Aligning reads to a genome
Next we create another wrapper to control an external program, to do the alignment with
the indexed genome data. There are many more options to control the program this time,
as this is highly dependent not only on the source of the data (in terms of which kind of
sequence machine produced it) but also on the way in which the sequenced DNA was
prepared and how this experimentally informs the biology that we are interested in. The
key aspects of the sequences we wish to align are the sequence lengths and within this
length what the quality of the called sequence reads is, given this generally diminishes
with length. Also when the alignment is actually done we have to consider how
mismatches (e.g. polymorphisms or errors in the sequencing) are handled. The full list of
options that we pass in, as arguments to the alignment function, is as follows:
genomeName
This is the name or tag given to the indexed genome that we
wish to align to. This is the same name as was used when
building the index with the function described above.
genomeDir
The location of the directory that contains the binary genome
index files.
readFiles
A list containing the names of the files containing the short-
read sequences to align.
pairedReadFiles
For experiments that use paired read data, a second list of
short-read sequence file names that match up (as pairs) with
the readFiles.
outFile
The name of the output alignment file, which by default will
use the SAM format.
leftTrim
The number of sequence letters to trim from the start of all the
reads; useful to remove known barcode sequences etc., which
should not be mapped.
rightTrim
How many sequence letters to remove from the end of the
sequence reads.
useSOLiD
Whether the data comes from an ABI SOLiD sequencing
machine.
qualType
The type of quality scores to use, which is expected to be
‘illumina1.3’, ‘solexa’ or ‘sanger’. See the description below of
the FASTQ format for more explanation of this.
maxMismatches
The maximum number of base-pair mismatches to tolerate in
an alignment.
highQualLen
How many sequence positions, from the start of the read, are
deemed reliable.
pairSepRange
The range of separations for paired sequence reads; known
from the size selection used to prepare the DNA fragment
library.
showHits
The number of genome location alignment ‘hits’ to show for
each input read sequence.
maxHits
Sets the maximum number of alignments that will be tolerated
for a read sequence; otherwise the read is deemed to be
unmappable and is discarded.
enforceBest
Whether to enforce the display of only the best-matching read,
if more than one could be aligned to the genome.
cpuCores
If set, this states how many processing jobs to run at one time,
making use of multi-core central processing units. The default
is to use all available CPU cores.
The short sequence reads that are aligned to the genome are generally stored in a textual
format called FASTQ. These files contain sequence entries, one after another, in a form
exemplified by the following text:
@Annotation
CGGATGATTTTTATCCCATGAGACATCCAGTTCGG
+Annotation
45567778999:9;;<===>?@@@@AAAABCCCDE
As illustrated, each sequence read comprises four lines of text. The first and third lines,
beginning with ‘@’ and ‘+’ respectively, are textual annotations for the sequence read. In a
basic sense these are just identifiers, but the output of most sequencing machines contains
a rich variety of information, including details relating to the machine that read the
sequence. Importantly, if the FASTQ sequence files consist of read pairs (sequencing from
the two ends of DNA fragments) then the annotation for each sequence can be used to
unambiguously identify the connected read pairs even if (and as is generally the case) the
read pairs are given in two files, one for the sequencing from each end. However, it should
be noted that currently in many cases the order of sequences in paired FASTQ files is
consistent, so that the pair identity is simply the position in the file. The second line of a
FASTQ file is clearly the one-letter code sequence of the DNA base pairs. The fourth line
of the entry is an array of quality scores, each character of which relates to the sequence
letter at the same read position. It is perhaps unfortunate that there are several slightly
different systems of code that are used to state the positional read quality, which is why
the quality score scheme is taken as an input to the alignment program. All the quality
score systems use a range of textual (ASCII) characters to specify the quality value at each
read position and what differs in each system is which range of characters are used. The
basic problem is that just using characters from 0 through to 9 does not give enough
precision to the scores, so instead more characters are used, which naturally includes
letters and other symbols. All the characters are used in the following order:
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_
'abcdefghijklmnopqrstuvwxyz{|}~
Here the score value in the scale decreases left to right and each sub-sequence character
is the next point on the scale. One thing that differs between the different schemes,
however, is which characters have the highest and lowest scores. Recent Illumina
machines (under the ‘sanger’ scheme here), for example, use symbols from ‘!’ to ‘J’
inclusive (older Illumina machines use the ‘illumina1.3’ scheme). Note that it is possible
to also use reads in FASTA file format, rather than FASTQ, although these will not contain
any quality information.
Now we come to the Python function that actually does the genome alignment by
wrapping the Bowtie aligner program. It should be noted that not all of the possible
command line options for Bowtie have been considered (such as the read pair orientation,
reverse complementation and backtracking retries), and of course this is specifically
designed to fit the requirements of one version of the genome alignment software. The
genomeAlign() function takes the arguments described in the above table, noting that we
can split the line defining the input alignments to keep things neat and tidy:
def genomeAlign(genomeName, genomeDir, readFiles, pairedReadFiles=None,
outFile=None, leftTrim=0, rightTrim=0, useSOLiD=False,
qualType='sanger', maxMismatches=2, highQualLen=28,
pairSepRange=(0,250), showHits=1, maxHits=2,
enforceBest=True, cpuCores=None):
This alignment program requires that the genome directory, containing the binary index
files, is specified as an environment variable, i.e. a name needs to be associated with the
location at the operating-system level, rather than being specified at run time. To get round
this requirement we can simply set the operating system’s environment variables within
Python, hence the os.environ line which links the variable name to its value using a
Python dictionary. The next operation then sets the output file if it was not passed into the
function, with the default being the genome name appended with ‘.sam’, given we are
making SAM-format files.
14
os.environ['BOWTIE_INDEXES'] = genomeDir
if not outFile:
outFile = genomeName + '.sam'
As with the previous example we collect a list of text strings that are equivalent to the
program name with the command line options required to run the program. Here the ‘-S’
option means that the output alignment file format will be SAM, although we could adapt
the function to specify something different. The readFilesStr is simply the comma-joined
list of input (short-read sequence) file names.
cmdArgs = [ALIGNER_PATH+'bowtie', '-S']
readFilesStr = ','.join(readFiles)
Next is some code to automatically guess the type of the input file based on the file
extension that follows the file name (from the ‘.’ at the end). Here we use the
os.path.splitext() function to split the name on the appropriate period/full-stop character to
get the trailing characters. Note that this procedure assumes that all the input read files are
of the same file format, given we only determine the type from the first in the list, but in
reality they really ought to be the same. If the file extension is in the first pre-set list then
the input reads are assumed to be in FASTA format, which requires that the ‘-f’ option is
used. Otherwise we check whether the format appears to be FASTQ, where we use the ‘-q’
option. And if none of that is true we assume the reads are in ‘raw’ format, where there is
simply a different one-letter sequence on each line.
foreName, fileType = os.path.splitext(readFiles[0])
if fileType in ('.fa','.fna','.mfa','.fasta'):
cmdArgs.append('-f')
elif fileType in ('.fq','.fastq'):
cmdArgs.append('-q')
else:
cmdArgs.append('-r')
Next is the option for using ABI SOLiD data. Then the quality score type is set, which
effectively translates the more informative text string options into ‘—’ style options. Note
here that the term ‘phred’ refers to the original DNA sequence quality score system, from a
program of that name, which is adjusted to use the text characters as described above
(ASCII with a specified offset).
if useSOLiD:
cmdArgs.append('-C')
if qualType == 'illumina1.3': # Phred+64
cmdArgs.append('--phred64-quals')
elif qualType == 'solexa': # Phred+59
cmdArgs.append('--solexa-quals')
else: # Sanger, current illumina: Phred + 33
cmdArgs.append('--phred33-quals')
The options to enforce reporting of only the best-matching genome alignment and
setting the number of processing jobs are considered. Note that the number of CPU cores
available on the current computer system can be determined using the standard
multiprocessing library, which is available from Python 2.6 onward.
if enforceBest:
cmdArgs.append('--best')
if not cpuCores:
import multiprocessing
cpuCores = multiprocessing.cpu_count()
The remaining, numeric, options are converted to text strings and added to the list. Here
the += operator is used, which is equivalent to list.extend() to add the contents of one list
to another.
cmdArgs += ['-5', str(leftTrim),
'-3', str(rightTrim),
'-n', str(maxMismatches),
'-l', str(highQualLen),
'-k', str(showHits),
'-m', str(maxHits),
'-p', str(cpuCores),
'--chunkmbs', '256']
If a second list of file names was passed to the function, specifying paired sequence
reads from the two ends of the DNA fragments, then we set some further options
specifically relating to read pairs. Naturally, we process the input sequence file names, as a
comma-joined list like before. Then we split the input separation limits for the pairs from
pairSepRange; these are input as separate arguments for the aligner. Note that when paired
reads are used the ‘-1’ and ‘-2’ options are employed to set the two file names. Otherwise,
i.e. at the else, the remaining options just refer to the one readFilesStr.
if pairedReadFiles:
pairedReadFiles = ','.join(pairedReadFiles)
minSep, maxSep = pairSepRange
cmdArgs += [genomeName,
'--minins', str(minSep),
'--maxins', str(maxSep),
'-1', readFilesStr,
'-2', pairedReadFiles,
outFile]
else:
cmdArgs += [genomeName, readFilesStr, outFile]
And finally at the end we invoke call() to run the actual external aligner program, which
may take a significant time depending on the number of sequence reads and the available
computational power. The output file name is returned from the function.
call(cmdArgs)
return outFile
All of the above can be brought together in the following example of a test, using the
demonstration data supplied in the on-line material,
15
which is the single E. coli
chromosome sequence, although in practice we will only really index the genome if it
changes. Note that the test chromosome file is converted to the full (absolute) file-system
path using os.path.abspath, which is what our indexing program requires.
filePath = os.path.abspath('examples/EcoliGenome.fasta')
genomeName = 'E_coli'
indexGenome(genomeName, [filePath,], 'examples')
fastqFile = 'examples/EcoliReads.fastq'
genomeAlign(genomeName, 'examples', [fastqFile], qualType='sanger')
The final alignment of this will produce output of the form:
# reads processed: 1000
# reads with at least one reported alignment: 685 (68.50%)
# reads that failed to align: 301 (30.10%)
# reads with alignments suppressed due to -m: 14 (1.40%)
Reported 685 alignments to 1 output stream(s)
Do'stlaringiz bilan baham: |