Python Programming for Biology: Bioinformatics and Beyond

Aligning reads to a genome

Download 7,75 Mb.

Pdf ko'rish

bet	260/514
Sana	30.12.2021
Hajmi	7,75 Mb.
	#91066

1 ... 256 257 258 259 260 261 262 263 ... 514

Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

Aligning reads to a genome

Next we create another wrapper to control an external program, to do the alignment with

the indexed genome data. There are many more options to control the program this time,

as this is highly dependent not only on the source of the data (in terms of which kind of

sequence machine produced it) but also on the way in which the sequenced DNA was

prepared and how this experimentally informs the biology that we are interested in. The

key aspects of the sequences we wish to align are the sequence lengths and within this

length what the quality of the called sequence reads is, given this generally diminishes

with length. Also when the alignment is actually done we have to consider how

mismatches (e.g. polymorphisms or errors in the sequencing) are handled. The full list of

options that we pass in, as arguments to the alignment function, is as follows:

genomeName

This is the name or tag given to the indexed genome that we

wish to align to. This is the same name as was used when

building the index with the function described above.

genomeDir

The location of the directory that contains the binary genome

index files.

readFiles

A list containing the names of the files containing the short-

read sequences to align.

pairedReadFiles

For experiments that use paired read data, a second list of

short-read sequence file names that match up (as pairs) with

the readFiles.

outFile

The name of the output alignment file, which by default will

use the SAM format.

leftTrim

The number of sequence letters to trim from the start of all the

reads; useful to remove known barcode sequences etc., which

should not be mapped.

rightTrim

How many sequence letters to remove from the end of the

sequence reads.

useSOLiD

Whether the data comes from an ABI SOLiD sequencing

machine.

qualType

The type of quality scores to use, which is expected to be

‘illumina1.3’, ‘solexa’ or ‘sanger’. See the description below of

the FASTQ format for more explanation of this.

maxMismatches

The maximum number of base-pair mismatches to tolerate in

an alignment.

highQualLen

How many sequence positions, from the start of the read, are

deemed reliable.

pairSepRange

The range of separations for paired sequence reads; known

from the size selection used to prepare the DNA fragment

library.

showHits

The number of genome location alignment ‘hits’ to show for

each input read sequence.

maxHits

Sets the maximum number of alignments that will be tolerated

for a read sequence; otherwise the read is deemed to be

unmappable and is discarded.

enforceBest

Whether to enforce the display of only the best-matching read,

if more than one could be aligned to the genome.

cpuCores

If set, this states how many processing jobs to run at one time,

making use of multi-core central processing units. The default

is to use all available CPU cores.

The short sequence reads that are aligned to the genome are generally stored in a textual

format called FASTQ. These files contain sequence entries, one after another, in a form

exemplified by the following text:

@Annotation

CGGATGATTTTTATCCCATGAGACATCCAGTTCGG

+Annotation

45567778999:9;;<===>?@@@@AAAABCCCDE

As illustrated, each sequence read comprises four lines of text. The first and third lines,

beginning with ‘@’ and ‘+’ respectively, are textual annotations for the sequence read. In a

basic sense these are just identifiers, but the output of most sequencing machines contains

a rich variety of information, including details relating to the machine that read the

sequence. Importantly, if the FASTQ sequence files consist of read pairs (sequencing from

the two ends of DNA fragments) then the annotation for each sequence can be used to

unambiguously identify the connected read pairs even if (and as is generally the case) the

read pairs are given in two files, one for the sequencing from each end. However, it should

be noted that currently in many cases the order of sequences in paired FASTQ files is

consistent, so that the pair identity is simply the position in the file. The second line of a

FASTQ file is clearly the one-letter code sequence of the DNA base pairs. The fourth line

of the entry is an array of quality scores, each character of which relates to the sequence

letter at the same read position. It is perhaps unfortunate that there are several slightly

different systems of code that are used to state the positional read quality, which is why

the quality score scheme is taken as an input to the alignment program. All the quality

score systems use a range of textual (ASCII) characters to specify the quality value at each

read position and what differs in each system is which range of characters are used. The

basic problem is that just using characters from 0 through to 9 does not give enough

precision to the scores, so instead more characters are used, which naturally includes

letters and other symbols. All the characters are used in the following order:

!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_

'abcdefghijklmnopqrstuvwxyz{|}~

Here the score value in the scale decreases left to right and each sub-sequence character

is the next point on the scale. One thing that differs between the different schemes,

however, is which characters have the highest and lowest scores. Recent Illumina

machines (under the ‘sanger’ scheme here), for example, use symbols from ‘!’ to ‘J’

inclusive (older Illumina machines use the ‘illumina1.3’ scheme). Note that it is possible

to also use reads in FASTA file format, rather than FASTQ, although these will not contain

any quality information.

Now we come to the Python function that actually does the genome alignment by

wrapping the Bowtie aligner program. It should be noted that not all of the possible

command line options for Bowtie have been considered (such as the read pair orientation,

reverse complementation and backtracking retries), and of course this is specifically

designed to fit the requirements of one version of the genome alignment software. The

genomeAlign() function takes the arguments described in the above table, noting that we

can split the line defining the input alignments to keep things neat and tidy:

def genomeAlign(genomeName, genomeDir, readFiles, pairedReadFiles=None,

outFile=None, leftTrim=0, rightTrim=0, useSOLiD=False,

qualType='sanger', maxMismatches=2, highQualLen=28,

pairSepRange=(0,250), showHits=1, maxHits=2,

enforceBest=True, cpuCores=None):

This alignment program requires that the genome directory, containing the binary index

files, is specified as an environment variable, i.e. a name needs to be associated with the

location at the operating-system level, rather than being specified at run time. To get round

this requirement we can simply set the operating system’s environment variables within

Python, hence the os.environ line which links the variable name to its value using a

Python dictionary. The next operation then sets the output file if it was not passed into the

function, with the default being the genome name appended with ‘.sam’, given we are

making SAM-format files.

os.environ['BOWTIE_INDEXES'] = genomeDir

if not outFile:

outFile = genomeName + '.sam'

As with the previous example we collect a list of text strings that are equivalent to the

program name with the command line options required to run the program. Here the ‘-S’

option means that the output alignment file format will be SAM, although we could adapt

the function to specify something different. The readFilesStr is simply the comma-joined

list of input (short-read sequence) file names.

cmdArgs = [ALIGNER_PATH+'bowtie', '-S']

readFilesStr = ','.join(readFiles)

Next is some code to automatically guess the type of the input file based on the file

extension that follows the file name (from the ‘.’ at the end). Here we use the

os.path.splitext() function to split the name on the appropriate period/full-stop character to

get the trailing characters. Note that this procedure assumes that all the input read files are

of the same file format, given we only determine the type from the first in the list, but in

reality they really ought to be the same. If the file extension is in the first pre-set list then

the input reads are assumed to be in FASTA format, which requires that the ‘-f’ option is

used. Otherwise we check whether the format appears to be FASTQ, where we use the ‘-q’

option. And if none of that is true we assume the reads are in ‘raw’ format, where there is

simply a different one-letter sequence on each line.

foreName, fileType = os.path.splitext(readFiles[0])

if fileType in ('.fa','.fna','.mfa','.fasta'):

cmdArgs.append('-f')

elif fileType in ('.fq','.fastq'):

cmdArgs.append('-q')

else:

cmdArgs.append('-r')

Next is the option for using ABI SOLiD data. Then the quality score type is set, which

effectively translates the more informative text string options into ‘—’ style options. Note

here that the term ‘phred’ refers to the original DNA sequence quality score system, from a

program of that name, which is adjusted to use the text characters as described above

(ASCII with a specified offset).

if useSOLiD:

cmdArgs.append('-C')

if qualType == 'illumina1.3': # Phred+64

cmdArgs.append('--phred64-quals')

elif qualType == 'solexa': # Phred+59

cmdArgs.append('--solexa-quals')

else: # Sanger, current illumina: Phred + 33

cmdArgs.append('--phred33-quals')

The options to enforce reporting of only the best-matching genome alignment and

setting the number of processing jobs are considered. Note that the number of CPU cores

available on the current computer system can be determined using the standard

multiprocessing library, which is available from Python 2.6 onward.

if enforceBest:

cmdArgs.append('--best')

if not cpuCores:

import multiprocessing

cpuCores = multiprocessing.cpu_count()

The remaining, numeric, options are converted to text strings and added to the list. Here

the += operator is used, which is equivalent to list.extend() to add the contents of one list

to another.

cmdArgs += ['-5', str(leftTrim),

'-3', str(rightTrim),

'-n', str(maxMismatches),

'-l', str(highQualLen),

'-k', str(showHits),

'-m', str(maxHits),

'-p', str(cpuCores),

'--chunkmbs', '256']

If a second list of file names was passed to the function, specifying paired sequence

reads from the two ends of the DNA fragments, then we set some further options

specifically relating to read pairs. Naturally, we process the input sequence file names, as a

comma-joined list like before. Then we split the input separation limits for the pairs from

pairSepRange; these are input as separate arguments for the aligner. Note that when paired

reads are used the ‘-1’ and ‘-2’ options are employed to set the two file names. Otherwise,

i.e. at the else, the remaining options just refer to the one readFilesStr.

if pairedReadFiles:

pairedReadFiles = ','.join(pairedReadFiles)

minSep, maxSep = pairSepRange

cmdArgs += [genomeName,

'--minins', str(minSep),

'--maxins', str(maxSep),

'-1', readFilesStr,

'-2', pairedReadFiles,

outFile]

else:

cmdArgs += [genomeName, readFilesStr, outFile]

And finally at the end we invoke call() to run the actual external aligner program, which

may take a significant time depending on the number of sequence reads and the available

computational power. The output file name is returned from the function.

call(cmdArgs)

return outFile

All of the above can be brought together in the following example of a test, using the

demonstration data supplied in the on-line material,

which is the single E. coli

chromosome sequence, although in practice we will only really index the genome if it

changes. Note that the test chromosome file is converted to the full (absolute) file-system

path using os.path.abspath, which is what our indexing program requires.

filePath = os.path.abspath('examples/EcoliGenome.fasta')

genomeName = 'E_coli'

indexGenome(genomeName, [filePath,], 'examples')

fastqFile = 'examples/EcoliReads.fastq'

genomeAlign(genomeName, 'examples', [fastqFile], qualType='sanger')

The final alignment of this will produce output of the form:

# reads processed: 1000

# reads with at least one reported alignment: 685 (68.50%)

# reads that failed to align: 301 (30.10%)

# reads with alignments suppressed due to -m: 14 (1.40%)

Reported 685 alignments to 1 output stream(s)

Download 7,75 Mb.

Do'stlaringiz bilan baham:

1 ... 256 257 258 259 260 261 262 263 ... 514