Python Programming for Biology: Bioinformatics and Beyond


Aligning reads to a genome



Download 7,75 Mb.
Pdf ko'rish
bet260/514
Sana30.12.2021
Hajmi7,75 Mb.
#91066
1   ...   256   257   258   259   260   261   262   263   ...   514
Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

Aligning reads to a genome

Next we create another wrapper to control an external program, to do the alignment with

the indexed genome data. There are many more options to control the program this time,

as this is highly dependent not only on the source of the data (in terms of which kind of

sequence  machine  produced  it)  but  also  on  the  way  in  which  the  sequenced  DNA  was

prepared  and  how  this  experimentally  informs  the  biology  that  we  are  interested  in.  The

key  aspects  of  the  sequences  we  wish  to  align  are  the  sequence  lengths  and  within  this

length  what  the  quality  of  the  called  sequence  reads  is,  given  this  generally  diminishes

with  length.  Also  when  the  alignment  is  actually  done  we  have  to  consider  how

mismatches (e.g. polymorphisms or errors in the sequencing) are handled. The full list of

options that we pass in, as arguments to the alignment function, is as follows:

genomeName

This is the name or tag given to the indexed genome that we

wish to align to. This is the same name as was used when

building the index with the function described above.

genomeDir

The location of the directory that contains the binary genome

index files.

readFiles

A list containing the names of the files containing the short-

read sequences to align.

pairedReadFiles

For experiments that use paired read data, a second list of

short-read sequence file names that match up (as pairs) with

the readFiles.

outFile


The name of the output alignment file, which by default will

use the SAM format.

leftTrim

The number of sequence letters to trim from the start of all the

reads; useful to remove known barcode sequences etc., which

should not be mapped.

rightTrim

How many sequence letters to remove from the end of the

sequence reads.

useSOLiD


Whether the data comes from an ABI SOLiD sequencing

machine.


qualType

The type of quality scores to use, which is expected to be




‘illumina1.3’, ‘solexa’ or ‘sanger’. See the description below of

the FASTQ format for more explanation of this.

maxMismatches

The maximum number of base-pair mismatches to tolerate in

an alignment.

highQualLen

How many sequence positions, from the start of the read, are

deemed reliable.

pairSepRange

The range of separations for paired sequence reads; known

from the size selection used to prepare the DNA fragment

library.


showHits

The number of genome location alignment ‘hits’ to show for

each input read sequence.

maxHits


Sets the maximum number of alignments that will be tolerated

for a read sequence; otherwise the read is deemed to be

unmappable and is discarded.

enforceBest

Whether to enforce the display of only the best-matching read,

if more than one could be aligned to the genome.

cpuCores

If set, this states how many processing jobs to run at one time,

making use of multi-core central processing units. The default

is to use all available CPU cores.

The short sequence reads that are aligned to the genome are generally stored in a textual

format  called  FASTQ.  These  files  contain  sequence  entries,  one  after  another,  in  a  form

exemplified by the following text:

@Annotation

CGGATGATTTTTATCCCATGAGACATCCAGTTCGG

+Annotation

45567778999:9;;<===>?@@@@AAAABCCCDE

As illustrated, each sequence read comprises four lines of text. The first and third lines,

beginning with ‘@’ and ‘+’ respectively, are textual annotations for the sequence read. In a

basic sense these are just identifiers, but the output of most sequencing machines contains

a  rich  variety  of  information,  including  details  relating  to  the  machine  that  read  the

sequence. Importantly, if the FASTQ sequence files consist of read pairs (sequencing from

the  two  ends  of  DNA  fragments)  then  the  annotation  for  each  sequence  can  be  used  to

unambiguously identify the connected read pairs even if (and as is generally the case) the

read pairs are given in two files, one for the sequencing from each end. However, it should

be  noted  that  currently  in  many  cases  the  order  of  sequences  in  paired  FASTQ  files  is

consistent, so that the pair identity is simply the position in the file. The second line of a

FASTQ file is clearly the one-letter code sequence of the DNA base pairs. The fourth line




of the entry is an array of quality scores, each character of which relates to the sequence

letter  at  the  same  read  position.  It  is  perhaps  unfortunate  that  there  are  several  slightly

different  systems  of  code  that  are  used  to  state  the  positional  read  quality,  which  is  why

the  quality  score  scheme  is  taken  as  an  input  to  the  alignment  program.  All  the  quality

score systems use a range of textual (ASCII) characters to specify the quality value at each

read position and what differs in each system is which range of characters are used. The

basic  problem  is  that  just  using  characters  from  0  through  to  9  does  not  give  enough

precision  to  the  scores,  so  instead  more  characters  are  used,  which  naturally  includes

letters and other symbols. All the characters are used in the following order:

!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_

'abcdefghijklmnopqrstuvwxyz{|}~

Here the score value in the scale decreases left to right and each sub-sequence character

is  the  next  point  on  the  scale.  One  thing  that  differs  between  the  different  schemes,

however,  is  which  characters  have  the  highest  and  lowest  scores.  Recent  Illumina

machines  (under  the  ‘sanger’  scheme  here),  for  example,  use  symbols  from  ‘!’  to  ‘J’

inclusive (older Illumina machines use the ‘illumina1.3’ scheme). Note that it is possible

to also use reads in FASTA file format, rather than FASTQ, although these will not contain

any quality information.

Now  we  come  to  the  Python  function  that  actually  does  the  genome  alignment  by

wrapping  the  Bowtie  aligner  program.  It  should  be  noted  that  not  all  of  the  possible

command line options for Bowtie have been considered (such as the read pair orientation,

reverse  complementation  and  backtracking  retries),  and  of  course  this  is  specifically

designed  to  fit  the  requirements  of  one  version  of  the  genome  alignment  software.  The

genomeAlign() function takes the arguments described in the above table, noting that we

can split the line defining the input alignments to keep things neat and tidy:

def genomeAlign(genomeName, genomeDir, readFiles, pairedReadFiles=None,

outFile=None, leftTrim=0, rightTrim=0, useSOLiD=False,

qualType='sanger', maxMismatches=2, highQualLen=28,

pairSepRange=(0,250), showHits=1, maxHits=2,

enforceBest=True, cpuCores=None):

This alignment program requires that the genome directory, containing the binary index

files, is specified as an environment variable, i.e. a name needs to be associated with the

location at the operating-system level, rather than being specified at run time. To get round

this  requirement  we  can  simply  set  the  operating  system’s  environment  variables  within

Python,  hence  the  os.environ  line  which  links  the  variable  name  to  its  value  using  a

Python dictionary. The next operation then sets the output file if it was not passed into the

function,  with  the  default  being  the  genome  name  appended  with  ‘.sam’,  given  we  are

making SAM-format files.

14

os.environ['BOWTIE_INDEXES'] = genomeDir



if not outFile:

outFile = genomeName + '.sam'

As with the previous example we collect a list of text strings that are equivalent to the



program name with the command line options required to run the program. Here the ‘-S’

option means that the output alignment file format will be SAM, although we could adapt

the function to specify something different. The readFilesStr is simply the comma-joined

list of input (short-read sequence) file names.

cmdArgs = [ALIGNER_PATH+'bowtie', '-S']

readFilesStr = ','.join(readFiles)

Next  is  some  code  to  automatically  guess  the  type  of  the  input  file  based  on  the  file

extension  that  follows  the  file  name  (from  the  ‘.’  at  the  end).  Here  we  use  the

os.path.splitext() function to split the name on the appropriate period/full-stop character to

get the trailing characters. Note that this procedure assumes that all the input read files are

of the same file format, given we only determine the type from the first in the list, but in

reality they really ought to be the same. If the file extension is in the first pre-set list then

the input reads are assumed to be in FASTA format, which requires that the ‘-f’ option is

used. Otherwise we check whether the format appears to be FASTQ, where we use the ‘-q’

option. And if none of that is true we assume the reads are in ‘raw’ format, where there is

simply a different one-letter sequence on each line.

foreName, fileType = os.path.splitext(readFiles[0])

if fileType in ('.fa','.fna','.mfa','.fasta'):

cmdArgs.append('-f')

elif fileType in ('.fq','.fastq'):

cmdArgs.append('-q')

else:


cmdArgs.append('-r')

Next is the option for using ABI SOLiD data. Then the quality score type is set, which

effectively translates the more informative text string options into ‘—’ style options. Note

here that the term ‘phred’ refers to the original DNA sequence quality score system, from a

program  of  that  name,  which  is  adjusted  to  use  the  text  characters  as  described  above

(ASCII with a specified offset).

if useSOLiD:

cmdArgs.append('-C')

if qualType == 'illumina1.3': # Phred+64

cmdArgs.append('--phred64-quals')

elif qualType == 'solexa': # Phred+59

cmdArgs.append('--solexa-quals')

else: # Sanger, current illumina: Phred + 33

cmdArgs.append('--phred33-quals')

The  options  to  enforce  reporting  of  only  the  best-matching  genome  alignment  and

setting the number of processing jobs are considered. Note that the number of CPU cores

available  on  the  current  computer  system  can  be  determined  using  the  standard

multiprocessing library, which is available from Python 2.6 onward.




if enforceBest:

cmdArgs.append('--best')

if not cpuCores:

import multiprocessing

cpuCores = multiprocessing.cpu_count()

The remaining, numeric, options are converted to text strings and added to the list. Here

the += operator is used, which is equivalent to list.extend() to add the contents of one list

to another.

cmdArgs += ['-5', str(leftTrim),

'-3', str(rightTrim),

'-n', str(maxMismatches),

'-l', str(highQualLen),

'-k', str(showHits),

'-m', str(maxHits),

'-p', str(cpuCores),

'--chunkmbs', '256']

If  a  second  list  of  file  names  was  passed  to  the  function,  specifying  paired  sequence

reads  from  the  two  ends  of  the  DNA  fragments,  then  we  set  some  further  options

specifically relating to read pairs. Naturally, we process the input sequence file names, as a

comma-joined list like before. Then we split the input separation limits for the pairs from

pairSepRange; these are input as separate arguments for the aligner. Note that when paired

reads are used the ‘-1’ and ‘-2’ options are employed to set the two file names. Otherwise,

i.e. at the else, the remaining options just refer to the one readFilesStr.

if pairedReadFiles:

pairedReadFiles = ','.join(pairedReadFiles)

minSep, maxSep = pairSepRange

cmdArgs += [genomeName,

'--minins', str(minSep),

'--maxins', str(maxSep),

'-1', readFilesStr,

'-2', pairedReadFiles,

outFile]


else:

cmdArgs += [genomeName, readFilesStr, outFile]

And finally at the end we invoke call() to run the actual external aligner program, which

may take a significant time depending on the number of sequence reads and the available

computational power. The output file name is returned from the function.

call(cmdArgs)

return outFile

All of the above can be brought together in the following example of a test, using the

demonstration  data  supplied  in  the  on-line  material,

15

 which  is  the  single  E.  coli



chromosome  sequence,  although  in  practice  we  will  only  really  index  the  genome  if  it

changes. Note that the test chromosome file is converted to the full (absolute) file-system

path using os.path.abspath, which is what our indexing program requires.



filePath = os.path.abspath('examples/EcoliGenome.fasta')

genomeName = 'E_coli'

indexGenome(genomeName, [filePath,], 'examples')

fastqFile = 'examples/EcoliReads.fastq'

genomeAlign(genomeName, 'examples', [fastqFile], qualType='sanger')

The final alignment of this will produce output of the form:

# reads processed: 1000

# reads with at least one reported alignment: 685 (68.50%)

# reads that failed to align: 301 (30.10%)

# reads with alignments suppressed due to -m: 14 (1.40%)

Reported 685 alignments to 1 output stream(s)


Download 7,75 Mb.

Do'stlaringiz bilan baham:
1   ...   256   257   258   259   260   261   262   263   ...   514




Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©hozir.org 2024
ma'muriyatiga murojaat qiling

kiriting | ro'yxatdan o'tish
    Bosh sahifa
юртда тантана
Боғда битган
Бугун юртда
Эшитганлар жилманглар
Эшитмадим деманглар
битган бодомлар
Yangiariq tumani
qitish marakazi
Raqamli texnologiyalar
ilishida muhokamadan
tasdiqqa tavsiya
tavsiya etilgan
iqtisodiyot kafedrasi
steiermarkischen landesregierung
asarlaringizni yuboring
o'zingizning asarlaringizni
Iltimos faqat
faqat o'zingizning
steierm rkischen
landesregierung fachabteilung
rkischen landesregierung
hamshira loyihasi
loyihasi mavsum
faolyatining oqibatlari
asosiy adabiyotlar
fakulteti ahborot
ahborot havfsizligi
havfsizligi kafedrasi
fanidan bo’yicha
fakulteti iqtisodiyot
boshqaruv fakulteti
chiqarishda boshqaruv
ishlab chiqarishda
iqtisodiyot fakultet
multiservis tarmoqlari
fanidan asosiy
Uzbek fanidan
mavzulari potok
asosidagi multiservis
'aliyyil a'ziym
billahil 'aliyyil
illaa billahil
quvvata illaa
falah' deganida
Kompyuter savodxonligi
bo’yicha mustaqil
'alal falah'
Hayya 'alal
'alas soloh
Hayya 'alas
mavsum boyicha


yuklab olish