Python Programming for Biology: Bioinformatics and Beyond



Download 7,75 Mb.
Pdf ko'rish
bet255/514
Sana30.12.2021
Hajmi7,75 Mb.
#91066
1   ...   251   252   253   254   255   256   257   258   ...   514
Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

Short-read mapping

Irrespective of whichever technique was actually used to generate the DNA segments their

sequence must be mapped to a reference genome to find from where in the chromosomes

the  sequence  originated.  Effectively,  mapping  short  sequence  reads  back  to  a  pre-

assembled  genome  sequence  that  allows  the  reads  to  be  annotated  with  all  the  known

genomic  information.  This  will  include  aspects  such  as:  whether  the  sequence  is  from  a

gene, is a regulatory region, is a structural region

2

or is non-functional; which gene, if any,



the sequence is from (or near to) and whether the sequence is an intron or an exon. Often

the actual base-pair sequence of a read is not the point of main interest; the location within

the genome is. Naturally, to find where DNA fragments come from requires an alignment

of the read sequences to the reference genome sequence to find where they match. Usually

only the two ends of the fragments are read for the first 100 or so base pairs,

3

but this is



generally enough to locate the sequence within the genome. Also in this case, the pairing

of the sequences from the two fragment ends can help the mapping: if you know the range

of  lengths  of  the  DNA  fragments  (for  example,  using  information  from  gel

electrophoresis)  then  you  know  how  far  apart  the  paired-end  reads  could  be,  and  thus

restrict alignments to only genome positions where the reads are relatively close together.

Unfortunately  there  may  still  be  more  than  one  genome  match  for  a  particular  sequence



read, especially if the region has a repetitive sequence. Here the ambiguity can sometimes

be resolved by sequencing for longer, e.g. reading the DNA fragments for more than 100

base  pairs,  but  this  gives  diminishing  returns  as  longer  reads  become  more  error-prone.

Sometimes a sequence read may not match at all, if there has been a genuine substitution

or  an  error  in  the  sequencing  (common  at  the  end  of  reads).  Fortunately  in  situations

where sequences differ slightly we can do the sequence alignments in a permissive way, to

accept small changes where the expectation is that the quality of the reads decreases with

length; i.e. the chances of a mismatch increase with length.

The alignment of short, high-throughput DNA sequence reads to a genome is not done

using  the  types  of  sequence  alignment  discussed  previously,  i.e.  not  using  dynamic

programming or programs like BLAST. Such methods would be too slow. Instead genome

mapping methods pre-index the genome sequence for a quick look-up, and commonly use

the  Burrows-Wheeler  transform  for  data  compression.

4

 The  genome  index  means  that  a



small  query  sequence  can  be  mapped  to  the  genome  by  extracting  the  known  genome

positions  for  its  constituent  sequence(s)  that  have  been  previously  located;  this  is

somewhat similar to finding data in a Python dictionary using its key. The general idea is

to avoid having to align a query sequence to the large number of possible short sequences

in the whole genome each time. Rather, significant matches are found with a quick look-

up  which  can  eliminate  the  vast  majority  of  the  genome  sequence.  This  strategy  is

optimised  for  large  numbers  of  reads  being  mapped  to  the  same  target  (the  genome

sequence).  This  would  be  impractical  for  general  pairwise  alignments  of  arbitrary

sequence  databases  because  the  indexing  process,  required  before  using  a  new  target

sequence, is designed for large contiguous sequences. Indexing is memory-intensive and is

proportionately slow but, given the target is fixed for a given genome sequence,

5

the cost



is  returned  many  times  over  for  the  mapping  of  large  numbers  of  small  sequences  to  a

single target.




Download 7,75 Mb.

Do'stlaringiz bilan baham:
1   ...   251   252   253   254   255   256   257   258   ...   514




Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©hozir.org 2024
ma'muriyatiga murojaat qiling

kiriting | ro'yxatdan o'tish
    Bosh sahifa
юртда тантана
Боғда битган
Бугун юртда
Эшитганлар жилманглар
Эшитмадим деманглар
битган бодомлар
Yangiariq tumani
qitish marakazi
Raqamli texnologiyalar
ilishida muhokamadan
tasdiqqa tavsiya
tavsiya etilgan
iqtisodiyot kafedrasi
steiermarkischen landesregierung
asarlaringizni yuboring
o'zingizning asarlaringizni
Iltimos faqat
faqat o'zingizning
steierm rkischen
landesregierung fachabteilung
rkischen landesregierung
hamshira loyihasi
loyihasi mavsum
faolyatining oqibatlari
asosiy adabiyotlar
fakulteti ahborot
ahborot havfsizligi
havfsizligi kafedrasi
fanidan bo’yicha
fakulteti iqtisodiyot
boshqaruv fakulteti
chiqarishda boshqaruv
ishlab chiqarishda
iqtisodiyot fakultet
multiservis tarmoqlari
fanidan asosiy
Uzbek fanidan
mavzulari potok
asosidagi multiservis
'aliyyil a'ziym
billahil 'aliyyil
illaa billahil
quvvata illaa
falah' deganida
Kompyuter savodxonligi
bo’yicha mustaqil
'alal falah'
Hayya 'alal
'alas soloh
Hayya 'alas
mavsum boyicha


yuklab olish