Python Programming for Biology: Bioinformatics and Beyond

Download 7,75 Mb.

Pdf ko'rish

bet	255/514
Sana	30.12.2021
Hajmi	7,75 Mb.
	#91066

1 ... 251 252 253 254 255 256 257 258 ... 514

Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

Short-read mapping

Irrespective of whichever technique was actually used to generate the DNA segments their

sequence must be mapped to a reference genome to find from where in the chromosomes

the sequence originated. Effectively, mapping short sequence reads back to a pre-

assembled genome sequence that allows the reads to be annotated with all the known

genomic information. This will include aspects such as: whether the sequence is from a

gene, is a regulatory region, is a structural region

or is non-functional; which gene, if any,

the sequence is from (or near to) and whether the sequence is an intron or an exon. Often

the actual base-pair sequence of a read is not the point of main interest; the location within

the genome is. Naturally, to find where DNA fragments come from requires an alignment

of the read sequences to the reference genome sequence to find where they match. Usually

only the two ends of the fragments are read for the first 100 or so base pairs,

but this is

generally enough to locate the sequence within the genome. Also in this case, the pairing

of the sequences from the two fragment ends can help the mapping: if you know the range

of lengths of the DNA fragments (for example, using information from gel

electrophoresis) then you know how far apart the paired-end reads could be, and thus

restrict alignments to only genome positions where the reads are relatively close together.

Unfortunately there may still be more than one genome match for a particular sequence

read, especially if the region has a repetitive sequence. Here the ambiguity can sometimes

be resolved by sequencing for longer, e.g. reading the DNA fragments for more than 100

base pairs, but this gives diminishing returns as longer reads become more error-prone.

Sometimes a sequence read may not match at all, if there has been a genuine substitution

or an error in the sequencing (common at the end of reads). Fortunately in situations

where sequences differ slightly we can do the sequence alignments in a permissive way, to

accept small changes where the expectation is that the quality of the reads decreases with

length; i.e. the chances of a mismatch increase with length.

The alignment of short, high-throughput DNA sequence reads to a genome is not done

using the types of sequence alignment discussed previously, i.e. not using dynamic

programming or programs like BLAST. Such methods would be too slow. Instead genome

mapping methods pre-index the genome sequence for a quick look-up, and commonly use

the Burrows-Wheeler transform for data compression.

The genome index means that a

small query sequence can be mapped to the genome by extracting the known genome

positions for its constituent sequence(s) that have been previously located; this is

somewhat similar to finding data in a Python dictionary using its key. The general idea is

to avoid having to align a query sequence to the large number of possible short sequences

in the whole genome each time. Rather, significant matches are found with a quick look-

up which can eliminate the vast majority of the genome sequence. This strategy is

optimised for large numbers of reads being mapped to the same target (the genome

sequence). This would be impractical for general pairwise alignments of arbitrary

sequence databases because the indexing process, required before using a new target

sequence, is designed for large contiguous sequences. Indexing is memory-intensive and is

proportionately slow but, given the target is fixed for a given genome sequence,

the cost

is returned many times over for the mapping of large numbers of small sequences to a

single target.

Download 7,75 Mb.

Do'stlaringiz bilan baham:

1 ... 251 252 253 254 255 256 257 258 ... 514