sequence must be mapped to a reference genome to find from where in the chromosomes
the sequence originated. Effectively, mapping short sequence reads back to a pre-
assembled genome sequence that allows the reads to be annotated with all the known
genomic information. This will include aspects such as: whether the sequence is from a
the actual base-pair sequence of a read is not the point of main interest; the location within
of the read sequences to the reference genome sequence to find where they match. Usually
only the two ends of the fragments are read for the first 100 or so base pairs,
restrict alignments to only genome positions where the reads are relatively close together.
read, especially if the region has a repetitive sequence. Here the ambiguity can sometimes
be resolved by sequencing for longer, e.g. reading the DNA fragments for more than 100
base pairs, but this gives diminishing returns as longer reads become more error-prone.
Sometimes a sequence read may not match at all, if there has been a genuine substitution
or an error in the sequencing (common at the end of reads). Fortunately in situations
where sequences differ slightly we can do the sequence alignments in a permissive way, to
accept small changes where the expectation is that the quality of the reads decreases with
length; i.e. the chances of a mismatch increase with length.
The alignment of short, high-throughput DNA sequence reads to a genome is not done
using the types of sequence alignment discussed previously, i.e. not using dynamic
programming or programs like BLAST. Such methods would be too slow. Instead genome
mapping methods pre-index the genome sequence for a quick look-up, and commonly use
the Burrows-Wheeler transform for data compression.
4
The genome index means that a
small query sequence can be mapped to the genome by extracting the known genome
positions for its constituent sequence(s) that have been previously located; this is
somewhat similar to finding data in a Python dictionary using its key. The general idea is
to avoid having to align a query sequence to the large number of possible short sequences
in the whole genome each time. Rather, significant matches are found with a quick look-
up which can eliminate the vast majority of the genome sequence. This strategy is
optimised for large numbers of reads being mapped to the same target (the genome
sequence). This would be impractical for general pairwise alignments of arbitrary
sequence databases because the indexing process, required before using a new target
sequence, is designed for large contiguous sequences. Indexing is memory-intensive and is
proportionately slow but, given the target is fixed for a given genome sequence,
5
the cost
is returned many times over for the mapping of large numbers of small sequences to a
single target.
Do'stlaringiz bilan baham: