Substitutability
Next we will move on from measuring a simple sequence identity to the more subtle
measure of sequence similarity; this is to say that sequence pairs in an alignment can have
a score even when they are not the same. The notion of similarity in this case is somewhat
subjective and ultimately depends on the kind of biology you are working with.
Nevertheless the general idea when scoring how similar two residues are, when aligned as
a pair, is to consider how substitutable one residue type is for another; in other words, how
likely they are to have been swapped or exchanged for one another. Residues that
commonly swap for one another are deemed to be similar and give high scores, while
those that rarely swap are dissimilar and give low scores. High similarity in this instance
doesn’t necessarily mean that two residues are always chemically similar, although they
often are. Strictly speaking the substitutability of one residue type for another depends on
the exact context of the residue (where it is in a chromosome or protein etc.) but we can
ignore this complication for now and consider just an average value for swap-ability.
The substitutability of one residue for another is stored as a two-dimensional array,
commonly called a substitution matrix or similarity matrix. The idea is that each score
value in the matrix represents the substitutability of two residue types, e.g. ‘A’ to ‘G’ in
DNA or ‘V’ to ‘L’ in proteins. The two residue types can be thought of as indicating the
row and column of an element in a matrix, although in our Python examples we will
encode matrices as dictionaries of dictionaries. Using dictionaries we can look up the
score for two residue types by using the residue letters directly as keys, without having to
work out the numbers for the matrix row and column. With a substitution matrix
dictionary the first key (residue letter) identifies a sub-dictionary from inside the main
dictionary and the second key gets the final value from inside the sub-dictionary.
Below is an example of a very simple substitution matrix that would give the same
scores as if you were measuring sequence identity. i.e. a score of one where residues are
identical and zero elsewhere.
DNA_1 = {'G': { 'G':1, 'C':0, 'A':0, 'T':0 },
'C': { 'G':0, 'C':1, 'A':0, 'T':0 },
'A': { 'G':0, 'C':0, 'A':1, 'T':0 },
'T': { 'G':0, 'C':0, 'A':0, 'T':1 }}
Remembering that two keys are needed to extract a value (one for the main dictionary
and one for the sub-dictionaries) we would get 1 for identical residue look-ups like
DNA_1[‘G’][‘G’] and 0 for non-identical keys like DNA_1[‘G’][‘A’].
Changing track slightly, rather than scoring DNA sequences for matches we could also
score for complementarity (i.e. using Crick and Watson’s pairing rules), with 1 for A:T or
G:C matches and -1 for mismatches. Expressed as a Python dictionary this would be:
REV_COMP = {'G': { 'G':-1, 'C': 1, 'A':-1, 'T':-1 },
'C': { 'G': 1, 'C':-1, 'A':-1, 'T':-1 },
'A': { 'G':-1, 'C':-1, 'A':-1, 'T': 1 },
'T': { 'G':-1, 'C':-1, 'A': 1, 'T':-1 }}
Moving on to a more sophisticated matrix, as illustrated above, you will note that
substitution scores can have negative values (mismatch) and that a score of zero is often
used to indicate indifference. In the DNA_2 matrix below note that identical residue keys
give a score of 1 but non-identical -3. In other words the mismatches are strongly
penalised; in an alignment three identical residues are required to balance one mismatch.
Also note that the example uses the residue code ‘N’, which in this instance for DNA
means any
9
unidentified residue, which is indifferent in an alignment, given that we
cannot tell if it is good or bad and so scores zero with everything.
DNA_2 = {'G': { 'G': 1, 'C':-3, 'A':-3, 'T':-3, 'N':0 },
'C': { 'G':-3, 'C': 1, 'A':-3, 'T':-3, 'N':0 },
'A': { 'G':-3, 'C':-3, 'A': 1, 'T':-3, 'N':0 },
'T': { 'G':-3, 'C':-3, 'A':-3, 'T': 1, 'N':0 },
'N': { 'G': 0, 'C': 0, 'A': 0, 'T': 0, 'N':0 }}
The next example is part of a substitution matrix for protein sequences. It is a fairly
famous one called BLOSUM62 (often the default in many programs). You will of course
note that the matrix is much larger than for DNA because we have 20 regular amino acids,
plus ‘X’ for unknown type. We have only shown the first four sub-dictionaries here, but
the
full
matrix
can
be
found
in
the
on-line
material
(available
via
http://www.cambridge.org/pythonforbiology
). There are usually many variants of a given
substitution matrix type. Here we specifically use the ‘62’
10
version of BLOSUM series
because it is a good general-purpose one. You would commonly consider using different
matrix versions to tune your alignment for more closely related or distantly related
sequences for which substitution preferences are known to differ.
BLOSUM62 = {'A':{'A': 4,'R':-1,'N':-2,'D':-2,'C': 0,'Q':-1,
'E':-1,'G': 0,'H':-2,'I':-1,'L':-1,'K':-1,
'M':-1,'F':-2,'P':-1,'S': 1,'T': 0,'W':-3,
'Y':-2,'V': 0,'X':0},
'R':{'A':-1,'R': 5,'N': 0,'D':-2,'C':-3,'Q': 1,
'E': 0,'G':-2,'H': 0,'I':-3,'L':-2,'K': 2,
'M':-1,'F':-3,'P':-2,'S':-1,'T':-1,'W':-3,
'Y':-2,'V':-3,'X':0},
'N':{'A':-2,'R': 0,'N': 6,'D': 1,'C':-3,'Q': 0,
'E': 0,'G': 0,'H': 1,'I':-3,'L':-3,'K': 0,
'M':-2,'F':-3,'P':-2,'S': 1,'T': 0,'W':-4,
'Y':-2,'V':-3,'X':0},
'D':{'A':-2,'R':-2,'N': 1,'D': 6,'C':-3,'Q': 0,
'E': 2,'G':-1,'H':-1,'I':-3,'L':-4,'K':-1,
'M':-3,'F':-3,'P':-1,'S': 0,'T':-1,'W':-4,
'Y':-3,'V':-3,'X':0}}
## SNIP: THE FULL MATRIX CARRIES ON FOR 17 MORE SUB-DICTIONARIES ##
As with the DNA matrices we use two keys to get the substitution score and have
positive, zero and negative values. Note that the matrix, like the DNA matrix examples, is
symmetric,
11
i.e. BLOSUM62[‘A’][‘R’] equals BLOSUM62[‘R’][‘A’]. Unlike the DNA
examples the diagonal of the matrix is not uniform, which is to say that the score for
residue types being the same in an alignment differs. For example, BLOSUM62[‘A’][‘A’],
meaning an exact alanine match, gives a score of 4, but an exact asparagine match
BLOSUM62[‘N’][‘N’] gives a higher score of 6. Thus ‘A’ is less well conserved (more
swappable for something else) than ‘N’.
We will not go into fine detail about how substitution matrices are calculated until
Chapter 14. In essence the idea is that you first generate good, well-curated alignments of
multiple sequences using as much information as humanly possible from structure and
function etc. and you then count how many times one residue type is substituted for
another within the alignment. Then in various ways these counts are converted into whole-
number scores, relative to some baseline value. If you are really interested, we recommend
the early papers on the PAM
12
and BLOSUM
13
matrices. These two protein matrices are
calculated in slightly different ways, but together they give a good idea of the underlying
principles.
Do'stlaringiz bilan baham: |