Calculating substitution matrices
Next we will move on from conservation to a means of measuring the variations, or more
specifically the kinds of residue substitution, present in a multiple-sequence alignment.
We will use the analysis of substitutions to illustrate how an empirically derived
substitution matrix, like the BLOSUM variant we have been using, can be measured. Of
course it may seem that there is a somewhat circular argument in what we will be doing;
an initial substitution matrix is needed to make the alignments from which we calculate a
new substitution matrix. Clearly the composition of the new matrix will be influenced by
the choice of the first. However, this is not all folly. Firstly, there are many ways of
making and improving alignments which don’t rely on a substitution rule. For proteins this
would typically involve using alignments of three-dimensional structures. If homologous
proteins of known structure can be aligned spatially then we have an alternative means to
detect equivalent residue pairs. Secondly, even if we use one matrix to generate the input
alignments the choice of sequences in the alignment has a vast influence on the outcome.
So, for example, we might use a general matrix to perform alignments of a very narrow
class of sequences, but the measured substitutions will reflect the choice of input. In this
way a substitution matrix becomes more specialised, and thus better able to detect
substitutions of the specialist class.
The function to calculate substitution naturally accepts a list of alignments to analyse,
and various other arguments, but it is notable that these include a number that is used to
smooth the data. The idea of data smoothing might seem dubious at first but it has an
important function here. What the function will do in the end is compare the observed
number of substitutions, given the alignments, and the expected number of substitutions
for each possible pair of residue types. If the number of observations is very small then
such a comparison is much less meaningful. The addition or removal of an observation
might cause a large proportional change, but this isn’t significant. Consider a substitution
type where the number of observations is 4, but the expected value is 2. The observation is
double what you expect, but it is likely that an extra couple of observations arose by pure
chance. If the number of observations were 40 and the expected 20, then this is much
more meaningful. Accordingly, the smoothing procedure used to deal with this is only
significant when the number of observations is small (i.e. about the same size as the
smoothing value), and all it does is make the data move towards the expectation, such that
a particular point is not unduly significant. Small numbers of observations are of course
much less problematic if the total amount of alignment data that you can analyse is large,
but you don’t always have the luxury of this.
We will compare observed and expected substitutions by calculating the logarithm of a
probability ratio: a log-odds score. Hence, we import the log function from Python’s
standard mathematical library. Logarithms are convenient for substitution tables so that
addition can be used to get an overall score for alignments etc. If raw probabilities were
used then we would need to multiply values instead, which would often result in floating
point numbers that are far too small for a computer to quickly and accurately deal with.
The function name is then defined with its arguments and two variables are initialised:
empty dictionaries that will hold the statistical data. The alphabet argument is a list of all
possible residue letters, whether or not they are observed in the alignments, and the
maxVal argument is used to scale the substitution matrix at the end.
from math import log
def calcSubstitutionMatrix(alignments, alphabet, maxVal, smooth=5):
matrix = {}
counts = {}
The blank dictionary that will contain the substitution matrix is filled with smaller sub-
dictionaries, which in turn are filled with zero values. The residue letters are the keys in
both dictionaries so two letters are required to extract each value: the substitutability of
one residue for another. The dictionary containing the residue counts is also filled, with an
initial value of zero for each letter.
for letterA in alphabet:
subDict = {}
for letterB in alphabet:
subDict[letterB] = 0
matrix[letterA] = subDict
counts[letterA] = 0
Initially the function will measure the observed number of residues of each kind and the
number of substitutions of one residue type to another. However, in the end we need to
calculate simple probabilities as the fraction of total possible events we observe.
Accordingly, variables for two totals are initialised to zero, one for the total number of pair
substitutions and one for the total number of residue letters across all sequences. Note that
these numbers have floating point representation (0.0 not 0) because we will eventually
use them for division and wish to avoid integer rounding in Python 2.
totalRes = 0.0
totalSub = 0.0
The main part of the function involves looping through the input alignments to collect
the substitution data. It would have been possible to call upon the profile() function that
we have used previously, but on this occasion we would have had to loop through all of
the sequences and positions in any case, to calculate the totals and residue counts. Thus in
this instance it is simpler to do everything within the same set of loops. After the initial
loop we record the number of positions in the particular alignment, and then loop through
these positions, initialising a list which will collect the residue letters observed at this
position, eventually doing yet another loop though the residue letters, seq[i], available
from all the sequences at this location (i). If the letter represents a gap it is skipped,
otherwise it is added to the letters list.
for align in alignments:
numPos = len(align[0])
for i in range(numPos):
letters = []
for seq in align:
letter = seq[i]
if letter == '-':
continue
letters.append(letter)
With the letters collected we go through the list in two loops to get all of the possible
residue pairs. The count for each letter is increased inside the first loop and inside the
second loop we add each pair observation to our matrix. The matrix is thus filled with the
observed substitutions, although since the letter does not change we really mean
Do'stlaringiz bilan baham: |