letters list (so not including gaps), and the substitution total is increased by the square of
Once the residue and substation counts have been collected the next task is to calculate
the various probabilities and the log-odds scores for the final matrix. To calculate the
expected substitution probability we first calculate the average composition of each
residue type by considering the number of counts over all alignments, residue positions
A variable for the maximum score is initialised and then we loop through all residue
For each pair of residue types we calculate the expected substitution probability by
multiplying the average compositions of each type. This is equivalent to saying that the
expectation for a given pair, in the absence of any other information, depends only upon
the probability of observing each residue type independently. With the expectation
calculated we do a check to make sure that it is not a zero value (False in the if statement);
if it is zero then one of the residue types is not observed at all in the data, thus we have
nothing to do and so skip the loop.
expected = averageComp[resA] * averageComp[resB]
if not expected:
continue
The observed substitution count is simply the pair count of this combination, as stored
in matrix. This is then used to calculate a weighting value that is used to smooth the data
in the event of the number of observations being low. Note that if observed is much larger
than the smooth value the weight is close to zero, if they are equal the weight is one half,
and if observed is comparatively small the weight is about one.
observed = matrix[resA][resB]
weight = 1.0 / (1.0+(observed/smooth))
The observed substitution counts for the pair are converted into a probability by
dividing by the total number of substitution pairs. Then the smoothing is applied by
redefining the observed probability as a combination of the original observed probability
and the expected probability. With a low weight (lots of observations) the observation
probability is barely altered, but with a large weight (few observations) the probability is
closer to the expectation.
observed /= totalSub
observed = weight*expected + (1-weight)*observed
Lastly in the loops the log-odds score is calculated as the logarithm of the probability
ratio; we check whether this score is the new maximum score value and then put the score
into the matrix for the current residue letters.
logOdds = log(observed/expected)
if (maxScore is None) or (logOdds>maxScore):
maxScore = logOdds
matrix[resA][resB] = logOdds
Once the loops are done it simply remains to scale the log-odds values in the
substitution matrix and convert to integers. The integer conversion is somewhat
traditional, but does mean that calculations with the substitution matrix are quicker. Note
how the maximum score is converted to a positive value using abs()because the score
logarithm could be negative. The maximum score is used to divide the matrix values to get
a relative figure. However, we also multiply the log-odds score by the maxVal argument
passed in at the start, which means we generate a range of integer values up this figure.
Without maxVal using int() here would be pointless; we would only get zero or one.
maxScore = abs(maxScore)
for resA in alphabet:
for resB in alphabet:
matrix[resA][resB] = int(maxVal*matrix[resA][resB]/maxScore)
return matrix
The function is tested with a list of alignments, a list of residue letters and a scaling,
maximum value for the matrix. Note how we can cheekily get the residue letters from the
keys of an existing substitution matrix.
aminoAcids = BLOSUM62.keys()
print(calcSubstitutionMatrix([align2,], aminoAcids, 10))
Do'stlaringiz bilan baham: