As the first example we define the getConservation function, which expects input
arguments consisting of an alignment and a substitution matrix, so that we can score the
similarity between sequence elements. Note how initially, after defining a conservation list
which we will fill and pass back at the end, we use the profile() function defined in the
previous chapter to convert the alignment into a profile: a list of dictionaries containing
the composition of each position. In order to use the profile-generating function we must
either have defined it in the same Python file or use an import command like the one
below to access it from another location. The module import we illustrate is what you
would need if importing from MultipleAlign.py which has been placed on the current
Python import path (e.g. current working directory). This file is downloadable from the
on-line
archive
which
accompanies
this
book
(see
http://www.cambridge.org/pythonforbiology
).
from MultipleAlign import profile
def getConservation(align, simMatrix):
conservation = []
prof = profile(align)
So armed with a compositional profile generated from the input alignment we next have
to convert the compositional data for each alignment position into a single value
representing the degree of residue conservation, according to some input matrix of
similarity scores. Thus we loop through all of the dictionaries of the profile. Within the
loop, for each composition dictionary we call the .items() function, which is an inbuilt
function or method of standard Python dictionaries. In Python 2 this generates a list of
(key, value) pairs which serves two purposes: firstly we will need both the residue letter
(key) and composition (value) when we look through all the residue combinations, and
secondly having a list allows us to sort the data. In Python 3, the .items() function returns
a view onto the items and a list() function is applied to this to get an actual list.
for compDict in prof:
items = list(compDict.items()) # list() not needed in Python 2
We sort the items list according to the composition value, the second element of each
pair, so that we get a ranking from which we can find the highest-scoring residue code.
Note that if we sorted according to the first element of the pairs we would get the list in
alphabetical order, not the required scoring order. The sort is performed using the
specified key function, which takes as its one argument an element of the list, and returns
the desired comparison value determined by that element. Here that is the second element,
x[1], which is used for the comparison. See
Chapter 5
for a more detailed description of
lambda. If we don’t pass a key function then the list would be sorted according to x itself,
which means the first element, x[0].
items.sort( key=lambda x: x[1] )
Next we define an initial score of zero and then make two loops through the items
containing residue codes and composition values, so that we generate all possible
combinations of two residue types. The score for each combination is simply the
multiplication of the compositional fractions and the similarity score from the substitution
matrix. You can imagine the two composition values defining lengths on the side of a unit
square, and thus their product represents an area of the whole.
score = 0.0
for resA, compA in items:
for resB, compB in items:
score += compA * compB * simMatrix[resA][resB]
With the score total for this position defined we now scale its value so that we can
easily compare different positions. Accordingly, we find the maximum possible value for
the score at this position and divide by that, so that all scores will be between zero and
one. We define the best possible score as being the diagonal element of the substitution
matrix for the residue with the highest composition at this location. Accordingly, we get
the best residue code from the end of our sorted list of items (index -1 to get the highest
value) and use it as both of the keys in the matrix dictionary look-up.
bestLetter = items[-1][0]
maxScore = simMatrix[bestLetter][bestLetter]
The score is scaled by dividing it by the maximum value and added to the conservation
list, which once filled is then passed back at the very end, outside the profile loop.
score /= maxScore
conservation.append(score)
return conservation
We can test the function by defining some alignments and calling the function with a
substitution matrix appropriate to DNA or protein. Note that we can import the previously
defined matrices, which are part of the file Alignments.py.
from Alignments import DNA_1, BLOSUM62
align1 = ['AAGCCGCACACAGACCCTGAG',
'AAGCTGCACGCAGACCCTGAG',
'AGGCTGCACGCAGACCCTGAG',
'AAGCTGCACGTGGACCCTGAG',
'AGGCTGCACGTGGACCCTGAG',
'AGGCTGCACGTGGACCCTGAG',
'AAGCTGCATGTGGACCCTGAG']
print(getConservation(align1, DNA_1))
# [1.0, 0.51, 1.0, 1.0, 0.76, 1.0, …]
align2 = ['QPVHPFSRPAPVVIILIILCVMAGVIGTILLISYGIRLLIK-------------',
'QLVHRFTVPAPVVIILIILCVMAGIIGTILLISYTIRRLIK-------------',
'QLAHHFSEPE---ITLIIFGVMAGVIGTILLISYGIRRLIKKSPSDVKPLPSPD',
'QLVHEFSELV---IALIIFGVMAGVIGTILFISYGSRRLIKKSESDVQPLPPPD',
'MLEHEFSAPV---AILIILGVMAGIIGIILLISYSIGQIIKKRSVDIQPPEDED',
'PIQHDFPALV---MILIILGVMAGIIGTILLISYCISRMTKKSSVDIQSPEGGD',
'QLVHIFSEPV---IIGIIYAVMLGIIITILSIAFCIGQLTKKSSLPAQVASPED',
'-LAHDFSQPV---ITVIILGVMAGIIGIILLLAYVSRRLRKRP-----PADVP-',
'SYHQDFSHAE---ITGIIFAVMAGLLLIIFLIAYLIRRMIKKPLPVPKPQDSPD']
print(getConservation(align2, BLOSUM62))
# [0.3, 0.38, 0.09, 0.8, 0.08, 1.0, 0.64, 0.1, 0.32, 0.27, …]