Conservation display
In the next example we will use the conservation values to make a very simple display,
consisting of symbols that can be placed under the alignment to indicate which positions
are conserved and which vary. Such a display, often involving ‘*’, ‘:’ and ‘.’ characters to
represent perfect, high and moderate conservation respectively, is very commonly seen in
the output of programs that generate multiple alignments. A more sophisticated graphical
display might involve rendering text with different colours. We will not give such an
example here, but the same conservation values could be used.
The function is defined as accepting an alignment, a similarity matrix and a list of
threshold values as input arguments. The list of thresholds will define the values that
separate the conservation into different categories, and hence dictate which symbol is
used. After the initial function definition we define an empty string which will be filled
with symbols to indicate the similarity at each position. We use the previously defined
getConservation() function to extract the conservation values from the input alignment,
and we extract the list of thresholds as three singular values.
def makeSimilarityString(align, simMatrix, thresholds):
simString = ''
conservation = getConservation(align, simMatrix)
t1, t2, t3 = thresholds
for score in conservation:
if score >= t1:
symbol = '*'
elif score >= t2:
symbol = ':'
elif score >= t3:
symbol = '.'
else:
symbol = ' '
simString += symbol
return simString
The remainder of the function simply involves looping through each score and
comparing it with the threshold values, which are assumed to be in descending order. If
the score is at or above the first threshold, the symbol is defined as ‘*’. Otherwise if it’s at
or above the second it is a ‘:’, and at or above the third value a ‘.’. Finally, if the score was
below the smallest threshold the symbol is set to a space, representing little or no
conservation. At the end of each loop the symbol is added to the growing text string,
which is then passed back at the return statement.
We test this function by printing out the sequences in the alignment with the line of
newly generated conservation symbols underneath.
symbols = makeSimilarityString(align2, BLOSUM62, (1.0, 0.5, 0.3))
for seq in align2:
print(seq)
print(symbols)
And the result in this case is:
QPVHPFSRPAPVVIILIILCVMAGVIGTILLISYGIRLLIK-------------
QLVHRFTVPAPVVIILIILCVMAGIIGTILLISYTIRRLIK-------------
QLAHHFSEPE---ITLIIFGVMAGVIGTILLISYGIRRLIKKSPSDVKPLPSPD
QLVHEFSELV---IALIIFGVMAGVIGTILFISYGSRRLIKKSESDVQPLPPPD
MLEHEFSAPV---AILIILGVMAGIIGIILLISYSIGQIIKKRSVDIQPPEDED
PIQHDFPALV---MILIILGVMAGIIGTILLISYCISRMTKKSSVDIQSPEGGD
QLVHIFSEPV---IIGIIYAVMLGIIITILSIAFCIGQLTKKSSLPAQVASPED
-LAHDFSQPV---ITVIILGVMAGIIGIILLLAYVSRRLRKRP-----PADVP-
SYHQDFSHAE---ITGIIFAVMAGLLLIIFLIAYLIRRMIKKPLPVPKPQDSPD
.. : *: . :. **: **:*::..*::::: ...:.*: .
In the above examples we have been thinking of conservation as being an amount of
something. However, we can also say in what way the residues are being conserved,
especially for amino acids that have chemical characteristics which may be preserved,
even if the exact residue types are not. An example would be a position where we have
histidine, lysine and arginine residues (‘H’, ‘K’ and ‘R’), where although the amino acid
type does change, the position is one that can be categorised as having a positive electric
charge, a property of all three types.
Here we illustrate a Python list that defines some amino acid categories, based upon
their properties. Each category is represented by a pair of items in a tuple. The first item is
the name of the category and the second is a string containing all of the residue codes for
that category. Note that we have simple one-letter categories for the individual amino
acids, so that if an alignment only has one residue type in a given position the pure amino
acid is stated, rather than giving a larger, less precise category.
AA_CATEGORIES = [('G','G'), ('A','A'), ('I','I'), ('V','V'),
('L','L'), ('M','M'), ('F','F'), ('Y','Y'),
('W','W'), ('H','H'), ('C','C'), ('P','P'),
('K','K'), ('R','R'), ('D','D'), ('E','E'),
('Q','Q'), ('N','N'), ('S','S'), ('T','T'),
('-','-'),
('acidic', 'DE'),
('hydroxyl', 'ST'),
('aliphatic', 'VIL'),
('basic', 'KHR'),
('tiny', 'GAS'),
('aromatic', 'FYWH'),
('charged', 'KHRDE'),
('small', 'AGSVTDNPC'),
('polar', 'KHRDEQNSTC'),
('hydrophobic','IVLFYWHAGMC'),
('turnlike', 'GASHKRDEQNSTC'),
('undef', 'AGVILMFYWHCPKRDEQNST-')]
The function example that employs amino acid categorisation naturally takes this list as
an argument, as well as an alignment to work on. Inside the function a list is defined that
will be filled with the residue categories, one for each position, then the next command
makes a profile from the alignment using the previously discussed profile() function.
from MultipleAlign import profile
def getAlignProperties(align, categories):
properties = []
prof = profile(align)
The bulk of the function involves looping through the dictionaries containing
composition fractions, defined for the positions of profile, and getting the list of residue
letters. Here the actual composition values are not used, but the dictionary keys give us the
list of residue types found at that location.
for fracDict in prof:
letters = fracDict.keys()
Given the residue letters the task is now to go through each category, which is listed in
order of size from smallest to largest, to find the first, and hence smallest, category that
contains all of the residue letters for this position. Accordingly for each category in turn
we make a loop through each letter and determine if it is within the string of letters for that
category (group). If the letter is not in the category, then the current category cannot
possibly be right, so we use the break keyword to quit the current loop through the letters
immediately, thus going on to test the next category. If there is no residue letter that causes
the break to be triggered, i.e. all letters were present for the group, then the program flow
will reach the else: statement, whereupon the name of the current category is added to the
list to represent the current alignment position. With that done we don’t need to test any
more categories for this position so a different break is issued, this time to quit the loop
through categories, thus going on to the next position. At the very end of the function the
list of properties is returned as you would expect.
for name, group in categories:
for letter in letters:
if letter not in group:
break # quit inner loop
else: # all letters are in group
properties.append(name)
break # quit outer loop
return properties
The function is tested with an alignment and the categories list for the amino acids. In
this instance we print the results by looping through the category names and their index
numbers simultaneously, courtesy of Python’s enumerate() function
catNames = getAlignProperties(align2, AA_CATEGORIES)
for i, category in enumerate(catNames):
print(i, category)
Do'stlaringiz bilan baham: |