Python Programming for Biology: Bioinformatics and Beyond

Download 7,75 Mb.

Pdf ko'rish

bet	201/514
Sana	30.12.2021
Hajmi	7,75 Mb.
	#91066

1 ... 197 198 199 200 201 202 203 204 ... 514

Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

Conservation display

In the next example we will use the conservation values to make a very simple display,

consisting of symbols that can be placed under the alignment to indicate which positions

are conserved and which vary. Such a display, often involving ‘*’, ‘:’ and ‘.’ characters to

represent perfect, high and moderate conservation respectively, is very commonly seen in

the output of programs that generate multiple alignments. A more sophisticated graphical

display might involve rendering text with different colours. We will not give such an

example here, but the same conservation values could be used.

The function is defined as accepting an alignment, a similarity matrix and a list of

threshold values as input arguments. The list of thresholds will define the values that

separate the conservation into different categories, and hence dictate which symbol is

used. After the initial function definition we define an empty string which will be filled

with symbols to indicate the similarity at each position. We use the previously defined

getConservation() function to extract the conservation values from the input alignment,

and we extract the list of thresholds as three singular values.

def makeSimilarityString(align, simMatrix, thresholds):

simString = ''

conservation = getConservation(align, simMatrix)

t1, t2, t3 = thresholds

for score in conservation:

if score >= t1:

symbol = '*'

elif score >= t2:

symbol = ':'

elif score >= t3:

symbol = '.'

else:

symbol = ' '

simString += symbol

return simString

The remainder of the function simply involves looping through each score and

comparing it with the threshold values, which are assumed to be in descending order. If

the score is at or above the first threshold, the symbol is defined as ‘*’. Otherwise if it’s at

or above the second it is a ‘:’, and at or above the third value a ‘.’. Finally, if the score was

below the smallest threshold the symbol is set to a space, representing little or no

conservation. At the end of each loop the symbol is added to the growing text string,

which is then passed back at the return statement.

We test this function by printing out the sequences in the alignment with the line of

newly generated conservation symbols underneath.

symbols = makeSimilarityString(align2, BLOSUM62, (1.0, 0.5, 0.3))

for seq in align2:

print(seq)

print(symbols)

And the result in this case is:

QPVHPFSRPAPVVIILIILCVMAGVIGTILLISYGIRLLIK-------------

QLVHRFTVPAPVVIILIILCVMAGIIGTILLISYTIRRLIK-------------

QLAHHFSEPE---ITLIIFGVMAGVIGTILLISYGIRRLIKKSPSDVKPLPSPD

QLVHEFSELV---IALIIFGVMAGVIGTILFISYGSRRLIKKSESDVQPLPPPD

MLEHEFSAPV---AILIILGVMAGIIGIILLISYSIGQIIKKRSVDIQPPEDED

PIQHDFPALV---MILIILGVMAGIIGTILLISYCISRMTKKSSVDIQSPEGGD

QLVHIFSEPV---IIGIIYAVMLGIIITILSIAFCIGQLTKKSSLPAQVASPED

-LAHDFSQPV---ITVIILGVMAGIIGIILLLAYVSRRLRKRP-----PADVP-

SYHQDFSHAE---ITGIIFAVMAGLLLIIFLIAYLIRRMIKKPLPVPKPQDSPD

.. : *: . :. **: **:*::..*::::: ...:.*: .

In the above examples we have been thinking of conservation as being an amount of

something. However, we can also say in what way the residues are being conserved,

especially for amino acids that have chemical characteristics which may be preserved,

even if the exact residue types are not. An example would be a position where we have

histidine, lysine and arginine residues (‘H’, ‘K’ and ‘R’), where although the amino acid

type does change, the position is one that can be categorised as having a positive electric

charge, a property of all three types.

Here we illustrate a Python list that defines some amino acid categories, based upon

their properties. Each category is represented by a pair of items in a tuple. The first item is

the name of the category and the second is a string containing all of the residue codes for

that category. Note that we have simple one-letter categories for the individual amino

acids, so that if an alignment only has one residue type in a given position the pure amino

acid is stated, rather than giving a larger, less precise category.

AA_CATEGORIES = [('G','G'), ('A','A'), ('I','I'), ('V','V'),

('L','L'), ('M','M'), ('F','F'), ('Y','Y'),

('W','W'), ('H','H'), ('C','C'), ('P','P'),

('K','K'), ('R','R'), ('D','D'), ('E','E'),

('Q','Q'), ('N','N'), ('S','S'), ('T','T'),

('-','-'),

('acidic', 'DE'),

('hydroxyl', 'ST'),

('aliphatic', 'VIL'),

('basic', 'KHR'),

('tiny', 'GAS'),

('aromatic', 'FYWH'),

('charged', 'KHRDE'),

('small', 'AGSVTDNPC'),

('polar', 'KHRDEQNSTC'),

('hydrophobic','IVLFYWHAGMC'),

('turnlike', 'GASHKRDEQNSTC'),

('undef', 'AGVILMFYWHCPKRDEQNST-')]

The function example that employs amino acid categorisation naturally takes this list as

an argument, as well as an alignment to work on. Inside the function a list is defined that

will be filled with the residue categories, one for each position, then the next command

makes a profile from the alignment using the previously discussed profile() function.

from MultipleAlign import profile

def getAlignProperties(align, categories):

properties = []

prof = profile(align)

The bulk of the function involves looping through the dictionaries containing

composition fractions, defined for the positions of profile, and getting the list of residue

letters. Here the actual composition values are not used, but the dictionary keys give us the

list of residue types found at that location.

for fracDict in prof:

letters = fracDict.keys()

Given the residue letters the task is now to go through each category, which is listed in

order of size from smallest to largest, to find the first, and hence smallest, category that

contains all of the residue letters for this position. Accordingly for each category in turn

we make a loop through each letter and determine if it is within the string of letters for that

category (group). If the letter is not in the category, then the current category cannot

possibly be right, so we use the break keyword to quit the current loop through the letters

immediately, thus going on to test the next category. If there is no residue letter that causes

the break to be triggered, i.e. all letters were present for the group, then the program flow

will reach the else: statement, whereupon the name of the current category is added to the

list to represent the current alignment position. With that done we don’t need to test any

more categories for this position so a different break is issued, this time to quit the loop

through categories, thus going on to the next position. At the very end of the function the

list of properties is returned as you would expect.

for name, group in categories:

for letter in letters:

if letter not in group:

break # quit inner loop

else: # all letters are in group

properties.append(name)

break # quit outer loop

return properties

The function is tested with an alignment and the categories list for the amino acids. In

this instance we print the results by looping through the category names and their index

numbers simultaneously, courtesy of Python’s enumerate() function

catNames = getAlignProperties(align2, AA_CATEGORIES)

for i, category in enumerate(catNames):

print(i, category)

Download 7,75 Mb.

Do'stlaringiz bilan baham:

1 ... 197 198 199 200 201 202 203 204 ... 514