Python Programming for Biology: Bioinformatics and Beyond

Download 7,75 Mb.

Pdf ko'rish

bet	200/514
Sana	30.12.2021
Hajmi	7,75 Mb.
	#91066

1 ... 196 197 198 199 200 201 202 203 ... 514

Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

Measuring conservation

As the first example we define the getConservation function, which expects input

arguments consisting of an alignment and a substitution matrix, so that we can score the

similarity between sequence elements. Note how initially, after defining a conservation list

which we will fill and pass back at the end, we use the profile() function defined in the

previous chapter to convert the alignment into a profile: a list of dictionaries containing

the composition of each position. In order to use the profile-generating function we must

either have defined it in the same Python file or use an import command like the one

below to access it from another location. The module import we illustrate is what you

would need if importing from MultipleAlign.py which has been placed on the current

Python import path (e.g. current working directory). This file is downloadable from the

on-line

archive

which

accompanies

this

book

(see

http://www.cambridge.org/pythonforbiology

from MultipleAlign import profile

def getConservation(align, simMatrix):

conservation = []

prof = profile(align)

So armed with a compositional profile generated from the input alignment we next have

to convert the compositional data for each alignment position into a single value

representing the degree of residue conservation, according to some input matrix of

similarity scores. Thus we loop through all of the dictionaries of the profile. Within the

loop, for each composition dictionary we call the .items() function, which is an inbuilt

function or method of standard Python dictionaries. In Python 2 this generates a list of

(key, value) pairs which serves two purposes: firstly we will need both the residue letter

(key) and composition (value) when we look through all the residue combinations, and

secondly having a list allows us to sort the data. In Python 3, the .items() function returns

a view onto the items and a list() function is applied to this to get an actual list.

for compDict in prof:

items = list(compDict.items()) # list() not needed in Python 2

We sort the items list according to the composition value, the second element of each

pair, so that we get a ranking from which we can find the highest-scoring residue code.

Note that if we sorted according to the first element of the pairs we would get the list in

alphabetical order, not the required scoring order. The sort is performed using the

specified key function, which takes as its one argument an element of the list, and returns

the desired comparison value determined by that element. Here that is the second element,

x[1], which is used for the comparison. See

Chapter 5

for a more detailed description of

lambda. If we don’t pass a key function then the list would be sorted according to x itself,

which means the first element, x[0].

items.sort( key=lambda x: x[1] )

Next we define an initial score of zero and then make two loops through the items

containing residue codes and composition values, so that we generate all possible

combinations of two residue types. The score for each combination is simply the

multiplication of the compositional fractions and the similarity score from the substitution

matrix. You can imagine the two composition values defining lengths on the side of a unit

square, and thus their product represents an area of the whole.

score = 0.0

for resA, compA in items:

for resB, compB in items:

score += compA * compB * simMatrix[resA][resB]

With the score total for this position defined we now scale its value so that we can

easily compare different positions. Accordingly, we find the maximum possible value for

the score at this position and divide by that, so that all scores will be between zero and

one. We define the best possible score as being the diagonal element of the substitution

matrix for the residue with the highest composition at this location. Accordingly, we get

the best residue code from the end of our sorted list of items (index -1 to get the highest

value) and use it as both of the keys in the matrix dictionary look-up.

bestLetter = items[-1][0]

maxScore = simMatrix[bestLetter][bestLetter]

The score is scaled by dividing it by the maximum value and added to the conservation

list, which once filled is then passed back at the very end, outside the profile loop.

score /= maxScore

conservation.append(score)

return conservation

We can test the function by defining some alignments and calling the function with a

substitution matrix appropriate to DNA or protein. Note that we can import the previously

defined matrices, which are part of the file Alignments.py.

from Alignments import DNA_1, BLOSUM62

align1 = ['AAGCCGCACACAGACCCTGAG',

'AAGCTGCACGCAGACCCTGAG',

'AGGCTGCACGCAGACCCTGAG',

'AAGCTGCACGTGGACCCTGAG',

'AGGCTGCACGTGGACCCTGAG',

'AAGCTGCATGTGGACCCTGAG']

print(getConservation(align1, DNA_1))

# [1.0, 0.51, 1.0, 1.0, 0.76, 1.0, …]

align2 = ['QPVHPFSRPAPVVIILIILCVMAGVIGTILLISYGIRLLIK-------------',

'QLVHRFTVPAPVVIILIILCVMAGIIGTILLISYTIRRLIK-------------',

'QLAHHFSEPE---ITLIIFGVMAGVIGTILLISYGIRRLIKKSPSDVKPLPSPD',

'QLVHEFSELV---IALIIFGVMAGVIGTILFISYGSRRLIKKSESDVQPLPPPD',

'MLEHEFSAPV---AILIILGVMAGIIGIILLISYSIGQIIKKRSVDIQPPEDED',

'PIQHDFPALV---MILIILGVMAGIIGTILLISYCISRMTKKSSVDIQSPEGGD',

'QLVHIFSEPV---IIGIIYAVMLGIIITILSIAFCIGQLTKKSSLPAQVASPED',

'-LAHDFSQPV---ITVIILGVMAGIIGIILLLAYVSRRLRKRP-----PADVP-',

'SYHQDFSHAE---ITGIIFAVMAGLLLIIFLIAYLIRRMIKKPLPVPKPQDSPD']

print(getConservation(align2, BLOSUM62))

# [0.3, 0.38, 0.09, 0.8, 0.08, 1.0, 0.64, 0.1, 0.32, 0.27, …]

Download 7,75 Mb.

Do'stlaringiz bilan baham:

1 ... 196 197 198 199 200 201 202 203 ... 514