Python Programming for Biology: Bioinformatics and Beyond

Figure 14.5. The basic construction of a phylogenetic tree using neighbour-joining

Download 7,75 Mb.

Pdf ko'rish

bet	206/514
Sana	30.12.2021
Hajmi	7,75 Mb.
	#91066

1 ... 202 203 204 205 206 207 208 209 ... 514

Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

Figure 14.5. The basic construction of a phylogenetic tree using neighbour-joining.

Although many more rigorous methods exist, phylogenetic trees can be constructed from

sequences using the relatively simple neighbour-joining method. This involves creating

pairwise alignments between all sequences to determine their similarity, and thus also

rough evolutionary distance from one another. The resulting distance matrix is then used

to construct the tree. This is done in a cycle where the most similar pair each time is

combined to make a branch point and reduce the size of the matrix, until all sequences are

joined.

Before we get to the main tree-generating Python function we will construct some

ancillary functions that will be used inside the main routine. This will help understanding

of the process by breaking it into operations that can be thought about separately. Also, the

function that makes the tree will be neater and easier to understand. The downside to this

way of doing things is the relative speed penalty incurred with calling Python functions.

Thus if speed was an issue you could forego the separate functions and embed their code

directly in the main section. If even more speed is required we would recommend a

Python/C hybrid language approach, for example, using Cython, as discussed in

Chapter

To start with we make a simple distance matrix generation function that accepts a list of

sequences and a substitution matrix as input arguments. It defines an initially blank

matrix

7

of zeros with a row and column for each sequence (n sequences), such that the

matrix elements will eventually be filled with the distance between all possible sequence

pairs. Then we calculate a list of maximum possible scores for sequence comparisons, by

comparing each sequence with itself using the calcSeqSimilarity() function defined in

Chapter 12

. These values will give a baseline for the other scores. The row and column

indices in the matrix are then looped through (i and j) and for each index the

corresponding sequence is obtained. Note how two for loops are used but we don’t go

from index 0 to n for these; instead the first loop goes up to n-1, and the second starts from

i+1. This means we only consider one half of the matrix, excluding the diagonal. We can

reduce the number of loops this way, and thus save some time, because the matrix is

symmetric, i.e. matrix[i][j] is the same as matrix[j][i].

from Alignments import sequenceAlign, calcSeqSimilarity

from math import exp

def getDistanceMatrix(seqs, simMatrix):

n = len(seqs)

matrix = [[0.0] * n for x in range(n)]

maxScores = [calcSeqSimilarity(x, x, simMatrix) for x in seqs]

for i in range(n-1):

seqA = seqs[i]

for j in range(i+1,n):

seqB = seqs[j]

score, alignA, alignB = sequenceAlign(seqA, seqB, simMatrix)

maxScore = max(maxScores[i],maxScores[j])

dist = maxScore - score

matrix[i][j] = dist

matrix[j][i] = dist

return matrix

The sequences for each matrix element are aligned using the input substitution matrix.

The alignment thus generated is passed back with a score, which is then converted into a

distance. The maximum possible score is calculated as the maximum from amongst the

two maxScores list values for the sequences compared in this iteration. The distance is

then calculated as the difference between this maximum possible score and the observed

alignment score. Accordingly, if the sequences are identical the alignment score value is

equal to the maximum and the distance is zero, and the more dissimilar the sequences, the

larger the distance between them. If we were to use the simple DNA substitution matrix

DNA_1, the distance would represent the number of non-identical positions. The distance

value is then inserted into the matrix at the given row and column.

Here we test the function using previously defined lists of sequences (without gaps) and

substitution matrices. The output is looped through row by row and the values in each row

are displayed in a formatted manner (to two decimal places using ‘%.2f’) to keep things

neat on screen.

seqs = ['QPVHPFSRPAPVVIILIILCVMAGVIGTILLISYGIRLLIK',

'QLVHRFTVPAPVVIILIILCVMAGIIGTILLISYTIRRLIK',

'QLAHHFSEPEITLIIFGVMAGVIGTILLISYGIRRLIKKSPSDVKPLPSPD',

'QLVHEFSELVIALIIFGVMAGVIGTILFISYGSRRLIKKSESDVQPLPPPD',

'MLEHEFSAPVAILIILGVMAGIIGIILLISYSIGQIIKKRSVDIQPPEDED',

'PIQHDFPALVMILIILGVMAGIIGTILLISYCISRMTKKSSVDIQSPEGGD',

'QLVHIFSEPVIIGIIYAVMLGIIITILSIAFCIGQLTKKSSLPAQVASPED',

'LAHDFSQPVITVIILGVMAGIIGIILLLAYVSRRLRKRPPADVP',

'SYHQDFSHAEITGIIFAVMAGLLLIIFLIAYLIRRMIKKPLPVPKPQDSPD']

distMatrix = getDistanceMatrix(seqs, BLOSUM62)

distMatrix = getDistanceMatrix(align1, DNA_1)

for row in distMatrix:

print(['%.2f' % x for x in row])

# ['0.00', '2.00', '3.00', '4.00', '5.00', '5.00', '5.00']

# …

Next we define another function to find out which element of a distance matrix

represents the closest sequences, and thus which sequences should be joined next when

making a tree. Note that this is not as simple as finding the smallest distance, rather we

calculate the value q (which is often described in the form of the Q-matrix) as the distance

scaled by the number of branch points for the tree (n-2) minus the sum of the row and

column at that point. For each row and column (again only needing to consider half of the

matrix) we determine whether the q value is smaller than the best so far. If so (and for the

initial loop, where minQ is None) then we redefine the best q, and record the row and

column for this element as joinPair.

def getJoinPair(distMatrix):

n = len(distMatrix)

minQ = None

joinPair = None

for i in range(n-1):

sumRow = sum(distMatrix[i])

for j in range(i+1, n):

sumCol = sum(distMatrix[j])

dist = distMatrix[i][j]

q = (n-2)*dist - sumRow - sumCol

if (minQ is None) or (q < minQ):

minQ = q

joinPair = [i,j]

return joinPair

At the end of the function the indices for the closest pair of sequences are returned.

Because of the for loops these indices are in a specific order, which helps later when the

tree is built. We can test the function using a previously defined distance matrix:

print(getJoinPair(distMatrix))

The final function we require before we get to the main tree construction is one to

determine the distance from a point on the tree (sequence or branch point) to the branch

point that is created when joining up with another part of the tree. This function is required

because the neighbour-joining algorithm works using a distance matrix, and each time we

join sequences we need to know the distances to this new join point (node). Effectively the

joined parts of the tree are replaced by a new single node; i.e. two sequences are replaced

with an ancestor. When the distance matrix is recalculated we need to know how far each

of the joined sequences is from the new point.

The function takes a distance matrix and two indices, i and j. The first index is the

row/column for the sequence we want to find the distance from, and the second index is

the sequence we are joining with. The row and column for the two indices are extracted

directly, remembering that the distance matrix is symmetric, so the j column is the same as

the j row.

def getDistToJunction(distMatrix, i, j):

n = len(distMatrix)

row = distMatrix[i]

column = distMatrix[j]

dist = distMatrix[i][j] + (sum(row)-sum(column))/(n-2)

dist *= 0.5

return dist

The distance from sequence index i to the new node is the average distance (hence

multiplying by 0.5) between the two joined sequences (distMatrix[i][j]) and the distance to

the whole tree, i.e. for n-2 nodes. The distance to the rest of the tree is calculated by taking

the total (whole tree) distance for one sequence away from the total distance for the other

and then dividing by the number of nodes to get an average. If the total tree were not

considered the branch point would be exactly halfway between the two sequences.

However, with the rest of the tree potentially being closer to one of the sequences than the

other the branch point is pulled to one side.

Finally we get to the function that actually builds the tree. This takes an abstract

distance matrix as input, so that it can be used in many different situations (not just for

sequences), but which can be generated here using getDistanceMatrix() using a list of

sequences and a residue similarity matrix as arguments. As mentioned previously this

could be modified to take multiple sequences for each taxon, rather than just one. Initially

in the function a list joinOrder is initialised, which will record the order in which

sequences are joined to build the tree, and we use the function defined above to make the

initial matrix of distances between sequences. The tree variable is initialised as a list

containing index numbers, one for each sequence in the distance matrix. An alternative to

using simple numbers here would be to use names for the sequences if we had them. What

is actually used for the tree construction is not important; it just needs to be a tag to

identify each taxon. As the main part of the function proceeds the tree variable will be

modified, and each time a new branch is formed the tags for the two joined sections will

be combined (into a tuple). In the end the nesting of pair sequence tags in this list will

represent the structure of the tree.

def neighbourJoinTree(distMatrix):

joinOrder = []

n = len(distMatrix)

tree = list(range(n)) # do not need list() in Python 2

Next we perform an iterative loop using the while statement. The loop continues while

the length of the distance matrix is greater than two, i.e. while we still have to work out

which possible bits of tree to combine. Each time we go through the loop a new branch

point is defined and the length of the distance matrix will decrease by one (two sequences

disappear and one new branch point appears). When there are only two tree parts to join

there are no more choices to be made; the last remaining join is made and the tree is

complete. Inside the loop we use the getJoinPair() function defined above to determine

which indices, stored as x and y, represent the closest pair, and thus should be joined

during this round.

while n > 2:

x, y = getJoinPair(distMatrix)

Just as the distance matrix gets smaller each time we join sequences so too does the tree

list. When we combine branches we remove the representation of those branches from the

list but add a new value for the larger, joined branch. Accordingly, the rows and columns

of the distance matrix have a direct correspondence to the positions in tree. The variable

node is defined as a tuple containing the two parts of the tree that are to be combined.

These two parts correspond to the positions x and y (in the distance matrix and tree list)

and can represent single, unjoined sequence tags or tags that are already in a branching

structure. The new branch node is added to the joinOrder list and the tree list. To see the

tree grow you could issue ‘print(tree)’ during the iteration.

node = (tree[x], tree[y])

joinOrder.append(node)

tree.append(node)

Then we delete the x and y elements from the tree list because these parts have now

been combined, and added as the new node to the end. Note that the y element is deleted

before the x because we want to delete the largest list index first. Deleting the smaller

index x first would shuffle the remainder of the list containing the y position along by one,

complicating matters.

del tree[y]

del tree[x]

Next we have to adjust the distance matrix in light of the newly joined positions. First

we calculate the distance from the joined x and y positions to the new branch point using

the function getDistToJunction() explained in detail above.

distX = getDistToJunction(distMatrix, x, y)

distY = getDistToJunction(distMatrix, y, x)

A new row, containing zeros, is added to the distance matrix. This will later be filled

with distances to the new branch point. Note that this new row has one more element than

the existing rows because it won’t be extended like the others in the following loop.

distMatrix.append([0] * (n+1))

Then we loop through all positions in the distance matrix, except the newly added one,

and select those that have not been joined this round (not x or y). The distance from each

of these positions to the new node is then defined as the average of the distances to the x,

y points minus the respective distances of x and y to their joining node.

for i in range(n):

if i not in (x,y):

dist = (distMatrix[x][i]-distX) + (distMatrix[y][i]-distY)

dist *= 0.5

This distance is added to the end of the distance matrix row. Also, the new row (n) at

the bottom has the appropriate element filled in. Thus we have filled in the new distance

values for the ends of row and column i.

distMatrix[i].append(dist)

distMatrix[n][i] = dist

Once the loop through matrix positions is complete we will have a new row and column

in the distance matrix for the newly joined branch point. So next we delete the old x and y

positions from the matrix. Firstly we delete two rows, and then for the remaining rows we

loop through and delete the appropriate two columns. At the end of the while loop we

decrease n by one, because after the merge the distance matrix is now one row/column

smaller.

del distMatrix[y]

del distMatrix[x]

for row in distMatrix:

del row[y]

del row[x]

n -= 1

At the end of the function we convert the tree list to a tuple, effectively saying that the

two remaining parts are joined (remember tuples are immutable). We add the whole tree to

the list of join events. Finally the function passes back the tree and branch combination

order, each of which is useful in different ways.

tree = tuple(tree)

joinOrder.append(tree)

return tree, joinOrder

For the above function the output tree data just uses sequence identifiers, it doesn’t

contain any distance information. However, it is trivial to make modifications that will

allow the function to also pass back distX and distY, the distance to the branch points,

along with the identifiers, e.g. node = ((tree[x], distX), (tree[y], distY)). Finally we can

test the tree generation, first generating the distance matrix with a list of protein sequences

and the BLOSUM residue substitution scores:

distMatrix = getDistanceMatrix(seqs, BLOSUM62)

tree, treeJoinOrder = neighbourJoinTree(distMatrix)

print(tree) # Result : (((7, (0, 1)), (4, 5)), ((2, 3), (6, 8)))

Download 7,75 Mb.

Do'stlaringiz bilan baham:

1 ... 202 203 204 205 206 207 208 209 ... 514