Figure 14.5. The basic construction of a phylogenetic tree using neighbour-joining.
Although many more rigorous methods exist, phylogenetic trees can be constructed from
sequences using the relatively simple neighbour-joining method. This involves creating
pairwise alignments between all sequences to determine their similarity, and thus also
rough evolutionary distance from one another. The resulting distance matrix is then used
to construct the tree. This is done in a cycle where the most similar pair each time is
combined to make a branch point and reduce the size of the matrix, until all sequences are
joined.
Before we get to the main tree-generating Python function we will construct some
ancillary functions that will be used inside the main routine. This will help understanding
of the process by breaking it into operations that can be thought about separately. Also, the
function that makes the tree will be neater and easier to understand. The downside to this
way of doing things is the relative speed penalty incurred with calling Python functions.
Thus if speed was an issue you could forego the separate functions and embed their code
directly in the main section. If even more speed is required we would recommend a
Python/C hybrid language approach, for example, using Cython, as discussed in
Chapter
27
.
To start with we make a simple distance matrix generation function that accepts a list of
sequences and a substitution matrix as input arguments. It defines an initially blank
matrix
7
of zeros with a row and column for each sequence (n sequences), such that the
matrix elements will eventually be filled with the distance between all possible sequence
pairs. Then we calculate a list of maximum possible scores for sequence comparisons, by
comparing each sequence with itself using the calcSeqSimilarity() function defined in
Chapter 12
. These values will give a baseline for the other scores. The row and column
indices in the matrix are then looped through (i and j) and for each index the
corresponding sequence is obtained. Note how two for loops are used but we don’t go
from index 0 to n for these; instead the first loop goes up to n-1, and the second starts from
i+1. This means we only consider one half of the matrix, excluding the diagonal. We can
reduce the number of loops this way, and thus save some time, because the matrix is
symmetric, i.e. matrix[i][j] is the same as matrix[j][i].
from Alignments import sequenceAlign, calcSeqSimilarity
from math import exp
def getDistanceMatrix(seqs, simMatrix):
n = len(seqs)
matrix = [[0.0] * n for x in range(n)]
maxScores = [calcSeqSimilarity(x, x, simMatrix) for x in seqs]
for i in range(n-1):
seqA = seqs[i]
for j in range(i+1,n):
seqB = seqs[j]
score, alignA, alignB = sequenceAlign(seqA, seqB, simMatrix)
maxScore = max(maxScores[i],maxScores[j])
dist = maxScore - score
matrix[i][j] = dist
matrix[j][i] = dist
return matrix
The sequences for each matrix element are aligned using the input substitution matrix.
The alignment thus generated is passed back with a score, which is then converted into a
distance. The maximum possible score is calculated as the maximum from amongst the
two maxScores list values for the sequences compared in this iteration. The distance is
then calculated as the difference between this maximum possible score and the observed
alignment score. Accordingly, if the sequences are identical the alignment score value is
equal to the maximum and the distance is zero, and the more dissimilar the sequences, the
larger the distance between them. If we were to use the simple DNA substitution matrix
DNA_1, the distance would represent the number of non-identical positions. The distance
value is then inserted into the matrix at the given row and column.
Here we test the function using previously defined lists of sequences (without gaps) and
substitution matrices. The output is looped through row by row and the values in each row
are displayed in a formatted manner (to two decimal places using ‘%.2f’) to keep things
neat on screen.
seqs = ['QPVHPFSRPAPVVIILIILCVMAGVIGTILLISYGIRLLIK',
'QLVHRFTVPAPVVIILIILCVMAGIIGTILLISYTIRRLIK',
'QLAHHFSEPEITLIIFGVMAGVIGTILLISYGIRRLIKKSPSDVKPLPSPD',
'QLVHEFSELVIALIIFGVMAGVIGTILFISYGSRRLIKKSESDVQPLPPPD',
'MLEHEFSAPVAILIILGVMAGIIGIILLISYSIGQIIKKRSVDIQPPEDED',
'PIQHDFPALVMILIILGVMAGIIGTILLISYCISRMTKKSSVDIQSPEGGD',
'QLVHIFSEPVIIGIIYAVMLGIIITILSIAFCIGQLTKKSSLPAQVASPED',
'LAHDFSQPVITVIILGVMAGIIGIILLLAYVSRRLRKRPPADVP',
'SYHQDFSHAEITGIIFAVMAGLLLIIFLIAYLIRRMIKKPLPVPKPQDSPD']
distMatrix = getDistanceMatrix(seqs, BLOSUM62)
distMatrix = getDistanceMatrix(align1, DNA_1)
for row in distMatrix:
print(['%.2f' % x for x in row])
# ['0.00', '2.00', '3.00', '4.00', '5.00', '5.00', '5.00']
# …
Next we define another function to find out which element of a distance matrix
represents the closest sequences, and thus which sequences should be joined next when
making a tree. Note that this is not as simple as finding the smallest distance, rather we
calculate the value q (which is often described in the form of the Q-matrix) as the distance
scaled by the number of branch points for the tree (n-2) minus the sum of the row and
column at that point. For each row and column (again only needing to consider half of the
matrix) we determine whether the q value is smaller than the best so far. If so (and for the
initial loop, where minQ is None) then we redefine the best q, and record the row and
column for this element as joinPair.
def getJoinPair(distMatrix):
n = len(distMatrix)
minQ = None
joinPair = None
for i in range(n-1):
sumRow = sum(distMatrix[i])
for j in range(i+1, n):
sumCol = sum(distMatrix[j])
dist = distMatrix[i][j]
q = (n-2)*dist - sumRow - sumCol
if (minQ is None) or (q < minQ):
minQ = q
joinPair = [i,j]
return joinPair
At the end of the function the indices for the closest pair of sequences are returned.
Because of the for loops these indices are in a specific order, which helps later when the
tree is built. We can test the function using a previously defined distance matrix:
print(getJoinPair(distMatrix))
The final function we require before we get to the main tree construction is one to
determine the distance from a point on the tree (sequence or branch point) to the branch
point that is created when joining up with another part of the tree. This function is required
because the neighbour-joining algorithm works using a distance matrix, and each time we
join sequences we need to know the distances to this new join point (node). Effectively the
joined parts of the tree are replaced by a new single node; i.e. two sequences are replaced
with an ancestor. When the distance matrix is recalculated we need to know how far each
of the joined sequences is from the new point.
The function takes a distance matrix and two indices, i and j. The first index is the
row/column for the sequence we want to find the distance from, and the second index is
the sequence we are joining with. The row and column for the two indices are extracted
directly, remembering that the distance matrix is symmetric, so the j column is the same as
the j row.
def getDistToJunction(distMatrix, i, j):
n = len(distMatrix)
row = distMatrix[i]
column = distMatrix[j]
dist = distMatrix[i][j] + (sum(row)-sum(column))/(n-2)
dist *= 0.5
return dist
The distance from sequence index i to the new node is the average distance (hence
multiplying by 0.5) between the two joined sequences (distMatrix[i][j]) and the distance to
the whole tree, i.e. for n-2 nodes. The distance to the rest of the tree is calculated by taking
the total (whole tree) distance for one sequence away from the total distance for the other
and then dividing by the number of nodes to get an average. If the total tree were not
considered the branch point would be exactly halfway between the two sequences.
However, with the rest of the tree potentially being closer to one of the sequences than the
other the branch point is pulled to one side.
Finally we get to the function that actually builds the tree. This takes an abstract
distance matrix as input, so that it can be used in many different situations (not just for
sequences), but which can be generated here using getDistanceMatrix() using a list of
sequences and a residue similarity matrix as arguments. As mentioned previously this
could be modified to take multiple sequences for each taxon, rather than just one. Initially
in the function a list joinOrder is initialised, which will record the order in which
sequences are joined to build the tree, and we use the function defined above to make the
initial matrix of distances between sequences. The tree variable is initialised as a list
containing index numbers, one for each sequence in the distance matrix. An alternative to
using simple numbers here would be to use names for the sequences if we had them. What
is actually used for the tree construction is not important; it just needs to be a tag to
identify each taxon. As the main part of the function proceeds the tree variable will be
modified, and each time a new branch is formed the tags for the two joined sections will
be combined (into a tuple). In the end the nesting of pair sequence tags in this list will
represent the structure of the tree.
def neighbourJoinTree(distMatrix):
joinOrder = []
n = len(distMatrix)
tree = list(range(n)) # do not need list() in Python 2
Next we perform an iterative loop using the while statement. The loop continues while
the length of the distance matrix is greater than two, i.e. while we still have to work out
which possible bits of tree to combine. Each time we go through the loop a new branch
point is defined and the length of the distance matrix will decrease by one (two sequences
disappear and one new branch point appears). When there are only two tree parts to join
there are no more choices to be made; the last remaining join is made and the tree is
complete. Inside the loop we use the getJoinPair() function defined above to determine
which indices, stored as x and y, represent the closest pair, and thus should be joined
during this round.
while n > 2:
x, y = getJoinPair(distMatrix)
Just as the distance matrix gets smaller each time we join sequences so too does the tree
list. When we combine branches we remove the representation of those branches from the
list but add a new value for the larger, joined branch. Accordingly, the rows and columns
of the distance matrix have a direct correspondence to the positions in tree. The variable
node is defined as a tuple containing the two parts of the tree that are to be combined.
These two parts correspond to the positions x and y (in the distance matrix and tree list)
and can represent single, unjoined sequence tags or tags that are already in a branching
structure. The new branch node is added to the joinOrder list and the tree list. To see the
tree grow you could issue ‘print(tree)’ during the iteration.
node = (tree[x], tree[y])
joinOrder.append(node)
tree.append(node)
Then we delete the x and y elements from the tree list because these parts have now
been combined, and added as the new node to the end. Note that the y element is deleted
before the x because we want to delete the largest list index first. Deleting the smaller
index x first would shuffle the remainder of the list containing the y position along by one,
complicating matters.
del tree[y]
del tree[x]
Next we have to adjust the distance matrix in light of the newly joined positions. First
we calculate the distance from the joined x and y positions to the new branch point using
the function getDistToJunction() explained in detail above.
distX = getDistToJunction(distMatrix, x, y)
distY = getDistToJunction(distMatrix, y, x)
A new row, containing zeros, is added to the distance matrix. This will later be filled
with distances to the new branch point. Note that this new row has one more element than
the existing rows because it won’t be extended like the others in the following loop.
distMatrix.append([0] * (n+1))
Then we loop through all positions in the distance matrix, except the newly added one,
and select those that have not been joined this round (not x or y). The distance from each
of these positions to the new node is then defined as the average of the distances to the x,
y points minus the respective distances of x and y to their joining node.
for i in range(n):
if i not in (x,y):
dist = (distMatrix[x][i]-distX) + (distMatrix[y][i]-distY)
dist *= 0.5
This distance is added to the end of the distance matrix row. Also, the new row (n) at
the bottom has the appropriate element filled in. Thus we have filled in the new distance
values for the ends of row and column i.
distMatrix[i].append(dist)
distMatrix[n][i] = dist
Once the loop through matrix positions is complete we will have a new row and column
in the distance matrix for the newly joined branch point. So next we delete the old x and y
positions from the matrix. Firstly we delete two rows, and then for the remaining rows we
loop through and delete the appropriate two columns. At the end of the while loop we
decrease n by one, because after the merge the distance matrix is now one row/column
smaller.
del distMatrix[y]
del distMatrix[x]
for row in distMatrix:
del row[y]
del row[x]
n -= 1
At the end of the function we convert the tree list to a tuple, effectively saying that the
two remaining parts are joined (remember tuples are immutable). We add the whole tree to
the list of join events. Finally the function passes back the tree and branch combination
order, each of which is useful in different ways.
tree = tuple(tree)
joinOrder.append(tree)
return tree, joinOrder
For the above function the output tree data just uses sequence identifiers, it doesn’t
contain any distance information. However, it is trivial to make modifications that will
allow the function to also pass back distX and distY, the distance to the branch points,
along with the identifiers, e.g. node = ((tree[x], distX), (tree[y], distY)). Finally we can
test the tree generation, first generating the distance matrix with a list of protein sequences
and the BLOSUM residue substitution scores:
distMatrix = getDistanceMatrix(seqs, BLOSUM62)
tree, treeJoinOrder = neighbourJoinTree(distMatrix)
print(tree) # Result : (((7, (0, 1)), (4, 5)), ((2, 3), (6, 8)))
Do'stlaringiz bilan baham: |