Phylogenetic trees using neighbour-joining
Tree-building can be a very difficult task. This is basically because the number of branch
combinations increases very rapidly with the number of input taxa. The number of
combinations can be so vast that we cannot routinely test all combinations to give the
optimum tree arrangement. Hence what is often done, and which we will do here, is to
generate a quite good, but not optimal, tree using a computationally fast method. The
method used here is the one introduced by Saitou and Nei; the neighbour-joining method.
5
This is a quick way of generating a tree that is usually not so far from a global optimum
and which is also relatively easy to understand mechanistically. The generated tree will
not necessarily be at the global optimum because the algorithm is ‘greedy’ in that it only
optimises the local solution for pairing up sequences, i.e. it never considers the tree as a
whole. Nonetheless it is still a speedy and useful example to be used in subsequent
analyses, and it can also provide a starting point for more globally aware optimising
methods, if more accuracy is required. As with most of the examples in this book, the aim
of the tree-generating example is not to give a high-performance, gold-standard program,
but rather to lead you through a relatively simple piece of code that can be readily
understood and which illustrates the major points of the topic.
The neighbour-joining algorithm involves repeatedly finding the closest pair from
amongst the input sequences (and sub-trees after the first cycle) and joining these to form
a new larger sub-tree until all sequences have been considered and only one, fully joined
tree remains. Thus the first step for the tree-building example is to calculate what is
termed a distance matrix, where the rows and columns of the matrix represent the different
input sequences and the matrix elements represent the evolutionary distance between those
sequences. Here we will only compare isolated input sequences; however, the method
could be adapted to consider multiple sequences for each taxon (e.g. each input is a group
of sequences representing a different species) so that the evolutionary distance between
organisms is better represented as a whole. The tree that will result in the end is what is
called an unrooted tree. This means that the tree has no indication of which taxon came
first, evolutionarily speaking, and hence where the common (but unseen) ancestor fits in.
By making an unrooted tree we are only showing the branch relationships between the
input taxa. We recommend further reading if you wish to build rooted trees: those with an
ancestor start point.
The evolutionary distance that separates two sequences in this instance will be
estimated using the scores that are generated by performing pairwise sequence alignment.
Thus the values will depend on the similarity of paired residues as defined by a
substitution matrix. With DNA sequences it is common to use a more complex
evolutionary model, to say how each change relates to distance; there are many more
sophisticated ways of calculating the distance between sequences, not all of which can be
used with the relatively simple neighbour-joining method used here. However, by using
the substitution matrices we will keep things very simple for illustrative purposes and be
able to use existing Python functions. For further reading on distance matrix generation,
and tree generation in general, the PHYLIP
6
software package is a good starting point.
Do'stlaringiz bilan baham: |