Given a group of homologous sequences we can often go beyond saying that they are
with such a tree is to reconstruct the way that sequences have diverged during evolution.
This can be used to reconstruct the events of how genes, non-coding regions and even
protein domains arose. Given enough information we can look at a large scale to say how
whole species are related, and if we look at the fine details how individuals within a
family are related. Of course on some occasions we already know the inheritance tree, by
using knowledge of parentage. This enables us to follow traits including physical
differences, biochemical differences (e.g. blood groups) and inherited disease symptoms.
However, it is only if we study the inherited differences at the biological sequence level
that we can understand the molecular reasons, which in turn improves medicine and
In history, evolutionary and family trees were built according to observable
characteristics. If two species shared certain anatomical characteristics they would be
deemed to be more closely related. This works well in some cases, but not in others (such
as knowing where to place the elephant, whale and duck-billed platypus in the evolution
of mammals). The reason for this difficulty is that people were only following a few
subjective measurements. DNA sequencing allows us to place evolutionary lineages with
much more confidence, because the detection of sequence is a precise thing and there are
vastly more data points to follow: potentially every base pair, gene and transposon.
Nevertheless, we sometimes still have to resort to anatomical comparisons when DNA is
unavailable, as with dinosaurs, but the more bones the better.
When constructing a phylogenetic tree of sequences the basic principle is to think of the
most similar sequences being the most closely related, analogous to the anatomical means
of grouping organisms. When looking at sequence evolution we often think in terms of the
most frugal explanation or parsimony; it is reasonable to assume that minimal changes are
the most likely, so we would think that a nucleotide is less likely to change from say T to
G to C than it is to go directly from T to C. Accordingly when we build a phylogenetic
tree we assume that the correct one is, or is close to, the one that involves the minimum
amount of overall sequence change. Absolute parsimony isn’t always a good idea in all
situations: with distantly related sequences, and those with a high rate of change, the
chances of having intermediate residue changes is significant, so it is better to think in
terms of the long-term equilibrium of sequence. Also some things may be similar by
chance and not because of a common ancestor, although this becomes increasingly
unlikely overall if we consider increasingly more sequence data. However, there may
simply not be enough data to form a firm opinion, even if building some sort of optimised
tree is computationally possible.
When trying to work out real inheritance and evolutionary relationships more
information will yield better results. Thus when we look at the relationships between
species it is best to consider as much sequence and as many sequences as possible,
although given the choice it is better to have sequences that sample a tree widely and
evenly. Tree-building becomes more inaccurate, with regard to the underlying truth, the
longer the branch, so it is best to have lots of linking sequences and hence shorter
branches. Also, some sequences (genes, proteins or whatever) may be better than others at
uncovering the relationships, particularly if the rate of sequence change is the right
magnitude; too many changes and the assumption of parsimony is weaker, but too few
changes and there isn’t enough evidence to support a hypothesis. Accordingly, when we
study fast-moving things, like the mutation of viruses, we look at rapidly changing genes,
and for slow things like speciation we look at slowly changing things: ribosomal RNA
genes, mitochondrial ‘housekeeping’ genes and rare transposon and duplication events.
When we have confidently built a phylogenetic tree, analyses of sequence variation
gives us more information than can be obtained from alignments. We will be able to spot
which changes occurred first and whether the same change has occurred more than once.
As illustrated in
Figure 14.4
, consider for example four sequences A, B, C and D, two of
which, A and B, have residue W at a position and two of which, C and D, have residue Y
at the same position. If we know that the pairs A and B and C and D are more closely
related as a whole, then we know that one residue substitution was enough to make the
observed situation; the ancestor of the sequences might have had W or Y but one
substitution is enough to generate the A and B (W) branch or C and D (Y) branch.
Conversely if the most closely related pairs overall are A and C and B and D then each
pair contains a mix of W and Y residues. In this case there must have been at least two
substitution events, one on each branch from the ancestor, which could only have one of
the two residues. Accordingly, by considering the overall relationship between sequences
we can make much better measurements of the rate of change than we can from just a
multiple alignment.