Python Programming for Biology: Bioinformatics and Beyond

Download 7,75 Mb.

Pdf ko'rish

bet	197/514
Sana	30.12.2021
Hajmi	7,75 Mb.
	#91066

1 ... 193 194 195 196 197 198 199 200 ... 514

Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

Phylogenetic trees

Given a group of homologous sequences we can often go beyond saying that they are

related and build a phylogenetic tree to say how they are related to one another. The idea

with such a tree is to reconstruct the way that sequences have diverged during evolution.

This can be used to reconstruct the events of how genes, non-coding regions and even

protein domains arose. Given enough information we can look at a large scale to say how

whole species are related, and if we look at the fine details how individuals within a

family are related. Of course on some occasions we already know the inheritance tree, by

using knowledge of parentage. This enables us to follow traits including physical

differences, biochemical differences (e.g. blood groups) and inherited disease symptoms.

However, it is only if we study the inherited differences at the biological sequence level

that we can understand the molecular reasons, which in turn improves medicine and

biology.

In history, evolutionary and family trees were built according to observable

characteristics. If two species shared certain anatomical characteristics they would be

deemed to be more closely related. This works well in some cases, but not in others (such

as knowing where to place the elephant, whale and duck-billed platypus in the evolution

of mammals). The reason for this difficulty is that people were only following a few

subjective measurements. DNA sequencing allows us to place evolutionary lineages with

much more confidence, because the detection of sequence is a precise thing and there are

vastly more data points to follow: potentially every base pair, gene and transposon.

Nevertheless, we sometimes still have to resort to anatomical comparisons when DNA is

unavailable, as with dinosaurs, but the more bones the better.

When constructing a phylogenetic tree of sequences the basic principle is to think of the

most similar sequences being the most closely related, analogous to the anatomical means

of grouping organisms. When looking at sequence evolution we often think in terms of the

most frugal explanation or parsimony; it is reasonable to assume that minimal changes are

the most likely, so we would think that a nucleotide is less likely to change from say T to

G to C than it is to go directly from T to C. Accordingly when we build a phylogenetic

tree we assume that the correct one is, or is close to, the one that involves the minimum

amount of overall sequence change. Absolute parsimony isn’t always a good idea in all

situations: with distantly related sequences, and those with a high rate of change, the

chances of having intermediate residue changes is significant, so it is better to think in

terms of the long-term equilibrium of sequence. Also some things may be similar by

chance and not because of a common ancestor, although this becomes increasingly

unlikely overall if we consider increasingly more sequence data. However, there may

simply not be enough data to form a firm opinion, even if building some sort of optimised

tree is computationally possible.

When trying to work out real inheritance and evolutionary relationships more

information will yield better results. Thus when we look at the relationships between

species it is best to consider as much sequence and as many sequences as possible,

although given the choice it is better to have sequences that sample a tree widely and

evenly. Tree-building becomes more inaccurate, with regard to the underlying truth, the

longer the branch, so it is best to have lots of linking sequences and hence shorter

branches. Also, some sequences (genes, proteins or whatever) may be better than others at

uncovering the relationships, particularly if the rate of sequence change is the right

magnitude; too many changes and the assumption of parsimony is weaker, but too few

changes and there isn’t enough evidence to support a hypothesis. Accordingly, when we

study fast-moving things, like the mutation of viruses, we look at rapidly changing genes,

and for slow things like speciation we look at slowly changing things: ribosomal RNA

genes, mitochondrial ‘housekeeping’ genes and rare transposon and duplication events.

When we have confidently built a phylogenetic tree, analyses of sequence variation

gives us more information than can be obtained from alignments. We will be able to spot

which changes occurred first and whether the same change has occurred more than once.

As illustrated in

Figure 14.4

, consider for example four sequences A, B, C and D, two of

which, A and B, have residue W at a position and two of which, C and D, have residue Y

at the same position. If we know that the pairs A and B and C and D are more closely

related as a whole, then we know that one residue substitution was enough to make the

observed situation; the ancestor of the sequences might have had W or Y but one

substitution is enough to generate the A and B (W) branch or C and D (Y) branch.

Conversely if the most closely related pairs overall are A and C and B and D then each

pair contains a mix of W and Y residues. In this case there must have been at least two

substitution events, one on each branch from the ancestor, which could only have one of

the two residues. Accordingly, by considering the overall relationship between sequences

we can make much better measurements of the rate of change than we can from just a

multiple alignment.

Download 7,75 Mb.

Do'stlaringiz bilan baham:

1 ... 193 194 195 196 197 198 199 200 ... 514