Calculating substitution rates using trees
Another way to use sequences while following a phylogenetic tree structure is to estimate
the rate of residue substitution. This will not be a rate in a specific time sense, but a rate in
terms of the evolutionary distance between sequences. Strictly speaking the calculations
will be of the minimum rate of substitution because we might not see all intermediate
residue changes, i.e. although it is possible for a residue to change from X to Z to Y, unless
we observe Z we will assume a direct transition from X to Y, the most parsimonious
solution. There are two example Python functions that will measure substitution rates: the
first is just to give a simple numeric rate value for the positions along a set of aligned
sequences, and the second is to work out the relative rates of passive and active DNA
substitutions in protein-coding regions.
In the following rate calculation functions we will be following residue letters back
along the branches of a tree. In order to determine whether there has been a substitution
event at a particular tree branch, at a particular sequence position, we estimate what the
ancestral residue might be. This is a very simple guess, and other more complex
probabilistic estimates can be made, but it is good enough for illustrative purposes. In
essence the function works by taking two sets for residues, one for each of the branches of
a tree, and determining whether these sets have any residue codes in common. If they do it
is assumed that these are the ancestral residue codes. When the outer branches of the trees
are used then there will only be a single residue letter in each set. However, for inner
branch points where substitution events may have occurred (and there is no good guess at
when the ancestor might be) each set can contain several residue codes.
The function is defined to take two sets of codes. Because we are using the Python set
objects we can use their inbuilt .intersection() function to find residue letters that are
present in both input sets. Then we perform some tests to determine what the ancestor
residue codes might be and whether we think a substitution event has occurred.
def ancestorResidue(codesA, codesB):
common = codesA.intersection(codesB)
If there are some residue codes in common this common set is passed back as the
ancestor set. Also, it is assumed that a substitution has not occurred if the number of
common residues is smaller than one or both inputs: then the ancestor represents a
clarification, rather than a divergence; any input set that is larger than the common set
would have already had its substitutions counted earlier, when the ambiguous set was
defined.
if common:
return common, False
If there are no residue codes in common, the ancestor is set to the union of the two
input sets. In this way ambiguous sets are generated. If the inputs are both full a
substitution event has occurred, otherwise we have one or more gaps and no substitution is
counted.
else:
union = codesA.union(codesB)
if codesA and codesB:
return union, True
else:
return union, False
Next we define a function that follows a tree of sequences and uses the ancestor residue
estimation to calculate minimum substitution rates. This function takes an alignment and
tree joining order as inputs; obviously these relate to the same sequences. We determine
the number of alignment positions n and the number of tree nodes. Two lists of zeros are
then initialised to be filled in with the relative and absolute substitution rates for each
position. The relative rates will be compared to the average for all positions.
def calcSubstitutionRates(align, treeOrder):
n = len(align[0])
numNodes = float(len(treeOrder))
absRates = [0.0] * n
relRates = [0.0] * n
Next a dictionary is defined and initially filled with the gapped sequences from the
input alignment. The sequences are converted from a string of letters into a list of Python
sets, each containing a single letter. Note that gaps are replaced by empty sets. As we
compare sequences in tree order, to count substitutions, the elements of this dictionary will
be replaced with new sets of residues representing ancestral sequences (i.e. at tree branch
points). The initial dictionary keys are simply indices within the alignment.
treeSeqs = {}
for i, seq in enumerate(align):
sets = []
for letter in seq:
if letter == '-':
sets.append(set())
else:
sets.append(set([letter])) # Could use sets.append({letter})
from Python 2.7
treeSeqs[i] = sets
Then we loop though the tree generation order, and in each iteration we extract the
indices of the combined branches (a and b) from the start of the list. Note that by using
.pop() the list is shortened, and when the list is empty the while loop will stop. The indices
are then used to extract the relevant sequences, and a new list is defined which will be
filled with a guess at the ancestor sequence.
while treeOrder:
a, b = treeOrder.pop(0)
seqA = treeSeqs[a]
seqB = treeSeqs[b]
seqC = []
To fill the ancestor sequence, all the alignment positions are looped through and the
corresponding pair of branching sequence elements extracted (seqA[i], seqB[i]). The
ancestor residue data is predicted with the function defined above and we catch the output
values for the ancestral residue codes and a Boolean value (True or False) that tells us
whether a substitution has occurred. If it has, the list of counts for the absolute rate
calculation is incremented by one at this position.
for i in range(n):
residueSet, swapped = ancestorResidue(seqA[i], seqB[i])
seqC.append(residueSet)
if swapped:
absRates[i] += 1.0
With this tree branch point considered, we store the ancestor sequence in the dictionary
of sequences, combining the old keys in a tuple to make a new key for the ancestor. This is
the same way that the node identifiers were made in the neighbourJoinTree() function.
treeSeqs[(a, b)] = seqC
del treeSeqs[a]
del treeSeqs[b]
After the while loop the average substitution rate is calculated as the total number of
residue substitutions, across all positions, divided by the maximum possible number of
substitutions: one for each branch node at each alignment position.
meanRate = sum(absRates)/(numNodes*n)
Finally, we loop though the alignment positions and calculate the per-site rate values.
The absolute rate is simply the count in the absRates list divided by the number of branch
points. The relative rate is the difference of the absolute rate from the mean, as a
proportion of the mean rate, and thus gives a positive or negative value depending on
whether the local rate is larger or smaller than average. At the end the lists of absolute and
relative rates are passed back as output.
for i in range(n):
rate = absRates[i] / numNodes
absRates[i] = rate
relRates[i] = (rate-meanRate)/meanRate
return absRates, relRates
To test the function we use our multiple-alignment function to generate the alignment
and tree data for some sequences. The alignment and order of tree construction are then
passed as arguments to calculate the substitution rates, which we then loop through and
print.
align, tree, joinOrder = treeProfileMultipleAlign(seqs, BLOSUM62)
absRates, relRates = calcSubstitutionRates(align, joinOrder)
for i, absRate in enumerate(absRates):
print('%2d %6.2f %6.2f' % (i, absRate, relRates[i]))
The final example function in this chapter will detect sequence substitutions in the same
manner as the substitution rate calculation, by following a guiding tree. However, instead
of treating all substitutions equally we will subdivide them into two categories: those that
change a codon to encode a different amino acid (active) and those that do not (passive).
As mentioned above, regions of high active substitution may indicate positive selection in
evolution, and more passive regions may indicate preservation of function. Naturally this
analysis only works for DNA sequences corresponding to the protein-coding region of
genes. We will assume that the input alignment is of the coding DNA strand (so not
reverse complement of the genetic code) and starts exactly at the start of the first codon to
be considered (so no offset).
The function takes an alignment, tree data and a genetic code as arguments. As before
the number of positions and number of nodes in the tree are defined. Then counters for the
active and passive substitutions are initialised and the dictionary to contain the sequences,
and ancestor sets, is defined. Much of the function’s logic is the same as described
previously in calcSubstitutionRates(). However, once the ancestor residues are predicted
things begin to differ.
def calcActivePassive(align, treeOrder, geneticCode):
n = len(align[0])
numNodes = float(len(treeOrder))
active = 0
passive = 0
treeSeqs = {}
for i, seq in enumerate(align):
sets = []
for letter in seq:
if letter == '-':
sets.append(set())
else:
sets.append(set([letter]))
treeSeqs[i] = sets
while treeOrder:
a, b = treeOrder.pop(0)
seqA = treeSeqs[a]
seqB = treeSeqs[b]
seqC = []
for i in range(n):
residues, swapped = ancestorResidue(seqA[i], seqB[i])
seqC.append(residues)
If a substitution event is detected then we have to consider the codons in which the
current DNA residues reside. The codon start position is defined as the multiple of three at
or before the current position: if we divide i by three, round to the integer and then
multiply by three, we will round down to a multiple of three. The sub-sequences for the
codons are defined for each of the sequences by taking a slice of each from the codon start
position to a position three residues along.
if swapped:
codonStart = (i//3)*3 # // is integer division
subSeqA = seqA[codonStart:codonStart+3]
subSeqB = seqB[codonStart:codonStart+3]
Because the variables containing the sub-sequences are actually lists of Python sets
(generated with the ancestorResidue()function) the actual triple-letter codons are
generated by looping through all the residue possibilities for each of the three positions
and joining the residue combinations (x+y+z). For each combination the amino acid code
is then looked up using the DNA codon, according to the input genetic code dictionary.
Note that because our genetic code dictionary uses RNA sequence keys, we first have to
replace T with U in the DNA codon. The amino acid code is added to the set of amino acid
codes aminoAcidsA using the .add() function of Python sets.
aminoAcidsA = set()
for x in subSeqA[0]:
for y in subSeqA[1]:
for z in subSeqA[2]:
codon = x+y+z
codon.replace('T','U'’)
aminoAcidsA.add( geneticCode.get(codon) )
The process is repeated for the second sub-sequence, i.e. the other branch of the tree.
aminoAcidsB = set()
for x in subSeqB[0]:
for y in subSeqB[1]:
for z in subSeqB[2]:
codon = x+y+z
codon.replace('T','U')
aminoAcidsB.add( geneticCode.get(codon) )
If the detected substitution leaves any of the amino acid codes the same, comparing the
two branches of the tree, we increase the count for the passive substitutions, otherwise if
the code does change then the active count is increased. Note how the intersection of the
two sets of amino acid codes is used to find whether there are codes in common, in which
case we assume a passive change.
if aminoAcidsA.intersection(aminoAcidsB):
passive += 1
else:
active += 1
Finally, in the while loop the new ancestor sequence is stored and the already
considered parts of the tree are deleted. And at the end of the function the counts for the
active and passive DNA substitutions are passed back at the return statement.
treeSeqs[(a, b)] = seqC
del treeSeqs[a]
del treeSeqs[b]
return active, passive
To test we first generate an alignment and tree data and then make the active and
passive counts, using an imported genetic code dictionary.
seqs = ['AAAGTGGATGAAGTTGGTGCTGAGGCCCTGGGCAGGCTG',
'AAAGTGGATGATGTTGGTGCTGAGGCCCTGGGCAGGCTG',
'AAAGTGGATGAAGTTGGTGCTGAGGCCCTGGGCAGGCTG',
'AAAGTGGATGAAGTTGGTGCTGAAGCCCTGGGCAGGCTG',
'AAAGTGGATGAAGTTGGTGCTGAGGCCCTGGGCAGGCTG',
'CATGTGGATGAAGTTGGTGGTGAGGCCCTGGGCAGGCTG',
'AAAGTGGACGAAGTTGGTGCTGAGGCCCTGGGCAGGCTG',
'CATGTGGATGAAATTAGTGGTGAGGTCCTGGGCAGGCTG',
'AACGTGGATGAAGTTGGTGGTGAGGCCCTGGGCAGGCTG']
from Sequences import STANDARD_GENETIC_CODE as SGC
align, tree, joinOrder = treeProfileMultipleAlign(seqs, DNA_1)
active, passive = calcActivePassive (align, joinOrder, SGC)
print('Active: %d Passive: %d' % (active, passive))
1
Many virus replication enzymes lack proof-reading ability so that although error-prone
their genome can mutate, adapt and evolve at a very high rate.
Do'stlaringiz bilan baham: |