Python Programming for Biology: Bioinformatics and Beyond

Calculating substitution rates using trees

Download 7,75 Mb.

Pdf ko'rish

bet	208/514
Sana	30.12.2021
Hajmi	7,75 Mb.
	#91066

1 ... 204 205 206 207 208 209 210 211 ... 514

Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

Calculating substitution rates using trees

Another way to use sequences while following a phylogenetic tree structure is to estimate

the rate of residue substitution. This will not be a rate in a specific time sense, but a rate in

terms of the evolutionary distance between sequences. Strictly speaking the calculations

will be of the minimum rate of substitution because we might not see all intermediate

residue changes, i.e. although it is possible for a residue to change from X to Z to Y, unless

we observe Z we will assume a direct transition from X to Y, the most parsimonious

solution. There are two example Python functions that will measure substitution rates: the

first is just to give a simple numeric rate value for the positions along a set of aligned

sequences, and the second is to work out the relative rates of passive and active DNA

substitutions in protein-coding regions.

In the following rate calculation functions we will be following residue letters back

along the branches of a tree. In order to determine whether there has been a substitution

event at a particular tree branch, at a particular sequence position, we estimate what the

ancestral residue might be. This is a very simple guess, and other more complex

probabilistic estimates can be made, but it is good enough for illustrative purposes. In

essence the function works by taking two sets for residues, one for each of the branches of

a tree, and determining whether these sets have any residue codes in common. If they do it

is assumed that these are the ancestral residue codes. When the outer branches of the trees

are used then there will only be a single residue letter in each set. However, for inner

branch points where substitution events may have occurred (and there is no good guess at

when the ancestor might be) each set can contain several residue codes.

The function is defined to take two sets of codes. Because we are using the Python set

objects we can use their inbuilt .intersection() function to find residue letters that are

present in both input sets. Then we perform some tests to determine what the ancestor

residue codes might be and whether we think a substitution event has occurred.

def ancestorResidue(codesA, codesB):

common = codesA.intersection(codesB)

If there are some residue codes in common this common set is passed back as the

ancestor set. Also, it is assumed that a substitution has not occurred if the number of

common residues is smaller than one or both inputs: then the ancestor represents a

clarification, rather than a divergence; any input set that is larger than the common set

would have already had its substitutions counted earlier, when the ambiguous set was

defined.

if common:

return common, False

If there are no residue codes in common, the ancestor is set to the union of the two

input sets. In this way ambiguous sets are generated. If the inputs are both full a

substitution event has occurred, otherwise we have one or more gaps and no substitution is

counted.

else:

union = codesA.union(codesB)

if codesA and codesB:

return union, True

else:

return union, False

Next we define a function that follows a tree of sequences and uses the ancestor residue

estimation to calculate minimum substitution rates. This function takes an alignment and

tree joining order as inputs; obviously these relate to the same sequences. We determine

the number of alignment positions n and the number of tree nodes. Two lists of zeros are

then initialised to be filled in with the relative and absolute substitution rates for each

position. The relative rates will be compared to the average for all positions.

def calcSubstitutionRates(align, treeOrder):

n = len(align[0])

numNodes = float(len(treeOrder))

absRates = [0.0] * n

relRates = [0.0] * n

Next a dictionary is defined and initially filled with the gapped sequences from the

input alignment. The sequences are converted from a string of letters into a list of Python

sets, each containing a single letter. Note that gaps are replaced by empty sets. As we

compare sequences in tree order, to count substitutions, the elements of this dictionary will

be replaced with new sets of residues representing ancestral sequences (i.e. at tree branch

points). The initial dictionary keys are simply indices within the alignment.

treeSeqs = {}

for i, seq in enumerate(align):

sets = []

for letter in seq:

if letter == '-':

sets.append(set())

else:

sets.append(set([letter])) # Could use sets.append({letter})

from Python 2.7

treeSeqs[i] = sets

Then we loop though the tree generation order, and in each iteration we extract the

indices of the combined branches (a and b) from the start of the list. Note that by using

.pop() the list is shortened, and when the list is empty the while loop will stop. The indices

are then used to extract the relevant sequences, and a new list is defined which will be

filled with a guess at the ancestor sequence.

while treeOrder:

a, b = treeOrder.pop(0)

seqA = treeSeqs[a]

seqB = treeSeqs[b]

seqC = []

To fill the ancestor sequence, all the alignment positions are looped through and the

corresponding pair of branching sequence elements extracted (seqA[i], seqB[i]). The

ancestor residue data is predicted with the function defined above and we catch the output

values for the ancestral residue codes and a Boolean value (True or False) that tells us

whether a substitution has occurred. If it has, the list of counts for the absolute rate

calculation is incremented by one at this position.

for i in range(n):

residueSet, swapped = ancestorResidue(seqA[i], seqB[i])

seqC.append(residueSet)

if swapped:

absRates[i] += 1.0

With this tree branch point considered, we store the ancestor sequence in the dictionary

of sequences, combining the old keys in a tuple to make a new key for the ancestor. This is

the same way that the node identifiers were made in the neighbourJoinTree() function.

treeSeqs[(a, b)] = seqC

del treeSeqs[a]

del treeSeqs[b]

After the while loop the average substitution rate is calculated as the total number of

residue substitutions, across all positions, divided by the maximum possible number of

substitutions: one for each branch node at each alignment position.

meanRate = sum(absRates)/(numNodes*n)

Finally, we loop though the alignment positions and calculate the per-site rate values.

The absolute rate is simply the count in the absRates list divided by the number of branch

points. The relative rate is the difference of the absolute rate from the mean, as a

proportion of the mean rate, and thus gives a positive or negative value depending on

whether the local rate is larger or smaller than average. At the end the lists of absolute and

relative rates are passed back as output.

for i in range(n):

rate = absRates[i] / numNodes

absRates[i] = rate

relRates[i] = (rate-meanRate)/meanRate

return absRates, relRates

To test the function we use our multiple-alignment function to generate the alignment

and tree data for some sequences. The alignment and order of tree construction are then

passed as arguments to calculate the substitution rates, which we then loop through and

print.

align, tree, joinOrder = treeProfileMultipleAlign(seqs, BLOSUM62)

absRates, relRates = calcSubstitutionRates(align, joinOrder)

for i, absRate in enumerate(absRates):

print('%2d %6.2f %6.2f' % (i, absRate, relRates[i]))

The final example function in this chapter will detect sequence substitutions in the same

manner as the substitution rate calculation, by following a guiding tree. However, instead

of treating all substitutions equally we will subdivide them into two categories: those that

change a codon to encode a different amino acid (active) and those that do not (passive).

As mentioned above, regions of high active substitution may indicate positive selection in

evolution, and more passive regions may indicate preservation of function. Naturally this

analysis only works for DNA sequences corresponding to the protein-coding region of

genes. We will assume that the input alignment is of the coding DNA strand (so not

reverse complement of the genetic code) and starts exactly at the start of the first codon to

be considered (so no offset).

The function takes an alignment, tree data and a genetic code as arguments. As before

the number of positions and number of nodes in the tree are defined. Then counters for the

active and passive substitutions are initialised and the dictionary to contain the sequences,

and ancestor sets, is defined. Much of the function’s logic is the same as described

previously in calcSubstitutionRates(). However, once the ancestor residues are predicted

things begin to differ.

def calcActivePassive(align, treeOrder, geneticCode):

n = len(align[0])

numNodes = float(len(treeOrder))

active = 0

passive = 0

treeSeqs = {}

for i, seq in enumerate(align):

sets = []

for letter in seq:

if letter == '-':

sets.append(set())

else:

sets.append(set([letter]))

treeSeqs[i] = sets

while treeOrder:

a, b = treeOrder.pop(0)

seqA = treeSeqs[a]

seqB = treeSeqs[b]

seqC = []

for i in range(n):

residues, swapped = ancestorResidue(seqA[i], seqB[i])

seqC.append(residues)

If a substitution event is detected then we have to consider the codons in which the

current DNA residues reside. The codon start position is defined as the multiple of three at

or before the current position: if we divide i by three, round to the integer and then

multiply by three, we will round down to a multiple of three. The sub-sequences for the

codons are defined for each of the sequences by taking a slice of each from the codon start

position to a position three residues along.

if swapped:

codonStart = (i//3)*3 # // is integer division

subSeqA = seqA[codonStart:codonStart+3]

subSeqB = seqB[codonStart:codonStart+3]

Because the variables containing the sub-sequences are actually lists of Python sets

(generated with the ancestorResidue()function) the actual triple-letter codons are

generated by looping through all the residue possibilities for each of the three positions

and joining the residue combinations (x+y+z). For each combination the amino acid code

is then looked up using the DNA codon, according to the input genetic code dictionary.

Note that because our genetic code dictionary uses RNA sequence keys, we first have to

replace T with U in the DNA codon. The amino acid code is added to the set of amino acid

codes aminoAcidsA using the .add() function of Python sets.

aminoAcidsA = set()

for x in subSeqA[0]:

for y in subSeqA[1]:

for z in subSeqA[2]:

codon = x+y+z

codon.replace('T','U'’)

aminoAcidsA.add( geneticCode.get(codon) )

The process is repeated for the second sub-sequence, i.e. the other branch of the tree.

aminoAcidsB = set()

for x in subSeqB[0]:

for y in subSeqB[1]:

for z in subSeqB[2]:

codon = x+y+z

codon.replace('T','U')

aminoAcidsB.add( geneticCode.get(codon) )

If the detected substitution leaves any of the amino acid codes the same, comparing the

two branches of the tree, we increase the count for the passive substitutions, otherwise if

the code does change then the active count is increased. Note how the intersection of the

two sets of amino acid codes is used to find whether there are codes in common, in which

case we assume a passive change.

if aminoAcidsA.intersection(aminoAcidsB):

passive += 1

else:

active += 1

Finally, in the while loop the new ancestor sequence is stored and the already

considered parts of the tree are deleted. And at the end of the function the counts for the

active and passive DNA substitutions are passed back at the return statement.

treeSeqs[(a, b)] = seqC

del treeSeqs[a]

del treeSeqs[b]

return active, passive

To test we first generate an alignment and tree data and then make the active and

passive counts, using an imported genetic code dictionary.

seqs = ['AAAGTGGATGAAGTTGGTGCTGAGGCCCTGGGCAGGCTG',

'AAAGTGGATGATGTTGGTGCTGAGGCCCTGGGCAGGCTG',

'AAAGTGGATGAAGTTGGTGCTGAGGCCCTGGGCAGGCTG',

'AAAGTGGATGAAGTTGGTGCTGAAGCCCTGGGCAGGCTG',

'AAAGTGGATGAAGTTGGTGCTGAGGCCCTGGGCAGGCTG',

'CATGTGGATGAAGTTGGTGGTGAGGCCCTGGGCAGGCTG',

'AAAGTGGACGAAGTTGGTGCTGAGGCCCTGGGCAGGCTG',

'CATGTGGATGAAATTAGTGGTGAGGTCCTGGGCAGGCTG',

'AACGTGGATGAAGTTGGTGGTGAGGCCCTGGGCAGGCTG']

from Sequences import STANDARD_GENETIC_CODE as SGC

align, tree, joinOrder = treeProfileMultipleAlign(seqs, DNA_1)

active, passive = calcActivePassive (align, joinOrder, SGC)

print('Active: %d Passive: %d' % (active, passive))

Many virus replication enzymes lack proof-reading ability so that although error-prone

their genome can mutate, adapt and evolve at a very high rate.

Download 7,75 Mb.

Do'stlaringiz bilan baham:

1 ... 204 205 206 207 208 209 210 211 ... 514