Tree-guided multiple-sequence alignments
Although we may just want to admire the evolutionary relationship suggested by a
phylogenetic tree we can also use trees for many subsequent bioinformatics exercises. One
important example is in the generation of multiple-sequence alignments. When making a
tree, pairs of sequences are compared by alignment, but we can combine the sequences in
the same order as they are liked into the tree to build a deeper multiple alignment. The
example of a multiple alignment we gave earlier in
Chapter 13
simply combined
sequences in an arbitrary input order. Now we aim to do much better and link them in a
way that better represents how they diverged. The process of multiple alignment in our
example will be much the same; profiles are aligned and gaps are inserted so that
sequences can be stacked; it is just the order of comparing the profiles that differs (which
now represents the branching nodes of the tree). We could make further improvements to
the multiple alignments by weighting the different branches when we combine the
sequences. However, we will leave such complexities to programs like ClustalW etc.
The multiple-alignment function naturally takes a list of one-letter sequences and a
substitution matrix as input arguments. Then inside the function we create a dictionary
multipleAlign which will hold the sequences as they are combined. Because we will be
combining groups of sequences that correspond to the branches of a phylogenetic tree, this
dictionary will hold all of the partial alignments as they are joined together. The keys of
this dictionary represent the indices of the sequences that were used to make each sub-
alignment. Initially nothing has been combined, so the dictionary is filled with isolated
sequences alone in a list ([seq,]), keyed by a single index. As the multiple alignment
builds new indices will be added to indicate which sequences have been combined. These
new indices will be of the tuple form (x,y) where x and y could be plain index numbers or
other, already joined keys. For example, the key ‘(3,(1,2))’ will be for an alignment that
first combined sequences 1 and 2 which was then added to 3 – of course here the keys
‘(1,2)’ and ‘3’ will already exist because we are following the branches of the tree from
outside inwards.
from MultipleAlign import profile, profileAlign
def treeProfileMultipleAlign(seqs, simMatrix):
multipleAlign = {}
for i, seq in enumerate(seqs):
multipleAlign[i] = [seq,]
With the alignment dictionary initialised, the neighbourJoinTree() function is called
using a distance matrix generated with the same input sequences and substitution matrix.
This will return not only the structure of the tree, in terms of indices, but also the order in
which sequences were combined. It is this second value that we will use to combine the
sequence alignment. A copy is made of the tree combination order, which can be altered as
we make the alignment. The original unmodified treeOrder list will be returned from the
function as output.
distMatrix = getDistanceMatrix(seqs, simMatrix)
tree, treeOrder = neighbourJoinTree(distMatrix)
joinStack = treeOrder[:]
Next a while loop is set up to continue combining sequences while we still have tree
branches to consider. Note how the first (index zero) element is removed from the
joinStack list, thus shortening it while at the same time providing the keys (keyA and
keyB) to extract the next sub-alignments to be joined. Eventually all branch points will
have been considered and the joinStack list will be empty, whereupon the loop will stop.
while joinStack:
keyA, keyB = joinStack.pop(0)
These keys are then used to access the relevant sub-alignments from the larger
dictionary. These sub-alignments will then be combined, once gaps have been inserted, to
make a new, larger multiple alignment. This is equivalent to joining two branches of the
phylogenetic tree.
subAlignA = multipleAlign[keyA]
subAlignB = multipleAlign[keyB]
The profiles (lists of residue fraction dictionaries) are calculated using the previously
defined profile() function for each sub-alignment. These are then passed into the profile
alignment function, again previously defined, to generate a score (s, which will be
completely ignored here) and two aligned profiles, i.e. with gaps.
profA = profile(subAlignA)
profB = profile(subAlignB)
s, profAlignA, profAlignB = profileAlign(profA, profB, simMatrix)
We insert the gaps in the previously combined multiple alignments (or single sequence
at the beginning) by looking for gaps in the newly aligned profile. Accordingly, we loop
though the profile alignment to find positions (index i) where there is no fraction
dictionary, but rather a None object, indicating a gap.
gaps = []
for i, fractions in enumerate(profAlignA):
if fractions is None:
gaps.append(i)
With the gaps collected, another loop is performed, this time through each of the
sequences in the sub-alignment. Each sequence is redefined by the insertion of dashes ‘-’
at the positions specified by the gaps list. Note that as the gaps are inserted the sequence
string will grow, so although all gap positions after the first are meaningless, with regard
to the initial contents of seq, once all preceding gaps have been inserted the next gap
location will be in-step with the longer redefined seq. After the gap insertion the sequence
is re-inserted in the sub-alignment at the correct index [j].
for j, seq in enumerate(subAlignA):
for gap in gaps:
seq = seq[:gap] + '-' + seq[gap:]
subAlignA[j] = seq
The gapping procedure is repeated using the gaps from the second of the aligned
profiles (profAlignB) to insert dashes into the second of the sub-alignments (subAlignB).
gaps = []
for i, fractions in enumerate(profAlignB):
if fractions is None:
gaps.append(i)
for j, seq in enumerate(subAlignB):
for gap in gaps:
seq = seq[:gap] + '-' + seq[gap:]
subAlignB[j] = seq
Now that the sequences have had the appropriate gaps inserted the two smaller
alignments are combined and placed in an expanded multipleAlign dictionary. The new
key for this alignment is simply a tuple containing the two keys of the smaller sub-
alignments that were combined in this iteration through the loop. The old sub-alignments
are no longer needed and can now be removed from the dictionary by deleting the entries
for the old, now combined, keys.
newKey = (keyA, keyB)
multipleAlign[newKey] = subAlignA + subAlignB
del multipleAlign[keyA]
del multipleAlign[keyB]
Finally we return the deepest alignment in the dictionary, which is the one that was
stored using the final value of newKey from the loop, and the tree data. This represents the
last joining in the treeOrder list, and thus represents the tree as a whole.
return multipleAlign[newKey], tree, treeOrder
We test the function with a list of input sequences and then print out the gapped output
sequences from the returned alignment.
align, tree, order = treeProfileMultipleAlign(seqs, BLOSUM62)
for seq in align:
print(seq)
And this gives:
-LAH-DFSQ--PVITV-IILGVMAGIIGIILLLAYVSRRLR-KR-----P-PADVP
QPVH-PFSRPAPVVIILIILCVMAGVIGTILLISYGIRLL--------------IK
QLVH-RFTVPAPVVIILIILCVMAGIIGTILLISYTIRRL--------------IK
MLEH-EFS--APVAIL-IILGVMAGIIGIILLISYSIGQIIKKRSVDIQP-PEDED
PIQH-DFP--ALVMIL-IILGVMAGIIGTILLISYCISRMTKKSSVDIQS-PEGGD
QLAH-HFSE--PEITL-IIFGVMAGVIGTILLISYGIRRLIKKSPSDVKPLPSP-D
QLVH-EFSE--LVIAL-IIFGVMAGVIGTILFISYGSRRLIKKSESDVQPLPPP-D
QLVH-IFSE--PVIIG-IIYAVMLGIIITILSIAFCIGQLTKKS-SLPAQVASPED
-SYHQDFSH--AEITG-IIFAVMAGLLLIIFLIAYLIRRMIKKPLPVPKPQDSP-D
Do'stlaringiz bilan baham: |