Python Programming for Biology: Bioinformatics and Beyond

Tree-guided multiple-sequence alignments

Download 7,75 Mb.

Pdf ko'rish

bet	207/514
Sana	30.12.2021
Hajmi	7,75 Mb.
	#91066

1 ... 203 204 205 206 207 208 209 210 ... 514

Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

Tree-guided multiple-sequence alignments

Although we may just want to admire the evolutionary relationship suggested by a

phylogenetic tree we can also use trees for many subsequent bioinformatics exercises. One

important example is in the generation of multiple-sequence alignments. When making a

tree, pairs of sequences are compared by alignment, but we can combine the sequences in

the same order as they are liked into the tree to build a deeper multiple alignment. The

example of a multiple alignment we gave earlier in

Chapter 13

simply combined

sequences in an arbitrary input order. Now we aim to do much better and link them in a

way that better represents how they diverged. The process of multiple alignment in our

example will be much the same; profiles are aligned and gaps are inserted so that

sequences can be stacked; it is just the order of comparing the profiles that differs (which

now represents the branching nodes of the tree). We could make further improvements to

the multiple alignments by weighting the different branches when we combine the

sequences. However, we will leave such complexities to programs like ClustalW etc.

The multiple-alignment function naturally takes a list of one-letter sequences and a

substitution matrix as input arguments. Then inside the function we create a dictionary

multipleAlign which will hold the sequences as they are combined. Because we will be

combining groups of sequences that correspond to the branches of a phylogenetic tree, this

dictionary will hold all of the partial alignments as they are joined together. The keys of

this dictionary represent the indices of the sequences that were used to make each sub-

alignment. Initially nothing has been combined, so the dictionary is filled with isolated

sequences alone in a list ([seq,]), keyed by a single index. As the multiple alignment

builds new indices will be added to indicate which sequences have been combined. These

new indices will be of the tuple form (x,y) where x and y could be plain index numbers or

other, already joined keys. For example, the key ‘(3,(1,2))’ will be for an alignment that

first combined sequences 1 and 2 which was then added to 3 – of course here the keys

‘(1,2)’ and ‘3’ will already exist because we are following the branches of the tree from

outside inwards.

from MultipleAlign import profile, profileAlign

def treeProfileMultipleAlign(seqs, simMatrix):

multipleAlign = {}

for i, seq in enumerate(seqs):

multipleAlign[i] = [seq,]

With the alignment dictionary initialised, the neighbourJoinTree() function is called

using a distance matrix generated with the same input sequences and substitution matrix.

This will return not only the structure of the tree, in terms of indices, but also the order in

which sequences were combined. It is this second value that we will use to combine the

sequence alignment. A copy is made of the tree combination order, which can be altered as

we make the alignment. The original unmodified treeOrder list will be returned from the

function as output.

distMatrix = getDistanceMatrix(seqs, simMatrix)

tree, treeOrder = neighbourJoinTree(distMatrix)

joinStack = treeOrder[:]

Next a while loop is set up to continue combining sequences while we still have tree

branches to consider. Note how the first (index zero) element is removed from the

joinStack list, thus shortening it while at the same time providing the keys (keyA and

keyB) to extract the next sub-alignments to be joined. Eventually all branch points will

have been considered and the joinStack list will be empty, whereupon the loop will stop.

while joinStack:

keyA, keyB = joinStack.pop(0)

These keys are then used to access the relevant sub-alignments from the larger

dictionary. These sub-alignments will then be combined, once gaps have been inserted, to

make a new, larger multiple alignment. This is equivalent to joining two branches of the

phylogenetic tree.

subAlignA = multipleAlign[keyA]

subAlignB = multipleAlign[keyB]

The profiles (lists of residue fraction dictionaries) are calculated using the previously

defined profile() function for each sub-alignment. These are then passed into the profile

alignment function, again previously defined, to generate a score (s, which will be

completely ignored here) and two aligned profiles, i.e. with gaps.

profA = profile(subAlignA)

profB = profile(subAlignB)

s, profAlignA, profAlignB = profileAlign(profA, profB, simMatrix)

We insert the gaps in the previously combined multiple alignments (or single sequence

at the beginning) by looking for gaps in the newly aligned profile. Accordingly, we loop

though the profile alignment to find positions (index i) where there is no fraction

dictionary, but rather a None object, indicating a gap.

gaps = []

for i, fractions in enumerate(profAlignA):

if fractions is None:

gaps.append(i)

With the gaps collected, another loop is performed, this time through each of the

sequences in the sub-alignment. Each sequence is redefined by the insertion of dashes ‘-’

at the positions specified by the gaps list. Note that as the gaps are inserted the sequence

string will grow, so although all gap positions after the first are meaningless, with regard

to the initial contents of seq, once all preceding gaps have been inserted the next gap

location will be in-step with the longer redefined seq. After the gap insertion the sequence

is re-inserted in the sub-alignment at the correct index [j].

for j, seq in enumerate(subAlignA):

for gap in gaps:

seq = seq[:gap] + '-' + seq[gap:]

subAlignA[j] = seq

The gapping procedure is repeated using the gaps from the second of the aligned

profiles (profAlignB) to insert dashes into the second of the sub-alignments (subAlignB).

gaps = []

for i, fractions in enumerate(profAlignB):

if fractions is None:

gaps.append(i)

for j, seq in enumerate(subAlignB):

for gap in gaps:

seq = seq[:gap] + '-' + seq[gap:]

subAlignB[j] = seq

Now that the sequences have had the appropriate gaps inserted the two smaller

alignments are combined and placed in an expanded multipleAlign dictionary. The new

key for this alignment is simply a tuple containing the two keys of the smaller sub-

alignments that were combined in this iteration through the loop. The old sub-alignments

are no longer needed and can now be removed from the dictionary by deleting the entries

for the old, now combined, keys.

newKey = (keyA, keyB)

multipleAlign[newKey] = subAlignA + subAlignB

del multipleAlign[keyA]

del multipleAlign[keyB]

Finally we return the deepest alignment in the dictionary, which is the one that was

stored using the final value of newKey from the loop, and the tree data. This represents the

last joining in the treeOrder list, and thus represents the tree as a whole.

return multipleAlign[newKey], tree, treeOrder

We test the function with a list of input sequences and then print out the gapped output

sequences from the returned alignment.

align, tree, order = treeProfileMultipleAlign(seqs, BLOSUM62)

for seq in align:

print(seq)

And this gives:

-LAH-DFSQ--PVITV-IILGVMAGIIGIILLLAYVSRRLR-KR-----P-PADVP

QPVH-PFSRPAPVVIILIILCVMAGVIGTILLISYGIRLL--------------IK

QLVH-RFTVPAPVVIILIILCVMAGIIGTILLISYTIRRL--------------IK

MLEH-EFS--APVAIL-IILGVMAGIIGIILLISYSIGQIIKKRSVDIQP-PEDED

PIQH-DFP--ALVMIL-IILGVMAGIIGTILLISYCISRMTKKSSVDIQS-PEGGD

QLAH-HFSE--PEITL-IIFGVMAGVIGTILLISYGIRRLIKKSPSDVKPLPSP-D

QLVH-EFSE--LVIAL-IIFGVMAGVIGTILFISYGSRRLIKKSESDVQPLPPP-D

QLVH-IFSE--PVIIG-IIYAVMLGIIITILSIAFCIGQLTKKS-SLPAQVASPED

-SYHQDFSH--AEITG-IIFAVMAGLLLIIFLIAYLIRRMIKKPLPVPKPQDSP-D

Download 7,75 Mb.

Do'stlaringiz bilan baham:

1 ... 203 204 205 206 207 208 209 210 ... 514