Before we define the next function we make the required imports to get a substitution
matrix (used for testing) and the regular pairwise alignment function sequenceAlign(). The
). The profile-based
as input. Then we initialise a value for the number of input sequences.
We begin the multiple alignment by aligning the first two sequences and placing the
gapped output in the multipleAlign list. Obviously we would need to check that we had at
least two sequences before using this function. Note that if we were doing this properly we
would need to start with the edges of a genetic tree.
score, alignA, alignB = sequenceAlign(seqs[0], seqs[1], simMatrix)
multipleAlign = [alignA, alignB]
Next the function loops through an index, one for each input sequence, but starting at 2.
We start here because we have already used the first two sequences to get the multiple
alignment started. Then in the loop we generate two profiles, one for the existing
alignment and one for the sequence to be added, seqs[i], noting that this is placed inside a
list as required by the profile-generating function.
for i in range(2,n):
profA = profile(multipleAlign)
toAdd = [seqs[i],]
profB = profile(toAdd)
Now we do the alignment of the two profiles, and get two gapped profile alignments
back, as well as a score which in this instance we ignore.
score, alignA, alignB = profileAlign(profA, profB, simMatrix)
With the two gapped and aligned profiles, we have to use the positions of gap insertions
to combine the new sequence into the existing multiple alignment. Note how the output
profiles are just a guide to place insertions; the deepening alignment always uses the
original sequences. We repeat this operation twice, once to collect and insert gaps for the
starting multiple alignment, and once for the newly added sequences. Accordingly, we
loop through the alignment using enumerate() so that we extract both an index number and
a fractions dictionary (which is what a profile is composed of). If the fractions dictionary
is None, i.e. missing, then we have a gap in that profile and we place the current index in a
list of gap positions.
gaps = []
for j, fractions in enumerate(alignA):
if fractions is None:
gaps.append(j)
Once the gap locations are collected, all the sequences in that part of the alignment are
looped through and are redefined with extra dashes, i.e. we add a column of ‘-’ at the gap
location. In this round we are putting dashes into the original multipleAlign list.
for j, seq in enumerate(multipleAlign):
for gap in gaps:
seq = seq[:gap] + '-' + seq[gap:]
multipleAlign[j] = seq
Then a second round of gap insertion is done, this time collecting locations from the
second align profile, alignB, and placing the gaps in the toAdd list.
gaps = []
for j, fractions in enumerate(alignB):
if fractions is None:
gaps.append(j)
for j, seq in enumerate(toAdd):
for gap in gaps:
seq = seq[:gap] + '-' + seq[gap:]
toAdd[j] = seq
With all the gaps placed, the toAdd list is then added to the multipleAlign list to form a
new, deeper alignment. The .extend() function is used because we are joining two lists.
multipleAlign.extend(toAdd)
Finally, outside the loop, having combined all of the sequences, we pass back the
completed multiple alignment and test the function.
return multipleAlign
# To test and print out the result
seqs = ['SRPAPVVLIILCVMAGVIGTILLISYGIRLLIK',
'TVPAPVVIILIILCVMAGIIGTILLLIISYTIRRLIK',
'HHFSEPEITLIIFGVMAGVIGTILLLIISYGIRLIK',
'HFSELVIALIIFGVMAGVIGTILFISYGSRLIK']
align = simpleProfileMultipleAlign(seqs, BLOSUM62)
for k, seq in enumerate(align):
print(k, seq)
# Result:
# 0 -SRPAPVV--LIILCVMAGVIGTI--LLISYGIRLLIK
# 1 -TVPAPVVIILIILCVMAGIIGTILLLIISYTIRRLIK
# 2 HHFSEPEI-TLIIFGVMAGVIGTILLLIISYGIR-LIK
# 3 -HFSELVI-ALIIFGVMAGVIGTI--LFISYGSR-LIK
Note that in the above test, after we have obtained the alignment we loop through the
list of sequences of which it is composed using the inbuilt enumerate() function. This is so
we efficiently get back an index number at the same time as the one-letter sequences.
Do'stlaringiz bilan baham: