Pairwise alignment with Python
The following Python function is an example of how dynamic programming is used in
practice to generate an optimal alignment for a pair of sequences. The method is very
similar to the one described by Needleman and Wunsch.
14
Another popular dynamic
programming method for sequence alignment was introduced by Smith and Waterman,
which focuses more on local solutions.
15
However, in this case the underlying principle is
the same, and the only major difference is that a Smith-Waterman alignment doesn’t allow
the intermediate alignment scores to dip below zero, so that local (off diagonal)
alignments are more prominent.
The following discussion will split the alignment function into more easily understood
sections but the code, in its concatenated entirety, can be viewed in the on-line material at
http://www.cambridge.org/pythonforbiology
. First we begin with the function definition
that takes two one-letter sequences, a substitution/similarity matrix and two different gap
penalties. Note that you would normally check that the input values were sensible (e.g.
sequences were really one-letter and upper case) before using the function
def sequenceAlign(seqA, seqB, simMatrix=DNA_2, insert=8, extend=4):
Next we define two numbers which represent the size of the comparison grid. As
illustrated above, the comparison grid is simply a matrix of one sequence versus the other,
so that each point is a different possible residue pair. The size of the grid is one larger than
the lengths of the sequences because we require an extra row and column at the start of the
matrix (left and top) to contain starting values to begin growing the alignments from.
numI = len(seqA) + 1
numJ = len(seqB) + 1
Next we create the matrices (actually lists of lists here, but we could use NumPy
matrices) which represent the comparison grid. We will actually use two matrices of equal
size; one is used to contain the sub-totals of the alignment scores and the other is used to
record which route (sub-alignment) was taken, i.e. whether the best route to this point
involved a gap or a residue comparison. In the route matrix we will use a code to say
which of the three directions was taken, where 0 represents the pairing of two residues, 1
is for a gap in seqB (but a residue in seqA) and 2 is for a gap in seqA (but a residue in
seqB). We build the preliminary blank matrices of the right size initially with zeros:
16
scoreMatrix = [[0] * numJ for x in range(numI)]
routeMatrix = [[0] * numJ for x in range(numI)]
Then for the route matrix we adjust the top and left edges, except the first element, so
that they contain route codes that represent gaps. This means that for alignments where
one sequence is indented relative to the other the only route to get to that point is by
having a series of leading gaps in the indented sequence. We will leave the edges of the
score matrix at zero, so as not to penalise alignments that begin with a gap. However, in
some situations you might want to have a penalty for this.
for i in range(1, numI):
routeMatrix[i][0] = 1
for j in range(1, numJ):
routeMatrix[0][j] = 2
Given that the edges of the route and score matrix are now defined, we now fill in the
remainder of the matrix values by looping through i and j, the indices of the rows and
columns, starting at 1, rather than 0, due to the extra edges which were added.
for i in range(1, numI):
for j in range(1, numJ):
We need to decide which gap penalty is relevant for each matrix element. So we use
some logic which means that the insertion penalty is used, except if the previous position
has a gap, in which case we use the extension penalty. Note that we make this decision
twice and define two separate gap penalty values, one for each sequence. Detecting
whether the previous position has a gap is a simple matter of checking the route matrix in
the previous row (i-1) or column (j-1) to see if it has a gap code (1 or 2).
penalty1 = insert
penalty2 = insert
if routeMatrix[i-1][j] == 1:
penalty1 = extend
elif routeMatrix[i][j-1] == 2:
penalty2 = extend
Next we look up the similarity score for comparing the residue codes in this row and
column. Note that we get the position for the residues within the two sequences using i-1
and j-1; this is because the row and column in the comparison matrix is one larger than the
equivalent position in the sequences, due to the extra initialisation values put at the start.
similarity = simMatrix[ seqA[i-1] ][ seqB[j-1] ]
With the gap penalties and the similarity scores defined, we make a list of the three
possible path options for extending the alignment at this position. These paths are the sub-
total scores of three neighbouring matrix elements (representing the three shorter sub-
alignments that meet at this point) with either the similarity score added or a gap penalty
subtracted. The first path (route code 0) means adding two residues and using their
similarity score, effectively going diagonally in the comparison matrix i-1, j-1 to i,j). The
second path (route code 1) means having a gap in seqB and subtracting a penalty, going
down a row i-1, j to i,j. Likewise the third path (route code 2) subtracts a penalty for a gap
in seqA and goes across a column i, j-1 to i,j.
paths = [scoreMatrix[i-1][j-1] + similarity, # Route 0
scoreMatrix[i-1][j] - penalty1, # Route 1
scoreMatrix[i][j-1] - penalty2] # Route 2
The best sub-total score is simply the maximum value in the paths list, and the route
code is the index (position starting at zero) of this score. Note how the paths.index is the
reason for using the numeric route codes; they are arbitrary, but come naturally from the
order of the above list.
best = max(paths)
route = paths.index(best)
Finally for the loop we store the best score as the sub-total for this matrix element, and
the best route to get to this point in the route matrix. In a subsequent loop these values will
be considered to extend the alignment further. The route might not survive, but at least it
will be considered.
scoreMatrix[i][j] = best
routeMatrix[i][j] = route
Now that the matrix of scores and matrix of routes are filled in we have reached our
destination; routes have met at the ends of the two sequences. The task now is to work
back from the end of the paths (bottom right of the comparison matrix), following the
winning routes backwards to the start of the sequences, at each point adding the
appropriate gap or paired residues. Thus, we define initially blank alignment lists, one for
each input sequence that we will fill with residue codes and dashes in reverse order:
alignA = []
alignB = []
Next we get the row and column for the end of the alignment, which are one less than
the corresponding sizes, given that list indices start at zero. We also record the score for
this final position, which we can give back as the overall score for the overall alignment.
i = numI-1
j = numJ-1
score = scoreMatrix[i][j]
The next loop fills in the alignment strings by going back along the rows and columns
(decreasing values of i and j), following the winning route at each step. Naturally we
continue while we still have sequence alignment to fill in. Note how the while loop
continues if either the row or column is bigger than zero; it only stops when both are zero,
when at the very start of the alignment. Each time the operations go through the loop there
will be a new row and column combination, and thus a new route code is defined for this
point.
while i > 0 or j > 0:
route = routeMatrix[i][j]
Depending on the route code, i.e. which of the paths won out to get to this point, we
grow the alignment strings in one of three ways. Route 0 indicates that the maximum
score came from comparing two residues, so in this case we add residue letters to both
alignment strings and decrease both the row and column by one. Routes 1 and 2 indicate
that the best score came from having a penalty and inserting a gap, in which case we add a
dash to one sequence and a residue to the other, decreasing either the row or the column as
appropriate. Again, note that the indices for getting residue letters from the sequences are
one less than the row and column indices.
if route == 0: # Diagonal
alignA.append( seqA[i-1] )
alignB.append( seqB[j-1] )
i -= 1
j -= 1
elif route == 1: # Gap in seqB
alignA.append( seqA[i-1] )
alignB.append( '-' )
i -= 1
elif route == 2: # Gap in seqA
alignA.append( '-' )
alignB.append( seqB[j-1] )
j -= 1
Lastly we reverse the alignment lists, given that we were working backwards from the
end, and convert them to text strings which we send back with the overall score. Note that
working with lists of text characters in this way, which we only join into a string at the
end, is quicker than repeatedly extending strings, especially for long sequences.
alignA.reverse()
alignB.reverse()
alignA = ''.join(alignA)
alignB = ''.join(alignB)
return score, alignA, alignB
And the function can be tested accordingly:
seqA = 'WFSEPEIST'
seqB = 'FSRPAVVIST'
score, alignA, alignB = sequenceAlign(seqA, seqB, BLOSUM62)
print(score) # 17
print(alignA) # WFSEPE--IST
print(alignB) # -FSRPAVVIST
Do'stlaringiz bilan baham: |