Python Programming for Biology: Bioinformatics and Beyond

Pairwise alignment with Python

Download 7,75 Mb.

Pdf ko'rish

bet	173/514
Sana	30.12.2021
Hajmi	7,75 Mb.
	#91066

1 ... 169 170 171 172 173 174 175 176 ... 514

Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

Pairwise alignment with Python

The following Python function is an example of how dynamic programming is used in

practice to generate an optimal alignment for a pair of sequences. The method is very

similar to the one described by Needleman and Wunsch.

Another popular dynamic

programming method for sequence alignment was introduced by Smith and Waterman,

which focuses more on local solutions.

However, in this case the underlying principle is

the same, and the only major difference is that a Smith-Waterman alignment doesn’t allow

the intermediate alignment scores to dip below zero, so that local (off diagonal)

alignments are more prominent.

The following discussion will split the alignment function into more easily understood

sections but the code, in its concatenated entirety, can be viewed in the on-line material at

http://www.cambridge.org/pythonforbiology

. First we begin with the function definition

that takes two one-letter sequences, a substitution/similarity matrix and two different gap

penalties. Note that you would normally check that the input values were sensible (e.g.

sequences were really one-letter and upper case) before using the function

def sequenceAlign(seqA, seqB, simMatrix=DNA_2, insert=8, extend=4):

Next we define two numbers which represent the size of the comparison grid. As

illustrated above, the comparison grid is simply a matrix of one sequence versus the other,

so that each point is a different possible residue pair. The size of the grid is one larger than

the lengths of the sequences because we require an extra row and column at the start of the

matrix (left and top) to contain starting values to begin growing the alignments from.

numI = len(seqA) + 1

numJ = len(seqB) + 1

Next we create the matrices (actually lists of lists here, but we could use NumPy

matrices) which represent the comparison grid. We will actually use two matrices of equal

size; one is used to contain the sub-totals of the alignment scores and the other is used to

record which route (sub-alignment) was taken, i.e. whether the best route to this point

involved a gap or a residue comparison. In the route matrix we will use a code to say

which of the three directions was taken, where 0 represents the pairing of two residues, 1

is for a gap in seqB (but a residue in seqA) and 2 is for a gap in seqA (but a residue in

seqB). We build the preliminary blank matrices of the right size initially with zeros:

scoreMatrix = [[0] * numJ for x in range(numI)]

routeMatrix = [[0] * numJ for x in range(numI)]

Then for the route matrix we adjust the top and left edges, except the first element, so

that they contain route codes that represent gaps. This means that for alignments where

one sequence is indented relative to the other the only route to get to that point is by

having a series of leading gaps in the indented sequence. We will leave the edges of the

score matrix at zero, so as not to penalise alignments that begin with a gap. However, in

some situations you might want to have a penalty for this.

for i in range(1, numI):

routeMatrix[i][0] = 1

for j in range(1, numJ):

routeMatrix[0][j] = 2

Given that the edges of the route and score matrix are now defined, we now fill in the

remainder of the matrix values by looping through i and j, the indices of the rows and

columns, starting at 1, rather than 0, due to the extra edges which were added.

for i in range(1, numI):

for j in range(1, numJ):

We need to decide which gap penalty is relevant for each matrix element. So we use

some logic which means that the insertion penalty is used, except if the previous position

has a gap, in which case we use the extension penalty. Note that we make this decision

twice and define two separate gap penalty values, one for each sequence. Detecting

whether the previous position has a gap is a simple matter of checking the route matrix in

the previous row (i-1) or column (j-1) to see if it has a gap code (1 or 2).

penalty1 = insert

penalty2 = insert

if routeMatrix[i-1][j] == 1:

penalty1 = extend

elif routeMatrix[i][j-1] == 2:

penalty2 = extend

Next we look up the similarity score for comparing the residue codes in this row and

column. Note that we get the position for the residues within the two sequences using i-1

and j-1; this is because the row and column in the comparison matrix is one larger than the

equivalent position in the sequences, due to the extra initialisation values put at the start.

similarity = simMatrix[ seqA[i-1] ][ seqB[j-1] ]

With the gap penalties and the similarity scores defined, we make a list of the three

possible path options for extending the alignment at this position. These paths are the sub-

total scores of three neighbouring matrix elements (representing the three shorter sub-

alignments that meet at this point) with either the similarity score added or a gap penalty

subtracted. The first path (route code 0) means adding two residues and using their

similarity score, effectively going diagonally in the comparison matrix i-1, j-1 to i,j). The

second path (route code 1) means having a gap in seqB and subtracting a penalty, going

down a row i-1, j to i,j. Likewise the third path (route code 2) subtracts a penalty for a gap

in seqA and goes across a column i, j-1 to i,j.

paths = [scoreMatrix[i-1][j-1] + similarity, # Route 0

scoreMatrix[i-1][j] - penalty1, # Route 1

scoreMatrix[i][j-1] - penalty2] # Route 2

The best sub-total score is simply the maximum value in the paths list, and the route

code is the index (position starting at zero) of this score. Note how the paths.index is the

reason for using the numeric route codes; they are arbitrary, but come naturally from the

order of the above list.

best = max(paths)

route = paths.index(best)

Finally for the loop we store the best score as the sub-total for this matrix element, and

the best route to get to this point in the route matrix. In a subsequent loop these values will

be considered to extend the alignment further. The route might not survive, but at least it

will be considered.

scoreMatrix[i][j] = best

routeMatrix[i][j] = route

Now that the matrix of scores and matrix of routes are filled in we have reached our

destination; routes have met at the ends of the two sequences. The task now is to work

back from the end of the paths (bottom right of the comparison matrix), following the

winning routes backwards to the start of the sequences, at each point adding the

appropriate gap or paired residues. Thus, we define initially blank alignment lists, one for

each input sequence that we will fill with residue codes and dashes in reverse order:

alignA = []

alignB = []

Next we get the row and column for the end of the alignment, which are one less than

the corresponding sizes, given that list indices start at zero. We also record the score for

this final position, which we can give back as the overall score for the overall alignment.

i = numI-1

j = numJ-1

score = scoreMatrix[i][j]

The next loop fills in the alignment strings by going back along the rows and columns

(decreasing values of i and j), following the winning route at each step. Naturally we

continue while we still have sequence alignment to fill in. Note how the while loop

continues if either the row or column is bigger than zero; it only stops when both are zero,

when at the very start of the alignment. Each time the operations go through the loop there

will be a new row and column combination, and thus a new route code is defined for this

point.

while i > 0 or j > 0:

route = routeMatrix[i][j]

Depending on the route code, i.e. which of the paths won out to get to this point, we

grow the alignment strings in one of three ways. Route 0 indicates that the maximum

score came from comparing two residues, so in this case we add residue letters to both

alignment strings and decrease both the row and column by one. Routes 1 and 2 indicate

that the best score came from having a penalty and inserting a gap, in which case we add a

dash to one sequence and a residue to the other, decreasing either the row or the column as

appropriate. Again, note that the indices for getting residue letters from the sequences are

one less than the row and column indices.

if route == 0: # Diagonal

alignA.append( seqA[i-1] )

alignB.append( seqB[j-1] )

i -= 1

j -= 1

elif route == 1: # Gap in seqB

alignA.append( seqA[i-1] )

alignB.append( '-' )

i -= 1

elif route == 2: # Gap in seqA

alignA.append( '-' )

alignB.append( seqB[j-1] )

j -= 1

Lastly we reverse the alignment lists, given that we were working backwards from the

end, and convert them to text strings which we send back with the overall score. Note that

working with lists of text characters in this way, which we only join into a string at the

end, is quicker than repeatedly extending strings, especially for long sequences.

alignA.reverse()

alignB.reverse()

alignA = ''.join(alignA)

alignB = ''.join(alignB)

return score, alignA, alignB

And the function can be tested accordingly:

seqA = 'WFSEPEIST'

seqB = 'FSRPAVVIST'

score, alignA, alignB = sequenceAlign(seqA, seqB, BLOSUM62)

print(score) # 17

print(alignA) # WFSEPE--IST

print(alignB) # -FSRPAVVIST

Download 7,75 Mb.

Do'stlaringiz bilan baham:

1 ... 169 170 171 172 173 174 175 176 ... 514