Python Programming for Biology: Bioinformatics and Beyond


Pairwise alignment with Python



Download 7,75 Mb.
Pdf ko'rish
bet173/514
Sana30.12.2021
Hajmi7,75 Mb.
#91066
1   ...   169   170   171   172   173   174   175   176   ...   514
Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

Pairwise alignment with Python

The  following  Python  function  is  an  example  of  how  dynamic  programming  is  used  in

practice  to  generate  an  optimal  alignment  for  a  pair  of  sequences.  The  method  is  very

similar  to  the  one  described  by  Needleman  and  Wunsch.

14

 Another  popular  dynamic



programming  method  for  sequence  alignment  was  introduced  by  Smith  and  Waterman,

which focuses more on local solutions.

15

However, in this case the underlying principle is



the same, and the only major difference is that a Smith-Waterman alignment doesn’t allow

the  intermediate  alignment  scores  to  dip  below  zero,  so  that  local  (off  diagonal)

alignments are more prominent.

The following discussion will split the alignment function into more easily understood

sections but the code, in its concatenated entirety, can be viewed in the on-line material at

http://www.cambridge.org/pythonforbiology

.  First  we  begin  with  the  function  definition

that takes two one-letter sequences, a substitution/similarity matrix and two different gap

penalties.  Note  that  you  would  normally  check  that  the  input  values  were  sensible  (e.g.

sequences were really one-letter and upper case) before using the function

def sequenceAlign(seqA, seqB, simMatrix=DNA_2, insert=8, extend=4):

Next  we  define  two  numbers  which  represent  the  size  of  the  comparison  grid.  As




illustrated above, the comparison grid is simply a matrix of one sequence versus the other,

so that each point is a different possible residue pair. The size of the grid is one larger than

the lengths of the sequences because we require an extra row and column at the start of the

matrix (left and top) to contain starting values to begin growing the alignments from.

numI = len(seqA) + 1

numJ = len(seqB) + 1

Next  we  create  the  matrices  (actually  lists  of  lists  here,  but  we  could  use  NumPy

matrices) which represent the comparison grid. We will actually use two matrices of equal

size; one is used to contain the sub-totals of the alignment scores and the other is used to

record  which  route  (sub-alignment)  was  taken,  i.e.  whether  the  best  route  to  this  point

involved  a  gap  or  a  residue  comparison.  In  the  route  matrix  we  will  use  a  code  to  say

which of the three directions was taken, where 0 represents the pairing of two residues, 1

is for a gap in seqB (but a residue in seqA) and 2  is  for  a  gap  in  seqA  (but  a  residue  in

seqB). We build the preliminary blank matrices of the right size initially with zeros:

16

scoreMatrix = [[0] * numJ for x in range(numI)]



routeMatrix = [[0] * numJ for x in range(numI)]

Then for the route matrix we adjust the top and left edges, except the first element, so

that  they  contain  route  codes  that  represent  gaps.  This  means  that  for  alignments  where

one  sequence  is  indented  relative  to  the  other  the  only  route  to  get  to  that  point  is  by

having a series of leading gaps in the indented sequence. We will leave the edges of the

score matrix at zero, so as not to penalise alignments that begin with a gap. However, in

some situations you might want to have a penalty for this.

for i in range(1, numI):

routeMatrix[i][0] = 1

for j in range(1, numJ):

routeMatrix[0][j] = 2

Given that the edges of the route and score matrix are now defined, we now fill in the

remainder  of  the  matrix  values  by  looping  through  i  and  j,  the  indices  of  the  rows  and

columns, starting at 1, rather than 0, due to the extra edges which were added.

for i in range(1, numI):

for j in range(1, numJ):

We  need  to  decide  which  gap  penalty  is  relevant  for  each  matrix  element.  So  we  use

some logic which means that the insertion penalty is used, except if the previous position

has  a  gap,  in  which  case  we  use  the  extension  penalty.  Note  that  we  make  this  decision

twice  and  define  two  separate  gap  penalty  values,  one  for  each  sequence.  Detecting

whether the previous position has a gap is a simple matter of checking the route matrix in

the previous row (i-1) or column (j-1) to see if it has a gap code (1 or 2).

penalty1 = insert

penalty2 = insert

if routeMatrix[i-1][j] == 1:



penalty1 = extend

elif routeMatrix[i][j-1] == 2:

penalty2 = extend

Next  we  look  up  the  similarity  score  for  comparing  the  residue  codes  in  this  row  and

column. Note that we get the position for the residues within the two sequences using i-1

and j-1; this is because the row and column in the comparison matrix is one larger than the

equivalent position in the sequences, due to the extra initialisation values put at the start.

similarity = simMatrix[ seqA[i-1] ][ seqB[j-1] ]

With  the  gap  penalties  and  the  similarity  scores  defined,  we  make  a  list  of  the  three

possible path options for extending the alignment at this position. These paths are the sub-

total  scores  of  three  neighbouring  matrix  elements  (representing  the  three  shorter  sub-

alignments that meet at this point) with either the similarity score added or a gap penalty

subtracted.  The  first  path  (route  code  0)  means  adding  two  residues  and  using  their

similarity score, effectively going diagonally in the comparison matrix i-1, j-1 to i,j). The

second path (route code 1) means having a gap in seqB and subtracting a penalty, going

down a row i-1, j to i,j. Likewise the third path (route code 2) subtracts a penalty for a gap

in seqA and goes across a column i, j-1 to i,j.

paths = [scoreMatrix[i-1][j-1] + similarity, # Route 0

scoreMatrix[i-1][j] - penalty1, # Route 1

scoreMatrix[i][j-1] - penalty2] # Route 2

The  best  sub-total  score  is  simply  the  maximum  value  in  the  paths  list,  and  the  route

code is the index (position starting at zero) of this score. Note how the paths.index is the

reason for using the numeric route codes; they are arbitrary, but come naturally from the

order of the above list.

best = max(paths)

route = paths.index(best)

Finally for the loop we store the best score as the sub-total for this matrix element, and

the best route to get to this point in the route matrix. In a subsequent loop these values will

be considered to extend the alignment further. The route might not survive, but at least it

will be considered.

scoreMatrix[i][j] = best

routeMatrix[i][j] = route

Now  that  the  matrix  of  scores  and  matrix  of  routes  are  filled  in  we  have  reached  our

destination;  routes  have  met  at  the  ends  of  the  two  sequences.  The  task  now  is  to  work

back  from  the  end  of  the  paths  (bottom  right  of  the  comparison  matrix),  following  the

winning  routes  backwards  to  the  start  of  the  sequences,  at  each  point  adding  the

appropriate gap or paired residues. Thus, we define initially blank alignment lists, one for

each input sequence that we will fill with residue codes and dashes in reverse order:

alignA = []

alignB = []




Next we get the row and column for the end of the alignment, which are one less than

the corresponding sizes, given that list indices start at zero. We also record the score for

this final position, which we can give back as the overall score for the overall alignment.

i = numI-1

j = numJ-1

score = scoreMatrix[i][j]

The next loop fills in the alignment strings by going back along the rows and columns

(decreasing  values  of  i  and  j),  following  the  winning  route  at  each  step.  Naturally  we

continue  while  we  still  have  sequence  alignment  to  fill  in.  Note  how  the  while  loop

continues if either the row or column is bigger than zero; it only stops when both are zero,

when at the very start of the alignment. Each time the operations go through the loop there

will be a new row and column combination, and thus a new route code is defined for this

point.

while i > 0 or j > 0:



route = routeMatrix[i][j]

Depending  on  the  route  code,  i.e.  which  of  the  paths  won  out  to  get  to  this  point,  we

grow  the  alignment  strings  in  one  of  three  ways.  Route  0  indicates  that  the  maximum

score  came  from  comparing  two  residues,  so  in  this  case  we  add  residue  letters  to  both

alignment strings and decrease both the row and column by one. Routes 1 and 2 indicate

that the best score came from having a penalty and inserting a gap, in which case we add a

dash to one sequence and a residue to the other, decreasing either the row or the column as

appropriate. Again, note that the indices for getting residue letters from the sequences are

one less than the row and column indices.

if route == 0: # Diagonal

alignA.append( seqA[i-1] )

alignB.append( seqB[j-1] )

i -= 1

j -= 1


elif route == 1: # Gap in seqB

alignA.append( seqA[i-1] )

alignB.append( '-' )

i -= 1


elif route == 2: # Gap in seqA

alignA.append( '-' )

alignB.append( seqB[j-1] )

j -= 1


Lastly we reverse the alignment lists, given that we were working backwards from the

end, and convert them to text strings which we send back with the overall score. Note that

working  with  lists  of  text  characters  in  this  way,  which  we  only  join  into  a  string  at  the

end, is quicker than repeatedly extending strings, especially for long sequences.

alignA.reverse()

alignB.reverse()

alignA = ''.join(alignA)



alignB = ''.join(alignB)

return score, alignA, alignB

And the function can be tested accordingly:

seqA = 'WFSEPEIST'

seqB = 'FSRPAVVIST'

score, alignA, alignB = sequenceAlign(seqA, seqB, BLOSUM62)

print(score) # 17

print(alignA) # WFSEPE--IST

print(alignB) # -FSRPAVVIST


Download 7,75 Mb.

Do'stlaringiz bilan baham:
1   ...   169   170   171   172   173   174   175   176   ...   514




Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©hozir.org 2024
ma'muriyatiga murojaat qiling

kiriting | ro'yxatdan o'tish
    Bosh sahifa
юртда тантана
Боғда битган
Бугун юртда
Эшитганлар жилманглар
Эшитмадим деманглар
битган бодомлар
Yangiariq tumani
qitish marakazi
Raqamli texnologiyalar
ilishida muhokamadan
tasdiqqa tavsiya
tavsiya etilgan
iqtisodiyot kafedrasi
steiermarkischen landesregierung
asarlaringizni yuboring
o'zingizning asarlaringizni
Iltimos faqat
faqat o'zingizning
steierm rkischen
landesregierung fachabteilung
rkischen landesregierung
hamshira loyihasi
loyihasi mavsum
faolyatining oqibatlari
asosiy adabiyotlar
fakulteti ahborot
ahborot havfsizligi
havfsizligi kafedrasi
fanidan bo’yicha
fakulteti iqtisodiyot
boshqaruv fakulteti
chiqarishda boshqaruv
ishlab chiqarishda
iqtisodiyot fakultet
multiservis tarmoqlari
fanidan asosiy
Uzbek fanidan
mavzulari potok
asosidagi multiservis
'aliyyil a'ziym
billahil 'aliyyil
illaa billahil
quvvata illaa
falah' deganida
Kompyuter savodxonligi
bo’yicha mustaqil
'alal falah'
Hayya 'alal
'alas soloh
Hayya 'alas
mavsum boyicha


yuklab olish