Python Programming for Biology: Bioinformatics and Beyond


Figure 14.5.  The basic construction of a phylogenetic tree using neighbour-joining



Download 7,75 Mb.
Pdf ko'rish
bet206/514
Sana30.12.2021
Hajmi7,75 Mb.
#91066
1   ...   202   203   204   205   206   207   208   209   ...   514
Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

Figure 14.5.  The basic construction of a phylogenetic tree using neighbour-joining.

Although many more rigorous methods exist, phylogenetic trees can be constructed from

sequences using the relatively simple neighbour-joining method. This involves creating

pairwise alignments between all sequences to determine their similarity, and thus also

rough evolutionary distance from one another. The resulting distance matrix is then used

to construct the tree. This is done in a cycle where the most similar pair each time is

combined to make a branch point and reduce the size of the matrix, until all sequences are

joined.


Before  we  get  to  the  main  tree-generating  Python  function  we  will  construct  some

ancillary functions that will be used inside the main routine. This will help understanding

of the process by breaking it into operations that can be thought about separately. Also, the

function that makes the tree will be neater and easier to understand. The downside to this

way of doing things is the relative speed penalty incurred with calling Python functions.

Thus if speed was an issue you could forego the separate functions and embed their code

directly  in  the  main  section.  If  even  more  speed  is  required  we  would  recommend  a

Python/C hybrid language approach, for example, using Cython, as discussed in

Chapter

27

.




To start with we make a simple distance matrix generation function that accepts a list of

sequences  and  a  substitution  matrix  as  input  arguments.  It  defines  an  initially  blank

matrix

7

 of  zeros  with  a  row  and  column  for  each  sequence  (n  sequences),  such  that  the



matrix elements will eventually be filled with the distance between all possible sequence

pairs. Then we calculate a list of maximum possible scores for sequence comparisons, by

comparing  each  sequence  with  itself  using  the  calcSeqSimilarity()  function  defined  in

Chapter  12

.  These  values  will  give  a  baseline  for  the  other  scores.  The  row  and  column

indices  in  the  matrix  are  then  looped  through  (i  and  j)  and  for  each  index  the

corresponding  sequence  is  obtained.  Note  how  two  for  loops  are  used  but  we  don’t  go

from index 0 to n for these; instead the first loop goes up to n-1, and the second starts from

i+1. This means we only consider one half of the matrix, excluding the diagonal. We can

reduce  the  number  of  loops  this  way,  and  thus  save  some  time,  because  the  matrix  is

symmetric, i.e. matrix[i][j] is the same as matrix[j][i].

from Alignments import sequenceAlign, calcSeqSimilarity

from math import exp

def getDistanceMatrix(seqs, simMatrix):

n = len(seqs)

matrix = [[0.0] * n for x in range(n)]

maxScores = [calcSeqSimilarity(x, x, simMatrix) for x in seqs]

for i in range(n-1):

seqA = seqs[i]

for j in range(i+1,n):

seqB = seqs[j]

score, alignA, alignB = sequenceAlign(seqA, seqB, simMatrix)

maxScore = max(maxScores[i],maxScores[j])

dist = maxScore - score

matrix[i][j] = dist

matrix[j][i] = dist

return matrix

The sequences for each matrix element are aligned using the input substitution matrix.

The alignment thus generated is passed back with a score, which is then converted into a

distance.  The  maximum  possible  score  is  calculated  as  the  maximum  from  amongst  the

two  maxScores  list  values  for  the  sequences  compared  in  this  iteration.  The  distance  is

then calculated as the difference between this maximum possible score and the observed

alignment  score.  Accordingly,  if  the  sequences  are  identical  the  alignment  score  value  is

equal to the maximum and the distance is zero, and the more dissimilar the sequences, the

larger  the  distance  between  them.  If  we  were  to  use  the  simple  DNA  substitution  matrix

DNA_1, the distance would represent the number of non-identical positions. The distance

value is then inserted into the matrix at the given row and column.

Here we test the function using previously defined lists of sequences (without gaps) and

substitution matrices. The output is looped through row by row and the values in each row

are displayed in a formatted manner (to two decimal places using ‘%.2f’) to keep things




neat on screen.

seqs = ['QPVHPFSRPAPVVIILIILCVMAGVIGTILLISYGIRLLIK',

'QLVHRFTVPAPVVIILIILCVMAGIIGTILLISYTIRRLIK',

'QLAHHFSEPEITLIIFGVMAGVIGTILLISYGIRRLIKKSPSDVKPLPSPD',

'QLVHEFSELVIALIIFGVMAGVIGTILFISYGSRRLIKKSESDVQPLPPPD',

'MLEHEFSAPVAILIILGVMAGIIGIILLISYSIGQIIKKRSVDIQPPEDED',

'PIQHDFPALVMILIILGVMAGIIGTILLISYCISRMTKKSSVDIQSPEGGD',

'QLVHIFSEPVIIGIIYAVMLGIIITILSIAFCIGQLTKKSSLPAQVASPED',

'LAHDFSQPVITVIILGVMAGIIGIILLLAYVSRRLRKRPPADVP',

'SYHQDFSHAEITGIIFAVMAGLLLIIFLIAYLIRRMIKKPLPVPKPQDSPD']

distMatrix = getDistanceMatrix(seqs, BLOSUM62)

distMatrix = getDistanceMatrix(align1, DNA_1)

for row in distMatrix:

print(['%.2f' % x for x in row])

# ['0.00', '2.00', '3.00', '4.00', '5.00', '5.00', '5.00']

# …


Next  we  define  another  function  to  find  out  which  element  of  a  distance  matrix

represents  the  closest  sequences,  and  thus  which  sequences  should  be  joined  next  when

making  a  tree.  Note  that  this  is  not  as  simple  as  finding  the  smallest  distance,  rather  we

calculate the value q (which is often described in the form of the Q-matrix) as the distance

scaled  by  the  number  of  branch  points  for  the  tree  (n-2)  minus  the  sum  of  the  row  and

column at that point. For each row and column (again only needing to consider half of the

matrix) we determine whether the q value is smaller than the best so far. If so (and for the

initial  loop,  where  minQ  is  None)  then  we  redefine  the  best  q,  and  record  the  row  and

column for this element as joinPair.

def getJoinPair(distMatrix):

n = len(distMatrix)

minQ = None

joinPair = None

for i in range(n-1):

sumRow = sum(distMatrix[i])

for j in range(i+1, n):

sumCol = sum(distMatrix[j])

dist = distMatrix[i][j]

q = (n-2)*dist - sumRow - sumCol

if (minQ is None) or (q < minQ):

minQ = q

joinPair = [i,j]

return joinPair

At  the  end  of  the  function  the  indices  for  the  closest  pair  of  sequences  are  returned.




Because of the for loops these indices are in a specific order, which helps later when the

tree is built. We can test the function using a previously defined distance matrix:

print(getJoinPair(distMatrix))

The  final  function  we  require  before  we  get  to  the  main  tree  construction  is  one  to

determine the distance from a point on the tree (sequence or branch point) to the branch

point that is created when joining up with another part of the tree. This function is required

because the neighbour-joining algorithm works using a distance matrix, and each time we

join sequences we need to know the distances to this new join point (node). Effectively the

joined parts of the tree are replaced by a new single node; i.e. two sequences are replaced

with an ancestor. When the distance matrix is recalculated we need to know how far each

of the joined sequences is from the new point.

The  function  takes  a  distance  matrix  and  two  indices,  i  and  j.  The  first  index  is  the

row/column for the sequence we want to find the distance from, and the second index is

the sequence we are joining with. The row and column for the two indices are extracted

directly, remembering that the distance matrix is symmetric, so the j column is the same as

the j row.

def getDistToJunction(distMatrix, i, j):

n = len(distMatrix)

row = distMatrix[i]

column = distMatrix[j]

dist = distMatrix[i][j] + (sum(row)-sum(column))/(n-2)

dist *= 0.5

return dist

The  distance  from  sequence  index  i  to  the  new  node  is  the  average  distance  (hence

multiplying by 0.5) between the two joined sequences (distMatrix[i][j]) and the distance to

the whole tree, i.e. for n-2 nodes. The distance to the rest of the tree is calculated by taking

the total (whole tree) distance for one sequence away from the total distance for the other

and  then  dividing  by  the  number  of  nodes  to  get  an  average.  If  the  total  tree  were  not

considered  the  branch  point  would  be  exactly  halfway  between  the  two  sequences.

However, with the rest of the tree potentially being closer to one of the sequences than the

other the branch point is pulled to one side.

Finally  we  get  to  the  function  that  actually  builds  the  tree.  This  takes  an  abstract

distance  matrix  as  input,  so  that  it  can  be  used  in  many  different  situations  (not  just  for

sequences),  but  which  can  be  generated  here  using  getDistanceMatrix()  using  a  list  of

sequences  and  a  residue  similarity  matrix  as  arguments.  As  mentioned  previously  this

could be modified to take multiple sequences for each taxon, rather than just one. Initially

in  the  function  a  list  joinOrder  is  initialised,  which  will  record  the  order  in  which

sequences are joined to build the tree, and we use the function defined above to make the

initial  matrix  of  distances  between  sequences.  The  tree  variable  is  initialised  as  a  list

containing index numbers, one for each sequence in the distance matrix. An alternative to

using simple numbers here would be to use names for the sequences if we had them. What



is  actually  used  for  the  tree  construction  is  not  important;  it  just  needs  to  be  a  tag  to

identify  each  taxon.  As  the  main  part  of  the  function  proceeds  the  tree  variable  will  be

modified, and each time a new branch is formed the tags for the two joined sections will

be  combined  (into  a  tuple).  In  the  end  the  nesting  of  pair  sequence  tags  in  this  list  will

represent the structure of the tree.

def neighbourJoinTree(distMatrix):

joinOrder = []

n = len(distMatrix)

tree = list(range(n)) # do not need list() in Python 2

Next we perform an iterative loop using the while statement. The loop continues while

the length of the distance matrix is greater than two, i.e. while we still have to work out

which  possible  bits  of  tree  to  combine.  Each  time  we  go  through  the  loop  a  new  branch

point is defined and the length of the distance matrix will decrease by one (two sequences

disappear and one new branch point appears). When there are only two tree parts to join

there  are  no  more  choices  to  be  made;  the  last  remaining  join  is  made  and  the  tree  is

complete.  Inside  the  loop  we  use  the  getJoinPair()  function  defined  above  to  determine

which  indices,  stored  as  x  and  y,  represent  the  closest  pair,  and  thus  should  be  joined

during this round.

while n > 2:

x, y = getJoinPair(distMatrix)

Just as the distance matrix gets smaller each time we join sequences so too does the tree

list. When we combine branches we remove the representation of those branches from the

list but add a new value for the larger, joined branch. Accordingly, the rows and columns

of the distance matrix have a direct correspondence to the positions in tree. The variable

node  is  defined  as  a  tuple  containing  the  two  parts  of  the  tree  that  are  to  be  combined.

These two parts correspond to the positions x and y (in the distance matrix and tree  list)

and  can  represent  single,  unjoined  sequence  tags  or  tags  that  are  already  in  a  branching

structure. The new branch node is added to the joinOrder list and the tree list. To see the

tree grow you could issue ‘print(tree)’ during the iteration.

node = (tree[x], tree[y])

joinOrder.append(node)

tree.append(node)

Then  we  delete  the  x  and  y  elements  from  the  tree  list  because  these  parts  have  now

been combined, and added as the new node to the end. Note that the y element is deleted

before  the  x  because  we  want  to  delete  the  largest  list  index  first.  Deleting  the  smaller

index x first would shuffle the remainder of the list containing the y position along by one,

complicating matters.

del tree[y]

del tree[x]

Next we have to adjust the distance matrix in light of the newly joined positions. First

we calculate the distance from the joined x and y positions to the new branch point using



the function getDistToJunction() explained in detail above.

distX = getDistToJunction(distMatrix, x, y)

distY = getDistToJunction(distMatrix, y, x)

A  new  row,  containing  zeros,  is  added  to  the  distance  matrix.  This  will  later  be  filled

with distances to the new branch point. Note that this new row has one more element than

the existing rows because it won’t be extended like the others in the following loop.

distMatrix.append([0] * (n+1))

Then we loop through all positions in the distance matrix, except the newly added one,

and select those that have not been joined this round (not x or y). The distance from each

of these positions to the new node is then defined as the average of the distances to the x,

y points minus the respective distances of x and y to their joining node.

for i in range(n):

if i not in (x,y):

dist = (distMatrix[x][i]-distX) + (distMatrix[y][i]-distY)

dist *= 0.5

This distance is added to the end of the distance matrix row. Also, the new row (n)  at

the bottom has the appropriate element filled in. Thus we have filled in the new distance

values for the ends of row and column i.

distMatrix[i].append(dist)

distMatrix[n][i] = dist

Once the loop through matrix positions is complete we will have a new row and column

in the distance matrix for the newly joined branch point. So next we delete the old x and y

positions from the matrix. Firstly we delete two rows, and then for the remaining rows we

loop  through  and  delete  the  appropriate  two  columns.  At  the  end  of  the  while  loop  we

decrease  n  by  one,  because  after  the  merge  the  distance  matrix  is  now  one  row/column

smaller.


del distMatrix[y]

del distMatrix[x]

for row in distMatrix:

del row[y]

del row[x]

n -= 1


At the end of the function we convert the tree list to a tuple, effectively saying that the

two remaining parts are joined (remember tuples are immutable). We add the whole tree to

the  list  of  join  events.  Finally  the  function  passes  back  the  tree  and  branch  combination

order, each of which is useful in different ways.

tree = tuple(tree)

joinOrder.append(tree)




return tree, joinOrder

For  the  above  function  the  output  tree  data  just  uses  sequence  identifiers,  it  doesn’t

contain  any  distance  information.  However,  it  is  trivial  to  make  modifications  that  will

allow  the  function  to  also  pass  back  distX  and  distY,  the  distance  to  the  branch  points,

along  with  the  identifiers,  e.g.  node  =  ((tree[x],  distX),  (tree[y],  distY)).  Finally  we  can

test the tree generation, first generating the distance matrix with a list of protein sequences

and the BLOSUM residue substitution scores:

distMatrix = getDistanceMatrix(seqs, BLOSUM62)

tree, treeJoinOrder = neighbourJoinTree(distMatrix)

print(tree) # Result : (((7, (0, 1)), (4, 5)), ((2, 3), (6, 8)))




Download 7,75 Mb.

Do'stlaringiz bilan baham:
1   ...   202   203   204   205   206   207   208   209   ...   514




Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©hozir.org 2024
ma'muriyatiga murojaat qiling

kiriting | ro'yxatdan o'tish
    Bosh sahifa
юртда тантана
Боғда битган
Бугун юртда
Эшитганлар жилманглар
Эшитмадим деманглар
битган бодомлар
Yangiariq tumani
qitish marakazi
Raqamli texnologiyalar
ilishida muhokamadan
tasdiqqa tavsiya
tavsiya etilgan
iqtisodiyot kafedrasi
steiermarkischen landesregierung
asarlaringizni yuboring
o'zingizning asarlaringizni
Iltimos faqat
faqat o'zingizning
steierm rkischen
landesregierung fachabteilung
rkischen landesregierung
hamshira loyihasi
loyihasi mavsum
faolyatining oqibatlari
asosiy adabiyotlar
fakulteti ahborot
ahborot havfsizligi
havfsizligi kafedrasi
fanidan bo’yicha
fakulteti iqtisodiyot
boshqaruv fakulteti
chiqarishda boshqaruv
ishlab chiqarishda
iqtisodiyot fakultet
multiservis tarmoqlari
fanidan asosiy
Uzbek fanidan
mavzulari potok
asosidagi multiservis
'aliyyil a'ziym
billahil 'aliyyil
illaa billahil
quvvata illaa
falah' deganida
Kompyuter savodxonligi
bo’yicha mustaqil
'alal falah'
Hayya 'alal
'alas soloh
Hayya 'alas
mavsum boyicha


yuklab olish