Python Programming for Biology: Bioinformatics and Beyond


Tree-guided multiple-sequence alignments



Download 7,75 Mb.
Pdf ko'rish
bet207/514
Sana30.12.2021
Hajmi7,75 Mb.
#91066
1   ...   203   204   205   206   207   208   209   210   ...   514
Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

Tree-guided multiple-sequence alignments

Although  we  may  just  want  to  admire  the  evolutionary  relationship  suggested  by  a

phylogenetic tree we can also use trees for many subsequent bioinformatics exercises. One

important example is in the generation of multiple-sequence alignments. When making a

tree, pairs of sequences are compared by alignment, but we can combine the sequences in

the  same  order  as  they  are  liked  into  the  tree  to  build  a  deeper  multiple  alignment.  The

example  of  a  multiple  alignment  we  gave  earlier  in

Chapter  13

 simply  combined

sequences in an arbitrary input order. Now we aim to do much better and link them in a

way  that  better  represents  how  they  diverged.  The  process  of  multiple  alignment  in  our

example  will  be  much  the  same;  profiles  are  aligned  and  gaps  are  inserted  so  that

sequences can be stacked; it is just the order of comparing the profiles that differs (which

now represents the branching nodes of the tree). We could make further improvements to

the  multiple  alignments  by  weighting  the  different  branches  when  we  combine  the

sequences. However, we will leave such complexities to programs like ClustalW etc.

The  multiple-alignment  function  naturally  takes  a  list  of  one-letter  sequences  and  a

substitution  matrix  as  input  arguments.  Then  inside  the  function  we  create  a  dictionary

multipleAlign  which  will  hold  the  sequences  as  they  are  combined.  Because  we  will  be

combining groups of sequences that correspond to the branches of a phylogenetic tree, this

dictionary will hold all of the partial alignments as they are joined together. The keys of

this  dictionary  represent  the  indices  of  the  sequences  that  were  used  to  make  each  sub-

alignment.  Initially  nothing  has  been  combined,  so  the  dictionary  is  filled  with  isolated

sequences  alone  in  a  list  ([seq,]),  keyed  by  a  single  index.  As  the  multiple  alignment

builds new indices will be added to indicate which sequences have been combined. These

new indices will be of the tuple form (x,y) where x and y could be plain index numbers or

other,  already  joined  keys.  For  example,  the  key  ‘(3,(1,2))’  will  be  for  an  alignment  that

first  combined  sequences  1  and  2  which  was  then  added  to  3  –  of  course  here  the  keys

‘(1,2)’ and ‘3’  will  already  exist  because  we  are  following  the  branches  of  the  tree  from

outside inwards.

from MultipleAlign import profile, profileAlign

def treeProfileMultipleAlign(seqs, simMatrix):




multipleAlign = {}

for i, seq in enumerate(seqs):

multipleAlign[i] = [seq,]

With  the  alignment  dictionary  initialised,  the  neighbourJoinTree()  function  is  called

using a distance matrix generated with the same input sequences and substitution matrix.

This will return not only the structure of the tree, in terms of indices, but also the order in

which sequences were combined. It is this second value that we will use to combine the

sequence alignment. A copy is made of the tree combination order, which can be altered as

we make the alignment. The original unmodified treeOrder list will be returned from the

function as output.

distMatrix = getDistanceMatrix(seqs, simMatrix)

tree, treeOrder = neighbourJoinTree(distMatrix)

joinStack = treeOrder[:]

Next  a  while  loop  is  set  up  to  continue  combining  sequences  while  we  still  have  tree

branches  to  consider.  Note  how  the  first  (index  zero)  element  is  removed  from  the

joinStack  list,  thus  shortening  it  while  at  the  same  time  providing  the  keys  (keyA  and

keyB)  to  extract  the  next  sub-alignments  to  be  joined.  Eventually  all  branch  points  will

have been considered and the joinStack list will be empty, whereupon the loop will stop.

while joinStack:

keyA, keyB = joinStack.pop(0)

These  keys  are  then  used  to  access  the  relevant  sub-alignments  from  the  larger

dictionary. These sub-alignments will then be combined, once gaps have been inserted, to

make a new, larger multiple alignment. This is equivalent to joining two branches of the

phylogenetic tree.

subAlignA = multipleAlign[keyA]

subAlignB = multipleAlign[keyB]

The  profiles  (lists  of  residue  fraction  dictionaries)  are  calculated  using  the  previously

defined  profile()  function  for  each  sub-alignment.  These  are  then  passed  into  the  profile

alignment  function,  again  previously  defined,  to  generate  a  score  (s,  which  will  be

completely ignored here) and two aligned profiles, i.e. with gaps.

profA = profile(subAlignA)

profB = profile(subAlignB)

s, profAlignA, profAlignB = profileAlign(profA, profB, simMatrix)

We insert the gaps in the previously combined multiple alignments (or single sequence

at the beginning) by looking for gaps in the newly aligned profile. Accordingly, we loop

though  the  profile  alignment  to  find  positions  (index  i)  where  there  is  no  fraction

dictionary, but rather a None object, indicating a gap.

gaps = []

for i, fractions in enumerate(profAlignA):

if fractions is None:

gaps.append(i)



With  the  gaps  collected,  another  loop  is  performed,  this  time  through  each  of  the

sequences in the sub-alignment. Each sequence is redefined by the insertion of dashes ‘-’

at the positions specified by the gaps list. Note that as the gaps are inserted the sequence

string will grow, so although all gap positions after the first are meaningless, with regard

to  the  initial  contents  of  seq,  once  all  preceding  gaps  have  been  inserted  the  next  gap

location will be in-step with the longer redefined seq. After the gap insertion the sequence

is re-inserted in the sub-alignment at the correct index [j].

for j, seq in enumerate(subAlignA):

for gap in gaps:

seq = seq[:gap] + '-' + seq[gap:]

subAlignA[j] = seq

The  gapping  procedure  is  repeated  using  the  gaps  from  the  second  of  the  aligned

profiles (profAlignB) to insert dashes into the second of the sub-alignments (subAlignB).

gaps = []

for i, fractions in enumerate(profAlignB):

if fractions is None:

gaps.append(i)

for j, seq in enumerate(subAlignB):

for gap in gaps:

seq = seq[:gap] + '-' + seq[gap:]

subAlignB[j] = seq

Now  that  the  sequences  have  had  the  appropriate  gaps  inserted  the  two  smaller

alignments  are  combined  and  placed  in  an  expanded  multipleAlign  dictionary.  The  new

key  for  this  alignment  is  simply  a  tuple  containing  the  two  keys  of  the  smaller  sub-

alignments that were combined in this iteration through the loop. The old sub-alignments

are no longer needed and can now be removed from the dictionary by deleting the entries

for the old, now combined, keys.

newKey = (keyA, keyB)

multipleAlign[newKey] = subAlignA + subAlignB

del multipleAlign[keyA]

del multipleAlign[keyB]

Finally  we  return  the  deepest  alignment  in  the  dictionary,  which  is  the  one  that  was

stored using the final value of newKey from the loop, and the tree data. This represents the

last joining in the treeOrder list, and thus represents the tree as a whole.

return multipleAlign[newKey], tree, treeOrder

We test the function with a list of input sequences and then print out the gapped output

sequences from the returned alignment.

align, tree, order = treeProfileMultipleAlign(seqs, BLOSUM62)

for seq in align:



print(seq)

And this gives:

-LAH-DFSQ--PVITV-IILGVMAGIIGIILLLAYVSRRLR-KR-----P-PADVP

QPVH-PFSRPAPVVIILIILCVMAGVIGTILLISYGIRLL--------------IK

QLVH-RFTVPAPVVIILIILCVMAGIIGTILLISYTIRRL--------------IK

MLEH-EFS--APVAIL-IILGVMAGIIGIILLISYSIGQIIKKRSVDIQP-PEDED

PIQH-DFP--ALVMIL-IILGVMAGIIGTILLISYCISRMTKKSSVDIQS-PEGGD

QLAH-HFSE--PEITL-IIFGVMAGVIGTILLISYGIRRLIKKSPSDVKPLPSP-D

QLVH-EFSE--LVIAL-IIFGVMAGVIGTILFISYGSRRLIKKSESDVQPLPPP-D

QLVH-IFSE--PVIIG-IIYAVMLGIIITILSIAFCIGQLTKKS-SLPAQVASPED

-SYHQDFSH--AEITG-IIFAVMAGLLLIIFLIAYLIRRMIKKPLPVPKPQDSP-D


Download 7,75 Mb.

Do'stlaringiz bilan baham:
1   ...   203   204   205   206   207   208   209   210   ...   514




Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©hozir.org 2024
ma'muriyatiga murojaat qiling

kiriting | ro'yxatdan o'tish
    Bosh sahifa
юртда тантана
Боғда битган
Бугун юртда
Эшитганлар жилманглар
Эшитмадим деманглар
битган бодомлар
Yangiariq tumani
qitish marakazi
Raqamli texnologiyalar
ilishida muhokamadan
tasdiqqa tavsiya
tavsiya etilgan
iqtisodiyot kafedrasi
steiermarkischen landesregierung
asarlaringizni yuboring
o'zingizning asarlaringizni
Iltimos faqat
faqat o'zingizning
steierm rkischen
landesregierung fachabteilung
rkischen landesregierung
hamshira loyihasi
loyihasi mavsum
faolyatining oqibatlari
asosiy adabiyotlar
fakulteti ahborot
ahborot havfsizligi
havfsizligi kafedrasi
fanidan bo’yicha
fakulteti iqtisodiyot
boshqaruv fakulteti
chiqarishda boshqaruv
ishlab chiqarishda
iqtisodiyot fakultet
multiservis tarmoqlari
fanidan asosiy
Uzbek fanidan
mavzulari potok
asosidagi multiservis
'aliyyil a'ziym
billahil 'aliyyil
illaa billahil
quvvata illaa
falah' deganida
Kompyuter savodxonligi
bo’yicha mustaqil
'alal falah'
Hayya 'alal
'alas soloh
Hayya 'alas
mavsum boyicha


yuklab olish