Python Programming for Biology: Bioinformatics and Beyond



Download 7,75 Mb.
Pdf ko'rish
bet335/514
Sana30.12.2021
Hajmi7,75 Mb.
#91066
1   ...   331   332   333   334   335   336   337   338   ...   514
Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

The Viterbi algorithm

The Viterbi algorithm is an efficient means of using the probabilities within an HMM to

predict  the  most  likely  sequence  of  hidden  states  that  generates  a  sequence  of  known

outcomes:  our  observed  data.  There  are  many  situations  when  it  can  be  useful  to  get  a

handle on the ‘best guess’ for the underlying states, and in the example at the end of this

chapter  we  will  use  the  method  to  estimate  the  most  likely  sequence  of  buried/exposed

protein  states.  However,  this  is  not  to  suggest  that  we  are  always  satisfied  with  a  single

optimum  answer;  considering  a  Bayesian  approach  it  may  actually  be  more  relevant  to

have a distribution of different hidden state sequences, which can tell us something about

the reliability of estimates.

The Viterbi algorithm is an example of dynamic programming, similar to the pairwise

sequence alignment method shown in

Chapter 12

, where we compute the path through the

hidden  state  transitions  that  maximises  the  overall  probability.  Rather  than  multiplying

probabilities, which would often result in very small numbers that cannot be represented

accurately  in  floating  point  representation,

11

 we  add  the  logarithms  of  the  probabilities,



which will always be in a manageable range.

The procedure starts at the beginning of the HMM chain and calculates the probabilities




for all hidden states at subsequent positions until the end, where one ‘winning’ sequence

of  hidden  states  is  revealed.  Given  an  array  of  different  states  there  are  several  possible

transitions (paths) to get to the different states at the subsequent positions in the Markov

chain.  At  a  given  position  where  the  probability  of  having  each  state  is  recorded,  the

highest  probability  for  a  state  in  the  next  position  is  taken  as  the  maximum  of  the

outcomes  when  these  probabilities  are  multiplied  by  the  appropriate  transition  and

emission probabilities (i.e. observing the data for the following position). For the protein

example this means we multiply by probabilities of going from one buried/exposed state

to  the  next  and  by  the  probability  of  observing  a  particular  amino  acid,  given  the  next

state.


In the Viterbi algorithm, although there are many sequences of hidden states that can be

proposed  to  yield  a  particular  observed  state  for  a  position  in  the  chain,  the  sub-optimal

state  sequences  are  discarded.  The  procedure  can  be  viewed  as  using  pre-computed

solutions  for  the  prior  chain  positions.  The  dynamic  programming  core  of  the  algorithm

means  that  we  do  not  compute  all  possible  routes,  only  the  best  ones  for  each  position,

thus minimising the amount of computation that has to be done.

The implementation of the Viterbi algorithm is fairly simple using Python with NumPy

arrays. We go from one point in the sequence to the next in a for loop, given we know the

number  of  observations.  Calculating  the  log-probabilities  involves  adding  arrays  for  the

preceding states (or starting values) to the transition and emission arrays, noting that the

emission  probabilities  are  conditioned  given  the  actual  observed  data  for  the  point.  For

each  step,  choosing  the  best  transition  to  get  to  each  of  the  states  in  the  subsequent

position is a simple maximisation of the summed log-probabilities. Here we use .argmax()

to give the indices for the highest-scoring transitions. This allows us not only to select the

required values, but also to record the list of winning indices at each position. Hence, once

the end is reached, the overall highest-scoring path will be determined. Note that we are

only  recording  the  best  route  and  log-probability  for  each  state,  so  this  operation  is

relatively frugal on memory.

For the actual Python example we do the required NumPy imports for arrays and then

define  a  function  which  takes  four  arguments:  a  list  of  the  observed  data  and  log-

probabilities  for  the  starting  values,  transition  matrix  and  emissions.  Note  that  it  is

assumed  that  all  the  probabilities  are  NumPy  arrays  and  that  the  logarithms  were  taken

beforehand  (e.g.  logArray  =  numpy.log(myArray)),  so  we  can  add  rather  than  multiply.

For  simplicity  we  will  often  refer  to  the  log-probabilities  simply  as  ‘scores’.  The

observations may be a list or an array and encode the data as indices. So, for example, if

the data is a DNA sequence with G, C, A and T letters then obs is a list that selects from

four indices: 0, 1, 2 or 3. Here the order in which the data states are encoded is arbitrary,

but the order must naturally be the same as in the emission matrix pEmit.

from numpy import empty, array

def viterbi(obs, pStart, pTrans, pEmit):

Initially we record the number of hidden states from the input arrays, and then initialise

values for the log-probabilities scores. The first value for scoresPrev is initialised with the

starting  scores  and  the  emissions  for  the  first  data  item.  The  dictionary  pathsPrev  that



records  the  score-maximising  choices,  and  hence  best  path  through  the  HMM,  uses  the

index of the state (here i) as a key and holds a list of indices, to record the prior states and

hence path. The initial value here is naturally i, where we start from.

nStates = len(pStart)

states = range(nStates)

scores = empty(nStates)

scoresPrev = pStart + pEmit[:,obs[0]]

pathsPrev = dict([(i, [i]) for i in states])

Next  we  loop  through  all  of  the  subsequent  observations,  i.e.  for  each  position  in  the

Markov  chain  after  the  start,  noting  that  we  record  the  observed  value  as  val.  For  each

position we redefine the paths based on the previous ones so we create paths, which will

extend the previous array of state indices.

for val in obs[1:]:

paths = {}

For the current position we go through each state index i, and define the options for the

possible  outcomes  as  a  sum  of  log-probabilities.  To  the  previous  scores  we  add  the

transition  to  get  to  state  i  and  the  probability  of  emitting  the  outcome  val.  Note  that  we

take the slice pTrans[:,i] so that we get the scores to transition from all states to the current

one, and that the emission of the observed value for this state pEmit[i,val] is just a single

number (and so is added to all array elements). The best score for the state is simply the

largest .max()  and  the  index  for  this  is  extracted  using  .argmax().  The  index  bestState  is

the winning state that we come from to get to i. This index is used to access the path for

that  state:  the  winning  one  up  to  the  previous  point.  The  updated  path  list  is  then  the

extension of the path we came from with the current state.

for i in states:

options = scoresPrev + pTrans[:,i] + pEmit[i,val]

bestState = options.argmax()

scores[i] = options.max()

paths[i] = pathsPrev[bestState] + [i]

Once all the states have been considered for this position we store the current paths and

scores as the previous ones for the next iteration.

pathsPrev = paths

scoresPrev = array(scores)

Finally,  it  is  easy  to  get  the  best  state  (or  index  thereof)  and  its  log-probability  score

from the arrays we are left with at the end. The winning path and score is returned from

the function.

endState = scores.argmax()

logProb = scores.max()

return logProb, paths[endState]




Download 7,75 Mb.

Do'stlaringiz bilan baham:
1   ...   331   332   333   334   335   336   337   338   ...   514




Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©hozir.org 2024
ma'muriyatiga murojaat qiling

kiriting | ro'yxatdan o'tish
    Bosh sahifa
юртда тантана
Боғда битган
Бугун юртда
Эшитганлар жилманглар
Эшитмадим деманглар
битган бодомлар
Yangiariq tumani
qitish marakazi
Raqamli texnologiyalar
ilishida muhokamadan
tasdiqqa tavsiya
tavsiya etilgan
iqtisodiyot kafedrasi
steiermarkischen landesregierung
asarlaringizni yuboring
o'zingizning asarlaringizni
Iltimos faqat
faqat o'zingizning
steierm rkischen
landesregierung fachabteilung
rkischen landesregierung
hamshira loyihasi
loyihasi mavsum
faolyatining oqibatlari
asosiy adabiyotlar
fakulteti ahborot
ahborot havfsizligi
havfsizligi kafedrasi
fanidan bo’yicha
fakulteti iqtisodiyot
boshqaruv fakulteti
chiqarishda boshqaruv
ishlab chiqarishda
iqtisodiyot fakultet
multiservis tarmoqlari
fanidan asosiy
Uzbek fanidan
mavzulari potok
asosidagi multiservis
'aliyyil a'ziym
billahil 'aliyyil
illaa billahil
quvvata illaa
falah' deganida
Kompyuter savodxonligi
bo’yicha mustaqil
'alal falah'
Hayya 'alal
'alas soloh
Hayya 'alas
mavsum boyicha


yuklab olish