Python Programming for Biology: Bioinformatics and Beyond


Using Python for hidden Markov models



Download 7,75 Mb.
Pdf ko'rish
bet333/514
Sana30.12.2021
Hajmi7,75 Mb.
#91066
1   ...   329   330   331   332   333   334   335   336   ...   514
Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

Using Python for hidden Markov models

So  far  we  have  covered  a  fair  amount  of  general  theory,  so  for  the  remainder  of  this

chapter we turn to some Python versions of key algorithms which will allow you to work

with  hidden  Markov  models  and  go  on  to  show  you  how  these  can  be  practically

implemented. Given the definition of an HMM the task naturally turns to extracting some

useful information. We usually want to know what the likely underlying hidden states are,

given  some  sequence  of  observable  data.  Because  we  are  dealing  with  a  probabilistic

model, and bearing in mind a Bayesian approach which considers alternative hypotheses,

there will be various competing sequences of hidden states or ‘routes’ through the Markov

chain. Often it is useful to have the most likely sequence of hidden states, hence we cover

the Viterbi algorithm. We then go on to the forward-backward algorithm, which will allow

us to estimate the probabilities of particular states being present at a particular point in the

chain,  to  compare  competing  solutions  and  indicate  the  points  of  greatest  uncertainty  in

our predictions.

We will demonstrate the use of these algorithms with a relatively simple example of a

hidden Markov model that relates to protein sequences. Specifically we will have states to

describe protein residues in sequences as being ‘buried’ or ‘exposed’. These reflect where

amino  acid  residues  lie  within  the  3D  structure  of  a  folded  protein.  The  buried  residues

will be found in the interior of the protein, at its core, and the exposed residues will be on

its surface, i.e. exposed to the water solvent. We don’t expect state predictions from this

HMM to be especially accurate, given that protein folding is so complex, but the example

is  simple  enough  for  this  book.  Also,  we  are  basing  the  HMM  on  a  well-known

observation,  that  the  buried  cores  of  residues  in  proteins  have  a  distinctly  different

complement  of  amino  acids  compared  to  the  exposed  surface.  The  exposed  amino  acids

have  a  strong  tendency  to  be  hydrophilic  and  have  side  chains  that  contain  polar  and

charged  atom  groups,  which  can  interact  with  the  water.  In  contrast  the  buried  residues

tend to be hydrophobic, and thus have non-polar, uncharged side chains.

10

A  simple  way  to  form  the  HMM  would  be  to  only  have  two  hidden  states,  for  buried



and exposed categories. In this case we would calculate the probabilities for changing (or

not changing) between states as we move through the protein sequence. For an observed

amino acid type there will be probabilities that it came from each of the two states. Hence,

the observed data, the protein sequence, would be emitted in a probabilistic manner from




these two states.

In  practice  we  will  not  use  this  simple  HMM,  rather  we  will  have  many  more  states.

Each  state  will  be  a  combination  of  a  buried/exposed  label  and  an  amino  acid  type,  so

there  will  be  40  states  in  total.  The  reason  behind  this  is  to  make  the  transition

probabilities more detailed, so the overall HMM becomes more accurate. By encoding the

amino acid type in the hidden state we will be able to incorporate the effect of the amino

acid  type  in  the  transitions  of  the  chain.  For  example,  where  a  sequence  position  has  a

valine amino acid followed by a serine amino acid we will know more precisely what the

probability is to go from a buried or exposed valine to a buried or exposed serine; in effect

the  transition  probability  matrix  will  vary  according  to  the  sequence.  If  we  use  such

combined  states  the  emission  probabilities  become  trivial:  the  probability  of  getting  the

observed amino acid is 1.0 if that amino acid is part of the hidden state and 0.0 otherwise.

Naturally  the  more  states  we  have  in  the  HMM  the  more  statistical  data  we  need  to  get

accurate probabilities; we need data for 40×40 transitions. However, the probabilities will

be estimated from a large amount of PDB data (containing protein 3D structures) so there

is little concern about getting good statistics here.




Download 7,75 Mb.

Do'stlaringiz bilan baham:
1   ...   329   330   331   332   333   334   335   336   ...   514




Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©hozir.org 2024
ma'muriyatiga murojaat qiling

kiriting | ro'yxatdan o'tish
    Bosh sahifa
юртда тантана
Боғда битган
Бугун юртда
Эшитганлар жилманглар
Эшитмадим деманглар
битган бодомлар
Yangiariq tumani
qitish marakazi
Raqamli texnologiyalar
ilishida muhokamadan
tasdiqqa tavsiya
tavsiya etilgan
iqtisodiyot kafedrasi
steiermarkischen landesregierung
asarlaringizni yuboring
o'zingizning asarlaringizni
Iltimos faqat
faqat o'zingizning
steierm rkischen
landesregierung fachabteilung
rkischen landesregierung
hamshira loyihasi
loyihasi mavsum
faolyatining oqibatlari
asosiy adabiyotlar
fakulteti ahborot
ahborot havfsizligi
havfsizligi kafedrasi
fanidan bo’yicha
fakulteti iqtisodiyot
boshqaruv fakulteti
chiqarishda boshqaruv
ishlab chiqarishda
iqtisodiyot fakultet
multiservis tarmoqlari
fanidan asosiy
Uzbek fanidan
mavzulari potok
asosidagi multiservis
'aliyyil a'ziym
billahil 'aliyyil
illaa billahil
quvvata illaa
falah' deganida
Kompyuter savodxonligi
bo’yicha mustaqil
'alal falah'
Hayya 'alal
'alas soloh
Hayya 'alas
mavsum boyicha


yuklab olish