Python Programming for Biology: Bioinformatics and Beyond

Using Python for hidden Markov models

Download 7,75 Mb.

Pdf ko'rish

bet	333/514
Sana	30.12.2021
Hajmi	7,75 Mb.
	#91066

1 ... 329 330 331 332 333 334 335 336 ... 514

Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

Using Python for hidden Markov models

So far we have covered a fair amount of general theory, so for the remainder of this

chapter we turn to some Python versions of key algorithms which will allow you to work

with hidden Markov models and go on to show you how these can be practically

implemented. Given the definition of an HMM the task naturally turns to extracting some

useful information. We usually want to know what the likely underlying hidden states are,

given some sequence of observable data. Because we are dealing with a probabilistic

model, and bearing in mind a Bayesian approach which considers alternative hypotheses,

there will be various competing sequences of hidden states or ‘routes’ through the Markov

chain. Often it is useful to have the most likely sequence of hidden states, hence we cover

the Viterbi algorithm. We then go on to the forward-backward algorithm, which will allow

us to estimate the probabilities of particular states being present at a particular point in the

chain, to compare competing solutions and indicate the points of greatest uncertainty in

our predictions.

We will demonstrate the use of these algorithms with a relatively simple example of a

hidden Markov model that relates to protein sequences. Specifically we will have states to

describe protein residues in sequences as being ‘buried’ or ‘exposed’. These reflect where

amino acid residues lie within the 3D structure of a folded protein. The buried residues

will be found in the interior of the protein, at its core, and the exposed residues will be on

its surface, i.e. exposed to the water solvent. We don’t expect state predictions from this

HMM to be especially accurate, given that protein folding is so complex, but the example

is simple enough for this book. Also, we are basing the HMM on a well-known

observation, that the buried cores of residues in proteins have a distinctly different

complement of amino acids compared to the exposed surface. The exposed amino acids

have a strong tendency to be hydrophilic and have side chains that contain polar and

charged atom groups, which can interact with the water. In contrast the buried residues

tend to be hydrophobic, and thus have non-polar, uncharged side chains.

A simple way to form the HMM would be to only have two hidden states, for buried

and exposed categories. In this case we would calculate the probabilities for changing (or

not changing) between states as we move through the protein sequence. For an observed

amino acid type there will be probabilities that it came from each of the two states. Hence,

the observed data, the protein sequence, would be emitted in a probabilistic manner from

these two states.

In practice we will not use this simple HMM, rather we will have many more states.

Each state will be a combination of a buried/exposed label and an amino acid type, so

there will be 40 states in total. The reason behind this is to make the transition

probabilities more detailed, so the overall HMM becomes more accurate. By encoding the

amino acid type in the hidden state we will be able to incorporate the effect of the amino

acid type in the transitions of the chain. For example, where a sequence position has a

valine amino acid followed by a serine amino acid we will know more precisely what the

probability is to go from a buried or exposed valine to a buried or exposed serine; in effect

the transition probability matrix will vary according to the sequence. If we use such

combined states the emission probabilities become trivial: the probability of getting the

observed amino acid is 1.0 if that amino acid is part of the hidden state and 0.0 otherwise.

Naturally the more states we have in the HMM the more statistical data we need to get

accurate probabilities; we need data for 40×40 transitions. However, the probabilities will

be estimated from a large amount of PDB data (containing protein 3D structures) so there

is little concern about getting good statistics here.

Download 7,75 Mb.

Do'stlaringiz bilan baham:

1 ... 329 330 331 332 333 334 335 336 ... 514