Using Python for hidden Markov models
So far we have covered a fair amount of general theory, so for the remainder of this
chapter we turn to some Python versions of key algorithms which will allow you to work
with hidden Markov models and go on to show you how these can be practically
implemented. Given the definition of an HMM the task naturally turns to extracting some
useful information. We usually want to know what the likely underlying hidden states are,
given some sequence of observable data. Because we are dealing with a probabilistic
model, and bearing in mind a Bayesian approach which considers alternative hypotheses,
there will be various competing sequences of hidden states or ‘routes’ through the Markov
chain. Often it is useful to have the most likely sequence of hidden states, hence we cover
the Viterbi algorithm. We then go on to the forward-backward algorithm, which will allow
us to estimate the probabilities of particular states being present at a particular point in the
chain, to compare competing solutions and indicate the points of greatest uncertainty in
our predictions.
We will demonstrate the use of these algorithms with a relatively simple example of a
hidden Markov model that relates to protein sequences. Specifically we will have states to
describe protein residues in sequences as being ‘buried’ or ‘exposed’. These reflect where
amino acid residues lie within the 3D structure of a folded protein. The buried residues
will be found in the interior of the protein, at its core, and the exposed residues will be on
its surface, i.e. exposed to the water solvent. We don’t expect state predictions from this
HMM to be especially accurate, given that protein folding is so complex, but the example
is simple enough for this book. Also, we are basing the HMM on a well-known
observation, that the buried cores of residues in proteins have a distinctly different
complement of amino acids compared to the exposed surface. The exposed amino acids
have a strong tendency to be hydrophilic and have side chains that contain polar and
charged atom groups, which can interact with the water. In contrast the buried residues
tend to be hydrophobic, and thus have non-polar, uncharged side chains.
10
A simple way to form the HMM would be to only have two hidden states, for buried
and exposed categories. In this case we would calculate the probabilities for changing (or
not changing) between states as we move through the protein sequence. For an observed
amino acid type there will be probabilities that it came from each of the two states. Hence,
the observed data, the protein sequence, would be emitted in a probabilistic manner from
these two states.
In practice we will not use this simple HMM, rather we will have many more states.
Each state will be a combination of a buried/exposed label and an amino acid type, so
there will be 40 states in total. The reason behind this is to make the transition
probabilities more detailed, so the overall HMM becomes more accurate. By encoding the
amino acid type in the hidden state we will be able to incorporate the effect of the amino
acid type in the transitions of the chain. For example, where a sequence position has a
valine amino acid followed by a serine amino acid we will know more precisely what the
probability is to go from a buried or exposed valine to a buried or exposed serine; in effect
the transition probability matrix will vary according to the sequence. If we use such
combined states the emission probabilities become trivial: the probability of getting the
observed amino acid is 1.0 if that amino acid is part of the hidden state and 0.0 otherwise.
Naturally the more states we have in the HMM the more statistical data we need to get
accurate probabilities; we need data for 40×40 transitions. However, the probabilities will
be estimated from a large amount of PDB data (containing protein 3D structures) so there
is little concern about getting good statistics here.
Do'stlaringiz bilan baham: |