The Viterbi algorithm
The Viterbi algorithm is an efficient means of using the probabilities within an HMM to
predict the most likely sequence of hidden states that generates a sequence of known
outcomes: our observed data. There are many situations when it can be useful to get a
handle on the ‘best guess’ for the underlying states, and in the example at the end of this
chapter we will use the method to estimate the most likely sequence of buried/exposed
protein states. However, this is not to suggest that we are always satisfied with a single
optimum answer; considering a Bayesian approach it may actually be more relevant to
have a distribution of different hidden state sequences, which can tell us something about
the reliability of estimates.
The Viterbi algorithm is an example of dynamic programming, similar to the pairwise
sequence alignment method shown in
Chapter 12
, where we compute the path through the
hidden state transitions that maximises the overall probability. Rather than multiplying
probabilities, which would often result in very small numbers that cannot be represented
accurately in floating point representation,
11
we add the logarithms of the probabilities,
which will always be in a manageable range.
The procedure starts at the beginning of the HMM chain and calculates the probabilities
for all hidden states at subsequent positions until the end, where one ‘winning’ sequence
of hidden states is revealed. Given an array of different states there are several possible
transitions (paths) to get to the different states at the subsequent positions in the Markov
chain. At a given position where the probability of having each state is recorded, the
highest probability for a state in the next position is taken as the maximum of the
outcomes when these probabilities are multiplied by the appropriate transition and
emission probabilities (i.e. observing the data for the following position). For the protein
example this means we multiply by probabilities of going from one buried/exposed state
to the next and by the probability of observing a particular amino acid, given the next
state.
In the Viterbi algorithm, although there are many sequences of hidden states that can be
proposed to yield a particular observed state for a position in the chain, the sub-optimal
state sequences are discarded. The procedure can be viewed as using pre-computed
solutions for the prior chain positions. The dynamic programming core of the algorithm
means that we do not compute all possible routes, only the best ones for each position,
thus minimising the amount of computation that has to be done.
The implementation of the Viterbi algorithm is fairly simple using Python with NumPy
arrays. We go from one point in the sequence to the next in a for loop, given we know the
number of observations. Calculating the log-probabilities involves adding arrays for the
preceding states (or starting values) to the transition and emission arrays, noting that the
emission probabilities are conditioned given the actual observed data for the point. For
each step, choosing the best transition to get to each of the states in the subsequent
position is a simple maximisation of the summed log-probabilities. Here we use .argmax()
to give the indices for the highest-scoring transitions. This allows us not only to select the
required values, but also to record the list of winning indices at each position. Hence, once
the end is reached, the overall highest-scoring path will be determined. Note that we are
only recording the best route and log-probability for each state, so this operation is
relatively frugal on memory.
For the actual Python example we do the required NumPy imports for arrays and then
define a function which takes four arguments: a list of the observed data and log-
probabilities for the starting values, transition matrix and emissions. Note that it is
assumed that all the probabilities are NumPy arrays and that the logarithms were taken
beforehand (e.g. logArray = numpy.log(myArray)), so we can add rather than multiply.
For simplicity we will often refer to the log-probabilities simply as ‘scores’. The
observations may be a list or an array and encode the data as indices. So, for example, if
the data is a DNA sequence with G, C, A and T letters then obs is a list that selects from
four indices: 0, 1, 2 or 3. Here the order in which the data states are encoded is arbitrary,
but the order must naturally be the same as in the emission matrix pEmit.
from numpy import empty, array
def viterbi(obs, pStart, pTrans, pEmit):
Initially we record the number of hidden states from the input arrays, and then initialise
values for the log-probabilities scores. The first value for scoresPrev is initialised with the
starting scores and the emissions for the first data item. The dictionary pathsPrev that
records the score-maximising choices, and hence best path through the HMM, uses the
index of the state (here i) as a key and holds a list of indices, to record the prior states and
hence path. The initial value here is naturally i, where we start from.
nStates = len(pStart)
states = range(nStates)
scores = empty(nStates)
scoresPrev = pStart + pEmit[:,obs[0]]
pathsPrev = dict([(i, [i]) for i in states])
Next we loop through all of the subsequent observations, i.e. for each position in the
Markov chain after the start, noting that we record the observed value as val. For each
position we redefine the paths based on the previous ones so we create paths, which will
extend the previous array of state indices.
for val in obs[1:]:
paths = {}
For the current position we go through each state index i, and define the options for the
possible outcomes as a sum of log-probabilities. To the previous scores we add the
transition to get to state i and the probability of emitting the outcome val. Note that we
take the slice pTrans[:,i] so that we get the scores to transition from all states to the current
one, and that the emission of the observed value for this state pEmit[i,val] is just a single
number (and so is added to all array elements). The best score for the state is simply the
largest .max() and the index for this is extracted using .argmax(). The index bestState is
the winning state that we come from to get to i. This index is used to access the path for
that state: the winning one up to the previous point. The updated path list is then the
extension of the path we came from with the current state.
for i in states:
options = scoresPrev + pTrans[:,i] + pEmit[i,val]
bestState = options.argmax()
scores[i] = options.max()
paths[i] = pathsPrev[bestState] + [i]
Once all the states have been considered for this position we store the current paths and
scores as the previous ones for the next iteration.
pathsPrev = paths
scoresPrev = array(scores)
Finally, it is easy to get the best state (or index thereof) and its log-probability score
from the arrays we are left with at the end. The winning path and score is returned from
the function.
endState = scores.argmax()
logProb = scores.max()
return logProb, paths[endState]
Do'stlaringiz bilan baham: |