Python Programming for Biology: Bioinformatics and Beyond

Download 7,75 Mb.

Pdf ko'rish

bet	335/514
Sana	30.12.2021
Hajmi	7,75 Mb.
	#91066

1 ... 331 332 333 334 335 336 337 338 ... 514

Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

The Viterbi algorithm

The Viterbi algorithm is an efficient means of using the probabilities within an HMM to

predict the most likely sequence of hidden states that generates a sequence of known

outcomes: our observed data. There are many situations when it can be useful to get a

handle on the ‘best guess’ for the underlying states, and in the example at the end of this

chapter we will use the method to estimate the most likely sequence of buried/exposed

protein states. However, this is not to suggest that we are always satisfied with a single

optimum answer; considering a Bayesian approach it may actually be more relevant to

have a distribution of different hidden state sequences, which can tell us something about

the reliability of estimates.

The Viterbi algorithm is an example of dynamic programming, similar to the pairwise

sequence alignment method shown in

Chapter 12

, where we compute the path through the

hidden state transitions that maximises the overall probability. Rather than multiplying

probabilities, which would often result in very small numbers that cannot be represented

accurately in floating point representation,

we add the logarithms of the probabilities,

which will always be in a manageable range.

The procedure starts at the beginning of the HMM chain and calculates the probabilities

for all hidden states at subsequent positions until the end, where one ‘winning’ sequence

of hidden states is revealed. Given an array of different states there are several possible

transitions (paths) to get to the different states at the subsequent positions in the Markov

chain. At a given position where the probability of having each state is recorded, the

highest probability for a state in the next position is taken as the maximum of the

outcomes when these probabilities are multiplied by the appropriate transition and

emission probabilities (i.e. observing the data for the following position). For the protein

example this means we multiply by probabilities of going from one buried/exposed state

to the next and by the probability of observing a particular amino acid, given the next

state.

In the Viterbi algorithm, although there are many sequences of hidden states that can be

proposed to yield a particular observed state for a position in the chain, the sub-optimal

state sequences are discarded. The procedure can be viewed as using pre-computed

solutions for the prior chain positions. The dynamic programming core of the algorithm

means that we do not compute all possible routes, only the best ones for each position,

thus minimising the amount of computation that has to be done.

The implementation of the Viterbi algorithm is fairly simple using Python with NumPy

arrays. We go from one point in the sequence to the next in a for loop, given we know the

number of observations. Calculating the log-probabilities involves adding arrays for the

preceding states (or starting values) to the transition and emission arrays, noting that the

emission probabilities are conditioned given the actual observed data for the point. For

each step, choosing the best transition to get to each of the states in the subsequent

position is a simple maximisation of the summed log-probabilities. Here we use .argmax()

to give the indices for the highest-scoring transitions. This allows us not only to select the

required values, but also to record the list of winning indices at each position. Hence, once

the end is reached, the overall highest-scoring path will be determined. Note that we are

only recording the best route and log-probability for each state, so this operation is

relatively frugal on memory.

For the actual Python example we do the required NumPy imports for arrays and then

define a function which takes four arguments: a list of the observed data and log-

probabilities for the starting values, transition matrix and emissions. Note that it is

assumed that all the probabilities are NumPy arrays and that the logarithms were taken

beforehand (e.g. logArray = numpy.log(myArray)), so we can add rather than multiply.

For simplicity we will often refer to the log-probabilities simply as ‘scores’. The

observations may be a list or an array and encode the data as indices. So, for example, if

the data is a DNA sequence with G, C, A and T letters then obs is a list that selects from

four indices: 0, 1, 2 or 3. Here the order in which the data states are encoded is arbitrary,

but the order must naturally be the same as in the emission matrix pEmit.

from numpy import empty, array

def viterbi(obs, pStart, pTrans, pEmit):

Initially we record the number of hidden states from the input arrays, and then initialise

values for the log-probabilities scores. The first value for scoresPrev is initialised with the

starting scores and the emissions for the first data item. The dictionary pathsPrev that

records the score-maximising choices, and hence best path through the HMM, uses the

index of the state (here i) as a key and holds a list of indices, to record the prior states and

hence path. The initial value here is naturally i, where we start from.

nStates = len(pStart)

states = range(nStates)

scores = empty(nStates)

scoresPrev = pStart + pEmit[:,obs[0]]

pathsPrev = dict([(i, [i]) for i in states])

Next we loop through all of the subsequent observations, i.e. for each position in the

Markov chain after the start, noting that we record the observed value as val. For each

position we redefine the paths based on the previous ones so we create paths, which will

extend the previous array of state indices.

for val in obs[1:]:

paths = {}

For the current position we go through each state index i, and define the options for the

possible outcomes as a sum of log-probabilities. To the previous scores we add the

transition to get to state i and the probability of emitting the outcome val. Note that we

take the slice pTrans[:,i] so that we get the scores to transition from all states to the current

one, and that the emission of the observed value for this state pEmit[i,val] is just a single

number (and so is added to all array elements). The best score for the state is simply the

largest .max() and the index for this is extracted using .argmax(). The index bestState is

the winning state that we come from to get to i. This index is used to access the path for

that state: the winning one up to the previous point. The updated path list is then the

extension of the path we came from with the current state.

for i in states:

options = scoresPrev + pTrans[:,i] + pEmit[i,val]

bestState = options.argmax()

scores[i] = options.max()

paths[i] = pathsPrev[bestState] + [i]

Once all the states have been considered for this position we store the current paths and

scores as the previous ones for the next iteration.

pathsPrev = paths

scoresPrev = array(scores)

Finally, it is easy to get the best state (or index thereof) and its log-probability score

from the arrays we are left with at the end. The winning path and score is returned from

the function.

endState = scores.argmax()

logProb = scores.max()

return logProb, paths[endState]

Download 7,75 Mb.

Do'stlaringiz bilan baham:

1 ... 331 332 333 334 335 336 337 338 ... 514