Implementing a protein sequence HMM
Finally in this chapter we end with a demonstration of using the above algorithms for
handling our example protein sequence HMM, so we can predict buried or exposed status.
To do this we must first obtain some probabilities for the transitions between the 40
different possible states. In Python we will label the states in the form (exposure,
aminoAcid). The probabilities are derived from a simple statistical analysis of a subset of
the PDB database. The actual subset of 3D structures coordinates used are from the VAST
chain set
13
and represent only protein amino acid chains that are dissimilar to each other,
within a predefined limit (p-value cut-off 10
−7
). The idea behind using this non-redundant
subset is to try to reduce the amount of bias that comes from having multiple entries for
closely related proteins. Overall, we would like the probabilities for our HMM to be
representative of proteins in general, and not skewed towards those kinds which have the
most structure data. This kind of problem is common when dealing with sequence data in
general (protein, DNA or RNA) and is something that a bioinformatician should be wary
of.
To determine whether each of the amino acids in our structure database is in a solvent-
exposed or buried context an external program was run on all of the PDB-format data
files. For reasons of space we will not discuss this in detail but the method is the one
described by Shrake and Rupley.
14
This calculates numerical values representing the
exposed surface area for each atom. For our HMM we want to have exposed or buried
categories for each residue, so the per-atom values needed to be converted. The algorithm
here is to add the values for all the atoms within a residue to get the exposed surface area
for the whole residue. This is then divided by the maximum surface area for that kind of
residue, which gives a fractional value and eliminates the effect of amino acid size. Buried
residues are then defined as those which have an exposed surface area of less than 7% of
the maximum.
15
The results from this procedure are represented in a file which
accompanies
this
book:
‘PdbSeqExposureCategories.txt’
(download
via
http://www.cambridge.org/pythonforbiology
). The lines of this file are in the form
illustrated below, with one line of one-letter amino acid codes followed by a second line of
the same length representing buried or exposed categories for the sequence. Here a dash
‘-’ represents an exposed position and an asterisk ‘*’ buried, so that the categorisation is
easy to see:
GSSGSSGHEETECPLRLAVCQHCDLELSILKLKEHEDYCGARTELCGNCGRNVLVKDLKTHPEVCGREGS
-------------------*------------------**------------*-------*---*-----
VWSVQIVDNAGLGANLALYPSGNSSTVPRYVTVTGYAPITFSEIGPKTVHQSWYITVHNGDDRAFQLGYEGGGVA
-*-*-----------*-----**-*---*****---******------------******-----****---*-*
With the exposure category data at hand we move on to calculating the probabilities
and generating the HMM. For this example we have plenty of data to derive probabilities
from the frequency of observing particular events. However, if data is missing, or
otherwise difficult to analyse, we could use a method like the Baum-Welch algorithm, to
estimate the emission and transition possibilities, given a sufficient amount of
representative training data. The Baum-Welch algorithm can actually be applied when
none of the probabilities are initially known, which in effect means that the HMM
probabilities can be ‘learned’ from even proportionately sparse data. In essence this is a
variety of machine learning, although we leave further discussion of this topic until
Chapter 24
. In our example case we simply count the occurrences of the different
buried/exposed transitions in the data file.
Initially we get a list of the exposure and amino acid type labels, the combination of
which gives the hidden state labels. Although we will be working with NumPy arrays, and
thus referring to the states by numeric indices, the order of symbols in these lists relates
each state index to a meaningful symbol.
expTypes = ['-','*',]
aaTypes = ['A','C','D','E','F','G','H','I','K','L',
'M','N','P','Q','R','S','T','V','W','Y']
The data counts and probabilities will be stored in NumPy arrays, so we define these
upfront as empty data structures of the required sizes, and of floating point data type.
nExp = len(expTypes) # Number of exposure categories
nAmino = len(aaTypes) # Number of amino acid types
nStates = nExp * nAmino # Number of HMM states
from numpy import zeros, log
pStart = zeros(nStates, float) # Starting probabilities
pTrans = zeros((nStates, nStates), float) # Transition probabilities
pEmit = zeros((nStates, nAmino), float) # Emission probabilities
Because we will be reading the data from a file containing textual symbols we will need
to relate those symbols to the correct indices in the NumPy arrays. Hence we create a
dictionary for all the textual state codes so we can quickly look up each index. This is a
simple matter of looping through all the exposure and amino acid symbols and for each
combination making a tuple that acts as a key to the index, which is incremented each time
in the loop. The additional stateDict is made so we can do the reverse look-up later on,
converting indices into symbols.
indexDict = {}
stateDict = {}
index = 0
for exposure in expTypes:
for aminoAcid in aaTypes:
stateKey = (exposure, aminoAcid)
indexDict[stateKey] = index
stateDict[index] = stateKey
index += 1
To fill the counts for the state transitions we read through the data file. We open the
data file for reading, remembering to use the full path to the file if it is not in the current
directory:
fileName = 'examples/PdbSeqExposureCategories.txt'
fileObj = open(fileName,'r')
The file is read two lines at a time using a while loop so that the amino acid sequence
comes from line1 and the exposure code from line2.
line1 = fileObj.readline()
line2 = fileObj.readline()
while line1 and line2:
sequence = line1.strip()
exposure = line2.strip()
For each sequence, a second loop goes through all adjacent pairs of amino acids and the
corresponding exposure codes. Combining the exposure and amino acid symbols gives a
key to look up the correct numeric indices. Note that we use .get() to skip situations where
we have unusual amino acids (not in our list of 20) and thus have an unknown state key.
n = len(sequence)
for i in range(n-2):
aa1, aa2 = sequence[i:i+2]
exp1, exp2 = exposure[i:i+2]
stateKey1 = (exp1, aa1)
stateKey2 = (exp2, aa2)
index1 = indexDict.get(stateKey1)
index2 = indexDict.get(stateKey2)
if index1 is None or index2 is None:
continue
The indices allow the counts to be incremented in the correct element of the arrays:
pStart[index1] += 1.0
pTrans[index1, index2] += 1.0
When the inner loop is done, we add a count for the last position (which was otherwise
missed because in getting sequence pairs we didn’t go to the end) and then get the next
lines for the while loop.
pStart[index2] += 1
line1 = fileObj.readline()
line2 = fileObj.readline()
The NumPy arrays contain counts, which are converted to probabilities by dividing by
their totals. For pTrans we divide the values in each row of the matrix by the total for that
row.
pStart /= pStart.sum()
for i in range(nStates):
pTrans[i] /= pTrans[i].sum()
Finally we fill the trivial emission probabilities, setting to 1.0 where the amino acid
matches the hidden state, and leaving values at 0.0 otherwise. Note that we use the look-
up dictionary to get the index for the state (the first dimension in the emission array) but
the plain amino acid index simply comes from looping through the list of amino acid types
with enumerate().
for exposure in expTypes:
for aminoIndex, aminoAcid in enumerate(aaTypes):
stateIndex = indexDict[(exposure, aminoAcid)]
pEmit[stateIndex, aminoIndex] = 1.0
With the probability arrays defined we can then test the Viterbi algorithm, noting that
we convert the probabilities to logarithms and represent the sequence as numeric indices,
rather than code letters. Note the small addition when taking logarithms, to deal with zero
probability values.
seq = "MYGKIIFVLLLSEIVSISASSTTGVAMHTSTSSSVTKSYISSQTNDTHKRDTYAATPRAH"\
"EVSEISVRTVYPPEEETGERVQLAHHFSEPEITLIIFGVMAGVIGTILLISYGIRRLIKK"\
"SPSDVKPLPSPDTDVPLSSVEIENPETSDQ"
obs = [aaTypes.index(aa) for aa in seq]
adj = 1e-99
logStart = log(pStart+adj)
logTrans = log(pTrans+adj)
logEmit = log(pEmit+adj)
logProbScore, path = viterbi(obs, logStart, logTrans, logEmit)
To generate a string of symbols representing the buried or exposed states, the indices
from the winning Viterbi path are used as keys to stateDict, which gives back the textual
symbols for the state (exposure, aminoAcid) and the first of these (hence [0]) is the
symbol we want.
bestExpCodes = ''.join([stateDict[i][0] for i in path])
print(seq)
print(bestExpCodes)
This will give the following output, albeit on one line:
MYGKIIFVLLLSEIVSISASSTTGVAMHTSTSSSVTKSYISSQTNDTHKRDTYAATPRAHEVSEISVRT
----*******---**------------------------*------------------------***---
VYPPEEETGERVQLAHHFSEPEITLIIFGVMAGVIGTILLISYGIRRLIKKSPSDVKPLPSPDTDVPLS
*----------*----------***************************---**------*-------------
SVEIENPETSDQ
------------
We can also test the implementation of the forward-backward algorithm and plot a
graph to show the underlying probabilities for exposed or buried states at each point in the
Markov chain. For each position we get back an array of smoothed probability values
(combining forward and backward values) and because of the way we created the data
arrays we know that the first 20 values correspond to the exposed category, and the
remainder the buried. The sum of each half of values will give the total probability for the
category, although for our example HMM only one amino acid (for the observed type) will
have a non-zero value.
smooth = forwardBackward(obs, pStart, pTrans, pEmit)
buriedList = []
exposeList = []
for values in smooth:
exposeList.append(sum(values[:20]))
buriedList.append(sum(values[20:]))
xAxisValues = list(range(len(exposeList))) # Sequence positions
from matplotlib import pyplot
pyplot.plot(xAxisValues, exposeList, c='#A0A0A0')
pyplot.plot(xAxisValues, buriedList, c='#000000')
pyplot.show()
Do'stlaringiz bilan baham: |