Python Programming for Biology: Bioinformatics and Beyond


Implementing a protein sequence HMM



Download 7,75 Mb.
Pdf ko'rish
bet337/514
Sana30.12.2021
Hajmi7,75 Mb.
#91066
1   ...   333   334   335   336   337   338   339   340   ...   514
Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

Implementing a protein sequence HMM

Finally  in  this  chapter  we  end  with  a  demonstration  of  using  the  above  algorithms  for

handling our example protein sequence HMM, so we can predict buried or exposed status.

To  do  this  we  must  first  obtain  some  probabilities  for  the  transitions  between  the  40

different  possible  states.  In  Python  we  will  label  the  states  in  the  form  (exposure,

aminoAcid). The probabilities are derived from a simple statistical analysis of a subset of

the PDB database. The actual subset of 3D structures coordinates used are from the VAST

chain set

13

and represent only protein amino acid chains that are dissimilar to each other,



within a predefined limit (p-value cut-off 10

−7

). The idea behind using this non-redundant



subset is to try to reduce the amount of bias that comes from having multiple entries for

closely  related  proteins.  Overall,  we  would  like  the  probabilities  for  our  HMM  to  be

representative of proteins in general, and not skewed towards those kinds which have the

most structure data. This kind of problem is common when dealing with sequence data in

general (protein, DNA or RNA) and is something that a bioinformatician should be wary

of.


To determine whether each of the amino acids in our structure database is in a solvent-

exposed  or  buried  context  an  external  program  was  run  on  all  of  the  PDB-format  data

files.  For  reasons  of  space  we  will  not  discuss  this  in  detail  but  the  method  is  the  one

described  by  Shrake  and  Rupley.

14

 This  calculates  numerical  values  representing  the



exposed  surface  area  for  each  atom.  For  our  HMM  we  want  to  have  exposed  or  buried

categories for each residue, so the per-atom values needed to be converted. The algorithm

here is to add the values for all the atoms within a residue to get the exposed surface area

for the whole residue. This is then divided by the maximum surface area for that kind of

residue, which gives a fractional value and eliminates the effect of amino acid size. Buried

residues are then defined as those which have an exposed surface area of less than 7% of

the  maximum.

15

 The  results  from  this  procedure  are  represented  in  a  file  which



accompanies

this


book:

‘PdbSeqExposureCategories.txt’

(download

via


http://www.cambridge.org/pythonforbiology

).  The  lines  of  this  file  are  in  the  form

illustrated below, with one line of one-letter amino acid codes followed by a second line of

the same length representing buried or exposed categories for the sequence. Here a dash

‘-’ represents an exposed position and an asterisk ‘*’  buried,  so  that  the  categorisation  is

easy to see:

GSSGSSGHEETECPLRLAVCQHCDLELSILKLKEHEDYCGARTELCGNCGRNVLVKDLKTHPEVCGREGS

-------------------*------------------**------------*-------*---*-----

VWSVQIVDNAGLGANLALYPSGNSSTVPRYVTVTGYAPITFSEIGPKTVHQSWYITVHNGDDRAFQLGYEGGGVA

-*-*-----------*-----**-*---*****---******------------******-----****---*-*




With  the  exposure  category  data  at  hand  we  move  on  to  calculating  the  probabilities

and generating the HMM. For this example we have plenty of data to derive probabilities

from  the  frequency  of  observing  particular  events.  However,  if  data  is  missing,  or

otherwise difficult to analyse, we could use a method like the Baum-Welch algorithm, to

estimate  the  emission  and  transition  possibilities,  given  a  sufficient  amount  of

representative  training  data.  The  Baum-Welch  algorithm  can  actually  be  applied  when

none  of  the  probabilities  are  initially  known,  which  in  effect  means  that  the  HMM

probabilities  can  be  ‘learned’  from  even  proportionately  sparse  data.  In  essence  this  is  a

variety  of  machine  learning,  although  we  leave  further  discussion  of  this  topic  until

Chapter  24

.  In  our  example  case  we  simply  count  the  occurrences  of  the  different

buried/exposed transitions in the data file.

Initially  we  get  a  list  of  the  exposure  and  amino  acid  type  labels,  the  combination  of

which gives the hidden state labels. Although we will be working with NumPy arrays, and

thus referring to the states by numeric indices, the order of symbols in these lists relates

each state index to a meaningful symbol.

expTypes = ['-','*',]

aaTypes = ['A','C','D','E','F','G','H','I','K','L',

'M','N','P','Q','R','S','T','V','W','Y']

The  data  counts  and  probabilities  will  be  stored  in  NumPy  arrays,  so  we  define  these

upfront as empty data structures of the required sizes, and of floating point data type.

nExp = len(expTypes) # Number of exposure categories

nAmino = len(aaTypes) # Number of amino acid types

nStates = nExp * nAmino # Number of HMM states

from numpy import zeros, log

pStart = zeros(nStates, float) # Starting probabilities

pTrans = zeros((nStates, nStates), float) # Transition probabilities

pEmit = zeros((nStates, nAmino), float) # Emission probabilities

Because we will be reading the data from a file containing textual symbols we will need

to  relate  those  symbols  to  the  correct  indices  in  the  NumPy  arrays.  Hence  we  create  a

dictionary for all the textual state codes so we can quickly look up each index. This is a

simple  matter  of  looping  through  all  the  exposure  and  amino  acid  symbols  and  for  each

combination making a tuple that acts as a key to the index, which is incremented each time

in  the  loop.  The  additional  stateDict  is  made  so  we  can  do  the  reverse  look-up  later  on,

converting indices into symbols.

indexDict = {}

stateDict = {}

index = 0

for exposure in expTypes:

for aminoAcid in aaTypes:

stateKey = (exposure, aminoAcid)

indexDict[stateKey] = index

stateDict[index] = stateKey

index += 1




To  fill  the  counts  for  the  state  transitions  we  read  through  the  data  file.  We  open  the

data file for reading, remembering to use the full path to the file if it is not in the current

directory:

fileName = 'examples/PdbSeqExposureCategories.txt'

fileObj = open(fileName,'r')

The file is read two lines at a time using a while loop so that the amino acid sequence

comes from line1 and the exposure code from line2.

line1 = fileObj.readline()

line2 = fileObj.readline()

while line1 and line2:

sequence = line1.strip()

exposure = line2.strip()

For each sequence, a second loop goes through all adjacent pairs of amino acids and the

corresponding exposure codes. Combining the exposure and amino acid symbols gives a

key to look up the correct numeric indices. Note that we use .get() to skip situations where

we have unusual amino acids (not in our list of 20) and thus have an unknown state key.

n = len(sequence)

for i in range(n-2):

aa1, aa2 = sequence[i:i+2]

exp1, exp2 = exposure[i:i+2]

stateKey1 = (exp1, aa1)

stateKey2 = (exp2, aa2)

index1 = indexDict.get(stateKey1)

index2 = indexDict.get(stateKey2)

if index1 is None or index2 is None:

continue


The indices allow the counts to be incremented in the correct element of the arrays:

pStart[index1] += 1.0

pTrans[index1, index2] += 1.0

When the inner loop is done, we add a count for the last position (which was otherwise

missed  because  in  getting  sequence  pairs  we  didn’t  go  to  the  end)  and  then  get  the  next

lines for the while loop.

pStart[index2] += 1

line1 = fileObj.readline()

line2 = fileObj.readline()

The NumPy arrays contain counts, which are converted to probabilities by dividing by

their totals. For pTrans we divide the values in each row of the matrix by the total for that

row.


pStart /= pStart.sum()

for i in range(nStates):

pTrans[i] /= pTrans[i].sum()



Finally  we  fill  the  trivial  emission  probabilities,  setting  to  1.0  where  the  amino  acid

matches the hidden state, and leaving values at 0.0 otherwise. Note that we use the look-

up dictionary to get the index for the state (the first dimension in the emission array) but

the plain amino acid index simply comes from looping through the list of amino acid types

with enumerate().

for exposure in expTypes:

for aminoIndex, aminoAcid in enumerate(aaTypes):

stateIndex = indexDict[(exposure, aminoAcid)]

pEmit[stateIndex, aminoIndex] = 1.0

With the probability arrays defined we can then test the Viterbi algorithm, noting that

we convert the probabilities to logarithms and represent the sequence as numeric indices,

rather than code letters. Note the small addition when taking logarithms, to deal with zero

probability values.

seq = "MYGKIIFVLLLSEIVSISASSTTGVAMHTSTSSSVTKSYISSQTNDTHKRDTYAATPRAH"\

"EVSEISVRTVYPPEEETGERVQLAHHFSEPEITLIIFGVMAGVIGTILLISYGIRRLIKK"\

"SPSDVKPLPSPDTDVPLSSVEIENPETSDQ"

obs = [aaTypes.index(aa) for aa in seq]

adj = 1e-99

logStart = log(pStart+adj)

logTrans = log(pTrans+adj)

logEmit = log(pEmit+adj)

logProbScore, path = viterbi(obs, logStart, logTrans, logEmit)

To  generate  a  string  of  symbols  representing  the  buried  or  exposed  states,  the  indices

from the winning Viterbi path are used as keys to stateDict, which gives back the textual

symbols  for  the  state  (exposure,  aminoAcid)  and  the  first  of  these  (hence  [0])  is  the

symbol we want.

bestExpCodes = ''.join([stateDict[i][0] for i in path])

print(seq)

print(bestExpCodes)

This will give the following output, albeit on one line:

MYGKIIFVLLLSEIVSISASSTTGVAMHTSTSSSVTKSYISSQTNDTHKRDTYAATPRAHEVSEISVRT

----*******---**------------------------*------------------------***---

VYPPEEETGERVQLAHHFSEPEITLIIFGVMAGVIGTILLISYGIRRLIKKSPSDVKPLPSPDTDVPLS

*----------*----------***************************---**------*-------------

SVEIENPETSDQ

------------

We  can  also  test  the  implementation  of  the  forward-backward  algorithm  and  plot  a

graph to show the underlying probabilities for exposed or buried states at each point in the

Markov  chain.  For  each  position  we  get  back  an  array  of  smoothed  probability  values

(combining  forward  and  backward  values)  and  because  of  the  way  we  created  the  data




arrays  we  know  that  the  first  20  values  correspond  to  the  exposed  category,  and  the

remainder the buried. The sum of each half of values will give the total probability for the

category, although for our example HMM only one amino acid (for the observed type) will

have a non-zero value.

smooth = forwardBackward(obs, pStart, pTrans, pEmit)

buriedList = []

exposeList = []

for values in smooth:

exposeList.append(sum(values[:20]))

buriedList.append(sum(values[20:]))

xAxisValues = list(range(len(exposeList))) # Sequence positions

from matplotlib import pyplot

pyplot.plot(xAxisValues, exposeList, c='#A0A0A0')

pyplot.plot(xAxisValues, buriedList, c='#000000')

pyplot.show()


Download 7,75 Mb.

Do'stlaringiz bilan baham:
1   ...   333   334   335   336   337   338   339   340   ...   514




Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©hozir.org 2024
ma'muriyatiga murojaat qiling

kiriting | ro'yxatdan o'tish
    Bosh sahifa
юртда тантана
Боғда битган
Бугун юртда
Эшитганлар жилманглар
Эшитмадим деманглар
битган бодомлар
Yangiariq tumani
qitish marakazi
Raqamli texnologiyalar
ilishida muhokamadan
tasdiqqa tavsiya
tavsiya etilgan
iqtisodiyot kafedrasi
steiermarkischen landesregierung
asarlaringizni yuboring
o'zingizning asarlaringizni
Iltimos faqat
faqat o'zingizning
steierm rkischen
landesregierung fachabteilung
rkischen landesregierung
hamshira loyihasi
loyihasi mavsum
faolyatining oqibatlari
asosiy adabiyotlar
fakulteti ahborot
ahborot havfsizligi
havfsizligi kafedrasi
fanidan bo’yicha
fakulteti iqtisodiyot
boshqaruv fakulteti
chiqarishda boshqaruv
ishlab chiqarishda
iqtisodiyot fakultet
multiservis tarmoqlari
fanidan asosiy
Uzbek fanidan
mavzulari potok
asosidagi multiservis
'aliyyil a'ziym
billahil 'aliyyil
illaa billahil
quvvata illaa
falah' deganida
Kompyuter savodxonligi
bo’yicha mustaqil
'alal falah'
Hayya 'alal
'alas soloh
Hayya 'alas
mavsum boyicha


yuklab olish