Python Programming for Biology: Bioinformatics and Beyond


A neural network for biological sequences



Download 7,75 Mb.
Pdf ko'rish
bet404/514
Sana30.12.2021
Hajmi7,75 Mb.
#91066
1   ...   400   401   402   403   404   405   406   407   ...   514
Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

A neural network for biological sequences

Next  we  move  on  from  the  trivial  neural  network  test  example  to  illustrate  how  feature

vectors may be generated from biological sequences (and category data generally), so that

they  can  be  used  in  such  machine  learning  programs.  The  example  will  predict  the

secondary  structure  of  a  residue  in  the  middle  of  a  five-amino-acid  sequence.  Both  the

amino  acid  sequence  and  the  output  secondary-structure  categories  will  be  represented

initially as code letters, but they will be converted into numbers (zeros and ones) before

being  passed  into  the  feed-forward  neural  network.  Although  this  example  uses  protein

sequences an analogous procedure can be used for DNA and RNA.

The test data that will be illustrated for the example is very small, simply to give a taste

of  the  data  and  still  have  it  fit  on  the  page.  As  a  result,  a  neural  network  trained  on  this

data  would  be  totally  useless  at  making  secondary-structure  predictions  in  practice.

However,  in  the  on-line  material  a  file  (SecStrucTrainingData.tsv,  available  via

http://www.cambridge.org/pythonforbiology

)  with  a  large  data  set  containing  many

thousands  of  sequences  is  available.  Using  this  as  input  would  give  a  vastly  superior

result.  The  test  data  is  presented  as  a  list  of  2-tuples;  each  tuple  has  a  five-letter  residue

protein  sequence  and  a  secondary-structure  code.  Three  secondary-structure  codes  are

used to represent three general conformations of protein backbone geometry. These codes

are  ‘E’  for  extended  conformations  (mostly  beta-strand),  ‘H’  for  helices  (mostly  alpha-

helix) and ‘C’ for random coil or unstructured stretches (everything else).

seqSecStrucData = [('ADTLL','E'),

('DTLLI','E'),

('TLLIL','E'),

('LLILG','E'),

('LILGD','E'),

('ILGDS','E'),

('LGDSL','C'),

('GDSLS','H'),

('DSLSA','H'),

('SLSAG','H'),

('LSAGY','H'),




('SAGYR','C'),

('AGYRM','C'),

('GYRMS','C'),

('YRMSA','C'),

('RMSAS','C')]

Before  the  above  data  can  be  used  it  will  need  to  be  converted  from  text  strings  into

numeric feature vectors (arrays of numbers). All of the feature vectors that represent either

sequence  input  or  prediction  output  will  contain  numbers  to  represent  the  presence  or

absence of a particular category of item. In this example the feature vectors will contain

ones  to  indicate  the  presence  and  zeros  to  represent  the  absence  of  an  amino  acid  (for

input) or of a secondary-structure code letter (for output). Other numbers could have been

used  instead  with  equal  success  (e.g.  ±1  or  ±0.5),  although  the  two  values  chosen  for

presence or absence should naturally be distinct and lie in the range of the trigger function

where  there  is  a  steep  gradient;  for  the  hyperbolic  tangent  example  used  here  the  range

from −1 to +1 is usually best. Note that the size of the vector generated is the length of the

input  sequence  multiplied  by  the  number  of  possible  letters;  each  element  of  the  vector

represents  a  different  residue  at  a  different  sequence  position.  For  a  five-letter  protein

sequence the vector will have 20 elements (one for each amino acid) for each of the five

sequence positions, and thus the total length will be 100.

To make the feature vectors we first define dictionaries so that a letter can be used as a

key to look up the position (index) in the vector that should be set to 1.0. The actual order

that we go through the letter codes is unimportant, but it should be consistent for a given

program. Here we could have used the list.index(letter) form to get a position index, but

this is slower, especially if the number of possible letters and amount of training data are

large. To make the index-look-up dictionary for the amino acids we loop through a list of

possibilities and associate the residue letter code with the index i, which in this case was

generated with enumerate():

12

aminoAcids = 'ACDEFGHIKLMNPQRSTVWY'



aaIndexDict = {}

for i, aa in enumerate(aminoAcids):

aaIndexDict[aa] = i

The  same  sort  of  thing  is  repeated  for  the  secondary-structure  codes,  albeit  with  a

smaller number of possible letters:

ssIndexDict = {}

ssCodes = 'HCE'

for i, code in enumerate(ssCodes):

ssIndexDict[code] = i

Now we actually do the conversion of the training data from the text strings to numeric

vectors  that  can  be  used  in  the  neural  network  routine.  To  help  with  this  the

convertSeqToVector  function  defined  below  is  constructed  to  take  a  sequence  of  letters

seq,  and  indexDict,  which  can  convert  each  letter  to  the  correct  position  in  the  code

alphabet.  The  vector  initially  starts  filled  with  zeros,  but  then  selective  positions  are

converted to ones, depending on which sequence letters are observed. The actual index in

the vector that needs to be set is determined by the index-look-up dictionary for that letter,




indexDict[letter],  and  the  start  point  for  that  sequence  position,  pos  *  numLetters.  For

example,  if  the  third  letter  (index  2,  counting  from  0)  in  seq  is  ‘F’,  this  adds  a  one  at

position 45; forty (2 × 20) to get to the start of the block that represents the third sequence

position and five more because ‘F’ is at index 5 in the protein sequence alphabet.

def convertSeqToVector(seq, indexDict):

numLetters = len(indexDict)

vector = [0.0] * len(seq) * numLetters

for pos, letter in enumerate(seq):

index = pos * numLetters + indexDict[letter]

vector[index] = 1.0

return vector

The  actual  training  data  for  the  neural  network  is  made  by  looping  through  all  of  the

pairs  of  protein  sequence  and  secondary-structure  code,  and  for  each  of  these  using  the

above  function  to  make  the  feature  vector  from  the  text.  The  trainingData  list  is

constructed  as  linked  pairs  of  input  and  output  feature  vectors.  Note  that  because  the

secondary-structure code ss is only a single letter the output vector will be of length three.

Specifically, for the three codes the output vectors will be as follows: ‘E’: [0.0, 0.0, 1.0];

‘C’: [0.0, 1.0, 0.0]; ‘H’: [1.0, 0.0, 0.0].

trainingData = []

for seq, ss in seqSecStrucData:

inputVec = convertSeqToVector(seq, aaIndexDict)

outputVec = convertSeqToVector(ss, ssIndexDict)

trainingData.append( (inputVec, outputVec) )

The number of hidden nodes is set to three, by way of example. The training data and

network  size  are  then  passed  into  the  main  training  function,  in  order  to  generate  the

predictive weight matrices.

wMatrixIn, wMatrixOut = neuralNetTrain(trainingData, 3, 1000)

After  training,  the  neural  network  can  make  secondary-structure  predictions  for  any

five-letter protein sequence, but, as before, this must be converted into the numeric vector

form. Because the prediction operates on a whole numpy.array the test vector is converted

into an array (here with one element).

testSeq = 'DLLSA'

testVec = convertSeqToVector(testSeq, aaIndexDict)

testArray = array( [testVec,] )

The weight matrices from the training are used to make predictions on the test array and

the  output  signal,  sOut,  is  interrogated  to  find  the  position  of  the  largest  value  (i.e.  the

position predicted to be one not zero). This index represents the best secondary-structure

code, and the code itself can be obtained by using the index with the original set of codes

to get a letter back.



sIn, sHid, sOut = neuralNetPredict(testArray, wMatrixIn, wMatrixOut)

index = sOut.argmax()

print("Test prediction: %s" % ssCodes[index])


Download 7,75 Mb.

Do'stlaringiz bilan baham:
1   ...   400   401   402   403   404   405   406   407   ...   514




Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©hozir.org 2024
ma'muriyatiga murojaat qiling

kiriting | ro'yxatdan o'tish
    Bosh sahifa
юртда тантана
Боғда битган
Бугун юртда
Эшитганлар жилманглар
Эшитмадим деманглар
битган бодомлар
Yangiariq tumani
qitish marakazi
Raqamli texnologiyalar
ilishida muhokamadan
tasdiqqa tavsiya
tavsiya etilgan
iqtisodiyot kafedrasi
steiermarkischen landesregierung
asarlaringizni yuboring
o'zingizning asarlaringizni
Iltimos faqat
faqat o'zingizning
steierm rkischen
landesregierung fachabteilung
rkischen landesregierung
hamshira loyihasi
loyihasi mavsum
faolyatining oqibatlari
asosiy adabiyotlar
fakulteti ahborot
ahborot havfsizligi
havfsizligi kafedrasi
fanidan bo’yicha
fakulteti iqtisodiyot
boshqaruv fakulteti
chiqarishda boshqaruv
ishlab chiqarishda
iqtisodiyot fakultet
multiservis tarmoqlari
fanidan asosiy
Uzbek fanidan
mavzulari potok
asosidagi multiservis
'aliyyil a'ziym
billahil 'aliyyil
illaa billahil
quvvata illaa
falah' deganida
Kompyuter savodxonligi
bo’yicha mustaqil
'alal falah'
Hayya 'alal
'alas soloh
Hayya 'alas
mavsum boyicha


yuklab olish