A neural network for biological sequences
Next we move on from the trivial neural network test example to illustrate how feature
vectors may be generated from biological sequences (and category data generally), so that
they can be used in such machine learning programs. The example will predict the
secondary structure of a residue in the middle of a five-amino-acid sequence. Both the
amino acid sequence and the output secondary-structure categories will be represented
initially as code letters, but they will be converted into numbers (zeros and ones) before
being passed into the feed-forward neural network. Although this example uses protein
sequences an analogous procedure can be used for DNA and RNA.
The test data that will be illustrated for the example is very small, simply to give a taste
of the data and still have it fit on the page. As a result, a neural network trained on this
data would be totally useless at making secondary-structure predictions in practice.
However, in the on-line material a file (SecStrucTrainingData.tsv, available via
http://www.cambridge.org/pythonforbiology
) with a large data set containing many
thousands of sequences is available. Using this as input would give a vastly superior
result. The test data is presented as a list of 2-tuples; each tuple has a five-letter residue
protein sequence and a secondary-structure code. Three secondary-structure codes are
used to represent three general conformations of protein backbone geometry. These codes
are ‘E’ for extended conformations (mostly beta-strand), ‘H’ for helices (mostly alpha-
helix) and ‘C’ for random coil or unstructured stretches (everything else).
seqSecStrucData = [('ADTLL','E'),
('DTLLI','E'),
('TLLIL','E'),
('LLILG','E'),
('LILGD','E'),
('ILGDS','E'),
('LGDSL','C'),
('GDSLS','H'),
('DSLSA','H'),
('SLSAG','H'),
('LSAGY','H'),
('SAGYR','C'),
('AGYRM','C'),
('GYRMS','C'),
('YRMSA','C'),
('RMSAS','C')]
Before the above data can be used it will need to be converted from text strings into
numeric feature vectors (arrays of numbers). All of the feature vectors that represent either
sequence input or prediction output will contain numbers to represent the presence or
absence of a particular category of item. In this example the feature vectors will contain
ones to indicate the presence and zeros to represent the absence of an amino acid (for
input) or of a secondary-structure code letter (for output). Other numbers could have been
used instead with equal success (e.g. ±1 or ±0.5), although the two values chosen for
presence or absence should naturally be distinct and lie in the range of the trigger function
where there is a steep gradient; for the hyperbolic tangent example used here the range
from −1 to +1 is usually best. Note that the size of the vector generated is the length of the
input sequence multiplied by the number of possible letters; each element of the vector
represents a different residue at a different sequence position. For a five-letter protein
sequence the vector will have 20 elements (one for each amino acid) for each of the five
sequence positions, and thus the total length will be 100.
To make the feature vectors we first define dictionaries so that a letter can be used as a
key to look up the position (index) in the vector that should be set to 1.0. The actual order
that we go through the letter codes is unimportant, but it should be consistent for a given
program. Here we could have used the list.index(letter) form to get a position index, but
this is slower, especially if the number of possible letters and amount of training data are
large. To make the index-look-up dictionary for the amino acids we loop through a list of
possibilities and associate the residue letter code with the index i, which in this case was
generated with enumerate():
12
aminoAcids = 'ACDEFGHIKLMNPQRSTVWY'
aaIndexDict = {}
for i, aa in enumerate(aminoAcids):
aaIndexDict[aa] = i
The same sort of thing is repeated for the secondary-structure codes, albeit with a
smaller number of possible letters:
ssIndexDict = {}
ssCodes = 'HCE'
for i, code in enumerate(ssCodes):
ssIndexDict[code] = i
Now we actually do the conversion of the training data from the text strings to numeric
vectors that can be used in the neural network routine. To help with this the
convertSeqToVector function defined below is constructed to take a sequence of letters
seq, and indexDict, which can convert each letter to the correct position in the code
alphabet. The vector initially starts filled with zeros, but then selective positions are
converted to ones, depending on which sequence letters are observed. The actual index in
the vector that needs to be set is determined by the index-look-up dictionary for that letter,
indexDict[letter], and the start point for that sequence position, pos * numLetters. For
example, if the third letter (index 2, counting from 0) in seq is ‘F’, this adds a one at
position 45; forty (2 × 20) to get to the start of the block that represents the third sequence
position and five more because ‘F’ is at index 5 in the protein sequence alphabet.
def convertSeqToVector(seq, indexDict):
numLetters = len(indexDict)
vector = [0.0] * len(seq) * numLetters
for pos, letter in enumerate(seq):
index = pos * numLetters + indexDict[letter]
vector[index] = 1.0
return vector
The actual training data for the neural network is made by looping through all of the
pairs of protein sequence and secondary-structure code, and for each of these using the
above function to make the feature vector from the text. The trainingData list is
constructed as linked pairs of input and output feature vectors. Note that because the
secondary-structure code ss is only a single letter the output vector will be of length three.
Specifically, for the three codes the output vectors will be as follows: ‘E’: [0.0, 0.0, 1.0];
‘C’: [0.0, 1.0, 0.0]; ‘H’: [1.0, 0.0, 0.0].
trainingData = []
for seq, ss in seqSecStrucData:
inputVec = convertSeqToVector(seq, aaIndexDict)
outputVec = convertSeqToVector(ss, ssIndexDict)
trainingData.append( (inputVec, outputVec) )
The number of hidden nodes is set to three, by way of example. The training data and
network size are then passed into the main training function, in order to generate the
predictive weight matrices.
wMatrixIn, wMatrixOut = neuralNetTrain(trainingData, 3, 1000)
After training, the neural network can make secondary-structure predictions for any
five-letter protein sequence, but, as before, this must be converted into the numeric vector
form. Because the prediction operates on a whole numpy.array the test vector is converted
into an array (here with one element).
testSeq = 'DLLSA'
testVec = convertSeqToVector(testSeq, aaIndexDict)
testArray = array( [testVec,] )
The weight matrices from the training are used to make predictions on the test array and
the output signal, sOut, is interrogated to find the position of the largest value (i.e. the
position predicted to be one not zero). This index represents the best secondary-structure
code, and the code itself can be obtained by using the index with the original set of codes
to get a letter back.
sIn, sHid, sOut = neuralNetPredict(testArray, wMatrixIn, wMatrixOut)
index = sOut.argmax()
print("Test prediction: %s" % ssCodes[index])
Do'stlaringiz bilan baham: |