Python Programming for Biology: Bioinformatics and Beyond

A neural network for biological sequences

Download 7,75 Mb.

Pdf ko'rish

bet	404/514
Sana	30.12.2021
Hajmi	7,75 Mb.
	#91066

1 ... 400 401 402 403 404 405 406 407 ... 514

Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

A neural network for biological sequences

Next we move on from the trivial neural network test example to illustrate how feature

vectors may be generated from biological sequences (and category data generally), so that

they can be used in such machine learning programs. The example will predict the

secondary structure of a residue in the middle of a five-amino-acid sequence. Both the

amino acid sequence and the output secondary-structure categories will be represented

initially as code letters, but they will be converted into numbers (zeros and ones) before

being passed into the feed-forward neural network. Although this example uses protein

sequences an analogous procedure can be used for DNA and RNA.

The test data that will be illustrated for the example is very small, simply to give a taste

of the data and still have it fit on the page. As a result, a neural network trained on this

data would be totally useless at making secondary-structure predictions in practice.

However, in the on-line material a file (SecStrucTrainingData.tsv, available via

http://www.cambridge.org/pythonforbiology

) with a large data set containing many

thousands of sequences is available. Using this as input would give a vastly superior

result. The test data is presented as a list of 2-tuples; each tuple has a five-letter residue

protein sequence and a secondary-structure code. Three secondary-structure codes are

used to represent three general conformations of protein backbone geometry. These codes

are ‘E’ for extended conformations (mostly beta-strand), ‘H’ for helices (mostly alpha-

helix) and ‘C’ for random coil or unstructured stretches (everything else).

seqSecStrucData = [('ADTLL','E'),

('DTLLI','E'),

('TLLIL','E'),

('LLILG','E'),

('LILGD','E'),

('ILGDS','E'),

('LGDSL','C'),

('GDSLS','H'),

('DSLSA','H'),

('SLSAG','H'),

('LSAGY','H'),

('SAGYR','C'),

('AGYRM','C'),

('GYRMS','C'),

('YRMSA','C'),

('RMSAS','C')]

Before the above data can be used it will need to be converted from text strings into

numeric feature vectors (arrays of numbers). All of the feature vectors that represent either

sequence input or prediction output will contain numbers to represent the presence or

absence of a particular category of item. In this example the feature vectors will contain

ones to indicate the presence and zeros to represent the absence of an amino acid (for

input) or of a secondary-structure code letter (for output). Other numbers could have been

used instead with equal success (e.g. ±1 or ±0.5), although the two values chosen for

presence or absence should naturally be distinct and lie in the range of the trigger function

where there is a steep gradient; for the hyperbolic tangent example used here the range

from −1 to +1 is usually best. Note that the size of the vector generated is the length of the

input sequence multiplied by the number of possible letters; each element of the vector

represents a different residue at a different sequence position. For a five-letter protein

sequence the vector will have 20 elements (one for each amino acid) for each of the five

sequence positions, and thus the total length will be 100.

To make the feature vectors we first define dictionaries so that a letter can be used as a

key to look up the position (index) in the vector that should be set to 1.0. The actual order

that we go through the letter codes is unimportant, but it should be consistent for a given

program. Here we could have used the list.index(letter) form to get a position index, but

this is slower, especially if the number of possible letters and amount of training data are

large. To make the index-look-up dictionary for the amino acids we loop through a list of

possibilities and associate the residue letter code with the index i, which in this case was

generated with enumerate():

aminoAcids = 'ACDEFGHIKLMNPQRSTVWY'

aaIndexDict = {}

for i, aa in enumerate(aminoAcids):

aaIndexDict[aa] = i

The same sort of thing is repeated for the secondary-structure codes, albeit with a

smaller number of possible letters:

ssIndexDict = {}

ssCodes = 'HCE'

for i, code in enumerate(ssCodes):

ssIndexDict[code] = i

Now we actually do the conversion of the training data from the text strings to numeric

vectors that can be used in the neural network routine. To help with this the

convertSeqToVector function defined below is constructed to take a sequence of letters

seq, and indexDict, which can convert each letter to the correct position in the code

alphabet. The vector initially starts filled with zeros, but then selective positions are

converted to ones, depending on which sequence letters are observed. The actual index in

the vector that needs to be set is determined by the index-look-up dictionary for that letter,

indexDict[letter], and the start point for that sequence position, pos * numLetters. For

example, if the third letter (index 2, counting from 0) in seq is ‘F’, this adds a one at

position 45; forty (2 × 20) to get to the start of the block that represents the third sequence

position and five more because ‘F’ is at index 5 in the protein sequence alphabet.

def convertSeqToVector(seq, indexDict):

numLetters = len(indexDict)

vector = [0.0] * len(seq) * numLetters

for pos, letter in enumerate(seq):

index = pos * numLetters + indexDict[letter]

vector[index] = 1.0

return vector

The actual training data for the neural network is made by looping through all of the

pairs of protein sequence and secondary-structure code, and for each of these using the

above function to make the feature vector from the text. The trainingData list is

constructed as linked pairs of input and output feature vectors. Note that because the

secondary-structure code ss is only a single letter the output vector will be of length three.

Specifically, for the three codes the output vectors will be as follows: ‘E’: [0.0, 0.0, 1.0];

‘C’: [0.0, 1.0, 0.0]; ‘H’: [1.0, 0.0, 0.0].

trainingData = []

for seq, ss in seqSecStrucData:

inputVec = convertSeqToVector(seq, aaIndexDict)

outputVec = convertSeqToVector(ss, ssIndexDict)

trainingData.append( (inputVec, outputVec) )

The number of hidden nodes is set to three, by way of example. The training data and

network size are then passed into the main training function, in order to generate the

predictive weight matrices.

wMatrixIn, wMatrixOut = neuralNetTrain(trainingData, 3, 1000)

After training, the neural network can make secondary-structure predictions for any

five-letter protein sequence, but, as before, this must be converted into the numeric vector

form. Because the prediction operates on a whole numpy.array the test vector is converted

into an array (here with one element).

testSeq = 'DLLSA'

testVec = convertSeqToVector(testSeq, aaIndexDict)

testArray = array( [testVec,] )

The weight matrices from the training are used to make predictions on the test array and

the output signal, sOut, is interrogated to find the position of the largest value (i.e. the

position predicted to be one not zero). This index represents the best secondary-structure

code, and the code itself can be obtained by using the index with the original set of codes

to get a letter back.

sIn, sHid, sOut = neuralNetPredict(testArray, wMatrixIn, wMatrixOut)

index = sOut.argmax()

print("Test prediction: %s" % ssCodes[index])

Download 7,75 Mb.

Do'stlaringiz bilan baham:

1 ... 400 401 402 403 404 405 406 407 ... 514