Now we will move on to another example which produces data which we can display as a
graph, but this time it will be for a protein sequence. The task here is to generate a plot of
carry a charge or chemical groups that can form any hydrogen bonds. It is often useful to
find such hydrophobic regions because by shunning water they make important
interactions inside the folded core of proteins or allow a protein to be inserted into a
cellular membrane. It is in the context of cell membranes that this example is based.
A cellular membrane is a double layer (bilayer) of hydrophobic lipid
10
molecules into
which specific proteins are embedded by virtue of a hydrophobic anchor.
A membrane
defines the outer extent of each cell, and various internal compartments, with special
functions, inside it. Biologically the lipid component of a membrane creates a barrier to
most molecules and the protein component allows selective passage for some molecules,
in line with the requirements of the cell.
The next example function aims to predict whether a protein possesses a sufficiently
hydrophobic segment of residues (which will fold into a helix) that will allow it to be
inserted into a cell’s system of membranes. This is a simplistic prediction, as in reality
there are other factors that govern whether a segment is used, but nonetheless it is
sufficiently accurate to find over 70% of membrane spans.
Initially we define a hydrophobicity scale: a number associated with each amino acid
letter that says how water-hating it is. For this example we will use the GES scale,
11
but
there are several others to choose from.
GES_SCALE = {'F':-3.7,'M':-3.4,'I':-3.1,'L':-2.8,'V':-2.6,
'C':-2.0,'W':-1.9,'A':-1.6,'T':-1.2,'G':-1.0,
'S':-0.6,'P': 0.2,'Y': 0.7,'H': 3.0,'Q': 4.1,
'N': 4.8,'E': 8.2,'K': 8.8,'D': 9.2,'R':12.3}
We define the function that will perform the search so that it accepts a protein sequence
and hydrophobicity scale dictionary as mandatory inputs, and an optional input to specify
a search window size. The philosophy of this function differs a little from those above
because it includes an optimisation to calculate quickly; i.e. minimising the number of
operations performed.
An index i is defined to loop through the sequence and, because it is useful in several
spots, we define j to be i plus the search width. The adding up of the hydrophobicity score
for each segment can take place inside one of two separate sections, depending on the
result of an if statement. This statement is set up such that the first time we add up scores
(detected by the score being at its start value of None) we consider all of the positions
from i up to j. After this first summation, rather than repeating the summation for the
whole of the next section, we use the fact that the next section only differs from the
previous one at its first and last positions. Accordingly, to get the score for the next section
we take the existing score and take away the score of the residue we have just left behind
(i-1) and add the score of the new end residue (j-1: we go up to but do not include position
j). This is a speed optimisation because overall fewer operations are performed, but it will
be prone to the accumulation of small floating point errors: however, such errors will not
grow to anything significant for something as short as a protein sequence.
def hydrophobicitySearch(seq, scale, winSize=15):
"""Scan a protein sequence for hydrophobic regions using the GES
hydrophobicity scale.
"""
score = None
scoreList = []
for i in range(len(seq)- winSize):
j = i + winSize
if score is None:
score = 0
for k in range(i,j):
score += scale[seq[k]]
else:
score += scale[seq[j-1]]
score -= scale[seq[i-1]]
scoreList.append(score)
return scoreList
As before we can execute the function with an example sequence and plot the results
with Matplotlib.
from matplotlib import pyplot
scores = hydrophobicitySearch(proteinSeq, GES_SCALE)
pyplot.plot(scores)
pyplot.show()
Do'stlaringiz bilan baham: