It is often useful to measure how repetitive a section of DNA or protein sequence is. For
DNA regions of repetitive sequence are often associated with non-coding, and especially
‘junk’ DNA; hence if we are searching for genes and their regulatory regions we look for a
sequence that is not repetitive. For protein, repetitive amino acid sequences are associated
with the parts that are unstructured: the bits that don’t fold into a compact globule. This
can be especially useful for multi-domain proteins (several independent compact globules
along the length of the chain), where repetitive sequence regions can identify the flexible
For this example we will measure the repetitiveness along a sequence by counting how
many different kinds of residue are present within a given search window. If a given sub-
sequence only uses a limited number of residue types, rather than a varied mixture, we
mathematical definition we turn to information theory and use a measure which derived
Considering that DNA and protein have different numbers of possible residue types (4
different. For example, in a given 12-residue segment, for DNA we would expect to see
every kind of nucleotide, and if all nucleotides are equally likely there would be on
average three of each kind. For 12-residue protein segments, it is impossible to have all
amino acids present, and we might reasonably expect the average occurrence of any kind
to be less than 1.
Accordingly, because the expectation of how likely it is to find a given type of residue
varies according to the situation, we will measure sequence repetitiveness by comparing
the measured occurrences with what we would expect given a random selection. Here we
will refer to such a random selection as the null hypothesis: what we would expect in the
absence of extra information (see also
Chapter 22
for further discussion). In reality there
are many null hypotheses that you could choose from, if you were searching within gene
sequences and already know that G:C base pairs are more common than A:T. However,
for this exercise we will assume an unbiased null hypothesis, where each residue type is
equally likely: the baseline for nucleotides is 25% and the baseline for amino acids is 5%.
We will refer to the formulation we use for this comparative measure of repetitiveness as
the relative entropy, also known as the Kullback-Leibler divergence.
13
The actual example code will be broken up into two separate functions; one will
calculate the relative entropy and the other will scan through a sequence compiling the
results. The first function takes a sequence, a list of possible residue types, which of
course will differ for DNA and protein, and does the appropriate mathematics (see the last
column of Table 11.2) to give back a single value H (a traditional letter to use for entropy).
Note that this function does not contain any code specific to a given context. This means
that the function will be useful in other situations, one of the main reasons to write it in a
separate block of code. A separate function also helps to keep the Python code readable.
The workings of the function involve calculating the expected base-level proportion
base, then filling the dictionary prop, which stores the abundance of each kind of residue
we have in the input sequence. The residue counts are converted to a proportion (i.e.
between 0 and 1) by dividing by the sequence length, making sure that we do floating
point division (dividing two integers gives an integer in Python before version 3). Armed
with the base level and the proportions we then loop through each residue type and add the
local repetitiveness measure for that type to the relative entropy total. The occurrence of
base-2 logarithms throughout is a means to ensure that the answer is output in units of bits
(common in information theory).
def calcRelativeEntropy(seq, resCodes):
"""Calculate a relative entropy value for the residues in a
sequence compared to a uniform null hypothesis.
"""
from math import log
N = float(len(seq))
base = 1.0/len(resCodes)
prop = {}
for r in resCodes:
prop[r] = 0
for r in seq:
prop[r] += 1
for r in resCodes:
prop[r] /= N
H = 0
for r in resCodes:
if prop[r] != 0.0:
h = prop[r]* log(prop[r]/base, 2.0)
H += h
H /= log(base, 2.0)
return H
The second function will use the calcRelativeEntropy() function as it scans through an
input sequence. In this example, and unlike the previous ones in this chapter, rather than
having our search window stop before it falls off the end of the input sequence we will use
a cheat that will enable the search to go right to the last residue. This involves pretending
that the sequence is circular; rather than falling off the end the search will jump back to
the beginning of the sequence. Whether it is appropriate to cheat is at the programmer’s
discretion, but in this instance it gives results that appear reasonable, due to the fact that
we are repeating a real segment of sequence.
Do'stlaringiz bilan baham: