Python Programming for Biology: Bioinformatics and Beyond

Download 7,75 Mb.

Pdf ko'rish

bet	149/514
Sana	30.12.2021
Hajmi	7,75 Mb.
	#91066

1 ... 145 146 147 148 149 150 151 152 ... 514

Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

Measuring repetitiveness

It is often useful to measure how repetitive a section of DNA or protein sequence is. For

DNA regions of repetitive sequence are often associated with non-coding, and especially

‘junk’ DNA; hence if we are searching for genes and their regulatory regions we look for a

sequence that is not repetitive. For protein, repetitive amino acid sequences are associated

with the parts that are unstructured: the bits that don’t fold into a compact globule. This

can be especially useful for multi-domain proteins (several independent compact globules

along the length of the chain), where repetitive sequence regions can identify the flexible

linkers between the compact, inflexible, folded domains.

For this example we will measure the repetitiveness along a sequence by counting how

many different kinds of residue are present within a given search window. If a given sub-

sequence only uses a limited number of residue types, rather than a varied mixture, we

deem that sequence to be repetitive or have low sequence complexity. For a more formal

mathematical definition we turn to information theory and use a measure which derived

from the theory of information entropy, as initially described by Shannon.

Considering that DNA and protein have different numbers of possible residue types (4

versus 20), it is clear that for these different kinds of molecule the expectation of how

many types of residue we will see on average in any given segment will be markedly

different. For example, in a given 12-residue segment, for DNA we would expect to see

every kind of nucleotide, and if all nucleotides are equally likely there would be on

average three of each kind. For 12-residue protein segments, it is impossible to have all

amino acids present, and we might reasonably expect the average occurrence of any kind

to be less than 1.

Accordingly, because the expectation of how likely it is to find a given type of residue

varies according to the situation, we will measure sequence repetitiveness by comparing

the measured occurrences with what we would expect given a random selection. Here we

will refer to such a random selection as the null hypothesis: what we would expect in the

absence of extra information (see also

Chapter 22

for further discussion). In reality there

are many null hypotheses that you could choose from, if you were searching within gene

sequences and already know that G:C base pairs are more common than A:T. However,

for this exercise we will assume an unbiased null hypothesis, where each residue type is

equally likely: the baseline for nucleotides is 25% and the baseline for amino acids is 5%.

We will refer to the formulation we use for this comparative measure of repetitiveness as

the relative entropy, also known as the Kullback-Leibler divergence.

The actual example code will be broken up into two separate functions; one will

calculate the relative entropy and the other will scan through a sequence compiling the

results. The first function takes a sequence, a list of possible residue types, which of

course will differ for DNA and protein, and does the appropriate mathematics (see the last

column of Table 11.2) to give back a single value H (a traditional letter to use for entropy).

Note that this function does not contain any code specific to a given context. This means

that the function will be useful in other situations, one of the main reasons to write it in a

separate block of code. A separate function also helps to keep the Python code readable.

The workings of the function involve calculating the expected base-level proportion

base, then filling the dictionary prop, which stores the abundance of each kind of residue

we have in the input sequence. The residue counts are converted to a proportion (i.e.

between 0 and 1) by dividing by the sequence length, making sure that we do floating

point division (dividing two integers gives an integer in Python before version 3). Armed

with the base level and the proportions we then loop through each residue type and add the

local repetitiveness measure for that type to the relative entropy total. The occurrence of

base-2 logarithms throughout is a means to ensure that the answer is output in units of bits

(common in information theory).

def calcRelativeEntropy(seq, resCodes):

"""Calculate a relative entropy value for the residues in a

sequence compared to a uniform null hypothesis.

"""

from math import log

N = float(len(seq))

base = 1.0/len(resCodes)

prop = {}

for r in resCodes:

prop[r] = 0

for r in seq:

prop[r] += 1

for r in resCodes:

prop[r] /= N

H = 0

for r in resCodes:

if prop[r] != 0.0:

h = prop[r]* log(prop[r]/base, 2.0)

H += h

H /= log(base, 2.0)

return H

The second function will use the calcRelativeEntropy() function as it scans through an

input sequence. In this example, and unlike the previous ones in this chapter, rather than

having our search window stop before it falls off the end of the input sequence we will use

a cheat that will enable the search to go right to the last residue. This involves pretending

that the sequence is circular; rather than falling off the end the search will jump back to

the beginning of the sequence. Whether it is appropriate to cheat is at the programmer’s

discretion, but in this instance it gives results that appear reasonable, due to the fact that

we are repeating a real segment of sequence.

Download 7,75 Mb.

Do'stlaringiz bilan baham:

1 ... 145 146 147 148 149 150 151 152 ... 514