Python Programming for Biology: Bioinformatics and Beyond



Download 7,75 Mb.
Pdf ko'rish
bet149/514
Sana30.12.2021
Hajmi7,75 Mb.
#91066
1   ...   145   146   147   148   149   150   151   152   ...   514
Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

Measuring repetitiveness

It is often useful to measure how repetitive a section of DNA or protein sequence is. For

DNA regions of repetitive sequence are often associated with non-coding, and especially

‘junk’ DNA; hence if we are searching for genes and their regulatory regions we look for a

sequence that is not repetitive. For protein, repetitive amino acid sequences are associated

with the parts that are unstructured: the bits that don’t fold into a compact globule. This

can be especially useful for multi-domain proteins (several independent compact globules

along the length of the chain), where repetitive sequence regions can identify the flexible

linkers between the compact, inflexible, folded domains.

For this example we will measure the repetitiveness along a sequence by counting how

many different kinds of residue are present within a given search window. If a given sub-

sequence  only  uses  a  limited  number  of  residue  types,  rather  than  a  varied  mixture,  we

deem that sequence to be repetitive or have low sequence complexity. For a more formal

mathematical  definition  we  turn  to  information  theory  and  use  a  measure  which  derived

from the theory of information entropy, as initially described by Shannon.

12

Considering that DNA and protein have different numbers of possible residue types (4



versus  20),  it  is  clear  that  for  these  different  kinds  of  molecule  the  expectation  of  how

many  types  of  residue  we  will  see  on  average  in  any  given  segment  will  be  markedly

different.  For  example,  in  a  given  12-residue  segment,  for  DNA  we  would  expect  to  see

every  kind  of  nucleotide,  and  if  all  nucleotides  are  equally  likely  there  would  be  on




average  three  of  each  kind.  For  12-residue  protein  segments,  it  is  impossible  to  have  all

amino acids present, and we might reasonably expect the average occurrence of any kind

to be less than 1.

Accordingly, because the expectation of how likely it is to find a given type of residue

varies  according  to  the  situation,  we  will  measure  sequence  repetitiveness  by  comparing

the measured occurrences with what we would expect given a random selection. Here we

will refer to such a random selection as the null hypothesis: what we would expect in the

absence of extra information (see also

Chapter 22

for further discussion). In reality there

are many null hypotheses that you could choose from, if you were searching within gene

sequences  and  already  know  that  G:C  base  pairs  are  more  common  than  A:T.  However,

for this exercise we will assume an unbiased null hypothesis, where each residue type is

equally likely: the baseline for nucleotides is 25% and the baseline for amino acids is 5%.

We will refer to the formulation we use for this comparative measure of repetitiveness as

the relative entropy, also known as the Kullback-Leibler divergence.

13

The  actual  example  code  will  be  broken  up  into  two  separate  functions;  one  will



calculate  the  relative  entropy  and  the  other  will  scan  through  a  sequence  compiling  the

results.  The  first  function  takes  a  sequence,  a  list  of  possible  residue  types,  which  of

course will differ for DNA and protein, and does the appropriate mathematics (see the last

column of Table 11.2) to give back a single value H (a traditional letter to use for entropy).

Note that this function does not contain any code specific to a given context. This means

that the function will be useful in other situations, one of the main reasons to write it in a

separate block of code. A separate function also helps to keep the Python code readable.

The  workings  of  the  function  involve  calculating  the  expected  base-level  proportion

base, then filling the dictionary prop, which stores the abundance of each kind of residue

we  have  in  the  input  sequence.  The  residue  counts  are  converted  to  a  proportion  (i.e.

between  0  and  1)  by  dividing  by  the  sequence  length,  making  sure  that  we  do  floating

point division (dividing two integers gives an integer in Python before version 3). Armed

with the base level and the proportions we then loop through each residue type and add the

local repetitiveness measure for that type to the relative entropy total. The occurrence of

base-2 logarithms throughout is a means to ensure that the answer is output in units of bits

(common in information theory).

def calcRelativeEntropy(seq, resCodes):

"""Calculate a relative entropy value for the residues in a

sequence compared to a uniform null hypothesis.

"""


from math import log

N = float(len(seq))

base = 1.0/len(resCodes)

prop = {}

for r in resCodes:

prop[r] = 0

for r in seq:

prop[r] += 1




for r in resCodes:

prop[r] /= N

H = 0

for r in resCodes:



if prop[r] != 0.0:

h = prop[r]* log(prop[r]/base, 2.0)

H += h

H /= log(base, 2.0)



return H

The second function will use the calcRelativeEntropy() function as it scans through an

input sequence. In this example, and unlike the previous ones in this chapter, rather than

having our search window stop before it falls off the end of the input sequence we will use

a cheat that will enable the search to go right to the last residue. This involves pretending

that the sequence is circular; rather than falling off the end the search will jump back to

the  beginning  of  the  sequence.  Whether  it  is  appropriate  to  cheat  is  at  the  programmer’s

discretion, but in this instance it gives results that appear reasonable, due to the fact that

we are repeating a real segment of sequence.


Download 7,75 Mb.

Do'stlaringiz bilan baham:
1   ...   145   146   147   148   149   150   151   152   ...   514




Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©hozir.org 2024
ma'muriyatiga murojaat qiling

kiriting | ro'yxatdan o'tish
    Bosh sahifa
юртда тантана
Боғда битган
Бугун юртда
Эшитганлар жилманглар
Эшитмадим деманглар
битган бодомлар
Yangiariq tumani
qitish marakazi
Raqamli texnologiyalar
ilishida muhokamadan
tasdiqqa tavsiya
tavsiya etilgan
iqtisodiyot kafedrasi
steiermarkischen landesregierung
asarlaringizni yuboring
o'zingizning asarlaringizni
Iltimos faqat
faqat o'zingizning
steierm rkischen
landesregierung fachabteilung
rkischen landesregierung
hamshira loyihasi
loyihasi mavsum
faolyatining oqibatlari
asosiy adabiyotlar
fakulteti ahborot
ahborot havfsizligi
havfsizligi kafedrasi
fanidan bo’yicha
fakulteti iqtisodiyot
boshqaruv fakulteti
chiqarishda boshqaruv
ishlab chiqarishda
iqtisodiyot fakultet
multiservis tarmoqlari
fanidan asosiy
Uzbek fanidan
mavzulari potok
asosidagi multiservis
'aliyyil a'ziym
billahil 'aliyyil
illaa billahil
quvvata illaa
falah' deganida
Kompyuter savodxonligi
bo’yicha mustaqil
'alal falah'
Hayya 'alal
'alas soloh
Hayya 'alas
mavsum boyicha


yuklab olish