Python Programming for Biology: Bioinformatics and Beyond


Simple sub-sequence properties



Download 7,75 Mb.
Pdf ko'rish
bet146/514
Sana30.12.2021
Hajmi7,75 Mb.
#91066
1   ...   142   143   144   145   146   147   148   149   ...   514
Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

Simple sub-sequence properties


The  following  examples  move  from  considering  a  whole  sequence  to  looking  at  the

properties  of  regions  or  overlapping  sub-sequences  from  within  a  larger  sequence.

Although we have only room to give a few simple examples, this kind of analysis is very

important in trying to ascertain the biological function of the different parts of a sequence

and in some cases to help guide experimentation.

Finding a sequence motif

The  next  example  script  is  designed  to  find  a  particular  smaller  sub-sequence  within  a

larger sequence. This kind of operation is useful because specific small sequences, called

motifs,  often  have  important  biological  roles.  Examples  of  this  in  DNA  include  specific

sequences which indicate the start of genes, specify where the coding regions of genes are

spliced  (introns  are  removed)  or  allow  a  protein  to  bind.  There  are  also  examples  in

proteins including where specific sequences are used to interact with other proteins, bind

to DNA or direct that the protein should be modified (e.g. by joining sugars to it).

The following is a simple example of how to find a fixed sub-sequence within a larger

sequence:

seq = 'AGCTCGCTCGCTGCGTATAAAATCGCATCGCGCGCAGC'

position1 = seq.find('TATAAA')

position2 = seq.find('GAGGAG')

This will give the values for position1 as 15 and position2 as −1 (because it cannot be

found). The negative number that is given back from .find() when a substring is not found

is a bit of an oddity. The authors feel that giving back the None object would have been

more appropriate.

In  many  cases,  however,  it  is  not  just  one  single  well-defined  sub-sequence  that

corresponds to a motif with a biological function. More usually there are a range of similar

sequences  that  are  all  found  to  fulfil  the  same  role.  Accordingly,  next  we  will  not  be

defining a sequence motif as a single stretch of residues, but rather as a residue profile.

A residue profile states, for each position in the motif, what range of residues are found

and  how  often  one  kind  of  residue  is  present  compared  to  the  others.  Obviously,  the

residue  profile  that  we  use  to  try  to  detect  a  particular  motif  will  have  been  determined

beforehand,  almost  always  by  comparing  sequences  that  we  know  (e.g.  by

experimentation) will function in the desired way.

This  particular  example  attempts  to  find  the  region  of  a  DNA  sequence  called  the

‘TATA  box’.  This  is  a  biologically  important  region  which,  as  the  name  suggests,  often

contains a particular sequence. The biological role of this sequence is to help define where

the start of a gene is. Note that only some genes use the TATA box system.

Firstly,  we  define  the  profile  that  encodes  the  preferred  sequence  letters  for  each

position in a section of DNA. The profile is stored as a Python dictionary, where the keys

to the dictionary are the nucleotide letters and the values are lists of scores. Each score list

contains  a  number  for  each  position  that  indicates  how  likely  it  is  to  find  the  given

nucleotide at that location.

profile = {



'A':[ 61, 16,352, 3,354,268,360,222,155, 56, 83, 82, 82, 68, 77],

'C':[145, 46, 0, 10, 0, 0, 3, 2, 44,135,147,127,118,107,101],

'G':[152, 18, 2, 2, 5, 0, 10, 44,157,150,128,128,128,139,140],

'T':[ 31,309, 35,374, 30,121, 6,121, 33, 48, 31, 52, 61, 75, 71]}

Next we define a function that takes a DNA sequence and a profile to determine where

the profile best matches. A for loop is used to define the start point i of each sub-sequence

that we wish to test against the profile. Note that we have ensured that the sub-sequence

does not fall off the end of the larger input sequence because we stop the loop at the end of

the sequence minus the profile length, for which we use the variable width.

We  calculate  the  score  for  each  sub-sequence  by  looking  up  the  scores  for  the

component letters using the profile dictionary: we use the nucleotide letter to first get the

correct sores list and then the profile position j, to extract the right score. The position we

interrogate  from  the  input  sequence  is  defined  as  i+j;  i.e.  we  add  the  position  within  the

sub-sequence to the absolute start of the sub-sequence within the larger sequence. The best

score  that  has  been  found  so  far  is  recorded,  together  with  the  position  where  that  score

occurred.

def matchDnaProfile(seq, profile):

""" Find the best-matching position and score when comparing a DNA

sequence with a DNA sequence profile """

bestScore = 0

bestPosition = None # Just to start with

width = len(profile['A'])

for i in range(len(seq)-width):

score = 0

for j in range(width):

letter = seq[i+j]

score += profile[letter][j]

if score > bestScore:

bestScore = score

bestPosition = i

return bestScore, bestPosition

Rather  than  finding  the  best  score  and  best  position,  we  could  alternatively  have

reported the scores at all positions, or even a list of positions that all give a score above a

specified  threshold.  To  test  our  function  with  our  profile  and  DNA  sequence,  and  report

back which section of the sequence gave the best hit, we do the following:

score, position = matchDnaProfile(dnaSeq, profile)

print(score, position, dnaSeq[position:position+15])


Download 7,75 Mb.

Do'stlaringiz bilan baham:
1   ...   142   143   144   145   146   147   148   149   ...   514




Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©hozir.org 2024
ma'muriyatiga murojaat qiling

kiriting | ro'yxatdan o'tish
    Bosh sahifa
юртда тантана
Боғда битган
Бугун юртда
Эшитганлар жилманглар
Эшитмадим деманглар
битган бодомлар
Yangiariq tumani
qitish marakazi
Raqamli texnologiyalar
ilishida muhokamadan
tasdiqqa tavsiya
tavsiya etilgan
iqtisodiyot kafedrasi
steiermarkischen landesregierung
asarlaringizni yuboring
o'zingizning asarlaringizni
Iltimos faqat
faqat o'zingizning
steierm rkischen
landesregierung fachabteilung
rkischen landesregierung
hamshira loyihasi
loyihasi mavsum
faolyatining oqibatlari
asosiy adabiyotlar
fakulteti ahborot
ahborot havfsizligi
havfsizligi kafedrasi
fanidan bo’yicha
fakulteti iqtisodiyot
boshqaruv fakulteti
chiqarishda boshqaruv
ishlab chiqarishda
iqtisodiyot fakultet
multiservis tarmoqlari
fanidan asosiy
Uzbek fanidan
mavzulari potok
asosidagi multiservis
'aliyyil a'ziym
billahil 'aliyyil
illaa billahil
quvvata illaa
falah' deganida
Kompyuter savodxonligi
bo’yicha mustaqil
'alal falah'
Hayya 'alal
'alas soloh
Hayya 'alas
mavsum boyicha


yuklab olish