Python Programming for Biology: Bioinformatics and Beyond


Translate a DNA sequence into protein



Download 7,75 Mb.
Pdf ko'rish
bet144/514
Sana30.12.2021
Hajmi7,75 Mb.
#91066
1   ...   140   141   142   143   144   145   146   147   ...   514
Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

Translate a DNA sequence into protein

The first example script is designed to determine the sequence of amino acids in a protein,

starting  from  a  DNA  or  RNA  sequence  by  using  the  genetic  code,  stored  in  a  Python

dictionary,  to  perform  the  translation.  The  situation  is  generally  more  complicated,

because precisely which section (or sections) of a nucleotide sequence end up being used

is  not  always  clear;  there  can  be  the  issue  of  finding  a  gene  amongst  a  large  amount  of

DNA and working out how the RNA that is made (transcribed) from the gene is processed

by splicing to give a mature messenger RNA. For now we leave such problems aside.

Firstly,  we  define  a  dictionary  that  contains  our  genetic  code.  Here  we  use  strings



containing three nucleotide letters as the dictionary’s keys; these are the codons. The value

associated  with  each  codon  is  the  three-letter  code  of  the  appropriate  amino  acid  or  the

None object if it is a stop codon.

STANDARD_GENETIC_CODE = {

'UUU':'Phe', 'UUC':'Phe', 'UCU':'Ser', 'UCC':'Ser',

'UAU':'Tyr', 'UAC':'Tyr', 'UGU':'Cys', 'UGC':'Cys',

'UUA':'Leu', 'UCA':'Ser', 'UAA':None, 'UGA':None,

'UUG':'Leu', 'UCG':'Ser', 'UAG':None, 'UGG':'Trp',

'CUU':'Leu', 'CUC':'Leu', 'CCU':'Pro', 'CCC':'Pro',

'CAU':'His', 'CAC':'His', 'CGU':'Arg', 'CGC':'Arg',

'CUA':'Leu', 'CUG':'Leu', 'CCA':'Pro', 'CCG':'Pro',

'CAA':'Gln', 'CAG':'Gln', 'CGA':'Arg', 'CGG':'Arg',

'AUU':'Ile', 'AUC':'Ile', 'ACU':'Thr', 'ACC':'Thr',

'AAU':'Asn', 'AAC':'Asn', 'AGU':'Ser', 'AGC':'Ser',

'AUA':'Ile', 'ACA':'Thr', 'AAA':'Lys', 'AGA':'Arg',

'AUG':'Met', 'ACG':'Thr', 'AAG':'Lys', 'AGG':'Arg',

'GUU':'Val', 'GUC':'Val', 'GCU':'Ala', 'GCC':'Ala',

'GAU':'Asp', 'GAC':'Asp', 'GGU':'Gly', 'GGC':'Gly',

'GUA':'Val', 'GUG':'Val', 'GCA':'Ala', 'GCG':'Ala',

'GAA':'Glu', 'GAG':'Glu', 'GGA':'Gly', 'GGG':'Gly'}

Now we define a sequence. This initial sequence is really only for testing purposes. In

reality  of  course  we  want  to  accept  a  variety  of  different  sequences  from  files  and

databases.

dnaSeq = 'ATGGTGCATCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTG'

Given the nucleotide sequence, we take each group of three nucleotide letters and use

the group as a key to look up the corresponding amino acid code, remembering of course

that we must convert any DNA T residues into RNA U residues (which our genetic code

dictionary  requires).  Assuming  we  find  an  amino  acid  code  we  add  it  to  the  list  which

represents the protein sequence. If we cannot find an amino acid for a codon, then we have

reached  a  stop  codon,  whereupon  our  protein  sequence  is  complete  and  we  can

immediately  stop  the  translation.  Note  that  we  define  the  coding  as  a  three-letter  sub-

sequence using the slice notation seq[i:i+3],  remembering  that  this  will  take  letters  from

position  i,  up  to  but  not  including  i+3.  At  the  end  we  pass  back  the  list  of  amino  acid

codes.  This  operation  is  put  into  a  Python  function,  so  that  we  can  repeat  the  operation

with any sequence and genetic code.

def proteinTranslation(seq, geneticCode):

""" This function translates a nucleic acid sequence into a

protein sequence, until the end or until it comes across

a stop codon """

seq = seq.replace('T','U') # Make sure we have RNA sequence

proteinSeq = []

i = 0


while i+2 < len(seq):

codon = seq[i:i+3]

aminoAcid = geneticCode[codon]



if aminoAcid is None: # Found stop codon

break


proteinSeq.append(aminoAcid)

i += 3


return proteinSeq

Note  that  there  are  many  ways  in  which  we  could  have  extracted  the  groups  of  three

letters  from  the  input  sequence.  In  this  instance  we  used  a  while  loop,  and  the  loop

continues as long as there are still at least three letters remaining, i.e. that the index plus

two i+2 is still within the length of the sequence (and also unless the break is triggered by

a stop codon). Here index i will be the position of the first letter in the codon and i+2 will

be the last letter. Getting these ‘boundary conditions’ correct (so it is i+2 not i+1 or i+3) is

one  of  the  tricky  bits  of  computer  programming.  Of  course  at  the  end  of  the  loop  we

increase the index by three for the next round.

To actually run the function on our test sequence call the function by using its name in

association with the variable for the test sequence and the variable that holds the genetic

code:  these  get  passed  to  the  function  as  an  argument.  The  resulting  protein  sequence  is

passed back to fill in the value of the proteinSeq variable.

protein3LetterSeq = proteinTranslation(dnaSeq, STANDARD_GENETIC_CODE)

Converting a DNA sequence to an RNA sequence is much easier, because all we have

to do is replace T letters with U  letters,  as  we  already  had  to  do  when  using  the  genetic

code dictionary, and we can use the inbuilt Python functionality (assuming the sequence is

stored as a text string) to do this.

rnaSeq = dnaSeq.replace('T','U')


Download 7,75 Mb.

Do'stlaringiz bilan baham:
1   ...   140   141   142   143   144   145   146   147   ...   514




Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©hozir.org 2024
ma'muriyatiga murojaat qiling

kiriting | ro'yxatdan o'tish
    Bosh sahifa
юртда тантана
Боғда битган
Бугун юртда
Эшитганлар жилманглар
Эшитмадим деманглар
битган бодомлар
Yangiariq tumani
qitish marakazi
Raqamli texnologiyalar
ilishida muhokamadan
tasdiqqa tavsiya
tavsiya etilgan
iqtisodiyot kafedrasi
steiermarkischen landesregierung
asarlaringizni yuboring
o'zingizning asarlaringizni
Iltimos faqat
faqat o'zingizning
steierm rkischen
landesregierung fachabteilung
rkischen landesregierung
hamshira loyihasi
loyihasi mavsum
faolyatining oqibatlari
asosiy adabiyotlar
fakulteti ahborot
ahborot havfsizligi
havfsizligi kafedrasi
fanidan bo’yicha
fakulteti iqtisodiyot
boshqaruv fakulteti
chiqarishda boshqaruv
ishlab chiqarishda
iqtisodiyot fakultet
multiservis tarmoqlari
fanidan asosiy
Uzbek fanidan
mavzulari potok
asosidagi multiservis
'aliyyil a'ziym
billahil 'aliyyil
illaa billahil
quvvata illaa
falah' deganida
Kompyuter savodxonligi
bo’yicha mustaqil
'alal falah'
Hayya 'alal
'alas soloh
Hayya 'alas
mavsum boyicha


yuklab olish