The first example script is designed to determine the sequence of amino acids in a protein,
starting from a DNA or RNA sequence by using the genetic code, stored in a Python
dictionary, to perform the translation. The situation is generally more complicated,
is not always clear; there can be the issue of finding a gene amongst a large amount of
by splicing to give a mature messenger RNA. For now we leave such problems aside.
Firstly, we define a dictionary that contains our genetic code. Here we use strings
containing three nucleotide letters as the dictionary’s keys; these are the codons. The value
associated with each codon is the three-letter code of the appropriate amino acid or the
None object if it is a stop codon.
STANDARD_GENETIC_CODE = {
'UUU':'Phe', 'UUC':'Phe', 'UCU':'Ser', 'UCC':'Ser',
'UAU':'Tyr', 'UAC':'Tyr', 'UGU':'Cys', 'UGC':'Cys',
'UUA':'Leu', 'UCA':'Ser', 'UAA':None, 'UGA':None,
'UUG':'Leu', 'UCG':'Ser', 'UAG':None, 'UGG':'Trp',
'CUU':'Leu', 'CUC':'Leu', 'CCU':'Pro', 'CCC':'Pro',
'CAU':'His', 'CAC':'His', 'CGU':'Arg', 'CGC':'Arg',
'CUA':'Leu', 'CUG':'Leu', 'CCA':'Pro', 'CCG':'Pro',
'CAA':'Gln', 'CAG':'Gln', 'CGA':'Arg', 'CGG':'Arg',
'AUU':'Ile', 'AUC':'Ile', 'ACU':'Thr', 'ACC':'Thr',
'AAU':'Asn', 'AAC':'Asn', 'AGU':'Ser', 'AGC':'Ser',
'AUA':'Ile', 'ACA':'Thr', 'AAA':'Lys', 'AGA':'Arg',
'AUG':'Met', 'ACG':'Thr', 'AAG':'Lys', 'AGG':'Arg',
'GUU':'Val', 'GUC':'Val', 'GCU':'Ala', 'GCC':'Ala',
'GAU':'Asp', 'GAC':'Asp', 'GGU':'Gly', 'GGC':'Gly',
'GUA':'Val', 'GUG':'Val', 'GCA':'Ala', 'GCG':'Ala',
'GAA':'Glu', 'GAG':'Glu', 'GGA':'Gly', 'GGG':'Gly'}
Now we define a sequence. This initial sequence is really only for testing purposes. In
reality of course we want to accept a variety of different sequences from files and
databases.
dnaSeq = 'ATGGTGCATCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTG'
Given the nucleotide sequence, we take each group of three nucleotide letters and use
the group as a key to look up the corresponding amino acid code, remembering of course
that we must convert any DNA T residues into RNA U residues (which our genetic code
dictionary requires). Assuming we find an amino acid code we add it to the list which
represents the protein sequence. If we cannot find an amino acid for a codon, then we have
reached a stop codon, whereupon our protein sequence is complete and we can
immediately stop the translation. Note that we define the coding as a three-letter sub-
sequence using the slice notation seq[i:i+3], remembering that this will take letters from
position i, up to but not including i+3. At the end we pass back the list of amino acid
codes. This operation is put into a Python function, so that we can repeat the operation
with any sequence and genetic code.
def proteinTranslation(seq, geneticCode):
""" This function translates a nucleic acid sequence into a
protein sequence, until the end or until it comes across
a stop codon """
seq = seq.replace('T','U') # Make sure we have RNA sequence
proteinSeq = []
i = 0
while i+2 < len(seq):
codon = seq[i:i+3]
aminoAcid = geneticCode[codon]
if aminoAcid is None: # Found stop codon
break
proteinSeq.append(aminoAcid)
i += 3
return proteinSeq
Note that there are many ways in which we could have extracted the groups of three
letters from the input sequence. In this instance we used a while loop, and the loop
continues as long as there are still at least three letters remaining, i.e. that the index plus
two i+2 is still within the length of the sequence (and also unless the break is triggered by
a stop codon). Here index i will be the position of the first letter in the codon and i+2 will
be the last letter. Getting these ‘boundary conditions’ correct (so it is i+2 not i+1 or i+3) is
one of the tricky bits of computer programming. Of course at the end of the loop we
increase the index by three for the next round.
To actually run the function on our test sequence call the function by using its name in
association with the variable for the test sequence and the variable that holds the genetic
code: these get passed to the function as an argument. The resulting protein sequence is
passed back to fill in the value of the proteinSeq variable.
protein3LetterSeq = proteinTranslation(dnaSeq, STANDARD_GENETIC_CODE)
Converting a DNA sequence to an RNA sequence is much easier, because all we have
to do is replace T letters with U letters, as we already had to do when using the genetic
code dictionary, and we can use the inbuilt Python functionality (assuming the sequence is
stored as a text string) to do this.
rnaSeq = dnaSeq.replace('T','U')
Do'stlaringiz bilan baham: