Python Programming for Biology: Bioinformatics and Beyond

Translate a DNA sequence into protein

Download 7,75 Mb.

Pdf ko'rish

bet	144/514
Sana	30.12.2021
Hajmi	7,75 Mb.
	#91066

1 ... 140 141 142 143 144 145 146 147 ... 514

Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

Translate a DNA sequence into protein

The first example script is designed to determine the sequence of amino acids in a protein,

starting from a DNA or RNA sequence by using the genetic code, stored in a Python

dictionary, to perform the translation. The situation is generally more complicated,

because precisely which section (or sections) of a nucleotide sequence end up being used

is not always clear; there can be the issue of finding a gene amongst a large amount of

DNA and working out how the RNA that is made (transcribed) from the gene is processed

by splicing to give a mature messenger RNA. For now we leave such problems aside.

Firstly, we define a dictionary that contains our genetic code. Here we use strings

containing three nucleotide letters as the dictionary’s keys; these are the codons. The value

associated with each codon is the three-letter code of the appropriate amino acid or the

None object if it is a stop codon.

STANDARD_GENETIC_CODE = {

'UUU':'Phe', 'UUC':'Phe', 'UCU':'Ser', 'UCC':'Ser',

'UAU':'Tyr', 'UAC':'Tyr', 'UGU':'Cys', 'UGC':'Cys',

'UUA':'Leu', 'UCA':'Ser', 'UAA':None, 'UGA':None,

'UUG':'Leu', 'UCG':'Ser', 'UAG':None, 'UGG':'Trp',

'CUU':'Leu', 'CUC':'Leu', 'CCU':'Pro', 'CCC':'Pro',

'CAU':'His', 'CAC':'His', 'CGU':'Arg', 'CGC':'Arg',

'CUA':'Leu', 'CUG':'Leu', 'CCA':'Pro', 'CCG':'Pro',

'CAA':'Gln', 'CAG':'Gln', 'CGA':'Arg', 'CGG':'Arg',

'AUU':'Ile', 'AUC':'Ile', 'ACU':'Thr', 'ACC':'Thr',

'AAU':'Asn', 'AAC':'Asn', 'AGU':'Ser', 'AGC':'Ser',

'AUA':'Ile', 'ACA':'Thr', 'AAA':'Lys', 'AGA':'Arg',

'AUG':'Met', 'ACG':'Thr', 'AAG':'Lys', 'AGG':'Arg',

'GUU':'Val', 'GUC':'Val', 'GCU':'Ala', 'GCC':'Ala',

'GAU':'Asp', 'GAC':'Asp', 'GGU':'Gly', 'GGC':'Gly',

'GUA':'Val', 'GUG':'Val', 'GCA':'Ala', 'GCG':'Ala',

'GAA':'Glu', 'GAG':'Glu', 'GGA':'Gly', 'GGG':'Gly'}

Now we define a sequence. This initial sequence is really only for testing purposes. In

reality of course we want to accept a variety of different sequences from files and

databases.

dnaSeq = 'ATGGTGCATCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTG'

Given the nucleotide sequence, we take each group of three nucleotide letters and use

the group as a key to look up the corresponding amino acid code, remembering of course

that we must convert any DNA T residues into RNA U residues (which our genetic code

dictionary requires). Assuming we find an amino acid code we add it to the list which

represents the protein sequence. If we cannot find an amino acid for a codon, then we have

reached a stop codon, whereupon our protein sequence is complete and we can

immediately stop the translation. Note that we define the coding as a three-letter sub-

sequence using the slice notation seq[i:i+3], remembering that this will take letters from

position i, up to but not including i+3. At the end we pass back the list of amino acid

codes. This operation is put into a Python function, so that we can repeat the operation

with any sequence and genetic code.

def proteinTranslation(seq, geneticCode):

""" This function translates a nucleic acid sequence into a

protein sequence, until the end or until it comes across

a stop codon """

seq = seq.replace('T','U') # Make sure we have RNA sequence

proteinSeq = []

i = 0

while i+2 < len(seq):

codon = seq[i:i+3]

aminoAcid = geneticCode[codon]

if aminoAcid is None: # Found stop codon

break

proteinSeq.append(aminoAcid)

i += 3

return proteinSeq

Note that there are many ways in which we could have extracted the groups of three

letters from the input sequence. In this instance we used a while loop, and the loop

continues as long as there are still at least three letters remaining, i.e. that the index plus

two i+2 is still within the length of the sequence (and also unless the break is triggered by

a stop codon). Here index i will be the position of the first letter in the codon and i+2 will

be the last letter. Getting these ‘boundary conditions’ correct (so it is i+2 not i+1 or i+3) is

one of the tricky bits of computer programming. Of course at the end of the loop we

increase the index by three for the next round.

To actually run the function on our test sequence call the function by using its name in

association with the variable for the test sequence and the variable that holds the genetic

code: these get passed to the function as an argument. The resulting protein sequence is

passed back to fill in the value of the proteinSeq variable.

protein3LetterSeq = proteinTranslation(dnaSeq, STANDARD_GENETIC_CODE)

Converting a DNA sequence to an RNA sequence is much easier, because all we have

to do is replace T letters with U letters, as we already had to do when using the genetic

code dictionary, and we can use the inbuilt Python functionality (assuming the sequence is

stored as a text string) to do this.

rnaSeq = dnaSeq.replace('T','U')

Download 7,75 Mb.

Do'stlaringiz bilan baham:

1 ... 140 141 142 143 144 145 146 147 ... 514