This next script estimates the mass of a DNA, RNA or protein molecule (in units of
daltons). This is only an estimate because various residues reversibly bind hydrogen ions
under different conditions (i.e. pH affects whether H
Nonetheless this estimate will be useful enough to say where we expect DNA or protein to
or mass spectrometer trace.
specify that it takes one argument seq, which is a sequence, and one argument molType,
which states whether we are using a protein sequence, a DNA sequence or an RNA
sequence. Note that we set a default value for molType to be ‘protein’, so that we can
Inside the function we define a dictionary that stores the average molecular weights of
the different kinds of residue. Internally this dictionary contains three inner sub-
dictionaries, one for each of the different molecule types. We access the correct inner
dictionary using the molType as a key. The one-letter residue codes then act as the keys to
the inner dictionary to extract the appropriate molecular masses.
Next we define a variable to hold the total for the molecular mass. This is initially
defined with a value equal to that of the molecular mass of water, because the average
residue masses in the dictionary do not take account of the end residues that have extra
atoms (OH at one end and H at the other) because they are only linked on one side, instead
of both sides.
def estimateMolMass(seq, molType='protein'):
"""Calculate the molecular weight of a biological sequence assuming
normal isotopic ratios and protonation/modification states
"""
residueMasses = {
"DNA": {"G":329.21, "C":289.18, "A":323.21, "T":304.19},
"RNA": {"G":345.21, "C":305.18, "A":329.21, "U":302.16},
"protein": {"A": 71.07, "R":156.18, "N":114.08, "D":115.08,
"C":103.10, "Q":128.13, "E":129.11, "G": 57.05,
"H":137.14, "I":113.15, "L":113.15, "K":128.17,
"M":131.19, "F":147.17, "P": 97.11, "S": 87.07,
"T":101.10, "W":186.20, "Y":163.17, "V": 99.13}}
massDict = residueMasses[molType]
# Begin with mass of extra end atoms H + OH
molMass = 18.02
for letter in seq:
molMass += massDict.get(letter, 0.0)
return molMass
The for loop extracts each element of the sequence in turn, which will be a single
nucleotide or amino acid letter. This letter is then used to look up the appropriate value of
molecular mass in the dictionary. The .get() function of the dictionary is used so that a
default value for the mass can be specified, just in case we have a letter in the sequence
that is not in the dictionary. In such a circumstance using a guess for an average mass of
an unrecognised residue, rather than 0.0, may be appropriate under some circumstances.
The molecular mass of the current residue is then added to the total, and the for loop
moves onto the next letter in the sequence. Finally the return statement is used so that the
value of the total molecular mass is passed back to the point in the program where the
function was called from. To test this function we could do something like:
proteinSeq = 'IRTNGTHMQPLLKLMKFQKFLLELFTLQKRKPEKGYNLPIISLNQ'
proteinMass = estimateMolMass(proteinSeq)
or for DNA, noting that we have to specify the molecule type:
dnaMass = estimateMolMass(dnaSeq, molType='DNA')
Do'stlaringiz bilan baham: