PDB (Protein Data Bank) files were invented in the 1970s to describe the three-
dimensional coordinates of biological macromolecules. As the name suggests this was
initially designed for proteins, but the same system is now commonly used to represent
DNA, RNA, carbohydrates, lipids, small molecules and any other biologically important
molecule. PDB files can contain the description of multiple molecules and multiple
structures, and can hold lots of other descriptive information. However, in this section we
ignore all the complexities and concentrate only on the parts that specify the spatial
coordinates. A PDB file is both key/value and line oriented, with the key at the start of
each line giving context to the data in the remainder of the line. The coordinates we are
interested in are in records where the line starts with the six characters ‘ATOM ’ (with two
spaces at the end), which can be thought of as the key. The x coordinate is given in
columns 30 to 37, y in columns 38 to 45 and z in columns 46 to 53, assuming that the first
column is column 0.
The following example reads a PDB file to calculate the centroid of a structure, the
average position of the atoms. Strictly speaking, this should be biased by the weight of
each atom, but we ignore that issue here (and in practice it does not make much of a
difference). In a drawing application, if you rotate a molecule on the screen, it is generally
desired to rotate it about the centroid, otherwise the rotation looks odd.
The function takes the name of the PDB file as an argument, and returns the number of
atoms found as well as the average x, y and z positions. As a PDB reader the function is
very simple and naïve, and in any serious program you would do best to use an existing
and tested function, like the one in the BioPython module. Nonetheless, the function will
serve to illustrate the principles involved.
Initially we open the file object, read all of the lines and then immediately close it.
Next, variables representing the numbers of atoms and the totals for the x, y and z
coordinates are initialised to zero, before looping though each of the lines. If a line begins
with the desired ‘ATOM ’ key the atom count is increased, the coordinates are extracted
and the coordinate totals are increased. The coordinate data is initially just text characters
from the file and needs to be converted to Python numbers (which can be added
numerically). The Python float() performs the conversion from test string to floating point
number. So, for example, the string ‘12.572‘ would be converted to the number 12.572.
def calcCentroid(pdbFile):
fileObj = open(pdbFile, 'rU')
natoms = 0
xsum = ysum = zsum = 0
for line in fileObj:
if line[:6] == 'ATOM ':
natoms = + = 1
x = float(line[30:38])
y = float(line[38:46])
z = float(line[46:54])
xsum += x
ysum += y
zsum += z
fileObj.close()
if natoms == 0:
xavg = yavg = zavg = 0
else:
xavg = xsum / natoms
yavg = ysum / natoms
zavg = zsum / natoms
return (natoms, xavg, yavg, zavg)
Once the looping is done and the additions are complete, the averages are defined by
dividing the summation of each coordinate type by the total number of atoms. Note that if
the PDB file has no atom records the averages are simply set to zero, and we cannot divide
by zero in any case. The function is then readily tested:
print(calcCentroid('examples/protein.pdb'))
Of course it’s possible that someone calls the calcCentroid() function with an argument
that is not a PDB file, or even a file that does not exist. If the file does not exist, or you do
not have permission to read it, then the function will throw a standard Python exception
(IOError) when it tries to open it. If the file exists but is not a PDB file then most likely
there will be no lines starting with the text ‘ATOM ’ and so the function will just return the
tuple (0, 0, 0, 0). It’s also possible in this case that there is a line starting with ‘ATOM ’
(by coincidence) but it does not have three floating point numbers in columns 30 through
53, in which case a standard Python exception (ValueError) will be thrown when the
float() function is called.
There is always a question as to how you deal with bad input to a function. There is no
perfect answer. Sometimes you might want to throw standard Python exceptions. In other
cases you might want to check for conditions that might lead to an exception and instead
return some sensible default. Alternatively you might want to throw your own exception
to give a more informative warning to the user, rather than the standard Python one. It is a
matter of taste and circumstance.
Do'stlaringiz bilan baham: