Python Programming for Biology: Bioinformatics and Beyond

Download 7,75 Mb.

Pdf ko'rish

bet	80/514
Sana	30.12.2021
Hajmi	7,75 Mb.
	#91066

1 ... 76 77 78 79 80 81 82 83 ... 514

Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

Reading PDB files

PDB (Protein Data Bank) files were invented in the 1970s to describe the three-

dimensional coordinates of biological macromolecules. As the name suggests this was

initially designed for proteins, but the same system is now commonly used to represent

DNA, RNA, carbohydrates, lipids, small molecules and any other biologically important

molecule. PDB files can contain the description of multiple molecules and multiple

structures, and can hold lots of other descriptive information. However, in this section we

ignore all the complexities and concentrate only on the parts that specify the spatial

coordinates. A PDB file is both key/value and line oriented, with the key at the start of

each line giving context to the data in the remainder of the line. The coordinates we are

interested in are in records where the line starts with the six characters ‘ATOM ’ (with two

spaces at the end), which can be thought of as the key. The x coordinate is given in

columns 30 to 37, y in columns 38 to 45 and z in columns 46 to 53, assuming that the first

column is column 0.

The following example reads a PDB file to calculate the centroid of a structure, the

average position of the atoms. Strictly speaking, this should be biased by the weight of

each atom, but we ignore that issue here (and in practice it does not make much of a

difference). In a drawing application, if you rotate a molecule on the screen, it is generally

desired to rotate it about the centroid, otherwise the rotation looks odd.

The function takes the name of the PDB file as an argument, and returns the number of

atoms found as well as the average x, y and z positions. As a PDB reader the function is

very simple and naïve, and in any serious program you would do best to use an existing

and tested function, like the one in the BioPython module. Nonetheless, the function will

serve to illustrate the principles involved.

Initially we open the file object, read all of the lines and then immediately close it.

Next, variables representing the numbers of atoms and the totals for the x, y and z

coordinates are initialised to zero, before looping though each of the lines. If a line begins

with the desired ‘ATOM ’ key the atom count is increased, the coordinates are extracted

and the coordinate totals are increased. The coordinate data is initially just text characters

from the file and needs to be converted to Python numbers (which can be added

numerically). The Python float() performs the conversion from test string to floating point

number. So, for example, the string ‘12.572‘ would be converted to the number 12.572.

def calcCentroid(pdbFile):

fileObj = open(pdbFile, 'rU')

natoms = 0

xsum = ysum = zsum = 0

for line in fileObj:

if line[:6] == 'ATOM ':

natoms = + = 1

x = float(line[30:38])

y = float(line[38:46])

z = float(line[46:54])

xsum += x

ysum += y

zsum += z

fileObj.close()

if natoms == 0:

xavg = yavg = zavg = 0

else:

xavg = xsum / natoms

yavg = ysum / natoms

zavg = zsum / natoms

return (natoms, xavg, yavg, zavg)

Once the looping is done and the additions are complete, the averages are defined by

dividing the summation of each coordinate type by the total number of atoms. Note that if

the PDB file has no atom records the averages are simply set to zero, and we cannot divide

by zero in any case. The function is then readily tested:

print(calcCentroid('examples/protein.pdb'))

Of course it’s possible that someone calls the calcCentroid() function with an argument

that is not a PDB file, or even a file that does not exist. If the file does not exist, or you do

not have permission to read it, then the function will throw a standard Python exception

(IOError) when it tries to open it. If the file exists but is not a PDB file then most likely

there will be no lines starting with the text ‘ATOM ’ and so the function will just return the

tuple (0, 0, 0, 0). It’s also possible in this case that there is a line starting with ‘ATOM ’

(by coincidence) but it does not have three floating point numbers in columns 30 through

53, in which case a standard Python exception (ValueError) will be thrown when the

float() function is called.

There is always a question as to how you deal with bad input to a function. There is no

perfect answer. Sometimes you might want to throw standard Python exceptions. In other

cases you might want to check for conditions that might lead to an exception and instead

return some sensible default. Alternatively you might want to throw your own exception

to give a more informative warning to the user, rather than the standard Python one. It is a

matter of taste and circumstance.

Download 7,75 Mb.

Do'stlaringiz bilan baham:

1 ... 76 77 78 79 80 81 82 83 ... 514