Populating the model: reading PDB files
In
Chapter 6
we made a quick and dirty attempt at reading structural coordinate
information from a PDB-format file, when we wrote the function calcCentroid(). In that
function we used the information as we read it, and did not bother storing it for later use.
In contrast, here we will create one or more Structure objects, so we can perform any
number of manipulations and interrogations on the data, after it has been loaded. There are
various aspects of a PDB-format file which we will consider when loading its data into our
Python objects:
A PDB file can have multiple structure conformations (models) in it, so our function
will return a list of structures.
A PDB file is record oriented and most records are one line long. The first six
characters of the record determine the record type.
The ‘HEADER’ record gives a short description and a date but most importantly for
us it also provides the PDB identifier.
The ‘TITLE’ record gives a further description of the molecular assembly. We will
use this to specify the name attribute of the structure.
A ‘MODEL’ record indicates when a new structure is starting, and gives numbers for
the conformation. It is optional, so in a PDB file with only one set of coordinates it
can be missing. If there is a ‘MODEL’ record then there is a matching ‘ENDMDL’
record when the description of that conformation finishes.
An ‘ATOM’ record specifies the chain, residue and atom information, including the
coordinates, for one atom. This means that the chain and residue information is
specified over and over again, once for each atom in the residue.
A fuller description of the (current) PDB file format can be found via the wwPDB
website (
http://www.wwpdb.org
). There are many more record types, but we will only use
the ones mentioned above and ignore the remainder. Also, it should be noted that, when
this book was being written, a decision was taken to phase out the (now quite old) PDB
file format and use the mmCIF format instead. However, we stick with the PDB format,
which is still widespread, as an example for teaching Python.
Below we define the getStructuresFromFile() function, which takes the location of a
file and gives back a list of Structure objects. At the top of the function the variable
structure, representing the current working Structure object, is initialised to None. This is
done so that later in the code, when we come across an ‘ATOM’ record for the first time
we know that we need to create a new Structure object. The variable structure is again set
to None when an ‘ENDMDL’ record is found, because that indicates that this is the end of
the current conformation; the next ‘ATOM’ record, if there is another, is for the next
conformation where a new structure would then need to be created. The conformation is
initialised to 0 because there might not be any ‘MODEL’ record. However, if there is a
‘MODEL’ record then instead the conformation is taken from that. The name and pdbId
are initialised to default values. In theory these do not have to be initialised because they
will all be set up in due course, when the ‘TITLE’ and ‘HEADER’ records have been read,
but it might be the case (for non-standard PDB files) that those records are missing. After
the initialisation the remainder of the function involves looping through the lines from the
file to interrogate the various records, making the required Python objects as we go along:
def getStructuresFromFile(fileName):
fileObject = open(fileName)
structure = None
name = 'unknown'
conformation = 0
pdbId = None
structures = []
for line in fileObject:
record = line[0:6].strip()
if record == 'HEADER':
pdbId = line.split()[-1]
elif record == 'TITLE':
name = line[10:].strip()
elif record == 'MODEL':
conformation = int(line[10:14])
elif record == 'ENDMDL':
structure = None
elif record == 'ATOM':
serial = int(line[6:11]) # not used here
atomName = line[12:16].strip()
resName = line[17:20].strip()
chainCode = line[21:22].strip()
seqId = int(line[22:26])
x = float(line[30:38])
y = float(line[38:46])
z = float(line[46:54])
segment = line[72:76].strip()
element = line[76:78].strip()
if chainCode == '':
if segment:
chainCode = segment
else:
chainCode = 'A'
if not structure:
structure = Structure(name, conformation, pdbId)
structures.append(structure)
chain = structure.getChain(chainCode)
if not chain:
chain = Chain(structure, chainCode)
residue = chain.getResidue(seqId)
if not residue:
residue = Residue(chain, seqId, resName)
if not element:
element = name[0] # Have to guess
coords = (x,y,z)
atom = Atom(residue, atomName, coords, element)
fileObject.close()
return structures
Values for the various attributes are extracted from the lines of the file, according to the
kind of record present. When dealing with numeric data like the seqId (an integer) or the
x, y and z coordinates (floating point numbers) the data extracted from the line will
initially be just a string of characters, i.e. not a real Python number object. Hence, for
these values we need to explicitly convert the text with int() or float() as appropriate. The
textual values receive a little processing to remove spaces from their ends, using the strip()
method of strings.
Note that as well as the structure, the chain and residue are also created as and when
they are needed; each time we come across an ‘ATOM’ line we definitely need to make a
new atom, but a new chain or residue is only needed when we come across its first atom.
Here we detect when a new chain is needed when the chainCode changes and a new
residue when the seqId number changes. Naturally, when we start a new chain we will also
make a new residue, given that it doesn’t yet have any children to put atoms into. For the
chainCode we do a little checking to make sure it wasn’t blank
8
and if it is empty we try to
substitute with the segment letter, and if this fails a plain ‘A’. Also, if the chemical
element is missing we take the first character of the atom name as a guess. These are just
two examples of the kind of checking that should be done if you are accepting coordinate
data from a variety of sources and are being rigorous. Here we consider only using official
files from the PDB, so we are confident that they are well formed and conform to all the
standards.
In the above example it may seem odd that we define the atom variable, since in fact it
is not used for anything, and we could just have written
Atom(residue, atomName, coords, element)
Either way, it looks like we are immediately throwing away the object we have just
created, leaving it to be garbage collected. However, the parent residue has a handle to all
its atoms, via both a dictionary and a list, so this does not happen. In turn chain has a
handle to all its residues, and structure has a handle to all its chains. So it all works out
nicely and all we need to pass back at the end of the function is a list of structures, which
contain everything else. The code is then easily tested on an appropriate PDB-format file:
testStructs = getStructuresFromFile('examples/glycophorin.pdb')
structure = testStructs[0]
chain = structure.getChain('A')
for residue in chain.residues:
print(residue.seqId, residue.code)
There is a notable deficiency in the code; the molType of the chain is not specified
explicitly, so it is set to the default value, ‘protein’. This is not very satisfactory if the
chain is DNA or RNA. A cleverer approach would be to try and determine, perhaps from
the residue.code and/or atom names, what the molType is. This raises the issue that it
could take a few ‘ATOM’ records before the molType might be determined accurately, by
which time the chain has long since been created, perhaps with the incorrect molType.
That would be a problem if the molType was not supposed to change, i.e. was frozen. If
we do allow it to change then we could set the molType accurately by looking at what
kinds of atom are present. A function to do this, which looks at a few characteristic atoms
in a residue, might be as follows:
def guessResidueMolType(residue):
if residue.getAtom("CA") and residue.getAtom("N"):
return 'protein'
elif residue.getAtom("C5'") and residue.getAtom("C3'"): # DNA/RNA
if residue.getAtom("02'"):
return 'RNA'
else: # This is "2'-deoxy"
return 'DNA'
return 'other'
If we insisted that the molecule type could not be changed, an alternative method would
be to process the entire PDB file, deduce the molType and then afterwards create the
structure and associated chains etc. We would still need to store the information as we
process the PDB file, and that means some kind of intermediate data structure, though this
could be a simple one involving dictionaries and lists.
Do'stlaringiz bilan baham: |