Python Programming for Biology: Bioinformatics and Beyond

Populating the model: reading PDB files

Download 7,75 Mb.

Pdf ko'rish

bet	106/514
Sana	30.12.2021
Hajmi	7,75 Mb.
	#91066

1 ... 102 103 104 105 106 107 108 109 ... 514

Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

Populating the model: reading PDB files

Chapter 6

we made a quick and dirty attempt at reading structural coordinate

information from a PDB-format file, when we wrote the function calcCentroid(). In that

function we used the information as we read it, and did not bother storing it for later use.

In contrast, here we will create one or more Structure objects, so we can perform any

number of manipulations and interrogations on the data, after it has been loaded. There are

various aspects of a PDB-format file which we will consider when loading its data into our

Python objects:

A PDB file can have multiple structure conformations (models) in it, so our function

will return a list of structures.

A PDB file is record oriented and most records are one line long. The first six

characters of the record determine the record type.

The ‘HEADER’ record gives a short description and a date but most importantly for

us it also provides the PDB identifier.

The ‘TITLE’ record gives a further description of the molecular assembly. We will

use this to specify the name attribute of the structure.

A ‘MODEL’ record indicates when a new structure is starting, and gives numbers for

the conformation. It is optional, so in a PDB file with only one set of coordinates it

can be missing. If there is a ‘MODEL’ record then there is a matching ‘ENDMDL’

record when the description of that conformation finishes.

An ‘ATOM’ record specifies the chain, residue and atom information, including the

coordinates, for one atom. This means that the chain and residue information is

specified over and over again, once for each atom in the residue.

A fuller description of the (current) PDB file format can be found via the wwPDB

website (

http://www.wwpdb.org

). There are many more record types, but we will only use

the ones mentioned above and ignore the remainder. Also, it should be noted that, when

this book was being written, a decision was taken to phase out the (now quite old) PDB

file format and use the mmCIF format instead. However, we stick with the PDB format,

which is still widespread, as an example for teaching Python.

Below we define the getStructuresFromFile() function, which takes the location of a

file and gives back a list of Structure objects. At the top of the function the variable

structure, representing the current working Structure object, is initialised to None. This is

done so that later in the code, when we come across an ‘ATOM’ record for the first time

we know that we need to create a new Structure object. The variable structure is again set

to None when an ‘ENDMDL’ record is found, because that indicates that this is the end of

the current conformation; the next ‘ATOM’ record, if there is another, is for the next

conformation where a new structure would then need to be created. The conformation is

initialised to 0 because there might not be any ‘MODEL’ record. However, if there is a

‘MODEL’ record then instead the conformation is taken from that. The name and pdbId

are initialised to default values. In theory these do not have to be initialised because they

will all be set up in due course, when the ‘TITLE’ and ‘HEADER’ records have been read,

but it might be the case (for non-standard PDB files) that those records are missing. After

the initialisation the remainder of the function involves looping through the lines from the

file to interrogate the various records, making the required Python objects as we go along:

def getStructuresFromFile(fileName):

fileObject = open(fileName)

structure = None

name = 'unknown'

conformation = 0

pdbId = None

structures = []

for line in fileObject:

record = line[0:6].strip()

if record == 'HEADER':

pdbId = line.split()[-1]

elif record == 'TITLE':

name = line[10:].strip()

elif record == 'MODEL':

conformation = int(line[10:14])

elif record == 'ENDMDL':

structure = None

elif record == 'ATOM':

serial = int(line[6:11]) # not used here

atomName = line[12:16].strip()

resName = line[17:20].strip()

chainCode = line[21:22].strip()

seqId = int(line[22:26])

x = float(line[30:38])

y = float(line[38:46])

z = float(line[46:54])

segment = line[72:76].strip()

element = line[76:78].strip()

if chainCode == '':

if segment:

chainCode = segment

else:

chainCode = 'A'

if not structure:

structure = Structure(name, conformation, pdbId)

structures.append(structure)

chain = structure.getChain(chainCode)

if not chain:

chain = Chain(structure, chainCode)

residue = chain.getResidue(seqId)

if not residue:

residue = Residue(chain, seqId, resName)

if not element:

element = name[0] # Have to guess

coords = (x,y,z)

atom = Atom(residue, atomName, coords, element)

fileObject.close()

return structures

Values for the various attributes are extracted from the lines of the file, according to the

kind of record present. When dealing with numeric data like the seqId (an integer) or the

x, y and z coordinates (floating point numbers) the data extracted from the line will

initially be just a string of characters, i.e. not a real Python number object. Hence, for

these values we need to explicitly convert the text with int() or float() as appropriate. The

textual values receive a little processing to remove spaces from their ends, using the strip()

method of strings.

Note that as well as the structure, the chain and residue are also created as and when

they are needed; each time we come across an ‘ATOM’ line we definitely need to make a

new atom, but a new chain or residue is only needed when we come across its first atom.

Here we detect when a new chain is needed when the chainCode changes and a new

residue when the seqId number changes. Naturally, when we start a new chain we will also

make a new residue, given that it doesn’t yet have any children to put atoms into. For the

chainCode we do a little checking to make sure it wasn’t blank

and if it is empty we try to

substitute with the segment letter, and if this fails a plain ‘A’. Also, if the chemical

element is missing we take the first character of the atom name as a guess. These are just

two examples of the kind of checking that should be done if you are accepting coordinate

data from a variety of sources and are being rigorous. Here we consider only using official

files from the PDB, so we are confident that they are well formed and conform to all the

standards.

In the above example it may seem odd that we define the atom variable, since in fact it

is not used for anything, and we could just have written

Atom(residue, atomName, coords, element)

Either way, it looks like we are immediately throwing away the object we have just

created, leaving it to be garbage collected. However, the parent residue has a handle to all

its atoms, via both a dictionary and a list, so this does not happen. In turn chain has a

handle to all its residues, and structure has a handle to all its chains. So it all works out

nicely and all we need to pass back at the end of the function is a list of structures, which

contain everything else. The code is then easily tested on an appropriate PDB-format file:

testStructs = getStructuresFromFile('examples/glycophorin.pdb')

structure = testStructs[0]

chain = structure.getChain('A')

for residue in chain.residues:

print(residue.seqId, residue.code)

There is a notable deficiency in the code; the molType of the chain is not specified

explicitly, so it is set to the default value, ‘protein’. This is not very satisfactory if the

chain is DNA or RNA. A cleverer approach would be to try and determine, perhaps from

the residue.code and/or atom names, what the molType is. This raises the issue that it

could take a few ‘ATOM’ records before the molType might be determined accurately, by

which time the chain has long since been created, perhaps with the incorrect molType.

That would be a problem if the molType was not supposed to change, i.e. was frozen. If

we do allow it to change then we could set the molType accurately by looking at what

kinds of atom are present. A function to do this, which looks at a few characteristic atoms

in a residue, might be as follows:

def guessResidueMolType(residue):

if residue.getAtom("CA") and residue.getAtom("N"):

return 'protein'

elif residue.getAtom("C5'") and residue.getAtom("C3'"): # DNA/RNA

if residue.getAtom("02'"):

return 'RNA'

else: # This is "2'-deoxy"

return 'DNA'

return 'other'

If we insisted that the molecule type could not be changed, an alternative method would

be to process the entire PDB file, deduce the molType and then afterwards create the

structure and associated chains etc. We would still need to store the information as we

process the PDB file, and that means some kind of intermediate data structure, though this

could be a simple one involving dictionaries and lists.

Download 7,75 Mb.

Do'stlaringiz bilan baham:

1 ... 102 103 104 105 106 107 108 109 ... 514