Python Programming for Biology: Bioinformatics and Beyond


Populating the model: reading PDB files



Download 7,75 Mb.
Pdf ko'rish
bet106/514
Sana30.12.2021
Hajmi7,75 Mb.
#91066
1   ...   102   103   104   105   106   107   108   109   ...   514
Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

Populating the model: reading PDB files

In


Chapter  6

 we  made  a  quick  and  dirty  attempt  at  reading  structural  coordinate

information  from  a  PDB-format  file,  when  we  wrote  the  function  calcCentroid().  In  that

function we used the information as we read it, and did not bother storing it for later use.

In  contrast,  here  we  will  create  one  or  more  Structure  objects,  so  we  can  perform  any

number of manipulations and interrogations on the data, after it has been loaded. There are

various aspects of a PDB-format file which we will consider when loading its data into our

Python objects:

A PDB file can have multiple structure conformations (models) in it, so our function

will return a list of structures.

A  PDB  file  is  record  oriented  and  most  records  are  one  line  long.  The  first  six

characters of the record determine the record type.

The ‘HEADER’ record gives a short description and a date but most importantly for

us it also provides the PDB identifier.

The  ‘TITLE’  record  gives  a  further  description  of  the  molecular  assembly.  We  will

use this to specify the name attribute of the structure.

A ‘MODEL’ record indicates when a new structure is starting, and gives numbers for

the conformation. It is optional, so in a PDB file with only one set of coordinates it

can be missing. If there is a ‘MODEL’ record then there is a matching ‘ENDMDL’

record when the description of that conformation finishes.

An ‘ATOM’ record specifies the chain, residue and atom information, including the

coordinates,  for  one  atom.  This  means  that  the  chain  and  residue  information  is

specified over and over again, once for each atom in the residue.

A  fuller  description  of  the  (current)  PDB  file  format  can  be  found  via  the  wwPDB

website (

http://www.wwpdb.org

). There are many more record types, but we will only use

the ones mentioned above and ignore the remainder. Also, it should be noted that, when

this book was being written, a decision was taken to phase out the (now quite old) PDB

file format and use the mmCIF format instead. However, we stick with the PDB format,

which is still widespread, as an example for teaching Python.

Below  we  define  the  getStructuresFromFile()  function,  which  takes  the  location  of  a

file  and  gives  back  a  list  of  Structure  objects.  At  the  top  of  the  function  the  variable

structure, representing the current working Structure object, is initialised to None. This is

done so that later in the code, when we come across an ‘ATOM’ record for the first time

we know that we need to create a new Structure object. The variable structure is again set

to None when an ‘ENDMDL’ record is found, because that indicates that this is the end of

the  current  conformation;  the  next  ‘ATOM’  record,  if  there  is  another,  is  for  the  next

conformation where a new structure would then need to be created. The conformation is

initialised  to  0  because  there  might  not  be  any  ‘MODEL’  record.  However,  if  there  is  a

‘MODEL’ record then instead the conformation  is  taken  from  that.  The  name  and  pdbId

are initialised to default values. In theory these do not have to be initialised because they




will all be set up in due course, when the ‘TITLE’ and ‘HEADER’ records have been read,

but it might be the case (for non-standard PDB files) that those records are missing. After

the initialisation the remainder of the function involves looping through the lines from the

file to interrogate the various records, making the required Python objects as we go along:

def getStructuresFromFile(fileName):

fileObject = open(fileName)

structure = None

name = 'unknown'

conformation = 0

pdbId = None

structures = []

for line in fileObject:

record = line[0:6].strip()

if record == 'HEADER':

pdbId = line.split()[-1]

elif record == 'TITLE':

name = line[10:].strip()

elif record == 'MODEL':

conformation = int(line[10:14])

elif record == 'ENDMDL':

structure = None

elif record == 'ATOM':

serial = int(line[6:11]) # not used here

atomName = line[12:16].strip()

resName = line[17:20].strip()

chainCode = line[21:22].strip()

seqId = int(line[22:26])

x = float(line[30:38])

y = float(line[38:46])

z = float(line[46:54])

segment = line[72:76].strip()

element = line[76:78].strip()

if chainCode == '':

if segment:

chainCode = segment

else:


chainCode = 'A'

if not structure:

structure = Structure(name, conformation, pdbId)

structures.append(structure)

chain = structure.getChain(chainCode)

if not chain:

chain = Chain(structure, chainCode)



residue = chain.getResidue(seqId)

if not residue:

residue = Residue(chain, seqId, resName)

if not element:

element = name[0] # Have to guess

coords = (x,y,z)

atom = Atom(residue, atomName, coords, element)

fileObject.close()

return structures

Values for the various attributes are extracted from the lines of the file, according to the

kind of record present. When dealing with numeric data like the seqId (an integer) or the

x,  y  and  z  coordinates  (floating  point  numbers)  the  data  extracted  from  the  line  will

initially  be  just  a  string  of  characters,  i.e.  not  a  real  Python  number  object.  Hence,  for

these values we need to explicitly convert the text with int() or float() as appropriate. The

textual values receive a little processing to remove spaces from their ends, using the strip()

method of strings.

Note  that  as  well  as  the  structure,  the  chain  and  residue  are  also  created  as  and  when

they are needed; each time we come across an ‘ATOM’ line we definitely need to make a

new atom, but a new chain or residue is only needed when we come across its first atom.

Here  we  detect  when  a  new  chain  is  needed  when  the  chainCode  changes  and  a  new

residue when the seqId number changes. Naturally, when we start a new chain we will also

make a new residue, given that it doesn’t yet have any children to put atoms into. For the

chainCode we do a little checking to make sure it wasn’t blank

8

and if it is empty we try to



substitute  with  the  segment  letter,  and  if  this  fails  a  plain  ‘A’.  Also,  if  the  chemical

element is missing we take the first character of the atom name as a guess. These are just

two examples of the kind of checking that should be done if you are accepting coordinate

data from a variety of sources and are being rigorous. Here we consider only using official

files from the PDB, so we are confident that they are well formed and conform to all the

standards.

In the above example it may seem odd that we define the atom variable, since in fact it

is not used for anything, and we could just have written

Atom(residue, atomName, coords, element)

Either  way,  it  looks  like  we  are  immediately  throwing  away  the  object  we  have  just

created, leaving it to be garbage collected. However, the parent residue has a handle to all

its  atoms,  via  both  a  dictionary  and  a  list,  so  this  does  not  happen.  In  turn  chain  has  a

handle to all its residues, and structure  has  a  handle  to  all  its  chains.  So  it  all  works  out

nicely and all we need to pass back at the end of the function is a list of structures, which

contain everything else. The code is then easily tested on an appropriate PDB-format file:

testStructs = getStructuresFromFile('examples/glycophorin.pdb')

structure = testStructs[0]

chain = structure.getChain('A')




for residue in chain.residues:

print(residue.seqId, residue.code)

There  is  a  notable  deficiency  in  the  code;  the  molType  of  the  chain  is  not  specified

explicitly,  so  it  is  set  to  the  default  value,  ‘protein’.  This  is  not  very  satisfactory  if  the

chain is DNA or RNA. A cleverer approach would be to try and determine, perhaps from

the  residue.code  and/or  atom  names,  what  the  molType  is.  This  raises  the  issue  that  it

could take a few ‘ATOM’ records before the molType might be determined accurately, by

which  time  the  chain  has  long  since  been  created,  perhaps  with  the  incorrect  molType.

That would be a problem if the molType was not supposed to change, i.e. was frozen. If

we  do  allow  it  to  change  then  we  could  set  the  molType  accurately  by  looking  at  what

kinds of atom are present. A function to do this, which looks at a few characteristic atoms

in a residue, might be as follows:

def guessResidueMolType(residue):

if residue.getAtom("CA") and residue.getAtom("N"):

return 'protein'

elif residue.getAtom("C5'") and residue.getAtom("C3'"): # DNA/RNA

if residue.getAtom("02'"):

return 'RNA'

else: # This is "2'-deoxy"

return 'DNA'

return 'other'

If we insisted that the molecule type could not be changed, an alternative method would

be  to  process  the  entire  PDB  file,  deduce  the  molType  and  then  afterwards  create  the

structure  and  associated  chains  etc.  We  would  still  need  to  store  the  information  as  we

process the PDB file, and that means some kind of intermediate data structure, though this

could be a simple one involving dictionaries and lists.




Download 7,75 Mb.

Do'stlaringiz bilan baham:
1   ...   102   103   104   105   106   107   108   109   ...   514




Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©hozir.org 2024
ma'muriyatiga murojaat qiling

kiriting | ro'yxatdan o'tish
    Bosh sahifa
юртда тантана
Боғда битган
Бугун юртда
Эшитганлар жилманглар
Эшитмадим деманглар
битган бодомлар
Yangiariq tumani
qitish marakazi
Raqamli texnologiyalar
ilishida muhokamadan
tasdiqqa tavsiya
tavsiya etilgan
iqtisodiyot kafedrasi
steiermarkischen landesregierung
asarlaringizni yuboring
o'zingizning asarlaringizni
Iltimos faqat
faqat o'zingizning
steierm rkischen
landesregierung fachabteilung
rkischen landesregierung
hamshira loyihasi
loyihasi mavsum
faolyatining oqibatlari
asosiy adabiyotlar
fakulteti ahborot
ahborot havfsizligi
havfsizligi kafedrasi
fanidan bo’yicha
fakulteti iqtisodiyot
boshqaruv fakulteti
chiqarishda boshqaruv
ishlab chiqarishda
iqtisodiyot fakultet
multiservis tarmoqlari
fanidan asosiy
Uzbek fanidan
mavzulari potok
asosidagi multiservis
'aliyyil a'ziym
billahil 'aliyyil
illaa billahil
quvvata illaa
falah' deganida
Kompyuter savodxonligi
bo’yicha mustaqil
'alal falah'
Hayya 'alal
'alas soloh
Hayya 'alas
mavsum boyicha


yuklab olish