Python Programming for Biology: Bioinformatics and Beyond


Adding a structure to the database



Download 7,75 Mb.
Pdf ko'rish
bet310/514
Sana30.12.2021
Hajmi7,75 Mb.
#91066
1   ...   306   307   308   309   310   311   312   313   ...   514
Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

Adding a structure to the database

The basic ideas of how to deal with the Python wrapper around SQL were discussed in the

previous section. Here we provide an implementation of how we can add a structure item,

including  all  of  its  chains,  residues  and  atoms,  into  the  database.  The  example  below

works  for  both  sqlite3  and  MySQLdb  because  we  use  the  simple  function

formatStatement() to insert the correct placeholder text in all the statements, substituting

using the normal Python string format code %s.

def formatStatement(text, placeHolder):

if placeHolder == '%s':

return text

numInserts = text.count('%s')

return text % numInserts*(placeHolder,)

The  code  below  executes  one  SQL  command  for  each  record  being  inserted  into  the

database.  In  general  SQL  commands  are  rather  slow  to  execute.  It  would  be  better  to

bundle  several  SQL  insert  statements  into  each  command.  SQLite  does  not  offer  this

functionality so the method below is the only one used in this case. But MySQL does offer

the ability to bundle several insert statements into each command, and after we discuss the

generic  version  then  we  will  modify  the  code  to  take  advantage  of  this  feature.  In  real-

world  applications  requiring  reasonable  performance,  the  following  code  would  not  be

acceptable (and so SQLite would not be acceptable). But we illustrate here nonetheless for




educational purposes.

The  function  definition  has  three  arguments:  a  database  connection,  a  structure  object

and a placeHolder, which defaults to the MySQLdb value. The function first gets a cursor,

with which interactions with the database will take place.

def addStructureToDb(connection, structure, placeHolder='%s'):

cursor = connection.cursor()

We  first  check  whether  there  is  already  a  structure  with  the  same  pdbId  and

conformation in the database. If so, we raise an error exception.

pdbId = structure.pdbId

conformation = structure.conformation

stmt = "select * from structure where pdbId=%s and conformation=%s"

stmt = formatStatement(stmt, placeHolder)

cursor.execute(stmt, (pdbId, conformation))

if cursor.fetchone():

cursor.close()

msg = 'structure with (pdbId=%s, conformation=%s) already known'

raise Exception(msg % (pdbId, conformation))

If there is any kind of error when inserting the structure into the database we want to

roll  back  (reverse)  the  transaction,  so  we  put  the  entire  code  into  a  try/except  block.  We

then  insert  the  relevant  structure  data  into  the  structure  table.  Note  that  after  a  record  is

inserted into the database, the identifier of that record can be found via cursor.lastrowid.

This identifier is then used when creating the child records. Here it is the structureId.

try:

stmt = "insert into structure (name, pdbId, conformation) " \



"values (%s, %s, %s)"

stmt = formatStatement(stmt, placeHolder)

cursor.execute(stmt, (structure.name, pdbId, conformation))

structureId = cursor.lastrowid

We  then  descend  down  the  rest  of  the  hierarchy:  Chain,  Residue  and  Atom,  in  turn.

First we add the chains.

for chain in structure.chains:

molType = chain.molType

code = chain.code

stmt = "insert into chain (structureId, molType, code) " \

"values (%s, %s, %s)"

stmt = formatStatement(stmt, placeHolder)

cursor.execute(stmt, (structureId, molType, code))

chainId = cursor.lastrowid

Then we add the residues.



for residue in chain.residues:

seqId = residue.seqId

# insert residue into database

stmt = "insert into residue (chainId, seqId, code) " \

"values (%s, %s, %s)"

stmt = formatStatement(stmt, placeHolder)

cursor.execute(stmt, (chainId, seqId, residue.code))

residueId = cursor.lastrowid

Finally we add the atoms. This is the longest, and slowest, part of the operation.

for atom in residue.atoms:

# insert atom into database

(x, y, z) = atom.coords

stmt = "insert into atom " \

"(residueId, name, x, y, z, element) " \

"values (%s, %s, %s, %s, %s, %s)"

stmt = formatStatement(stmt, placeHolder)

cursor.execute(stmt,

(residueId, atom.name, x, y, z, atom.element))

If  there  are  no  errors  then  we  close  the  cursor.  Technically  speaking,  Python  would

eventually close the cursor in any case, but it’s better to be explicit. Finally, we commit the

transaction.

cursor.close()

connection.commit()

If there was an error then we again close the cursor. An error exception could occur for

a few reasons. For example, the database connection might be lost. Or the pdbId might not

have been set (this is not required in the Structure class constructor). If there has been an

exception  we  do  a  rollback,  to  restore  the  previous  state,  and  afterwards  re-raise  the

originating exception object.

except Exception as e: # syntax from Python 2.6

cursor.close()

try:

connection.rollback()



except:

pass


raise e # re-raise original exception

Now we can test the above on data from a PDB file. We only want to run this test code

if the module is run directly, rather than imported from another module, so we use a check

on __name__. The user should specify the database name and PDB file, otherwise an error

is given and the program exited. We then read the structures from a file, using the function

getStructuresFromFile() from

Chapter 8

.



if __name__ == '__main__':

import sys

if len(sys.argv) != 3:

print('need to specify database and PDB file')

sys.exit(1)

database = sys.argv[1]

pdbFile = sys.argv[2]

from Modelling import getStructuresFromFile

structures = getStructuresFromFile(pdbFile)

We then open a connection and pass that object into the addStructureToDb()  function.

This  is  just  a  matter  of  taste.  We  could  instead  have  passed  in  the  database  name  (and

username and password, if needed) and opened, and then at the end, closed the connection

inside the function. In sqlite3 we can open the connection just with the database name:

import sqlite3

connection = sqlite3.connect(database)

placeHolder = '?'

In MySQLdb we also need a username and password. We assume that the user for the

database  is  the  same  as  returned  by  getpass.getuser(),  and  similarly  for  the  password.  In

general this might not be the case, so the test code might need tweaking here. (It is just test

code,  though.)  Accordingly,  the  MySQLdb  alternative  to  the  above  block  of  code  would

be something like:

import MySQLdb

import getpass

user = getpass.getuser()

pwd = getpass.getpass()

connection = MySQLdb.connect(db=database, user=user, passwd=pwd)

placeHolder = '%s'

We then add the structures into the database, one after the other. We wrap the code in a

try/finally  block  so  that  we  always  close  the  connection  whether  or  not  an  error  has

occurred.  Again,  Python  would  do  this  automatically,  but  it’s  better  to  be  explicit.  Note

that the way the code is written, if we get an error adding some structure, then the previous

structures will remain in the database because those transactions will have been committed

in addStructureToDb(). An alternative would be to do the transaction management here, so

that either all the structures are inserted, or none of them is.

try:

for structure in structures:



addStructureToDb(connection, structure, placeHolder)

finally:


connection.close()

Now we consider how to change the code to allow several insert statements into each

SQL  command  that  is  executed.  This  is  possible  in  MySQL  but  not  in  SQLite,  and  the

below is how this kind of application should be implemented in practical situations. The




simplest  change  is  to  insert  the  atom  records  for  a  given  residue  in  one  command,  since

normally  there  are  many  atoms  per  residue,  but  to  leave  the  other  records,  for  the

structure,  chains  and  residues,  inserted  one  at  a  time.  In  this  case,  the  only  modification

required in the code is the innermost loop, which now looks like:

values = []

for atom in residue.atoms:

(x, y, z) = atom.coords

values.extend([residueId, atom.name, x, y, z, atom.element])

nAtoms = len(values) / 6

atomPlaceHolder = '(%s, %s, %s, %s, %s, %s)'

atomPlaceHolder = nAtoms * [atomPlaceHolder]

atomPlaceHolder = ','.join(atomPlaceHolder)

stmt = "insert into atom" \

" (residueId, name, x, y, z, element) values " \

+ atomPlaceHolder

cursor.execute(stmt, values)

The  values  that  are  going  to  be  inserted  into  the  atom  table  are  stored  in  an  array.  It

would  be  natural,  from  a  Python  point  of  view,  to  store  this  as  a  list  of  lists,  with  each

inner list being the six pieces of data relevant for each atom (residue ID, atom name, x, y,

z,  chemical  element).  But  the  way  MySQLdb  works  this  is  not  possible,  and  instead  we

have  to  store  everything  in  one  long  list.  The  variable  nAtoms  is  the  number  of  atom

records being inserted, and this is the length of values divided by 6 because there are six

pieces  of  data  for  each  atom.  The  MySQL  syntax  for  inserting  multiple  records  into  a

database  does  expect  a  list  of  lists,  and  this  gives  rise  to  the  slightly  complicated

construction  of  the  variable  atomPlaceHolder.  For  example,  if  nAtoms  is  2  then

atomPlaceHolder  would  be  the  Python  string  ‘(%s,  %s,  %s,  %s,  %s),(%s,  %s,  %s,  %s,

%s)’.

On  a  test  protein  with  a  total  of  20  structures,  40  chains,  1600  residues  and  26,440



atoms  (around  17  atoms  per  residue)  the  time  to  insert  this  data  into  the  database  was

reduced  by  a  factor  of  around  3.5  in  comparison  with  the  original  method.  So  this  is

definitely  worth  doing  (although  as  noted,  it  cannot  be  done  with  SQLite).  A  further

optimisation is to insert all the structures in one go, then all the chains in one go etc. This

is  left  as  an  exercise  for  the  reader.  On  the  test  protein  this  further  reduced  the  time  to

insert the data by around another 25%, so is a much smaller effect.




Download 7,75 Mb.

Do'stlaringiz bilan baham:
1   ...   306   307   308   309   310   311   312   313   ...   514




Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©hozir.org 2024
ma'muriyatiga murojaat qiling

kiriting | ro'yxatdan o'tish
    Bosh sahifa
юртда тантана
Боғда битган
Бугун юртда
Эшитганлар жилманглар
Эшитмадим деманглар
битган бодомлар
Yangiariq tumani
qitish marakazi
Raqamli texnologiyalar
ilishida muhokamadan
tasdiqqa tavsiya
tavsiya etilgan
iqtisodiyot kafedrasi
steiermarkischen landesregierung
asarlaringizni yuboring
o'zingizning asarlaringizni
Iltimos faqat
faqat o'zingizning
steierm rkischen
landesregierung fachabteilung
rkischen landesregierung
hamshira loyihasi
loyihasi mavsum
faolyatining oqibatlari
asosiy adabiyotlar
fakulteti ahborot
ahborot havfsizligi
havfsizligi kafedrasi
fanidan bo’yicha
fakulteti iqtisodiyot
boshqaruv fakulteti
chiqarishda boshqaruv
ishlab chiqarishda
iqtisodiyot fakultet
multiservis tarmoqlari
fanidan asosiy
Uzbek fanidan
mavzulari potok
asosidagi multiservis
'aliyyil a'ziym
billahil 'aliyyil
illaa billahil
quvvata illaa
falah' deganida
Kompyuter savodxonligi
bo’yicha mustaqil
'alal falah'
Hayya 'alal
'alas soloh
Hayya 'alas
mavsum boyicha


yuklab olish