Adding a structure to the database
The basic ideas of how to deal with the Python wrapper around SQL were discussed in the
previous section. Here we provide an implementation of how we can add a structure item,
including all of its chains, residues and atoms, into the database. The example below
works for both sqlite3 and MySQLdb because we use the simple function
formatStatement() to insert the correct placeholder text in all the statements, substituting
using the normal Python string format code %s.
def formatStatement(text, placeHolder):
if placeHolder == '%s':
return text
numInserts = text.count('%s')
return text % numInserts*(placeHolder,)
The code below executes one SQL command for each record being inserted into the
database. In general SQL commands are rather slow to execute. It would be better to
bundle several SQL insert statements into each command. SQLite does not offer this
functionality so the method below is the only one used in this case. But MySQL does offer
the ability to bundle several insert statements into each command, and after we discuss the
generic version then we will modify the code to take advantage of this feature. In real-
world applications requiring reasonable performance, the following code would not be
acceptable (and so SQLite would not be acceptable). But we illustrate here nonetheless for
educational purposes.
The function definition has three arguments: a database connection, a structure object
and a placeHolder, which defaults to the MySQLdb value. The function first gets a cursor,
with which interactions with the database will take place.
def addStructureToDb(connection, structure, placeHolder='%s'):
cursor = connection.cursor()
We first check whether there is already a structure with the same pdbId and
conformation in the database. If so, we raise an error exception.
pdbId = structure.pdbId
conformation = structure.conformation
stmt = "select * from structure where pdbId=%s and conformation=%s"
stmt = formatStatement(stmt, placeHolder)
cursor.execute(stmt, (pdbId, conformation))
if cursor.fetchone():
cursor.close()
msg = 'structure with (pdbId=%s, conformation=%s) already known'
raise Exception(msg % (pdbId, conformation))
If there is any kind of error when inserting the structure into the database we want to
roll back (reverse) the transaction, so we put the entire code into a try/except block. We
then insert the relevant structure data into the structure table. Note that after a record is
inserted into the database, the identifier of that record can be found via cursor.lastrowid.
This identifier is then used when creating the child records. Here it is the structureId.
try:
stmt = "insert into structure (name, pdbId, conformation) " \
"values (%s, %s, %s)"
stmt = formatStatement(stmt, placeHolder)
cursor.execute(stmt, (structure.name, pdbId, conformation))
structureId = cursor.lastrowid
We then descend down the rest of the hierarchy: Chain, Residue and Atom, in turn.
First we add the chains.
for chain in structure.chains:
molType = chain.molType
code = chain.code
stmt = "insert into chain (structureId, molType, code) " \
"values (%s, %s, %s)"
stmt = formatStatement(stmt, placeHolder)
cursor.execute(stmt, (structureId, molType, code))
chainId = cursor.lastrowid
Then we add the residues.
for residue in chain.residues:
seqId = residue.seqId
# insert residue into database
stmt = "insert into residue (chainId, seqId, code) " \
"values (%s, %s, %s)"
stmt = formatStatement(stmt, placeHolder)
cursor.execute(stmt, (chainId, seqId, residue.code))
residueId = cursor.lastrowid
Finally we add the atoms. This is the longest, and slowest, part of the operation.
for atom in residue.atoms:
# insert atom into database
(x, y, z) = atom.coords
stmt = "insert into atom " \
"(residueId, name, x, y, z, element) " \
"values (%s, %s, %s, %s, %s, %s)"
stmt = formatStatement(stmt, placeHolder)
cursor.execute(stmt,
(residueId, atom.name, x, y, z, atom.element))
If there are no errors then we close the cursor. Technically speaking, Python would
eventually close the cursor in any case, but it’s better to be explicit. Finally, we commit the
transaction.
cursor.close()
connection.commit()
If there was an error then we again close the cursor. An error exception could occur for
a few reasons. For example, the database connection might be lost. Or the pdbId might not
have been set (this is not required in the Structure class constructor). If there has been an
exception we do a rollback, to restore the previous state, and afterwards re-raise the
originating exception object.
except Exception as e: # syntax from Python 2.6
cursor.close()
try:
connection.rollback()
except:
pass
raise e # re-raise original exception
Now we can test the above on data from a PDB file. We only want to run this test code
if the module is run directly, rather than imported from another module, so we use a check
on __name__. The user should specify the database name and PDB file, otherwise an error
is given and the program exited. We then read the structures from a file, using the function
getStructuresFromFile() from
Chapter 8
.
if __name__ == '__main__':
import sys
if len(sys.argv) != 3:
print('need to specify database and PDB file')
sys.exit(1)
database = sys.argv[1]
pdbFile = sys.argv[2]
from Modelling import getStructuresFromFile
structures = getStructuresFromFile(pdbFile)
We then open a connection and pass that object into the addStructureToDb() function.
This is just a matter of taste. We could instead have passed in the database name (and
username and password, if needed) and opened, and then at the end, closed the connection
inside the function. In sqlite3 we can open the connection just with the database name:
import sqlite3
connection = sqlite3.connect(database)
placeHolder = '?'
In MySQLdb we also need a username and password. We assume that the user for the
database is the same as returned by getpass.getuser(), and similarly for the password. In
general this might not be the case, so the test code might need tweaking here. (It is just test
code, though.) Accordingly, the MySQLdb alternative to the above block of code would
be something like:
import MySQLdb
import getpass
user = getpass.getuser()
pwd = getpass.getpass()
connection = MySQLdb.connect(db=database, user=user, passwd=pwd)
placeHolder = '%s'
We then add the structures into the database, one after the other. We wrap the code in a
try/finally block so that we always close the connection whether or not an error has
occurred. Again, Python would do this automatically, but it’s better to be explicit. Note
that the way the code is written, if we get an error adding some structure, then the previous
structures will remain in the database because those transactions will have been committed
in addStructureToDb(). An alternative would be to do the transaction management here, so
that either all the structures are inserted, or none of them is.
try:
for structure in structures:
addStructureToDb(connection, structure, placeHolder)
finally:
connection.close()
Now we consider how to change the code to allow several insert statements into each
SQL command that is executed. This is possible in MySQL but not in SQLite, and the
below is how this kind of application should be implemented in practical situations. The
simplest change is to insert the atom records for a given residue in one command, since
normally there are many atoms per residue, but to leave the other records, for the
structure, chains and residues, inserted one at a time. In this case, the only modification
required in the code is the innermost loop, which now looks like:
values = []
for atom in residue.atoms:
(x, y, z) = atom.coords
values.extend([residueId, atom.name, x, y, z, atom.element])
nAtoms = len(values) / 6
atomPlaceHolder = '(%s, %s, %s, %s, %s, %s)'
atomPlaceHolder = nAtoms * [atomPlaceHolder]
atomPlaceHolder = ','.join(atomPlaceHolder)
stmt = "insert into atom" \
" (residueId, name, x, y, z, element) values " \
+ atomPlaceHolder
cursor.execute(stmt, values)
The values that are going to be inserted into the atom table are stored in an array. It
would be natural, from a Python point of view, to store this as a list of lists, with each
inner list being the six pieces of data relevant for each atom (residue ID, atom name, x, y,
z, chemical element). But the way MySQLdb works this is not possible, and instead we
have to store everything in one long list. The variable nAtoms is the number of atom
records being inserted, and this is the length of values divided by 6 because there are six
pieces of data for each atom. The MySQL syntax for inserting multiple records into a
database does expect a list of lists, and this gives rise to the slightly complicated
construction of the variable atomPlaceHolder. For example, if nAtoms is 2 then
atomPlaceHolder would be the Python string ‘(%s, %s, %s, %s, %s),(%s, %s, %s, %s,
%s)’.
On a test protein with a total of 20 structures, 40 chains, 1600 residues and 26,440
atoms (around 17 atoms per residue) the time to insert this data into the database was
reduced by a factor of around 3.5 in comparison with the original method. So this is
definitely worth doing (although as noted, it cannot be done with SQLite). A further
optimisation is to insert all the structures in one go, then all the chains in one go etc. This
is left as an exercise for the reader. On the test protein this further reduced the time to
insert the data by around another 25%, so is a much smaller effect.
Do'stlaringiz bilan baham: |