Python Programming for Biology: Bioinformatics and Beyond

Adding a structure to the database

Download 7,75 Mb.

Pdf ko'rish

bet	310/514
Sana	30.12.2021
Hajmi	7,75 Mb.
	#91066

1 ... 306 307 308 309 310 311 312 313 ... 514

Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

Adding a structure to the database

The basic ideas of how to deal with the Python wrapper around SQL were discussed in the

previous section. Here we provide an implementation of how we can add a structure item,

including all of its chains, residues and atoms, into the database. The example below

works for both sqlite3 and MySQLdb because we use the simple function

formatStatement() to insert the correct placeholder text in all the statements, substituting

using the normal Python string format code %s.

def formatStatement(text, placeHolder):

if placeHolder == '%s':

return text

numInserts = text.count('%s')

return text % numInserts*(placeHolder,)

The code below executes one SQL command for each record being inserted into the

database. In general SQL commands are rather slow to execute. It would be better to

bundle several SQL insert statements into each command. SQLite does not offer this

functionality so the method below is the only one used in this case. But MySQL does offer

the ability to bundle several insert statements into each command, and after we discuss the

generic version then we will modify the code to take advantage of this feature. In real-

world applications requiring reasonable performance, the following code would not be

acceptable (and so SQLite would not be acceptable). But we illustrate here nonetheless for

educational purposes.

The function definition has three arguments: a database connection, a structure object

and a placeHolder, which defaults to the MySQLdb value. The function first gets a cursor,

with which interactions with the database will take place.

def addStructureToDb(connection, structure, placeHolder='%s'):

cursor = connection.cursor()

We first check whether there is already a structure with the same pdbId and

conformation in the database. If so, we raise an error exception.

pdbId = structure.pdbId

conformation = structure.conformation

stmt = "select * from structure where pdbId=%s and conformation=%s"

stmt = formatStatement(stmt, placeHolder)

cursor.execute(stmt, (pdbId, conformation))

if cursor.fetchone():

cursor.close()

msg = 'structure with (pdbId=%s, conformation=%s) already known'

raise Exception(msg % (pdbId, conformation))

If there is any kind of error when inserting the structure into the database we want to

roll back (reverse) the transaction, so we put the entire code into a try/except block. We

then insert the relevant structure data into the structure table. Note that after a record is

inserted into the database, the identifier of that record can be found via cursor.lastrowid.

This identifier is then used when creating the child records. Here it is the structureId.

try:

stmt = "insert into structure (name, pdbId, conformation) " \

"values (%s, %s, %s)"

stmt = formatStatement(stmt, placeHolder)

cursor.execute(stmt, (structure.name, pdbId, conformation))

structureId = cursor.lastrowid

We then descend down the rest of the hierarchy: Chain, Residue and Atom, in turn.

First we add the chains.

for chain in structure.chains:

molType = chain.molType

code = chain.code

stmt = "insert into chain (structureId, molType, code) " \

"values (%s, %s, %s)"

stmt = formatStatement(stmt, placeHolder)

cursor.execute(stmt, (structureId, molType, code))

chainId = cursor.lastrowid

Then we add the residues.

for residue in chain.residues:

seqId = residue.seqId

# insert residue into database

stmt = "insert into residue (chainId, seqId, code) " \

"values (%s, %s, %s)"

stmt = formatStatement(stmt, placeHolder)

cursor.execute(stmt, (chainId, seqId, residue.code))

residueId = cursor.lastrowid

Finally we add the atoms. This is the longest, and slowest, part of the operation.

for atom in residue.atoms:

# insert atom into database

(x, y, z) = atom.coords

stmt = "insert into atom " \

"(residueId, name, x, y, z, element) " \

"values (%s, %s, %s, %s, %s, %s)"

stmt = formatStatement(stmt, placeHolder)

cursor.execute(stmt,

(residueId, atom.name, x, y, z, atom.element))

If there are no errors then we close the cursor. Technically speaking, Python would

eventually close the cursor in any case, but it’s better to be explicit. Finally, we commit the

transaction.

cursor.close()

connection.commit()

If there was an error then we again close the cursor. An error exception could occur for

a few reasons. For example, the database connection might be lost. Or the pdbId might not

have been set (this is not required in the Structure class constructor). If there has been an

exception we do a rollback, to restore the previous state, and afterwards re-raise the

originating exception object.

except Exception as e: # syntax from Python 2.6

cursor.close()

try:

connection.rollback()

except:

pass

raise e # re-raise original exception

Now we can test the above on data from a PDB file. We only want to run this test code

if the module is run directly, rather than imported from another module, so we use a check

on __name__. The user should specify the database name and PDB file, otherwise an error

is given and the program exited. We then read the structures from a file, using the function

getStructuresFromFile() from

Chapter 8

if __name__ == '__main__':

import sys

if len(sys.argv) != 3:

print('need to specify database and PDB file')

sys.exit(1)

database = sys.argv[1]

pdbFile = sys.argv[2]

from Modelling import getStructuresFromFile

structures = getStructuresFromFile(pdbFile)

We then open a connection and pass that object into the addStructureToDb() function.

This is just a matter of taste. We could instead have passed in the database name (and

username and password, if needed) and opened, and then at the end, closed the connection

inside the function. In sqlite3 we can open the connection just with the database name:

import sqlite3

connection = sqlite3.connect(database)

placeHolder = '?'

In MySQLdb we also need a username and password. We assume that the user for the

database is the same as returned by getpass.getuser(), and similarly for the password. In

general this might not be the case, so the test code might need tweaking here. (It is just test

code, though.) Accordingly, the MySQLdb alternative to the above block of code would

be something like:

import MySQLdb

import getpass

user = getpass.getuser()

pwd = getpass.getpass()

connection = MySQLdb.connect(db=database, user=user, passwd=pwd)

placeHolder = '%s'

We then add the structures into the database, one after the other. We wrap the code in a

try/finally block so that we always close the connection whether or not an error has

occurred. Again, Python would do this automatically, but it’s better to be explicit. Note

that the way the code is written, if we get an error adding some structure, then the previous

structures will remain in the database because those transactions will have been committed

in addStructureToDb(). An alternative would be to do the transaction management here, so

that either all the structures are inserted, or none of them is.

try:

for structure in structures:

addStructureToDb(connection, structure, placeHolder)

finally:

connection.close()

Now we consider how to change the code to allow several insert statements into each

SQL command that is executed. This is possible in MySQL but not in SQLite, and the

below is how this kind of application should be implemented in practical situations. The

simplest change is to insert the atom records for a given residue in one command, since

normally there are many atoms per residue, but to leave the other records, for the

structure, chains and residues, inserted one at a time. In this case, the only modification

required in the code is the innermost loop, which now looks like:

values = []

for atom in residue.atoms:

(x, y, z) = atom.coords

values.extend([residueId, atom.name, x, y, z, atom.element])

nAtoms = len(values) / 6

atomPlaceHolder = '(%s, %s, %s, %s, %s, %s)'

atomPlaceHolder = nAtoms * [atomPlaceHolder]

atomPlaceHolder = ','.join(atomPlaceHolder)

stmt = "insert into atom" \

" (residueId, name, x, y, z, element) values " \

+ atomPlaceHolder

cursor.execute(stmt, values)

The values that are going to be inserted into the atom table are stored in an array. It

would be natural, from a Python point of view, to store this as a list of lists, with each

inner list being the six pieces of data relevant for each atom (residue ID, atom name, x, y,

z, chemical element). But the way MySQLdb works this is not possible, and instead we

have to store everything in one long list. The variable nAtoms is the number of atom

records being inserted, and this is the length of values divided by 6 because there are six

pieces of data for each atom. The MySQL syntax for inserting multiple records into a

database does expect a list of lists, and this gives rise to the slightly complicated

construction of the variable atomPlaceHolder. For example, if nAtoms is 2 then

atomPlaceHolder would be the Python string ‘(%s, %s, %s, %s, %s),(%s, %s, %s, %s,

%s)’.

On a test protein with a total of 20 structures, 40 chains, 1600 residues and 26,440

atoms (around 17 atoms per residue) the time to insert this data into the database was

reduced by a factor of around 3.5 in comparison with the original method. So this is

definitely worth doing (although as noted, it cannot be done with SQLite). A further

optimisation is to insert all the structures in one go, then all the chains in one go etc. This

is left as an exercise for the reader. On the test protein this further reduced the time to

insert the data by around another 25%, so is a much smaller effect.

Download 7,75 Mb.

Do'stlaringiz bilan baham:

1 ... 306 307 308 309 310 311 312 313 ... 514