Using BLAST from Python
Before we get started we need to consider how to create a sequence database that the
BLAST program can search through. This is not a normal sequence file or database, but
rather it is a special, optimised series of files that is special to the BLAST system and
which makes the comparison of sequences efficient. Fortunately there is a program called
‘makeblastdb’, which is available in the BLAST+ package,
18
that can be used to make the
special database from sequences stored in common text-based formats like FASTA.
Before the examples we import all of the external functions and modules that will be
required. Specifically we need the substitution function (sub) from the regular
expression
19
module to help format the sequences for the input files. The call function
from the subprocess is a convenient way to run programs that are external to Python. The
ElementTree module will allow us to read the XML
20
formatted BLAST output.
from re import sub
from subprocess import call
from xml.etree import ElementTree
Now a function is defined to make the special sequence database for BLAST. This is
not a conventional relational database, e.g. where small amounts of data are accessed and
modified, but rather a specially formatted version of sequence data to be queried, which
allows BLAST to efficiently find matches. The function takes a FASTA-formatted file as
input, a distinguishing name for the database, formatdbExe, the location of the external
‘makeblastdb’ program to run and isProtein, a switch to state whether the sequence is
protein and not nucleic acid (RNA or DNA). Note that the input FASTA sequence file is
often very large, representing a whole database (e.g. UniProt) or even a genome.
Obtaining genome sequence information will be discussed later in
Chapter 17
.
def makeBlastDatabase(fastaFile, databaseName,
formatdbExe=None, isProtein=True):
If a value for formatdbExe is not passed in we will assume that the database creation
program is available to the local system as a command called makeblastdb. If the program
name is unavailable, or not set correctly, then an error exception will be triggered when we
come to run the program via call().
if not formatdbExe:
formatdbExe = 'makeblastdb'
Next we define a variable which will indicate the database type. The makeblastdb
program accepts a value ‘prot’ (protein) or ‘nucl’ (nucleic acid) after its -dbtype option to
indicate whether the database is protein or not.
if isProtein:
molType = 'prot'
else:
molType = 'nucl'
Then all of the options and arguments that will be used when running the makeblastdb
program are collected in a list. These would be the values that one would type at a
command line prompt to run the program.
cmdArgs = [formatdbExe,
'-dbtype', molType,
'-in', fastaFile,
'-out', databaseName]
This list is then passed to the call() function to actually run the program, noting that we
catch any exceptions that may occur. If an exception is triggered we print the command
that was tried, together with the original Python error (err) so we can determine what the
problem was.
print('Making BLAST database %s…' % databaseName)
try:
call(cmdArgs)
except Exception as err:
print('BLAST database creation failed')
print('Command used: "%s"' % ' '.join(cmdArgs))
print(err)
return
print(' …done')
With the function defined we can now run the makeblastdb program on a FASTA-
format file as if it were Python. Naturally we only need to do this once for each database,
unless it changes.
fileName = 'EcoliGenome.fasta'
makeBlastDatabase(fileName, 'ECOLI_PROT')
However, we may have to specifically state the location of the makeblastdb program if
the system doesn’t know that location of the executable file:
makeBlastDatabase(fileName, 'ECOLI_PROT', '/usr/bin/makeblastdb')
Next we come to a similar Python wrapper function that will run the actual BLAST
program, given a query sequence and a database, created as described above. For the sake
of simplicity, several of the BLAST input options will be ignored and only part of the
output will be read. However, this function can be expanded to use as much detail as is
required.
21
The Python function that controls the BLAST search takes arguments for the query
sequence (as one-letter codes) and the location of the formatted database to search. There
are optional arguments to specify the BLAST program location (otherwise the command
‘blastp’ will be tried), the maximum cut-off expectation value (the E-value score), the type
of substitution matrix and the number of processing cores to use. It is important to note
that different BLAST programs are used to specify which molecule type (protein or
nucleic acid) is used in the query and independently in the database. Accordingly, the
following programs are available to specify which type of BLAST search will be run:
‘blastp’: protein query against a protein database.
‘blastn’: nucleic acid query against a nucleotide database.
‘tblastn’: protein query against a translated nucleotide database.
‘blastx’: translated nucleic acid query against a protein database.
‘tblastx’: translated nucleic acid query against a translated nucleotide database.
Note that the ‘tblastx’ method translates nucleotide sequence to protein sequence in all
reading frames, so may yield amino acid matches even if the base sequences are quite
different, but naturally only one of the reading frames encodes protein.
The blastSearch() function makes use of the subprocess module again, but this time we
also import Popen and PIPE. The Popen class is a more general way of controlling
external programs than a simple call and we use it here because we are directly controlling
the standard input to and output from BLAST using Python, rather than via the operating
system (an alternative to this approach would be to read and write temporary files). The
PIPE import is a value that allows us to achieve this by setting the stdin and stdout
attributes of Popen so that data will be communicated or ‘piped’ to and from Python.
from subprocess import call, Popen, PIPE
def blastSearch(seq, database, blastExe=None, eCutoff=10,
matrix='BLOSUM62', cpuCores=None):
As with the previous function we set a default name for the program, here assuming
‘blastp’ is available and executable on the system, if the name was not specifically set by
the user. As discussed above, the choice of the specific BLAST program dictates what
type of search is being performed and what molecule type (protein or nucleic acid) the
query and database represent.
if not blastExe:
blastExe = 'blastp'
If the number of processor (CPU) cores was not specified we take the maximum
available as specified by the handy multiprocessing module. Using multiple cores will
allow part of the BLAST search to run more quickly, because certain tasks can be
computed in parallel.
if not cpuCores:
import multiprocessing
cpuCores = multiprocessing.cpu_count()
The query sequence is tidied, making it upper case and inserting a newline ‘\n’ every 60
characters to make regular FASTA-format sequence strings.
querySeq = seq.upper()
querySeq = sub('(.{60})(.)',r'\1\n\2', querySeq)
The sequence is then placed into a string formatted as a FASTA entry by adding an
annotation line (with arbitrary name) before a sequence line. These two lines represent the
data that would be contained in a FASTA file containing a single sequence, which we will
send as input to BLAST.
querySeq = '>UserQuery\n%s\n' % querySeq
The program command arguments are collected in a list as before, noting that we don’t
need to set any input or output files because by default BLAST will use standard input and
output:
cmdArgs = [blastExe,
'-outfmt', '5', # => XML output
'-num_threads', str(cpuCores),
'-db', database,
'-evalue', str(eCutoff),
'-matrix', matrix]
print(' '.join(cmdArgs))
The Popen class is used to make an object that encapsulates the BLAST process, setting
attributes so that the standard input and output data is sent via Python. If there are any
problems with this, e.g. if the BLAST executable is not found, then we catch the exception
and report the error that occurred.
try:
proc = Popen(cmdArgs, stdin=PIPE, stdout=PIPE)
except Exception as err:
print('BLAST command failed')
print('Command used: "%s"' % ' '.join(cmdArgs))
print(err)
return []
Nothing will be run by BLAST until we send it some input data and we do this using
the communicate() method of the process object. This function call gives back two values,
representing standard output and error data respectively. If we hadn’t set stdout=PIPE
above this data would simply be sent to the screen.
stdOutData, stdErrData = proc.communicate(querySeq)
The results are collected by reading through the XML-formatted text contained in the
stdOutData string using the ElementTree module. The fromstring() function will give back
an object representing the tree-like hierarchy of the XML data. To navigate this tree we
start at the base, which is named root here, from which we can follow any node (like
branches) to get to the data within.
results = []
root = ElementTree.fromstring(stdOutData)
From this start point, to collect the required data we use the .find()function, going from
point to point within the structure defined by the XML file. What this function does is to
look for specific elements in the tree-like structure of the XML file which match a
particular text string. Naturally, knowing which things to look for depends on knowing the
schema of the XML file, i.e. what the names of the elements are and how they relate to
one another. See
Chapter 6
for further discussion of XML files. Note that the format of
these XML files could change in future versions of BLAST, so if this example does not
work directly look at the actual output and adapt the following code for the actual XML
elements in the file.
To get to iteration, which will store all the sequences matched in the database (there is
only one iteration in a regular BLAST search) we go to elements named Blast
Output_iterations and then Iteration.
iteration = root.find('BlastOutput_iterations').find('Iteration')
The matches to the query that BLAST finds (if any) will be stored in the XML inside
the ‘Iteration_hits’ section as ‘Hit’ elements. If there are no ‘Iteration_hits’ then we
simply return an empty list for the results.
ihit = iteration.find('Iteration_hits')
if ihit:
hits = ihit.findall('Hit')
else:
return []
Otherwise if there are Hit elements representing the sequence matches to the query in
the database we loop through them to find the Hsp elements (high scoring pairs).
for hit in hits:
hsp = hit.find('Hit_hsps').find('Hsp')
Then the remainder of the database match information is stored in a dictionary using the
name as a key. Note that this is just one example of how to store the results in Python data
structures.
hitDict = {}
The name of the match (the hit) is extracted from the ‘Hit_def’ attribute using
hit.findtext() which we then store in the dictionary with the ‘def’ key. Similarly, its
sequence length is stored as the ‘len’ after conversion of the XML string to a Python
integer:
hitDict['def'] = hit.findtext('Hit_def')
hitDict['len'] = int(hit.findtext('Hit_len'))
For all the values associated with each high-scoring pair we take the tag name for the
value (as used in the XML file) and remove the first four characters to make a simpler key,
thus avoiding all the keys beginning with ‘Hsp_’, e.g. ‘Hsp_score’ becomes ‘score’. The
actual value is then obtained with .findtext() and converted to a Python data type, with
int() or float() as needed. First we collect the floating point values:
for tag in ('Hsp_bit-score', 'Hsp_evalue'):
key = tag[4:]
hitDict[key] = float(hsp.findtext(tag))
Then the integer values:
for tag in ('Hsp_score', 'Hsp_query-from',
'Hsp_query-to', 'Hsp_hit-from',
'Hsp_hit-to', 'Hsp_query-frame',
'Hsp_hit-frame', 'Hsp_identity',
'Hsp_positive', 'Hsp_gaps',
'Hsp_align-len'):
key = tag[4:]
hitDict[key] = int(hsp.findtext(tag, '0'))
And finally the plain text values:
for tag in ('Hsp_qseq', 'Hsp_hseq', 'Hsp_midline'):
key = tag[4:]
hitDict[key] = hsp.findtext(tag)
Once the hit is processed we store the dictionary, with all the extracted values in the
results list. A dictionary could have been used to store the results, but here we wish to
preserve the order of the matches.
results.append(hitDict)
return results
Finally, the Python BLAST search wrapper function can be tested with a query
sequence and a BLAST sequence database created previously.
seq = 'NGTISYTNEAGKIYQLKPNPAVLICRVRGLHLPEKHVTWRGEAIPGSLFDFA' \
'LYFFHNYQALLAKGSGPYFYLPKTQSWQEAAWWSEVFSYAEDRFNLPRGTIK' \
'ATLLIETLPAVFQMDEILHALRDHIVGLNCGRWDYIFSYIKTLKNYPDRVLP'
The location of the database is the file path to the BLAST database files, which
includes the database name, but excludes any file extensions:
database = 'ECOLI_PROT'
The results that come back after running the specified BLAST program (the path for
which should be set appropriately) are a list of names and data dictionaries for any
sequence matches.
results = blastSearch(seq, database'/usr/bin/blastp',
eCutoff=1.0, cpuCores=4)
The various sequences and scores are obtained from the results dictionary for each
match (hit) by using the appropriate keyword:
for hitDict in results:
print(hitDict['def'], hitDict['score'], hitDict['evalue'])
print(hitDict['qseq'])
print(hitDict['midline'])
print(hitDict['hseq'])
Do'stlaringiz bilan baham: |