Python Programming for Biology: Bioinformatics and Beyond

Download 7,75 Mb.

Pdf ko'rish

bet	175/514
Sana	30.12.2021
Hajmi	7,75 Mb.
	#91066

1 ... 171 172 173 174 175 176 177 178 ... 514

Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

Using BLAST from Python

Before we get started we need to consider how to create a sequence database that the

BLAST program can search through. This is not a normal sequence file or database, but

rather it is a special, optimised series of files that is special to the BLAST system and

which makes the comparison of sequences efficient. Fortunately there is a program called

‘makeblastdb’, which is available in the BLAST+ package,

that can be used to make the

special database from sequences stored in common text-based formats like FASTA.

Before the examples we import all of the external functions and modules that will be

required. Specifically we need the substitution function (sub) from the regular

expression

module to help format the sequences for the input files. The call function

from the subprocess is a convenient way to run programs that are external to Python. The

ElementTree module will allow us to read the XML

formatted BLAST output.

from re import sub

from subprocess import call

from xml.etree import ElementTree

Now a function is defined to make the special sequence database for BLAST. This is

not a conventional relational database, e.g. where small amounts of data are accessed and

modified, but rather a specially formatted version of sequence data to be queried, which

allows BLAST to efficiently find matches. The function takes a FASTA-formatted file as

input, a distinguishing name for the database, formatdbExe, the location of the external

‘makeblastdb’ program to run and isProtein, a switch to state whether the sequence is

protein and not nucleic acid (RNA or DNA). Note that the input FASTA sequence file is

often very large, representing a whole database (e.g. UniProt) or even a genome.

Obtaining genome sequence information will be discussed later in

Chapter 17

def makeBlastDatabase(fastaFile, databaseName,

formatdbExe=None, isProtein=True):

If a value for formatdbExe is not passed in we will assume that the database creation

program is available to the local system as a command called makeblastdb. If the program

name is unavailable, or not set correctly, then an error exception will be triggered when we

come to run the program via call().

if not formatdbExe:

formatdbExe = 'makeblastdb'

Next we define a variable which will indicate the database type. The makeblastdb

program accepts a value ‘prot’ (protein) or ‘nucl’ (nucleic acid) after its -dbtype option to

indicate whether the database is protein or not.

if isProtein:

molType = 'prot'

else:

molType = 'nucl'

Then all of the options and arguments that will be used when running the makeblastdb

program are collected in a list. These would be the values that one would type at a

command line prompt to run the program.

cmdArgs = [formatdbExe,

'-dbtype', molType,

'-in', fastaFile,

'-out', databaseName]

This list is then passed to the call() function to actually run the program, noting that we

catch any exceptions that may occur. If an exception is triggered we print the command

that was tried, together with the original Python error (err) so we can determine what the

problem was.

print('Making BLAST database %s…' % databaseName)

try:

call(cmdArgs)

except Exception as err:

print('BLAST database creation failed')

print('Command used: "%s"' % ' '.join(cmdArgs))

print(err)

return

print(' …done')

With the function defined we can now run the makeblastdb program on a FASTA-

format file as if it were Python. Naturally we only need to do this once for each database,

unless it changes.

fileName = 'EcoliGenome.fasta'

makeBlastDatabase(fileName, 'ECOLI_PROT')

However, we may have to specifically state the location of the makeblastdb program if

the system doesn’t know that location of the executable file:

makeBlastDatabase(fileName, 'ECOLI_PROT', '/usr/bin/makeblastdb')

Next we come to a similar Python wrapper function that will run the actual BLAST

program, given a query sequence and a database, created as described above. For the sake

of simplicity, several of the BLAST input options will be ignored and only part of the

output will be read. However, this function can be expanded to use as much detail as is

required.

The Python function that controls the BLAST search takes arguments for the query

sequence (as one-letter codes) and the location of the formatted database to search. There

are optional arguments to specify the BLAST program location (otherwise the command

‘blastp’ will be tried), the maximum cut-off expectation value (the E-value score), the type

of substitution matrix and the number of processing cores to use. It is important to note

that different BLAST programs are used to specify which molecule type (protein or

nucleic acid) is used in the query and independently in the database. Accordingly, the

following programs are available to specify which type of BLAST search will be run:

‘blastp’: protein query against a protein database.

‘blastn’: nucleic acid query against a nucleotide database.

‘tblastn’: protein query against a translated nucleotide database.

‘blastx’: translated nucleic acid query against a protein database.

‘tblastx’: translated nucleic acid query against a translated nucleotide database.

Note that the ‘tblastx’ method translates nucleotide sequence to protein sequence in all

reading frames, so may yield amino acid matches even if the base sequences are quite

different, but naturally only one of the reading frames encodes protein.

The blastSearch() function makes use of the subprocess module again, but this time we

also import Popen and PIPE. The Popen class is a more general way of controlling

external programs than a simple call and we use it here because we are directly controlling

the standard input to and output from BLAST using Python, rather than via the operating

system (an alternative to this approach would be to read and write temporary files). The

PIPE import is a value that allows us to achieve this by setting the stdin and stdout

attributes of Popen so that data will be communicated or ‘piped’ to and from Python.

from subprocess import call, Popen, PIPE

def blastSearch(seq, database, blastExe=None, eCutoff=10,

matrix='BLOSUM62', cpuCores=None):

As with the previous function we set a default name for the program, here assuming

‘blastp’ is available and executable on the system, if the name was not specifically set by

the user. As discussed above, the choice of the specific BLAST program dictates what

type of search is being performed and what molecule type (protein or nucleic acid) the

query and database represent.

if not blastExe:

blastExe = 'blastp'

If the number of processor (CPU) cores was not specified we take the maximum

available as specified by the handy multiprocessing module. Using multiple cores will

allow part of the BLAST search to run more quickly, because certain tasks can be

computed in parallel.

if not cpuCores:

import multiprocessing

cpuCores = multiprocessing.cpu_count()

The query sequence is tidied, making it upper case and inserting a newline ‘\n’ every 60

characters to make regular FASTA-format sequence strings.

querySeq = seq.upper()

querySeq = sub('(.{60})(.)',r'\1\n\2', querySeq)

The sequence is then placed into a string formatted as a FASTA entry by adding an

annotation line (with arbitrary name) before a sequence line. These two lines represent the

data that would be contained in a FASTA file containing a single sequence, which we will

send as input to BLAST.

querySeq = '>UserQuery\n%s\n' % querySeq

The program command arguments are collected in a list as before, noting that we don’t

need to set any input or output files because by default BLAST will use standard input and

output:

cmdArgs = [blastExe,

'-outfmt', '5', # => XML output

'-num_threads', str(cpuCores),

'-db', database,

'-evalue', str(eCutoff),

'-matrix', matrix]

print(' '.join(cmdArgs))

The Popen class is used to make an object that encapsulates the BLAST process, setting

attributes so that the standard input and output data is sent via Python. If there are any

problems with this, e.g. if the BLAST executable is not found, then we catch the exception

and report the error that occurred.

try:

proc = Popen(cmdArgs, stdin=PIPE, stdout=PIPE)

except Exception as err:

print('BLAST command failed')

print('Command used: "%s"' % ' '.join(cmdArgs))

print(err)

return []

Nothing will be run by BLAST until we send it some input data and we do this using

the communicate() method of the process object. This function call gives back two values,

representing standard output and error data respectively. If we hadn’t set stdout=PIPE

above this data would simply be sent to the screen.

stdOutData, stdErrData = proc.communicate(querySeq)

The results are collected by reading through the XML-formatted text contained in the

stdOutData string using the ElementTree module. The fromstring() function will give back

an object representing the tree-like hierarchy of the XML data. To navigate this tree we

start at the base, which is named root here, from which we can follow any node (like

branches) to get to the data within.

results = []

root = ElementTree.fromstring(stdOutData)

From this start point, to collect the required data we use the .find()function, going from

point to point within the structure defined by the XML file. What this function does is to

look for specific elements in the tree-like structure of the XML file which match a

particular text string. Naturally, knowing which things to look for depends on knowing the

schema of the XML file, i.e. what the names of the elements are and how they relate to

one another. See

Chapter 6

for further discussion of XML files. Note that the format of

these XML files could change in future versions of BLAST, so if this example does not

work directly look at the actual output and adapt the following code for the actual XML

elements in the file.

To get to iteration, which will store all the sequences matched in the database (there is

only one iteration in a regular BLAST search) we go to elements named Blast

Output_iterations and then Iteration.

iteration = root.find('BlastOutput_iterations').find('Iteration')

The matches to the query that BLAST finds (if any) will be stored in the XML inside

the ‘Iteration_hits’ section as ‘Hit’ elements. If there are no ‘Iteration_hits’ then we

simply return an empty list for the results.

ihit = iteration.find('Iteration_hits')

if ihit:

hits = ihit.findall('Hit')

else:

return []

Otherwise if there are Hit elements representing the sequence matches to the query in

the database we loop through them to find the Hsp elements (high scoring pairs).

for hit in hits:

hsp = hit.find('Hit_hsps').find('Hsp')

Then the remainder of the database match information is stored in a dictionary using the

name as a key. Note that this is just one example of how to store the results in Python data

structures.

hitDict = {}

The name of the match (the hit) is extracted from the ‘Hit_def’ attribute using

hit.findtext() which we then store in the dictionary with the ‘def’ key. Similarly, its

sequence length is stored as the ‘len’ after conversion of the XML string to a Python

integer:

hitDict['def'] = hit.findtext('Hit_def')

hitDict['len'] = int(hit.findtext('Hit_len'))

For all the values associated with each high-scoring pair we take the tag name for the

value (as used in the XML file) and remove the first four characters to make a simpler key,

thus avoiding all the keys beginning with ‘Hsp_’, e.g. ‘Hsp_score’ becomes ‘score’. The

actual value is then obtained with .findtext() and converted to a Python data type, with

int() or float() as needed. First we collect the floating point values:

for tag in ('Hsp_bit-score', 'Hsp_evalue'):

key = tag[4:]

hitDict[key] = float(hsp.findtext(tag))

Then the integer values:

for tag in ('Hsp_score', 'Hsp_query-from',

'Hsp_query-to', 'Hsp_hit-from',

'Hsp_hit-to', 'Hsp_query-frame',

'Hsp_hit-frame', 'Hsp_identity',

'Hsp_positive', 'Hsp_gaps',

'Hsp_align-len'):

key = tag[4:]

hitDict[key] = int(hsp.findtext(tag, '0'))

And finally the plain text values:

for tag in ('Hsp_qseq', 'Hsp_hseq', 'Hsp_midline'):

key = tag[4:]

hitDict[key] = hsp.findtext(tag)

Once the hit is processed we store the dictionary, with all the extracted values in the

results list. A dictionary could have been used to store the results, but here we wish to

preserve the order of the matches.

results.append(hitDict)

return results

Finally, the Python BLAST search wrapper function can be tested with a query

sequence and a BLAST sequence database created previously.

seq = 'NGTISYTNEAGKIYQLKPNPAVLICRVRGLHLPEKHVTWRGEAIPGSLFDFA' \

'LYFFHNYQALLAKGSGPYFYLPKTQSWQEAAWWSEVFSYAEDRFNLPRGTIK' \

'ATLLIETLPAVFQMDEILHALRDHIVGLNCGRWDYIFSYIKTLKNYPDRVLP'

The location of the database is the file path to the BLAST database files, which

includes the database name, but excludes any file extensions:

database = 'ECOLI_PROT'

The results that come back after running the specified BLAST program (the path for

which should be set appropriately) are a list of names and data dictionaries for any

sequence matches.

results = blastSearch(seq, database'/usr/bin/blastp',

eCutoff=1.0, cpuCores=4)

The various sequences and scores are obtained from the results dictionary for each

match (hit) by using the appropriate keyword:

for hitDict in results:

print(hitDict['def'], hitDict['score'], hitDict['evalue'])

print(hitDict['qseq'])

print(hitDict['midline'])

print(hitDict['hseq'])

Download 7,75 Mb.

Do'stlaringiz bilan baham:

1 ... 171 172 173 174 175 176 177 178 ... 514