Python Programming for Biology: Bioinformatics and Beyond



Download 7,75 Mb.
Pdf ko'rish
bet175/514
Sana30.12.2021
Hajmi7,75 Mb.
#91066
1   ...   171   172   173   174   175   176   177   178   ...   514
Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

Using BLAST from Python

Before  we  get  started  we  need  to  consider  how  to  create  a  sequence  database  that  the

BLAST program can search through. This is not a normal sequence file or database, but

rather  it  is  a  special,  optimised  series  of  files  that  is  special  to  the  BLAST  system  and

which makes the comparison of sequences efficient. Fortunately there is a program called

‘makeblastdb’, which is available in the BLAST+ package,

18

that can be used to make the




special database from sequences stored in common text-based formats like FASTA.

Before  the  examples  we  import  all  of  the  external  functions  and  modules  that  will  be

required.  Specifically  we  need  the  substitution  function  (sub)  from  the  regular

expression

19

 module  to  help  format  the  sequences  for  the  input  files.  The  call  function



from the subprocess is a convenient way to run programs that are external to Python. The

ElementTree module will allow us to read the XML

20

formatted BLAST output.



from re import sub

from subprocess import call

from xml.etree import ElementTree

Now  a  function  is  defined  to  make  the  special  sequence  database  for  BLAST.  This  is

not a conventional relational database, e.g. where small amounts of data are accessed and

modified,  but  rather  a  specially  formatted  version  of  sequence  data  to  be  queried,  which

allows BLAST to efficiently find matches. The function takes a FASTA-formatted file as

input,  a  distinguishing  name  for  the  database,  formatdbExe,  the  location  of  the  external

‘makeblastdb’  program  to  run  and  isProtein,  a  switch  to  state  whether  the  sequence  is

protein and not nucleic acid (RNA or DNA). Note that the input FASTA sequence file is

often  very  large,  representing  a  whole  database  (e.g.  UniProt)  or  even  a  genome.

Obtaining genome sequence information will be discussed later in

Chapter 17

.

def makeBlastDatabase(fastaFile, databaseName,



formatdbExe=None, isProtein=True):

If a value for formatdbExe  is  not  passed  in  we  will  assume  that  the  database  creation

program is available to the local system as a command called makeblastdb. If the program

name is unavailable, or not set correctly, then an error exception will be triggered when we

come to run the program via call().

if not formatdbExe:

formatdbExe = 'makeblastdb'

Next  we  define  a  variable  which  will  indicate  the  database  type.  The  makeblastdb

program accepts a value ‘prot’ (protein) or ‘nucl’ (nucleic acid) after its -dbtype option to

indicate whether the database is protein or not.

if isProtein:

molType = 'prot'

else:

molType = 'nucl'



Then all of the options and arguments that will be used when running the makeblastdb

program  are  collected  in  a  list.  These  would  be  the  values  that  one  would  type  at  a

command line prompt to run the program.

cmdArgs = [formatdbExe,

'-dbtype', molType,

'-in', fastaFile,

'-out', databaseName]

This list is then passed to the call() function to actually run the program, noting that we




catch  any  exceptions  that  may  occur.  If  an  exception  is  triggered  we  print  the  command

that was tried, together with the original Python error (err) so we can determine what the

problem was.

print('Making BLAST database %s…' % databaseName)

try:

call(cmdArgs)



except Exception as err:

print('BLAST database creation failed')

print('Command used: "%s"' % ' '.join(cmdArgs))

print(err)

return

print(' …done')



With  the  function  defined  we  can  now  run  the  makeblastdb  program  on  a  FASTA-

format file as if it were Python. Naturally we only need to do this once for each database,

unless it changes.

fileName = 'EcoliGenome.fasta'

makeBlastDatabase(fileName, 'ECOLI_PROT')

However, we may have to specifically state the location of the makeblastdb program if

the system doesn’t know that location of the executable file:

makeBlastDatabase(fileName, 'ECOLI_PROT', '/usr/bin/makeblastdb')

Next  we  come  to  a  similar  Python  wrapper  function  that  will  run  the  actual  BLAST

program, given a query sequence and a database, created as described above. For the sake

of  simplicity,  several  of  the  BLAST  input  options  will  be  ignored  and  only  part  of  the

output  will  be  read.  However,  this  function  can  be  expanded  to  use  as  much  detail  as  is

required.

21

The  Python  function  that  controls  the  BLAST  search  takes  arguments  for  the  query



sequence (as one-letter codes) and the location of the formatted database to search. There

are optional arguments to specify the BLAST program location (otherwise the command

‘blastp’ will be tried), the maximum cut-off expectation value (the E-value score), the type

of  substitution  matrix  and  the  number  of  processing  cores  to  use.  It  is  important  to  note

that  different  BLAST  programs  are  used  to  specify  which  molecule  type  (protein  or

nucleic  acid)  is  used  in  the  query  and  independently  in  the  database.  Accordingly,  the

following programs are available to specify which type of BLAST search will be run:

‘blastp’: protein query against a protein database.

‘blastn’: nucleic acid query against a nucleotide database.

‘tblastn’: protein query against a translated nucleotide database.

‘blastx’: translated nucleic acid query against a protein database.

‘tblastx’: translated nucleic acid query against a translated nucleotide database.




Note that the ‘tblastx’ method translates nucleotide sequence to protein sequence in all

reading  frames,  so  may  yield  amino  acid  matches  even  if  the  base  sequences  are  quite

different, but naturally only one of the reading frames encodes protein.

The blastSearch() function makes use of the subprocess module again, but this time we

also  import  Popen  and  PIPE.  The  Popen  class  is  a  more  general  way  of  controlling

external programs than a simple call and we use it here because we are directly controlling

the standard input to and output from BLAST using Python, rather than via the operating

system (an alternative to this approach would be to read and write temporary files). The

PIPE  import  is  a  value  that  allows  us  to  achieve  this  by  setting  the  stdin  and  stdout

attributes of Popen so that data will be communicated or ‘piped’ to and from Python.

from subprocess import call, Popen, PIPE

def blastSearch(seq, database, blastExe=None, eCutoff=10,

matrix='BLOSUM62', cpuCores=None):

As  with  the  previous  function  we  set  a  default  name  for  the  program,  here  assuming

‘blastp’ is available and executable on the system, if the name was not specifically set by

the  user.  As  discussed  above,  the  choice  of  the  specific  BLAST  program  dictates  what

type  of  search  is  being  performed  and  what  molecule  type  (protein  or  nucleic  acid)  the

query and database represent.

if not blastExe:

blastExe = 'blastp'

If  the  number  of  processor  (CPU)  cores  was  not  specified  we  take  the  maximum

available  as  specified  by  the  handy  multiprocessing  module.  Using  multiple  cores  will

allow  part  of  the  BLAST  search  to  run  more  quickly,  because  certain  tasks  can  be

computed in parallel.

if not cpuCores:

import multiprocessing

cpuCores = multiprocessing.cpu_count()

The query sequence is tidied, making it upper case and inserting a newline ‘\n’ every 60

characters to make regular FASTA-format sequence strings.

querySeq = seq.upper()

querySeq = sub('(.{60})(.)',r'\1\n\2', querySeq)

The  sequence  is  then  placed  into  a  string  formatted  as  a  FASTA  entry  by  adding  an

annotation line (with arbitrary name) before a sequence line. These two lines represent the

data that would be contained in a FASTA file containing a single sequence, which we will

send as input to BLAST.

querySeq = '>UserQuery\n%s\n' % querySeq

The program command arguments are collected in a list as before, noting that we don’t

need to set any input or output files because by default BLAST will use standard input and

output:



cmdArgs = [blastExe,

'-outfmt', '5', # => XML output

'-num_threads', str(cpuCores),

'-db', database,

'-evalue', str(eCutoff),

'-matrix', matrix]

print(' '.join(cmdArgs))

The Popen class is used to make an object that encapsulates the BLAST process, setting

attributes  so  that  the  standard  input  and  output  data  is  sent  via  Python.  If  there  are  any

problems with this, e.g. if the BLAST executable is not found, then we catch the exception

and report the error that occurred.

try:


proc = Popen(cmdArgs, stdin=PIPE, stdout=PIPE)

except Exception as err:

print('BLAST command failed')

print('Command used: "%s"' % ' '.join(cmdArgs))

print(err)

return []

Nothing will be run by BLAST until we send it some input data and we do this using

the communicate() method of the process object. This function call gives back two values,

representing  standard  output  and  error  data  respectively.  If  we  hadn’t  set  stdout=PIPE

above this data would simply be sent to the screen.

stdOutData, stdErrData = proc.communicate(querySeq)

The  results  are  collected  by  reading  through  the  XML-formatted  text  contained  in  the

stdOutData string using the ElementTree module. The fromstring() function will give back

an  object  representing  the  tree-like  hierarchy  of  the  XML  data.  To  navigate  this  tree  we

start  at  the  base,  which  is  named  root  here,  from  which  we  can  follow  any  node  (like

branches) to get to the data within.

results = []

root = ElementTree.fromstring(stdOutData)

From this start point, to collect the required data we use the .find()function, going from

point to point within the structure defined by the XML file. What this function does is to

look  for  specific  elements  in  the  tree-like  structure  of  the  XML  file  which  match  a

particular text string. Naturally, knowing which things to look for depends on knowing the

schema  of  the  XML  file,  i.e.  what  the  names  of  the  elements  are  and  how  they  relate  to

one  another.  See

Chapter  6

 for  further  discussion  of  XML  files.  Note  that  the  format  of

these  XML  files  could  change  in  future  versions  of  BLAST,  so  if  this  example  does  not

work directly look at the actual output and adapt the following code for the actual XML

elements in the file.

To get to iteration, which will store all the sequences matched in the database (there is

only  one  iteration  in  a  regular  BLAST  search)  we  go  to  elements  named  Blast

Output_iterations and then Iteration.




iteration = root.find('BlastOutput_iterations').find('Iteration')

The matches to the query that BLAST finds (if any) will be stored in the XML inside

the  ‘Iteration_hits’  section  as  ‘Hit’  elements.  If  there  are  no  ‘Iteration_hits’  then  we

simply return an empty list for the results.

ihit = iteration.find('Iteration_hits')

if ihit:


hits = ihit.findall('Hit')

else:


return []

Otherwise if there are Hit elements representing the sequence matches to the query in

the database we loop through them to find the Hsp elements (high scoring pairs).

for hit in hits:

hsp = hit.find('Hit_hsps').find('Hsp')

Then the remainder of the database match information is stored in a dictionary using the

name as a key. Note that this is just one example of how to store the results in Python data

structures.

hitDict = {}

The  name  of  the  match  (the  hit)  is  extracted  from  the  ‘Hit_def’  attribute  using

hit.findtext()  which  we  then  store  in  the  dictionary  with  the  ‘def’  key.  Similarly,  its

sequence  length  is  stored  as  the  ‘len’  after  conversion  of  the  XML  string  to  a  Python

integer:

hitDict['def'] = hit.findtext('Hit_def')

hitDict['len'] = int(hit.findtext('Hit_len'))

For all the values associated with each high-scoring pair we take the tag name for the

value (as used in the XML file) and remove the first four characters to make a simpler key,

thus avoiding all the keys beginning with ‘Hsp_’, e.g. ‘Hsp_score’ becomes ‘score’.  The

actual  value  is  then  obtained  with  .findtext()  and  converted  to  a  Python  data  type,  with

int() or float() as needed. First we collect the floating point values:

for tag in ('Hsp_bit-score', 'Hsp_evalue'):

key = tag[4:]

hitDict[key] = float(hsp.findtext(tag))

Then the integer values:

for tag in ('Hsp_score', 'Hsp_query-from',

'Hsp_query-to', 'Hsp_hit-from',

'Hsp_hit-to', 'Hsp_query-frame',

'Hsp_hit-frame', 'Hsp_identity',

'Hsp_positive', 'Hsp_gaps',

'Hsp_align-len'):

key = tag[4:]

hitDict[key] = int(hsp.findtext(tag, '0'))




And finally the plain text values:

for tag in ('Hsp_qseq', 'Hsp_hseq', 'Hsp_midline'):

key = tag[4:]

hitDict[key] = hsp.findtext(tag)

Once  the  hit  is  processed  we  store  the  dictionary,  with  all  the  extracted  values  in  the

results  list.  A  dictionary  could  have  been  used  to  store  the  results,  but  here  we  wish  to

preserve the order of the matches.

results.append(hitDict)

return results

Finally,  the  Python  BLAST  search  wrapper  function  can  be  tested  with  a  query

sequence and a BLAST sequence database created previously.

seq = 'NGTISYTNEAGKIYQLKPNPAVLICRVRGLHLPEKHVTWRGEAIPGSLFDFA' \

'LYFFHNYQALLAKGSGPYFYLPKTQSWQEAAWWSEVFSYAEDRFNLPRGTIK' \

'ATLLIETLPAVFQMDEILHALRDHIVGLNCGRWDYIFSYIKTLKNYPDRVLP'

The  location  of  the  database  is  the  file  path  to  the  BLAST  database  files,  which

includes the database name, but excludes any file extensions:

database = 'ECOLI_PROT'

The  results  that  come  back  after  running  the  specified  BLAST  program  (the  path  for

which  should  be  set  appropriately)  are  a  list  of  names  and  data  dictionaries  for  any

sequence matches.

results = blastSearch(seq, database'/usr/bin/blastp',

eCutoff=1.0, cpuCores=4)

The  various  sequences  and  scores  are  obtained  from  the  results  dictionary  for  each

match (hit) by using the appropriate keyword:

for hitDict in results:

print(hitDict['def'], hitDict['score'], hitDict['evalue'])

print(hitDict['qseq'])

print(hitDict['midline'])

print(hitDict['hseq'])


Download 7,75 Mb.

Do'stlaringiz bilan baham:
1   ...   171   172   173   174   175   176   177   178   ...   514




Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©hozir.org 2024
ma'muriyatiga murojaat qiling

kiriting | ro'yxatdan o'tish
    Bosh sahifa
юртда тантана
Боғда битган
Бугун юртда
Эшитганлар жилманглар
Эшитмадим деманглар
битган бодомлар
Yangiariq tumani
qitish marakazi
Raqamli texnologiyalar
ilishida muhokamadan
tasdiqqa tavsiya
tavsiya etilgan
iqtisodiyot kafedrasi
steiermarkischen landesregierung
asarlaringizni yuboring
o'zingizning asarlaringizni
Iltimos faqat
faqat o'zingizning
steierm rkischen
landesregierung fachabteilung
rkischen landesregierung
hamshira loyihasi
loyihasi mavsum
faolyatining oqibatlari
asosiy adabiyotlar
fakulteti ahborot
ahborot havfsizligi
havfsizligi kafedrasi
fanidan bo’yicha
fakulteti iqtisodiyot
boshqaruv fakulteti
chiqarishda boshqaruv
ishlab chiqarishda
iqtisodiyot fakultet
multiservis tarmoqlari
fanidan asosiy
Uzbek fanidan
mavzulari potok
asosidagi multiservis
'aliyyil a'ziym
billahil 'aliyyil
illaa billahil
quvvata illaa
falah' deganida
Kompyuter savodxonligi
bo’yicha mustaqil
'alal falah'
Hayya 'alal
'alas soloh
Hayya 'alas
mavsum boyicha


yuklab olish