Python Programming for Biology: Bioinformatics and Beyond



Download 7,75 Mb.
Pdf ko'rish
bet259/514
Sana30.12.2021
Hajmi7,75 Mb.
#91066
1   ...   255   256   257   258   259   260   261   262   ...   514
Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

Indexing a genome

Before considering running the actual alignment the genome sequence data first needs to

be  indexed.  This  process  is  critical  for  the  fast  short-read  alignment  program  operation.

Using a full genome sequence in a textual file format would be exceedingly slow; rather

the chromosomes are processed to give a binary index file where the sequence of a short

read can be looked up efficiently. The indexing process does take a significant amount of

time (and computer memory), but it only has to be done occasionally, when there is a new

release of an improved genome sequence.

The first step to get started is to make an import from the standard Python subprocess

module,  which  will  handle  invoking  the  command  to  run  the  external  indexing  or

alignment  program.  Naturally,  we  also  need  to  specify  the  location  of  the  aligner  within

the local file system; the directory that contains the program executable file(s) is specified

in ALIGNER_PATH.

from subprocess import call

ALIGNER_PATH = '/home/user/programs/bowtie-0.12.7/' # substitute correct

path


For  our  example  program,  Bowtie,  the  indexing  program  will  be  found  in  the

ALIGNER_PATH  directory  specified  above.  The  Python  function  that  controls  the

indexing,  as  illustrated  below,  takes  various  arguments,  which  naturally  include  the

location of the genome sequence files and the various options to control the indexing. The

options that are used here are specific to the particular program and its current version, so

changes  would  need  to  be  made  to  use  other  aligners,  albeit  using  the  same  kinds  of

strategy.

The  function  indexGenome()  takes  the  genomeName,  which  will  be  the  identifier  for

the genome data that will be used later when the actual alignment is made. The fileNames

are  the  names  of  the  sequence  files  and  naturally  outputDir  controls  where  the  indexed

genome files are placed. The other options control a few of the more important aspects of

the  indexing  process;  however,  there  are  many  more  options  that  could  have  been

included. The full list of options (normally issued via the command line) is available in the

main Bowtie documentation.

def indexGenome(genomeName, fileNames, outputDir,

tableSize=10, quiet=True, pack=True):

Firstly, in the function a check is made to decompress any archived genome sequence

files, using the function described above, which will not affect sequence files that are not

compressed. The names of the uncompressed sequence files are appended to a list, which



is then joined by commas into one long text string, fastaFileStr, and it is in this form that

we specify the sequence files when running the indexing program.

fastaFiles = []

for fileName in fileNames:

fileName = uncompressGzFile(fileName)

fastaFiles.append(fileName)

fastaFileStr= ','.join(fastaFiles)

The  next  stage  is  to  assemble  the  options  (the  command  line  arguments)  that  will  be

used when creating the index. These are assembled into the cmdArgs list. In essence this

list  will  contain  all  of  the  things  that  would  be  typed  by  a  person  at  a  command  line

prompt if the program were to be run directly from the operating system. Accordingly, the

first  thing  in  the  list  is  the  full  file  path  of  the  indexing  program,  which  is  the

ALIGNER_PATH joined to the name of the indexing program, which here is bowtie-build

(and is located in that directory). The ‘-f’ option specifies that the input sequence data is

hard-wired to be in FASTA format. If this is not always the case in use, then this option

could  be  changed  by  adding  another  optional  argument  in  the  function  definition.  The

quiet and pack  arguments  of  the  function  are  used  to  specify  whether  the  respective  ‘-q’

and ‘-p’ options end up in the list. Here the quiet option is used to suppress textual output

during  the  indexing  procedure  and  the  pack  is  used  to  compress  data  when  the  index  is

made;  this  makes  indexing  slower,  but  means  it  takes  less  computer  memory.  Note  how

the  example  has  substituted  a  simple  command  line  option  and  replaced  it,  as  far  as

running  from  Python  is  concerned,  with  a  variable  that  has  a  more  immediately

informative name, e.g. ‘-q’ becomes quiet.

cmdArgs = [ALIGNER_PATH+'bowtie-build', '-f']

if quiet:

cmdArgs.append('-q')

if pack:

cmdArgs.append('-p')

The  final  options  to  add  to  the  list  relate  to  the  table  size,  which  is  the  number  of

sequence  positions  from  the  short  query  that  will  be  mapped  at  one  time  to  the  genome

data, the input sequence files and the name tag for the resultant genome index.

cmdArgs += ['-t', str(tableSize), fastaFileStr, genomeName]

The competed list of command options is joined, using spaces, into a line that is printed

out,  to  show  the  user  what  is  being  run.  This  might  result  in  something  like

‘/home/user/programs/bowtie-build  -f  -t  10  genomeFasta/NC_008253.fna  E_coli’.  Then

finally the indexing job is run using call() with the program name and option information.

Note that the output directory is dictated by setting the current working directory (cwd) at

the time the call is made, rather than this being a program input option.

print(' '.join(cmdArgs))

call(cmdArgs, cwd=outputDir)




Under  normal  circumstance  the  indexing  will  only  be  done  occasionally,  but  beware

that  the  procedure  will  often  take  a  significant  amount  of  active  computer  memory.  For

example,  even  with  the  pack  option  being  used  a  mammalian  genome  will  typically

require over three gigabytes of free memory to index.




Download 7,75 Mb.

Do'stlaringiz bilan baham:
1   ...   255   256   257   258   259   260   261   262   ...   514




Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©hozir.org 2024
ma'muriyatiga murojaat qiling

kiriting | ro'yxatdan o'tish
    Bosh sahifa
юртда тантана
Боғда битган
Бугун юртда
Эшитганлар жилманглар
Эшитмадим деманглар
битган бодомлар
Yangiariq tumani
qitish marakazi
Raqamli texnologiyalar
ilishida muhokamadan
tasdiqqa tavsiya
tavsiya etilgan
iqtisodiyot kafedrasi
steiermarkischen landesregierung
asarlaringizni yuboring
o'zingizning asarlaringizni
Iltimos faqat
faqat o'zingizning
steierm rkischen
landesregierung fachabteilung
rkischen landesregierung
hamshira loyihasi
loyihasi mavsum
faolyatining oqibatlari
asosiy adabiyotlar
fakulteti ahborot
ahborot havfsizligi
havfsizligi kafedrasi
fanidan bo’yicha
fakulteti iqtisodiyot
boshqaruv fakulteti
chiqarishda boshqaruv
ishlab chiqarishda
iqtisodiyot fakultet
multiservis tarmoqlari
fanidan asosiy
Uzbek fanidan
mavzulari potok
asosidagi multiservis
'aliyyil a'ziym
billahil 'aliyyil
illaa billahil
quvvata illaa
falah' deganida
Kompyuter savodxonligi
bo’yicha mustaqil
'alal falah'
Hayya 'alal
'alas soloh
Hayya 'alas
mavsum boyicha


yuklab olish