Indexing a genome
Before considering running the actual alignment the genome sequence data first needs to
be indexed. This process is critical for the fast short-read alignment program operation.
Using a full genome sequence in a textual file format would be exceedingly slow; rather
the chromosomes are processed to give a binary index file where the sequence of a short
read can be looked up efficiently. The indexing process does take a significant amount of
time (and computer memory), but it only has to be done occasionally, when there is a new
release of an improved genome sequence.
The first step to get started is to make an import from the standard Python subprocess
module, which will handle invoking the command to run the external indexing or
alignment program. Naturally, we also need to specify the location of the aligner within
the local file system; the directory that contains the program executable file(s) is specified
in ALIGNER_PATH.
from subprocess import call
ALIGNER_PATH = '/home/user/programs/bowtie-0.12.7/' # substitute correct
path
For our example program, Bowtie, the indexing program will be found in the
ALIGNER_PATH directory specified above. The Python function that controls the
indexing, as illustrated below, takes various arguments, which naturally include the
location of the genome sequence files and the various options to control the indexing. The
options that are used here are specific to the particular program and its current version, so
changes would need to be made to use other aligners, albeit using the same kinds of
strategy.
The function indexGenome() takes the genomeName, which will be the identifier for
the genome data that will be used later when the actual alignment is made. The fileNames
are the names of the sequence files and naturally outputDir controls where the indexed
genome files are placed. The other options control a few of the more important aspects of
the indexing process; however, there are many more options that could have been
included. The full list of options (normally issued via the command line) is available in the
main Bowtie documentation.
def indexGenome(genomeName, fileNames, outputDir,
tableSize=10, quiet=True, pack=True):
Firstly, in the function a check is made to decompress any archived genome sequence
files, using the function described above, which will not affect sequence files that are not
compressed. The names of the uncompressed sequence files are appended to a list, which
is then joined by commas into one long text string, fastaFileStr, and it is in this form that
we specify the sequence files when running the indexing program.
fastaFiles = []
for fileName in fileNames:
fileName = uncompressGzFile(fileName)
fastaFiles.append(fileName)
fastaFileStr= ','.join(fastaFiles)
The next stage is to assemble the options (the command line arguments) that will be
used when creating the index. These are assembled into the cmdArgs list. In essence this
list will contain all of the things that would be typed by a person at a command line
prompt if the program were to be run directly from the operating system. Accordingly, the
first thing in the list is the full file path of the indexing program, which is the
ALIGNER_PATH joined to the name of the indexing program, which here is bowtie-build
(and is located in that directory). The ‘-f’ option specifies that the input sequence data is
hard-wired to be in FASTA format. If this is not always the case in use, then this option
could be changed by adding another optional argument in the function definition. The
quiet and pack arguments of the function are used to specify whether the respective ‘-q’
and ‘-p’ options end up in the list. Here the quiet option is used to suppress textual output
during the indexing procedure and the pack is used to compress data when the index is
made; this makes indexing slower, but means it takes less computer memory. Note how
the example has substituted a simple command line option and replaced it, as far as
running from Python is concerned, with a variable that has a more immediately
informative name, e.g. ‘-q’ becomes quiet.
cmdArgs = [ALIGNER_PATH+'bowtie-build', '-f']
if quiet:
cmdArgs.append('-q')
if pack:
cmdArgs.append('-p')
The final options to add to the list relate to the table size, which is the number of
sequence positions from the short query that will be mapped at one time to the genome
data, the input sequence files and the name tag for the resultant genome index.
cmdArgs += ['-t', str(tableSize), fastaFileStr, genomeName]
The competed list of command options is joined, using spaces, into a line that is printed
out, to show the user what is being run. This might result in something like
‘/home/user/programs/bowtie-build -f -t 10 genomeFasta/NC_008253.fna E_coli’. Then
finally the indexing job is run using call() with the program name and option information.
Note that the output directory is dictated by setting the current working directory (cwd) at
the time the call is made, rather than this being a program input option.
print(' '.join(cmdArgs))
call(cmdArgs, cwd=outputDir)
Under normal circumstance the indexing will only be done occasionally, but beware
that the procedure will often take a significant amount of active computer memory. For
example, even with the pack option being used a mammalian genome will typically
require over three gigabytes of free memory to index.
Do'stlaringiz bilan baham: |