Mapping sequences to a genome
In this next section we illustrate how we can use Python to control the process of mapping
large numbers of relatively short DNA sequence reads to a known genome sequence. For
the intensive calculations we will be relying on an existing external program that is quick
and well tested.
Obtaining genome sequences
While we can obtain genome sequences using a web browser, we may sometimes also
wish to automate the process, which can naturally be done with Python scripts. For this
example we will be illustrating how to get data from the NCBI,
8
but hopefully it is clear
how the procedure would be adapted to use other databases. The following downloads
genome sequence information, in the simple FASTA format, in a similar way to the
example used earlier to download Protein Data Bank (PDB) files containing
macromolecular structure data.
On-line resources change, so naturally the examples have been tuned to work with the
way that genomes can be accessed at the time of writing. In the future, however, the
various Internet addresses we use will probably require adjustment. Here we use the File
Transfer Protocol (FTP) service available at the following URL:
FTP_ROOT = 'ftp://ftp.ncbi.nlm.nih.gov/genomes'
And as an example we illustrate the downloading of the following E. coli bacterium
genome data (and naturally we checked with a web browser to see the file names of the
sequences that were available):
GENOME = '/Bacteria/Escherichia_coli_536_uid58531/NC_008253.fna'
The equivalent for a human chromosome sequence would use something like the
following, noting that there is a sub-directory to specify the chromosome:
GENOME = '/H_sapiens/CHR_10/hs_ref_GRCh8_chr10.fa.gz'
We will define a simple Python function downloadFile()that may be built into more
automated scripts to fetch and save the file. For our example the file will be a FASTA-
format sequence. Firstly a few imports are made. The fnmatch module provides a
filename-matching function that makes it easy to choose particular file names that have a
particular file extension (an ending like .fna or .fasta), as an alternative to using regular
expressions (the re module). The urlopen() function will handle all of the communication
with the remote data server. In Python 3 it is in the urllib.request module and in Python2 it
is in the urllib2 module, so we first try to import the former and if that fails try the latter:
import os
from fnmatch import fnmatch
try:
# Python 3
from urllib.request import urlopen
except ImportError:
# Python 2
from urllib2 import urlopen
The download function itself takes two arguments, the URL of the remote file we wish
to acquire and the file path (file name and directory) of where we save the file locally. In
the function we print a message to give the user some information on what is happening,
open a connection to the remote location with urlopen() and read the contents into the
variable data.
def downloadFile(remoteFile, localFile):
print('Downloading to %s …' % localFile)
response = urlopen(remoteFile)
data = response.read()
Then at the end we open the local output file for writing in binary mode.
9
Note that we
are being lazy here and don’t make any checks to ensure that we are not overwriting an
existing file, but this would be easily handled using os.path.exists().
fileObj = open(localFile, 'b')
fileObj.write(data)
fileObj.close()
print(' …done')
This function is simply tested with the following demonstration code for the E. coli
genome file mentioned above (saving the file in the current working directory):
remoteGenome =
FTP_ROOT+'/Bacteria/Escherichia_coli_536_uid58531/NC_008253.fna'
downloadFile(remoteGenome, 'EcoliGenome.fasta')
Moving on from fetching single files, the next example downloads several files with
related locations. For example, the files may contain the different human chromosome
sequences, but are housed in different sub-directories (as is the case for the NCBI’s FTP
site).
The downloadGenomeFiles() function takes a remote directory and a local directory,
rather than using a single specific file name, and searches for all the files that end a
particular way (by default with ‘fna’). Also, in this example we have separated the main
(base) location of the remote site into a variable called url, given that this is not expected
to change much and we can avoid always having to specify the full location of the remote
data. After the initial definition line some tidying is done on the input remoteDir to ensure
that it begins and ends with a ‘/’ character, which is what is expected later in the function.
def downloadGenomeFiles(remoteDir, localDir, fileType='*.fna',
url=FTP_ROOT):
if remoteDir[0] != '/':
remoteDir = '/' + remoteDir
if remoteDir[-1] != '/':
remoteDir = remoteDir + '/'
The full remote path is then the combination of the base URL and the variable remote
path. Note that here we do not use os.path.join(), because this would use different slashes
under different operating systems (‘\’ for Windows, ‘/’ for Max OS X and Linux) and is
appropriate only for local file systems, not remote Internet resources that expect only ‘/’.
remotePath = url + remoteDir
print("Reading %s" % remotePath)
Next the urlopen() function is used and in this case, because the target is a directory,
gets the listing of the file names at the remote location.
response = urllib2.urlopen(remotePath)
data = response.read()
Some empty lists are initialised, which will contain all the file information. The remote
data that is read is one long test string containing all the file names, which we split on the
newline character to give a list of lines.
fileNames = []
filePaths = []
chromosomeDirs = []
lines = data.split('\n')
Then each line from the remote read is considered in a loop, stripping off any
whitespace and skipping blank lines:
for line in lines:
line = line.strip()
if not line:
continue
The actual file name is the last element of an array ([-1]) when we split the line into a
list according to its whitespace. Then if the entry begins with the letters ‘CHR’ it is
deemed to be a sub-directory for individual chromosomes that we need to look into to get
the sequence files, and this chromosome directory is added to the chromosomeDirs list.
fileName = line.split()[-1]
if fileName.startswith('CHR'):
chromosomeDirs.append(fileName + '/')
Otherwise, if the file is not a chromosome sub-directory, a check is made to see if it is a
sequence file with the prescribed ending. Here we use the imported fnmatch() function to
see if the file name contains the pattern specified in the input arguments. This pattern
defaults to ‘*.fna’, which means that the name matches if it ends in .fna; the ‘*’ matches
any number of characters in the file name before the stated ending.
elif fnmatch(fileName, fileType):
fileNames.append(fileName)
continue
We then get the full locations of the individual remote files and download them, to a
local file of the same name, using the downloadFile() function defined above. Here we do
use os.path.join() to make a local file name which is consistent with the operating system
we are using. The os.path.abspath() is used to get the full, long file name, with directories
relative to the root of the local file system: this resolves relative directory specifications
like ‘../’, removes redundancies like ‘dirA/./dirB’ and converts any slashes to the right
kind for the current operating system.
for fileName in fileNames:
filePath = os.path.join(localDir, fileName)
filePath = os.path.abspath(filePath)
filePaths.append(filePath)
downloadFile(url + remoteDir + fileName, filePath)
Given that reading the remote directory may have found some sub-directories for
individual chromosomes we then repeat the whole operation on each sub-directory by
calling the downloadGenomeFiles() function from inside itself, i.e. recursively. This will
then go in to each of those other locations and potentially add more matching files to the
current list. Note that only the first, main function call will give back files to the user, and
all the recursive calls will be absorbed into the filePaths list at this point.
for chromosomeDir in chromosomeDirs:
subDir = remoteDir + chromosomeDir
filePaths += downloadGenomeFiles(subDir, localDir, fileType, url)
return filePaths
At the end the function returns a list of the file paths that were actually saved. We can
then test the function on the human genome at the NCBI site used previously. Note that
this may take a significant time to download, so only do it for real if you really want all of
the human genome data!
filePaths = downloadGenomeFiles('H_sapiens','examples','hs_ref*.fa.gz')
One further convenience function extracts any compressed GZIP archives (ending in
‘.gz’) so they can be used locally without hindrance. This function simply uses standard
Python modules to create open file objects for compressed formats. These can be parsed,
line by line, in much the same way as a regular file. Here the example simply writes out
the uncompressed lines one by one into a new file, but if there is enough memory the
whole file could be read in one go (using .read()). Also, the gzip library could be used to
read compressed files directly inside analysis functions, thus avoiding having to store
uncompressed files. Note that there are also standard Python modules to handle other
compression formats like ‘ZIP’ and ‘BZIP2’; zipfile and bz2.
import gzip
def uncompressGzFile(fileName):
if fileName.endswith('.gz'):
inFileObj = gzip.open(fileName, 'rb')
fileName = fileName[:-3]
outFileObj = open(fileName, 'w')
for line in inFileObj:
outFileObj.write(line)
# Faster alternative, given sufficient memory:
# outFileObj.write(infileObj.read())
inFileObj.close()
outFileObj.close()
return fileName
Next we use an external program that will do the alignment to map the short sequence
reads to a genome sequence. Popular open-source software choices for this at the time of
writing are BOWTIE
10
and BWA,
11
and we will illustrate the use of the former by
wrapping with a convenient Python function, in a similar manner to what was done with
BLAST and ClustalW examples in earlier chapters.
Downloading genome sequence data is usually a simple matter, as described above, of
accessing an on-line repository. The next step is naturally to consider aligning some short-
read data to the genome sequence. For demonstration purposes, there are lots of high-
throughput sequencing data sets that are available via on-line services like the Gene
Expression Omnibus,
12
and which can be used with the scripts we describe below. The
actual genome alignment in these examples will be done by the program called Bowtie,
13
which is just the open-source example we happen to have chosen for this chapter. Python
functions will be illustrated which wrap this external program so that we can use it in
larger programs and automated pipeline scripts. Hence the purpose of the examples is to
easily use existing (tested and efficient) programs, rather than doing everything from
scratch in Python, which would be slower to run and take a long time to describe.
Do'stlaringiz bilan baham: |