Python Programming for Biology: Bioinformatics and Beyond

Mapping sequences to a genome

Download 7,75 Mb.

Pdf ko'rish

bet	258/514
Sana	30.12.2021
Hajmi	7,75 Mb.
	#91066

1 ... 254 255 256 257 258 259 260 261 ... 514

Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

Obtaining genome sequences

Mapping sequences to a genome

In this next section we illustrate how we can use Python to control the process of mapping

large numbers of relatively short DNA sequence reads to a known genome sequence. For

the intensive calculations we will be relying on an existing external program that is quick

and well tested.

Obtaining genome sequences

While we can obtain genome sequences using a web browser, we may sometimes also

wish to automate the process, which can naturally be done with Python scripts. For this

example we will be illustrating how to get data from the NCBI,

but hopefully it is clear

how the procedure would be adapted to use other databases. The following downloads

genome sequence information, in the simple FASTA format, in a similar way to the

example used earlier to download Protein Data Bank (PDB) files containing

macromolecular structure data.

On-line resources change, so naturally the examples have been tuned to work with the

way that genomes can be accessed at the time of writing. In the future, however, the

various Internet addresses we use will probably require adjustment. Here we use the File

Transfer Protocol (FTP) service available at the following URL:

FTP_ROOT = 'ftp://ftp.ncbi.nlm.nih.gov/genomes'

And as an example we illustrate the downloading of the following E. coli bacterium

genome data (and naturally we checked with a web browser to see the file names of the

sequences that were available):

GENOME = '/Bacteria/Escherichia_coli_536_uid58531/NC_008253.fna'

The equivalent for a human chromosome sequence would use something like the

following, noting that there is a sub-directory to specify the chromosome:

GENOME = '/H_sapiens/CHR_10/hs_ref_GRCh8_chr10.fa.gz'

We will define a simple Python function downloadFile()that may be built into more

automated scripts to fetch and save the file. For our example the file will be a FASTA-

format sequence. Firstly a few imports are made. The fnmatch module provides a

filename-matching function that makes it easy to choose particular file names that have a

particular file extension (an ending like .fna or .fasta), as an alternative to using regular

expressions (the re module). The urlopen() function will handle all of the communication

with the remote data server. In Python 3 it is in the urllib.request module and in Python2 it

is in the urllib2 module, so we first try to import the former and if that fails try the latter:

import os

from fnmatch import fnmatch

try:

# Python 3

from urllib.request import urlopen

except ImportError:

# Python 2

from urllib2 import urlopen

The download function itself takes two arguments, the URL of the remote file we wish

to acquire and the file path (file name and directory) of where we save the file locally. In

the function we print a message to give the user some information on what is happening,

open a connection to the remote location with urlopen() and read the contents into the

variable data.

def downloadFile(remoteFile, localFile):

print('Downloading to %s …' % localFile)

response = urlopen(remoteFile)

data = response.read()

Then at the end we open the local output file for writing in binary mode.

Note that we

are being lazy here and don’t make any checks to ensure that we are not overwriting an

existing file, but this would be easily handled using os.path.exists().

fileObj = open(localFile, 'b')

fileObj.write(data)

fileObj.close()

print(' …done')

This function is simply tested with the following demonstration code for the E. coli

genome file mentioned above (saving the file in the current working directory):

remoteGenome =

FTP_ROOT+'/Bacteria/Escherichia_coli_536_uid58531/NC_008253.fna'

downloadFile(remoteGenome, 'EcoliGenome.fasta')

Moving on from fetching single files, the next example downloads several files with

related locations. For example, the files may contain the different human chromosome

sequences, but are housed in different sub-directories (as is the case for the NCBI’s FTP

site).

The downloadGenomeFiles() function takes a remote directory and a local directory,

rather than using a single specific file name, and searches for all the files that end a

particular way (by default with ‘fna’). Also, in this example we have separated the main

(base) location of the remote site into a variable called url, given that this is not expected

to change much and we can avoid always having to specify the full location of the remote

data. After the initial definition line some tidying is done on the input remoteDir to ensure

that it begins and ends with a ‘/’ character, which is what is expected later in the function.

def downloadGenomeFiles(remoteDir, localDir, fileType='*.fna',

url=FTP_ROOT):

if remoteDir[0] != '/':

remoteDir = '/' + remoteDir

if remoteDir[-1] != '/':

remoteDir = remoteDir + '/'

The full remote path is then the combination of the base URL and the variable remote

path. Note that here we do not use os.path.join(), because this would use different slashes

under different operating systems (‘\’ for Windows, ‘/’ for Max OS X and Linux) and is

appropriate only for local file systems, not remote Internet resources that expect only ‘/’.

remotePath = url + remoteDir

print("Reading %s" % remotePath)

Next the urlopen() function is used and in this case, because the target is a directory,

gets the listing of the file names at the remote location.

response = urllib2.urlopen(remotePath)

data = response.read()

Some empty lists are initialised, which will contain all the file information. The remote

data that is read is one long test string containing all the file names, which we split on the

newline character to give a list of lines.

fileNames = []

filePaths = []

chromosomeDirs = []

lines = data.split('\n')

Then each line from the remote read is considered in a loop, stripping off any

whitespace and skipping blank lines:

for line in lines:

line = line.strip()

if not line:

continue

The actual file name is the last element of an array ([-1]) when we split the line into a

list according to its whitespace. Then if the entry begins with the letters ‘CHR’ it is

deemed to be a sub-directory for individual chromosomes that we need to look into to get

the sequence files, and this chromosome directory is added to the chromosomeDirs list.

fileName = line.split()[-1]

if fileName.startswith('CHR'):

chromosomeDirs.append(fileName + '/')

Otherwise, if the file is not a chromosome sub-directory, a check is made to see if it is a

sequence file with the prescribed ending. Here we use the imported fnmatch() function to

see if the file name contains the pattern specified in the input arguments. This pattern

defaults to ‘*.fna’, which means that the name matches if it ends in .fna; the ‘*’ matches

any number of characters in the file name before the stated ending.

elif fnmatch(fileName, fileType):

fileNames.append(fileName)

continue

We then get the full locations of the individual remote files and download them, to a

local file of the same name, using the downloadFile() function defined above. Here we do

use os.path.join() to make a local file name which is consistent with the operating system

we are using. The os.path.abspath() is used to get the full, long file name, with directories

relative to the root of the local file system: this resolves relative directory specifications

like ‘../’, removes redundancies like ‘dirA/./dirB’ and converts any slashes to the right

kind for the current operating system.

for fileName in fileNames:

filePath = os.path.join(localDir, fileName)

filePath = os.path.abspath(filePath)

filePaths.append(filePath)

downloadFile(url + remoteDir + fileName, filePath)

Given that reading the remote directory may have found some sub-directories for

individual chromosomes we then repeat the whole operation on each sub-directory by

calling the downloadGenomeFiles() function from inside itself, i.e. recursively. This will

then go in to each of those other locations and potentially add more matching files to the

current list. Note that only the first, main function call will give back files to the user, and

all the recursive calls will be absorbed into the filePaths list at this point.

for chromosomeDir in chromosomeDirs:

subDir = remoteDir + chromosomeDir

filePaths += downloadGenomeFiles(subDir, localDir, fileType, url)

return filePaths

At the end the function returns a list of the file paths that were actually saved. We can

then test the function on the human genome at the NCBI site used previously. Note that

this may take a significant time to download, so only do it for real if you really want all of

the human genome data!

filePaths = downloadGenomeFiles('H_sapiens','examples','hs_ref*.fa.gz')

One further convenience function extracts any compressed GZIP archives (ending in

‘.gz’) so they can be used locally without hindrance. This function simply uses standard

Python modules to create open file objects for compressed formats. These can be parsed,

line by line, in much the same way as a regular file. Here the example simply writes out

the uncompressed lines one by one into a new file, but if there is enough memory the

whole file could be read in one go (using .read()). Also, the gzip library could be used to

read compressed files directly inside analysis functions, thus avoiding having to store

uncompressed files. Note that there are also standard Python modules to handle other

compression formats like ‘ZIP’ and ‘BZIP2’; zipfile and bz2.

import gzip

def uncompressGzFile(fileName):

if fileName.endswith('.gz'):

inFileObj = gzip.open(fileName, 'rb')

fileName = fileName[:-3]

outFileObj = open(fileName, 'w')

for line in inFileObj:

outFileObj.write(line)

# Faster alternative, given sufficient memory:

# outFileObj.write(infileObj.read())

inFileObj.close()

outFileObj.close()

return fileName

Next we use an external program that will do the alignment to map the short sequence

reads to a genome sequence. Popular open-source software choices for this at the time of

writing are BOWTIE

and BWA,

and we will illustrate the use of the former by

wrapping with a convenient Python function, in a similar manner to what was done with

BLAST and ClustalW examples in earlier chapters.

Downloading genome sequence data is usually a simple matter, as described above, of

accessing an on-line repository. The next step is naturally to consider aligning some short-

read data to the genome sequence. For demonstration purposes, there are lots of high-

throughput sequencing data sets that are available via on-line services like the Gene

Expression Omnibus,

and which can be used with the scripts we describe below. The

actual genome alignment in these examples will be done by the program called Bowtie,

which is just the open-source example we happen to have chosen for this chapter. Python

functions will be illustrated which wrap this external program so that we can use it in

larger programs and automated pipeline scripts. Hence the purpose of the examples is to

easily use existing (tested and efficient) programs, rather than doing everything from

scratch in Python, which would be slower to run and take a long time to describe.

Download 7,75 Mb.

Do'stlaringiz bilan baham:

1 ... 254 255 256 257 258 259 260 261 ... 514