Python Programming for Biology: Bioinformatics and Beyond


Mapping sequences to a genome



Download 7,75 Mb.
Pdf ko'rish
bet258/514
Sana30.12.2021
Hajmi7,75 Mb.
#91066
1   ...   254   255   256   257   258   259   260   261   ...   514
Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

Mapping sequences to a genome

In this next section we illustrate how we can use Python to control the process of mapping

large numbers of relatively short DNA sequence reads to a known genome sequence. For

the intensive calculations we will be relying on an existing external program that is quick

and well tested.

Obtaining genome sequences

While  we  can  obtain  genome  sequences  using  a  web  browser,  we  may  sometimes  also

wish  to  automate  the  process,  which  can  naturally  be  done  with  Python  scripts.  For  this

example we will be illustrating how to get data from the NCBI,

8

but hopefully it is clear



how  the  procedure  would  be  adapted  to  use  other  databases.  The  following  downloads

genome  sequence  information,  in  the  simple  FASTA  format,  in  a  similar  way  to  the

example  used  earlier  to  download  Protein  Data  Bank  (PDB)  files  containing

macromolecular structure data.

On-line resources change, so naturally the examples have been tuned to work with the

way  that  genomes  can  be  accessed  at  the  time  of  writing.  In  the  future,  however,  the

various Internet addresses we use will probably require adjustment. Here we use the File

Transfer Protocol (FTP) service available at the following URL:

FTP_ROOT = 'ftp://ftp.ncbi.nlm.nih.gov/genomes'



And  as  an  example  we  illustrate  the  downloading  of  the  following  E.  coli  bacterium

genome data (and naturally we checked with a web browser to see the file names of the

sequences that were available):

GENOME = '/Bacteria/Escherichia_coli_536_uid58531/NC_008253.fna'

The  equivalent  for  a  human  chromosome  sequence  would  use  something  like  the

following, noting that there is a sub-directory to specify the chromosome:

GENOME = '/H_sapiens/CHR_10/hs_ref_GRCh8_chr10.fa.gz'

We  will  define  a  simple  Python  function  downloadFile()that  may  be  built  into  more

automated  scripts  to  fetch  and  save  the  file.  For  our  example  the  file  will  be  a  FASTA-

format  sequence.  Firstly  a  few  imports  are  made.  The  fnmatch  module  provides  a

filename-matching function that makes it easy to choose particular file names that have a

particular  file  extension  (an  ending  like  .fna  or  .fasta),  as  an  alternative  to  using  regular

expressions (the re module). The urlopen() function will handle all of the communication

with the remote data server. In Python 3 it is in the urllib.request module and in Python2 it

is in the urllib2 module, so we first try to import the former and if that fails try the latter:

import os

from fnmatch import fnmatch

try:


# Python 3

from urllib.request import urlopen

except ImportError:

# Python 2

from urllib2 import urlopen

The download function itself takes two arguments, the URL of the remote file we wish

to acquire and the file path (file name and directory) of where we save the file locally. In

the function we print a message to give the user some information on what is happening,

open  a  connection  to  the  remote  location  with  urlopen()  and  read  the  contents  into  the

variable data.

def downloadFile(remoteFile, localFile):

print('Downloading to %s …' % localFile)

response = urlopen(remoteFile)

data = response.read()

Then at the end we open the local output file for writing in binary mode.

9

Note that we



are  being  lazy  here  and  don’t  make  any  checks  to  ensure  that  we  are  not  overwriting  an

existing file, but this would be easily handled using os.path.exists().

fileObj = open(localFile, 'b')

fileObj.write(data)

fileObj.close()

print(' …done')




This  function  is  simply  tested  with  the  following  demonstration  code  for  the  E.  coli

genome file mentioned above (saving the file in the current working directory):

remoteGenome =

FTP_ROOT+'/Bacteria/Escherichia_coli_536_uid58531/NC_008253.fna'

downloadFile(remoteGenome, 'EcoliGenome.fasta')

Moving  on  from  fetching  single  files,  the  next  example  downloads  several  files  with

related  locations.  For  example,  the  files  may  contain  the  different  human  chromosome

sequences, but are housed in different sub-directories (as is the case for the NCBI’s FTP

site).

The  downloadGenomeFiles()  function  takes  a  remote  directory  and  a  local  directory,



rather  than  using  a  single  specific  file  name,  and  searches  for  all  the  files  that  end  a

particular way (by default with ‘fna’). Also, in this example we have separated the main

(base) location of the remote site into a variable called url, given that this is not expected

to change much and we can avoid always having to specify the full location of the remote

data. After the initial definition line some tidying is done on the input remoteDir to ensure

that it begins and ends with a ‘/’ character, which is what is expected later in the function.

def downloadGenomeFiles(remoteDir, localDir, fileType='*.fna',

url=FTP_ROOT):

if remoteDir[0] != '/':

remoteDir = '/' + remoteDir

if remoteDir[-1] != '/':

remoteDir = remoteDir + '/'

The full remote path is then the combination of the base URL and the variable remote

path. Note that here we do not use os.path.join(), because this would use different slashes

under different operating systems (‘\’ for Windows, ‘/’  for  Max  OS  X  and  Linux)  and  is

appropriate only for local file systems, not remote Internet resources that expect only ‘/’.

remotePath = url + remoteDir

print("Reading %s" % remotePath)

Next  the  urlopen()  function  is  used  and  in  this  case,  because  the  target  is  a  directory,

gets the listing of the file names at the remote location.

response = urllib2.urlopen(remotePath)

data = response.read()

Some empty lists are initialised, which will contain all the file information. The remote

data that is read is one long test string containing all the file names, which we split on the

newline character to give a list of lines.

fileNames = []

filePaths = []

chromosomeDirs = []

lines = data.split('\n')

Then  each  line  from  the  remote  read  is  considered  in  a  loop,  stripping  off  any




whitespace and skipping blank lines:

for line in lines:

line = line.strip()

if not line:

continue

The actual file name is the last element of an array ([-1]) when we split the line into a

list  according  to  its  whitespace.  Then  if  the  entry  begins  with  the  letters  ‘CHR’  it  is

deemed to be a sub-directory for individual chromosomes that we need to look into to get

the sequence files, and this chromosome directory is added to the chromosomeDirs list.

fileName = line.split()[-1]

if fileName.startswith('CHR'):

chromosomeDirs.append(fileName + '/')

Otherwise, if the file is not a chromosome sub-directory, a check is made to see if it is a

sequence file with the prescribed ending. Here we use the imported fnmatch() function to

see  if  the  file  name  contains  the  pattern  specified  in  the  input  arguments.  This  pattern

defaults to ‘*.fna’, which means that the name matches if it ends in .fna; the ‘*’ matches

any number of characters in the file name before the stated ending.

elif fnmatch(fileName, fileType):

fileNames.append(fileName)

continue


We  then  get  the  full  locations  of  the  individual  remote  files  and  download  them,  to  a

local file of the same name, using the downloadFile() function defined above. Here we do

use os.path.join() to make a local file name which is consistent with the operating system

we are using. The os.path.abspath() is used to get the full, long file name, with directories

relative  to  the  root  of  the  local  file  system:  this  resolves  relative  directory  specifications

like  ‘../’,  removes  redundancies  like  ‘dirA/./dirB’  and  converts  any  slashes  to  the  right

kind for the current operating system.

for fileName in fileNames:

filePath = os.path.join(localDir, fileName)

filePath = os.path.abspath(filePath)

filePaths.append(filePath)

downloadFile(url + remoteDir + fileName, filePath)

Given  that  reading  the  remote  directory  may  have  found  some  sub-directories  for

individual  chromosomes  we  then  repeat  the  whole  operation  on  each  sub-directory  by

calling the downloadGenomeFiles() function from inside itself, i.e. recursively. This will

then go in to each of those other locations and potentially add more matching files to the

current list. Note that only the first, main function call will give back files to the user, and

all the recursive calls will be absorbed into the filePaths list at this point.

for chromosomeDir in chromosomeDirs:

subDir = remoteDir + chromosomeDir

filePaths += downloadGenomeFiles(subDir, localDir, fileType, url)



return filePaths

At the end the function returns a list of the file paths that were actually saved. We can

then  test  the  function  on  the  human  genome  at  the  NCBI  site  used  previously.  Note  that

this may take a significant time to download, so only do it for real if you really want all of

the human genome data!

filePaths = downloadGenomeFiles('H_sapiens','examples','hs_ref*.fa.gz')

One  further  convenience  function  extracts  any  compressed  GZIP  archives  (ending  in

‘.gz’)  so  they  can  be  used  locally  without  hindrance.  This  function  simply  uses  standard

Python modules to create open file objects for compressed formats. These can be parsed,

line by line, in much the same way as a regular file. Here the example simply writes out

the  uncompressed  lines  one  by  one  into  a  new  file,  but  if  there  is  enough  memory  the

whole file could be read in one go (using .read()). Also, the gzip library could be used to

read  compressed  files  directly  inside  analysis  functions,  thus  avoiding  having  to  store

uncompressed  files.  Note  that  there  are  also  standard  Python  modules  to  handle  other

compression formats like ‘ZIP’ and ‘BZIP2’; zipfile and bz2.

import gzip

def uncompressGzFile(fileName):

if fileName.endswith('.gz'):

inFileObj = gzip.open(fileName, 'rb')

fileName = fileName[:-3]

outFileObj = open(fileName, 'w')

for line in inFileObj:

outFileObj.write(line)

# Faster alternative, given sufficient memory:

# outFileObj.write(infileObj.read())

inFileObj.close()

outFileObj.close()

return fileName

Next we use an external program that will do the alignment to map the short sequence

reads to a genome sequence. Popular open-source software choices for this at the time of

writing  are  BOWTIE

10

 and  BWA,



11

 and  we  will  illustrate  the  use  of  the  former  by

wrapping with a convenient Python function, in a similar manner to what was done with

BLAST and ClustalW examples in earlier chapters.

Downloading genome sequence data is usually a simple matter, as described above, of

accessing an on-line repository. The next step is naturally to consider aligning some short-

read  data  to  the  genome  sequence.  For  demonstration  purposes,  there  are  lots  of  high-

throughput  sequencing  data  sets  that  are  available  via  on-line  services  like  the  Gene

Expression  Omnibus,

12

 and  which  can  be  used  with  the  scripts  we  describe  below.  The




actual genome alignment in these examples will be done by the program called Bowtie,

13

which is just the open-source example we happen to have chosen for this chapter. Python



functions  will  be  illustrated  which  wrap  this  external  program  so  that  we  can  use  it  in

larger programs and automated pipeline scripts. Hence the purpose of the examples is to

easily  use  existing  (tested  and  efficient)  programs,  rather  than  doing  everything  from

scratch in Python, which would be slower to run and take a long time to describe.




Download 7,75 Mb.

Do'stlaringiz bilan baham:
1   ...   254   255   256   257   258   259   260   261   ...   514




Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©hozir.org 2024
ma'muriyatiga murojaat qiling

kiriting | ro'yxatdan o'tish
    Bosh sahifa
юртда тантана
Боғда битган
Бугун юртда
Эшитганлар жилманглар
Эшитмадим деманглар
битган бодомлар
Yangiariq tumani
qitish marakazi
Raqamli texnologiyalar
ilishida muhokamadan
tasdiqqa tavsiya
tavsiya etilgan
iqtisodiyot kafedrasi
steiermarkischen landesregierung
asarlaringizni yuboring
o'zingizning asarlaringizni
Iltimos faqat
faqat o'zingizning
steierm rkischen
landesregierung fachabteilung
rkischen landesregierung
hamshira loyihasi
loyihasi mavsum
faolyatining oqibatlari
asosiy adabiyotlar
fakulteti ahborot
ahborot havfsizligi
havfsizligi kafedrasi
fanidan bo’yicha
fakulteti iqtisodiyot
boshqaruv fakulteti
chiqarishda boshqaruv
ishlab chiqarishda
iqtisodiyot fakultet
multiservis tarmoqlari
fanidan asosiy
Uzbek fanidan
mavzulari potok
asosidagi multiservis
'aliyyil a'ziym
billahil 'aliyyil
illaa billahil
quvvata illaa
falah' deganida
Kompyuter savodxonligi
bo’yicha mustaqil
'alal falah'
Hayya 'alal
'alas soloh
Hayya 'alas
mavsum boyicha


yuklab olish