Python Programming for Biology: Bioinformatics and Beyond



Download 7,75 Mb.
Pdf ko'rish
bet85/514
Sana30.12.2021
Hajmi7,75 Mb.
#91066
1   ...   81   82   83   84   85   86   87   88   ...   514
Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

Column-delimited formats

Next in this chapter we will look at making new file formats to work with your programs.




In general, however, you might consider avoiding this entirely. If there is already a well-

defined  standard  that  is  used  for  a  particular  kind  of  data,  like  FASTA  for  sequences  or

PDB  (or  more  recently  mmCIF)  for  molecular  structures,  then  that  should  be  the  first

choice, especially if you want other people or programs to understand your data. Also, for

an  arbitrary  set  of  data  you  could  use  an  existing  standardised  system  like  XML  and

benefit from the large number of available Python modules to deal with it.

Nonetheless there are occasional situations where there is a pertinent need to read and

write data in a custom format, especially where the data is fairly simple and the files will

only be used in a limited, perhaps internal, set of situations. As an example we will choose

a  simple  file  format,  which  is  easy  to  read,  write  and  for  a  human  being  to  understand.

This  will  consist  of  an  initial  header  line  that  states  what  the  various  items  of  data

represent and then subsequent lines, one for each of the data elements in a list. On each

data line we will use a piece of text (commonly a single character like a space, tab stop or

comma) to separate or delimit the various items on that line. Note that we should choose

the separator string carefully so that it is not something that will be contained in the data

and disrupt the delineation of different items.

In the function below, to write out the data we first create a header line, to indicate what

each of the fields represents, and then loop through the list of data to create the remaining

lines of the file. There is a check to make sure the heading list is the same size as the first

item  of  data  and  the  heading  line  is  formed  using  the  .join()  function  of  the  separator

string.  This  combines  all  the  elements  into  one  text  string  and  is  then  written  out  to  the

file,  combining  it  with  a  newline  character.  For  the  data  lines  the  separator  joins  the

formats variable to create a single one-line format, which will be the template to say how

to  convert  each  row  of  data  into  the  appropriate  line  of  text,  where  each  item  has  the

correct numerical precision and padding etc. The actual data lines are created from a tuple

of each row via the ‘%’ formatting operator and are written out with a newline character.

def writeListFile(fileName, data, headings, formats, separator='\t'):

if len(data[0]) != len(headings):

print("Headings length does not match input list")

return


fileObj = open(fileName, 'w')

line = separator.join(headings)

fileObj.write('%s\n' % line)

format = separator.join(formats)

for row in data:

line = format % tuple(row)

fileObj.write('%s\n' % line)

fileObj.close()

To  create  a  specific  type  of  file  using  this  general  function  the  headings,  format  and

separator can be specified, i.e. so they are invariant for the function. For example, here is a

file format specification which uses four items (a string, two integer numbers and floating

point number) on a line separated by tabs:




def writeChromosomeRegions(fileName, data):

headings = ['chromo', 'start', 'end', 'value']

formats = ['%s', '%d', '%d', '%.3f']

writeListFile(fileName, data, headings, formats, ' ')

Which could produce something like:

chromo start end value

chr1 195612601 196518584 0.379

chr1 52408393 196590488 0.361

chr1 193237929 196783789 0.473

chr1 181373059 6104731 0.104

chr2 7015693 7539562 0.508

chr2 9097449 9108209 0.302

The equivalent functions for reading our files are fairly simple. We just need to skip the

first  line,  assuming  of  course  we  already  know  what  the  data  represents,  and  then  loop

through the remainder of the lines. For the data lines we remove the last, newline character

with .rstrip() and split them according to the specified separator, again defaulting to a tab

space, and put the resulting list as an entry in the larger list, dataList, which is returned at

the  end  of  the  function.  Note  that  because  the  values  read  from  the  files  are  just  text

characters  we  need  to  appropriately  convert  anything  which  should  not  remain  a  Python

string,  like  numbers  or  True/False  values.  This  is  illustrated  below  by  the  use  of  the

converters argument, which contains a list of functions (int, float etc.) to transform the text

from the file in the appropriate way. If a conversion is not required for an item then the list

simply contains None.

def readListFile(fileName, converters, separator='\t'):

dataList = []

fileObj = open(fileName, 'rU')

header = fileObj.readline() # Extract first line

for line in fileObj: # Loop through remaining lines

line = line.rstrip()

data = line.split(separator)

for index, datum in enumerate(data):

convertFunc = converters[index]

if convertFunc:

data[index] = convertFunc(datum)

dataList.append(data)

return dataList

We can then use this general file-reading function to make something more specific, i.e.

by  defining  a  separator  and  conversion  functions  appropriate  to  a  particular  job.  In  the

example below we use a space as a separator and leave the first value as text, convert the

second and third values to integers and convert the fourth to a floating point value, i.e. so




we could read files made with writeChromosomeRegions().

def readChromosomeRegions(fileName):

converters = [None, int, int, float]

dataList = readListFile(fileName, converters, ' ')

return dataList

There is a standard Python module, called ‘csv’ (after Comma Separated Value), which

will do most of the above handling of delimited text files. Unfortunately, it uses different

methods to open files in Python 2 and Python 3. In Python 2 the binary, ‘b’, flag is used to

open the file, and in Python 3 an extra newline argument is used instead. The underlying

reason for the complication is because the csv module is designed to cope with new lines

being present in the middle of an item of data. Hence, the module does not read or write

the data with the standard line-by-line method and makes a separate assessment about how

to split the data into rows.

To deal with all of this we have created a small function that can distinguish between

Python 2 and Python 3 using sys.version_info.major (which gives the value 2 or 3 for the

respective  versions)  and  use  the  csv  module  in  the  correct  way.  The  construction  of  the

function is similar to writeListFile,  where  we  write  a  header  and  then  the  data  lines,  but

here the actual writing is done using the writerow() method of a csv.writer object. Also, it

is notable that what we have called the data separator, the csv functions call the delimiter.

import csv

import sys

def writeCsvFile(fileName, data, headings, separator='\t'):

if sys.version_info.major > 2:

fileObj = open(fileName, 'w', newline='')

else:

fileObj = open(fileName, 'wb')



writer = csv.writer(fileObj, delimiter=separator)

writer.writerow(headings)

for row in data:

writer.writerow(row)

fileObj.close()

There  are  also  complications  due  to  the  Python  version  in  our  csv  reader  function

readCsvFile(),  but  the  file  reading  itself  is  fairly  straightforward.  We  simply  create  a

csv.reader object that can be looped through, row by row. We ignore the first (index zero)

header  row  and  convert  the  text  to  the  required  data  types  in  the  same  way  we  did  in

readListFile().

def readCsvFile(fileName, converters, separator='\t'):

dataList = []

if sys.version_info.major > 2:



fileObj = open(fileName, 'r', newline='')

else:


fileObj = open(fileName, 'rb')

reader = csv.reader(fileObj, delimiter=separator)

for n, row in enumerate(reader):

if n > 0: # n = 0 is the header, which we ignore

for index, datum in enumerate(row):

convertFunc = converters[index]

if convertFunc:

row[index] = convertFunc(datum)

dataList.append(row)

fileObj.close()

return dataList


Download 7,75 Mb.

Do'stlaringiz bilan baham:
1   ...   81   82   83   84   85   86   87   88   ...   514




Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©hozir.org 2024
ma'muriyatiga murojaat qiling

kiriting | ro'yxatdan o'tish
    Bosh sahifa
юртда тантана
Боғда битган
Бугун юртда
Эшитганлар жилманглар
Эшитмадим деманглар
битган бодомлар
Yangiariq tumani
qitish marakazi
Raqamli texnologiyalar
ilishida muhokamadan
tasdiqqa tavsiya
tavsiya etilgan
iqtisodiyot kafedrasi
steiermarkischen landesregierung
asarlaringizni yuboring
o'zingizning asarlaringizni
Iltimos faqat
faqat o'zingizning
steierm rkischen
landesregierung fachabteilung
rkischen landesregierung
hamshira loyihasi
loyihasi mavsum
faolyatining oqibatlari
asosiy adabiyotlar
fakulteti ahborot
ahborot havfsizligi
havfsizligi kafedrasi
fanidan bo’yicha
fakulteti iqtisodiyot
boshqaruv fakulteti
chiqarishda boshqaruv
ishlab chiqarishda
iqtisodiyot fakultet
multiservis tarmoqlari
fanidan asosiy
Uzbek fanidan
mavzulari potok
asosidagi multiservis
'aliyyil a'ziym
billahil 'aliyyil
illaa billahil
quvvata illaa
falah' deganida
Kompyuter savodxonligi
bo’yicha mustaqil
'alal falah'
Hayya 'alal
'alas soloh
Hayya 'alas
mavsum boyicha


yuklab olish