Python Programming for Biology: Bioinformatics and Beyond

Download 7,75 Mb.

Pdf ko'rish

bet	85/514
Sana	30.12.2021
Hajmi	7,75 Mb.
	#91066

1 ... 81 82 83 84 85 86 87 88 ... 514

Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

Column-delimited formats

Next in this chapter we will look at making new file formats to work with your programs.

In general, however, you might consider avoiding this entirely. If there is already a well-

defined standard that is used for a particular kind of data, like FASTA for sequences or

PDB (or more recently mmCIF) for molecular structures, then that should be the first

choice, especially if you want other people or programs to understand your data. Also, for

an arbitrary set of data you could use an existing standardised system like XML and

benefit from the large number of available Python modules to deal with it.

Nonetheless there are occasional situations where there is a pertinent need to read and

write data in a custom format, especially where the data is fairly simple and the files will

only be used in a limited, perhaps internal, set of situations. As an example we will choose

a simple file format, which is easy to read, write and for a human being to understand.

This will consist of an initial header line that states what the various items of data

represent and then subsequent lines, one for each of the data elements in a list. On each

data line we will use a piece of text (commonly a single character like a space, tab stop or

comma) to separate or delimit the various items on that line. Note that we should choose

the separator string carefully so that it is not something that will be contained in the data

and disrupt the delineation of different items.

In the function below, to write out the data we first create a header line, to indicate what

each of the fields represents, and then loop through the list of data to create the remaining

lines of the file. There is a check to make sure the heading list is the same size as the first

item of data and the heading line is formed using the .join() function of the separator

string. This combines all the elements into one text string and is then written out to the

file, combining it with a newline character. For the data lines the separator joins the

formats variable to create a single one-line format, which will be the template to say how

to convert each row of data into the appropriate line of text, where each item has the

correct numerical precision and padding etc. The actual data lines are created from a tuple

of each row via the ‘%’ formatting operator and are written out with a newline character.

def writeListFile(fileName, data, headings, formats, separator='\t'):

if len(data[0]) != len(headings):

print("Headings length does not match input list")

return

fileObj = open(fileName, 'w')

line = separator.join(headings)

fileObj.write('%s\n' % line)

format = separator.join(formats)

for row in data:

line = format % tuple(row)

fileObj.write('%s\n' % line)

fileObj.close()

To create a specific type of file using this general function the headings, format and

separator can be specified, i.e. so they are invariant for the function. For example, here is a

file format specification which uses four items (a string, two integer numbers and floating

point number) on a line separated by tabs:

def writeChromosomeRegions(fileName, data):

headings = ['chromo', 'start', 'end', 'value']

formats = ['%s', '%d', '%d', '%.3f']

writeListFile(fileName, data, headings, formats, ' ')

Which could produce something like:

chromo start end value

chr1 195612601 196518584 0.379

chr1 52408393 196590488 0.361

chr1 193237929 196783789 0.473

chr1 181373059 6104731 0.104

chr2 7015693 7539562 0.508

chr2 9097449 9108209 0.302

The equivalent functions for reading our files are fairly simple. We just need to skip the

first line, assuming of course we already know what the data represents, and then loop

through the remainder of the lines. For the data lines we remove the last, newline character

with .rstrip() and split them according to the specified separator, again defaulting to a tab

space, and put the resulting list as an entry in the larger list, dataList, which is returned at

the end of the function. Note that because the values read from the files are just text

characters we need to appropriately convert anything which should not remain a Python

string, like numbers or True/False values. This is illustrated below by the use of the

converters argument, which contains a list of functions (int, float etc.) to transform the text

from the file in the appropriate way. If a conversion is not required for an item then the list

simply contains None.

def readListFile(fileName, converters, separator='\t'):

dataList = []

fileObj = open(fileName, 'rU')

header = fileObj.readline() # Extract first line

for line in fileObj: # Loop through remaining lines

line = line.rstrip()

data = line.split(separator)

for index, datum in enumerate(data):

convertFunc = converters[index]

if convertFunc:

data[index] = convertFunc(datum)

dataList.append(data)

return dataList

We can then use this general file-reading function to make something more specific, i.e.

by defining a separator and conversion functions appropriate to a particular job. In the

example below we use a space as a separator and leave the first value as text, convert the

second and third values to integers and convert the fourth to a floating point value, i.e. so

we could read files made with writeChromosomeRegions().

def readChromosomeRegions(fileName):

converters = [None, int, int, float]

dataList = readListFile(fileName, converters, ' ')

return dataList

There is a standard Python module, called ‘csv’ (after Comma Separated Value), which

will do most of the above handling of delimited text files. Unfortunately, it uses different

methods to open files in Python 2 and Python 3. In Python 2 the binary, ‘b’, flag is used to

open the file, and in Python 3 an extra newline argument is used instead. The underlying

reason for the complication is because the csv module is designed to cope with new lines

being present in the middle of an item of data. Hence, the module does not read or write

the data with the standard line-by-line method and makes a separate assessment about how

to split the data into rows.

To deal with all of this we have created a small function that can distinguish between

Python 2 and Python 3 using sys.version_info.major (which gives the value 2 or 3 for the

respective versions) and use the csv module in the correct way. The construction of the

function is similar to writeListFile, where we write a header and then the data lines, but

here the actual writing is done using the writerow() method of a csv.writer object. Also, it

is notable that what we have called the data separator, the csv functions call the delimiter.

import csv

import sys

def writeCsvFile(fileName, data, headings, separator='\t'):

if sys.version_info.major > 2:

fileObj = open(fileName, 'w', newline='')

else:

fileObj = open(fileName, 'wb')

writer = csv.writer(fileObj, delimiter=separator)

writer.writerow(headings)

for row in data:

writer.writerow(row)

fileObj.close()

There are also complications due to the Python version in our csv reader function

readCsvFile(), but the file reading itself is fairly straightforward. We simply create a

csv.reader object that can be looped through, row by row. We ignore the first (index zero)

header row and convert the text to the required data types in the same way we did in

readListFile().

def readCsvFile(fileName, converters, separator='\t'):

dataList = []

if sys.version_info.major > 2:

fileObj = open(fileName, 'r', newline='')

else:

fileObj = open(fileName, 'rb')

reader = csv.reader(fileObj, delimiter=separator)

for n, row in enumerate(reader):

if n > 0: # n = 0 is the header, which we ignore

for index, datum in enumerate(row):

convertFunc = converters[index]

if convertFunc:

row[index] = convertFunc(datum)

dataList.append(row)

fileObj.close()

return dataList

Download 7,75 Mb.

Do'stlaringiz bilan baham:

1 ... 81 82 83 84 85 86 87 88 ... 514