Column-delimited formats
Next in this chapter we will look at making new file formats to work with your programs.
In general, however, you might consider avoiding this entirely. If there is already a well-
defined standard that is used for a particular kind of data, like FASTA for sequences or
PDB (or more recently mmCIF) for molecular structures, then that should be the first
choice, especially if you want other people or programs to understand your data. Also, for
an arbitrary set of data you could use an existing standardised system like XML and
benefit from the large number of available Python modules to deal with it.
Nonetheless there are occasional situations where there is a pertinent need to read and
write data in a custom format, especially where the data is fairly simple and the files will
only be used in a limited, perhaps internal, set of situations. As an example we will choose
a simple file format, which is easy to read, write and for a human being to understand.
This will consist of an initial header line that states what the various items of data
represent and then subsequent lines, one for each of the data elements in a list. On each
data line we will use a piece of text (commonly a single character like a space, tab stop or
comma) to separate or delimit the various items on that line. Note that we should choose
the separator string carefully so that it is not something that will be contained in the data
and disrupt the delineation of different items.
In the function below, to write out the data we first create a header line, to indicate what
each of the fields represents, and then loop through the list of data to create the remaining
lines of the file. There is a check to make sure the heading list is the same size as the first
item of data and the heading line is formed using the .join() function of the separator
string. This combines all the elements into one text string and is then written out to the
file, combining it with a newline character. For the data lines the separator joins the
formats variable to create a single one-line format, which will be the template to say how
to convert each row of data into the appropriate line of text, where each item has the
correct numerical precision and padding etc. The actual data lines are created from a tuple
of each row via the ‘%’ formatting operator and are written out with a newline character.
def writeListFile(fileName, data, headings, formats, separator='\t'):
if len(data[0]) != len(headings):
print("Headings length does not match input list")
return
fileObj = open(fileName, 'w')
line = separator.join(headings)
fileObj.write('%s\n' % line)
format = separator.join(formats)
for row in data:
line = format % tuple(row)
fileObj.write('%s\n' % line)
fileObj.close()
To create a specific type of file using this general function the headings, format and
separator can be specified, i.e. so they are invariant for the function. For example, here is a
file format specification which uses four items (a string, two integer numbers and floating
point number) on a line separated by tabs:
def writeChromosomeRegions(fileName, data):
headings = ['chromo', 'start', 'end', 'value']
formats = ['%s', '%d', '%d', '%.3f']
writeListFile(fileName, data, headings, formats, ' ')
Which could produce something like:
chromo start end value
chr1 195612601 196518584 0.379
chr1 52408393 196590488 0.361
chr1 193237929 196783789 0.473
chr1 181373059 6104731 0.104
chr2 7015693 7539562 0.508
chr2 9097449 9108209 0.302
The equivalent functions for reading our files are fairly simple. We just need to skip the
first line, assuming of course we already know what the data represents, and then loop
through the remainder of the lines. For the data lines we remove the last, newline character
with .rstrip() and split them according to the specified separator, again defaulting to a tab
space, and put the resulting list as an entry in the larger list, dataList, which is returned at
the end of the function. Note that because the values read from the files are just text
characters we need to appropriately convert anything which should not remain a Python
string, like numbers or True/False values. This is illustrated below by the use of the
converters argument, which contains a list of functions (int, float etc.) to transform the text
from the file in the appropriate way. If a conversion is not required for an item then the list
simply contains None.
def readListFile(fileName, converters, separator='\t'):
dataList = []
fileObj = open(fileName, 'rU')
header = fileObj.readline() # Extract first line
for line in fileObj: # Loop through remaining lines
line = line.rstrip()
data = line.split(separator)
for index, datum in enumerate(data):
convertFunc = converters[index]
if convertFunc:
data[index] = convertFunc(datum)
dataList.append(data)
return dataList
We can then use this general file-reading function to make something more specific, i.e.
by defining a separator and conversion functions appropriate to a particular job. In the
example below we use a space as a separator and leave the first value as text, convert the
second and third values to integers and convert the fourth to a floating point value, i.e. so
we could read files made with writeChromosomeRegions().
def readChromosomeRegions(fileName):
converters = [None, int, int, float]
dataList = readListFile(fileName, converters, ' ')
return dataList
There is a standard Python module, called ‘csv’ (after Comma Separated Value), which
will do most of the above handling of delimited text files. Unfortunately, it uses different
methods to open files in Python 2 and Python 3. In Python 2 the binary, ‘b’, flag is used to
open the file, and in Python 3 an extra newline argument is used instead. The underlying
reason for the complication is because the csv module is designed to cope with new lines
being present in the middle of an item of data. Hence, the module does not read or write
the data with the standard line-by-line method and makes a separate assessment about how
to split the data into rows.
To deal with all of this we have created a small function that can distinguish between
Python 2 and Python 3 using sys.version_info.major (which gives the value 2 or 3 for the
respective versions) and use the csv module in the correct way. The construction of the
function is similar to writeListFile, where we write a header and then the data lines, but
here the actual writing is done using the writerow() method of a csv.writer object. Also, it
is notable that what we have called the data separator, the csv functions call the delimiter.
import csv
import sys
def writeCsvFile(fileName, data, headings, separator='\t'):
if sys.version_info.major > 2:
fileObj = open(fileName, 'w', newline='')
else:
fileObj = open(fileName, 'wb')
writer = csv.writer(fileObj, delimiter=separator)
writer.writerow(headings)
for row in data:
writer.writerow(row)
fileObj.close()
There are also complications due to the Python version in our csv reader function
readCsvFile(), but the file reading itself is fairly straightforward. We simply create a
csv.reader object that can be looped through, row by row. We ignore the first (index zero)
header row and convert the text to the required data types in the same way we did in
readListFile().
def readCsvFile(fileName, converters, separator='\t'):
dataList = []
if sys.version_info.major > 2:
fileObj = open(fileName, 'r', newline='')
else:
fileObj = open(fileName, 'rb')
reader = csv.reader(fileObj, delimiter=separator)
for n, row in enumerate(reader):
if n > 0: # n = 0 is the header, which we ignore
for index, datum in enumerate(row):
convertFunc = converters[index]
if convertFunc:
row[index] = convertFunc(datum)
dataList.append(row)
fileObj.close()
return dataList
Do'stlaringiz bilan baham: |