File reading examples
Reading whitespace-separated files
For our first practical example we will begin with reading a simple yet commonly used
kind of file, one where each line has several fields that are separated with whitespace. By
‘whitespace’ we mean tab stops (‘\t’) or one or more spaces. An example of such a file
would be the following, where we first have a descriptive header line and then subsequent
lines with three text fields; the first is the name of a chromosome, the second is a base-pair
position in the chromosome and the last is a value representing an experimentally
determined value for that position:
chromosome position value
chr1 3417953 0.74634
chrX 152662801 0.50036
chr7 55281536 0.82376
chr4 9168943 0.73375
chr1 13170641 0.42181
For the purposes of our example we will assume that the above lines are in a file called
‘chromoData.tsv’ which lies in the ‘examples’ sub-directory of the current working
directory, where ‘.tsv’ gives a hint that the format is tab-separated values. In order to
process this file we will first read the separate header line with .readline(), given that it
doesn’t contain data we are interested in. Then we will loop through the remainder of the
lines, by iterating over the file object, and for each line we will use the string function
split() to separate the line into a list of substrings. Without any arguments split() will
separate the fields according to whitespace, which is what we want. For a different file
format we could specify a different separator, so, for example, for comma-separated fields
we would use split(‘,’) or for tab-separated fields split(‘\t’), both of which can
accommodate data items with internal spaces.
fileObj = open('examples/chromoData.tsv')
values = []
header = fileObj.readline() # Don't need this first line
for line in fileObj:
data = line.split()
chromosome, position, value = data
position = int(position)
value = float(value)
values.append(value)
mean = sum(values)/len(values)
print('Mean value', mean)
For each line we obtain a list with three items and these are extracted into separate
chromosome, position and value variables. Initially these will be text strings, given that
they were just read from the file, but in the case of the position and value we generally
want to convert them from strings into integer and floating point number data types
respectively (though in this simple example we have not used the position). Accordingly
we use the int() and float() functions to do the conversion. Once a variable is a numeric
data type we can then perform mathematical operations, like finding the mean value as
illustrated.
We will consider field-delimited formats again in the readListFile() function below,
where we handle things in a more general way, allowing different data type conversion
functions and field separators to be specified as function arguments.
Do'stlaringiz bilan baham: |