Reading array data
We will define the Microarray class a bit later, but first we will look at what kind of raw,
experimental data it will contain. This will be illustrated by writing functions that extract
data for a Microarray object from the contents of files, specifically text files and image
files. An alternative approach would be to make a blank Microarray object and then have
that load data into itself, but here we aim to first show the kind of actual underlying data
we are dealing with.
Importing text matrices
The load function will assume a simple file format where we have one line of text for each
array element and the columns of data in the file are identifiers for the array coordinates
(these could be the row and column of the array) and the actual data signal value. For
example, this could be something like:
A B 1.230
A C 4.510
A D 0.075
B C 4.999
B D 0.258
C D 2.312
We will not assume that all elements of the array are represented and we will not
assume that the array row and column identifiers (i.e. A, B, C and D in the above
example) are either continuous or in any particular order.
The experimental microarray data will be represented as two-dimensional NumPy
arrays. Accordingly, we make various imports from the numpy library for the
mathematical array operations that will need to be done:
from numpy import array, dot, log, sqrt, uint8, zeros
The function to load data from a file and create a Microarray object takes a file name to
load from as the first argument, an identifying name for the array as the second argument
and an optional third argument to state what the default signal value is for an array
element that we have no data for. This last argument may or may not be used depending
on what data we read in, but we should at least be aware of incomplete or failed data
points. The default here is zero, rather than None, given that we are dealing with NumPy
arrays that don’t mix data types.
def loadDataMatrix(fileName, sampleName, default=0.0):
fileObj = open(fileName, 'r')
Empty sets are created which will contain the identifiers for the rows and columns in
the microarray data. In the end these could be filled with just numbers representing the
array coordinates, or they could be text labels. The only rule about these identifiers is that
because they are in sets they cannot be internally modifiable items like lists (they must be
hashable
1
). The values in these sets will be keys to access the numeric data stored in the
dataDict dictionary.
rows = set()
cols = set()
dataDict = {}
Next we loop though all the lines in the open file object and split each line according to
internal whitespace to give values for the row, column and the numeric value. Naturally
this operation would be different if the format of the file was different.
for line in fileObj:
row, col, value = line.split()
A check is made to ensure that each row identifier that comes from the file has an entry
in dataDict. If it does not then we make a new inner dictionary within the main one using
the row identifier as the key. Note that we cannot fill this dictionary, or any other data
collection, in advance because we do not know what rows or columns we have until the
file has been read.
if row not in dataDict:
dataDict[row] = {}
The actual signal value, which is a floating point number, is then added to the data by
using the row and column identifiers as keys for the main and inner dictionaries, although
we first check the dictionary to guard against using repeats for the same element. The row
and col that were just used are then added to the set of rows and cols. Because they are set
data types it does not matter if we have seen them before, given that a set automatically
ignores repeated items. Note that the value is converted using float() because initially it is
just a text string loaded from a file, and not a Python number object.
if col in dataDict[row]:
print('Repeat entry found for element %d, %d' % (row, col))
continue
dataDict[row][col] = float(value)
rows.add(row)
cols.add(col)
After all the lines have been processed we then convert the set of row and column
identifiers to sorted, ordered lists. Only now that the total range of these identifiers has
been collected can they be used to create the axes of the NumPy 2D value array. The sizes
of this array will naturally be based on the number of row and column identifiers, which
are recorded as nRows and nCols.
rows = sorted(rows)
cols = sorted(cols)
nRows = len(rows)
nCols = len(cols)
The NumPy array to store the values is initialised as an array of zeros of the required
size, with the data type (the last argument) set to be floating point numbers:
dataMatrix = zeros((nRows, nCols), float)
The rectangular dataMatrix array is then filled by extracting values from the dataDict,
although there is provision to replace missing values with the default value (hence we use
the .get() dictionary call). Note how we use enumerate() to extract index numbers (i and j)
as we loop through row and column identifiers. These indices are then used to fill the
correct position in the array. At the end the ordered list of rows and cols will be used so
that the indices in the NumPy array can be used to look up the original data labels that
they refer to.
for i, row in enumerate(rows):
for j, col in enumerate(cols):
value = dataDict[row].get(col, default)
dataMatrix[i,j] = value
fileObj.close()
With the data collected we use it to construct a Microarray class of object, made with
the definition described below. This is then passed back from the function.
return Microarray(sampleName, dataMatrix, rows, cols)
Do'stlaringiz bilan baham: |