Value normalisation
Now the examples move onto methods that adjust and normalise our data. The aim here is
generally to preserve the features and correlations within the data, while adjusting the data
values so that they have a standard range or fit better to some kind of distribution. The
reasons for doing this may simply be mathematical, e.g. to calculate scores or
probabilities, but normalisation has a very important role in making different experiments
(both experiment spots within an array and whole arrays) comparable with one another.
The first function of this kind is used to clip the lowest, base value of the data so that it
does not drop below a specified threshold; elements that have smaller values will be set to
this limit. Note that if the absolute threshold is not specified it is taken to be some
proportion of the maximum value, which in this case arbitrarily defaults to 0.2. This
function would be handy to eliminate erroneous negative values, for example, or to
disregard microarray elements that are deemed to be insignificant because they are below
some noise level. Also, as with the makeImage() function we allow the channels to be
specified to state which layers of the array data should be considered. If this is not
specified it defaults to the indices for all layers range(self.nChannels). Note that channels
is deliberately converted to a tuple and then placed in a list so that it can be used to index a
subset from the self.data array (this is a consequence of the way NumPy array indices
work).
def clipBaseline(self, threshold=None, channels=None, defaultProp=0.2):
if not channels:
channels = range(self.nChannels)
channels = [tuple(channels)]
The maximum value is found from the required channels of the array data, and if the
threshold (clipping) value is not specified it is calculated with the default proportion.
maxVal = self.data[channels].max()
if threshold is None:
limit = maxVal * defaultProp
else:
limit = threshold
By comparing the whole array self.data with the limit we generate an array of Boolean
values (True/False) that state whether each element was less than the threshold value. The
indices of the positions where this is True are provided by the nonzero() function of
NumPy arrays. These array elements corresponding to these indices are then set to the
lower limit.
boolArray = self.data[channels] < limit
indices = boolArray.nonzero()
self.data[indices] = limit
After clipping, the data is then centred by subtracting the baseline value, to give a new
base value of zero, and finally scaled (effectively stretched) to restore the original
maximum value.
self.data[channels] -= limit
self.data[channels] *= maxVal / (maxVal-limit)
We now consider various simple normalisation and adjustment methods. The
normaliseSd function will scale the data values according to the standard deviation in the
measurement, thus we divide all signal values by the standard deviation. We include the
scale argument so that the data can also be arbitrarily scaled at the same time. This kind of
adjustment is useful when comparing different instances of microarrays, where the actual
range of signals (e.g. detected fluorescence values) probably ought to be the same on
different occasions, but where there is variation in magnitude simply by the way the
microarray is constructed. For example, one microarray might have a systematically larger
amount of substrate printed on it compared to another. If there is confidence that different
arrays are showing the same range of underlying data then this normalisation, according to
standard deviation, is reasonable. The operation is done separately on all of the data layers
(one for each channel), though we could extend the function to accept only a limited
number of channels. The data is multiplied and divided appropriately in an element-by-
element manner in the NumPy array. The standard deviation is obtained using the .std()
method inbuilt into the NumPy array objects, as we describe in
Chapter 22
.
def normaliseSd(self, scale=1.0):
for i in range(self.nChannels):
self.data[i] = self.data[i] * scale / self.data[i].std()
Another similar kind of normalisation is simply to make sure that microarray values are
scaled relative to their mean value. This is done for similar reasons as above, but rather
than saying the variation in values is the same across different arrays, we assume that the
mean values should be the same, or nearly so. This is indeed a reasonable assumption in
many situations.
def normaliseMean(self, scale=1.0):
for i in range(self.nChannels):
self.data[i] = self.data[i] * scale / self.data[i].mean()
If we do one of the above normalisations then it is often handy to represent the data
values as positive and negative numbers either side of the mean value, so we can see what
is above or below the average, rather than as a merely positive intensity. This is readily
achieved by subtracting the mean value from all the data:
def centerMean(self):
for i in range(self.nChannels):
self.data[i] -= self.data[i].mean()
Combining the centring of the data and scaling by the standard deviation we get a
commonly used operation which is called Z-score normalisation,
4
as we discuss in
Chapter 22
. All this really means is that we move the data values so that they are centred
at zero and scaled to put the standard deviation at the values ±1.0.
def normaliseZscore(self):
self.centerMean()
self.normaliseSd()
Another kind of normalisation is to scale the values to some upper limit, e.g. so they are
at most 1.0. This is done by dividing by the maximum value. This operation is useful if
you know what the maximum value in the array is, or tends towards. For example, this
could be a strong reference signal for an element acting as a positive control. Here we add
the option to consider either each data layer separately (perChannel=True) or the
maximum value from all the data (perChannel=False).
def normaliseMax(self, scale=1.0, perChannel=True):
if perChannel:
for i in range(self.nChannels):
self.data[i] = self.data[i] * scale / self.data[i].max()
else:
self.data = self.data * scale / self.data.max()
Likewise we could normalise the rows separately, relative to the maximum values in
each row. Here the data values are divided by an array representing the maximum value
along only one axis, hence axis=1, to get the sum of the values (over the column positions)
for each row. The slice notation [:,None] is a convenient way of changing what would
otherwise be a one-dimensional array of maxima, in one long row vector, into a two-
dimensional array with several rows (one number in each) and one long column. This
means the division then scaling of each row of values by a different number.
def normaliseRowMax(self, scale=1.0):
for i in range(self.nChannels):
self.data[i] = self.data[i] * scale / self.data[i].max(axis=1)[:,None]
Row normalisation is useful where each row represents a different kind of data. An
example would be where each column corresponds to a different nucleotide (or amino
acid) sequence change in one molecule and each row represents a set of target molecules
that are being bound. By normalising by row we will get the relative signal that illustrates
which sequence changes give the best binding to each target. This may be useful in finding
the change that leads to optimal binding for several targets, but naturally we lose
information about the comparative strength of binding between targets.
Alternatively we can normalise the rows so they are scaled according to their mean
value.
def normaliseRowMean(self, scale=1.0):
for i in range(self.nChannels):
self.data[i] = self.data[i] * scale / self.data[i].mean(axis=1)[:,None]
The same operations can be done for the columns in the data, it just depends on how the
array is arranged. Note that here we use axis=0 and so don’t have to do convert the array
of maxima into a column vector.
def normaliseColMax(self, scale=1.0):
for i in range(self.nChannels):
self.data[i] = self.data[i] * scale / self.data[i].max(axis=0)
def normaliseColMean(self, scale=1.0):
for i in range(self.nChannels):
self.data[i] = self.data[i] * scale / self.data[i].mean(axis=0)
If the microarray contains elements that represent control values, i.e. where you know
what the expected results for these are, then you can scale the whole array relative to the
known signal for these reference points. Here the reference values are specified by passing
in the lists of the rows and columns they are found in, which are used to extract the
corresponding values from the data. Taking a slice from self.data using separate rows and
column specifications [rows, cols], rather than finding specific indices, may seem odd, but
is often convenient in NumPy (e.g. this is what .nonzero() gives). The [rows, cols]
notation will affect all of the elements where all of the indices coincide, so can actually be
more efficient than stating separate coordinates. The mean of the reference values is then
used to divide the whole array. To add a bit of diversity we revert to allowing the channels
to be specified to state which data layers to operate on.
def normaliseRefs(self, rows, cols, scale=1.0, channels=None):
if not channels:
channels = range(self.nChannels)
channels = tuple(channels)
refValues = self.data[channels, rows, cols]
for i in channels:
self.data[i] = self.data[i] * scale / refValues[i].mean()
A different way of normalising data values, especially for fluorescence intensity data, is
to convert it into a logarithmic scale, effectively compressing its dynamic range. Note that
we clip the baseline to remove any negative values and add 1.0 so we don’t take the
logarithm of any zero values. After conversion to a log scale we can then apply other
normalisation techniques to compare different microarrays or rows etc. (Obviously we
cannot do centerMean() before the log conversion, because we want all values to be
positive.)
def normaliseLogMean(self):
self.clipBaseline(threshold=0.0)
for i in range(self.nChannels):
self.data[i] = log( 1.0 + self.data[i] / self.data[i].mean() )
We can test all of the above by creating a Microarray object named testArray that uses
some example data from a text file. We illustrate the result of the various normalisation
methods by writing out an image after each point.
testArray = loadDataMatrix('examples/microarrayData.txt', 'Test')
testArray.makeImage(25).save('testArray.png')
# Log normalise
testArray.normaliseLogMean()
testArray.makeImage(25).save('normaliseLogMean.png')
# Normalise to max and clip
testArray.resetData()
testArray.normaliseMax()
testArray.clipBaseline(0.5)
testArray.makeImage(25).save('clipBaseline.png')
# Normalise to standard deviation
testArray.resetData()
print(''Initial SD:'', testArray.data.std())
testArray.normaliseSd()
print(''Final SD:'', testArray.data.std())
Another handy way to do normalisation, albeit in a less scientific way, is to perform
Do'stlaringiz bilan baham: |