Python Programming for Biology: Bioinformatics and Beyond

Download 7,75 Mb.

Pdf ko'rish

bet	246/514
Sana	30.12.2021
Hajmi	7,75 Mb.
	#91066

1 ... 242 243 244 245 246 247 248 249 ... 514

Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

Value normalisation

Now the examples move onto methods that adjust and normalise our data. The aim here is

generally to preserve the features and correlations within the data, while adjusting the data

values so that they have a standard range or fit better to some kind of distribution. The

reasons for doing this may simply be mathematical, e.g. to calculate scores or

probabilities, but normalisation has a very important role in making different experiments

(both experiment spots within an array and whole arrays) comparable with one another.

The first function of this kind is used to clip the lowest, base value of the data so that it

does not drop below a specified threshold; elements that have smaller values will be set to

this limit. Note that if the absolute threshold is not specified it is taken to be some

proportion of the maximum value, which in this case arbitrarily defaults to 0.2. This

function would be handy to eliminate erroneous negative values, for example, or to

disregard microarray elements that are deemed to be insignificant because they are below

some noise level. Also, as with the makeImage() function we allow the channels to be

specified to state which layers of the array data should be considered. If this is not

specified it defaults to the indices for all layers range(self.nChannels). Note that channels

is deliberately converted to a tuple and then placed in a list so that it can be used to index a

subset from the self.data array (this is a consequence of the way NumPy array indices

work).

def clipBaseline(self, threshold=None, channels=None, defaultProp=0.2):

if not channels:

channels = range(self.nChannels)

channels = [tuple(channels)]

The maximum value is found from the required channels of the array data, and if the

threshold (clipping) value is not specified it is calculated with the default proportion.

maxVal = self.data[channels].max()

if threshold is None:

limit = maxVal * defaultProp

else:

limit = threshold

By comparing the whole array self.data with the limit we generate an array of Boolean

values (True/False) that state whether each element was less than the threshold value. The

indices of the positions where this is True are provided by the nonzero() function of

NumPy arrays. These array elements corresponding to these indices are then set to the

lower limit.

boolArray = self.data[channels] < limit

indices = boolArray.nonzero()

self.data[indices] = limit

After clipping, the data is then centred by subtracting the baseline value, to give a new

base value of zero, and finally scaled (effectively stretched) to restore the original

maximum value.

self.data[channels] -= limit

self.data[channels] *= maxVal / (maxVal-limit)

We now consider various simple normalisation and adjustment methods. The

normaliseSd function will scale the data values according to the standard deviation in the

measurement, thus we divide all signal values by the standard deviation. We include the

scale argument so that the data can also be arbitrarily scaled at the same time. This kind of

adjustment is useful when comparing different instances of microarrays, where the actual

range of signals (e.g. detected fluorescence values) probably ought to be the same on

different occasions, but where there is variation in magnitude simply by the way the

microarray is constructed. For example, one microarray might have a systematically larger

amount of substrate printed on it compared to another. If there is confidence that different

arrays are showing the same range of underlying data then this normalisation, according to

standard deviation, is reasonable. The operation is done separately on all of the data layers

(one for each channel), though we could extend the function to accept only a limited

number of channels. The data is multiplied and divided appropriately in an element-by-

element manner in the NumPy array. The standard deviation is obtained using the .std()

method inbuilt into the NumPy array objects, as we describe in

Chapter 22

def normaliseSd(self, scale=1.0):

for i in range(self.nChannels):

self.data[i] = self.data[i] * scale / self.data[i].std()

Another similar kind of normalisation is simply to make sure that microarray values are

scaled relative to their mean value. This is done for similar reasons as above, but rather

than saying the variation in values is the same across different arrays, we assume that the

mean values should be the same, or nearly so. This is indeed a reasonable assumption in

many situations.

def normaliseMean(self, scale=1.0):

for i in range(self.nChannels):

self.data[i] = self.data[i] * scale / self.data[i].mean()

If we do one of the above normalisations then it is often handy to represent the data

values as positive and negative numbers either side of the mean value, so we can see what

is above or below the average, rather than as a merely positive intensity. This is readily

achieved by subtracting the mean value from all the data:

def centerMean(self):

for i in range(self.nChannels):

self.data[i] -= self.data[i].mean()

Combining the centring of the data and scaling by the standard deviation we get a

commonly used operation which is called Z-score normalisation,

as we discuss in

Chapter 22

. All this really means is that we move the data values so that they are centred

at zero and scaled to put the standard deviation at the values ±1.0.

def normaliseZscore(self):

self.centerMean()

self.normaliseSd()

Another kind of normalisation is to scale the values to some upper limit, e.g. so they are

at most 1.0. This is done by dividing by the maximum value. This operation is useful if

you know what the maximum value in the array is, or tends towards. For example, this

could be a strong reference signal for an element acting as a positive control. Here we add

the option to consider either each data layer separately (perChannel=True) or the

maximum value from all the data (perChannel=False).

def normaliseMax(self, scale=1.0, perChannel=True):

if perChannel:

for i in range(self.nChannels):

self.data[i] = self.data[i] * scale / self.data[i].max()

else:

self.data = self.data * scale / self.data.max()

Likewise we could normalise the rows separately, relative to the maximum values in

each row. Here the data values are divided by an array representing the maximum value

along only one axis, hence axis=1, to get the sum of the values (over the column positions)

for each row. The slice notation [:,None] is a convenient way of changing what would

otherwise be a one-dimensional array of maxima, in one long row vector, into a two-

dimensional array with several rows (one number in each) and one long column. This

means the division then scaling of each row of values by a different number.

def normaliseRowMax(self, scale=1.0):

for i in range(self.nChannels):

self.data[i] = self.data[i] * scale / self.data[i].max(axis=1)[:,None]

Row normalisation is useful where each row represents a different kind of data. An

example would be where each column corresponds to a different nucleotide (or amino

acid) sequence change in one molecule and each row represents a set of target molecules

that are being bound. By normalising by row we will get the relative signal that illustrates

which sequence changes give the best binding to each target. This may be useful in finding

the change that leads to optimal binding for several targets, but naturally we lose

information about the comparative strength of binding between targets.

Alternatively we can normalise the rows so they are scaled according to their mean

value.

def normaliseRowMean(self, scale=1.0):

for i in range(self.nChannels):

self.data[i] = self.data[i] * scale / self.data[i].mean(axis=1)[:,None]

The same operations can be done for the columns in the data, it just depends on how the

array is arranged. Note that here we use axis=0 and so don’t have to do convert the array

of maxima into a column vector.

def normaliseColMax(self, scale=1.0):

for i in range(self.nChannels):

self.data[i] = self.data[i] * scale / self.data[i].max(axis=0)

def normaliseColMean(self, scale=1.0):

for i in range(self.nChannels):

self.data[i] = self.data[i] * scale / self.data[i].mean(axis=0)

If the microarray contains elements that represent control values, i.e. where you know

what the expected results for these are, then you can scale the whole array relative to the

known signal for these reference points. Here the reference values are specified by passing

in the lists of the rows and columns they are found in, which are used to extract the

corresponding values from the data. Taking a slice from self.data using separate rows and

column specifications [rows, cols], rather than finding specific indices, may seem odd, but

is often convenient in NumPy (e.g. this is what .nonzero() gives). The [rows, cols]

notation will affect all of the elements where all of the indices coincide, so can actually be

more efficient than stating separate coordinates. The mean of the reference values is then

used to divide the whole array. To add a bit of diversity we revert to allowing the channels

to be specified to state which data layers to operate on.

def normaliseRefs(self, rows, cols, scale=1.0, channels=None):

if not channels:

channels = range(self.nChannels)

channels = tuple(channels)

refValues = self.data[channels, rows, cols]

for i in channels:

self.data[i] = self.data[i] * scale / refValues[i].mean()

A different way of normalising data values, especially for fluorescence intensity data, is

to convert it into a logarithmic scale, effectively compressing its dynamic range. Note that

we clip the baseline to remove any negative values and add 1.0 so we don’t take the

logarithm of any zero values. After conversion to a log scale we can then apply other

normalisation techniques to compare different microarrays or rows etc. (Obviously we

cannot do centerMean() before the log conversion, because we want all values to be

positive.)

def normaliseLogMean(self):

self.clipBaseline(threshold=0.0)

for i in range(self.nChannels):

self.data[i] = log( 1.0 + self.data[i] / self.data[i].mean() )

We can test all of the above by creating a Microarray object named testArray that uses

some example data from a text file. We illustrate the result of the various normalisation

methods by writing out an image after each point.

testArray = loadDataMatrix('examples/microarrayData.txt', 'Test')

testArray.makeImage(25).save('testArray.png')

# Log normalise

testArray.normaliseLogMean()

testArray.makeImage(25).save('normaliseLogMean.png')

# Normalise to max and clip

testArray.resetData()

testArray.normaliseMax()

testArray.clipBaseline(0.5)

testArray.makeImage(25).save('clipBaseline.png')

# Normalise to standard deviation

testArray.resetData()

print(''Initial SD:'', testArray.data.std())

testArray.normaliseSd()

print(''Final SD:'', testArray.data.std())

Another handy way to do normalisation, albeit in a less scientific way, is to perform

Download 7,75 Mb.

Do'stlaringiz bilan baham:

1 ... 242 243 244 245 246 247 248 249 ... 514