Python Programming for Biology: Bioinformatics and Beyond



Download 7,75 Mb.
Pdf ko'rish
bet246/514
Sana30.12.2021
Hajmi7,75 Mb.
#91066
1   ...   242   243   244   245   246   247   248   249   ...   514
Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

Value normalisation

Now the examples move onto methods that adjust and normalise our data. The aim here is

generally to preserve the features and correlations within the data, while adjusting the data

values  so  that  they  have  a  standard  range  or  fit  better  to  some  kind  of  distribution.  The

reasons  for  doing  this  may  simply  be  mathematical,  e.g.  to  calculate  scores  or

probabilities, but normalisation has a very important role in making different experiments

(both experiment spots within an array and whole arrays) comparable with one another.

The first function of this kind is used to clip the lowest, base value of the data so that it

does not drop below a specified threshold; elements that have smaller values will be set to

this  limit.  Note  that  if  the  absolute  threshold  is  not  specified  it  is  taken  to  be  some

proportion  of  the  maximum  value,  which  in  this  case  arbitrarily  defaults  to  0.2.  This

function  would  be  handy  to  eliminate  erroneous  negative  values,  for  example,  or  to

disregard microarray elements that are deemed to be insignificant because they are below

some  noise  level.  Also,  as  with  the  makeImage()  function  we  allow  the  channels  to  be

specified  to  state  which  layers  of  the  array  data  should  be  considered.  If  this  is  not

specified it defaults to the indices for all layers range(self.nChannels). Note that channels

is deliberately converted to a tuple and then placed in a list so that it can be used to index a

subset  from  the  self.data  array  (this  is  a  consequence  of  the  way  NumPy  array  indices

work).

def clipBaseline(self, threshold=None, channels=None, defaultProp=0.2):



if not channels:

channels = range(self.nChannels)

channels = [tuple(channels)]

The  maximum  value  is  found  from  the  required  channels  of  the  array  data,  and  if  the

threshold (clipping) value is not specified it is calculated with the default proportion.

maxVal = self.data[channels].max()

if threshold is None:

limit = maxVal * defaultProp

else:

limit = threshold



By comparing the whole array self.data with the limit we generate an array of Boolean

values (True/False) that state whether each element was less than the threshold value. The

indices  of  the  positions  where  this  is  True  are  provided  by  the  nonzero()  function  of

NumPy  arrays.  These  array  elements  corresponding  to  these  indices  are  then  set  to  the

lower limit.

boolArray = self.data[channels] < limit

indices = boolArray.nonzero()



self.data[indices] = limit

After clipping, the data is then centred by subtracting the baseline value, to give a new

base  value  of  zero,  and  finally  scaled  (effectively  stretched)  to  restore  the  original

maximum value.

self.data[channels] -= limit

self.data[channels] *= maxVal / (maxVal-limit)

We  now  consider  various  simple  normalisation  and  adjustment  methods.  The

normaliseSd function will scale the data values according to the standard deviation in the

measurement,  thus  we  divide  all  signal  values  by  the  standard  deviation.  We  include  the

scale argument so that the data can also be arbitrarily scaled at the same time. This kind of

adjustment is useful when comparing different instances of microarrays, where the actual

range  of  signals  (e.g.  detected  fluorescence  values)  probably  ought  to  be  the  same  on

different  occasions,  but  where  there  is  variation  in  magnitude  simply  by  the  way  the

microarray is constructed. For example, one microarray might have a systematically larger

amount of substrate printed on it compared to another. If there is confidence that different

arrays are showing the same range of underlying data then this normalisation, according to

standard deviation, is reasonable. The operation is done separately on all of the data layers

(one  for  each  channel),  though  we  could  extend  the  function  to  accept  only  a  limited

number  of  channels.  The  data  is  multiplied  and  divided  appropriately  in  an  element-by-

element  manner  in  the  NumPy  array.  The  standard  deviation  is  obtained  using  the  .std()

method inbuilt into the NumPy array objects, as we describe in

Chapter 22

.

def normaliseSd(self, scale=1.0):



for i in range(self.nChannels):

self.data[i] = self.data[i] * scale / self.data[i].std()

Another similar kind of normalisation is simply to make sure that microarray values are

scaled  relative  to  their  mean  value.  This  is  done  for  similar  reasons  as  above,  but  rather

than saying the variation in values is the same across different arrays, we assume that the

mean values should be the same, or nearly so. This is indeed a reasonable assumption in

many situations.

def normaliseMean(self, scale=1.0):

for i in range(self.nChannels):

self.data[i] = self.data[i] * scale / self.data[i].mean()

If  we  do  one  of  the  above  normalisations  then  it  is  often  handy  to  represent  the  data

values as positive and negative numbers either side of the mean value, so we can see what

is  above  or  below  the  average,  rather  than  as  a  merely  positive  intensity.  This  is  readily

achieved by subtracting the mean value from all the data:

def centerMean(self):

for i in range(self.nChannels):

self.data[i] -= self.data[i].mean()



Combining  the  centring  of  the  data  and  scaling  by  the  standard  deviation  we  get  a

commonly  used  operation  which  is  called  Z-score  normalisation,

4

 as  we  discuss  in



Chapter 22

. All this really means is that we move the data values so that they are centred

at zero and scaled to put the standard deviation at the values ±1.0.

def normaliseZscore(self):

self.centerMean()

self.normaliseSd()

Another kind of normalisation is to scale the values to some upper limit, e.g. so they are

at  most  1.0.  This  is  done  by  dividing  by  the  maximum  value.  This  operation  is  useful  if

you  know  what  the  maximum  value  in  the  array  is,  or  tends  towards.  For  example,  this

could be a strong reference signal for an element acting as a positive control. Here we add

the  option  to  consider  either  each  data  layer  separately  (perChannel=True)  or  the

maximum value from all the data (perChannel=False).

def normaliseMax(self, scale=1.0, perChannel=True):

if perChannel:

for i in range(self.nChannels):

self.data[i] = self.data[i] * scale / self.data[i].max()

else:

self.data = self.data * scale / self.data.max()



Likewise  we  could  normalise  the  rows  separately,  relative  to  the  maximum  values  in

each  row.  Here  the  data  values  are  divided  by  an  array  representing  the  maximum  value

along only one axis, hence axis=1, to get the sum of the values (over the column positions)

for  each  row.  The  slice  notation  [:,None]  is  a  convenient  way  of  changing  what  would

otherwise  be  a  one-dimensional  array  of  maxima,  in  one  long  row  vector,  into  a  two-

dimensional  array  with  several  rows  (one  number  in  each)  and  one  long  column.  This

means the division then scaling of each row of values by a different number.

def normaliseRowMax(self, scale=1.0):

for i in range(self.nChannels):

self.data[i] = self.data[i] * scale / self.data[i].max(axis=1)[:,None]

Row  normalisation  is  useful  where  each  row  represents  a  different  kind  of  data.  An

example  would  be  where  each  column  corresponds  to  a  different  nucleotide  (or  amino

acid) sequence change in one molecule and each row represents a set of target molecules

that are being bound. By normalising by row we will get the relative signal that illustrates

which sequence changes give the best binding to each target. This may be useful in finding

the  change  that  leads  to  optimal  binding  for  several  targets,  but  naturally  we  lose

information about the comparative strength of binding between targets.

Alternatively  we  can  normalise  the  rows  so  they  are  scaled  according  to  their  mean

value.

def normaliseRowMean(self, scale=1.0):




for i in range(self.nChannels):

self.data[i] = self.data[i] * scale / self.data[i].mean(axis=1)[:,None]

The same operations can be done for the columns in the data, it just depends on how the

array is arranged. Note that here we use axis=0 and so don’t have to do convert the array

of maxima into a column vector.

def normaliseColMax(self, scale=1.0):

for i in range(self.nChannels):

self.data[i] = self.data[i] * scale / self.data[i].max(axis=0)

def normaliseColMean(self, scale=1.0):

for i in range(self.nChannels):

self.data[i] = self.data[i] * scale / self.data[i].mean(axis=0)

If the microarray contains elements that represent control values, i.e. where you know

what the expected results for these are, then you can scale the whole array relative to the

known signal for these reference points. Here the reference values are specified by passing

in  the  lists  of  the  rows  and  columns  they  are  found  in,  which  are  used  to  extract  the

corresponding values from the data. Taking a slice from self.data using separate rows and

column specifications [rows, cols], rather than finding specific indices, may seem odd, but

is  often  convenient  in  NumPy  (e.g.  this  is  what  .nonzero()  gives).  The  [rows,  cols]

notation will affect all of the elements where all of the indices coincide, so can actually be

more efficient than stating separate coordinates. The mean of the reference values is then

used to divide the whole array. To add a bit of diversity we revert to allowing the channels

to be specified to state which data layers to operate on.

def normaliseRefs(self, rows, cols, scale=1.0, channels=None):

if not channels:

channels = range(self.nChannels)

channels = tuple(channels)

refValues = self.data[channels, rows, cols]

for i in channels:

self.data[i] = self.data[i] * scale / refValues[i].mean()

A different way of normalising data values, especially for fluorescence intensity data, is

to convert it into a logarithmic scale, effectively compressing its dynamic range. Note that

we  clip  the  baseline  to  remove  any  negative  values  and  add  1.0  so  we  don’t  take  the

logarithm  of  any  zero  values.  After  conversion  to  a  log  scale  we  can  then  apply  other

normalisation  techniques  to  compare  different  microarrays  or  rows  etc.  (Obviously  we

cannot  do  centerMean()  before  the  log  conversion,  because  we  want  all  values  to  be

positive.)

def normaliseLogMean(self):

self.clipBaseline(threshold=0.0)




for i in range(self.nChannels):

self.data[i] = log( 1.0 + self.data[i] / self.data[i].mean() )

We can test all of the above by creating a Microarray object named testArray that uses

some  example  data  from  a  text  file.  We  illustrate  the  result  of  the  various  normalisation

methods by writing out an image after each point.

testArray = loadDataMatrix('examples/microarrayData.txt', 'Test')

testArray.makeImage(25).save('testArray.png')

# Log normalise

testArray.normaliseLogMean()

testArray.makeImage(25).save('normaliseLogMean.png')

# Normalise to max and clip

testArray.resetData()

testArray.normaliseMax()

testArray.clipBaseline(0.5)

testArray.makeImage(25).save('clipBaseline.png')

# Normalise to standard deviation

testArray.resetData()

print(''Initial SD:'', testArray.data.std())

testArray.normaliseSd()

print(''Final SD:'', testArray.data.std())

Another  handy  way  to  do  normalisation,  albeit  in  a  less  scientific  way,  is  to  perform


Download 7,75 Mb.

Do'stlaringiz bilan baham:
1   ...   242   243   244   245   246   247   248   249   ...   514




Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©hozir.org 2024
ma'muriyatiga murojaat qiling

kiriting | ro'yxatdan o'tish
    Bosh sahifa
юртда тантана
Боғда битган
Бугун юртда
Эшитганлар жилманглар
Эшитмадим деманглар
битган бодомлар
Yangiariq tumani
qitish marakazi
Raqamli texnologiyalar
ilishida muhokamadan
tasdiqqa tavsiya
tavsiya etilgan
iqtisodiyot kafedrasi
steiermarkischen landesregierung
asarlaringizni yuboring
o'zingizning asarlaringizni
Iltimos faqat
faqat o'zingizning
steierm rkischen
landesregierung fachabteilung
rkischen landesregierung
hamshira loyihasi
loyihasi mavsum
faolyatining oqibatlari
asosiy adabiyotlar
fakulteti ahborot
ahborot havfsizligi
havfsizligi kafedrasi
fanidan bo’yicha
fakulteti iqtisodiyot
boshqaruv fakulteti
chiqarishda boshqaruv
ishlab chiqarishda
iqtisodiyot fakultet
multiservis tarmoqlari
fanidan asosiy
Uzbek fanidan
mavzulari potok
asosidagi multiservis
'aliyyil a'ziym
billahil 'aliyyil
illaa billahil
quvvata illaa
falah' deganida
Kompyuter savodxonligi
bo’yicha mustaqil
'alal falah'
Hayya 'alal
'alas soloh
Hayya 'alas
mavsum boyicha


yuklab olish