Python Programming for Biology: Bioinformatics and Beyond



Download 7,75 Mb.
Pdf ko'rish
bet297/514
Sana30.12.2021
Hajmi7,75 Mb.
#91066
1   ...   293   294   295   296   297   298   299   300   ...   514
Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

Peak fitting

In order to fit the peak shape, there has to be a model of what the peak shape should be.

For  exponentially  decaying  data,  the  real  part  of  a  continuous  Fourier  transform  of  that

data has what is called a Lorentzian line shape, in any given dimension:

Here A is the amplitude, ω

0

is the position and λ is the linewidth of the peak, which is the



full width of the peak at half-height:

In practice, the data is not continuous, it is discrete, and the discrete Fourier transform is

carried  out  instead  of  the  continuous  one.  This  leads  to  a  slightly  different,  and  more

complicated, line shape than the above, but in practice the difference is small enough close

to  the  central  position  that  the  Lorentzian  line  shape  functional  form  can  be  used  in  the

fitting, and that is what will be done here.

Peaks  with  more  than  one  dimension  have  a  single  height  (amplitude),  but  there  is  a

position  and  a  linewidth  for  each  dimension.  Thus  for  a  Lorentzian  line  shape  the

theoretical value of the data can be modelled as being a product of the amplitude and of

the line shape in each dimension (the index i represents one of the dimensions):

The fit() function is added to the Peak class so that it can be called on a peak  object  via



peak.fit(fitWidth). As discussed in

Chapter 7,

because this function belongs inside a class

definition it can use the self variable to refer to the actual peak object. The first thing the

function does is to define a region which is fitWidth points either side of the integer peak

position  in  each  dimension,  talking  care  that  the  region  doesn’t  fall  off  the  edge  of  the

array, and then it gets self.fitData, the data values in the specified region.  An  alternative

approach would be for a user to specify the region via a graphical user interface.

from scipy import optimize

def fit(self, fitWidth=2):

region = []

numPoints = self.data.shape

for dim, point in enumerate(peak.position):

start = max(point-fitWidth, 0)

end = min(point+fitWidth+1, numPoints[dim])

region.append( (start, end) )

The  data  is  normalised  by  the  peak  height,  to  make  the  fitting  work  on  data  that  has

maximum value at 1, rather than an arbitrary-sized number.

self.fitData = self._getRegionData(region) / self.dataHeight

The  SciPy  function  optimize.fmin  is  used  to  determine  the  optimum  fit.  It  requires  a

fitting function and a list of starting values for the parameters. Here the parameters are the

amplitudeScale  (relative  to  the  peak  height)  and,  for  each  dimension,  an  offset  from  the

peak position and a linewidth scale.

amplitudeScale = 1.0

offset = 0.0

linewidthScale = 1.0

These  are  inserted  into  the  params  list.  The  fitting  function  is  an  unnamed  lambda

function (see

Chapter 5

) that allows the data region to be passed into the fitting function as

an argument before its actual execution. An alternative would be to store the region on the

peak object. There are several optional arguments for optimize.fmin, including xtol, which

allows a specification of how precise the result needs to be.

ndim = self.data.ndim

params = [amplitudeScale]

params.extend(ndim*[offset])

params.extend(ndim*[linewidthScale])

fitFunc = lambda params: self._fitFunc(region, params)

result = optimize.fmin(fitFunc, params, xtol=0.01)

The result comes back as a NumPy array, which is then split into the amplitude, offset

and linewidth parts using the appropriate slices.

amplitudeScale = result[0]

offset = result[1:ndim+1]

linewidthScale = result[ndim+1:]




Although  they  do  not  have  to  be,  the  values  are  converted  to  ordinary  Python  types

using the float() and list() functions and stored as attributes on the peak object.

peak.fitAmplitude = float(amplitudeScale * peak.dataHeight)

peak.fitPosition = list(peak.position + offset)

peak.fitLinewidth = list(linewidthScale * peak.linewidth)

The function _getRegionData does the required extraction of the data in the region from

the full data set. The region determines the selection by using the standard Python slice()

function, which allows us to specify an index which goes from start to end. This is done in

each dimension and the result needs to be converted to a tuple for use as a NumPy index.

def _getRegionData(self, region):

slices = tuple([slice(start, end) for start, end in region])

return self.data[slices]

The  fit-testing  function  _fitFunc  is  where  the  real  work  is  done.  The  SciPy  function

optimize.fmin  calls  this  repeatedly  with  a  specified  choice  of  params  to  determine  how

good  the  fit  is,  until  either  convergence  to  the  solution  is  achieved  or  the  number  of

iterations  reaches  its  maximum  limit.  The  (fixed)  region  is  also  passed  into  the  fitting

function because of the way we defined it as a lambda function. The fitting function first

unpacks  the  params  list  into  the  corresponding  parameters  and  sliceData  is  initially  set

with zeros, but will eventually contain the line shapes for each dimension.

from numpy import array, zeros

def _fitFunc(self, region, params):

ndim = self.data.ndim

amplitudeScale = params[0]

offset = params[1:1+ndim]

linewidthScale = params[1+ndim:]

sliceData = ndim * [0]

A for loop goes  though each data  dimension, calculating test  values for  linewidth  and

testPos by using the input test parameters to adjust the values stored on the peak object.

The start and end points for the dimension are extracted.

for dim in range(ndim):

linewidth = linewidthScale[dim] * self.linewidth[dim]

testPos = offset[dim] + self.position[dim]

(start, end) = region[dim]

A check is made that linewidth > 0 because there are some situations where the SciPy

optimize.fmin function will set the linewidthScale[dim]  parameter  to  be  negative.  If  it  is

negative then the line shape will be zero (in all dimensions, given the final product). An

alternative would be to use the absolute value of this parameter. The Lorentzian line shape

is then calculated (see above equation) for each dimension, in the specified region.

if linewidth > 0:



x = array(range(start, end))

x = (x – testPos) / linewidth

slice1d = 1.0 / (1.0 + 4.0*x*x)

else:


slice1d = zeros(end-start)

sliceData[dim] = slice1d

The final heights for the shape are calculated by multiplying the amplitude by the multi-

dimensional  outer  product  (see  below)  of  the  one-dimensional  shapes  contained  in

sliceData.  Lastly,  the  test-fit  function  must  return  the  difference  between  the  theoretical

line shape and the actual data. Here the root-mean-square difference is used, but that is not

mandatory (we could just use the square).

heights = amplitudeScale * self._outerProduct(sliceData)

diff2 = ((heights-self.fitData)**2).mean()

return sqrt(diff2)

The  individual  line  shapes  are  one-dimensional,  and  the  product  of  these  across  the

different  dimensions  is  required  to  get  the  full  theoretical  shape.  NumPy  has  a  function

outer which will do the appropriate multiplication in two dimensions. In order to do it in

an arbitrary number of dimensions, the outer function needs to be called first on the first

two  dimensions,  then  next  on  that  result  and  the  shape  for  the  third  dimension  etc.  The

result is a two-dimensional array so finally the data needs to be reshaped into the correct

number of dimensions. This is all accomplished in the function _outerProduct():

from numpy import outer

def _outerProduct(self, data):

size = [d.shape[0] for d in data]

product = data[0]

for dim in range(1, len(size)):

product = outer(product, data[dim])

product = product.reshape(size)

return product

There  are  many  improvements  that  could  be  made  to  the  code.  Different  line  shapes

could be allowed to be fit, for example, a ‘Gaussian’ one (where the exponential decay is

squared). A constant vertical offset (constant baseline), or more complicated descriptions

of the baseline, could be fitted to the data. It could be that one or more of the parameters

are somehow known, so should not be fitted. If the peaks overlap then more than one peak

could be fitted at a time, though the number of parameters is proportional to the number of

peaks, and so fitting more than one peak increases the risk of having too many degrees of

freedom with too little data.


Download 7,75 Mb.

Do'stlaringiz bilan baham:
1   ...   293   294   295   296   297   298   299   300   ...   514




Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©hozir.org 2024
ma'muriyatiga murojaat qiling

kiriting | ro'yxatdan o'tish
    Bosh sahifa
юртда тантана
Боғда битган
Бугун юртда
Эшитганлар жилманглар
Эшитмадим деманглар
битган бодомлар
Yangiariq tumani
qitish marakazi
Raqamli texnologiyalar
ilishida muhokamadan
tasdiqqa tavsiya
tavsiya etilgan
iqtisodiyot kafedrasi
steiermarkischen landesregierung
asarlaringizni yuboring
o'zingizning asarlaringizni
Iltimos faqat
faqat o'zingizning
steierm rkischen
landesregierung fachabteilung
rkischen landesregierung
hamshira loyihasi
loyihasi mavsum
faolyatining oqibatlari
asosiy adabiyotlar
fakulteti ahborot
ahborot havfsizligi
havfsizligi kafedrasi
fanidan bo’yicha
fakulteti iqtisodiyot
boshqaruv fakulteti
chiqarishda boshqaruv
ishlab chiqarishda
iqtisodiyot fakultet
multiservis tarmoqlari
fanidan asosiy
Uzbek fanidan
mavzulari potok
asosidagi multiservis
'aliyyil a'ziym
billahil 'aliyyil
illaa billahil
quvvata illaa
falah' deganida
Kompyuter savodxonligi
bo’yicha mustaqil
'alal falah'
Hayya 'alal
'alas soloh
Hayya 'alas
mavsum boyicha


yuklab olish