Python Programming for Biology: Bioinformatics and Beyond

Download 7,75 Mb.

Pdf ko'rish

bet	297/514
Sana	30.12.2021
Hajmi	7,75 Mb.
	#91066

1 ... 293 294 295 296 297 298 299 300 ... 514

Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

Peak fitting

In order to fit the peak shape, there has to be a model of what the peak shape should be.

For exponentially decaying data, the real part of a continuous Fourier transform of that

data has what is called a Lorentzian line shape, in any given dimension:

Here A is the amplitude, ω

is the position and λ is the linewidth of the peak, which is the

full width of the peak at half-height:

In practice, the data is not continuous, it is discrete, and the discrete Fourier transform is

carried out instead of the continuous one. This leads to a slightly different, and more

complicated, line shape than the above, but in practice the difference is small enough close

to the central position that the Lorentzian line shape functional form can be used in the

fitting, and that is what will be done here.

Peaks with more than one dimension have a single height (amplitude), but there is a

position and a linewidth for each dimension. Thus for a Lorentzian line shape the

theoretical value of the data can be modelled as being a product of the amplitude and of

the line shape in each dimension (the index i represents one of the dimensions):

The fit() function is added to the Peak class so that it can be called on a peak object via

peak.fit(fitWidth). As discussed in

Chapter 7,

because this function belongs inside a class

definition it can use the self variable to refer to the actual peak object. The first thing the

function does is to define a region which is fitWidth points either side of the integer peak

position in each dimension, talking care that the region doesn’t fall off the edge of the

array, and then it gets self.fitData, the data values in the specified region. An alternative

approach would be for a user to specify the region via a graphical user interface.

from scipy import optimize

def fit(self, fitWidth=2):

region = []

numPoints = self.data.shape

for dim, point in enumerate(peak.position):

start = max(point-fitWidth, 0)

end = min(point+fitWidth+1, numPoints[dim])

region.append( (start, end) )

The data is normalised by the peak height, to make the fitting work on data that has

maximum value at 1, rather than an arbitrary-sized number.

self.fitData = self._getRegionData(region) / self.dataHeight

The SciPy function optimize.fmin is used to determine the optimum fit. It requires a

fitting function and a list of starting values for the parameters. Here the parameters are the

amplitudeScale (relative to the peak height) and, for each dimension, an offset from the

peak position and a linewidth scale.

amplitudeScale = 1.0

offset = 0.0

linewidthScale = 1.0

These are inserted into the params list. The fitting function is an unnamed lambda

function (see

Chapter 5

) that allows the data region to be passed into the fitting function as

an argument before its actual execution. An alternative would be to store the region on the

peak object. There are several optional arguments for optimize.fmin, including xtol, which

allows a specification of how precise the result needs to be.

ndim = self.data.ndim

params = [amplitudeScale]

params.extend(ndim*[offset])

params.extend(ndim*[linewidthScale])

fitFunc = lambda params: self._fitFunc(region, params)

result = optimize.fmin(fitFunc, params, xtol=0.01)

The result comes back as a NumPy array, which is then split into the amplitude, offset

and linewidth parts using the appropriate slices.

amplitudeScale = result[0]

offset = result[1:ndim+1]

linewidthScale = result[ndim+1:]

Although they do not have to be, the values are converted to ordinary Python types

using the float() and list() functions and stored as attributes on the peak object.

peak.fitAmplitude = float(amplitudeScale * peak.dataHeight)

peak.fitPosition = list(peak.position + offset)

peak.fitLinewidth = list(linewidthScale * peak.linewidth)

The function _getRegionData does the required extraction of the data in the region from

the full data set. The region determines the selection by using the standard Python slice()

function, which allows us to specify an index which goes from start to end. This is done in

each dimension and the result needs to be converted to a tuple for use as a NumPy index.

def _getRegionData(self, region):

slices = tuple([slice(start, end) for start, end in region])

return self.data[slices]

The fit-testing function _fitFunc is where the real work is done. The SciPy function

optimize.fmin calls this repeatedly with a specified choice of params to determine how

good the fit is, until either convergence to the solution is achieved or the number of

iterations reaches its maximum limit. The (fixed) region is also passed into the fitting

function because of the way we defined it as a lambda function. The fitting function first

unpacks the params list into the corresponding parameters and sliceData is initially set

with zeros, but will eventually contain the line shapes for each dimension.

from numpy import array, zeros

def _fitFunc(self, region, params):

ndim = self.data.ndim

amplitudeScale = params[0]

offset = params[1:1+ndim]

linewidthScale = params[1+ndim:]

sliceData = ndim * [0]

A for loop goes though each data dimension, calculating test values for linewidth and

testPos by using the input test parameters to adjust the values stored on the peak object.

The start and end points for the dimension are extracted.

for dim in range(ndim):

linewidth = linewidthScale[dim] * self.linewidth[dim]

testPos = offset[dim] + self.position[dim]

(start, end) = region[dim]

A check is made that linewidth > 0 because there are some situations where the SciPy

optimize.fmin function will set the linewidthScale[dim] parameter to be negative. If it is

negative then the line shape will be zero (in all dimensions, given the final product). An

alternative would be to use the absolute value of this parameter. The Lorentzian line shape

is then calculated (see above equation) for each dimension, in the specified region.

if linewidth > 0:

x = array(range(start, end))

x = (x – testPos) / linewidth

slice1d = 1.0 / (1.0 + 4.0*x*x)

else:

slice1d = zeros(end-start)

sliceData[dim] = slice1d

The final heights for the shape are calculated by multiplying the amplitude by the multi-

dimensional outer product (see below) of the one-dimensional shapes contained in

sliceData. Lastly, the test-fit function must return the difference between the theoretical

line shape and the actual data. Here the root-mean-square difference is used, but that is not

mandatory (we could just use the square).

heights = amplitudeScale * self._outerProduct(sliceData)

diff2 = ((heights-self.fitData)**2).mean()

return sqrt(diff2)

The individual line shapes are one-dimensional, and the product of these across the

different dimensions is required to get the full theoretical shape. NumPy has a function

outer which will do the appropriate multiplication in two dimensions. In order to do it in

an arbitrary number of dimensions, the outer function needs to be called first on the first

two dimensions, then next on that result and the shape for the third dimension etc. The

result is a two-dimensional array so finally the data needs to be reshaped into the correct

number of dimensions. This is all accomplished in the function _outerProduct():

from numpy import outer

def _outerProduct(self, data):

size = [d.shape[0] for d in data]

product = data[0]

for dim in range(1, len(size)):

product = outer(product, data[dim])

product = product.reshape(size)

return product

There are many improvements that could be made to the code. Different line shapes

could be allowed to be fit, for example, a ‘Gaussian’ one (where the exponential decay is

squared). A constant vertical offset (constant baseline), or more complicated descriptions

of the baseline, could be fitted to the data. It could be that one or more of the parameters

are somehow known, so should not be fitted. If the peaks overlap then more than one peak

could be fitted at a time, though the number of parameters is proportional to the number of

peaks, and so fitting more than one peak increases the risk of having too many degrees of

freedom with too little data.

Download 7,75 Mb.

Do'stlaringiz bilan baham:

1 ... 293 294 295 296 297 298 299 300 ... 514