Peak fitting
In order to fit the peak shape, there has to be a model of what the peak shape should be.
For exponentially decaying data, the real part of a continuous Fourier transform of that
data has what is called a Lorentzian line shape, in any given dimension:
Here A is the amplitude, ω
0
is the position and λ is the linewidth of the peak, which is the
full width of the peak at half-height:
In practice, the data is not continuous, it is discrete, and the discrete Fourier transform is
carried out instead of the continuous one. This leads to a slightly different, and more
complicated, line shape than the above, but in practice the difference is small enough close
to the central position that the Lorentzian line shape functional form can be used in the
fitting, and that is what will be done here.
Peaks with more than one dimension have a single height (amplitude), but there is a
position and a linewidth for each dimension. Thus for a Lorentzian line shape the
theoretical value of the data can be modelled as being a product of the amplitude and of
the line shape in each dimension (the index i represents one of the dimensions):
The fit() function is added to the Peak class so that it can be called on a peak object via
peak.fit(fitWidth). As discussed in
Chapter 7,
because this function belongs inside a class
definition it can use the self variable to refer to the actual peak object. The first thing the
function does is to define a region which is fitWidth points either side of the integer peak
position in each dimension, talking care that the region doesn’t fall off the edge of the
array, and then it gets self.fitData, the data values in the specified region. An alternative
approach would be for a user to specify the region via a graphical user interface.
from scipy import optimize
def fit(self, fitWidth=2):
region = []
numPoints = self.data.shape
for dim, point in enumerate(peak.position):
start = max(point-fitWidth, 0)
end = min(point+fitWidth+1, numPoints[dim])
region.append( (start, end) )
The data is normalised by the peak height, to make the fitting work on data that has
maximum value at 1, rather than an arbitrary-sized number.
self.fitData = self._getRegionData(region) / self.dataHeight
The SciPy function optimize.fmin is used to determine the optimum fit. It requires a
fitting function and a list of starting values for the parameters. Here the parameters are the
amplitudeScale (relative to the peak height) and, for each dimension, an offset from the
peak position and a linewidth scale.
amplitudeScale = 1.0
offset = 0.0
linewidthScale = 1.0
These are inserted into the params list. The fitting function is an unnamed lambda
function (see
Chapter 5
) that allows the data region to be passed into the fitting function as
an argument before its actual execution. An alternative would be to store the region on the
peak object. There are several optional arguments for optimize.fmin, including xtol, which
allows a specification of how precise the result needs to be.
ndim = self.data.ndim
params = [amplitudeScale]
params.extend(ndim*[offset])
params.extend(ndim*[linewidthScale])
fitFunc = lambda params: self._fitFunc(region, params)
result = optimize.fmin(fitFunc, params, xtol=0.01)
The result comes back as a NumPy array, which is then split into the amplitude, offset
and linewidth parts using the appropriate slices.
amplitudeScale = result[0]
offset = result[1:ndim+1]
linewidthScale = result[ndim+1:]
Although they do not have to be, the values are converted to ordinary Python types
using the float() and list() functions and stored as attributes on the peak object.
peak.fitAmplitude = float(amplitudeScale * peak.dataHeight)
peak.fitPosition = list(peak.position + offset)
peak.fitLinewidth = list(linewidthScale * peak.linewidth)
The function _getRegionData does the required extraction of the data in the region from
the full data set. The region determines the selection by using the standard Python slice()
function, which allows us to specify an index which goes from start to end. This is done in
each dimension and the result needs to be converted to a tuple for use as a NumPy index.
def _getRegionData(self, region):
slices = tuple([slice(start, end) for start, end in region])
return self.data[slices]
The fit-testing function _fitFunc is where the real work is done. The SciPy function
optimize.fmin calls this repeatedly with a specified choice of params to determine how
good the fit is, until either convergence to the solution is achieved or the number of
iterations reaches its maximum limit. The (fixed) region is also passed into the fitting
function because of the way we defined it as a lambda function. The fitting function first
unpacks the params list into the corresponding parameters and sliceData is initially set
with zeros, but will eventually contain the line shapes for each dimension.
from numpy import array, zeros
def _fitFunc(self, region, params):
ndim = self.data.ndim
amplitudeScale = params[0]
offset = params[1:1+ndim]
linewidthScale = params[1+ndim:]
sliceData = ndim * [0]
A for loop goes though each data dimension, calculating test values for linewidth and
testPos by using the input test parameters to adjust the values stored on the peak object.
The start and end points for the dimension are extracted.
for dim in range(ndim):
linewidth = linewidthScale[dim] * self.linewidth[dim]
testPos = offset[dim] + self.position[dim]
(start, end) = region[dim]
A check is made that linewidth > 0 because there are some situations where the SciPy
optimize.fmin function will set the linewidthScale[dim] parameter to be negative. If it is
negative then the line shape will be zero (in all dimensions, given the final product). An
alternative would be to use the absolute value of this parameter. The Lorentzian line shape
is then calculated (see above equation) for each dimension, in the specified region.
if linewidth > 0:
x = array(range(start, end))
x = (x – testPos) / linewidth
slice1d = 1.0 / (1.0 + 4.0*x*x)
else:
slice1d = zeros(end-start)
sliceData[dim] = slice1d
The final heights for the shape are calculated by multiplying the amplitude by the multi-
dimensional outer product (see below) of the one-dimensional shapes contained in
sliceData. Lastly, the test-fit function must return the difference between the theoretical
line shape and the actual data. Here the root-mean-square difference is used, but that is not
mandatory (we could just use the square).
heights = amplitudeScale * self._outerProduct(sliceData)
diff2 = ((heights-self.fitData)**2).mean()
return sqrt(diff2)
The individual line shapes are one-dimensional, and the product of these across the
different dimensions is required to get the full theoretical shape. NumPy has a function
outer which will do the appropriate multiplication in two dimensions. In order to do it in
an arbitrary number of dimensions, the outer function needs to be called first on the first
two dimensions, then next on that result and the shape for the third dimension etc. The
result is a two-dimensional array so finally the data needs to be reshaped into the correct
number of dimensions. This is all accomplished in the function _outerProduct():
from numpy import outer
def _outerProduct(self, data):
size = [d.shape[0] for d in data]
product = data[0]
for dim in range(1, len(size)):
product = outer(product, data[dim])
product = product.reshape(size)
return product
There are many improvements that could be made to the code. Different line shapes
could be allowed to be fit, for example, a ‘Gaussian’ one (where the exponential decay is
squared). A constant vertical offset (constant baseline), or more complicated descriptions
of the baseline, could be fitted to the data. It could be that one or more of the parameters
are somehow known, so should not be fitted. If the peaks overlap then more than one peak
could be fitted at a time, though the number of parameters is proportional to the number of
peaks, and so fitting more than one peak increases the risk of having too many degrees of
freedom with too little data.
Do'stlaringiz bilan baham: |