Figure 23.7. Principal component analysis of 2D data and its projection onto a
principal axis. As illustrated for a two-dimensional example, the first principal
component (PC 1) is the single linear combination of features (i.e. a direction relative to
the axes) that explains most of the variance in the data. Subsequent principal components
represent orthogonal directions (i.e. at right angles to the other components) that maximise
the remaining variance not covered by earlier principal components; though, for this two-
dimensional example PC 2 is determined by PC 1, since it is the only possibility.
Projecting a data set onto its most important principal component axes allows for the
dimensionality of the data set to be reduced, while still preserving as much linear
correlation as possible. This can be useful for visualisation and to reduce the complexity
of high-dimensionality data sets.
The LDA optimisation is achieved by finding the matrix (and hence orientation) that
maximises the separation (scatter) between the data sets, relative to the separation within
each data set. The scatter within is simply the weighted covariances of the data sets
separately, and the scatter between is the difference between their means. In essence we
scale the line between the two means of the data sets by the combined size of the data
scatter for each dimension. The resultant matrix can be used to transform the data into a
new orientation for easy discrimination.
Firstly a Python function is defined which takes two data sets. These could be the
results of a clustering operation or a previously known classification.
def twoClassLda(dataA, dataB):
Firstly, we find the averages (centres) of the data sets, by summing along the major
axes, i.e. adding the vectors together and dividing by the total number.
meanA = dataA.mean(axis=0)
meanB = dataB.mean(axis=0)
Then the cov() function is used to calculated the covariance matrix for each data set
(the size of the correlation between the data dimensions).
covA = cov(dataA.T)
covB = cov(dataB.T)
Then the number of points in each data set, less one, is calculated for the data sets. The
subtraction is present effectively because having only one data point does not give a
scatter, so we use the number of points above this first one.
nA = len(dataA)-1.0
nB = len(dataB)-1.0
The scatter within each category is simply defined as the sum of the covariance
matrices, weighted according to the size of the data sets. The scatter between categories is
simply the separation between the data sets, i.e. the difference from one data centre to
another.
scatterWithin = nA * covA + nB * covB
scatterBetween = meanA - meanB
The discrimination matrix between data sets is the line between centres
(scatterBetween) divided by the scatter within the data (multiplied by inverse matrix) for
each dimension.
discrim = dot(linalg.inv(scatterWithin),scatterBetween)
The data sets are transformed using the discrimination matrix, reorienting them along
the line of best separation. These are passed back at the end for inspection.
transfA = dot(dataA, discrim.T)
transfB = dot(dataB, discrim.T)
The best guess for the dividing point that separates the data sets is the average of the
two data centres reshaped (transformed) to lie along the discriminating direction.
divide = dot(discrim,(meanA+meanB))/2.0
return transfA, transfB, divide
Here we test the LDA function with two normally distributed, random data sets. One
has a small spread and is moved to the side (by adding array([-10.0,5.0])) and the other
has a wider spread. The two sets should overlap (intermingle).
testData1 = random.normal(0.0, 2.0, (100,2)) + array([-10.0,5.0])
testData2 = random.normal(0.0, 6.0, (100,2))
The test sets can be visualised with matplotlib in the usual way:
from matplotlib import pyplot
x, y = zip(*testData1)
pyplot.scatter(x, y, s=25, c='#404040', marker='o')
x, y = zip(*testData2)
pyplot.scatter(x, y, s=25, c='#FFFFFF', marker='^')
pyplot.show()
Running the function on these data sets we get two arrays proj1, proj2 containing
transformed data and a dividing value, div. These arrays are one-dimensional and
represent the projection of our two data sets onto the discrimination line. We can plot them
as points along a line, which here we separate for clarity with y-values at 0.5 and -0.5. The
dividing value div can be used to draw a line to show where the LDA has estimated the
best boundary between categories (assuming a symmetric, normal distribution).
proj1, proj2, div = twoClassLda(testData1, testData2)
print(div)
x = proj1
y = [0.5] * len(x)
pyplot.scatter(x, y, s=35, c='#404040', marker='o')
x = proj2
y = [-0.5] * len(x)
pyplot.scatter(x, y, s=35, c='#FFFFFF', marker='^')
pyplot.plot((div, div), (1.0, -1.0))
pyplot.show()
Do'stlaringiz bilan baham: |