As the calculation of all of the principal components of a data set is often not required, e.g.
we may be only interested in the first two principal components, the next example
function can more efficiently extract the principal components one at a time (most
significant first) without calculating a full covariance matrix. Each time a component is
extracted the data may be transformed, eliminating that component, so the next most
important component can be extracted. This is an iterative method that converges on the
principal component, hence we pass in a precision value to state how long we iteratively
cycle to improve the convergence.
from numpy import random, dot, array, outer
def extractPrincipalComponent(data, precision=1e-9):
As before, we extract the number of samples (data items) and features from the input
data. We then calculate the mean (centre) of the data so that it can be moved to the zero
point.
samples, features = data.shape
meanVec = data.mean(axis=0)
dataC = data - meanVec
The initial guess at the principal component pc1 is constructed as a random array, with
the same length as the number of features. We also initialise pc0, which will be the value
from the previous cycle, so that we can check for convergence. Initially pc0 is arbitrarily
set to be different to pc1.
pc1 = random.random(features)
pc0 = pc1 - 1.0
The main optimisation involves a while loop that checks the two subsequent estimates
for the principal components. If their difference, summed up over all dimensions, is
sufficiently close (within the precision value) the loop will stop.
while abs((pc0-pc1).sum()) > precision:
Inside the loop the principal component vector is improved upon for the next cycle.
This involves going through all the items in the (centred) data, scaling each according to
the projection onto the guess for the principal component (with dot product) and adding
them to a total vector, t. When normalised (scaled to length one) this total vector will be
the new estimate for the principal component. Here the dot product gives the coincidence
of the data vectors with the current principal component estimate. The summation will
give an average vector which is weighted according to the data items that have most
coincidence. The convergence occurs because the largest correlation in the data will have
a biased influence on the weights, increasing the size of t each cycle until there is
maximum overlap with pc1. At this point pc1 represents a fundamental axis in the data,
along which most convergence occurs.
t = zeros(features)
for datum in dataC:
t += dot(datum, pc1) * datum
At the end of each cycle the previous principal component estimate is stored as pc0 and
the new estimate is found by scaling the t vector to length one.
pc0 = pc1
pc1 = t / sqrt(dot(t,t))
Once there is little improvement in the principal component the cycles stop and the
estimate is returned.
return pc1
We will test both the full and quick PCA functions with the same data set. This will be a
random data set of 100 points in two dimensions, which is transformed by a shearing
matrix, in order to produce a visible correlation in the data. Naturally we are not limited to
only two-dimensional data, but this is easier to illustrate here.
testData = random.normal(0.0, 2.0, (100,2))
shear = array([[2,1],[1,0]])
testData = dot(testData, shear)
The PCA is performed by the two functions and the results are compared.
pc1 = extractPrincipalComponent(testData)
print('Quick PC1:', pc1)
basis, energy = principalComponentAnalysis(testData, n=2)
print('Full PCA:', basis, energy)
The test data can be plotted in the usual way and we can visualise the first principal
component (pc1) by drawing a line along its direction. We make a line from pc1, which
we can plot, by scaling the vector (by a factor of ten to make it visible) either side of the
centre.
from matplotlib import pyplot
x,y = zip(*testData)
pyplot.scatter(x, y, s=20, c='#F0F0F0', marker='o')
x,y = zip(-10*pc1, 10*pc1)
pyplot.plot(x, y)
To illustrate that the basis matrix transforms the data in the right way we can apply it to
the test data, so that the principal component axes are aligned with the axes of our graph.
transformed = dot(testData, basis)
x,y = zip(*transformed)
pyplot.scatter(x, y, s=10, c='#000000', marker='^')
pyplot.show()
Do'stlaringiz bilan baham: