Principal component analysis
Principal component analysis (PCA) is a relatively simple but widely used technique to
extract the innate trends within data. The principal components of data are the vectors (in
the same feature space of the data) that give the best separation of the data items in terms
of their covariances. Mathematically, PCA gives the eigenvectors of the covariance
matrix. Taking the eigenvalues of these in size order we can find the most significant,
principal components that account for most of the variance in the data. Taking fewer
principal components than the number of dimensions present in the input data set allows
for a lower-dimensionality representation of the data (by projecting it onto these
directions) that still contains the important correlations.
When we calculate the principal components of a data set we obtain vectors, directions
in the data, which are orthogonal (perpendicular) to one another. Thus each component
vector represents an independent axis and there can be as many components as there are
data dimensions. As some axes are more significant than others, for separating the data,
those are the ones we are usually interested in. We can also consider the principal
component vectors as a transformation, which we can apply to our data: to orient it and
stretch it along these orthogonal axes.
Principal component analysis does have limitations, which the programmer should be
aware of, but it is quick and easy, so often worth a try. A classic example where PCA fails
is for ‘checkerboard’ data, i.e. alternating squares of two categories, where there are no
simple axes in data that separate the categories. In such instances more sophisticated, non-
linear techniques, such as support vector machines (see
Chapter 24
), may be used.
In the Python example for PCA we first make the NumPy imports and define the
function which takes data (as an array of arrays) and the number of principal components
we wish to extract.
from numpy import cov, linalg, sqrt, zeros, ones, diag
def principalComponentAnalysis(data, n=2):
First we get the size of the data, in terms of the number of data items (samples) and the
number of dimensions (features).
samples, features = data.shape
We calculate the average data item (effectively the centre of the data) by finding the
mean along the primary data axis. We then centre the input data on zero (on the feature
axes) by taking away this average. Note that we then transpose the data with .T, turning it
sideways, so we can calculate the covariance in its dimensions.
meanVec = data.mean(axis=0)
dataC = (data - meanVec).T
The covariance is estimated using the cov() function and the eigenvalues and
eigenvectors are extracted with the linalg.eig() function. Here the inbuilt NumPy functions
for dealing with arrays and linear algebra really show their value.
covar = cov(dataC)
evals, evecs = linalg.eig(covar)
The resulting eigenvalues, which represent the scaling factors along the eigenvectors,
are sorted by size. Here we use the .argsort() function, which gives the indices of the array
in the order in which the values increase. We then use the cunning trick [::-1] for reversing
NumPy arrays.
10
The eigenvectors are then reordered by using these indices, to put them
in order of decreasing eigenvalue, i.e. most significant component first.
indices = evals.argsort()[::-1]
evecs = evecs[:,indices]
We can then take the top eigenvectors as the first n principal components, which we call
basis, because these are the directions that we can use to map our data to. The energy is
simply a measure of how much covariance our top eigenvalues explain, which is useful for
detecting whether more principal components should be considered.
basis = evecs[:,:n]
energy = evals[:n].sum()
# norm wrt to variance
#sd = sqrt(diag(covar))
#zscores = dataC.T / sd
return basis, energy
Do'stlaringiz bilan baham: |