The ‘jump’ method
Next we will consider the situation where you do not know in advance how many clusters
the data should be grouped into, but do not wish to use a threshold-based method. We will
consider an augmentation on top of the k-means method to try various numbers of clusters
and determine the most meaningful. The example function uses what we will term the
‘jump’ method.
7
It will perform k-means clustering with different values of k and then
assesses which of the steps between increasing values of k represents the best compromise
between the number of clusters and complexity of the solution.
For a given trial value of k, the result of a trial clustering attempt involves the
calculation of what we term the ‘distortion’, which in essence is a measure of the average
spread, from the cluster centres, of data points over all the clusters. This spread is adjusted
for the covariance
8
in the data, which indicates how much variation there is in the values
for the various axes, such that the relative scale of the different features does not matter.
The jumpMethodCluster function is constructed to take a set of data, as an array of
arrays, an optional range of k values (numbers of clusters) to test and cycles, which
determines how many times to repeat the clustering. This cyclic repetition is important
because different clustering attempts, involving random numbers, may give variable
results.
from numpy import cov, linalg
def jumpMethodCluster(data, kRange=None, cycles=10):
Firstly, we extract the length of the data and the number of dimensions (features). This
is simply the size (.shape in NumPy) of the array. Then we define the range of k values if
none were passed in; by default the range is between two and the number of data items.
Note that the range is specified by an upper limit, which will not be included.
n, dims = data.shape
if kRange is None:
start, limit = (2, n+1)
else:
start, limit = kRange
Next we define the power variable, which will be used to adjust the calculated
distortion values according to the number of dimensions in the data (see the primary
reference for an explanation of this). The distortions are collected in a dictionary, keyed by
k values, so this is initially defined as blank. The invCovMat is a matrix representing the
covariance in the dimensions of the input data. Here we can make use of NumPy’s cov and
linalg.pinv functions to estimate the covariance and then invert the matrix.
9
power = dims/2.0
distortions = {}
invCovMat = linalg.pinv(cov(data.T))
Then a loop goes through all of the values of k which are to be tested. For each we
define an array, initially at zero, of average distortions (scaled distances). We will have
one value for each cycle, with the aim being to find the cycle with the best clustering and
hence minimum distortion.
for k in range(start, limit):
meanDists = zeros(cycles)
The next for loop exists to provide several clustering attempts, to guard against
clustering where k-means gets stuck in a stable but sub-optimal situation. For each attempt
we define a sum of distortions sumDist, and use the kMeansSpread function to actually
generate the clusters for this cycle and value of k.
for c in range(cycles):
sumDist = 0.0
centers, clusters = kMeansSpread(data, k)
With the clustering attempt made the results are analysed. We go through each cluster
in turn and calculate diffs, the difference between the data items in the cluster and the
cluster centres that the k-means algorithm gave.
for i, cluster in enumerate(clusters):
size = len(cluster)
diffs = array(cluster) - centers[i]
The differences are used to calculate the average distortion. At a basic level this
requires that the diff arrays are squared using a dot product (dot(diff.T, diff) ), but we also
apply the inverse covariance matrix to scale the values according to the spread of data in
the dimensions (i.e. treat each data feature equivocally). The distortion, which can be
thought of as a measure of distance, is then divided by the size of the cluster, to get an
average for the items in the cluster, and added to the total.
for j, diff in enumerate(diffs):
dist = dot(diff.T, dot(diff, invCovMat))
sumDist += dist / size
The final summation of average distortions in the cycle is then averaged again, over the
number of clusters and dimensions, so that we get the average for the whole clustering
solution.
meanDists[c] = sumDist / (dims * k)
Finally, after all trial cycles are done the minimum value of the averaged distortions is
selected: the trial that gave the best clustering. This value is adjusted by being raised to the
negative power -power, which scales the result according to how the distortion is expected
to vary according to the number of data dimensions (see reference above).
distortions[k] = min(meanDists) ** (-power)
With the distortions calculated, it only remains to go through the values to find the
value of k (cluster size) that corresponds to the biggest improvement. The largest jump in
the distortion value will occur when we reach the optimum value of k, the one that give
the most notable improvement in the relative separation between the data items and the
cluster centres.
maxJump = None
bestK = None
for k in range(start+1, limit):
jump = distortions[k] - distortions[k-1]
if (maxJump is None) or (jump > maxJump):
maxJump = jump
bestK = k
return bestK
To visualise how the jump method works, imagine three real clusters, as illustrated in
Figure 23.6
. There will be some large spreads from the cluster centres if we try to use only
two clusters. Using four clusters will tend to split a real cluster, leading to only a modest
reduction in the spread. Scaled properly, according to the effect of dimensionality, the
spread measure (distortion) can be used to suggest the value of k.
Do'stlaringiz bilan baham: |