centres. The idea is that by choosing cluster centres that are spaced further apart the
optimisation converges more quickly and is less likely to get stuck in sub-optimal
positions, which are often characterised by two centres being close and dividing a real
cluster at the expense of a globally better solution. One method that uses better starting
where the chance of a point becoming an initial centre increases with the square of the
distance to the other centres specified so far. However, here we will consider another,
slightly more directed approach with a similar aim.
For the kMeansSpread function we guess one cluster centre by taking a random data
point and then choose the centres of subsequent clusters by selecting points that are
furthest away from those defined so far. Each cluster centre is chosen by creating an index,
which is placed in the set indices, which is used at the end to select corresponding data
items and thus create an array of centres. The index for the first cluster is a random integer
between zero (the first data item) and n-1 (the last data item), selected using the standard
random number Python module: random.randint(0, n-1).
5
The remaining indices are added
in a while loop, until a total of k is achieved, by choosing subsequent points which have
minimum radial influence from the centres already chosen.
from numpy import zeros, ones, vstack
from random import randint
def kMeansSpread(data, k):
n = len(data)
index = randint(0, n-1)
indices = set([index])
influence = zeros(n)
while len(indices) < k:
diff = data - data[index]
sumSq = (diff * diff).sum(axis=1) + 1.0
influence += 1.0 / sumSq
index = influence.argmin()
while index in indices:
index = randint(0, n-1)
indices.add(index)
centers = vstack([data[i] for i in indices])
return kMeans(data, k, centers)
Key to this approach is the influence array. Initially starting with zeros, each time we
add an index for a new centre we calculate the difference from that cluster centre
(data[index]) to all the data, thus creating diff. Each value in the influence array is then
increased by one over sumSq: the sum of this difference squared. Note that sumSq has 1.0
added to each element to avoid dividing by zero, i.e. where a data point coincides exactly
with the previous centre. Also, we do an element-wise division when we divide the simple
number 1.0 by the whole array sumSq, so we get an array of reciprocals. In effect each
centre has a diminishing radial influence and we choose the index where the sum of all
these influences influence is minimised. In some circumstances an index can actually be
picked twice, but we guard against this with a second while loop which checks for repeats
and chooses a random index until an unused one is found. Once k indices are collected, a
list of data items, representing the cluster centres, is created using a list comprehension
and stacked into a NumPy array (vstack operates on arrays much like append does on
lists). Overall this method is analogous to the concept of repelling charges. In the end the
‘repelled’ centres are a fair starting point for k-means, which is called at the end so its
results are passed back directly at the return.
The function is used, and can be tested, in the same way as kMeans(). If speed of
execution becomes important then, aside from faster implementations (e.g. using a C-
language extension, as described in
Chapter 27
), further improvements can sometimes be
made to the algorithmic efficiency. For example, if the dimensionality of the data items is
not too large (e.g. three-dimensional) techniques like kD-trees
6
can be used to more
efficiently calculate the closest centre for each data item. Also, if the data set to be
clustered is large, significant time savings can be made by initially running the algorithm
on a smaller, random subset of the data. The cluster centres from the smaller subset can
then be used as good starting values in a second round of clustering using the whole data
set. A point of note with k-means and similar methods is that they tend to only work well
where the clusters are of about the same size and density and are convex (blobs).
Do'stlaringiz bilan baham: