K-means clustering
The next Python example for clustering is an algorithm known as k-means clustering.
Here the ‘k’ refers to the number of clusters that we wish to group our list data items into.
The ‘means’ refers to the concept of describing each cluster as the central average
position: a geometric average of its members. There are implementations of the k-means
algorithm
in
the
Scientific
Python
library
(scipy.cluster.vq.kmeans
and
scipy.cluster.vq.kmeans2), but we will create one from scratch here to illustrate the
methods and to give a basis for modification. Also, there are variants of this clustering
method that can be used, like ‘k-medoids’ or ‘k-medians’, which can also be considered,
but which we will not go into here for reasons of brevity. Compared to the threshold-based
methods already discussed, the k-means method has the advantage that you do not have to
choose a threshold. Although you do have to say how many clusters (k) to make, it is
possible to try several values for k and take the best one. Compared to DBSCAN, k-means
will not work as well with data clusters that have irregular or intermingled shapes, but will
allow separation of regular (‘blobby’) clusters even when they are overlapped.
Initially the cluster centres are defined randomly, basing them on the locations of
random data items. Each of the input data items will then become a member of a cluster
by simply determining which centre is closest. The effect is that the discrimination
boundaries between clusters are straight lines.
3
Most likely the initial guess for the cluster
centres will be bad, so the algorithm goes through a number of iterative cycles to improve
the locations of the centres. Each cycle consists of re-appraising which data items are
members of which cluster and then moving the centres of the clusters to the average of
their memberships. When a cluster centre moves the membership of items may change for
the next cycle, which in turn leads to a new centre, and so the cycle repeats. The k-means
algorithm is simple in that there is no recorded statistic that is being optimised. Instead the
iteration continues until a stable situation is reached. For data sets where there are (k)
obvious clumps of items the algorithm will be fairly reproducible. However, where the
data items are more evenly spread the final clustering will depend on the initial guess of
the centres, i.e. there can be more than one stable solution. If there isn’t a stable solution
you should consider whether it is meaningful to cluster the data in this way; imagine
dividing a regular grid of points into two, where any straight line that makes two groups of
equal size is a possible solution.
The Python function that performs k-means clustering takes an array of input data
vectors and the number k, to specify how many clusters there will be. We also have the
option of passing in some initial guesses for the cluster centres, but by default this is
None, and we choose random centres. If unspecified, to define the initial guess for centers
we use the sample function from Python’s standard random number module, which will
choose k different items from the data, and then convert the resulting list into a NumPy
array.
from numpy import array, random
from random import sample
def kMeans(data, k, centers=None):
if centers is None:
centers = array( sample(list(data), k) ) # list() not needed in Python
2
With the initial cluster centres defined we now come to the iterative part. Here we
define the amount of change, which is initially large (proportionately speaking). The while
loop will continue as long as the change between subsequent cycles is big enough,
otherwise we deem the clustering to have reached a stable solution.
change = 1.0
while change > 1e-8:
The membership of the clusters is defined by putting each data item into a list, one for
each cluster, according to which cluster centre is closest. Hence clusters is defined as a list
of lists which collects the data items. Note that the membership of clusters may change
each cycle, so we have to repeatedly re-appraise the memberships. The for loop goes
though each data item, here called vector, and subtracts the centres; this will give an array
of difference vectors from the data item to all of the centres. Note that this subtraction
assumes we are working with NumPy arrays, hence subtracting two arrays gives another
array. The array of differences is then squared and then summed up for each vector, so that
we have a list of (square
4
) distances, with one value for each centre; i.e. an array of (x, y,
z) items becomes (x
2
, y
2
, z
2
) and then (x
2
+y
2
+z
2
). The closest centre is then the index of
the smallest value in the array dists.argmin(). The data vector is added to the appropriate
list in clusters with the closest index.
clusters = [[] for x in range(k)]
for vector in data:
diffs = centers - vector
dists = (diffs * diffs).sum(axis=1)
closest = dists.argmin()
clusters[closest].append(vector)
With the membership of the clusters defined, the next loop recalculates the centres of
the clusters based on the data vectors that belong to each. We simply go through the list of
clusters and, after converting to a NumPy array, we calculate the average data vector,
taking a summation for each data dimension and dividing by the length of the cluster. The
difference between the old and new cluster centre is calculated by subtraction, then added
as the sum of squares (i.e. length squared) to the total amount of change for this cycle.
Finally in the loop the new centre replaces the old one: centers[i] = center.
change = 0
for i, cluster in enumerate(clusters):
cluster = array(cluster)
center = cluster.sum(axis=0)/len(cluster)
diff = center - centers[i]
change += (diff * diff).sum()
centers[i] = center
return centers, clusters
At the end of the function, once the change is small enough, we pass back the last
values of the cluster centres and the list which contains the data items grouped according
to cluster membership.
To test the data we can use a random spread of 1000 data points in two dimensions,
grouping into three clusters:
testDataA = random.random((1000,2)) # No clumps
centers, clusters = kMeans(testDataA, 3)
Or we can add two distinct clumps of two-dimensional points:
from numpy import vstack
testDataB1 = random.normal(0.0, 2.0, (100,2))
testDataB2 = random.normal(7.0, 2.0, (100,2))
testDataB = vstack([testDataB1, testDataB2]) # Two clumps
centers, clusters = kMeans(testDataB, 2)
For each test we can display the result in a scatter graph using the matplotlib library.
Note that to make the plot we extract the x and y coordinates for the data items separately
using zip, because matplotlib.pyplot expects two separate lists (rather than (x,y) points).
After plotting the data with separate colours we also make a scatter plot of the centres of
the clusters with black circles.
from matplotlib import pyplot
colors = ['#FF0000','#00FF00','#0000FF',
'#FFFF00','#00FFFF','#FF00FF']
for i, cluster in enumerate(clusters):
x, y = zip(*cluster)
color = colors[i % len(colors)]
pyplot.scatter(x, y, c=color, marker='o')
x, y = zip(*centers)
pyplot.scatter(x, y, s=40, c='black', marker='o')
pyplot.show()
Do'stlaringiz bilan baham: |