Clustering methods
The following Python examples demonstrate a range of clustering methods. Overall we
have tried to give an overview of the different approaches and have chosen examples that
are reasonably easy to explain (and write in Python) but which should also prove useful in
many different situations.
Simple threshold clustering
The first Python example in this chapter involves grouping items of data into clusters in a
fairly simple manner, based upon how close (similar) the items of data are. This method is
somewhat naïve and suffers from a few problems, but is a good place to start in the subject
and we will subsequently show improvements. The algorithm works by initially
considering an item of data as a potentially separate cluster and then expands the cluster
by adding any other items that are sufficiently close, i.e. the distance between data points
is less than a threshold. The general problem with this kind of approach is that it is
sensitive to the arrangement of data points; two genuine clusters can be merged into one
by the presence of outlying points that just happen to span the gap between the clusters.
The method is sometimes described as being ‘too greedy’.
Firstly, in the Python example we define a simple function that calculates the regular,
Euclidian distance between two items of data (feature vectors). Although this is a
reasonable general choice for measuring the similarity between data items, there are some
situations where something else may be more appropriate. A good example of using a
different kind of measure would be for biological sequences, where it is possible to
estimate how similar the sequences are, assuming an evolutionary process.
The euclideanDist function will assume the data items are NumPy array objects, so that
we can do some quick manipulations without having to go through loops. The distance is
calculated by first finding the difference between two vectors: another vector, diff. The
square root of the sum of the squares of the values in the diff vector is then the distance we
want. We use the dot() function to do the summing and squaring for us. (See
Chapter 9
for
an explanation of dot().)
from numpy import dot, sqrt
def euclideanDist(vectorA, vectorB):
diff = vectorA-vectorB
return sqrt(dot(diff,diff))
Next is another small helper function that we will use in the clustering. In this case the
findNeighbours function is built to look through all pairs of data items, calculate the
distance between the items (for example, using the euclideanDist function) and record
those items that are sufficiently close to be considered as near neighbours. The three input
arguments are naturally the data itself, the function to get the distance value and a
threshold for defining neighbours. The results will be stored in a Python dictionary
neighbourDict, so this is defined and initially filled with empty lists, one for each item of
data.
def findNeighbours(data, distFunc, threshold):
neighbourDict = {}
n = len(data)
for i in range(n):
neighbourDict[i] = []
for i in range(0,n-1):
for j in range(i+1,n):
dist = distFunc(data[i], data[j])
if dist < threshold:
neighbourDict[i].append(j)
neighbourDict[j].append(i)
return neighbourDict
The remainder of the function involves going through all pairs of data items. Hence we
have two loops, one with the index i and one with the index j, to get two items. Note how
we have deliberately chosen the ranges for these indices to avoid repetition, i.e. j are the
indices larger than i. The distance is calculated by using the function that was passed in
and then if the distance is less than the threshold we record the data for i and j indices to
be neighbours; the indices go in each other’s list of neighbours, held in neighbourDict.
Finally we get to the actual simpleCluster clustering function. This takes a list of data,
each item of which is assumed to be a NumPy array, a distance threshold and the function
to make the distance measurement. Initially, the dictionary of neighbours is filled using
findNeighbours, then the remainder of the function works to find clusters by combining
neighbours together into groups.
def simpleCluster(data, threshold, distFunc=euclideanDist):
neighbourDict = findNeighbours(data, distFunc, threshold)
For the clusters we first define an empty list, which will represent items of data in
discrete groups. The pool variable is a set of numbers, the indices of data items that we
have not yet processed. The aim will be to go through all the pool and place the indices in
the appropriate cluster. We are using Python sets here because detecting something in a set
is quicker than looking though a list.
clusters = []
pool = set(range(len(data)))
Now the task is to process the data, which involves taking an item from the ‘to-do’ pool
(remembering pop() removes something from a set so the pool shrinks) and looking
through its neighbours, which we fetch from the neighbourDict. Because the item was in
the pool it will not yet be in a cluster, thus we define a new one as an empty set and add
the item index i. The item is now considered processed. It may seem dumb to already
consider the item processed, but as will become apparent most items will be added in a
different way; this is just a start for the cluster.
while pool:
i = pool.pop()
neighbours = neighbourDict[i]
cluster = set()
cluster.add(i)
Once the new cluster is defined we then add all of the neighbours which are known to
be close to the same cluster. To do this a new pool pool2 is defined from the indices j of
the neighbours.
pool2 = set(neighbours)
while pool2:
j = pool2.pop()
If the neighbour has not yet been processed (j still in pool) then we process it by
removing it from the main pool, and add it to the current cluster. Its own neighbours then
go into pool2, the set that we are currently working on. In this way neighbours of
neighbours are placed in the same cluster. Note the use of the update, which will expand a
Python set to include any new values from another group.
if j in pool:
pool.remove(j)
cluster.add(j)
neighbours2 = neighbourDict[j]
pool2.update(neighbours2)
This inner loop ends when no more neighbours are placed in the pool for the current
cluster and pool2 is fully processed (empty), whereupon the cluster is added to the list of
clusters. Another cluster will be created by the outer while loop as long as the pool still
contains something to process: those that are not a member of any clusters seen so far.
clusters.append(cluster)
The final output of the clusters will be given as a list of lists; each sub-list will contain
all of the data items that go together in the same cluster. So far the clusters are only
defined in terms of index numbers, so it is a simple matter to go through each set of
clustered indices and add the actual data items they correspond to (data[i]) to the
clusterData list which will be passed back at the end of the function.
clusterData = []
for cluster in clusters:
clusterData.append( [data[i] for i in cluster] )
return clusterData
For testing we will use some random data clusters with a normal (Gaussian)
distribution, readily created using random.normal. Here each data cluster is centred on a
different point but has the same spread and size: 100 points in two dimensions. These are
then combined into a single array of vectors (300 points on two dimensions) using vstack,
which combines the arrays of vectors along their first axis (rows), and shuffled to mix up
the vectors from the different clusters.
from numpy import random, vstack
spread = 0.12
sizeDims = (100,2)
data = [random.normal(( 0.0, 0.0), spread, sizeDims),
random.normal(( 1.0, 1.0), spread, sizeDims),
random.normal(( 1.0, 0.0), spread, sizeDims)]
data = vstack(data)
random.shuffle(data) # Randomise order
clusters = simpleCluster(data, 0.10)
The following can then plot the results with different symbols and colours. Note that we
plot all the small ‘noise’ clusters that have fewer than three items as black spots. Also, the
zip function is handy here because the scatter plotting function requires two separate lists
of x values and y values, but our data has (x,y) pairs. Hence *cluster expands the one list
of (x,y) pairs into multiple small lists, and zip groups the separate x and y values together
(see
Chapter 10
).
from matplotlib import pyplot
colors = ['#F0F0F0','#A0A0A0','#505050',
'#D0D0D0','#808080','#202020']
markers = ['d','o','s','>','^']
i = 0
for cluster in clusters:
allX, allY = zip(*cluster)
if len(cluster) > 3:
color = colors[i % len(colors)]
marker = markers[i % len(markers)]
pyplot.scatter(allX, allY, s=30, c=color, marker=marker)
i += 1
else:
pyplot.scatter(allX, allY, s=5, c='black', marker='o')
pyplot.show()
Do'stlaringiz bilan baham: |