Python Programming for Biology: Bioinformatics and Beyond

Download 7,75 Mb.

Pdf ko'rish

bet	376/514
Sana	30.12.2021
Hajmi	7,75 Mb.
	#91066

1 ... 372 373 374 375 376 377 378 379 ... 514

Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

K-means clustering

The next Python example for clustering is an algorithm known as k-means clustering.

Here the ‘k’ refers to the number of clusters that we wish to group our list data items into.

The ‘means’ refers to the concept of describing each cluster as the central average

position: a geometric average of its members. There are implementations of the k-means

algorithm

the

Scientific

Python

library

(scipy.cluster.vq.kmeans

and

scipy.cluster.vq.kmeans2), but we will create one from scratch here to illustrate the

methods and to give a basis for modification. Also, there are variants of this clustering

method that can be used, like ‘k-medoids’ or ‘k-medians’, which can also be considered,

but which we will not go into here for reasons of brevity. Compared to the threshold-based

methods already discussed, the k-means method has the advantage that you do not have to

choose a threshold. Although you do have to say how many clusters (k) to make, it is

possible to try several values for k and take the best one. Compared to DBSCAN, k-means

will not work as well with data clusters that have irregular or intermingled shapes, but will

allow separation of regular (‘blobby’) clusters even when they are overlapped.

Initially the cluster centres are defined randomly, basing them on the locations of

random data items. Each of the input data items will then become a member of a cluster

by simply determining which centre is closest. The effect is that the discrimination

boundaries between clusters are straight lines.

Most likely the initial guess for the cluster

centres will be bad, so the algorithm goes through a number of iterative cycles to improve

the locations of the centres. Each cycle consists of re-appraising which data items are

members of which cluster and then moving the centres of the clusters to the average of

their memberships. When a cluster centre moves the membership of items may change for

the next cycle, which in turn leads to a new centre, and so the cycle repeats. The k-means

algorithm is simple in that there is no recorded statistic that is being optimised. Instead the

iteration continues until a stable situation is reached. For data sets where there are (k)

obvious clumps of items the algorithm will be fairly reproducible. However, where the

data items are more evenly spread the final clustering will depend on the initial guess of

the centres, i.e. there can be more than one stable solution. If there isn’t a stable solution

you should consider whether it is meaningful to cluster the data in this way; imagine

dividing a regular grid of points into two, where any straight line that makes two groups of

equal size is a possible solution.

The Python function that performs k-means clustering takes an array of input data

vectors and the number k, to specify how many clusters there will be. We also have the

option of passing in some initial guesses for the cluster centres, but by default this is

None, and we choose random centres. If unspecified, to define the initial guess for centers

we use the sample function from Python’s standard random number module, which will

choose k different items from the data, and then convert the resulting list into a NumPy

array.

from numpy import array, random

from random import sample

def kMeans(data, k, centers=None):

if centers is None:

centers = array( sample(list(data), k) ) # list() not needed in Python

With the initial cluster centres defined we now come to the iterative part. Here we

define the amount of change, which is initially large (proportionately speaking). The while

loop will continue as long as the change between subsequent cycles is big enough,

otherwise we deem the clustering to have reached a stable solution.

change = 1.0

while change > 1e-8:

The membership of the clusters is defined by putting each data item into a list, one for

each cluster, according to which cluster centre is closest. Hence clusters is defined as a list

of lists which collects the data items. Note that the membership of clusters may change

each cycle, so we have to repeatedly re-appraise the memberships. The for loop goes

though each data item, here called vector, and subtracts the centres; this will give an array

of difference vectors from the data item to all of the centres. Note that this subtraction

assumes we are working with NumPy arrays, hence subtracting two arrays gives another

array. The array of differences is then squared and then summed up for each vector, so that

we have a list of (square

) distances, with one value for each centre; i.e. an array of (x, y,

z) items becomes (x

, y

, z

) and then (x

+y

2

). The closest centre is then the index of

the smallest value in the array dists.argmin(). The data vector is added to the appropriate

list in clusters with the closest index.

clusters = [[] for x in range(k)]

for vector in data:

diffs = centers - vector

dists = (diffs * diffs).sum(axis=1)

closest = dists.argmin()

clusters[closest].append(vector)

With the membership of the clusters defined, the next loop recalculates the centres of

the clusters based on the data vectors that belong to each. We simply go through the list of

clusters and, after converting to a NumPy array, we calculate the average data vector,

taking a summation for each data dimension and dividing by the length of the cluster. The

difference between the old and new cluster centre is calculated by subtraction, then added

as the sum of squares (i.e. length squared) to the total amount of change for this cycle.

Finally in the loop the new centre replaces the old one: centers[i] = center.

change = 0

for i, cluster in enumerate(clusters):

cluster = array(cluster)

center = cluster.sum(axis=0)/len(cluster)

diff = center - centers[i]

change += (diff * diff).sum()

centers[i] = center

return centers, clusters

At the end of the function, once the change is small enough, we pass back the last

values of the cluster centres and the list which contains the data items grouped according

to cluster membership.

To test the data we can use a random spread of 1000 data points in two dimensions,

grouping into three clusters:

testDataA = random.random((1000,2)) # No clumps

centers, clusters = kMeans(testDataA, 3)

Or we can add two distinct clumps of two-dimensional points:

from numpy import vstack

testDataB1 = random.normal(0.0, 2.0, (100,2))

testDataB2 = random.normal(7.0, 2.0, (100,2))

testDataB = vstack([testDataB1, testDataB2]) # Two clumps

centers, clusters = kMeans(testDataB, 2)

For each test we can display the result in a scatter graph using the matplotlib library.

Note that to make the plot we extract the x and y coordinates for the data items separately

using zip, because matplotlib.pyplot expects two separate lists (rather than (x,y) points).

After plotting the data with separate colours we also make a scatter plot of the centres of

the clusters with black circles.

from matplotlib import pyplot

colors = ['#FF0000','#00FF00','#0000FF',

'#FFFF00','#00FFFF','#FF00FF']

for i, cluster in enumerate(clusters):

x, y = zip(*cluster)

color = colors[i % len(colors)]

pyplot.scatter(x, y, c=color, marker='o')

x, y = zip(*centers)

pyplot.scatter(x, y, s=40, c='black', marker='o')

pyplot.show()

Download 7,75 Mb.

Do'stlaringiz bilan baham:

1 ... 372 373 374 375 376 377 378 379 ... 514