Python Programming for Biology: Bioinformatics and Beyond

Download 7,75 Mb.

Pdf ko'rish

bet	378/514
Sana	30.12.2021
Hajmi	7,75 Mb.
	#91066

1 ... 374 375 376 377 378 379 380 381 ... 514

Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

Improving k-means

The k-means algorithm can be improved by having a better guess at the starting cluster

centres. The idea is that by choosing cluster centres that are spaced further apart the

optimisation converges more quickly and is less likely to get stuck in sub-optimal

positions, which are often characterised by two centres being close and dividing a real

cluster at the expense of a globally better solution. One method that uses better starting

positions is called k-means++, which works by choosing centres on a probabilistic basis,

where the chance of a point becoming an initial centre increases with the square of the

distance to the other centres specified so far. However, here we will consider another,

slightly more directed approach with a similar aim.

For the kMeansSpread function we guess one cluster centre by taking a random data

point and then choose the centres of subsequent clusters by selecting points that are

furthest away from those defined so far. Each cluster centre is chosen by creating an index,

which is placed in the set indices, which is used at the end to select corresponding data

items and thus create an array of centres. The index for the first cluster is a random integer

between zero (the first data item) and n-1 (the last data item), selected using the standard

random number Python module: random.randint(0, n-1).

The remaining indices are added

in a while loop, until a total of k is achieved, by choosing subsequent points which have

minimum radial influence from the centres already chosen.

from numpy import zeros, ones, vstack

from random import randint

def kMeansSpread(data, k):

n = len(data)

index = randint(0, n-1)

indices = set([index])

influence = zeros(n)

while len(indices) < k:

diff = data - data[index]

sumSq = (diff * diff).sum(axis=1) + 1.0

influence += 1.0 / sumSq

index = influence.argmin()

while index in indices:

index = randint(0, n-1)

indices.add(index)

centers = vstack([data[i] for i in indices])

return kMeans(data, k, centers)

Key to this approach is the influence array. Initially starting with zeros, each time we

add an index for a new centre we calculate the difference from that cluster centre

(data[index]) to all the data, thus creating diff. Each value in the influence array is then

increased by one over sumSq: the sum of this difference squared. Note that sumSq has 1.0

added to each element to avoid dividing by zero, i.e. where a data point coincides exactly

with the previous centre. Also, we do an element-wise division when we divide the simple

number 1.0 by the whole array sumSq, so we get an array of reciprocals. In effect each

centre has a diminishing radial influence and we choose the index where the sum of all

these influences influence is minimised. In some circumstances an index can actually be

picked twice, but we guard against this with a second while loop which checks for repeats

and chooses a random index until an unused one is found. Once k indices are collected, a

list of data items, representing the cluster centres, is created using a list comprehension

and stacked into a NumPy array (vstack operates on arrays much like append does on

lists). Overall this method is analogous to the concept of repelling charges. In the end the

‘repelled’ centres are a fair starting point for k-means, which is called at the end so its

results are passed back directly at the return.

The function is used, and can be tested, in the same way as kMeans(). If speed of

execution becomes important then, aside from faster implementations (e.g. using a C-

language extension, as described in

Chapter 27

), further improvements can sometimes be

made to the algorithmic efficiency. For example, if the dimensionality of the data items is

not too large (e.g. three-dimensional) techniques like kD-trees

can be used to more

efficiently calculate the closest centre for each data item. Also, if the data set to be

clustered is large, significant time savings can be made by initially running the algorithm

on a smaller, random subset of the data. The cluster centres from the smaller subset can

then be used as good starting values in a second round of clustering using the whole data

set. A point of note with k-means and similar methods is that they tend to only work well

where the clusters are of about the same size and density and are convex (blobs).

Download 7,75 Mb.

Do'stlaringiz bilan baham:

1 ... 374 375 376 377 378 379 380 381 ... 514