Python Programming for Biology: Bioinformatics and Beyond

Download 7,75 Mb.

Pdf ko'rish

bet	372/514
Sana	30.12.2021
Hajmi	7,75 Mb.
	#91066

1 ... 368 369 370 371 372 373 374 375 ... 514

Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

Simple threshold clustering

Clustering methods

The following Python examples demonstrate a range of clustering methods. Overall we

have tried to give an overview of the different approaches and have chosen examples that

are reasonably easy to explain (and write in Python) but which should also prove useful in

many different situations.

Simple threshold clustering

The first Python example in this chapter involves grouping items of data into clusters in a

fairly simple manner, based upon how close (similar) the items of data are. This method is

somewhat naïve and suffers from a few problems, but is a good place to start in the subject

and we will subsequently show improvements. The algorithm works by initially

considering an item of data as a potentially separate cluster and then expands the cluster

by adding any other items that are sufficiently close, i.e. the distance between data points

is less than a threshold. The general problem with this kind of approach is that it is

sensitive to the arrangement of data points; two genuine clusters can be merged into one

by the presence of outlying points that just happen to span the gap between the clusters.

The method is sometimes described as being ‘too greedy’.

Firstly, in the Python example we define a simple function that calculates the regular,

Euclidian distance between two items of data (feature vectors). Although this is a

reasonable general choice for measuring the similarity between data items, there are some

situations where something else may be more appropriate. A good example of using a

different kind of measure would be for biological sequences, where it is possible to

estimate how similar the sequences are, assuming an evolutionary process.

The euclideanDist function will assume the data items are NumPy array objects, so that

we can do some quick manipulations without having to go through loops. The distance is

calculated by first finding the difference between two vectors: another vector, diff. The

square root of the sum of the squares of the values in the diff vector is then the distance we

want. We use the dot() function to do the summing and squaring for us. (See

Chapter 9

for

an explanation of dot().)

from numpy import dot, sqrt

def euclideanDist(vectorA, vectorB):

diff = vectorA-vectorB

return sqrt(dot(diff,diff))

Next is another small helper function that we will use in the clustering. In this case the

findNeighbours function is built to look through all pairs of data items, calculate the

distance between the items (for example, using the euclideanDist function) and record

those items that are sufficiently close to be considered as near neighbours. The three input

arguments are naturally the data itself, the function to get the distance value and a

threshold for defining neighbours. The results will be stored in a Python dictionary

neighbourDict, so this is defined and initially filled with empty lists, one for each item of

data.

def findNeighbours(data, distFunc, threshold):

neighbourDict = {}

n = len(data)

for i in range(n):

neighbourDict[i] = []

for i in range(0,n-1):

for j in range(i+1,n):

dist = distFunc(data[i], data[j])

if dist < threshold:

neighbourDict[i].append(j)

neighbourDict[j].append(i)

return neighbourDict

The remainder of the function involves going through all pairs of data items. Hence we

have two loops, one with the index i and one with the index j, to get two items. Note how

we have deliberately chosen the ranges for these indices to avoid repetition, i.e. j are the

indices larger than i. The distance is calculated by using the function that was passed in

and then if the distance is less than the threshold we record the data for i and j indices to

be neighbours; the indices go in each other’s list of neighbours, held in neighbourDict.

Finally we get to the actual simpleCluster clustering function. This takes a list of data,

each item of which is assumed to be a NumPy array, a distance threshold and the function

to make the distance measurement. Initially, the dictionary of neighbours is filled using

findNeighbours, then the remainder of the function works to find clusters by combining

neighbours together into groups.

def simpleCluster(data, threshold, distFunc=euclideanDist):

neighbourDict = findNeighbours(data, distFunc, threshold)

For the clusters we first define an empty list, which will represent items of data in

discrete groups. The pool variable is a set of numbers, the indices of data items that we

have not yet processed. The aim will be to go through all the pool and place the indices in

the appropriate cluster. We are using Python sets here because detecting something in a set

is quicker than looking though a list.

clusters = []

pool = set(range(len(data)))

Now the task is to process the data, which involves taking an item from the ‘to-do’ pool

(remembering pop() removes something from a set so the pool shrinks) and looking

through its neighbours, which we fetch from the neighbourDict. Because the item was in

the pool it will not yet be in a cluster, thus we define a new one as an empty set and add

the item index i. The item is now considered processed. It may seem dumb to already

consider the item processed, but as will become apparent most items will be added in a

different way; this is just a start for the cluster.

while pool:

i = pool.pop()

neighbours = neighbourDict[i]

cluster = set()

cluster.add(i)

Once the new cluster is defined we then add all of the neighbours which are known to

be close to the same cluster. To do this a new pool pool2 is defined from the indices j of

the neighbours.

pool2 = set(neighbours)

while pool2:

j = pool2.pop()

If the neighbour has not yet been processed (j still in pool) then we process it by

removing it from the main pool, and add it to the current cluster. Its own neighbours then

go into pool2, the set that we are currently working on. In this way neighbours of

neighbours are placed in the same cluster. Note the use of the update, which will expand a

Python set to include any new values from another group.

if j in pool:

pool.remove(j)

cluster.add(j)

neighbours2 = neighbourDict[j]

pool2.update(neighbours2)

This inner loop ends when no more neighbours are placed in the pool for the current

cluster and pool2 is fully processed (empty), whereupon the cluster is added to the list of

clusters. Another cluster will be created by the outer while loop as long as the pool still

contains something to process: those that are not a member of any clusters seen so far.

clusters.append(cluster)

The final output of the clusters will be given as a list of lists; each sub-list will contain

all of the data items that go together in the same cluster. So far the clusters are only

defined in terms of index numbers, so it is a simple matter to go through each set of

clustered indices and add the actual data items they correspond to (data[i]) to the

clusterData list which will be passed back at the end of the function.

clusterData = []

for cluster in clusters:

clusterData.append( [data[i] for i in cluster] )

return clusterData

For testing we will use some random data clusters with a normal (Gaussian)

distribution, readily created using random.normal. Here each data cluster is centred on a

different point but has the same spread and size: 100 points in two dimensions. These are

then combined into a single array of vectors (300 points on two dimensions) using vstack,

which combines the arrays of vectors along their first axis (rows), and shuffled to mix up

the vectors from the different clusters.

from numpy import random, vstack

spread = 0.12

sizeDims = (100,2)

data = [random.normal(( 0.0, 0.0), spread, sizeDims),

random.normal(( 1.0, 1.0), spread, sizeDims),

random.normal(( 1.0, 0.0), spread, sizeDims)]

data = vstack(data)

random.shuffle(data) # Randomise order

clusters = simpleCluster(data, 0.10)

The following can then plot the results with different symbols and colours. Note that we

plot all the small ‘noise’ clusters that have fewer than three items as black spots. Also, the

zip function is handy here because the scatter plotting function requires two separate lists

of x values and y values, but our data has (x,y) pairs. Hence *cluster expands the one list

of (x,y) pairs into multiple small lists, and zip groups the separate x and y values together

(see

Chapter 10

from matplotlib import pyplot

colors = ['#F0F0F0','#A0A0A0','#505050',

'#D0D0D0','#808080','#202020']

markers = ['d','o','s','>','^']

i = 0

for cluster in clusters:

allX, allY = zip(*cluster)

if len(cluster) > 3:

color = colors[i % len(colors)]

marker = markers[i % len(markers)]

pyplot.scatter(allX, allY, s=30, c=color, marker=marker)

i += 1

else:

pyplot.scatter(allX, allY, s=5, c='black', marker='o')

pyplot.show()

Download 7,75 Mb.

Do'stlaringiz bilan baham:

1 ... 368 369 370 371 372 373 374 375 ... 514