Python Programming for Biology: Bioinformatics and Beyond

Download 7,75 Mb.

Pdf ko'rish

bet	251/514
Sana	30.12.2021
Hajmi	7,75 Mb.
	#91066

1 ... 247 248 249 250 251 252 253 254 ... 514

Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

Hierarchical clustering

The next example moves on to rearranging the data (swapping rows and columns) so that

we can better see similarities or correlations in the elements. This rearrangement will

involve hierarchical clustering, and uses the same idea as building a phylogenetic tree (see

Chapter 14

), so that the most similar rows and columns are placed next to one another.

The aim is that any rows or columns that show similar patterns will be placed together in

easily visualised sub-groups. The image-generating method described earlier can then be

used to make a visual representation (see

Figure 16.3

The function __hierarchicalRowCluster takes an array of data as a two-dimensional

matrix and uses this to build a matrix of the distances between each row, so the element

distanceMatrix[i][j] is the distance between row i and row j. By ‘distance’ here we mean

the similarity between different rows, and measure this here as the Euclidean distance

between row vectors (root of sum of differences squared). In different situations you could

consider other means of estimating the similarity between rows.

It should be noted that

the function name begins with a double underscore ‘__’. In Python a function starting, and

not ending, with a double underscore is effectively private,

so it can’t be called directly

from an instance of a class. A function is kept private like this if you don’t want to expose

it as a normal method (a function linked to a class), and only use it internally within the

class definition. Hence, here we make the clustering function private and use it in the inner

workings of Microarray but do not allow method calls from an object like

microrray1.__hierarchicalRowCluster(data).

def __hierarchicalRowCluster(self, dataMatrix):

from SeqVariation import neighbourJoinTree

n = len(dataMatrix[0])

distanceMatrix = zeros((n, n), float)

We loop through each layer and row in the dataMatrix and take this row away from the

whole array, so we get a matrix of differences to this row. The differences are squared,

added up along the row and the square root is taken, giving a distance from row to all

other rows. These distances are then placed at position i for this row within the

distanceMatrix.

for channelData in dataMatrix:

for i, row in enumerate(channelData):

diffs = channelData - row

sqDiffs = diffs * diffs

sqDists = sqDiffs.sum(axis=1)

distanceMatrix[i,:] += sqDists

The function neighbourJoinTree which was described previously in

Chapter 14

reused here to create a hierarchical tree from the distance matrix. Note that we use .tolist()

to make a list of lists from the array of distances because the tree-generating function did

not make use of NumPy arrays earlier in the book.

tree, joinOrder = neighbourJoinTree(distanceMatrix.tolist())

The hierarchical tree structure is then interrogated by following its branches, effectively

flattening it into a single list of node indices, which represents the order of rows for the

shuffled matrix. The list rowOrder is initially defined as a copy of the tree, but is then

processed to insert the contents of any sub-lists (or tuples) directly into the main list. The

use of a while loop here is because the size of the rowOrder list will grow as the tree

branches are flattened. We record an index i, which represents the position in the list that

is being processed, and continue until the end of the list. The inner while loop checks

whether the list item at position i is an integer (the end of a tree branch), and if it is not the

contents of the sub-list (rowOrder[i]) are inserted into the main list using the slice notation

[i:i+1], replacing the original range covering the sub-list with its contents. The use of the

second while loop is needed because the new element at position i may itself be another

sub-list.

rowOrder = list(tree)

i = 0

while i < len(rowOrder):

while not isinstance(rowOrder[i], int):

rowOrder[i:i+1] = rowOrder[i]

i += 1

return rowOrder

The row clustering routine is actually used by the hierarchicalCluster function, which is

not private and so provides the method microarray.hierarchicalCluster(). This clusters the

rows in self.data and then reorders the array according to the row hierarchy. The resulting

data array is then transposed (flip rows for columns) and clustered again, effectively

clustering the columns, which forms the new column order.

def hierarchicalCluster(self):

rows = self.__hierarchicalRowCluster(self.data)

swapped = transpose(self.data, axes=(0,2,1))

cols = self.__hierarchicalRowCluster(swapped)

The reordered rows and cols are used as indices into the NumPy arrays to shuffle the

data, according to the hierarchical clustering, noting that we don’t affect self.data.

The

data is then used to make an entirely new Microarray object, with a different order for its

rows and columns.

data = self.data[:,rows] # Rearrange

data = data[:,:,cols]

# data = array(data.tolist()) # to fix PIL.Image bug

name = self.name + '-Sorted'

rowData = [self.rowData[i] for i in rows]

colData = [self.colData[j] for j in cols]

sortedArray = Microarray(name, data, rowData, colData)

return sortedArray

When testing the hierarchical clustering we get back a new Microarray object with a

different order of rows and columns:

sortedArray = rgArray.hierarchicalCluster()

sortedArray.makeImage(20).show()

print(rgArray.rowData)

print(sortedArray.rowData)

# [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17] – Original

rows

# [4, 13, 2, 11, 6, 15, 5, 14, 7, 16, 1, 10, 3, 12, 8, 17, 0, 9] – Shuffled

rows

Download 7,75 Mb.

Do'stlaringiz bilan baham:

1 ... 247 248 249 250 251 252 253 254 ... 514