Python Programming for Biology: Bioinformatics and Beyond

Figure 23.1. How many seashells in how many groups?

Download 7,75 Mb.

Pdf ko'rish

bet	370/514
Sana	30.12.2021
Hajmi	7,75 Mb.
	#91066

1 ... 366 367 368 369 370 371 372 373 ... 514

Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

Figure 23.1. How many seashells in how many groups? Some views of data are better

at distinguishing items and clusters than others.

Clustering

Clustering relates to the process of partitioning data units into discrete groups. Such an

operation requires that the similarity (or difference) between units is measured and then

the members of each group are allocated to give the arrangement that maximises the

association of similar items and the separation of dissimilar ones. In practice most of the

clustering methods presented here will not be able to give an immediate analytical solution

to this optimisation problem, rather the process will be an iterative one, with several

cycles of improvement until a stable solution is found. As mentioned above, clustering

may operate on data items which have a high dimensionality, represented as feature

vectors. However, if the analysis is too slow or too complicated the original data may be

transformed (projected) into a set of lower-dimensionality data by methods like PCA prior

to the clustering operation.

Depending on the situation, the process of clustering may work with prior knowledge

about the number of clusters, e.g. what the underlying data categories are. Alternatively,

the number of clusters may be completely unknown. If the numbers of clusters is not

known then this number must be deduced or optimised. Generally, several different trials

are run, each of which involves a different number of clusters. Within each trial there is a

separate optimisation for how the data items are allocated within that number of clusters.

The best number of clusters is then determined from the best overall arrangement from all

the trials. It would be possible to place each data item in a separate cluster, thus giving

maximum separation, but the objective is to give the best balance between the number of

clusters and the degree of separation, rather than only maximising separation.

Once clusters are defined the result may then be used as a means of predicting

classification, i.e. estimating in which cluster a previously unseen piece of data lies.

Making a prediction may be as simple as finding which cluster is closest. Alternatively,

more advanced approaches, such as the supervised machine learning methods described in

Chapter 24

, can be used where classification is not so easy. These can learn patterns from

training data with known, fixed classifications before predictions are made. One of the

machine learning methods presented later, the self-organising map, is notable because it is

unsupervised (needs no prior classifications) and thus can be viewed as an alternative to

the linear clustering methods presented in this chapter.

Download 7,75 Mb.

Do'stlaringiz bilan baham:

1 ... 366 367 368 369 370 371 372 373 ... 514