Python Programming for Biology: Bioinformatics and Beyond

Download 7,75 Mb.

Pdf ko'rish

bet	405/514
Sana	30.12.2021
Hajmi	7,75 Mb.
	#91066

1 ... 401 402 403 404 405 406 407 408 ... 514

Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

Support vector machines

The final example of a machine learning method in this chapter moves away from neural

networks to a different and more recent approach, described generally as kernel methods,

and the specific example given here is known as a support vector machine (SVM). The

SVM can be thought of as being related to the k-nearest neighbour idea, where we

visualise known points of data in a vector space and aim to make predictions on

previously unseen input by looking where a query lies in this space in relation to the

known, training data. Whereas with the k-nearest neighbour method a prediction is made

by looking at a number of known values in the vicinity of a query, for an SVM the training

data vectors are used to define the location of a boundary that separates the vector space

into two regions. A prediction is made for an unseen query vector by seeing on which side

of the decision boundary it occurs. For two dimensions such a decision boundary would be

a line, and in three dimensions it is a plane, but for most SVMs we are working in more,

often many more, than three dimensions (data features), in which case the boundary is a

hyperplane. The name support vector machine derives from the support vectors, which are

the known, training data points that are found on the edge of this boundary, defining the

boundary position.

One of the useful properties of support vector machines compared to neural network

techniques comes from the fact that they maximise the margin between data categories. If

you imagine the training data to be two distinct, fixed groups the operation finds the

widest region that separates them. The boundary between these two groups is placed in the

centre of the separating region, and the prediction of previously unseen data is made by

finding out on which side of the boundary the query point lies. With a neural network the

effective boundary between data categories will not always be fairly placed in the middle.

This is a particular issue when the training data is not evenly spread (e.g. one prediction

category is significantly more populous) and if it has been over-trained. The decision line

may be much closer to one category or the other, with the result that unseen query data

that lies between the two will tend to be classified inappropriately, in only one way. With a

support vector machine it is only the support vectors at the edge of the maximised margin

between categories that define the boundary, thus over-training is not possible and the

amount of data on either side does not affect the decision line so much.

A support vector machine effectively defines the widest boundary between two

categories of data vector by doing linear optimisation, similar to finding a line of best fit.

However, the input data may not naturally lie in two discrete regions and thus be separable

with a simple plane boundary. Imagine trying to decide which applicants should be

allowed to join the army based on their weight measurements. You wouldn’t expect to be

able to sensibly use a single decision line to separate those who pass and those who fail,

given that you have to distinguish between both those people who are overweight and

those who are underweight. The solution to this kind of problem can nonetheless be solved

using a single line, and thus by a support vector machine, because the decision boundary

can be placed in a space of higher dimensionality than the original problem. Referring

again to the army recruits, if instead of having a single weight axis you put the weight

measurements in a semi-circle then you can now separate the pass and fail categories

above and below a single line. It is simply that the data is now in two dimensions, rather

than the original one. The extra dimension has nothing to do with the original data, it has

simply spread it out in a predictable way and you can still retrieve the original values if

required: in this case from the distance along the curve. Using a known function to spread

data points into more dimensions is often referred to as the kernel trick

and gives power

to the SVM to operate on nonlinear problems (e.g. with context dependency) while using

mathematically robust, linear means to make an optimum decision. Although the simple

example mentioned uses two dimensions to achieve separation, a general support vector

machine can use any number of extra dimensions. Although at first this might seem to

arbitrarily complicate the problem, the really clever part of the kernel trick is that within

the algorithm to find the separating boundary the high-dimensionality locations of the data

points do not have to be stored or even calculated. The optimisation can be performed by

merely considering what the projection of the high-dimensionality points onto the original

feature axes (i.e. using the dot product) would be. For a more mathematical explanation of

support vector machines and the kernel trick see the reference given below.

Download 7,75 Mb.

Do'stlaringiz bilan baham:

1 ... 401 402 403 404 405 406 407 408 ... 514