Python Programming for Biology: Bioinformatics and Beyond

Support vector machine predictions

Download 7,75 Mb.

Pdf ko'rish

bet	408/514
Sana	30.12.2021
Hajmi	7,75 Mb.
	#91066

1 ... 404 405 406 407 408 409 410 411 ... 514

Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

Support vector machine predictions

To use the SVM to make a prediction involves working out which side of the decision

hyperplane, determined during training, a query feature vector lies. Naturally the

prediction function takes a query vector as input, together with the training data and its

known categories. We also pass in the function and parameters that allow the calculation

of the coincidence of feature vectors using a kernel.

def svmPredict(query, data, knowns, supports, kernelFunc, kernelParams):

prediction = 0.0

for j, vector in enumerate(data):

support = supports[j]

if support > 0:

coincidence = kernelFunc(vector, query, *kernelParams) + 1.0

prediction += coincidence * support * knowns[j]

return prediction

The SVM prediction is made by going through all of the training data points and

finding those that are support vectors (support > 0). When a support vector is found its

coincidence with (similarity to) the query is found using the kernel function. The degree of

coincidence is multiplied by the amount of support for that training vector and the known

classification. Given that the known classification of the data vector is +1.0 or −1.0 this

will either add or subtract from the prediction total; effectively each support vector pulls

the summation to the positive or the negative size. In the end whether the predictSum

value is finally positive or negative determines the category of the query.

This next function, svmSeparation(), is used to test whether the training data was well

separated into two categories, i.e. reproducing the known classification. We don’t use the

above prediction function because we can reuse the pre-calculated kernelArray for speed.

As before, the known classification is in the form of an array containing values of +1.0 or

−1.0.

def svmSeparation(knowns, supports, kernelArray):

score = 0.0

nz = [i for i, val in enumerate(supports) if val > 0]

for i, known in enumerate(knowns):

prediction = sum(supports[nz] * knowns[nz] * kernelArray[nz, i] )

if known * prediction > 0.0: # same sign

score += 1.0

return 100.0 * score / len(knowns)

Making the prediction is done using the same logic as described for svmPredict(),

although here we do it in one line using NumPy array operations, given that we don’t have

to call the kernel function and can use the pre-calculated array instead. It is also notable

that we calculate nz, a list of the indices for the non-zero support values, upfront to help

reduce the number of calculations. With each prediction value, to actually test whether the

classification is correct we see if the prediction is the same sign as the known

classification. At the end the function gives back a percentage of correct classifications for

the training data.

To test out the support vector machine code we will make a fairly simple example that

contains a discontinuous patchwork of points in a two-dimensional plane that have been

placed into one of two categories, each in distinct regions. The following code goes

through a grid of x and y positions, which are normalised to be between 0.0 and 1.0, to

make an alternating chequerboard pattern for the categorisation (−1 or +1), except for the

middle square, which is flipped the other way, resulting in a central cross. This will give a

recognisable shape in the data that we can look for afterwards.

At each grid location the random.normal function from NumPy is used to make a

cluster of points by specifying a set of values for the x and y axes. The category and the x

and y value for each point are placed in the main catData list. This list is then shuffled to

introduce a random order. The list of known categorisations is extracted as the last index

(-1) for all catData items and the training feature vectors as everything up to the last index

([:,:-1]).

numPoints = 20

catData = []

for x in range(1,6):

for y in range(1,6):

xNorm = x/6.0 # Normalise range [0,1]

yNorm = y/6.0

if (x == 3) and (y == 3):

category = -1.0

elif (x%2) == (y%2):

category = 1.0

else:

category = -1.0

xvals = random.normal(xNorm, 0.2, numPoints)

yvals = random.normal(yNorm, 0.2, numPoints)

for i in range(numPoints): # xrange in Python 2

catData.append( (xvals[i], yvals[i], category) )

catData = array(catData)

random.shuffle(catData)

knowns = catData [:,-1]

data = catData [:,:-1]

Running the SVM on this data involves passing in the known classifications, training

data, a Gaussian kernel function and the parameters for the kernel. After training the

svmSeparation()function can be used to assess how well the SVM separates the known

categories.

params = (0.1,)

supports, steps, kernelArray = svmTrain(knowns, data, kernelGauss, params)

score = svmSeparation(knowns, supports, kernelArray)

print('Known data: %5.2f%% correct' % ( score ))

Download 7,75 Mb.

Do'stlaringiz bilan baham:

1 ... 404 405 406 407 408 409 410 411 ... 514