Python Programming for Biology: Bioinformatics and Beyond



Download 7,75 Mb.
Pdf ko'rish
bet405/514
Sana30.12.2021
Hajmi7,75 Mb.
#91066
1   ...   401   402   403   404   405   406   407   408   ...   514
Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

Support vector machines

The final example of a machine learning method in this chapter moves away from neural

networks to a different and more recent approach, described generally as kernel methods,

and  the  specific  example  given  here  is  known  as  a  support  vector  machine  (SVM).  The

SVM  can  be  thought  of  as  being  related  to  the  k-nearest  neighbour  idea,  where  we

visualise  known  points  of  data  in  a  vector  space  and  aim  to  make  predictions  on

previously  unseen  input  by  looking  where  a  query  lies  in  this  space  in  relation  to  the

known, training data. Whereas with the k-nearest neighbour method a prediction is made

by looking at a number of known values in the vicinity of a query, for an SVM the training

data vectors are used to define the location of a boundary that separates the vector space

into two regions. A prediction is made for an unseen query vector by seeing on which side

of the decision boundary it occurs. For two dimensions such a decision boundary would be

a line, and in three dimensions it is a plane, but for most SVMs we are working in more,

often many more, than three dimensions (data features), in which case the boundary is a



hyperplane. The name support vector machine derives from the support vectors, which are

the known, training data points that are found on the edge of this boundary, defining the

boundary position.

One  of  the  useful  properties  of  support  vector  machines  compared  to  neural  network

techniques comes from the fact that they maximise the margin between data categories. If

you  imagine  the  training  data  to  be  two  distinct,  fixed  groups  the  operation  finds  the

widest region that separates them. The boundary between these two groups is placed in the

centre  of  the  separating  region,  and  the  prediction  of  previously  unseen  data  is  made  by

finding out on which side of the boundary the query point lies. With a neural network the

effective boundary between data categories will not always be fairly placed in the middle.

This is a particular issue when the training data is not evenly spread (e.g. one prediction

category is significantly more populous) and if it has been over-trained. The decision line

may  be  much  closer  to  one  category  or  the  other,  with  the  result  that  unseen  query  data

that lies between the two will tend to be classified inappropriately, in only one way. With a

support vector machine it is only the support vectors at the edge of the maximised margin

between  categories  that  define  the  boundary,  thus  over-training  is  not  possible  and  the

amount of data on either side does not affect the decision line so much.

13

A  support  vector  machine  effectively  defines  the  widest  boundary  between  two



categories of data vector by doing linear optimisation, similar to finding a line of best fit.

However, the input data may not naturally lie in two discrete regions and thus be separable

with  a  simple  plane  boundary.  Imagine  trying  to  decide  which  applicants  should  be

allowed to join the army based on their weight measurements. You wouldn’t expect to be

able to sensibly use a single decision line to separate those who pass and those who fail,

given  that  you  have  to  distinguish  between  both  those  people  who  are  overweight  and

those who are underweight. The solution to this kind of problem can nonetheless be solved

using a single line, and thus by a support vector machine, because the decision boundary




can  be  placed  in  a  space  of  higher  dimensionality  than  the  original  problem.  Referring

again  to  the  army  recruits,  if  instead  of  having  a  single  weight  axis  you  put  the  weight

measurements  in  a  semi-circle  then  you  can  now  separate  the  pass  and  fail  categories

above and below a single line. It is simply that the data is now in two dimensions, rather

than the original one. The extra dimension has nothing to do with the original data, it has

simply spread it out in a predictable way and you can still retrieve the original values if

required: in this case from the distance along the curve. Using a known function to spread

data points into more dimensions is often referred to as the kernel trick

14

and gives power



to the SVM to operate on nonlinear problems (e.g. with context dependency) while using

mathematically  robust,  linear  means  to  make  an  optimum  decision.  Although  the  simple

example  mentioned  uses  two  dimensions  to  achieve  separation,  a  general  support  vector

machine  can  use  any  number  of  extra  dimensions.  Although  at  first  this  might  seem  to

arbitrarily complicate the problem, the really clever part of the kernel trick is that within

the algorithm to find the separating boundary the high-dimensionality locations of the data

points do not have to be stored or even calculated. The optimisation can be performed by

merely considering what the projection of the high-dimensionality points onto the original

feature axes (i.e. using the dot product) would be. For a more mathematical explanation of

support vector machines and the kernel trick see the reference given below.

15


Download 7,75 Mb.

Do'stlaringiz bilan baham:
1   ...   401   402   403   404   405   406   407   408   ...   514




Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©hozir.org 2024
ma'muriyatiga murojaat qiling

kiriting | ro'yxatdan o'tish
    Bosh sahifa
юртда тантана
Боғда битган
Бугун юртда
Эшитганлар жилманглар
Эшитмадим деманглар
битган бодомлар
Yangiariq tumani
qitish marakazi
Raqamli texnologiyalar
ilishida muhokamadan
tasdiqqa tavsiya
tavsiya etilgan
iqtisodiyot kafedrasi
steiermarkischen landesregierung
asarlaringizni yuboring
o'zingizning asarlaringizni
Iltimos faqat
faqat o'zingizning
steierm rkischen
landesregierung fachabteilung
rkischen landesregierung
hamshira loyihasi
loyihasi mavsum
faolyatining oqibatlari
asosiy adabiyotlar
fakulteti ahborot
ahborot havfsizligi
havfsizligi kafedrasi
fanidan bo’yicha
fakulteti iqtisodiyot
boshqaruv fakulteti
chiqarishda boshqaruv
ishlab chiqarishda
iqtisodiyot fakultet
multiservis tarmoqlari
fanidan asosiy
Uzbek fanidan
mavzulari potok
asosidagi multiservis
'aliyyil a'ziym
billahil 'aliyyil
illaa billahil
quvvata illaa
falah' deganida
Kompyuter savodxonligi
bo’yicha mustaqil
'alal falah'
Hayya 'alal
'alas soloh
Hayya 'alas
mavsum boyicha


yuklab olish