Python Programming for Biology: Bioinformatics and Beyond



Download 7,75 Mb.
Pdf ko'rish
bet393/514
Sana30.12.2021
Hajmi7,75 Mb.
#91066
1   ...   389   390   391   392   393   394   395   396   ...   514
Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

Feature space

The  k-nearest  neighbour  algorithm  described  below  is  a  very  simple  yet  surprisingly

powerful  method  to  perform  classification.  The  idea  is  that  you  describe  the  known,

classified data as points, with coordinate locations, in a feature space. Just like real three-

dimensional space, a feature space is defined by separate axes and a location in that space

is  defined  by  coordinates  on  each  of  the  axes.  A  simple  example  of  a  feature  space  is

colour,  which  may  be  described

1

 by  red,  green  and  blue  axes.  Any  colour  within  and



including the extremes of black (zero red, green and blue) and white (maximum red, green

and blue) is described by a certain amount of red, green and blue, in other words positions

on these axes, and thus a colour may be described as a location vector (red, green, blue).

We may also measure the distance between colour vectors, which is effectively to say how

similar the colours are. Note that with this colour example we don’t have the same kind of

unbounded  continuum  that  we  have  in  real  three-dimensional  space:  it  is  meaningless  to

have a negative colour value and we don’t let the values exceed the maximum intensity;

nothing can be whiter than white.

When dealing with feature spaces in general you can have a mixture of both bounded

and unbounded axes, and as many numbers of ‘dimensions’ as is required to describe your

data,  although  it  is  often  a  good  idea  to  normalise  the  numeric  data  so  that  it  fits  in  a

limited range of values (typically −1 to +1). A good example of high-dimensionality data

occurs in the use of biological sequences. For example, if you have input data that consists

of  DNA  sequences  that  are  five  base  pairs  long,  then  because  we  can  have  one  of  four

independent nucleotide residue types at each position along the length we need a vector of



length 20 (20 ‘dimensions’) to describe a five-letter sequence; 5 positions times 4 residue

types.  In  this  case  with  axes  for  G,  C,  A  and  T,  the  sequence  ‘TATGA’  would  have  the

vector (0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0), where each sequential group of

four numbers corresponds to one position in the DNA sequence. Note that 1 indicates the

presence  of  a  letter  on  its  axis,  and  0  the  absence.  If  we  moved  from  describing  pure

sequences to alignment profiles (i.e. proportions of the residue type at each position) we

could  have  a  sequence  with  fractional  values  instead  of  the  plain  0  or  1.  You  might  be

tempted to encode DNA sequences with a numbered system where, for example, A = 0, T

=  1,  C  =  2,  G  =  3,  so  the  above  example  would  be  (1,0,1,3,0).  However,  this  would  not

work  because  the  different  nucleotides  are  entirely  alternative  states  and  not  points  on  a

continuum where one type is meaningfully ‘closer’ to another.


Download 7,75 Mb.

Do'stlaringiz bilan baham:
1   ...   389   390   391   392   393   394   395   396   ...   514




Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©hozir.org 2024
ma'muriyatiga murojaat qiling

kiriting | ro'yxatdan o'tish
    Bosh sahifa
юртда тантана
Боғда битган
Бугун юртда
Эшитганлар жилманглар
Эшитмадим деманглар
битган бодомлар
Yangiariq tumani
qitish marakazi
Raqamli texnologiyalar
ilishida muhokamadan
tasdiqqa tavsiya
tavsiya etilgan
iqtisodiyot kafedrasi
steiermarkischen landesregierung
asarlaringizni yuboring
o'zingizning asarlaringizni
Iltimos faqat
faqat o'zingizning
steierm rkischen
landesregierung fachabteilung
rkischen landesregierung
hamshira loyihasi
loyihasi mavsum
faolyatining oqibatlari
asosiy adabiyotlar
fakulteti ahborot
ahborot havfsizligi
havfsizligi kafedrasi
fanidan bo’yicha
fakulteti iqtisodiyot
boshqaruv fakulteti
chiqarishda boshqaruv
ishlab chiqarishda
iqtisodiyot fakultet
multiservis tarmoqlari
fanidan asosiy
Uzbek fanidan
mavzulari potok
asosidagi multiservis
'aliyyil a'ziym
billahil 'aliyyil
illaa billahil
quvvata illaa
falah' deganida
Kompyuter savodxonligi
bo’yicha mustaqil
'alal falah'
Hayya 'alal
'alas soloh
Hayya 'alas
mavsum boyicha


yuklab olish