Bioinf
or
ma
tics
BIO-ALGORITHMS AND MED-SYSTEMS
JOURNAL EDITED BY JAGIELLONIAN UNIVERSITY – MEDICAL COLLEGE
Vol. 7, No. 13, 2011, pp. 67-70
A COmbINED SVm-RDA ClASSIFIER
FOR PROTEIN FOlD RECOGNITION
w
iesław
c
hMielnicki
1
, k
atarzyna
s
tąPor
2
1
Jagiellonian University, Faculty of Physics, Astronomy and Applied Computer Science, Kraków, Poland
2
Silesian University of Technology, Institute of Computer Science, Gliwice, Poland
Abstract:
Predicting the three-dimensional (3D) structure of a protein is a key problem in molecular biology. It is also an in
-
teresting issue for statistical methods recognition. There are many approaches to this problem considering discriminative and
generative classiiers. In this paper a classiier combining the well-known support vector machine (SvM) classiier with regular
-
ized discriminant analysis (RDA) classiier is presented. It is used on a real world data set. The obtained results are promising
improving previously published methods.
Keywords:
protein fold recognition, support vector machine,
multi-class classiier, one-versus-one strategy
Introduction
Predicting the three-dimensional (3D) structure of a protein is
a key problem in molecular biology. Proteins manifest their func
-
tion through these structures, so it is very important to know
not only sequence of amino acids in a protein molecule, but
also how this sequence is folded. The successful completion of
many genome-sequencing projects has meant that the number
of proteins with known amino acids
sequence is quickly increas-
ing, but the number of proteins with known 3D structure is still
relatively very small.
There is a variety of different aproaches to the protein struc-
ture prediction. They range from those based on physical prin
-
ciples, through methods that rely on evolutionary information, to
the statistical methods based on machine-learning systems. An
interesting survey of these methods can be found in Rychlewski
et al. [22]. In this paper we focused on machine-learning algo
-
rithms (Stąpor [20]).
There are several machine-learning methods
to predict the
protein folds from amino acids sequences proposed in literature.
Ding and Dubchak [5] experiment with support vector machine
(SvM) and neural network (NN) classiiers. Shen and Chou
[9] proposed ensemble model based on nearest neighbour.
A modiied nearest neighbour algorithm called K-local hyperplane
(HKNN) was used by Okun [14]. Nanni [13] proposed ensemble
of classiiers: Fisher’s linear classiier and HKNN classiier.
There are two standard approaches to the classiication task:
generative classiiers use training data to estimate the probability
model for each class and then test items are classiied by com
-
paring their probabilities under these models. The discriminative
classiiers try to ind the optimal frontiers between classes based
on all the samples of the training data set.
This paper presents a classiier, which combines the sup
-
port vector machine (SvM) – discriminative classiier – with the
statistical regularized discriminant analysis (RDA) – generative
classiier. The SvM technique has been used in different ap
-
plication domains and has outperformed the traditional tech-
niques. However, the SvM is a binary classiier but the protein
fold recognition is a multi-class problem and how to effectively
extend a binary to the multi-class classiier case is still an on-
going research problem. There are many methods proposed to
deal with this issue
One of the irst and well-known methods is one-versus-one
strategy with max-win voting scheme. In this strategy all binary
classiiers vote for the preferred class. Originally a class with the
maximum number of votes is recognized as the correct class.
However, some of these binary classiiers are unreliable.
The votes from these classiiers inluence the inal classiica
-
tion result. In this paper there is a strategy presented to assign
a weight (which can be treated as a measure of reliability) to
each vote based on the values of the discriminant function from
an RDA classiier.
The rest of this paper is organized as follows: Section 2
introduces the database and the feature
vectors used is these
experiments, Section 3 presents the basis of RDA classiier,
Section 4 shortly describes basics of the SvM classiier, Section
5 describes the method of combining the classiiers and Section
6 presents experimental results and conclusions.