analysis work well [2]. This process is often used in microarray data analysis [3] or text
retrieval [4], fields where the initial number of descriptors is very high and where the
dimensionality reduction is crucial before data analysis.
There are numerous theoretical presentations of the SVD. Roughly speaking, we
produce from an initial description space ℵ = {X
1
, . . . , X
J
} of J descriptors (and n
examples), a new space of J features Ψ = {F
1
, . . . , F
J
} with the following constraints:
Ψ is an orthogonal basis; the factor F
1
is built from a projection vector P
1
(kP
1
k = 1)
so as to maximize the variance of F
1
, v
1
= Var (F
1
); the second factor F
2
is built from
a projection vector P
2
(kP
2
k = 1) so as to maximize the variance v
2
= Var (F
2
), and
F
2
must be independent (perpendicular) to F
1
, etc. In the two spaces, the proximity
between two individuals is preserved, and more interesting, in the subspace p (p <
J) of Ψ, the distance between two examples is roughly respected, the quality of the
approximation can be measured using the sum of variance of p first selected factors
(S
p
=
P
p
j=1
v
j
).
There is a mathematical relation between SVD and PCA (Principal Component
Analysis) when the descriptors are standardized. If ℵ
′
is the transpose of ℵ, the square
matrix (ℵ
′
ℵ) is a correlation matrix: v
1
is its first eigenvalue and P
1
is the associated
eigenvector. Thus, the sum of variance of the first p selected factors is the proportion of
explained variance with these factors (E
p
=
S
p
J
).
In addition to the dimensionality reduction which improves the efficiency of the
supervised learning algorithm, this process allows to detect and extract the true patterns
in the data, the last factors express the noisy information in the dataset. From this point
of view, the SVD is an effective data cleaning process, by selecting the p best factors,
we reject negligible information contained in the data. Thus, it is possible to reconstruct
an approximate version of original data from the selected factors and projection vectors.
About the implementation, the challenge was considerable. It was not possible to
use diagonalization techniques from the 8000 × 8000 correlation matrix in order to
extract eigenvalue and eigenvectors. It was thus necessary to consider the direct extrac-
tion of the singular values from the standardized matrix ℵ with a powerful algorithm,
the computing time and the memory requirement are major constraints. We used the NI-
PALS implementation [5] which interprets the singular value extraction as successive
orthogonal regressions: the first one produces the first factor F
1
, using the residuals of
this regression, we perform a new regression in order to produce the second factor F
2
,
etc. This approach allows to reduce computations considerably since we can stop calcu-
lations as soon as the first p factors were generated. In our experiments, from a n = 100
examples and J = 7000 descriptors, the first 5 factors are generated in 10 seconds on
a standard personal computer running under Windows (Pentium III – 1 Ghz – 512 MB
RAM). We use the TANAGRA [6], an open source data mining software, source code
is available on the website of the authors (http:\\eric.univ-lyon2.fr\˜ricco\tanagra).
2.2 SVD and Irrelevant Descriptors
If the SVD is a very interesting process for dimensionality reduction by controlling the
loss of information, it has a major drawback in a protein classification framework: the
SVD is an unsupervised process. In fact, to build the factors, it used all the descriptors,
including the irrelevant one for a supervised learning task.
40