- visual data exploration can easily deal with
highly heterogeneous and noisy data,
- visual data exploration is intuitive and requires
no understanding of complex mathematical or
statistical algorithms or parameters.
As a result, visual data exploration usually
allows a faster data exploration and often provides
better results, especially in cases where automatic
algorithms fail. This fact leads to a high demand for
visual exploration techniques and makes them
indispensable in conjunction with automatic
exploration techniques. In data mining, some new
methods have recently appeared (Poulet, 2004),
trying to involve more significantly the user in the
process and using more intensively the visualization
(Aggarwal, 2002). We think it is important to
consider user perception to overcome the drawbacks
of dimension selection process and propose new
approach where the user choice is combined with
automatic fitness function. These automatic fitness
functions are applied to eliminate a great part of
redundant and noisy solutions and the interactive
fitness is applied to evaluate the visual interpretation
as understandable or not.
2.3 Dimensionality Selection
Dimension selection attempts to discover the
attributes of a dataset that are the most relevant to
the data-mining task. It is a commonly used and
powerful technique for reducing the dimensionality
of a problem to more manageable levels. Feature
selection involves searching through various feature
subsets and evaluating each of these subsets using
some criterion (Liu and Motoda, 1998). The
evaluation criteria follow one of the two basic
models, the wrapper model and the filter model. The
wrapper model techniques evaluate the dataset using
the data-mining algorithm that will ultimately be
used. Algorithms based on the filter model examine
intrinsic properties of the data to evaluate the feature
subset prior to data mining. Much of the work in
feature selection has been directed at supervised
learning. The main difference between feature
selection in supervised and unsupervised learning is
the evaluation criterion. Supervised wrapper models
use classification accuracy as a measure of
goodness. The filter-based approaches almost
always rely on the class labels, most commonly
assessing correlations between features and the class
labels. In the unsupervised clustering problem, there
are no universally accepted measures of accuracy
and no class labels. However, there are a number of
methods that adapt feature selection to clustering.
The wrapper method proposed in (Kim & al., 2000)
forms a new feature subset and evaluates the
resulting set by applying a standard k-means
algorithm. The EM clustering algorithm can also be
used in the wrapper framework (Dy and Brodley,
2000). Hybrid methods have also been developed
that use a filter approach as a heuristic and refine the
results with a clustering algorithm. In addition to
using different evaluation criteria, unsupervised
feature selection methods have employed various
search methods in attempts to scale to large, high
dimensional datasets. With such datasets, genetic
searching becomes a viable heuristic method and has
been used with many of the aforementioned criteria
(Boudjeloud and Poulet, 2005a). Another promising
approach focusing on unsupervised classification is
subspace clustering. A survey of subspace clustering
algorithms can be found in (Parsons & al., 2004).
The two main types of subspace clustering
algorithms can be distinguished by the way they
search for subspaces. A naive approach might be to
search through all possible subspaces and use cluster
validation techniques to determine the subspaces
with the best clusters. This is not feasible because
the subset generation problem is intractable. More
sophisticated heuristic search methods are required
and the choice of a search technique determines
many other characteristics of an algorithm. (Parsons
& al., 2004) divide subspace clustering algorithms
into two categories based on how they determine a
measure of locality used to evaluate subspaces. The
bottom-up search method takes advantage of the
downward closure property of density to reduce the
search space, using an APRIORI style approach. The
top-down subspace clustering approach starts by
finding an initial approximation of the clusters in the
full feature space with equally weighted dimensions.
For the subspace clustering methods many
parameters tuning are necessary in order to get
meaningful results. The most critical parameters for
top-down algorithms are the number of clusters and
the size of the subspaces, which are often very
difficult to determine a priori. Since subspace size is
a parameter, top-down algorithms tend to find
clusters in the same or similarly sized subspaces. For
techniques that use sampling, the size of the sample
is another critical parameter and can play a large role
in the quality of the final results.
3 SEMI INTERACTIVE GENETIC
ALGORITHM
The large number of dimensions of the data set is
one of the major difficulties encountered in data
mining. We use genetic algorithm (Boudjeloud and
Poulet, 2004), (Boudjeloud and Poulet, 2005a) for
dimension selection with the individual represented
SEMI INTERACTIVE METHOD FOR DATA MINING
5