2 NNGE CATEGORIZATION
The data mining problem implies analyzing a set of
points defined as geographic coordinates x and y and
their damage or risk level r. Depending on the
considered approach, the risk can be nominal, which
means that each building belongs to a certain risk
class C
r
, or numerical, i.e. each building has a risk
probability associated with it, a real number
. The goal is to find the subsets of nearby
points, clusters, which share the same C
]1,0[∈r
r
, or at least
clusters with minimum impurity, i.e. most of the
cluster members should belong to the same class or
have close r values.
A straightforward approach is to use a
categorization algorithm to describe such subsets of
points. In general, categorization is a task of finding a
target function f that maps each attribute set A that
defines an object into one (or more, each with a
degree of membership) predefined class C. This target
function f is also known as the categorization or
classification model.
In the literature (Tan, Steinbach & Kumar, 2005;
Han & Kamber, 2000; Mitchell, 1997; Nilsson, 1996)
several categorization types of algorithms are
described. Among the most frequently used are rule-
based methods, prototype-based methods and
exemplar-based methods.
For the particular purpose of our research, the
rule-based categorization seems to be most
appropriate, since we need a non-hierarchical, explicit
partition of data. A nearest-neighbor-based approach
is useful, because the prediction phase is irrelevant in
our case. The damage of the building cannot be
predicted by taking into account only the damage of
its neighbors. Also, this class of algorithms always
performs well on the training set, with error rates
close to 0.
Such an algorithm is the Non-Nested Generalized
Exemplar, NNGE (Martin, 1995; Witten & Frank,
2000), which forms homogenous hyper-rectangles
(generalized exemplars) in the attribute space such
that no exception should be contained within. The
hyper-rectangles do not overlap, and in this way, the
algorithm prevents over-fitting.
In order to test the behavior of the algorithm we
used a test problem proposed by Eick, Zeidat, and
Zhao (2004), displayed in figure 1, where different
point colors represent different classes.
The results of NNGE algorithm are presented in
the same figure. One can see the hyper-rectangles
found by the algorithm, which are 2-D rectangles in
our case. In addition, the convex hull of the cluster
points is emphasized and the internal area of the
convex hull is hatched.
Figure 1: NNGE results for the test problem.
The algorithm only discovers axis-parallel hyper-
rectangles; it cannot take into account other
distributions of data. Another disadvantage is that
NNGE can link rather distant points, if there is no
exception example lying between them.
An alternative approach is to use a clustering
method instead of classification, which should also
use the predefined r values of points.
3 K-NEAREST NEIGHBOR
GRAPH METHOD OF
SUPERVISED CLUSTERING
The goal of the cluster analysis is to group the
instances based only on information found in the data
that describes the objects and their relationships, i.e.
their attributes. Objects within a group should be
more similar or related to each other than to objects
from other groups. The greater the similarity (or
homogeneity) within a group and greater the
difference between group, the better the clustering.
There are many clustering algorithms known in the
literature: hierarchical (nested) vs. partitional (un-
nested), exclusive vs. overlapping or fuzzy, complete
vs. partial (Tan, Steinbach & Kumar, 2005).
Clustering is typically applied in an unsupervised
learning framework using particular error functions,
e.g. an error function that minimizes the distances
inside a cluster, therefore keeping the clusters tight.
An unsupervised approach for the problem
presented in figure 1 would most likely lead to
clustering together all the points in the upper region,
because they are closer to each other from the
topological point of view, even if they belong to
different classes.
Supervised clustering, on the other hand, deviates
from traditional clustering since it is applied on
classified examples with the objective of identifying
clusters that have high probability density with
respect to single classes (Eick, Zeidat & Zhao, 2004).
For our problem, we propose a clustering method
that simultaneously takes into account the topology of
ICSOFT 2006 - INTERNATIONAL CONFERENCE ON SOFTWARE AND DATA TECHNOLOGIES
154