
 
2 NNGE CATEGORIZATION 
The data mining problem implies analyzing a set of 
points defined as geographic coordinates x and y and 
their damage or risk level r. Depending on the 
considered approach, the risk can be nominal, which 
means that each building belongs to a certain risk 
class  C
r 
, or numerical, i.e. each building has a risk 
probability associated with it, a real number 
. The goal is to find the subsets of nearby 
points, clusters, which share the same C
]1,0[∈r
r
, or at least 
clusters with minimum impurity, i.e. most  of the 
cluster members should belong to the same class or 
have close r values.  
A straightforward approach is to use a 
categorization algorithm to describe such subsets of 
points. In general, categorization is a task of finding a 
target function f that maps each attribute set A that 
defines an object into one (or more, each with a 
degree of membership) predefined class C. This target 
function  f is also known as the categorization or 
classification model. 
In the literature (Tan, Steinbach & Kumar, 2005; 
Han & Kamber, 2000; Mitchell, 1997; Nilsson, 1996) 
several categorization types of algorithms are 
described. Among the most frequently used are rule-
based methods, prototype-based methods and 
exemplar-based methods.  
For the particular purpose of our research, the 
rule-based categorization seems to be most 
appropriate, since we need a non-hierarchical, explicit 
partition of data. A nearest-neighbor-based approach 
is useful, because the prediction phase is irrelevant in 
our case. The damage of the building cannot be 
predicted by taking into account only the damage of 
its neighbors. Also, this class of algorithms always 
performs well on the training set, with error rates 
close to 0. 
Such an algorithm is the Non-Nested Generalized 
Exemplar, NNGE (Martin, 1995; Witten & Frank, 
2000), which forms homogenous hyper-rectangles 
(generalized exemplars) in the attribute space such 
that no exception should be contained within. The 
hyper-rectangles do not overlap, and in this way, the 
algorithm prevents over-fitting. 
  In order to test the behavior of the algorithm we 
used a test problem proposed by Eick, Zeidat, and 
Zhao (2004), displayed in figure 1, where different 
point colors represent different classes. 
The results of NNGE algorithm are presented in 
the same figure. One can see the hyper-rectangles 
found by the algorithm, which are 2-D rectangles in 
our case. In addition, the convex hull of the cluster 
points is emphasized and the internal area of the 
convex hull is hatched. 
 
 
Figure 1: NNGE results for the test problem. 
The algorithm only discovers axis-parallel hyper-
rectangles; it cannot take into account other 
distributions of data. Another disadvantage is that 
NNGE can link rather distant points, if there is no 
exception example lying between them. 
An alternative approach is to use a clustering 
method instead of classification, which should also 
use the predefined r values of points. 
 
3 K-NEAREST NEIGHBOR 
GRAPH METHOD OF 
SUPERVISED CLUSTERING 
The goal of the cluster analysis is to group the 
instances based only on information found in the data 
that describes the objects and their relationships, i.e. 
their attributes. Objects within a group should be 
more similar or related to each other than to objects 
from other groups. The greater the similarity (or 
homogeneity) within a group and greater the 
difference between group, the better the clustering.  
There are many clustering algorithms known in the 
literature: hierarchical (nested) vs. partitional (un-
nested), exclusive vs. overlapping or fuzzy, complete 
vs. partial (Tan, Steinbach & Kumar, 2005). 
Clustering is typically applied in an unsupervised 
learning framework using particular error functions, 
e.g. an error function that minimizes the distances 
inside a cluster, therefore keeping the clusters tight.  
An unsupervised approach for the problem 
presented in figure 1 would most likely lead to 
clustering together all the points in the upper region, 
because they are closer to each other from the 
topological point of view, even if they belong to 
different classes. 
Supervised clustering, on the other hand, deviates 
from traditional clustering since it is applied on 
classified examples with the objective of identifying 
clusters that have high probability density with 
respect to single classes (Eick, Zeidat & Zhao, 2004). 
For our problem, we propose a clustering method 
that simultaneously takes into account the topology of 
ICSOFT 2006 - INTERNATIONAL CONFERENCE ON SOFTWARE AND DATA TECHNOLOGIES
154