properly formalize many classification problems that
derive from web mining problems such as page
ranking algorithms, personalization of search results
and many others; for possible applications see
(Antonellis et al. 2005). Although, we can build
trivial solutions for this problem using existing
classification techniques, we study a specific
technique that exploits the semantic information that
derives from the decomposition of training
documents into their sentences. Such a semantic
analysis shares a lot with passage-based retrieval
techniques (Salton et al, 1996), (Hearst et al., 1993)
that further decompose a large text document in
order to identify text segments and text themes so as
to improve accuracy of information retrieval queries.
It is also connected with already proposed phrase-
based document indexing techniques (Hammouda et
al., 2004) as an alternative to single-word analysis
that the simple Vector Space Model provides.
The rest of paper is organized as follows. In
Section 2 the definition of the Scalable
Classification Problem is presented, along with an
intuitive description of possible applications. In
Section 3, we study the 20newsgroup dataset
applying a new text analysis technique so as to
specify logical interpretation of the ‘user-specific’
classification. Subsequently, different solutions for
the problem are described that base upon the
reduction of the problem into multiple standard
binary classification problems. Section 5 describes
our Scalable Classification Algorithm that derives
from spectral decomposition of the training
documents into the vector representation of their
sentences. Experimental evaluation of the algorithm
is given in Section 6, using two different datasets.
Finally, we summarize our results and address future
work.
2 SCALABLE TEXT
CLASSIFICATION PROBLEM
Traditional text classification is the task of assigning
a Boolean value to each pair
,
ji
dc DC〈〉∈×
, where
D
is a domain of documents and
1
{, , }
C
Cc c= …
is a
set of predefined categories. More formally, we have
the following definition (Sebastiani, 2002):
D
EFINITION 1 (STANDARD TEXT CLASSIFICATION) Let
{
1
,,
C
Cc c= …
be a set of predefined categories and
{
1
,,
D
dd= …
a growing set of documents. A standard
text classifier is an approximation function
DCΦ= × →ℜ
of the function
DCΦ= × →ℜ
that
describes how documents ought to be classified.
Looking further into the definition, it is easy to
see that most parameters of the problem are static.
Definition of the categories relies only on the initial
set that training, labeled documents specify and
cannot be further expanded or limited. Moreover,
definition of a specific category relies only on
information that training documents provide.
Classification function is specified by the
minimization of an effectiveness measure
(Sebastiani, 2002) that shows how much functions
and
‘coincide’. In tradition, this measure relies
on the precision and recall, or other effectiveness
measures that combine these values (e.g. micro-
averaging and macro-averaging). It is then obvious
that depending on the measure we choose, resulting
classifiers defer. However, we can argue that
classification procedure still remains static, that is,
given a classifier and a specific document, whenever
we try to apply the classifier to that document,
classification result will remain the same (by
definition).
Web mining techniques that capture user-profile
information in order to improve end-user results,
many times come up with text classification
problems. However, characteristics of these text
classification problems involve dynamic changes of
Web users’ behaviour and ‘on-the-fly’ definition of
the category topics.
Consider, for example, the text article of Figure 1
and Web users A and B. A is a journalist that looks
for information about Linux in order to write an
article about open source software in general, while
B is an experienced system administrator looking
instructions on installing OpenBSD 3.6.
It's official: OpenBSD 3.7 has been released. There are oodles of
new features, including tons of new and improved wireless drivers
(covered here previously), new ports for the Sharp Zaurus and SGI,
improvements to OpenSSH, OpenBGPD, OpenNTPD, CARP, PF, a new
OSPF daemon, new functionality for the already-excellent ports &
packages system, and lots more. As always, please support the
project if you can by buying CDs and t-shirts, or grab the goodness
from your local mirror.
Source: Slashdot.org
Figure 1: Example news article.
A well-trained standard classification system
would then provide the above document to both
users, as it is clearly related to open source software
and to OpenBSD operating system. However, it is
obvious that although user A would need such a
decision, it is useless for user B to come across this
article.
Trying to investigate the cause of user’s B
disappointment, we see that standard text
classification systems lack the ability to provide
‘per-user’ results. However, a user’s knowledge of a
topic should be taken into account while providing
him with the results. It is more possible that a user
who is aware of a category (e.g. user B knows a lot
SCALABILITY OF TEXT CLASSIFICATION
409