new document finding the k nearest neighbours among the training documents. The
resulting classification is a kind of majority vote of the categories of these neighbours
[4]. Support vector machines try to find a model that minimizes the true error (the
probability to make a classification error) and are based on the structural risk
minimization principle [1]. Machine learning techniques and shallow parsing have
been used in a methodology for authorship attribution by Luyckx and Daelemans [7].
All the above methods, except the statistical tests, are called semi-parametric models
for classification, as they model the underlying distribution with a potentially infinite
number of parameters selected in such a way that the prediction becomes optimal.
The above authorship attribution systems have several disadvantages. First of all,
these systems invariably perform their analysis at the word level. Although word level
analysis seems to be intuitive, it ignores various morphological features which can be
very important to the identification problem. Therefore, the systems are language
dependent and techniques that apply for one language usually could not be applicable
for other languages. Emphasis must also be given to the difficulty of word
segmentation in many Asian languages. These systems, also, usually involve a
feature elimination process to reduce dimensionality space by setting thresholds to
eliminate uninformative features [8]. This fact could be extremely subtle, because
although rare features contribute less information than common features, they can still
have an important cumulative effect [9].
To avoid these undesirable situations, many researchers have proposed different
approaches, which work in a character level segmentation [13], [14]. Fuchun et al.
[14], have shown that the state of the art performance in authorship attribution can be
achieved by building N-gram language models of the text produced by an author.
These models play the role of author profiles. The standard perplexity measure is then
used as the similarity measure between two profiles. Although these methods are
language independent and do not require any text pre-processing, they still rely on a
training phase during which the system has to build the author’s profile using a set of
optimal N-grams. This may be computationally intensive and costly, especially when
larger n-grams are used.
In this paper, we apply an alternative non parametric approach to solve the authorship
identification problem using N-grams at a character level segmentation (N-
consecutive characters). We compare simple N-grams distributions with the normal
distribution avoiding thus the extra computational burden of building authors’
profiles. For a text with unknown authorship, for all the possible N-grams in the text
we calculate their distributions in each one of the authors’ collection writings. These
distributions are then compared to the normal distribution using the Kolmogorov -
Smirnov test. The author, whose the derived distribution is behaved more abnormally
is selected as the correct answer for the authorship identification problem. We expect
the n-grams of the disputed text to be more biased against the correct and should be
distributed more abnormally in the correct author’s collection writing, than the other
authors’ writings. Such an abnormality is caught by the Kolmogorov-Smirnov test.
Our method is language independent and does not require segmentation for languages
such as Chinese or Thai. There is no need for any text pre-processing or higher level
processing, avoiding thus the use of taggers, parsers, feature selection strategies, or
other language dependent NLP tools. Our method is also simple, not parametric
without the necessity of building authors’ profiles from training data.
140