tic web". Like the data mining definition, text min-
ing may be defined as: "The entire process of ex-
tracting relevant information that is not explicitly
present in a document collection". There is a clear
distinction between this and documentary computer
science, which only attempts to show explicitly
present concepts (Delgado, 2002).
In text mining, it should be remembered that we
are dealing with information which is not particu-
larly structured, and therefore traditional data min-
ing techniques cannot be applied. This lack of struc-
ture is the biggest problem for TMs and requires the
texts to be preprocessed and converted into an "in-
termediate form" so that the algorithms and methods
(classification, association, etc.) being used may be
applied to them.
Current study of ontologies and their design and
development tools (Lambrix,2003) has produced
new methods and techniques, but mainly a wider,
more ambitious vision in textual data processing,
since they make action possible on previously con-
ceptualized domains. This is particularly interesting
as here there is an intermediate element that can be
used as a link between the "gross" textual informa-
tion and its preprocessing and subsequent treatment
using usual data mining techniques. More specifi-
cally, the ontology is particularly important in in-
formation retrieval systems since they provide the
connection between final user applications and data-
bases and, contrariwise, in visualization processes
(Guarino,1998 ).
In short, it is clear that this is an interdisciplinary
field which includes elements, methods and tech-
niques from documentary computer science, linguis-
tics, and data mining, with their contributions relat-
ing to information retrieval, information extraction,
clustering, categorization and automatic learning
(among others) (Prados,2004).
3 WORK DESCRIPTION
The basic idea behind this work is that the medical
language used in diagnostic expressions and deter-
minations (based on natural language) is sufficiently
controlled and strict with an adequately formal
grammar as to be worth using in data processing and
documentary searches which could be applied di-
rectly on the medical language without the need for
code systems. This would make documentary
searches more powerful by enabling descriptive and
qualifying aspects to be included in the pathology or
treatment, and not merely unqualified searches on
the diagnosis (Shankar, 2002). A diagnostic expres-
sion comprises terms with a semantic content which
are relevant for diagnosis. It is first necessary to
analyze the semantic typology of these terms and to
establishing the different classes to which a lexical
unit can belong.
Our aim in this paper is to show that this is pos-
sible. In order to do so, we have considered four
classes of term sets: identifiers, locators, qualifiers,
and etiologicals. From this first semantic structure,
we can extract an ontology which conceptually
describes the domain of the medical diagnosis, and
propose a methodology for information storage and
retrieval, extending the restrictive possibilities of
traditional searches on the electronic healthcare
record using the defined ontology.
Our first objective is to define what we call the
"semantic classes", each of which represents a kind
of concept used by doctors in their descriptions. Our
second objective is to check that the medical expres-
sions in natural language are in keeping with the
element coordination of each semantic class. Our
third objective is to confirm the hypothesis that the
constituent elements of each semantic class repre-
sent a perfectly determinable and controllable set. A
class may have superclasses which offer a medical
meaning membership from one given concept to
another with a more general meaning (Dameron,
2004).
We also define what we call the "semantic class
sequence" (SCS) as the sequence obtained in the
identification of the semantic classes from an ex-
pression, and their sequence of presentation. By way
of example, the SCS “I-E-L” means "identifier-
etiological-locator". Although the purpose of these
SCSs is to check stability in expression composition,
they will also be useful in both the data acquisition
process and the documentary search process from
the data structures of the semantic classes.
The system is enriched by extracting the corre-
sponding ontology (ONTOARCHINET), its hierar-
chies and properties, with it having a welldefined
conceptual environment and making its use avail-
able for the documentary search.
We shall also discuss whether this proposal is
worthwhile and if it is possible to construct a set of
data structures representing the members of each
semantic class and their properties (associations,
frequency, medical environment, etc.) so that under
the analyzed (cleaned and filtered) documentary
search, the SCS can be recognized. Its components
are then extracted in order to perform the search,
focusing or extending it according to the semantic
components in the ontology.
ICEIS 2006 - INFORMATION SYSTEMS ANALYSIS AND SPECIFICATION
566