SIS is a detailed study of the source archives. The
data sources quality should be checked and some
data cleaning operations should be performed in or-
der to remove all possible errors that might nega-
tively impact the statistical analysis. Then archives
are checked for cross inconsistencies, and finally data
are integrated in a global archive.
2.1 Data Integration and Cleaning
The first steps required to build a SIS are a detailed
analysis of the archives and the development of a
global integration schema which will drive the sub-
sequent steps. Further activities are the establish-
ment of a mapping schema between the global inte-
grated schema and the single archive schemas (local
schemas). Finally the steps of a process of data mi-
gration towards the integrated archive should be de-
tailed. During data migration some low quality data
issues might occur and should be resolved, as we will
show in Sec. 2.2. Moreover, data loaded into the
global integration schema instance might reveal un-
suitable for the analysis leading to misinterpretations.
For this reason the SIS development process should
be an iterative one, with the aim of progressively tun-
ing the global integration schema and the migration
procedure. Moreover, schemas may not completely
capture the semantics of the data that they describe,
and there may be several plausible mappings between
two schemas. This subjectivity makes it valuable to
have user input to guide the match and essential to
have user validation of the result.
2.2 Data Quality Improvement
The main problem in using administrative databases
for statistical and decision making purposes is the
presence of errors that do not affect the regular use of
the archive for administrative purposes. Such errors
are hardly noticed, and, even when discovered, they
are usually tolerated. However, this errors and low
quality of data can negatively affect statistical analy-
sis. Therefore, data sources need to undergo a qual-
ity improvement pre-processing before being an input
for any kind of analysis. Administrative databases are
employed to access information describing a single
item at a time (e.g., the address of a person), while
statistical analysis deals with collection of items (e.g.
how many people live within an area). This differ-
ent usage of archives may unveil simple errors like
duplicate records, or more complex ones, e.g. some
inhabitants that are registered in the Registry Office
of a neighbour town and not in the town where they
live. Some of the problems may be fixed by perform-
ing data cleaning actions whose results have a certain
degree of reliability, therefore requiring manual eval-
uation employing various data quality metrics such
as accuracy, consistency, completeness, timeliness,
and so on (integration quality criteria). Many clean-
ing techniques can be used, we won’t investigate this
topic anymore, we would like to highlight that these
techniques have different costs in term of execution
time required (both to humans and computers) and
“optimal mix selection” issues arise when resources
are scarce. The optimal mix selection is performed
by evaluating an execution cost and a quality improve-
ment rate for each candidate operation. The estima-
tion of both values is a heuristic operation, based on
experience as well.
3 THE AMeRIcA PROJECT
The concepts illustrated are presented for the AMeR-
IcA Project. The approach comprises various and in-
dependent phases: from data integration and quality
analysis, to the definition of statistical indicators, via
the analysis of information sources, database design,
transformation and data management process, and de-
finition of a multidimensional model for data analysis
as a decisional support. The reference population is
provided by the Registry of the Milan Municipality.
Data on such population are fundamental, since it is
impossible to obtain a data provisioning from the In-
come Office bounded to a geographic area. A cross
reference between the Registry Archives and the In-
come Archives allows one to obtain the desired infor-
mation. The process of data interpretation, cleaning,
and normalization, applied both to single source and
to integrated data, has required a great effort and a
deep data domain knowledge.
The Income Archive holds also some registry in-
formation about people, however preference has been
given to data derived from the Registry Archive, since
it is usually more up to date. In fact, an individual no-
tifies address changes to the Registry Office quickly,
while the Income Office is notified once per year with
the tax declaration form. Records describing the same
person in different archives are identified by the Fis-
cal Code (FC, similar to the US Social Security Num-
ber). Once different records on the same individual
have been identified, further information (e.g., profes-
sion, qualification, education, and so on) significant
for analysis and not present in the Income Archive,
may be used. However, the scarce freshness of some
archives would violate the information quality crite-
ria; thus, such additional information has not been
included in the analysis. The portion of data in the
AMeRIcA SIS coming from the Income Office refers
to the income returns of both companies and people.
Individuals declare income data by filling in different
forms, according to the received type of income and
properties. Three common basic macro-information
ICEIS 2006 - DATABASES AND INFORMATION SYSTEMS INTEGRATION
294