Although the paper focused on business data, the
problem of duplicate or unrecognized entries for the
same entities is evident in all types of stored data.
The application of the presented concept should thus
be checked for other areas of application and investi-
gated accordingly.
A semi-automatic two-tiered approach for record
linkage and entity resolution of business data was pre-
sented in this paper. The approach considers the data
sovereignty of different data sets as well as different
reasons for organization name variations. The topic
of data sovereignty is often neglected in current re-
search, although it is becoming increasingly impor-
tant in practice.
The paper presents a conceptual architecture. The
chapters 4.1 and 4.2 discuss recommendations of
available or experimental methods for each step of
the conceptual process. The implementation of these
methods is currently in progress and as such has been
discussed in chapter 5.
This approach is specially designed for the case
where only the company name is available as a data
set spanning feature for deduplication and entity res-
olution of organization data. Therefore, a two-tire ap-
proach was considered using fuzzy logic and NLP-
based deep learning techniques.
The first component of the approach is designed
to handle character-based name variations and thus
fuzzy logic-based techniques can be used. First re-
sults show that homogeneous and complete groups,
regarding name deviations at the letter level, can be
The second component of the approach deals with
more complex deviations that have few similarities.
While the first component can be fully automated, the
second requires human-machine interaction.
Overall this approach has the potential to reduce
the effort required compared to a mostly entirely man-
ual data curation. In addition, the use of computa-
tionally expensive record linkage and entity resolu-
tion methods can be minimized by using the Corpora-
tion Catalog. The potential of such an approach could
be realized in different areas where there is a signifi-
cant need for harmonized data and externally curated
systems are not feasible.
