Although the paper focused on business data, the
problem of duplicate or unrecognized entries for the
same entities is evident in all types of stored data.
The application of the presented concept should thus
be checked for other areas of application and investi-
gated accordingly.
6 CONCLUSION
A semi-automatic two-tiered approach for record
linkage and entity resolution of business data was pre-
sented in this paper. The approach considers the data
sovereignty of different data sets as well as different
reasons for organization name variations. The topic
of data sovereignty is often neglected in current re-
search, although it is becoming increasingly impor-
tant in practice.
The paper presents a conceptual architecture. The
chapters 4.1 and 4.2 discuss recommendations of
available or experimental methods for each step of
the conceptual process. The implementation of these
methods is currently in progress and as such has been
discussed in chapter 5.
This approach is specially designed for the case
where only the company name is available as a data
set spanning feature for deduplication and entity res-
olution of organization data. Therefore, a two-tire ap-
proach was considered using fuzzy logic and NLP-
based deep learning techniques.
The first component of the approach is designed
to handle character-based name variations and thus
fuzzy logic-based techniques can be used. First re-
sults show that homogeneous and complete groups,
regarding name deviations at the letter level, can be
achieved.
The second component of the approach deals with
more complex deviations that have few similarities.
While the first component can be fully automated, the
second requires human-machine interaction.
Overall this approach has the potential to reduce
the effort required compared to a mostly entirely man-
ual data curation. In addition, the use of computa-
tionally expensive record linkage and entity resolu-
tion methods can be minimized by using the Corpora-
tion Catalog. The potential of such an approach could
be realized in different areas where there is a signifi-
cant need for harmonized data and externally curated
systems are not feasible.
REFERENCES
Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak,
R., and Ives, Z. (2007). DBpedia: A Nucleus for a
Web of Open Data. In Aberer, K., Choi, K.-S., Noy,
N., Allemang, D., Lee, K.-I., Nixon, L., Golbeck, J.,
Mika, P., Maynard, D., Mizoguchi, R., Schreiber, G.,
and Cudr
´
e-Mauroux, P., editors, The Semantic Web,
Lecture Notes in Computer Science, pages 722–735,
Berlin, Heidelberg. Springer.
Bard, G. V. (2007). Spelling-error tolerant, order-
independent pass-phrases via the damerau-levenshtein
string-edit distance metric. In Proceedings of the Fifth
Australasian Symposium on ACSW Frontiers - Vol-
ume 68, ACSW ’07, pages 117–124, AUS. Australian
Computer Society, Inc.
Binette, O. and Steorts, R. C. (2022). (Almost) All of Entity
Resolution. arXiv.
Boscoe, F. P., Schrag, D., Chen, K., Roohan, P. J., and
Schymura, M. J. (2011). Building Capacity to Assess
Cancer Care in the Medicaid Population in New York
State. Health Services Research, 46(3):805–820.
Chen, X., Campero Durand, G., Zoun, R., Broneske, D., Li,
Y., and Saake, G. (2019). The Best of Both Worlds:
Combining Hand-Tuned and Word-Embedding-Based
Similarity Measures for Entity Resolution. In Grust,
T., Naumann, F., B
¨
ohm, A., Lehner, W., H
¨
arder, T.,
Rahm, E., Heuer, A., Klettke, M., and Meyer, H., ed-
itors, BTW 2019, pages 215–224. Gesellschaft f
¨
ur In-
formatik, Bonn.
Damerau, F. J. (1964). A technique for computer detection
and correction of spelling errors. Communications of
the ACM, 7(3):171–176.
Elmagarmid, A. K., Ipeirotis, P. G., and Verykios, V. S.
(2007). Duplicate Record Detection: A Survey. IEEE
Transactions on Knowledge and Data Engineering,
19(1):1–16.
Fellegi, I. P. and Sunter, A. B. (1969). A Theory for Record
Linkage. Journal of the American Statistical Associa-
tion, 64(328):1183–1210.
Filipov, L. and Varbanov, Z. (2020). On Fuzzy Matching of
Strings. Serdica Journal of Computing, 13(1-2):71–
80.
Gottapu, R. D., Dagli, C., and Ali, B. (2016). Entity Resolu-
tion Using Convolutional Neural Network. Procedia
Computer Science, 95:153–158.
Gregg, F. and Eder, D. (2022). Dedupe.
Hern
´
andez, M. A. and Stolfo, S. J. (1998). Real-world Data
is Dirty: Data Cleansing and The Merge/Purge Prob-
lem. Data Mining and Knowledge Discovery, 2(1):9–
37.
Jaccard, P. (1912). The Distribution of the Flora in the
Alpine Zone.1. New Phytologist, 11(2):37–50.
Kaufman, A. R. and Klevs, A. (2021). Adaptive Fuzzy
String Matching: How to Merge Datasets with Only
One (Messy) Identifying Field. Political Analysis,
pages 1–7.
K
¨
opcke, H., Thor, A., and Rahm, E. (2010). Evaluation
of entity resolution approaches on real-world match
DATA 2022 - 11th International Conference on Data Science, Technology and Applications
490