Duplicate Detection in a Knowledge Base with PIKA

Maxime Prieur, Guillaume Gadek and Bruno Grilheres

Airbus Defence and Space, France

Keywords:

Knowledge Base, Entity Embedding, Duplicate Detection, Graph Convolutional Networks.

Abstract:

This paper explores the use of Graph Neural Network models producing node embeddings, in order to solve

the not fully addressed problem of detecting similar items stored in a knowledge base. Leveraging pre-trained

models for textual semantic similarity, our proposed method PIKA aggregates heterogeneous (structured and

unstructured) characteristics of an entity and its neighborhood to produce an embedding vector that can be

used in different tasks such as information retrieval or classiﬁcation tasks. Our method learns speciﬁc weights

for each information brought by an entity, enabling us to process it in an inductive fashion.

1 INTRODUCTION

In the last few years, the concept of knowledge bases

(KB) has gained a lot of interest. In a KB, the el-

ements sharing relations can be linked among each

other, which can then be visualized in the form of a

Knowledge Graph, very helpful for tasks like social

network analysis, decision making, question answer-

ing or product to user recommendation, (Dai et al.,

2020; Xiaohan, 2020). Such a database can be con-

structed and ﬁlled either manually (from people in

charge of entering day-to-day information) or auto-

matically (from parsing a continuous ﬂow of data).

The latter is part of a bigger task called automatic

knowledge base population.

In both previous cases, there is a risk of adding an

already existing element due to typos in the values of

the attributes, several users concurrently updating the

knowledge base or an entity being added with up-to-

date information without worrying about one already

present in the database. These cases are a big con-

cern when administrating the database: they are in-

troducing redundancies, polluting data and impacting

the performances on the different tasks. In this paper,

we propose a solution to avoid this type of errors by

detecting similar entities in a knowledge base before

inserting new items.

To the best of our knowledge, this problem has

not been fully addressed yet. An appropriate solu-

tion should take into account the different types of

entities stored in a database (e.g Person, Location,

etc...), being able to scale to a large number of el-

ements (billion) with heterogeneous description and

taking advantages of both the entity description (at-

tributes) and its neighborhood (relations and attributes

of the neighbors).

Several attempts have been made to determinate

the similarity between pairs of elements, mostly based

on their attributes (Mudgal et al., 2018). However,

their way to operate faces a scalability issue when

dealing with large numbers of entities since the pair-

wise measure has to be computed for each candidate.

A few methods have been proposed to produce

a vector representation of the elements in a knowl-

edge base for predicting link between entities (Wang

et al., 2021; Shi and Weninger, 2018). Still, these ap-

proaches are either based on the requirement that all

of the entities are associated with a descriptive text or

that the entities have no information other than their

name. Other approaches adopt the graph neural net-

work framework from (Wang et al., 2019); but as the

previous ones, they use a loss based on the probability

of a relation between two entities. This is not suited

to the problem we are trying to solve and can lower

the discriminating power of the information carried

by the neighborhood.

In this paper, our main contributions are the fol-

lowing:

• we introduce a pre-processing step to aggregate

the attributes of an entity in a unique vector used

as an input for embedding models;

• we are the ﬁrst to propose a fast and reliable GCN

model, PIKA, capable to produce node embed-

dings within a heterogeneous graph, in order to

retrieve similar or almost duplicate entities;

• we perform an in-depth evaluation and compar-

ison of our proposed model to state-of-the-art

Prieur, M., Gadek, G. and Grilheres, B.

Duplicate Detection in a Knowledge Base with PIKA.

DOI: 10.5220/0010769500003116

In Proceedings of the 14th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2022) - Volume 3, pages 46-54

ISBN: 978-989-758-547-0; ISSN: 2184-433X

GCNs methods, retrieving generated clones from

the TAC KBP database.

The remaining of this article is structured as fol-

lows. Section 2 reviews the related work regarding the

embedding of the entities in a knowledge base. Sec-

tion 3 is a detailed description of our model. Section

4 formalizes the experiment conducted to measure the

embeddings quality. Finally, Section 5 shows and

describes the obtained results before discussing what

could be explored in future works.

2 RELATED WORK

2.1 Entity Matching

The task of entity matching consists in determining

the similarity between two records from the same

database or different databases (entity alignment);

this task can be tackled in a number of ways. A hybrid

module that uses the alignment of encoded values of

the same attribute between a pair of records has been

proposed by (Mudgal et al., 2018). The HighwayNet

model is a bi-directional RNN with a decomposable

attention mechanism; it summarizes attributes before

applying a concatenation and an element-wise abso-

lute difference to produce an output which is then

classiﬁed (Srivastava et al., 2015).

Two other models, Ditto and Entity Matching

on Unstructured Data, proposed respectively by (Li

et al., 2020) and (Brunner and Stockinger, 2020),

chose to ﬁne-tune pre-trained language models such

as Roberta (Liu et al., 2019), giving the concatenation

of the two serialized records as inputs. These methods

do not integrate the graph aspect of a knowledge base

since they are applied to atomic records not linked to

each other.

2.2 Relation based Embedding

Node embedding methods use the neighborhood of a

node, generating random walks of ﬁxed length around

each node to build a context to compute an em-

bedding, as introduced by DeepWalk (Perozzi et al.,

2014).This principle of using the context to train en-

coders relying on the frequency of appearance of the

element is frequently exploited (Nielsen, 2017; Ris-

toski and Paulheim, 2016; Dong et al., 2017).The ﬁrst

disadvantage of this is that embeddings are obtained

on a vocabulary (e.g. the nodes in the graph) of ﬁxed

size, which rules out any comparison with new nodes.

Moreover, relying only on the neighborhood is widely

applied to community detection tasks, as described in

(Fortunato, 2010), but relevant, cohesive communities

can only be detected by considering the network and

the attributes (e.g., the node-born semantic informa-

tion) (Gadek, 2019).

More focused on the existence of a relation

between two records, the “translational” approach

chooses to model links as translations from one entity

to another (Bordes et al., 2013; Dettmers et al., 2018).

The embeddings are computed using a margin-base

loss function reducing the distance of real triplets.

Rotate, from (Sun et al., 2019), deals with the rela-

tions as rotation in a vector space to allow symmetry,

anti-symmetry, inversion and composition to be taken

into account. Once again, the problem of transduc-

tivity arises. In addition to this, keeping as a training

objective the predictions of the relations is not opti-

mal for the detection of similarities.

2.3 Graph Neural Networks

The recent development of deep-learning architec-

tures, especially convolution layers, has led to an in-

terest to reuse them in graph processing, as illus-

trated in (Kipf and Welling, 2016). The information

around the node can be aggregated and iteratively up-

dated to obtain the representation of a target, in an

inductive manner with the Graphsage model (Hamil-

ton et al., 2017). This kind of method is usually lim-

ited to a depth of 1 or 2 layers because of an over-

smoothing problem (Wu et al., 2019) and to main-

tain low processing times. Still, GraphSage provides

learned reusable weights to handle new nodes.

Since not all neighbors have the same inﬂuence on

an entity, GAT, a GNN model using linear projections

and a self-attention mechanism, has been introduced

(Veli

ckovi

c et al., 2017). Arguing that the attention

used by GAT is not dynamic and struggles to pro-

cess the real inﬂuence of the nodes, a modiﬁcation of

the sequence of operations of the previous model has

been proposed to solve the issue (Brody et al., 2021).

The above-mentioned approaches disregard the

heterogeneity of the graph, an aspect that is covered

by HinSage, an adaptation of (Hamilton et al., 2017)

implemented in StellarGraph

. HinSage groups and

averages the vectors of the neighborhood according to

the type of the relationship or neighbor. The method

called HGT computes the attention coefﬁcient of the

neighbors, the message passed by them and merges

everything into a new vector. All the operations in-

volve class- and relation-speciﬁc linear projections.

This makes it complex to use in the case of a knowl-

edge graph because of the large number of different

edges.

https://stellargraph.readthedocs.io

Duplicate Detection in a Knowledge Base with PIKA

2.4 Node Embeddings in Knowledge

Graphs

Recent work also suggested to combine the transla-

tional loss functions from (Bordes et al., 2013) with

the ﬁne-tuning of a textual model to encode the de-

scription of the entities (Wang et al., 2021; Daza et al.,

2020). This approach performs well when provided a

rich textual description, which is however not consis-

tently the case with knowledge databases.

2.5 Discussion

Our review of the related work highlights the need for

a solution that can:

• Make the most of the additional information

brought by the nature of knowledge graphs (i.e.

the type of relationships and the class of the

nodes);

• Carry out the search quickly enough to be able to

propose similar entities in real time or near real-

time among thousands of items;

• Operate inductively to allow new nodes to be pro-

cessed without training over the model and com-

puting one more time the representations of all the

nodes;

• Transform the node proﬁle into a vector without

loss of content, in the case of sparse or rich at-

tributes.

This paper aims to address these needs by proposing

an adaptation of the GCNs models.

3 TASK DESCRIPTION AND

MODEL FORMALIZATION

3.1 Problem Deﬁnition

A knowledge base is a data structure used to store

heterogeneous entities with structured information.

This structure can be modeled as an attributed het-

erogeneous graph, G = (V, E, Λ, M, A), where V is the

set of nodes (representing entities; the two terms are

used interchangeably) constituting the graph, ε

i, j

∈ E

is an oriented link between two nodes, Λ is the as-

sociation of each entity with a class, M is the as-

sociation of an edge, ε

i, j

, to its type and α(i) =

{α(i)

, ...α(i)

} ∈ A, is the set of attributes possessed

by the node v(i) ∈ V . An attribute is constituted of a

name and the embedding of its textual representation,

α(i)

= (α(i)

j,name

, α(i)

j,val

Type: Per

Name: Pepper Potts

Nationality: American

Work_for

Type: Per

Name: Tony Stark

Age: 48

Spouse_of

Type: Org

Name: The Avengers

Leader: Tony Stark

Part_of

Live_in

Type: Location

Name: New-York

Population: 8.419 million (2019)

Based_in

Live_in

Type: Org

Name: Stark Industries

Location: Malibu

Figure 1: Example of a knowledge graph.

As shown in Figure 1, the difﬁculties with this

type of graph are in the possible variation of attributes

between two nodes of the same class (qualiﬁed as

dirty entity matching) and the number of relations.

3.2 Model Architecture

Our model is composed of two parts: the proﬁle em-

bedding, shown in Figure 2, and the graph model, in

Figure 3. Since graph methods require the submission

of a vector for each entity, a ﬁrst step is to summa-

rize the proﬁle information. The second step consists

in a graph model adapted to the nature of knowledge

graphs. The two parts can be trained separately.

To stabilize and reduce learning time, we use

the k-head learning approach introduced by (Vaswani

et al., 2017).As described for the GAT model, the k-

head learning consists in training several versions of

the model parameters in parallel. The resulting rep-

resentations obtained for each head are then averaged

into a single vector.

3.3 Input Computation

For the ﬁrst stage of embedding, we rely on a pre-

trained encoder to transform strings into vectors, such

as Universal Sentence Encoder (Cer et al., 2018), but

also Mpnet (Song et al., 2020) and DistilRoberta, a

lighter model adapted from (Liu et al., 2019). These

models were trained to measure the semantic simi-

larity between two sentences. For the last two mod-

els we used the instances proposed by the sentence-

transformers

library; they were computed follow-

ing the process described by (Reimers and Gurevych,

2019).

We selected the Universal Sentence Encoder for

a better trade-off between processing time and efﬁ-

ciency. One could argue that attributes in our case are

made up of words that moreover are not often found

in the literature (First name, Spouse name, etc...)

and therefore a character-based encoder like FastText

would be more appropriate. As a matter of fact, Fas-

text is used for those reasons in (Mudgal et al., 2018).

https://www.sbert.net

ICAART 2022 - 14th International Conference on Agents and Artiﬁcial Intelligence

An exhaustive comparison of which encoder is the

best suited for each kind of attribute is left for future

work.

Once all the values of the ﬁelds constituting an en-

tity are encoded, we need to merge their information.

First, attribute-speciﬁc coefﬁcients, w

ψ(α(i)

j,name

)

, are

used to weight the feature vectors of an attribute, α(i).

ψ, being the function mapping the attribute name to

his corresponding weight and α(i)

j,name

, the name of

the j-th attribute of the node i. Then those weighted

vectors are aggregated. For this step we tested several

approaches:

• Global Accumulation-Pooling: this operation

builds the entity embedding by building-up all his

weighted features.

• Max-Pooling: widely used for methods process-

ing images and reused in (Hamilton et al., 2017),

this operation returns the maximum value of the

inputs, index-wise.

• Extreme-Pooling: since the similarities are mea-

sured with vector distances, keeping negative val-

ues in the embedding enables a better comparison

while keeping the max-pooling operation.

Equation 1 shows how we compute the proﬁle

embedding, h

, of a node i with k attributes, using

extreme-pooling. The weights and attribute feature

have values in R

with d the length of the feature vec-

tors. The • represents the Hadamar product.

(i) =extreme({w

ψ(α(i)

j,name

)

• α(i)

j,val

})

f or j ∈ {1, ..., k}

(1)

Since the proﬁle does not depend on the relation,

once it is computed it only needs to be updated when

a change is made to the entity itself. However, the use

of speciﬁc weights per attribute makes it impossible to

infer on unseen ﬁeld types. In the case where entities

have few attributes, the weighting vectors can easily

be replaced by matrices.

3.4 Taking Into Account Neighbors

Through Graph Neural Networks

We have the intuition that the task would be better

solved by taking into consideration not only the node

proﬁles, but also their neighborhoods. To do so we

rely on a GNN; the state of the art underlines the

relevance of a few architecture types: GAT, GATV2,

GraphSage, HinSage, HGT. The points covered by the

latter models are summarized in Table 1.

We adapt these models to take as inputs, the vector

of the target node and all its neighbors. In the case of

a model taking into account the type of the node and

the neighborhood, we add an entry corresponding to

a vector of indices. Those indices associate the target

node to its class and the neighbors to their relation

type.

3.5 PIKA: Performing Identiﬁcation of

Clones in KnowledgeBase

Algorithm

We present here in detail PIKA, the model that led to

the best performances. It is an adaptation of GAT, ad-

justed to take advantage of the speciﬁcs of knowledge

graphs. The followings describe the operations per-

formed to produce the node representation.

h(i)

i, j

= w

φ(i, j)

• h

( j) (2)

First, each neighbor of the target node is mul-

tiplied in an index-wise fashion with an edge-type

speciﬁc weight. h

( j) is the proﬁle representation

of the node j, w the weight vector (both ∈ R

) and

φ(i, j) is the relation type from the record i to record

j. (φ(i, j) = φ( j, i)

−1

is the inverse relation. To enable

the consideration of a large number of edge types, we

used vectors, since matrices would require the compu-

tation of too many parameters. Equations 3, 4, 5 re-

produce the operations explained in (Veli

ckovi

c et al.,

2017).

In Equation 3, the attention coefﬁcient, e(i)

i, j

, of

each edge linked to the node i is computed by multi-

plying an attention vector, β ∈ R

d∗2

to the concatena-

tion of the representations of the two entities. These

coefﬁcients are then normalized by Softmax in Equa-

tion 4.

e(i)

i, j

= LeakyReLU(β.[h(i)

i, j

||h(i)

i,i

]) (3)

β(i)

i, j

e(i)

i, j

∑

k∈Neigh(i)

e(i)

i,k

(4)

The last step is an aggregation followed by a non-

linear transformation to compute the ﬁnal represen-

tation of the node i, h(i). Neigh(i) corresponds to

the neighborhood of the node i, plus a relation to it-

self. As several links can exist between two nodes,

the neighbor involved can appear more than once in

Equation 5.

h(i) = σ(

∑

j∈Neigh(i)

β(i)

i, j

• h(i)

i, j

) (5)

Since the direct neighborhood of a node is sufﬁ-

cient to differentiate between cloned and non-cloned

pairs, we limit ourselves to models with only one con-

volutional layer. This allows us to reduce the memory

Duplicate Detection in a Knowledge Base with PIKA

𝛼 𝑖

!"#$%&

: attr 𝑖

𝒆𝒏𝒕𝒊𝒕𝒚

Textual/

encoder

Pooling

𝛼 𝑖

!"#$%&

: 𝛼 𝑖

!"($)

…

𝑤

*+, -

!,#$%&

𝛼 𝑖

!"#$%&

: [𝑎

-!!

, … 𝑎

-!/

]

…

𝛼 𝑖

'"#$%&

: [𝑎

-'!

, … 𝑎

-'/

]

𝒆𝒏𝒄𝒐𝒅𝒆𝒅 𝒑𝒓𝒐𝒇𝒊𝒍𝒆

𝛼 𝑖

'"#$%&

: attr 𝑖

…

𝛼 𝑖

'"#$%&

: 𝛼 𝑖

'"($)

𝑤

*+, -

',#$%&

Figure 2: Schema of the proﬁle encoder.

Table 1: Model characteristics.

Model Directed Heterogeneous edges Node type speciﬁc Attention Iterative inputs

GAT ✗ ✗ ✗ ✓ ✓

GATV2 ✗ ✗ ✗ ✓ ✓

GATV2* ✓ ✓ ✓ ✓ ✓

Graphsage ✗ ✗ ✗ ✗ ✓

Graphsage* ✗ ✗ ✓ ✗ ✓

HinSage ✗ ✓ ✓ ✗ ✓

HGT ✓ ✓ ✓ ✓ ✓

PIKA-simple ✓ ✓ ✓ ✓ ✗

PIKA ✓ ✓ ✓ ✓ ✓

used and the time spent on calculating the representa-

tion. This makes possible to take into account all the

neighbors without the necessity of a sampling strategy

as the one used in (Hamilton et al., 2017).

As opposed to the computation of the vector sum-

marizing the proﬁle, if a change occurs at the at-

tributes level, there is a need to recalculate the graph

embedding of the neighboring nodes.

3.6 Near-duplicate Finding in Vector

Spaces

In order to retrieve the most similar entities based

on their embedding we use an approximate nearest

neighbor search tool, Annoy

. Inspired from kd-trees,

it uses random projections and associates a random

hyper-plane at each node of the search tree. The draw-

back of such libraries is the need to update the search

index whenever a new node is added.We use the co-

sine similarity to compare the vectors: it is the most

common measure to handle embeddings.

3.7 Training Procedure

3.7.1 Implementation

All the models have been implemented using Py-

torch and trained as classiﬁers in a supervised manner,

following the framework explained in (Reimers and

Gurevych, 2019). This framework consists in train-

ing a binary classiﬁer with pairs of entities as input.

This classiﬁer can be used in inference to perform the

classiﬁcation; the last prediction layer can be removed

to directly obtain representative vectors, more useful

https://github.com/spotify/annoy

for the entity retrieval task. Our training procedure

just changes the output given to the classiﬁer as ini-

tially described in the framework. Instead of using

the concatenation of the two produced embeddings

and their absolute difference, we only keep the differ-

ence since it produces better results during our experi-

ments. About the number of attributes, only the 4,000

most present attribute types are taken into account and

the 2,000 types of edges (multiplied by two since the

direction of the link is considered). The vector shape

of the ﬁnal embeddings is set to 512. Regarding train-

ing, we use the binary-cross entropy loss function and

the Adam Optimizer (l

= 10

−3

3.7.2 Iterative Training Procedure

The standard training procedure relies on training

pairs; the model predicts whether both entities refer to

the same or not. We had the intuition that the robust-

ness of the model could be improved here: indeed, the

non-analogous pairs are randomly created, and that is

why the models can easily differentiate between enti-

ties that have nothing in common but could struggle

to predict ambiguous entities.

To address this problem, we propose an iterative

training procedure in Algorithm 1 which, after each

iteration, adds the pairs returned by the model as the

most similar. The elements from which the most sim-

ilar requests are made are the ones in the initial test

and validation sets, while the returned instances are

the ones that are not already in one of the sets. Note

that this procedure is used for the proﬁle encoder step;

using it on the neighborhood aggregation step would

require ﬁrst to generate a train/test split with a speciﬁc

consideration (to avoid having a same neighborhood

being in both the train and test sets).

ICAART 2022 - 14th International Conference on Agents and Artiﬁcial Intelligence

"#$

"%$

"&$

"'$

𝜙(𝑖, 𝑚)

𝜙(𝑛, 𝑖)

𝜙 𝑜, 𝑖 , 𝜙(𝑖, 𝑜)

Target'node

Neighborhood

Edge-

weighting

Attention-

coefficients

Weighted-

sum

Node'

embedding

h 𝑖

()*

h 𝑖

+)(

h 𝑖

,)(

h 𝑖

(),

h 𝑖

()(

𝛽 𝑖

()*

𝛽 𝑖

+)(

𝛽 𝑖

,)(

𝛽 𝑖

(),

𝛽 𝑖

()(

Figure 3: Schema of the PIKA model.

Algorithm 1: Iterative training procedure.

Input : Set baseTrain, BaseVal

Result: the trained model

train ← baseTrain ;

val ← baseVal ;

model ← new Model ;

while nbIteration ¡ maxIteration do

model ← train(model, train, val);

embeddings ← get embeddings(model);

most similar pairs ←

get similarities(embeddings, baseTrain,

basicVal);

train, val ← get new sets(model, train,

val);

end

4 EXPERIMENT SETTINGS

4.1 The Datasets

In order to develop, test and evaluate the models we

used the TAC KBP Reference Knowledge Base devel-

oped by the Linguistic Data Consortium (LDC). The

knowledge base is built from English Wikipedia arti-

cles. More than 800, 000 entities are divided into four

classes: Person (PER), Geo-political entity (GPE),

Organization (ORG) and Unknown (UKN). TAC can

be modeled as a knowledge graph by using as re-

lations the references to other entities in descriptive

texts.

Generating Clones. As TAC is considered to be

free of duplicate entities, it was necessary to generate

clones from instances of the base. To do so, we used

a custom script which generates from a model entity

drawn randomly (that has not already been used) a

clone with alterations to relationships and attributes.

These alterations reproduce at best what can be found

in our use cases and have arbitrary probabilities of oc-

currence. Concerning the attributes, ﬁeld values can

be swapped, the values can be randomly truncated,

changes/ insertions of characters may occur to sim-

ulate typing errors, attributes can be deleted. About

TAC

Train set

32 000 nodes

Val i dat ion set

19 200 nodes

Te st set

12 800 nodes

Clones Mix

20 000 nodes

Poor Clones

10 000 nodes

Rich Clones

10 000 nodes

22 000 nodes

13 200 nodes

8 800 nodes

5000 nodes

3000 nodes

2000 nodes

Figure 4: Distribution of the dataset.

the edges, some can be deleted, added with a random

pair of neighbors and predicates or existing ones can

have the type of the relationship replaced. To prevent

bias in the evaluation, an entity can only be used to

produce one clone and a clone cannot be used as a

model.

To test thoroughly the methods, two types of

clones were generated. The ﬁrst one with few deleted

attributes (deletion probability set to 0.2), qualiﬁed

as rich clones and corresponding to an exact clone

search. The second type, with a lot of deletion (dele-

tion probability set to 0.5) for the case of a query

search and will be named as poor clones. The other

changes have the same probability of occurrence for

both types.

Distribution of the Datasets. As shown in Fig-

ure 4, the datasets are made of pairs of nodes

“(entity

, entity

)”, associated with a “cloned” or

“non-cloned” label. There are respectively 16,000,

9,600 and 6,400 pairs in the training, validation and

test set. The datasets are made of 50% cloned pairs.

20% of false clone pairs have one of the entities with

a duplicate in the database to prevent from bias in the

training procedure. This distribution sets the number

of duplicates at 20,000, i.e. 2.5% of the base: it is a

trade-off between training set size and a polluted base.

4.2 Evaluation Procedure

The methods are evaluated according to two settings

in order to model the ability of a method to operate by

proposing similar entities or directly making a classi-

ﬁcation:

Duplicate Detection in a Knowledge Base with PIKA

• Search for similar entities: as a problem of infor-

mation retrieval, for all similar pairs in the test set,

we submit the ﬁrst entity of the tuple as a query.

The position of the analogue entity in the list or-

dered by similarities with the embedding of the

query node is then measured.

• Clone detection: we take one of the entities of

each pair of the test set. The system then returns

the entity it considers the most similar and ﬁnally

applies a binary classiﬁcation.

4.3 Metrics

In the context of the search for similar entities, we

used the Hit@k score (with k ∈ (1, 10, 100)) and the

MRR (Mean Reciprocal Rank). The Hit@k score

measures the average capacity of a system to return

the target in the ﬁrst k elements, while the MRR is

used to evaluate the performance of the system is gen-

eral.

MRR =

|Q|

∑

i=1

rank

(6)

Hit@k =

|Q|

∑

i=1

target

∈top

(7)

The clone detection task is, on the other hand, seen

as a classiﬁcation problem. Given a pair of entities as

inputs, the classiﬁer predicts whether both represent

the same real entity (“clone”) or not. The evaluation

metrics are Accuracy, Precision, Recall and F1-score.

Note that this problem is a classiﬁcation task on

top of the IR chain: the classiﬁer examines the sub-

mitted entity and its closest match. As a result the

recall of the classiﬁer is the Hit@1 of the IR task.

5 RESULTS AND DISCUSSION

We compare our proposed method PIKA with versions

of GraphSage, HGT, HinSage, GAT and GATV2

which we implemented under Pytorch and trained un-

der the (Reimers and Gurevych, 2019) framework. In

addition to being tested with their standard implemen-

tation, some modiﬁcations are added to take into ac-

count the speciﬁcities of Knowledge Databases. We

also measure the performance of PIKA without the

iterative training procedure under the label of PIKA-

simple, in order to evaluate the added value of this

procedure.

The introduction of iterative training really boosts

the quality of the model: PIKA clearly outperforms

PIKA-simple, its non-iterative implementation. This

gain is even greater when switching from basic mod-

els to models modiﬁed to consider the characteristics

of a knowledge graph (noted with a *). Indeed, we

can clearly see a better MRR for *-versions, with an

exception for GATV2-* on rich clones. The improve-

ment is due, ﬁrstly, to the fact that the orientation of

the edges is included, which is an interesting param-

eter for places for example (GPE type). Nodes like

these, that are often linked to a myriad of incoming

neighbors, would be better represented by their out-

going edges. Secondly, our methods differentiate the

weights according to the class of the node in the self-

attention mechanism. For some types it is preferable

to keep more of the node information. Lastly, dis-

tinguishing the nature of connections gives a better

adaptability to the model for different topologies.

From the previous results we can also see that

models not adapted to the knowledge graph struc-

ture (node-type information, relation-type informa-

tion) struggle to rely on the information given by the

neighborhood as their MRR drops between 0.1 and

0.15 points when comparing the results of the same

model on the mixed and poor set. Considering the

sparsity of the knowledge base (a median of three

edges per node and 14% of them do not possess any

link), we believe that methods strongly relying on

context will perform even better on denser graphs.

As shown in the classiﬁcation part of the Ta-

ble 2, the proportion of false positives is still too

high (shown as low precision scores) to enable an

automatic similarity detection system. This can be

explained by the fact that no iterative training was

done on the classiﬁer with the embedding obtained

by the graph methods. Indeed, an iterative training

on the graph methods would not have been useful:

many nodes have a common neighborhood without

being clones. Ambiguity would have hindered itera-

tive learning.

About the time constraint of embedding methods

using graphs, we measure an average of 0.16 s re-

quired to compute the embedding of an entity and re-

trieve the most similar item from our knowledge base;

this time is computed in a standard ofﬁce workstation,

against the whole TAC KBP database. This short du-

ration makes it convenient to integrate PIKA into Web-

scale systems.

6 CONCLUSION

We presented a two-part model that is able to sum-

marize the features and the neighborhood of a node

into a vector usable to compare the similarities and to

retrieve close entities in terms of characteristics and

ICAART 2022 - 14th International Conference on Agents and Artiﬁcial Intelligence

Table 2: Measures on the sets of poor clones, then rich clones for the information retrieval task, and on rich clones for the

classiﬁcation task. Models followed by ”*” are variants adapted to knowledge graphs.

Poor clones Rich clones Rich clones - classiﬁcation

Model MRR Hit@1 Hit@100 MRR Hit@1 Hit@100 Acc. Precision Recall F1

GAT 0.257 0.206 0.534 0.532 0.472 0.764 0.320 0.258 0.957 0.406

GATV2 0.636 0.582 0.850 0.810 0.774 0.936 0.542 0.491 0.924 0.641

GATV2* 0.720 0.678 0.886 0.752 0.711 0.903 0.565 0.513 0.978 0.673

Graphsage 0.449 0.388 0.740 0.783 0.752 0.913 0.431 0.423 0.993 0.593

Graphsage* 0.484 0.432 0.764 0.820 0.793 0.928 0.431 0.423 0.993 0.593

HinSage 0.110 0.091 0.235 0.521 0.489 0.699 0.764 0.472 0.705 0.566

HGT 0.141 0.119 0.281 0.450 0.426 0.600 0.564 0.297 0.890 0.446

PIKA-simple 0.817 0.783 0.928 0.918 0.899 0.971 0.653 0.617 0.960 0.751

PIKA 0.856 0.828 0.941 0.959 0.949 0.988 0.786 0.669 0.952 0.786

relations. Since the model weights are not speciﬁc to

nodes but to attributes and relationships, it is applica-

ble to other graphs with the same types of attributes

and edges.

For a more speciﬁc application, our method could

easily be adapted to assign higher weights to attributes

that should be focused on. By submitting pairs of

clones with speciﬁc modiﬁcations during training, the

user would be able to detect similarities only on tar-

geted attributes.

We obtained a more robust model to ambiguous

entity pairs, by applying an iterative training. Never-

theless, embedding a complex entity prevents an ef-

ﬁcient atomic comparison of its attributes (as is per-

formed in entity alignment methods) since the infor-

mation is smoothed into a single vector. Still, the

results showed that search entities, if not returned in

ﬁrst position, are almost always in the top 100 results.

This considered, we can easily imagine the use of our

model as a pre-ﬁltering step on a large database before

performing an entity-by-entity comparison or using a

more accurate but slower method on fewer candidates.

Finally, the numerical values are not processed in

an optimal way by the encoder since they appear in

the database as strings. Text encoders such as USE

or Mpnet do not speciﬁcally handle numerical val-

ues: we believe that a dedicated processing for such

attributes is required in order to reach an acceptable

quality for the clone classiﬁcation task.

REFERENCES

Bordes, A., Usunier, N., Garcia-Duran, A., Weston, J., and

Yakhnenko, O. (2013). Translating embeddings for

modeling multi-relational data. Advances in neural

information processing systems, 26.

Brody, S., Alon, U., and Yahav, E. (2021). How atten-

tive are graph attention networks? arXiv preprint

arXiv:2105.14491.

Brunner, U. and Stockinger, K. (2020). Entity matching

with transformer architectures-a step forward in data

integration. In International Conference on Extending

Database Technology, Copenhagen, 30 March-2 April

2020. OpenProceedings.

Cer, D., Yang, Y., Kong, S.-y., Hua, N., Limtiaco, N., John,

R. S., Constant, N., Guajardo-C

espedes, M., Yuan,

S., Tar, C., et al. (2018). Universal sentence encoder.

arXiv preprint arXiv:1803.11175.

Dai, Y., Wang, S., Xiong, N. N., and Guo, W. (2020). A

survey on knowledge graph embedding: Approaches,

applications and benchmarks. Electronics, 9(5).

Daza, D., Cochez, M., and Groth, P. (2020). Inductive entity

representations from text via link prediction. CoRR,

abs/2010.03496.

Dettmers, T., Minervini, P., Stenetorp, P., and Riedel, S.

(2018). Convolutional 2d knowledge graph embed-

dings.

Dong, Y., Chawla, N. V., and Swami, A. (2017). metap-

ath2vec: Scalable representation learning for hetero-

geneous networks. In Proceedings of the 23rd ACM

SIGKDD international conference on knowledge dis-

covery and data mining, pages 135–144.

Fortunato, S. (2010). Community detection in graphs.

Physics reports, 486(3-5):75–174.

Gadek, G. (2019). From community detection to topical,

interactive group detection in online social networks.

In IEEE/WIC/ACM International Conference on Web

Intelligence-Companion Volume, pages 176–183.

Hamilton, W. L., Ying, R., and Leskovec, J. (2017). In-

ductive representation learning on large graphs. In

Proceedings of the 31st International Conference on

Neural Information Processing Systems, pages 1025–

1035.

Kipf, T. N. and Welling, M. (2016). Semi-supervised clas-

siﬁcation with graph convolutional networks. arXiv

preprint arXiv:1609.02907.

Li, Y., Li, J., Suhara, Y., Doan, A., and Tan, W.-C. (2020).

Deep entity matching with pre-trained language mod-

els. arXiv preprint arXiv:2004.00584.

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D.,

Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov,

V. (2019). Roberta: A robustly optimized bert pre-

training approach. arXiv preprint arXiv:1907.11692.

Duplicate Detection in a Knowledge Base with PIKA

Mudgal, S., Li, H., Rekatsinas, T., Doan, A., Park, Y., Kr-

ishnan, G., Deep, R., Arcaute, E., and Raghavendra,

V. (2018). Deep learning for entity matching: A de-

sign space exploration. In Proceedings of the 2018

International Conference on Management of Data,

pages 19–34.

Nielsen, F.

A. (2017). Wembedder: Wikidata entity embed-

ding web service. arXiv preprint arXiv:1710.04099.

Perozzi, B., Al-Rfou, R., and Skiena, S. (2014). Deepwalk:

Online learning of social representations. In Proceed-

ings of the 20th ACM SIGKDD international confer-

ence on Knowledge discovery and data mining, pages

701–710.

Reimers, N. and Gurevych, I. (2019). Sentence-bert: Sen-

tence embeddings using siamese bert-networks. arXiv

preprint arXiv:1908.10084.

Ristoski, P. and Paulheim, H. (2016). Rdf2vec: Rdf graph

embeddings for data mining. In International Seman-

tic Web Conference, pages 498–514. Springer.

Shi, B. and Weninger, T. (2018). Open-world knowledge

graph completion. In Proceedings of the AAAI Con-

ference on Artiﬁcial Intelligence, volume 32.

Song, K., Tan, X., Qin, T., Lu, J., and Liu, T.-Y. (2020). Mp-

net: Masked and permuted pre-training for language

understanding. arXiv preprint arXiv:2004.09297.

Srivastava, R. K., Greff, K., and Schmidhuber, J. (2015).

Highway networks. arXiv preprint arXiv:1505.00387.

Sun, Z., Deng, Z.-H., Nie, J.-Y., and Tang, J. (2019). Rotate:

Knowledge graph embedding by relational rotation in

complex space. arXiv preprint arXiv:1902.10197.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,

L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I.

(2017). Attention is all you need. In Advances in

neural information processing systems, pages 5998–

6008.

Veli

ckovi

c, P., Cucurull, G., Casanova, A., Romero, A., Lio,

P., and Bengio, Y. (2017). Graph attention networks.

arXiv preprint arXiv:1710.10903.

Wang, P., Han, J., Li, C., and Pan, R. (2019). Logic

attention based neighborhood aggregation for induc-

tive knowledge graph embedding. In Proceedings of

the AAAI Conference on Artiﬁcial Intelligence, vol-

ume 33, pages 7152–7159.

Wang, X., Gao, T., Zhu, Z., Zhang, Z., Liu, Z., Li, J., and

Tang, J. (2021). Kepler: A uniﬁed model for knowl-

edge embedding and pre-trained language representa-

tion. Transactions of the Association for Computa-

tional Linguistics, 9:176–194.

Wu, F., Souza, A., Zhang, T., Fifty, C., Yu, T., and Wein-

berger, K. (2019). Simplifying graph convolutional

networks. In International conference on machine

learning, pages 6861–6871. PMLR.

Xiaohan, Z. (2020). A survey on application of knowl-

edge graph. Journal of Physics: Conference Series,

1487:012016.

ICAART 2022 - 14th International Conference on Agents and Artiﬁcial Intelligence