source to document-oriented NoSQL DBs. The target
of the transfer is also a document-oriented NoSQL
DB.
In our case study presented in Section 2, the
(restricted) Data Lake consists of MongoDB NoSQL
DBs and the Data Warehouse is an OrientDB NoSQL
DB. MongoDB and OrientDB share the same
document-oriented model using different
vocabularies; however, the underlying concepts in
terms of data representation are identical and
consistent with the metamodel in Figure 2. Note that
OrientDB offers a richer description capability,
especially in terms of link expression. For the sake of
clarity of presentation, we will use the MongoDB
vocabulary for the source and the OrientDB
vocabulary for the target (Table 1).
Table 1: Correspondence of the source and target
vocabularies.
MongoDB (source) OrientDB (target)
Collection Class
Document Record
Fields Property
DBRef link Reference
Oid (Object identifier) Rid (Record identifier)
The Transfer module therefore consists of
"copying/pasting" data from several databases
(source) into a single database (target). This data
transfer must however follow certain rules to allow
the treatments that we will carry out later.
Rule 1: the names of the transferred classes are
prefixed by the name of the source database from
which they originate; this avoids the synonymity of
class names in the data warehouse.
Rule 2: when transferring documents, the original
identifiers are kept as they are; thus, the identifier of
a document is stored in a property of the target record,
in the form (Oid, value of the identifier). This
identifier will be used in the Convert module and then
deleted in the record.
Rule 3: the links contained in the documents
(DBRef links) are stored as they are in the target
records; they will be transformed into references in the
Convert module. The transferred links will be
prefixed by the key word "DBRef".
According to rule 2 above, any record r that is
stored in the Data Warehouse has a property
containing the MongoDB identifier of the original
document. If we consider the set of OrientDB
identifiers (E_Rid) assigned to the records and the set
of MongoDB identifiers (E_Oid) present in the
source, then there exists a bijection of E_Oid into
E_Rid that we will note as follows: Rb: E_Oid →
E_Rid. This property is important because it allows
MongoDB links to convert into OrientDB links.
4.2 The Merging Module
The Data Lake is generally made up of a set of
separate databases managed independently. The data
warehouse resulting from the Transfer module may
contain "similar" data sets. For example, in our
medical application, descriptions of insured persons
or lists of doctors appear in several databases in the
Data Lake. These data may concern the same entities
in the real world but do not necessarily have the same
structures (different properties).
The Data Warehouse can therefore include
subsets containing classes linked by the equivalence
relation; they "describe the same entities"; each
subset is called an equivalence group. For example,
in the ENS Data Warehouse of our application, the
classes B1.Doctors, B2.Therapists and
B3.Practitioner constitute an equivalence group (the
prefixes B1, B2 and B3 correspond to the names of
the original databases in the Data Lake).
For each equivalence group in the data warehouse,
the module will cause all classes in the group to be
deleted and replaced by the resulting class 𝑋. The
transformation from 𝐺 to 𝑋 is based on the use of a
domain ontology created with the help of business
experts; the ontology automatically builds the 𝑋 class.
The experts (administrators, managers, decision-
makers, etc.) have in-depth knowledge of one or more
databases contained in the Data Lake.
They are asked to specify the possible semantic
correspondences between the data of the different
databases. For example, in the ENS Data Lake, three
databases contain data describing individuals insured
by mutual insurance companies. The ontology will
indicate an equivalence relation between these data.
The similarity relations between data are stored in a
domain ontology, called "Onto”, in the form of a
graph; they are obtained from the specifications
provided by the business experts.
The business experts must define a primary key
(in the sense of the relational model) among the
attributes of the classes of the same group. This key,
made up of a minimal set of properties, makes it
possible to distinguish records and to establish inter-
class correspondences within a group. For example,
the classes of a group containing descriptions of