3 SOLUTION OVERVIEW
We have a DB managed by the NoSQL OrientDB
system. Our process consists of extracting the logical
schema from the DB and presenting it in a form that
can be read by users (developers and decision-
makers). We used the ModelToModel transformation
approach of MDA (OMG, 2021 April) to generate the
logical schema. We therefore present successively the
characteristics of the OrientDB system, the principles
of MDA and an overview of our process.
OrientDB is a multi-model NoSQL data storage and
manipulation system in the sense that it supports
several data organizations. Given the specificities of
our application, we chose the document-oriented
model whose records (i.e. objects) contain a set of
properties (Key, Value); the values of the properties
can belong to all types of data (atomic, structured and
multivalued). One of the particularities of OrientDB
system is the possibility of expressing association
links in the form of pointers (reference values)
according to the ODMG DB standard (ODMS, 2021
April). In addition, OrientDB is schemaless because,
for a given class, the schema of the records is not
provided when the class is created.
MDA is a branch of model-driven engineering
(MDE) proposed by the OMG (OMG, 2021 June). It
is a software development architecture that
distinguishes several levels of description making it
possible to disregard the technical characteristics
(PIM, CIM, PSM) of an application. Thus, the PSM
(Platform Specific Model) corresponds to
descriptions taking into account the technical
characteristics of an implementation platform. In
addition, MDA offers model transformation
principles and techniques for generating code or,
inversely, extracting the model from existing code.
This involves applying transformation rules on
metamodels describing the starting point (the source)
and the arrival (the target). The Eclipse Foundation
(Eclipse, 2021 April) has developed implementation
tools in accordance with MDA. The objective of our
work is to obtain the (unique) schema of an OrientDB
schemaless DB. MDA offers us extraction principles
consisting in metamodeling the source (DB) and the
target (schema) and then applying rules of passage
from the source to the target.
This solution has the advantage of being able to
be applied to different document-oriented NoSQL
DBs (managed by OrientDB or by other systems
accepting this model). However, it faces some
difficulties related mainly to the detection of data
types and links; we therefore made some initial
assumptions in section 5.1.
4 RELATED WORKS
Several NoSQL DB schema extraction softwares
have been proposed by software publishers such as
"Spark Dataframe" (Apache Spark, 2021 Oct),
"Schema-guru" (SnowPlow Analytics, 2021 Oct) and
"Mongodb-schema" (Peter Schmidt, 2021 Oct).
These softwares extract the class schema (designated
also by tables or collections) from a DB in JSON
format; but these softwares do not extract the
semantic relationships between objects.
In addition, research works proposed extracting
more complete schemas from NoSQL DB. In
(Baazizi and al., 2017), the authors propose a process
of schema extraction from a JSON dataset using the
Map-reduce system. This process can be summarized
in 2 phases: the first consists in applying the Map
transformation to each record of a class in order to
deduce the pairs (key, type) from the pairs (key,
value). The result of this step allows to obtain several
schemas specific to each record. The Reduce phase
consists of merging these schemas in order to provide
a global schema for each class. This process was
extended by the same authors by integrating the
parameterization in the 2nd phase Reduce (Baazizi
and al., 2019b); this allows the user to infer the
schemas produced in the Map phase at different levels
of abstraction.
Another process of schema extraction from an
extended JSON dataset has been proposed in (Frozza
and al., 2018). Extended JSON records support, in
addition to standard types, other data types like the
DBRef type allowing to express links between
objects, Date, Long, Timestamp, Binary… The
extraction process consists in realizing 3 successive
steps: i) creation of schemas for each record, ii)
grouping of raw schemas in order to obtain a unique
class of JSON objects, iii) unification of schemas and
iv) construction of the global schema for all records
of a class. The processes presented in (Frozza and al.,
2018), (Baazizi and al., 2019b) and (Baazizi and al.,
2017) provide some answers to our problem.
However, the DBs to which they apply to contain a
unique class of objects; they therefore do not deal
with the links between classes.
In (Aftab et al., 2020), an automatic process for
transforming document-oriented NoSQL DB
(MongoDB) into relational DB has been presented. It
is summarized in three steps: extracting the schema
from the source DB, analyzing and converting it into
SQL query according to the format of the target DB
and finally launching ETL processes. The latter
extract the data from the NoSQL DB, process it to
create the SQL queries and then load it into the target