because it employs a wide range of vocabulary from
several languages, to make LD more understandable.
To begin, LD text is translated into the MSA
language to be understandable by Arabic speakers.
Second, translation software solutions such as
Google translate could be used to easily translate
MSA text that expresses LD into another national
language. Actually, many resources, such as a
bilingual dictionary, can be used to translate dialects
into their original languages. Building such
dictionaries plays a crucial role in Natural Language
Processing (NLP) applications not only in machine
translation but also in named entity recognition and
cross-lingual information retrieval. The final
purpose of this work, regardless of language, is to
explain the general method for creating a bilingual
dictionary for translating dialects into their original
languages. The method also takes into account the
availability of preexisting monolingual dictionaries
for original languages. In this research, a method for
creating a bilingual dictionary for LD-MSA
translation is given as a case study.
2 RELATED WORK
Many researchers have recently become interested in
translating AD into MSA. To carry out their
researches, they employed various approaches to
build their parallel, bilingual, and monolingual
dictionaries that are crucial for building machine
translation systems. This section will discuss some
of the most significant researches on creating
dictionaries or corpora for translation between AD
and MSA. (Kchaou et al., 2020) published a TD-
MSA parallel corpus in 2020, which was collected
using a variety of resources. The Parallel Arabic
DIalectal Corpus (PADIC) is the first resource,
which is a parallel corpus that combines Maghreb
dialects (Algerian, Tunisian, and Moroccan), Levant
dialects (Palestinian and Syrian), and the MSA
(Meftouh et al., 2015). Multi Arabic Dialect
Applications and Resources (MADAR), a TD-MSA
parallel corpus (Bouamor et al., 2018), is the second
resource. They then gathered text from the Tunisian
corpus CONSTitution (TD-CONST), which contains
the Tunisian constitution written in MSA and
translated into the dialect of Tunisian. The Tunisian
social media corpus COMments (TD-COM) is
another resource that includes 900 Facebook
comments that were then translated into MSA by a
native speaker. Finally, they created a TD-MSA
bilingual dictionary by aligning the collected parallel
corpora. Starting with the two monolingual
morphological dictionaries for TD and MSA,
(Sghaier et al., 2020) made a great effort to generate
the necessary resources from scratch. To map TD
words to their MSA equivalents, a bilingual lexicon
dictionary was built.
In 2012, (Salloum et al., 2012) presented their
Elissa Rule-Based Machine Translation (RBMT)
system, which allowed for the translation of a set of
Arabic dialects into MSA utilizing AD-MSA
dictionaries such as the Tharwa dictionary and other
dictionaries they built. (Diab et al., 2013) Presented
Tharwa in 2013, a three-way, large-scale lexicon
that encompasses Egyptian Arabic, Modern
Standard Arabic, and English. The Tharwa is the
first three-way electronic resource for DA that
includes rich and deep linguistic information for
each entry. Egyptian Arabic is the resource's first
pilot dialect, with intentions to expand to other
Arabic dialects. The Tharwa were gathered from a
variety of sources, both manually and automatically.
(Tachicart et al., 2014), introduced their machine
translation, which combines a rule-based approach
and a statistical approach, using tools designed for
Arabic standard and adapting them to the Moroccan
dialect. To collect their bilingual dictionary corpus,
they used the writings of some television production
scenarios and some MSA dictionaries. The extension
of the bilingual dictionary was done by collecting
additional online resources to ensure maximum
coverage of the vocabulary of the Moroccan dialect.
In 2018, (Mubarak et al., 2018) presented a
parallel corpus called Dial2MSA, which contains
dialectal Arabic tweets in four main Arabic dialects
(Egyptian, Maghrebi, Levantine, and Gulf) and
their corresponding MSA translations. The tweets
were collected from Twitter, and then a set of
distinctive words for each dialect were filtered. The
crowdsourcing platform (CrowdFlower) was then
utilized to hire native speakers of each dialect to
translate each tweet into its MSA. The final corpus
contains 16,000 Egyptian-MSA pairs, 8,000
Maghrebi-MSA pairs, and 18,000 of Gulf-MSA
and Levantine-MSA pairs. In 2022, Torjmen, Roua,
and Kais Haddar created a bilingual dictionary
from various TD-MSA corpora. The TD-MSA
bilingual dictionary has 4417 entries and generates
approximately 174, 000 forms using derivational
and inflectional grammars (Mubarak et al., 2019).