Stacking BERT based Models for Arabic Sentiment Analysis

Hasna Chouikhi

, Hamza Chniter

and Fethi Jarray

1,2 a

LIMTIC Laboratory, UTM University, Tunisia

Higher Institute of Computer Science of Medenine, Tunisia

Keywords:

Arabic Sentiment Analysis, BERT Models, Stacking Approach, Large Scale Dataset.

Abstract:

Recently, transformer-based models showed great success in sentiment analysis and were considered as the

state-of-the-art model for various languages. However, the accuracy of Arabic sentiment analysis still needs

improvements. In this work, we proposed a stacking architecture of Arabic sentiment analysis by combing

different BERT models. We also create a large-scale dataset of Arabic sentiment analysis by merging small

publicly available datasets. The experimental study proves the efﬁciency of the proposed approach in terms of

classiﬁcation accuracy compared to single model architecture.

1 INTRODUCTION

Sentiment Analysis (SA) is a Natural Language Pro-

cessing (NLP) research ﬁeld that spotlights on look-

ing over people’s opinions, sentiments, and emo-

tions. SA techniques are categorized into symbolic

and sub-symbolic approaches. The former use lexi-

cal and ontologies (Dragoni et al., 2018) to encode

the associated polarity with words and multi-word

expressions. The latter consist of supervised, semi-

supervised and unsupervised machine learning tech-

niques that perform sentiment classiﬁcation based on

word co-occurrence frequencies. Among all these

techniques, the most popular are based on deep neu-

ral networks. Some hybrid frameworks leverage both

symbolic and sub-symbolic approaches. SA can be

seen as a multi-step process including data retrieval,

data extraction, data pre-processing, and feature ex-

traction. The ultimate subtasks of sentiment clas-

siﬁcation allow three types of classiﬁcation: polar-

ity classiﬁcation, intensity classiﬁcation, and emo-

tion identiﬁcation. The ﬁrst type classiﬁes the text

as positive, negative or neutral, while the second type

identiﬁes the polarity degree as very positive, posi-

tive, negative or very negative. The third classiﬁcation

identiﬁes the emotion such as sadness, anger or hap-

piness. Practically, Arabic language has a complex

nature, due to its ambiguity and rich morphological

system. This nature associated with various dialects

and the lack of resources represent a challenge for the

https://orcid.org/0000-0002-5110-1173

progress of Arabic sentiment analysis research. The

major contributions of our present work are:

• Creating a large-scale sentiment Arabic datasets

(LargeASA) for Arabic sentiment analysis’s task.

• Using an adjusted model ASA Medium BERT.

• Designing a new staking approach (also known

as Stacked Generalization ) for Arabic Sentiment

Analysis by combing three BERT based models

(Arabic-BERT (Safaya et al., 2020), AraBERT

(Antoun et al., 2020) and mBERT (Jacob et al.,

2019)).

2 RELATED WORK

The learning based approaches of ASA can be classi-

ﬁed into two categories : classical machine learning

approaches, and deep learning approaches.

2.1 Classical Machine Learning

Approaches for ASA

Machine learning (ML) methods have broadly been

used for sentiment analysis. ML addresses senti-

ment analysis as a text classiﬁcation problem. Many

approaches such as support vector machine (SVM),

maximum entropy (ME), naive Bayes (NB) algo-

rithm, and artiﬁcial neural networks (ANNs) have

been proposed to handle ASA. NB and SVM are

144

Chouikhi, H., Chniter, H. and Jarray, F.

Stacking BERT based Models for Arabic Sentiment Analysis.

DOI: 10.5220/0010648400003064

In Proceedings of the 13th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2021) - Volume 2: KEOD, pages 144-150

ISBN: 978-989-758-533-3; ISSN: 2184-3228

the most commonly exploited machine learning algo-

rithms for solving the sentiment classiﬁcation prob-

lem (Imran et al., 2018) Al-Rubaiee et al. (Al-

Rubaiee et al., 2016) performed polarity classiﬁca-

tion, and rating classiﬁcation by using SVM, MNB,

and BNB. They achieved 90% accuracy polarity clas-

siﬁcation and 50% accuracy rating classiﬁcation.

2.2 Deep Learning Approaches for ASA

Using DL is less abandoned in Arabic SA than in

English SA. (Socher et al., 2013) proposed an RNN

(Recurrent neural network) based approach which is

trained on a constructed sentiment treebank and im-

proved the sentence-level sentiment analysis on En-

glish datasets. (Rangel et al., 2019) used CNN model

for SA tasks and a Stanford segmenter to perform

tweets tokenization and normalization. They used

Word2vec for word embedding with ASTD datasets.

(Alhumoud et al., 2015a) used a LSTM-CNN

model with only two unbalanced classes (Positive and

negative) among four classes (objective, subjective

positive, subjective negative, and subjective mixed)

form ASTD.

Since its appearance in 2018, many pretrained ver-

sions of BERT (Devlin et al., 2018) has been pro-

posed for sequence learning such as ASA. The recent

trend in sentiment analysis is based on BERT repre-

sentation. Let’s brieﬂy describe and remind BERT

and the different versions that handle Arabic texts.

BERT (Bidirectional Encoder Representations from

Transformers) is pre-trained by conditioning on both

left and right context in all layers, unlike previous lan-

guage representation models. Applying BERT to any

NLP task needs only to ﬁne-tune one additional out-

put layer to the downstream task (see Figure 1).

Figure 1: BERT based architecturefor ASA.

The multilingual BERT (mBERT) (Jacob et al.,

2019) model is trained on many languages including

Arabic and it serves as a universal language modeling.

(ElJundi et al., 2019) developed an Arabic spe-

ciﬁc universal language model (ULM), hULMonA.

They ﬁne tune mBERT ULM for ASA. They col-

lected a benchmark dataset for ULM evaluation with

sentiment analysis. (Safaya et al., 2020) proposed

ArabicBERT which is a set of pre-trained trans-

former language models for arabic language. They

used a base version of arabic BERT model (bert-base-

arabic). (Antoun et al., 2020) created AraBERTv02

based on the BERT model. It was trained on Arabic

corpora consisting of internet text and news articles

of (8.6B tokens). (Lan et al., 2020) introduced Gi-

gaBERTv3 which is a bilingual BERT for English

and Arabic. It was pre-trained on a large corpra

(Gigaword, Oscar and Wikipedia). (Abdul-Mageed

et al., 2020) designed MARBERT and ArBERT.

Both are built based on the BERT-based model ex-

cept for MARBERT. ArBERT was trained on a col-

lection of Arabic datasets which are mostly books and

articles written in Modern Standard Arabic (MSA).

While MARBERT trained both Dialectal (DA) and

MSA tweets, it does not output the next sentence pre-

diction (NSP) objective as it is trained on short tweets.

Additionally, MARBERT and ArBERT were experi-

mented on the ArSarcasm dataset (Farha and Magdy,

2020). Finally (Abdelali et al., 2021) trained QARiB

(QCRI Arabic and Dialectal BERT) on a collection of

Arabic tweets and sentences of text written on MSA.

3 PROPOSED APPROACHES

In this paper, we realised a stacked generalization

BERT model by stacking designing training a meta

learning algorithm to combine the predictions of

three BERT models dedicated to the Arabic language

(Arabic-BERT, mBERT and AraBERT).

The Data pipeline consists of the following steps:

• We ﬁrst apply tokenization to break up the input

texts into tokens. Figure 2 presents the result of

Arabic-BERT and mBERT tokenizers applied to

an example sentence (S). We observe that Arabic-

BERT tokenizer is more appropriate for Arabic

because it considers the characteristic of Arabic

morphology.

• Afterward, we convert each text to a BERT’s for-

mat by adding the special [CLS] token at the be-

ginning of each text and [SEP] token between sen-

tences and the end;

Stacking BERT based Models for Arabic Sentiment Analysis

145

Figure 2: Comparison between Arabic BERT (a) and mBERT (b) tokenizer.

• We map each token to an index based on the pre-

trained BERT’s vocabulary.

3.1 ASA based on Arabic BERT Model

(ASA-medium BERT)

In this section, we propose an ASA based on the

Arabic BERT model. As mentioned on the orig-

inal paper, Arabic BERT is delivered in four ver-

sions: bert-mini-arabic, bert-medium-arabic, bert-

base-arabic and bert-large-arabic. We applied a grid

search strategy to ﬁnd the best Arabic BERT ver-

sion with the best hyperparameters (Chouikhi et al.,

2021) . Table 1 represents hyper-parameters of Arabic

BERT for ASA used after our ﬁne-tuning. We used

the AJGT dataset (Alomari et al., 2017) as a testing

dataset.

Among all cited works, the approach of Ali Safaya

(Safaya et al., 2020) is the closest to our approach.

Figure 3 depicts the proposed architecture for ara-

bic SA. Our architecture is composed of three blocks.

The ﬁrst block describes the text pre-processing step

where we used an arabic BERT tokenizer to split

the word into tokens. Second block is the training

model. Arabic BERT model is used with only 8 en-

coder (Medium case (Safaya et al., 2020)). The out-

put of last four hidden layers is concatenated to get

a size representation vector 512x4x128 with 16 batch

size. The pooling operation’s output is concatenated

and ﬂattened to be later on crossed a dense layer and

a Softmax function to get the ﬁnal label. Third block

is about the classiﬁer where we used a dropout layer

for some regularization and a fully-connected layer

for our output.

Table 1: Hyper-parameters of ASA-medium BERT.

Hyper-parameters Value

Batch-size 16

dropout 0.1

Max length 128

Hidden size 512

lr 2e-5

Optimizer AdamW

Epochs 10/20/50

Figure 3: ASA-medium BERT architecture.

Table 1 displays the hyperparameters of the pro-

posed model. The overall model is trained by AdamW

optimizer. We note that with hyperparameters opti-

mization by grid search strategy, we outperform the

approach of (Safaya et al., 2020).

Table 2 explain the architectural differences be-

tween ASA-medium BERT model, Arabic BERT

KEOD 2021 - 13th International Conference on Knowledge Engineering and Ontology Development

146

Table 2: Architectural differences between ASA-medium BERT, AraBERT and Arabic BERT models.

Batch-size Epochs Layers Activation function

ASA-medium BERT 16 5 8 Softmax

Arabic BERT (Safaya et al., 2020) 16/32 10 12 ReLU

AraBERT (Antoun et al., 2020) 512/128 27 12 Softmax

(Safaya et al., 2020) and AraBERT (Antoun et al.,

2020) ones. It shows that with an Arabic tokenizer

the number of encoders in the Arabic BERT model

inﬂuences the accuracy value.

3.2 Stacking BERT based Model for

ASA

A Stacking model is a hierarchical model ensemble

framework in which the predictions, generated by us-

ing various machine learning base modes, are used as

inputs for a meta-model. The objective of the meta-

model is to optimally combine the base model pre-

dictions to form a new classiﬁer. In this work we

used Medium Arabic BERT (Safaya et al., 2020),

AraBERT (Antoun et al., 2020) and mBERT (Jacob

et al., 2019) as base model and a fully connected layer

as a meta-model (Figure 4).

Figure 4: Stacking BERT based model for ASA.

In the experiment section, we will envisage three

stacking scenarios: auto stacking of each BERT, pair-

wise stacking, and entire stacking of the three BERT

models.

4 EXPERIMENTS AND RESULTS

4.1 Datasets

In this paper we perform experiments on three avail-

able datasets HARD, LABR and AJGT (Table 3). All

were split into two subsets: 80% for training, and 20%

for testing.

• Hotel Arabic Reviews Dataset (HARD) (Elnagar

et al., 2018) contains 93,700 reviews. Each one

has two parts: positive comments and negative

comments. It covers 1858 hotels contributed by

30889 users.

• Large-scale Arabic Book Reviews (LABR) (Aly

and Atiya, 2013) contains over 63,000 book re-

views in Arabic.

• Arabic Jordanian General Tweets (AJGT) (Alo-

mari et al., 2017) contains 1,800 tweets annotated

as positive and negative.

• Large scale Arabic Sentiment Analyis

(LargeASA). We agragate HARD, LABRR,

and AJGT datasets into a large corpus for ASA.

This dataset is publicly available upon request.

Table 3: Statistics of used datasets.

Dataset Samples Labels

LABR 63,000 positive / negative

HARD 93,700 positive / negative

AJGT 1,800 positive / negative

LargeASA 158,500 positive / negative

Table 4 indicates the variation of the accuracy

value according to the used method and datasets. It

shows that we have a competition between our model

(ASA-medium BERT) and (Antoun et al., 2020) one.

Our model gives the best result for LABR, AJGT

and ArsenTD-Lev ((Baly et al., 2019)) datasets; while

(Antoun et al., 2020) works give the best result with

ASTD and HARD datasets. We found a slight dif-

ference in the accuracy value between the two works

( 92,6% compared to 91% for ASTD dataset ((Nabil

et al., 2015)) and 86,7% compared to 87% for LABR

datasets). However, our model gives a very good re-

sult with ArsenTD-Lev dataset (75% compared to an

Stacking BERT based Models for Arabic Sentiment Analysis

147

Table 4: Comparison between ASA-medium BERT and the previous ASA approaches.

Model \Dataset AJGT LABR HARD ASTD ArsenTD-Lev

CNN(Ghanem et al., 2019) - - - 79% -

LSTM (Shoukry and Rafea, 2012) - 71% - 81% -

LSTM-CNN (Alhumoud et al., 2015b) - - - 81% -

CNN-CROW(Eskander and Rambow, 2015) - - - 72.14% -

DE-CNN-G1(Dahou et al., 2019) 93.06% - - 82.48% -

LR(Harrat et al., 2019) - 84.97% - 87.10% -

GNB(Harrat et al., 2019) - 85% - 86% -

SVM(Aly and Atiya, 2013) - 50% - - -

Arabic-BERT base - - - 71.4% 55.2%

hULMonA (ElJundi et al., 2019) - - 95.7% 69.9% 52.4%

AraBERT 93.8% 86.7% 96.2% 92.6% 59.4%

mBERT 83.6% 83% 95.7% - -

ASA-medium BERT 96.11% 87% 95% 91% 75%

Table 5: Comparison between the base models.

Model \Dataset AJGT LargeASA LABR HARD

ASA-medium BERT 96.11% 90% 87% 95%

mBERT 83.6% 83% 85% 95.7%

AraBERT 93.8 % 85% 86.7% 96.2 %

Table 6: Auto stacking base models.

Model \Dataset AJGT LargeASA LABR HARD

ASA-medium BERT x2 94% 90 % 90% 96%

mBERT x2 83% 86% 87% 95%

AraBERT x2 77 % 87 % 86% 96%

Table 7: Pairwise Stacking BERT models.

Model \Dataset AJGT LargeASA LABR HARD

ASA-medium BERT+ mBERT 94% 90% 90 % 96 %

ASA-medium BERT+ AraBERT 90% 90 % 88% 96 %

mBERT + AraBERT 78 % 88% 88% 95%

accuracy value that does not exceed 60% with others

models).

Table 5 shows a comparison between the three

base models that are used in the stacking approach. It

shows that Medium Arabic BERT is the most perfor-

mant and mBERT is the less performant. We will en-

visage different stacking strategies of these base mod-

els to strengthen them.

4.2 Auto Stacking Strategy

We aim to strengthen each base model by stacking it

with its self. The numerical results are displayed in

Table 6. By cross comparing Table 5 and Table 6, we

conclude that performance of the model did not im-

proved by auto stacking a model with itself. This is

may be due to the fact that we have only small num-

ber of classes; positive and negative and it may be

interesting to check the efﬁciency of auto stacking for

large number of classes such as the sentiment analysis

with intensity.

4.3 Pairwise Stacking Strategy

In this set of experiments, we stack the base models in

a pairwise manner. Table 7 details the results obtained

by applying two different models. It shows that the

other BERT based model can be strengthen by stack-

ing them with ASA-medium BERT .

4.4 Complete Stacking Strategy

Finally, we completely stack the three base models:

Arabic BERT, AraBERT and mBERT (see Table 8).

From all the stacking scenarios, we conclude that the

best way is to autostack ASA-medium BERT with it

self.

KEOD 2021 - 13th International Conference on Knowledge Engineering and Ontology Development

148

Table 8: Complete Stacking BERT models.

Model \Dataset AJGT LargeASA LABR HARD

ASA-medium BERT+ mBERT +AraBERT 93% 91% 88% 95%

5 CONCLUSION

In this paper, we proposed a stacked generalization

approach for Arabic sentiment analysis. We have

used Medium Arabic BERT, AraBERT and mBERT

as base models. Firstly, we proved that by implement-

ing a single arabic medium BERT model we outper-

form the state of the art for ASA. Secondly, the ex-

periment results showed that the stacking strategy im-

proves the accuracy. As a continuity of this contribu-

tion, we plan to generalize our results to the sentiment

analysis with intensities case.

REFERENCES

Abdelali, A., Hassan, S., Mubarak, H., Darwish, K.,

and Samih, Y. (2021). Pre-training bert on ara-

bic tweets: Practical considerations. arXiv preprint

arXiv:2102.10684.

Abdul-Mageed, M., Elmadany, A., and Nagoudi, E. M. B.

(2020). Arbert & marbert: Deep bidirectional trans-

formers for arabic. arXiv preprint arXiv:2101.01785.

Al-Rubaiee, H., Qiu, R., and Li, D. (2016). Identifying

mubasher software products through sentiment analy-

sis of arabic tweets. In 2016 International Conference

on Industrial Informatics and Computer Systems (CI-

ICS), pages 1–6. IEEE.

Alhumoud, S., Albuhairi, T., and Alohaideb, W. (2015a).

Hybrid sentiment analyser for arabic tweets using r.

In 2015 7th International Joint Conference on Knowl-

edge Discovery, Knowledge Engineering and Knowl-

edge Management (IC3K), volume 1, pages 417–424.

IEEE.

Alhumoud, S., Albuhairi, T., and Alohaideb, W. (2015b).

Hybrid sentiment analyser for arabic tweets using r.

In 2015 7th International Joint Conference on Knowl-

edge Discovery, Knowledge Engineering and Knowl-

edge Management (IC3K), volume 1, pages 417–424.

IEEE.

Alomari, K. M., ElSherif, H. M., and Shaalan, K. (2017).

Arabic tweets sentimental analysis using machine

learning. In International Conference on Industrial,

Engineering and Other Applications of Applied Intel-

ligent Systems, pages 602–610. Springer.

Aly, M. and Atiya, A. (2013). Labr: A large scale ara-

bic book reviews dataset. In Proceedings of the 51st

Annual Meeting of the Association for Computational

Linguistics (Volume 2: Short Papers), pages 494–498.

Antoun, W., Baly, F., and Hajj, H. (2020). Arabert:

Transformer-based model for arabic language under-

standing. arXiv preprint arXiv:2003.00104.

Baly, R., Khaddaj, A., Hajj, H., El-Hajj, W., and Sha-

ban, K. B. (2019). Arsentd-lev: A multi-topic corpus

for target-based sentiment analysis in arabic levantine

tweets. arXiv preprint arXiv:1906.01830.

Chouikhi, H., Chniter, H., and Jarray, F. (2021). Arabic

sentiment analysis using bert model. In 13th Interna-

tional Conference on Computational Collective Intel-

ligence (ICCCI). Springer, Cham.

Dahou, A., Elaziz, M. A., Zhou, J., and Xiong, S. (2019).

Arabic sentiment classiﬁcation using convolutional

neural network and differential evolution algorithm.

Computational intelligence and neuroscience, 2019.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.

(2018). Bert: Pre-training of deep bidirectional trans-

formers for language understanding. arXiv preprint

arXiv:1810.04805.

Dragoni, M., Poria, S., and Cambria, E. (2018). Ontosentic-

net: A commonsense ontology for sentiment analysis.

IEEE Intelligent Systems, 33(3):77–85.

ElJundi, O., Antoun, W., El Droubi, N., Hajj, H., El-Hajj,

W., and Shaban, K. (2019). hulmona: The univer-

sal language model in arabic. In Proceedings of the

Fourth Arabic Natural Language Processing Work-

shop, pages 68–77.

Elnagar, A., Khalifa, Y. S., and Einea, A. (2018). Ho-

tel arabic-reviews dataset construction for sentiment

analysis applications. In Intelligent natural language

processing: Trends and applications, pages 35–52.

Springer.

Eskander, R. and Rambow, O. (2015). Slsa: A sentiment

lexicon for standard arabic. In Proceedings of the

2015 conference on empirical methods in natural lan-

guage processing, pages 2545–2550.

Farha, I. A. and Magdy, W. (2020). From arabic sentiment

analysis to sarcasm detection: The arsarcasm dataset.

In Proceedings of the 4th Workshop on Open-Source

Arabic Corpora and Processing Tools, with a Shared

Task on Offensive Language Detection, pages 32–39.

Ghanem, B., Karoui, J., Benamara, F., Moriceau, V., and

Rosso, P. (2019). Idat at ﬁre2019: Overview of the

track on irony detection in arabic tweets. In Proceed-

ings of the 11th Forum for Information Retrieval Eval-

uation, pages 10–13.

Harrat, S., Meftouh, K., and Smaili, K. (2019). Machine

translation for arabic dialects (survey). Information

Processing & Management, 56(2):262–273.

Imran, A., Faiyaz, M., and Akhtar, F. (2018). An enhanced

approach for quantitative prediction of personality in

facebook posts. International Journal of Education

and Management Engineering (IJEME), 8(2):8–19.

Jacob, D., Ming-Wei, C., Kenton, L., and Toutanova, K.

(2019). Bert: Pre-training of deep bidirectional trans-

formers for language understanding. In Proceedings

Stacking BERT based Models for Arabic Sentiment Analysis

149

of the 2019 Conference of the North American Chap-

ter of the Association for Computational Linguistics:

Human Language Technologies, Minneapolis, Min-

nesota, June. Association for Computational Linguis-

tics, 1:4171–41862.

Lan, W., Chen, Y., Xu, W., and Ritter, A. (2020). An empir-

ical study of pre-trained transformers for arabic infor-

mation extraction. In Proceedings of the 2020 Con-

ference on Empirical Methods in Natural Language

Processing (EMNLP), pages 4727–4734.

Nabil, M., Aly, M., and Atiya, A. (2015). Astd: Arabic

sentiment tweets dataset. In Proceedings of the 2015

conference on empirical methods in natural language

processing, pages 2515–2519.

Rangel, F., Rosso, P., Charﬁ, A., Zaghouani, W., Ghanem,

B., and Snchez-Junquera, J. (2019). Overview of the

track on author proﬁling and deception detection in

arabic. In Working Notes of the Forum for Information

Retrieval Evaluation (FIRE 2019). CEUR Workshop

Proceedings. In: CEUR-WS. org, Kolkata, India.

Safaya, A., Abdullatif, M., and Yuret, D. (2020). KUISAIL

at SemEval-2020 task 12: BERT-CNN for offensive

speech identiﬁcation in social media. In Proceedings

of the Fourteenth Workshop on Semantic Evaluation,

pages 2054–2059, Barcelona (online). International

Committee for Computational Linguistics.

Shoukry, A. and Rafea, A. (2012). Sentence-level arabic

sentiment analysis. In 2012 International Conference

on Collaboration Technologies and Systems (CTS),

pages 546–550. IEEE.

Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning,

C. D., Ng, A. Y., and Potts, C. (2013). Recursive

deep models for semantic compositionality over a sen-

timent treebank. In Proceedings of the 2013 confer-

ence on empirical methods in natural language pro-

cessing, pages 1631–1642.

KEOD 2021 - 13th International Conference on Knowledge Engineering and Ontology Development

150