Fine-grained Detection of Academic Emotions with Spatial Temporal

Graph Attention Networks using Facial Landmarks

Hua Leong Fwa

School of Computing and Information Systems, Singapore Management University, Singapore

Keywords:

Spatial, Temporal, Affective States, Facial Landmarks, Graph Attention Network, Gated Recurrent Unit.

Abstract:

With the incidence of the Covid-19 pandemic, institutions have adopted online learning as the main lesson

delivery channel. A common criticism of online learning is that sensing of learners’ affective states such as

engagement is lacking which degrades the quality of teaching. In this study, we propose automatic sensing of

learners’ affective states in an online setting with web cameras capturing their facial landmarks and head poses.

We postulate that the sparsely connected facial landmarks can be modelled using a Graph Neural Network.

Using the publicly available in the wild DAiSEE dataset, we modelled both the spatial and temporal dimen-

sions of the facial videos with a deep learning architecture consisting of Graph Attention Networks and Gated

Recurrent Units. The ablation study conﬁrmed that the differencing of consecutive frames of facial landmarks

and the addition of head poses enhance the detection performance. The results further demonstrated that the

model performed well in comparison with other models and more importantly, is suited for implementation

on mobile devices with its low computational requirements.

1 INTRODUCTION

The Covid-19 pandemic has transformed the way

education is delivered globally. With face-to-face

lessons curtailed, educational institutions have little

choice but to transition to online lessons with the help

of video conferencing and related technologies. Many

educators and learners have however, critiqued that

online lessons are impoverished in the affective and

social interaction aspects and thus cannot completely

replace traditional face-to-face delivery.

In a face-to-face classroom setting, teachers con-

tinuously monitor the emotional state of their learners

to gauge their level of understanding and engagement.

Contingent on the learners’ cognitive state, the teach-

ers can then adapt their tutoring style dynamically for

example by slowing the pace of delivery or redirect-

ing the learners’ attention to the content material to

optimize the learning. On the other hand, in an on-

line lesson, learners hide behind the veil of a virtual

screen. This is further exacerbated by a large class

of online learners to a single teacher in a typical on-

line session which constraints an individual learner’s

video to be a small window on the teacher’s console,

making it difﬁcult for the teacher to infer their learn-

https://orcid.org/0000-0002-4472-2481

ers’ affective state. To enhance the learning in online

lesson delivery, it is thus essential to augment online

lessons with automatic sensing of learners’ affective

or cognitive state.

Previous studies on emotion sensing have tapped

on modalities such as audio (Sezgin et al., 2012;

Guhan et al., 2020), body postures (D’Mello and

Graesser, 2009), facial images and videos (Whitehill

et al., 2008; Grafsgaard et al., 2013) as well as phys-

iological signals such as ECG (Belle et al., 2012).

The prevalence of web cameras and their use in vari-

ous virtual learning platforms resulted in their exten-

sive use for assessing learners’ facial emotions. It

is thus no coincidence that majority of recent works

on automatic sensing of learners’ emotions leverage

on facial images and videos captured using web cam-

eras. The affordability, unobtrusiveness and ubiquity

of web cameras as compared to other sensors further

adds to their use for emotion sensing.

Studies have shown that basic emotions such as

happiness and sadness are not relevant to the learn-

ing process (Pekrun et al., 2002, 2007). Academic

emotions or affect such as frustration, boredom, con-

fusion and engagement are the ones which inﬂuence

the learning process. The ability of expert human tu-

tors to achieve enhanced learning outcomes has been

widely attributed to their ability to sense the affec-

Fwa, H.

Fine-grained Detection of Academic Emotions with Spatial Temporal Graph Attention Networks using Facial Landmarks.

DOI: 10.5220/0010921200003182

In Proceedings of the 14th International Conference on Computer Supported Education (CSEDU 2022) - Volume 2, pages 27-34

ISBN: 978-989-758-562-3; ISSN: 2184-5026

tive states of the learners and to continually adapt

their tutoring strategies in response to the dynamically

changing affective states throughout the tutoring ses-

sion. Sustaining the incidences of positive affect e.g.

engagement and suppressing the incidences of nega-

tive affect e.g. boredom, the tutor (which can be hu-

man or computer) can enhance the interest and moti-

vation of the learner in the learning process.

Facial emotion recognition can be broadly cate-

gorized into static and dynamic facial emotion recog-

nition. Static facial emotion recognition techniques

use static facial images as inputs while dynamic facial

emotion recognition techniques use videos as inputs.

Static facial emotion recognition techniques model

only the spatial dimensions while on the other hand,

dynamic facial emotion recognition model not only

the spatial but also the temporal dimensions. By in-

cluding temporal dimensions in the model, dynamic

facial emotion techniques are known to offer higher

accuracies of facial emotion detection as compared to

static facial emotion recognition techniques.

In this paper, we focus on the detection of aca-

demic emotions as they are more relevant in a class-

room or online teaching scenario. The detection of the

academic emotions is however challenging as they are

more subtle as compared to the basic emotions and

thus the signiﬁcance of this research. In an online

learning context, with the successful detection of the

academic emotional states of a learner e.g. when a

learner is disengaged, bored or frustrated, a human or

computer tutor can then enact appropriate pedagogi-

cal interventions to avert detrimental effects such as

the learner giving up on the learning altogether.

Micro facial expressions changes is correlated to

facial emotions and the detection and monitoring of

facial landmarks movements is one of the established

methods for tracking facial expressions. The net-

work of facial landmark points can be formulated as

a grid and this makes it amenable to be represented

as a graph where each landmark point can be mod-

elled to be spatially related to other landmark points.

This motivates our study on the modelling of facial

landmark changes through Graph Attention Network

(GAT) (Veli

ckovi

c et al., 2018) using the Dataset for

Affective States in E-Environment (DAiSEE) dataset

(Gupta et al., 2016).

In short, our contributions can be summarized as

follows:

• We proposed a novel spatial temporal GAT based

model to detect fast changing intensity levels

of academic emotions which include frustration,

boredom, engagement and confusion.

• We captured facial landmark points and head

poses as inputs as opposed to facial images, result-

ing in a computationally lighter academic emotion

detection model which is more adequate for de-

ployment on mobile devices.

• Our empirical results showed that the proposed

graph based model outperforms Convolutional

Neural Network (CNN) based models such as In-

ceptionNet.

2 RELATED STUDIES

Prior studies have shown a tight coupling between

head pose and engagement. A study by Asteriadis

(Asteriadis et al., 2009) documented the use of head,

eye and hand movements to gauge children’s level of

interest and attention in the context of reading an elec-

tronic document. Gaze Tutor (D’Mello et al., 2012),

an intelligent tutoring system used a commercial eye

tracker to identify the engagement of learners from

their eye gaze patterns and attempts to re-engage the

learner with dialogue moves. Tracking eye gaze al-

lows one to capture instances of inattention where the

head is stationary while eye focus on somewhere off-

screen but tracking head pose is still relevant in nu-

merous instances where user turns head away from

the screen.

A CNN architecture was proposed in a study by

Murshed et al. (2019) to classify engagement levels

for the DAiSEE dataset. The proposed CNN archi-

tecture is then compared against three other popular

CNN architectures after re-categorizing the original

four levels of engagement into three levels – not en-

gaged, normally engaged and highly engaged. The

authors merged the labels of bored, frustrated and

confused into the non-engaged category. The pro-

posed model achieved an average accuracy level of

92.33% for the DAiSEE dataset. The authors did not

model the temporal aspect but instead aggregated the

facial frames into sliding window width of 5.

The Dataset for Affective States in E-Environment

(DAiSEE) is the ﬁrst multi-label academic emotions

video classiﬁcation dataset collected in the wild by

(Gupta et al., 2016). It consists of 9068 snippets

from 112 users and annotated with the affective states

and intensity levels of boredom, frustration, confu-

sion and engagement. The authors used Inception-

Net, C3D and Long-term Recurrent Convolution Net-

work (LRCN) deep learning models to classify the af-

fective states and found the LRCN model to be the

best performing model (with reported accuracies of

53.7%, 57.9%, 72.3% and 73.5% for boredom, en-

gagement, confusion and frustration). The results of

this study conﬁrmed that temporal classiﬁers tend to

outperform static classiﬁers for the classiﬁcation of

CSEDU 2022 - 14th International Conference on Computer Supported Education

affective states related to learning.

Liao et al. (2021) proposed the Deep Facial Spa-

tiotemporal Network (DFSTN) comprising of a pre-

trained SE-ResNet-50 for extracting the facial spatial

features and passing them into an LSTM network with

Global Attention module to generate an attentional

hidden state. With the modelling of both spatial and

temporal features, the authors managed to attain an

accuracy of 58.84% on the four levels of engagement

as labelled in the DAiSEE dataset.

A combination of ResNet and Temporal Convolu-

tional Network (TCN) hybrid deep learning architec-

ture was used in a study by Abedi and Khan (2021)

for engagement detection in the DAiSEE videos. The

pre-trained 2D ResNet extracts the spatial features

while the TCN analyses the temporal changes in the

video frames. A weighted cross entropy loss func-

tion is employed to tackle the issue of imbalanced

dataset (low proportion of lower levels of engagement

as compared to the higher levels of engagement). An

accuracy of 63.9% is achieved on the four levels of

engagement of the DAiSEE dataset. A point to note is

that the previous two studies only focus on the detec-

tion of engagement and not the other academic emo-

tions.

The study by Ngoc et al. (2020) inspires us to

model facial landmarks as a graph attention network

for the detection of academic emotions. In their study,

the authors constructed a graph representation of fa-

cial landmarks using Graph Neural Network (GNN)

and employed a Gated Linear Unit (GLU) as the tem-

poral block to predict seven basic emotions, namely

anger, contempt, disgust, fear, happiness, sadness,

and surprise. The authors justify that facial landmark

points have a sparse relation with spatially adjacent

landmark points and in addition, a continuous rela-

tion of facial landmarks on the temporal axis as cap-

tured in a facial video makes them amenable to be-

ing modelled with a temporal graph network. The

proposed network achieved an accuracy of 96.02%,

69.64% and 32.64% on the CK+ dataset, MMI and

AFEW dataset (for classiﬁcation of basic emotions

and not academic emotions) respectively. This com-

pares well with other state of the art benchmarks and

justiﬁes the use of graph neural network to model fa-

cial landmark points for facial emotion recognition.

Inspired by the previous work and the success

of attention mechanism in previous studies, we em-

ployed the GAT for modelling the spatial features of

facial landmarks. The previous study (Liao et al.,

2021) used a pre-trained SE-Resnet-50 for extracting

facial spatial features from facial images. In contrast,

for our study, we used GAT which attends over the

neighbours of each facial landmark node in a graph

structure for computing a hidden representation for

individual frame of facial landmarks. The temporal

dimensions of hidden representations of facial land-

marks and head poses are then modelled with the

Gated Recurrent Unit (Cho et al., 2014). We postu-

late that the inclusion of head pose will improve the

performance of the detection of academic emotions.

We also intend to incorporate this affect sensing mod-

ule in a mobile-based intelligent tutoring system. The

computation complexity of the model is integral to

implementation on mobile devices as mobile devices

typically have lower processing power and storage.

3 METHODOLOGY

This section describes the processing pipeline and

proposed model architecture which uses both the GAT

and Gated Recurrent Unit (GRU) as seen in Fig. 1.

3.1 Facial Landmarks and Head Pose

Extraction

This study uses the DAiSEE dataset. Each DAiSEE

video is 10 seconds in duration and is captured at

a frame rate of 30 frames per seconds. The videos

were ﬁrst segregated into the individual frames which

resulted in a total of 300 frames or static images

per video. We next used the Multi-Task Cascaded

Convolutional Neural Network (MTCNN) (Zhang

et al., 2016) algorithm to crop out the facial images.

MTCNN is employed here as it is balanced in both

speed and performance as compared to other algo-

rithms.

The cropped facial images were resized to 112 by

112 pixels before being passed to the Practical Facial

Landmarks Detector (PFLD) (Guo et al., 2019) for

extraction of the facial landmarks. PFLD is used here

as it achieves good performance across challenging

datasets with low computational requirements. Out

of the 98 facial landmark points extracted by the pre-

trained PFLD model, we excluded the facial landmark

points that outlined the outer shape of the face as they

were not correlated to affective states (Ngoc et al.,

2020). In total, 65 facial landmark points that out-

lined the brow, eye, nose and mouth were used. The

head pose angles of roll, pitch and yaw were also ex-

tracted from PFLD model.

We extracted facial landmark points from each 10

seconds video and for a video rate of 30 frames per

second, this equates to a total of 300 frames of fa-

cial landmarks. Only 30 frames (1 frame picked at a

ﬁxed interval of every 10 frames) were used as input

into the model. For videos with less than 15 frames,

Fine-grained Detection of Academic Emotions with Spatial Temporal Graph Attention Networks using Facial Landmarks

Figure 1: Modelling workﬂow and architecture.

we interpolated additional frames between consecu-

tive frames to make up 30 frames. Some of the videos

with less than 15 frames of facial landmarks extracted

were dropped. We postulate that the rate of movement

of each facial feature contributes to the tracking of

emotions. Thus across consecutive frames, we calcu-

lated the difference of the vertex coordinates for each

facial vertex which symbolizes the rate of change of

the facial features and input the rate of change of fa-

cial features into the model instead of the raw facial

landmark coordinates.

The facial landmarks are formulated into the

shape N x T x V x 2 where N denotes the batch size,

T denotes the temporal width of 29 frames, V denotes

the facial landmark vertex and the last dimension of 2

stores the x and y coordinates of the facial landmark

point. We also formulated head poses into features

with shape (N x T x H) where N denotes the batch

size, T denotes the temporal width. The last dimen-

sion of H=3 stores the roll, pitch and yaw values.

Across the 9068 DAiSEE videos, there were 144

videos in which facial landmarks cannot be extracted

(no face can be detected in the video). Another 47,

49 and 17 videos were excluded from the train, vali-

dation and test videos respectively as there were less

than 15 frames of facial landmarks which can be ex-

tracted. In total, only 2.8% of the entire dataset of

videos were discarded.

CSEDU 2022 - 14th International Conference on Computer Supported Education

3.2 Graph Construction

Graphs are data structures that allow us to model a

set of objects (vertices) and the relationships between

them (edges) (Zhou et al., 2020). To transform the

facial landmarks into a graph, each facial landmark

point constitutes a vertex on our graph and the edges

are formed with the Delaunay method. The Delau-

nay triangulation method (Lee and Schachter, 1980)

has been used for construction of the graph struc-

ture linking the vertices in a number of previous stud-

ies (Fabian Benitez-Quiroz et al., 2016; Ngoc et al.,

2020) that used facial landmarks for the detection of

facial action units.

3.3 Spatial Temporal Graph Attention

Network (STGAT)

The STGAT model is illustrated in Fig.1. The input

to each GAT layer is a set of node features,

h = {

,...,

∈ R

(1)

where N = 65, denoting the number of facial land-

marks and F = 2 as each node feature consists of the

x and y coordinates of the facial landmark point.

Within each GAT layer, we apply a linear trans-

formation to encode the input features into embedded

features.

(l)

= W

(l)

(2)

Next, a pair-wise un-normalized attention score is

computed between two neighbours using the equation

below and a drop out is then applied during training.

The embedded features of the two nodes are concate-

nated and then dot product with a learnable weight

vector a before applying a LeakyReLU to the prod-

uct.

(l)

i j

= LeakyReLU



(l)



(l)

||z

(l)



(3)

The next equation applies a softmax to normalize the

attention scores on each node’s incoming edges.

(l)

i j

exp



(l)

i j



∑

k∈N(i)

exp



(l)



(4)

Finally, the embeddings from neighbours are aggre-

gated together and scaled by the attention scores.

(l+1)

= σ

∑

j∈N(i)

(l)

i j

(l)

(5)

We use Exponential Linear Units (ELU) as the acti-

vation function σ. W and a are weight matrices, l

denotes the layer and N(i) denotes the set of one-hop

neighbours of node i.

The max pooling operation then selects the maxi-

mum value across the facial landmark vertices to gen-

erate the embedding for the entire facial graph. The

facial graph embedding is then concatenated with the

head pose embeddings (after passing through a fully

connected layer) before being passed to the GRU to

model temporal features. GRU resolves the prob-

lem of vanishing gradient typically associated with a

standard Recurrent Neural Unit. The output from the

last GRU is then passed through Batch Normalization

layer (BN), ELU activation layer to 2 layers of Fully

Connected (FC) or linear, BN and ELU layers (FC1

and FC2) to produce the predicted scores for 4 levels

of boredom, engagement, frustration and confusion.

4 MODELLING

With the use of Optuna library (Akiba et al., 2019) for

hyperparameter tuning, we trained the STGAT using

the train dataset and optimized the model for the best

set of hyperparameters using the validation dataset.

To train STGAT, we use the Root Mean Squared Prop-

agation (RMSProp) (Tieleman et al., 2012) optimizer

to train for a maximum of 40 epochs. An early stop-

ping algorithm was employed to stop the experiments

if the loss on the validation set did not reduce for con-

secutive 4 epochs.

We ran a total of 50 Optuna trials and saved the

model which gave the lowest loss for each trial as well

as their corresponding set of hyperparameter values.

The hyperparameters include the learning rate, the

hidden layer dimensions for GATs and GRU, FCs and

the dropout rate for the drop out layers. A batch size

of 64 and learning rate of 1 × 10

−3

(selected through

Optuna trials) was used for model inference. Table

1 shows the ﬁnal conﬁguration for the model that is

used to generate the accuracies using the test dataset.

The overall accuracy was calculated by averag-

ing the detection accuracies across the four states of

boredom, frustration, engagement and confusion. The

model which gave the best overall accuracy score

among the trials was selected to be used for the eval-

uation of the performance using the test dataset.

4.1 Loss

We used a combination of the softmax and center loss

(Wen et al., 2016) as the ﬁnal loss function. We noted

that with the softmax loss alone, the proposed model

was not able to discriminate between the four levels

of academic emotions. With the center loss function,

Fine-grained Detection of Academic Emotions with Spatial Temporal Graph Attention Networks using Facial Landmarks

Table 1: STGAT model conﬁguration.

Conﬁguration No. of heads Dropout Input dim. Output dim.

Multi-head GAT 6 0.5 2 80

GAT 1 0.5 480 48

GRU - - 48 112

FC - - 3 112

FC1 - - 224 128

FC2 - - 128 4

Table 2: Accuracies of STGAT face only, STGAT raw landmark models versus STGAT model.

Models Boredom Engagement Confusion Frustration

STGAT face only 41.8% 49.6% 67.5% 78.0%

STGAT raw landmark 40.0% 49.6% 67.5% 78.0%

STGAT 46.2% 49.6% 67.5% 78.0%

the model needs to learn the center for the deep fea-

tures in each class and minimize the distance between

the features and its class center, resulting in better dis-

crimination. The softmax loss function is deﬁned as

= −

∑

i=1

log

exp(W

+ b

)

∑

j=1

exp(W

+ b

)

(6)

where x

∈ R

denotes the i

deep feature, belonging

to the y

class and W

∈ R

denotes the j

column

of the weights W ∈ R

d×n

and b ∈ R

is the bias term.

The center loss function is denoted by

∑

i=1

− c

(7)

where c

is the y

th class center of deep features. The

ﬁnal loss function is then denoted by

L = L

+ λL

(8)

where λ (which is set to 0.2 in our experiment) is used

for balancing the softmax and center loss functions.

4.2 Ablation Study

Other than facial landmarks, head pose was also ac-

quired from PFLD model. To investigate whether the

addition of head pose improves the performance of

our model, we constructed a model that uses only

facial landmarks (referred to as STGAT face only

model) which is similar to the STGAT model outlined

in Fig. 1 but without the input of head poses to the

GRU layer.

To verify that the differencing of facial land-

mark coordinates across consecutive facial landmark

frames enhances the performance of STGAT, we com-

pare it against another model (referred to as STGAT

raw landmark) that takes in the raw facial landmark

coordinates and head poses as input.

We followed the same workﬂow described in

Methodology section for both STGAT face only

model and STGAT raw landmark where the models

were trained using the train dataset and optimized for

the best set of hyperparameters using the validation

dataset. The same optimizer and loss function were

used and the hyperparameters were selected by using

Optuna to run for 50 trials.

As seen in Table 2, although the accuracies for en-

gagement, confusion and frustration are similar for

both STGAT and STGAT face only models, the de-

tection accuracy for boredom improves by 4.4% from

41.8% to 46.2% with the inclusion of head pose. Sim-

ilarly, comparing the STGAT and STGAT raw land-

mark models, the detection for boredom improves

by 6.2% for STGAT versus STGAT raw landmark

model.

5 RESULTS AND DISCUSSION

The performance of STGAT as compared to other

models is shown in Table 3. The accuracies for the

detection of academic emotions for the ﬁrst 5 models

are retrieved from the study by Gupta et al. (2016).

As compared to the other models, STGAT outper-

forms InceptionNet Video and performs on par with

C3D Training but lags behind LRCN. It is worth not-

ing however that the authors stated that they recreated

the DAiSEE videos by treating every alternate frame

as continual affective states for LRCN to achieve a

higher performance.

In addition, LRCN consists of a combination of

CNNs (to model the spatial dimension) connected

to Long Short Term Memory modules (LSTMs) for

modelling of the temporal dimensions which is com-

putationally heavier as compared to our proposed

model which consists of a light pre-trained facial

CSEDU 2022 - 14th International Conference on Computer Supported Education

Table 3: Accuracy of models across all four levels of boredom, engagement, frustration and confusion.

Models Boredom Engagement Confusion Frustration

InceptionNet Frame 36.5% 47.1% 70.3% 78.3%

InceptionNet Video 32.3% 46.4% 66.3% 77.3%

C3D Training 47.2% 48.6% 67.9% 78.3%

C3D FineTuning 45.2% 56.1% 66.3% 79.1%

LRCN 53.7% 57.9% 72.3% 73.5%

STGAT 46.2% 49.6% 67.5% 78.0%

landmark interference PFLD model, a relatively small

GAT (only 65 nodes) and GRUs. To put in perspec-

tive, ResNet-50 a CNN that is 50 layers deep, trained

on more than one million images from the ImageNet

database, has close to 2.6 million parameters and is

about 98 MB in size when compressed. ResNet-50

was used in studies by Abedi and Khan (2021) and

Liao et al. (2021). In contrast, the proposed STGAT

model has less than 115 thousand parameters and only

480 KB in size. With the inclusion of PFLD’s 1.27

million parameters and 6.9 MB in size, it is still about

14 times smaller than ResNet-50. Running on an Intel

i3 with 4 core processor, 16 GB RAM and a NVidia

GeForce GTX1660 super GPU with a batch size of

64, STGAT took 1.46 ms for face cropping and fa-

cial landmarks inference. Added to that, STGAT took

about 4.8 ms to output the affective states giving a

total inference time of about 6.26 ms. Both training

and inference of deep learning models containing mil-

lions of parameters demand computational resources

that are beyond the constraints of typical mobile de-

vices (Wang et al., 2018). Speciﬁcally, the memory

and battery resource of mobile devices are key consid-

erations. For model inference, a basic requirement is

the model must be small enough to at least ﬁt into the

limited on device memory and even if the model can

be compressed to ﬁt, a computationally heavy model

will also rapidly deplete the battery of the mobile de-

vice. Both the computational and storage consider-

ations are pivotal for implementation on mobile de-

vices. A lighter model results in faster detection of the

fast changing affective states of learners and depletes

battery slower while a smaller sized model allows it

to be deployed on a wide range of mobile devices in-

cluding those with lesser on device memory.

Balancing for speed versus performance, we

chose to use the pre-trained PFLD 0.25x model for

facial landmarks detection. The larger pre-trained

PFLD models gives better detection accuracies but re-

sults in slower inference. In addition, PFLD is also

not the current state of the art model for facial land-

mark detection. Thus, we hypothesize that with better

facial landmark detection, the performance of STGAT

should further improve.

6 CONCLUSION

In conclusion, we have demonstrated the feasibility

of detecting academic emotions by proposing a graph

based model using GAT and GRU that modelled both

the spatial and temporal dimensions of facial land-

marks. The ablation study also showed that the ad-

dition of head poses into STGAT and the differencing

of consecutive frames of facial landmarks further en-

hance the accuracy of detection speciﬁcally for bore-

dom. Although the results showed that the overall

performance of STGAT is not state of the art, the pro-

posed model has low computational requirements and

is adequate for implementation on mobile devices.

We contend that with the availability of better per-

forming facial landmark detection models, the per-

formance of facial landmarks graph neural network

based model should improve further and this will be a

direction for future research.

REFERENCES

Abedi, A. and Khan, S. S. (2021). Improving state-of-the-

art in Detecting Student Engagement with Resnet and

TCN Hybrid Network. In 18th Conference on Robots

and Vision, CRV 2021, pages 151–157.

Akiba, T., Sano, S., Yanase, T., Ohta, T., and Koyama, M.

(2019). Optuna: A next-generation hyperparameter

optimization framework. In Proceedings of the 25th

ACM SIGKDD international conference on knowl-

edge discovery & data mining, pages 2623–2631.

Asteriadis, S., Tzouveli, P., Karpouzis, K., and Kollias, S.

(2009). Estimation of behavioral user state based on

eye gaze and head pose—application in an e-learning

environment. Multimedia Tools and Applications,

41(3):469–493.

Belle, A., Hargraves, R. H., and Najarian, K. (2012). An au-

tomated optimal engagement and attention detection

system using electrocardiogram. Computational and

mathematical methods in medicine, 2012.

Cho, K., Van Merri

enboer, B., Gulcehre, C., Bahdanau, D.,

Bougares, F., Schwenk, H., and Bengio, Y. (2014).

Learning phrase representations using rnn encoder-

decoder for statistical machine translation. arXiv

preprint arXiv:1406.1078.

Fine-grained Detection of Academic Emotions with Spatial Temporal Graph Attention Networks using Facial Landmarks

D’Mello, S. and Graesser, A. (2009). Automatic detection

of learner’s affect from gross body language. Applied

Artiﬁcial Intelligence, 23(2):123–150.

D’Mello, S., Olney, A., Williams, C., and Hays, P. (2012).

Gaze tutor: A gaze-reactive intelligent tutoring sys-

tem. International Journal of human-computer stud-

ies, 70(5):377–398.

Fabian Benitez-Quiroz, C., Srinivasan, R., and Martinez,

A. M. (2016). Emotionet: An accurate, real-time algo-

rithm for the automatic annotation of a million facial

expressions in the wild. In Proceedings of the IEEE

conference on computer vision and pattern recogni-

tion, pages 5562–5570.

Grafsgaard, J., Wiggins, J. B., Boyer, K. E., Wiebe, E. N.,

and Lester, J. (2013). Automatically recognizing fa-

cial expression: Predicting engagement and frustra-

tion. In Educational Data Mining 2013.

Guhan, P., Agarwal, M., Awasthi, N., Reeves, G., Manocha,

D., and Bera, A. (2020). Abc-net: Semi-supervised

multimodal gan-based engagement detection using an

affective, behavioral and cognitive model. arXiv

preprint arXiv:2011.08690.

Guo, X., Li, S., Yu, J., Zhang, J., Ma, J., Ma, L., Liu, W.,

and Ling, H. (2019). Pﬂd: A practical facial landmark

detector. arXiv preprint arXiv:1902.10859.

Gupta, A., D’Cunha, A., Awasthi, K., and Balasubra-

manian, V. (2016). Daisee: Towards user en-

gagement recognition in the wild. arXiv preprint

arXiv:1609.01885.

Lee, D.-T. and Schachter, B. J. (1980). Two algorithms

for constructing a delaunay triangulation. Interna-

tional Journal of Computer & Information Sciences,

9(3):219–242.

Liao, J., Liang, Y., and Pan, J. (2021). Deep facial spa-

tiotemporal network for engagement prediction in on-

line learning. Applied Intelligence, 51(10):1–13.

Murshed, M., Dewan, M. A. A., Lin, F., and

Wen, D. (2019). Engagement detection in e-

learning environments using convolutional neural

networks. In 2019 IEEE Intl Conf on De-

pendable, Autonomic and Secure Computing, Intl

Conf on Pervasive Intelligence and Computing, Intl

Conf on Cloud and Big Data Computing, Intl

Conf on Cyber Science and Technology Congress

(DASC/PiCom/CBDCom/CyberSciTech), pages 80–

86. IEEE.

Ngoc, Q. T., Lee, S., and Song, B. C. (2020). Fa-

cial landmark-based emotion recognition via directed

graph neural network. Electronics (Switzerland), 9(5).

Pekrun, R., Frenzel, A. C., Goetz, T., and Perry, R. P.

(2007). The control-value theory of achievement emo-

tions: An integrative approach to emotions in educa-

tion. In Emotion in education, pages 13–36. Elsevier.

Pekrun, R., Goetz, T., Titz, W., and Perry, R. P. (2002). Aca-

demic emotions in students’ self-regulated learning

and achievement: A program of qualitative and quan-

titative research. Educational psychologist, 37(2):91–

105.

Sezgin, M. C., Gunsel, B., and Kurt, G. K. (2012). Per-

ceptual audio features for emotion detection. Eurasip

Journal on Audio, Speech, and Music Processing,

2012(1):1–21.

Tieleman, T., Hinton, G., et al. (2012). Lecture 6.5-

rmsprop: Divide the gradient by a running average of

its recent magnitude. COURSERA: Neural networks

for machine learning, 4(2):26–31.

Veli

ckovi

c, P., Casanova, A., Li

o, P., Cucurull, G., Romero,

A., and Bengio, Y. (2018). Graph attention networks.

In International Conference on Learning Representa-

tions.

Wang, J., Cao, B., Yu, P., Sun, L., Bao, W., and Zhu, X.

(2018). Deep learning towards mobile applications.

In 2018 IEEE 38th International Conference on Dis-

tributed Computing Systems (ICDCS), pages 1385–

1393. IEEE.

Wen, Y., Zhang, K., Li, Z., and Qiao, Y. (2016). A discrim-

inative feature learning approach for deep face recog-

nition. In European conference on computer vision,

pages 499–515. Springer.

Whitehill, J., Bartlett, M., and Movellan, J. (2008). Auto-

matic facial expression recognition for intelligent tu-

toring systems. In 2008 IEEE Computer Society Con-

ference on Computer Vision and Pattern Recognition

Workshops, pages 1–6. IEEE.

Zhang, K., Zhang, Z., Li, Z., and Qiao, Y. (2016). Joint

face detection and alignment using multitask cascaded

convolutional networks. IEEE Signal Processing Let-

ters, 23(10):1499–1503.

Zhou, J., Cui, G., Hu, S., Zhang, Z., Yang, C., Liu, Z.,

Wang, L., Li, C., and Sun, M. (2020). Graph neu-

ral networks: A review of methods and applications.

AI Open, 1:57–81.

CSEDU 2022 - 14th International Conference on Computer Supported Education