A New Way to Characterize Learning Datasets

elina Treuillier

1,2

and Anne Boyer

1,2

Lorraine University, 34 Cours L

eopold, 54000 Nancy, France

LORIA, 615 Rue du Jardin Botanique, 54506 Vandoeuvre-l

es-Nancy, France

Keywords:

Learning Analytics, Corpus Representativeness, Learning Indicators, Learner Personas.

Abstract:

The student’s interaction with Virtual Learning Environments (VLE) produces a large amount of data, known

as learning traces, which is commonly used by the Learning Analytics (LA) domain to enhance the learning

experience. Digital learning systems are generally based on the processing of these traces and must be able

to adapt to different student proﬁles. However, the information provided in raw traces is diversiﬁed and

can’t be directly used for the proﬁle identiﬁcation task: it requires deﬁning learning indicators pedagogically

relevant, and measurable directly from learning traces, and then classify learners proﬁles according to these

indicators. The paper’s main contribution remains on the characterization of LA datasets both in terms of

groups sizes and observed digital behaviors. It answers the lack of clearly stated information for LA systems

developers, who need to ensure that their algorithms do not introduce bias, especially by disfavoring speciﬁc

categories of students, which would only worsen existing inequalities in the student population. To go further,

the embodiment of these identiﬁed proﬁles by translating them into learner personas also participates in the

improvement of the explicability of LA outcomes by providing easy-to-interpret descriptions of students.

These personas consist of ﬁctitious representative student proﬁles, expressing different needs and learning

objectives to which the LA systems must respond.

1 INTRODUCTION

The intensiﬁed use of Learning Analytics (LA) has

led to a signiﬁcant shift in learning: learners can par-

ticipate in a speciﬁc course from anywhere and at any

time. While attending different courses online, learn-

ers produce a large amount of data, known as learn-

ing traces, which are commonly processed by several

algorithms to understand learning and potentially im-

prove it (Siemens and Long, 2011). These traces can

be very diverse, and reﬂect the student’s behaviors on

the learning platform.

Obviously, not all students engage in similar be-

haviors, both in distance and face-to-face courses and

thus need to receive adapted and quality support (Xu

and Jaggars, 2014). All students must be accompa-

nied and supported in their learning tasks, and none

should be privileged or, on the contrary, disadvan-

taged.

These notions of adaptability and equity are es-

sential and need to be insured when dealing with LA

systems (Slade and Prinsloo, 2013). These last learn

from learning traces to provide results that will help

actors in their decision-making to improve learning.

However, the heterogeneous nature of the data, cou-

pled with the diversity of behavior of each student,

makes the task tricky: it requires identifying distinct

learning behaviors, allowing the description of learn-

ing proﬁles, and this directly from the learning traces.

Hence, we ask ourselves how to characterize learn-

ing dataset in terms of representativity of learn-

ing proﬁles? (RQ1) We hypothesized that a new and

complete description of the dataset could participate

in the reduction of inequalities that persist in distance

learning by giving important clues to LA actors, and

in particular to systems’ developers who implement

algorithms processing the diverse data. Finally, this

complete characterization of the datasets can pretend

to participate in the generalization of fair learning,

which is adapted and accessible to all students. In-

deed, this could ensure that LA solutions are adapted

to the different student proﬁles identiﬁed and that they

do not introduce any bias.

However, before we can imagine improving on-

line learning, we must ensure that the results provided

are understandable and accepted by LA actors. In this

regard, we wondered how to improve the explicabil-

ity of the identiﬁed learning proﬁles? (RQ2) We

Treuillier, C. and Boyer, A.

A New Way to Characterize Learning Datasets.

DOI: 10.5220/0010982500003182

In Proceedings of the 14th International Conference on Computer Suppor ted Education (CSEDU 2022) - Volume 2, pages 35-44

ISBN: 978-989-758-562-3; ISSN: 2184-5026

thought that the translation of the identiﬁed learning

proﬁles in a narrative and comprehensible form is es-

sential to make sure that all the potential users can

understand the results, measure the importance of the

computed indicators, and act accordingly.

This work is part of the LOLA (Laboratoire Ou-

vert de Learning Analytics) project, which aims at set-

ting up a collaborative platform on which the differ-

ent LA actors will be able to share datasets, models,

and results. To complete its offer, the platform pro-

poses a complete evaluation environment that relies

on the use of indicators, both technical, algorithmic,

and pedagogical. The work presented then partici-

pates in the elaboration of this environment.

The paper is organized as follows. Section 2

presents key concepts on which the presented work is

based. The methodology applied to identify the learn-

ing proﬁles as well as the associated results are pre-

sented in Section 3. The translation of these results in

a narrative and understandable form is then presented

in Section 4. Finally, Section 5 presents conclusions

and some interesting perspectives.

2 RELATED WORKS

2.1 Learning Styles

The learning process relies on cognitive foundations,

which are essential to the acquisition of complete

knowledge, allowing learners to interact with their

environment. The french neuroscientist (Dehaene,

2013) described four pillars of learning :

• Attention: brain’s mechanisms allowing the se-

lection of information on which the student needs

to focus.

• Active Engagement: active pedagogy avoiding

wandering of the mind and thus supporting the

adoption of an active behavior toward the learn-

ing task.

• Information Feedback: feedback in the form of

error signals, essential for efﬁcient learning.

• Daily Consolidation: storage of received infor-

mation, on a regular basis.

These various concepts result in the adoption of

diverse learning behaviors allowing to receive and

to process the information. Students present differ-

ent strategies to deal with a large amount of infor-

mation they receive every day, and some are more

adapted and beneﬁcial than others. In this context,

many frameworks, based on different observations

(student’s personality, information processing, peda-

gogical preferences...) have been deﬁned to describe

learning behaviors (Sadler-Smith, 1997). One of the

most popular is the framework described by (Felder

et al., 1988), which is still used to describe the style

of thousands of learners each year. This well-known

framework classiﬁes students into different categories

based on the way they perceive the world, reason, re-

ceive the information, process the information, and

ﬁnally, understand this information. In total it allows

describing 32 learning styles, to which the students

belong. It is important to note that learners’ behaviors

can change over time: one student can be associated

with several categories during an extended period of

learning.

Interestingly, the framework’s authors showed that

there are also some teaching styles, which can be in

line with the learning styles or not. In the latter case,

students may become discouraged and experience de-

clines in performance or may even drop out. For that

reason, it is essential for teachers to understand how

their students learn, even if it’s very difﬁcult when

dealing with large groups of students, to adapt their

pedagogy accordingly and pretend to enhance learn-

ing. In that sense, (El-Bishouty et al., 2019) showed

that Felder and Silverman’s framework is applicable

in an online learning context and that a course that

took into account different learning styles could im-

prove learning. Thus, LA systems processing the data

about thousand of students simultaneously represent

an important tool to support teachers: they participate

in the development of an Adaptive Learning (Nijha-

van and Brusilovsky, 2002), which adapt according to

student’s needs and offer an adapted and personalized

support helping to improve learning performances.

Many researchers have been interested in identi-

fying learner proﬁles from learning data (Paiva et al.,

2015; Mojarad et al., 2018; Mupinga et al., 2006; Lot-

sari et al., 2014): all of them have relied on different

methods, using different data and therefore provid-

ing different results. The next subsection details some

useful learning indicators that were used in previous

studies, and which are particularly interesting in the

context of our work.

2.2 Learning Indicators

The identiﬁcation of learning styles must be based

on some indicators that provide useful insights into

the students’ behaviors. In our context of the study,

what sets these indicators apart from those existing

(e.g. in the educational sciences ﬁeld) is the fact

that they must be directly evaluable from the learn-

ing data collected about students. They remain, how-

CSEDU 2022 - 14th International Conference on Computer Supported Education

ever, based on strong theoretical concepts from the

educational sciences, and must reﬂect relevant learn-

ing behaviors, providing useful information about the

learning process. Then, these indicators need to be

reﬁned according to the available data, and will there-

fore allow characterizing different behaviors accord-

ing to the speciﬁc parameters selected from the learn-

ing traces. Hence, in this study, we rely on the follow-

ing deﬁnition of learning indicators: ”An indicator is

an observable that is pedagogically signiﬁcant, com-

puted or established with the help of observations,

and testifying to the quality of interaction, activity,

and learning. It is deﬁned according to an observation

objective and motivated by an educational objective”

(Iksal, 2012).

Many learning indicators may be of interest to us.

However, for this work, only the most signiﬁcant in an

online learning context has been computed. The ﬁrst

selected indicator refers to student engagement which

is, as detailed earlier, essential to ensure a quality

learning process (Dehaene, 2013). Student engage-

ment was discussed in many studies: in 2014, (Chi

and Wylie, 2014) deﬁned the ICAP model, describing

four modes of engagement: Interactive, Constructive,

Active, and Passive. Each mode is associated with

different learning behaviors, allowing a more or less

in-depth learning process, and therefore have different

consequences on the learning outcomes. In another

way, some researchers tried to quantify this engage-

ment directly from learning traces, as (Hussain et al.,

2018) who used predictive models to identify low-

engagement students. Other interesting studies focus

on some different learning indicators, particularly in-

teresting in the context of the study: (Boroujeni et al.,

2016), for example, quantiﬁed the students’ regular-

ity to study its impact on learning outcomes. Others

authors, as (Arnold and Pistilli, 2012), detailed an in-

teresting method based on learning traces and demo-

graphic data to predict learning performances.

In our case, we want to characterize student pro-

ﬁles according to a broad spectrum of indicators. We

have therefore not focused on the characterization of

one indicator but computed several based on those

mentioned in this section. This set of indicators serves

as a basis for a clearer and fuller deﬁnition of the be-

haviors adopted by the students described by a spe-

ciﬁc dataset (Ben Soussia et al., 2021). We hypoth-

esized that this description will allow characterizing

the datasets and will be used to study their represen-

tativeness. The choice of these indicators is based on

the available data, which can vary signiﬁcantly de-

pending on the learning traces. The complete method-

ology is described in section 3.

3 DESCRIPTION OF LEARNING

BEHAVIORS

3.1 Methodology

The methodology allowing the identiﬁcation of

learner proﬁles according to selected indicators can

be divided into ﬁve steps (See Figure1):

• Selection of a LA dataset of interest.

• Selection of learning indicators depending on the

available data in the selected dataset. Learning

traces that are recorded when students interact

with VLE can be diverse and do not always reﬂect

the same behaviors. Hence, it is essential to sys-

tematically select indicators adapted to the studied

corpus. A unique indicator can be evaluated from

different parameters.

• Data selection and pre-processing once the indi-

cators are selected. Only the data used to com-

pute indicators are selected and pre-processed to

improve the performance of the model.

• Identiﬁcation of homogeneous groups of students

(i.e. students adopting similar behaviors). To

do that, a classiﬁcation method regrouping data

with similar properties in an unsupervised manner

seems to be the best solution.

• Description of the learning proﬁles according to

the identiﬁed groups and learning indicators se-

lected.

The data selection was performed using R (Ripley

et al., 2001). Other steps were all performed thanks

to the ScikitLearn library for Python (Pedregosa et al.,

2011). Results are detailed in the following section.

3.2 Results

3.2.1 Selection of a Dataset

The methodology was applied on the well-known

Open University Learning Analytics Dataset

(OULAD) (Kuzilek et al., 2017), which gathers data

about 32,593 students involved in distance learning at

Open University, one of the largest distance learning

universities worldwide. It is fully anonymized and

contains both demographic data, interaction data,

as well as the results of the various evaluations.

The dataset gathered information about 22 courses,

called modules, which can be dispensed multiple

times during the year, and are thus differentiated

by the year (2013 or 2014) and the month of the

beginning (B=February, J=October) of the considered

presentation.

A New Way to Characterize Learning Datasets

Figure 1: Different steps of the applied methodology.

To analyze a homogeneous set of students, we

chose to select a single presentation among those

available in OULAD. We have thus selected the

February 2013 (2013B) presentation of the module

DDD, which is a STEM (Science, Technology, Engi-

neering, and Mathematics) course that involved 1303

students and lasted 240 days during which 14 evalua-

tions were spread.

3.2.2 Selection of Indicators According to the

Selected Dataset

The information contained in OULAD is multiple and

detailed: we focus mainly on learners’ activity on the

VLE and rendering modalities and performances in

the exams. Furthermore, it is important to note that

the data concentrates information about 20 types of

material, with which users can interact. However,

some types of activities have more inﬂuence on learn-

ing outcomes: forumng, oucontent, homepage and

subpage activities (as entitled in the dataset) are, for

example, the most important predictors of engage-

ment according to (Hussain et al., 2018). We have

therefore only selected these four activity types.

Together, the data allowed us to work on 5 learn-

ing indicators: engagement (Hussain et al., 2018),

performance (Arnold and Pistilli, 2012), regularity

(Boroujeni et al., 2016), reactivity (Boroujeni et al.,

2016) and curiosity (Pluck and Johnson, 2011). The

deﬁnition of the indicators and the associated features

selected in the dataset are detailed in Table 1.

3.2.3 Data Pre-processing

Once the data is selected, it undergoes a pre-

processing phase which is necessary to improve the

performance of the model. We ﬁrst handled miss-

ing values (NAs) by replacing them: if no activity is

recorded or the student does not get a grade (assign-

ment not handed in), the value is replaced by 0. In that

latter case, the delay is equal to 240, corresponding to

the total duration of the considered course.

At this stage, the 1303 students were divided

into 4 sub-datasets corresponding to their ﬁnal re-

sult, which can be: withdrawn, fail, pass, or dis-

tinction (the information is available in the initial

dataset). The common data standardization phase is

then applied to rescale the numerical data to better

analyze it. The several standardization methods avail-

able in ScikitLearn were ﬁrst compared and the Ro-

bustScaler, which is described as particularly suit-

able for data containing outliers, was selected. Fi-

nally, we applied the IsolationForest algorithm (Liu

et al., 2008) to detect outliers. In our context, outliers

represent learners adopting atypical behaviors, who

can’t be associated with any other students. However,

we emphasize that the identiﬁcation of these non-

standard students is essential because their atypical

behaviors do not allow them to beneﬁt from the same

support as the other students, so they need to be ana-

lyzed differently. From an algorithmic point of view,

their parallel treatment allows enhancing the perfor-

mances of the model. Therefore, inliers and outliers

are divided into different sub-datasets. Only inliers

are studied for the next phase, but outliers are not set

aside. They are simply treated differently and will be

described independently to provide a complete char-

acterization of the dataset.

3.2.4 Identiﬁcation of Homogeneous Groups of

Students

Once inliers and outliers are identiﬁed and separated

in distinct sub-datasets, the goal is to identify homo-

geneous groups of students among the inliers and ac-

cording to the selected learning indicators. In con-

crete terms, the goal of this stage is to identify some

subsets S

composed by proﬁles P

described by a se-

quence of learning traces T

i, j

. Outliers are then noted

. (See Figure 2).

To classify students of the 4 sub-datasets (with-

drawn, fail, pass, distinction) into homogeneous

groups, we used the k-means algorithm (Likas et al.,

2003), which is well described and adapted to learn-

ing dataset (Navarro and Ger, 2018). The perfor-

mances were evaluated with two metrics:

CSEDU 2022 - 14th International Conference on Computer Supported Education

Table 1: Deﬁnition of each learning indicator and associated features selected in OULAD.

Indicator Deﬁnition Associated features in OULAD (DDD - 2013B)

Performance Students’ learning outcomes in the module Scores obtained in the 14 assessments of the module [0-100]

Reactivity Responsiveness to course-related events Delay between submission data and deadline, for each assessment

Engagement Students’ activity on the VLE Number of clicks (in total and for each activity)

Regularity Behavioral patterns Number of active days and daily behavior (in total and for each activity)

Curiosity Students’ intrinsic motivation Number of different types of resources consulted

Figure 2: Identiﬁcation of subsets S

and outliers O

• Silhouette Analysis (Rousseeuw, 1987): mea-

sures the distance between each point of a cluster

with the points of other clusters. It has a range of

[−1;1], with values closer to 1 indicating a better

classiﬁcation.

• Davies-Bouldin Criterion (Davies and Bouldin,

1979): computes the ratio of within-cluster dis-

tances to between-cluster distances of each clus-

ter with its most similar cluster. It has a range of

[0;+∞], with values closer to 0 indicating a better

classiﬁcation.

3.2.5 Learner Proﬁles Description

Firstly, the k-means algorithm has been launched with

various values of k (from 2 to 15) and the quality of

the partition was evaluated with the Davies-Bouldin

Index and Silhouette Analysis, which made it possible

to determine the optimal value of k. Silhouette plots

(Rousseeuw, 1987) were displayed to visually iden-

tify what partitions perform better (See an example

in Figure 3). With the optimal value of k, the index

values obtained are relatively good (Davies-Bouldin

Index close to 0 and Average Silhouette Index close

to 1). This indicates a quality partition: it means that

the different learners are clustered in the right groups,

which are sufﬁciently separated from each other.

The described methodology allowed the identiﬁ-

cation of a varying number of homogeneous groups

depending on the sub-dataset considered. However,

for each of them, there is always a group represent-

ing a larger proportion of the dataset, some groups

representing a smaller number of students but whose

size are still quite representative, and, ﬁnally, some

groups representing only a very small number of stu-

dents. Thus, we ﬁxed a threshold ε = 10, under which

identiﬁed subsets are considered as outliers and then

treated as the outliers identiﬁed in the pre-processing

phase (with IsolationForest algorithm).

The larger subset corresponds to the prime per-

sona: it represents the larger proportion of the stu-

dents described in the considered dataset. The as-

sociated indicators describe then the online behavior

adopted by the majority of learners. Smaller groups

(size > ε ) were deﬁned as under-represented per-

sonas: they are associated with a considerable num-

ber of students, who exhibit a particular behavior, dif-

ferent from the one commonly shared in the studied

dataset, and therefore required adapted support. In

addition, algorithms processing the dataset containing

information about these atypical learners and outliers

must be able to recognize them and treat them with

the same quality as students in the prime group. The

global information about the results is resumed in Ta-

ble 2.

4 CONCEPTION OF LEARNER

PERSONAS

The homogeneous groups of students identiﬁed in the

previous section give useful information about the

various online learning behaviors, to which the LA

systems must adapt. However, LA actors (develop-

ers, researchers, users...) need to understand these

A New Way to Characterize Learning Datasets

Figure 3: Silhouette analysis for k-means clustering on pass subdataset with k=8.

Table 2: Summary of number of inliers and outliers, optimal value of k and values of Davies-Bouldin and Silhouette indexes

for each subdatasets.

Subdataset Number of Inliers

Number of Outliers

(IsolationForest algorithm)

Optimal value of k Davies-Bouldin Index Silhouette Index

Withdrawn 427 5 4 0,82 0,83

Fail 357 4 8 0,16 0,91

Pass 451 5 10 0,70 0,78

Distinction 53 1 6 0,05 0,88

behaviors, and especially what they mean according

to the different learning indicators. To help them in

this task, the identiﬁed groups were, in a second step,

translated into learner personas. These latter were de-

ﬁned as ”narrative descriptions of typical learners that

can be identiﬁed through centroids of machine learn-

ing classiﬁcation processes” by (Brooks and Greer,

2014).

Personas are commonly used during the develop-

ment phase of numerical services, especially in UX

design (Lallemand and Gronier, 2016): they represent

typical users, to which the service must respond. In

our case, the goal of the learner personas is different:

they help to enhance the explicability of LA outputs to

potential users (teachers, learners), and also to study

the representativeness of a corpus.

Thus, each group identiﬁed through the presented

methodology has been translated into the form of a

persona representing a narrative description of a ﬁcti-

tious student: it contains some demographic informa-

tion (name, gender, and age), associated with a textual

description giving essential clues about the learning

behavior, according to the learning indicators. This

description allows embodying the results returned by

the classiﬁcation process: anyone who might read it

can understand it.

In the rest of the paper, prime persona, an example

of an under-represented persona, and an outlier are

detailed for each sub-dataset.

4.1 Withdrawn Dataset

To begin, we observe that the majority of students

who dropped out (76% of the dataset, 326 students)

have very low activity (351 clicks), are very irregular

(22 active days), and access very few resources (45).

This behavior causes poor results from the beginning

of the course. These students dropped out quickly,

and do not turn in any more assignments. Other un-

derrepresented subgroups are more active (number of

clicks > 1000), more regular (between 66 and 86 ac-

tive days) , and more curious (> 100 resources con-

sulted), but give up progressively, with some groups

withdrawing more quickly than others. Finally, a sur-

prising outlier of this dataset shows a very active and

regular behavior at the beginning of the course (4267

clicks, 178 active days), and he seems to be curious

(188 consulted resources). Unfortunately, he gives up

on the last assignment, which is not handed in, and

therefore does not pass the course despite his seem-

ingly ideal behavior (See Figure 4).

CSEDU 2022 - 14th International Conference on Computer Supported Education

Figure 4: Personas of withdrawn dataset.

4.2 Fail Dataset

The majority of students who failed (53% of the

dataset, 190 students) are not very active (620 clicks),

whatever the type of activity considered. This low

activity is associated with a reduced number of re-

sources consulted (73) and less active days (43).

These students who are inactive, irregular, and not

very curious about the course, obtain low results that

do not allow them to succeed, especially since they

are not very reactive and do not return all the assign-

ments. Interestingly, some under-represented students

were much more active and regular (1871 clicks, 110

active days), and thus consulted a greater number of

resources (145 resources). These students turned in

all the assignments on time but obtained low scores

(grades < 40) and therefore did not pass the module.

Their learning behavior does not seem to be efﬁcient

to allow them to succeed. Interestingly, we observe

that a speciﬁc outlier have an intense activity, with

an impressive number of clicks (7155), many days of

activity (201), and a wide variety of resources con-

sulted (240), but obtained scores are too low to pass

the module (See Figure 5).

Figure 5: Personas of fail dataset.

4.3 Pass Dataset

For successful students, the primary persona repre-

sents 69% of the dataset (312 learners). The asso-

ciated students adopt a very active behavior (2240

clicks), especially on the forums (522 clicks). They

also connect on the platform regularly (130 active

days), and consult numerous resources (167). This

active, regular, and curious learning behavior allows

them to succeed at the module, by obtaining good re-

sults (grades > 60). Other students, less represented,

A New Way to Characterize Learning Datasets

are less active with half the number of clicks (1113)

and far fewer active days (77). These students, less

active and less regular, do not turn in all the assign-

ments but their correct results nevertheless allow them

to validate the module. Finally, the outliers include

students with frenetic activity (19196 clicks) spread

over 259 active days during which 439 different re-

sources are consulted. We can easily understand why

this type of student is considered as an outlier given

the adopted behavior (See Figure 6).

Figure 6: Personas of pass dataset.

4.4 Distinction Dataset

Finally, the majority of students earning a distinction

(87%, 46 students) are very active (2577 clicks) and

regular (146 active days) in the course. In particular,

they show high activity on the forums (627 clicks).

This behavior allows them to obtain excellent grades

(> 80). For this dataset, we do not observe any under-

represented personas: the 5 identiﬁed clusters corre-

spond to only one or two students, which are there-

fore considered as outliers. The most striking one is

an outlier showing a very increased activity (17957

clicks) throughout the entire course (260 days of ac-

tivity) and a high curiosity (361 resources consulted).

This student also seems to be very active on the fo-

rums since he makes almost 7050 clicks on it. All of

his assignments are handed in on time and his results

are brilliant (See Fig 7).

Figure 7: Personas of distinction dataset.

The presented personas are particularly interest-

ing since they allow to describe the different learner

proﬁles described in the selected dataset, and that in

a narrative way, understandable to all. The diverse

identiﬁed proﬁles are based on the selected indica-

tors, which thus seem to be relevant from the point

of view of learning. They allow identifying homo-

geneous groups of students, different enough from

each other, who express interesting behaviors in ac-

cordance with their ﬁnal result. Furthermore, the high

diversity of the proﬁles is interesting and translates,

once again, the necessity to adapt LA systems to all

users, who express speciﬁc needs. The deﬁned per-

sonas contain information about online adopted be-

haviors, but the associated methodology also allows

us to describe the corpus in terms of group sizes, i.e.

frequency of the described behaviors. In that sense,

personas pretend to help developers ensure that their

algorithms are adapted to all students and do not in-

CSEDU 2022 - 14th International Conference on Computer Supported Education

troduce bias, by privileging students who adopt the

most common behavior, for example.

5 CONCLUSIONS AND

PERSPECTIVES

Broadly, the selected classiﬁcation method (k-means

algorithm) was appropriate to identify groups in

which students share similar behaviors according to

the selected indicators. The different steps described,

from the selection of the dataset to the conception

of personas, allow us to answer the RQ1. In paral-

lel, the unprecedented conception of personas based

on these identiﬁed groups is then effective to de-

scribe these learning behaviors semantically and thus

completes the numerical results returned by the algo-

rithm. These personas can then be shared with all

the LA actors who will be able to understand them,

whether they are technophiles or not, and then re-

spond to the RQ2. Altogether, these elements intro-

duce some interesting insights about how to character-

ize LA datasets more completely and understandably.

This new description of corpus, based on learner per-

sonas, seems to be able to become a powerful tool in

the LA ﬁeld, participating in learning improvement

for the entire student population.

Nevertheless, some aspects were pointed out and

deserve to be studied and evaluated. First of all, we

wonder if the embodiment of the personas, by giving

a name, a gender, and an age to the ﬁctive student, is

relevant in some contexts and does not introduce other

bias in the people who have to use them. Cognitive

bias can appear and affect the way LA actors interpret

the personas and use them. A study focusing on this

aspect seems to be needed to answer this question.

The automation of personas conception can also

be discussed: we ask ourselves if the redaction of the

learning behaviors could be automated with speciﬁc

models. It seems to be essential when dealing with

very large datasets, in which the description of hun-

dred of personas implies a large workload. In another

way, having a human intervention can reassure users

and participate in the enhancement of systems’ expli-

cability. These aspects thus deserve an in-depth anal-

ysis to determine the ideal comprise between com-

plete automation and human contribution.

Finally, the presented methodology looks promis-

ing and offers interesting results but was only applied

to a unique dataset in the paper. Now, we must study

how the methodology applies to multiple datasets,

like those shared on the LOLA platform, and from

which different learning indicators can be computed.

Application to a private dataset has started and is ex-

pected to be completed in the near future. This work

contributes to the afﬁrmation of the robustness of our

method and could allow us to impose learner personas

as a privileged tool for LA dataset characterization.

ACKNOWLEDGEMENTS

This work has been done in the framework of the

LOLA project, with the support of the French Min-

istry of Higher Education, Research, and Innovation.

REFERENCES

Arnold, K. E. and Pistilli, M. D. (2012). Course signals at

purdue: Using learning analytics to increase student

success. In Proceedings of the 2nd international con-

ference on learning analytics and knowledge, pages

267–270.

Ben Soussia, A., Roussanaly, A., and Boyer, A. (2021).

An in-depth methodology to predict at-risk learners.

In European Conference on Technology Enhanced

Learning, pages 193–206. Springer.

Boroujeni, M. S., Sharma, K., Kidzi

nski, Ł., Lucignano, L.,

and Dillenbourg, P. (2016). How to quantify student’s

regularity? In European conference on technology

enhanced learning, pages 277–291. Springer.

Brooks, C. and Greer, J. (2014). Explaining predictive mod-

els to learning specialists using personas. In Proceed-

ings of the fourth international conference on learning

analytics and knowledge, pages 26–30.

Chi, M. T. and Wylie, R. (2014). The icap framework: Link-

ing cognitive engagement to active learning outcomes.

Educational psychologist, 49(4):219–243.

Davies, D. L. and Bouldin, D. W. (1979). A cluster separa-

tion measure. IEEE transactions on pattern analysis

and machine intelligence, (2):224–227.

Dehaene, S. (2013). Les quatres piliers de l’apprentissage,

ou ce que nous disent les neurosciences.

El-Bishouty, M. M., Aldraiweesh, A., Alturki, U., Tor-

torella, R., Yang, J., Chang, T.-W., Graf, S., et al.

(2019). Use of felder and silverman learning style

model for online course design. Educational Tech-

nology Research and Development, 67(1):161–177.

Felder, R. M., Silverman, L. K., et al. (1988). Learning and

teaching styles in engineering education. Engineering

education, 78(7):674–681.

Hussain, M., Zhu, W., Zhang, W., and Abidi, S. M. R.

(2018). Student engagement predictions in an e-

learning system and their impact on student course as-

sessment scores. Computational intelligence and neu-

roscience, 2018.

Iksal, S. (2012). Ing

enierie de l’observation bas

ee sur

la prescription en EIAH. PhD thesis, Universit

e du

Maine.

A New Way to Characterize Learning Datasets

Kuzilek, J., Hlosta, M., and Zdrahal, Z. (2017). Open

university learning analytics dataset. Scientiﬁc data,

4(1):1–8.

Lallemand, C. and Gronier, G. (2016). M

ethodes de de-

sign UX: 30 m

ethodes fondamentales pour concevoir

evaluer les syst

emes interactifs. Eyrolles.

Likas, A., Vlassis, N., and Verbeek, J. J. (2003). The global

k-means clustering algorithm. Pattern recognition,

36(2):451–461.

Liu, F. T., Ting, K. M., and Zhou, Z.-H. (2008). Isolation

forest. In 2008 eighth ieee international conference

on data mining, pages 413–422. IEEE.

Lotsari, E., Verykios, V. S., Panagiotakopoulos, C., and

Kalles, D. (2014). A learning analytics methodology

for student proﬁling. In Hellenic Conference on Arti-

ﬁcial Intelligence, pages 300–312. Springer.

Mojarad, S., Essa, A., Mojarad, S., and Baker, R. S. (2018).

Data-driven learner proﬁling based on clustering stu-

dent behaviors: learning consistency, pace and effort.

In International conference on intelligent tutoring sys-

tems, pages 130–139. Springer.

Mupinga, D. M., Nora, R. T., and Yaw, D. C. (2006). The

learning styles, expectations, and needs of online stu-

dents. College teaching, 54(1):185–189.

Navarro,

A. A. M. and Ger, P. M. (2018). Comparison of

clustering algorithms for learning analytics with edu-

cational datasets. IJIMAI, 5(2):9–16.

Nijhavan, H. and Brusilovsky, P. (2002). A framework

for adaptive e-learning based on distributed re-usable

learning activities. In E-learn: World conference on

e-learning in corporate, government, healthcare, and

higher education, pages 154–161. Association for the

Advancement of Computing in Education (AACE).

Paiva, R. O. A., Bittencourt, I. I., da Silva, A. P., Isotani, S.,

and Jaques, P. (2015). Improving pedagogical recom-

mendations by classifying students according to their

interactional behavior in a gamiﬁed learning environ-

ment. In Proceedings of the 30th Annual ACM Sym-

posium on Applied Computing, pages 233–238.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V.,

Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P.,

Weiss, R., Dubourg, V., et al. (2011). Scikit-learn:

Machine learning in python. the Journal of machine

Learning research, 12:2825–2830.

Pluck, G. and Johnson, H. (2011). Stimulating curiosity

to enhance learning. GESJ: Education Sciences and

Psychology, 2.

Ripley, B. D. et al. (2001). The r project in statistical com-

puting. MSOR Connections. The newsletter of the

LTSN Maths, Stats & OR Network, 1(1):23–25.

Rousseeuw, P. J. (1987). Silhouettes: a graphical aid to

the interpretation and validation of cluster analysis.

Journal of computational and applied mathematics,

20:53–65.

Sadler-Smith, E. (1997). ‘learning style’: frameworks and

instruments. Educational psychology, 17(1-2):51–63.

Siemens, G. and Long, P. (2011). Penetrating the fog: Ana-

lytics in learning and education. EDUCAUSE review,

46(5):30.

Slade, S. and Prinsloo, P. (2013). Learning analytics: Ethi-

cal issues and dilemmas. American Behavioral Scien-

tist, 57(10):1510–1529.

Xu, D. and Jaggars, S. S. (2014). Performance gaps be-

tween online and face-to-face courses: Differences

across types of students and academic subject areas.

The Journal of Higher Education, 85(5):633–659.

CSEDU 2022 - 14th International Conference on Computer Supported Education