dent grades, and then the complexity of model is in-
creased by taking the linear function using multiple
variables. After that, classification algorithms as de-
cision trees, nearest neighbour, support vector ma-
chines, linear discriminant analysis, and also combi-
nations of classifiers were used to predict student fi-
nal grade. A system to predict students’ grades for
the courses they will enroll in during the next enroll-
ment term by learning patterns from historical data
but also using additional information about students,
courses and the professors that teach them is pro-
posed in (Sweeney et al., 2016). Several models were
used: Factorization Machines (FM), Random Forests
(RF), and the Personalized Multi-Linear Regression,
and the best result was obtained using a hybrid FM-
RF method that proved to accurately predict grades
for both new and returning students taking both new
and existing courses; the study of the feature selec-
tion study emphasizes strong connections between in-
structor characteristics and student performance.
3 METHODOLOGY
Our goal is to build a recommendation system for stu-
dent progress; we try to apply several Machine Learn-
ing (ML) techniques to find correlation relationships
between disciplines and to be able to predict a future
student’s grade for a discipline based on the previous
grades obtained by all students and see which of these
ML techniques best suites our use case. From the var-
ious Machine Learning classes of algorithms we in-
vestigated an unsupervised learning method, namely
clustering, and three supervised learning techniques,
i.e. linear regression, random forest regression and
neural networks. Our first approach considers cluster-
ing techniques to group similar disciplines based on
the grades received by students. First approach was to
use clustering. There are many clustering algorithms
that have emerged over time, some of them being in-
cluded in the stable releases of various data science
libraries due to their maturity and performance they
provide. It is known that there is no ”one algorithm
matches all problems”, but, as the study conducted in
(Saxena et al., 2017) concluded, the well suited clus-
tering algorithm for a vast majority of the problems is
K-Means from the ”Partition” family (Table 1). In Ta-
ble 1 we summarize the main characteristics of a list
of clustering algorithms we have considered. We list
for each considered algorithm, the family of cluster-
ing algorithms to which a specific algorithm belongs,
its time complexity, scalability, suitability for large
data sets and sensitivity to noise in the data.
Our dataset consists of grades obtained by the stu-
dents of an entire Bachelor’s degree series, across
their entire academic route (spanning over 3 years of
BSc studies). The curricula for such a series contains
both compulsory and optional courses. If, for com-
pulsory courses we have enrolled all students, the op-
tional ones can have enrolled only a fraction of them.
We used only grades obtained by students for the
compulsory courses, such that we have approximately
the same number of grades for each course (there may
be a small number of students that did not attend the
exam for some courses). This would ease our prepro-
cessing of the data and would make our dataset more
balanced. The dataset was exported from the database
in csv format, each row containing the grades for a
specific student and the header row consists of the stu-
dent id and the 27 compulsory courses. Our aim was
to group courses that have some similarity (similar-
ity is based on the obtained students’ grades) between
them in the same cluster, so, we needed to have each
course on a separate row, the header containing the
student ids and each cell on the table to have the grade
the student from that column obtained for the course
on that row. To obtain this, we ”transposed” the orig-
inal dataset. To be easy for us to run the steps needed
in the clustering process we decided to use Python and
its data science libraries: numpy, pandas, sklearn; the
steps performed were as follows:
• loading dataset in a Pandas
1
dataframe
• checking that there are no missing values
• dropping the course column and remain only with
numerical data (the grades themselves)
• applying Elbow method to detect the most suit-
able number of clusters as we applied a partition-
ing algorithm (Figure 1)
• run the K-Means clustering algorithm on the
dataset using the above obtained number of clus-
ters
• add the cluster label to each instance from our
original dataset (there we have the course name)
• obtain the Silhouette score for our clustering
scheme (Figure 2)
By running these steps we obtained 8 clusters, but
the Silhouette score was quite low and we decided it
is not good enough for our solution.
An alternative method we considered was to take
into consideration the direct correlation factor be-
tween each 2 different courses that are thought in dif-
ferent semesters, this, in our opinion being relevant
for students trajectory. In order to obtain this correla-
tion factors we had to implement some database side
1
https://pandas.pydata.org
Recommendation System for Student Academic Progress
287