Test Quality Assessment and Adaptive Algorithm Based on IRT Models

Alexander A. Kostikov

1 a

, Kateryna V. Vlasenko

2,3 b

, Iryna V. Lovianova

4 c

Vadim V. Khoroshailo

1 d

and Natalia S. Hrudkina

1 e

Donbass State Engineering Academy, 72 Academichna Str. , Kramatorsk, 84313, Ukraine

National University of “Kyiv Mohyla Academy”, 2 Skovorody Str., Kyiv, 04070, Ukraine

Limited Liability Company Technical University “Metinvest Polytechnic”, 80 Pivdenne Hwy, Zaporizhzhia, 69008,

Ukraine

Kryvyi Rih State Pedagogical Univeristy, 54 Gagarin Ave., Kryvyi Rih, 50086, Ukraine

Keywords:

Adaptive Algorithm, Rasch Model, Item Response Theory (IRT), Information Function of Test, Latent

Variables, Birnbaum Model.

Abstract:

In this paper the algorithm for adaptive testing of students’ knowledge in distance learning and an assess-

ment of its effectiveness in the educational process has been proposed. The results of the study are based on

the achievements of modern testing theory IRT. The objective of the study was to build test items that allow

adequately assessing the student’s achievements and automating the assessment process using an adaptive al-

gorithm. To achieve this goal, mathematical models of modern testing theory IRT were used, namely, the

Rasch model, 2-PL and 3-PL Birnbaum models. The important outcomes of this work are a thorough analysis

of the developed test tasks, identiﬁcation of their shortcomings and automation of the process of assessing stu-

dents’ knowledge using an adaptive algorithm based on the methods of modern testing theory IRT. The paper

provides an overview of the results of the application of modern test theory, a description and block diagram

of the proposed algorithm and the results of its application in the real educational process. The effectiveness

of using this algorithm for the objective assessment of students’ knowledge has been experimentally shown.

The test quality has been assessed using the IRT models.

1 INTRODUCTION

1.1 Motivation and Research

Challenges of the Study

Important components of the educational process are

the quality control of the assimilation of knowledge

in higher education and the assessment of the degree

to which students achieve their educational goals.

Recently, distance learning has become widespread,

which has become especially relevant in connection

with the COVID19 pandemic. In this regard, there is

a problem of the adequacy of knowledge assessment

with the help of computer testing. The solution to this

problem is of great importance, because it allows you

https://orcid.org/0000-0003-3503-4836

https://orcid.org/0000-0002-8920-5680

https://orcid.org/0000-0003-3186-2837

https://orcid.org/0000-0001-6539-8329

https://orcid.org/0000-0002-0914-8875

to reduce the time for knowledge control, facilitate

the work of teachers, and immediately get the result

of the assessment.

While solving this problem, we faced the follow-

ing challenges:

1. Choice of research methodology.

2. Experimental veriﬁcation of the research results.

Our studies were evaluated on the results of mod-

ern testing theory, which allows us to adequately as-

sess the quality of test items and create effective adap-

tive knowledge assessment algorithms based on them.

To test knowledge, test items of varying complexity

were developed for the discipline “Higher Mathemat-

ics” to assess the level of students’ knowledge of with

different levels of training.

1.2 Problem Statement

Modern approaches to assessing students’ academic

achievements are based on the use of classical testing

Kostikov, A., Vlasenko, K., Lovianova, I., Khoroshailo, V. and Hrudkina, N.

Test Quality Assessment and Adaptive Algorithm Based on IRT Models.

DOI: 10.5220/0012061900003431

In Proceedings of the 2nd Myroslav I. Zhaldak Symposium on Advances in Educational Technology (AET 2021), pages 103-117

ISBN: 978-989-758-662-0

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

103

theory and Item Response Theory (IRT). The mathe-

matical background of pedagogical measurement the-

ory was created in the works of (Andersen, 1973; An-

drich, 2021; Avanesov, 1980; Guttman, 1944; Lord

et al., 1968; Maslak et al., 2005; Rasch, 1960; Wright

and Linacre, 1987; Wright and Masters, 1982). In

IRT, the concept of a latent variable is used. The term

“latent variable (parameter)” is usually understood as

a theoretical concept that characterizes a certain hid-

den property or quality (for example, the level of stu-

dents’ ability, the difﬁculty of the test task), which

cannot be directly measured. The advantages of the

classical testing theory are the provision of informa-

tion about the indicators of the knowledge quality of

the subjects, the clarity of the performed calculations

and the simple interpretation of the processing data.

The main disadvantage is the dependence of the re-

sults of evaluating the participants’ parameters on the

difﬁculty of the proposed tasks. Application of IRT,

based on Rasch models, provides the possibility of the

evaluation independence of the latent parameter “abil-

ity level” calculated values of participants α

from the

values of the “item difﬁculty” β

. This helps to in-

crease the objectivity of the obtained assessments of

the students’ ability level and allows to build effective

algorithms for assessing knowledge.

The purpose of this paper is to develop an al-

gorithm of adaptive testing for objective assessment

of students’ knowledge in distance learning, which

becomes especially relevant in the quarantine of

COVID-19.

1.3 State of Arts and Review

The educational standards of the new generation are

based on a competency-based approach to assessing

the quality of a student’s training, when it is not his

knowledge that is tested, ﬁrst of all, but his readiness

to apply it in practice and to act productively in a non-

standard situation, the ability to create the required

mode of action. Therefore, the quality of training is

understood as the degree of the student’s readiness to

demonstrate the relevant competencies. The general-

ization of the world experience in the implementation

of the competence-based approach to assessing learn-

ing outcomes allows us to make the following conclu-

sions that determine the main approaches to assessing

the level of competence mastery, the main of which

are the following:

1. competencies are dynamic, since they are not an

invariable quality in the structure of a pupil’s per-

sonality, but are able to develop, improve or com-

pletely disappear in the absence of an incentive to

manifest them. Therefore, we can talk about the

level of competence, assess it quantitatively, and

monitor it.

2. when assessing learning outcomes, it is necessary

to consider them in dynamics, which requires di-

agnostics of the educational process using moni-

toring procedures

3. the level of possession of a competence is a hidden

(latent) parameter of the pupil and direct measure-

ment is not amenable. It can be estimated with a

certain probability. Therefore, when evaluating it,

a probabilistic approach should be used.

It follows from this that in order to create tools for

the automated assessment of the learning outcomes, it

is necessary, ﬁrst of all, to solve two problems:

1. develop theoretical and methodological founda-

tions for modeling and parameterization of the

learning process and the diagnostic tools used to

evaluate its results.

2. theoretically substantiate and implement

software-algorithmic means for processing

the results of participants’ diagnostics (testing,

questionnaires), as well as tools for assessing

learning outcomes and the quality of diagnostic

tools.

The theoretical and methodological basis for solv-

ing these problems was the study results, ﬁrst of all,

by such Brown (Brown, 1910), Cronbach (Cronbach,

1951), Guilford (Guilford, 1942), Gulliksen (Gullik-

sen, 1986), Guttman (Guttman, 1944), Kuder and

Richardson (Kuder and Richardson, 1937), Luce and

Tukey (Luce and Tukey, 1964), Lord et al. (Lord

et al., 1968), Sax (Sax, 1989), Spearman (Spearman,

1910). They developed the theoretical foundations

for the creation of diagnostic materials and the clas-

sical approach to processing, analysis and interpreta-

tion of diagnostic results: the conceptual apparatus

of the classical test theory, criteria and indicators of

the quality of diagnostic tools, methodological basics

of their design and quality expertise. The issues of

scaling and comparison of processing data have been

deeply investigated.

The theoretical basis for the creation of tools for

automatic assessment of the results of the educational

process has received its further development due to

the creation of the IRT (Item Response Theory) the

foundations of which are set out in the works of (An-

drich, 2021, 2005; Andersen, 1973; Bezruczko, 2005;

Bond et al., 2020; Andrich et al., 2001; van der Linden

and Hambleton, 1997; Ingebo, 1997; Eckes, 2011;

Lord, 1980; Perline et al., 1979; Smith and Smith,

2004; Rasch, 1960; Fischer and Molenaar, 1995; Wil-

son, 2005; Wright and Masters, 1982; Wright, 1977;

Wright and Stone, 1979; Wright and Linacre, 1987).

AET 2021 - Myroslav I. Zhaldak Symposium on Advances in Educational Technology

104

Currently, IRT mathematical models are widely

used to assess the quality of test items. In (Tjabolo

and Otaya, 2019) the quality of the questions of

school exams was assessed using 1, 2,3-PL models.

As a result of the study, school exam questions were

classiﬁed into two categories(the good and the bad

categories) based on the value of the difﬁculty level

of the test items.

In (Amelia and Kriswantoro, 2017) the quality of

items in chemistry was also assessed using 1-PL, 2-

PL and 3-PL models. By these models, assessments

of the students’ ability level, the difﬁculty level of

test items were obtained, and the difference in the

obtained assessments was analyzed. Various adap-

tive algorithms based on the Rasch model have been

proposed in the works (Al-A’ali, 2006; Zaqoot et al.,

2021) Despite the large number of papers devoted

to the creation of adaptive algorithms based on IRT

models, we could hardly ﬁnd any reference to works

that would consider an adaptive algorithm selecting

the model that best suits the test data. The adaptive

algorithm proposed by us can use any of three IRT

models: 1-PL, 2-PL and 3-PL.

2 ALGORITHM OF ADAPTIVE

TESTING BASED ON RASCH

MODEL

Adaptive testing is a type of testing in which the or-

der of presentation of test items and the difﬁculty of

the next task depends on the participant’s answers to

previous items. The basis of adaptive testing systems

are statistical models. Very easy and very difﬁcult

tasks are automatically uninformative. Therefore, for

most tests, the optimal level of difﬁculty is the item,

to which the correct answer is given by about half of

the test participants.

The difﬁculties of the test items is determined ex-

perimentally, and the measurement process consists

of determining the percentage of participants who are

able to give the correct answer to the task in previ-

ous experiments. The problem of developing adap-

tive algorithms has been considered in (Weiss, 1982;

Al-A’ali, 2006; Weiss, 2004).

The Rasch model was used to construct the adap-

tive testing algorithm. This model is deﬁned by for-

mulas:

exp(θ

− β

)

1 + exp(θ

− β

)

(1)

where P

is the probability that the participant n, n =

1, . . . , N with the ability θ

correctly performs the task

i, i = 1, . . . , I, with the difﬁculty β

To start the algorithm, it is necessary to determine

the initial levels of difﬁculties. To this end, at the

beginning of the testing session the accumulation of

primary information about the level of preparation of

the participant is carried out. To do this, participant

receive N

tasks with an average level of difﬁculty.

Tasks to determine the initial level of the participant

are chosen by the teacher. Then, using the received

answers, the initial estimation of the ability level of

the student is calculated, and also recalculation of the

difﬁculty level current values of test items is carried

out.

The initial assessment of the ability level of the

i-th student (in logs) is based on the formula:

= ln





, i = 1, 2 · · · N, (2)

where N is the number of test participants, p

is the

proportion of correct answers of the i-th participant

to all tasks, q

is the proportion of incorrect answers

= 1 − p

The difﬁculty level of test items in logs is deter-

mined by the formula:

= ln





, j = 1, 2 · · · M, (3)

where M is the number of test items, p

is the propor-

tion of correct answers of all participants to the j-th

test item, q

is the proportion of incorrect answers.

At the next stage, the initial values in the logs of

the ability level of participants θ

and the initial val-

ues in the logs of the difﬁculty level of the test item

are reduced to a same interval scale (Lord et al.,

1968). The formula for such transition is based on the

idea of reducing the impact of the items difﬁculty on

the assessments of test participants.

Pre-calculating the average value of the initial log-

its of the students’ ability level

θ =

∑

i=1

and the standard deviation V of the initial values dis-

tribution of the parameter θ

∑

i=1



− θ



N − 1

we obtain a formula for calculating the difﬁculty level

logit of the j-th item

= θ +Y · β

, j = 1, M, (4)

where

Y =



1 +

2.89



Test Quality Assessment and Adaptive Algorithm Based on IRT Models

105

Similarly, calculating

β =

∑

j=1

, W =

∑



− β



M − 1

we get the formula for calculating the ability level

logit of the i-th student:

= β + X · θ

, i = 1, N, (5)

where X =



1 +

2.89



The obtained values allow to compare the level of

students’ ability with the level of test item difﬁculty.

If θ

− β

is a negative quantity and is large in mod-

ulus, then the problem of difﬁculty β

is too difﬁcult

for a student with the ability level θ

, and it will not

be useful for measuring the level of knowledge of the

i-th student. If this difference is positive and large in

modulus, then the task is too easy, it has long been

mastered by the student. If θ

− β

, then the proba-

bility that the student correctly completes the task is

equal to 0.5.

The information function of the i-th problem for

the Rasch model (1) I

(θ) is deﬁned as the product

of the probability of the correct answer P

(θ) to this

problem on the probability of the incorrect answer

(θ) (Lord et al., 1968)

(θ) = P

(θ) · Q

(θ) (6)

Figure 1 shows the information function of the i-

th item. Figure 1 shows that the test item, the answer

to which all students know, does not provide any in-

formation, as well as the item, the answer to which

no one knows. We get useful information when some

participants know the answer to the task and some do

not.

Figure 1: Information function of the test task.

The information function of the test is calculated

as the sum of the information functions of the test

items (Lord et al., 1968):

I(θ) = D

∑

j=1

(θ) (7)

where D is the correction factor (D = 1.7), necessary

to approximate the distribution of logistic probability

to the law of normal distribution.

After calculating the information function, the

measurement error SE is calculated, the value of

which is used to check the condition of the end of the

test procedure.

In the Rusch model, the measurement error de-

pends on the level of training θ and is calculated by

the formula (Lord et al., 1968):

SE(θ) =

I(θ)

. (8)

If the error takes a value less than the threshold

set by the teacher, the adaptive testing algorithm ends.

Otherwise, the following test task is selected. To se-

lect the next task, use the value of θ

, calculated by

formula (5). The next task is the one whose difﬁculty

level is closest to the current assessment of the ability

level of the participant. This task has the largest in-

formation contribution and its choice reduces the total

number of required test tasks.

Thus, the developed adaptive testing algorithm

consists of the following stages:

1. Selection of 5 tasks of average difﬁculty from the

bank of questions, which is determined by the

teacher.

2. Finding the initial level of student’s ability θ

and

the initial difﬁculty level of items β

by formulas

(2) and (3).

3. Summary of the obtained initial values θ

and β

to a single interval scale using formulas (5) and

(4).

4. Calculation of the information function of test

tasks to which the student answered by formulas

(6) and (7).

5. Finding the measurement error by the formula (8).

6. If the measurement error is less than the threshold,

the adaptive testing is completed.

7. If not, then the next task is selected from the con-

dition



− β



= min.

8. Then the algorithm is repeated starting from point

The block diagram of the algorithm is shown in

ﬁgure 2. The proposed algorithm can use any of three

models: 1-PL, 2-PL and 3-PL.

AET 2021 - Myroslav I. Zhaldak Symposium on Advances in Educational Technology

106

Figure 2: Block diagram of the adaptive testing algorithm.

3 RESULTS OF TEST ANALYSYS

BASED ON THE RASCH

MODEL

Let us consider the procedure for calculating the pa-

rameters of student ability level θ

and item difﬁculty

parameter β

from empirical data. As initial data we

will take results of testing of students in Moodle sys-

tem on discipline “Higher Mathematics” of the Math-

ematics and Modeling Departement of the Donbass

State Engineering Academy (table 1). Table 1 shows

the records of the ﬁrst 10 test participants. A total of

50 participants took part in the testing.

The test in this discipline consisted of 20 ques-

tions. First, it is necessary to calculate the proportions

of correct p

and incorrect q

answers of participants.

These values are calculated by formulas

, q

= 1 − p

, (9)

where R

is the number of correct answers for the i-th

test item, = 1, 2, ..., n, and n is the number of items

in the test. For example, for the ﬁrst participant of

testing we have

= 0.9, q

= 1 − 0.9 = 0.1

The values p

and q

are given in columns 3 and 4

of table 1.

Next, calculate the initial values θ

of the ability

level of participants by formula (2). For the ﬁrst par-

ticipant we have

= ln

0.9

0.1

= 2.197

Using the statistical module Moodle, the follow-

ing characteristics were obtained for test tasks: facil-

ity index (F), standard deviation (SD), random guess

score (RGS), intended weight, effective weight, dis-

tinction, distinction efﬁciency.

These data are shown in table 2.

Based on the data in table 2, we can estimate the

initial values of the item difﬁculty parameter. By for-

mula (3) for the ﬁrst problem we obtain

= ln

= −3.891

Test Quality Assessment and Adaptive Algorithm Based on IRT Models

107

Table 1: Test results in the Moodle system in the discipline “Higher Mathematics” of the Mathematics and Modeling Depart-

ment of the Donbass State Engineering Academy.

Participant’s number Score Number of correct answers p

1 90 18 0.9 0.1 2.197225

2 75 15 0.75 0.25 1.098612

3 85 17 0.85 0.15 1.734601

4 100 20 1 0 ∞

5 75 15 0.75 0.25 1.098612

6 100 20 1 0 ∞

7 90 18 0.9 0.1 2.197225

8 90 18 0.9 0.1 2.197225

9 70 14 0.7 0.3 0.847298

10 85 17 0.85 0.15 1.734601

Table 2: Statistical characteristics obtained using the statistical module of the Moodle system based on the results of ﬁnal

testing in the discipline “Higher Mathematics”.

Q# F SD RGS Intended weight Effective weight Distinction Distinguishing efﬁciency

1 98.00% 14.14% 33.33% 5.00% -11.54% -28.62%

2 94.00% 23.99% 33.33% 5.00% 3.41% 6.93% 11.28%

3 90.00% 30.30% 16.67% 5.00% 6.75% 44.07% 65.85%

4 94.00% 23.99% 20.00% 5.00% 4.66% 22.91% 39.11%

5 96.00% 19.79% 20.00% 5.00% 3.34% 11.72% 23.66%

6 90.00% 30.30% 14.29% 5.00% 3.18% -1.53% -2.22%

7 92.00% 27.40% 14.29% 5.00% 6.32% 43.38% 70.79%

8 84.00% 37.03% 20.00% 5.00% 6.48% 26.08% 35.44%

9 88.00% 32.83% 20.00% 5.00% 5.32% 17.26% 23.76%

10 74.00% 44.31% 20.00% 5.00% 9.75% 68.31% 84.84%

11 98.00% 14.14% 20.00% 5.00% 2.85% 14.64% 35.69%

12 100.00% 0.00% 16.67% 5.00% 0.00%

13 94.00% 23.99% 33.33% 5.00% 4.93% 27.00% 45.87%

14 90.00% 30.30% 33.33% 5.00% 5.51% 23.81% 34.88%

15 88.00% 32.83% 25.00% 5.00% 5.32% 17.26% 23.76%

16 90.00% 30.30% 33.33% 5.00% 5.51% 23.81% 33.33%

17 42.00% 49.86% 20.00% 5.00% 5.23% -2.60% -3.57%

18 80.00% 40.41% 33.33% 5.00% 8.11% 45.46% 56.25%

19 56.00% 50.14% 20.00% 5.00% 7.01% 13.80% 17.23%

20 82.00% 38.81% 20.00% 5.00% 6.32% 21.10% 27.68%

The results of calculations of the initial values of

the item difﬁculty parameter are given in table 3.

As can be seen from table 3, all participants in the

quiz answered the 12th item, so the score was equal to

inﬁnity with a minus sign. But practically at β >-6 the

probability value P

(β) close to one. These items are

performed by all participants and they become redun-

dant. Items with β > 6 are also useless. Such items

will not be overcome by any participant and they do

not carry any information about differences in the stu-

dents’ ability levels.

In tables 1 and 3, the parameter values θ

and

are on different interval scales. In order to reduce

them to a single scale of standard estimates, it is nec-

essary to calculate the variances V

and W

using the

data from tables 1 and 3. Inﬁnite data are excluded

from consideration.

Calculating the variance, we obtain

∑

i=1



− θ



N − 1

= 0.634,

∑



− β



M − 1

= 4.873

Next, we calculate the angular coefﬁcients (Lord

et al., 1968):

Y =



1 +

2.89



= 1.104

AET 2021 - Myroslav I. Zhaldak Symposium on Advances in Educational Technology

108

Table 3: Initial values β

of the item difﬁculty parameter.

Q# Progress p

1 98.00% 0.98 0.02 -3.89182

2 94.00% 0.94 0.06 -2.75154

3 90.00% 0.90 0.10 -2.19722

4 94.00% 0.94 0.06 -2.75154

5 96.00% 0.96 0.04 -3.17805

6 90.00% 0.90 0.10 -2.19722

7 92.00% 0.92 0.08 -2.44235

8 84.00% 0.84 0.16 -1.65823

9 88.00% 0.88 0.12 -1.99243

10 74,00% 0.74 0.26 -1.04597

11 98.00% 0.98 0.02 -3.89182

12 100.00% 1.00 0.00 -∞

13 94.00% 0.94 0.06 -2.75154

14 90.00% 0.90 0.10 -2.19722

15 88.00% 0.88 0.12 -1.99243

16 90.00% 0.90 0.10 -2.19722

17 42,00% 0.42 0.58 0.322773

18 80.00% 0.80 0.20 -1.38629

19 56,00% 0.56 0.44 -0.24116

20 82.00% 0.82 0.18 -1.51635

X =



1 +

2.89



= 1.63

Next on the formulas

= −2.103 + 1.104θ

= 1.86 + 1.63β

calculate the scaled values β

and θ

of the parameters.

In tables 4 and 5 scaled parameter values are pro-

vided.

For the analysis of test items quality we will cre-

ate histograms of students’ ability levels and levels

of items difﬁculties on the basis of the received data.

These histograms are shown in ﬁgure 3 and ﬁgure 4.

Figure 3: Ability levels histogram.

Figure 4: Items difﬁculty levels histogram

The histograms of the distributions of ability lev-

els and difﬁculty levels of test items are visually sim-

ilar to a normal distribution, which is typical for a

good test. However, the distribution of difﬁculty lev-

els of test items has a negative maximum, which in-

dicates that there are more simple than difﬁcult tasks

in the test. The presence of a large number of easy

tasks leads to the fact that assessments of the level of

preparation of students will be inﬂated. This is clearly

seen from the histogram of the training levels of the

test participants, which clearly shows that the range of

ability levels is from -1.8 to 2.3 logs, while the range

of items difﬁculties levels is from -1.8 to 1.2 logs.

The sum of the scaled difﬁculty levels of test items

is -27.93.

This means that the test items are very easy. This

test is not balanced, it contains a lot of easy items. It is

necessary to strive to ensure that this amount is close

to zero. Thus, the assessment of latent parameters al-

lows to determine noninformative items that should

be excluded from the quiz. The use of the developed

adaptive algorithm will allow to objectively assess the

level of students’ knowledge.

Now for all test tasks we construct characteristic

curves using the relation (10)

1 + exp(−1.702(θ− β

))

(10)

where P

is the probability that the participant with the

ability θ correctly performs the task j, j = 1, . . . , M,

with the difﬁculty β

The graph of these characteristic curves is shown

in ﬁgure 5.

Figure 5 shows that the characteristic curves for

the items 1-2, 4-6, 8-11, 12-13 coincide, and the

curves for tasks 17-18, 18-19 are more than 0.5

log apart. Thus, the characteristic curves are un-

even. Since tasks with the same level of difﬁculty do

not provide additional information when measuring a

Test Quality Assessment and Adaptive Algorithm Based on IRT Models

109

Figure 5: Characteristic curves of test items.

Table 4: Scaled values of item difﬁculty parameter β

Q# β

1 -3.89182 -4.48367

2 -2.75154 -2,625

3 -2.19722 -1.72148

4 -2.75154 -2.625

5 -3.17805 -3.32023

6 -2.19722 -1.72148

7 -2.44235 -2.12103

8 -1.65823 -0.84291

9 -1.99243 -1.38766

10 -1.04597 0.155071

11 -3.89182 -4.48367

13 -2.75154 -2,625

14 -2.19722 -1.72148

15 -1.99243 -1.38766

16 -2.19722 -1.72148

17 0.322773 2.386121

18 -1.38629 -0.39966

19 -0.24116 1.466906

20 -1.51635 -0.61165

given level of knowledge, one should be left out of

the tasks that match in terms of difﬁculty, and the rest

should be deleted. It is necessary to keep the item that

Table 5: Scaled values of the ability level θ

Participant’s number θ

1 2.197225 0.322736

2 1.098612 -0.89013

3 1.734601 -0.188

5 1.098612 -0.89013

7 2.197225 0.322736

8 2.197225 0.322736

9 0.847298 -1.16758

10 1.734601 -0.188

is most different from the remaining items in the test.

To create a quality test, it is necessary to remove tasks

1, 5, 6, 10, 11, 12 from the test and add to the test

items with difﬁculty that is in the interval between the

complexity of 17-18 and 18-19 items.

The graph of the information function of test items

and the test as a whole, deﬁned by formulas (6) and

(7), is shown in ﬁgure 6. Figure 6 shows that the infor-

mation function has one clearly expressed maximum.

This is a sign of a “good” test. However, it can be

seen that the test contains a lot of easy test items with

difﬁculties in the interval (-3; -2), which can be ex-

cluded from the test. Also in the test there are many

easy tasks with the same difﬁculties, which can also

AET 2021 - Myroslav I. Zhaldak Symposium on Advances in Educational Technology

110

be excluded from the test without violation of its in-

formation content. However, the more difﬁcult tasks

(with difﬁculties of 1-2 logits) are clearly not enough

in the test, so it is necessary to add more complex

tasks.

Figure 6: Information functions of the test and test items.

The graph of the measurement error, depending on

the level of training, is shown in ﬁgure 7.

Figure 7: Measurement error graph.

It can be seen from the graph that the measure-

ment error is large for the values of the ability in the

interval (2, 4), which is associated with the lack of test

items of increased difﬁculty.

4 RESULTS OF TEST ANALYSIS

BY 2PL AND 3PL BIRNBAUM

MODEL

The two-parameter (2PL) Birnbaum model differs

from the Rasch model by the presence of the a

pa-

rameter, which characterizes the differentiating abil-

ity of the j-th task. According to this model, the prob-

ability of a correct answer by an examinee with θ abil-

ity level to a test item with β

difﬁculty is determined

by the formula:

(θ) =

1 + exp(−1.7a

(θ − β

))

The a

parameter is deﬁned by the relation

bis

)

1 −



bis

)



where (r

bis

)

is the biserial correlation coefﬁcient of

the j-th task. Often, instead of this coefﬁcient, a point

biserial coefﬁcient r

is used – correlation coefﬁcient

of each task with student individual score

− X

n(n − 1)

(11)

Here n

is the number of students who completed

this item;

– the number of students who did not complete

it;

n = n

+ n

– total number of students;

– average individual score of students who

coped with the given item (the ratio of the sum of in-

dividual scores of students who completed this item

to n

);

– the average individual score of students who

did not cope with this iten (the ratio of the sum of

individual scores of students who did not completed

this item, to n

);

is the standard deviation for the individual

scores of all students.

So, we will assume that

≈

1 −





(12)

The parameter a

is directly proportional to the

slope of the characteristic curve at the inﬂection point.

The greater the value of this parameter, the greater

the steepness of the characteristic curve and, there-

fore, the greater the differentiating ability of the item.

Therefore, to compare the levels of student’s knowl-

edge among themselves, it is important to select items

depending on the values of the parameter a

The table 6 shows the values of the point biserial

coefﬁcient r

, the parameter a

and the difﬁculty of

the items β

. These parameters are calculated by for-

mulas (4), (11), (12) respectively.

To obtain a test with a good distinguishing abil-

ity, we will use the following recommendations for

selecting items. First of all, it is necessary to exclude

tasks 7 and 3 from the test, which have a negative

Test Quality Assessment and Adaptive Algorithm Based on IRT Models

111

Table 6: The values of the parameters r

and β

Q# 7 3 6 14 16 9 15 8 20 18 10 19 17

-2.12 -1.72 -1.72 -1.72 -1.72 -1.39 -1.39 -0.84 -0.61 -0.40 0.16 1.47 2.39

-0.01 -0.14 0.00 0.26 0.43 0.40 0.64 0.26 0.59 0.65 0.65 0.49 0.35

-0.01 -0.15 0.00 0.27 0.47 0.43 0.84 0.27 0.73 0.84 0.85 0.56 0.37

value of the discrimination parameter. This is due to

the fact that examinee with a low level of knowledge

respond well to them and poorly – with a high level

of knowledge, which is contrary to common sense.

This is due to guessing, when a student with a low

level of knowledge randomly selects the correct an-

swer. In addition, it is necessary to select tasks with

sufﬁciently large values – from the interval (0.5; 2.5).

In the test, from this point of view, tasks 6, 14, 8 will

be the worst. Further analysis involves the selection

of tasks with the greatest differentiating ability with

equal difﬁculty.

Consider tasks 9 and 15, which have the same

difﬁculty and differ in the parameter a

= 0.43,

= 0.84. According to the one-parameter Rasch

model, both tasks have the same probability curve for

the correct answer of the subjects (curve 1, ﬁgure 8),

that is, from the point of view of the differentiating

ability of the tasks, they are indistinguishable. In the

case of a two-parameter model, we obtain two differ-

ent characteristic curves: steeper (2) for task 15 and

less steep (3) for task 9. Thus, when minimizing the

length of the test, task 15 is preferable.

Figure 8: Comparison of the Rasch model and the two-

parameter Birnbaum model for items 9 and 15.

The test in question is a closed-type test with

the choice of a single correct answer out of ﬁve of-

fered for each task. In such cases, in order to reduce

the guessing effect, it is proposed to use the three-

parameter Birnbaum model.

This Birnbaum model contains one more parame-

ter c

, which characterizes the probability of a correct

answer to the task j if this answer is guessed and not

based on knowledge. In this case, the probability of

the correct answer of the subjects to the task of the j

test is expressed by the formula

(θ) = c

+ (1 − c

)(1 + exp (−1, 7a

(θ − β

)))

−1

where c

, k

is the number of responses to task

j. In the test under consideration k

= 5, c

= 0.2.

The characteristic curves of these tasks cross the

line P

(θ) = c

, so the characteristic curves them-

selves become ﬂatter, which leads to a decrease in the

differentiating ability of the test.

The ﬁgure 9 shows the probability curves for the

correct answer of the subjects to item 5, depending

on the ability level θ, corresponding to the Rasch

model(curve 1),the two-parameter model (curve 3)

and the three-parameter model (curve 2).

Figure 9: Characteristic curves for item 15 according to

Rasch and Birnbaum models.

Let us compare the information functions for the

Rasch and Birnbaum models. For the Rasch model,

the information function is determined by relation (6).

For the two-parameter Birnbaum model, the informa-

tion function is given by the expression

(θ) = 2.89a

(θ)Q

(θ).

For the three-parameter Birnbaum model, the in-

formation function has the form

(θ) =

2.89a

(1 − c

)

(θ)Q

(θ)

where R

(θ) = (c

+ exp(1.7a

(θ − β

))) and

(θ) = (1 + exp (−1.7a

(θ − β

)))

AET 2021 - Myroslav I. Zhaldak Symposium on Advances in Educational Technology

112

The maximum value of the information function

for the Rasch model and the two-parameter Birnbaum

model is reached at the inﬂection point of the charac-

teristic curve, that is, when the difﬁculty (in logits)

is equal to the ability level. The maximum value of

the information function for the Rasch model and the

two-parameter Birnbaum model is reached at the in-

ﬂection point of the characteristic curve, that is, when

the difﬁculty (in logits) is equal to the level of knowl-

edge (in logits). Thus, for θ

, tasks with difﬁculty val-

ues β from the neighborhood of the point θ

are the

most informative (in logits). Thus, for θ

, tasks with

difﬁculty values β from the neighborhood of the point

are the most informative.

In the ﬁgure 10, for items 9 and 15 of the test,

information functions are shown: according to the

Rasch model (curve 1 – common for two items), ac-

cording to the two-parameter model for item 9 (curve

2) and for item 15 (curve 3). The difﬁculty of tasks

is equal to -1.39, therefore these items are the most

informative for values close to -1.39.

Figure 10: Information functions for items 9 and 15.

In the case of the three-parameter Birnbaum

model, the maximum information function is reached

at the point

max

= β

1, 7a



0.5 + 0.5

1 + 8c



For items 9, with difﬁculty β = −1.39, the max-

imum of the information function is reached at the

point θ

max

= 2, and for task 15, with the same difﬁ-

culty, at the point θ

max

= 2. The information function

of the entire test is determined by the formula 7. The

information function of the entire test must have one

clearly deﬁned maximum, otherwise the test needs to

be improved, items with difﬁculties corresponding to

the failure areas of the information function should be

added to it.

The ﬁgure 11 shows the information functions

of the entire test, based on the Rasch model, two-

parameter (2PL) model, three-parameter (3PL) Birn-

baum model. In the test under consideration, this con-

dition is satisﬁed because each curve has one maxi-

mum point.

Figure 11: Information functions of test.

So, for the test in the discipline “Higher Mathe-

matics”, two Birnbaum models were built: two- and

three-parameter. If we consider that the test should

correspond to these models, it is necessary to get

rid of some tasks, and change others. In particular,

from the point of view of a normatively oriented

test, it should have a sufﬁciently high differentiating

ability. Therefore, tasks 9 and 14 should be removed,

which are identical in difﬁculty with tasks 15 and

16, respectively, and differ in lower differentiating

ability. In addition, it is necessary to replace or

change tasks 6, 8, 17 so that their differentiating

ability increases. Tasks 7 and 3 with a negative value

of the discrimination parameter should be removed.

As regards the information functions, the graphs built

according to the three models, including the Rasch

model, did not reveal any contradictions between

theory and experiment.

To use the IRT model for assessing the quality

of test items, it is necessary that the test results

are adequate to this model. To check this fact, the

adequacy is assessed based on Pearson’s chi-squared

test. To calculate the value χ

for the j-th task,

all test participants are divided according to the

ability level into Q intervals, and for each interval

the average value of their ability level

is found.

The algorithm for dividing the range of change in the

value of θ into intervals is constructed in such a way

that if the level of ability of any participant coincides

with the interval border, it shifts to the right so much

that the value of this ability falls into the previous

interval.

is calculated based on the expression (Baker

and Kim, 2017)

Test Quality Assessment and Adaptive Algorithm Based on IRT Models

113

∑

q=1

q j

− T

q j

)

, (13)

where x

q j

is the number of test participants with

ability level

, who correctly answered the j-th test

item; T

is the total number of test participants with

ability level

; value δ

is calculated by the formula:

q j

(1 − P

q j

). (14)

q j

is the probability expression of successful

completion of j-th item with difﬁculty β

by a par-

ticipant with the ability level

and is deﬁned by IRT

models. To assess the adequasy of test results with

IRT models, we will use the R language tools. To do

this, we will load the test results saved in the input.csv

ﬁle with the CSV extension using the call

data<-read.csv("input.csv")

For data analysis, we use the ltm library of the R

language. To use this library, it must be installed and

loaded using the commands

install.package("ltm")

library("ltm")

After that, the following commaands can be used

to estimate Rasch model for test data.

f1<-rasch(data,constraint =

cbind(length(data)+1,1)))

To view the results of estimation, we use the fol-

lowing command

summary(data.rasch)

As a result, we obtain the following output:

Call:

rasch(data = data,

constraint = cbind(ncol(data) + 1, 1))

Model Summary:

log.Lik AIC BIC

-321.6515 681.3031 717.6315

Coefficients:

value std.err z.vals

Dffclt.v1 -4.3305 1.0338 -4.1888

Dffclt.v11 -4.3305 1.0338 -4.1888

Dffclt.v5 -3.5941 0.7533 -4.7712

Dffclt.v2 -3.1457 0.6323 -4.9751

Dffclt.v4 -3.1458 0.6323 -4.9751

Dffclt.v13 -3.1458 0.6323 -4.9751

Dffclt.v7 -2.8155 0.5619 -5.0104

Dffclt.v3 -2.5500 0.5151 -4.9510

Dffclt.v6 -2.5498 0.5150 -4.9509

Dffclt.v14 -2.5504 0.5151 -4.9511

Dffclt.v16 -2.5503 0.5151 -4.9511

Dffclt.v9 -2.3256 0.4813 -4.8319

Dffclt.v15 -2.3254 0.4813 -4.8318

Dffclt.v8 -1.9529 0.4355 -4.4842

Dffclt.v20 -1.7924 0.4193 -4.2750

Dffclt.v18 -1.6442 0.4060 -4.0502

Dffclt.v10 -1.2505 0.3775 -3.3122

Dffclt.v19 -0.2938 0.3435 -0.8552

Dffclt.v17 0.3859 0.3450 1.1184

Dscrmn 1.0000 NA NA

Integration:

method: Gauss-Hermite

quadrature points: 21

Optimization:

Convergence: 0

max(|grad|): 0.0014

quasi-Newton: BFGS

The output contains the following information:

log-likelyhood value (LogLik),the Akaike informa-

tion criterion (AIC),Bayesian information criteria

(BIC). AIC and BIC can be used to compare the rel-

ative ﬁt of the models for the same data. The lower

AIC and BIC value, the better the model ﬁts the data.

The output also include item difﬁculty estimates (Df-

fclt) with their standard error and z statistic. Us-

ing the functions ltm library, we can assess absolute

model ﬁt. This assessment can be conducted using

chi-square test of the null hypothesis. The null hy-

pothesis is that our model ﬁts the data. To determine

whether the model ﬁt the individual items, we use the

following command

item.fit(data.rasch,simulate.p.value=FALSE)

After using these command we obtain the following

output:

Item-Fit Statistics and P-values

Call:

rasch(data = data,

constraint = cbind(ncol(data) + 1, 1))

Alternative: Items do not fit the model

Ability Categories: 10

Xˆ2 Pr(>Xˆ2)

v1 12.4481 0.0527

v11 12.4481 0.0527

v5 24.2183 0.0005

v2 35.3998 <0.0001

v4 18.3807 0.0053

v13 18.3807 0.0053

v7 12.7657 0.0469

v3 19.6667 0.0032

v6 17.7808 0.0068

v14 8.0073 0.2376

v16 9.8845 0.1296

v9 15.6480 0.0158

v15 15.6579 0.0157

v8 19.6375 0.0032

v20 9.4753 0.1486

v18 15.7466 0.0152

v10 24.7008 0.0004

v19 29.4453 0.0001

v17 19.5219 0.0034

AET 2021 - Myroslav I. Zhaldak Symposium on Advances in Educational Technology

114

Analysing this result, we can conclude that the

model does not accurately ﬁt responses for the indi-

vidual tasks. Item 2 turned out to be the worst. Items

19,10 and 5 are also poorly consistent with the Rasch

model. Similar results can be obtained for the 2PL

Birnbaum model.

To obtain estimates of latent trait,we use the fol-

lowing commands:

data.2pl<-ltm(data˜z1)

summary(data.2pl)

As a result, we obtain

Call:

ltm(formula = data ˜ z1)

Model Summary:

log.Lik AIC BIC

-251.8978 579.7955 652.4524

Coefficients:

value std.err z.vals

Dffclt.v1 -3.6349 6.5440 -0.5555

Dffclt.v11 -3.6349 6.5440 -0.5555

Dffclt.v5 -2.8923 4.1718 -0.6933

Dffclt.v2 -2.3626 3.3693 -0.7012

Dffclt.v4 -2.5588 3.4556 -0.7405

Dffclt.v13 -4.1300 5.4315 -0.7604

Dffclt.v7 12.4254 34.7564 0.3575

Dffclt.v3 -9.4923 24.8491 -0.3820

Dffclt.v6 10.1668 23.5444 0.4318

Dffclt.v14 2.9872 2.4793 1.2049

Dffclt.v16 1.7069 1.2336 1.3837

Dffclt.v9 1.6604 1.0279 1.6154

Dffclt.v15 0.7883 151.3151 0.0052

Dffclt.v8 13.1975 51.8541 0.2545

Dffclt.v20 0.9232 0.5943 1.5533

Dffclt.v18 0.6679 2601.7180 0.0003

Dffclt.v10 0.4715 439.4082 0.0011

Dffclt.v19 -0.0204 65.1284 -0.0003

Dffclt.v17 -0.6392 797.0869 -0.0008

Dscrmn.v1 1.3535 1.8244 0.7419

Dscrmn.v11 1.3535 1.8244 0.7419

Dscrmn.v5 1.5005 1.5262 0.9832

Dscrmn.v2 1.7745 1.6673 1.0643

Dscrmn.v4 1.5215 1.4152 1.0751

Dscrmn.v13 0.7628 0.8333 0.9153

Dscrmn.v7 -0.1942 0.5450 -0.3563

Dscrmn.v3 0.2395 0.5988 0.4000

Dscrmn.v6 -0.2130 0.4947 -0.4305

Dscrmn.v14 -0.7469 0.5524 -1.3522

Dscrmn.v16 -1.5396 0.8178 -1.8827

Dscrmn.v9 -1.3673 0.6043 -2.2627

Dscrmn.v15 -27.8310 2871.9894 -0.0097

Dscrmn.v8 -0.1238 0.4912 -0.2521

Dscrmn.v20 -2.2078 0.9535 -2.3155

Dscrmn.v18 -36.3006 70169.1000 -0.0005

Dscrmn.v10 -44.8844 17157.7715 -0.0026

Dscrmn.v19 -38.3693 122239.5199 -0.0003

Dscrmn.v17 -33.1395 20052.6276 -0.0017

Integration:

method: Gauss-Hermite

quadrature points: 21

Optimization:

Convergence: 0

max(|grad|): 0.012

quasi-Newton: BFGS

Comparing AIC and BIC for the Rasch model

(AIC=681.3031, BIC=717.613) and for the 2PL Birn-

baum model (AIC=579.7955,BIC=652.4524) we can

conclude that 2PL model better ﬁts the test data.

However, the function did not correctly calculate

the difﬁculty level for some items, as evidenced by

the std.err value. The presence of negative values

of the discrimination coefﬁcient indicates that these

items does not ﬁts the model. To assess the ﬁt of each

items, we use χ

test.

item.fit(data.2pl,simulate.p.value=FALSE)

The output of this command is shown below:

Item-Fit Statistics and P-values

Call:

ltm(formula = data ˜ z1)

Alternative: Items do not fit the model

Ability Categories: 10

Xˆ2 Pr(>Xˆ2)

v1 17.2949 0.0272

v11 17.2949 0.0272

v5 5.7470 0.6756

v2 8.6983 0.3684

v4 8.6199 0.3754

v13 18.0006 0.0212

v7 27.8472 0.0005

v3 32.8960 0.0001

v6 15.0469 0.0582

v14 19.2362 0.0136

v16 15.7731 0.0457

v9 19.5417 0.0122

v15 0.2435 1

v8 22.2559 0.0045

v20 11.2384 0.1886

v18 0.7094 0.9995

v10 23.5588 0.0027

v19 18.7206 0.0164

v17 2.4617 0.9635

From this output we see that the 2PL Birnbaum

model did not ﬁt items 3, 7, 10, 8, 19. Thus, by the

R language tools, it was established which items cor-

respond to the Rasch model and the two-parameter

Birnbaum model.

5 DISCUSSION

The purpose of this paper was to automate the pro-

cess of testing students’ knowledge, which is espe-

cially relevant for distance learning. To achieve this

goal an adaptive testing algorithm based on the Rasch

Test Quality Assessment and Adaptive Algorithm Based on IRT Models

115

model was proposed and the modeling of the students’

knowledge assessment process using this algorithm

was carried out. The results of testing their knowl-

edge in the course “Higher Mathematics” obtained in

the Moodle system were taken as the initial values of

the tasks complexity and the levels of the students’

ability.

As a result of modeling, the levels of students’

abilities were recalculated. the information functions

of the test tasks and the entire test as a whole were

built. the standard measurement error was calculated.

depending on the student’s ability level. The analy-

sis of the obtained results allows us to conclude that

the test is not balanced and contains too many easy

tasks. They are tasks with numbers 1, 3, 11. Remov-

ing them from the test will reduce the number of test

items and speed up the process of determining the stu-

dent’s level of training.

A change in the assessment of the student’s ability

level as a result of testing indicates the need to intro-

duce an adaptive testing system into the educational

process which will improve the quality of assessment

of students’ knowledge.

These conclusions are conﬁrmed by the works of

other authors. So in this paper (Al-A’ali, 2006) it was

shown that the use of adaptive testing based on IRT

made it possible to reduce the number of test tasks

and increase the reliability of determining the level

of student readiness. The effectiveness of the use of

adaptive testing to improve the quality of pedagogi-

cal measurements is evidenced by the works (Weiss,

1982, 2004).

6 CONCLUSIONS

In connection with the development of distance learn-

ing, the problem of automating the process of evalu-

ating students’ knowledge is becoming important. To

solve this problem, the achievements of modern test-

ing theory IRT were used. Mathematical models of

IRT provide the basis for building an adaptive testing

algorithm that allows you to automate the process of

knowledge assessment.

As a result of this work, the following results were

obtained:

1. An algorithm of adaptive knowledge assessment

based on the IRT approaches was proposed. This

algorithm consists of an initial assessment of the

difﬁculty level of test items and students’ abilities,

scaling of these parameters, selection of the next

question based on minimizing the module of their

difference and estimation of the measuring error

of the knowledge level by the information func-

tion of the proposed question.

2. The test parameters were evaluated on the basis of

IRT theory, which identiﬁed non-informative test

questions that should be excluded from the set of

test items.

3. The correspondence of the experimental data to

the Rasch and 2PL Birnbaum model was assessed

based on the Pearson’s chi-squared test by using

the language R, which made it possible to identify

tasks that require replacement or processing.

4. The quality of the test was examined using three

IRT models (Rasch model, 2PL and 3PL Birn-

baum model), which allowed for a more careful

selection of test items.

The results of the study showed the effectiveness

of using IRT to assess knowledge. An analysis of

these results allows us to conclude that the use of the

IRT methods to build an adaptive algorithm will au-

tomate the process of knowledge assessment and in-

crease the objectivity of assessment in distance learn-

ing. The use of several mathematical models in the

adaptive algorithm makes it possible to choose among

them the one that best ﬁts the experimental data,

which will improve the accuracy of assessing the stu-

dent’s knowledge.

In the future, we plan to improve the adap-

tive algorithm for assessing educational achievements

by combining the Knowledge Space (KS) and IRT

(Mu

noz-Merino et al., 2018).

REFERENCES

Al-A’ali, M. (2006). IRT-Item Response Theory As-

sessment for an Adaptive Teaching Assessment Sys-

tem. In Proceedings of the 10th WSEAS Inter-

national Conference on APPLIED MATHEMATICS,

MATH’06, page 518–522, Stevens Point, Wisconsin,

USA. World Scientiﬁc and Engineering Academy and

Society (WSEAS).

Amelia, R. N. and Kriswantoro, K. (2017). Implementa-

tion of Item Response Theory for Analysis of Test

Items Quality and Students’ Ability in Chemistry.

JKPK (Jurnal Kimia dan Pendidikan Kimia), 2(1):1–

12. https://doi.org/10.20961/jkpk.v2i1.8512.

Andersen, E. B. (1973). A goodness of ﬁt test for the Rasch

model. Psychometrika, 38(1):123–140. https://doi.

org/10.1007/BF02291180.

Andrich, D. (2005). The Rasch model explained. In Applied

Rasch measurement: A book of exemplars, pages 27–

59. Springer.

Andrich, D. (2021). Rasch Models for Measurement.

Thousand Oaks. https://methods.sagepub.com/book/

rasch-models-for-measurement.

AET 2021 - Myroslav I. Zhaldak Symposium on Advances in Educational Technology

116

Andrich, D., Sheridan, B., and Luo, G. (2001).

RUMM2010: Rasch Unidimensional Measurement

Models. http://www.rummlab.com.au/.

Avanesov, V. S. (1980). The problem of psychological

tests. Soviet Education, 22(6):6–23. https://doi.org/

10.2753/RES1060-939322066.

Baker, F. B. and Kim, S.-H. (2017). The basics of item

response theory using R. Statistics for Social and Be-

havioral Sciences. Springer Cham. https://doi.org/10.

1007/978-3-319-54205-8.

Bezruczko, N., editor (2005). Rasch measurement in health

sciences. Jam Press Maple Grove, MN.

Bond, T., Yan, Z., and Heene, M. (2020). Applying the

Rasch model: Fundamental measurement in the hu-

man sciences. Routledge, fourth edition.

Brown, W. (1910). Some experimental results in the cor-

relation of mental abilities. British Journal of Psy-

chology, 1904-1920, 3(3):296–322. https://doi.org/10.

1111/j.2044-8295.1910.tb00207.x.

Cronbach, L. J. (1951). Coefﬁcient alpha and the inter-

nal structure of tests. Psychometrika, 16(3):297–334.

https://doi.org/10.1007/BF02310555.

Eckes, T. (2011). Introduction to Many-Facet Rasch

Measurement. Peter Lang, Bern, Switzerland. url

”https://www.peterlang.com/view/title/13347”.

Fischer, G. H. and Molenaar, I. W., editors (1995). Rasch

models: Foundations, recent developments, and appli-

cations. Springer Science & Business Media. https:

//doi.org/10.1007/978-1-4612-4230-7.

Guilford, J. P. (1942). Fundamental Statistics in Psychology

and Education. McGraw-Hill Book Company, Inc.,

New York, US. https://archive.org/details/in.ernet.dli.

2015.228996.

Gulliksen, H. (1986). Perspective on Educa-

tional Measurement. Applied Psycholog-

ical Measurement, 10(2):109–132. https:

//doi.org/10.1177/014662168601000201.

Guttman, L. (1944). A Basis for Scaling Qualitative Data.

American Sociological Review, 9(2):139–150. https:

//doi.org/10.2307/2086306.

Ingebo, G. S. (1997). Probability in the Measure of Achieve-

ment. Mesa Press.

Kuder, G. F. and Richardson, M. W. (1937). The theory

of the estimation of test reliability. Psychometrika,

2(3):151–160. https://doi.org/10.1007/BF02288391.

Lord, F. M. (1980). Applications of Item Response Theory

To Practical Testing Problems. Routledge. https://doi.

org/10.4324/9780203056615.

Lord, F. M., Novick, M. R., and Birnbaum, A. (1968).

Statistical theories of mental test scores. Addison-

Wesley, Oxford.

Luce, R. D. and Tukey, J. W. (1964). Simultaneous conjoint

measurement: A new type of fundamental measure-

ment. Journal of Mathematical Psychology, 1(1):1–

27. https://doi.org/10.1016/0022-2496(64)90015-X.

Maslak, A. A., Karabatsos, G., Anisimova, T. S., and Os-

ipov, S. A. (2005). Measuring and comparing higher

education quality between countries worldwide. Jour-

nal of Applied Measurement, 6(4):432–442.

noz-Merino, P. J., Novillo, R. G., and Kloos, C. D.

(2018). Assessment of skills and adaptive learning

for parametric exercises combining knowledge spaces

and item response theory. Applied Soft Computing,

68:110–124. https://doi.org/10.1016/j.asoc.2018.03.

045.

Perline, R., Wright, B. D., and Wainer, H. (1979).

The Rasch model as additive conjoint measurement.

Applied Psychological Measurement, 3(2):237–255.

https://doi.org/10.1177/014662167900300213.

Rasch, G. (1960). Studies in mathematical psychology: I.

Probabilistic models for some intelligence and attain-

ment tests. Nielsen & Lydiche.

Sax, G. (1989). Principles of educational and psychological

measurement and evaluation. Wadsworth Pub. Co.,

Belmont, 3rd edition.

Smith, E. V. and Smith, R. M., editors (2004). Introduction

to Rasch measurement: Theory, models and applica-

tions. JAM Press.

Spearman, C. (1910). Correlation calculated from faulty

data. British Journal of Psychology, 1904-1920,

3(3):271–295. https://doi.org/10.1111/j.2044-8295.

1910.tb00206.x.

Tjabolo, S. A. and Otaya, L. G. (2019). Quality of School

Exam Tests Based on Item Response Theory. Uni-

versal Journal of Educational Research, 7(10):2156–

2164. https://doi.org/10.13189/ujer.2019.071013.

van der Linden, W. J. and Hambleton, R. K., editors (1997).

Handbook of Modern Item Response Theory. Springer

Science & Business Media. https://doi.org/10.1007/

978-1-4757-2691-6.

Weiss, D. J. (1982). Improving measurement quality and ef-

ﬁciency with adaptive testing. Applied Psychological

Measurement, 6(4):473–492. https://doi.org/10.1177/

014662168200600408.

Weiss, D. J. (2004). Computerized adaptive testing for ef-

fective and efﬁcient measurement in counseling and

education. Measurement and Evaluation in Counsel-

ing and Development, 37(2):70–84. https://doi.org/10.

1080/07481756.2004.11909751.

Wilson, M. (2005). Constructing Measures: An Item Re-

sponse Modeling Approach. Routledge.

Wright, B. D. (1977). Solving Measurement Problems

with the Rasch Model. Journal of Educational Mea-

surement, 14(2):97–116. http://www.jstor.org/stable/

1434010.

Wright, B. D. and Linacre, J. M. (1987). Dichotomous rasch

model derived from speciﬁc objectivity. Rasch mea-

surement transactions, 1(1):5–6. https://www.rasch.

org/rmt/rmt11a.htm.

Wright, B. D. and Masters, G. N. (1982). Rating scale anal-

ysis. Mesa Press, Chicago.

Wright, B. D. and Stone, M. H. (1979).

Best test design. Mesa Press, Chicago.

https://www.rasch.org/BTD RSA/pdf%20[reduced%

20size]/Best%20Test%20Design.pdf.

Zaqoot, W., Oh, L.-B., Koh, E., Seah, L. H., and Teo, H.-H.

(2021). The Use of Rasch Model to Create Adaptive

Practices in e-Learning Systems. In ACIS 2021 Pro-

ceedings, page 69. https://aisel.aisnet.org/acis2021/

69.

Test Quality Assessment and Adaptive Algorithm Based on IRT Models

117