cessfully performs the task with 91% accuracy. Fur-
thermore, our Kappa statistic (0.73) reinforces a con-
sistent agreement between algorithm’s and expert’s
summary labelling. This solution links the evalua-
tion of co-reference solutions for benchmarking (Lin,
2004) to an applicable solution for real summary un-
derstanding.
Future research will focus on the readability of the
models. This refers to whether we may have a snap-
shot from the ensemble model of the main features
delivering the final decision. In other words, we want
to be knowledgeable of the relevance of the format-
ting, semantic and syntactic layers on the result. This
would elucidate the understanding of the potential for
new applications regarding the machine-human cor-
relation on decision making in the task of summary
evaluation.
Lastly, the scope of this study is limited to the
Dutch language. It is to be expected a similar perfor-
mance in close relative languages such as German and
English, yet challenges increase in more distant types.
Therefore, modelling techniques should be fine-tuned
and adapted to each specific language to validate the
results previously presented.
REFERENCES
Barrios, F., Lopez, F., Argerich, L., and Wachenchauzer,
R. (2016). Variations of the similarity function
of textrank for automated summarization. CoRR,
abs/1602.03606.
Bradley, A. P. (1997). The use of the area under the
roc curve in the evaluation of machine learning algo-
rithms. Pattern recognition, 30(7):1145–1159.
Breiman, L. (2001). Random forests. Machine learning,
45(1):5–32.
Brin, S. and Page, L. (1998). The anatomy of a large-scale
hypertextual web search engine.
Cohen, J. (1960). A coefficient of agreement for nominal
scales. Educational and psychological measurement,
20(1):37–46.
G
¨
unes¸, F., Wolfinger, R., and Tan, P.-Y. (2017). Stacked
ensemble models for improved prediction accuracy. In
Proc. Static Anal. Symp, pages 1–19.
Hearst, M. A., Dumais, S. T., Osuna, E., Platt, J., and
Scholkopf, B. (1998). Support vector machines. IEEE
Intelligent Systems and their Applications, 13(4):18–
28.
Jones, K. S. (1972). A statistical interpretation of term
specificity and its application in retrieval. Journal of
documentation.
Kleinberg, J. M. (1999). Authoritative sources in a hy-
perlinked environment. Journal of the ACM (JACM),
46(5):604–632.
Landis, J. R. and Koch, G. G. (1977). The measurement of
observer agreement for categorical data. biometrics,
pages 159–174.
Lin, C.-Y. (2004). Rouge: A package for automatic evalua-
tion of summaries. Text summarization branches out,
page 74–81.
Mani, I. (2001). Automatic summarization. J. Benjamins
Pub. Co.
McHugh, M. L. (2012). Interrater reliability: the kappa
statistic. Biochemia Medica, page 276–282.
Mihalcea, R. and Tarau, P. (2004). Textrank: Bringing or-
der into text. Proceedings of the 2004 Conference on
Empirical Methods in Natural Language Processing,
page 404–411.
Schuster, M. and Paliwal, K. K. (1997). Bidirectional re-
current neural networks. IEEE transactions on Signal
Processing, 45(11):2673–2681.
Wolpert, D. H. (1992). Stacked generalization. Neural net-
works, 5(2):241–259.
Wu, H., Ma, T., Wu, L., Manyumwa, T., and Ji, S.
(2020). Unsupervised reference-free summary qual-
ity evaluation via contrastive learning. arXiv preprint
arXiv:2010.01781.
Zhang, P. (1993). Model selection via multifold cross vali-
dation. The Annals of Statistics, 21(1):299–313.
Zhou, P., Qi, Z., Zheng, S., Xu, J., Bao, H., and Xu, B.
(2016). Text classification improved by integrating
bidirectional lstm with two-dimensional max pooling.
arXiv preprint arXiv:1611.06639.
APPENDIX
In Table 2 we define p
o
as the raters’ accuracy.
The maximum agreement probability is P
max
=
∑
k
i=1
min(P
i+
,P
+i
) where P
i+
and P
+i
are the row and
column probabilities from the original raters’ matrix.
p
Good
and p
Bad
are the probability of random agree-
ment for the different summary categories. p
e
is
the random probability for both categories together,
thus, p
e
= p
Good
+ p
Bad
. Kappa value is defined as
κ =
p
o
−p
e
1−p
e
. The maximum value for the unequal
distribution of the sample is κ
max
and is calculated
by κ
max
=
P
max
−P
e
1−P
e
. The standard error and confi-
dence interval are calculated by SE
κ
=
q
p
o
(1−p
o
)
N(1−p
e
)
2
and
CI : κ ± Z
1−α/2
SE
κ
respectively.
Understanding Summaries: Modelling Evaluation in Extractive Summarisation Techniques
611