ticulated mathematically using our proposed decision
and explanation fidelity measures, determining when
an explanation is deceptive, is not always as clear,
since it includes a grey area. That is, an explanation
might be said to be deceptive, but it might also only
be judged as inaccurate or simplified. Thus, deception
detection is not an easy task: While strong deception
is well-recognizable, minor forms are difficult to de-
tect. Furthermore, some form of domain or model
knowledge is necessary. This could be data similar
or, preferably, identical to the model’s training data
under investigation. Domain experts could also pro-
vide information in the form of labeled samples or de-
tection rules, i.e., they can investigate model outputs
and judge them as faithful or deceptive. Identifying
deceptive explanations becomes much easier if model
access and training or testing data are available, i.e.,
it reduces to comparing outputs from models to those
suggested by the (training) data. We recommend reg-
ulatory bodies to pass laws that ensure that auditors
have actual model access since this simplifies the pro-
cess of deception detection. It is one step towards en-
suring that AI is used for the social good.
Detection methods will improve, but so will
strategies for lying. Thus, it is important to anticipate
weaknesses of detection algorithms that deceitful par-
ties might exploit, and mitigate them early on, e.g.,
with the aid of generic security methods (Schlegel
et al., ). The field of explainability evolves quickly
with many challenges ahead (Meske et al., 2021).
This provides ample opportunities for future research
to assess methods for creation and detection of decep-
tive explanations, e.g., methods explaining features or
layers of image processing systems rather than text
(Schneider and Vlachos, 2021a).
11 CONCLUSION
Given economic and other incentives, a new cat
and mouse game between ”liars” and ”detectors” is
emerging in the context of AI. Our work provided
a first move in this game: We structured the prob-
lem, and contributed by showing that detection of de-
ception attempts without domain knowledge is chal-
lenging. Our ML models utilizing domain knowledge
through training data yield good detection accuracy,
while unsupervised techniques are only effective for
more severe deception attempts or given (detailed) ar-
chitectural information of the model under investiga-
tion.
REFERENCES
Adebayo, J., Gilmer, J., Muelly, M., Goodfellow, I., Hardt,
M., and Kim, B. (2018). Sanity checks for saliency
maps. In Neural Information Processing Systems.
Adelani, D., Mai, H., Fang, F., Nguyen, H. H., Yamag-
ishi, J., and Echizen, I. (2019). Generating sentiment-
preserving fake online reviews using neural language
models and their human-and machine-based detec-
tion. arXiv:1907.09177.
Aivodji, U., Arai, H., Fortineau, O., Gambs, S., Hara, S.,
and Tapp, A. (2019). Fairwashing: the risk of ratio-
nalization. In Int. Conf. on Machine Learning(ICML).
Aroyo, A. M., Gonzalez-Billandon, J., Tonelli, A., Sciutti,
A., Gori, M., Sandini, G., and Rea, F. (2018). Can a
humanoid robot spot a liar? In Int. Conf. on Humanoid
Robots, pages 1045–1052.
Bach, S., Binder, A., Montavon, G., Klauschen, F., M
¨
uller,
K.-R., and Samek, W. (2015). On pixel-wise explana-
tions for non-linear classifier decisions by layer-wise
relevance propagation. PloS one.
Damer, T. E. (2013). Attacking faulty reasoning. Cengage
Learning, Boston, Massachusetts.
Dimanov, B., Bhatt, U., Jamnik, M., and Weller, A. (2020).
You shouldn’t trust me: Learning models which con-
ceal unfairness from multiple explanation methods. In
SafeAI@ AAAI.
Ekman, P. and Friesen, W. V. (1969). Nonverbal leakage
and clues to deception. Psychiatry, 32(1):88–106.
Fukuchi, K., Hara, S., and Maehara, T. (2020). Faking
fairness via stealthily biased sampling. In Pro. of the
AAAI Conference on Artificial Intelligence.
Fusco, F., Vlachos, M., Vasileiadis, V., Wardatzky, K., and
Schneider, J. (2019). Reconet: An interpretable neu-
ral architecture for recommender systems. In Proc.
IJCAI.
Kim, Y. (2014). Convolutional neural networks for sentence
classification. In Proc. Empirical Methods in Natural
Language Processing (EMNLP).
Kowsari, K., Brown, D. E., Heidarysafa, M., Meimandi,
K. J., Gerber, M. S., and Barnes, L. E. (2017). Hdltex:
Hierarchical deep learning for text classification. In
IEEE Int. Conference on Machine Learning and Ap-
plications (ICMLA).
Lai, V. and Tan, C. (2019). On human predictions with ex-
planations and predictions of machine learning mod-
els: A case study on deception detection. In Proceed-
ings of the Conference on Fairness, Accountability,
and Transparency, pages 29–38.
Lakkaraju, H. and Bastani, O. (2019). How do i fool you?:
Manipulating user trust via misleading black box ex-
planations. arXiv preprint arXiv:1911.06473.
Lertvittayakumjorn, P. and Toni, F. (2019). Human-
grounded evaluations of explanation methods for text
classification. arXiv preprint arXiv:1908.11355.
Levine, T. R. (2014). Encyclopedia of deception. Sage Pub-
lications.
ICAART 2022 - 14th International Conference on Agents and Artificial Intelligence
54