Utility of Anonymised Data in Decision Tree Derivation
Jack R. Davies and Jianhua Shao
School of Computer Science & Informatics, Cardiff University, U.K.
Keywords:
Data Anonymisation, Data Utility, Decision Tree.
Abstract:
Privacy Preserving Data Publishing (PPDP) is a practice for anonymising microdata such that it can be publicly
shared. Much work has been carried out on developing methods of data anonymisation, but relatively little
work has been done on examining how useful anonymised data is in supporting data analysis. This paper eval-
uates the utility of k-anonymised data in decision tree derivation and examines how accurate some commonly
used metrics are in estimating this utility. Our results suggest that whilst classification accuracy loss is mini-
mal in most common scenarios, using a small selection of simple metrics when calibrating a k-Anonymisation
could help significantly improve decision tree classification accuracy for anonymised data.
1 INTRODUCTION
With the increase in personal data collection, storage
and use by a growing number of corporations and or-
ganisations, there has been a corresponding rise in the
population’s concern for their privacy. To alleviate
these concerns, governments require that individuals’
privacy is protected in the sharing of sensitive data. To
comply with these requirements, data publishers use
a process known as Privacy Preserving Data Publish-
ing (PPDP) (Fung et al., 2010). The major challenge
with PPDP is to ensure that the privacy of individu-
als is maintained, whilst also retaining the usefulness
of the original data. Na
¨
ıve approaches such as sim-
ply removing explicit identifiers (e.g., driving license
number) from the data set, or redacting contextual in-
formation, are not sufficient. A more sophisticated
approach is needed whereby the data satisfies given
privacy requirements defined in a privacy model, to
protect against potential attacks.
One such model is k-Anonymity (Samarati and
Sweeney, 1998). k-Anonymity requires each record
in a data set to be indistinguishable from at least k −1
other records over the set of quasi-identifiers (QIDs).
QIDs are attributes in the data set that are externally
available and can be used to link a record to a spe-
cific individual – an attack known as Record Link-
age. For example, Age and Occupation are possible
QIDs that could be used to identify a specific per-
son in a data set if there is a unique combination of
their values within. A given data set will rarely sat-
isfy k-Anonymity, thus the data will need to be mod-
ified through anonymisation operations. For exam-
ple, if the Occupation values ‘Lawyer’ and ‘Doctor’
do not appear k times in a data set individually, we
can choose to generalise both into ‘Professional’ or
simply ’{Lawyer, Doctor}’. The data set will publish
at least k records with the generalised value instead,
thereby satisfying k-Anonymity.
Whilst k-Anonymity ensures anonymity in the
data, it must retain utility for recipients of the data
too. The degree to which this utility is maintained
is something that has not been comprehensively stud-
ied; this paper attempts to study this in the context
of one particular area of data analysis – classification
using decision trees. We adapt the basic ID3 algo-
rithm (Quinlan, 1986) to derive decisions trees from
anonymised data and we then use this adapted algo-
rithm to train and test decision trees in four differ-
ent scenarios, comparing and evaluating the results
from each scenario to measure the data utility in terms
of classification accuracy. In addition, we measure
the utility of the anonymised data using some metrics
commonly found in the literature and compare these
measures to classification accuracy. This grants in-
sight into the reliability of these metrics in estimating
the utility of anonymised data in decision tree classi-
fication.
The rest of the paper is organised as follows: In
Section 2, we briefly discuss some related work; Sec-
tion 3 presents essential background information; in
Section 4, we present our experiments and report our
results in Section 5; finally, in Section 6, we conclude
the paper.
Davies, J. and Shao, J.
Utility of Anonymised Data in Decision Tree Derivation.
DOI: 10.5220/0010778300003120
In Proceedings of the 8th International Conference on Information Systems Security and Privacy (ICISSP 2022), pages 273-280
ISBN: 978-989-758-553-1; ISSN: 2184-4356
Copyright
c
2022 by SCITEPRESS – Science and Technology Publications, Lda. All r ights reserved
273