actions. In other words, the disassociation method
ensures that each combination of m items appears at
least k times in the released dataset. Using the disas-
sociation method, items in transactions are protected
by dividing them into groups such that the items in
each group satisfies the k
m
-anonymity requirement.
In this paper, we present a de-anonymisation ap-
proach to attacking transaction data anonymized by
the disassociation method, and we do so by exploit-
ing semantic relationships among the data items to
expose hidden links between them. We use some
well-established measures to score semantic relation-
ships and we heuristically re-construct original trans-
actions from disassociated ones. Our findings show
that the disassociation method may not protect trans-
action data effectively: up to 60% of the disassociated
items can be re-associated, thereby breaking the pri-
vacy of nearly 70% of protected itemsets in disasso-
ciated transactions.
The rest of the paper is organised as follows. In
Section 2, we discuss the work related to this paper. In
Section 3, we give a brief introduction to the disasso-
ciation method. In Section 4, we present our approach
to semantic attack and explain the two key steps of
our attacking approach. In Section 5, we illustrate
how chunks in disassociated dataset can be attacked
by proposing three hueristic strategies to re-construct
original transactions based on semantic relationships.
In Section 6, we report the experimental results. Fi-
nally, in Section 7, we conclude the paper.
2 RELATED WORKS
In recent years, privacy threats associated with releas-
ing data concerning individuals have been extensively
investigated, leading to identifying a variety of possi-
ble attacks on published data. One well-publicised
potential attack is linkage attack where an attacker
is assumed to be able to link a record in a dataset to
the record owner by using some external knowledge.
Sweeney (Sweeney, 2002) described an example of
linkage attack where records in a medical dataset pub-
lished by the Group Insurance Commission in Mas-
sachusetts were matched with the voters registration
list for Cambridge, Massachusetts. Despite the fact
that all the explicit identifiers in the medical dataset
have been removed, she was able to re-identify the
Governor of Massachusetts, William Weld, by link-
ing his data in the voters registration list to that in the
medical dataset.
Published data can also be attacked by inferences.
This type of attack occurs when an attacker can
deduce sensitive information that they do not have
access to from accessible non-sensitive information
published in the dataset by using a range of techniques
(Farkas and Jajodia, 2002). For example, data analy-
sis or data mining tools can be used to discover sen-
sitive patterns or correlations within data that violate
the privacy of individuals (Turkanovic et al., 2015),
(Clifton and Marks, 1996).
One advanced inference attack is the minimality
attack. In this type of attack, an attacker is assumed
to have knowledge of the anonymisation mechanism
used and the privacy requirements set to anonymise
a dataset. The attacker may obtain this knowledge
by examining the published dataset and the documen-
tation about the anonymisation algorithm, and then
uses this knowledge to break anonymity (Fung et al.,
2010), (Wong et al., 2007), (Cormode et al., 2010a),
(Zhang et al., 2007).
All types of attack described above rely on data
frequency to identify individuals and their associated
sensitive information from a published dataset. They
do not, however, exploit semantic relationships that
may exist among data items when attacking data pri-
vacy. Shao and Ong proposed a method for attacking
set-generalised transactions based on semantic rela-
tionships (Shao and Ong, 2017). To illustrate this type
of attack, consider the example given in Figure 1.
The original transactions in Figure 1 (a) have been
anonymised by a set-based generlisation (Loukides
et al., 2011) to produce the result shown in Figure
1 (b), where an item that does not occur frequent
enough is replaced by a set of items. Assuming that in
this case insulin, sneezing and petechiae are sensitive
items that need protection, they are generalised into
a set as shown in Figure 1 (b). As such, an attacker
will not know which sensitive item belongs to which
transaction. However, by exploiting semantic rela-
tionships, an attacker may establish that insulin has
stronger relationship with diabetes than other items in
transaction (1), hence it is more likely to be the orig-
inal item. This type of semantic attack can reduce
the “cover” through generalisation by removing some
items, as shown in Figure 1 (d), thereby violating in-
dividuals’ privacy.
This type of semantic attack depends on effective
assessment of the likelihood that two or more items
will occur together in a given context. A number of
tools in natural language processing (NLP) can be
used to understand and interpret semantic relation-
ships. For example, Sanchez et al. (Sánchez et al.,
2013) measure the semantic distance between terms
using point-wise mutual information (PMI) and use
the World Wide Web (WWW) as a corpus to find re-
lated terms (Bouma, 2009), (Sánchez et al., 2012),
(Staddon et al., 2007), (Chow et al., 2009). Chow
Semantic Attack on Disassociated Transactions
61