2 METHOD
We propose to compare generative models from mul-
tiple perspectives, including Uniqueness, Similarity
and Utility. In this section, we detail the feature ex-
traction procedure of the data, definitions of these
evaluation metrics and generative model selection.
2.1 Feature Extraction
Cystic Fibrosis (CF) is a rare disease that gives rise to
different forms of lung dysfunction, eventually lead-
ing to progressive respiratory failure. It is a complex
disease, and the types and severity of symptoms can
differ widely from person to person. In our work, we
extract CF patients from the IBM Explorys database
with a total of 10074 patients extracted, represent-
ing about 1/3 (31199) of all CF patients in the US
[USCFF, 2020]. Patients belong to two subgroups:
having died or having received a lung transplant, la-
beled by value 0; or having survived, labeled by value
1. We remove all samples with no diagnosis codes
and duplicates to enhance synthetic diversity. Our
final dataset has 3184 patients, with ∼ 80% belong-
ing to the survived subgroup. The EHRs of these pa-
tients are then aggregated over time. For each patient,
we assign value 1 to the features that have appeared
in the medical history, and value 0 to these features
that have never appeared. The medical data is finally
represented as a binary matrix where each row cor-
responds to a patient and each column to a medical
feature. The predictive survival outcome is based on
these medical features including comorbidity, lung in-
fection, and therapy variables.
2.2 Uniqueness
Privacy assurances are essential to prevent the leakage
of personal information. However, a synthetic data
generator can achieve perfect evaluation scores by
simply copying the original training data, thus break-
ing the privacy guarantee. Differential privacy (DP)
is one well-known and commonly researched assur-
ance [Dwork et al., 2014]. DP algorithms limit how
much; the output can differ based on whether the in-
put is included; one can learn about a person because
their data was included; and confidence about whether
someone’s data was included. In practice, there are
various distance-based metrics to guarantee such as-
surance. In [Alaa et al., 2021], the authors quantify a
generated sample as authentic if its closest real sam-
ple is at least closer to any other real sample than the
generated sample. Extending this to the case of bi-
nary variables, we could consider hamming distance.
However, since our data is de-personalised and non-
identifiable, we assess our generators based on how
many exact copies are made. We consider the require-
ment of privacy as Uniqueness: to not simply copy
the input data. We calculate the Uniqueness of each
model by generating a large finite number of samples
and reporting the percentage of overlap with the orig-
inal training data.
2.3 Similarity
It is hard to measure the Similarity between the syn-
thetic and original datasets with one score because of
the multiple features and data types within the data.
In our work, the Similarity is measured with four sub-
metrics, precision, recall, density and coverage.
In [Sajjadi et al., 2018] the use of precision and
recall metrics to measure the output from generative
models is proposed. Precision measures the fidelity -
the degree to which generated samples resemble the
original data. Furthermore, recall measures diversity
- whether generated samples cover the full variability
of the original data. The latter is particularly useful
for evaluating generative models prone to mode col-
lapse. Precision is defined as the proportion of the
synthetic probability distribution that can be gener-
ated from the original probability distribution, thus
measuring fidelity, and recall symmetrically defines
diversity. Precision and recall are effectively calcu-
lated as the proportion of samples that lie in support
of the comparative distribution, which assumes uni-
form density across the data. Therefore, alternative
metrics have been proposed, such as density and cov-
erage, to ameliorate this issue [Kynk
¨
a
¨
anniemi et al.,
2019, Naeem et al., 2020, Alaa et al., 2021]. We em-
ploy the definition and implementation of density and
coverage metrics from [Naeem et al., 2020] for a more
accurate Similarity evaluation. Density and coverage
address the lack of robustness to outliers, failure to de-
tect matching distributions and inability to diagnose
different types of distribution failure.
2.4 Utility
To empirically validate the Utility of the generated
dataset, we introduce two different training testing
settings. Setting A: train the predictive models on
the synthetic training set and test the performance of
the models on the testing set. Setting B: train on
the synthetically augmented balanced-class training
set as the original dataset is imbalanced, then test on
the testing set. We perform a 5-fold cross-validation,
sampling each fold with a proportional representation
of each class. After training the synthesisers (de-
SDAIH 2022 - Scarce Data in Artificial Intelligence for Healthcare
18