Table 1.
List of typical personally identifiable variables of health record data. Adapted from HL7 documentation [12] and Provincial Health Data Centre [13].
Table 2.
Systematic and overlapping steps involved in data de-identification and anonymisation. Compiled from [20,21,23,25,26].
Fig 1.
Anonymising precise numerical values.
Rounding precise values to decimal places or significant figures can ensure k-anonymity is preserved whilst retaining variable characteristics and epidemiological meaning (artificial dataset). A: Birthweights (kg) dataset with 4-decimal place precision, B: Birthweights (kg) dataset rounded to one decimal place. C: Precise number of exercise days per year; D: Number of exercise days per year with jitter in range −5 to +5 days.
Fig 2.
Checking bivariate correlation before and after perturbation.
An exhaustive bi-variate correlation matrix shows that the bivariate correlation relationships remain generally similar despite perturbation. Red shading indicates positive correlation, blue shading indicates negative correlation. Values within each cell show the correlation coefficient. A: Original dataset, B: Dataset after perturbation of multiple fields.
Fig 3.
Checking k-anonymisation before and after perturbation.
Creating categories based on numerical value range can increase k-anonymity (artificial dataset with x-axis = value/category, and y-axis = counts per value/category). A: Exact integer variables ranging from 1 to 20, B: Categorical variables derived from integer variables ranging from 1 to 20.