Ten quick tips for protecting health data using de-identification and perturbation of structured datasets

doi:10.1371/journal.pcbi.1013507

Table 1.

List of typical personally identifiable variables of health record data. Adapted from HL7 documentation [12] and Provincial Health Data Centre [13].

More »

Expand

Table 2.

Systematic and overlapping steps involved in data de-identification and anonymisation. Compiled from [20,21,23,25,26].

More »

Expand

Fig 1.

Anonymising precise numerical values.

Rounding precise values to decimal places or significant figures can ensure k-anonymity is preserved whilst retaining variable characteristics and epidemiological meaning (artificial dataset). A: Birthweights (kg) dataset with 4-decimal place precision, B: Birthweights (kg) dataset rounded to one decimal place. C: Precise number of exercise days per year; D: Number of exercise days per year with jitter in range −5 to +5 days.

More »

Expand

Fig 2.

Checking bivariate correlation before and after perturbation.

An exhaustive bi-variate correlation matrix shows that the bivariate correlation relationships remain generally similar despite perturbation. Red shading indicates positive correlation, blue shading indicates negative correlation. Values within each cell show the correlation coefficient. A: Original dataset, B: Dataset after perturbation of multiple fields.

More »

Expand

Fig 3.

Checking k-anonymisation before and after perturbation.

Creating categories based on numerical value range can increase k-anonymity (artificial dataset with x-axis = value/category, and y-axis = counts per value/category). A: Exact integer variables ranging from 1 to 20, B: Categorical variables derived from integer variables ranging from 1 to 20.

More »

Expand