A Game Theoretic Framework for Analyzing Re-Identification Risk

Given the potential wealth of insights in personal data the big databases can provide, many organizations aim to share data while protecting privacy by sharing de-identified data, but are concerned because various demonstrations show such data can be re-identified. Yet these investigations focus on how attacks can be perpetrated, not the likelihood they will be realized. This paper introduces a game theoretic framework that enables a publisher to balance re-identification risk with the value of sharing data, leveraging a natural assumption that a recipient only attempts re-identification if its potential gains outweigh the costs. We apply the framework to a real case study, where the value of the data to the publisher is the actual grant funding dollar amounts from a national sponsor and the re-identification gain of the recipient is the fine paid to a regulator for violation of federal privacy rules. There are three notable findings: 1) it is possible to achieve zero risk, in that the recipient never gains from re-identification, while sharing almost as much data as the optimal solution that allows for a small amount of risk; 2) the zero-risk solution enables sharing much more data than a commonly invoked de-identification policy of the U.S. Health Insurance Portability and Accountability Act (HIPAA); and 3) a sensitivity analysis demonstrates these findings are robust to order-of-magnitude changes in player losses and gains. In combination, these findings provide support that such a framework can enable pragmatic policy decisions about de-identified data sharing.


De-identification Models
In the re-identification game studied in this article, we assume the defender's strategy set is derived from de-identification models. Various models have been developed, but all aim to transform the attributes that could be used to ascertain an individual's identity to address identity disclosure risk. While there are other privacy concerns (e.g., attribute disclosure [1], presence / absence in a dataset [2], and contribution to a statistical distribution [3]), we focus on identity disclosure because of its direct relationship with existing privacy laws and a broad class of data protection methodologies [4]. We believe that our game theoretic framework will generalize to other privacy models, such as differential privacy [3] which applies random noise to shared data, but focus on identity disclosure for illustration of the novel perspective games bring to the de-identification problem.
The operations that can be applied to de-identify a record can grossly be characterized as i) randomization, ii) generalization, and iii) suppression. We focus on generalization and suppression because they are widely adopted in data protection policies and de-identification algorithms. De-identification policies, such as HIPAA's Safe Harbor [5], often use rules, in the form of an enumerated list of features that need to be generalized (e.g., 5-digit Zip code needs to be generalized to first 3-digits, provided there are at least 20,000 people in the region). Similar rule-based policies have been invoked in other countries, such as Canada [6]. These policies are not necessarily optimal, per se, and so de-identification policy search methods have been proposed to discover policies that maximize data utility while satisfying a risk threshold [7][8][9].
Beyond rule-based policies, other approaches have focused on ensuring the dataset itself satisfies a certain level of protection. For instance, k-anonymity [10] states that a record must be equal to k-1 other records. While k-anonymity can be achieved through any of the aforementioned operations [11], the most common approach is generalization [12]. We adopt the generalization model without enforcing a specific protection parameter. Rather, we search for a generalization that maximizes the payoff for the publisher of a record.
A number of generalization models have been developed (particularly with application for k-anonymization) and it is important to clarify which is used in this work. Specifically, we use a full-domain generalization model [13], which is the cross-classification of the domain generalization hierarchy (DGH) for each attribute. This is the most frequently used in practice, but note our framework can be extended for other generalization models, such as full-subtree generalization [14] and multi-dimensional generalization [15].

Risk of Privacy Violations
No system is impregnable to attack and, thus, re-identification risk assessments must be performed. [16] suggested three models of re-identification risks: i) prosecutor, ii) journalist, and iii) marketer. For these risks, it is assumed there is a published dataset, which is based on a sample of a broader population. The prosecutor and journalist risks correspond to the most re-identifiable record in the dataset and population, respectively. The marketer risk, by contrast, corresponds to the average risk of all records in the dataset. While [17] provides mathematical definitions of these scenarios, it is assumed that the attack will always be attempted. However, as we show in this work, the cost of a privacy violation (e.g., expected loss in terms of a fine) greatly influences this decision. There have been investigations into the cost of privacy violations. For instance, [18] introduced quantitative methods to define privacy violations and their consequences. They provided definitions of sensitivity and severity (of privacy violations), taking into account the level at which the data subjects are concerned with regards to their own privacy. [19] designed a decision theoretic framework to assess privacy risk that accounts for both the entity identification and the disclosed information sensitivity. However, these models did not consider multiple players in a game with varying strategies.
One of the challenges associated with re-identification is that an adversary must obtain a degree of background knowledge in order to perpetrate their attack. In certain instances, this knowledge may be gained by observation, such as when the adversary sees an ambulance leaving their neighbor's house. Yet such information may be difficult to come by and, thus, it has been suggested that reasonable adversaries are more likely to use resources, such as public records or information brokers, that can be gathered or queried en masse. In this regard, there has been some investigation into the credentials and costs associated with gathering such resources. In particular, [20] illustrated that voter registration records, which have been used for re-identifications, have a wide range in price (from $0 to $17,000), which is set by the state or municipality making them available, and the amount of information useful for reidentification (e.g., demographics) is not correlated with the price (e.g., the most expensive resource actually had the least amount of information).

Games Applied to Privacy and Security
Game theoretic frameworks have been introduced to model privacy and security problems. [21] proposed a security game model between the defender (e.g., police officers) and the adversary (e.g., terrorists) to optimize the allocation of limited security resources, which is extended in [22] by considering the surveillance cost and partial knowledge of the adversary. In [23], the authors defined a multi-party game to formulate a privacy-preserving distributed data mining problem. However, in this game, every party is both a data publisher and adversary, which is different from our two-player game. In [24], a normal form game between a user and a service provider was defined for assessing privacy risk. In this setting, the user chooses whether or not to provide private information, while the service provider chooses whether or not to exploit the user's private information. Their strategy set is significantly smaller than the one we consider. [25] modeled the location privacy protection problem as a two-player, zero-sum, signaling game. However, the sum of the payoffs of two players in our game model is not zero. The Stackelberg game model, which we leverage to model the re-identification game, has been used in various contexts. [26], for instance, modeled the adversarial prediction problem as a Stackelberg game between a data generator (leader) and a learner (follower). Here, the leader generates data based on the follower's prediction models to create confusion, while the follower adjusts the prediction models to account for the leader's response. [27] modeled the adversarial prediction problem as a single-shot game, which is one kind of Stackelberg game as we used in our model, and explored the conditions for the existence of unique Nash equilibrium. [28] introduced the notion of games for auditing the use of medical records in the context of primary care settings and presented a polynomial-time approximation scheme to compute a solution that is arbitrarily close to the optimal solution. Their approach to computing the Stackelberg equilibrium is based on the multiple-LPs technique of Conitzer and Sandhom [29]. In their Stackelberg game model, the defender is the data publisher (e.g., hospital) and the adversary is the data recipient (e.g., employee), which is very similar to the settings in our game model. However, while they can only mitigate the privacy risk after the data publishing, we minimize the risk even before the data publishing.