Yahtzee: An Anonymized Group Level Matching Procedure

doi:10.1371/journal.pone.0055760

Table 1.

Hash Example.

More »

Expand

Figure 1.

Step 1 of the Yahtzee procedure begins with the hashing of the datasets using a new salt (See Table 1 above).

In step 2 the group ID is determined for all groups where in the origin dataset and then matched to the same group ID from the destination group-level dataset. Notice that the hashing procedure and group aggregation is the same in both datsets except we keep all groups in the destination dataset, regardless of size. This is so because we only need to know the group size from the origin dataset to make predictions about the behavior in the destination dataset. Once the group-level datasets are matched by the group ID, the group-level information is stored and the process is repeated times. In step 3 the group level data is sent to the holder of the destination dataset so that the group level values can be assigned to the individual observations based on the same hashes used in the construction of the groups during each of the Yahtzee rounds. Once the destination dataset has acquired a sufficient number of group level values (see Figure 2 for information on determining the value of ) it is possible to then use the combined information to predict the behavior of each individual, which is step 4 of the Yahtzee procedure. For our application, using equations 4, 5 and 6 above, it is possible to predict if the individual is unregistered, a voter or an abstainer. Finally, it is worth repeating that only the group-level data is passed from the origin to the destination dataset. See the Pseudocode for additional information.

More »

Expand

Figure 2.

The proportion of correct predictions for participation rates of 30%, 45%, 55%, and 70% (the match rate is held constant at 30% in all four figures) from a simulation of the matching procedure.

The dark line represents the accuracy rate for true participators. The light line represents the accuracy rate for true abstainers. Accuracy increases for both categories as observations for each individual are obtained from the Yahtzee procedure. Note that the less frequent of the two behaviors requires fewer observations for classification than the more frequent behavior. is the number of observations per person necessary to achieve a given level of accuracy for the less frequent behavior and is the number of observations necessary to achieve a given level of accuracy for the more frequent behavior.

More »

Expand

Table 2.

Number of Draws.

More »

Expand

Correlation.

More »

Expand