Secure Surveillance of Antimicrobial Resistant Organism Colonization or Infection in Ontario Long Term Care Homes

Background There is stigma attached to the identification of residents carrying antimicrobial resistant organisms (ARO) in long term care homes, yet there is a need to collect data about their prevalence for public health surveillance and intervention purposes. Objective We conducted a point prevalence study to assess ARO rates in long term care homes in Ontario using a secure data collection system. Methods All long term care homes in the province were asked to provide colonization or infection counts for methicillin-resistant Staphylococcus aureus (MRSA), vancomycin-resistant enterococci (VRE), and extended-spectrum beta-lactamase (ESBL) as recorded in their electronic medical records, and the number of current residents. Data was collected online during the October-November 2011 period using a Paillier cryptosystem that allows computation on encrypted data. Results A provably secure data collection system was implemented. Overall, 82% of the homes in the province responded. MRSA was the most frequent ARO identified at 3 cases per 100 residents, followed by ESBL at 0.83 per 100 residents, and VRE at 0.56 per 100 residents. The microbiological findings and their distribution were consistent with available provincial laboratory data reporting test results for AROs in hospitals. Conclusions We describe an ARO point prevalence study which demonstrated the feasibility of collecting data from long term care homes securely across the province and providing strong privacy and confidentiality assurances, while obtaining high response rates.

To address these challenges, we propose a distributed protocol with the weaker requirement of having only semi-trusted third parties. A semi-trusted third party would not be able to access any of the raw data, even if it wanted to. This means that if there is a security compromise, staff corruption, or a compelled disclosure, there is no additional risk of viewing raw data. A protocol with semi-trusted third parties also overcomes the requirement of practices having to completely trust the third party. The only requirement on a semi-trusted third party is that it follow the protocol faithfully.

Ethics Considerations
The use of secure surveillance system should expedite the ethics review process because there are no major privacy issues to contend with. In addition, because no information that would identify residents, or that would implicate LTCHs, was being collected, we examined whether this kind of study required ethics review at all.
In Canada, REBs follow the Tri-Council Policy Statement: Ethical Conduct for Research Involving Humans (TCPS) [14]. Article 2.4 of the TCPS states that no REB review is required if a research project relies on the secondary use of "anonymous" data. The document goes on to explain that anonymous information and human biological materials are distinct from those that have been coded, or anonymized. Whereas anonymous data never contained identifiers, anonymized data was previously identifiable but later irreversibly stripped of this information.
In this study we collected existing colonization and infection data from the medical records of LTCH residents. The secure protocol ensures that the identity of residents cannot be determined and that individual facility values cannot be computed. This de-identification process is irreversible. However, the mere existence of a private key that can decrypt the data, even if the protocol uses multiple parties to ensure that does not happen, means that the data cannot be anonymous as defined in the TCPS. 1 Also, that the data entered by the LTCHs into our system existed originally in identifiable form in electronic medical records also means that the data collected is not anonymous (because there exists an identifiable version of that data), even if this identifiable data is not collected for the study. 2 Therefore, according to that reasoning we were collecting anonymized data, which is not exempt from ethics review.
Moreover, even if data is anonymous (or anonymized), some have argued that individuals may have a property interest in their personal information, suggesting the need to consult with affected individuals in some manner [15], although this is not a generally accepted legal theory. On the other hand, it can also be argued that studies with anonymous (or perhaps, anonymized) data can still present ethical risks due to the possibility of group harm [16]. For example, a research protocol restricted to the study of anonymized information might nevertheless, classify residents on the basis of their ethnicity or race (e.g., those of Aboriginal origin versus those not of Aboriginal origin). It may therefore be possible to draw conclusions about Aboriginal communities on the basis of these research data. These inferences could be harmful to Aboriginals as a group; namely, in underscoring social stigma or associating a group with higher prevalence of infections or colonizations. An REB would be in a position to make that determination. Therefore, the definitions in Article 2.4 still require secure data collection protocols to go through ethics review, as well as the more general need to examine the potential for group harm. We presented this protocol to the Children's Hospital of Eastern Ontario Research Institute REB. The use of a secure protocol expedited the review process considerably since there were little privacy risks or risks of inadvertent harm to the homes, and there were no evident group harms to consider.

Paillier Cryptosystem
We use the additive homomorphic encryption system proposed by Paillier [17] where p is a product of two large prime numbers, and D is the decryption function. In this type of cryptosystem addition of the plaintext is mapped to the multiplication of the corresponding ciphertext. The Paillier cryptosystem also allows a limited form of the product of an encrypted value: (2) which allows an encrypted value to be multiplied with a plaintext value to obtain their product.
Another property of Paillier encryption is that it is probabilistic. This means that it uses randomness in its encryption algorithm so that when encrypting the same message several times it will, in general, yield different ciphertexts. This property is important to ensure that an adversary would not be able to compare an encrypted message to all possible counts from zero onwards and determine what the encrypted value is.
Depending on the context we also use the simpler notation where we let   m denote the encryption of m under a semantically secure additively homomorphic public-key encryption algorithm, such as Paillier [17].

Weighted Mean
The weighted mean (by number of residents) for a cell is given by: The weighted mean for cell   , i j can be rewritten as: Therefore, the weighted cell mean by number of residents is also the pooled rate of colonizations.
The weighted (marginal) mean for row i is given by: Therefore, the weighted row mean by number of residents is also the pooled rate of colonizations by bed size. Similarly, the weighted column mean by number of residents is also the pooled rate of colonizations by regions, given by 100.
The secure computation of these values by the Aggregator is straight forward as all of the computations are sums. The numerator and denominator are computed separately and send to the KH to decrypt these values and divide them and multiply by 100 to get the rate. The values sent by the LTCH's to the Aggregator would be:

Weighted Sample Variances
Similarly, we calculated the (unbiased) weighted sample variances to accompany the above means.
The weighted variance for cell   , i j is: We summarize the variance equations in Table 2.

Secure Computations for Variances
We focus on the computation of the variance. The variance for a cell is given by: We will determine the numerator and denominator separately. For the numerator the first term is easily computed. The second term can be broken down as follows:  10 2 and sends along with other encrypted values. Note that we need to perform scaling for the above fraction before encryption and the scale factor has to be sent along as well.
The denominator can be expressed as: Each site will also compute the square of the number of residents for each cell: Therefore, the pieces of the computation would be: The Aggregator would therefore receive for each facility the following values: Note that the third value, the division, is constructed to always produce an integer value: the fractional term is multiplied by a precision factor and then truncated.
The Aggregator will compute the following values and then send them to the KH: The KH will decrypt these values as follows: Then the KH will compute the standard deviation for a cell as follows: And for a row i we have the variance: Therefore, each facility sends the following values to the Aggregator: The Aggregator computes the following values and sends them to the KH: The KH will decrypt these values as follows: For columns the computations are similar: Therefore. Each facility sends the following values to the Aggregator: The Aggregator computes the following values and sends them to the KH: The KH will decrypt these values as follows:

Security Analysis
In this section we analyze the security of the data collection system in two settings. The first setting we define an ideal functionality and prove our protocol implements this functionality in the semi-trusted model. The standard goal for such an analysis is to show that the parties learn nothing more about each other's private inputs than what is revealed by the output itself. In the second setting we analyze what an adversary can learn about the private inputs from the output.
We specifically consider the case of the information provided by the LTCH's to compute the cell standard deviation because that is a superset the information that is provided for the other statistics. A security proof for the more detailed information for the cell standard deviation would also apply to any subset of information.

Setting 1 -Security of Private Data Collection Protocol
Loosely speaking, a protocol is secure in the semi-trusted adversarial model if a party learns nothing more from participating in the protocol than it learns from the inputs and outputs available to that party.
We proceed by defining an ideal private data collection protocol in the presence of a trusted third party (TTP ). We then review our semi-trusted private data collection protocol described earlier, and proceed to show its participants learn nothing more from the execution of this protocol than they would in the ideal implementation, and in turn, that our private data collection protocol is secure against a semi-trusted adversary.

Ideal Private Data Collection
Private data collection (PDC) is ideally implemented between K long-term care homes

Semi-trusted Private Data Collection Protocol
The semi-trusted private data collection protocol (PDC-semi-trusted) is implemented between K LTCH's 1 k H H  , Aggregator and key holder KH . KH generates a public and private key pair, and broadcasts the public key to all parties. Each home i H then sends to the Aggregator: The Aggregator homomorphically computes the following values and sends them to : KH recovers , , , sn sc sscn and ssn by decrypting the ciphertexts it received from the Aggregator.

Security of Protocol PDC-Semi-Trusted
We state the claims of security for PDC in the semi-trusted model.

Lemma 1 (Correctness). Protocol PDC-semi-trusted evaluates the PDC functionality with high probability.
The proof is based on the fact that KH receives encryptions of , , , sn sc sscn and ssn .

Lemma 2 (Privacy is preserved between homes). For each home i H , every other home j i H  learns nothing about i H 's input.
The proof is based on the fact that homes receive no information from the protocol execution.

Lemma 3 (Privacy is preserved between homes and the Aggregator). If the encryption scheme is semantically secure, then the protocol views of the Aggregator for any two inputs of any home i H are indistinguishable.
The proof is based on the fact that the only information the Aggregator receives consists of semantically-secure encryptions and does not know the associated private key. The proof is based on the fact that KH only receives the encryptions of , , , sn sc sscn and ssn .

Definition (identifying result):
Let be the result of the ground truth solution . Let , … , ℓ be the ℓ spurious solutions of . We call , … , ℓ , the solution set of . We say that result isidentifying if for some threshold . We say that is -non-identifying if .

Definition (anonymity):
Let be an ensemble of results vectors. Let be the minimally acceptable probability that a particular result is -non-identifying. We say the results ensemble is if: In other words, the ensemble of results is anonymous if each result ∈ is -non-identifying with probability greater than or equal to .

Definition (integer partition):
Let be an integer. An integer partition of is defined as any vector of integers , … , for which ∑ and arbitrary . We define a -partition of as any vector of integers … for which ∑ .

Concrete example of a result with multiple solutions.
Suppose we have 6 sites. Assume it is publicly known that each home houses 60 120 patients, and that the average colonization rate across all sites is 5%. Consider the following scenario in which the ground truth , is as follows: Note that all of these solutions will satisfy the above system of equations for the example.

Simulating the anonymity of results.
We wrote a software tool to estimate whether a particular results ensemble was anonymous or not. For the purposes of this study, we define a results ensemble in terms of two variables: the number of sites  Given the high probability of 5-identifiability for 60-patient facilities with high colonization rates, based on our model, we decided to predict at higher rates of colonization than was originally simulated. Note that our model had a low predictive error rate of only 0.098, compared to the null model's error rate of 0.334, and provided a high-level of predictive discrimination with a c-index of 0.964.
The following figure shows the predicted probability of 5-identifiability for colonization rates ranging from 7 to 10%. When there are twelve 60-patient facilities, the upper confidence limit is below 0.05 for a colonization rate of 10% or less; when there are eleven 60-patient facilities, the upper confidence limit is below 0.05 for a colonization rate of 8% or less. If we compare to facilities of different capacities, however, we find that far fewer sites are needed to ensure the probability of 5-identifiability is less than 0.05 for a colonization rate of 10% or less. We summarize the minimum number of facilities required in the following table.

The case with no standard deviations per cell
In certain circumstances we may not wish to calculate the standard deviation in a particular cell. In this case the participants can skip decrypting the values and . It should be obvious that if