Sharing extended summary data from contemporary genetics studies is unlikely to threaten subject privacy

Background Starting from a forensic problem, Homer et al. showed that it was possible to detect if an individual contributes only 0.5% of the DNA in a pool. The finding was extended to prove the possibility of detecting whether a subject participated in a small homogeneous GWAS. We denote this as the detection of a subject belonging to a certain cohort (SBCC). Subsequently, Visscher and Hill showed that the power to detect SBCC signal for an ethnically homogeneous cohort depends roughly on the ratio of the number of independent markers and total sample size. However, it is not clear if the same holds for more ethnically diverse cohorts. Later, Masca et al. propose running as SBCC test a regression of departure from assumed population frequency of i) subject genotype on ii) cohort of interest frequency. They use simulations to show that the approach has better SBCC detection power than the original Homer method but is impeded by population stratification. Approach To investigate the possibility of SBCC detection in multi-ethnic cohorts, we generalize the Masca et al. approach by theoretically deriving the correlation between a subject genotype and the cohort reference allele frequencies (RAFs) for stratified cohorts. Based on the derived formula, we theoretically show that, due to background stratification noise, SBCC detection is unlikely even for mildly stratified cohorts of size greater than around a thousand subjects. Thus, for the vast majority of contemporary cohorts, the fear of compromising privacy via SBCC detection is unfounded.


Introduction
Spurred by stricter NIMH requirement for sharing data, in the beginning of Genome Wide Association Studies (GWASs) era most researchers published in a timely manner summary statistics from studies, e.g. Z-scores, odds ratios (OR) and, even reference allele frequency PLOS  (RAF) by case status. However, this free sharing did not last long before privacy concerns were raised. First, Homer et al. [1], starting from a forensic problem, showed that it was possible to detect if an individual contributes only 0.5% of the DNA in a pool. In the same paper, the authors extended the findings to show that you can detect if a subject participated in a small (N%1,500) homogeneous GWAS by using only summary statistics and RAFs. We denote this the detection of a subject belonging to a certain cohort (SBCC). Subsequently, Visscher and Hill [2] used a likelihood ratio (LR) approach to show that the power to detect SBCC signal for an ethnically homogeneous cohort depends roughly on the ratio of the number of independent markers and total sample size. Unfortunately, even though Visscher and Hill implied that at larger sample sizes the power of detecting whether a subject is the member of a cohort is much smaller, this finding was not enough to avoid a chilling effect on the free sharing of summary data.
By using a Bayesian approach Clayton [3] investigated the conditions needed for SBCC detection for a homogeneous cohort. He computes Bayes factors for subject belonging to case and control group and derives their upper limit as a function of allele frequency. He also touches on the lack of good reference data making SBCC even harder. In the end, Clayton concludes that that "scenarios in which an individual might be identified in this manner are somewhat improbable-particularly when so many SNPs would be needed that linkage disequilibrium could not be ignored (so that any potential invader of privacy would also require access to an individual-level data set from which to estimate the linkage disequilibrium structure)".
Later, Masca et al. [4] propose as SBCC statistic an empirical regression test of departure from assumed population frequency of i) subject genotype on ii) cohort of interest frequency. They use simulations to show that i) their approach is more powerful than Homer et al., ii) population stratification impedes SBCC detection and ii) SBCC detection is possible only at smaller sizes.
In this paper we attempt to answer the question whether, from an SBCC perspective, not sharing data is scientifically defensible for present day GWAS studies. To answer it we theoretically extend Masca et al SBCC approach, ii) update it for stratified cohorts and ii) use the approach for SBCC signal testing. As a measure of SBCC signal strength we propose the correlation between a subject genotype and the cohort RAFs (CGR). We show that for unstratified cohorts, CGR is equivalent to Visscher and Hill LR, which suggest our approach is locally uniform most powerful (UMP) test under modest stratification. Based on the functional form of CGR statistic we argue that, for the vast majority of contemporary cohorts, stopping the free sharing of data due to SBCC concerns is not scientifically justified.

Methods
Given that the information relating to SBCC for certain disorders is likely to be much more detrimental than him/her belonging to the cohort of a quantitative trait, in this paper the focus in on case control cohorts. Due to subjects' contribution to i) the Z-scores being adjusted for unknown ancestry components and ii) RAFs incorporating solely unadjusted subjects' contribution, we argue that RAFs are likely to provide much more information on whether a subject belongs to a cohort. Consequently, this paper will treat only the privacy concerns relating to the worst-case scenario of sharing case RAFs.
Correlation between case genotype and in-cohort RAF Assume the cohort under investigation consists of n cases and n 0 controls for a certain disorder. Further assume that the cohort samples m subpopulations, with the i-th subpopulation having n i cases and n 0 i controls. Under stratification, an important index for population divergence is Wright's fixation index F st , which is the quotient of the variance in subpopulation frequencies and the variance of the allele in cohort (1). F st was also shown to be the apparent correlation of alleles in the same subpopulation (1). (Alleles from different subpopulations are uncorrelated.) Let F i denote the correlation of the alleles in the i-th subpopulation.
Before proceeding to deduce the correlation between case genotype and in-cohort RAF, i.e. CGR, we establish some basic relationships for variance and covariance of subjects' genotypes. Assume that X 1 and X 2 are the additively coded alleles (i.e. the number of reference alleles) of an individual from the i-th subpopulation, then the genotype G = X 1 + X 2 . Then, Let G 1 = X 11 + X 12 and G 2 = X 21 + X 22 be the bi-allelic genotype for 2 subjects from the same subpopulation (with fixation index F i ) or two different subpopulations. Then Cov(G 1 , G 2 ) = Cov(X 11 + X 12 , X 21 + X 22 ) = Cov(X 11 , X 21 ) + Cov(X 11 , X 22 ) + Cov(X 12 , X 21 ) + Cov(X 12 , X 22 ) = 4 Cov(X 11 , X 21 ) Eq (2). Thus, . ., m and j ¼ 1; . . . n i ðn i 0 Þ are the additively coded genotype at the variant under investigation for the j-th individual in the i-th subpopulation in the cases (controls). For this variant, having a population RAF of p, let be the estimated allele frequency in the affected (cases) and unaffected (controls) subjects, respectively. Suppose studies publicly report RAF estimate of the form:p ¼ op A þ ð1 À oÞp U : For example, from a population genetics point of view might be of interest to reportp for ω = K, i.e. the population RAF estimate. [Other interesting scenarios is to report bothp A (ω = 1) andp U (ω = 0).] Assuming that the study reports suchp estimates for all common SNPs, for privacy considerations it is desirable to compute the expected correlation between a certain case genotype, G i 0 ,j 0 , andp. To this end we start by first estimating Var ðpÞ and E½ðG i 0 ;j 0 À 2 pÞðp À pÞ. Relationship Eqs (1) and (2) from above [also in Devlin et al. (1)], can be re-written as:

the correlation of interest becomes:
Further manipulations, reduces the correlation to: If we assume the same F st for all populations and an equal number of cases and controls in each subpopulation, i.e. F i = F and n i ¼ n 0 i ¼ n m , for large numbers the formula is approximated by: Thus, under stratification, the correlation between the genotype of a case (ω = 1, above) and the allele frequency of cases can be approximated by The functional form from equation form was empirically validated [see subsection 1.3 and Fig A in S1 File]. The correlation between a subject genotype and RAF can be also estimated for a subject not belonging to the cohort (subsection 1.1 in SM).
Using correlation between case genotype and in-cohort RAF to test SBCC ρ(F) from Eq 3 can be approximated via first order Taylor series: (for more details, see Eqs B and C in S1 File). Because the ffiffi n p m F bias might not be negligible even for moderately sized intracontinental meta-analyses, to test the true correlation due to belonging to the case cohort -rð0Þ; ffiffi n p m F bias needs to be subtracted. Based on the above Taylor series approximation, ρ(0)can be estimated byrð0Þ ¼ ffirðFÞ À ffiffi n p mF , whereF is estimated using a relevant and ideal, i.e. perfectly matching ethnic distribution, panel of size n @ ¼ n k (k >> 10 for large meta-analyses). It follows that Var 1 rð0Þ ¼ 1 o þ k m 2 (Eq 4 in subsection 1.4 of SM), where o is the equivalent number of independent SNPs in genome scan. Thus the expectation of Z-score for testing ρ(0) = 0 (subject not in cohort) vs. ρ(0) > 0 (which likely yields higher power than testing the more appropriate ρ(0) = 0 vs rð0Þ for subjects in the case cohort. We stress that if non-stratification is assumed (i.e. to eliminate kn m 2 in relationship (4)), the equivalent X 2 test has the noncentrality parameter l ¼ m 2 ¼ o n which is similar to the one deduced by Visscher and Hill using a likelihood ratio (LR) approach when either i) not augmenting the data with a reference panel and ii) being able to use the cohort sample along with reference panel to estimateF. Given the desirable properties of LR tests [5] (Theorem 8.3.1-Neyman Pearson Lemma)) and the fact that F is very small in practice (e.g. F = 0.006 in the most divergent European populations [6]) it follows that test based on relationship (4) is UMP or close to UMP for modest stratification. Assuming (extremely) conservatively that the number of independent SNPs is o = 1, instead of o = 50,000 as in [2], we compute the upper bound for the probability (power) of detecting a significant signal for subjects belonging to case cohort at a certain type I error, α, is Simulated scenarios used to evaluate power to detect SBCC To give an idea about power to detect SBCC signal we present a range of scenarios inspired by existing data sets. As possible values of the parameters (present and future) we chose: panel sample size of n @ ¼ n k ¼ f1; 000; 10; 000; 30; 000; 100; 000g, and the number of subpopulations set to m ¼ max 〚 n n s 〛; 2 , where 〚:〛 is the rounding to the nearest integer function and, rather conservatively, (as multiple studies target the same subpopulation) n s = {700; 1,400, 2,800} is the average number of cases per study. The values for the number of cases per study is informed by the analysis of the second schizophrenia cohort from the Psychiatric Genetics Consortium (PGC) [7], which averages 700 cases per study. The assumptions regarding n s are conservative because i) in many large studies (PGC included) multiple sub-studies are targeting the same subpopulation and ii) with the increase of total sample sizes of meta-analyses the sample sizes coming from each subpopulation are expected to increase.

Practical application
We apply the method to PGC2 schizophrenia (SCZ) [7]. It discovered 108 loci by analyzing a multiethnic cohort which included slightly more than 30,000 cases. Each individual study contributed around 700cases. We assume thatF is estimated using the publicly available subpanel of Haplotype Reference Consortium [8], which contains around n@ = 12,000 subjects.

Results
With these conservative assumptions, we obtain an upper limit for the detection power, q, as a function of sample size, n (Fig 1). These calculations show that, at a type I error of 0.05, there is some modest power to detect the case belonging signal (Fig 1) only when i) (perfectly matching) panel size is extremely large and ii) cohort size is lower than 1,000. For more realistic parameter scenarios, the power of detection is practically negligible. For the practical application to PGC2 SCZ, assuming 700 cases per individual study and n@ = 12,000, the power to detect SBCC signal is around 6.6% for a type I error rate of α = 5%. If using the smaller 1000 Genome reference phase 1 [9] (n@ = 1,000) and 3 [10](n@ = 2,504) the power decreases to 5.7% and 5.5%, respectively. However, even such near-false-positive-rate detection powers are likely overestimates due to poor panel coverage of many PGC2 SCZ subpopulations.

Discussion
SBCC related privacy concerns do not preclude sharing summary data (even case RAFs) even when analyzing cohorts of rather modest stratification and size. This is due SBCC signal (for a cohort of size >~1,000) being overwhelmed by the stratification background noise even when very large reference panels are available. Consequently, as far as SBCC detection is concerned, there is no scientifically valid reason why the summary data for most genetic studies, including case RAFs, should not be made publicly available. However, our work does not preclude data sharing raising privacy concerns from, currently unidentified, non-SBCC vantage points.