Human microbiome privacy risks associated with summary statistics

Recognizing that microbial community composition within the human microbiome is associated with the physiological state of the host has sparked a large number of human microbiome association studies (HMAS). With the increasing size of publicly available HMAS data, the privacy risk is also increasing because HMAS metadata could contain sensitive private information. I demonstrate that a simple test statistic based on the taxonomic profiles of an individual’s microbiome along with summary statistics of HMAS data can reveal the membership of the individual’s microbiome in an HMAS sample. In particular, species-level taxonomic data obtained from small-scale HMAS can be highly vulnerable to privacy risk. Minimal guidelines for HMAS data privacy are suggested, and an assessment of HMAS privacy risk using the simulation method proposed is recommended at the time of study design.


Introduction
Humans have coevolved with an immense number of diverse microorganisms that inhabit our bodies, collectively referred to as the human microbiome [1]. Together with the development of metagenomics, recognizing that microbial community composition within the microbiome is associated with the physiological state of the host has sparked a large number of human microbiome association studies (HMAS), which are also referred to as human metagenomewide association studies (MWAS) [2] in analogy to genome-wide association studies (GWAS) [3]. As in the field of GWAS [4][5][6][7], the privacy risk is increasing with the increasing size of publicly available HMAS data.
The privacy threats of HMAS data are based on the fact that individual microbiomes harbor personally identifiable information in the form of microbial community composition. Several prominent studies demonstrated that individual identity can be revealed using the human microbiome. Fierer et al. [8] showed that an individual who touched an object (e.g., computer keyboard) could be identified by matching the compositional profile of the microbiome on the surface of the object to that of the individual's skin microbiome. While the authors' approach can be properly applied to forensic analyses, similar microbiome-based approaches can also be used to reveal an individual's location or intimate partner as shown by Lax et al. [9]  et al. [10]. Besides, Franzosa et al. [11] presented the possibility of a different type of privacy threat that uses information stored in databases; the authors showed that metagenomic codes, as sets of differentiable features of any given microbiome, can be used to identify individuals in the Human Microbiome Project dataset.
Although the development of biological as well as computational/statistical tools for analyzing individual microbiomes helps us to better understand human microbiomes, such tools can be viewed as double-edged swords. The more a tool has resolving power, the more privacy risks confront people as described above. Unfortunately, we might not be able to prevent privacy threats by attackers who utilize microbiome-based forensic techniques because the attackers only need to seize microbiomes from victims. On the other hand, privacy threats by data breaches can be prevented when the HMAS community develops privacy-preserving methods for HMAS data analysis as shown by Wagner et al. [12] and by storing HMAS data in accesscontrolled databases such as the dbGaP [13], which is currently used as a secure database of human-related genotypic and phenotypic data. However, there is another type of privacy threat that can be caused by the publication of HMAS data. Although raw or detailed data are not presented, summary statistics (e.g., mean frequencies of prokaryotic taxa in study microbiomes) are frequently provided in tables and figures contained in HMAS-based papers. Concerns over privacy breaches due to publishing summary statistics was first raised by Homer et al. [14] with respect to GWAS privacy, and this type of privacy attack was later termed 'attribute disclosure attacks under the summary statistic scenario' by Erlich and Narayanan [15]. To my knowledge, there have been no reports evaluating the privacy risk of HMAS summary statistics, which led me to perform a simple, foundational study in order to urge the HMAS community to be aware of privacy risks associated with HMAS dataset summary statistics.
In this paper, I demonstrate that the membership of an individual in the samples of an HMAS (e.g., case group or control group) can be revealed easily in the summary statistics of taxonomic compositions calculated from the microbiomes of the samples. Using a simple test statistic that was calculated from binary (presence/absence) taxonomic profiles, my simulation studies showed that publication of species-level taxonomic data obtained from small-scale HMAS can be highly vulnerable to privacy risk. This study asserts that the taxonomic profiles of the human microbiome should be treated as sensitive biometric information in that the HMAS metadata could contain the behavioral history of individuals in addition to medical conditions. I propose minimal guidelines for HMAS privacy and suggest that researchers use the simple simulation presented here to assess the privacy risk at the time of study design with the acknowledgment that a more advanced method could have greater resolving power for privacy breach.

Development of the test statistic
Suppose two samples R and C, each with size n R and n C , were drawn independently from population P of human microbiomes. In HMAS, these samples may correspond to the sets of microbiomes of volunteers in the reference group (f! y R i g n R i¼1 ) and case group (f! y C i g n C i¼1 ), respectively; each microbiome is a vector of which the elements represent relative frequencies y S j of j = 1,2,� � �,t, where t is the number of operational taxonomic units (OTUs) at different taxonomic levels from phylum to strain and S2{R, C}. The summary statistics are vectors ! r and ! c, of which the elements are the mean frequencies r j and c j of the OTUs in the pooled data obtained from R and C, respectively. Now consider the microbiome of an individual q, ! y q , of which the elements are simple binary measures of presence/absence of OUT j (i.e., y q j 2 f0; 1g), and suppose we want to determine whether ! y q is a member of R or C using ! r and ! c. First, calculate a distance d for OUT j using the absolute difference between y q j and c j and the absolute difference between y q j and r j as follows: d j ¼ jy q j À r j j À jy q j À c j j Assuming that OTUs are independent and invoking the central limit theorem for the large number of OTUs (t>50) examined in the HMAS, z-score of d j across all OTUs will follow the standard normal distribution, N(0, 1).
Since the t = max(t R , t C , t y ) is large, the variance of d can be estimated reliably by the sample variance s 2 . The test statistic Z was inspired by Homer et al. [14] and Braun et al. [16]. The authors used a similar test statistic calculated from the single nucleotide polymorphism (SNP) genotyping data to identify an individual's genotype in GWAS samples. I modified the original test statistic in order to use the binary taxonomic data in HMAS. Because an individual microbiome randomly drawn from population P should be equally distant from R and C, μ 0 was presumably expected to be zero, i.e., N(0, 1) was a putative null distribution. Thus, under the null hypothesis H 0 : Z ¼ 0ð! y q is a random draw from P), the alternative hypothesis H R : Z < 0ð! y q is a member of R) or H C : Z > 0 ð! y q is a member of C) can be tested with an appropriate significance level α. For example, Z>1.65 rejects H 0 in favor of H C at α = 0.05 (one-tailed test).

Distribution simulations
I examined the feasibility of the test statistic Z in identifying the presence of an individual's microbiome in an HMAS sample with simulated datasets. Suppose that the OTUs correspond to species-level affiliations of microorganisms. Then, the presence of species j in P is a Bernoulli random variable with parameter p j . To model the probability distribution of p j ð! pÞ, I used four different Beta(π 1 , π 2 ) distributions. For P with a small number of high-frequency species, π 1 = 0.1 and π 2 = 1.0 were assumed, and for P with a large number of high-frequency species, π 1 = 1.0 and π 2 = 0.1 were assumed. For P with a high number of high-frequency species as well as a high number of low-frequency species, π 1 = 0.1 and π 2 = 0.1 were assumed, which might be more realistic than the above two distributions in that the high-frequency species may correspond to constitutional or autochthonous prokaryotic populations and the lowfrequency species may correspond to opportunistic or heterochthonous prokaryotic populations across individual microbiomes. In addition, a Beta(1, 1) (uniform) distribution was also used, which represents our ignorance of the distribution of p j according to the principle of indifference.
Because the individual microbiome in R or C is a set of t random draws of y j~B ernoulli(p j ), i.e., ! y � Bernoullið! pÞ, the summary statistics for samples R and C were simulated using ! r � Binomialðn R ; ! pÞ=n R and ! c � Binomialðn C ; ! pÞ=n C , respectively. To construct density curves of true positives for samples R and C, random draws of ! y Rþ and y Cþ j from samples R and C were used to calculate the test statistics Z R+ and Z C+ ðn Rþ Z ¼ n Cþ Z ¼ 100Þ, respectively. Density curves were estimated using the Gaussian kernel density estimator. To construct density curves of the true null distribution, random draws of ! y P from P were used to calculate test statistics Z P ðn P Z ¼ 100Þ. Python code for the simulation is available at https://colab.research.google.com/drive/1dOZi8OSo5qHmF7JPyGAP1I_BQdBVSSiC?usp= sharing.

Overview of simulation results
A simulation study was started with the number of OTUs t =2,000, which roughly reflects the number of species (including uncultivated candidate species) in the human gut microbiome [17], and with n R = n C = 10, which could correspond to small-scale HWAS, under the assumption that p j follows uniform distribution (Fig 1). The null distribution estimated using Z P was very close to N(0, 1) and crisply separated from the distributions of true positives (Z R+ and Z C + ), indicating the feasibility of the microbiome-based identification. The distributions of true positives moved toward the null distribution with increasing n or with decreasing t but moved away from the null distribution with decreasing n or with increasing t. Similar distribution patterns were observed for all the underlying distributions of p j assumed (S1-S3 Figs).
For numerical interpretation of the simulation results, I focused on the probability of type II error at α = 0.05 (β R = P(Z R+ >z α |H R ) or (β C = P(Z C+ <z α |H C ); the probability that the test statistic is not in the H 0 rejection range, given that the alternative hypothesis is true). The power of the test (1−β) is important to a privacy attacker because it would be especially difficult for the attacker with a high β to determine whether the individual under investigation belongs to a particular HMAS sample due to the high false negative rate. α level critical values were obtained from percentiles of Z P distribution because the true null distribution might diverge from N(0, 1). As expected in the density curves, β was far less than 0.01 for the experiments with a large t or small n (S1-S4 Tables).
To formulate the guidelines for human microbiome privacy, the initial simulation study was expanded for virtual HMAS samples with n R = n C = 10, 20,. . .,100,. . .,1000 and with t = 20, 30,. . .,100,. . .,1000,. . .,20000. The method was in general slightly more powerful for p j under the uniform distribution (Fig 2) than for p j under other beta distributions (S4 and S5 Figs). In the resulting contour plots, the yellowish area represents higher β (i.e., lower test power); thus, the HMAS samples located in the dark blue area are considered to be vulnerable to privacy risk. The power of the test decreased notably with increasing HMAS sample size (n) or with a decreasing number of OTUs (t) from which the HMAS summary statistics are calculated, and it is possible to notice a borderline where β decreases considerably, which could help in assessing the privacy risk of the HMAS data.

Effect of sample size
Samples of a large size approximate the population P because lim n R!1 r j ¼ p j or lim n C!1 c j ¼ p j (hence, lim n R ;n C;!1 d j ¼ 0), and the distribution of Z R+ or Z C+ would overlap with the null distribution as shown in the selected density curves, indicating that a large sample size will make the classifying method ineffective. Contrarily, for the samples of a small size, � d significantly deviates from μ 0 even if the difference between r j and p j or between c j and p j is very small. Note that the sample size does not increase with the number of technical replicates used in HMAS data, since the sample points in technical replicates are not independent.
For an unequal sample size (e.g., n R = 1,000 and n C = 10), the samples with a larger size would approximate the null distribution, but the test power for identifying a microbiome in a sample with a smaller size would not be affected (S6 Fig). Considering that taxonomic profiles of individual microbiomes in the case sample could be more homogeneous than those in the control sample, the effective size of the case sample might be smaller than the census size of the case sample. This would result in much higher statistical power for testing whether an individual is a member of the case sample, which is a primary interest of the attackers.  Beta(1, 1)) distribution. Density curves for true positives of samples R (Z R+ ) and C (Z C+ ) are denoted by green and red lines, respectively. Density curves of simulated null distribution and standard normal distribution are denoted by black and gray lines, respectively. Single and double asterisks represent type II error probabilities β<0.05 and β<0.01, respectively. https://doi.org/10.1371/journal.pone.0249528.g001

Effect of the number of OTUs
With the sample size fixed, an increased number of OTUs would increase the test power. This situation can happen when the HMAS data contains taxonomic profiles with a resolution finer than species level. Under the arbitrary speculation that each species is comprised of ca. 10 strains (t = 20,000), the method was very powerful (β�0.01) even for a moderate sample size. This is because the denominator of the test statistic shrinks by increasing t compared to the sample mean � d, which results in a large Z R+ or Z C+ . In the same vein, a reduced number (e.g., t = 20) of OTUs (roughly corresponding to near phylum-level) would decrease the power of the test.

Effect of correlation among OTUs
The occurrence of many OTUs in the human microbiome could be correlated, since the microbiome itself is an ecological community [18]. If the correlation among OTUs is significant, the violation of the assumption that OTUs are independent could change the test power. I analytically evaluated the effect of the OUT correlation on test power. If OTUs are not independent, the variance of � d includes covariance (Cov) terms as follows: where j<j 0 . By lettingr be the average correlation among d j such that Vð � dÞ � 0, the above equation can be written as follows: where 1 þrt Àr � 0. Thus, Vð � dÞ increases ifr is positive (0 <r � 1¼)VðdÞ=t < Vð � dÞ � VðdÞÞ or decreases ifr is negative ðÀ 1=ðt À 1Þ �r < 0¼)0 � Vð � dÞ � VðdÞ=tÞ: One can imagine that the changes in Vð � dÞ result in changes in the distribution of test statistic Z. But the effect ofr on test power is not easy to see with changes in Vð � dÞ. Rather, we can view the denominator term of Vð � dÞ as an effective number of independent OUTs (t e ) as follows: Note that t e = t ifr ¼ 0, 1�t e <t if 0 <r � 1, and t<t e if À 1=ðt À 1Þ �r < 0 (t e !1 as r ! À 1=ðt À 1Þ). For example, if three of 10 OTUs show a strong positive correlation, two of these OTUs cannot contribute to the number of OTUs as equally as other independent OTUs. Thus, the effect of a positiver could be similar to the effect of a decreased number of OTUs, resulting in decreased test power as in the case of linkage disequilibrium among SNPs [16]. However,r can also be negative if negative interactions among OTUs are more prevalent than positive interactions among OTUs, which could subsequently increase test power. Even very small negative correlations (e.g.,r ¼ À 0:001) among a moderate number of OTUs (t = 1000) would increase t e to 10 6 , while very small positive correlations (e.g.,r ¼ þ0:001) decrease t e to 500. The negative correlation overwhelmingly affects t e because t e is a reciprocal function ofr when t is fixed (S7 Fig). Nonetheless, becauser cannot be measured from summary statistics unless provided or assumed, the effect ofr should be investigated more comprehensively in the future.

Additional considerations
I used the binary vector as a query microbiome because it was considered that the binary data is less prone to experimental variations. If a frequency vector is used, the � d could increase because small differences between jy q j À r j j and |y q j À c j | can accumulate across a large number of OTUs. However, it would not always result in a larger Z R+ or Z C+ because the increased sample variance could counteract the increase in � d. Thus, the use of a frequency vector might not guarantee more test power, rather it could make � d more prone to errors in OTU frequencies.
It was assumed that R and C were samples drawn independently from the population of human microbiomes (P). Violation of this assumption could shift the location of the null distribution (μ 0 ). Suppose that populations underlying samples R and C (denoted by R and C, respectively) are all different from the population underlying ! y q ðPÞ. If the systemic difference between R and C is significant, the difference between summary statistics can deviate from zero, i.e., E(Δ) = E(r−c)6 ¼0. Let d � j be the distance metric that includes Δ term as follows: Then, according Braun et al. [16] and using Pðy q j ¼ 0jp j Þ ¼ 1 À p j and Pðy q j ¼ 1jp j Þ ¼ p j , the expected value of the distance metric under the null hypothesis is as follows: Because −1�E(Δ)�1 and 0 < EðpÞ � 1; m � 0 ranges from -1 to 1 which is quite small compared to the distance between the distributions of Z P and Z R+ or between the distributions of Z P and Z C+ in the cases where the test power is very high (e.g., t�2000 and n = 10 or t�20000 and n = 100) (Fig 1). Moreover, as E(p) deviates from its two extreme values (0 or 1) which are highly unrealistic in microbiome composition, the 2E(Δ)E(p) term counteracts the deviation of m � 0 from zero (i.e., m � 0 ! 0 as E(p)!0.5). Thus, I considered that the violation of the assumption that underlying populations are all different would not alter the main results of this study. In fact, if R (reference/control group) and C (case/treatment group) are assumed to be samples drawn from P and C, respectively, the test statistic would become very similar to that used for a two-tailed z-test (or t-test) which can be used to test the null hypothesis that y q j is not a member of C.
Although it might be virtually impossible for the HMAS community to develop an almighty shield that protects data against all possible privacy breaching methods, we do not need to overreact to HMAS privacy concerns since the taxonomic profile of the individual's microbiome could not be a permanent identifier, while a subset of an individual's microbiome may endure for most of the lifetime. Nonetheless, the HMAS community should have guidelines for reducing the privacy risk, which could be kept minimal in order not to obstruct our understanding of the human microbiome.

Concluding remarks
This study showed the possibilities of privacy breaches using the summary statistics drawn from HMAS data. The key finding was that the publication of species composition data obtained from small-scale HMAS (e.g., t�1,000 and n R = n C = 10) easily exposes the privacy of the victims. The HMAS samples with moderate size (n R = n C �100) and t�1000 were on the vicinity of the borderline. I suggest these figures as a basis for HMAS data release policy while acknowledging a sophisticated test statistic that employs better distance metric along with the violation of underlying assumptions could improve the power of the privacy breaching methods. I propose minimal guidelines for HMAS data release as follows: i) increase the sample size as much as manageable, ii) do not publish species composition data even in the form of summary statistics if the sample size is less than 10, iii) avoid publishing subspecies-or strain-level data unless the sample size is far larger than 100. I also suggest that HMAS researchers evaluate the privacy risk at the time of study design using the simulation method presented in this study. Because the test statistic used was relatively robust against the variations in the models for the distribution of p j , HMAS researchers can simply use uniform distribution and their intended t, n R , and n C to estimate the 'minimal' power of the privacy-breaching method.
This study focused only on attribute disclosure attacks under the summary statistic scenario [15]. However, HMAS data privacy concerns are not limited to summary statistics; privacy breaches can occur in various manners at any stage of HMAS data management as described by Wagner et al. [12]. Thus, it is timely that the HMAS community begins comprehensive discussions on HMAS data privacy risks and developing privacy-preserving algorithms for data storage and release.  Table. Summary statistics of simulation results obtained under the assumption that the population OTU frequencies follow a Beta(1, 1) distribution. Type II error probabilities less than 0.05 are in bold. (PDF)