Searching across-cohort relatives in 54,092 GWAS samples via encrypted genotype regression

Explicitly sharing individual level data in genomics studies has many merits comparing to sharing summary statistics, including more strict QCs, common statistical analyses, relative identification and improved statistical power in GWAS, but it is hampered by privacy or ethical constraints. In this study, we developed encG-reg, a regression approach that can detect relatives of various degrees based on encrypted genomic data, which is immune of ethical constraints. The encryption properties of encG-reg are based on the random matrix theory by masking the original genotypic matrix without sacrificing precision of individual-level genotype data. We established a connection between the dimension of a random matrix, which masked genotype matrices, and the required precision of a study for encrypted genotype data. encG-reg has false positive and false negative rates equivalent to sharing original individual level data, and is computationally efficient when searching relatives. We split the UK Biobank into their respective centers, and then encrypted the genotype data. We observed that the relatives estimated using encG-reg was equivalently accurate with the estimation by KING, which is a widely used software but requires original genotype data. In a more complex application, we launched a finely devised multi-center collaboration across 5 research institutes in China, covering 9 cohorts of 54,092 GWAS samples. encG-reg again identified true relatives existing across the cohorts with even different ethnic backgrounds and genotypic qualities. Our study clearly demonstrates that encrypted genomic data can be used for data sharing without loss of information or data sharing barrier.


Introduction
Genomic datasets have reached millions of individuals, and are often encapsulated in well-protected cohorts, in which relatives more than often, given increasing genotyped individuals, spread across cohorts and can be identified once the genomic data are compared [1].Estimating genetic relationship often has clear scientific reasons, such as controlling false positive rates in genome-wide association studies (GWAS) or reducing overfitting in polygenic risk score prediction [2][3][4].Social benefits are recently promoted for available individual genomic data such as relativeness testing and forensic genetic genealogy [5].However, direct-to-consumer (DTC) genetic testing activities along with third-party services pose new privacy and ethical concerns [6]; law enforcement authorities have exploited some of the consumer genomic databases to identify suspects by finding their distant genetic relatives, which has brought privacy concerns to the attention of the general public [7,8].For regulating forensic genetic genealogy, laws, policies, and privacy-protection techniques are in parallel development [9][10][11].
The above progress, nevertheless, often requires individual-level data to be shared which may often be beyond the permitted range of data sharing [12].The encryption methods for genotypes have gone through from the initial one-way cryptographic hashes, to random matrix multiplication, and recently to homomorphic encryption (HE).One-way hashes are leveraged to detect overlapping samples, but it fails if the test genotypes differ, which can be caused by genotyping or imputation errors, even when they are minor [13,14].Privacy-preserving protocols for multi-center GWAS have been brought to public recently [15][16][17][18][19][20], on which HE is mostly based.HE provides high precision for results for certain kinds of computational tasks in genetic studies.However, as it is computationally substantial, often one or two orders of magnitude larger than that of the original cost, its application has been limited to small sets of data, at the scale of several hundreds of samples.
In this study, random matrix theory has been adopted to detect relatedness based on our previous study [21].We developed a novel mitigation strategy called "encrypted genotype regression", hereby encG-reg, which does not require original genotype data to be shared but is capable of identifying relatedness with highly controllable precision of balanced Type I and Type II error rates.Since only encrypted genotype data is exchanged in performing encG-reg, collaborators from different cohorts are able to minimize their concerns about data confidentiality.We explore the statistical properties of encG-reg in theory, simulations, and application of 485,158 UK Biobank (UKB) samples of various ethnicity.In a real-world collaboration that includes 5 genomic centers from north to south China (Beijing, Shanghai, Hangzhou, Guangzhou, and Shenzhen) totaling 54,092 genetically diverse samples were genotyped based on different platforms, and intriguing relatedness was identified between cohorts by encG-reg.Privacy-preserving is context-dependent and is still in development under a particular scenario.Throughout this study, when summary information is exchanged, such as allele frequencies and variant positions, we apply the practical guideline in GWAS meta-analysis (GWAMA), which defines a novel strategy we established for data safety.

Description of the method
In this section, we will first present an analogous sketch of our thinking.Imagine a Go-like board with dimensions n 1 ×n 2 , where each square contains a particle of size θ s or θ l , and we assume θ s <θ l .Additionally, it is often the case that there is a significantly large number of particles of small size θ s compared to those of larger size θ l .Each particle is imperfect because of the handcraft variance of the particle size.Intuitively, criteria are required if we want to correctly pick a particle of size θ l , which is sampled from a normal distribution Nðy l ; s 2 y l Þ, out of many particles of size θ s , which are sampled from Nðy s ; s 2 y s Þ.These criteria include the expectation and sampling variance of the particle sizes θ l we want to pick up, the number of squares on the board (n 1 ×n 2 ), the probability that we would accept for incorrectly picking up a particle of size θ s (Type I error rate α), and the probability that we would miss a real particle of size θ l (Type II error rate β).All these criteria need to be balanced to find a solution that allows us to distinguish θ l from θ s under an acceptable cost, say computational cost and storage cost.
It is evident that the described question fits into the conventional statistical testing scenario, which is "null hypothesis H 0 : Nðy s ; s 2 y s Þ vs alternative hypothesis H 1 : Nðy l ; s 2 y l Þ".Furthermore, expressions for the minimum size of the testing sample can be given by a power calculator that brings out the required Type I and Type II error rates.In our study, θ refers to the relatedness between a pair of individuals and m refers to the number of markers and is related to the sampling variance of relatedness scores.A slightly upscaled concept of m is the effective number of markers m e , which takes into account the squared correlation between m markers, and consequently m e �m.Given these considerations, a smaller m is preferred to minimize the data cost.Other technical march in S1 Text primarily focused on deriving the sampling variance.The variance is approximated using fourth-order moment computation for a fundamental pairwise relatedness estimator, the genetic relationship matrix (GRM).It is further refined through an eighth-order moment computation after multiplying the source genotype matrix by an S matrix (m rows and k columns).Here, the S matrix acts analogously to the private key in a cryptographic scheme.The ground truth is that a larger value of k facilitates better identification between θ l and θ s .However, in line with the approach for determining a small yet sufficient value of m, we also aim to identify the smallest value of k that balances precision and cost.Consequently, the primary objective of this study is to establish a lower bound for m (or m e ) and k.These two values will enable us to determine the minimum conditions required for detecting relatedness.

Ethic statement
For SBWCH cohort, the protocol and written consent were approved by the Institutional Review Board of Shenzhen Baoan Women's and Children's Hospital (LLSC-2021-04-01-10-KS); For CAS cohort, the protocol and written consent were approved by the Institutional Review Board of Beijing Institute of Genomics and Zhongguancun hospital (No.2020H020, No.2021H001, and No.20201229); For ZOC cohort, a written informed consent was obtained from the parents or guardians of the young twins.Ethical approval and DNA data using approval were obtained from the Ethical Committee of Zhongshan Ophthalmic Center; For Fudan cohort, the protocol and written consent were approved by the College of Life Science Fudan University Ethical Review Board; For WBBC cohort, the protocol and written consent were approved by the Westlake University Ethical Review Board.

Overview of the method
Using SNPs, inter-cohort relatedness for pairs of individuals can be inferred from genetic relationship matrix, which is G . X 1 is a matrix of n 1 individuals (rows) and m markers (columns), and so is X 2 .This GRM definition is identical to Eq 9 in Speed and Balding's review paper for GRM (X are standardized by SNP allelic frequencies and its expected sampling variance) [22].The expectation and variance of g ij are We can express y r ¼ 1 2 À � r for the r th degree relatives, so θ r has the expected values of 1, 0.5, 0.25, and 0.125 for the zero (clonemate), first (full sibs, or parent-offspring), second (half sibs, or grandparent-grandchildren), and third degree (first cousins, or great grandparent-great grandchild) of relatives, respectively.Obviously, when there is an inbred or population structure, or a loop in marriage, the realized value of θ r covers a continuous range [23].
Let S be an m×k matrix and its entries are independently sampled from N(0,σ 2 ).X1 ¼ X 1 S presents an ideal one-way encryption technique in private genetic data sharing, and we call X1 "encrypted genotype", hereby encG.When , the approximated precision of which relies on the sampling variance of S matrix (S1 Fig) .The relationship is clear: as the value of k increases, the approximation approach to the optimum becomes more accurate.In this study, we ask whether there are relatedness existing between X 1 and X 2 , and how large k should be in order to reduce the noise and meanwhile is still able to identify the relatives of certain degree.
Based on encG, it is now trustworthy to construct Ĝ12 ¼ , the encrypted GRM.In terms of the matrix element ĝ ij , its expectation and variance are , respectively, in which the term 1þy 2 r k is crept into varðĝ ij Þ in comparing with its counterpart in Eq 1.As SNPs are often in linkage disequilibrium (LD), we introduce the effective number of markers (m e ), which is a population parameter engaged in various genetic analyses [24].The variances of g ij and ĝ ij then become

encG regression (encG-reg)
Another interpretation of encGRM is from the perspective of linear regression, which we regress the i th row of X1 ðx i Þ against the j th row of X2 ðx j Þ to estimate the relatedness.We call this procedure the encG regression (encG-reg).The slope b ij of a simple regression model xj ¼ b ij xi þ e indicates the relatedness score between these two individuals.The expectation and the sampling variance of bij ¼ Compared with encGRM, encG-reg generates a smaller sampling variance and improves statistical power in identifying relatives.

Minimum numbers of m e and k
Given a pair of individuals I) whose relatedness is estimated by GRM and follows the distribution of N y r ; 1þy 2 r m e � � , we ask how to identify them from unrelated pairs with a distribution of ; II) whose relatedness is estimated by encG-reg and follows the distribution of , we ask how to differentiate them from unrelated individual pairs as sam- . This question is analogous to the conventional pattern recognition, which can be solved under the power calculation in the statistical test framework for null verse alternative hypotheses.We consequently need to determine two key parameters.I) the effective number of markers, m e , a population statistic that sets the resolution of GRM itself in detecting relatives.II) the column number of the random matrix, k, an iteration dimension that sets the precision of encG-reg.To determine m e and k, upon Type I error rate (α, false positive rate as aforementioned) and Type II error rate (β, false negative rate), m e should satisfy the below inequality Similar to m e , the minimum number of k is also determined by a certain Type I and Type II error rates, the degree of relatives to be detected, as well as the parameter m e .k should follow the below inequality In particular, α should be under experiment-wise control, say after Bonferroni correction, and consequently upon the total comparisons N ¼ P C i<j n i n j , where there are C cohorts and n i samples in cohort i.Of note, Eq 3 gives a lower bound of the number of markers, while in practice we often have genome-wide SNPs in surplus, such as the case in the UKB example below.As a larger m e leads to a smaller k, it is upon the data to choose a large m e but a small k, or a minimum m e but a large k.The conceptual layout of the method is as described above.The technical details of sampling variance at Eqs 1-2 and statistical power calculation for Eqs 3-4 can be found in S1 Text and the corresponding annotations are given in S1 Table .The properties of all three methods, including GRM, encGRM, and encG-reg, are summarized in Table 1.For a pair of cohorts of sample n 1 and n 2 , the computational time complexity of encG-reg is about Oððn 1 þ n 2 Þmk þ n 1 n 2 kÞ: the first term occurs at the local site of each cohort and the second term occurs at an entrusted computational server for a pair of cohorts.Furthermore, after local encryption by multiplication of S, the size of the X1 that is sent to the central analyst is of dimension n 1 ×k, which approximately represents the space complexity.Although the values of m and k are as determined by Eq 3 and Eq 4, but in practice we may pick a slightly larger m, say 2m, so as to balance time complexity and space complexity.
After the assembly of cohorts, there are options for choosing SNPs upon the experimental design.An exhaustive design denotes the use of intersected SNPs between each pair of cohorts, thus a specific random matrix will be shared with each pair of cohorts.Given C cohorts, there are CðC À 1Þ=2 S matrices generated and each cohort is likely to receive C À 1 different S matrices that matches to C À 1 cohorts.Adopting exhaustive design is possible to maximize the statistical power with the maximized number of SNPs, but the computational, as well as communicational, efforts may overwhelm the organization of a study.In contrast, a parsimony design denotes the use of intersected SNPs among all assembled cohorts, as long as the number of SNPs satisfies the resolution in Eq 3 and Eq 4. Exhaustive design and parsimony design are both validated in the 19 UKB cohorts, each of which had sample sizes greater than 10,000, and parsimony design is further tested in the real world for 9 Chinese cohorts in this study.

Protocol for encG-reg for biobank-scale application
We now sketch a detailed technical protocol with security concern for encG-reg into four steps and two interactions.For the four steps, steps 1 and 3 are performed by each collaborator, and steps 2 and 4 are performed by a central analyst (Fig 2A).For the two interactions, exchanged information, possible attacks, and corresponding preventative strategies are given in examples (Fig 2B).We have provided comprehensive details and practical commands for each step at https://github.com/qixininin/encG-reg.The repository includes two main folders: "Simulation" and "Protocol".The "Simulation" folder includes code and resources for all simulations in this study.The "Protocol" folder contains a user-friendly protocol that outlines the step-bystep process for a group of collaborators using encG-reg.These commands and scripts have been utilized during interactions with our collaborators in multi-center Chinese datasets application.
Step 1 Cohort assembly and intra-cohort quality controls.Basic intra-cohort QCs should be conducted.Summary information such as SNP ID, reference allele, and its frequency are then requested by the central analyst.
Step 2 Inter-cohort quality controls and parameter setup.Using the "geo-geno" relationship often observed in genetic data [25,26], we suggest two inter-cohort QCs.One is called frequency-principal component analysis (fPCA) and another is called fStructure.The technical details of the employed fPCA and fStructure methods can be found in our previous study [21] and is described in the following subsection.The central analyst determines m and k by Eq 3 and Eq 4 based on survived SNPs and passes parameter information to each collaborator along with a SNP list.
Step 3 Encrypt genotype matrix.The m-by-k random matrix, or matrices when an exhaustive design is chosen, is generated and sent to each collaborator.As a positive control, The figure shows the resolution for detecting relatives or overlapping samples with respect to varying number of markers at every row (for better illustration m e was twice that of Eq 3) and the degree of relatives to be detected (r = 0, 1, and 2).The y axis is the relatedness calculated from GRM and the x axis is the estimated relatedness calculated from encG-reg (A) and encGRM (B).Each point represents an individual pair between cohort 1 and cohort 2 (there are 200 × 200 = 40,000 pairs in total), given the simulated relatedness.The dotted line indicates the 95% confidence interval of the relatedness directly estimated from the original genotype (blue) and the encrypted genotype (red).The table provides how m and k are estimated.The columns "under minimal m e " provide benchmark for a parameter, and it is practically to choose 2×m e and then estimate k as shown under the column "practical m e ". https://doi.org/10.1371/journal.pgen.1011037.g001

Time complexity
Oðn

Space complexity a
Oðn i kÞ Oðn i kÞ

Table notes:
This table compares the definition and properties, such as expectation, variance and time complexity between three important matrices, GRM, encGRM, and encG-reg, mentioned in the article.Note that, each element of the random matrix S m×k = {s ij } follows a normal distribution N(0, 1/m); m is the number of markers, also represents the number of rows for random matrix S; k is the number of columns for random matrix S; m e is the number of effective markers, an advanced concept that takes the squared correlation between markers into account; r is the degree of a pair of relatives and θ r is their relatedness score; n 1 and n 2 are the sample sizes of two cohorts.
a Space complexity refers to the dimension of Xi ¼ X i S, which should be transported to the central analyst. https://doi.org/10.1371/journal.pgen.1011037.t001

Cohort assembly
The first two weeks The 3rd and 4th weeks The 5th and 6th weeks

The 7th weeks
Inter-cohort QC

Central analyst
The first interaction The second interaction The mathematical details of encG-reg are simply algebraic, but its inter-cohort implementation involves coordination.(A) We illustrate its key steps, the time cost of which was adapted from the present exercise for 9 Chinese datasets (here simplified as three cohorts).Cohort assembly: It took us about a week to call and got positive responses from our collaborators (See Table 3), who agreed with our research plan.Inter-cohort QC: we received allele frequencies reports from each cohort and started to implement inter-cohort QC according to "geo-geno" analysis (see Fig 6).This step took about two weeks.Encrypt genotypes: upon the choice of the exercise, it could be exhaustive design (see UKB example), which may maximize the statistical power but with increased logistics such as generating pairwise S ij ; in the Chinese cohorts study we used parsimony design, and generated a unique S given 500 SNPs that were chosen from the 7,009 common SNPs.It took about a week to determine the number of SNPs and the dimension of k according to Eq 3 and 4, and to evaluate the effective number of markers.Perform encG-reg and validation: we conducted intercohort encG-reg and validated the results (see Fig 7 and Table 4).It took one week.(B) Two interactions between data owners and central analyst, including example data for exchange and possible attacks and corresponding preventative strategies.
https://doi.org/10.1371/journal.pgen.1011037.g002reference samples will be merged into each cohort.Genotype encryption is realized by the matrix multiplication between the standardized genotype matrix and S.
Step 4 Perform encG-reg.Inter-cohort computing for relatedness will be conducted by the central analyst.A successful implementation will lead to at least positive controls consistently identified as inter-cohort "overlap" and if possible, various sporadic relatedness.
In the above steps, there are two interactions between collaborators and the central analyst: The first interaction.Collaborators send over a list of variants including their allele frequencies.After doing variants selection, central analyst returns a list of intersected variants, together with a randomly generated seed.Re-identifications based on allele-frequency may occur, but a suggested choice of common variants (MAF>0.05)can mostly dispel these misgivings.
The second interaction.Collaborators send over a matrix of encrypted genotypes.After performing encG-reg between each two pairs of cohorts, the central analyst returns identified relatedness or returns relatedness scores directly, based on pre-agreed requests.PCA-attack based on encrypted genotypes may occur, if the correlation structure of variants being approximated in a proper reference population, but one straightforward defense is to use variants that are in linkage equilibrium, and it ensures that the correlation matrix closely approximates to the identity matrix.

Cohort-level quality control using fPCA and fStructure
In this study, we use fPCA and fStructure to examine the data quality at the cohort-level using summary statistics.Both fPCA and fStructure have been previously used in The Genetic Investigation of Anthropometric Traits (GIANT) Consortium [21], and GWAMA for educational attainment [27].fPCA is a principal component analysis based on summary data at the population level, rather than individual-level data.It uses the allele frequency of common markers across all cohorts, which is constructed into matrix P ¼ fp ij g C�M .Here, C represents the number of cohorts and M represents the number of common markers, while p ij denotes the allele frequency for the j th marker in the i th cohort.Using the differences in marker frequencies, fPCA can effectively capture the population structure in PC1 and PC2.fStructure explores the genetic composition of the target cohorts by comparing with reference populations using F st .Again, F st is calculated based on allele frequency of common markers among all cohorts and the reference populations.Suppose there are C cohorts and three reference population.The average F st using common markers (F

Calculate the number of effective markers
The corresponding m e will be estimated from, 1KG-EUR and 1KG-CHN as the reference populations for validation in the UKB cohorts and the Chinese cohorts, respectively.According to its definition, m e ¼ , in which ρ 2 is the squared Pearson's correlation for a pair of SNPs, and m e can be empirically estimated as 1  varðG off Þ , where G off denotes the offdiagonal elements of GRM [24,28,29]. 1  m e describes the global LD for all the included SNPs.For more technical details about m e , please refer to Huang et al [30].Since m e is asymptotically distributed as N m e ; 4m 2 e n 2 � � according to the Delta method, the sampling variance of m e is negligible as long as the studying populations are from similar ancestries, such as the case for Manchester and Oxford cohorts in UKB and the Chinese datasets employed in this study (S2 Table ).For a single-ethnicity population, when SNPs are randomly sampled from the genome, with m<50,000, m e is approximately equal to m, as demonstrated in Chinese cohorts.
In the case of a multi-ethnicity population like UKB, the use of encG-reg is when SNPs of ethnicity-insensitive frequencies are employed.

Verification and comparison
Simulation validation.For a conceptional exploration, we illustrated how m and k would affect the identification of various relatedness.In this case, we ignored the difference between m and m e because SNPs were generated independently.We simulated 200 individuals each for cohort 1 and cohort 2 (n 1 = n 2 = 200).Between cohort 1 and cohort 2, we generated 10 pairs of related samples at a variety of relatedness, i.e., zero-degree/identical, 1st-degree, and 2nddegree relatives, respectively.For better illustration, we set the desired number of markers (m) twice as given by Eq 3 and the corresponding size of k as given by Eq 4 at the experiment-wise Type I error rate of 0.05 (α = 0.05/40,000) and Type II error rate of 0.1 -equivalent to 90% statistical power.We simulated individual-level genotype matrices with the dimension of n 1 ×m and n 2 ×m and the encrypted genotype matrices with the dimension of n 1 ×k and n 2 ×k.Relatedness scores for GRM, encGRM, and encG-reg were calculated accordingly and theoretical distributions were derived under the assumption of multivariate distribution for each degree of relatedness.Fig 1 showed that for encG-reg, in each scenario, sufficient k was able to detect a certain degree of relatedness as long as m could support.Compared with encGRM, encG-reg had a smaller variance and consequently a larger statistical power in detecting relatives.
We then evaluated the properties and performance of encG-reg, GRM, and encGRM in more details.We first validated the derived variances of GRM, encGRM, and encG-reg (as summarized in Table 1).1,000 pairs of relatives were separated in cohort 1 and cohort 2. m = 1,000, 1,250, 1,500, 1,750, and 2,000 independent markers were simulated, and their MAF was sampled from a uniform distribution U(0.05,0.5).Genotype matrices from two cohorts were encrypted by the same m×k random matrix S, whose elements drew from a normal distribution N 0; 1 m À � .We set k to be 1,000, 2,000, 3,000, 4,000, and 5,000, respectively.Both the original and the encrypted genotype matrices were standardized based on the description for the three methods.Observed and theoretical variances were examined among four different degrees of relatedness (identical, 1st-degree, 2nd-degree, and 3rd-degree).The estimated sampling variances of GRM, encGRM and encG-reg matched with the theoretical variance at each level of relatedness (Fig 3).
Multi-ethnic samples validation: UKB in exhaustive and parsimony design.Both exhaustive and parsimony designs were conducted to validate encG-reg on 485,158 UKB multi-ethnicity samples from 19 assessment centers with a sample size greater than 10,000 (S3 Table ), resulting in a total of 110,713,926,381 inter-cohort individual pairs.As the 485,158 UKB samples consist of 94.23% Whites, 1.94% Asian or Asian British, 1.57% Black or Black British, and 2.27% "Other" or "Unknown", it is an ideal dataset to validate whether encG-reg is good to handle diversified samples.Identical/twins, 1st-degree and 2nd-degree relatedness were aimed to be detected by KING-robust ("the rule of thumb") using the real genotypes and by encG-reg using the encrypted genotypes, respectively.We conducted QC on the 784,256 chip SNPs within the 19 cohorts, and the inclusion criteria for autosome SNPs were: (1) minor allele frequency (MAF) > 0.01; (2) Hardy-Weinberg equilibrium (HWE) test p-value > 1e-7; and (3) locus missingness < 0.05.An averaging number of 578,543 SNPs survived from 19 cohorts.In addition, taking account of the multi-ethnicity nature of UKB samples, only SNPs of ethnicity-insensitive frequency, which have indifferent allele frequencies statistically, were

PLOS GENETICS
included.Therefore, we selected ethnicity-insensitive SNPs based on three reference population representing European (99 CEU and 107 TSI samples), African (107 YRI samples), or East Asian (103 CHB and 105 CHS samples) background in 1000 Genome Project.We first performed SNP quality control on three reference population, using the following inclusion criteria: (1) minor allele frequency (MAF) > 0.01; (2) Hardy-Weinberg equilibrium (HWE) test p-value > 1e-7; and (3) locus missingness < 0.05.Next, we conducted association studies on two reference populations at a time, selecting SNPs with a p-value greater than 0.05 (considered insignificant).Once insignificant SNPs were identified between each pair of reference population, we took the intersection between those SNPs to establish the ethnicity-insensitive SNP pool, which contained a total of 299,835 SNPs.These SNPs exhibit consistent allele frequency across different ethnic backgrounds.For more details see S1 Text.
We adopted both exhaustive design and parsimony design in this UKB validation.In the exhaustive design, intersected SNPs were selected between each pair of cohorts, the average number of intersected SNPs was 556,929 and the average number of ethnicity-insensitive SNPs after taking intersection with the ethnicity-insensitive pool was 13,157.In the parsimony design, a total number of 12,858 intersected and also ethnicity-insensitive SNPs among all 19 The observed and theoretical sampling variance of GRM (A1-A4), encGRM (B1-B4) and encG-reg (C1-C4) are given in bar plots.Individual genotypes are simulated with m = 1,000, 1,250, 1,500, 1,750, and 2,000 independent markers.A total number of n 1 = n 2 = 1,000 pairs of relatives are simulated under each different levels of relatedness (r = 0, 1, 2, and 3).As for the encryption, the column number of random matrices are k = 4,000, 5,000, 6,000, 7,000, and 8,000 correspondingly.https://doi.org/10.1371/journal.pgen.1011037.g003cohorts were selected.The numbers of ethnicity-insensitive SNPs intersected between each pair of UKB cohorts in exhaustive design were all given in S4 Table .The number of k for encG-reg was estimated by Eq 4 at a Type I error rate of 0.05 and a Type II error rate of 0.1.To note that, experiment-wise Bonferroni correction is based on the number of paired samples between every two cohorts (N ij ¼ n i n j ) for exhaustive design and the total number of paired samples among all cohorts (N ¼ P C i<j n i n j ) for parsimony design.Performance of encG-reg in two UKB cohorts.We investigated more details of encG-reg at two assessment centers in Manchester (11,502 individuals) and Oxford (12,260 individuals) from UKB white British, which included over 140 million comparisons.We randomly sampled SNPs with different ranges of MAF (0.01 to 0.05, 0.05 to 0.15, 0.15 to 0.25, 0.25 to 0.35, 0.35 to 0.5, and a broad range of 0.05 to 0.5) so as to compare the performance of encG-reg and KING.According to the minimum number of m e and k at the experiment-wise Type I error rate of 0.05 a ¼ on Eq 3 and Eq 4 (Table 2), the minimum requirement for m e was 283 and 1,104 for detecting 1st-degree and 2nd-degree relatedness, respectively.However, since it is m rather than m e that can be directly determined and interacted with the data, we suggested and empirically chose m as twice the minimum number of m e , in order to ensure that the practical derived m e satisfies Eq 3. We randomly selected 566 SNPs (m e = 566, θ 1 = 0.45) and 2,209 SNPs (m e = 2,023, θ 2 = 0.225) for detecting 1st-degree and 2nd-degree relatedness, and the corresponding k were 494 and 2,342, respectively.Against possible noise that may rust statistical power, we also increased k to 1.2k and denoted it as encG-reg+.The average relatedness score, standard deviation, and statistical power were calculated for each detected relative pair after resampling SNPs 100 times.
Out of the 11,502×12,260 = 141,014,520 pairs of inter-cohort individuals, 17 pairs of socalled 1st-degree and 2 pairs of 2nd-degree relatives were found using overall QCed SNPs by KING.The bar plots in Fig 4 compared relatedness scores of the known 1st-degree (m e = 566, k = 494) and 2nd-degree (m e = 2,023, k = 2,342) relatives, estimated by KING, GRM, encG-reg, and encG-reg+ (using 1.2k).In general, encG-reg and encG-reg+, still showed very similar estimations of relatedness score compared with KING.When SNPs were sampled with MAFs between 0.05 and 0.5, the average statistical power reached 0.9 and 0.95 for detecting 1stdegree relatedness by encG-reg and encG-reg+.The overall statistical power was proportional to the MAF; when the MAF of the sampled SNPs was less than 0.05, the statistical power of encG-reg was still close to our theoretical benchmark.In a more refined scope, using the conditional binomial distribution, our analytical result showed that the sampling variance of    After resampling SNPs 100 times, the average GRM score, standard deviation, and statistical power were calculated for each detected relative-pair.The grey dashed line indicates the expected statistical power of 0.9.The solid colored lines indicate the average relatedness scores for certain degrees as estimated by the four methods.17 pairs of so-called 1st-degree and 2 pairs of 2nd-degree relatives were approved using overall SNPs by KING. https://doi.org/10.1371/journal.pgen.1011037.g004 GRM was proportional to Performance of encG-reg in UKB.We verified the exhaustive design of encG-reg in 19 UKB cohorts (totaling over 100 billion inter-cohort individual pairs) by comparing with the results from KING up to the 2nd-degree relatedness (Fig 5A).The average number of intersected SNPs between every two pairs of cohorts was 13,157.The same 38 pairs of identical samples (monozygotic twins in this case) were detected by KING and encG-reg; 7,965, and 6,632 pairs of 1st-degree and 2nd-degree relatedness were inferred by KING, comparing to 7,913 and 7,022 by encG-reg, respectively.It could be seen that encG-reg was quite comparable to KING in practice.Based on individual ID and their recorded ethnicity, consistent relatedness scores were estimated by KING and encG-reg (Fig 5B -5D).Combining geographic distance between 19 cohorts, we discovered that more relatives were detected between adjacent assessment centers, such as Manchester and Bury, Newcastle and Middlesbrough, and Leeds and Sheffield.Besides, consistent numbers of relatedness were inferred by the parsimony design of encG-reg (S5 Table ).The decrease in the number of the detected 2nd-degree relatedness for parsimony design was possibly due to the smaller experiment-wise Type I error rate and thus a more stringent threshold.
As aforementioned the computational time complexity was Oððn 1 þ n 2 Þmk þ n 1 n 2 kÞ.For the example of 19 UKB cohorts, with an average cohort size of n = 25,537, an average of m = 13,157 intersected ethnicity-insensitive markers, and an average of k = 1,381 columns for random matrix, the average time required for each pair-wise encG-reg computation was 9.682 ± 2.700 minutes (S4 Table ).The computations were performed using one thread on an Intel(R) Xeon(R) CPU E7-4870 @ 2.40GHz.The average storage space of data, which was supposed to be transported to the center analyst but not for the UKB in-house demonstration, was proportional to n � k.Both the computational time and the storage space was affordable even for biobank-scale data such as the UKB.Applications 9 multi-center Chinese datasets.We launched a national-scale application for encG-reg in 9 Chinese datasets under the parsimony design to avoid possible computational and communicational costs.Four out of nine datasets were publicly available, while the remaining datasets were recruited from 5 research centers, located in from north to south China, including Beijing, Shanghai, Hangzhou, Guangzhou, and Shenzhen (Table 3).Serving as a proof-of-concept and brief validation of encG-reg in civilian and complex environments, collaboration was organized to detect identical or 1st-degree relatedness samples but without revealing personal medical information.
1KG-CHN (public): We considered two Chinese subpopulations in 1000 Genome Project (1KG) [31], CHB (Han Chinese in Beijing, 103 individuals) and CHS (Southern Han Chinese, 105 individuals) as the reference population and also as a positive control in the cross-cohort test in Chinese datasets.Individuals in the project were genotyped by either whole-genome sequencing or whole-exome sequencing platform.
UKB-CHN (UKB application 41376): The UKB includes 1,653 individuals of self-reported Chinese [32].After genomic assessment, 1,435 were considered as Chinese origin.Individuals in the project were genotyped using the Applied Biosystems UK BiLEVE Axiom Array by Affymetrix, followed by the genotype imputation.

CONVERGE (public):
The CONVERGE consortium aimed to investigate major depressive disorder (MDD) [33].It included 5,303 Chinese women with recurrent MDD and 5,337 controls, who were genotyped with low-coverage whole-genome sequencing followed by imputation.
MESA (accessible after dbGAP application): The Multi-Ethnic Study of Atherosclerosis (MESA), which investigates subclinical cardiovascular disease [34], includes 653 Chinese samples, who were genotyped using Affymetrix Genome-Wide Human Single Nucleotide Polymorphism array 6.0, followed by genotype imputation.
SBWCH Biobank: The Shenzhen Baoan Women's and Children's Hospital (Baoan district, Shenzhen, Guangdong province) Biobank aims to investigate traits and diseases during pregnancy and at birth.30,074 women were included in this study.Maternal genotypes were inferred from the non-invasive prenatal testing (NIPT) low depth whole genome sequencing data using STITCH [36] following the methodological pipeline that we previously published [35].The average genotype imputation accuracy reaches 0.89 after filtration of INFO score 0.4.

CAS and ZOC:
The Chinese Academy of Sciences (CAS) cohort is a prospective cohort study aiming to identify risk factors influencing physical and mental health of Chinese mental workers via a multi-omics approach.Since 2015, the study has recruited 4,109 CAS employees (48.2% male) located in Beijing, China.All participants belong to the research/education sector, and are characterized by a primary of Chinese Han origin (94.1%).DNA was extracted from peripheral blood samples and genotyped on the Infinium Asian Screening Array + Multi-Disease-24 (ASA+MD) BeadChip, a specially designed genotyping array for clinical research of East Asian population with 743,722 variants.For validation purpose, samples were randomly split into CAS1 and CAS2.According to their records, ZOC was consisted of 19 homozygotic and heterozygotic siblings, who were evenly split into CAS1 and CAS2 as internal validation of the method.ZOC is part of the Guangzhou Twin Eye Study (GTES), a prospective cohort study that included monozygotic and dizygotic twins born between 1987 and 2000 as well as their biological parents in Guangzhou, China.Baseline examinations were conducted in 2006, and all participants were invited to attend annual follow-up examinations.Non-fasting peripheral venous blood was collected by a trained nurse at baseline for DNA extraction, and genotyping was performed using the Affymetrix axiom arrays (Affymetrix) at the State Key Laboratory of Ophthalmology at Zhongshan Ophthalmic Center (ZOC) [37].CAS and ZOC cohorts were deeply collaborated for certain studies, and consequently merged to fit this study.
Fudan: A multistage GWAS of glioma were performed in the Han Chinese population, with a total of 3,097 glioma cases and 4,362 controls.All Chinese Han samples used in this study were obtained through collaboration with multiple hospitals (Southern population from Huashan Hospital, Nanjing 1st Hospital, Northern population from Tiantan Hospital and Tangdu Hospital).DNA samples were extracted from blood samples and were genotyped using Illumina Human OmniExpress v1 BeadChips [38].2,008 samples were included for this study.

WBBC:
The Westlake BioBank for Chinese (WBBC) cohort is a population-based prospective study with its major purpose to better understand the effect of genetic and environmental factors on growth and development from youngster to elderly [39].The mean age of the study samples were 18.6 years for males and 18.5 years for females, respectively.The Westlake Bio-Bank WBBC pilot project has finished whole-genome sequencing (WGS) in 4,535 individuals and high-density genotyping in 5,841 individuals [40,41].
The 9 Chinese datasets were reorganized into 9 cohorts (1KG-CHN, UKB-CHN, CON-VERGE, META, SBWCH, CAS1, CAS2, Fudan, and WBBC) and to test encG-reg in the real world.Within CAS1 and CAS2, relatedness if identified by encG-reg would be verified by CAS.As would have been found, among other pairs of cohorts, sporadic relatedness might occur.
Performance of encG-reg in Chinese cohorts.As summarized in Fig 2A , the Chinese cohort study was swiftly organized and completed within about 7 weeks, showing that encGreg was an effective strategy with better ethical assurance.Following intra-cohort QCs and upon received summary information, we examined sample sizes and SNPs in each cohort (Table 3).In total, it included 54,092 samples and generated about 1 billion (N = 930,140,004) pairs of tests.When allele frequencies were compared with that in CONVERGE, the majority of SNPs show consistent allele frequencies across cohorts (S3 Fig and S6 Table).The missing rates and the intersected SNPs were also examined across cohorts (Figs S4-S5 and S7 Table ), after which a total of 7,009 SNPs were in common among 9 cohorts for the parsimony design of encG-reg (Fig 6A).
The results of fPCA and fStucture matched with their expected "geo-geno" mirror in Chinese samples [35].The first eigenvector of fPCA distinguished southern and northern Chinese samples in this study: the SBWCH Biobank (dominantly sampled from Shenzhen, the southmost metropolitan city in mainland China) and CAS cohort (dominantly sampled from Beijing) (Fig 6B and 6C).Using a slightly different illustration strategy, the fStructure results, a counterpart to the well-known Structure plot in population genetics, were also consistent with the reported Chinese background of the 9 cohorts (Fig 6D and 6E).
We offered a list of 500 SNPs to be shared by the collaborators and estimated m e to be 477 (evaluated in 1KG-CHN).The minimum number of k was 710, given the experiment-wise Type I error rate of 0.05 (a ¼ 0:05 930;140;004 , θ 1 = 0.45) and the statistical power of 0.9.Each collaborator then encrypted their genotype matrix by the random matrix S. As foolproof controls, 1KG-CHN samples were consistently identified as "identical" inter-cohort.
Relatives were identified between CAS1 and CAS2, and SBWCH and WBBC (Fig 7A and  7B).The pair-wise encG-reg distributions between cohorts were consistent with our theoretical expectation (Figs 7C and S6).
For anticipated relatives, as each of the 19 Guangzhou twins was split into CAS1 and CAS2, 18 pairs were identified as monozygotic (MZ) or dizygotic (DZ) by encG-reg and verified by intra-cohort IBD calculation in CAS Beijing team (Fig 7D).Remarkably, the pair of recorded twins that was not identified by encG-reg was verified as unrelated by IBD calculation, and ZOC team is conducting further investigation on potential logistic errors.These results demonstrated that encG-reg was reliable with well-controlled Type I and Type II error rates.
Particularly, we illustrated how sporadically related pairs were captured by encG-reg.We detected 14 pairs of inter-cohort relatedness, including 4 pairs of identical samples and 10 pairs of 1st-degree relatives (Table 4).For these sporadic related inter-cohort samples, encGreg exhibited their relatedness in the forms of regression plots and estimated regression coefficients, two examples were given in Fig 7E and 7F.Thirteen out of fourteen pairs of sporadic related pairs were identified between CAS1 and CAS2.This could likely be attributed to the  The histogram shows all estimated relatedness using encG-reg between CAS1 and CAS2, most of which are unrelated pairs and the centralized sampling of the CAS cohort among CAS employees, as it is highly probable that family members are intended to work in the same company.Moreover, despite the higher missing rate in SBWCH, which introduced additional noise, encG-reg still managed to capture one across-cohort relatives between SBWCH and WBBC.To avoid possible breaching of privacy, we refrained from further exploring their relationship as extensively as we did for UKB.

Discussion
One of the early attempts on detecting cross-cohort relatives was limited to detecting overlapping individuals by one-way cryptographic hashes, which offered qualitative but not quantitative conclusions on false positive and false negative rates [13].To settle the question of exact encryption precision, we focused on investigating the intrinsic consequence after genotype encryption with a random matrix and proposed encG-reg.We described the properties of encG-reg in how k and m e influence its precision.This property was well testified in both the UKB example and the collaboration across China.Our investigation led to controllable theoretical probability density function is given as the normal distribution N 0;  encryption precision even under a variety of genotype platforms and datasets with different qualities.It should be noticed, as a proof-of-concept, we only studied additive GRM, which corresponds to IBD 1 .Obviously, our work can be extended to dominance-GRM (or two-allele "IBD = 2" scheme), so as to further split 1st-degree relatives into parent-offspring (IBD 0 = 0, IBD 1 = 1.0, and IBD 2 = 0) and full sibs (IBD 0 = 0.25, IBD 1 = 0.5, and IBD 2 = 0.25).To this point, we have only presented the outcomes concerning identity, 1st-degree, and 2nd-degree relatedness in UKB.This is primarily due to the absence of a distinct definition for true relatedness.To conduct further examination for the exact inference of distant relatives, a dataset with more pedigree information should be employed and a study design with more comprehensive comparisons should be considered [42].
As demonstrated in UKB multi-ethnicity samples, encG-reg could be applied for biobankscale datasets with high precision in comparing with conventional individual-level benchmark methods such as KING and GRM.The evaluation using Chinese cohorts is the first attempt to develop an encrypted method that can be applied in large-scale searching relatives with encrypted genomic data.In this experiment, for convenience and manageability, we only considered parsimony design by using shared SNPs across the 9 Chinese cohorts.Switching to exhaustive design will be a better option if each pair of cohorts conducts encG-reg for their customized degree of relatives.
For either exhaustive design or parsimony design of encG-reg, the core algorithm is algebraic and requires little human information in its implementation.Thus, an automatic central analysis facility that can significantly host and synchronize more cohorts will be attractive.An exhaustive implementation of encG-reg will search even deeper relatedness across cohorts in a highly mobilizing nation like China, in which relatives used to live nearby but now are distant due to industrialization [43].A much deeper implementation of encG-reg will bring out unique resource for conducting biomedical research at large scale as including familial information as demonstrated [44].Last but not least, encG-reg is a developed tool that is with much better protected genomic privacy, and can facilitate necessary relative searching when it is needed.It is not purposed to penetrate membership or other unethical activities.
An attack on the central site may result in the leak of encrypted genotype matrices and estimated relatedness score, but no raw genotype matrices will be leaked.However, since the individual IDs were de-identified by each cohort, such as individual IDs presented in Table 4, no more information can be traced back by other sites or the central site.Moreover, additional secure protection can be implemented at the central site (which can be a cloud server), and this is about the design on an entrusted server.In the worst-case scenario-a colluded central analyst, both S and XS have been leaked, certain adversary attacks, such as PCA-attack, may be carried out.In PCA-attack, X can be adversary reconstructed via principal component analysis on XS, if the correlation structure of SNPs can be approximated in a proper reference population [45].To mitigate the risk of a PCA-attack, one straightforward defence is to use SNPs that are in linkage equilibrium as much as possible (m�m e then), and it ensures that the correlation matrix closely approximates the identity matrix.As noticed, the strategy of masking the original genotype after multiplication of S resembles matrix random projection employed in classifying the transactions as legitimate or fraudulent across financial institutions, and a class of methods of possibly reconstructing X have been discussed [46].Consequently, we are confident for the safety of the whole practice, for now and for the future when the encG-reg grows to a more broadly utilized application.
The homomorphic encryption such as CKKS scheme [47] and Fan and Vercauteren (FV) scheme [48] have recently been employed in developing HE-KING [49,50], which are also applicable to our study.Nevertheless, the computational cost of HE-KING is substantial, for which after encryption the memory cost is often one or two orders of magnitude of the original genotypes.Eq 3 provides a lower bound of SNP numbers for detecting relatedness and consequently can be plugged into HE-KING, a useful quantity that is able to reduce the SNP number and computational cost in particular when analyzing biobank-scale data such as UKB.

Fig 1
provides a phenomenological illustration of how m e and k are weaved together.A simulation R code for Fig 1 can be found at https://github.com/qixininin/encG-reg/blob/main/1-Simulations/Figure1-resolution.R.

Fig 1 .
Fig 1.Resolution for varying relatedness using GRM, encGRM and encG-reg.The figure shows the resolution for detecting relatives or overlapping samples with respect to varying number of markers at every row (for better illustration m e was twice that of Eq 3) and the degree of relatives to be detected (r = 0, 1, and 2).The y axis is the relatedness calculated from GRM and the x axis is the estimated relatedness calculated from encG-reg (A) and encGRM (B).Each point represents an individual pair between cohort 1 and cohort 2 (there are 200 × 200 = 40,000 pairs in total), given the simulated relatedness.The dotted line indicates the 95% confidence interval of the relatedness directly estimated from the original genotype (blue) and the encrypted genotype (red).The table provides how m and k are estimated.The columns "under minimal m e " provide benchmark for a parameter, and it is practically to choose 2×m e and then estimate k as shown under the column "practical m e ".

3 AttacksFig 2 .
Fig 2. Workflow of encG-reg and its practical timeline as exercised in Chinese cohorts.The mathematical details of encG-reg are simply algebraic, but its inter-cohort implementation involves coordination.(A) We illustrate its key steps, the time cost of which was adapted from the present exercise for 9 Chinese datasets (here simplified as three cohorts).Cohort assembly: It took us about a week to call and got positive responses from our collaborators (See Table3), who agreed with our research plan.Inter-cohort QC: we received allele frequencies reports from each cohort and started to implement inter-cohort QC according to "geo-geno" analysis (see Fig6).This step took about two weeks.Encrypt genotypes: upon the choice of the exercise, it could be exhaustive design (see UKB example), which may maximize the statistical power but with increased logistics such as generating pairwise S ij ; in the Chinese cohorts study we used parsimony design, and generated a unique S given 500 SNPs that were chosen from the 7,009 common SNPs.It took about a week to determine the number of SNPs and the dimension of k according to Eq 3 and 4, and to evaluate the effective number of markers.Perform encG-reg and validation: we conducted intercohort encG-reg and validated the results (see Fig 7and Table 4).It took one week.(B) Two interactions between data owners and central analyst, including example data for exchange and possible attacks and corresponding preventative strategies.

Fig 4 .
Fig 4. Influence of minor allele frequencies in detecting relatives in Manchester and Oxford cohorts.The bar plots provide a comparison of relatedness scores for the known 1st-degree and 2nd-degree relatives estimated by KING, GRM, encG-reg, and encG-reg+ at two representative assessment centers (Manchester and Oxford).For each assessment centers, 566 and 2,209 SNPs were randomly selected within specific MAF ranges: 0.01 to 0.05, 0.05 to 0.15, 0.15 to 0.25, 0.25 to 0.35, 0.35 to 0.5, and 0.05 to 0.5.Here, encG-reg+ denotes the use of 1.2-fold of the minimum number of k and IBD denotes twice the relatedness score estimated by KING.After resampling SNPs 100 times, the average GRM score, standard deviation, and statistical power were calculated for each detected relative-pair.The grey dashed line indicates the expected statistical power of 0.9.The solid colored lines indicate the average relatedness scores for certain degrees as estimated by the four methods.17 pairs of so-called 1st-degree and 2 pairs of 2nd-degree relatives were approved using overall SNPs by KING.

Fig 5 .
Fig 5. Resolution for detecting relatives in UKB cohorts by KING and encG-reg under exhaustive design.(A) Chord diagrams show the number of inter-cohort identical/twins, 1st-degree, and 2nd-degree relatedness across 19 UKB assessments with over 10,000 samples.Relatedness was detected and compared between KING and encG-reg under an exhaustive design, encompassing a total of 171 intercohort analyses.In each chord plot, the length of the side edge is proportional to the count of detected relatives between the focal cohort and other cohorts.(B) The scatter plot shows the estimated relatedness score by KING and encG-reg for inter-cohort relative pairs, including identical, 1st-degree, and 2nd-degree pairs.(C) The histogram shows the distribution of relatedness scores estimated by encGreg.(D) The histogram shows the distribution of relatedness scores estimated by KING.https://doi.org/10.1371/journal.pgen.1011037.g005

Fig 6 .
Fig 6.Cohort-level genetic background analyses for Chinese cohorts under parsimony encG-reg analysis.(A) Overview of the intersected SNPs across cohorts, a black dot indicated its corresponding cohort was included.Each row represented one cohort while each column represented one combination of cohorts.Dots linked by lines suggested cohorts in this combination.The height of bars represented the cohort's SNP numbers (rows) or SNP intersection numbers (columns).Inset histogram plots show the distribution of the 7,009 intersected SNPs and the 500 SNPs randomly chosen from the 7,009 SNPs for encG-reg analysis.(B) 7,009 SNPs were used to estimate fPC from the intersection of SNPs for the 9 cohorts.Each triangle represented one Chinese cohort and was placed according to their first two principal component scores (fPC1 and fPC2) derived from the received allele frequencies.(C) Five private datasets have been pinned onto the base map from GADM (https://gadm.org/data.html)using R language.The size of point indicates the sample size of each dataset.(D) Global fStructure plot indicates global-level F st -derived genetic composite projected onto the three external reference populations: 1KG-CHN (CHB and CHS), 1KG-EUR (CEU and TSI), and 1KG-AFR (YRI), respectively; 4,296 of the 7,009 SNPs intersected with the three reference populations were used.(E) Within Chinese fStructure plot indicates within-China genetic composite.The three external references are 1KG-CHB (North Chinese), 1KG-CHS (South Chinese), and 1KG-CDX (Southwest minority Chinese Dai), respectively; 4,809 of the 7,009 SNPs intersected with these three reference populations were used.Along x axis are 9 Chinese cohorts and the height of each bar represents its proportional genetic composition of the three reference populations.Cohort codes: YRI, Yoruba in Ibadan representing African samples; CHB, Han Chinese in Beijing; CHS, Southern Han Chinese; CHN, CHB and CHS together; CEU, Utah Residents with Northern and Western European Ancestry; TSI, Tuscani in Italy; CDX, Chinese Dai in Xishuangbanna.https://doi.org/10.1371/journal.pgen.1011037.g006

Fig 7 .
Fig 7. Detected identical pairs and 1st-degree pairs between Chinese cohorts.(A) The circle plot illustrates identical pairs and (B) 1st-degree pairs across 9 Chinese cohorts.The solid links indicates anticipated relatedness between the CAS cohorts.The dashed link indicates relatedness identified across cohorts.The length of each cohort bar is proportional to their respective sample sizes.(C) The histogram shows all estimated relatedness using encG-reg between CAS1 and CAS2, most of which are unrelated pairs and the

a
Standard deviation (SD) is calculated from SD b ij ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffifficovðx i ;x j Þ varðx i Þ q, where xi and xj are the vectors of the encrypted genotypes for two individuals.bDue to missing data, the corrected score, is adjusted for the genotype missing rate between the pair of individuals https://doi.org/10.1371/journal.pgen.1011037.t004 ref 1 st ; F ref 2 st , and F ref 3 st ) between i th cohort and the given reference populations are calculated.Finally, a bar plot is employed to show the ratio of 1=F ref 1 st ; 1=F ref 2 st , and 1=F ref 3 st , providing a clear visualization of the genetic composition of the testing cohorts against the reference populations.

Table notes :
We used Manchester(11,502 individuals)and Oxford(12,260 individuals)from UKB white British, totaling 11,502×12,260 = 141,014,520 pairs.The minimum number of m e for detecting identical, 1st-degree, and 2nd-degree relatedness are calculated from Eq 3 at the experiment-wise Type I error rate of 0.05 (α = 0.05/141.014,520)and Type II error rate of 0.1.In practice, the suggested number of markers (m) here is twice that of the minimum number of m e because it is a surplus of SNPs.The empirical m e is estimated from 1KG-EUR.The minimum number of k is calculated from Eq 4 given empirical m e .The column of "1.2×k" is suggested for improved statistical power.
It was noticeable that larger MAFs could lead to a smaller variance of GRM score (S2 Fig), which further resulted in a smaller variance and a higher power of detecting relatives for encGRM and encG-reg.This result is consistent with how MAF affects the statistical power in UKB Manchester and Oxford cohorts.

Table 4 . Supporting evidence for sporadic related pairs. Pair Cohort 1 ID 1 Cohort 2 ID 2 Score (SD a ) Adjusted Score b (SD) Inferred relatedness
The threshold (grey dot line) for rejecting H 0 was calculated by z 1À a=N ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi Here we included 208 controls merged from 1KG-CHN.m e ¼ 477; k 0 ¼ 70; k 1 ¼ 710; N ¼ 930; 140; 004.(D) Relationship verification for 19 Guangdong twins split in CAS cohorts.Dashed lines indicate inference criteria for detecting relatedness of different degrees.Solid line of y = x indicates the agreement between encG-reg and IBD.Points are colored with IBD-inferred relatedness in KING (identical in green, 1st-degree in blue, and unrelated in red) and are shaped according to encG-reg-inferred relatedness (identical in square, 1st-degree in diamond, and unrelated in circle).(E) and (F) Illustration for encG-reg estimation for sporadic related inter-cohort samples.The grey line is the criterion for identical pairs (slope of 1) or 1st-degree pairs (slope of 0.5).The solid lines colored in red are without adjustment for missing values (encG-reg score), and in the bottom (colored in purple) are adjusted for missing values (encG-reg score*).

Table Notes :
IDs were de-identified by each cohort.