Skip to main content
Advertisement
  • Loading metrics

Federated privacy-protected meta- and mega-omics data analysis in multi-center studies with a fully open-source analytic platform

  • Xavier Escriba-Montagut,

    Roles Formal analysis, Software, Writing – original draft, Writing – review & editing

    Affiliations Barcelona Institute for Global Health (ISGlobal), Barcelona, Spain, Universitat Pompeu Fabra (UPF), Barcelona, Spain

  • Yannick Marcon,

    Roles Software, Writing – original draft

    Affiliation Epigeny, St Ouen, France

  • Augusto Anguita-Ruiz,

    Roles Software

    Affiliations Barcelona Institute for Global Health (ISGlobal), Barcelona, Spain, Universitat Pompeu Fabra (UPF), Barcelona, Spain

  • Demetris Avraam,

    Roles Software, Writing – original draft

    Affiliation Department of Public Health, Policy and Systems, University of Liverpool, Liverpool, United Kingdom

  • Jose Urquiza,

    Roles Software

    Affiliations Barcelona Institute for Global Health (ISGlobal), Barcelona, Spain, Universitat Pompeu Fabra (UPF), Barcelona, Spain, Centro de Investigación Biomédica en Red en Epidemiología y Salud Pública (CIBERESP), Barcelona, Spain

  • Andrei S. Morgan,

    Roles Funding acquisition

    Affiliations Université Paris Cité, Centre of Research in Epidemiology and StatisticS (CRESS), Obstetrical Perinatal and Pediatric Epidemiology Research Team (EPOPé), INSERM, INRAE, F-75006, Paris, France, Elizabeth Garrett Anderson Institute for Women’s Health London, University College London, London, United Kingdom

  • Rebecca C. Wilson,

    Roles Funding acquisition

    Affiliation Department of Public Health, Policy and Systems, University of Liverpool, Liverpool, United Kingdom

  • Paul Burton,

    Roles Funding acquisition

    Affiliation Population Health Sciences Institute, Newcastle University, Newcastle, United Kingdom

  • Juan R. Gonzalez

    Roles Conceptualization, Funding acquisition, Software, Supervision, Writing – original draft, Writing – review & editing

    juanr.gonzalez@isglobal.org

    Affiliations Barcelona Institute for Global Health (ISGlobal), Barcelona, Spain, Universitat Pompeu Fabra (UPF), Barcelona, Spain, Centro de Investigación Biomédica en Red en Epidemiología y Salud Pública (CIBERESP), Barcelona, Spain

Abstract

The importance of maintaining data privacy and complying with regulatory requirements is highlighted especially when sharing omic data between different research centers. This challenge is even more pronounced in the scenario where a multi-center effort for collaborative omics studies is necessary. OmicSHIELD is introduced as an open-source tool aimed at overcoming these challenges by enabling privacy-protected federated analysis of sensitive omic data. In order to ensure this, multiple security mechanisms have been included in the software. This innovative tool is capable of managing a wide range of omic data analyses specifically tailored to biomedical research. These include genome and epigenome wide association studies and differential gene expression analyses. OmicSHIELD is designed to support both meta- and mega-analysis, so that it offers a wide range of capabilities for different analysis designs. We present a series of use cases illustrating some examples of how the software addresses real-world analyses of omic data.

Author summary

OmicSHIELD revolutionizes the way researchers can engage with federated omics data, providing a secure framework for conducting different omic data analyses. This innovative platform allows data to stay in their original repositories, thus eliminating data transfer—a crucial feature in an era where data privacy regulations are becoming increasingly stringent. By leveraging advanced techniques like differential privacy, OmicSHIELD aims to mitigate disclosure risks associated with analysis of omics data, while still enabling accurate collaborative research. The platform is highly flexible, supporting processing and analysis of multiple omic data formats. This makes it a useful tool for researchers looking to perform complex analyses across multiple datasets. OmicSHIELD includes active disclosure control checks and the ability to compute a wide range of analytical methods useful to obtain insights from omic data. By prioritizing both analytical power and data privacy, OmicSHIELD addresses the most pressing challenges in omics research today, making it easier for scientists to unlock new insights while maintaining high ethical standards.

Introduction

Contemporary data analytics in health and biological sciences include a central focus on the analysis and interpretation of high volume ‘omics data’ (genomic, epigenomic, or metabolomic data). An important requirement for fully exploiting the potential of such data is to make large amounts of clinical, epidemiological and omic information accessible and interoperable to researchers. This can be achieved through data sharing. Historically, data sharing has been based on central warehousing: this requires data generators to physically transfer data or summary statistics to make them accessible to data analysts. This approach has been adopted, for instance, by most consortia devoted to the analysis of genomic data. Under this setting, each data provider runs, for example, their own genome-wide association studies (GWAS) independently, and shared summary statistics are then meta-analysed [1]. Alternatively, a recent and increasingly used analytic trend known as federated analysis (FA) permits analysis of multiple decentralized datasets, without sharing or accessing individual-level information (i.e., migrating the analysis to the data) [2]. Motivated by the delicate nature of genetic and health data and the ethical and legal issues behind sharing this kind of information, the potential benefits of FA are being increasingly widely optimized [3,4], [5].

Meta-analysis combines summary statistics from multiple studies to derive a unified conclusion. In contrast, mega-analysis (i.e. pooled analysis) in our context, uses the individual-level data from multiple studies to derive combined conclusions without the requirement of physically pooling the data. Instead, we employ specific statistical methodologies that reproduce the same results as if the data would be pooled together and analysed jointly.

Meta-analysis is widely adopted for the combination of GWAS [6] and is also being used in differential gene expression (DGE) and epigenome wide association studies (EWAS) by combining results from different populations [7]. Being capable of performing both meta- and mega-analysis in a federated framework would be a cutting-edge advance for the biomedical field, allowing, among other possibilities, to choose the best and most convenient approach to be applied depending on data characteristics and designs. Most currently available FA systems for omics data are only intended to perform pooled analyses, arguing that this approach substantially increases statistical power [8]. However, pooled analysis in a multi-cohort setting is not recommended when data are heterogeneous among cohorts or when data are not properly harmonized [9] (e.g. gene expression normalized using different methods, or GWAS data maintained on different platforms) as substantive heterogeneity in the nature of the data can lead to biased results. This includes “confounding by study” which can be very severe when an outcome and an explanatory covariate vary in tandem (or in a reciprocal manner) across study populations [10].

Omic FA has a key constraint since it includes sensitive information. It requires ensuring appropriate levels of security and privacy and the judicious application of the stringent regulations implicit to contemporary governance frameworks such as the General Data Protection Regulation (GDPR) in Europe. To address this important issue, different privacy-protecting techniques such as federated learning (FL) in combination with differential privacy (DP) [11], homomorphic encryption (HE) [12], and secure multi-party computation (SMPC) [13] have been developed, some or all of which may be adopted [14]. Algorithms developed by researchers could potentially be used, alongside genotype-phenotype associations from genetic association studies, by an attacker to predict genotypes and phenotypes of target individuals based on genome information shared by individuals or their relatives [15], [16], [17]. A secure FA platform should thus have solutions to minimize the risk of such potential attacks.

The burden of data sharing on multi-center studies that deal with omic data has enormously increased in the last few years. These include large projects such as ORCHESTRA, MIRACUM, LifeCycle, HELIX and ATHLETE [18], [19], [20], [21], [22], [23] among many others. Having software solutions to FA in omic studies using privacy-protected techniques is therefore an urgent need. In the omic setting, infrastructures for federated networks, FA of GWAS (FAHME [8] and sPLINK [24]) and transcriptomic data (Flimma [25]) have been proposed. These existing tools have important limitations including that they may require their own data infrastructures and different programming languages–some of which are not open-source, hence making the implementation of new features difficult. Another important limitation of existing solutions is that downstream analyses (e.g. data visualization, post-omic data analyses) are poorly integrated into analytical pipelines.

In order to overcome existing limitations, we have developed OmicSHIELD which covers many types of omics data including genomics, epigenomics and transcriptomics. OmicSHIELD is designed for analysis of horizontally partitioned data, with no current support for vertically partitioned data. It is developed for DataSHIELD [2], a platform that supports secure federated analysis of sensitive data that never leaves the data owner’s domain. This framework has been used to develop FA in different settings such as machine learning [26] and survival analysis [27]. It allows both mega- and meta-analyses. Two distinct data warehouses existfor the FA DataSHIELD infrastructure, Opal and Armadillo that are elsewhere described [28]. A detailed bookdown can also be accessed to see examples of how Opal can be used for data management (https://isglobal-brge.github.io/resource_bookdown/opal.html). A key feature of OmicSHIELD is that it is open-source, written in R and licensed under GPL, thus facilitating downstream analyses within a single pipeline by interacting with other programming languages (e.g. Python) and with other R or Bioconductor packages. OmicSHIELD incorporates both statisticaldisclosure controls and differential privacy approaches to assure privacy-preserving data analyses. Therefore, our approach has the potential to fulfill the stringent requirements made by data controllers (e.g. hospitals) which often hinder multicentric medical studies, since differentially private learning has been positioned as one of the preferred methods for GDPR-compliant recommender systems [28].

By enabling data to stay within its original repository and minimizing data movement, OmicSHIELD adheres to GDPR principles of data minimization and purpose limitation. Furthermore, OmicSHIELD’s user-based architecture allows for audit trails and transparency, ensuring data usage is compliant with GDPR’s accountability requirements.

OmicSHIELD is designed to enhance biomedical research by facilitating a wide range of omic analyses in federated settings, ensuring data privacy and compliance. In federated data analysis of sensitive omic data, addressing data privacy threats is essential. Our adversarial model focuses on “insiders with authorized access but malicious intent” recognizing the significant risk from within, often overlooked. More precisely, this falls into the honest-but-curious (HBC) adversary, which can be described as “participant who will attempt to learn all possible information from legitimately received messages” [29]. The main threat being controlled is the ability of the adversary to re-identify an individual within a dataset. These insiders, including data custodians, researchers, and technical staff, might exploit their access to infer individual-level information from aggregated data, despite not having direct access to raw, unencrypted data. In the context of data governance, will have the opportunity tosuch action is a misuse of the data. This is why the first mitigation is always the implementation of strong data governance and the use of contractual obligations for the analyst.

To help users leverage these frameworks, we present an online book available at https://isglobal-brge.github.io/OmicSHIELD/. It covers installation, sources of help, and complete workflows illustrating examples of omic data analyses using freely available datasets. It also includes material describing different use cases corresponding to real world data applications from different existing projects. On it, readers will find further use cases than the ones described on this manuscript. Therefore, results presented in this manuscript can be fully reproduced and users can perform additional analyses using other covariates or conditions.

Results

In this section, we initiate the discussion by presenting a broad outline of the functionalities and solutions offered by OmicSHIELD, underscoring its merits and its potential in addressing the challenges delineated in this manuscript, a summary of the key aspects is available in Table 1. Subsequently, we exemplify the practical utility of OmicSHIELD through two real-world examples of omic data analyses employing cutting-edge methodologies. In the first example, we elucidate the procedure for conducting analysis on genomic data, whereas the second example delineates the analysis of transcriptomic and epigenomic data.

Security

There are different disclosure control methods assembled in OmicSHIELD. These can be summarized as: 1) Disclosure traps: the statistical disclosure control method of cell suppression is used to define the minimum cell sizes for functions that perform aggregation or subsettings. 2) Control of output: filtering information that the client receives, e.g. not sending the individual identifiers of a study. The behavior of these techniques can be configured by each study’s data managers thus allowing individual studies to adopt a set of disclosure controls that comply with their own, local required regulations. 3) Differential privacy methods are also implemented to enable an additional layer of protection to results that are returned by study servers to the central analysis node. Differential privacy has been defined as the inability of an attacker to distinguish whether a single individual is present in a dataset [30]. Adding stochastic noise to function outputs is a way of achieving this: different types of noise can be used [31], with fine-tuned Laplace noise being the chosen method for our implementation (see Methods section). This approach has been adopted as a countermeasure to inference attacks using complex queries [32], [33], [34]: as such attacks can make use of GWAS results and allele frequencies, the implemented differential-privacy mechanisms are intended to cause additional difficulty for such attacks. 4) There are other strategies that can be performed using minor allele frequencies (MAF) attacks [35]. To prevent these attacks, we offer an extra layer of protection by having a filter that blocks the output for SNPs with a MAF lower than a pre-specified threshold. This threshold is also configurable by data managers. Furthermore, the OmicSHIELD architecture is user-based, meaning the activity of authorized researchers can be audited to check whether inference attacks have been performed. OmicSHIELD has all filters active by default with sensible values (i.e. ε = 3, resample = 3, MAF = 0.05), nevertheless it is up to the data managers to modify these values according to their needs or privacy budgets.

Omic analytic capabilities

OmicSHIELD contains functionalities to perform three types of omic data analysis: GWAS, DGE and EWAS. Table 2 outlines the main functions available in OmicSHIELD. Fig 1 demonstrates how omic association analyses can be performed using client-side functions (i.e. using the dsOmicsClient package). Data (omic and phenotypes/covariates) are stored in their native formats at different sites (for example, remote servers accessible via https or ssh, Amazon S3, locally,…). Analysis follows the DataSHIELD client-server architecture, implemented through a pair of libraries with dsOmics implemented server-side and dsOmicsClient client-side. Most association analyses involving omic data are based on fitting different generalized linear models (GLMs) for each feature (e.g., SNP, CpG, gene, transcript, …) and this forms the basis of the methods we have implemented, which includes two different types of analysis: pooled and meta-analyses.

thumbnail
Fig 1. Scheme of DataSHIELD implementation of omic-related packages.

The dsOmics package contains functions to perform non-disclosive data analyses of omic data that are managed within Opal using the resourcer package. Omic data normally have two pieces of information, one corresponding to features (CpGs, SNPs, genes, …) and another for phenotypic data (grouping variable, outcome, covariates, …) that can be stored in different resources (e.g. PLINK and table in genomics) or in a specific resource designed for that purpose in R/Bioconductor (e.g. ExpressionSet or RangedSummarizedExperiment). This package should be installed in the Opal server along with their dependencies. The package dsOmicsClient must be available in the client side and contains functions that allow the interaction between the analysis computer and the servers.

https://doi.org/10.1371/journal.pcbi.1012626.g001

thumbnail
Table 2. Main analysis functions of OmicSHIELD.

For the complete list of functions and the complete details refer to the available online guide (https://www.isglobal-brge.github.io/dsOmicsClient).

https://doi.org/10.1371/journal.pcbi.1012626.t002

The “pooled approach” (Fig 2A) is recommended when the researcher wants to analyse omic data from different sources and obtain results as if the data were physically pooled in a single database. The algorithm for pooled regression analysis can be very time consuming for omic data, therefore we implemented a fast algorithm for massive generalized linear models (see Methods). This approach is not recommended for data that is not fully harmonized–it should not be performed when gene expressions are normalized using different methods, or when GWAS is provinent of different data collection technologies, as substantive heterogeneity in the nature of the data can lead to biased results [36]. The “meta-analysis approach” (Fig 2B) overcomes limitations raised when performing pooled analyses. Computation issues are addressed by using scalable and fast methods to perform data analyses at the whole-genome level at each server. Best practice methods to perform meta GWAS should be adopted [37]. We have implemented these methodologies in OmicSHIELD. Adopting a meta-analytic approach is also recommended when there is large heterogeneity in the trait of interest between cohorts.

thumbnail
Fig 2. Types of omics data analyses implemented in dsOmicsClient.

We have implemented two different types of analyses: the virtually pooled and the federated meta-analysis. Panel A shows the “virtually pooled approach” that is recommended when the user wants to analyse omic data from different sources (e.g. studies) and obtain results as if the data were in a single data warehouse. Panel B depicts the “federated meta-analysis approach” that overcomes the limitations raised when performing pooled analyses: computing time and data harmonization. The computation issue is addressed by using scalable and fast methods to perform data analysis at whole-genome level at each server. The data harmonization issue is addressed by combining results using p-values that are independent of how omic data in features have been recorded.

https://doi.org/10.1371/journal.pcbi.1012626.g002

Genomics, polygenic risk scores and principal components analyses

Genomic data are analysed using the GWASTools and GENESIS Bioconductor packages that allow quality control (QC) and GWAS to be performed [38]. As we can also use computational resources [39], we have implemented methods to perform meta-analyses using PLINK and SNPTEST which are standardly used to perform GWAS using genotyped and imputed SNPs, respectively.

We use the PGS Catalog to calculate polygenic risk scores (PRS) from curated literature [40]. As this information can be disclosive, OmicSHIELD calculates PRS for each individual at each server without any interaction between the cohorts. The PRS are stored on the servers and are considered thereafter as any other covariable. That is, PRS can be used as part of an association model.

Computing principal components analyses (PCA) is the standard methodology to address populations in GWAS, but computing federated PCA is missed in other federated GWAS solutions (e.g. FAHME [8] or sPLINK [24]). Existing approaches use principal components (PCs) estimated at each cohort and these covariates are then used for the adjustment of association models. However, PCs should be computed using the entire population to capture genetic differences among individuals [41]. OmicSHIELD can circumvent this issue by adopting a pooled approach, thus providing a better solution than any other available elsewhere.

Differential gene expression analysis and EWAS

The DGE and EWAS meta-analyses provided by OmicSHIELD make use of the widely used Bioconductor ExpressionSet, RangedSummarizedExperiment or GenomicRatioSet data formats to deal with omic and phenotypic (e.g. covariates) information. The main analysis method implemented corresponds to RNA-seq data which are analysed using limma+voom [42]. We have also implemented functions for using other methods such as DESeq2 and edgeR [43] as well as methods to analyse microarray data using limma.

Post-omic analyses and visualization

The pooled approach returns the effect sizes, standard errors, and some specific feature annotations. On the other hand, the meta-analysis approach provides study-specific estimates and standard errors from analyses performed on each server. These results can then be brought together using various meta-analytic techniques. For example, GWAS utilizes effect sizes and standard errors, while DGE and EWAS employ p-values [7]. Both methods are incorporated in OmicSHIELD.

Various visualizations can be created after the completion of an analyses. For example, Manhattan and qq-plots can be generated for all types of analysis, while a locus zoom plot can be made for GWAS. Visualization of data distributions for any feature grouped by a specific variable (e.g. using boxplots) can be performed for DGE and EWAS.

Use case 1: Multi-centric GWAS of CINECA data

To evaluate our new approach to GWAS analysis, we used a public dataset of synthetic genotype data from CINECA (see Data availability statement). This dataset has been split into three different subsets and uploaded to three virtual servers to act as individual study centers. The virtual configuration is illustrated in Fig 3.

thumbnail
Fig 3. Configuration for multi-centric GWAS of CINECA data using OmicSHIELD.

To achieve this configuration, the data from CINECA has been partitioned into three different virtual cohorts, containing 817, 1073 and 614 individuals respectively. Each virtual cohort contains the genotype information of the individuals. The amount of variants present for each individual is 865 thousand.

https://doi.org/10.1371/journal.pcbi.1012626.g003

We compare the results obtained using OmicSHIELD with those obtained by pooling the three datasets into a single dataset and being analysed following a traditional local pipeline. We are interested in assessing associations between SNPs and diabetes, information that is obtained from a variable called `diabetes_diagnosed_doctor`. We adjust for other covariates including sex, age and high-density lipoprotein (HDL) cholesterol. The effect sizes (i.e. beta values) of the top 20 SNPs obtained with OmicSHIELD are compared with the effects obtained from the traditional pipeline. This comparison yielded a mean square error of 5.3 x 10−4 and a bias of -2.6 x 10−3 which is almost negligible in practical terms (bias in the risk of a given SNP is to the order of 1 in 10000). The Manhattan plot in Fig 4 shows that the top hits among the p-values and the general trend of significance levels are accurately replicated using OmicSHIELD. We can see that the noise added by the differential privacy method (ε-privacy: 3) allows trends to be replicated while ensuring that the top hits remain significant. Please refer to the data availability statement section to reproduce the results presented.

thumbnail
Fig 4. Locus zoom plots of the top hit for the original data (left) and pooled fast GWAS (right).

The red line corresponds to a threshold of significance of -log10(P) < 5 x 10−5. Trends and top hits are reproduced on the OmicSHIELD analysis, the major differences being found on the SNPs clearly not relevant.

https://doi.org/10.1371/journal.pcbi.1012626.g004

After demonstrating the efficacy of OmicSHIELD in replicating the significance of p-values and identifying top hits, we further investigate the influence of differential privacy parameters on the usability and interpretability of our results. Differential privacy employs a privacy parameter, epsilon (ε), to balance data utility against privacy. A smaller ε value indicates higher privacy but potentially less accurate results, while a larger ε allows for greater accuracy at the cost of privacy. To quantitatively assess this trade-off, we conducted additional analyses varying ε values at 0.01, 0.05, 0.07, 0.09, 0.1 and 1. These experiments aim to portray how changes in ε affect the overall utility of our federated analysis results, particularly in terms of statistical power and the ability to detect significant associations. The outcomes of these analyses are shown in Fig 5.

thumbnail
Fig 5. Impact of Differential Privacy Parameter (Epsilon) on the Usability of Federated Analysis Results.

This figure presents the analysis of the effect of varying epsilon (ε) values on the detection of significant genetic associations in a federated omic data setting. Epsilon values tested include 0.01 (a), 0.05 (b), 0.07 (c), 0.09 (d), 0.1 (e) and 1 (f).

https://doi.org/10.1371/journal.pcbi.1012626.g005

We observed a notable threshold at ε = 0.09, which emerges as a critical point delineating the trade-off between maintaining privacy and achieving reliable statistical results. Below this ε value, our analysis revealed an increased incidence of false significant hits. Conversely, when ε values are set above this threshold, the stability and reliability of the results significantly improve. This stability indicates that ε = 0.09 represents an optimal balance according to the authors findings. These findings underscore the importance of carefully selecting differential privacy parameters in federated omic data analyses to optimize both privacy protection and data utility.

Use case 2: DGE and EWAS analysis of HELIX data

Here we illustrate how to perform DGE and EWAS of HELIX data (see Data availability statement). The data infrastructure for this project is shown in Fig 6. Transcriptome and epigenome data are stored as ExpressionSet, which is a standard Bioconductor infrastructure to deal with these types of omic data [44]. Note that this Bioconductor object contain phenotypic data (i.e. metadata) encapsulated jointly with the omic data. In both examples, we are interested in comparing gene expression and methylation between males and females focusing only on the autosomes.

thumbnail
Fig 6. Opal Data infrastructure of HELIX project.

The raster data used to generate the map comes from https://www.naturalearthdata.com/.

https://doi.org/10.1371/journal.pcbi.1012626.g006

For the DGE analysis, we analyse microarray data derived using the "Human Transcriptome Array 2.” of Affymetrix. Among the different analyses, we show how to apply meta-analysis with the functions ds.limma() and metaPvalues(). For each microarray probe, this analysis implements multiple generalized linear models separately (one per study) and combines the results using study-specific derived p-values. As a result, we identified a list of 325 probes (mapping 287 genes) differentially expressed between boys and girls of the HELIX project, and which further passed multiple-testing correction filters (P-value threshold = 1.74e-06). As a proof of the ability of OmicSHIELD to be integrated with other R functionalities and Bioconductor packages, we continued the pipeline presenting results of a functional enrichment analysis (FEA). This analysis shows that significant differentially expressed genes participate in processes with evident and previously-described sexual dimorphism such as in the case of “Longevity regulating pathways” (KEGG Pathway:Ia04211).

For the EWAS data we illustrate how to perform DNA methylation differential analysis using microarray data obtained with the “Infinium HumanMethylation450" platform of Illumina. We illustrate how to perform an epigenome-wide meta-analysis with and without adjusting for surrogate variables. We also adjusted our models for confounders including age and ethnicity. Consequently, from the initial list of almost 300k CpGs, we identify a total of 10,417 differential methylated probes between boys and girls from which only three passed the strict Bonferroni multiple-testing correction. Interestingly, two of these three probes, cg12052203 and cg25650246 (mapping the B3GNT1 and RFTN1 respectively), have been previously associated with sex methylation differences (http://www.ewascatalog.org). The FEA showed that significant CpGs map genes participating in processes with strong sex differences such as in the case of bone formation (“Endocrine and other factor-regulated calcium reabsorption”, KEGG PathwI hsa04961). Please refer to the data availability statement section to reproduce the results presented. For the sake of reproducibility, we included all the data on the same server, then different connections to it are performed, this replicates the same situation as having different servers scattered across different countries. The differential privacy settings used on this use case is ε = 3.

Validation and Application of OmicSHIELD using Real-world Methylation Data

The OmicSHIELD infrastructure has been presented as a robust, privacy-protected and open-source analytical platform for federated meta- and mega-omic data analysis in multi-centre studies. To demonstrate the utility and validity of OmicSHIELD, a researcher conducted a real-world case study using methylation data from two independent cohorts of the ATHLETE project, INMA (INfancia y Medio Ambiente) and EDEN (Etude des Déterminants pré et postnatals précoces du développement et de la santé de l’Enfant).

The study aimed to validate OmicSHIELD using INMA methylation data by comparing the EWAS results from both DataSHIELD and local analyses, without implementing differential privacy. This comparison was essential to determine if the results obtained through the OmicSHIELD platform were consistent with those derived from conventional local analysis methods. Upon validation of the consistency and accuracy of the results, the researcher proceeded to use DataSHIELD for the analysis of the EDEN cohort.

The analysis of the INMA methylation data using OmicSHIELD yielded aggregated EWAS results that were in accordance with the results obtained through local analysis, thus demonstrating the reliability and validity of the platform. This validation provided confidence in utilizing OmicSHIELD for further data analysis while maintaining data privacy and adhering to ethical guidelines. Given that this validation was performed under no differential privacy mechanisms, the regression estimates for each model were identical for both approaches (i.e. with and without using OmicSHIELD).

Following the successful validation, the researcher proceeded to analyse the EDEN cohort data exclusively through DataSHIELD, incorporating the results into their publication, which is still being elaborated. The application of OmicSHIELD in this real-world case scenario showcases its potential as a valuable tool for researchers in the field of bioinformatics, allowing for privacy-protected federated data analysis across multiple centres without compromising the integrity of the results.

In conclusion, the validation and application of OmicSHIELD using real-world methylation data from INMA and EDEN cohorts provide strong evidence for its effectiveness and reliability as an open-source analytic platform for privacy-protected data analysis in multi-centre studies. The successful implementation of OmicSHIELD in this real study highlights its potential to significantly advance the field of bioinformatics and facilitate novel discoveries in the omics sciences while mitigating data disclosure risks.

Discussion

We have presented a new software to perform omic analyses using multi-center studies (i.e. federated data) with active disclosure protection during analysis and for the outputs. Such a tool, based on the paradigm of DataSHIELD, provides a great opportunity for researchers to enhance multi-center collaboration by establishing a trustworthy platform that brings the analysis to the data, hence avoiding onerous data-sharing procedures. The solution we present is fully open-source, enabling researchers not only to contribute their own developments, but to personally assess and control the disclosure-prevention and differential-privacy mechanisms. This guarantees the data contributors (e.g. research participants) and custodians (e.g. data controllers) complete transparency on how data is utilized, something that is central to the philosophy of the GDPR but is not achievable with commercially licensed softwares.

OmicSHIELD has implemented state-of-the-art methods in GWAS, transcriptomic and epigenomic data analysis including new methodologies. For example, for GWAS, we have implemented quality control of individual studies before performing pooled- or meta-analysis [37], and non-disclosive pooled principal component analysis (PCA) to adjust for population stratification in pooled GWAS (something not addressed in FAHME [8] or sPLINK [24]). For transcriptomic and epigenomic studies, we have implemented outlier removal and surrogate variable analysis, and differential expression analyses using not only limma and voom but also other existing approaches, including edgeR and DESeq, that could be also used for metagenomics data analyses [43].

OmicSHIELD provides the results one would expect when having all the data combined on a single machine, however, the data are neither shared between study servers nor stored on an intermediate server. Simply stated, data never leaves the study centers where they are hosted and thus remain fully controlled by their data owners. Pooled analysis can be beneficial to improve statistical power. However, this approach is not useful when huge imbalances exist between different datasets. Our software solution includes approaches for researchers to select the best methods taking into consideration how data has been collected or harmonized, different types of study designs, and data heterogeneity. Both implemented methods are privacy-protected using disclosure traps (based on staistical disclosure control methods) and differential privacy. The disclosure traps and differential privacy are configurable at the data source level; the decision of whether to apply differential privacy (and it’s ε-differential privacy) is dictated by the data manager of each site, and it is possible to have different sites using different configurations, thus offering flexibility to multi-center studies where there are different views and regulations on data protection and disclosure risks.

Given the proposed solution relies on differential privacy, it could be discussed that there has to be a system in place that limits the number of queries. That is indeed true, although it is not a mechanism that has to be included into OmicSHIELD, but rather a mechanism provided by the DataSHIELD architecture, given that all the different analysis packages designed for it would benefit. While this is not yet available, but there is active logging of how the users interact with the federated platform, so an active control over the authorized users is possible to take into action.

Currently, there are several International projects that have set up an infrastructure using DataSHIELD to perform FA. These include UnCoVer [19], ATHLETE [21], LifeCycle [20], and InterConnect [45], in addition to national consortia established in Germany (e.g. INTIMIC [46] and MIRACUM [47]) and Sweden [48]. They started by analysing data addressing clinical and epidemiological scientific questions but are now also moving towards including omic data analyses. In this regard, OmicSHIELD will provide a great solution for performing FA in these large consortia and allow the examination, for instance, of the impact of the exposome on the epigenome, discovering new genetic risk factors for persistent COVID or knowing how the exposome impacts human health, among others.

The presented iteration of OmicSHIELD has the potential to reproduce many published papers as well as be the main driver of new investigative projects; nevertheless, there are many ways this software could be expanded in the future. Discussion to determine future directions to be taken will be held with researchers that use our tool for their work; in this way, we can guarantee that future versions of OmicSHIELD will contain functionalities required by real-life projects, ensuring its longevity and quality. Besides having access to new developments of OmicSHIELD, researchers that choose to use the DataSHIELD ecosystem will also benefit from the growing array of available open-source libraries (https://www.datashield.org/help/community-packages), thus enabling them to use a wide variety of tools to perform non-disclosive statistical analyses on their data.

Methods

OmicSHIELD is based upon our recent development, the “resources” architecture which is a new DataSHIELD tool that allows: 1) the use of large data in their original repositories; 2) working with original data formats (e.g. PLINK, VCF, ExpressionSet, RangeSummarizedExperiments); 3) interactions with other programming languages (including shell commands) and software (R, Neuroconductor Bioconductor, Python); and 4) interfacing with “Apache Spark”, a fast and general purpose analytical engine for big data and deep learning [39]. Along this section we describe some of the algorithms, there are two entities mentioned data processors (DPs) and the client. The DPs are the servers where the data is stored and the analysis is performed. The client, typically represented by the researcher, sends instructions to the DPs for data analysis and receives back the results

Differential privacy

Differential privacy can be briefly described as an algorithm that prohibits individual level information from being identified when an output is observed. This concept was introduced by Dwork [49]. Formally, this concept is referred to as ε-differential privacy and is mathematically expressed as:

Where A is a randomizing algorithm that takes a dataset D as input, and ε is a positive real number, called the privacy parameter, that defines the level of privacy (closer to 0 indicates more privacy). S refers to all subsets of the image of A i.e. all possible subsets of the output values of the algorithm A. The inequality refers to datasets D1 and D2 that differ on a single element (i.e., the data of one person). If met, it indicates that after application of the randomizing algorithm, the probability that A(D1) lies in S is less than or equal to the probability that A(D2) is in S multiplied by the constant exp (ε) that gets closer to 1.00 as ε falls to 0.

In order to implement differential privacy into our software, we used the Laplace mechanism. This adds noise to the output of a function, the noise being drawn from a Laplace distribution with mean 0 and scale , where Δf is the l1 norm defined by .

In practical terms, we perform data anonymization through addition of Laplace noise (with zero mean and variance equal to a proportion of the initial variability). We also propose the removal of the noise effect at the analysis stage (to increase the accuracy of the model estimates) through statistical techniques of measurement errors. In the case of GWAS, this method guarantees privacy at variant-level, the disclosure risk could be increased on the event that multiple variants are correlated; given the adversarial model of our software this case is out-of-scope of this research, as DataSHIELD users are required to adhere to established data governance rules for data use. Our software enforces these rules, thereby mitigating further risks.

The assessment of Δf can be particularly complex for certain functions (e.g. limma + voom), hence a resampling method [50] is used to assess Δf. The used resampling method estimates the sensitivity to achieve random differential privacy [51]. This method introduces an additional risk parameter to the differential privacy, γ, which represents the probability that the privacy loss can exceed ε. Following the conditions defined in Theorem 15 of [50] it is used to compute the number of resamples (N) to guarantee differential privacy under this assumption. A lower probability of privacy loss exceeding ε (low γ) will translate to a higher N. The privacy parameters can be configured by the data manager of each study while the default values are set to ε = 3 and γ = 0.1. This method implies that differential privacy is slightly weaker than traditional differential privacy (where sensitivity is assessed analytically) due to the probabilistic nature of its guarantee. However, by choosing a sufficiently small γ, the practical difference in privacy protection becomes negligible for most applications. This resampling method has been used on all the analysis methods that return results to the client. In order to implement the layer of differential privacy to OmicSHIELD, we add the calculated noise to the results that are returned to the client. The resampling algorithm works as follows:

A: Function to which differential privacy is to be applied

D: Complete dataset

D1, D2: Dataset missing a random individual

N: Number of resamples

Step 1: Derive D1, D2 randomly from D

Step 2: Compute A(D1)

Step 3: Compute A(D2)

Step 4: Compute ΔfN = ‖A(D1)−A(D2)‖

Step 5: Redo steps 1–4 N times

Step 6: ΔfN = max(Δf1,2,…,N))

Allele frequency

To compute the allele frequency using horizontally partitioned data, we use the following algorithm, where the multi-layered security approach of OmicSHIELD can be appreciated, precisely on steps 1.2 to 1.4.

DPi: Data processor (Server node)

geno: Encoded genotype data

Step 1: Intermediate allele frequency

       1: Each DPi computes allele_frequency(geno) [MAFi, ni]

       2: Each DPi computes differential privacy Δf

       3: Each DPi applies diff_privacy(MAF_i) [MAF_i]

       4: Each DPi checks MAFfilter (MAFi) [MAFi]

       5: Each DPi returns MAFi to the client

Step 2: Client aggregation

       1: The client receives MAFi and ni

       2: The client sums MAFi and divides by the sum of ni

       3: The client returns the pooled MAF results [final_results]

Block method to calculate pooled PCA

To compute a pooled PCA using horizontally partitioned data, the block method singular value decomposition has been implemented [52] [53]. The pseudo-code of the algorithm is as follows:

DPi: Data processor (Server node)

SVD: Singular value decomposition

geno: Encoded genotype data

Step 1: Intermediate SVD

       1: Each DPi computes SVD(geno) [ui, vi, di]

       2: Each DPi computes ui * di [resi]

       3: Each DPi returns resi to the client

Step 2: Client aggregation

       1: The client receives resi and merges it horizontally [aggregated]

       2: The client computes SVD(aggregated) [final_results]

       3: The client returns the pooled PCA results [final_results]

The algorithm used returns the product of the left singular vectors (ui) and singular values (di) to the client. This guarantees that the original data contained on each node are not reconstructible (i.e. we only share non-disclosive information among servers). Moreover, differential privacy is applied to the results that the client receives following the aforementioned methodology.

This is a general method to compute PCA. We have adapted this approach to allow PCA to be performed on genotype data. To this end, these two extra steps have been implemented:

  • Genotype standardization: , where the genotype is XIj (i is the individual index, j is the SNP index), μj = 2pj is the mean of the 0,1,2-coded genotype (pj being the alternate allele frequency), and σj is the expected standard error under Hardy-Weinberg Equilibrium, that is .
  • Perform PCA using only selected SNPs that have been linked to differentiate ethnic groups [54].

Fast virtually pooled GWAS

The method implemented for fast GWAS is based on research by Sikorska et al[55] using an algorithm adapted to our infrastructure. This method is used to perform GWAS on a pooled setting. Our implementation is described in the following pseudo-code:

DPi: Data processor (Server node)

Step 1: Fit the objective model *

       1: Iteratively each DPi extracts the coefficients of objective model

       2: Client sends to each DPi the coefficient values

       3: Each DPi computes the fitted.values and residuals [FITi, RESi]

Step 2: Using the residuals and genotype information [RESi, GENi]

       1: Each DPi computes RESi–mean (RESi): > YCi

       2: Each DPi computes colSums(YCi * GENi); colSums(GENi); colSums()

and returns to the client [Bi, S1i, S2i]

       3: The client merges [Bi, S1i, S2i] into [B, S1, S2]

       4: The client computes the total number of individuals: > NIND

Step 3: Using [B, S1, S2, NIND, YCi] compute betas and pvalues

       1: The client computes S2−(S12)/NIND: > DEN1

       2: The client computes B/DEN1: > β to obtain the beta values

       3: Each DPi computes colSums() and returns to client: > YC2i

       4: The client merges YC2i: > YC2

       5: The client computes (

       6: The client computes sqrt(σ * (1 / DEN1)): > ERR

       7: The client computes 2 * pnorm(-abs(β/ERR)): > PVAL to obtain the pvalues

* Step 1.1: Model fitted using dsBaseClient::ds.glm, based on the conventional iterative reweighted least squares (IRLS) algorithm [56], a variant of Newton optimization.

The data that is being shared with the client is always a result of an aggregation function (i.e. column sums and different products). For this reason, we added disclosure controls that guarantee that the aggregated data have a minimum number of valid points to prevent data leaks (as returning aggregates of a single individual will leak their data). The sacrifice we make with this approach is a slight imprecision in the results (quantified in the “Results” use case 1 section). This sacrifice is in exchange for computational performance, illustrated in Fig 7. It must be noted that performance is limited by the slowest server, therefore increasing the amount of data providers in the study does not imply increasing computational time as all servers work in parallel. Even though it is not explicitly mentioned on the pseudo-algorithm, differential privacy is also implemented on this algorithm and applied to all the operations that return aggregated results to the client. In this case it is trivial to see that there is no need to perform a sensitivity analysis, given the genetic data is encoded as [0, 1, 2] integer values.

thumbnail
Fig 7. Performance of the fast GWAS function.

The benchmark has been performed using a dataset consisting of 2,504 individuals and increasing SNPs to obtain the different data points. When computing this plot, Differential privacy (DP) was turned off. There are a total of 21 data points with 20 repetitions each one, the results are displayed using a smoothed “loess” line. The trend appears to be linear as more SNPs are added to the analysis, with an approximate rate of computation: 1M SNPs each 90 seconds. The description of the machine used for this benchmark is available on the “Methods” section. The results are subject to great fluctuation depending on the hard drive technology of the machine. Using state-of-the-art solid state drives the performance could be much better.

https://doi.org/10.1371/journal.pcbi.1012626.g007

Meta-level quality control

For researchers interested in performing a meta-level GWAS, it is possible to do this within OmicSHIELD. To accomplish this using established quality control protocols [37], three different plots are made available to assess the quality of the study level data:

  • EAF Plot: Pinpoint harmonization problems.
  • P-Z Plot: Issues with beta estimation. Can pinpoint errors in the pre-processing of the data.
  • SE-N Plot: Can pinpoint issues with trait transformations.

These methods should provide researchers with enough information to assess whether values at the study level can be trusted. Any problems that arise from quality control should be notified to the data owners as OmicSHIELD does not provide tools for overcoming them as each case is likely to require an ad-hoc solution.

Polygenic Risk Scores

The PRS is a statistic that is computed individually, therefore there are no concerns with having data distributed among multiple cohorts as results from individuals have no effect on the other cohorts The PRS is computed as:

Where the genotype is X (i is the individual index, j is the risk SNP index) and βj is the associated risk to the SNP j.

Given that PRS results are individual and not aggregated, they are disclosive. For this reason, the client does not receive any results when performing a PRS, results are instead stored on local study servers. where they can be employed as covariates in subsequent analyses such as GWAS to control for genetic predisposition when identifying novel genetic associations, or utilized within GLM to explore interactions between genetic risk and environmental or lifestyle factors [57].

We can provide information on the SNPs and their beta-values from a given polygenic risk score by using data from the public repository PGS Catalog [40], which contains over 2,000 different literature-curated scores; a harmonization step ensures the risk alleles on the private study servers and the PGS Catalog are aligned.

Baseline computations with GWASTools and SNPRelate

To assess the results yielded by our implementations, we used tools freely available to researchers: GWASTools (1.35.2), GENESIS (2.20.1) and SNPRelate (1.25.2).

Experimental settings

We implemented our solutions on top of R 4.1.1. To evaluate OmicSHIELD, we emulated a network of data repository servers using DSLite 1.2.0. The solution was deployed using a Linux machine with two Intel Xeon Gold 6240 CPUs running at 2.6 GHz with 72 threads running on 36 cores and 256 GB of RAM. The hard drives are SAS drives at 10000 rpm 12Gb/s. Machine resources were equally distributed among the virtual data repositories.

Acknowledgments

We acknowledge Sofia Aguilar on her valuable inputs on performing the methylation studies, dealing with methylation data and her efforts towards assessing the usability of OmicSHIELD on real-world data. We also thank Manuel Huth for his invaluable assistance in discussing differential privacy concepts and providing insightful feedback on the methods described in this paper.

References

  1. 1. Deelen J. et al., “A meta-analysis of genome-wide association studies identifies multiple longevity genes,” Nature Communications 2019 10:1, vol. 10, no. 1, pp. 1–14, Aug. 2019, pmid:31413261
  2. 2. Gaye A. et al., “DataSHIELD: taking the analysis to the data, not the data to the analysis,” Int J Epidemiol, vol. 43, no. 6, pp. 1929–1944, Dec. 2014, pmid:25261970
  3. 3. Sarma K. V. et al., “Federated learning improves site performance in multicenter deep learning without data sharing,” J Am Med Inform Assoc, vol. 28, no. 6, p. 1259, Jun. 2021, pmid:33537772
  4. 4. Sheller M. J. et al., “Federated learning in medicine: facilitating multi-institutional collaborations without sharing patient data,” Sci Rep, vol. 10, no. 1, Dec. 2020, pmid:32724046
  5. 5. Dayan I. et al., “Federated learning for predicting clinical outcomes in patients with COVID-19,” Nat Med, vol. 27, no. 10, pp. 1735–1743, Oct. 2021, pmid:34526699
  6. 6. Evangelou E. and Ioannidis J. P. A., “Meta-analysis methods for genome-wide association studies and beyond,” Nat Rev Genet, vol. 14, no. 6, pp. 379–389, Jun. 2013, pmid:23657481
  7. 7. Toro-Domínguez D., Villatoro-Garciá J. A., Martorell-Marugán J., Román-Montoya Y., Alarcón-Riquelme M. E., and Carmona-Saéz P., “A survey of gene expression meta-analysis: methods and applications,” Brief Bioinform, vol. 22, no. 2, pp. 1694–1705, Mar. 2021, pmid:32095826
  8. 8. Froelicher D. et al., “Truly privacy-preserving federated analytics for precision medicine with multiparty homomorphic encryption,” Nature Communications 2021 12:1, vol. 12, no. 1, pp. 1–10, Oct. 2021, pmid:34635645
  9. 9. Shrier I., Platt R. W., and Steele R. J., “Mega-trials vs. meta-analysis: Precision vs. heterogeneity?,” Contemp Clin Trials, vol. 28, no. 3, pp. 324–328, May 2007, pmid:17188025
  10. 10. Xiang D. and Cai W., “Privacy Protection and Secondary Use of Health Data: Strategies and Methods,” Biomed Res Int, vol. 2021, 2021, pmid:34660798
  11. 11. Dwork C., “Differential privacy,” Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 4052 LNCS, pp. 1–12, 2006,
  12. 12. Mohialden Y. M., Hussien N. M., Salman S. A., and Aljanabi M., “Secure Federated Learning with a Homomorphic Encryption Model,” International Journal Papier Advance and Scientific Review, vol. 4, no. 3, pp. 1–7, Nov. 2023,
  13. 13. Yan Y., Yang G., Gao Y., Zang C., Chen J., and Wang Q., “Multi-Participant Vertical Federated Learning Based Time Series Prediction,” ACM International Conference Proceeding Series, pp. 165–171, Mar. 2022,
  14. 14. He Z., Yu J., Li J., Han Q., Luo G., and Li Y., “Inference Attacks and Controls on Genotypes and Phenotypes for Individual Genomic Data; Inference Attacks and Controls on Genotypes and Phenotypes for Individual Genomic Data,” IEEE/ACM Trans Comput Biol Bioinform, vol. 17, 2020, pmid:29994587
  15. 15. Mateș D. et al., “ORCHESTRA project in Romania-a prospective occupational cohort to study the impact of COVID-19 pandemic on healthcare workers,” vol. 72, no. 1, pp. 54–2021,
  16. 16. Ayoz K., Aysen M., Ayday E., and Ercument Cicek A., “The effect of kinship in re-identification attacks against genomic data sharing beacons,” Bioinformatics, vol. 36, no. Supplement_2, pp. i903–i910, Dec. 2020, pmid:33381836
  17. 17. Ayoz K., Ayday E., and Cicek A. E., “Genome Reconstruction Attacks Against Genomic Data-Sharing Beacons,” Proceedings on Privacy Enhancing Technologies, vol. 2021, no. 3, pp. 28–48, Jul. 2021, pmid:34746296
  18. 18. Hampf C. et al., “A survey on the current status and future perspective of informed consent management in the MIRACUM consortium of the German Medical Informatics Initiative,” Translational Medicine Communications 2021 6:1, vol. 6, no. 1, pp. 1–11, Mar. 2021,
  19. 19. Peñalvo J. L. et al., “Unravelling data for rapid evidence-based response to COVID-19: a summary of the unCoVer protocol,” BMJ Open, vol. 11, no. 11, p. e055630, Nov. 2021, pmid:34794999
  20. 20. Jaddoe V. W. V. et al., “The LifeCycle Project-EU Child Cohort Network: a federated analysis infrastructure and harmonized data of more than 250,000 children and parents,” Eur J Epidemiol, vol. 35, no. 7, pp. 709–724, Jul. 2020, pmid:32705500
  21. 21. Vrijheid a-c M. et al., “Advancing tools for human early lifecourse exposome research and translation (ATHLETE) Project overview,” 2021, pmid:34934888
  22. 22. Vrijheid M. et al., “The human early-life exposome (HELIX): Project rationale and design,” Environ Health Perspect, vol. 122, no. 6, pp. 535–544, 2014, pmid:24610234
  23. 23. Dursi L. J. et al., “CanDIG: Federated network across Canada for multi-omic and health data discovery and analysis,” 2021, pmid:36778585
  24. 24. Nasirigerdeh R. et al., “sPLINK: a hybrid federated tool as a robust alternative to meta-analysis in genome-wide association studies,” Genome Biol, vol. 23, no. 1, pp. 1–24, Dec. 2022,
  25. 25. Zolotareva O. et al., “Flimma: a federated and privacy-aware tool for differential gene expression analysis,” Genome Biol, vol. 22, no. 1, pp. 1–26, Dec. 2021,
  26. 26. Cao H. et al., “dsMTL: a computational framework for privacy-preserving, distributed multi-task machine learning,” Bioinformatics, vol. 38, no. 21, pp. 4919–4926, Oct. 2022, pmid:36073911
  27. 27. Banerjee S. et al., “dsSurvival: Privacy preserving survival models for federated individual patient meta-analysis in DataSHIELD,” BMC Res Notes, vol. 15, no. 1, pp. 1–6, Dec. 2022,
  28. 28. Cummings R. and Desai D., “The Role of Differential Privacy in GDPR Compliance Position Paper”,
  29. 29. Paverd A., Martin A., and Brown I., “Modelling and Automatically Analysing Privacy Properties for Honest-but-Curious Adversaries”, Accessed: May 21, 2024. [Online]. Available: https://www.cs.ox.ac.uk/people/andrew.paverd/casper/
  30. 30. Mironov I., Pandey O., Reingold O., and Vadhan S., “Computational Differential Privacy,” Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 5677 LNCS, pp. 126–142, 2009,
  31. 31. Geng Q. and Viswanath P., “The Optimal Noise-Adding Mechanism in Differential Privacy,” 2014,
  32. 32. Chen J., Wang W. H., and Shi X., “Differential Privacy Protection Against Membership Inference Attack on Machine Learning for Genomic Data,” bioRxiv, p. 2020.08.03.235416, Aug. 2020,
  33. 33. Ye D., Shen S., Zhu T., Liu B., and Zhou W., “One Parameter Defense—Defending against Data Inference Attacks via Differential Privacy,” IEEE Transactions on Information Forensics and Security, vol. 17, pp. 1466–1480, Mar. 2022,
  34. 34. Almadhoun N., Ayday E., and Ulusoy Ö., “Inference attacks against differentially private query results from genomic datasets including dependent tuples,” Bioinformatics, vol. 36, no. Supplement_1, pp. i136–i145, Jul. 2020, pmid:32657411
  35. 35. Ayoz K., Aysen M., Ayday E., and Cicek A. E., “The effect of kinship in re-identification attacks against genomic data sharing beacons”, pmid:33381836
  36. 36. Blettner M., Sauerbrei W., Schlehofer B., Scheuchenpflug T., and Friedenreich C., “Traditional reviews, meta-analyses and pooled analyses in epidemiology.,” Int J Epidemiol, vol. 28, no. 1, pp. 1–9, Feb. 1999, pmid:10195657
  37. 37. Winkler T. W. et al., “Quality control and conduct of genome-wide association meta-analyses,” Nat Protoc, vol. 9, no. 5, pp. 1192–1212, 2014, pmid:24762786
  38. 38. Gogarten S. M. et al., “Genetic association testing using the GENESIS R/Bioconductor package”, pmid:31329242
  39. 39. Marcon Y. et al., “Orchestrating privacy-protected big data analyses of data from different resources with R and DataSHIELD,” PLoS Comput Biol, vol. 17, no. 3, p. e1008880, Mar. 2021, pmid:33784300
  40. 40. Birling M.-C. et al., “The Polygenic Score Catalog as an open database for reproducibility and systematic evaluation,” Nature Genetics 2021 53:4, vol. 53, no. 4, pp. 420–425, Mar. 2021, pmid:33692568
  41. 41. Uffelmann E. et al., “Genome-wide association studies,” Nature Reviews Methods Primers 2021 1:1, vol. 1, no. 1, pp. 1–21, Aug. 2021,
  42. 42. Law C. W., Chen Y., Shi W., and Smyth G. K., “Voom: Precision weights unlock linear model analysis tools for RNA-seq read counts,” Genome Biol, vol. 15, no. 2, pp. 1–17, Feb. 2014, pmid:24485249
  43. 43. Jonsson V., Österlund T., Nerman O., and Kristiansson E., “Statistical evaluation of methods for identification of differentially abundant genes in comparative metagenomics,” BMC Genomics, vol. 17, no. 1, pp. 1–14, Jan. 2016, pmid:26810311
  44. 44. Huber W. et al., “Orchestrating high-throughput genomic analysis with Bioconductor,” 2015, pmid:25633503
  45. 45. Terras J. M. et al., “Fostering the relation and the connectivity between smart homes and grids–InterConnect project,” IET Conference Publications, vol. 2020, no. CP767, pp. 761–764, 2020,
  46. 46. Agamennone V. et al., “HDHL-INTIMIC: A European Knowledge Platform on Food, Diet, Intestinal Microbiomics, and Human Health,” Nutrients 2022, Vol. 14, Page 1881, vol. 14, no. 9, p. 1881, Apr. 2022, pmid:35565847
  47. 47. Prokosch H. U. et al., “MIRACUM: Medical Informatics in Research and Care in University Medicine,” Methods Inf Med, vol. 57, no. S 01, pp. e82–e91, Jul. 2018, pmid:30016814
  48. 48. Sundström J. et al., “Rationale for a Swedish cohort consortium,” https://mc.manuscriptcentral.com/ujms, vol. 124, no. 1, pp. 21–28, Jan. 2019, pmid:30618330
  49. 49. Dwork C. and Roth A., “The Algorithmic Foundations of Differential Privacy,” Foundations and Trends in Theoretical Computer Science, vol. 9, no. 3–4, pp. 211–407, 2014,
  50. 50. Rubinstein B. I. P. and Aldà F., “Pain-Free Random Differential Privacy with Sensitivity Sampling,” 2017.
  51. 51. Hall R., Wasserman L., and Rinaldo A., “Random Differential Privacy,” Journal of Privacy and Confidentiality, vol. 4, no. 2, pp. 43–59, Mar. 2013,
  52. 52. Iwen M. A. and Ong B. W., “A Distributed and Incremental SVD Algorithm for Agglomerative Data Analysis on Large Networks,” SIAM Journal on Matrix Analysis and Applications, vol. 37, no. 4, pp. 1699–1718, 2016,
  53. 53. John E. Kontoghiorghes, “Parallel Algorithms for the Singular Value Decomposition,” pp. 133–180, Dec. 2005,
  54. 54. Huang T., Shu Y., and Cai Y. D., “Genetic differences among ethnic groups,” BMC Genomics, vol. 16, no. 1, pp. 1–10, Dec. 2015,
  55. 55. Sikorska K., Lesaffre E., Groenen P. F. J., and Eilers P. H. C., “GWAS on your notebook: Fast semi-parallel linear and logistic regression for genome-wide association studies,” BMC Bioinformatics, vol. 14, no. 1, pp. 1–11, May 2013, pmid:23711206
  56. 56. Aitkin M., Anderson D., Francis B., and Hinde J., “Statistical Modelling in GLIM 4,” Oxford Statistical Science Series.
  57. 57. Rodrigue A. L. et al., “Specificity of Psychiatric Polygenic Risk Scores and Their Effects on Associated Risk Phenotypes,” Biological psychiatry global open science, vol. 3, no. 3, pp. 519–529, Jul. 2022, pmid:37519455