Assessing the performance of genome-wide association studies for predicting disease risk

To date more than 3700 genome-wide association studies (GWAS) have been published that look at the genetic contributions of single nucleotide polymorphisms (SNPs) to human conditions or human phenotypes. Through these studies many highly significant SNPs have been identified for hundreds of diseases or medical conditions. However, the extent to which GWAS-identified SNPs or combinations of SNP biomarkers can predict disease risk is not well known. One of the most commonly used approaches to assess the performance of predictive biomarkers is to determine the area under the receiver-operator characteristic curve (AUROC). We have developed an R package called G-WIZ to generate ROC curves and calculate the AUROC using summary-level GWAS data. We first tested the performance of G-WIZ by using AUROC values derived from patient-level SNP data, as well as literature-reported AUROC values. We found that G-WIZ predicts the AUROC with <3% error. Next, we used the summary level GWAS data from GWAS Central to determine the ROC curves and AUROC values for 569 different GWA studies spanning 219 different conditions. Using these data we found a small number of GWA studies with SNP-derived risk predictors that have very high AUROCs (>0.75). On the other hand, the average GWA study produces a multi-SNP risk predictor with an AUROC of 0.55. Detailed AUROC comparisons indicate that most SNP-derived risk predictions are not as good as clinically based disease risk predictors. All our calculations (ROC curves, AUROCs, explained heritability) are in a publicly accessible database called GWAS-ROCS (http://gwasrocs.ca). The G-WIZ code is freely available for download at https://github.com/jonaspatronjp/GWIZ-Rscript/.


Introduction
A genome-wide association study (GWAS) is a comprehensive genetic analysis of the association between certain observable traits and specific genetic variations in the form of Single Nucleotide Polymorphisms (SNPs). The appeal of genome wide association (GWA) studies is that they provide a relatively facile approach for detecting potential genetic contributors to common, complex diseases (such as diabetes) or phenotypes (such as body mass index or hair color) using a simple case-control study model. The first GWA study was performed in 2005 [1]. This early work explored the association between certain SNPs and age-related macular degeneration in a study population of 146 individuals (96 cases, 50 controls). To date, thousands of GWA studies looking at almost an equal number of conditions or phenotypes, with study populations as large as 1.3 million have been published [2]. Many of these GWA studies are now archived in public databases such as GWAS Central [3] and the NHGRI-EBI GWAS Catalog [4].
Public databases such as GWAS Central contain summary level findings from GWA studies collected on humans [3]. GWAS Central currently houses data from more than 3300 publications corresponding to over 6100 GWA studies, and lists more than 21 million p-values, ranging between 5 x 10 -2 and 1 x 10 -584 . The p-values in a GWA study report the likelihood of the odd-ratios between two different alleles being statistically different than one. The typical threshold of significance for most published GWA studies is p = 5 x 10 -8 . The average odds ratio for a statistically significant SNP is 1.33 with very few SNPs having an odds ratio above 3.0 [5]. 7 Central Collection would have required extensive ethics reviews along with time and resources that were far beyond our means. This necessitated the development of a modeling program (called G-WIZ or Gwas WIZard) that would calculate ROC and AUROC data from summary-level information only. To do so we exploited that fact that almost no SNPs separated by more than 10 kb are in absolute linkage disequilibrium and that the vast majority of reported, disease-significant SNPs are in Hardy-Weinberg equilibrium [16][17][18]. As a result, we assumed the independence of SNPs to create simulated patient populations with specific SNP profiles and assigned health conditions from the publicly available OR and RAF data. These synthetic populations were designed to be sufficiently large (typically >30,000 individuals) so that statistically anomalies would be averaged out. To assign a single SNP to an individual in the simulated population the following methodology was used: Let H s denote the number of risk alleles in the healthy group, Under the assumption that each SNP is independent of the others we can repeat the above procedure to create a full SNP profile and an assigned health state for each member of the simulated population. More specifically, a G-WIZ simulation starts by creating a population of individuals assigned as cases or controls in accordance with the selected GWAS Central record. Next, by using the risk allele frequency for the controls and the odds ratio between the cases and controls G-WIZ calculates the risk allele frequency in the cases. Once the risk allele frequency in both the case and control groups is generated, G-WIZ can appropriately assign the SNP profiles to each group. All G-WIZ models were built using all the SNPs reported by each of the respective GWA studies. These SNPs from GWAS Central were reported on the basis of their significance as identified by the original depositors. However, we considered that it might still be possible that models created using only a subset (feature selection) of reported SNPs would perform better, as this would have controlled for over-parameterization. We tested for this by performing feature selection, using only the SNPs with lowest p-values, however no improvements to the models' performance were found. In the end our SNP profiles used every reported SNP. On average, each study had a SNP profile consisting of 6 significant SNPs.
Moreover, the maximum SNP p-value was 9 x 10 -6 and the minimum SNP p-value was 1 x 10 -295 , indicating that the reported SNPs are all highly significant. Further, each G-WIZ model had on average 34,491 simulated patients (cases and controls). The largest number of SNPs used in any given SNP profile was 50, and the largest synthetic population generated by G-WIZ consisted of 808,380 individuals.

Statistical modeling for ROC curve generation
The creation of simulated populations consisting of full SNP profiles and assigned health states for each of the 569 condition/phenotype studies in GWAS Central allowed us to calculate the corresponding ROC curves and AUROC values. A common modeling method used to generate ROC curves for multi-marker data is logistic regression.
Logistic regression is a statistical method for modeling multiple independent variables (e.g. SNPs) to explain two possible outcomes (e.g. healthy or diseased). Once constructed, a logistic regression model will return a risk score between 0 and 1. A cut-off value can then be chosen (e.g. 0.5), and any individual that has a risk score above it is classified as 'diseased', and any individual below it is classified as 'healthy'. A plot of the sensitivity against 1-specificity for all possible cut-off values is known as a receiver-operating characteristic (ROC) curve. The classification accuracy of the logistic regression model can then be measured by calculating the area under the curve of the ROC curve (AUROC). A perfect model would have an AUROC of 1, while a model with no classification accuracy would have an AUROC of 0.5 [7].
We performed logistic regression analysis because it is easy to perform and interpret [19]. Although, it is possible that better performing multi-SNP profiles could have been developed using advanced machine learning algorithms such as neural networks, decision trees, or support vector machines [20][21][22][23][24][25][26], it is also possible to overtrain models with these very powerful pattern-finding tools. Indeed, it is not unusual, during validation studies, to see these models fall short when compared to logistic regression models [20][21][22][23]. These concerns regarding overfitting led us to limit our model complexity and to exclusively use logistic and ridge logistic regression to estimate the classification accuracy of these GWA studies.
Another common issue with regression models containing many explanatory variables is multicollinearity. Multicollinearity increases the variance of parameter estimates, which will affect confidence intervals and hypothesis tests. This can lead to incorrect inferences about relationships between explanatory and response variables [27].
With these issues in mind we tested for multicollinearity by estimating the variance inflation factor (VIF) prior to building each regression model. The VIF is the quotient of the variance from a model which regresses one of the predictor variables against all the others. Multicollinearity was determined to exist when at least two variables showed an inflated coefficient (i.e. when the VIF was infinity). We tried a wide range of other VIF cutoff values (less than infinity), however the differences in the AUC estimates were very small (<0.009). Whenever multicollinearity was observed we used ridge logistic regression [28] to generate a biomarker model, otherwise we used a standard logistic regression model. Because standard logistic regression is more easily interpretable than its ridge regression counterpart, we found it appropriate to restrict the use of ridge regression only to models with extreme (i.e. divergent) VIF estimates. In total 566 standard logistic regression models and 3 ridge logistic regression models were constructed for all 569 GWA studies.
To assess the performance of each biomarker or ROC-generative model, the simulated data was randomly split into training and testing sets. In the training set, nested cross-validation (outer 3-fold and inner 2-fold) was used to obtain an estimate of the classification accuracy [28]. Once the model was properly tuned, it was validated using the testing set.

The G-WIZ program
G-WIZ is written in the R programming language [29]. It consists of several modules including a custom-written tool to generate patient populations, the MLR package [30]

SNP-derived heritability calculations
For each study in GWAS Central we also estimated the total variance in disease liability explained (often referred to as the SNP heritability) using the following formula described by Pawitan et al. [32].
Where n is the number of SNPs in a particular study, and RAF k and OR k are the risk allele frequency and odds ratio of the k th SNP. We created an in-house R script to run this formula on all 569 GWA studies collected from GWAS Central. The results are shown in S1 Table.

Validation
To further ensure that our modeling methods and assumptions were correct, we validated our predictions in two different ways. In the first approach, we used real patient GWAS Cohort. The SNP profiles and risk alleles we used are reported in Table 1, and are the same SNPs and alleles reported as being statistically significant by the WTCCC researchers [15]. For both control datasets we then applied the G-WIZ modelling method to the same set of SNPs and generated a synthetic population with SNP profiles by directly calculating the RAFs and number of cases and controls for each disease study, and by estimating the ORs from the logistic regression coefficients. The true AUROCs were then compared to our G-WIZ calculated AUROCs. Additionally, the shape of the ROC curves was also compared using Delong's test [33].
Similarly, we observed that the size of the sample population can also lead to differences in AUROC values (of +/-0.04) with smaller populations or unbalanced numbers of cases and controls leading to larger differences.  GWA study "Shingles" [57,58] had the lowest AUROC with a predicted AUROC of 0.50.

Analysis of GWAS Central studies
The logistic regression models that we built used on average 6 SNPs. The largest number of SNPs used in a single model was 50. Six studies were modeled using 50 SNPs and 165 studies (29%) were modeled using only a single SNP (see S3 Table).

GWAS vs. non-GWAS risk prediction performance
One of the motivating factors behind this study was to compare the performance of GWAS-derived or SNP-derived biomarker profiles for disease prediction with other predictive biomarker profiles derived from clinical, metabolomic and/or proteomic (i.e. non-GWAS) data. These data are presented in Table 4

Calculating SNP heritability using the AUROC
While comparing the AUROC values and heritability estimates plotted in the GWAS-ROCS website, an interesting trend was noted. In particular, the heritability seemed to be well correlated with the square of the AUROC (r = 0.87, see Fig 2). This led to a more detailed investigation regarding the potential rational for this observation. Upon further reading we found that the , where D is the Somers' rank correlation [84] between risk profile and disease status (1 = diseased, 0 = not diseased). Note that the squared Somers' D rank correlation is in fact the proportion of explained variance [85], and that the definition of heritability is precisely the proportion of explained variance. Thus, in the context of a SNP-only model trying to predict disease status, the squared Somers' D rank correlation is, in fact, the SNP heritability. Rearranging for D and then squaring in the formula above, we find that . This result highlights the utility of AUROC calculations for not only assessing the predictive performance of a multi-SNP panel but for also easily and rapidly calculating heritability of such a SNP panel.

Comparison with competing methods
As noted earlier, several other methods have been described for estimating AUROC values [8][9][10][11][12][13] or generating ROC curves [11,12] from summary-level GWAS data. The methods are referred in this paper by the name of the first author. These include the methods by Lu [13], Moonesinghe [9] and Gail [10] which analytically determine AUROC estimates (but not ROC curves), Pepe [12] and Janssens [11]   A screenshot montage illustrating the contents and design of the GWAS-ROCS database is shown in Fig 4. As can be seen from this figure the GWAS-ROCS website has a simple webpage layout (Fig 4a). There are four tabs at the top of the page: "Browse Study Simulations", "Downloads", "About" and "Contact Us". Clicking on the "Browse Study Simulations" tab allows users to view a scrollable series of images where they can easily browse through the 569 simulated GWA studies produced in this paper (Fig 4b).
Users can sort the GWA studies according to their ID number, the condition/phenotype and the AUROC value. Clicking on a study sends users to a webpage with more information about that specific study (Fig 4c). This includes information such as hyperlinks to the reference GWAS Central study and the original PubMed publication, the number of control and case subjects, the SNP accession IDs, the ORs and RAFs, and the simulated ROC plots, all of which can be found on this page. Additionally, a downloadable *.csv file with the simulated population for that specific GWA study can be found on this page too. The "Downloads" tab gives users a way to quickly and efficiently download the *.csv files with simulated populations and ROC plots, for every single study in GWAS-ROCS (Fig 4d). The "About" tab contains some documentation to help users navigate the site. And finally, the "Contact Us" tab gives users an easy way to contact the GWAS-ROCS team with any questions or concerns they may have.

Discussion
GWA studies have contributed significantly to our understanding of the genetic contributions to disease and disease risk. Hundreds of novel genes have been identified and implicated in various traits or conditions and many of these have led to new biological understandings and insights [86]. With continued improvements to GWA study designs (increased sample sizes, better population selection to remove confounders, An assessment of genome wide association studies (Patron, Wishart, et al.) 25 more narrowly defined phenotypes) and GWAS analysis it is likely that many more important biological or genetic insights will be gained [86]. While the value of GWA studies is indisputable, there are still lingering concerns over the inability of SNPs to explain as much of the heritable variation as originally hoped (the missing heritability problem [87,88]) or as much of the disease risk as expected [88].
As remarked earlier, GWA studies that explore disease risk do not often adopt the convention used by most other multi-marker risk predictors to assess performance. In particular, the use of multi-component SNP models and the evaluation of ROC curves or AUROC values (C-statistics) is quite rare. Of the >3700 GWAS publications we evaluated, only 112 have published ROC curves or AUROC data. Of these, fewer than 30 provided sufficient data to independently validate their reported ROC or AUROC results.
This has made it difficult to compare the performance of SNP-derived or GWAS-derived biomarkers in disease risk prediction with other types of disease-risk prediction biomarkers or models (clinical, metabolomic, proteomic, etc.). Furthermore, the difference in reporting methodologies between GWA studies (with an emphasis on pvalues and odds ratios for individual SNPs) and non-GWA studies (with an emphasis on ROC curves and AUROCs calculated for multiple markers) has also led to an expectation by many non-GWAS specialists, or those with limited statistical training, that the predictive performance of GWAS-derived biomarkers should be much better than non-GWAS derived biomarkers.
Because of this "cultural" difference we undertook this study to help standardize biomarker reporting between GWAS derived and non-GWAS derived biomarker profiles.
In particular, we used logistic regression modeling and simulated patient data to generate a comprehensive and publicly available database of GWAS ROC curves, AUROCs and SNP-heritability scores for a large number of conditions (219) and a large number of GWAS studies (569). These data were placed into an open-access database called GWAS-ROCS, which is publicly available at http://gwasrocs.ca.
In creating the GWAS-ROCS database we hoped to accomplish several objectives.
First, we wanted to compile and consolidate an accurate and comprehensive set of SNPderived AUROCs into a single, open-access site. Second, we wanted to use this consolidated data to systematically analyze interesting trends or features in GWAS AUROC data. One of the trends we wanted to explore in more detail concerned the performance of SNP biomarker panels in disease or phenotype prediction. Our results indicate that the average AUROC for a typical GWAS-derived biomarker profile is low, just 0.55 with a standard deviation of 0.05. This is significantly lower than what we expected given that (the few) published AUROCs typically report a range between 0.62-0.67 (see S2 Table, [88]). The fact that published GWAS AUROCs tend to be high (~0.65) and unpublished GWAS AUROCs tend to be low (~0.55), suggests that one reason for the paucity of published GWAS AUROCs is that many AUROCs for SNP biomarker profiles are either uninterestingly low (<0.55), or not statistically different from those generated by a random predictor.
Another aspect that we wanted to explore in more detail was the performance of GWAS-derived SNP profiles for disease prediction compared to non-GWAS profiles for predicting identical diseases. As noted previously, the fact that GWAS disease risk assessments are not typically presented or measured in the same way as non-GWAS disease risk assessments, has made this kind of comparison difficult [89]. As seen in Table 4, we found that non-genetic factors were generally better at predicting disease than genetic factors. In particular, for those conditions where GWAS-derived AUROCs and non-GWAS (clinical/proteomic/metabolomic) derived AUROCs could be compared, we found that a typical non-GWA study reported AUROCs closer to 0.81, which is significantly more than the average GWAS-derived biomarker profile of 0.64 (see Table   4). On the other hand, it is important to note that the predictive ability or disease risk scores of SNP-derived biomarker profiles can effectively occur at birth (many decades prior to the onset of disease) while the non-SNP-derived biomarkers are generally only useful a few months or at most a few years prior to the onset of the disease. In this regard, the utility of SNP-biomarker profiles for long-term disease prevention or disease prophylaxis, even if modest compared to non-SNP profiles, is still quite significant.
A third objective of this study was to identify those conditions that appear to exhibit the best AUROC performance with multi-component SNP data. These high AUROC conditions would be expected to have a relatively high "explainable" genetic component with regard to disease risk. From the data compiled in GWAS Central we identified 5 conditions/phenotypes that had an AUROC greater than 0.75 and an estimated heritability of >25%. As can be seen in S4 Table,  interesting to note that the very first SNP study ever recorded was one done on macular degeneration [1] and that macular degeneration has among the highest levels of heritability and among the highest AUROC values of all conditions we investigated. In many respects, macular degeneration was the equivalent of hitting the "mother lode" for GWA studies.
A fourth objective of this study was to identify those conditions where SNP information appears to be relatively uninformative with regard to disease risk prediction. A sixth objective of this study was to explore whether certain trends in AUROCs, disease types or heritability estimates could be discerned by analyzing a large AUROC data set. As noted earlier, we found that the SNP-heritability as determined by Pawitan et al. [32], seemed to be well correlated with the square of the AUROC (r = 0.87, see Fig 2).
This led to the discovery that . We used this newly derived formula to estimate the SNP-heritability for all 569 studies from GWAS Central. These heritability estimates can be found at http://gwasrocs.ca. Moreover, we compared our heritability estimates against the heritability values for 10 different conditions reported in the literature. Table 5 shows that on average our estimates differed from the true values by just 0.013. The formula we derived suggests that an AUROC of approximately 0.85 is needed to explain 50% of the heritability. of all SNPs in our GWAS Central dataset. This corresponds to a SNP located close to the well-known Alzheimer's disease-associated gene ApoE4 [99]. Using this single SNP alone it was possible to create an Alzheimer's disease risk predictor with an AUROC of 0.62. The addition of 3 more SNPs with p-values of 2 x 10 -10 , 4 x 10 -8 , 1 x 10 -7 led to an increase in the Alzheimer's disease risk prediction AUROC to 0.65 [100]. So, despite the extremely low p-values for these Alzheimer's disease SNPs, the influence on the AUROC (and the heritability) was relatively modest. In another interesting example, the GWA study "Coronary Artery Disease" [101,102] had 50 SNPs, and a p-value of a staggering 1 x 10 -101 for the most significant SNP. However, even with so many SNPs and the inclusion of SNPs with remarkably low p-values, the AUROC of this SNP-panel reached just 0.58. Overall, our results indicate that it is not the p-value, but rather the odds ratio (OR), in conjunction with the risk allele frequency (RAF), that are most important for determining biomarker performance in disease risk prediction.
One criticism of our approach to calculating AUROCs from GWAS data is that is computationally inefficient. In particular, we construct large, simulated patient populations, and then used those simulated patient/SNP populations to estimate the AUROCs. A more efficient approach would have been to use machine learning methods or statistical techniques [9,10] to predict the AUROC values directly. There are two reasons why we chose the population simulation approach. First, we believed it would be more useful for the scientific community to have access to simulated patient populations (with SNP data). This would allow others to perform their own statistical or modeling experiments. Furthermore, simulated SNP data can be used to create synthetic "patients" for electronic medical record (EMR) testing and training. Indeed, because of ethics and privacy restrictions, access to individual level GWAS data is often difficult, making generation of realistic genetic data for patients equally difficult. On the other hand, simulated individual level SNP data (and other 'omics' data) could be of great utility in the development and testing of "next-generation" EMR software and databases with realistic genetic data. As a result, G-WIZ was created as part of a separate EMR project to generate realistic "synthetic" patients with realistic conditions/phenotypes and correspondingly realistic genomic (SNP), metabolomic, proteomic and clinical profiles.
In addition to the appeal of creating synthetic patient data, we also realized that by creating simulated populations we would be able to determine and plot ROC curves (with which we could determine the AUROC values). Having a calculated ROC curve would give us another set of data with which to compare and validate our results. Indeed, we used the G-WIZ generated ROC curves to visually validate a number of the early ROC results during the testing phase of the program.

Conclusion
To summarize, we have created a software tool called G-WIZ to accurately predict GWAS ROC curves and AUROCs from summary level GWAS data. We subsequently compiled data from every sufficiently informative large-scale study in GWAS Central and calculated the corresponding ROC curves and AUROCs using G-WIZ. Using these calculated data, we conducted a number of comparisons to look for interesting results or unexpected trends. In particular, we compared these calculated GWAS AUROCs to typical AUROCs reported in other 'omics' studies and found some striking differences.
We also derived a novel formula to calculate SNP-heritability and calculated the proportion of heritability explained by SNPs for all 569 GWAS Central studies that we analyzed. Through this assessment we were able to make some general suggestions regarding the evaluation and selection of medical conditions that should hopefully yield more significant and useful GWAS outcomes. The results of our G-WIZ calculations, along with other meta-data about each GWA study and the predicted heritability have been placed in an open-access database called GWAS-ROCS.