^{1}

^{*}

^{¤a}

^{1}

^{¤b}

^{¤c}

^{1}

^{¤d}

^{1}

^{¤e}

^{1}

^{1}

^{1}

^{¤f}

^{1}

^{2}

^{3}

^{4}

Conceived and designed the experiments: BKP EH JW MC DS. Performed the experiments: BKP EH JW DJT. Analyzed the data: BKP EH JW DJT ES HT. Contributed reagents/materials/analysis tools: JW DJT ES HT. Wrote the paper: BKP EH DJT ES HT MC DS.

Current address: Section on Ecology and Evolution and Genome Center, University of California Davis, Davis, California, United States of America

Current address: Blavatnik School of Computer Science, Department of Molecular Microbiology and Biotechnology, Tel-Aviv, University, Tel-Aviv, Israel

Current address: International Computer Science Institute, Berkeley, California, United States of America

Current address: School of Medicine, Indiana University, Indianapolis, Indiana, United States of America

Current address: Life Technologies, Foster City, California, United States of America

Current address: Locus Development, San Francisco, California, United States America

All the authors were employed at Navigenics Inc when this study was carried out. No other company mentioned in the author affiliations was involved in the study. This is a primary research article and the data used in this study (WTCCC data) was not generated by the company and is available to all qualified researchers. This does not alter the authors' adherence to all the PLoS ONE policies on sharing data and materials.

The prevalence of common chronic non-communicable diseases (CNCDs) far overshadows the prevalence of both monogenic and infectious diseases combined. All CNCDs, also called complex genetic diseases, have a heritable genetic component that can be used for pre-symptomatic risk assessment. Common single nucleotide polymorphisms (SNPs) that tag risk haplotypes across the genome currently account for a non-trivial portion of the germ-line genetic risk and we will likely continue to identify the remaining missing heritability in the form of rare variants, copy number variants and epigenetic modifications. Here, we describe a novel measure for calculating the lifetime risk of a disease, called the genetic composite index (GCI), and demonstrate its predictive value as a clinical classifier. The GCI only considers summary statistics of the effects of genetic variation and hence does not require the results of large-scale studies simultaneously assessing multiple risk factors. Combining GCI scores with environmental risk information provides an additional tool for clinical decision-making. The GCI can be populated with heritable risk information of any type, and thus represents a framework for CNCD pre-symptomatic risk assessment that can be populated as additional risk information is identified through next-generation technologies.

Common chronic non-communicable diseases (CNCDs) are caused by a combination of genetic and environmental risk factors. These diseases account for the majority of disease burden, and the majority of health care cost, globally. Pre-symptomatic risk assessment of an individual for CNCDs, and personalized management to extend the healthy lifespan and reduce costs, is increasingly a global priority

Recent advances in genotyping technology have greatly improved our understanding of the genetic risk factors that contribute to such diseases. In particular, whole-genome association studies have uncovered many common variants that increase an individual's risk of developing a disease during his/her lifetime. Since disease prevention will be the most effective means to ensure a healthier population in the coming decades, it is necessary to understand how to integrate inherited genetic risk information into our clinical decision-making process early in life so that we can minimize the chance of developing disease in the future. Low effect size common SNP variants, rare and private variants, DNA copy number variants and epigenetic modifications are together believed to account for most of the inherited risk. When we can fully articulate the relative contribution of each of these elements to any specific disease, and the effects of their interactions with one another, our predictive accuracy will peak.

Accurately estimating an individual's risk to develop a CNCD is a challenging task. To begin, the risk is determined by many factors including the genetic risk factor load, environmental factors, gender, age etc and not all contributing factors are known. It is therefore clear that for most conditions the best risk assessments can only provide a probabilistic estimate. In order to accurately estimate the risk of an individual, one has to take into account the different associated variants, their effect sizes, their frequency in the population, the environmental factors affecting the individual, such as diet, age, family history and ethnic background as well as their interactions. Large-scale studies that investigate all of these factors at once are prohibitively expensive to conduct, and to our knowledge, none have been conducted.

Here, we study the performance of risk estimates based on the genetic composition of an individual alone, keeping all other factors fixed. Several approaches for risk estimation based on genetics alone have been proposed in the past

Similarly to previous approaches, we rely on several assumptions, main among them being the assumption of independence between the disease-associated loci. We use simulated data as well as real data to assess the performance of the risk estimates under different conditions. Importantly, we find that the assumption of independence does not greatly affect the generality of our method and modest SNP-SNP interactions in simulated data do not seem to significantly affect its predictability.

In order to measure the quality and effectiveness of GCI and similar methods, it is important to understand their limitations and merits. For example,

We use the Wellcome Trust Case Control Consortium (WTCCC) data

Disease | dbSNP rs id | Relative risk |
Relative Risk |
Frequency |
Frequency |

Type 2 Diabetes | rs10012946 |
1.1464 | 1.0239 | 0.5000 | 0.4667 |

rs10811661 |
1.3008 | 1.1282 | 0.6667 | 0.2500 | |

rs1801282 |
1.4128 | 1.2417 | 0.8667 | 0.1167 | |

rs4402960 |
1.1602 | 1.1233 | 0.1167 | 0.3500 | |

rs4506565 |
1.6133 | 1.2738 | 0.0847 | 0.3729 | |

rs5215 |
1.1681 | 1.0935 | 0.1000 | 0.6167 | |

rs8050136 |
1.3609 | 1.1176 | 0.1167 | 0.6667 | |

rs9494266 |
1.4909 | 1.2296 | 0.0169 | 0.0847 | |

rs10923931 | 1.1948 | 1.0947 | 0.0167 | 0.2000 | |

rs4607103 | 1.1392 | 1.0681 | 0.6333 | 0.3500 | |

rs7961581 | 1.1355 | 1.0664 | 0.0500 | 0.3667 | |

rs864745 | 1.1530 | 1.0747 | 0.3158 | 0.4035 | |

rs5015480 | 1.1456 | 1.0451 | 0.3167 | 0.4833 | |

Crohn's Disease | rs10883365 | 1.6154 | 1.1989 | 0.3000 | 0.4000 |

rs2066845 | 11.4381 | 3.0164 | 0.0000 | 0.0333 | |

rs10489276 | 1.4130 | 1.1888 | 0.0333 | 0.3667 | |

rs1894603 | 1.4608 | 1.2088 | 0.2542 | 0.4407 | |

rs4871611 | 1.1654 | 1.0795 | 0.3667 | 0.5000 | |

rs6679677 | 1.7116 | 1.3085 | 0.7167 | 0.2833 | |

rs17234657 | 2.3052 | 1.5360 | 0.0667 | 0.2000 | |

rs11175593 | 2.3532 | 1.5353 | 0.0000 | 0.0333 | |

rs11584383 | 1.3899 | 1.1790 | 0.4333 | 0.4500 | |

rs1456893 | 1.4371 | 1.1989 | 0.3667 | 0.5333 | |

rs1736135 | 1.3898 | 1.1790 | 0.3000 | 0.5000 | |

rs17582416 | 1.3432 | 1.1590 | 0.1667 | 0.4333 | |

rs2872507 | 1.2527 | 1.1193 | 0.2167 | 0.5000 | |

rs3764147 | 1.5580 | 1.2484 | 0.0847 | 0.3220 | |

rs4263839 | 1.4852 | 1.2188 | 0.4167 | 0.4667 | |

rs744166 | 1.3898 | 1.1790 | 0.3276 | 0.4483 | |

rs762421 | 1.2751 | 1.1292 | 0.2500 | 0.4833 | |

rs10210302 | 1.8433 | 1.1890 | 0.3000 | 0.5000 | |

rs7746082 | 1.3663 | 1.1690 | 0.1017 | 0.4915 | |

rs7927894 | 1.3432 | 1.1591 | 0.2333 | 0.3833 | |

rs9858542 | 1.8316 | 1.0895 | 0.0333 | 0.4167 | |

rs11805303 | 1.8525 | 1.3875 | 0.1000 | 0.3833 | |

rs1000113 | 1.9102 | 1.5354 | 0.0000 | 0.0667 | |

rs2066844 | 3.2543 | 1.9609 | 0.0000 | 0.2203 | |

rs17221417 | 1.9118 | 1.2883 | 0.1000 | 0.5167 | |

rs2542151 | 1.9997 | 1.2980 | 0.0500 | 0.2833 | |

rs10761659 | 1.5461 | 1.2287 | 0.2333 | 0.6333 | |

Rheumatoid | rs10118357 |
1.7278 | 1.3152 | 0.2712 | 0.5254 |

Arthritis | rs13207033 |
1.7559 | 1.3258 | 0.6667 | 0.3167 |

rs6457617 |
5.0847 | 2.3414 | 0.2167 | 0.5667 | |

rs6679677 |
3.1672 | 1.6847 | 0.0000 | 0.2833 | |

rs6920220 |
1.7023 | 1.1965 | 0.0000 | 0.3500 |

As noted before, we use Receiver Operating Characteristic (ROC) curve analysis

One of the assumptions made by the GCI framework is that the disease-associated SNPs are independent. This assumption is useful since the score can then be calculated just from summary data; furthermore, when interactions are modeled based on limited data, there is a risk of over-fitting. Nevertheless, in an attempt to quantify how much information might be lost by the independence assumption, we compared our method with a model that accounts for both SNP-SNP interactions and the marginal contribution of each SNP. Particularly, we used logistic regression to account for the interactions. If the SNPs are s_{1}, s_{2}…s_{n}, then the model assumes that the logit transformation of the binary outcome reflecting disease or non-disease status is _{ij} is the interaction between s_{i} and s_{j}. We first trained the model using the WTCCC data and then generated a ROC curve based on its probability estimates. Since this model takes into account the pair-wise interactions between SNPs, it should be at least as accurate as the GCI score, which does not consider them. Note that the logistic regression model is an optimistic upper bound on the GCI since it can easily over-fit the model to the data; therefore, we are being conservative in our estimation of the information lost under the independence assumption.

Disease | Heritability | Average Lifetime Risk | Optimal Scenario |
GCI score | Logistic Regression |

Type 2 Diabetes | 64% |
25.0% |
0.894 | 0.613 | 0.644 |

Crohn's Disease | 80% |
0.56% |
0.992 | 0.689 | 0.757 |

Rheumatoid Arthritis | 53% |
1.54% |
0.944 | 0.675 | 0.689 |

The number of SNPs used in our analysis reflects the current knowledge about the effect of common SNPs on the risk of a disease. These, however lack many other factors such as epigenetic factors, rare variants, copy number variants, interactions etc. The question remains as to how much more accurate could we potentially be when considering genetic factors alone. We shed light on this by comparing our empirical results to theoretical disease models that assume that the disease is affected by both environmental and genetic factors, and that the two factors are independent (see

Formally, the theoretical model uses a phenotype variable P, and it assumes that

Based on

Attempts to estimate the number of causal variants in complex diseases have been made in the past

Our GCI score is based on the assumptions that all SNPs are in linkage equilibrium and that they have independent effects on the risk of the disease. As discussed above, the three examples studied here show no significant difference between the GCI model and a model in which pair-wise dependencies among the SNPs are included through logistic regression. This assumption may not always hold since, we know of some rare examples for which there is evidence of epistasis

In order to further explore the issue of interactions, we simulated datasets under a model in which a single pair of SNPs is interacting. Formally, the model can be described as follows. Let λ_{i} denote the relative risk of the disease for a particular combination of genotypes (g_{i}) and p denote the average lifetime risk. If all SNPs are independent, the total risk is proportional to _{ij} denotes the relative risk for the j^{th} locus. In the interactions model, we assume that for a particular pair, the relative risk for some combinations of genotypes is γ times larger than the product of their relative risks. For all other SNPs and for all other genotype combinations, relative risks are assumed to be multiplicative. Thus, for example, if SNPs x and y interact, then the relative risk for the pair, _{ix}, g_{iy}), and

We set the values of λ_{ix}, λ_{iy} for the interacting SNPs x and y so that the relative risks for each of these SNPs under univariate models is equal to what is observed in real data (given in _{i} is the relative risk of individual i based on the interactions model. We choose C so that the fraction of cases is close to the average lifetime risk of the disease.

Let RR, RN and NN denote the observed values of relative risks for any SNP for risk-allele homozygote (2), heterozygotes (1) and non-risk-allele homozygote (0) respectively and let rr, rn and nn denote the respective genotype frequencies. Since λ_{ij} for any locus j can only take 3 possible values corresponding to the 3 possible genotypes, we will denote these by λ_{ij0}, λ_{ij1}, and λ_{ij2} respectively and set _{ix1}, λ_{iy1}, λ_{ix2}, λ_{iy2} for SNPs x and y by solving the following system of equations:

Based on the risks in the interactions model, we assigned disease status labels for 100,000 randomly drawn samples. We used this simulated case-control data to plot ROC curves based on two approaches for risk assessment. First, we calculate the relative risk of an individual according to the true interactions model. Then, we assigned relative risks assuming the independence model. As observed in

Simulated Interaction Factor 2 |
Simulated Interaction Factor 10 |
|||

Interaction risk estimate | GCI risk estimate (Multiplicative) | Interaction risk estimate | GCI risk estimate (Multiplicative) | |

Crohn's Disease | 0.722 | 0.722 | 0.739 | 0.724 |

Rheumatoid Arthritis | 0.679 | 0.674 | 0.720 | 0.673 |

Type 2 Diabetes | 0.597 | 0.594 | 0.607 | 0.595 |

The ROC curve serves as one metric for evaluating a diagnostic in that it provides a quantitative measure of the ability of the test to distinguish between unaffected and affected individuals. However, when estimating the lifetime risk, the ROC curve alone may not be sufficient if a score does not directly estimate the correct probabilistic measure (i.e. the probability of developing disease in one's lifetime) but instead computes some function of this probability. In particular, for any given pair of score functions, f_{1}(G) and f_{2}(G), the ROC curves of the functions will be identical as long as f_{1} is a monotonic increasing function of f_{2}. For instance, we could simply assign _{1} and f_{2} to estimate risk we will get exactly the same ROC curves. However, these two functions may give very different lifetime risk estimates to individuals. Therefore, ROC curves alone are not sufficient for tests that report probabilistic risk. For quality assessment, we also need a more informative quantity, the absolute value of relative error between the true risk probability and the estimated risk probability. The relative error is defined as the difference between the estimated and true risk probability divided by the true risk probability. Thus, the absolute value of relative error is given by:

Since the true probability of developing a disease is unknown, we simulated a scenario in which case-control data is used to calculate the GCI parameters (i.e. the relative risks), and then applied the GCI risk estimates to another independently simulated population. The disease model we used for the simulation assumes that the genetic factors of the disease can be decomposed into a small number of large effects and a large number of small effects that can be approximated by a normal distribution (see

The above procedure was used to generate a simulated set of relative risk values. We then generated 500 individuals randomly according to the theoretical disease model. Since the variables are known for each of these individuals, we know the correct genetic risk to develop the condition. We use these ‘true risks’ as a baseline for the accuracy measure. We compare the GCI based risk estimates to this baseline, as well as a variant of the GCI in which the relative risks are replaced by the odds ratios. We note that methods that calculate disease risk based on prevalence (e.g.

In

In the previous sections, we used only the genetic information to estimate the risk of disease. In order to estimate the potential contribution of known environmental factors to disease prediction, we now consider the case where both environmental and genotypic data are used to estimate risk. Such an example was studied for the case of Type 2 Diabetes in

Disease | Environmental Variable | Level | Proportion in the population | Relative risk |

<23 | 0.20 | 1.00 | ||

23–23.9 | 0.16 | 1.00 | ||

24–24.9 | 0.14 | 1.50 | ||

25–26.9 | 0.27 | 2.20 | ||

Type 2 Diabetes | Body Mass Index | 27–28.9 | 0.14 | 4.40 |

29–30.9 | 0.06 | 6.70 | ||

31–32.9 | 0.02 | 11.6 | ||

33–34.9 | 0.01 | 21.3 | ||

> = 35 | 0.01 | 42.1 | ||

Never Smoked | 0.50 | 1.00 | ||

Smoking | Ex-Smoker | 0.39 | 1.10 | |

<20 cigs/day | 0.04 | 1.50 | ||

> = 20 cigs/day | 0.07 | 1.70 | ||

Never Smoked | 0.545 | 1.00 | ||

Crohn's Disease | Smoking | Ex-Smoker | 0.245 | 1.70 |

Current-Smoker | 0.198 | 3.00 | ||

Never Smoked | 0.498 | 1.00 | ||

Rheumatoid Arthritis | Smoking | Ex-Smoker | 0.276 | 1.40 |

Current-Smoker | 0.227 | 1.30 |

We simulated the genotype and environmental factor values for a set of 100,000 individuals based on their known frequencies in the population (See

GENEVA study refers to the Gene Environment Association Studies initiative (

Effect of genetic (15 SNPs given in

Disease | Environmental Variable | Level | Relative risk |

<23 | 1.00 | ||

> = 23 and <25 | 2.67 | ||

Body Mass Index | > = 25 and <30 | 7.59 | |

> = 30 and <35 | 20.1 | ||

Type 2 Diabetes | > = 35 | 38.8 | |

Never Smoked | 1.00 | ||

Smoking | Ex Smoker | 1.23 | |

Current Smoker | 1.44 |

Disease | dbSNP rs id | Relative risk |
Relative risk |
Frequency |
Frequency |

Type 2 Diabetes | rs153143 | 1.1586 | 1.0772 | 0.0170 | 0.1670 |

rs11634397 | 1.0961 | 1.0472 | 0.3280 | 0.5340 | |

rs8042680 | 1.1112 | 1.0545 | 0.0330 | 0.3670 | |

rs10012946 |
1.1464 | 1.0239 | 0.5000 | 0.4667 | |

rs10811661 |
1.3008 | 1.1282 | 0.6667 | 0.2500 | |

rs1801282 |
1.4128 | 1.2417 | 0.8667 | 0.1167 | |

rs4402960 |
1.1602 | 1.1233 | 0.1167 | 0.3500 | |

rs4506565 |
1.6133 | 1.2738 | 0.0847 | 0.3729 | |

rs5215 |
1.1681 | 1.0935 | 0.1000 | 0.6167 | |

rs8050136 |
1.3609 | 1.1176 | 0.1167 | 0.6667 | |

rs10923931 | 1.1948 | 1.0947 | 0.0167 | 0.2000 | |

rs4607103 | 1.1392 | 1.0681 | 0.6333 | 0.3500 | |

rs7961581 | 1.1355 | 1.0664 | 0.0500 | 0.3667 | |

rs864745 | 1.1530 | 1.0747 | 0.3158 | 0.4035 | |

rs5015480 | 1.1456 | 1.0451 | 0.3167 | 0.4833 |

The Human Genome Project

We have presented a new method for the estimation of an individual's lifetime risk based on genetic data through a genetic score function (the GCI). The GCI, like all estimates of a particular quantity, requires a set of assumptions that may bias the risk estimates. Particularly, the assumptions made by the GCI score are that the allele frequencies of the causal SNPs and effect sizes are known, and that all the SNPs are independent of each other. We show through simulation studies and by the analysis of the WTCCC data that, moderate SNP-SNP interactions have almost no effect on the power of the multiplicative GCI score. However, in principle strong non-additive effects between variants might affect the risk estimates, and thus care has to be taken when interpreting the results. In most scenarios, we expect that such effects will likely be discovered prior to the use of GCI and can be incorporated in the risk calculation. So, we view this as a minor problem, especially given that no significantly strong SNP-SNP interactions have been uncovered in whole genome association studies performed over the past several years.

We used the ROC curve analysis and the heritability of each of the conditions we considered to find the total genetic variation explained by known variants, compared to the expected genetic variation based on heritability. We find that current scientific knowledge can explain approximately 6%-14% of the total genetic variation for these conditions. This suggests that the risk estimates provided by the GCI may vary considerably in the future, as more genetic variants are found and used for risk estimation (e.g. see

It is clear that next-generation technologies will be used in study designs similar to GWAS to identify additional heritable risk factors for CNCDs. As each new genetic association is validated to the appropriate industry thresholds, this new genetic risk information can be added into the GCI in a scalable fashion, on a disease-by-disease basis to improve the accuracy of the GCI in real time.

Given these interpretations of the GCI score, it is informative to use such a score in order to estimate the risk of an individual based on their genetic data. The medical benefits of such individualized knowledge are intuitive, but have to be clinically proven through prospective studies. The main open question is whether individuals will benefit by change of behavior, early diagnosis or an individualized course of treatment based on their genetic information for actionable CNCDs. We believe that tools such as the GCI score will facilitate such studies and help transition us into the era of personalized preventive medicine.

The datasets used were approved by the relevant boards in Navigenics Inc and University of California Davis.

We consider a disease for which k risk loci have been identified. As done in _{i} is genotype of an individual in locus i, and D represents the event that the individual will develop the disease across his or her lifetime. As noted by

When calculating the risk across multiple SNPs for an individual with genotypes (g_{1},…,g_{n}), we are interested in finding the probability

In order to estimate the lifetime risk of a specific individual, we therefore need to have an estimate of the average lifetime risk Pr(D) across the entire population and the risk of developing the disease across the lifetime of an individual with genotype g_{i}. The former has been estimated for a wide range of conditions using prospective studies

In epidemiology literature, the relative risk is often considered an intuitive and informative measure of risk. The relative risk is defined as _{0}, a_{1}, and a_{2} correspond to the genotypes with 0, 1, and 2 risk alleles. If the relative risks are known, we could estimate Pr(D|a_{i}) by using the following:

Equation 1, together with the relative risks provide three independent equations with three variables, since Pr(a_{i}) can be found by considering a reference population, and Pr(D) is known. Unfortunately, the relative risk cannot be directly calculated in the context of case-control studies and whole-genome association studies. The relative risk can usually be estimated through prospective studies in which a set of healthy individuals is studied over a long period of time. In contrast, odds ratios are normally reported in case-control studies. The odds-ratio is the ratio between the odds of carrying the risk allele in cases vs. controls. For rare diseases, the odds ratio is a good approximation of relative risk; however for common diseases, the odds ratio could result in a misleading estimate of risk, where the odds ratios may be quite high even when the increase in risk is minor.

As previously noted

We now turn to the calculation of Pr(D|g_{i}) given that an α fraction of the controls will eventually develop the disease along their lifetime. We consider a locus in which m+1 different alleles are present. This allows us to deal with general scenarios, in which g_{i} may represent any number of interacting SNPs, and where m = 3^{s}_{,} where s is the number of SNPs represented by g_{i}.

We will denote the m+1 possible alleles by a_{0}, a_{1}…, a_{m}, where a_{0} is the non-risk allele, and their respective allele frequencies in the general population as f_{0,} f_{1},…, f_{m}. Given that an α fraction of the controls will eventually develop the condition, we can write the odds ratios as:

Similar to Equation 1, we know that

For a fixed α, we can solve this equation using a binary search on the variable Pr(D|a_{0}); there is exactly one solution between 0 and Pr(D) since the right hand side of this equation is an increasing function of Pr(D|a_{0}) and binary search is guaranteed to find that solution.

Generally, the value of α is unknown and it has to be determined based on the age characteristics of the study population. For instance, if the control population is a sample from the general population, then α should be taken as the average lifetime risk of the disease. However, if the control population was chosen so that their age range is after the age of onset of the disease, α should be close to 0. When case-control genotype data is given, one can use maximum likelihood estimation to calculate α.

The GCI method essentially provides a way to compute the relative risks of an individual as compared to an individual with non-risk alleles at each of the disease-associated marker. In order to calculate the lifetime risk, we take the product of the relative risks across all loci (this is the overall relative risk of the individual under the multiplicative model) and multiply it by the average lifetime risk of the disease in the population. We then divide this product by the average overall relative risk of the population. To approximate the average relative risk of the population, we assume that the SNPs at different loci are independent of one another (i.e. in linkage equilibrium). Under this assumption, the average overall relative risk of the population is equal to the product of the average relative risks at each disease-associated marker.

If all the markers effects are independent, the relative risk of individual i is equal to _{ij} denotes the relative risk for the j^{th} locus. Let Pr(D) denote the average lifetime risk of the disease in the population. Then, the GCI lifetime risk probability or GCI score of an individual i is:

Here, m+1 alleles are possible at each marker locus and λ_{jk} denotes the relative risk of the k^{th} allele of the j^{th} locus and f_{jk} denotes its frequency in the sample.

We compared the GCI score to the optimal risk scores calculated under two different theoretical disease models. These models assume that the disease is affected by both environmental and genetic factors, and that the two factors are independent of each other. We denote the phenotype _{G} and σ_{E} respectively, and that an individual will develop the condition in his or her lifetime if _{G,} σ_{E} and α using the constraint that _{G,} σ_{E}, or α which are difficult to estimate.

In the second model, a variant of the previous model, we assume that _{G1}, and _{i} corresponds to SNPs with large effects and G1 represents many other small genetic effects; if there are enough small genetic effects, we expect that the asymptotic behavior of their sum would be according to a normal distribution. By setting the parameters λ, σ_{G1} and p appropriately, we can control the relative risks of the large effect SNPs. We tune these parameters such that the relative risks are close to values observed in

In this section, we will show that the theoretical genetic maximum of the area under the ROC curve for model 1 depends on the average lifetime risk (ALTR) and the heritability of the disease alone. Let σ_{e} denote the variance in the environmental variable and σ_{g} denote the variance in the genetic variable. In model 1, both genetic (G) and environmental (E) variables are normally distributed. The theoretical maximum of ROC curve is obtained when the genetic variable is known exactly while the environmental variable is unknown. An individual is a true case if

The probability that an individual's genetic variable is greater than some cutoff (c) is given by:

The probability that an individual's genetic variable is greater than the cutoff and the individual is a true case is:

where

By definition heritability,

The integral within the brackets in the previous double integral can be expressed in terms of the error function, erf. Because the cumulative distribution function of normal distribution is given by

Thus, the probability that an individual is a true case and its genetic variable is greater than c can expressed as:

Similarly, the probability that an individual is a true control and its genetic variable is greater than c i.e.

Therefore, the true positive fraction for any given β only depends on h and ALTR since: _{e} and σ_{g}.

In this section, we prove a result similar to that in the previous section for disease model 2. In particular, we will show that if the relative risks of SNPs known to be associated with a disease and the risk-allele frequencies (p_{i}) are fixed, then the theoretical genetic maximum of the area under the ROC curve depends only on the heritability and the average lifetime risk of the disease. In model 2, the genetic variable is given by: _{i}s are distributed according to a Binomial distribution of B(2, p_{i}), where p_{i} is the allele frequency of the risk allele at locus i. B(2, p_{i}) gives the number of risk allele copies in an individual at locus i. _{e}. The phenotype is given by

Heritability for this model is _{i.} By definition, the relative risk of heterozygote is given by:

Let erf denote the error function and erfc denote the complementary error function (i.e. 1 – erf(x)). Since G1+E is _{i})/_{i}s with disease cutoff α represent the solutions for the SNPs for some choice of _{i}s with cutoff of Lα will necessarily be solutions if the standard deviation of G1 and E get changed by a factor of L. This is because z is always a linear combination of λ_{i}s. Therefore, λ_{i}/

By definition, _{i}/_{i} are independent of _{i} values. Then, if

The true positive fraction is defined as: Pr(G>c & G+E>α)/Pr(G+E>α) where c denotes the cutoff for genetic variable. Let

Using the error function to express the cumulative distribution function of the normal distribution, Pr(G>c & G+E>α) is:

Similarly, the probability that an individual is a true control and its genetic variable is greater than c i.e.

Note that _{i}s are fixed. Therefore, the true positive fraction for any given β only depends on the h and ALTR. The same is also true for false positive fraction since _{e}, σ_{g1} and λ_{i}s.

We first note that

Initially, determine the _{i} increases with

1) Determine

2) Determine

n) Determine

If all RN_{j} values are sufficiently close to the observed values, stop. Else go back to step 1.

Simulation experiments indicated that the above heuristic converges to a simultaneous solution for all

Since heritability can vary by population, age, environmental variation, phenotypic definition, sample size or standard error; we sought multiple references and chose the most robust estimate based on the method of calculating heritability, sample size, ancestral origin and study population. If several articles had good methodology, we tried to pick one “in the middle” of the range of reported estimates. For lifetime risk, there is often not multiple references and sometimes we relied on incidence data.

When lifetime risk data was not available from the literature (for Crohn's disease and Rheumatoid Arthritis), we used incidence data to obtain an estimate of the average lifetime risk (ALTR) using a conversion formula. Namely, we used the following formula:_{i} is the number of live births in the US in the year 2000; each from the appropriate gender and ethnicity. The main assumptions in this formula are: 1. Fixed population size. 2. Maximum life span for all.

We first validated our formula to determine if incidence data could incorrectly estimate lifetime risk using incidence and lifetime risk data from the Surveillance Epidemiology and End Results of the National Cancer Institute (USA) for a number of common cancers. Using our calculation with incidence data we estimated the published lifetime risk within 1% for breast, colon, prostate and lung cancers (results not shown). Thus, we are confident that our lifetime risk calculations are reliable.

^{nd}edition