A Regression-Based Method for Estimating Risks and Relative Risks in Case-Base Studies

Both the absolute risk and the relative risk (RR) have a crucial role to play in epidemiology. RR is often approximated by odds ratio (OR) under the rare-disease assumption in conventional case-control study; however, such a study design does not provide an estimate for absolute risk. The case-base study is an alternative approach which readily produces RR estimation without resorting to the rare-disease assumption. However, previous researchers only considered one single dichotomous exposure and did not elaborate how absolute risks can be estimated in a case-base study. In this paper, the authors propose a logistic model for the case-base study. The model is flexible enough to admit multiple exposures in any measurement scale—binary, categorical or continuous. It can be easily fitted using common statistical packages. With one additional step of simple calculations of the model parameters, one readily obtains relative and absolute risk estimates as well as their confidence intervals. Monte-Carlo simulations show that the proposed method can produce unbiased estimates and adequate-coverage confidence intervals, for ORs, RRs and absolute risks. The case-base study with all its desirable properties and its methods of analysis fully developed in this paper may become a mainstay in epidemiology.


Introduction
Both the absolute and the relative disease risks have a crucial role to play in epidemiology. The relative risk (RR) is the ratio of the disease risk for individuals at one specific exposure level to the disease risk for those at a reference level. Under the rare-disease assumption, RR is approximated by the odds ratio (OR), which in turn can be conveniently estimated in a case-control study. While an index such as RR or OR may be adequate for etiologic inferences, it is actually only part of a story. Once a factor has been demonstrated to be a risk factor for the disease, we will often be asked to predict the disease risk of an individual having a specific level of an exposure-the absolute risk. But unfortunately, the conventional case-control study does not provide an estimate for it.
Kupper et al [1] introduced a hybrid (part case-control, part cohort) design in a defined population (the 'study base')-the 'case-base' study later coined by Miettinen [2]. In contrast to the case-control study which samples the non-diseased subjects in the study base as the control group, the case-base study samples the entire study base with no regard to disease status. With such sampling scheme, the case-base study readily produces an RR estimate without resorting to the rare-disease assumption. Note that the case-base study should not be confused with the 'casecohort' study introduced by Prentice [3]. The former, like the case-control study, is a retrospective design which ascertains the exposure statuses of subjects in a population retrospectively, while the latter is a prospective cohort study with all the time-to-event information available.
While the case-cohort study has been gaining popularity over the years [3][4][5][6][7][8][9], the case-base study remained little noticed since its introduction forty years ago. Miettinen [2] derived a variance formula for RR in a case-base study. Sato [10,11] later proposed a more efficient estimator for RR, which is based on maximum likelihood estimation theory. However, these researchers only considered one dichotomous exposure and did not elaborate on how to estimate absolute risks in a case-base study. Without a general-purpose regression method for analyzing data, it is no wonder that most practicing epidemiologists would not consider the case-base design when planning a study.
In this paper, we develop a logistic model for the case-base study. The model is flexible enough to admit multiple exposures in any measurement scale-binary, categorical or continuous. It can be easily fitted using common statistical packages. With one additional step of simple calculations of the model parameters, one readily obtains relative and absolute risk estimates as well as their confidence intervals. We will use Monte-Carlo simulations to study the statistical properties of the proposed method.

Methods
Let the exposure profile of a subject be denoted by a 1|m row vector z. Each element of z can be in either binary, categorical or continuous scale. Let D represents the disease status of a subject, with D~1 for diseased and D~0 for non-diseased. We assume that the disease risk in the study population follows a logistic model: where exp (m) is the baseline disease odds (the disease odds for those with an exposure profile of z~0 in the population) and b is a m|1 column vector of parameters of interest [exp (b) is a column vector of odds ratios].
In a case-base study, the 'cases' are randomly selected from all the incident diseased subjects in the population. Let S 1~1 indicate that a diseased subject is recruited in the case sample, S 1~0 , otherwise. Such a case sampling scheme implies that or more concisely, where w 1 is a constant between 0 and 1. The 'controls' of a case-base study are randomly selected from all subjects in the population without regard to their disease status. Let S 0~1 indicate that a subject is recruited in the control sample, S 0~0 , otherwise. Such a control sampling scheme implies that where w 0 is a constant between 0 and 1. The two sampling schemes are independent to each other, that is, The event of S 0 zS 1 §1 indicates that a subject is recruited in a case-base study through case sampling, control sampling or both. The recruitment probability of a subject with a disease status of D and an exposure profile of z is Let p be the probability that a diseased subject in a case-base study is recruited in the control sample, that is, p is an important parameter to be used later. From equations 1-7, we show below that the disease risk in a case-base sample also follows a logistic model as the one in the population (model 1), albeit with a different intercept: Suppose that there are a total of n subjects recruited in a casebase study, who are indexed by i (i~1,:::,n). For the i th subject, the exposure profile, the disease status, and the control and the case sampling statuses are z i , D i , S 0,i , and S 1,i , respectively. Given the exposure status of the subjects recruited in the case-base study, each of the subjects provides the information of disease and sampling statuses. The likelihood function is therefore Risks and Relative Risks in Case-Base Studies PLOS ONE | www.plosone.org Because equation 9 is composed of three terms, the three sets of parameters ( w 1 in L 1 , p in L 2 , andm Ã and b t in L 3 ) are mutually independent (the second derivatives of the log-likelihood with respect to parameters in different sets are zero).
Both L 1 and L 2 in equation 9 are binomial likelihoods. Therefore the maximum likelihood estimates of 1 and p, and their variances are: Var and where n CN D is the number of diseased subjects recruited in control sample, n Both D , the number of diseased subjects recruited in both the case and the control sample, and n D , the total number of diseased subjects recruited in the case-base study.
The L 3 in equation 9 is a likelihood for a logistic regression model. To obtain the maximum likelihood estimates of m Ã and b t , we can fit a logistic regression (model 8) to the case-base data. Note that the dependent variable of this logistic regression is the binary disease status with the diseased subjects coded as '1' and the nondiseased subjects as '0', regardless of their being recruited through case sampling, control sampling or both. Any statistical package that performs logistic regression analysis can obtain the estimates b m Ã m Ã and b b b t , together with the variance-covariance matrix of (m Ã ,b t ). This variance-covariance matrix is denoted by S, which is an The b b b t above readily provides the maximum likelihood estimates for the logarithms of ORs. As detailed below, the b p p and b m Ã m Ã above are to be further combined to provide estimates for risks and RRs. First from model 8, an estimate for m in model 1 is An estimate of the disease risk for subjects in the population with an exposure profile vector u (a 1|m row vector ) is therefore The variance of the estimate (in logit scale) is where v~1 u ½ is a 1|(mz1) row vector. An estimate of the RR comparing those with an exposure profile vector u 1 with those with u 0 is Using the delta method, the variance of the estimate (in log scale) is : Exhibit S1 shows that Sato's formulas [10,11] of RR estimate and its variance in log scale are a special case of our formulas of equation 17 10) is not estimable. This has no bearing whatsoever on the current context of estimating risks and relative risks however, since it is a nuisance parameter anyway.
We perform Monte-Carlo simulations to examine the statistical properties of the proposed method. We consider three scenarios for the exposure. In the first scenario, we assume a binary exposure (E~0, 1). The exposure prevalence (for E~1) is set at 0.3. We assume that the OR comparing E~1 subjects with E~0 subjects is 2.5 (b~logOR = 0.9163). The disease prevalence in the study population is set at 0.1. Thus, the disease risk for E~0 subjects (risk 0 ) is 0.0727, the disease risk for E~1 subjects (risk 1 ) is 0.1638, and RR is 2.2543 (logRR = 0.8128).
In the third scenario, we assume two binary exposures (E 1 and E 2 ). The exposure prevalence is set at 0.3 for E 1 , and 0.4 for E 2 . The OR comparing E 1~1 subjects with E 1~0 subjects is 2.5 (b 1~l ogOR 1 = 0.9163), and the OR comparing E 2~1 subjects with E 2~0 subjects is 3 (b 2~l ogOR 2 = 1.0986). For simplicity, we assume that E 1 and E 2 are independent of each other in the population and that there is no multiplicative interaction between E 1 and E 2 in causing the disease. The disease prevalence in the study population is set at 0.1. Thus, the four disease risks are risk 00~0 :0431 (for E 1~0 ,E 2~0 ), risk 10~0 :1013 (for E 1~1 , E 2~0 ), risk 01~0 :1191 (for E 1~0 ,E 2~1 ), and risk 11~0 :2527 (for E 1~1 ,E 2~1 ), respectively. The RRs are (with E 1~0 , E 2~0 as the reference level) RR 10~2 :3481 (logRR 100 :8536),RR 01~2 :7618 (logRR 01~1 :0159), and RR 11~5 :8578 (logRR 11~1 :7678), respectively. The disease probabilities of subjects in the study population are assumed to follow the logistic model in model 1 with the parameter settings given in the preceding paragraphs. A case-base study is conducted in a study population of size 100000 with a case sampling probability ( w 1 ) of 0.05 and a control sampling probability ( w 0 ) of 0.005. Under such sampling scheme, the case-base study is expected to recruit a total of 500 distinct diseased and 500 distinct non-diseased subjects. We use the proposed method to calculate the point estimates and 95 confidence intervals (CIs) for ORs, RRs and risks. For a comparison, Sato's [10,11] and Miettinen's [2] methods are also performed.
The simulation was done for 10,000 times for each setting. The mean of the estimates for ORs (in log scale), RRs (in log scale) and risks (in logit scale) are calculated. The variance of an estimate is calculated as the sample variance of the estimates. We also calculate the coverage probability and the average length of the 95% CIs for the estimates. Table 1 shows the simulation results for a binary exposure. For all methods, the RR estimates are approximately unbiased and the 95% CIs achieve adequate coverage probabilities. However, the variance and the length of 95% CIs for our method are much smaller than those for Miettinen's methods. (Sato's method for the case of one binary exposure is exactly the same as our method.) Only our method can produce estimates for OR and risks additionally. From Table 1, we see that these estimates are approximately unbiased and their 95% CIs achieve adequate coverage probabilities. Table 2 presents the simulation results for an exposure with four levels. It can be seen that our method can produce unbiased estimates and adequate-coverage 95% CIs for ORs, RRs, and risks. Sato's and Miettinen's methods can only produce estimates and 95% CIs for RRs. These two methods do not exploit the constancy in OR per unit change in the exposure variable. Therefore we see that though unbiased and with adequate coverage, they produce considerably larger variances and average length of 95% CIs as compared to our method. Exhibit S2 presents the simulation results for an exposure with four levels but without the constant OR assumption. We see that our method is still unbiased and with adequate coverage. The RR estimates are now the same as those using Sato's method, though. Exhibit S3 shows that our method can produce unbiased estimates and adequate-coverage 95% CIs for ORs, RRs, and risks, when the exposure is in a continuous scale. Table 3 presents the simulation results for two binary exposures. Similarly, only our method can produce unbiased estimates and adequate-coverage 95% CIs for ORs, RRs, and risks. Sato's and Miettinen's methods can produce unbiased estimates and with adequate coverage 95% CIs for RRs only. These two methods do not exploit the assumption of no interaction between the two exposures. Therefore, we see that the variances and average length of 95% CIs for the two methods are much larger as compared to our method. Exhibit S4 presents the simulation results when there is an interaction effect between the two exposures. We see that our method can produce unbiased estimates and adequate-coverage 95% CIs for ORs, RRs, and risks, if an interaction term (crossproduct term) is incorporated into the regression model. Exhibit S5 presents the simulation results for a confounder. We see that without adjusting for the confounder, one gets estimates that are biased and 95% CIs that are under-coverage. The problems can be easily fixed by performing a logistic regression analysis with both the study exposure and the confounder as its covariates.

Results
Exhibit S6 examines the situations when the disease prevalence is lower: 0.05 and 0.01, respectively. The conclusions about method comparisons remain the same, except that the precisions for RRs and risks are compromised across all methods.

Discussion
Logistic regression is a standard technique for analyzing casecontrol data. It is also the method of choice for analyzing cohort data if time-to-event information is not available. However, the ORs that it estimates are approximating the RRs only under the rare-disease assumption. As such, there have been many methodologies/recommendations proposed to date regarding the estimation of RRs in cohort studies for common outcomes [12][13][14][15][16][17]. For example, Diaz-Quijano [17] described a novel regressionbased method for estimating RRs in cohort studies. In his method, all the diseased subjects in the study are to be duplicated, and the duplicated subjects are to be re-labeled as the non-diseased. (For case-base studies, we can duplicate and re-label the diseased subjects recruited in the control sample.) Then, a logistic model is fitted to the expanded dataset, and the resulting regression coefficients are the estimates for logRRs. For case-base study, we found that such a data expansion approach produces an unbiased RR estimate for a binary exposure, but with a larger variance and a wider CI than our method; for a four-level exposure, the approach produces biased estimates and CIs with inadequate coverage (results not shown). For cohort study without time-toevent information, one can also apply our method to estimate ORs, RRs, and risks, except that the p (equation 7) now is exactly one and is no longer a parameter to be estimated.
In addition to the usual ORs, a case-base study also provides estimates for risks (equation 15) and RRs (equation 17). From equations 16 and 18, we see that the precision of the estimation is inversely proportional to 1{b p p ð Þ n CN D &1 n CN D , that is, the larger the n CN D (number of diseased subjects recruited in control sample), the more precise the estimate of a risk or a RR. The value of n CN D depends on the disease prevalence in the population and the sample size of the case-base study ( Figure 1A). For a common disease (prevalence .0.05), a case-base study of 200 distinct subjects (with equal number of diseased and non-diseased subjects) is expected to have an n CN D larger than 5, producing an estimate of disease odds with the upper 95% confidence bound being roughly 5 times its lower bound ( Figure 1B). If the disease prevalence is lower (say, prevalence = 0.005), one needs to increase the sample size of the case-base study (2000 subjects) to achieve comparable precision. If the registry system (for the diseased and the general population as well) in a population is readily available, the sample size then is no longer a limiting factor. In such setting, a case-base study can produce estimates for risks and RRs with reasonable precision, even if the disease is very rare (eg., n CN D &10 and upperbound=lowerbound&3:5 when sample size = 20000 in a population with disease prevalence of 0.001).
In many respects, a case-base design is better than (or at least as good as) the commonly used case-control design. First, as just mentioned, a case-base study provides estimates not only for ORs but also for risks and RRs with reasonable accuracy (if n CN D §5). Second, the control sampling scheme of a case-base study is a simple random sampling of all subjects in the study population without regard to disease status. This means that a researcher can initiate the control recruitment process much earlier in a case-base design (at the outset of the study) than in a case-control design (at the end of the study). Third, although there could be some people sampled more than once in a case-base study, the sampling itself incurs minimal cost. The real cost constraint is usually the total number of distinct subjects that are actually recruited. And with the same total number of distinct subjects, a case-base study and a case-control study have exactly the same statistical efficiency, when it comes to estimating an OR. Finally, as shown in this study, the analysis of a case-base study is no more complicated than a casecontrol study-one needs only to fit a logistic regression model to the data and then do one extra step of simple calculations of the model parameters.

Supporting Information
Exhibit S1 Comparison of Sato's formulas and the formulas derived in this paper when there is only one single binary exposure.

Author Contributions
Conceived and designed the experiments: WCL. Performed the experiments: TTC. Analyzed the data: TTC. Contributed reagents/materials/ analysis tools: WCL. Wrote the paper: TTC WCL.