^{1}

^{1}

^{2}

^{*}

The authors have declared that no competing interests exist.

Conceived and designed the experiments: WCL. Performed the experiments: TTC. Analyzed the data: TTC. Contributed reagents/materials/analysis tools: WCL. Wrote the paper: TTC WCL.

Both the absolute risk and the relative risk (RR) have a crucial role to play in epidemiology. RR is often approximated by odds ratio (OR) under the rare-disease assumption in conventional case-control study; however, such a study design does not provide an estimate for absolute risk. The case-base study is an alternative approach which readily produces RR estimation without resorting to the rare-disease assumption. However, previous researchers only considered one single dichotomous exposure and did not elaborate how absolute risks can be estimated in a case-base study. In this paper, the authors propose a logistic model for the case-base study. The model is flexible enough to admit multiple exposures in any measurement scale—binary, categorical or continuous. It can be easily fitted using common statistical packages. With one additional step of simple calculations of the model parameters, one readily obtains relative and absolute risk estimates as well as their confidence intervals. Monte-Carlo simulations show that the proposed method can produce unbiased estimates and adequate-coverage confidence intervals, for ORs, RRs and absolute risks. The case-base study with all its desirable properties and its methods of analysis fully developed in this paper may become a mainstay in epidemiology.

Both the absolute and the relative disease risks have a crucial role to play in epidemiology. The relative risk (RR) is the ratio of the disease risk for individuals at one specific exposure level to the disease risk for those at a reference level. Under the rare-disease assumption, RR is approximated by the odds ratio (OR), which in turn can be conveniently estimated in a case-control study. While an index such as RR or OR may be adequate for etiologic inferences, it is actually only part of a story. Once a factor has been demonstrated to be a risk factor for the disease, we will often be asked to predict the disease risk of an individual having a specific level of an exposure—the absolute risk. But unfortunately, the conventional case-control study does not provide an estimate for it.

Kupper et al

While the case-cohort study has been gaining popularity over the years

In this paper, we develop a logistic model for the case-base study. The model is flexible enough to admit multiple exposures in any measurement scale—binary, categorical or continuous. It can be easily fitted using common statistical packages. With one additional step of simple calculations of the model parameters, one readily obtains relative and absolute risk estimates as well as their confidence intervals. We will use Monte-Carlo simulations to study the statistical properties of the proposed method.

Let the exposure profile of a subject be denoted by a

where

In a case-base study, the ‘cases’ are randomly selected from all the incident diseased subjects in the population. Let

or more concisely,

where

where

The event of

Let

From

Suppose that there are a total of ^{th} subject, the exposure profile, the disease status, and the control and the case sampling statuses are

Because

Both

and

where

The

The

An estimate of the disease risk for subjects in the population with an exposure profile vector

The variance of the estimate (in logit scale) is

where

Using the delta method, the variance of the estimate (in log scale) is

where

Note that if

We perform Monte-Carlo simulations to examine the statistical properties of the proposed method. We consider three scenarios for the exposure. In the first scenario, we assume a binary exposure (

In the second scenario, we assume an exposure with four levels (

In the third scenario, we assume two binary exposures (_{1} = 0.9163), and the OR comparing _{2} = 1.0986). For simplicity, we assume that

The disease probabilities of subjects in the study population are assumed to follow the logistic model in model 1 with the parameter settings given in the preceding paragraphs. A case-base study is conducted in a study population of size 100000 with a case sampling probability (

The simulation was done for 10,000 times for each setting. The mean of the estimates for ORs (in log scale), RRs (in log scale) and risks (in logit scale) are calculated. The variance of an estimate is calculated as the sample variance of the estimates. We also calculate the coverage probability and the average length of the 95% CIs for the estimates.

Methods | ||||

The present method | Sato | Miettinen | ||

Estimate [true value] | ||||

logOR [0.9163] | 0.9191 | - | - | |

logRR [0.8128] | 0.8148 | 0.8149 | 0.8149 | |

logit(risk_{0}) [–2.5465] |
–2.5559 | - | - | |

logit(risk_{1}) [–1.6303] |
–1.6369 | - | - | |

Variance (×100) | ||||

logOR | 1.8297 | - | - | |

logRR | 1.3984 | 1.3984 | 1.5017 | |

logit(risk_{0}) |
2.5622 | - | - | |

logit(risk_{1}) |
3.0710 | - | - | |

Coverage probability of 95% CI | ||||

logOR | 0.9521 | - | - | |

logRR | 0.9518 | 0.9518 | 0.9518 | |

logit(risk_{0}) |
0.9512 | - | - | |

logit(risk_{1}) |
0.9497 | - | - | |

Average length of 95% CI | ||||

logOR | 0.5324 | - | - | |

logRR | 0.4657 | 0.4657 | 0.4825 | |

logit(risk_{0}) |
0.6220 | - | - | |

logit(risk_{1}) |
0.6818 | - | - |

Methods | |||

The present method | Sato | Miettinen | |

Estimate [true value] | |||

logOR comparing adjacent levels [0.9163] | 0.9189 | - | - |

logRR_{1} [0.8629] |
0.8655 | 0.8654 | 0.8654 |

logRR_{2} [1.6569] |
1.6615 | 1.6648 | 1.6668 |

logRR_{3} [2.3203] |
2.3253 | 2.3278 | 2.3297 |

logit(risk_{0}) [–3.2708] |
–3.2845 | - | - |

logit(risk_{1}) [–2.3545] |
–2.3656 | - | - |

logit(risk_{2}) [–1.4383] |
–1.4468 | - | - |

logit(risk_{3}) [–0.5220] |
–0.5279 | - | - |

Variance (×100) | |||

logOR comparing adjacent levels | 0.4854 | - | - |

logRR_{1} |
0.4586 | 2.4588 | 2.5149 |

logRR_{2} |
1.5899 | 3.6685 | 4.0080 |

logRR_{3} |
2.6760 | 2.9777 | 3.4950 |

logit(risk_{0}) |
2.9127 | - | - |

logit(risk_{1}) |
2.3802 | - | - |

logit(risk_{2}) |
2.8184 | - | - |

logit(risk_{3}) |
4.2274 | - | - |

Coverage probability of 95% CI | |||

logOR comparing adjacent levels | 0.9536 | - | - |

logRR_{1} |
0.9533 | 0.9563 | 0.9556 |

logRR_{2} |
0.9530 | 0.9487 | 0.9493 |

logRR_{3} |
0.9518 | 0.9526 | 0.9523 |

logit(risk_{0}) |
0.9518 | - | - |

logit(risk_{1}) |
0.9504 | - | - |

logit(risk_{2}) |
0.9505 | - | - |

logit(risk_{3}) |
0.9505 | - | - |

Average length of 95% CI | |||

logOR comparing adjacent levels | 0.2731 | - | - |

logRR_{1} |
0.2657 | 0.6243 | 0.6319 |

logRR_{2} |
0.4952 | 0.7478 | 0.7814 |

logRR_{3} |
0.6437 | 0.6783 | 0.7330 |

logit(risk_{0}) |
0.6677 | - | - |

logit(risk_{1}) |
0.6011 | - | - |

logit(risk_{2}) |
0.6531 | - | - |

logit(risk_{3}) |
0.8007 | - | - |

Methods | |||

The present method | Sato | Miettinen | |

Estimate [true value] | |||

logOR_{1} [0.9163] |
0.9206 | - | - |

logOR_{2} [1.0986] |
1.1017 | - | - |

logRR_{10} [0.8536] |
0.8571 | 0.8580 | 0.8585 |

logRR_{01} [1.0159] |
1.0184 | 1.0193 | 1.0197 |

logRR_{11} [1.7678] |
1.7724 | 1.7741 | 1.7754 |

logit(risk_{00}) [–3.0995] |
–3.1087 | - | - |

logit(risk_{10}) [–2.1832] |
–2.1880 | - | - |

logit(risk_{01}) [–2.0008] |
–2.0070 | - | - |

logit(risk_{11}) [–1.0846] |
–1.0863 | - | - |

Variance (×100) | |||

logOR_{1} |
2.0187 | - | - |

logOR_{2} |
1.8573 | - | - |

logRR_{10} |
1.7228 | 3.2565 | 3.3754 |

logRR_{01} |
1.5893 | 2.4743 | 2.5707 |

logRR_{11} |
3.0231 | 3.0867 | 3.3906 |

logit(risk_{00}) |
3.1880 | - | - |

logit(risk_{10}) |
3.5971 | - | - |

logit(risk_{01}) |
3.0930 | - | - |

logit(risk_{11}) |
3.8039 | - | - |

Coverage probability of 95% CI | |||

logOR_{1} |
0.9490 | - | - |

logOR_{2} |
0.9503 | - | - |

logRR_{10} |
0.9492 | 0.9508 | 0.9509 |

logRR_{01} |
0.9508 | 0.9510 | 0.9486 |

logRR_{11} |
0.9484 | 0.9487 | 0.9532 |

logit(risk_{00}) |
0.9481 | - | - |

logit(risk_{10}) |
0.9470 | - | - |

logit(risk_{01}) |
0.9465 | - | - |

logit(risk_{11}) |
0.9487 | - | - |

Average length of 95% CI | |||

logOR_{1} |
0.5534 | - | - |

logOR_{2} |
0.5323 | - | - |

logRR_{10} |
0.5114 | 0.7034 | 0.7161 |

logRR_{01} |
0.4923 | 0.6149 | 0.6257 |

logRR_{11} |
0.6788 | 0.6862 | 0.7224 |

logit(risk_{00}) |
0.6875 | - | - |

logit(risk_{10}) |
0.7300 | - | - |

logit(risk_{01}) |
0.6767 | - | - |

logit(risk_{11}) |
0.7525 | - | - |

Logistic regression is a standard technique for analyzing case-control data. It is also the method of choice for analyzing cohort data if time-to-event information is not available. However, the ORs that it estimates are approximating the RRs only under the rare-disease assumption. As such, there have been many methodologies/recommendations proposed to date regarding the estimation of RRs in cohort studies for common outcomes

In addition to the usual ORs, a case-base study also provides estimates for risks (

In many respects, a case-base design is better than (or at least as good as) the commonly used case-control design. First, as just mentioned, a case-base study provides estimates not only for ORs but also for risks and RRs with reasonable accuracy (if

Comparison of Sato’s formulas and the formulas derived in this paper when there is only one single binary exposure.

(DOC)

Simulation results for an exposure with four levels but without the constant OR assumption.

(DOCX)

Simulation results when the exposure is in a continuous scale.

(DOCX)

Simulation results when there is an interaction effect between the two exposures.

(DOCX)

Simulation results for a confounder.

(DOCX)

Simulation results when the disease prevalence is lower.

(DOCX)