Evidence-based pregnancy testing in clinical trials: Recommendations from a multi-stakeholder development process

Most clinical trials exclude pregnant women in order to avoid the possibility of adverse embryonic and/or fetal effects. Currently, there are no evidence-based guidelines regarding appropriate methods for identifying early pregnancy among research subjects. This lack of guidance results in wide variation in pregnancy testing plans, leading to the potential for inadequate protection against embryonic or fetal exposure in some cases and unnecessary burdens on research participants in others, as well as inefficiencies caused by disagreements among sponsors, investigators, and regulators. To address this issue, the Clinical Trials Transformation Initiative convened content experts and stakeholders to develop recommendations for pregnancy testing in clinical research based on currently available evidence. Recommendations included: 1) the study protocol should clearly state the rationale for pregnancy testing and the plan for handling positive and indeterminate tests; 2) protocols should include an assessment of the pregnancy testing plan advantages (reduced risk of embryo/fetal exposure) versus the burdens (participant burden, study team workload, costs); 3) protocols should assess the participant burdens regarding the likelihood of false negative and false positive results; 4) participant administered home pregnancy testing should be avoided in clinical trials; and 5) the consent process should describe the extent of knowledge about the study intervention’s potential risk to the embryo/fetus and the limitations and consequences of pregnancy testing. CTTI has also developed an online tool to help implement these recommendations.


II. Structure of the Model
We constructed a Markov model to simulate pregnancy testing of women of reproductive potential enrolled in a clinical trial excluding pregnant women (note that the model could also be applied to the evaluation of postmarketing pregnancy testing protocols) ( Figure 1). The model consists of 3 mutually exclusive states: Not Pregnant, Undetected Pregnancy, and Detected Pregnancy. In a Markov model, "subjects" can transition between states during a "cycle" or "stage." In this model, the length of one Markov cycle is 1 day. Each day, women in the Not Pregnant state can transition to the Undetected Pregnancy state, with a probability dependent on age, contraceptive method, and time in the menstrual cycle, or remain Not Pregnant. Women with an Undetected Pregnancy will ultimately have their pregnancy detected either through testing done for the clinical trial, or based on signs and symptoms of pregnancy.
Pregnancy testing is performed at the start of the simulation, with the probability of true negative, false negative, true positive, and false positive results dependent on the prior probability of pregnancy, the level of hCG in each state (a function of both subject age and the developmental stage of pregnancy), the assay sensitivity for detection of hCG, and the threshold used to define a "positive" test. Based on the duration of the "trial", follow-up testing can be performed at any interval.
The model is run as a stochastic microsimulation-multiple iterations of the model are run, with each iteration representing the experience of an "individual subject." Results from each "subject" are then summed and averaged across all iterations (typically, between 10,000 and 1,000,000). At the start of each iteration, a "subject's" age is drawn from one of the distributions described below. The value for age is then used to draw from distributions for age-specific prevalence of menopause and hysterectomy; "subjects" between the ages of 15 and 55 who are not post-hysterectomy or menopausal are then "assigned" a contraceptive method based on an age-specific distribution. In the base case analysis, we assume that the initial screen is performed independently of the menstrual cycle day; the potential impact of testing timed to the menstrual cycle is examined in subsequent analyses. After the Specific types of statistical distributions were chosen based on the type of data and their fit to the reported data. Distributions included: • Normal: the classic bell curve; mean and median values are equivalent. The potential range of a normal distribution is -∞ to ∞. • Gamma: gamma distributions are frequently used as an alternative to the normal distribution when values below 0 are not possible. • Lognormal: the natural logarithm of the values are normally distributed. Like gamma distributions, lognormal distributions have a minimum value of 0, and often used to describe values that have a particularly wide range (such as hCG levels). Lognormal distributions are also used with ratios, such as relative risks, odds ratios, and hazard ratios. • Beta: beta distributions have a minimum value of 0 and a maximum of 1.0, and are used for values that are constrained by these limits (such as probabilities when there are only two possible outcomes, such as hysterectomy or no hysterectomy). • Dirichlet: a Dirichlet distribution is a special form of the beta distribution used when there are more than two possible outcomes or states (such as contraceptive method). • Weibull: Weibull distributions are survival functions which allow variation in the hazard function (i.e., the risk of an event can change with time, rather than remain constant). • Uniform: The probability of any outcome or value is equal (for example, menstrual cycle day if testing is not timed relative to menses or other marker).

III. Model Parameters
All models require certain simplifying assumptions and decisions about which data sources to use to generate estimates for the model parameters. As much as possible, we erred on the side of overestimating pregnancy risk. Age is the single most important determinant of pregnancy risk, and therefore the expected age distribution among trial subjects is the single most important predictor of the likelihood of pregnancy at the time of enrollment or during the trial.

A. Subject Age
Reported means and standard deviations for age were taken from 4 published trials of conditions that are relatively common in reproductive age women and frequently treated with drugs with known risks to the fetus from exposure during pregnancy (Table 1), although not every study involved exposure to a potentially teratogenic agent. The studies were identified through a PubMed search using the terms "randomized trial" and the specific condition, and were chosen to provide examples of "typical" age ranges for trials in reproductive age women.
Because normal distributions with these values for mean and standard deviations often resulted in values outside of a plausible range (or the reported range of the study), we characterized the age of each subject population with lognormal distributions, which were a better fit than the alternative gamma distributions.
Although age is the most important determinant of risk of pregnancy among women of reproductive potential, the probability of NOT being at risk of pregnancy is also dependent on age, since both prevalence of hysterectomy and menopause vary with age. Thus, the age distribution in a given clinical research study will affect not only the probability of pregnancy among women who are physiologically able to become pregnant, but also the overall probability of pregnancy in the study.

B. Hysterectomy Status
Age-specific hysterectomy prevalence was estimated directly from the 2010 Behavioral Risk Factor Surveillance System (BRFSS), a population-based survey. 5 We used total population estimates. Because hysterectomy rates vary dramatically among racial/ethnic groups (in particular, African-American women are more likely to undergo hysterectomy at younger ages than other groups), racespecific estimates might be important for specific conditions where prevalence or incidence varies by race/ethnicity.  Figure 2 depicts the estimated agespecific prevalence of menopause in women with an intact uterus, derived from published data from the Study of Women's Health across the Nation (SWAN). 6 Estimates for ages 41-52 are based on the published Kaplan-Meier curve in Gold et al, 6 while estimates for ages 53-55 are based on fitting a Weibull distribution to the survival function based on the reported median age at menopause for the Prevaalence Age entire population (51.4 years) and the estimated cumulative probability of menopause at age 52 based on the published K-M curve (52 years). We assumed that all women would be functionally menopausal at age 55, which was consistent with the estimated cumulative probability based on the Weibull distribution (99.2%). The published data from SWAN does not provide sufficient information to estimate confidence intervals around the age-specific prevalence. We also did not incorporate variation in age at menopause based on race/ethnicity, smoking history, or other factors, although, again, these could be readily incorporated if appropriate for a particular scenario.

D. Menstrual Cycle Characteristics
We used recent estimates of the duration and variability of the menstrual cycle among healthy premenopausal women 7 (Table 3): At the start of each "individual" simulation, a cycle length is drawn from the cycle length distribution, and a value for cycle variability is also drawn. For simulations that last longer than one menstrual cycle, days are added or subtracted to the cycle length based on the variation (i.e., if the variability value is 2, the range of possible values is the initial cycle length value ± (2/0.5) days. Similar calculations are done for luteal phase length and variability. Within each cycle, the day of the serum LH surge was calculated as the cycle length minus the luteal phase length. This method allows for both between and withinindividual variation in menstrual cycle length, while allowing appropriate modeling of day-specific hCG levels in the event of pregnancy.
Hormonal contraceptives affect both the probability of ovulation and menstrual patterns. For oral contraceptives, rings, and patches, we assigned a fixed cycle length of 28 days, and a fixed luteal phase length of 14 days. For simplicity, we did not attempt to model the effects of long-acting hormonal contraceptives on bleeding patterns; we did assume that protocols for continued testing in women using these methods would occur at fixed intervals in all cases, rather than being timed to menses.
We did not attempt to model age-specific variability in menstrual cycle length, although this is potentially an important consideration in women as they approach menopause.

E. Age-specific Conception Probability
Estimates of the cycle-specific probability of a detected pregnancy in couples not using contraception are available from a variety of sources, and include couples using natural family planning methods, 8,9 couples actively trying to conceive, [10][11][12] and couples undergoing artificial insemination where the female partner has normal reproductive function. 13,14 Differences between populations, including prior fertility history, age distribution, coital frequency, intent to conceive, and timing and method of conception confirmation contribute to variation in the estimates. All of these populations are likely to have higher fecundability then many potential clinical trial subjects with acute or chronic illness.
These estimates range from approximately 20% per cycle in nulliparous women under 30 undergoing donor insemination when only clinical pregnancies were the primary outcome 13 to 40% per cycle in Chinese textile workers aged 20-34 who were actively trying to conceive when daily urines were tested with a highly sensitive hCG assay. 12 For the purposes of this analysis, we used an estimate of 30% per cycle for women 30 and younger, which is consistent with the observed rate in early cycles in a US population-based study, 15 and with a European study of couples using natural family planning methods who had a coital frequency of twice a week. 9 Although all prospective studies show a decline in cyclespecific probability as the number of cycles increases beyond 2 or 3 (representing heterogeneity in underlying fertility), we conservatively assumed that the cycle-specific probability remained constant over the duration of the simulation (which leads to an overestimation of cumulative pregnancy probability).
We modeled the effect of age using published regression equations based on a Dutch study of donor insemination 13 ; in this study, a "critical age" (31 years) was identified after which there was a statistically significant progressive decline in fecundability.
Although there is evidence that cycle fecundity begins to decrease prior to age 30, 9,16 we again elected to be conservative and assumed no decline in fecundity until age 32, when the hazard ratio derived from the van Noord-Zaadsra paper was applied ( Figure 3).

F. Pregnancy Outcome Probabilities
Pregnancy loss after conception is common, with a substantial proportion occurring after implantation but before pregnancy is suspected clinically. 11,12,14,15 Studies using extremely sensitive assays report early pregnancy loss rates (prior to the onset of the next menses) of 22-23%, 12,15 while rates in two studies using less sensitive assays were 13%. 11,14 Confidence intervals for the 3 US-based studies overlapped; while the Chinese study 12 was large and population-based, the population was somewhat different in terms of potential exposures that might affect pregnancy loss (for example, prevalence of exposure to second-hand tobacco use in the home was 65%). There were also differences in the definition of "early pregnancy loss" between studies, ranging from detectable hCG with no delay in menses 11 or "clinical detection" 12,15 to losses prior to 23 days after the serum LH surge (9 days after the expected onset of menses). 14 Because the sensitivity of the assay used in the Wilcox study was higher than those in the Lohstroh and Zinaman papers, and higher than current commercially available kits, and because 21% of the subjects with "non-clinical" pregnancy losses in the Wilcox study actually had pregnancy symptoms, 17 which lowers the proportion of truly non-clinical losses to 18%, we elected to use the lower estimates form Lohstroh and Zinaman. Because the proportions were so similar, we pooled the total pregnancies by Age category for the two studies, and generated a Dirichlet distribution for the conditional probability of each pregnancy outcome, given conception (Table 4).
Although clinical, and probably occult, pregnancy loss rates clearly increase with age, 18 we did not attempt to model the relative conditional probability of pregnancy loss with age. We also did not include ectopic pregnancy, gestational trophoblastic neoplasia, or other uncommon outcomes.

G. Contraceptive Method
For the initial analysis, we assumed that women would continue to use their current method of contraception. Although this is appropriate for estimating the prior probability of pregnancy at the time of the initial screen, protocols may specify methods that may be used during the protocol-the model can be readily adjusted to accommodate protocol-specific methods. We also assume that the population-based estimates of method choice would be similar to those in a patient population. While this may be true for some conditions, certain methods may be more or less prevalent in other conditions. For example, patients with acne may be more likely to be using oral contraceptives, given their effectiveness in reducing circulating androgen levels, while patients on some anti-epilepsy drugs might be less likely to be on oral contraceptives, given the potential for mutual effects on drug metabolism.
Estimates for the age-specific distribution of specific methods among women using contraception were derived from the 2006 National Survey of Family Growth (NSFG) ( Table 5). 19 To minimize the number of categories for modeling, we combined methods with similar effectiveness where appropriate (oral contraceptives, contraceptive rings and patches as one category, IUDs and hormonal implants-long-acting reversible contraceptives, or LARC-as another). We assumed that women using periodic abstinence as their primary method would be instructed to use barrier methods for the duration of the trial.
Because the NSFG data is based on a weighted survey sample design, and published age-specific proportions and standard errors are based on all women, including women who do not use contraception, we could not directly estimate confidence intervals around the distribution of methods within age groups. To indirectly account for the uncertainty in these estimates, we estimated the number of actual subjects in each age category by fitting a beta distribution to the reported proportion and standard error of women using contraception in each age group, then estimating the number of  women using each method based on the reported overall proportion. These numbers were then used to generate a Dirichlet distribution.

H. Contraceptive Effectiveness
We used the most recent published estimates of first-year, typical use 12 month method-specific failure rates 20 to generate estimates of effectiveness for each method. We made several simplifying assumptions, all of which result in higher estimates of the probability of pregnancy: • We used "typical-use" rather than "perfect use" estimates.
• We used first-year estimates; typically, failure rates are highest in the first year of use for any method, due to both "selection" (couples with higher baseline fertility are more likely to get pregnant early), variability in user adherence in user-dependent methods such as oral contraception or barrier methods, and, for women in their 30s and 40s, age-dependent declines in fertility. • We assumed a constant risk across all 12 months; again, failure rates are highest in the first few months after beginning use. • We assumed that the reported relative reduction in pregnancy risk associated with each method was solely due to method effectiveness, rather than differences in the age distribution or intensity of desire to avoid pregnancy among users.
To generate estimates of the cycle specific relative reduction in pregnancy probability associated with specific methods: • We assumed that, during the 12 months used to estimate failure rates, there were 13 menstrual cycles with a mean duration of 28 days. • Assuming a constant risk of pregnancy and constant cycle-specific effectiveness, we generated estimates of the cycle-specific hazard of pregnancy with each method that was consistent with the reported 12 month failure rate. • Using the estimated cycle-specific probability for non-contraceptive users as the reference, we then estimated hazard ratios for each method (because of the way the data are reported, we could not generate confidence intervals around these estimates) ( Table 6). We then applied these hazard ratios to the age-specific pregnancy probability described in Section II.E, assuming no differential effect of contraceptive methods on detectable early pregnancy loss.

I. hCG levels in non-pregnant women
We estimated lognormal distributions for values of hCG in IU/L in serum and urine in non-pregnant women, stratified by age, from the reported quartiles in the papers by Cole and Ladner 21 and Snyder et al 22 (Table 7).

J. hCG levels in pregnancy
We assumed that the first day that hCG would be detectable in serum or urine would be 9 days after the serum LH peak (8 days after the urine peak), which is consistent across all studies which used LH as the marker for ovulation. 7,14,23,24 We generated Dirichlet distributions for the day of first hCG detection for live birth, clinical loss, and nonclinical losses using the values reported in Lohstroh et al 14 ; because the number of events is small, the confidence intervals are quite wide.
We assumed that hCG would not be detectable in urine prior to serum; we again used the data published by Lohstroh et al to generate a Dirichlet distribution for the probability of different lag times between appearance in serum and in urine. 14 The rise in mean hCG after first day of detection is remarkably similar across studies (the values for Nepomnaschy et al 25 are converted to IU/L from ng/mL using the reported conversion factor), with values in urine 7,23,25 being indistinguishable from those in serum 24 ( Figure 4). Because of the similarity across studies, we used the values (expressed as lognormal distributions) from Johnson et al, 23 which reflect currently available assays. We assumed that the pattern of increase would be similar in serum and urine; in the event of a delay between the first detectable serum value and first urine value, we assumed the curves would be similar, but would start from the same level (i.e., the curves would be in parallel, with a difference of 1-3 days.
We assumed that patterns of hCG increase in ongoing pregnancies, and falls in losses, would follow reported regression equations. 26,27

K. Sensitivity of hCG assays
We did not attempt to quantify the analytic sensitivity of specific assays, or to quantify the impact of differential sensitivity for specific isoforms of hCG. For the purposes of comparing the probability of specific test outcomes (true and false negatives, true and false positives), we used the following values for analytic sensitivity: • Serum hCG: 5 IU/L; because many labs report values between 5-14 or 5-24 as "indeterminate," we used an alternative values of 15 and 25. • Urine hCG: 20 IU/L, 25 IU/L, 50 IU/L.

L. Probability of pregnancy detection in the absence of testing
We used reported cumulative probability of reporting symptoms of pregnancy from a prospective study of couples attempting to conceive 17 to estimate the daily probability of a clinically detected pregnancy in the absence of testing. Based on the reported median and interquartile ranges using the timing of the LH surge as the reference, we first estimated the cumulative probability for women with live births using a Weibull function, and then calculated the day-to-day conditional probability from this curve. Estimates for women with clinical and the 21% of women with "non-clinical" losses with symptoms were generated using the reported day-to-day relative risks applied to the cumulative probabilities, then calculating dayspecific conditional probabilities. Figure 5 depicts the derived cumulative probabilities for each group.