External Validation and Calibration of IVFpredict: A National Prospective Cohort Study of 130,960 In Vitro Fertilisation Cycles

Background Accurately predicting the probability of a live birth after in vitro fertilisation (IVF) is important for patients, healthcare providers and policy makers. Two prediction models (Templeton and IVFpredict) have been previously developed from UK data and are widely used internationally. The more recent of these, IVFpredict, was shown to have greater predictive power in the development dataset. The aim of this study was external validation of the two models and comparison of their predictive ability. Methods and Findings 130,960 IVF cycles undertaken in the UK in 2008–2010 were used to validate and compare the Templeton and IVFpredict models. Discriminatory power was calculated using the area under the receiver-operator curve and calibration assessed using a calibration plot and Hosmer-Lemeshow statistic. The scaled modified Brier score, with measures of reliability and resolution, were calculated to assess overall accuracy. Both models were compared after updating for current live birth rates to ensure that the average observed and predicted live birth rates were equal. The discriminative power of both methods was comparable: the area under the receiver-operator curve was 0.628 (95% confidence interval (CI): 0.625–0.631) for IVFpredict and 0.616 (95% CI: 0.613–0.620) for the Templeton model. IVFpredict had markedly better calibration and higher diagnostic accuracy, with calibration plot intercept of 0.040 (95% CI: 0.017–0.063) and slope of 0.932 (95% CI: 0.839–1.025) compared with 0.080 (95% CI: 0.044–0.117) and 1.419 (95% CI: 1.149–1.690) for the Templeton model. Both models underestimated the live birth rate, but this was particularly marked in the Templeton model. Updating the models to reflect improvements in live birth rates since the models were developed enhanced their performance, but IVFpredict remained superior. Conclusion External validation in a large population cohort confirms IVFpredict has superior discrimination and calibration for informing patients, clinicians and healthcare policy makers of the probability of live birth following IVF.


Methods and Findings
130,960 IVF cycles undertaken in the UK in 2008-2010 were used to validate and compare the Templeton and IVFpredict models. Discriminatory power was calculated using the area under the receiver-operator curve and calibration assessed using a calibration plot and Hosmer-Lemeshow statistic. The scaled modified Brier score, with measures of reliability and resolution, were calculated to assess overall accuracy. Both models were compared after updating for current live birth rates to ensure that the average observed and predicted live birth rates were equal. The discriminative power of both methods was comparable: the area under the receiver-operator curve was 0.628 (95% confidence interval (CI): 0.625-0.631) for IVFpredict and 0.616 (95% CI: 0.613-0.620) for the Templeton model. IVFpredict had markedly better calibration and higher diagnostic accuracy, with calibration plot intercept of 0.040 (95% CI: 0.017-0.063) and slope of 0.932 (95% CI: 0.839-1.025) compared with 0.080 (95% CI: 0.044-0.117) and 1.419 (95% CI: 1.149-1.690) for the Templeton model. Both models underestimated the live birth rate, but this was particularly marked in the Templeton model. Updating the models to reflect improvements in live birth rates since the models were developed enhanced their performance, but IVFpredict remained superior.

Introduction
For a patient or couple considering in-vitro fertilisation (IVF) the most important prognosis is that of a live birth, and for the clinician advising them it is important to be able to provide an accurate assessment of that prognosis [1]. For policy makers, precise estimates of prognosis are essential to model the population burden of infertility and treatment, and to inform cost-effective healthcare provision [2,3]. As clinicians' assessment of prognosis are widely varied [4,5], several prediction models have been developed to give the prognosis of live birth based on patient and couple characteristics and measurements, in order to better inform patients and clinicians [6][7][8][9][10][11][12][13][14]. Many of these include measurements that would not be available prior to commencing the first cycle of IVF, and would hence have limited ability to inform decisions, and/or have not been externally validated.
Two prediction models have been developed using data from the UK [7,12], where there is a statutory legal requirement to maintain a national record of every initiated IVF cycle and its outcome. The first of these, the Templeton model [7], has been widely used [1], externally validated [12,[15][16][17][18], and recommended as the best model in systematic reviews [1,15]. However, it was developed using data from couples who received IVF two decades ago, when successful live birth rates were considerably lower than currently, and before the introduction of intra-cytoplasmic sperm injection (ICSI), which has transformed the treatment of male infertility [19]. Recently, we developed IVFpredict in the largest IVF prediction study to date (144,018 cycles) and added prognostic characteristics, including ICSI, to those used in the Templeton model [12]. We demonstrated that IVFpredict had superior discrimination and calibration to the Templeton model [12], and this model is increasingly used internationally. A recent Dutch study of 5,176 treatment cycles has externally validated IVFpredict but also showed that the Templeton model had similar discrimination and calibration [18]. However, in addition to having a relatively small sample size, that study only included couples with primary infertility and excluded those who used donor eggs. The authors acknowledged that these limitations were particularly likely to adversely affect the calibration of the IVFpredict model. Furthermore, the authors were only able to examine prediction of pregnancy, rather than live birth, though a correction factor was used in an attempt to produce an estimated live birth rate.
The live birth rate from IVF has continued to increase since the development of IVFpredict [20], and the mix of patients referred for IVF has changed [17]. Hence it is possible that it, as well as the Templeton model, will need to be updated for accurate use in a new cohort [15]. Indeed in the Dutch study, described above, both of the Templeton and IVFpredict models performed better if adjustments were made for pregnancy/live birth success rate in each cohort [18]. This raises the issue of whether, even with good prediction models that have been externally validated, their application in practice has to take account of success rates in the particular population that the couple, their clinician and policy makers might consider they belong to. Lastly, IVFpredict used broad female age categories, as during its development we were only provided with data that allocated each treatment cycle to the woman's age category publicly reported by the Human Fertilisation and Embryology Authority (HFEA). This has been criticised as likely to result in marked over-fitting of our model [18]. As female age is the strongest predictor of live birth success [7], it is possible that female age alone could accurately predict successful outcome, which would be a simple useful tool for all patients, clinicians and policy makers.
The purpose of this study is to perform validation of the IVFpredict and Templeton prediction models on a new UK cohort of IVF cycles. We aim to compare the predictive ability of the two models and to examine how much each prediction model requires updating in a new sample where success rates vary from those in the cohort used to originally develop the model. We also aim to quantify the value, in terms of predictive ability, of including covariates other than female age in the prediction models. In the validation sample used here we have female age in years at each cycle, and we also explore whether the predictive ability of IVFpredict is improved by using female age as a continuous variable. Since the quantity and quality of validation studies has been criticised [21], we aim to make use of a large validation sample size and perform all of the validation measures recommended in recent literature [21][22][23].

Ethics statement
The HFEA provided ethical approval for this study. All data were analyzed anonymously.

Data
The HFEA has a statutory duty to collect and record information about every assisted conception treatment in the UK. By law, every treatment centre must report certain couple characteristics, treatment details and outcomes for every initiated IVF cycle. The HFEA provided a database of all IVF cycles in the UK in the period 2003-2010. A cycle of IVF was defined as an initiated ovarian stimulation or planned fresh or frozen embryo transfer. Since IVFpredict was built using data from 2003-2007, we restricted the validation sample used here to cycles occurring in 2008-2010. Data on live births for cycles initiated in 2011 onwards were not completely available. We used the same exclusion criteria as our previous study where we developed IVFpredict and compared it with the Templeton model, excluding treatments that are not IVF (i.e. involve donor insemination or gamete/zygote intra-fallopian transfer), involve the storage or donation of eggs, or use frozen embryo transfer [12]. IVFpredict cannot give a prediction for women aged more than 50 years, so cycles from these women were excluded along with cycles for which data on the duration of infertility were missing.

Prediction models
The Templeton and IVFpredict models use a linear predictor that differs with patient and cycle characteristics. This is converted into the predicted probability of a live birth using the logistic transformation. Equivalently, the linear predictor equals the log odds of a live birth.

IVFpredict
The variables used by IVFpredict are: female age (categorized as 18-34, 35-37, 38-39, 40-42, 43-44 and 45-50 years), duration of infertility (less than 1, 1-3, 4-6, 7-9, 10-12 and more than 12 years attempting to conceive), cause of infertility (tubal, ovulatory, endometriosis, cervical, male or combined), number of previous IVF cycles and number of previous unsuccessful IVF cycles, pregnancy history, type of ovulation induction, whether ICSI was used, and whether donor or the patient's own eggs were used [12]. Gonadatropins are now recognized as the optimal agent for induction of multifollicular growth for IVF [24]. In our validation sample the agent used was not recorded for 128,438 cycles (98.1%), but it was recorded as gonadatropins in 2,511 cycles (99.6%) where it was reported. We therefore assumed that all cycles used gonadatropins and gave the IVFpredict linear effect attributed to this to all cycles. The interaction terms included in IVFpredict: between female age and duration, female age and egg source, ICSI and cause of infertility, and ICSI and number of previous IVF cycles, were included here.

Templeton model
The Templeton model uses female age (considered as a continuous variable with its effect on the log odds represented by a cubic curve), duration of infertility (less than 4, 4-6, 7-12 and more than 12 years), number of previously unsuccessful IVF cycles, pregnancy history, and tubal cause of infertility [7]. The original Templeton model does not include any interactions.

Female age-alone models
We wished to compare both the Templeton and IVFpredict models with a model that predicts live birth outcome using female age alone. However, there was no clear candidate for such a model in the literature, and it would not be appropriate to develop one using the validation data as its performance would be biased in the validation data. Instead, we considered a model in which the predicted probability of live birth decreased as female age increased. The exact shape of the relationship between female age and predicted probability of live birth does not affect the discriminatory power of this model, provided we assume this monotonic relationship. We also considered a model that used the HFEA age categories to inform prediction, again with decreasing live birth rate with increasing female age, in order to measure the reduction in discriminatory power associated with using female age as a categorical rather than continuous variable.

Statistical methods
We used several methods, recommended in recent reviews [21][22][23], to assess the validity of both models and compare their performance in the validation sample. These methods may be thought of as assessing discriminatory power, calibration, or a combination of these two properties of a prediction model. Discriminatory power refers to the ability of a prediction model to discriminate between successful and unsuccessful outcomes. This was assessed using the area under the receiver-operator curve (AUROC). In this context the AUROC is the probability that a model will predict a better prognosis for a randomly-selected cycle that resulted in a live birth than a randomly-selected cycle that did not result in a live birth. We compared the AUROC of IVFpredict with the reported AUROC and confidence intervals (CIs) in the IVFpredict development sample [12], using a Wald test, assuming independence between the development and validation samples.
Calibration refers to the similarity between the observed and predicted live birth rate in groups of cycles. We assessed general calibration using calibration plots, which average the observed and predicted live birth rate over deciles of the linear predictor. In a calibration plot, the observed live birth rate is plotted against predicted live birth rate, and perfect calibration is indicated by a straight line, with a gradient of one, through the origin. We used linear regression to estimate the intercept and slope of the closest-fitting straight line to the points on the calibration plot, and assessed departures from perfect calibration using the Hosmer-Lemeshow test [25 p147-156]. We also assessed calibration over patient and cycle characteristics by comparing the observed and predicted live birth rates by female age, egg source, duration of infertility, number of previously unsuccessful IVF cycles, previous live birth from IVF, cause of infertility, and use of ICSI. Here, departures from perfect calibration were assessed with Pearson's chi-squared test and p-values were considered against a Bonferroni-corrected threshold that ensured the family-wise error rate for all tests of calibration over patient and cycle characteristics was not greater than 5%.
The Brier score is a combination of the discriminatory power and calibration of a prediction model [26 p284-287]. We calculated modified Brier scores, being the Euclidean distance between observed and predicted live birth rates over deciles of the linear predictor, and scaled them by dividing by the sample variance of the observed live birth rate. The scaled modified Brier scores can be decomposed into measures of reliability and resolution. The reliability is a goodness-of-fit statistic and is related to the Hosmer-Lemeshow test for calibration. The resolution measures the range of probabilities that the prediction model can handle, with a higher value indicating that the model can predict over a larger range of probabilities, which given IVF treatment is now used in couples ranging from complete infertility to marginally reduced fertility (or even normal fertility), is desirable [20]. When the overall observed and predicted live birth rates are equal, as in the updated models, the scaled Brier score is the proportion of variation in observed live birth rates not explained by the prediction model.
It has been suggested that prediction models can be compared by evaluating how many patients are correctly reclassified from one treatment recommendation to another [23]. Since both the IVFpredict and Templeton models (and other models for predicting successful pregnancy/live birth outcome) do not give thresholds of prognosis or treatment recommendations, only the predicted probability of live birth, it is not possible to compare them in this way. However it is possible to assess, for each cycle, which prediction model gave a prognosis closer to the truth. We did this by cross-tabulating the number of cycles resulting in live births against the prediction model that gave a higher probability of success to that cycle. We also calculated a continuous version of the net reclassification index [27], which was calculated as twice the proportion of live births given a higher prognosis by IVFpredict minus twice the proportion of cycles not resulting in a live birth given a higher prognosis by IVFpredict.
For each of the IVFpredict and Templeton models we assessed two different predictions. The first was calculated using the original values for the linear predictors. The second prediction was updated for the validation sample by adding a constant term to the linear predictor, calculated numerically by trial-and-improvement to ensure that the average observed and predicted live birth rates in the whole sample were equal. The simpler method of adjusting the linear predictor by adding the difference between the observed and predicted log odds of live birth [28], would not achieve this, nor would adding the difference between the observed log odds of live birth in the validation sample compared with the development sample, as was done in the Dutch study [18]. A more complicated method is to refit the prediction model to the validation sample by re-running the logistic regression [15,17], or to use a calibration plot to find an adjusted prediction (vertical value) corresponding to each predicted prognosis (horizontal value). A drawback of these methods is that they cause the validation sample to become a new development sample, hence the refit model requires further validation, and therefore we did not use these methods. Updating the models, using any of the methods above, would affect their calibration but not their discriminatory power.
All statistical analyses were performed using Stata version 12 (StataCorp LP).

Results
There were 132,796 eligible IVF cycles that took place in the UK between 2008 and 2010. It was not possible to calculate a predicted probability of live birth for 1.4% of these cycles, due to missing information on duration of infertility or patient age greater than 50 years. There were a remaining 130,960 cycles available for validation; details of the formation of the validation sample are given in Fig. 1. Table 1 shows the characteristics of the treatment cycles in this study. Over half (51.7%) of cycles involved ICSI, and the proportion of cycles involving donor eggs increased with female age. There were 33,553 live births, giving a successful live birth rate of 25.6% per cycle in this cohort. Table 2 shows the AUROC for the IVFpredict and Templeton models. Despite strong statistical evidence for a difference in discrimination, with IVFpredict performing better than Templeton, the AUROC values were similar for the two models, with both showing good discrimination. IVFpredict had slightly poorer discrimination in this validation cohort than in the original cohort, and both models had better discrimination than a model based on female age alone as a continuous variable. The categorisation of female age resulted in a decrease in AUROC of 0.004, which was 0.7% of the AUROC of IVFpredict, suggesting that there would be a very small increase in AUROC if IVFpredict were to be redesigned to use female age as a continuous variable. line-indicating that observed live birth rates were above those predicted. This is particularly marked in the Templeton Model. The actual differences between observed and predicted live birth rates are given in S1 Table. Effect on calibration of updating the prediction models

Calibration
Both models required updating to ensure that the predicted live birth rate was, on average, equal to the observed live birth rate in the validation sample. IVFpredict was updated by adding 0.1396 to the linear predictor, thus increasing the predicted odds of live birth by a factor of  Fig. 3 (S2 Table). Both models showed a closer adherence to the reference line, although the Hosmer-Lemeshow test still showed strong statistical evidence of imperfect calibration in both models. As in the validation models without updating, calibration was better for the IVFpredict than the Templeton model. The IVFpredict calibration plot had a slope of 0.867 (95% CI: 0.789-0.946) whilst the Templeton calibration plot had a slope of 0.804 (95% CI: 0.697-0.912). Slopes less than 1 demonstrate that both updated models overestimated the live birth rate in couples with good prognosis, and underestimated the live birth rate in couples with poor prognosis. Since both models have been updated, the average observed and predicted live birth rates are equal, so the intercepts of the calibration plots give no additional information about the calibration of each updated model. Table 3 shows the calibration of both updated models by couple characteristics. IVFpredict did not show differential calibration with respect to cause of infertility and treatment type, whilst the Templeton model showed differential calibration over all variables considered in Table 3.

Calibration by different couple characteristics
Both models underestimated the live birth rate in patients using donor eggs, particularly in women aged 38 years or older, but this was more marked in the updated Templeton model. In women aged 45 years or older using donor eggs the predicted live birth rate from the Templeton model was only 8% (95% CI: 7-9%) of the observed live birth rate, whereas from  IVFpredict it was 84% (95% CI: 77-92%). The Templeton model overestimated the live birth rate in women aged 38 years or older using their own eggs and women with a tubal cause of infertility, whilst underestimating the live birth rate in women who had not previously had a live birth from IVF, and couples using ICSI.

Overall performance
Brier scores for all four models (IVFpredict and Templeton, original and updated) are shown in Table 4. The original and updated IVFpredict models had smaller Brier scores than the Templeton model, indicating better predictive accuracy. The updated models had better reliability than the original models, with IVFpredict having the best reliability. The IVFpredict models also had higher resolution. The rates in Table 5 show that, before updating, the Templeton model gives a lower probability of live birth than IVFpredict in 128,615 cycles (98.2%), as the Templeton model severely underestimates the probability of live birth. Thus the Templeton model gave a more accurate prognosis in cycles that did not result in a live birth and IVFpredict gave a more accurate prognosis in the fewer cycles that resulted in a live birth. This comparison is much more meaningful after updating both models. After updating, most women who had a live birth were given a higher probability of live birth by IVFpredict than the Templeton model, and most women who did not have a live birth were given a lower probability of live birth by IVFpredict. I.e., after updating, IVFpredict performed better than the Templeton model in terms of correctly

Discussion
In this study we have shown that IVFpredict externally validates and has good discrimination and calibration in a large independent cohort. We have further shown that it has better discrimination and calibration than the Templeton model for accurately predicting live birth. This is the first time that IVFpredict has been validated for live birth outcome on a different sample from that used for its development. The observed lower discrimination in comparison to the original development sample is to be expected [21,29], but the magnitude of the difference is tiny (mean difference of 0.01) and the observed AUROC of 0.628 in this validation sample can be thought of as excellent discriminative performance, as it is higher than the supposed maximum of 0.62 expected for a prediction model of IVF success [23]. Female age is the most important predictor of live birth from IVF [30], and IVFpredict has the drawback that it considers age in categories rather than a continuous measure. However, we demonstrated that IVFpredict has better discriminatory power than any model based only on a monotone transformation of female age, showing that the additional predictors used by IVFpredict give additional discriminatory power. As this was a validation study, it was not appropriate to investigate how prediction models could be improved by the inclusion of additional variables. The overall better performance, in terms of discrimination and calibration, of IVFpredict in comparison with the Templeton model in this study, as in our original development cohort, is in contrast to the conclusions of a recent Dutch study that reported similar discrimination and calibration of the two models [18]. However, the differences between AUROC and calibration for the two models in that study were minimal and, as discussed in the introduction, the Dutch study was small, included only treatment cycles in couples with primary infertility, could not include information on ICSI or whether women used their own or donor eggs and had pregnancy (rather than live birth) as the measured outcome. Our finding, that the Templeton model underestimated the live birth rate by up to 17% in the current study, is consistent with other studies that have previously demonstrated that the Templeton model was particularly poorly calibrated when applied to contemporary cohorts [15,17]. The likely explanation for this is the increase in the use and success of IVF in the two decades since the Templeton model was originally developed. However, IVFpredict had better calibration even after both models were updated to take account of the live birth rate in the contemporary UK cohort used here. This is likely to be because the Templeton model does not consider ICSI and donor eggs as informative predictors of prognosis and therefore incorrectly estimates, sometimes extremely, the probability of live birth in cycles using these treatment methods.
Our preferred method of updating the prediction models to account for the differences in IVF success in different populations caused the predicted live birth rate to equal the observed live birth rate in the whole validation sample, therefore removing the systematic underestimation exhibited by both models. The updated IVFpredict model had a calibration plot slope closer to the target value than that of the Templeton model, and a better reliability in the Brier score decomposition. Furthermore, the updated Templeton model overestimated the probability of live birth in older women using their own eggs and women with a tubal cause of infertility, and underestimated the probability of live birth from cycles using ICSI and women who had not had a previous live birth from IVF. Both updated models underestimated the probability of live birth in cycles using donor eggs, but this was more marked in the updated Templeton model. The updated IVFpredict model had a smaller Brier score, indicating a better overall calibration, and in particular showed greater resolution in the Brier score decomposition. This shows that IVFpredict has the desirable property of covering a greater range of probabilities than the Templeton model [23], indicating that it carries more predictive prognostic information [31]. Consistent with this, the updated IVFpredict model provided a more accurate prognosis than the updated Templeton model for the majority (52.5%) of the cycles in the validation sample.
Our results suggest that, whichever model is chosen for use in clinical practice, it should be updated or recalibrated before it is used to inform patients of their prognosis. This validation study has shown the potential for extremely misleading predictions if a poorly calibrated, out of date model is used. Ideally recalibration should take place on a per-centre basis due to the amount of variation in the live birth rate between treatment centres [15,30,32]. The simplest method of recalibration would involve a treatment centre retrospectively applying a prediction model to their recorded cycles, and calculating how much to add or subtract from the linear predictor to ensure that the average predicted live birth rate equals the observed live birth rate at that centre. Larger centres could create calibration plots, as in Fig. 2, and use this as described above to convert a probability given by the prediction model for certain patient characteristics into a centre-specific probability for that patient. Unfortunately the HFEA dataset used here did not contain centre details for each cycle, so we could not assess the efficacy of percentre recalibration.
This study benefits from a large validation sample that is highly representative of the target population. It should be noted that, since both the IVFpredict and Templeton models were developed using UK HFEA data, we were only able to perform temporal validation [30]. External validation of the ability of IVFpredict to predict live birth rate in couples with both primary and secondary infertility in a large non-UK population is required to assess its global applicability. Neither IVFpredict nor the Templeton model consider live birth from frozen embryo replacements, hence both models may be disadvantaged given the likelihood of increased use of frozen embryo replacements. No prediction models for live birth after IVF have yet reached the impact analysis stage and external validation is an essential step towards this [1]. In conclusion, this validation study indicates that IVFpredict has superior discrimination and calibration to the Templeton model for informing patients and clinicians in the UK of the prognosis of live birth following IVF. IVFpredict exhibits improved calibration after updating, and recalibration can be applied on a per-centre basis in clinical practice.