A Review and Comparison of Methods for Recreating Individual Patient Data from Published Kaplan-Meier Survival Curves for Economic Evaluations: A Simulation Study

Background In general, the individual patient-level data (IPD) collected in clinical trials are not available to independent researchers to conduct economic evaluations; researchers only have access to published survival curves and summary statistics. Thus, methods that use published survival curves and summary statistics to reproduce statistics for economic evaluations are essential. Four methods have been identified: two traditional methods 1) least squares method, 2) graphical method; and two recently proposed methods by 3) Hoyle and Henley, 4) Guyot et al. The four methods were first individually reviewed and subsequently assessed regarding their abilities to estimate mean survival through a simulation study. Methods A number of different scenarios were developed that comprised combinations of various sample sizes, censoring rates and parametric survival distributions. One thousand simulated survival datasets were generated for each scenario, and all methods were applied to actual IPD. The uncertainty in the estimate of mean survival time was also captured. Results All methods provided accurate estimates of the mean survival time when the sample size was 500 and a Weibull distribution was used. When the sample size was 100 and the Weibull distribution was used, the Guyot et al. method was almost as accurate as the Hoyle and Henley method; however, more biases were identified in the traditional methods. When a lognormal distribution was used, the Guyot et al. method generated noticeably less bias and a more accurate uncertainty compared with the Hoyle and Henley method. Conclusions The traditional methods should not be preferred because of their remarkable overestimation. When the Weibull distribution was used for a fitted model, the Guyot et al. method was almost as accurate as the Hoyle and Henley method. However, if the lognormal distribution was used, the Guyot et al. method was less biased compared with the Hoyle and Henley method.


Introduction
Decision analytic modeling is often applied as a vehicle for economic evaluations. [1] A commonly used approach in decision analytic modeling is the Markov model. Transition probabilities between Markov states are one of the most important features of the stochastic model. There are several available methods for the estimation of transition probabilities. Currently, survival analysis has been frequently used to estimate transition probabilities. Recently, an algorithm was developed by Latimer that aimed to improve the quality of survival analyses included within economic evaluations. [2] The framework proposed by Latimer was founded on the assumption that the researchers who conduct economic evaluations have access to individual patient-level data (IPD). Typically, IPD originate from clinical trials and observational studies. However, censoring frequently occurs during the follow-up period, and the key feature of the survival analysis is its ability to facilitate censoring inclusion, which makes it an appropriate method for survival data with censoring.
When IPD are available, the relationship between transition probabilities and time can be estimated from IPD using a survival analysis. The commonly used approach is to fit parametric survival functions to IPD by applying the maximum likelihood estimation (MLE) to estimate the parameters of the distributions that are used for the estimation of the transition probabilities. An estimate of the mean survival time is essential for cost-effectiveness analyses (CEAs); the mean survival time is a measure of the effectiveness in a CEA and can be estimated from parametric survival functions. [3] In addition to the mean survival time, other parameters can also be estimated from survival distributions. The uncertainty of parameters can be captured with a variance-covariance matrix from parametric regression models. [1] However, IPD are often only available to clinical trial sponsors because of confidentiality, and the independent economic evaluation researchers lack access to these data. Independent researchers can typically obtain published Kaplan-Meier curves and summary statistics, e.g., numbers at risk and total number of events, for economic evaluations.
For these reasons, methods that use published survival curves and reported summary statistics to reproduce statistics for economic evaluations are essential for independent researchers to conduct these evaluations. However, there are few methods available to attain information for economic evaluations in the absence of IPD, and only four methods have been identified to recreate IPD for economic evaluations from published KM curves. These four methods can be used to estimate the parameters of the fitted parametric survival function. The general equation of transition probabilities is as follows: tp(t u ) = 1 − exp{H(t − u) − H(t)}, where u indicates the length of the Markov cycle, t indicates the survival time, tp(t u ) indicates the transition probability between time point t-u and t, and H(t) indicates the cumulative hazard function of the parametric distribution. [4] In the following sections, the four identified methods were described briefly and then assessed by using a simulation study.

Method 1: Least squares method
The least squares method involves the estimation of parameters by minimizing the sum of the squares of the residuals; the residuals represent the differences between the actual data and the estimated data from a model. The least squares method is a common method of statistical estimation and is included in almost every statistical package, e.g., the R package, which makes it accessible to researchers in experimental sciences.
Suppose 20 data points, (x 1, y 1 ), (x 2, y 2 ),. . ., (x 20 ,y 20 ), were extracted from a survival curve that followed a distribution with a function of y = f(x, β), in which β = (β1, β2). The β vectors of the parameters were estimated using the least squares method, which minimizes the sum of squares.

Method 2: Graphical method
The graphical method used to estimate the parameters of parametric models involves transforming the survivor function to a linear function and fitting a straight line through a series of points that were extracted from a published Kaplan-Meier survival curve [5]. For example, the survival function of the Weibull distribution is provided as follows: SðtÞ ¼ e Àlt g . We can transform this distribution by taking the logarithm as follows: log(S(t)) = − λt γ , where λ is a scale parameter and γ is a shape parameter. We can subsequently take the logarithm again: log(−log(S(t))) = log(λ)+ γÃ log(t). From this equation, we can plot log(log(S(t))) versus log(t), and the parameters can then be estimated. The graphical method typically provides rough estimates of the parameters, but it has been used because of its simplicity.

Method 3: Estimation of interval-censored data (Hoyle and Henley, 2011)
Hoyle and Henley [6] proposed a method for the estimation of IPD. Numbers at risk, the total number of patients and survival probabilities were used to estimate IPD. The method made an assumption that censoring is constant within each time interval. The number of events and censorships in each time interval of 1/4 length were estimated, which improved the curve fits considerably. When numbers of patients at risk are not reported, the numbers of events and censorship can be estimated by the method proposed by Tierney et al. [7] The method will be more accurate with more reported intervals. The estimated IPD according to this method are interval-censored. A spreadsheet that implements the method is available at http://medicine. exeter.ac.uk/media/universityofexeter/medicalschool/profiles/Hoyle_and_Henley_Version_1. 1.xls. . .
Hoyle and Henley conducted a comparison of their method with two transitional methods, e.g., the least squares and graphical methods, using the Monte Carlo method for comparison. The parameters used in the scenarios in their simulation included sample size, combinations of parameters of the Weibull distribution and censoring type. The results of the simulation demonstrated that the method proposed by Hoyle and Henley was more accurate in both fitting the curve and estimating the mean survival time compared with the traditional methods. Furthermore, the estimation of the mean survival time using the proposed method was as accurate as the estimation obtained by applying the MLE directly to actual IPD. One important feature of this proposed method that distinguishes it from traditional methods is that uncertainty can be captured with this method.
However, alternative survival models and variations in the proportion of censoring were not included in the scenarios in the simulation study. The authors only used the Weibull distribution in their simulation study; however, many candidate survival models exist, such as Weibull, gamma, Gompertz, log-normal, and log-logistic. The lognormal distribution has typically been used to estimate the mean survival in NICE technology appraisals (TAs) instead of the Weibull distribution. [2] The lognormal distribution has commonly been used in the medical field, particularly for fitting cancer site data. [8] In a realistic trial setting, the censoring rate is often very high. [9] For example, a censoring rate as high as 70% is often identified in real clinical data. The MLE method is known to provide biased estimates when the data are heavily censored. [10] Method 4: Estimation of precise survival data: survival data with left or right censoring (Guyot et al, 2012) In 2012, an iterative algorithm developed by Guyot et al. was designed to obtain the reproduced IPD; consequently, survival curves could be reconstructed with the recreated IPD. [11]. The estimated IPD by the method was precise survival data rather than interval censored data. It was assumed that censoring rate is constant within time interval. An initial estimates for the number censored on interval i were obtained. Given this initial estimates, the number censored between extracted KM co-ordinates k and k + 1 was calculated. Then the number of events at each extracted KM co-ordinate and number of patients at risk at the next co-ordinate can be estimated. If the estimated number at risk at the beginning of interval i not equaled to the number at risk reported at the start of i, the process was repeated until the two figures matched from the previous iteration. If the reported total number of events was more than the estimated, the process was repeated until the two figures matched from the previous iteration. Further explanation of the method can be found in the original paper. [11] Regarding reproducibility and accuracy, the simulation was not performed to test the accuracy of the proposed method, which was different from the method created by Hoyle. Six pairs of survival curves were created from four published articles. The authors assessed the accuracy of the proposed method by comparing the actual 22 survival probabilities, 7 median survival times, 6 hazard ratios and 4 standard errors of the log hazard ratios that were reported in the four publications with the variables reconstructed using the proposed method. The results demonstrated that the accuracy was great for survival probabilities and median survival times, and the accuracy was reasonable when the number of individuals at risk or the total number of events was reported.
However, because the method proposed by Guyot et al. was not compared with other traditional methods, it remains unknown whether the accuracy of the proposed method is greater than the other methods previously described.
Because the Monte Carlo simulation was not performed to test the accuracy of the proposed method, we were not able to model different combinations of values for different inputs to observe the effects of truly different scenarios.
There are similarities and differences in the methods proposed by Hoyle and Henley and Guyot et al. Both methods include the use of the survival curve and the number at risk, as well as the assumption that censoring occurs at a constant rate in each time interval for both methods. The differences are summarized in Table 1.
Taylor et al. [12] reviewed the non-linear, Guyot et al. and Hoyle and Henley methods and assessed their impacts on the estimates of survival parameters by extracting data points from published survival curves and applying the three methods to the two case studies obtained. Because IPD were unavailable, the accuracy of these three methods was not compared. The findings indicated the methods used to estimate survival parameters to obtain transition probabilities between transition states can affect cost-effectiveness results.
Although the accuracy between Hoyle and Henley's method and traditional methods has been compared, this comparison was performed in the absence of the method proposed by Guyot et al., which is an important method to help researchers recreate survival data that have been cited and used frequently. [2,4,[13][14][15] Additionally, it is unclear how the four methods perform in different situations in which alternative survival models and various censoring rates are assumed to reflect realistic clinical data. Therefore, we evaluated the four methods using a simulation study.

Simulation settings
To compare the four methods previously described, a simulation study that could generate survival data with censoring was designed and conducted. This study enabled us to compare the performance of the four methods. The simulated survival data were designed to reflect real clinical trial data. This section contains the details of the simulation study design. The simulation settings in this study are similar to Hoyle and Henley's study, with the exceptions that (1) the Guyot et al. method was included, (2) variations in the proportion of censoring were simulated, and (3) the lognormal distribution was used in addition to the Weibull distribution.

Survival times
To simulate data, a number of patients with an underling survival time must be generated. Two sample sizes were chosen: 100 and 500 individuals. These sample sizes reflect the sample sizes typically included in clinical trials. [16][17][18] To simulate survival times with censoring, two survival distributions were required, one distribution for the time-to-event (T i ) and a second distribution for time to censoring (C i ). The Weibull and lognormal distributions are most commonly used for time-to-event data. [19] In this simulation, we considered these two survival distributions for the uncensored survival time T i : (1) Weibull distribution; and (2) lognormal distribution.
The survival function of the Weibull distribution is S(t) = exp(− λt γ ), in which λ is a scale parameter and γ is a shape parameter. The mean survival time of the Weibull distribution can be estimated as follows: E(X) = λΓ(1 + 1/γ), in which Γ is the gamma function. In the simulations, the mean survival time was set to 10 time units, which is common in time-to-event data. To ensure the mean survival time remained at 10 units, different combinations of the scale and shape parameters were chosen. First, three shape parameters were chosen that corresponded to the parameters used by Hoyle and Henley; these parameters cover most situations that would be encountered in the Weibull distribution. Second, the scale parameter was set to make the mean survival time equal to 10 units. The combinations of the parameters were as follows: 1. Decreasing failure rate (scale parameter<1): γ = 0.6, λ = 6.65 2. Constant failure rate (scale parameter<1): γ = 1, λ = 10 3. Increasing failure rate (scale parameter>1): λ = 2, λ = 11.28.
The survival function of the lognormal distribution is S t ffiffiffiffiffi ffi 2p p , and μ and σ are the mean and standard deviation, respectively, of the variable's natural logarithm. The mean survival time of the lognormal distribution can be expressed as:EðXÞ ¼ e mþ 1 2 s 2 . The mean survival time of the lognormal distribution was also set to 10 units.
The hazard function of the lognormal distribution first increases from zero to a maximum value and then decreases back to zero. The amount that the hazard function increases or decreases primarily depends on the value of σ.
The combinations of the parameters of the lognormal distribution were as follows: 1. σ = 1, μ = 1.802585: The hazard rate first increases from zero to a maximum value and then decreases to zero; 2. σ = 2, μ = 0.3025851: The hazard function essentially decreases over most time values.
The type of censoring in the simulation included type I censoring (because of pre-assigned fixed censoring times) and random censoring (because of loss to follow-up). The common censoring distributions considered in the literature are typically uniform and have exponential distributions [20]; in the simulation study, the random censoring was generated from an exponential distribution. The mean values of the exponential distributions were set to attain the desired censoring rates of 26, 42, and 76% to simulate what occurs in the real world ( Table 2).

Entry and follow-up times
The subjects were assumed to enter the study at different times during three time units periods. The entry times greater than 0 were generated from a uniform distribution between times 0 and 3 units. The maximum follow-up time was 12 units, which represents what often occurs in real data. [21][22] The subjects who were censored at time 12 units represented the subjects who remained at the end of the follow-up period. Thus, all subjects were followed up for a period of time that ranged between 3 units to 12 units, depending on their entry time. The followup time was set to 12 units, with the exception of the situations described in Table 2. The mean of exponential distribution was set to a large number(e.g. 10000000) in order to attain the target censoring rate.
b: Not only the mean of exponential distribution was set, but the time when the study ends was set to 15 in order to attain the target censoring rate.
c: Not only the mean of exponential distribution was set, but the time when the study ends was set to 13 in order to attain the target censoring rate.

Application of the methods
Considering all possible parameter selections, we ended up with several scenarios: 18 scenarios for the Weibull distribution and 12 scenarios for the lognormal distribution (Table 3).
Although the sample size 100-500 covered the range of trials typically encountered well and reflect what was often seen in large trials, [16][17][18] additional work was performed by simulating lager trials (sample size set to 1000). The simulation were run with γ set to 0.6 and λ set to 6.65 and censoring rate set to 0.76 for Weibull distribution and σ set to 2 and μ set to 0.30 and censoring rate set to 0.76 for lognormal distribution.
For the Guyot et al. method, the number of events was used; however, this information is often not provided. For the Hoyle and Henley method, 6 time intervals were used in most cases.
For each scenario, 1000 datasets were generated. The four methods previously described were subsequently applied to each of the 1000 datasets, and the MLE was applied directly to the simulated IPD to estimate the parameters of the survival distribution, which commonly occurs in cost-effectiveness analyses. For each method, the mean survival time was estimated for each of the 1000 datasets. All simulations were conducted using the R for Statistical Computing version 3.1.1 (R Foundation for Statistical Computing, Vienna, Austria).

Performance evaluation
The measurements used to evaluate the performance of the different methods have been summarized by Burton et al. [23] Bias was calculated as follows: in which β is the true value for the estimate of interest, and b is the mean estimate of interest over the 1000 simulations performed.
Mean square error (MSE) MSE is considered a useful measure of the overall accuracy because it incorporates both measures of bias and variability. The MSE was calculated as follows:

Uncertainty
In CEAs, the measure of effectiveness is the mean survival time over the duration of interest. [24] Thus, consideration of the uncertainty regarding the mean survival time is essential to conduct CEAs. [3] One of the advantages of using the methods proposed by Guyot et al. and Hoyle and Henley to conduct a CEA based on a published KM curve is that these methods can capture uncertainty regarding the mean survival time, unlike traditional methods, which are unable to reconstruct IPD.
To assess the performance of capturing the uncertainty reported by the two proposed methods, the standard errors of the mean survival time estimated by directly applying the MLE to actual IPD and by the two proposed methods were estimated.
We analyzed the impact of uncertainty on the estimate of mean time by applying a bootstrap resampling technique on the parameters of the survival distributions (Weibull and lognormal distributions). For every 1000 simulations, the standard deviation was estimated by applying a bootstrap resampling technique. For each bootstrap resample, the correlated variables were generated by applying the formula x = y+Tz, where x is the vector of the correlated variables, z is the vector of the independent standard normal variables, y is the vector of the parameter mean values and T is a cholesky decomposition of a variance-covariance matrix that was recorded from the simulations. The parameters of the survival distribution were acquired from the distribution, and the mean survival times were estimated. This three-step procedure was then repeated 10,000 times. The standard deviation of the mean survival time over the 10,000 bootstrap simulations was estimated. Thus, the standard error of the mean survival time for every 1000 simulations was obtained.

Results
For the tables and figures in this section, the methods are referred to as follows: the MLE was applied to the actual IPD (referred to as Actual IPD), the method proposed by Guyot Table 4 and Fig. 1 presents the results when the survival time followed a Weibull distribution and the sample size was 100. When the censoring rate was as low as 0.26, all four methods performed well in the estimation of the mean survival time, and their estimates were as accurate as the values estimated using actual IPD.

Simulation results
When the censoring rate was 0.42, all four methods provided satisfactory estimates of the mean survival time, with the least squares and graphical methods exhibiting small biases in the scenarios with a decreasing hazard rate.
When the censoring rate increased to 0.76, the least squares and graphical methods noticeably overestimated the mean survival time in the situations in which the hazard rate decreased or was constant. Because of the bias created from the application of MLE to actual IPD, the Guyot et al. and Hoyle and Henley methods provided biased estimates; however, these methods were still similar to the estimates obtained using actual IPD compared with the biases that resulted from the least squares and graphical methods.
When the sample size was 100 and the data followed a Weibull distribution, the accuracy of the estimates of mean survival time was somewhat dependent on the shape of the hazard function and the level of censoring. For decreasing and constant hazard patterns, the biases observed were greater in scenarios with higher compared with lower censoring rates, particularly for a decreasing hazard pattern; these biases resulted from the use of MLE when the data were heavily censored. There may be a substantially greater amount of bias with decreasing compared with constant hazard rates. One potential explanation is that the bias magnitude from the MLE is more affected by decreasing compared with constant hazard rates. When the censoring rate was the same, e.g., 0.76 or 0.42, the magnitude of the bias was affected by the shape of hazard function of the survival models. There was a greater bias with decreasing compared with increasing or constant hazard rates. When the shape parameter was equal to 0.6, which corresponds to a decreasing hazard function, the percentage bias in the Actual IPD, the method of Guyot et al. and the Hoyle and Henley method was 59.34, 72.76 and  68.72%, respectively, whereas the percentage bias was 11.79, 18.90 and 29.02%, respectively, with a constant hazard rate and 0.09, 1.33 and 1.35%, respectively, with an increasing hazard rate. One potential explanation is that the amount of random censoring in the increasing hazard case was less than the decreasing and constant hazard cases. The results from Table 5 and Fig. 2 demonstrate that all methods performed well when the sample size was 500. Table 6 and Fig. 3 presents the results when the survival time followed a lognormal distribution and the sample size was 100. When σ was 1 and μ was 1.80, all methods performed well at all three levels of censoring. When σ was 2 and μ was 0.30, the magnitude of the bias was strongly affected by the level of censoring. The greater the degree of censoring, the larger the bias. The least squares and graphical methods clearly overestimated the estimation of the mean  Table 7 and Fig. 4 presents the results when the survival time followed a lognormal distribution and the sample size was 500. When σ was 1 and μ was 1.80, all methods performed well. When σ was 2 and μ was 0.30, a greater bias was observed in the Hoyle and Henley method.
The results demonstrated that the biases were also affected by the shape of the hazard function and the level of censoring in both the lognormal and Weibull distributions, regardless of the sample size. For the same high censoring rate, a greater bias will be produced in the decreasing hazard pattern compared with the other hazard patterns.
The Fig. 5 and Fig. 6 presents the results when sample size was 1000 and survival time followed a Weibull distribution and lognormal distribution, respectively. The results showed that the performances of all methods in the cases of the larger trial were consistent with those in the cases of the sample size 500.

Uncertainty results
The results of Fig. 7 indicate that in situations that use the Weibull distribution, the Guyot et al. method provided slightly better uncertainty measures of the mean survival time than the Hoyle and Henley method. When the lognormal distribution was used, the mean difference between the standard error of the mean survival time from actual IPD and the corresponding standard error reported by the Guyot et al. method was substantially less than the Hoyle and Henley method.

Discussion
As expected, the application of the MLE to actual IPD was the most accurate method compared with the other methods in the simulation study; however, it provided noticeable biases in situations in which the level of censoring was high and the hazard function decreased over time. The biases occurred because the MLE can be quite biased when the sample size is small or when the survival data are heavily censored. [25][26] To correct this problem, a modified MLE (MMLE) was proposed to reduce bias in the estimates of the Weibull shape parameter through the modification of the profile likelihood [26]. The MLE and MMLE work well for complete, Type I, or Type II censored data, but they are not useful for random censored data. More recently, Shen [10] proposed a method that can be applied not only to censored data as previously discussed but also to other data, such as random censoring, progressive Type II censoring, and adaptive Type II progressive censoring. If the MMLE or the method proposed by Shen is applied to IPD, the bias will be reduced. However, the most commonly used statistical software packages do not include programs for MMLE or the method proposed by Shen. This debate is beyond our study's scope, and we believe that bias will be reduced as long as the programs for the corrected-MLE method are included in statistical software. In general, the least squares and graphic methods provided biased estimates; the estimates were heavily biased in some scenarios. For the Weibull distribution, the bias differences between the two proposed methods were not significant. Therefore, both of the two proposed methods provided satisfactory estimates of the mean survival time; these estimates were almost as accurate as the estimates obtained using actual IPD. However, the number of events was used in the simulation for the Guyot et al. method, and this information is often not provided in clinical trials. Therefore, the simulation design favored the Guyot et al. method. In addition, the Guyot et al. method required substantially more data preparation work compared with the Hoyle and Henley method. The data prepared for the Guyot et al. method should be sufficient, and every step in the KM curves should be captured. Hundreds of data points were required for the Guyot et al. method, whereas the number of data points required for the Hoyle and Henley method was substantially less; for example, when 6 time points were reported, 24 data points were required for Hoyle and Henley method. There is a larger probability for bias to occur from data extraction when it is necessary to extract more data. In the simulation study, the actual survival probabilities were used instead of the probabilities obtained from the published survival curves. Therefore, the results from the simulation study were the most accurate results that could be obtained from the four methods. Six time intervals were used for most scenarios in the simulation. The Hoyle and Henley method worked better when more time intervals were reported in the clinical trial. Published data typically include a minimum of 6 time Comparison of Methods for Recreating Individual Patient Data intervals. Therefore, the accuracy of the Hoyle and Henley method assessed here is likely the minimum accuracy that can be obtained from this method.
Given that both the Guyot et al. and Hoyle and Henley methods provided satisfactory estimates of the mean survival time when the Weibull distribution was fitted, we believe that the Hoyle and Henley method would not perform worse than the Guyot et al. method in the real world. For the lognormal distribution, the Hoyle and Henley method resulted in more noticeable biases compared with the other methods potentially because the Hoyle and Henley method is affected to a greater degree by the long tail of the lognormal distribution. In the simulation, the sample sizes of 100 and 500 were chosen, because the sample size 100-500 covers the range of trials typically encountered well and reflect what is often seen in large trials. [16][17][18] However, there are trials in which sample size is less than 100 or more than 500. The results showed that the performances of all methods in the cases of the larger trial are consistent with those in the cases of the sample size 500. For small trials, it's possible to see the exact drops in the Kaplan Meier curve and the tick marks for censorships. In these situations we don't need to use the two proposed methods.

Conclusions
The objective of this article was to review and compare the methods used to recreate individual patient data for economic evaluations from published Kaplan-Meier survival curves using a Monte Carlo simulation. Of the methods assessed in this study, the least squares and graphical methods were the least preferred methods because of their remarkable overestimation in some