A natural history model for planning prostate cancer testing: Calibration and validation using Swedish registry data

Recent prostate cancer screening trials have given conflicting results and it is unclear how to reduce prostate cancer mortality while minimising overdiagnosis and overtreatment. Prostate cancer testing is a partially observable process, and planning for testing requires either extrapolation from randomised controlled trials or, more flexibly, modelling of the cancer natural history. An existing US prostate cancer natural history model (Gulati et al, Biostatistics 2010;11:707-719) did not model for differences in survival between Gleason 6 and 7 cancers and predicted too few Gleason 7 cancers for contemporary Sweden. We re-implemented and re-calibrated the US model to Sweden. We extended the model to more finely describe the disease states, their time to biopsy-detectable cancer and prostate cancer survival. We first calibrated the model to the incidence rate ratio observed in the European Randomised Study of Screening for Prostate Cancer (ERSPC) together with age-specific cancer staging observed in the Stockholm PSA (prostate-specific antigen) and Biopsy Register; we then calibrated age-specific survival by disease states under contemporary testing and treatment using the Swedish National Prostate Cancer Register. After calibration, we were able to closely match observed prostate cancer incidence trends in Sweden. Assuming that patients detected at an earlier stage by screening receive a commensurate survival improvement, we find that the calibrated model replicates the observed mortality reduction in a simulation of ERSPC. Using the resulting model, we predicted incidence and mortality following the introduction of regular testing. Compared with a model of the current testing pattern, organised 8 yearly testing for men aged 55–69 years was predicted to reduce prostate cancer incidence by 14% and increase prostate cancer mortality by 2%. The model is open source and suitable for planning for effective prostate cancer screening into the future.

Introduction Swedish National Prostate Cancer Register. The raw data for those estimates are also potentially identifiable and data access to those data has been restricted by the Research Ethics Board at Umeå University. Research Ethics Boards in Sweden are currently being re-structured and their updated contact details will be available from http://www. epn.se. The authors do not own these data and had no special privileges in accessing the data. Anyone wishing to access the individual level data would need to apply for permission through a Research Ethics Board and from the primary data owners, including the Swedish National Board of Health and Welfare, Statistics Sweden and the four Stockholm clinical laboratories". the prostate cancer mortality RR from the ERSPC screening trial and (b) US prostate cancer incidence. Starting with the model from [12], we re-implemented and extended the model to include additional states by T-stage/M-stage and Gleason scoring as shown in Fig 1 (more recent developments of the FHCRC model are described in the Discussion). We also used more detailed inputs for calibration and validation. Data sources for the model inputs, calibration and validation included: the National Patient Register (including data on cancer Individuals are assumed to be healthy at age 35 years; they may progress to preclinical cancer states with fixed Gleason score, but with progression by T-stage and to metastatic cancer; preclinical cancers may be clinically diagnosed from nine different states, with survival from prostate cancer death modelled from the time of clinical diagnosis; death due to other causes is represented as a competing event. treatment), National Prostate Register (cancer incidence by Gleason score, T-stage and Mstage), Total Population Register (defining the at-risk population), Cause of Death Register, the Stockholm PSA and Biopsy Register (SPBR), and the PCBaSe research database for prostate cancer survival. Details are provided in the Materials and Methods.
Current PSA testing. The Stockholm and Swedish calibration targets were from a PSA tested population. We therefore modelled PSA testing in this population using data from the SPBR (details are provided in PSA testing sub-model). The initial PSA test uptake was modelled with a log-logistic distribution by age and calendar period. The PSA re-testing was modelled by age groups and PSA values. The resulting PSA testing rates by age and calendar period are shown in Fig 2. The probability of having a biopsy following a positive PSA test (i.e. biopsy compliance) was modelled by age and PSA value using the SPBR (see Table B in S1 Appendix).

Model calibration
Model calibration was undertaken in two steps. First, we calibrated for the relative distributions of incident cancers from contemporary Sweden and the screening effect on incidence from the ERSPC. Second, we calibrated to survival from contemporary Sweden.
Calibrating to the relative distributions of cancer staging. The observed relative distributions of incident cancer stages at diagnoses were used as calibration targets for modelling several prostate cancer natural history parameters (Tables 1 and 2). Importantly, we modelled for transitions between T-stages and fitted the relative distributions for the Gleason scores and cancer T and M stages by age groups; see Fig 3. We included different T-stages to support more detailed modelling of treatment and survival. The use of relative proportions allows for the absolute incidence rates to be used for validation. The calibration used a reconstruction of a contemporary Swedish population with data on PSA test uptake, health state proportions at diagnosis, and survival from a screened population. A total of 4392 diagnoses in the ages 50-74 from a three-years interval (2011-2013) were used as calibration targets.
The calibration to cancer staging only included two parameters. As a consequence, the model predictions do not accurately reflect all of the variation by age, T-stage and Gleason score (Fig 3). In particular, the predicted proportions are less accurate at younger ages for Tstages 1-2.
Calibration of screening effects on incidence. In addition to the calibration of the screened Swedish population, we also simulated for the screened and unscreened arms from the ERSPC to replicate results from the 13 years of follow-up [7]; for details on the reconstruction of the ERSPC trial, see Materials and methods. The ERSPC rate ratio of prostate cancer incidence (1.57, 95% confidence interval (CI) 1.51-1.62) was used in the calibration, while the ERSPC mortality RR of prostate cancer was used for validation. To calibrate to the ERSPC incidence rate ratio and to model indirectly for tumour size, we introduced a parameter for the proportion of the time from onset that a T1-T2 cancer would not be biopsy detectable [16]; we estimated that the T1-T2 cancers would on average be undetectable at biopsy for 53% of the time before they progressed to T3-T4 cancers.
Calibrating to survival from diagnosis. To calibrate prostate cancer survival, we compared simulated survival from a contemporary Swedish population, including longitudinal screening and treatment patterns, with observed 10-and 15-year survival from the PCBaSe database [17,18]. The PCBaSe database contained 93014 men from the Swedish National Prostate Cancer Register diagnosed in the period 1998-2014 linked with the health and population registers. Prostate cancer survival was stratified by M stage, Gleason score (� 6, 7, � 8), PSA (< 10, � 10) and ten-year age groups. Predictions from the calibrated model are displayed as Kaplan-Meier survival curves and the observed 10-and 15-year survival are displayed as point estimates with 95% CIs in Fig 4. The calibrated model has a clear separation in survival for men diagnosed with either Gleason 6 or Gleason 7 cancers; see Fig B in the S1 Appendix for a comparison with the FHCRC model, where survival tends to be similar for Gleason 6 and 7 cancers. For the FHCRC model, systematic differences in survival for Gleason 6 and 7 cancers, which are indirectly driven by differences in PSA growth rates and treatment assignment

Model validation
Population prostate cancer incidence. In Fig 5, we compared the age-standardised prostate cancer incidence rates from the simulation with that of Sweden during 1985-2016 which included the introduction of PSA testing. There is evidence for a good fit although the rapid increase in incidence following the introduction of PSA testing was not fully captured. This over-smoothing is possibly due to the PSA uptake sub-model having few degrees of freedom. Importantly, this is a validation and we did not calibrate for prostate cancer incidence rates.
Screening effect on mortality. In contrast to the incidence increase from ERSPC, which was used for calibration, the mortality decrease was used for validation. Using simulations of Additional validations. We provide further validations in section C in S1 Appendix. Fig  D in S1 Appendix shows that the model predicts prostate cancer incidence in Sweden and Stockholm for age for the years preceding the introduction of PSA testing. The age-specific model predictions are less accurate for the years preceding 2016, which suggests that the PSA sub-model does not capture the dynamic changes in earlier PSA testing (Fig E, S1 Appendix).
The model was not calibrated for prostate cancer mortality. For the validation by calendar period, see Fig F in S1 Appendix. The predicted shape of the temporal trends is similar to observed rates, although the level for the predicted mortality rates are lower than those observed. Detailed prostate cancer mortality rates by age and calendar period are shown in Fig  G in S1 Appendix. The predicted rates are similar to the observed rates. Finally, we show all cause mortality rates by age and calendar period (Fig H, S1 Appendix). The model predictions follow closely the observed patterns in Stockholm and Sweden.

Model predictions
When planning for prostate cancer testing policies, the following measures were considered to represent the burden of disease: prostate cancer incidence rate; prostate cancer overdiagnosis rate, where overdiagnosis is defined as the lifetime risk of having a prostate cancer diagnosis that would never have been clinically detected prior to death due to another cause; prostate cancer mortality rate; and life expectancy. We predicted these measures for a policy that replaces the current testing pattern (see Fig 2) with regular prostate cancer testing during ages 55-69 years: regular testing was introduced from 2015 at age 55 years for those born in 1960 and in later birth cohorts. After 15 years, regular testing had replaced current testing for ages 55-69 years. For each policy 100 million life histories were simulated. We assumed that preferences for PSA testing did not change with the introduction of regular testing, such that only men who would undertake testing under current testing would participate in regular testing [19,20]. Our modelling of organised screening specifically addresses the effect of screening intensity for the targeted age groups.
In Fig 6 we predicted prostate cancer incidence and overdiagnosis rate ratios for 20 years of 2-yearly testing, 8-yearly testing and the complete cessation of asymptomatic testing in comparison with the current testing pattern. The 2-yearly testing scenario resulted in a small reduction, RR 0.95 (95% MCI 0.95-0.95), in prostate cancer incidence and a larger decrease in prostate cancer overdiagnosis, RR 0.74 (95% MCI 0.73-0.74), over 20 years compared with the current testing pattern. The less intensive 8-yearly testing scenario substantially reduced the prostate cancer incidence, RR 0.86 (95% MCI 0.86-0.86), and the reduction of prostate cancer overdiagnosis was even larger, RR 0.53 (95% MCI 0.53-0.53), compared with the current testing pattern. The hypothetical cessation of all PSA testing for asymptomatic men in 2015 would result in a substantial decrease, RR 0.72 (95% MCI 0.72-0.73), of prostate cancer incidence compared with the current testing rates over 20 years as well as no overdiagnosis of asymptomatic men.
The purpose of early detection for prostate cancer is to lower prostate cancer mortality and increase the life expectancy. To assess these effects, we predicted mortality rates and life-years gained for the different PSA testing policies (Fig 7). Both the mortality rates and the life-years gained were expressed relative to the current PSA testing pattern. We predict that the broad introduction across the 1946-1965 birth cohorts contributed to the mortality reduction, which, while wearing off towards the end of the 20 year period, causes an increase in mortality, particularly for the 8 yearly testing. The relative effect on the mortality is considerably smaller than the effect on the incidence and while the 2 yearly testing pattern has a similar mortality, RR 0.99 (95% MCI 0.99-1.00), the 8 yearly testing pattern slightly increases the mortality, RR 1.02 (95% MCI 1.01-1.02), as the current uptake pattern for the predicted 20 years. Similarly the 2 yearly testing pattern slightly increased the life expectancy, 0.01 (95% MCI 0.01-0.02) life-years gained per 1,000 persons, and the 8 yearly testing pattern did not noticeably affect the life expectancy, -0.01 (95% MCI -0.01--0.00) life-years gained per 1,000 persons, compared to the current uptake pattern for the predicted 20 years. The hypothetical scenario of cessation of PSA testing for asymptomatic men in 2015 was predicted to significantly increase prostate cancer mortality over 20 years, RR 1.07 (95% MCI 1.07-1.08), and reduce the life expectancy by -0.06 (95% MCI -0.08--0.04) per 1000 persons. The effect is smaller than the 20% prostate cancer mortality reduction observed in the ERSPC study as the current PSA uptake pattern is less intensive than ERSPC and the lower biopsy compliance observed in Sweden (see Table B in S1 Appendix). The modest mortality reductions are potentially explained by relatively high levels of testing under the current PSA testing, and the use of the currently observed biopsy compliance for all predicted scenarios. These reductions are also comparable to the non-significant mortality reduction found in the Prostate, Lung, Colorectal and Ovarian (PLCO) cancer screening trial [8], where there were high levels of PSA testing in the control arm [21].

Discussion
Our aim was to develop, calibrate and validate a prostate cancer natural history model that could be used to evaluate prostate cancer testing. Using extensive Swedish data resources, we extended an older US-calibrated prostate cancer natural history model for the Swedish population and validated the new model. We then used the revised model to predict longer-term patterns of prostate cancer incidence and mortality in Sweden.
One of the challenges with natural history models is finding a model that is biologically meaningful and representative, whilst being mathematically simple and potentially estimable. Another challenge is that many of the parameters of a cancer's natural history are either not observable, such as the initial onset of disease, or are only partially observed at specific time points, such as the size of a tumour at the time of diagnosis. Investigators are divided in how to resolve these challenges. One school uses very simple models with expert judgement for the effectiveness of interventions. The validity of the predictions depend on the accuracy of the experts. A second school uses Markov models fitted to evidence from randomised controlled trials (RCTs) to assess the effectiveness of specific interventions within the follow-up from the RCTs. The validity is limited by the available RCT evidence, with strong limitations for predicting outside of the observed data. A third school uses more detailed natural history models and simulate for individuals. The validity of the predictions primarily reflects the validity of the natural history model. We are firmly in the last of these three schools. We have previously modelled cancer screening using both simple and more complex Markov models, and found issues with validity for the simple models and issues with model complexity for scaling more detailed Markov models to combinations of natural history and test states by time in state [22].
One potential criticism of many microsimulation models for cancer screening is that their complexity is coupled with a lack of model detail and that the source is usually closed. The USfunded CISNET collaboration has provided detailed model documentation [23] and some models (e.g. FHCRC) are available on request. We addressed this criticism by making all of our code open source and easily available (https://github.com/mclements/microsimulation and https://github.com/mclements/prostata). We encourage other microsimulation modellers to make their code openly available, which will lower the entry requirements for other investigators. If the cost of entry remains high, then a closed source consulting model will continue to be predominant.
There are several potential limitations. First, the revised natural history model was less accurate for modelling age dependent incidence (e.g. the incidence was under estimated in the younger ages and overestimates in the older ages; see Fig E in S1 Appendix). We suggest caution when interpreting incidence for Nordic populations for several reasons: the higher Nordic rates may lead to greater absolute declines in rates, leading to more effective screening; and the point estimate for the mortality reduction due to screening was higher in the Göteborg site, although there was no statistically significant heterogeneity between the ERSPC sites [7, p = 0.4]. More accurate modelling at older ages would require a more detailed natural history model. Second, it is difficult to assess whether the natural history model is causal and accurate: the disease process is only partially observed and the biology represented using a simple mathematical representation. Third, the prediction of age-standardised mortality rates were slightly lower than that observed in the Swedish population. This underestimation could be due to e.g. changes in Gleason grading, where the Gleason and T-stage distribution at diagnosis was based on data from patients diagnosed 2011-2013 whereas the survival by stage was based on patients diagnosed 1998-2014.
Finally, as individuals were followed for up to 15 years, the survival calibration was influenced by earlier Gleason grading practices, possibly leading to an overestimation of the risks for Gleason � 6 cancers. Nonetheless, we expect that our predictions will have strong internal validity, as the simulations allow for carefully controlled experimental conditions. Strengths of our approach include the wealth of detailed longitudinal data available from Sweden, and that we have made the model open source. Our natural history model can support an evidence-based approach to assessing whether the introduction of organised re-testing or screening would be effective and cost-effective. The Stockholm Prostata model was branched from the FHCRC prostate cancer model in 2013. Since that time, both the Stockholm Prostata model and the FHCRC model have incorporated a number of similar extensions, including T-stage development and more detailed modelling of Gleason grading [5]. A key difference is that the updated FHCRC model includes a two-parameter model for cancer onset [24]. We are currently investigating whether to incorporate these extensions into the Stockholm Prostata model. A key advantage of the Stockholm Prostata model is the availability of detailed longitudinal data on PSA values and prostate biopsies linked with clinical outcomes. The US CISNET prostate cancer models have historically relied heavily on un-linked, cross-sectional SEER data. In contrast the Swedish registry data have high coverage and are well suited for modelling the disease progression and treatment pathways within men. Our choice of modelling approach included model calibration for some key parameters in both unscreened and screened populations (Table 1). To assess whether the adapted model was valid for Sweden, we compared the model predictions with observed population incidence. This approach demonstrates both the strengths and potential weaknesses of our model.
Our model is now well suited to the health economic evaluation of new prostate cancer screening tests. In particular, we have modelled for Gleason � 6 cancers, which typically have very good prognosis, from Gleason 7 or Gleason � 8 cancers, where the last category has particularly poor prognosis. The new prostate cancer tests have focused on maintaining sensitivity for more aggressive prostate cancers, such as Gleason 7 or higher, with reductions in the incidence of small Gleason � 6 cancers and negative biopsies. Following a reviewer's suggestion, we will further investigate how to model for preferences for prostate cancer testing (see also [19]). Those preferences are expected to affect the effectiveness of new prostate cancer screening interventions in populations with established PSA testing patterns.
From the section on Model predictions, we found evidence to suggest that organised screening would reduce overdiagnosis without increasing mortality compared to current screening practices. Future work is needed to investigate refined screening strategies and the evaluation of cost-effectiveness.

Materials and methods
In this section, we will describe the various data sources used to develop the model, explain the model formulation, outline the methods for the calibration and validation of the model, and finish with a description of the model implementation.

Data sources
We have integrated multiple sources of data in order to extract relevant Swedish prostate cancer statistics for our model. The linkage between the different sources is illustrated in Fig 8. The regional ethics committee in Stockholm approved the study (dnr 2012/438-31/3, dnr 2016/620-32) and the data were analyzed anonymously. Detailed individual data on men who had a PSA test or a prostate core biopsy were extracted from the Stockholm PSA and Biopsy Register. Using the unique Swedish personal identification number [25], we linked the study cohort to a number of population registers, including the National Cancer Register (NCR) and the National Prostate Cancer Register (NPCR). The NCR included data on tumour extent (loco-regional vs distant) and the date of diagnosis; the NPCR contained additional clinical information, including Gleason score and TNM cancer stage classification. The NCR and NPCR were known to cover 96.3% and 94% of cancer patients, respectively [18,26]. A dynamic cohort (with men moving in and out of the Stockholm County) was defined via the Population Register which contains information on migrations within Sweden as well as external migration.
From PCBaSe, a research database linking the NPCR with other health and population registers, we extracted survival following a prostate cancer diagnosis by calendar period, age, stage, PSA value and Gleason score; and treatment modality by calendar period, age and Gleason score.
Data on prostate cancer incidence by calendar year, age, stage and Gleason scores were extracted from the NCR and the NPCR. Prostate cancer mortality by calendar year and age were obtained from the Cause of Death Register. The Total Population Register was used to calculate the male population at risk by calendar period and age.

Model description
As noted in the Results, we re-implemented and extended an earlier model by Gulati et al [12]. The prostate cancer natural history model links PSA growth with prostate cancer progression. The main components of the model are: (i) longitudinal PSA growth; (ii) Gleason score distribution at onset; (iii) transitions between natural history disease states, as shown in Fig 1; (iv) PSA testing; (v) biopsy sensitivity and specificity; (vi) treatment; and (vii) survival. The PSA growth is expressed functionally in Eq (1) and the transitions between states is defined in Eqs (5)- (11). Cancer onset is assumed to be independent of PSA, and that PSA rises faster after cancer onset. The distribution of Gleason score is assumed to be multinomial with the proportions modelled as a function of age at cancer onset, with no de-differentiation (or change in Gleason score) after onset. We expand on these difference components in the following subsections.
Changes from the earlier model. In the absence of Swedish data, we did not update a number of model components from the 2013 model, including: most parameters and the general structure for the longitudinal PSA sub-model; prostate cancer onset; the mean time from onset to metastatic cancer; the rates of clinical detection; the shape of the hazards from prostate cancer diagnosis to death; and the hazard ratios due to prostate cancer treatment. We updated: the population structure and background mortality rates; longitudinal PSA sub-models for Gleason 6 and Gleason 7 cancers; a detailed PSA testing sub-model; biopsy compliance; management of negative biopsies; new states for undiagnosed cancer by T-stage, modelling for progression; the distribution of Gleason score at cancer onset; treatment, including subsequent treatment for men initially assigned to active surveillance; and detailed survival by stage, Gleason score, PSA values and age.
Longitudinal PSA sub-model. The change in PSA values after cancer onset is assumed to differ by Gleason score, with a specific change for Gleason score 6 and below (G6-), Gleason 7 (G7) and Gleason 8 and above (G8+).
where y i (t) is measured PSA at age t for subject i, I(A) is a 1 if A is true and 0 otherwise, t oi is the age of cancer onset and where , is a Gleason-specific random change of slope after cancer onset • � i (t)*N(0, ϕ 2 ), represents measurement error.
The random slopes β * TN(μ, σ 2 ) are truncated normal distributions with support for positive real values. The probability density functions are f(β) = ϕ((β − μ)/σ)/(1 − F(−μ/σ)), where ϕ and F are the probability density function and cumulative distribution functions for standard normal distributions, respectively. The truncated normal distributions ensure that PSA growth is monotonically increasing.
Gleason score distribution. The Gleason score assigned to an individual at cancer onset is dependent on the age at cancer onset according to the probabilities modelled via the multinomial logistic regression in Eqs (2)-(4) as illustrated in Fig 9 (right panel). There is evidence that the FHCRC and Swedish models assume different Gleason score distributions, which may reflect differences in Gleason grading between the US and Swedish populations or changes in Gleason grading over time, where the Swedish data are more recent.
where t � 35. Transition rates between natural history states. Transitions between different states in the model (i.e. healthy, localized states, metastatic states and death) are simulated via events, which occur with different rates.
The disease onset (a transition from the healthy to a localized state) is modelled via a timedependent hazard (from age 35) as which means that the time-to-event follows a Weibull distribution (shape parameter 2 and scale parameter ffi ffi ffi ffi ffi ffi ffi ffi ffi ). The cumulative distribution (the complement of the survival function) for the time to cancer onset is hence R o t ð Þ ¼ 1 À exp À g o 2 t À 35Þ 2 À � À [11]. The event density, which is simply the probability density for the Weibull distribution with parameters as above, represents the rate of cancer onset per unit time (see Fig A in S1 Appendix).
Transitions between disease states are dependent on age (t) and the individual log-PSA-values (logðỹ i ðtÞÞ ¼ logðy i ðtÞÞ À � i ðtÞ). The model includes T-stage transitions within localised cancer states for preclinical cancer. The transition from T1-T2 to T3-T4 is the same for all Gleason categories and is described in Eq (6). γ t is the hazard of transitioning to T3-T4 and the time-dependence comes from the log-PSA levels.
The rate from T3-T4 to metastatic disease is proportional to PSA and γ m which is the metastasis hazard (see Eq (7)). Note that the FHCRC model used γ t to represent the parameter for the transition rate from onset to metastatic [11].
The clinical diagnosis rate for localised cancer onset for Gleason score 7 and lower (Eq (8)) and Gleason score 8 and higher (Eq (9)) are proportional to PSA and g G� lc , which is the clinical diagnosis hazard for localised cancer for the two Gleason score categories. As per the older US model, we combined the Gleason � 6 and 7 scores for these transitions due to a lack of informative data.
The rate to clinical diagnosis after metastatic onset for Gleason score 7 and below (Eq (10)) and Gleason score eight and above (Eq (11)), is proportional to PSA and g G� mc is the post-metastasis clinical diagnosis hazard for the two Gleason score categories. l G7À mc ðtÞ ¼ g G7À mcỹi ðtÞ ð10Þ l G8þ mc ðtÞ ¼ g G8þ mcỹi ðtÞ ð11Þ PSA testing. Diffusion of a new health technology into a population is a dynamic process. This process may reach a stationary state after a longer period of time. For PSA testing, test uptake was distributed across a range of ages over a comparatively short period, such that the PSA test patterns varied substantially by birth cohorts. PSA test uptake is required for calibrating the model to screened populations. The natural history model is calibrated to data that are observed both before and since the introduction of PSA testing. In particular, we have survival data for men diagnosed for prostate cancer from 1998, which is after the introduction of PSA testing. This requires that we accurately model for PSA uptake and re-testing and for treatment to represent the men at risk for prostate cancer incidence, survival and mortality.
The PSA sub-model represents uptake of the PSA test together with the pattern of PSA retesting. Uptake was modelled as: (i) a function of age for cohorts born from 1960; (ii) a function of calendar period multiplied by a factor for birth cohort for birth cohorts born before 1932; and (iii) a mixture of (i) and (ii) for the birth cohorts between 1932 and 1960. Mathematically, age-specific uptake (i) is modelled by the cumulative density function for a log-logistic cure model, such that where t is age at uptake, π 1 is the proportion of men ever having a PSA test, c is the calendar year of birth (or birth cohort), and where a 1 and b 1 are the shape and scale for a log-logistic distribution for those men who ever have a PSA test. The calendar-specific uptake for the older cohorts (ii) is modelled by where π 2 is the proportion of men who ever have a PSA test, and where a 2 and b 2 are the shape and scale for a log-logistic distribution for those men who ever have a PSA test. Finally, for the intermediate birth cohorts, t 1 is sampled from F 1 , t 2 is sampled from F 2 , and t 1 is selected over t 2 with probability (1960 − c)/ (1960 − 1932). PSA re-testing is modelled using a Weibull cure model, such that where t 0 is the age at the previous PSA test, y 0 is the value of the previous PSA test, π 3 is the proportion of men who will ever have a re-test, and where a 3 and b 3 are the shape and scale parameters of the Weibull distribution for those who ever have a re-test. For re-testing, the parameters π, a 3 and b 3 were estimated using a Weibull cure model stratified by the five-year age groups (30-34, 35-39, . . ., 85-89, 90+) and by PSA values ([0, 1), [1,3), [3,10), [10,1)) at the previous PSA test. The parameters π 1 , π 2 , a 1 , a 2 , b 1 and b 2 were calibrated to observed PSA test rates for Stockholm using a Poisson likelihood.
Biopsy sensitivity and compliance. For men who had a PSA value above 3 ng/mL, the proportion complying with a subsequent biopsy varied by PSA values and age, and was estimated from the SPBR (see Table B in S1 Appendix). We also modelled for whether a prostate cancer was biopsy-detectable, assuming that a cancer was not initially detectable for a proportion ϕ lag (16) and (17) of the time from cancer onset to the development of a T3-T4 cancer. Our approach varied from Wever et al. 2010 [16], who modelled for the sensitivity of a PSA test to detect a cancer by stage, irrespective of the time from cancer onset. The probability of a biopsy (Bx) rendering a diagnosis (Dx) depends on the biopsy sensitivity, the biopsy compliance and the probability of cancer: where P(t � t 0 ) is the probability of having had a cancer onset, P(Bx comp (t, PSA|PSA + )) is probability of performing a biopsy after a positive PSA test depending on age and PSA value and P(Bx sens ) is the biopsy sensitivity as expressed below: where Δ = t T3−T4 − t 0 is the time with a T1-T2 cancer. Treatment sub-model. Probabilities for treatment assignment to either active surveillance, radical prostatectomy, radiotherapy or androgen deprivation therapy were assessed from the SBPR. These values were stratified by five year age groups and Gleason score (see Fig  C in S1 Appendix).
Survival sub-model. Survival from cancer diagnosis to death due to screening was calibrated to the NPCR. In summary, we first simulated survival for Sweden for 1998-2014 including screening uptake. The shape for the uncalibrated hazard function used data from the SEER database for localised and metastatic diagnoses. NPCR survival estimates were available at ten and fifteen years after diagnosis by age groups for the period 1998-2014; for non-metastatic cancer, survival was available by Gleason score and for PSA less than 10 ng/mL and for 10 ng/mL and over. We compared the Kaplan-Meier estimates of survival from diagnosis from the simulated data with observed Kaplan-Meier estimates for men diagnosed with prostate cancer in Sweden. For each group, we calculated a calibration hazard ratio from the log of observed survival from NPCR divided by the log of simulated survival. We did not calibrate to observed survival from the pre-PSA era, as such estimates were not available.
One significant modelling challenge is selecting and fitting a mathematical representation for the effect of cancer screening. For cancers with a short period between a screen-detected diagnosis and a counter-factual clinical diagnosis, a common model is to represent differential survival based on changes in stage at diagnosis and changes in treatment. For prostate cancer, there is potentially a long period between screen-detectable prostate cancer and clinical diagnosis. The lead-time between a screen-detected diagnosis and clinical diagnosis is of the order of 10 years [27,28]. We represent the effect of screening on survival S as a function of time t = a − a c , where a c is the possibly counter-factual age of clinical diagnosis, rather than the age of screen-detected diagnosis a s . Then where ClinicalDx and ScreenDx represents either a clinical diagnosis or a screen-detected diagnosis, respectively, and Treatment(a), Stage(a), and PSA(a) are the treatment modality, stage and PSA value at age a, respectively. The treatment sub-model assumes that the hazard ratio from SPCG-4 [29, 0.56] applies comparing both radical prostatectomy and radiation therapy with either watchful waiting or active surveillance. This point estimate is consistent with the point estimate from the PIVOT trial [30], albeit without the latter being significantly different from one. We then model survival as

Implementation of the Stockholm Prostata model
The FHCRC prostate cancer model was implemented in C under an open source GPL licence, although the code has not been widely distributed. We have implemented our extended model using existing C++ simulation libraries and to manage input parameters and output predictions using R. The model is implemented together with an extensible microsimulation framework. It is available from https://github.com/mclements/microsimulation and https://github. com/mclements/prostata under a GPL3 license, allowing for use and reuse, in contrast to most existing microsimulation models which are not open source [31].

Model fitting and calibration
To adapt the extended prostate cancer model to the Swedish context, a number of input parameters were estimated from external data and a smaller set of parameters were estimated using simulation predictions fitted to calibration targets. First, a set of parameters were estimated from available data sources, including parameters describing the longitudinal development of PSA with age, the rate controlling time to cancer onset, and transition between cancer states (Table 1, external inputs). Observed characteristics of the modelled population were collected and used to estimate another set of parameters via simulations of the model. The characteristics used as calibration targets were the distribution of incident prostate cancers by Gleason scores and T-stage, observed survival by age groups and disease stage, together with the PSA uptake and re-testing (Table 1, directly observed inputs). The PSA test uptake prior to 2003 was reconstructed by fitting a model to prostate cancer incidence using later PSA test rates as covariates, and we used survival analysis to estimate re-testing rates prior to cancer diagnosis by age and PSA values. The calibration targets are described in further detail under the sub-section on Calibration methods below. The model was validated against the population of Sweden and the population of Stockholm [32] (Table 1, validation targets). For the validation, we simulated the observed PSA testing pattern and validated the model against the population data for incidence, all-cause mortality and prostate cancer mortality (see Results and section C in S1 Appendix).
Emulating the ERSPC trial. We performed a simulation experiment to emulate the ERSPC trial, where we predicted both the "control" arm and the "screening" arm with 100 million simulated men. Both arms where constructed as flat populations with inclusion between ages 55-69 years after which they where followed for 13 years. For study eligibility, we assumed that the men had not had a prostate cancer diagnosis prior to age 55 years. For the control arm, we assumed no screening. For the screening arm, we assumed four-yearly screening between ages 55 and 69 years. The PSA threshold was assumed to be 3.0 ng/mL, although in fact this varied by study site. We also used the reported biopsy compliance of 85.6%. Treatment and other-cause deaths were assumed to be similar to those observed in Stockholm.
Calibration methods. We used four sets of targets for our calibration procedure (with associated parameters): 1. The relative distribution of Gleason and T-stage for incident prostate cancers in contemporary Sweden (to estimate β 7 , β 8 , γ t , γ m , ϕ lag ) 2. An equality constraint on the mean time from onset to metastatic cancer (to estimate γ t , γ m ) 3. The incidence rate ratio due to screening from the ERSPC (to estimate γ t , γ m , ϕ lag ) 4. Detailed prostate cancer survival for contemporary Sweden.
For the first step, we calibrated for the incidence-related targets 1, 2 and 3 in one likelihood; and then, as a second step, we calibrated for target 4. For targets 1, 2 and 4, we simulated for current PSA testing in Sweden; for target 3, we simulated for both arms of the ERSPC.
For target 1, we used a multinomial likelihood with unknown parameters θ = (β 7 , β 8 , γ t , γ m , ϕ lag ) 0 . The multinomial log-likelihood was defined as where i is an index over age, j is an index over cancer staging, n i is the observed total count in a particular age group, y ij is the observed count for a combination of age and cancer staging, m i (θ) is the simulated total count, and p ij (θ) is the simulated proportion of individuals in a particular disease state (see Eqs (2)-(4) for the multinomial data generating mechanism). The cancer staging for the observed frequencies and simulated proportions were by age and (i) locoregional cancers by combinations of Gleason score and T-stage, and (ii) metastatic prostate cancers. The intercept terms α 7 and α 8 for the distribution of Gleason score at age 35 years were not identifiable and we assumed that α 7 = log(0.2) and α 8 = log(0.002). Half-cell corrections were performed to handle empty cells in the simulated proportions. For target 2, we used a non-linear equality constraint on the expected time from onset to metastatic cancer to ensure identifiability of progression across T-stages. Formally, where � t T1À T2 ðθÞ are the mean simulated transition times from onset to T3-T4, � t T3À T4 ðθÞ are the mean simulated transition times from T3-T4 to metastatic cancer, and � t old is the expected mean time from onset to metastatic cancer from a model without separate T stages (25.9 years from the FHCRC model; [12]).
For target 3, we used a non-linear equality constraint on the simulated incidence rate ratio from the ERSPC, where where IRR is the observed PSA screening incidence rate ratio from the ERSPC study and IRR (θ) is the simulated incidence rate ratio for the emulation of the ERSPC study.
Formally, the log-likelihood l 123 (θ) for targets 1-3 was l 123 ðθÞ ¼ l 1 ðθÞ þ w 2 l 2 ðθÞ þ w 3 l3ðθÞ ð24Þ where w 2 and w 3 are weights for the non-linear constraints. Note that the equality constraints in targets 2 and 3 were formulated in terms of weighted quadratic penalties. The weights were selected so that the constraints were approximately satisfied (w 2 = 1; w 3 = 10 4 ).
To optimise the simulation log-likelihood l 123 (θ), we used the Nelder-Mead optimisation algorithm. For each iteration of the optimisation, we evaluated the log-likelihood by simulating three different scenarios that depended on the parameters θ. From these scenarios, we predicted values that were used in the log-likelihood, including the relative distribution of cancer staging, the mean time from onset to metastatic cancer, and the PSA screening incidence rate ratio for the reconstructed ERSPC trial. The Nelder-Mead algorithm is commonly used to optimise functions for which derivatives are difficult to calculate and for objectives that are not smooth. The standard errors were calculated from the inverse of the Hessian matrix for the negative log-likelihood (see Table 2). Given the simulation likelihood, the calculation of the Hessian matrix required that the step size for the finite differences used a larger step size (0.01).
For the second step and target 4, the distributions of Gleason score, T-stage and metastatic cancer (θ) were kept fixed. Using the mean between the observed 10-and 15-year survival as the calibration target and Kaplan-Meier estimates based on the model simulations, we calculated the hazard ratios by age group, cancer stage, Gleason score and PSA values. The adjust-