Classical mathematical models for prediction of response to chemotherapy and immunotherapy

Classical mathematical models of tumor growth have shaped our understanding of cancer and have broad practical implications for treatment scheduling and dosage. However, even the simplest textbook models have been barely validated in real world-data of human patients. In this study, we fitted a range of differential equation models to tumor volume measurements of patients undergoing chemotherapy or cancer immunotherapy for solid tumors. We used a large dataset of 1472 patients with three or more measurements per target lesion, of which 652 patients had six or more data points. We show that the early treatment response shows only moderate correlation with the final treatment response, demonstrating the need for nuanced models. We then perform a head-to-head comparison of six classical models which are widely used in the field: the Exponential, Logistic, Classic Bertalanffy, General Bertalanffy, Classic Gompertz and General Gompertz model. Several models provide a good fit to tumor volume measurements, with the Gompertz model providing the best balance between goodness of fit and number of parameters. Similarly, when fitting to early treatment data, the general Bertalanffy and Gompertz models yield the lowest mean absolute error to forecasted data, indicating that these models could potentially be effective at predicting treatment outcome. In summary, we provide a quantitative benchmark for classical textbook models and state-of-the art models of human tumor growth. We publicly release an anonymized version of our original data, providing the first benchmark set of human tumor growth data for evaluation of mathematical models.


Introduction
The growth of solid tumors and their response to therapy is hard to predict on the level of individual patients. Similar to other complex systems such as the climate [1,2] or stock markets [3], quantitative mathematical models can be used to describe and forecast the behavior of cancer: this is one of the main objectives of "mathematical oncology" [4,5]. Mathematical models of tumor growth kinetics have improved the understanding of underlying biological mechanisms. [6][7][8] In addition, they have resulted in a number of modeling approaches for cancer treatments including chemotherapy [9,10] and immunotherapy [11,12], improved drug dosage [13,14] and have yielded candidate biomarkers for treatment response [15]. The roots of tumor growth models go back to 1825, when Gompertz published a mathematical model to analyze the population growth [16]. He argued that the number of people alive as a function of their age L(x) declines faster than exponential functions which means that the death rate should be increasing with age. 135 years later, von Bertalanffy addressed the question of "why does an organism grow at all and why after a certain time, does its growth come to stop?" [17] By replacing its concept of an "organism" with a malignant tumor, the answer to this question resulted in a mathematical model for tumor growth. Tumor modeling provides information about the net tumor growth rate, facilitates their comparison among different tumor types [18] and makes it possible to predict the future growth of tumors [19].
A number of "textbook" models have been used in the past to approximate tumor growth with mathematical equations. In addition to the above-mentioned models by Gompertz and von Bertalanffy (each in a "classical" and a more general form), exponential and logistic models are standard approaches to describe tumor growth (Table 1) [20]. Exponential models are able to predict either exponential growth or decay depending on the absolute values of birth and death rates, and the resulting sign of (birth rate − death rate). Logistic models can simulate the fact that tumor growth is limited by nutritional, immunological or spatial constraints by including a carrying capacity into the model at which the tumor volume plateaus. This carrying capacity is included in the per capita growth rate, in line with the observation that tumor growth slows down when the tumor volume becomes large. [21] To be precise, the carrying capacity can be interpreted to comprise a number of biological constraints to tumor cell www.ClinicalStudyDataRequest.com), which is now inactive and has been replaced by the Vivli platform (https://vivli.org, April 2021). Qualified researchers may request access to individual patient level data through the clinical study data request platform (https://vivli.org/). Further details on Roche's criteria for eligible studies are available here (https://vivli.org/members/ourmembers/). For further details on Roche's Global Policy on the Sharing of Clinical Information and how to request access to related clinical study documents, see (https://www.roche.com/research_and_ development/who_we_are_how_we_work/clinical_ trials/our_commitment_to_data_sharing.htm). The original proposal submitted to the CSDR platform is available in S1 Text. In order to enable reproduction of our experiments, we publicly release a fully anonymized subset of the data containing only the tumor volume measurements for the target lesion and the respective study and treatment arm (S1 Data).
proliferation. These constraints include the availability of nutrients and oxygen and thus, the concept of tumor angiogenesis is implicit in the carrying capacity. In addition, the pressure of immune cells attacking tumor cells limits the niche the tumor cells can fill and thus, the concept of antitumor immune response is implicit in the carrying capacity. The Gompertz model is another model which illustrates the experimentally observed decrease in the growth speed of tumors. Similarly to the logistic model it has a sigmoid shape, hence representing limited tumor growth. Its main assumption is the exponential decay of the growth rate [18]. Since this model has been applied in many fields to various problems a few equivalent Gompertz models exist, differing in the chosen re-parametrization. Gompertz and von Bertalanffy growth models are two basic but important models which are commonly used to model tumor volume growth since they have outperformed exponential models in many cases in the past [22].
Unlike in other domains in which mathematical models and practice are strongly linked, the field of mathematical oncology is, by and large, somewhat disconnected from clinical practice of oncology. While in recent years, large quantitative data collections have deepened the genetic [23] and immunological [24][25][26] understanding of solid tumors, even well-established textbook models in mathematical oncology have not been linked with or validated in large amounts of quantitative real-world data. As a result, growth models that form a conceptual backbone of mathematical oncology have never been formally validated in large patientderived datasets. In 2014, Benzekry et al. have systematically validated a range of textbook mathematical models on quantitative data obtained from two mouse models [20]. More recently, Vaghi et al. have extended that study and have validated classical growth models in 833 measurements in 94 animals [27]. These systematic large-scale approaches are highly important to link mathematical oncology to real-world data, but bear one major drawback: since almost all drugs that result in tumor control in mice fail in human experiments [28], mouse-based models are not suitable for human tumor growth estimations [29]. In addition, little validation of textbook models has been performed for tumors undergoing treatment, prompting caution whether unvalidated mathematical models have predictive power for clinical oncology. [30] While modeling of unabated tumor growth has academic relevance, fortunately untreated tumor growth for extended periods of time is rare in clinical practice [30]. Almost all patients with metastatic cancer undergo some type of systemic pharmacotherapy which slows down tumor progression [31].
In this study, we retrospectively collected quantitative measurements of tumor diameter changes over time from Non-Small Cell Lung Cancer (NSCLC) and bladder cancer patients from five large clinical trials. We systematically used this data with each of the standard mathematical models to address two questions: Firstly, how well can existing tumor growth models fit real-world data of patients undergoing treatment? (experiment #1) Secondly, how well can these models predict tumor growth at later disease stages when fitted to early-stage data? (experiment #2).

Ethics statement and data sharing
All experiments were conducted in accordance with the Declaration of Helsinki and the International Ethical Guidelines for Biomedical Research Involving Human Subjects by the Council for International Organizations of Medical Sciences (CIOMS). This study complies with the "Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis" (TRIPOD) statement [32]. All data were obtained in an anonymized way through a proposal to F. Hoffmann-La Roche Ltd. Through the platform "Clinical Study Data Request" (CSDR, www.ClinicalStudyDataRequest.com), which is now inactive and has been replaced by the Vivli platform (https://vivli.org, April 2021). Qualified researchers may request access to individual patient level data through the clinical study data request platform (https://vivli.org/ ). Further details on Roche's criteria for eligible studies are available here (https://vivli.org/ members/ourmembers/). For further details on Roche's Global Policy on the Sharing of Clinical Information and how to request access to related clinical study documents, see (https:// www.roche.com/research_and_development/who_we_are_how_we_work/clinical_trials/our_ commitment_to_data_sharing.htm). The original proposal submitted to the CSDR platform is available in S1 Text. In order to enable reproduction of our experiments, we publicly release a fully anonymized subset of the data containing only the tumor volume measurements for the target lesion and the respective study and treatment arm (S1 Data).

Data acquisition and preprocessing
We used data sets from five different clinical trials (Tables 1 and 2). The purpose of the original studies was the evaluation of the efficacy and safety of Atezolizumab (previously known as MPDL3280A), an immune checkpoint inhibitor directed against the Programmed Death Ligand 1 (PD-L1). In two out of the five trials (GO28753, GO28915), the performance of Atezolizumab was compared to Docetaxel, a chemotherapy drug. In the other three trials, all the participants received Atezolizumab as a treatment and the participants were further categorized into treatment arms or clinical subgroups as defined in the study protocols. One-dimensional longest diameter and shortest diameter of target and non-target lesions as manually measured on CT scans were available from the study database and were reported for each patient at different time intervals (Fig 1A). Because the shortest diameter was only available for a subset of patients, we used only the longest diameter (LD) and converted it to tumor volume (V) by V = LD 3 � 0.5 as described before [33]. Using the maximum value of V in the whole data set, the volumes were normalized to be in the range of 0 and 1 for the whole dataset. Most patients in the data sets had multiple tumor lesions (primary tumor and/or metastases). For simplicity, we refer to these lesions as "tumors". In the data set, one of these tumors for each patient was labeled as a "first target lesion" ('INV-T001'), i.e. an easily measurable lesion for which the diameter was closely monitored over time. In addition, patients usually had one or more "non-target lesions". In this study, we only used the target lesion which was labeled as for all the patients, as this was usually the tumor with the highest number of data points. Each patient has a different number of data points for this selected target lesion. While the intervals between data points were relatively similar (they were on average 50.62 days, with a standard deviation of 6.2 days), the absolute number varied (there were on average 3.63 data points for the selected target lesion per patient with a standard deviation of 3.22). Patients with few data points likely dropped out of the study early due to death or other reasons. To enable robust fitting of mathematical models to the data, we limited our data set to two sets of patients with three or more (or six or more, respectively) data points for the target lesion. Cumulatively, the original data sets had 2693 patients, of which 1472 had three or more data points and 652 had six or more data points available ( Fig 1B).

Patient categorization according to RECIST and trajectory type
For each patient, the ultimate response was encoded according to the response evaluation criteria in solid tumors (RECIST) system [34]. Based on the latest modification of this criteria (RECIST 1.1) [35], four tumor responses to treatments can be defined: Complete Response (CR, disappearance of all target lesions), Partial Response (PR, at least 30% decrease in sum of the longest diameters of target lesions in comparison to the baseline value), Progressive Disease (PD, at least an increase of 20% in the sum of the longest diameters of target lesions in comparison to baseline value) and Stable Disease (SD, when none of the above criteria fits to the tumor response). Because the RECIST system only assesses best response at discrete time points but does not categorize the full tumor volume trajectories, we additionally categorized the patients into three treatment response groups: "up", "down" and "fluctuate" (Fig 1C). For Table 2. Detailed summary of included studies. Data from five studies were used in this work. All studies can be identified either by their clinical trial registry number ("NCT . . .") or by their Roche ID ("GO . . .").

Study Design
NCT01846416 GO28625 FIR [52] -Atezolizumab in PD-L1 + in NSCLC (n = 138), Phase 2 -1) patient with no first treatment -2) patients progress following platinum chemo -3) patients 2L + treated brain metastases -ORR = 32% /21% / 23% NCT01903993 GO28753 POPLAR [53] -After platinum failure: Atezolizumab or Docetaxel in NSCLC n = 287; Phase 2 -1) 144 in Atezolizumab group -2) 143 in docetaxel group -OS 12.6 months / 9.7 months -improvement in OS with higher PD-L1 expression -Atezolizumab improved survival, correlated with expression PD-L1 NCT02031458 GO28754 BIRCH [54] -Atezolizumab in PD-L1 positive advanced or metastatic NSCLC n = 667;  this purpose, we calculated a vector containing the difference of each LD measurement at time point t + 1 to its previous measurement at time point t for each patient. If the LD at t + 1 was bigger than at t, the difference would be positive and vice versa. Patients for whom only the shortest distance measurement was available were excluded from the analysis. The "up" category includes patients whose difference vector values are always positive and patients with a positive difference after the first measurement if the ratio between the sum of all positive values to the sum of all negative values is >2. The "down" category includes patients whose difference vector values are always negative or a negative difference after the first measurement if the overall ratio between the sum of all negative values to the sum of all positive values is >2. The "fluctuate" category contains all patients that correspond to neither up nor down categories. In all five studies, the pattern "up", "down" and "fluctuate" was present (Fig 1D).

Models and experimental design
We predefined six classical mathematical models to be fitted to the data ( ðy i À À yÞ ). Experiment #1 was run on both patient sets separately: The set with all patients with three or more data points and the set with patients with six or more data points. Experiment #2 was aimed at fitting models to the early measurements for each patient, Table 3. Model description and interpretation of the parameters. For all differential equation models in the current study, the model name, equations and variables are listed. � birth rate and growth rate can be combined to one parameter, the effective growth rate.

Model name
Solution of the differential equation Differential equation (with initial condition V(0) = V 0 )

Parameter description Ref.
Exponential   General von Bertalanffy excluding the last three data points and subsequently estimating the predictive accuracy for the excluded data points. The statistical endpoint for experiment #2 was the MAE. Experiment #2 was only run on the set of patients with six or more data points.

Fitting and implementation
All model fitting procedures were implemented in Python 3.7. In particular, we used differential evolution to generate the initial data points for the differential equations. Differential evolution is a stochastic population based method which is used for global optimization problems [36]. Based on these initial guessed parameter values, the "Curve_fit" function (from the python package "scipy") is used to fit the model parameters to the experimental data. This function uses the Trust Region Reflective (trf) algorithm with the non-linear least squares loss function to find an optimal fit of the model parameters to the data points. The inputs to this function are the sorted time and its corresponding tumor volume measurements, the respective mathematical function to fit the data and the maximum number of iterations (we used 1000 iterations in this study). The output of the "Curve_fit" function is the calculated optimum parameters for the selected mathematical function. Having these parameters, it is possible to predict the volume values for each time point and then evaluate the goodness of the fit. The source codes are publicly available at https://github.com/KatherLab/ImmunotherapyModels.git.

Early RECIST status does not correspond well with ultimate treatment response
In clinical routine and clinical trials, RECIST response at early time points during treatment is often used to determine whether a given treatment should be continued [34]. If these initial RECIST results perfectly matched the ultimate RECIST, there would not be a need for more mathematical prediction models. Therefore, we systematically compared the RECIST status at the first, second, third and fourth tumor size evaluation for each patient with the "final" RECIST status as defined in the study protocol. In all treatment arms, we found an imperfect overlap between early and final measurements. Overall, the median concordance between first, second, third and fourth data point and final RECIST was 53.5, 64.0, 63.5 and 78.0, respectively. The same pattern was seen for the concordance between early and final RECIST calculated for only one target lesion ( Fig 1E). Hence, the RECIST classification can be a useful tool to assess therapy response status, but it might be insufficient for therapy response estimation at an early therapy stage. These findings provide a rationale for the use of mathematical models to improve response prediction. In addition, we compared statistically the correlation between the RECIST standard classification categories (CR/PR, SD and PD) with the developed grouping methods (up, down and fluctuate). As the results are summarized in S1 Fig both grouping systems are partially correlated (PD is mostly overlapping with "up", PR/CR with "down" and SD with "fluctuate"). However, the correlation was not perfect and particularly in the OAK study, 37 patients from 95 down category patients are classified as PD and 87 patients out of 133 patients in fluctuate category are classified as PD. This comparison shows that while RECIST is the standard classification system in clinical routine, our grouping method does provide an additional perspective on tumor response categories.

The Gompertz model outperforms other models when fitting clinical data points
We tested how well classical differential equation models (Table 3) can fit tumor volume trajectories under immunotherapy and chemotherapy. To compare these models, we first fitted them to all available data points for all patients with at least six measurements (experiment #1). This set the stage for experiment #2 (Fig 2A and 2B), in which models were fit to all points except the three last points and the predictive power was assessed for each model. In experiment #1, we found that all models provided a good fit to most data points, but the number of poorly fitted points differed between the models. Overall, the General Bertalanffy, the Gompertz and the General Gompertz model had the lowest number of poorly fitted data points (Fig 2C). We quantified this by calculating multiple metrics for the goodness of fit for each model, for each study arm, further stratifying patients in each study arm by the ultimate RECIST response. Again, we found that the General Bertalanffy, the Gompertz and the General Gompertz model consistently outperformed more simple models. (Fig 3A and 3B) The exponential model yielded the worst fit for 13 out of 19 patient groups in this analysis (Fig 3A). To rule out a selection bias, we repeated experiment #1 with all patients with at least three measurements, yielding comparable results (S2A and S2B Fig). Due to their higher degree of freedom, complex models always yield a better fit than simple models to any data set. To account for this, we assessed the Akaike Information Criterion (AIC) which incentivises goodness of fit but penalizes model complexity.
We found that according to the AIC, the General Bertalanffy model consistently yielded the poorest performance compared to the other models (Fig 3C and 3D). This observation also held when all patients with three or more measurements were considered (S2C and S2D Fig). However, the Gompertz model had a low (good) AIC for most study arms, showing that this model give a good balance between goodness of fit and model complexity. To rule out that these effects were obtained by sub-stratifying patients according to their final RECIST status, we repeated experiment #1 with patients sub-stratified as "up", "down" and "fluctuate", thereby considering the shape of the whole timeline for each patient. Again, we found that the General Bertalanffy, the Gompertz  For the "up" and "down" patient groups, the fitted model parameters were generally in a close range. For the "fluctuating" patient group, the fitted model parameters showed a higher variability between the patients, indicating the difficulty of to fit these trajectories (S1 Table). When penalizing for model complexity by using the Akaike Information Criterion, again the Gompertz model provided the best balance between goodness of fit and model complexity (S3G and S3H Fig). In summary, the Gompertz model adequately fitted the response to immunotherapy and chemotherapy across a range of clinically relevant populations, while having only two free parameters (Table 3).

Differential equation models can predict tumor response from early time points
While it is important to assess a model's ability to fit a tumor volume timeline a posteriori, a more clinically relevant problem is to predict final treatment response based on early tumor behavior under therapy. Therefore, we investigated if these models can predict the last data points when only fitted to early treatment response. To investigate this, we held out the three last data points on any given patient, fit the model to all remaining (early) data points and evaluated the mean absolute error from extrapolation to the holdout test measurements (experiment #2). Interestingly, we found that in most patient groups in most treatment arms the holdout data points could be very well predicted with this approach. A remarkable exception was the Classic Bertalanffy Model, which yielded the worst fit on the last three points as assessed by the Mean Absolute Error (S5A and S5B Fig). Overall, the best models for In experiment #1, models were fitted to all available data points for each patient (only for patients with at least 3 or 6 data points, respectively). In experiment #2, models were fitted to all but the last 3 data points for all patients with at least 6 data points. Then, the predictions for the last 3 data points were compared with the actual values. (B) Fit and prediction for three representative patients. (C) Plot of real data points and fitted data points for all models for all studies. A larger deviation from the diagonal indicates a worse fit. Models with a "raincloud" appearance systematically underestimate true tumor volume. https://doi.org/10.1371/journal.pcbi.1009822.g002

PLOS COMPUTATIONAL BIOLOGY
Prediction of tumor response to chemotherapy and immunotherapy predicting holdout measurements were the General Bertalanffy and the Gompertz model (S5A and S5B Fig). When analyzing the predictions of the exponential model ( Fig 4A) and the General Bertalanffy model (Fig 4B) in more detail, we found that for the "up" and "down" patients, the exponential and the General Bertalanffy model visually recapitulated the trajectory of the tumor volume. A notable exception are U-shaped curves present in some of the "fluctuating" patients ( Fig 4A and 4B).

Discussion
Cancer immunotherapy with immune checkpoint inhibitors is now an established part of the therapeutic arsenal for solid tumors [37]. Patterns of response to this class of drugs are more  complex than for classical chemotherapy [38]. Previous studies have at length discussed new response trajectories such as hyperprogression, pseudoprogression [38,39] or delayed response [40] in immune checkpoint inhibitors. Accordingly, simple assessment systems for treatment response such as RECIST are not ideally suited to predict future treatment response for a given patient. Although mathematical models of tumor growth have been used for decades to understand mechanisms of tumor progression and treatment response, they have not been systematically validated in human real-world data of patients undergoing systemic treatment. To our knowledge, the only large systematic evaluation of these models have been performed on mouse tumors [20,27], which function merely as a proxy for human tumors. Moreover, although immunotherapy is a cornerstone of cancer treatment and classical mathematical models are in principle useful to model cancer growth under therapy, they have not been previously applied to large cohorts of patients under immunotherapy. In this study, we present a systematic application of mathematical tumor growth models on a large human dataset of patients undergoing immunotherapy and chemotherapy. We restricted our analysis to six consensus mathematical models selected from [41]. We show that in particular the Gompertz model and the General Bertalanffy can successfully fit the tumor growth trajectory and provide an accurate prediction of ultimate treatment response on the basis of early treatment data. However, we also show that the fit for "fluctuating" patients is lower in all models, and fully Ushaped tumor growth trajectories could not be fitted at all. Comparison of the results between experiment #1 and #2 shows that models perform better if all the data points are used. However, from the clinical point of view, it is very useful if a model can predict the final response points from the early treatment response. This highlights the usefulness of stratifying patients into different categories and, in the future, of using more sophisticated models which can overcome this limitation. Our findings mirror a previous study by Benzekry et al. who demonstrated that the Gompertz model provides a good approximation of tumor growth in mice.
[20] Therefore, our study provides a potential bridge between textbook models of mathematical oncology and oncology practice today, providing evidence that simple mathematical functions can be used to predict immunotherapy response in most patient subsets.
A structural limitation to our study lies in the circumstance that the simple modelling of tumor growth or decay might not be the best predictor for the overall therapy outcome. Although the assessment of tumor growth might be useful to evaluate the drug or therapy regime response, it does not provide overall survival prediction for individual patients. Tumors might show a positive therapy response, but at the same time patients might die from adverse therapy events, infections or other therapy-related problems. Consequently, mathematical models which are solely based on tumor growth data should only be used together with other prognostic and predictive factors in clinical routine. Another limitation is the fact that by setting a threshold of at least six measurements at six points of time per patient, we had to exclude a part of patients from our final analysis. We mitigated this problem by repeating the analysis for patients with at least three data points, but this could still represent a selection bias by neglecting early study drop-outs and early cancer-related deaths. Other data-related limitations are that for some patients, only very few points can be present during the initial dynamics which might create problems. In the future, the availability of more complex datasets could allow researchers to build more complex models, thereby capturing more nuanced details of tumor growth. In practice, this is limited by the availability of structured data in oncology. In addition, in line with previous studies performed on mouse data, we used very simple mathematical models in this study [20,27]. Such models are a strong simplification of the reality of solid tumors, which are multicellular structures with a distinct spatial architecture [42]. Fundamentally, the key question is: how granular should a model be? This has been discussed extensively in the literature [8,[43][44][45][46][47]. More complex models have been proposed for modeling tumor growth under immunotherapy which could improve the fit to the data, for example the Kuznetsov model [12] and game theoretical models [48][49][50]. As a starting point for the analysis of more complex models of computational oncology in real-life human datasets of various cancer types, we provide our raw data for re-use by other groups. In addition to non-spatial models like the ordinary differential equation (ODE) models in this study, other studies have explored the use of spatial models in the context of cancer immunotherapy. [45,46] However, in these studies we found that it is very hard to fit the parameters of spatial models to clinical routine data. Even simple spatial models have >25 free parameters, which means that for every patient at least 25 measurements are needed (ideally much more). In comparison, the ordinary differential equation (ODE) models in our study are much simpler and they only have two or three free parameters. This simplicity enables fitting the model parameters to routine clinical data such as the databases used in our study. Furthermore, the use of non-spatial models is supported by theoretical considerations. Solid tumors consist of billions of cells which show some mobility in the immediates spatial vicinity. Tumors are not perfectly homogeneous in the spatial dimension, but if we assume that the relevant biological processes are sufficiently similar in distinct parts of the tumor, spatial patterns do not have to be explicitly modeled, but can be implicit as in ODE models. Ultimately, complex spatial models and simplistic ODE models are both very valuable tools which could be implemented in the clinic in different situations. Our present study provides the first large-scale evidence for the usefulness of ODE models. Future studies should investigate more complex models in similar experimental approaches. In general, clinical utility remains the ultimate benchmark, as was pointed out by Gerlee [51], "a model that is disconnected from reality in terms of mechanisms and dynamics is acceptable, as long as it does the job of predicting".
Ultimately, after refinement and prospective validation, such models could conceivably be used in the clinic to provide guidance on treatment recommendations for cancer patients. Unlike molecular biology-based biomarkers in the field of oncology, mathematical models could potentially improve response prediction for individual cancer patients based on ubiquitously available routine data.