Intercenter validation of a knowledge based model for automated planning of volumetric modulated arc therapy for prostate cancer. The experience of the German RapidPlan Consortium

Purpose To evaluate the performance of a model-based optimisation process for volumetric modulated arc therapy applied to prostate cancer in a multicentric cooperative group. The RapidPlan (RP) knowledge-based engine was tested for the planning of Volumetric modulated arc therapy with RapidArc on prostate cancer patients. The study was conducted in the frame of the German RapidPlan Consortium (GRC). Methods and materials 43 patients from one institute of the GRC were used to build and train a RP model. This was further shared with all members of the GRC plus an external site from a different country to increase the heterogeneity of the patient’s sampling. An in silico multicentric validation of the model was performed at planning level by comparing RP against reference plans optimized according to institutional procedures. A total of 60 patients from 7 institutes were used. Results On average, the automated RP based plans resulted fully consistent with the manually optimised set with a modest tendency to improvement in the medium-to-high dose region. A per-site stratification allowed to identify different patterns of performance of the model with some organs at risk resulting better spared with the manual or with the automated approach but in all cases the RP data fulfilled the clinical acceptability requirements. Discrepancies in the performance were due to different contouring protocols or to different emphasis put in the optimization of the manual cases. Conclusions The multicentric validation demonstrated that it was possible to satisfactorily optimize with the knowledge based model patients from all participating centres. In the presence of possibly significant differences in the contouring protocols, the automated plans, though acceptable and fulfilling the benchmark goals, might benefit from further fine tuning of the constraints. The study demonstrates that, at least for the case of prostate cancer patients, it is possibile to share models among different clinical institutes in a cooperative framework.


Results
On average, the automated RP based plans resulted fully consistent with the manually optimised set with a modest tendency to improvement in the medium-to-high dose region. A per-site stratification allowed to identify different patterns of performance of the model with some organs at risk resulting better spared with the manual or with the automated approach PLOS

Introduction
Knowledge based radiotherapy treatment planning (KBP in the following) is a concept pioneered since some years and the results published by the developers of the main algorithms [1][2][3][4][5][6][7] suggest the possibility to largely automate and individualize the definition of the appropriate dose-volume constraints for inverse planning. The basic idea is to develop clever engines which, after mining the historical planning data can predict the expected dose volume histograms (DVH) for any organ at risk (OAR) of any new patient. Different solutions have been proposed, from look-up engines, to machine learning based systems but all with the scope to automate one of the most critical and subjective phases of the inverse planning process. In fact, sub-optimal choice of the planning constraints might lead to sub-optimal final plans but the identification of the best (or best balanced) set of these might result a too hard task in the real world practice. Reasonably, limits in knowledge, expertise, resources and time constraints are the main risk factors for inadequate planning. Among the solutions available for clinical practice, one commercial system, the RapidPlan (RP) has been already intensively investigated in literature [8][9][10][11][12][13][14][15][16][17][18][19][20][21]. In these studies some general trends were observed and included a generally improved plan quality, a reduced inter-clinician variability, and the possibility to transfer the planning expertise from more experienced centers to less experienced institutions. To mention that RP has been introduced in clinical practice and that Hussein et al [18] summarized well in their work the main features of the system and the main aspects of the construction of a predictive model. In summary, RP is an engine that takes the geometrical features of the patients and correlates these to the previously achieved dosimetry to generate appropriate estimates of the achievable dose distributions for prospective cases and the automated definition of the optimization constraints.
The German RapidPlan Consortium (GRC) was created as a cooperative initiative among six radiation oncology institutes through Germany with the scope to facilitate the learning phase of this new technology and to shorten the time needed to move from the pre-clinical to the clinical implementation of the KB process. Members of the GRC are both academic institutes and departments of regional community hospitals or of private networks of hospitals. The mix of different sizes, level of available infrastructures and human resources allowed to harmonize in the group the various types of clinical centres representative of the wider radiation oncology community. The activity of the GRC consisted, so far, in several tasks including: i) individual learning and assessment of the RapidPlan KBP system; ii) development, training and in-house validation (closed-loop) of some models per center; iii) cross validation of the models with other centres of the GRC (open-loop tests) and, more recently the execution of a multicentric experiment mimicking the possible broader sharing of a validated model.
Aim of this latter study, summarized in this report, was primarily to demonstrate the usability of a model, developed by one centre of the GRC by all other members with the addition of an external member with possibly different clinical characteristics.
The paradigmatic clinical case selected for the study was high risk prostate and the treatment with volumetric modulated arc therapy (VMAT) of the pelvic volume. Fogliata et al. [9] and Hussein et al [18] reported about the development of RP models for the prostate but for different settings from what presented here. Furthermore, the results presented in those studies are relative to single institute experience.
The experiment would have been considered positive if all the institutes would have been capable to generate plans acceptable with respect to the institutional goals and if the average over the entire cohort would fulfil the benchmark requirements with the RP system. A more quantitative assessment would also aim to determine the possible improvement in plan quality induced by the use of RP and/or determine the limits or weaknesses of the system.

Material and methods
The RapidPlan model: a KB predictive model for RapidPlan was built for high risk prostate cancer patients, the model was designed for the treatment of the pelvic volume (including the prostate, the seminal vesicles and the pelvic nodes, partial treatment and no simultaneous boost concept applied). The dose prescription was set to 50.4Gy (in 28 fractions). All plans were normalised at the mean target dose. The OARs considered for the model definition and training were the bladder, the rectum, the femoral heads and the small intestine. The contouring of the small intestine actually accounted for the bowel bag, trying to encompass all possible positions of the intestinal loops. The margins from the clinical treatment volume (CTV) to the planning treatment volume (PTV) was defined as 7mm isotropic.
The model was trained with a set of 43 patients all selected by a single institute of the Consortium using a TrueBeam linear accelerator (Varian Medical Systems, Palo Alto, US) with the VMAT RapidArc technique and 2 full arcs. The photon beam energy selected for the model training dataset was 15MV but this did not constitute in the study design a mandatory requirement for the tests. The dose plan optimisation, calculation and the model training were performed using the corresponding algorithms of the Eclipse planning system version 13.6.23. The model validation was performed by inspection of the potential outliers and the quality of the treatment plans selected for the training. The final model version, approved for the study, resulted exempt the presence of any major outlier potentially influencing (negatively) the estimation power of the model.
The objectives defined for the DVH constraints definition are summarized in Table 1.
The inter-center test: the model was distributed among all the Consortium centers plus one additional site out of the core group and applied to a number of patients per site, these centers are labelled S1 to S7 and include also the center where the model was built. A total of 60 test plans were developed based on this model (the number of cases per center were 10,7,6,7,13,10,7 respectively).
Each RapidPlan case was compared against a corresponding plan, manually optimized. To provide a "real world" assessment of the robustness of the DVH prediction power of the RP model, each center was requested to include in the study typical patients belonging to the high risk category but without any modification of the local contouring strategies. While the technique (2 full RapidArcs) was requested to be kept for all cases, the beam energy was left free and it was selected to be either 6MV or 15MV according to local practice.
The RapidPlan optimization was performed for all 60 test cases without any interactive intervention on the process. No additional control structures were used for the optimization.
The manually optimized plans, were instead designed according to the standard procedures of the individual clinics, with the possible use of help structures and with interactive modification of the constraints if needed. Although the clinically acceptable dose-volume constraints would depend on the single individual center, the general benchmark objectives to be met for the average of the entire cohort were defined as follows. For the target coverage D 98% !98% (95%) for CTV (PTV respectively). For the organs at risk, the maximum dose D 1% to the bowel and the femoral heads was required to be inferior to 50 (40)Gy respectively. For the rectum the constraints were mean dose <36Gy, V 50% <10% and for the bladder mean dose <36Gy and V 50Gy <20%; D 1% <50Gy was the ideal objective, not possible to meet due to the overlap between PTV and bladder or rectum, so an alara principle was applied in this case.
Data analysis: all DVHs from all 60 test cases were exported from Eclipse and centrally analysed by one of the sites using the same metrics and calculation tools for all. Standard quantitative and qualitative assessment of the DVH was performed by inspecting a number of dosevolume parameters for either the targets (aiming to coverage and homogeneity information) or for the OARs (aiming to meaningful metrics for organs sparing). In particular, the Conformity Index (CI) was defined as the ratio between the body volume covered by the 95% isodose and the volume of the PTV while the Homogeneity was measured as the difference between D 5% and D 95% divided by the meann dose to the PTV.
Boxplots used to report differences between the various datasets reports five statistics: the median (solid line), the first and third quartile (bottom and top ends of the box) and the whiskers which correspond to 1.5 times the height of the box or, if not cases fall in that range, the minimum or maximum values (for normally distributed data this should correspond to approximately the 95% confidence interval) for each parameters. The points possible represented outside the limits of the whiskers are the outliers.

Results
To appraise the quality of the input cases used for the model construction and training, Fig 1 presents the distribution of DVHs for the various structures (targets and OARs) used for the model training with a typical case outlined together with its prediction band. The most spreadout group was the small bowel as reasonably expected given the contouring broad definition.
The heterogeneity of the test population was appraised in terms of the variability of the volumes of the targets and the two main organs at risk (bladder and rectum). Fig 2 shows the boxplots from the analysis of the volumes of the PTV, the bladder and the rectum. The dashed line represents the median value for the entire cohort of 60 patients. As it can be noted, there have been significant differences in the contouring strategy for the PTV as well as for the rectum, both statistically significant with a one-way analysis of variance (p = 0.01 and p = 0.02 respectively) while the observed difference for the bladder were not statistically significant (p = 0.09). The overlap volume between PTV and bladder or organ resulted significantly different (p<0.001 for rectum and p = 0.01 for bladder) when analysed in absolute or in percentage terms.  Table 2 for the CTV and the PTV and in Table 3 for the various OARs. CI was of course defined and reported only for the PTV. The data are reported for each site individually as well as for the entire cohort (All) and expressed as mean values while the interpatient variability is reported at 1 standard deviation level.
Concerning the target volumes (Table 3), no remarkable difference was observed among the parameters in terms of homogeneity and conformity. Some differences were reported to be statistically significant (p<0.05, marked with a Ã ). Nevertheless, the absolute difference observed in those cases is small. For example we reported a difference of 0.2-0.4Gy (i.e.~0.5-1% of the prescribed dose) for the mean dose to the CTV or 0.4Gy for (0.8%) for D98%. In general all RP or CL plans resulted equivalent.
The analysis of the OARs data is less straightforward and reflects the different clinical strategies and preferences among the various centers (in the CL sets) as well as the impact of the different contouring rules (and therefore different overlaps). This can be better visualized by  The mean dose to the bladder improved, on average, of 0.8Gy; the improvement was of 2.5% for V 40Gy and of 1.1% for V 45Gy ; a 0.9Gy and 1.9% worsening were observed for D 1% and V 50Gy . For the rectum an average improvement of 0.6Gy was for the mean dose and of 2.4 and 0.4% for V 40Gy and V 45Gy while D 1% and V 50Gy worsened of 0.3Gy and 1.6% respectively. Lastly, for the small bowel, the average improvement for the mean dose was of 1.9Gy while it resulted of 1.5Gy for D 1% . Despite the significant differences in the degree of overlap between PTV and bladder or rectum observed among centers, no further correlation was found between the degree of overlap and any dosimetric parameter.
A side cross-validation test was performed to validate the usability of the same model irrespective of the beam energy selected for the plans. The average DVH for the target volumes and some of the organs at risk from one center are shown in Fig 6. No differences were observed between the plans optimised for 6 or 15 MV photons and all based on the same model.

Discussion
Scope of the investigation was to understand if a knowledge based model aiming to automate inverse planning procedures, developed in one center was effectively applicable to other institutions of the same or near-to-same geographical and cultural area. The "affinity" among the centres would imply reasonable similar practice, protocols and some homogeneity in the patients´population. The model analysed aimed to be a broad scope one for the treatment of the pelvic volume in the high risk group of prostate cancer patients. No special conditions were imposed to the testing centers to strictly adhere to the model definitions (in terms of contouring rules for example) but rather the scope of the study was to appraise the possibility to use the same model within a 'real world' environment mimicking daily practice in different types of institutes. The institutes belonging to the GRC ranged from relatively small private departments (members of a network) to public regional community hospitals to finally larger academic centers. The results demonstrated that, on average, the use of KBP tools allowed some improvement in the sparing of the organs at risk compared to the routine clinical practice without compromising the coverage of the target volumes. At a site-per-site analysis, the performance of the automated planning revealed some different flavours and this has been reported primarily to two factors: the heterogeneity in the contouring rules (i.e. an heterogeneity in the "patients" set) and different logical priorities in the sparing of the OARs (e.g. some centres emphasising bladder over rectum or vice versa). A significant metric to prove the heterogeneity of the test dataset was provided by the variability of the overlap between the PTV and the bladder or the rectum. This from one side makes evident the different contouring protocols even within culturally homogeneous groups and on the other side justifies to some extent the differences observed in the degrees of sparing of the whole OARs as reported in the results. Indeed, the RapidPlan system used for the study, does not fully account for the eventual overlapping region giving priority to the target volumes in the generation of the objective lines and constraints.
The findings reported here confirm the quite obvious need of an accurate review and validation when a KB model generated and tested by other clinics is going to be applied for the first time in any new institute. Any un-supervised application of a KB model (or in general a technique not developed in-house) should be discouraged. Berry et al [20] published an interesting study aiming to investigate whether the use of KBP would have allowed to identify systematic variation in IMRT planning between different satellites of the same institute. The study was performed for intensity modulation treatment of esophageal cancer and a model built at the main campus was distributed to the others for testing and benchmarking. Their results proved that this was the case and that the use of KBP allowed to identify differences among the campuses possibly due to different levels of expertise, workload or preferences. Both Berry's and ours findings are consistent and from both it can be derived that KBP methods can be used either to harmonize practice or to facilitate the identification of challenges. In any case, KBP can lead to improvements in the planning quality.
Li et al [21] aimed to investigate whether the use of KBP could facilitate the adherence to clinical trial requirements and also improve the quality of the plans based on the assumption that lack of quality control is one relevant factor impacting on the outcome of the trials. This is in line to a recent publication where it was shown that the clinical outcome is strongly  Knowledge-based VMAT planning for prostate correlated with the volume of patients treated in an institute [22]. Li demonstrated that, properly built, the use of a model allowed to outperform manual planning in all protocol-specific dose volume objectives. In the frame of our study, this implies that, in a multicentric cooperative initiative, the adherence to guidelines or to recommendations could be facilitated and made stronger by the use of KBP methods.
One unconventional feature of the present study is the mix of cases planned for 6MV or for 15MV photons. Despite the apparent inconsistency, this was felt to be a strength because primarily it proves the possibility to use the same model applied to somehow different strategies, even at the level of energy and still be adequate in the quality of the results. More specifically, the energy selection should not constitute a problem with VMAT plans where full arcs are adopted as in the present study since the quality of the plans results very much comparable as it was demonstrated quite early in the RapidArc era [23][24]. To further demonstrate the robustness of the model with respect to the energy selection, one centre generated two sets of plans with RP for 6 or 15MV photon beams and the results, not shown in this report for contingent reasons, demonstrated the complete equivalence of the two sets as illustrated in Fig 6.

Conclusions
The multicentric validation demonstrated that it was possible to satisfactorily optimize patients from all participant centers with the knowledge based model. In the presence of possibly significant differences in the contouring protocols, the automated plans, In the presence of possibly significant differences in the contouring protocols, the automated plans, though acceptable and fulfilling the benchmark goals, might benefit from further fine tuning of the constraints.

Author Contributions
Conceptualization: CS OW LC.