Diagnostic value and relative weight of sequence-specific magnetic resonance features in characterizing clinically significant prostate cancers

Purpose To assess the diagnostic weight of sequence-specific magnetic resonance features in characterizing clinically significant prostate cancers (csPCa). Materials and methods We used a prospective database of 262 patients who underwent T2-weighted, diffusion-weighted, and dynamic contrast-enhanced (DCE) imaging before prostatectomy. For each lesion, two independent readers (R1, R2) prospectively defined nine features: shape, volume (V_Max), signal abnormality on each pulse sequence, number of pulse sequences with a marked (S_Max) and non-visible (S_Min) abnormality, likelihood of extracapsular extension (ECE) and PSA density (dPSA). Overall likelihood of malignancy was assessed using a 5-level Likert score. Features were evaluated using the area under the receiver operating characteristic curve (AUC). csPCa was defined as Gleason ≥7 cancer (csPCa-A), Gleason ≥7(4+3) cancer (csPCa-B) or Gleason ≥7 cancer with histological extraprostatic extension (csPCa-C), Results For csPCa-A, the Signal1 model (S_Max+S_Min) provided the best combination of signal-related variables, for both readers. The performance was improved by adding V_Max, ECE and/or dPSA, but not shape. All models performed better with DCE findings than without. When moving from csPCa-A to csPCa-B and csPCa-C definitions, the added value of V_Max, dPSA and ECE increased as compared to signal-related variables, and the added value of DCE decreased. For R1, the best models were Signal1+ECE+dPSA (AUC = 0,805 [95%CI:0,757–0,866]), Signal1+V_Max+dPSA (AUC = 0.823 [95%CI:0.760–0.893]) and Signal1+ECE+dPSA [AUC = 0.840 (95%CI:0.774–0.907)] for csPCa-A, csPCA-B and csPCA-C respectively. The AUCs of the corresponding Likert scores were 0.844 [95%CI:0.806–0.877, p = 0.11], 0.841 [95%CI:0.799–0.876, p = 0.52]) and 0.849 [95%CI:0.811–0.884, p = 0.49], respectively. For R2, the best models were Signal1+V_Max+dPSA (AUC = 0,790 [95%CI:0,731–0,857]), Signal1+V_Max (AUC = 0.813 [95%CI:0.746–0.882]) and Signal1+ECE+V_Max (AUC = 0.843 [95%CI: 0.781–0.907]) for csPCa-A, csPCA-B and csPCA-C respectively. The AUCs of the corresponding Likert scores were 0. 829 [95%CI:0.791–0.868, p = 0.13], 0.790 [95%CI:0.742–0.841, p = 0.12]) and 0.808 [95%CI:0.764–0.845, p = 0.006]), respectively. Conclusion Combination of simple variables can match the Likert score’s results. The optimal combination depends on the definition of csPCa.


Materials and methods
We used a prospective database of 262 patients who underwent T2-weighted, diffusionweighted, and dynamic contrast-enhanced (DCE) imaging before prostatectomy. For each lesion, two independent readers (R1, R2) prospectively defined nine features: shape, volume (V_Max), signal abnormality on each pulse sequence, number of pulse sequences with a marked (S_Max) and non-visible (S_Min) abnormality, likelihood of extracapsular extension (ECE) and PSA density (dPSA). Overall likelihood of malignancy was assessed using a 5-level Likert score. Features were evaluated using the area under the receiver operating characteristic curve (AUC). csPCa was defined as Gleason !7 cancer (csPCa-A), Gleason !7(4+3) cancer (csPCa-B) or Gleason !7 cancer with histological extraprostatic extension (csPCa-C),

Results
For csPCa-A, the Signal1 model (S_Max+S_Min) provided the best combination of signalrelated variables, for both readers. The performance was improved by adding V_Max, ECE and/or dPSA, but not shape. All models performed better with DCE findings than without. a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 When moving from csPCa-A to csPCa-B and csPCa-C definitions, the added value of V_Max, dPSA and ECE increased as compared to signal-related variables, and the added value of DCE decreased.

Introduction
Multiparametric Magnetic Resonance (MR) imaging can detect clinically significant prostate cancer (csPCa) with good accuracy [1][2][3][4][5][6][7]. Unfortunately, its interpretation needs expertise. Indeed, prostate focal lesions can have very different appearances from one MR pulse sequence to another, and it may be difficult to distinguish, within the large number of combinations of shape and signal abnormalities, those that are benign from those that are malignant.
Because it is impossible to definitely characterize as benign or malignant all prostate focal lesions, the use of a 5-point subjective score has been widely encouraged to describe the level of suspicion of a given lesion [8,9]. This so-called Likert score is a highly significant predictor of the malignant nature of prostate focal lesions [3,[10][11][12][13]. However, because there are no descriptions of specific criteria to be used in the scoring process, the Likert score relies heavily on the reader's experience. Therefore, some research groups tried to set up more objective scoring systems to improve inter-reader agreement [14][15][16][17].
In 2012, the European Society of Urogenital Radiology (ESUR) endorsed the Prostate Imaging Reporting and Data System (PIRADS) score [18]. It was shown to be a significant predictor of malignancy and aggressive behavior [19,20]. However, it did not outperform the Likert score, at least for experienced readers, and did not improve inter-reader agreement either [11,21,22]. Two main limitations were identified. First, it gave the same diagnostic weight to all pulse sequences, and second the interpretation of dynamic contrast-enhanced (DCE) imaging relied mostly on the shape of the enhancement curve that was shown to be a poor predictor of malignancy [11].
In 2015, the ESUR and the American College of Radiology endorsed the so-called PIRADS v2 score [23,24] [25,26], but, again, some limitations were pointed out [27]. It does not seem to improve interreader agreement as compared to the PIRADS v1 score, even after training [28,29], results in a high false positive rate [30] and was outperformed by alternative in-house scores [17,28]. These results have led some authors to suggest that there might be structural limits to the ability of any score based on MR imaging to allow detection of prostate cancer with high specificity [28].
The PIRADS scores were not based on an analysis of a large dataset, but rather were a result of expert opinion. As others [17], we hypothesize that processing large prospective databases detailing sequence-specific findings may help further refining prostate MR scoring systems. We therefore undertook this study to assess the relative diagnostic weight of sequence-specific MR features in characterizing aggressive cancers in PZ, using a prospectively acquired radiologic-pathologic database of patients treated by prostatectomy.

Radiologic-pathologic database
As of September 2008, all patients who underwent prostate multiparametric MR imaging before radical prostatectomy at our institution (Hospices Civils de Lyon) were proposed to have their radiologic and pathologic data entered in a prospective database approved by our Institutional Review Board (Comité de Protection des Personnes Sud-Est IV). All patients gave written informed consent. The database was used for other studies evaluating prostate cancer detection rates [3] and accuracy of tumor volume estimation [31] at multiparametric MR imaging, existing MR scoring systems [11] and diagnostic accuracy of quantitative MR parameters [32][33][34]. These studies do not overlap with the present one that aims at understanding the relative weight of non-quantitative sequence-specific findings in the characterization of aggressive prostate cancers, in order to improve current MR scoring systems.

MR image analysis
MR examinations comprised at least T2W, DW and DCE imaging at 1.5T and 3T, but protocol parameters varied as per the standard of care at the time of the examination (S1 Table). Upon inclusion, preoperative MR examinations were prospectively analyzed by two senior readers blinded to clinical and histological data. Reader 1 (R1, OR) and 2 (R2, FB) had respectively 11 years and 1 year of experience at the start of the database in 2008. Readers independently described all prostate suspicious focal lesions. They took into account all PZ lesions showing low-signal intensity on T2W images and/or on apparent diffusion coefficient (ADC) maps, and/or showing early enhancement at visual inspection of DCE images.
First, readers delineated each lesion on all three MR pulse sequence images using the Osirix software (Osirix imaging software, Geneva, Switzerland). This allowed the calculation of the lesion volume on each MR pulse sequence.
Second, readers specified the lesion shape on the three MR pulse sequences images, using the following list: not visible; ill-defined area with indistinct margins; linear lesion perpendicular to the capsule; linear lesion parallel to the capsule; triangular lesion; nodular lesion without mass effect on the adjacent TZ or capsule; nodular lesion with mass effect on the adjacent TZ or capsule.
Finally, they assessed the likelihood of malignancy of each lesion, using the following subjective 5-level Likert score: 1, definitely benign; 2, likely benign; 3, indeterminate; 4, likely malignant; 5, definitely malignant. Because a Likert score of 1/5 was used only in areas with normal appearance on all pulse sequences, focal lesions had, by definition, a Likert score !2/5.

Comparison of MR and histopathologic findings
Whole-mount sections of prostatectomy specimens were obtained every 3 mm according to guidelines [35]. A single uropathologist (FML) with 10 years of experience at the start of the database and blinded to the readers' assessements, assigned individual Gleason scores to all cancer foci and delineated them on the glass cover of whole-mount sections. Then, the readers and the uropathologist compared MR and histopathologic findings. The pathologist decided which MR lesions matched the positions of histologic cancers. These matching lesions were considered true positives only if their largest diameter was 50-150% of the diameter of the corresponding cancer to minimize chance detection [36,37].

Assessment of variables and combination of variables in characterizing csPCa-A in PZ
We first assessed the performance of the 9 variables decribed in Table 1 in characterizing in PZ csPCa defined as Gleason !7 cancers, i.e. as cancers with an International Society of Urological Pathology (ISUP) grade group !2 (csPCa-A) [38]. The variables comprised the PSA density (dPSA) and eight variables describing MR lesions (S_T2, S_DW, S_DCE, S_Max, S_Min, Shape, ECE and Vmax). (2) Ill-defined areas and linear lesions perpendicular to the capsule were grouped in a single category for statistical analysis.
Then we assessed the performance of three combinations of variables related to the lesions' signal. Signal1 model comprised S_Max and S_ Min, Signal2 model comprised S_Max, S_ Min and S_DW, and Signal3 model comprised S_T2, S_DW, S_DCE. Signal1a, Signal2a and Sig-nal3a models used the three MR pulse sequences. Signal1b, Signal2b and Signal3b models did not use DCE imaging. In Signal2a model, S_Max and S_Min used only the results of T2W and DCE imaging since DW findings were coded by S_DW. In Signal2b model, S_Max and S_Min used only the results of T2W imaging. These models were compared to S_Max as a standalone, with (S_Maxa) and without (S_Maxb) use of DCE imaging.
Signal1 model was selected for the rest of the analysis. The variables Shape, ECE, Vmax and dPSA were sequentially added to it. Each resulting multivariable model was assessed first using the findings of the three MR pulse sequences and then without DCE imaging.
In the database, extraprostatic extension was assessed at the lesion level (i.e., it was specified whether each lesion showed features of extraprostatic extension at pathological examination). This allowed combining the Gleason score and extraprostatic extension features at the lesion level.

Statistical analysis
The analysis units were the lesions identified by each reader. The probability that a lesion corresponded to a csPCa was modelled using a logistic regression for each of the 9 studied variables, and for the different combinations of variables described above. Receiver operating characteristic (ROC) curves were built using the probabilities of csPCa predicted by the different models. The diagnostic performance of each variable and combination of variables was quantified using the area under the ROC curve (AUC). Because the same data were used for developing the models and assessing their performance, the AUC can be overestimated. This is often referred to as optimism. We therefore used a bootstrap procedure to estimate a corrected AUC as proposed by Harrel [39] (S1 Appendix). To take account of the clustered structure of the data, the bootstrap procedures used a clustered resampling at the patient level. A similar bootstrap procedure was used for model-to-model comparison and for constructing confidence intervals. All interval estimations given in this paper are 95% confidence intervals (95%CI). Analyses were performed using R software version 3.2.4 (http://cran.r-project.org).

Study population
At the time of analysis, the database contained 262 patients imaged between September 2008 and February 2013. MR imaging was performed at 1.5T on scanner A (n = 72) or at 3T on scanner B (n = 113) or C (n = 77). The patients median age, PSA level and PSA density at the time of imaging were 62 years (interquartile range (IQR), 58-66), 6.5 ng/mL (IQR, 5.0-9.9) and 0.16 ng/mL/mL (IQR, 0.12-0. 25 S2 Table shows the distribution of individual variables for tissue classes and readers. S_Max obtained the highest AUC for both readers ( Table 2). For R1, it was significantly different from that of S_T2 (p = 0.002), S_DW (p = 0.01) and S_DCE (p = 0.005). For R2, it was significantly different from that of S_T2 (p = 0.001) and S_DCE (p = 0.011) but not from that of S_DW (p = 0.094) Table 3 shows the results of S_Maxa-b, Signal1a-b, Signal2a-b and Signal3a-b models. For both readers, all models performed better with use of DCE imaging, and the difference was contrast-enhanced image. One suspicious lesion was described by both readers in the right peripheral zone (A-C, arrow). The lesion was noted as nodular without mass effect by both readers. S_T2, S_DW and S_DCE were respectively marked, marked and moderate for both readers. V_Max was 2.0 cc and 2.1 cc for readers 1 and 2 respectively. The ECE and Likert scores were respectively 2/5 and 5/5 for both readers. Analysis of the prostatectomy specimen showed a matching Gleason 9 (4+5) cancer with a histological volume of 1.6 cc.

Assessment of signal-related models in characterizing csPCa-A in PZ
https://doi.org/10.1371/journal.pone.0178901.g001  Table 4 shows the results of the models associating Signal1 model and other variables. When DCE imaging was used, the addition of other variables either decreased the diagnostic performance or slightly improved it, with a maximal ΔAUC of +0.011 for R1 and +0.021 for R2.  For each reader, the models achieving the highest AUC value were highlighted in green. The models achieving an AUC value that was inferior to the highest AUC value by 0.01 or less were highlighted in yellow. . It performed better with use of DCE imaging, but the difference was not statistically significant (p = 0.104). As compared to Signal 1a model, Signal1a+ECE+dPSA model achieved a ΔAUC of +0.011, but the difference was not significant (p = 0.167).

Assessment of other multivariable models in characterizing csPCa-A in PZ
For R2, the Signal1a+Vmax+dPSA model obtained the highest AUC (0.790 [95%CI: 0.731-0.857]). It performed better with use of DCE imaging, but the difference was not statistically significant (p = 0.104). As compared to Signal 1a model, Signal1a+Vmax+dPSA model achieved a ΔAUC of +0.021, but the difference was not significant (p = 0.215).

Assessment of single variables and multivariable models in characterizing csPCa-C in PZ
When DCE imaging was used, the addition of other variables to the Signal 1a model tended to increase the AUC, with a maximal ΔAUC of +0.031 for R1 and +0.076 for R2 (Table 6). When DCE was not used, all multivariable models performed better than Signal1b model with a maximum ΔAUC of +0.065 for R1 and +0.133 for R2.
AUC values tended to be higher when DCE imaging was used, with a maximal ΔAUC of +0.032 for R1 and +0.041 for R2. However, two models performed better without DCE imaging for reader 2, including the model providing the best AUC value.
The The AUC of the Likert score was 0.849 (95%CI: 0.811-0.884) for R1 and 0.808 (95%CI: 0.764-0.845) for R2. The ΔAUC between the AUC of the best multivariable model and the Likert score was not significant for R1 (p = 0.49) but was significant for R2 (p = 0.006). Table 7 summarizes the changes in the model performances with changes in csPCa definition. When moving from definition csPCa-A to definition csPCA-C, the ΔAUC between the Signal 1 model and the best multivariate models tended to increase, and the ΔAUC between models with and without DCE tended to decrease. These trends were more pronounced for R2 than for R1.  For each reader, the models achieving the highest AUC value were highlighted in green. The models achieving an AUC value that was inferior to the highest AUC value by 0.01 or less were highlighted in yellow.

Discussion
The purpose of this study was not to build a new scoring system that could compete with existing ones, but rather to define the most informative MR features in characterizing csPCa in PZ, and to assess the relative diagnostic weight of these features, and how they could be optimally combined. Answering these questions is indeed mandatory if one wants to improve existing scores in the future. To achieve this purpose, we used a radiologic-pathologic database containing a detailed description of sequence-specific features of all lesions visible at pre-operative prostate multiparametric MR imaging. This description was made prospectively by two independent readers and was then compared to prostatectomy specimens findings used as reference.
We used several definitions for csPCa since there is currently no consensus on this matter. As a primary objective, we defined csPCa as cancers with a Gleason score !7 because the Table 6. Performances of multivariable models in characterizing csPCa-C in PZ. For each reader, the models achieving the highest AUC value were highlighted in green. The models achieving an AUC value that was inferior to the highest AUC value by 0.01 or less were highlighted in yellow.

AUC (95%CI)-
Bold characters indicate models that performed better without DCE imaging.
metastatic and lethal potential of Gleason 6 cancers is low [40], and because this definition is commonly used [41]. Using this definition, we built the best multivariate models through a stepwise approach. Then, the models were assessed using more stringent definitions for csPCa. Unsurprisingly, S_DW gave consistently better results than S_T2, and S_DCE for both readers, whatever the definition used for csPCa. This confirms that DW imaging is the most informative pulse sequence in PZ. Nonetheless, combining the results of the three pulse sequences into S_Max resulted in a substantial improvement in the diagnostic performance for both readers. This is in line with the good results obtained at the National Institute of Health (NIH) with an in-house score taking into account only the number of positive pulse sequences [16,42], and points out that signal abnormalities remain the most informative features and should play a central role in any scoring system. The Signal1-3 models consisted in three different combinations of the signal-based variables. They provided only marginal improvement as compared to S_Max. Particularly, the Sig-nal2 model, that was an attempt to increase the diagnostic weight of DW imaging among the signal-based variables, failed to improve the characterization of csPCa. Thus, although DW imaging is the most informative pulse sequence in PZ, its optimal combination with the other pulse sequences remains to be defined.
Because the Signal1 model tended to give the best results, it was selected for the next step that assessed the added value of variables that were not related to signal abnormalities. Three variables (dPSA, ECE and V_Max) showed consistent added value and all best multivariable models included at least one of them. Their added value over signal-based variables (i.e. the (1) A positive difference indicates a better performance of the multivariable models.
(2) A positive difference indicates a better performance of the model using DCE imaging. https://doi.org/10.1371/journal.pone.0178901.t007 Relative diagnostic weight of image features in prostate multiparametric MRI ΔAUC between the best multivariable model and the Signal 1 model) tended to increase when DCE imaging was not used and when more stringent definitions of csPCa were used. Unsurprisingly, the diagnostic weight of ECE increased when the definition of csPCa included extraprostatic extension (csPCa-C), and the diagnostic weight of V_Max increased when the definition of csPCa was restricted to more aggressive tumors (csPCa-B instead of csPCa-A).
Thus, an international consensus on the definition of csPCa is becoming crucial, since this will impact the scoring system to be used on multiparametric MR imaging. Interestingly, good results have recently been reported with a refinement of the NIH score combining ECE features and the number of positive pulse sequences [28]. Our results are in line and suggest that ECE features not only assess prostate cancer extracapsular extension, but also help characterizing the nature of the lesion. dPSA was the only non-MR variable included in this study. The fact that it consistently provided independent information to MR features is in line with a recent study that found that combining the PIRADS v2 score with dPSA improved prostate lesion characterization [43]. If the aim of scoring systems is to assess the likelihood of presence of csPCa, it might therefore be necessary to associate MR features and clinical or biochemical features in the future. Taking into account the shape of the lesions decreased the performance of almost all models. This strongly suggests that shape is not a good predictor of csPCa, even if the PIRADS v2 and other scoring systems [15,17] use it to characterize focal lesions in PZ. The poor diagnostic value of shape had already been found in another study [14] and may be due to the fact that it remains a very subjective feature.
There is currently a controversy about the added value of DCE imaging, as compared to T2W and DW imaging. DCE imaging may indeed help detect small cancers, but may also increase the number of false positive findings [23,[44][45][46][47]. Several groups failed to find clear added value for DCE imaging when MR images were interpreted visually or using scoring systems [25,[48][49][50]. Nonetheless, in most quantitative studies aimed at characterizing prostate lesions [34,[51][52][53][54], DCE-derived parameters were part of the best final models. Our study is in line with these quantitative studies. When we removed DCE findings, we observed a systematic decrease in the diagnostic performances of nearly all models, for both readers, suggesting that DCE imaging does provide information. The best way to incorporate this information into a scoring system remains to be defined. Interestingly, the added value of DCE imaging tended to decrease for both readers when more stringent definitions of csPCa were used.
In the last part of our study, we compared the best models to the Likert score prospectively assigned to each lesion. For the most experienced reader, the Likert score constantly outperformed the best model, even if the difference was never statistically significant and tended to decrease as more stringent definitions of csPCa were used. For the least experienced reader, however, the best model outperformed the Likert score for csPCa-B and csPCa-C, and the difference was statistically significant for csPCa-C. Most of the image features we used were subjective and this may be seen as a contradiction with our initial goal to obtain more objective scoring systems. Unless an entirely quantitative approach is used for all pulse sequences, subjective assessment of images remains unavoidable. Existing scoring systems share the same limitation, as shown by the PIRADS v2 score that distinguishes indistinct hypointense (score 2), mildly/moderately hypointense (score 3) or markedly hypointense (scores 4-5) lesions on ADC maps. However, our results suggest that breaking down the diagnostic process into separate features, and combining these features into predefined models may help less experienced readers better characterise MR lesions.
Our study has some limitations.The description of the sequence-specific features was done prospectively. Although this could be seen as a strength of the study, it also induced two limitations. First, we were not able to compare the best models with the PIRADS v2 score that was not launched at the start of the study. However, our main purpose was not to compare our models to existing scores, but to understand the features' relative weight in characterizing csPCa. Second, the features noted in the database for TZ lesions were mostly based on signal abnormalities. Features as homogeneous pattern, presence of a capsule, apical or anterior location, that are now known as major predictors of malignancy in TZ [55,56] were not prospectively recorded. As a result we chose not to use the database for assessing diagnostic models in TZ. Another limitation is due to the fact that both readers were from the same institution and may have characterized prostate lesions in a similar way. The best multivariable models may have been different with readers from other institutions. Finally, our study mixed patients imaged at different field strengths with varying protocols. Although this resulted in a heterogeneous population, it may better reflect daily routine population.

Conclusion
The number of pulse sequences showing marked signal abnormality for a given lesion (S_max) was one the most informative variable for both readers, whatever the definition used for csPCa. A moderate improvement could be obtained by taking into account, in addition to S_max, the number of negative pulse sequences, the presence of extracapsular extension features, the volume of the lesion and the PSA density. The added value of the three latter variables depended on the definition used for csPCa and tended to increase when more stringent definitions were used. Removing DCE findings decreased performance in nearly all models, but the difference decreased when more stringent definitions were used for csPCa. Finally the Likert score outperformed the best multivariable models for the most experienced reader whatever the definition used for csPCa. For the other reader, the multivariable models outperformed the Likert score for csPCa-B and csPCa-C definitions, and the difference was significant for csPCa-C. This suggests that scoring systems based on semi-objective variables may help less-experienced radiologists.
Supporting information S1 Table.