Diagnostic accuracy of magnetic resonance imaging techniques for treatment response evaluation in patients with head and neck tumors, a systematic review and meta-analysis

Background Novel advanced MRI techniques are investigated in patients treated for head and neck tumors as conventional anatomical MRI is unreliable to differentiate tumor from treatment related imaging changes. Purpose As the diagnostic accuracy of MRI techniques to detect tumor residual or recurrence during or after treatment is variable reported in the literature, we performed a systematic meta-analysis. Data sources Pubmed, EMBASE and Web of Science were searched from their first record to September 23th 2014. Study selection Studies reporting diagnostic accuracy of anatomical, ADC, perfusion or spectroscopy to identify tumor response confirmed by histology or follow-up in treated patients for head and neck tumors were selected by two authors independently. Data analysis Two authors independently performed data extraction including true positives, false positives, true negatives, false negatives and general study characteristics. Meta-analysis was performed using bivariate random effect models when ≥5 studies per test were included. Data synthesis We identified 16 relevant studies with anatomical MRI and ADC. No perfusion or spectroscopy studies were identified. Pooled analysis of anatomical MRI of the primary site (11 studies, N = 854) displayed a sensitivity of 84% (95%CI 72–92) and specificity of 82% (71–89). ADC of the primary site (6 studies, N = 287) showed a pooled sensitivity of 89% (74–96) and specificity of 86% (69–94). Limitations Main limitation are the low, but comparable quality of the included studies and the variability between the studies. Conclusions The higher diagnostic accuracy of ADC values over anatomical MRI for the primary tumor location emphases the relevance to include DWI with ADC for response evaluation of treated head and neck tumor patients.


Introduction
Head and neck tumors are a devastating disease being the seventh leading cancer with respect to incidence, and the eight with respect to mortality rates [1]. Incidence in developing countries compared to developed countries is even higher [2]. Patients with head and neck tumors follow an intensive and expensive treatment regime most often consisting of concomitant chemoradiotherapy. Surgery is not standard in the majority of the patients with locally advanced tumors, but is frequently performed in other patients groups [3]. Side effects of treatment are substantial which impacts quality of life [4]. Furthermore, many of the patients with locally advanced tumors demonstrate an inadequate treatment response [5]. Imaging follow-up is thus essential to evaluate treatment response and to tailor treatment in individual patients.
Conventional anatomical MRI techniques are commonly used for treatment evaluation, but are often not able to reliable identify treatment response [6]. Surgery as well as chemoradiotherapy induces false positive results by changes in the affected area, including fibrosis and necrosis [7]. These benign treatment induced changes should be differentiated from true residual or recurrent tumor on imaging to prevent unjustly discontinuation or initiation of therapy. On the other hand, missing a residual or recurrent tumor also results in inadequate treatment for the patient.
Several recent studies have shown encouraging results using diffusion weighted imaging (DWI) for the detection of recurrent head and neck tumors, including calculated apparent diffusion coefficient (ADC) as potential valuable imaging biomarker for treatment response evaluation [8]. Next, perfusion and magnetic resonance spectroscopy (MRS) are promising techniques [9,10]. This is further supported by a recent overview [11]. However, an overview of the diagnostic accuracy for these advanced MRI techniques is not available as systematic review or meta-analysis [8][9][10][11].
This prompted us to conduct a meta-analysis of the diagnostic accuracy of anatomical and advanced MRI techniques for tumor residual or recurrence in patients treated for head and neck tumors. We hypothesized that the advanced MRI techniques perform better than anatomical MRI techniques in the differentiation of tumor from treatment induced imaging changes.

Methods
Our systematic review was performed according to the Preferred Reporting Items for Systematic reviews and Meta-Analyses (PRISMA, see S1 PRISMA Checklist) criteria and the AMSTAR guidelines [12,13]. Furthermore, the Cochrane handbook for review of diagnostic test accuracy was used. A review protocol was written prior to the study start (available upon request).

Data sources and search strategy
PubMed, EMBASE and Web of Science were searched by AH and HW in separate sessions using the same search strategy from their first records to September 23 th 2014. Database keywords and text words were searched using head and neck tumors, MRI techniques, treatment options and treatment response including the subcategories and variants of these words as search terms (see S1 Text). No filters were used, but studies in non-English languages were excluded manually later. References of included studies were further hand searched. An effort was made to include unpublished data by searching EMBASE for conference proceeding and contacting authors of in case insufficient details were described to generate 2x2 tables.

Selection criteria
We searched for studies with patients who were treated for newly diagnosed head and neck tumors. Studies reporting on patients with tumors (squamous cell carcinoma) of the oral cavity, pharynx or larynx were included. The reference standards should determine the treatment effect, thus tumor recurrence or e.g. therapy induced changes by clinical follow-up, imaging follow-up, histology or a combination of these. Studies were included if a 2x2 table could be constructed for the anatomical or advanced MRI data using the full text or addition requested data from the authors.
We excluded studies of patients with salivary gland neoplasms, thyroid gland neoplasms, parathyroid neoplasms, facial neoplasms, esophagus neoplasms or tracheal neoplasms. Studies in which a MRI system <1.0 Tesla was used were excluded since data differ substantially from data obtained in the current common clinical practice using MRI systems !1.0 Tesla.

Study selection
Study selection, data extraction and study quality assessment was independently done by two authors (AH and HW) and discrepancies were resolved by discussion. Possible inclusion was assessed first based upon title and secondly based upon abstract. The full text was assessed for eligibility if the abstract suggested relevance. Subsequently, the article was included if it fulfilled the inclusion criteria of our study. References of included studies were hand searched.

Data extraction and quality assessment
Data extraction was done with the use of a data extraction form. The main data extracted consisted of the number of true positives, false positives, false negatives and true negatives. We further extracted data on study design, total number of patients, number of males/females, mean and range of patients' age, patient selection criteria, imaging characteristics, reference standard (histology/ imaging follow-up/ clinical follow up) and definition of tumor or treatment changes. In case of incomplete 2x2 tables, the corresponding author was contacted and requested to provide the required data to generate 2x2 tables. The quality of included studies was assessed using the quality assessment of diagnostic accuracy studies, QUADAS-2 [14].

Data synthesis
Sensitivity and specificity with 95% confidence interval (CI) were generated for anatomical MRI and advanced MRI with RevMan 5.3 (Cochrane collaboration, Copenhagen, Denmark). When several time points were measured in one study, we used the one that was closest to 6 weeks posttreatment for the main analysis, because that is the most commonly used imaging follow-up based on our experience. Furthermore, diagnostic accuracy was evaluated in subgroups for the intratreatment, early posttreatment and late posttreatment scan moment. These were set at 2 weeks after the start of treatment, 6 weeks after the end of treatment and !3 months after the end of treatment or the most nearing time.
Bivariate random effects models [15] were used to generate pooled estimates of the sensitivity, specificity, positive likelihood ratio and negative likelihood ratios with a 95% confidence interval for each index test when 5 or more studies were included. The sensitivity and specificity were displayed together with a hierarchical summary receiver operator curve (HSROC). We fitted meta-regression in bivariate models and compared sensitivity and specificity of anatomical MRI versus ADC with a likelihood ratio test. The direct comparisons of MRI techniques per study was tested with a two-sample Z test for proportions. The metandi module was used for meta-analysis of diagnostic test accuracy studies in STATA version 12.1 (College Station, Texas, USA). As suggested by the Cochrane Diagnostic Test Accuracy group, no analyses of study heterogeneity or funnel plot asymmetry were performed, as these tests are inaccurate. However, our used random effects model takes heterogeneity into account. Heterogeneity was assessed by visual inspection of the forest plots. We evaluated whether differences in selection (high risk population versus follow-up of all patients) could explain identified heterogeneity. In case of outliers we evaluated whether bias of specific study characteristics could explain the result and performed sensitivity analysis without the outlier to show the influence on the test outcome.
Potential clinical implication was illustrated by calculating the number of missed tumors and total misclassifications using the pooled sensitivity and specificity results for a hypothetical cohort of 100 patients treated for head and neck tumors. Overall prevalence of tumor residual or recurrence in this cohort was based on the mean prevalence of tumor in our included studies.

Description of studies
Our electronic search revealed a total of 2096 unduplicated references, of which 23 references were eligible for inclusion in the meta-analysis (Fig 1; Tables 1 and 2) . Seven references were excluded, because authors were unable to provide the requested information to generate a 2x2 table [32][33][34][35][36][37][38]. Two references of the initial 23 were based on the same patient population [29,30] of which data from the first publication was considered to be leading, although both were very similar. One reference described two separate populations of patients [18], group A [18]A and group B [18]B, respectively. This resulted in the inclusion of 16 patients populations (studies) in the meta-analysis with data from 15 references.
The included diffusion studies concerned 1087 patients with a mean age of 48 years and of whom 78% was male. Mean tumor prevalence was 25% (range 2-83), without differences for during treatment (range 9-21) or posttreatment tumor prevalence (range 2-83). As the tumor prevalence was overlapping for studies that performed follow-up of all patients (prevalence range  and studies that selected patients with a suspicion of tumor recurrence or a high risk population (prevalence range 2-83), we combined these groups in the further analysis. As some studies described both anatomical and advanced MRI or both primary and nodal sites, we had a total of 11 studies (854 patients) for anatomical MRI of the primary tumor site, 6   7  7  9  8  6  8  7  8  6  8  6  7   10  2  23  5  10  2  17  2  6  1  11  5   1  1  1  2  2  0  3  2  2  0  3  2   13  21  21  39  13  21  27  42  16  21  30 ----------------------    ADC studies of the primary site (287 patients), 4 anatomical MRI studies of the nodal sites (310 patients) and 2 ADC studies of the nodal site (68 patients). No studies concerning perfusion or spectroscopy MRI were available. The definition for differentiating tumor from treatment effects was variable between studies and described for each separately (Tables 1 and 2).

Methodological quality of included studies
The methodological quality of the included studies is summarized (Fig 2). In the patient selection domain, four studies were considered to be of high risk of bias due to inappropriate exclusion criteria as patients with less than a 1 year disease-free follow-up [17] or patients with less than 2 year follow-up of the primary site were excluded in these studies [24][25][26]. Requiring such a long disease-free period creates a selection bias favoring patients without tumor recurrence. We considered another six studies to be at high risk of bias because a non-random selection was carried out as patients with at high risk for recurrence or with a suspicion of recurrence were included only [18] In the reference standard domain, four studies were classified as being of high risk because the index test results were known when interpreting the reference standard [18]A, [18]B, [19,31]. Eight studies were judged as unclear risk because it was unclear whether the results of the index test were known during the interpretation of the reference standard [16,20,21,[24][25][26][27][28]. The remaining four studies were considered to be at low risk of bias [17,22,23,29].
Thus, as most studies showed high risk of bias in the domains patient selection and flow and timing and the index test and reference standard domains were mostly unclear, study quality was classified as low.
For assessment of applicability, the included participants and setting, the conduct and interpretation of the index test, and the reference standard in each of the included studies were not doubted to meet the review question. All studies fulfilled the inclusion criteria of the review.

Main findings primary site
The forest plot of the anatomical MRI (11 studies with 854 patients) for the primary tumor location showed a reasonable homogenous specificity (see S1 Fig). The sensitivity showed more variation in CI, which were wide in 2 studies [18]B, [19]. No outliers were detected.
Although the pooled sensitivity and specificity of ADC were higher, this difference was not significant (p = 0.457 and p = 0.626, respectively). Two studies compared the anatomical MRI for the primary site with the ADC directly. The first study demonstrated a sensitivity of 72% for anatomical MRI and a sensitivity of 94% for ADC (p = 0.079). Specificity of both tests were 57% and 100%, respectively (p = 0.002) [28]. The second study showed a sensitivity of 75% for anatomical MRI and a sensitivity of 100% for ADC (p<0.023) and a specificity of 73% and 95%, respectively (p<0.047) [29].
To illustrate the clinical implication of our findings, we calculated the number of missed tumors and the number of total misclassified patients in a hypothetic population of 100 head and neck patients with using the residual or recurrent tumor prevalence of 25% found in our meta-analysis. This calculation showed that follow-up with anatomical MRI would result in 4 missed tumors and 13 patients would receive unjustified treatment. Implementation of ADC

Main findings nodal site
The forest plot of the data for the nodal site for anatomical MRI (4 studies with 310 patients) showed small overlapping confidence interval for the sensitivity and specificity with exception of the sensitivity of one study [29] and the specificity of another study [16] (see S1 Fig).
Nodal sites of the anatomical MRI showed a sensitivity range of 67-90% and a specificity range of 33-97%, but there were too few studies to calculate pooled estimates. The forest plot of the ADC of the nodal site (2 studies with 68 patients) showed overlapping, but wide confidence intervals. ADC showed a sensitivity range of 73-78% and a specificity range of 88-100%. Two studies compared the nodal site directly [16,29]. The sensitivity was 67% and the specificity was 73% for anatomical MRI, for the first study [29]. The ADC demonstrated a non-significant higher diagnostic accuracy with a sensitivity and specificity of 78% and 88%, respectively (p = 0.601 and p = 0.087, respectively) [29]. Similar results were demonstrated by the second study with a sensitivity of 87% with a specificity of 33% for the anatomical MRI and 73% and 100% for ADC, respectively (p = 0.338 and p = 0.082) [16].

Discussion
By using the statistical strategy of a systematic meta-analysis, we were able to demonstrate a benefit of DWI with derived ADC data over anatomical conventional MRI sequences. Pooled ADC values showed a higher sensitivity (89%) and specificity (86%) than anatomical MRI for the primary site (84% and 82%, respectively), while similar results were demonstrated for the fewer studies concerning nodal sites. The higher sensitivity and specificity of ADC values for tumor recurrence is also confirmed by the few available direct comparisons.
The relation between the performance of the anatomical MRI and ADC has been unclear till now as most studies reported diagnostic accuracy data of only anatomical MRI [17-19, 24,26,27,31] or of only ADC data [16,[21][22][23]]. Only few have investigated both, but in only 2 studies the diagnostic accuracy was reported of both the anatomical MRI and DWI with derived ADC data for the primary tumor site [28,29]. These direct comparisons are less prone to bias than indirect comparisons. Both studies confirmed the higher diagnostic accuracy of ADC data over anatomical MRI found in our meta-analysis for the primary site with a statistically significant higher sensitivity and specificity for the ADC [28,29]. A similar higher diagnostic accuracy was displayed in the two studies with a direct comparison for the nodal site, although not statistically significant [16,29].
Different ADC thresholds for the differentiation between treatment effects and tumor residual/recurrence were used ranging from 1.16-1.46 x10-3 mm 2 /s for absolute values or 14-53% for relative differences (see also Table 1). This implies that used thresholds cannot be interpolated across hospital sites. Even within studies different cut-off values were used [29,30]. ADC values are also known to show intratumoral variation with low ADC values for solid tumor components and high ADC values for necrotic areas, which can be a caveat in drawing regions of interest [39]. This might be the reason for the different strategies used in the region of interest analyses. Whole tumor volume possibly included necrotic areas [22]. The studies targeting the most conspicuous area can be assumed to exclude necrosis [23], while necrosis is certainly excluded for the studies stated to target the most conspicuous area excluding necrosis [27] or the complete solid component excluding necrosis [16,29]. One study did not provided details about the region of interest analysis, hindering a judgment about the quality [21].
Despite the variation in thresholds, tumor heterogeneity and different b-values, ADC data still outperformed anatomical MRI techniques. Because of the limited number of studies we were not able to assess the diagnostic accuracy of ADC and MRI in various threshold subgroups. However, implementation in clinical practice would benefit from standardized and validated ADC threshold values and region of interest analysis. This lack of standardization and the current high variability also hinders the generation of an advice regarding the best cutoff value to be used in clinical practice. Nevertheless, this meta-analysis demonstrate what many radiologist experience in daily practice, namely that adding a diffusion sequence to the anatomical sequences enhances treatment evaluation.
Numbers of excluded patients due to susceptibility artefacts in the head and neck area were provided in some studies (see Tables 1 and 2). This is a known limitation of DWI sequences, but the current limited data suggest that it is a problem in a minority of the patients. Small primary tumor size was an exclusion criteria in only two studies [23,25]. The sensitivity and specificity reported in studies excluding tumors smaller than 6 mm, however, did not show a significantly higher accuracy over studies without size limitations. Other factors, like claustrophobia played a minimal role.
Data for perfusion and spectroscopy studies were searched, but were not available yet for inclusion in our meta-analysis. Perfusion is, however, feasible and already shows to be able to predict survival before treatment or predict tumor response early in the treatment [9,40]. The potential value of perfusion is also shown by high diagnostic accuracies in treatment response evaluation in patients with brain tumors [41]. Spectroscopy is even less studied although its feasibility has been demonstrated in head and neck tumors. However, diagnostic accuracy remains speculative currently [10].
The main analysis included predominantly posttreatment studies, but also a few intratreatment studies. Combining both was considered to be justified as MRI aims in both to identify viable tumor, although the question differs slightly. Intratreatment MRI aims at differentiating responders from non-responders to adapt the treatment in non-responders, while posttreatment MRI is used to select patients for addition therapy when tumor is shown. The overlapping diagnostic accuracy supports the legitimacy of combining intratreatment and posttreatment MRI.
Identifying non-responders and responders early after treatment start or even before treatment would be optimal. The few intratreatment studies in our data suggest a preference for using ADC data over anatomical MRI for this [17,22,29]. Predicting treatment response before the start of it also favors ADC for primary and nodal sites [34,37,42]. Although good, the performance is until now too variable for wide clinical implication. It might probably benefit from more precise coregistration to anatomical MRI, but also more clinical trials in a large population for validation of DWI early after the start of treatment [43]. Identifying the nonresponders with ADC as a potential biomarker early during treatment may enable treatment tailoring and may avoid possible side-effects of an ineffective and expensive treatment regime [44]. Prediction of clinical outcome would be of interest as well.
FDG-PET is frequently used for treatment response assessment with high sensitivity but lower specificity [45]. Compared to FDG-PET, ADC can be performed earlier to assess treatment response. FDG-PET is less reliable in the first months after treatment with false positive results due to inflammation, granulation and scar tissue [46]. ADC can be performed in this period, but false positive and false negatives are not fully excluded. True restricted diffusion can be seen in an abscess or with inflammation, although central enhancement as shown in tumor would be lacking. Scar tissue can display low ADC but normally in combination with lack of diffusion restriction. This distinguishes scar tissue from tumor with low values on the ADC map together with diffusion restriction [47]. Minimal to absent enhancement of scar tissue helps in further differentiation from tumor. Included studies used ADC values only for calculations and therefore likely underestimated the accuracy of diffusion weighted MRI. Combining anatomical MRI with diffusion weighted MRI including b-maps, ADC maps and post contrast images would probably demonstrate even higher diagnostic accuracy in clinical practice. The higher specificity (less false positives) of ADC compared to anatomic MRI results in a reduction of unnecessary and costly initiation of treatment in patients with treatment related changes. It might also reduce the patients that are false interpreted on anatomical MRI as having tumor progression resulting in incorrect continuation of therapy. Moreover, the higher sensitivity (less false negatives) of ADC contributes in decreasing the number of missed patients with tumor recurrence.
Multimodal imaging with PET/MR systems is a potential area for further research to increase diagnostic accuracy of treatment response both early after start treatment as well as later posttreatment [47].
In general, the methodological quality of the included studies was similar, but low. This might also explain the wider confidence interval in some studies [18]B, but could not provide a convincing explanation for others [16,19,29]. The heterogeneity of patient selection, reference standards or relatively small group size might provide additional sources of variation. This is a reflection of the complexity of the field, however this variation is an important limitation of the current study. Especially the variability in the definition used to identify tumor residual or recurrence compared to treatment effects as shown in Tables 1 and 2 might be a limiting factor. Furthermore, as discussed above and also displayed in these tables, different bvalues and ADC thresholds were used in the different studies. Although it still can be concluded that ADC helps in the differentiation of tumor residual or recurrence and treatment related effects as fibrosis, this variability hinders stronger conclusions and a firm implication in clinical practice. Further research should also focus on comparing all imaging techniques in the same population using direct comparisons to ensure a higher quality. In such a study, the same reference standard should be applied in a consecutive large cohort of patients. This would also allow subgroup analyses to search for the sources of heterogeneity in the diagnostic performance of the MRI sequences.

Conclusions
To conclude, a higher diagnostic accuracy of ADC values over anatomical MRI in patients with treated head and neck tumors is demonstrated in this meta-analysis. It is should be kept in mind that this was only statistically significant for the direct comparison of the primary tumor site and not convincing for the direct comparisons of the nodal site. However, this emphases the relevance to include DWI with ADC for response evaluation of treated head and neck tumor patients.