Reliability of Task-Based fMRI for Preoperative Planning: A Test-Retest Study in Brain Tumor Patients and Healthy Controls

Background Functional magnetic resonance imaging (fMRI) continues to develop as a clinical tool for patients with brain cancer, offering data that may directly influence surgical decisions. Unfortunately, routine integration of preoperative fMRI has been limited by concerns about reliability. Many pertinent studies have been undertaken involving healthy controls, but work involving brain tumor patients has been limited. To develop fMRI fully as a clinical tool, it will be critical to examine these reliability issues among patients with brain tumors. The present work is the first to extensively characterize differences in activation map quality between brain tumor patients and healthy controls, including the effects of tumor grade and the chosen behavioral testing paradigm on reliability outcomes. Method Test-retest data were collected for a group of low-grade (n = 6) and high-grade glioma (n = 6) patients, and for matched healthy controls (n = 12), who performed motor and language tasks during a single fMRI session. Reliability was characterized by the spatial overlap and displacement of brain activity clusters, BOLD signal stability, and the laterality index. Significance testing was performed to assess differences in reliability between the patients and controls, and low-grade and high-grade patients; as well as between different fMRI testing paradigms. Results There were few significant differences in fMRI reliability measures between patients and controls. Reliability was significantly lower when comparing high-grade tumor patients to controls, or to low-grade tumor patients. The motor task produced more reliable activation patterns than the language tasks, as did the rhyming task in comparison to the phonemic fluency task. Conclusion In low-grade glioma patients, fMRI data are as reliable as healthy control subjects. For high-grade glioma patients, further investigation is required to determine the underlying causes of reduced reliability. To maximize reliability outcomes, testing paradigms should be carefully selected to generate robust activation patterns.


Method
Test-retest data were collected for a group of low-grade (n = 6) and high-grade glioma (n = 6) patients, and for matched healthy controls (n = 12), who performed motor and language tasks during a single fMRI session. Reliability was characterized by the spatial overlap and displacement of brain activity clusters, BOLD signal stability, and the laterality index. Significance testing was performed to assess differences in reliability between the patients and controls, and low-grade and high-grade patients; as well as between different fMRI testing paradigms.

Results
There were few significant differences in fMRI reliability measures between patients and controls. Reliability was significantly lower when comparing high-grade tumor patients to controls, or to low-grade tumor patients. The motor task produced more reliable activation

Introduction
To address the shortcomings of previous test-retest studies in relation to preoperative fMRI in brain tumor patients, our work reports on a cohort of brain tumor patients and patientmatched healthy controls who repeated a battery of motor and language tasks during a single fMRI session. Test-retest reliability was investigated by applying a novel single-subject preprocessing pipeline optimization algorithm that yields state-of-the-art activation maps [15,16] and using a variety of metrics to make inferences about reproducibility within and across cohorts (i.e. low-grade vs. high-grade tumors, patients vs. controls), as well as across different testing paradigms (i.e. motor task vs. language task, language task A vs. language task B).

Subjects
With approval from the Research Ethics Boards at Sunnybrook Health Sciences Centre, Toronto, Canada, and St. Michael's Hospital, Toronto, Canada, eighteen brain tumor patients (10 male, 8 female; mean age 43.2 ± 13.7) were recruited to participate in this research study. Recruitment criteria included: clinical or radiological evidence of probable low-grade glioma (LGG; World Health Organization (WHO) grades I-II) or high-grade glioma (HGG; WHO grades III-IV) near or within eloquent brain regions (i.e. sensory, motor, or language areas), no contraindications to MRI (e.g. severe claustrophobia, metallic implants), and no other major neurological or psychological disorder. Patient demographics are listed in Table 1.
All but three patients were right-handed. On histopathology of subsequent surgical samples, 9 patients were found to have HGG, 8 to have LGG, and one patient to have brain metastases from primary lung cancer. Apart from this patient (P11) who presented with left and right  1). For each patient, one healthy control matched in age (± 2 years), sex, and handedness was also recruited to the study. For each volunteer subject, fMRI was performed during a single visit to Sunnybrook Research Institute (SRI), Toronto, Canada. Prior to participation, informed written consent was obtained from each subject and a 15-minute training session was undertaken to allow subjects to familiarize themselves with the behavioral tests to be performed, as well as with an fMRI-compatible tablet system that was used for test delivery and subject performance [17]. The tablet technology was equipped with a writing stylus and computer software (E-Prime, Psychology Software Tools, Sharpsburg, PA) capable of administering a variety of cognitive tasks. Lying in the magnet bore, the tablet platform rested over the torso at a comfortable angle for interaction. Visual stimuli were transmitted to the subjects via a liquid crystal display projector (Avotec, Inc., Stuart, FL) onto a rear-projection screen that was visible through an angled mirror attached to the head coil (diagonal viewing angle, 25.3°). The tablet system was used for this test-retest fMRI study because the technology has recently been implemented into awake craniotomy procedures to expand the behavioral testing repertoire available during  intraoperative mapping, and to improve standardization of behavioral tests during pre-and intra-operative mapping [18].

Image Acquisition
All subjects were imaged using a research-dedicated 3T MRI system (MR750, GE Healthcare, Waukesha, WI) equipped with an 8-channel head receiver coil and peripherals for recording cardiac and respiratory signals (photoplethysmograph and bellows, respectively). The protocol consisted of initial localizer images followed by IR-FSPGR (inversion recovery prepared fast spoiled gradient echo) T1-weighted axial anatomical imaging (repetition time (TR)/echo time (TE)/flip angle (Ɵ)) = 82 ms/3.2 ms/8 degrees); field of view (FOV) = 22 cm × 22 cm; 190 slices; slice thickness = 1 mm) and fMRI using repeated spiral in/out T2 Ã -weighted imaging (TR/TE/Ɵ = 2000 ms/30 ms/70 degrees; FOV = 20 cm × 20 cm; 30 slices; slice thickness = 4.5 mm) to record BOLD hemodynamic responses to neural activity effect [19]. During the 1-hour imaging session, patients performed a battery of up to three behavioral tests in duplicate, with a 20 minute test-retest interval time; identical procedures were used for the matched healthy controls. The randomized battery consisted of a hand motor task ("hand squeezing"), a rhyming task [20] and a written phonemic fluency task [21], as described below.

fMRI Task Battery
Hand Squeezing (HS). Subjects were given a latex squeeze toy in the hand contralateral to the hemisphere of tumor dominance (control subjects used the same hand as their patient match) and instructed to squeeze continuously at a self-directed, comfortable pace. Eight 15 s task blocks alternated with 15 s blocks of rest for a total run time of 240 s. The toy generated a 'squeak' sound that was audible from the MRI console, providing an appropriate monitor of task performance.
Rhyming Words (RW). Subjects were presented with a pair of words and instructed to decide silently if the word pairs rhymed. Forced-choice "yes" or "no" responses were recorded by pressing an icon on the tablet, and monitored from a computer at the MRI console. During each 18 s task block, 6 different word pairs were each displayed for 3 s. Eight separate task blocks were alternated with a baseline condition, in which line pairs were shown and subjects had to determine if the line pairs were alike in volume and orientation. The total run time was 300 s. The procedure for making tablet responses during the baseline condition and during the rhyming task was identical.
Phonemic Fluency (PF). Subjects were presented with a single letter and given 60 s to produce as many words as possible beginning with that letter. Written responses were recorded on the tablet using the stylus, and monitored from the MRI console. Three PF task blocks were administered, alternating with a 20 s baseline condition that involved writing varying lengths (self-chosen) of symbol strings composed of double-loops (e.g. "8", "88", "888", etc.), then followed by 16 s of rest. Different 3-letter combinations of equal task demands (C-F-L and P-R-W) [22] were used for the test and re-test run to minimize memory and learning effects.

Data Analysis
Preprocessing was performed within the NPAIRS (Nonparametric, Prediction, Activation, Influence, and Reproducibility reSampling) framework [16]. For both patient and control groups, this provided optimized single-subject preprocessing pipelines that yielded the most reproducible and task-predictive activation maps. In this framework, the quality of fMRI data was evaluated for a given preprocessing pipeline by splitting the task run temporally into two halves and analyzing the split-halves independently. Reproducibility (R) was computed as the Pearson correlation between split-half activation maps; prediction (P) was computed by using the analysis model of one split-half to classify scans from the other split-half, based on Bayesian posterior probability. The (P,R) measures were computed for all possible combinations from a set of pipeline steps turned on/off (see immediately below), and the combination minimizing the Euclidean distance measure ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð1 À PÞ 2 þ ð1 À RÞ 2 q was selected as the optimal pipeline, for each subject and task run [15]. The series of pipeline choices, some of which were extracted from the analysis of functional neuroimages (AFNI) software package (version: 2011_12_21_1014) [23], included: motion correction (AFNI, 3dvolreg), outlier censoring and interpolation [24], physiological correction of cardiac and respiratory data (AFNI, 3dretroicor), slice-timing correction (AFNI, 3dTshift), spatial smoothing with an isotropic 6 mm Gaussian filter (AFNI, 3dmerge), temporal detrending (0 th -3 rd order Legendre polynomials), motion parameter regression (using Principal Components of the 6 rigid-body motion parameters that account for >85% of motion variance), task covariates, global signal removal by regressing the first Principal Component of the fMRI data [25], and data-driven physiological correction using the PHYCAA+ algorithm [26]. For all tested processing pipelines, a univariate Gaussian Naïve Bayes model was used to estimate brain activity; this is the predictive formulation of the widely-used General Linear Model (GLM), which allows us to measure prediction accuracy of pipelines. Correction for multiple comparisons was performed using the false discovery rate [27] threshold of q < 0.05 and applied to the whole brain and region-of-interest (ROI) activation maps. The maps for individual subjects were then scrutinized with a small variable threshold, t s , (z-score < 1) to minimize any residual supra-threshold voxels potentially caused by artifact, and to stabilize active clusters and their spatial extent. Datasets were also created for two additional thresholds, t s -and t s +, at ± 20% of the initial threshold, such that reliability metrics could be evaluated at multiple thresholds. This approach has been previously used in the literature to good effect, as it minimizes errors in interpreting activation maps associated with a fixed threshold [28].
The procedure for delineating each ROI started with visual assessment of the group average tstatistic data for controls, which showed the expected patterns of activity for each fMRI task ( Fig  2, Table 2). Brain regions labelled in Fig 2 for each task (e.g. precentral gyrus, middle frontal gyrus) were identified from the Talairach-Tournoux Atlas (N27) and combined to create a mask of all ROIs. This was done using the "draw dataset" plugin built into the AFNI graphical user interface (GUI). The affine transformation of native spatial coordinates to Talairach coordinates, and vice versa, was fully automated and immediately updated on the GUI. When anatomical regions were shifted or distorted in patients by the tumor volume, ROIs were modified manually by one experienced individual highly knowledgeable in functional neuroanatomy (M.M).

Metrics of Test-Retest Reliability
To assess test-retest reliability in the cohorts of brain tumor patients and matched healthy controls, a comprehensive set of metrics was adopted. The reliability of single-subject maps was characterized by the spatial overlap and displacement of brain activity clusters, the stability of the BOLD signal, and the laterality index across test runs, and re-test runs. In addition to exploring group differences in test-retest reliability between patients and controls, these metrics were also used to explore the difference in reliability between fMRI tasks (i.e. motor vs. language, and language vs. language). Lastly, the effect of tumor stage on fMRI reliability was assessed by comparing results for HGG patients (n = 6) and LGG patients (n = 6). Although such a comparison has not been reported in the literature to date, it was expected that HGG patients would show decreased fMRI reliability in relation to LGG patients, in keeping with the effects of more aggressive and invasive disease, and likely greater disruption of neurological function. For all comparisons, statistical testing was performed using a two-sided Wilcoxon rank sum test at the p<0.05 significance level. A description of the individual metrics is provided below.
Spatial overlap. The degree of spatial overlap of active voxels was measured between test and retest sessions via the Jaccard Index, J o , and the Dice coefficient, D o :  where V o is the number of overlapping voxels across test sessions, and V 1 and V 2 are the number of active voxels in the individual test sessions, respectively. These resulting values are often interpreted as the 'percentage overlap', ranging between 0% (no overlap) and 100% (perfect overlap). The J o measure tends to estimate lower overlap than D o , with the greatest differences occurring at intermediate overlap values; the D o measure is also known to be susceptible to "aliasing" effects, where different patterns produce highly similar overlap values [29]. Both J o and D o were calculated at the whole brain and ROI level, and the average across t s , t s -, and t s + was computed to yield final values for each subject. Cluster displacement. The displacement of brain activity clusters across test and retest runs was computed by a Euclidean distance measurement. At the primary threshold, t s , where clusters were identified as most stable (minimal changes in cluster extent and volume) for individual subjects, the 3dclust command in AFNI (first-nearest neighbour; cluster threshold = 5 voxels) was used to extract centre of mass (COM) and peak coordinates for clusters residing within the selected ROIs. Each cluster maintaining a unique spatial position in both the test and re-test activation maps had a COM and peak displacement measure recorded, which was then averaged with measures from other nearby clusters. Lower displacement values indicated better reliability for this measure.
Stability of BOLD signal. The stability of the BOLD signal amplitude was calculated based on the t diff metric adopted by Gorgolewski et al. (2013) to evaluate the between-session variance of t values for single-subject test-retest data [30]. The metric is calculated simply from unthresholded t-statistic maps and is inversely related to calculation of the intra-class correlation coefficient (ICC) for 2 sessions. For each task, t diff was computed across the whole brain and within selected ROIs according to where n is the number of voxels and t i1 and t i2 are the i th voxel t-values of brain activity in the first and second test sessions, respectively. Test-retest data with lower t diff values are interpreted as more reliable, whereas higher t diff values indicate less reliability. Laterality index. Test-retest reliability was also assessed with respect to within-subject laterality of brain activity across the two cerebral hemispheres, given the usefulness of laterality assessments in preoperative fMRI. The assessment was made in terms of the laterality index (LI) according to where N left and N right represented brain map quantities from the left and right hemisphere, respectively. These brain map quantities are commonly characterized according to the extent of activated brain volume (number of active voxels in ROIs), LI e ; or by brain map signal intensity (mean or sum of t values or β coefficients of active voxels in ROIs), LI m [31]. It has been argued that the latter is a more robust measure of laterality, although Jansen et al. (2006) reported that reproducibility was comparable for LI e and LI m measured in healthy subjects. Given these findings, both LI e and LI m (based on the sum of t value) are reported in the present work to compare with previous reports while making inferences about the robustness of these metrics in brain tumor patients. Both LI e and LI m were calculated at t s , t s -, and t s + to account for the threshold-dependence of LI [32,33], which can highly influence results and thus reliability. In addition, both LIs were calculated based on ROIs within the inferior frontal gyrus and the superior temporal gyrus to include Broca's area and Wernicke's area, respectively. Laterality was determined according to a classification system commonly used in the literature [34]: an LI value greater than 0.2 was classified as left-lateralized, less than -0.2 was classified as right-lateralized, and the remaining LI range was classified as bilateral. Reliability of language laterality was subsequently assessed by two separate methods. In method A, language laterality was deemed reliable if the same classification was produced for both test and re-test runs for LI e and/or LI m , and across all thresholds: t s , t s -, and t s +. In method B, the value of individual subject LI values for the test runs was plotted against the analogous values for the re-test runs. For each functional task, the Pearson correlation coefficient r was computed to assess the consistency of the relative ordering of subject LI values within each cohort.

Results
Based on qualitative assessment during the initial training session, all subjects complied with instructions during task performance and demonstrated competent use of the tablet. Given that some subjects performed only a subset of the behavioral tasks, a total of 12 patient datasets (6 LGG, 6 HGG) were acquired for each of the three behavioral tasks. Similarly, 12 healthy control datasets were acquired for comparison. Subject performance scores in the test and retest runs were highly similar for both the RW task and PF task. The majority of subjects had a rhyming accuracy of >90% and generated approximately 6-13 words per 60 s task block during the PF test. HGG patients yielded some of the poorest performance scores, with rhyming accuracy falling below 80% (primarily due to slow reaction time) and only 3-4 words generated per 60 s task block of PF.
Anatomical regions of interest were shifted in one HGG patient (P02) and as a result the task ROIs were manually extended posteriorly to include the pre-and post-central gyri. The most significantly activated language areas (Table 2) in the control group maps (Fig 2) included Brodmann areas 6, 9, 22 (Wernicke's area), 45 (Broca's area), and 46. Group-level hand motor activity was dominated by the right precentral gyrus, however, left hemispheric activity was seen within the precentral gyrus at the single-subject level in some cases.
All but two patients (P12, P16) went on to receive surgical treatment in which fMRI data were utilized in conjunction with intraoperative brain mapping by DCS, demonstrating good concordance. The remaining patients underwent conventional surgical resection without intraoperative mapping.

Patient vs. Controls
Group means and standard deviations for all test-retest metrics are summarized in Table 3, for patients and controls and for each fMRI task that was investigated. Fig 3 shows the metric data for individual subjects (excluding D o, which produced similar trends to J o ; and LI which is reported below). Considerable variability was observed within and between groups, and only the whole brain J o (patients: 0.26±0.11, controls: 0.38±0.10) and D o (patients: 0.40±0.13, controls: 0.55±0.10) showed significant group differences, specifically for the HS task. Fig 4 illustrates one aspect of the variability within the patient group, displaying overlap results for those individual patients who produced the highest and lowest Jaccard indices within each functional task. Some additional trends were notable in Fig 3 for COM and peak displacement values, as well as for t diff values. Apart from the PF task, where displacement values were high compared to other tasks for both patients and controls, average COM and peak displacement values for the control group were smaller (i.e. more reliable) than average measures in the patient group. The variability in COM values was smaller than for peak displacement values in both groups, and for both metrics, variability was smaller in the control group than in the patient group. Specifically, the COM displacement within the control group was well within 5 mm on average Reliability of fMRI for Preoperative Planning for both HS and RW tasks, with a maximum range just exceeding 5 mm. For patients, the average COM displacement was 5 mm or less across both tasks, but the maximum range was substantially larger than for controls, at approximately 8 mm. The range of observed peak displacements was particularly high in the patient group, with a maximum peak displacement value for the HS task of nearly 23 mm, and an analogous value of approximately 17 mm for the RW task. Patient and control t diff values were most variable within the PF task, which showed in agreement with the RW task, that on average, t diff was greater (i.e. less reliable) in patients than controls. Although the opposite trend was observed in the motor task, all tasks were consistent with respect to trends in the within-and between-group variability of t diff , which was smaller for ROI versus whole brain analyses.
Based on the stringent criteria for method A (i.e. laterality consistently characterized across LI e , LI m , t s , t s -, and t s +), the laterality index was found to be reproducible in 92% of patients and 100% of controls for the RW task, versus a contradicting trend of 92% of patients and 67% of controls for the PF task. When LI e and LI m were identified as separate criteria (Table 4), percentage values increased solely for the latter metric, thus indicating better reliability when LI is measured according to brain map signal intensities. The Pearson correlation coefficients calculated in method B are tabulated in Table 4 and representative plots are illustrated in Fig 5. Similar to the findings in method A, r-values were greater (more reliable) in the controls than the patients for the RW task and vice versa for the PF task. However, in both the RW task and PF task results, data plotted for the individual patients fell near or within the confidence interval (CI) outlined for the control group (Fig 5), and furthermore demonstrated stronger lateralization in comparison to the controls. There was a greater amount of variability seen within the control group for the PF task versus the RW task, reflected in the larger CIs seen in Fig 5. Correlations between test and re-test run LI e were minimally-to-moderately better than LI m correlations, with the exception of the PF task in control subjects where the opposite effect was observed.
LGG. Regarding the effect of tumor grade on fMRI reliability, group means and standard deviations for all test-retest metrics are tabulated in Table 5, and plotted per subject in Fig 6 (excluding D o and LI). The only statistically significant results were found for whole brain J o (HGG: 0.18±0.04, LGG: 0.34±0.09) and D o (HGG: 0.30±0.05, LGG: 0.49±0.09) measurements for the motor task, as similarly reported for patient-control group comparisons. Table 4. Reliability metrics for language laterality in controls, HGG patients and LGG patients. LI e and LI m represent the laterality index characterized by the extent of activated brain volume, and the sum of brain map signal intensity, respectively. The Pearson correlation coefficients, r, are reported with their respective confidence interval (CI) in brackets.

Metric
Patients (n = 12) Controls (n = 12) HGG (n = 6) LGG There were consistent trends for mean J o and D o values across all tasks, showing that mean overlap for HGG patients is reduced slightly compared to LGG patients. The COM and peak displacement also had consistent trends across tasks, showing that HGG patients had slightly poorer reliability based on these metrics. However, the t diff metric did not show consistent trends across tasks. For example, in Fig 6 the rhyming task demonstrated a reduced t diff (poorer reliability) for HGG patients compared to LGG patients when analyzed across the whole brain, yet an increased t diff (better reliability) for HGG patients when analyzed within the ROI.
The laterality index was reproducible in 100% of LGG patients versus 83.3% of HGG patients (Table 4) for both the RW task and PF task, at the most stringent reliability criteria. When LI m was considered separately from LI e , the classification of laterality was improved in one HGG patient for the RW task. Trends in the Pearson correlation coefficients, however, were inconsistent and contradictory to trends identified through method A.

Motor vs. Language
Although the effects across patients, controls, and tumor grade were primarily identified through trends in the data, significant effects were observed across tasks, though fewer in the patient group (Table 6). Spatial overlap measures were on average comparable for the HS and RW task, whereas results for the PF task were significantly lower. For example, in the control group, a Jaccard index (whole brain) of 0.38±0.10 was reported for both the HS and RW task, versus a Jaccard index of 0.22±0.08 for the PF task. COM and peak displacement values were also comparable for the HS task and RW task, whereas the values were doubled or nearly doubled for PF, associated with a much poorer reliability outcome. Statistically significant differences in t diff were identified in both comparisons (i.e. HS vs. RW, HS vs. PF), solely for ROI measurements.

Rhyming vs. Phonemic Fluency
Spatial overlap, displacement, and t diff metrics consistently demonstrated better reliability outcomes for the RW task than for the PF task (Table 6). Statistically significant differences between tasks for J o and D o were restricted to whole brain measures in both patients and controls, whereas significant results were limited to the control group for the displacement and t diff metrics. The classification of language laterality (i.e. left, right, or bi-lateral) was equally reproducible in patients across tasks, but in the control group, language laterality was reproducible in a greater number of subjects for the RW task. Additionally, LI values for test and re-test runs were better correlated in the RW task for controls (Fig 5 and Table 4); however, this was contradicted in the patient group.

Discussion & Conclusions
Functional MRI is gaining popularity in clinical applications such as preoperative planning for brain tumor surgery, but it is very important that the brain activity maps derived from the fMRI BOLD signal are of sufficient quality to support their intended use. The main method for assessing fMRI quality is to perform test-retest studies and analyze the resulting data in terms of various reliability metrics that quantify the constancy of the measured signals. Because the fMRI test-retest literature focuses predominantly on healthy controls, the present study was undertaken to fill a gap in existing knowledge about clinically-relevant fMRI data quality by assessing multiple reliability metrics in a cohort of brain tumor patients. This included the use of single-subject optimization of the fMRI preprocessing pipeline to generate robust results, division of the patient group to study the influence of tumor grade on fMRI reliability metrics, and comparison with patient-matched healthy controls under identical behavioral testing conditions across three different tasks for mapping motor and language regions. Reliability of fMRI for Preoperative Planning

Patients & Controls
The main finding from this study is that for the most part, fMRI reliability metrics for the sample population of brain tumor patients are very comparable to those for healthy controls. It was demonstrated that reliability metric outcomes for both patients and controls are highly variable and dependent on the tumor grade in patients, choice of behavioral task, reliability metric, and the level of analysis (whole brain versus ROI). Each of these factors is discussed subsequently; however, the effect of tumor grade is given precedence due to the novelty of the finding. Specifically, fMRI reliability metrics were slightly worse on average for HGG patients when compared to those for LGG patients. In addition, the reliability metrics for LGG patients and controls were very similar. The broad implication is thus that the characteristics of healthy control fMRI test-retest data are very likely applicable to LGG patients, and that previous findings from the literature can be used to make inferences about reliability for preoperative planning in patients with low-grade tumors. This is important, given that the growth industry for preoperative fMRI is being driven in part by an evolving clinical predisposition toward early intervention for LGG. The differences in fMRI reliability metrics observed between HGG and LGG patients are consistent with two factors influencing fMRI signals that have been previously described in the literature. It has been reported that tumor-induced neurovascular uncoupling (ti-NVU) of the BOLD signal is more pronounced in HGG than in LGG, due to increased hyperperfusion of the vasculature [35]. The ti-NVU effect is anticipated to have most impact as the LAD approaches zero in high-grade tumors, where large volume effects are often exhibited. The second important factor involves the influence of substantial preoperative neurological deficit in patients with high-grade tumors (Table 1), resulting in poorer execution of fMRI tasks or fatigue on test-retest experiments. In particular, clinical data show significant correlations between HGG and low Karnofsky Performance Scores [36].
Although both of these factors are likely to have influenced the results of this study to some extent, their importance and relative contributions are difficult to ascertain in practice because of the experimental design and patient variability. Interestingly, no obvious relationship was detected between fMRI reliability metrics and LAD, or between reliability metrics and fMRI performance scores. Fig 4 clearly demonstrates examples where the most reliable activation maps (as measured by overlap of activation clusters) were obtained immediately adjacent to LGG tumors (P12, P13), but also examples where overlap was the worst for activations in the hemisphere contralateral to the LGG tumor site (P03). Regarding fMRI performance scores, HGG patients did produce some of the lowest performances, but they were highly patientdependant and highly uncorrelated with reliability outcomes. For example, one patient (P06) who presented with a grade III anaplastic astrocytoma, performed at a high level on all fMRI tasks, producing up to 13 words per minute in the PF task. Yet, reliability metric outcomes remained poor for this subject and were significantly lower than average results for LGG and controls. Tumor stage was found to be the only strong predictor of fMRI reliability metrics: the HGG patient group consistently yielded slightly worse metrics on average in comparison to the LGG group, regardless of between-patient variations in LAD or performance scores.
Putting the above discussion in context, it is also important to emphasize that the sample size of the present data poses a limitation to understanding more fully the relationship between fMRI reliability and tumor grade. Further investigation of this relationship in a larger cohort of patients is left to future work. Nonetheless, the present results provide strong enough evidence to suggest that clinicians should use some additional level of caution when interpreting activation maps in patients with suspected high-grade tumors. Preoperative evaluation of these patients is crucial such that the neuroimaging technician is aware of any language or motor deficits, and can improve reliability of the data by repeating fMRI paradigms that engage the compromised brain networks.

Behavioral Tasks
Differences in fMRI reliability metrics between patients and controls were evaluated for three different behavioral tasks: a hand squeezing task, a forced-choice rhyming task, and a written phonemic fluency task. Subject performance scores agree well with previous findings in the literature. For example, Golestanirad et al. (2015) reported a similar finding for the written PF task using the tablet system, where subjects generated an average of 12.1 ± 2.7 words per minute across all letters (F, A, S, D, or C) [21]. Zec et al. (1999) also found similar results with overt speech, reporting an average of 12.2±4.8, 11.0±5.0, and 13.4±4.6 words respectively for letters F, A, and S [37]. Group maps generated for each task were also consistent with previous literature, based on regions of peak brain activity (Table 2 and Fig 2) [20,21]. A statistically significant difference in group means, driven primarily by the HGG patients, was found solely for the hand squeezing task. This task yielded the most reliable activation maps, in agreement with previous findings in the literature suggesting better reliability for motor versus language tasks [11,38]. Language task differences in fMRI reliability metrics between patients and controls lacked statistical significance, consistent with the high inter-subject variability previously observed for tasks that have higher cognitive components than simple motor components [39]. This effect is further supported by the observation that fMRI reliability metrics for the HS and RW tasks were much less different than those for hand squeezing and the PF task. The cognitive demands including visual scanning, sustained attention, constrained response selection, and language processing were relatively modest for the RW task. In contrast, the written PF task is a measure of executive function, with higher cognitive demands including mental flexibility, unconstrained search for mental lexicon, and written coordination [20]. The unconstrained nature of the task means that performance depends on multiple factors including the size of the working lexicon and the ability to swap between clusters of similar word types once a cluster is depleted. Furthermore, the task becomes more difficult over time as fluency decreases, with consequent changes in brain activity that can reduce fMRI signals and lead to poorer reliability [21]. There is evidence in the literature of better reliability for the RW task versus the PF task [20], including: easier quantification of behavioral performance, more focal activation patterns, and stronger hemispheric lateralization for determining language laterality. Nonetheless, apart from their differences, similar peri-Sylvian activation patterns have been reported for the RW and PF tasks (Fig 2) [20]. This is valuable information for researchers and clinicians conducting fMRI studies in patients with brain tumors, for example, who may have neurological deficits that affect their performance capabilities. A task of lower cognitive difficulty (i.e. RW) likely can be used to produce a desired activation pattern in a more robust and reliable manner than a task that is more cognitively challenging (i.e. PF) and thus prone to poorer performance scores and noisier brain activation signals. In any event, multiple tasks with overlapping activation patterns should be applied during preoperative fMRI to avoid misidentification of eloquent regions. However, it could be that the PF task solely provides confirmatory evidence.

Reliability Metrics & Level of Analysis
To quantify fMRI test-retest reliability in a comprehensive manner, the spatial overlap (via Jaccard index and Dice coefficient) of active clusters, the COM and peak displacement of active clusters, the stability of the BOLD signal across test sessions (via t diff ) and the laterality index were investigated in the present study. The spatial overlap results agreed well with previous literature. For example, an average Dice coefficient of 45% and a Jaccard index of 33% has been reported in healthy subjects [11], and a range from 23-100% in patients with low-grade neoplasms. COM and peak displacement measures also fell within the range of what has been previously reported for healthy subjects [40,41] and patients [28,42], and the between session tvalue variance, t diff , followed trends reported by Gorgolewski et al. (2013) [30].
Results were highly variable depending on the choice of metric as well as the level of analysis (i.e. whole brain vs. ROI). Spatial overlap measures showed the most stable trends in the data. The Dice coefficient and Jaccard index were highly correlated, differing primarily in magnitude. Although the Dice coefficient has been reported more extensively in the literature, it has been argued that the Jaccard index is a more suitable metric providing a more natural quantification of overlap [29,43].
Previous studies have reported that ROI analyses yield higher reproducibility metrics than whole brain analyses [14,30,44], similar to the results reported here for spatial overlap and t diff measures. This is likely due to the removal of unstable and/or false positive activations when an ROI is constructed, however reliability remains highly dependent on appropriate ROI selection [31]. In some cases, however, the use of ROI analysis may have a marginal influence on the outcome of one reliability metric in comparison to another. This effect is seen to an extent with the J o and t diff metrics, where the ROI analysis had a greater influence on the former (Figs 3 and 6).
The displacement of COM coordinates was found to be more reproducible than peak coordinates of activity; previous studies reported similar findings noting the high variability in peak coordinates [42]. In fact, the peak coordinates of clusters varied as high at 23 mm in the present patient group, and Wurnig et al. (2013) reported measures as high as 45 mm. This poses a problem for clinicians who are often naturally inclined to look for regions of peak activity in a color-coded fMRI activation map, absent of any COM cursor data. Although the COM of a given cluster may be more reliable, it does not necessarily mean that it is representative of the true spatial coordinates of neural activity. Thus, an investigation of fMRI concordance with intraoperative mapping data, on the basis of peak versus COM coordinates, would be a valuable contribution to this field. Ruff et al. (2008) evaluated the reproducibility of the LI across three different language tasks and a range of p-values, concluding that language laterality by fMRI is threshold-and taskdependent [32]. The latter was similarly shown by Nadkarni et al. (2015) who explored laterality within expressive versus receptive language tasks [45]. The present study found that fMRI reliability has a task-dependence. Better correlations between individual test and re-test LIs were found for the RW task in comparison to the PF task, with the former used much more frequently to determine lateralization by fMRI [46,47]. Nonetheless, for both the RW task and PF task, laterality was classified as left, right, or bi-lateral, consistently across thresholds in a high number of patients (11/12 or 92%). This supports previous data which have shown high concordance between fMRI language lateralization and intraoperative mapping data [48]. Together these data suggest that fMRI is suitable for determining language laterality in brain tumor patients, provided that the methods for calculating LI are standardized. Although no formal methods have been outlined, some have proposed the use of functionally driven ROIs; as well as multiple cognitive tasks, thresholds, and definitions pertinent to LI (e.g. LI e , LI m ) [31].
Although not explicitly evaluated here, the choice of statistical threshold used to map brain activity is an important factor that influences fMRI reliability. For example, McKinsey et al.
(2010) demonstrated in a group of 24 brain tumor patients that deviations of ±20% from a standard fMRI threshold (t ranging from 2.8 to 26.4 across subjects) had no significant effect on COM, number of activated voxels, or the reproducibility of the location of activated voxels [28]. More recently, Stevens et al. (2013) conducted data-driven analyses in 8 patients, reporting a decrease in spatial overlap of approximately 10% when the threshold was varied from t = 2.6 to t = 6.6 [49]. It is worth noting however, that the rate of decrease slowed with increasingly higher fMRI thresholds, and individual results were highly variable. Irrespective of the slightly different findings in the literature, the potential volatility of fMRI thresholds is widely appreciated as a concern. The present study has addressed the issue indirectly through use of individually-optimized preprocessing pipelines, and evaluation of reliability metrics s across t s , t s -, and t s +, as reasonable attempts towards minimizing within-subject effects and ensuring between-subject variability has more influence on reported reliability measures.

Data Preprocessing Strategies
Preprocessing strategies were implemented into the analysis to mitigate subject-specific artifacts, including physiological noise and head motion that is likely more pronounced in a patient population. Churchill et al. (2013) demonstrated using the NPAIRS framework that single-subject pipeline choices can significantly affect data outcomes, and that the use of individually-optimized pipelines over standard pipelines improves fMRI reliability [15,16]. Thus, their algorithm was applied to single-subject data, thereby controlling for the above mentioned artifacts while allowing other confounding factors (e.g. brain tumor) to be addressed. However, this is not the only reliable method that has been reported in the literature. In addition to the NPAIRS framework, others have used empirical receiver operating characteristic (ROC) analyses to optimize single-subject pipelines [50,51]. Recently, Stevens et al. (2015) reported a significant increase in fMRI reliability for brain tumor patients using an ROC-based novel approach to select optimal pipeline choices [38]. Average reliability measures in their patient group increased from 0.58±0.03 to 0.65±0.02 through the use of optimized preprocessing pipelines, thereby narrowing the gap in reliability outcomes between the patients and a cohort of healthy controls (0.72±0.02).
As such, there are significant benefits to adopting one of such methods when dealing with clinical populations where inter-subject variability is high. However, it is also worth noting that the NPAIRS optimal preprocessing pipelines selected for patients and controls were carefully scrutinized and found to have no clear and consistent differences. This suggests that the variability attributable to fMRI signal artifact is comparable in both groups and that pipeline optimization is of general benefit, rather than a benefit solely in patient populations.

Clinical Relevance & Suitability
Quantitative measures of reliability are primarily used to validate fMRI for preoperative planning, however to properly assess its validity, a discussion of the clinical relevance is required. One very important question to consider when interpreting the results is where fMRI reliability is most important to the surgeon. In the majority of test-retest studies, reliability is measured across the whole brain and functionally driven ROIs, while in clinical practice the surgeon is only concerned with the regions immediately adjacent to the tumor that are exposed in the craniotomy window. Thus, it is the reliability of those activations adjacent to or nearby the tumor that are of clinical relevance. In Fig 4, examples were provided in LGG patients where activations adjacent to the tumor demonstrated good reliability results via the Jaccard index and Dice coefficient. Although reliability might be good in some patients, the variability across subjects remains high and there is a need for practical thresholds to help determine the acceptable measure of reliability in a real clinical situation.
Functional MRI maps may be a valuable adjunct to determine the craniotomy extent, direct targets for DCS, and perform subcortical mapping. In a worst-case scenario, the consequences of reliability may be that an fMRI finding causes the surgeon to expose more of the brain than necessary, or that the surgeon expends a few minutes of additional surgical time to perform subcortical mapping. Here, fMRI serves as a guide for planning; DCS remains the gold standard used to direct the surgical process in real-time [52][53][54][55][56].
In addition to characterizing fMRI reliability, it is also very important to carefully evaluate how well fMRI and DCS results agree. Overall, the neurosurgical opinion was that fMRI and DCS were in good agreement for each of the patients studied here, and that fMRI made valuable contributions to surgical decisions as a consequence. A more detailed and quantitative analysis of these sentiments is beyond the scope of the present work. Similar to fMRI, DCS remains an imperfect technique and therefore there is a broad need to understand and quantify the error introduced by various sources of technical and biological variability that may influence concordance between DCS and fMRI results. Our laboratory intends to report on these issues in the very near future.
Results from this study suggest that preoperative fMRI is a suitable clinical tool for patients diagnosed with LGG, whereas reliability decreases somewhat for patients with HGG. Further study of fMRI reliability in HGG patients will be useful. The choice of behavioral task, reproducibility metric, and level of analysis all influence fMRI reliability and whether differences can be observed between patients and healthy controls. Although the demands of fMRI reliability may be lessened when the clinical consequences of false information in fMRI maps are less severe to the patient, quantitative thresholds for reliability in LGG and HGG patients are of value and obtainable. Toward this goal, the test-retest data presented here must be augmented by additional patient recruitment, and further validated by intraoperative brain mapping data. This is the subject of on-going work in our laboratory, in an attempt to establish a flexible threshold for reliability that will ease the use of fMRI in practice, enabling it to be applied confidently in the clinical decision-making process.