The Reporting of Observational Clinical Functional Magnetic Resonance Imaging Studies: A Systematic Review

Introduction Complete reporting assists readers in confirming the methodological rigor and validity of findings and allows replication. The reporting quality of observational functional magnetic resonance imaging (fMRI) studies involving clinical participants is unclear. Objectives We sought to determine the quality of reporting in observational fMRI studies involving clinical participants. Methods We searched OVID MEDLINE for fMRI studies in six leading journals between January 2010 and December 2011.Three independent reviewers abstracted data from articles using an 83-item checklist adapted from the guidelines proposed by Poldrack et al. (Neuroimage 2008; 40: 409–14). We calculated the percentage of articles reporting each item of the checklist and the percentage of reported items per article. Results A random sample of 100 eligible articles was included in the study. Thirty-one items were reported by fewer than 50% of the articles and 13 items were reported by fewer than 20% of the articles. The median percentage of reported items per article was 51% (ranging from 30% to 78%). Although most articles reported statistical methods for within-subject modeling (92%) and for between-subject group modeling (97%), none of the articles reported observed effect sizes for any negative finding (0%). Few articles reported justifications for fixed-effect inferences used for group modeling (3%) and temporal autocorrelations used to account for within-subject variances and correlations (18%). Other under-reported areas included whether and how the task design was optimized for efficiency (22%) and distributions of inter-trial intervals (23%). Conclusions This study indicates that substantial improvement in the reporting of observational clinical fMRI studies is required. Poldrack et al.'s guidelines provide a means of improving overall reporting quality. Nonetheless, these guidelines are lengthy and may be at odds with strict word limits for publication; creation of a shortened-version of Poldrack's checklist that contains the most relevant items may be useful in this regard.


Introduction
In the past decade, the use of functional MRI (fMRI) studies in cognitive neuroscience has increased a great deal [1,2]. Given that fMRI is increasingly applied to the study of clinical disorders (e.g., [3][4][5][6][7][8]), and considering the vulnerability of clinical participants, there is an ethical imperative for scientists to apply rigorous methodology and to provide adequate reporting. Rigorous methodology is required in order to uphold the promises typically made to participants during the consent process, namely that the study will help investigators to understand their conditions. Complete reporting with sufficient details permits readers to ensure the methodological rigor of a study [9], consider the validity of findings [10][11][12][13][14], and extend and replicate the findings [9][10][11][12][13][15][16][17]. In particular, recent evidence indicates that overall, the fMRI literature lacks key details in their methods section, such as sample size calculations, whether temporal autocorrelations were modeled, descriptions of slice-timing and motion correction, slice order and coverage of functional brain images [18], and related parameter estimates (i.e., effect size and variance components) in the results section [19].
Standard guidelines have been developed to aid authors in reporting their research, such as the Consolidated Standards for Reporting Trials (CONSORT) [10] and the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) initiative [9]. Recently, Poldrack and his colleagues have proposed guidelines specifically for reporting fMRI studies [14]. Although many authors have suggested endorsing the guidelines proposed by Poldrack et al. in reporting fMRI studies to improve the quality, transparency and consistency of results [2,18,20,21], few systematic reviews have been conducted to appraise the quality of reporting based on these guidelines. Although a study by  recently examined adherence to Poldrack et al.'s guidelines in randomly selected fMRI studies published since 2007, it included few studies involving clinical populations. Thus, the reporting quality in clinical fMRI studies remains unclear. Given the unique challenges (e.g., technical, interpretive, and methodological) that confront clinical fMRI studies, reporting details on design, subject characteristics, analyses and interpretation is suggested to enhance reproducibility of results in this subset of fMRI studies. Therefore, we expect that reporting in clinical fMRI studies is different from that of the overall fMRI literature.
Moreover, based on our experience and anecdotal evidence that the majority of fMRI studies are observational (i.e., the type of study is not designed to randomize participants to test efficacy and safety of any therapeutic intervention), these studies are less scrutinized than randomized clinical trials with experimental interventions; for example, randomized trials have to be registered with clinicaltrials.gov. Therefore, we aimed to systematically evaluate the quality of reporting in observational fMRI studies involving clinical human participants (i.e., individuals who either have a disease or are at risk of developing a disease) using a checklist adapted from the guidelines proposed by Poldrack et al. In this study, we set out to address the following two questions: (1) what percentage of articles reported each item of the fMRI-specific guideline, and (2) what percentage of items was reported per article?

Search Strategy and Eligible Journals
We searched OVID MEDLINE on January 2012 by using key word search terms (e.g., functional magnetic resonance imaging) combined with the acronym (i.e., fMRI) for articles published in 2010 and 2011, in the English language, and involving human participants. Compared with journals in general, top journals are cited more frequently (e.g., higher impact factors (IF)) and more scrutinized prior to publication (e.g., lower manuscript acceptance rates). Furthermore, studies have indicated that high IF and low manuscript acceptance rates of journals are associated with higher methodological rigor of articles published in the journals [22][23][24][25][26]. In this study, we further constrained our selection to six leading journals: In the Journal Citation Report 2010, we selected four journals with a high IF in the category ''Neurosciences'', namely, Neuron (IF 14.9), Nature Neuroscience (IF 14.2), Brain (IF 9.2), Journal of Neuroscience (IF 7.3), one journal with the highest impact factor in the category ''Neuroimaging'' (NeuroImage, IF 5.94), and one journal which contributes a great number of articles in fMRI studies [18] and has a high impact factor (Proceedings of the National Academy of Sciences of the United States of America, IF 9.8). More details on the search strategy can be found on Table S1. Duplicate articles were removed.

Eligibility Criteria for Studies and Study Selection
We included articles that were peer-reviewed, full reports of observational fMRI studies involving human clinical participants, and block or event-related or mixed design for the fMRI paradigm. We excluded articles that were published only in abstract form or any that were only editorials, letters, comments or reviews. Genetic, resting-state observational fMRI studies, fMRI studies other than observational studies (e.g., randomized clinical trials), and studies of connectivity were also excluded. As studies of connectivity aim to identify and quantify the correlations between brain regions [27], these studies have a different reporting focus vis-à-vis fMRI data analyses. For example, they report the Psycho-Physiological Interaction analyses to estimate effective connectivity or functional coupling rather than data preprocessing steps, which were demonstrated to have significant impacts on the quality of data and the reliability and interpretation of fMRI results [28] [29]. However, the reporting essentials for effective connectivity studies have not been reflected in the current available guidelines including the one proposed by Poldrack et al. As our study aimed to evaluate the quality of reporting based on Poldrack et al.'s guidelines, we therefore excluded this type of study to ensure consistency.
In this study, we decided to include a target sample size of 100 articles that had to meet the predefined inclusion and exclusion criteria. We therefore randomly selected and assessed the eligibility of articles among the unique citations, which were identified from the initial search strategy and after the duplicates were removed, until 100 articles were reached.

Data Extraction
We created an electronic data extraction form containing 83 items adapted from the guidelines proposed by Poldrack et al. [14] to assess the reporting of study articles, which we piloted using a random selection of four studies reviewed by three independent reviewers (QG, MP, and WT). Through the pilot testing, we modified the abstraction form by deleting three items (Unwarping of B0 distortions; Describe any data quality control measures; any additional operations, e.g., masking out parts of the image) from Poldrack et al.'s original checklist. The reason for excluding these three items was that we found assessing them required too much subjectivity, meaning that biases among reviewers' judgments were very high. Excluding them meant we were better able to achieve a common perception and interpretation of definitions among items we did evaluate, and hence increased between-reviewer agreement. The observed percentage of agreement on judgments between any two reviewers was 0.78 or higher. Final abstraction forms were devised prior to use (see Table S2). The data were extracted from each article and any online supplements. Items were answered with ''Reported'', ''Not Reported'', or ''Not Applicable''.
Three authors (QG, MP, and WT), blinded to each other's assessments, abstracted the reporting of each article independently. Instead of all three raters reviewing all articles, we decided to have two reviewers rate each article. To determine the number of articles needed to be evaluated by the second reviewer to ensure a desired level of reliability, we performed a sample size calculation [30,31]. The sample size of 50 was chosen so as to estimate the kappa for the inter-rater agreement within a margin of error of 0.3 with 95% confidence, assuming that the true kappa would be 0.6 or more and that the proportion of agreements by chance was 0.7 or less (see File S2). The first reviewer (QG) evaluated all 100 articles, of which 50 articles were randomly selected for the second reviewer (MP), and the other 50 articles were given to the third reviewer (WT) for abstraction; each article was therefore rated by two reviewers.
After completion of independent assessments, any disagreements between any pair of reviewers (i.e., QG and MP; QG and WT) were resolved by discussion among two reviewers, and if necessary, involving the third reviewer or expert (GH) until consensus was reached. The raw data collected from the 100 studies is available at online Supporting Information (see File S4).

Statistical Analysis
We calculated the percentage of studies that reported each evaluation item and a 95% confidence interval (CI) using an exact binomial method [32]. We then estimated the median, minimum and maximum percentages of reported items for each article.
Inter-rater agreement was assessed using the prevalenceadjusted bias-adjusted kappa (PABAk) coefficient [33]. When the prevalence of a rating is very high or low, the value of kappa may indicate a low level of agreement while the observed percentage of agreement is high, known as the kappa paradox [34]. Hence, we used prevalence-adjusted bias-adjusted kappa [33] to address this paradox and to better interpret the inter-rater agreement. Kappa coefficient results were interpreted based on the scale as proposed by Byrt [35] We performed a sample size calculation to determine the number of articles to be included in the extraction and analysis. A sample size of 100 was chosen so that with 95% confidence, we would be able to quantify the true percentage of articles that reported each item to within 10% (see File S1). All statistical analyses were conducted using the SAS 9.2 software (Cary, NC).

Study Selection
After removing the duplicates, the initial search strategy identified 1120 unique articles. We screened the articles in a random order for eligibility until the quota of 100 eligible articles was reached. To reach this target, we assessed 1100 articles (see Figure S1 for a flow diagram). The list of the 100 eligible articles is included in File S3.

Study Characteristics
Among the included 100 eligible articles published in six leading journals in 2010 and 2011, about 60% came from the journal NeuroImage. The majority of study designs were cross-sectional (94%). The funding source was reported in 78% of the citations, and came primarily from two or more different sources (77%) rather than from industry alone (1%). Fifty three percent of included articles were published in 2010 and the remaining forty seven percent in 2011. The median total number of subjects was 34 (first quartile (Q1) = 26, third quartile (Q3) = 48) ranging from 8 to 126, and most studies (79%) had a sample size of no more than 50 (see Table 1).

Items Commonly Reported
Of the 83 items, 22 items were reported by 85% or more of the 100 included articles. Specifically, all of the studies reported sample sizes. Most studies further described the manufacturer, field strength and model name of the scanner and the pulse sequence type (98%), statistical methods used for group modeling (97%), subjects' characteristics such as age and gender (94%), statistical methods used for within-subject modeling (92%), eligibility criteria on selecting subjects (91%), and whether statistical inferences were corrected for multiple comparisons (90%). Similarly, 86% of the articles reported how regions of interest (ROIs) were defined. Of 86 articles that reported analyses not conducted on the whole brain, 80 (93%) explained how regions were determined (see Tables 2-10).

Items Not Commonly Reported
Among the 83 items, a total of 31 items were reported by no more than 50% of the included articles; 13 items were reported by fewer than 20% of the articles. Critically, and in sharp contrast to Poldrack's guidelines, none of the studies reported observed effect sizes if they failed to reject the null hypothesis. Only one article (3%, 1/31) provided justifications for using fixed-effect inferences for group modeling. Other items that were insufficiently reported included slice-timing and motion corrections (12/100), temporal autocorrelation modeling used to account for within-subject variances and correlations (18/100), whether and how the task design was optimized for efficiency if it was an event-related design (22%, 8/35), distributions of inter-stimulus intervals (ISI), whether ISI was variable (23%, 9/39), statistical methods for repeated measurements (24/100), and smoothness and resolution element (RESEL) count if family-wise error (FWE) was found by random

Reported Items per Article
The median (minimum, maximum) percentage of reported items per article was 51% (30%, 78%).
The inter-rater agreement was very good (PABAk .0.8) for 31 items, good (0.6, PABAk #0.8) for 31 items, fair (0.4,PABAk #0.6) for 20 items, and slight (PABAk = 0.34) for one item Table 2. Percentage of articles reported each item, inter-rater agreement on the item and whether the item should be included in future shortened checklist relating to ''Experimental Design''.

Specifics on Reported Items
Manuscript quality hinges not only on whether an item was reported, but the specifics of the method that was used. Here we describe manuscripts' methodological choices regarding software, spatial smoothing, temporal filtering and thresholding for statistical significance.
reasons for exclusion are given. Spatial smoothing reduces noise and hence increases the signalto-noise ratio while reducing the resolution of data [36,37]. Therefore, it is important to specify the extent to which spatial smoothing that has been applied. Specifically, the size of the smoothing kernel determines how much the data is smoothed, which has an effect on the extent of within-subject variability of estimates [38]. Reporting smoothing parameters helps readers to Table 5. Percentage of articles reported each item, inter-rater agreement on the item and whether the item should be included in future shortened checklist relating to ''Data Preprocessing''.  determine the balance between improving the sensitivity and maintaining the resolution of the functional image. As can be seen in Table 12, the majority of studies reported using spatial smoothing (88/100), with 95.5% (84/88) specifying a type of kernel. The widths of smoothing kernel ranged from 3 mm to 12 mm with a median width of 8 mm. The most frequent kernel width was 8 mm (42%, 37/88). Other common widths included 6 mm (29.5%, 26/88), 9 mm (8%, 7/88), and 10 mm (5.7%, 5/ 88). The widths used by fewer than 5 studies were 5 mm, 12 mm, 4 mm, 4.2 mm and 3 mm. None of the studies justified their choices of smoothing kernel. As with spatial smoothing, temporal filtering aims to increase the signal-to-noise ratio. Since most of the noise in fMRI is low frequency, high-pass filtering improves the ratio better than lowpass filtering, and is almost as good as band-pass filtering [36,39]. Specifying the filter cut-off parameter helps understand the temporal filtering process. Most studies (61/100) reported whether temporal filtering was used. Of the 60 studies that reported actual use of temporal filtering, most (95%, 57/60) used high-pass filtering. Only a few studies used low-pass (1.7%, 1/60) and bandpass (3.3%, 2/60) temporal filtering. Forty-eight studies reported the filter cut-off, among which the high-pass filtering cut-off ranged from 2.8 s to 318 s with a median and mode value of 128 s, compared to low-pass filtering with a single cut-off value of 6.7 s.

Discussion
This study identified some reporting practices in observational clinical fMRI studies that met expectations and other areas where reporting was less than adequate. In particular, only one quarter of the items from the recommended reporting guidelines by Poldrack et al. (2008) were reported adequately. Indeed, only one half of recommended items were routinely reported in each article. Moreover, one third of the items were reported by less than half of the articles. Less adequately reported items were distributed across the categories: experimental design, inter-subject registration and smoothing, data preprocessing, statistical modeling, and statistical inference on ROI analysis. These results indicate that substantial room for improvement exists in the reporting of observational clinical fMRI studies.
Specifically, improvement in reporting important details is recommended in areas such as observed effect sizes in the results section when study results are negative, justifications for fixedeffect inferences used for group modeling, and temporal autocorrelation matrix used to account for within-subject variance and correlations. As effect sizes observed from statistically significant regions overestimate true effect sizes [46,47], including values from non-significant regions (e.g., those that are identified from similar previous studies) would help provide a more realistic range of effect size estimates and reduce the risk of bias arising from reporting on active regions only. Given the existence of temporal autocorrelation in fMRI time series, incorporating an autocorrelation structure increases the accuracy of variance estimates. Reporting temporal autocorrelation estimates enables proper power analyses based on the method proposed by Mumford and Nichols [48]. Whereas findings from fixed-effect inferences particularly reflect the cohort of subjects studied, random-effect inferences generalize findings to the population at large from which the study sample was drawn [49]. The current recommendation is to use random-effect inferences for between-subject group modeling and fixed-effect inferences for single-subject modeling. Providing justifications for using fixed-effects for group modeling would enhance understanding and interpretation. This study differed substantially from the one existing review of fMRI reporting [18] in the number of items, definitions of items, study population and study design. For example, although Carp's study used a single reviewer, we conducted a systematic review by using a duplicate abstraction, measuring inter-rater agreement and resolving disagreements through consensus. Moreover, our study focused on observational studies with clinical participants; in contrast, Carp evaluated fMRI studies in general which may not capture many studies involving clinical participants. There are also some notable differences in results between the two studies. For example, in the current study around one-third reported the distribution of inter-trial intervals, compared to one-twelfth in Carp's study. About one half reported the number of subjects rejected from analyses with reasons for rejection in our study, which is one quarter greater than that of Carp's study. Similarly, less than one-third of the articles in our study reported the following four methodological items but still showed better reporting than those in Carp's study: how potentially confounding variables were matched across groups for group comparisons, whether autocorrelations were modeled, whether equal variance was assumed across groups for multiple group designs, and the number of RESELs and image smoothness for studies using FWE correction. Unfortunately, we are unable to identify the specific factors associated with these differences between the current study and Carp's study; the factors might be the type of clinical participants involved in the study, impact factors of the journal, or the exclusion of studies of connectivity. Future research may be helpful in this regard by comparing reporting quality among studies with clinical participants versus without clinical participants, with high impact factor journals versus with low impact factor journals, and including studies of connectivity versus excluding connectivity. Although different, both studies did detect some commonality in important items that are frequently absent from published reports, indicating that incomplete reporting challenges the evaluation, understanding and interpretation of study findings, and limits the use of results for synthesis, e.g., for meta analyses. Table 7. Percentage of articles reported each item, inter-rater agreement on the item and whether the item should be included in future shortened checklist relating to ''Statistical Modeling''. If the group has more than 2-levels, described the levels and assumptions of the model (e.g., are variances assumed equal between groups) (n = 21) Complete reporting becomes particularly important for studies involving clinical populations, where ensuring methodological rigor is necessary to uphold investigators' promises to their participants that their participation will help society to better understand the nature of their condition. Our findings point towards the need for substantial improvement in this regard. In several other fields of health research, it has been demonstrated that journals adopting standard reporting guidelines (e.g., CON-SORT statement) have better quality of reporting than those that do not [50][51][52], thus the use of guidelines in the fMRI literature may help improve the quality of reporting as well.
Implementation of the guidelines for reporting fMRI studies proposed by Poldrack and his colleagues (2008) do face some challenges. Firstly, authors often have strict word limits and the current guidelines are lengthy, making it important to identify which items are most essential. Secondly, some items are relevant to the quality of reporting observational clinical studies but are not covered in Poldrack et al.'s guidelines (for example, sample size Table 8. Percentage of articles reported each item, inter-rater agreement on the item and whether the item should be included in future shortened checklist relating to ''Statistical Inference on Statistic Image (thresholding)''. calculations in the methods section, characteristics of clinical participants, and participation data flow diagrams to better understand potential bias due to non-participation [53]). Since reporting guidelines are evolving documents [54], we suggest dividing the list of items that should be reported into those that are essential, which should be placed in the manuscript itself, and those which are helpful to report can be included as online supplements. Some methodological parameters have more impact than others [28,55] and hence should be considered as essential items. Some journals (e.g., Nature) have recently removed space limitations on methods sections, however, since this is not a widespread practice it would still be useful to distinguish between essential and helpful items. In addition to the form of text-based reporting, some items can be reported in the form of source code (e.g., for data collection and statistical analyses) [56] and machinereadable information compatible to different imaging analyses packages [57]. Our recommendation for creating a list of essential items is not intended to supplant the existing guidelines but rather a suggestion to consider during the next update of the guidelines. We hope that our suggestions will lead to more discussion and future consensus regarding what is in fact essential to report in the manuscript itself for observational clinical fMRI studies. For example, the consensus can be reached through a consensus meeting involving a variety of experts in this area, in a similar way that the standard CONSORT guideline was created. Involving journal editors in the process and having their endorsement of the guidelines would encourage researchers to comply with the new standards.
The present study has several limitations. First, findings in this study reflect the quality of reporting of observational clinical fMRI studies in six top neuroscience journals published between 2010 and 2011, results that may not apply to journals in general. Most likely, these results may overestimate true rates of reporting. Second, several items on the checklist used for evaluation in this systematic review involve subjectivity. However, using duplicate review and consensus for any disagreements helped to reduce differences in interpretations between reviewers.

Conclusion
This study has highlighted under-reported areas in observational fMRI studies involving clinical participants and points towards a need for improvement. Adherence to the guidelines for fMRI studies proposed by Poldrack and his colleagues could help improve quality of reporting. Considering that the guidelines are evolving and need continual updates, we suggest constructing a checklist that captures essential items to report to accommodate practical needs, and enforcing the reporting guidelines through proposed ways.  Supporting Information