MRIQC: Predicting Quality in Manual MRI Assessment Protocols Using No-Reference Image Quality Measures.

Quality control of MR images is essential for excluding problematic acquisitions and avoiding bias in subsequent image processing and analysis. However, the visual inspection of individual images is time-consuming and limited by both intra-and inter-rater variance. The difficulty of visual inspection scales with study size and with the heterogeneity of multi-site data. Here, we describe a tool for the automated assessment of T1-weighted MR images of the brain – MRIQC . MRIQC calculates a set of quality measures from each image and uses them as features in a binary (include/exclude) classifier. The classifier was designed to ensure generalization to new samples acquired in different centers and using different scanning parameters from our training dataset. To achieve that goal, the classifier was trained on the Autism Brain Imaging Data Exchange (ABIDE) dataset (N=1102), acquired at 17 locations with heterogeneous scanning parameters. We selected random forests from a set of models and pre-processing options using nested cross-validation on the ABIDE dataset. We report a performance of ∼ 89% accuracy of the best model evaluated with nested cross-validation. The best performing classifier was then evaluated on a held-out (unseen) dataset, unrelated to ABIDE and labeled by a different expert, yielding ∼ 73% accuracy. The MRIQC software package and the trained classifier are released as an open source project, so that individual researchers and large consortia can readily curate their data regardless the size of their databases. Robust QC is crucial to identify early structured imaging artifacts in ongoing acquisition efforts, and helps detect individual substandard images that may bias downstream analyses.

Image analysis can lead to erroneous conclusions when the original data are of low 2 quality. MRI images are unlikely to be artifact-free, and assessing the quality of images 3 produced by MR scanning systems has long been a challenging issue [1]. Traditionally, 4 all images in the sample under analysis are visually inspected by one or more experts, 5 and those showing an insufficient level of quality are excluded (some examples are given 6 in Fig 1A). Visual assessment is time consuming and prone to variability due to 7 inter-rater differences (see Fig 1B), as well as intra-rater differences arising from factors 8

PLOS
1/18 such as practice or fatigue. An additional concern is that some artifacts evade human 9 detection entirely [2] for example those due to improper choice of acquisition 10 parameters. Even though magnetic resonance (MR) systems undergo periodic 11 inspections and service, some machine-related artifacts persist unnoticed due to lenient 12 A B Figure 1. Visual assessment of MR scans and the quality control of the ABIDE dataset.
A. Two images with prominent artifacts from the ABIDE database are presented. An example scan (top) is shown with severe motion artifacts. The arrows point to signal spillover through the phase-encoding axis (right-to-left -RL-) due to eye movements (green) and vessel pulsations (red). A second example scan (bottom) shows severe coil artifacts. This figure caption is extended in Figure SI1. B. Info-graphic of the visual assessment of the T1-weighted (T1w) MR images of the ABIDE dataset performed by three different experts, split by scanning site. Each scanning site has one stripe with three rows of colored circles, except sites with large samples where the ratings are wrapped in two stripes. Each circle represents the rating of one image by one of the experts, with the color encoding the quality label (green is "accept", yellow is "doubtful", red is "exclude" and white denotes missing ratings). Each row is a different expert, thus the inter-rater consistency can be checked column-wise. A perfect agreement occurs when the three circles of a column show the same color (for example, the first image of the "CALTECH" scanning site). Some images yielded no agreement across raters (e.g. the second participant in the "PITT" sample). Next to each site label, the spread of ratings is reported. Rows are the three raters, and columns the ratings. First (in green color) for "accept", second (gray) for "doubtful", and third (red) is "exclude". The aggregated (all sites of ABIDE) rates are presented in the top-right box.
vendor quality checks, and drift from the system calibration settings. In our experience, 13 automated Quality Control (QC) protocols help detect these issues early in the 14 processing stream. The current trend of neuroimaging towards acquiring very large 15 samples across multiple scanning sites [3][4][5] introduces additional concerns. These large 16 scale imaging efforts render the visual inspection of every image infeasible and add the 17 possibility of between-site variability. Therefore, there is a need for fully-automated, 18 robust, and minimally biased QC protocols. These properties are difficult to achieve for 19 three reasons: 1) the absence of a gold standard impedes the definition of sensitive 20 quality metrics; 2) human experts introduce biases with their visual assessment; and 3) 21 cross-study and inter-site acquisition differences also introduce uncharacterized 22 variability. 23 Machine-specific artifacts have been traditionally tracked down using phantoms [6] 24 in a quantitative manner. However, many forms of image degradation are 25 participant-specific or arise from practical settings (see Fig 1,  preprocessed-connectomes project (PCP), and the UK Biobank [10].

59
The hypothesis behind this study is that we can predict the quality ratings of an 60 expert on previously unseen datasets (with dataset-specific scanning parameters) in a 61 1 A measure is called "no-reference" when no ground-truth of the same image without degradation is available.
supervised learning approach that uses features derived from a broad selection of IQMs. 62 To demonstrate that the trained classifier correctly predicts the quality of new data, we 63 used two unrelated databases to configure the training and held-out (test) datasets [11]. 64 We first select the best performing model on the training dataset using a grid strategy 65 in a nested cross-validation setup. We use the ABIDE database [4] for the training set 66 because data are acquired in 17 different scanning sites with varying acquisition 67 parameters (Table 1). These data show great variability in terms of imaging settings 68 and parameters, what represents the heterogeneity of real world data. The best 69 performing classifier is then trained in the full ABIDE dataset, and tested in the 70 held-out dataset [12] to assess whether the performance on unseen data falls within the 71 range predicted by the nested cross-validation.

72
The contributions of this work are summarized as follows. acquisition parameters is presented in Table 1, and a full-detail table in Table SI1.

88
Labeling protocol The labeling process is aided by surface reconstruction, using the 89 so-called white (WM-GM interface) and the pial (delineating the outer interface of the 90 cortex) surfaces as visual cues for the rater. We utilize FreeSurfer [13] to reconstruct 91 the surfaces. FreeSurfer has been recently proposed as a visual aid tool to assess T1w 92 images [14]. For run-time considerations, and to avoid circular evaluations of FreeSurfer, 93 this tool is not used in the MRIQC workflow (see The MRIQC tool section).

94
The following protocol was used for the manual assessment of T1w images: 1) The

101
During the visualization, the rater assessed the overall quality of the image. The

102
white and pial contours were used as evaluation surrogates, given that "exclude" images 103 usually exhibit imperfections and inaccuracies on these surfaces. When the expert found 104 general quality issues or the reconstructed surfaces revealed more specific artifacts, the 105 "exclude" label was assigned and the rater noted a brief description, for example: "low 106 Table 1. Summary table of the train and test datasets. The ABIDE dataset is publicly available a , and contains images acquired at 17 sites, with a diverse set of acquisition settings and parameters. This heterogeneity makes it a good candidate to train machine learning models that can generalize well to novel samples from other sites We selected ds030 [12] from OpenfMRI b as held-out dataset to evaluate the performance on data unrelated to the training set. A table summarizing the heterogeneity of parameters within the ABIDE dataset and also ds030 is provided as supplemental material (Table SI1). signal-to-nose ratio (SNR)", "poor image contrast", "ringing artifacts", "head motion", 107 etc.). The images in ds030 were randomized before rating.

109
The MRIQC tool MRIQC is an open-source project, developed under the following 110 software engineering principles. 1) Modularity and integrability: MRIQC implements a 111 nipype [15] workflow (see Fig 2) to integrate modular sub-workflows that rely upon third 112 party software toolboxes such as FSL [16], ANTs [17] and AFNI [18]. 2) Minimal 113 preprocessing: the workflow described before should be as minimal as possible to 114 estimate the IQMs on the original data or their minimally processed derivatives. 3)

115
Interoperability and standards: MRIQC follows the the brain imaging data structure 116 (BIDS, [19]), and it adopts the BIDS-App [20] standard. An example of the ease of 117 running MRIQC is presented in Listing SI1. 4) Reliability and robustness: the software 118 undergoes frequent vetting sprints by testing its robustness against data variability 119 (acquisition parameters, physiological differences, etc.) using images from the 120 OpenfMRI resource. Reliability is checked and maintained with the use of a continuous 121 integration service. Figure 2. MRIQC's processing data flow. Images undergo an optimized processing pipeline to: 1) realign images in a conformed space using AFNI realign; 2) INU estimation using N4ITK; 3) skull-stripping using AFNI 3dSkullStrip; 4) brain tissue segmentation using FSL-FAST; 5) computation of an air/tissue mask using the magnitude of the gradient image [8]; 6) mapping of an exclusion mask defined in MNI space into the subject using an affine registration scheme with ANTs; 7) computation of an air mask excluding the region below the plane crossing the nasio-cerebellum axis; 8) computation of artifactual regions [8]; and 9) computation of a surrounding air-mask without artifacts; 10) projection of all the computed masks and segmentations to the original (native) space of the image volume 122 Extracting the Image Quality Metrics The final steps of the MRIQC's workflow 123 compute the different IQMs, and a summary JSON file per subject is generated. The

124
IQMs can be grouped in four broad categories (see Table 2

CNR
The contrast-to-noise ratio [22] is an extension of the SNR calculation to evaluate how separated the tissue distributions of GM and WM are. Higher values indicate better quality.

SNR
MRIQC includes the signal-to-nose ratio calculation proposed by Dietrich et al. [23], using the air background as noise reference.
Additionally, for images that have undergone some noise reduction processing, or the more complex noise realizations of current parallel acquisitions, a simplified calculation using the within tissue variance is also provided.

QI 2
The second quality index of [8] is a calculation of the goodness-of-fit of a χ 2 distribution on the air mask, once the artifactual intensities detected for computing the QI 1 index have been removed.

EFC
The entropy-focus criterion [24] uses the Shannon entropy of voxel intensities as an indication of ghosting and blurring induced by head motion. Lower values are better.

FBER
The foreground-background energy ratio is calculated as the mean energy of image values within the head relative the mean energy of image values in the air mask. Consequently, higher values are better.

INU
MRIQC measures the location and spread of the bias field extracted estimated by the intensity non-uniformity correction. The smaller spreads located around 1.0 are better.

QI 1
The first quality index of [8] measures the amount of artifactual intensities in the air surrounding the head above the nasio-cerebellar axis. The smaller QI 1 , the better.

WM2MAX
The white-matter to maximum intensity ratio is the median intensity within the WM mask over the 95% percentile of the full intensity distribution, that captures the existence of long tails due to hyper-intensity of the carotid vessels and fat. Values should be around the interval [0.6, 0.8].

Other measures FWHM
The full-width half-maximum is an estimation of the blurriness of the image using AFNI's 3dFWHMx. Smaller is better.

ICVs
Estimation of the intracranial volume of each tissue calculated on the FSL FAST's segmentation. Normative values fall around 20%, 45% and 35% for cerebrospinal fluid (CSF), WM and GM, respectively.

rPVE
The residual partial volume effect feature is a tissue-wise sum of partial volumes that fall in the range [5%-95%] of the total volume of a pixel, computed on the partial volume maps generated by FSL FAST. Smaller residual partial volume effects (rPVEs) are better.

SSTATs
Several summary statistics statistics (mean, standard deviation, percentiles 5% and 95%, and kurtosis) are computed within the following regions of interest: background, CSF, WM, and GM.

TPMs
Overlap of tissue probability maps estimated from the image and the corresponding maps from the ICBM nonlinear-asymmetric 2009c template [25].
After the extraction of IQMs in all the images of our sample, a group report is 141 generated (Fig 3). The group report shows a scatter plot for each of the IQMs, so it is 142 particularly easy to identify the cases that are outliers for each metric. The plots are 143 interactive, such that clicking on any particular sample opens the corresponding 144 individual report of that case. Examples of group and individual reports for the ABIDE 145 dataset are available online at mriqc.org.  MRIQC generates one individual report per subject in the input folder and one group report including all subjects. To visually assess MRI samples, the first step (1) is opening the group report. This report shows boxplots and strip-plots for each of the IQMs. Looking at the distribution, it is possible to find images that potentially show low-quality as they are generally reflected as outliers in one or more strip-plot. For instance, in (2) hovering a suspicious sample within the coefficient of joint variation (CJV) plot, the subject identifier is presented ("sub-51296"). Clicking on that sample will open the individual report for that specific subject (3). This particular example of individual report is available online at https://web.stanford.edu/group/poldracklab/mriqc/reports/sub-51296 T1w.html.

Supervised classification 147
Our supervised learning approach to predicting the binary ratings of a human expert is 148 structured in two steps. First, we perform a preliminary model selection and evaluation 149 using repeated (x1000) and nested cross-validation, on the ABIDE dataset (see Step 1: 150 Tested models and selection). Then, a second optimization in a refined grid of 151 hyper-parameters for the model selected previously is performed with a single-loop 152 cross-validation on the ABIDE dataset. The best performing model of this second 153 cross-validation step is evaluated using the held-out dataset (see Step 2: Validation on 154 the held-out dataset). The cross-validation workflows are built upon scikit-learn [26] 155 and run using the Stampede supercomputer at the Texas Advanced Computing Center, 156 University of Texas, TX, USA.

157
Step 1: Tested models and selection 158 Based on the number of features (56) and training data available (∼1100 data points), 159 we compare two families of classifiers: SVCs and random forests classifiers (RFCs).

160
Given the diversity of scanning sites, in the model selection loop we also investigate the 161 need for normalizing (zscoring) features. In the following, models including a 162 preliminary zscoring will show the suffix "-zs" while those using the original features 163 without such transformation are noted with the suffix "-nzs".

164
The support-vector machine classifier (SVC) A support-vector machine [27] 165 finds a hyperplane in the high-dimensional space of the features that robustly classifies 166 the data. The SVC then uses the hyperplane to decide the class that is assigned to new 167 samples in the space of features. Two hyper-parameters define the support-vector 168 machine algorithm: a kernel function that defines the similarity between data points to 169 ultimately compute a distance to the hyperplane, and a regularization weight C. In 170 particular, we analyzed here the linear SVC implementation (as of now, "SVC-lin") and 171 the one based on radial basis functions (denoted by "SVC-rbf"). During model selection, 172 we evaluated the regularization weight C and the γ parameter (kernel width) of the 173 SVC-rbf.

174
The random forests classifier (RFC) Random forests [28] are an nonparametric 175 ensemble learning method that builds multiple decision trees. Then, a RFC assigns to 176 each new sample the mode of the predicted classes of all decision trees in the ensemble. 177 In this case, random forests are driven by a larger number of hyper-parameters. 178 Particularly, in this work we analyze the maximum tree-depth, the minimum number of 179 samples per split and the total number of decision trees.    Feature ranking One tool to improve the interpretability of the RFC is the 207 calculation of feature rankings [28] by means of variable importance or Gini importance. 208 Since we use scikit-learn, the implementation uses Gini importance, defined for a single 209 tree as the total decrease in node impurity weighted by the probability of reaching that 210 node. We finally report the median feature importance over all trees of the ensemble.

211
Step 2: Validation on the held-out dataset 212 In the second step, we cross-validated the model selected in step 1, optimizing a grid  averaged scores for the 10-fold strategy were: AUC=0.848 (σ=±0.135) and accuracy of 231 88.6% (σ=±11.5%). These results indicated that there is no practical difference 232 between the two split strategies as regards model selection through cross-validation on 233 this dataset. Therefore, since the averaged scores using LoSo cross-validation in the 234 inner loop were slightly higher, it was selected as split strategy for the cross-validation 235 in step 2. Note that the split strategy is not a model feature, and thus this decision can 236  In orange, the results corresponding to LoSo. In general, the 10-fold splitting was more optimistic for all models, whereas the LoSo scores are closer to results obtained in the outer (evaluation) loop. In all iteration loops, regardless of split strategy and cross-validation repetition, the RFC-nzs achieved the best score, with varying parameters. As expected, zscoring the features was necessary for both SVC lin and SVC rbf to exhibit acceptable performances, but always below that of the RFC-nzs. (Right) Evaluation -outer loop. On the right hand panel, colors represent again the split strategy used in the inner loop. With colored markers, the average cross-validated score is annotated in a box with the µ symbol. Below, the spread of the distribution is noted. Please note that, since the scores are bounded above 1.0, the values of the standard deviation σ are probably underestimated. The distributions of nested cross-validated scores for both AUC and accuracy were rather independent from the split strategy used in the inner loop. The results for both metrics obtained in the evaluation of the held-out dataset (ds030) are represented in the corresponding distribution of nested cross-validation scores, showing that the performance on unseen data falls very close to one standard deviation below the average score. In this case, the average cross-validated score was higher for LoSo (AUC=0.862/accuracy≈89.4%) as compared to the 10-fold split (AUC=0.848/accuracy≈88.6%). Also the spread of cross-validated scores is slightly lower for LoSo (AUC=±0.121/accuracy=±9.95% vs. AUC=±0.135/accuracy=±11.5%).

RFCnzs
be made based on the results of the outer loop of nested cross-validation, as opposed to 237 the model selection that is done based on the inner cross-validation loops.

238
The best performing model and parameters selected as the maximum average of the 239 AUC score in the inner loop, across all repetitions of the nested cross-validation was the 240 RFC-nzs, with 50 trees (n estimators), maximum tree depth (max depth) of 20, and a   (n estimators=50, max depth=20, min samples split=2). The AUC on the evaluation set was 248 0.695 and the accuracy 72.83%. We also analyzed the relevance of each feature in the 249 overall forest decision (Fig 5). The most relevant features are the coefficient of joint 250 variation (CJV) and the SNR measured on the WM tissue mask.

252
We propose a quantitative approach to quality control T1w MRI of the brain, enabling 253 the automatic identification of sub-standard acquisitions. Quality control protocols are 254 implemented to exclude faulty datasets that can bias the final results. Human brain 255 images can be degraded by various sources of artifacts, related to the scanning device 256 and settings or due to the participants themselves. Machine-derived artifacts are 257 efficiently mitigated in a quantitative manner with calibration. However, due to the lack 258 of reliable quality quantification tools, subject-specific artifacts and drifts from the 259 service settings are assessed visually. The visual inspection of every MRI acquisition of 260 the brain is a time-consuming and bias-prone task that would be ideally replaced by 261 decision algorithms. Automating the QC process is particularly necessary for ongoing 262 studies such as the UK Biobank that will collect data from tens of thousands of 263 individuals.

264
Previous efforts [7,30] in the quantification of image quality for their assessment 265 included the definition of image-quality metrics (IQMs). However, the approach was 266 unfeasible for a total automation due to the limited sensitivity of the available IQMs to 267 the most prevalent artifacts. Subsequent efforts [8] were focused on specific samples,

298
We used nested cross-validation to select the most predictive classifier, ensuring that 299 the evaluation loop was unbiased using a leave-one-site-out (LoSo) splitting strategy. In 300 this cross-validation scheme, the accuracy is bound below by that measured during the 301 test validation loop. Therefore, the final classifier is ultimately trained using all the 302 available data to push its predictive accuracy above the evaluated performance.  MRIQC is one effort to standardize methodologies that make data-driven and 319 objective QC decisions. Automated QC can provide unbiased exclusion criteria for 320 neuroimaging studies, helping avoid "cherry-picking" of data. A second potential 321 application is the use of automated QC predictions as data descriptors to support the 322 recently born "data papers" track of many journals and public databases like 323 OpenfMRI [31]. The ultimate goal of the proposed classifier is its inclusion in automatic 324 pre-registered the report, drafted the manuscript, run the experiments and interpreted 350 the results. KJG devised the machine learning approach to quality control, coordinated 351 the project, contributed to MRIQC and the cross-validation workflow, pre-registered the 352 report, and interpreted the results. MS rated the ABIDE dataset, helped understanding 353 the problems of inter-and intra-rater variabilities. DB rated the ds030 dataset. OOK 354 contributed in the design of the cross-validation workflow and interpreted the results.  Listing SI1 Running MRIQC. The BIDS standard makes MRIQC compatible with almost any input dataset without need for custom settings. Since all the metadata associated to the dataset are found in bids-data/, the following example would nicely run without further settings. The second positional argument, out/ indicates where the outputs will be written, and finally, the participant keyword instructs MRIQC to run the first level analysis as specified in BIDS Apps. § ¤ mriqc bids -data / out / participant mriqc bids -data / out / participant --participant_label S001 S002

¦ ¥
Listing SI2 Running MRIQC -Group Level. If the participant level was run setting some --participant label, the group level is not triggered by default. It can be done manually, pointing the input data folder to the derivatives folder generated with the participant level analysis: § ¤ mriqc out / derivatives / out / group

¦ ¥
Listing SI2 Predicting quality. Although the group runlevel will generate a CSV table with the quality label predicted for each sample, it is possible to run the classifier individually: § ¤ mriqc_clf --load -classifier -X aMRIQC . csv -o mypredictions . csv ¦ ¥ The default classifier can be replaced by a custom one using: § ¤ mriqc_clf --load -classifier my_custom_classifier . pklz -X aMRIQC . csv -o mypredictions . csv ¦ ¥ The documentation website contains more detailed information on how to train custom classifiers, or generate refined results from prediction: http://mriqc.readthedocs.io/en/latest/classifier.html. Figure SI1 Extended caption of Fig 1A. An example scan (top) is shown with severe motion artifacts. The reduced contrast between tissues and the ringing intensity waves in the anterior region of the brain in the presented slice suggest a large head movement occurred during acquisition. The green arrows point to signal spillover due to eye movements through the phase-encoding axis (in this case, right-to-left -RL-). Oftentimes, the RL or LR axes are selected for phase-encoding because the signal leakage from the eyeballs does not overlap with brain tissue, as opposed to selecting anterior-posterior directions. However, the red arrows point to signal spillover caused by vessel pulsations. Given the location of the vessel, in this case signal leakage overlaps brain tissue and affects the quality of this image. The phase-encoding axis has less bandwidth and thus, is more sensitive to movement. For that reason, it is generally selected to have the shortest field of view. A second example scan (bottom) shows severe coil artifacts.