MRIQC: Advancing the automatic prediction of image quality in MRI from unseen sites

Quality control of MRI is essential for excluding problematic acquisitions and avoiding bias in subsequent image processing and analysis. Visual inspection is subjective and impractical for large scale datasets. Although automated quality assessments have been demonstrated on single-site datasets, it is unclear that solutions can generalize to unseen data acquired at new sites. Here, we introduce the MRI Quality Control tool (MRIQC), a tool for extracting quality measures and fitting a binary (accept/exclude) classifier. Our tool can be run both locally and as a free online service via the OpenNeuro.org portal. The classifier is trained on a publicly available, multi-site dataset (17 sites, N = 1102). We perform model selection evaluating different normalization and feature exclusion approaches aimed at maximizing across-site generalization and estimate an accuracy of 76%±13% on new sites, using leave-one-site-out cross-validation. We confirm that result on a held-out dataset (2 sites, N = 265) also obtaining a 76% accuracy. Even though the performance of the trained classifier is statistically above chance, we show that it is susceptible to site effects and unable to account for artifacts specific to new sites. MRIQC performs with high accuracy in intra-site prediction, but performance on unseen sites leaves space for improvement which might require more labeled data and new approaches to the between-site variability. Overcoming these limitations is crucial for a more objective quality assessment of neuroimaging data, and to enable the analysis of extremely large and multi-site samples.

Eq 1 in S1 File Definition of the accuracy score (ACC). The accuracy score is calculated as the total of true positive (TP) and true negative (TN) samples over the total of samples: ACC = TP + TN P + N .
Eq 2 in S1 File Definition of the sensitivity to the "exclude" class (or recall). The recall is calculated as the total of true positive (TP) samples over the total of positive (manually rated as "exclude") samples: Block 3 in S1 File Converting datasets into BIDS The BIDS standard is built on top of well-established file formats such as NIfTI and JSON and uses easy to adopt file and folder naming structure. There are many tools that can help users to convert their data to BIDS. For example, dcm2niix (https://github.com/rordenlab/dcm2niix) is a robust DICOM and PAR/REC to NIfTI conversion tool that generates BIDS compatible JSON files. It can be used to build custom scripts tailored to a particular IT infrastructure available at one's neuroimaging center. HeuDiConv (https://github.com/nipy/heudiconv) is a more comprehensive solution that can be configured to take folders of unorganized DICOM files as an input and produce a BIDS compatible file/folder structure as an output. If the data is already in NIfTI file format, and there is a need to preserve disk space, symbolic links can be used to create a view of the dataset compatible with BIDS. For more information about software compatible with BIDS see http://bids.neuroimaging.io.
Block 4 in S1 File Running MRIQC The BIDS standard makes MRIQC compatible with almost any input dataset without need for custom settings. Since all the metadata associated to the dataset are found in bids-data/, the following example would nicely run without further settings. The second positional argument, out/ indicates where the outputs will be written, and finally, the participant keyword instructs MRIQC to run the first level analysis as specified in BIDS Apps.

¦ ¥
Running MRIQC -Group Level. If the participant level was run setting some --participant label, the group level is not triggered by default. It can be done manually, pointing the input data folder to the derivatives folder generated with the participant level analysis: § ¤ mriqc out / derivatives / out / group ¦ ¥ Predicting quality. Although the group run level will generate a CSV table with the quality label predicted for each sample, it is possible to run the classifier individually: The default classifier can be replaced by a custom one using: The documentation website contains more detailed information on how to train custom classifiers, or generate refined results from prediction: http://mriqc.readthedocs.io/en/latest/classifier.html.
Block 5 in S1 File Impact of the labeling protocol and variability sources In an earlier version of this manuscript we reused an existing quality assessment of ABIDE dataset that used a different labeling protocol. We fixed the issue with a second manual assessment of the full ABIDE, now following the labeling protocol. In this second assessment, the expert who rated the ABIDE dataset in first place, re-assessed 601 images. We use those 601 images to evaluate the intra-rater variability using different labeling protocols in Figure A.
In the earlier version of the manuscript using Protocol B for labeling, we introduced an unplanned inter-rater bias since the held-out dataset could not be rated by the same expert who rated the ABIDE dataset. However, by changing the rater from train to test datasets we implicitly controlled the risk of overestimating the classification ACC due to systematic errors (those idiosyncratic to one particular expert) on the labels of the training. This problem was fixed when we used Protocol A and two different experts rated 601 data points of ABIDE (with an overlap of 100 images). An analysis of all these sources of variability is done in Figure B. On the left, the confusion matrix for a single rater using two different labeling protocols is presented. Protocol A corresponds to the procedures used in this work, and described in section Labeling protocol. Protocol B corresponds to the previously existing assessment, that was based on the corrections applied to the surface reconstruction of FreeSurfer. The Cohen's Kappa index of agreement for the 3-class labeling was κ=0.29, and that index calculated on binarized labels increases to κ=0.48. Therefore, the labeling protocol defines, in practice, the concept of "quality" for the application at hand. The selection of the labeling procedures requires a trade-off between the time it takes to rate a new data point to increase the training set and the class-noise derived from the unreliability of such rating. The protocol also determines the interpretation of the quality categories by raters. As seen in the figure, only 1% of the images were deemed "doubtful", with a 19% that were accepted with protocol B and rated "doubtful" with protocol A. A similar effect occurs for 17% of the images that were accepted with protocol B (after applying surface corrections) and excluded with protocol A. Imbalance of ratings (ABIDE, aggregated) Figure B Quality control of the ABIDE dataset. Info-graphic of the visual assessment of the T1w images of the ABIDE dataset performed by two different experts, split by scanning site. Each scanning site has one stripe with three rows of colored circles, except sites with large samples where the ratings are wrapped in two stripes. Each circle represents the one rating of one image by one of the experts, with the color encoding the quality label (green is "accept", gray is "doubtful", red is "exclude" and missing dots represent missing ratings). Second and third rows are ratings from the same expert, using the two different protocols A and B. Protocol A corresponds to that used in this work to train and test the classifier. Protocol B is an alternative labeling protocol that used the surfaces reconstructed by FreeSurfer after manually correcting errors. A perfect agreement between ratings occurs when the three circles of a column show the same color (for example, the first participant of the "OLIN" scanning site). Some images yielded no agreement across raters and protocols (e.g. the first participant in the "MAX MUN" sample). For the intra-rater reliability (2nd and 3rd row), a consistent direction of discrepancy is observed, whereby images with poorer quality were considered good after manual corrections in the corresponding protocol (3rd row). Next to each site label, the frequency of quality labels is reported, using the same color code. The aggregated (all sites of ABIDE) frequencies are presented in the top-right box.

PLOS
Block 6 in S1 File The idiosyncratic ghost of DS030. We found the ghosting artifact of DS030 (see Fig 7B) in around 18% of the images. Since there are no examples in the training set with a comparable ghosting artifact, we question what the performance on the held-out dataset would have been, had we removed those artifacts from the test sample. In an exploratory experiment, we removed those data points (file mriqc/data/csv/manual qc/ds030 ghosts x.csv in the GitHub repository) from the test set, and run the evaluation experiment again. The results are summarized in Table 1 in S1 File.