Removing facial features from structural MRI images biases visual quality assessment

doi:10.1371/journal.pbio.3003149

Fig 1.

An example of T₁w image before and after defacing.

Defacing is typically implemented by zeroing the voxels around the face (left-hand side, panel “T1-weighted”). The “Background noise” panel shows two visualizations extracted from the MRIQC visual report, in which a window is applied to select the lowest intensities and then the latter are inverted to enhance patterns in the background. The red arrows indicate aliases along the anterior–posterior axis—which had the lowest bandwidth in the example—produced by eye motion. These aliases are straightforward to notice in front of the ocular globes in the “Background noise” panel of the “nondefaced” image because of the absence of other signal sources. This aliasing also spreads in the opposite direction, overlapping brain tissue. However, this overlap is often very hard to notice against the signals of interest within the brain. The corresponding “defaced” version of the “Background noise” panel shows how defacing eliminates valuable information for quality assessment.

More »

Expand

Table 1.

Sensitivity analyses for rm-ANOVA and LME comparisons.

We determined that our rm-ANOVA modeling would confirm differences in manual ratings of f = 0.218 or larger with G*Power [33]. This sensitivity corresponds to = 0.045 (i.e., a medium effect size) following Equation A in S1 Text. In the rm-ANOVA sensitivity analysis, we set two groups (defaced/nondefaced) and four measurements (4 raters) with a total sample size of 185 subjects from the HH site, 90% power, α = 0.02, a nonsphericity correction of 0.34, and a correlation among repeated measures of 0.1. Note that this sensitivity analysis is conservative as we expected a much higher correlation among repeated measures, which would reduce the detectable effect size. Remaining conservative, we iteratively tried different non-sphericity correction values and kept the lowest one possible to maximize the detectable effect size. With G*Power (Fig K in S1 Text), we also estimated the noncentrality parameter λ associated with the likelihood ratio test, which is a proxy for the effect size, yielding λ = 13.017. The degrees of freedom of the likelihood ratio test correspond to the difference in parameter count between the two nested LMEs compared (see Table F in S1 Text).

More »

Expand

Fig 2.

Defacing biases human assessment of image quality, particularly when image quality is low.

We examined biases with an “optimized” version of the BA plot, in which the x-axis represents the rating assigned to the nondefaced version of an image. Corresponding “standard” BA plots—in which the x-axis shows the average of the two ratings [37]—are reported in Fig B in S1 Text. Rating pairs where the defaced image’s quality was underestimated with respect to the nondefaced ( > 0) are represented in yellow. Conversely, pairs where the defaced image’s quality was overestimated () are in purple. Pairs within the 95% LoA are represented with dim colors, and the LoA boundaries are indicated with dashed colored lines annotated with their value. For example, the LoA for all raters pooled together was [−0.91, 0.81] (left panel). Finally, the bias is represented by a gray or black dashed line with a label reporting value and their corresponding 95% CI interval (parametric estimation). All raters displayed 95% LoA exceeding one unit, indicating that defacing introduces large variability in human assessments. All raters had negative—albeit small—biases, indicating that they systematically rated defaced images higher. However, these biases were statistically significant only for Raters 1 and 3, as well as the four raters aggregated together—indicated by the bias label and line colored in black. Relevant statistics (bias, LoA, 95% CI) are reported in Table B in S1 Text, and the full report of statistics, including 95% CI intervals calculated both by parametric and non-parametric means are distributed within the S6 Data file. Source tabular data for the BA analysis and results in Table B in S1 Text are found within the S1 Data file.

More »

Expand

Table 2.

The sensitivity analysis indicated that the rm-MANOVA was able to confirm differences in IQM of f = 0.16 corresponding to (i.e., a medium effect) or greater.

We ran the sensitivity analysis with G*Power, setting three groups (3 sites) and two measurements (defaced/nondefaced) with N = 580 (number of T1w per subject) per condition, with 90% power, and α = 0.02.

More »

Expand

Fig 3.

Rater 4 issued visibly different ratings, generally more optimistic, than the other raters in both (defaced and nondefaced) conditions.

The gray lines highlight the evolution of the rating between the nondefaced image and its defaced counterpart. The full white line in the violin plot represents the median of the distribution, while the dashed white lines represent the 25% and 75% quantiles. Comparing the median of the rating distribution from the nondefaced vs. defaced images, it is visible that different raters presented different bias magnitudes. Our most experienced rater (Rater 1) showed the largest bias. Rater 4’s rating distribution diverged from that of the other raters, being more optimistic overall about the quality of the images. Rater 4 also displayed a lower spread in quality assessments, which translated into the narrowest 95% LoA (Fig 2). Lastly, low ratings tended to be more biased by defacing as they showed a steeper evolution line, sometimes jumping one unit or more (equivalent to switching categories in the appreciation of quality, e.g., going from “poor” to “acceptable”). Higher ratings displayed gaps within 0.5 units. BA plots support the same observation (Fig 2 and Fig B in S1 Text). Source tabular data to generate this figure are found within the S1 Data.

More »

Expand

Table 3.

The linear mixed-effect models (LME) with defacing as a fixed effect explained significantly more variance than a “baseline” counterpart without defacing.

Model comparison with a likelihood ratio test yielded a p_FDR = 0.0183 after FDR correction. Complete reporting of the pre-registered comparison and the additional exploratory analyses are provided in Tables F and G in S1 Text.

More »

Expand

Table 4.

Results of repeated-measures MANOVA on the projected IQMs.

We did not find an effect of the site nor an effect of defacing on the principal components extracted from the IQMs. However, applying the IQMs standardization and PCA separately per site mitigated the site effects, revealing a defacing bias. Despite being significant, the defacing bias is associated with a negligible effect size. Effect size is reported as partial η² () and was computed with the function F_to_η² from the R package effectsize [46], which implements the formula given in Equation D in S1 Text with df = numDF and df_error = denDF. p_FDR corresponds to p-values controlled for false discovery rate (FDR).

More »

Expand

Table 5.

Study design template.

This table summarizes the link between the hypotheses, research questions, analysis plans, sensitivity analysis, and prospective interpretation given different outcomes.

More »

Expand