Fig 1.
(Left) Depth Data Collection: We utilized the NYU Depth V2 dataset, containing 654 test images of indoor scenes, and selected eight target points per image (example images with red crosses). For collecting human data, participants provided absolute depth judgments (in meters) for each labeled point (A, B, C, D) by adjusting bars against reference lines, as shown in the example task screen. We evaluated a diverse set of 64 DNN models, varying in strategy (supervised, self-supervised, hybrid, generative), training data characteristics (i.i.d. vs. o.o.d. datasets, single vs. multiple datasets), and backbone architecture (CNN-based, Transformer-based, Diffusion-based). The image was generated by ChatGPT (GPT-4o, 2025) to avoid potential copyright issues and is intended for illustrative purposes only. (Center) Decomposing Errors: We decomposed errors in depth judgments using two fitting methods. 1. Exponential Fitting was applied to capture the global depth compression effect at far range as well as expansion at near range in perceived depth. 2. Per-Image Affine Fitting was performed to analyze per-image affine transformations. This model decomposes errors into scale, shift, horizontal shear, and vertical shear components, schematically depicted with orange shapes transforming from ground truth (gray shapes). (Right) Comparing Errors: We compared error patterns between humans and DNNs by calculating partial correlations between raw errors and correlations between affine coefficients and residual errors, allowing for a quantitative assessment of human-DNN similarity in depth judgments.
Fig 2.
Scatter plot of averaged human depth judgments for each data point.
The red curve represents the exponential fitting results.
Table 1.
Results of affine components and inter-individual similarity of human depth judgments.
Fig 3.
Performance of regression models in capturing human depth judgments.
The left panel presents the coefficients of determination (R2), and the right panel illustrates the similarity to human depth judgments, measured by Pearson partial correlations (controlling for ground truth) between model estimates and human data. Error bars denote the 95% confidence intervals from 1,000 random half-splits of participants.
Table 2.
The model outputs were categorized based on the type of depth information provided: absolute depth (no symbol), relative depth (indicated by ), and disparity (indicated by
). Full version of this table is available in the Supporting Information.
Fig 4.
Comparison of human and DNN error patterns.
(A) Accuracy as measured by RMSE for raw error. (B) Similarity between humans and DNNs based on Pearson partial correlation. For both absolute (left) and scale-recovered (right) analyses, the inter-human partial correlations were calculated from absolute data, serving as a reference benchmark for human-level consistency.
Fig 5.
Scatter plot showing the relationship between RMSE and human similarity.
The horizontal axis denotes (A) the original metric depth RMSE, (B) the scale-shift invariant RMSE, and (C) the affine-invariant RMSE.
Fig 6.
Exponential fitting coefficients for 31 DNNs that produce absolute depth values.
The figure consists of three subplots: scale component (C), exponent component (γ), and shift component (β).
Fig 7.
Averaged affine components for human data and 31 DNNs that produce absolute depth values.
The figure comprises five subplots: scale component (az), shift component (b), horizontal shear component (ax), vertical shear component (ay), and RMSE for residual error. Error bars represent the 95% confidence intervals of the mean values, computed via bootstrap random sampling across individual data points.
Fig 8.
Relationship between model accuracy and similarity to human judgments for 31 DNNs that output absolute depth values.
(A) Scatter plots illustrating the relationship between scale-shift invariant RMSE and human similarity across different affine components. Marker colors indicate the type of training datasets, while marker shapes represent the training strategy used. (B) Spearman correlation coefficients reflecting the relationship between RMSE and human similarity rankings. (C) Spearman correlation coefficients for model rankings based on human similarity across affine components.
Fig 9.
Relationship between model accuracy and similarity to human judgments for 64 DNNs using scale-recovered data.
(A) Scatter plots illustrating the relationship between scale-shift invariant RMSE and human similarity across different affine components. Marker colors indicate the type of training datasets, while marker shapes represent the training strategy used. (B) Spearman correlation coefficients reflecting the relationship between RMSE and human similarity rankings. (C) Spearman correlation coefficients for model rankings based on human similarity across affine components.
Fig 10.
Human-DNN similarity for selected models.
(A) Variation across different training datasets, with marker colors indicating dataset type. (B) Variation across different backbone architectures, with marker colors indicating architecture type.
Fig 11.
Results of ordinal-level analyses (rank and residual-rank) for humans and 31 DNNs producing absolute depth values.
(A) Ordinal error rates of humans and DNNs for raw and residual data. (B) Scatter plots illustrating the relationship between scale-shift invariant RMSE and human similarity. Marker colors indicate the type of training dataset used, while marker shapes represent the training strategies employed.
Fig 12.
Example of the task screen for collecting supplemental dataset (relative depth judgments).
All instructions were displayed in Japanese in the actual experiment. The image was generated by ChatGPT (GPT-4o, 2025) to avoid potential copyright issues and is intended for illustrative purposes only.