Hierarchical abstraction drives human-like 3-D shape processing in deep learning models

doi:10.1371/journal.pcbi.1014047

Fig 1.

Example stimuli used in the experiment.

Each object is visualized as a sparse point cloud sampled from its 3D surface. Colors represent depth, with red indicating proximity and blue indicating distance. In the experiments, point clouds were displayed in black and presented as rotating GIFs.

More »

Expand

Fig 2.

The architectures of DGCNN (top) and Point Transformer (bottom).

N: number of points in each point cloud. MLP: Multi-layer perceptron, consisting of multiple fully connected layers. ⊕: concatenation.

More »

Expand

Table 1.

Comparison and key differences between a convolution-based model (DGCNN) and a transformer-based model (Point Transformer).

More »

Expand

Fig 3.

Point cloud stimuli used in Experiment 1 by varying dot density and object orientation.

Colors represent depth, with red indicating proximity and blue indicating distance. In the experiments, the stimuli were presented as black points with rotation in depth, viewed from a horizontal viewpoint.

More »

Expand

Fig 4.

Accuracy of human participants and models as a function of point density for upright point clouds (top) and inverted point clouds (bottom).

Error bars represent 95% confidence intervals around the mean accuracy, estimated across participants for human performance and across stimuli for model performance at each proportion level.

More »

Expand

Fig 5.

Confusion matrices for human responses, DGCNN predictions, and Point Transformer predictions in Experiment 1 upright condition.

Rows represent true labels and columns represent predicted labels.

More »

Expand

Fig 6.

Lego-like point cloud stimuli used in Experiment 2.

Points were uniformly sampled from voxel surfaces converted from point clouds, with varying voxel sizes determining the degree of local deformation. Higher voxel size indicates more local deformation.

More »

Expand

Fig 7.

Accuracy of human participants and models on Lego-like point clouds with varying voxel sizes.

Error bars represent 95% confidence intervals around the mean accuracy, estimated across participants for human performance and across stimuli for model performance at each voxel size level.

More »

Expand

Fig 8.

Confusion matrices for human responses, DGCNN predictions, and Point Transformer predictions in Experiment 2.

Rows represent true labels and columns represent predicted labels.

More »

Expand

Fig 9.

Scrambled stimuli used in Experiment 3.

Each row shows five objects (airplane, car, chair, lamp, and table) in either the original (top row) or scrambled (bottom row) condition. Scrambling was performed separately for DGCNN and Point Transformer models while preserving part identity.

More »

Expand

Fig 10.

Recognition performance of humans and models under scrambling manipulations.

Top panel, Human classification accuracy for intact and scrambled point clouds. Middle panel, DGCNN model accuracy for intact and scrambled point clouds. Bottom panel, Point Transformer model accuracy. Error bars represent 95% confidence intervals around the mean accuracy, estimated across participants for human performance and across stimuli for model performance.

More »