Fig 1.
Example stimuli used in the experiment.
Each object is visualized as a sparse point cloud sampled from its 3D surface. Colors represent depth, with red indicating proximity and blue indicating distance. In the experiments, point clouds were displayed in black and presented as rotating GIFs.
Fig 2.
The architectures of DGCNN (top) and Point Transformer (bottom).
N: number of points in each point cloud. MLP: Multi-layer perceptron, consisting of multiple fully connected layers. ⊕: concatenation.
Table 1.
Comparison and key differences between a convolution-based model (DGCNN) and a transformer-based model (Point Transformer).
Fig 3.
Point cloud stimuli used in Experiment 1 by varying dot density and object orientation.
Colors represent depth, with red indicating proximity and blue indicating distance. In the experiments, the stimuli were presented as black points with rotation in depth, viewed from a horizontal viewpoint.
Fig 4.
Accuracy of human participants and models as a function of point density for upright point clouds (top) and inverted point clouds (bottom).
Error bars represent 95% confidence intervals around the mean accuracy, estimated across participants for human performance and across stimuli for model performance at each proportion level.
Fig 5.
Confusion matrices for human responses, DGCNN predictions, and Point Transformer predictions in Experiment 1 upright condition.
Rows represent true labels and columns represent predicted labels.
Fig 6.
Lego-like point cloud stimuli used in Experiment 2.
Points were uniformly sampled from voxel surfaces converted from point clouds, with varying voxel sizes determining the degree of local deformation. Higher voxel size indicates more local deformation.
Fig 7.
Accuracy of human participants and models on Lego-like point clouds with varying voxel sizes.
Error bars represent 95% confidence intervals around the mean accuracy, estimated across participants for human performance and across stimuli for model performance at each voxel size level.
Fig 8.
Confusion matrices for human responses, DGCNN predictions, and Point Transformer predictions in Experiment 2.
Rows represent true labels and columns represent predicted labels.
Fig 9.
Scrambled stimuli used in Experiment 3.
Each row shows five objects (airplane, car, chair, lamp, and table) in either the original (top row) or scrambled (bottom row) condition. Scrambling was performed separately for DGCNN and Point Transformer models while preserving part identity.
Fig 10.
Recognition performance of humans and models under scrambling manipulations.
Top panel, Human classification accuracy for intact and scrambled point clouds. Middle panel, DGCNN model accuracy for intact and scrambled point clouds. Bottom panel, Point Transformer model accuracy. Error bars represent 95% confidence intervals around the mean accuracy, estimated across participants for human performance and across stimuli for model performance.
Fig 11.
Accuracy of variants of Point Transformers as a function of point density.
Original = the original Point Transformer, NoAttn = Remove attention, NoPE = Remove position encoding, NoDs = Remove downsampling.
Fig 12.
Accuracy of variants of DGCNN models as a function of varying point density.
DGCNN + DS = DGCNN model with an additional Downsampling layer after each EdgeConv layer.
Fig 13.
Correlation between model and human accuracy patterns across all experiment conditions.
Models with downsampling show significantly stronger alignment with human.