Hierarchical abstraction drives human-like 3-D shape processing in deep learning models

Shuhao Fu; Philip J. Kellman; Hongjing Lu

doi:10.1371/journal.pcbi.1014047

Abstract

Both humans and deep learning models can recognize objects from 3D shapes depicted with sparse visual information, such as a set of points randomly sampled from the surfaces of 3D objects (termed a point cloud). Although deep learning models achieve human-like performance in recognizing objects from 3D shapes, it remains unclear whether these models develop 3D shape representations similar to those used by human vision for object recognition. Evidence suggests that training with approximately 10,000 object instances enables models to acquire representations of local geometric structures in 3D shapes. We hypothesize, however, that their representations of 3D global shapes are still limited. To test this hypothesis, we conducted three human experiments systematically manipulating point density and object orientation (Experiment 1), local geometric structure (Experiment 2), and part configuration (Experiment 3). Human performance was stable across conditions in the first two experiments, but declined significantly in the part-scrambled condition of the final experiment. We compared human performance with two types of deep learning architectures: convolution-based models (e.g., DGCNN) and transformer-based models (e.g., Point Transformer). The transformer-based models more closely captured human performance patterns across experimental conditions. Ablation simulations revealed that this advantage is largely driven by progressive downsampling operations that enable hierarchical abstraction of 3D shapes.

Author summary

Humans can recognize 3D objects at a glance, even when they are depicted only as sparse sets of dots sampled from their surfaces, known as point clouds. We asked whether modern deep learning systems rely on the same kind of shape representations as humans or achieve recognition in different ways. This study combined human experiments with model evaluations. Participants viewed point cloud objects while we made recognition progressively more difficult by reducing the number of dots, flipping objects upside down, distorting local geometric properties, or scrambling parts into new configurations. Humans remained highly accurate in most cases but struggled when the part configuration was disrupted, highlighting a strong dependence on global 3D shape. We then compared two leading deep learning models, and identified the critical computational components responsible for achieving human-like performance. Our results showed that progressive downsampling, which constructs increasingly abstract shape representations, is the primary factor underlying human-like robustness, whereas attention mechanisms contribute only secondarily.

Citation: Fu S, Kellman PJ, Lu H (2026) Hierarchical abstraction drives human-like 3-D shape processing in deep learning models. PLoS Comput Biol 22(3): e1014047. https://doi.org/10.1371/journal.pcbi.1014047

Editor: Jian Liu, University of Birmingham, UNITED KINGDOM OF GREAT BRITAIN AND NORTHERN IRELAND

Received: September 24, 2025; Accepted: February 20, 2026; Published: March 13, 2026

Copyright: © 2026 Fu et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All human data, statistical analysis code, network training and testing code are available on Github at https://github.com/fushuhao6/hierarchical_abstraction_in_3D_object_recognition.git. All stimuli used in experiments are available on Zenodo at link https://doi.org/10.5281/zenodo.17158227.

Funding: This work was supported by the National Science Foundation under Grant BCS-2142269 (A Unified Theory for Perception of Physical and Social Dynamics, https://www.nsf.gov/awardsearch/showAward?AWD_ID=2142269&HistoricalAwards=false) to HL. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Objects in the natural world possess physical properties such as geometric shape, volume, and material composition. The human visual system is highly efficient at extracting these properties from visual input, often within a brief glance at an image. Among the various object attributes, the ability to perceive and recognize three-dimensional (3D) shape is widely regarded as fundamental for everyday behaviors such as navigation, object manipulation, and interaction with the external environment. A substantial body of research [1–5] has demonstrated that human object recognition does not merely rely on memorizing collections of two-dimensional (2D) retinal images across viewpoints. Instead, humans construct internal 3D representations of objects that support robust recognition. 3D object representations provide an object-centered description of global shape by encoding the spatial relations among an object’s features, often referred to as a structural description. When this structural information is available in the visual input, perception based on global shape remains robust to variations in object appearance arising from changes in viewpoint, occlusion, illumination, and other imaging conditions.

The importance of 3D shape perception is further underscored by its early emergence in development. Sensitivity to 3D structure is found in human infancy [6], suggesting that mechanisms for representing 3D shape from visual input are present early in life. As development progresses, toddlers between 18 and 24 months exhibit a pronounced “shape bias” in word learning, increasingly generalizing object names based on shape rather than texture, color, or other perceptual features as their vocabularies expand [7]. Although sensitivity to texture and other visual cues continues to mature, these features become secondary to shape in guiding object recognition and naming objects. Notably, even when texture information is entirely absent from the visual input, humans are still capable of accurately recognizing objects based solely on their 3D geometric structure.

A striking demonstration of this capacity is human recognition of objects presented as point clouds, consisting of discrete points sampled along object surfaces (see Fig 1). Despite the absence of continuous contours, shading, and texture, humans readily recognize 3D objects from such minimal visual information [8–12]. These behavioral findings highlight the robustness and flexibility of human 3D shape perception. Meanwhile, neuroscience research indicates that shape representations emerge through a hierarchical sequence of processing stages along the ventral visual pathway in the brain. This pathway extends from posterior occipitotemporal regions of the inferior temporal cortex (IT), including the lateral occipital cortex (LO), to more anterior regions encompassing the fusiform gyrus. Object-related information is ultimately transmitted to the anterior temporal lobe (ATL), where it is integrated into multimodal semantic representations of objects [13,14]. While prior research has emphasized the contribution of the dorsal pathway to object recognition, more recent findings point to a more sophisticated, distributed network underlying global shape processing. The ventral pathway appears to be specialized for extracting local shape features and supporting object recognition based on these features. However, recognition of objects defined by global shape depends on interactions between the dorsal and ventral pathways, which together facilitate the formation of 3D object representations that are robust to variations in viewpoint and occlusion [15].

Download:

Fig 1. Example stimuli used in the experiment.

Each object is visualized as a sparse point cloud sampled from its 3D surface. Colors represent depth, with red indicating proximity and blue indicating distance. In the experiments, point clouds were displayed in black and presented as rotating GIFs.

https://doi.org/10.1371/journal.pcbi.1014047.g001

In parallel with advances in behavioral and neuroscience research on 3D shape perception, recent progress in deep learning has led to the development of specialized architectures for object recognition from 3D point clouds. Two major classes of models have emerged. The first class includes graph-based architectures such as Dynamic Graph CNN (DGCNN), which extend graph neural network approaches by dynamically constructing local neighborhoods in feature space to learn local geometric structure [16]. Earlier models such as PointNet introduced foundational methods by learning spatial features from raw 3D data of point clouds [17]. DGCNN-based models have demonstrated human-like performance across a range of 3D object recognition tasks.

The second class comprises transformer-based architectures adapted for 3D data of point clouds, including the Point Transformer [18], which leverage self-attention mechanisms to capture relations among points. These transformer-based models likewise achieve human-level performance in 3D object recognition. Despite their comparable behavioral performance, it remains unclear whether these different model classes develop internal representations of 3D shape that are similar to, or distinct from, those employed by humans.

Previous research on image-based object recognition has shown that pre-trained deep convolutional neural networks (DCNNs) [19–21] can acquire internal representations that differ from those of humans, despite exhibiting similar behavioral performance in typical testing conditions. For example, Baker et al. [22] found that DCNNs struggled to classify objects in images based on their 2D global shape. In one experiment, they presented CNNs with objects that preserved the 2D global shape but were filled with textures from other objects. The networks showed a strong bias for classifying based on textures rather than shapes. Further experiments revealed that CNNs could not reliably classify objects based on outlines alone, indicating a reliance on local features rather than global shapes. In a separate experiment, these investigators found that, using silhouettes that networks could correctly classify, the addition of minute serrations along the bounding contours reduced network classification to below chance performance. In contrast, human object classification was not disrupted by this manipulation. These findings suggest that while CNNs can access local shape features in images, they do not form global shape representations crucial for human-like object recognition ([23], see a review [24]).

To address whether deep learning models trained to recognize 3D objects from point clouds acquire representations similar to or different from those of humans, it is necessary to combine systematic experimentation with model ablation and intervention approaches to identify the core computational components underlying 3D shape recognition in both humans and models. In this paper we compared 3D object recognition in humans and deep learning models. Through a set of experimental manipulations, we analyzed how both humans and models recognize 3D objects, particularly in challenging conditions where local and global shape features were manipulated. This included disrupting local geometric properties, presenting objects from unusual viewpoints, and varying the global configuration of object parts. We then tested the two types of deep learning models using the same tasks, and conducted ablation/intervention studies to identify the core computational mechanisms underlying 3D object recognition in the DNN models.

Modeling methods

Dataset

We used stimuli selected from the publicly available point cloud dataset, ModelNet40 [25]. The ModelNet40 dataset includes a set of 3D CAD models from 40 object categories. There are 12,311 3D CAD objects in total, with 9,843 objects used for training and 2,468 for testing. The point clouds consist of 1,024 points sampled uniformly from the surface of each 3D CAD model.

Deep learning models

We evaluated two deep learning models for 3D object recognition from point cloud data: a convolution-based model (DGCNN; [16]) and a transformer-based model (Point Transformer; [18]). Both models take 3D coordinates of points sampled from a 3D shape and generate feature embeddings for object classification. A more detailed illustration of the model architecture is shown in Fig 2 and a summary comparison between the two models is in Table 1.

Download:

Fig 2. The architectures of DGCNN (top) and Point Transformer (bottom).

N: number of points in each point cloud. MLP: Multi-layer perceptron, consisting of multiple fully connected layers. ⊕: concatenation.

https://doi.org/10.1371/journal.pcbi.1014047.g002

Download:

Table 1. Comparison and key differences between a convolution-based model (DGCNN) and a transformer-based model (Point Transformer).

https://doi.org/10.1371/journal.pcbi.1014047.t001

DGCNN processes point clouds as graphs using EdgeConv layers, which extract local geometric features by comparing each point to its neighbors in feature space. It is important to note that the neighbor points are defined by the feature spaces of each layer, rather than their physical positions in 3D space. Unlike traditional CNNs that operate on regular grids in 2D image or 3D space, DGCNN dynamically updates the neighborhood structure in each layer, enabling it to capture complex 3D geometric properties through local feature aggregation and global max pooling. Our implementation used 1,024 points and 20 nearest neighbors, based on the publicly available code [26].

Point Transformer employs a self-attention mechanism inspired by transformer architectures in natural language processing and computer vision [27,28]. Each Point Transformer layer integrates information from neighboring points using attention weights based on spatial proximity and feature similarity. Transition Down layers progressively reduce the number of points in the input set, enabling hierarchical abstraction of point cloud representations. These layers select a representative subset of points via farthest point sampling [29] and aggregate local features from their k-nearest neighbors, allowing the network to capture increasingly coarse-grained geometric and semantic structures across layers. This downsampling process is analogous to hierarchical visual processing in the human visual system, where increasingly abstract and spatially compressed representations are formed at successive stages. In our implementation, we used nearest neighbors following the original paper, and based our code on the publicly available implementation [30].

While both models operate on point cloud inputs, DGCNN emphasizes local geometric structure and information pooling from neighbors that share similar shape features, whereas Point Transformer combines hierarchical pooling and attention to capture context-dependent relations between a point and its neighbor points (defined in 3D space). To align with human experiments, we trained both models on the ModelNet40 dataset and extracted logits corresponding to the ten object categories shown to participants, enabling direct comparisons of recognition performance. Both models were trained using the same data augmentation procedures, including random point dropout, random scaling, and random shifting of the input point clouds.

Training process

To ensure a fair comparison among models, we adopted the data augmentation protocol implemented in the original Point Transformer repository [30] during training. Specifically, each point cloud was first subjected to random point dropout, where the dropout ratio, defined as the proportion of points to be removed, was uniformly sampled from the range . The selected points were replaced by the coordinates of the first point in the cloud. Following this, each point cloud was uniformly scaled by a random factor drawn from the interval and randomly translated along the x, y, and z axes, with the translation magnitude for each axis independently sampled from a uniform distribution over .

For all models, training was conducted using the Adam optimizer with an initial learning rate of . Each model was trained for 200 epochs, and a step learning rate scheduler was employed to progressively reduce the learning rate during training, decreasing it by a factor of 0.3 every 50 epochs.

Results

General procedure

We used the same general procedure across all human experiments unless otherwise specified. Participants viewed a point cloud object in each trial. The stimulus was displayed for 3 seconds, after which ten buttons, each labeled with a different object name, appeared for selection. Their task was to select the object category that best matched the presented point cloud object. The ten object categories were airplane, bottle, bowl, chair, cup, lamp, person, piano, stool, and table. Each point cloud stimulus was presented as a GIF rotating 10 degrees per frame around the vertical axis. The GIF was displayed at 10 frames per second, completing a full 360-degree rotation in 3.6 seconds.

Participants first completed a practice trial showing a rotating point cloud of a plant. They had to select the correct object category before proceeding to the experimental trials. If participants selected a wrong object category during practice, then the practice trial was repeated, and a hint message was displayed below the ten category buttons, directing participants to select the “Plant” button. This practice trial aimed to familiarize participants with the point-cloud display and ensure they understood the recognition task. The object category, plant, in the practice trial was not included in the subsequent experimental trials, and the plant stimulus was not subjected to any experimental manipulations of point density, orientation, or deformation.

The experimental trials were similar to the practice trial, except that no feedback was provided. At the end of the experiment, demographic information was collected, and participants were presented with debriefing information about the study.

Experiment 1: Point density and object orientation

In the first experiment, we investigated the recognition performance of both human participants and neural network models by manipulating: (1) point density in point cloud displays and (2) object orientation. By combining these conditions, we examined the robustness of human and model recognition across different levels of local detail and atypical viewpoints.

Participants.

This study involved direct recruitment of human participants. Two groups of participants were recruited through the UCLA Subject Pool. For the upright condition, 56 participants were recruited (45 female, 11 male), with one participant excluded for reporting a lack of seriousness, resulting in a final sample of 55 participants (mean age = 19.8, SD = 1.4). For the inverted condition, an additional 47 participants were recruited (40 female, 7 male), with a mean age of 20.4 years (SD = 1.5). The average completion time for both conditions was approximately 10 minutes, with a slight variance between groups.

Stimuli.

The stimuli were selected from the test set of the ModelNet40 dataset to ensure a fair comparison between human participants and deep neural network (DNN) models. We selected 7 object instances from each of the ten categories. Hence, the experiment included 70 different 3D object shapes. Each object stimulus was transformed by varying 1) point density: each point cloud was randomly downsampled to seven proportions (20%, 30%, 40%, 50%, 60%, 80%, and 100% of 1024 points). 2) object orientation: upright vs. inverted. Therefore, we generated 7 objects x 7 point densities x 10 categories = 490 stimuli for the upright condition and 490 stimuli for the inverted condition. Example stimuli with different densities and orientations are shown in Fig 3. Each point cloud stimulus was presented as a GIF rotating 10 degrees per frame around the vertical axis. The GIF was displayed at 10 frames per second, completing a full 360-degree rotation in 3.6 seconds.

Download:

Fig 3. Point cloud stimuli used in Experiment 1 by varying dot density and object orientation.

Colors represent depth, with red indicating proximity and blue indicating distance. In the experiments, the stimuli were presented as black points with rotation in depth, viewed from a horizontal viewpoint.

https://doi.org/10.1371/journal.pcbi.1014047.g003

In the experiment, the participants were randomly assigned to either the upright or inverted conditions. Each participant viewed one object exemplar only once, with a random permutation of dot density. In other words, each participant viewed all seven objects from one category, and each object instance was displayed only once with randomly assigned point density. This resulted in a total of 70 trials per participant. The trials were randomized for each participant. For the models, we used all combinations, 490 objects x 2 conditions = 980 stimuli, for testing.

We adopted a between-subject design for orientation to avoid presenting the same objects in both upright and inverted positions, which could introduce familiarity effects and substantially increase the number of trials. In contrast, point density was varied within subjects to efficiently assess recognition robustness to visual sparsity while keeping the experiment length manageable. Each object exemplar was presented only once, and different exemplars from the same category were used across density levels, minimizing potential bias from prior exposure. We recognize that an alternative design—assigning distinct sets of objects to upright and inverted conditions—could also control for cross-condition familiarity. However, we prioritized a between-subject design for orientation to ensure clear interpretability of orientation effects and maintain a practical experiment duration.

Results.

The results of the experiment are presented in Fig 4. Despite the sparse information provided in point cloud displays, human participants consistently demonstrated high accuracy across all levels of point density. Their performance ranged from 86.2% to 95.3%, with only a slight decline as the point density decreased. For instance, when the number of points was reduced to 20% of the original points, the mean accuracy was 86.4% (CI = [83.5%, 89.2%]), and at 30%, the mean accuracy was 88.9% (CI = [86.3%, 91.5%]). This result suggests that humans are highly resilient to reduced point density, maintaining reliable recognition performance even with sparse displays of 3D objects.

Download:

Fig 4. Accuracy of human participants and models as a function of point density for upright point clouds (top) and inverted point clouds (bottom).

Error bars represent 95% confidence intervals around the mean accuracy, estimated across participants for human performance and across stimuli for model performance at each proportion level.

https://doi.org/10.1371/journal.pcbi.1014047.g004

A mixed-design ANOVA was conducted with point density as a within-subject factor and orientation (upright vs. inverted) as a between-subject factor, with mean recognition accuracy as the dependent variable. The analysis revealed a significant main effect of orientation, , , indicating that participants in the upright condition achieved higher accuracy than those in the inverted condition. There was also a significant main effect of point density, , , showing that recognition performance varied across density levels, with lower accuracy for sparser point clouds. Moreover, the interaction between orientation and point density was significant, , , suggesting that the decline in recognition accuracy at lower point densities was more pronounced for inverted objects. These results indicate that both orientation and point density strongly influence recognition performance, and that orientation modulates the effect of density on visual recognition.

The performance of the DGCNN model, however, was markedly affected by the reduction in point density. While the model’s accuracy approached human performance at high point densities (above 30%), it declined sharply at lower densities. Specifically, the accuracy dropped to 64.3% at 30% point density and 48.6% at 20%. In contrast, the Point Transformer model exhibited greater robustness to displays with low point density. For upright displays, its accuracy remained high across all levels of point density, ranging from 94.3% at 50% to 87.1% at 20%, showing similar performance robustness as humans.

To further examine error structure beyond overall accuracy, we compared the category-level confusion patterns of human participants with those produced by DGCNN and the Point Transformer (Fig 5). We quantified similarity by computing Pearson correlations between the off-diagonal elements of the confusion matrices (i.e., considering only misclassification patterns and excluding correct responses). The Point Transformer showed a strong correspondence with human error patterns (, ), whereas DGCNN exhibited a weaker, though still significant, correspondence (, ). These results indicate that the Point Transformer more closely captures the structure of human category confusions than DGCNN.

Download:

Fig 5. Confusion matrices for human responses, DGCNN predictions, and Point Transformer predictions in Experiment 1 upright condition.

Rows represent true labels and columns represent predicted labels.

https://doi.org/10.1371/journal.pcbi.1014047.g005

In the inverted condition, human participants showed the inversion effect with lower accuracy than upright condition, but still significantly greater than the chance level. Furthermore, humans consistently outperformed the machine learning models for the inverted objects, highlighting their adaptability to changes in viewpoint. Unsurprisingly, the models performed significantly worse with inverted 3D objects, as these stimuli were absent from their training set. This demonstrates the general limitations of deep learning models’ strong reliance on training data. It is worth noting, however, that the Point Transformer model performed better than the DGCNN model overall and also showed less reduction in the inverted condition, particularly at lower point densities.

Experiment 2: Lego-like point clouds

In Experiment 2, we systematically deformed the local geometric features of point clouds while preserving their global shapes. This was achieved by converting point clouds into voxel grid displays, analogous to reducing the resolution of an image. A larger voxel size leads to lower spatial resolution in the point cloud and increased local deformation. The process involved generating a voxel grid from the point cloud, sampling points on the voxel surfaces, and normalizing the sampled points. The stimuli section below details the methodology and the corresponding implementation.

The idea of introducing Lego-like point clouds was inspired by the sawtooth images from Baker et al. (2018), where sawtooth edges were added to silhouette images to disrupt local contour features. In our 3D point cloud display, we similarly aimed to disrupt local 3D curvatures by converting point clouds into Lego-like displays while keeping the global shape almost unchanged.