Improving fine-grained food classification using deep residual learning and selective state space models

Chi-Sheng Chen; Guan-Ying Chen; Dong Zhou; Di Jiang; Daishi Chen; Shao-Hsuan Chang

doi:10.1371/journal.pone.0322695

Abstract

Background

Food classification is the foundation for developing food vision tasks and plays a key role in the burgeoning field of computational nutrition. Due to the complexity of food requiring fine-grained classification, the Convolutional Neural Networks (CNNs) backbone needs additional structural design, whereas Vision Transformers (ViTs), containing the self-attention module, has increased computational complexity.

Methods

We propose a ResVMamba model and validate its performance on processing complex food dataset. Unlike previous fine-grained classification models that heavily rely on attention mechanisms or hierarchical feature extraction, our method leverages a novel residual learning strategy within a state-space framework to improve representation learning. This approach enables the model to efficiently capture both global and local dependencies, surpassing the computational efficiency of Vision Transformers (ViTs) while maintaining high accuracy. We introduce an academically underestimated food dataset CNFOOD-241, and compare the CNFOOD-241 with other food databases.

Results

The proposed ResVMamba surpasses current state-of-the-art (SOTA) models, achieving a Top-1 classification accuracy of 81.70% and a Top-5 accuracy of 96.83%. Our findings elucidate that our proposed methodology establishes a new benchmark for SOTA performance in food recognition on the CNFOOD-241 dataset.

Conclusions

We pioneer the integration of a residual learning framework within the VMamba model to concurrently harness both global and local state features. The code can be obtained on GitHub: https://github.com/ChiShengChen/ResVMamba.

Citation: Chen C-S, Chen G-Y, Zhou D, Jiang D, Chen D, Chang S-H (2025) Improving fine-grained food classification using deep residual learning and selective state space models. PLoS One 20(5): e0322695. https://doi.org/10.1371/journal.pone.0322695

Editor: Qian Zhang, Jiangsu Open University, CHINA

Received: December 9, 2024; Accepted: March 26, 2025; Published: May 5, 2025

Copyright: © 2025 Chen et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The code and dataset can be obtained on GitHub: https://github.com/ChiShengChen/ResVMamba

Funding: The author(s) received no specific funding for this work.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Food plays a crucial role in human life, and with the rise of modern technology and significant changes in lifestyle and dietary habits, there has been an increasing emphasis on food computing [1]. Within this field, food recognition is a key area of research in computer vision and machine learning. Despite significant advancements, food image classification still faces several challenges, including low inter-class variation and high intra-class variation, variations in lighting and viewpoint, and occlusion in plated dishes. These challenges often lead to misclassification, limiting the real-world applicability of food recognition models. Deep learning has demonstrated remarkable adaptability across various domains, enabling breakthroughs in fields ranging from financial modeling, geomechanics [2–4], supply chain optimization [5] to biomedical engineering [6–9]. In food classification, Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) have achieved promising results, but they often struggle with feature extraction in fine-grained food categories [10]. To address these challenges, we propose an approach that integrates Deep Residual Learning and Selective State Space Models, leveraging the strengths of both techniques to enhance feature representation and improve classification accuracy. In industry, food image classification can be utilized for automating restaurant cooking processes, enabling self-checkout systems, and managing kitchen waste. Furthermore, food recognition is essential for various health-related applications, including nutritional analysis and dietary habit management.

In addition to the impacts of the photo-captured environment, noise in images, and image quality, the biggest challenge for food classification is low inter-class variation and high intra-class variation. The same food category can appear differently depending on cooking methods, seasonings, plating styles, and other preparation factors. While different food categories, even subtle differences in ingredients can result in visually similar but semantically different food types, such as shredded pork fried rice and shrimp fried rice. Addressing these issues requires techniques capturing fine-grained features to distinguish between food classes.

Fine-grained visual classification represents a formidable task within the field of computer vision, seeking to identify various subcategories within a broader category, such as the various species of birds [11], medical images [12], aircraft [13], pets [14], flowers [15], and natural images [16]. Food image recognition is also an important branch of Fine-Grained Visual Classification (FGVC) [17]. The pivotal point in the fine-grained classification is that in addition to learning global features, the model should be able to integrate local features to get global-local information mutually in order to achieve better recognition capabilities. Existing methods primarily focus on either developing model subnets for localizing discriminative features or improving feature learning strategies. However, achieving an optimal balance between local and global feature extraction remains a challenge.

In recent years, a new Sequence State Space (S4) model, through a Selection mechanism and computation with a Scan (S6), colloquially termed Mamba, has emerged as a promising alternative to Transformers due to their superior computational efficiency and capability to model long-range dependencies. The VMamba model [18], which incorporates the Mamba mechanism into image tasks (such as classification), currently establishes the state-of-the-art (SOTA) on the ImageNet dataset [19]. It retains the advantage of capturing both local and global information from input images as ViTs while also enhancing the model speed. However, there is still a lack of research on the application of VMamba to fine-grained datasets. Therefore, this study endeavors to employ VMamba on food images and introduces a model, ResVMamba, a novel model specifically designed for fine-grained datasets. Our approach, ResVMamba model, enhances the global-local feature integration capability of food image classification models by combining the efficiency of VMamba with residual learning mechanisms.

A well-defined dataset significantly influences the development of possible research topics and the feature-learning capabilities of models. In this study, we utilized the CNFOOD241 dataset [20]. CNFOOD241 is a Chinese food dataset created by expanding ChineseFoodNet [21], including correcting incorrect labels and increasing the number of images and food categories. In addition to model training, we provided a comparative analysis of CNFOOD241 and other food datasets, illustrating its suitability for research. Unlike other food databases, CNFOOD241 preserves the aspect ratio of images and standardizes the size to 600 × 600 pixels. This preprocessing step prevents image deformation during data augmentation, which could potentially lead to models learning incorrect semantic features. Furthermore, CNFOOD241 exhibits the relative imbalanced data distribution, making it a more challenging dataset for fine-grained food classification. By introducing ResVMamba, this study aims to advance FGVC in food recognition and provide a more efficient and accurate solution for real-world applications.

The contributions of this work are stated as follows:

We provide comparative studies on the food dataset and clarify the research value of CNFOOD241. To enhance the rigor of our study, we have further partitioned the dataset into separate test and validation segments as a new fine-grained image classification benchmark.
We first introduce the state space model into fine-grained image classification, and the proposed ResVMamba outperforms state-of-the-art approaches on the CNFOOD-241 dataset.

Related work

Food recognition datasets

In the burgeoning field of food computation, the proliferation of food datasets has marked a significant advancement, drawing widespread academic and practical interest. From the inception of datasets like ETH Food-101 [22], which introduced over one hundred thousand images of Western food varieties, to the expansive collections of ISIA Food-500 [23] and Food2K [24] encompassing nearly four hundred thousand and over a million images respectively, the evolution is notable. These datasets, predominantly sourced through web scraping, have been instrumental in advancing computational gastronomy and nutrition studies. However, they share a critical limitation: the lack of uniformity in the size distribution of images across different categories. This variance can lead to substantial discrepancies in some categories, where a few images might significantly exceed the average size of others, potentially skewing the dataset’s overall utility and introducing biases in the processing and classification results obtained after resizing images for analysis.

The issue of image size inconsistency poses challenges in maintaining the accuracy and reliability of computational models, especially those reliant on CNNs, ViTs and other image-processing architectures designed to extract detailed features from visual inputs. As depicted in Table 1, the disparity in image sizes may affect the performance of these models, leading to deviations in the extracted category-specific information and potentially impacting the overall effectiveness of the computational analysis.

Download:

Table 1. Image size statistics comparison of current open datasets of food recognition (pixels).

https://doi.org/10.1371/journal.pone.0322695.t001

In response to these challenges, our search for a more consistent and high-resolution dataset led us to CNFOOD-241. Among publicly available food datasets with uniform image sizes, such as UNICT-FD889 [25], Vireo Food-172 [26], and UNICT-FD1200 [27], CNFOOD-241 distinguishes itself by offering the highest resolution. This characteristic renders it an exceptional resource for conducting detailed image analyses within the food computation domain, facilitating more accurate and reliable studies in food recognition, nutritional analysis, and other related areas.

Food image recognition

Early food recognition systems primarily used traditional machine learning algorithms. Researchers extract handcrafted features from images using methods such as color histograms [28], Scale-Invariant Feature Transform (SIFT), Histogram of Oriented Gradients (HOG) [29], Gabor textures [28], and Local Binary Pattern (LBP) [29]. These extracted features were then fed into classifiers such as SVM [30] for categorization. While achieving reasonable performance, these early methods relied heavily on manual feature engineering and were limited by image quality and variability in food appearances.

The emergence of deep learning revolutionized food identification research. Researchers began applying convolutional neural networks, such as AlexNet [31], ResNet-50 [32], EfficientNet, and Inception v3 [33], to food image data using transfer learning methods. This approach eliminated the need for manual feature extraction and allowed models to learn hierarchical visual representations from large-scale labeled datasets. Subsequent research enhanced food recognition capabilities using ensemble networks, multi-task learning, and other techniques. Notably, PRENet [24] emerged as a milestone in food recognition, integrating three different branches, each tailored for capturing different aspects of food images. By fusing features from low and high-level layers, PRENet achieved SOTA performance on CNN models.

Recently, ViT have gained popularity in food image analysis due to their ability to capture long-range dependencies. ViT models divide images into patches and represent them as sequence data, applying self-attention to capture relationships between patches. Ongoing research continues to integrate ViT with data augmentation, semi-supervised learning, multi-model fusion, and other techniques, pushing the boundaries of food understanding from images. Our research pioneers the application of the State Space Model to food recognition and aims to bring breakthroughs in this field.

State space model on visual recognition

Recent research predominantly utilizes CNNs and ViTs for the task of classifying food categories. However, the capability of these models to detect features has not reached the SOTA in large-scale image classification challenges lately, being outperformed by a new generation of models known as Structured State Space for Sequence (S4) [34] modeling. The improvement of S4 models with a Selection mechanism and their execution using a Scan (S6) [35], informally known as Mamba, has been shown to outclass the Transformer architecture in handling long sequences. There are several Mamba models used on vision related task such as VMamba, Vision Mamba [18] have tried to use Mamba to do visual downstream tasks like image classification and object detection, but Vision Mamba more focus on inference speed and GPU-memory usage efficiency. U-Mamba [36], VM-UNet [37], and Mambamorph [38] have replaced convolutional blocks or downsample blocks to Mamba blocks applied on medical image segmentation tasks. However, it remains insufficient exploration of VMamba for fine-grained data and food recognition tasks, we therefore propose the VMamba-based model into these downstream tasks.

Deep residual learning on space state model

Deep residual learning is a technique employed in training deep neural networks that offers several notable advantages introduced by ResNet [39]. Its primary advantage lies in addressing challenges such as vanishing gradients and exploding gradients, which commonly impede the training of deep networks. By introducing residual blocks and skip connections, deep residual learning facilitates the flow of gradients throughout the network, effectively mitigating the issue of vanishing gradients. Consequently, it enables the training of deeper neural networks without sacrificing performance. Deeper networks afford the extraction of more complex features, thereby enhancing the model’s representational capacity. Additionally, the training process in deep residual learning converges more efficiently due to expedited gradient propagation via skip connections, resulting in reduced training time and computational costs. However, to our best knowledge, there is lack of research that use residual learning on VMamba. Hence, we introduce the residual learning structure into VMamba-based model in this work.

Methods

In this section, we first introduce the preliminary knowledge of VMamba, then propose the details of our ResVMamba structure.

State space models

State Space Models (SSMs) are widely recognized as linear systems with time-invariant properties, mapping an input to an output . These systems are mathematically formulated as linear ordinary differential equations (ODEs), as depicted in Equation (1), where the model’s parameters are denoted by , , for a system state of dimension N, and the direct link, . The state’s derivative and output signals are described by the following equations:

(1)

Discretization

When integrated into deep learning algorithms, State Space Models (SSMs), inherently continuous-time constructs, present substantial challenges. The discretization process is thus imperative.

The primary aim of discretization is to transmute the continuous ODE into a discrete function. This conversion is vital for aligning the model with the input data’s sample rate, thereby enabling computationally efficient operations [40]. Given the input x_k ∈ R^L×D, which is a sampled vector from the signal sequence of length L, the ODE [41] (Eq. 1) can be discretized employing the zeroth-order hold approach:

(2)

where , , and , with and . Following the practice, the approximation of B through first-order Taylor series is refined as:

(3)

2D selective scan mechanism

The VMamba model introduces a novel Selective Scan Mechanism (S6), diverging from traditional Linear Time- Invariant (LTI) State Space Models (SSMs). This S6 mechanism, central to the VMamba framework, incorporates matrices , , and , extracted from the input data , to imbue the system with contextual responsiveness and weight dynamism.

Furthermore, the Cross-Scan Module (CSM) is introduced to enhance spatial integration across the image. It unfolds image patches into sequences along rows and columns, and performs scanning across four directions, thereby enabling any pixel to integrate information from all others in different trajectories. These sequences are then reconfigured into a single image, culminating in a merged, information-rich new image.

VMamba model

The overall architecture of VMamba Model has been illustrated in previous literature [18]. The VMamba architecture, showcased in Fig 1, commences by partitioning the input image into patches through a stem module, emulating ViTs while maintaining the 2D structure and translating the patches into a 1D sequence. This approach yields a feature map with dimensions . VMamba layers a series of VSS blocks like Fig 2 atop this feature map to construct “Stage 1,” preserving its dimensions. Hierarchical structures in VMamba are established via down-sampling in “Stage 1” through a patch merging process. More VSS blocks are then integrated, reducing the output resolution to for “Stage 2.” This down-sampling is reiterated to form “Stage 3” and “Stage 4,” with resolutions of and , respectively. The resulting hierarchical design mirrors the multi-scale representation characteristic of renowned CNN models and some ViTs. Thus, VMamba’s architecture emerges as a comprehensive and versatile candidate for a variety of vision-related applications with analogous requirements.

Download:

Fig 1. The comparison between VMamba and our proposed ResVMamba.

https://doi.org/10.1371/journal.pone.0322695.g001

Download:

Fig 2. VSS Block.

https://doi.org/10.1371/journal.pone.0322695.g002

ResVMamba model

Inspired by ResNet, we have proposed a new type of VMamba model, ResVmamba, a Mamba with residual learning mechanism. ResVMamba architecture, as illustrated, is an advanced model configuration designed for efficient processing within the realm of computer vision. This architecture begins with a stem module that processes the input image, which is then followed by a series of VSS Blocks, arranged sequentially across four distinct stages.

Distinct from the original VMamba framework, the ResVMamba architecture not only employs the VMamba structure as its backbone but also integrates raw data directly into the feature map. In order to distinguish it from the residual structure in the VSS block, we called that global-residual mechanism. This integration is anticipated to facilitate the sharing of global image features in conjunction with the information processed through the VSS blocks. The intention behind this design is to harness both the localized details captured by individual VSS blocks and the overarching global features inherent in the unprocessed input, thereby enriching the model’s representational capacity and enhancing its performance on tasks requiring a comprehensive understanding of the visual data.

Implementation details

In accordance with the protocol proposed in previous work [18], ResVMamba embarks on a comprehensive training regimen on CNFOOD-241, the backbone uses the VMamba-S, extending over 150 epochs with an initial warmup period covering the initial 20 epochs, and leverages a batch size of 128. The training schema integrates the AdamW optimizer, with the beta parameters set at (0.9, 0.999), and momentum fixed at 0.9. A cosine decay schedule modulates the learning rate, commencing with an initial learning rate of 1 × 10^-3 and a weight decay parameter of 0.05. Augmenting the training are methodologies such as label smoothing at 0.1 and the implementation of an exponential moving average (EMA). Subsequent to these specified techniques, no further training strategies are deployed. The VMamba-S did transfer training with pretrained weight get from Liu et al. [18] on CNFOOD-241 with a batch size of 32, else training strategies is the same as default VMamba. The CMAL-Net are trained from ResNet-50 pretrained weight from pytorch based on its original setting on github, the others models are trained with pretrained weights on ImageNet-1K loaded from Huggingface by trim module with an initial learning rate of 1 × 10⁻⁴ and AdamW optimizer. Moreover, to establish CNFOOD-241 as a more equitable benchmark, we partitioned the dataset into training, validation, and test subsets at random. The training and validation sets were divided in a 7:3 ratio, respectively, through a randomized selection process. Finally, the training set contains 119,514 images and the validation set has 51,354 images in all 241 categories. The testing set is the original’val600x600’ folder in CNFOOD-241 dataset, which include 20,943 images as Fig 3. The train-validation split list can be available on the github: https://github.com/ChiShengChen/ResVMamba.

Download:

Fig 3. CNFOOD-241 data split flow.

https://doi.org/10.1371/journal.pone.0322695.g003

Results

Dataset

In this work, to fairly evaluate the models’ performance, we use the CNFOOD-241 as it possesses several notable characteristics as below:

Balanced distribution on image sizes across categories

The CNFOOD-241 dataset possesses the largest (almost two hundred thousand images) uniform-sized (600 × 600) image collection among publicly available food datasets.

Unbalanced number of images across categories on CNFOOD-241

In order to evaluate the imbalance in the number of images per category through the dataset, we calculate the normalized entropy. To normalize the entropy, we divide the entropy by , where n is the total number of categories. This normalized entropy, denoted as , is calculated as follows:

(4)

Given the original definition of entropy:

(5)

The normalized entropy ranges from 0 to 1, where = 1 indicates a perfectly balanced dataset, with each category having an equal share of the data. = 0 indicates complete imbalance, where all instances belong to a single category. This normalization allows for an easier comparison of entropy values across datasets with different numbers of categories, providing a standardized measure of category balance. The entropy result is shown as Table 2, we observe that CNFOOD-241 exhibits a greater degree of class imbalance compared to other datasets, which consequently enhances the challenging nature of this dataset.

Download:

Table 2. comparison of entropy on current food recognition datasets.

https://doi.org/10.1371/journal.pone.0322695.t002

Performance metrics

The performance of the models is evaluated by the top-k accuracy. Top-k accuracy is defined as the proportion of test samples for which the correct label is among the top k labels predicted by the model. Mathematically, it can be expressed as:

(6)

where:

is the total number of samples in the test set.
is the true label for the i-th sample.
is the set of top k predictions made by the model for the i-th sample.
is the indicator function, which is 1 if (the true label is among the top predictions) and 0 otherwise.

This metric is particularly useful for evaluating models on tasks where the goal is to provide a set of potential labels for each input and the exact rank within the top k is not critically important.

Comparisons with State-of-the-Art (SOTA) methods

In the result we observe that VMamba-S with ImageNet-1k pretrained weight can reach the SOTA on CNFOOD-241, the ResVMamba surpass the VMamba-S and further improves the classification accuracy to 81.70%. (Table 3).

Download:

Table 3. Comparison of our approach (ResVmamba) to other baselines on CNFOOD-241 with our split method.

https://doi.org/10.1371/journal.pone.0322695.t003

Discussion

In the domain of computer vision, food recognition is categorically placed within the realm of FGVC, a field distinguished by its focus on distinguishing between closely related subcategories within a broader category. This area has seen the development and application of several state-of-the-art (SOTA) models, each contributing to advancements in dataset-specific performance. By comparing the CNFOOD-241 dataset with other food databases, we highlighted the characteristics of CNFOOD241 as a high-resolution, data-imbalanced, and therefore a challenging dataset. For the datasets of Food2K and ETH Food-101, PRENet achieves top-1 accuracies of 83.75% and 91.13%, respectively. However, its top-1 accuracy on CNFOOD-241 is only 76.2%, demonstrating the considerable difficulty of CNFOOD-241. The significant drop in PRENet’s performance (from 91.13% on ETH Food-101 to 76.2% on CNFOOD-241) highlights the dataset’s complexity. This challenge primarily arises from the low inter-class variation (e.g., visually similar dishes such as different styles of dumplings) and high intra-class variation (e.g., the same dish appearing in different lighting conditions, angles, and occlusions). Such characteristics make CNFOOD-241 a more challenging benchmark for FGVC tasks, as models need to develop stronger discriminative feature learning capabilities.

The unveiling of the CNFOOD-241 dataset marks a significant advancement in fulfilling the essential demand for high-quality, uniform datasets within the domain of food computation, facilitating novel pathways for research and innovation. Experimental evidence indicates that this dataset presents a considerable challenge. To address the challenge, we introduce ResVMamba, which is an enhanced version of the original VMamba model, incorporating a residual deep learning structure to improve its performance in processing complex food dataset. Residual deep learning mitigates the vanishing gradient problem by allowing gradients to flow through the network more effectively, hence enabling the model to learn better representations, especially in deep architectures [39].

VMamba models [18] based on the State Space Model are regarded as outperforming ViTs on large image datasets like ImageNet [19]. This model is designed to improve performance on intricate image classification tasks. In this study, we successfully integrated the residual deep learning structure into the VMamba model. Our data show that VMamba has superior performance compared to CMAL-Net in fine-grained food recognition on CNFOOD241, with a notable improvement of 2.02% in top-1 accuracy. Furthermore, results indicated that incorporating a residual architecture on VMamba (ResVMamba) can further enhance accuracy by 1.12%, validating the effectiveness of deep residual learning in FGVC. Therefore, ResVMamba is well-suited for handling high-resolution and data-imbalanced scenarios, making it ideal for real-world applications in food recognition.

The improvement observed in ResVMamba can be attributed to its hybrid design, which leverages state space models (SSMs) to efficiently capture long-range dependencies while retaining CNN-like locality. Compared to ViTs, which rely on computationally expensive self-attention mechanisms, SSMs process sequences linearly in time complexity, making them particularly suitable for high-resolution food images in CNFOOD-241. Additionally, the integration of residual learning into VMamba contributes to improved model convergence and feature extraction. Residual connections facilitate gradient propagation, mitigating the vanishing gradient issue commonly encountered in deep architectures. This enables ResVMamba to better capture discriminative fine-grained details, such as subtle texture and shape differences in food images.

From our results, the CMAL-Net [42] stands out as the secondary SOTA on CNFOOD241, having been constructed by integrating three expert modules with a CNN-based backbone. Each expert module processes feature maps from specific layers, delivering both a categorical prediction and an attention region. This attention region not only highlights areas of interest within the images but also serves as a means for data augmentation for the other expert modules, thereby enhancing the model’s overall accuracy and robustness. EfficientNet [43] demonstrate an accuracy of 78.48% on CNFOOD241, utilizing Network Architecture Search (NAS) and Compound Model Scaling to optimize performance. ConvNeXT [44], inspired by the Swin Transformer [45] architecture, reimagines CNNs to surpass the Swin Transformer’s performance on ImageNet, marking a significant achievement in model design.

The experimental results confirm that VMamba, when enhanced with residual learning, can outperform ViTs and CNNs in fine-grained classification tasks. The introduction of a residual structure allows the model to retain both local texture details and high-level semantic features, addressing the challenges of intra-class variations in food images. Additionally, compared to CNN-based methods like ResNet and EfficientNet, our model leverages the sequence modeling capabilities of state space models to capture long-range dependencies more effectively. These results suggest that state space models have significant potential beyond traditional sequence modeling applications and can be further explored in other fine-grained classification domains.

Our proposed model ResVMamba sets a new benchmark for state-of-the-art (SOTA) performance in food recognition tasks, demonstrating its effectiveness in complex food dataset on the CNFOOD-241 dataset. Deep residual learning enhances the model’s generalization capabilities by effectively fitting training data while mitigating the risk of overfitting. Future research should continue exploring the classification capabilities of the ResVMamba model on a larger scale. Beyond food classification, our findings suggest that state space models (SSMs) with residual learning can serve as a promising alternative to traditional CNNs and ViTs in fine-grained visual classification (FGVC) across various domains, including medical imaging, biological species identification, and industrial defect detection. Future research can explore the extension of ResVMamba to these domains, as well as investigate techniques such as adaptive residual scaling or multi-scale feature fusion to further enhance model robustness and generalization.

While CNFOOD-241 is a challenging dataset for fine-grained food classification, its geographic and cultural diversity remains limited. Additionally, although ResVMamba demonstrates strong performance in this domain, further studies are needed to assess its effectiveness across other fine-grained visual classification tasks. Moreover, the computational cost of ResVMamba is higher than that of traditional CNN-based models, which may impact its deployment in resource-constrained environments. Future work should focus on expanding dataset diversity and optimizing computational efficiency to enhance model generalizability and practical applicability.

Conclusion

This study proposes the ResVMamba model, which integrates residual learning into the VMamba architecture for the first time in fine-grained food classification. Our comparisons with other food databases demonstrate that ResVMamba outperforms current SOTA models in fine-grained food classification with an accuracy of 81.70%. We demonstrate the potential of state space models in food image analysis. This research pioneers the integration of a residual learning framework within the VMamba model, enabling the effective utilization of both global and local feature states, which enhances its capability to tackle complex food recognition tasks. Future work will explore the application of ResVMamba to multi-modal food analysis and its integration into nutritional assessment systems.

Acknowledgments

Thank Kurh-Life Technology Co., Ltd. for providing a A100 cloud computing resource on Alicloud for the model training and fine-tuning.

References

1. Min W, Jiang S, Liu L, Rui Y, Jain R. A survey on food computing. ACM Comput Surv. 2019;52(5):1–36.
- View Article
- Google Scholar
2. Alakbari FS, Mohyaldinn ME, Ayoub MA, Muhsan AS, Abdulkadir SJ, Hussein IA, et al. Prediction of critical total drawdown in sand production from gas wells: Machine learning approach. Can J Chem Eng. 2022;101(5):2493–509.
- View Article
- Google Scholar
3. Alakbari FS, Mohyaldinn ME, Ayoub MA, Muhsan AS. Deep learning approach for robust prediction of reservoir bubble point pressure. ACS Omega. 2021;6(33):21499–513. pmid:34471753
- View Article
- PubMed/NCBI
- Google Scholar
4. Alakbari FS, Mohyaldinn ME, Ayoub MA, Hussein IA, Muhsan AS, Ridha S, et al. A gated recurrent unit model to predict Poisson’s ratio using deep learning. J Rock Mechan Geotech Eng. 2024;16(1):123–35.
- View Article
- Google Scholar
5. Chen CS, Chen YJ. optimizing supply chain networks with the power of graph neural networks. arXiv:250106221. [Preprint]. 2025 [cited 2025 Jan 7. ]. Available from:
- View Article
- Google Scholar
6. Wu R, Zhang T, Xu F. Cross-market arbitrage strategies based on deep learning. Acade J Sociol Manage. 2024:20–6.
- View Article
- Google Scholar
7. Chen S, He H. Stock price prediction using convolutional neural network. IOP Conf Ser: Mater Sci Eng. 2018;435:012026.
- View Article
- Google Scholar
8. Lai S-L, Chen C-S, Lin B-R, Chang R-F. Intraoperative detection of surgical gauze using deep convolutional neural network. Ann Biomed Eng. 2023;51(2):352–62. pmid:35972601
- View Article
- PubMed/NCBI
- Google Scholar
9. Chen CS, Tsai AHW, Huang SC. Quantum multimodal contrastive learning framework. arXiv:240813919. [Preprint]. 2024 [cited 2024 Aug 25. ]. Available from:
- View Article
- Google Scholar
10. Chen C-S, Yang Y-H, Chen G-Y, Chang S-H. Food classification for dietary support using fine-grained visual recognition with the HERBS Network. 2024.
- View Article
- Google Scholar
11. Van Horn G, Branson S, Farrell R, Haber S, Barry J, Ipeirotis P, et al. Building a bird recognition app and large scale dataset with citizen scientists: The fine print in fine-grained dataset collection. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2015. https://doi.org/10.1109/cvpr.2015.7298658
12. Zhou Y, Wang B, Huang L, Cui S, Shao L. A benchmark for studying diabetic retinopathy: segmentation, grading, and transferability. IEEE Trans Med Imaging. 2021;40(3):818–28. pmid:33180722
- View Article
- PubMed/NCBI
- Google Scholar
13. Maji S, Kannala J, Rahtu E, Blaschko M, Vedaldi A. Fine-grained visual classification of aircraft. Technical report. [Preprint]. 2013 [cited 2013 Jun 21. ] Available from:
- View Article
- Google Scholar
14. Parkhi OM, Vedaldi A, Zisserman A, Jawahar CV. Cats and dogs. 2012 IEEE Conference on Computer Vision and Pattern Recognition. 2012. p. 3498–505. https://doi.org/10.1109/cvpr.2012.6248092
15. Nilsback M-E, Zisserman A. Automated flower classification over a large number of classes. 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing. 2008. p. 722–9. https://doi.org/10.1109/icvgip.2008.47
16. Van Horn G, Cole E, Beery S, Wilber K, Belongie S, MacAodha O. Benchmarking representation learning for natural world image collections. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2021. https://doi.org/10.1109/cvpr46437.2021.01269
17. Singla A, Yuan L, Ebrahimi T. Food/non-food image classification and food categorization using pre-trained GoogLeNet Model. Proceedings of the 2nd International Workshop on Multimedia Assisted Dietary Management. 2016. p. 3–11. https://doi.org/10.1145/2986035.2986039
18. Liu Y, Tian Y, Zhao Y, Yu H, Xie L, Wang Y, et al. Vmamba: visual state space model. Technical report. [Preprint]. 2024 [cited 2024 Jan 18. ]. Available from:
- View Article
- Google Scholar
19. Deng J, Dong W, Socher R, Li L-J, Kai Li, Li Fei-Fei. ImageNet: A large-scale hierarchical image database. 2009 IEEE Conference on Computer Vision and Pattern Recognition. 2009. p. 248–55. https://doi.org/10.1109/cvpr.2009.5206848
20. Fan B, Li W, Dong L, Li J, Nie Z. Automatic Chinese Food recognition based on a stacking fusion model. Annu Int Conf IEEE Eng Med Biol Soc. 2023;2023:1–4. pmid:38083522
- View Article
- PubMed/NCBI
- Google Scholar
21. Chen X, Zhou H, Zhu Y, Diao L. Chinesefoodnet: a large-scale image dataset for chinese food recognition. arXiv preprint arXiv:170502743. 2017 [cited 2017 May 8. ]. Available from: https://doi.org/10.48550/arXiv.1705.02743
- View Article
- Google Scholar
22. Havard TA, Jones TJ, Kavanagh JL. Analogue experiments to investigate magma mixing within dykes. Bull Volcanol. 2025;87(4):29. pmid:40207182
- View Article
- PubMed/NCBI
- Google Scholar
23. Min W, Liu L, Wang Z, Luo Z, Wei X, Wei X, et al . Isia food-500: A dataset for large-scale food recognition via stacked global-local attention network. arXiv:2008.05655. [Preprint] 2020: [cited 2020 Aug 13. ] Available from:
- View Article
- Google Scholar
24. Min W, Wang Z, Liu Y, Luo M, Kang L, Wei X, et al. Large scale visual food recognition. IEEE Trans Pattern Anal Mach Intell. 2023;45(8):9932–49. pmid:37021867
- View Article
- PubMed/NCBI
- Google Scholar
25. Farinella G, Allegra D, Stanco F. A benchmark dataset to study the representation of food images. Computer Vision – ECCV. 2014;2015:584–99.
- View Article
- Google Scholar
26. Chen J, Ngo C. Deep-based ingredient recognition for cooking recipe retrieval. Proceedings of the 24th ACM international conference on Multimedia. 2016. https://doi.org/10.1145/2964284.2964315
27. Farinella GM, Allegra D, Moltisanti M, Stanco F, Battiato S. Retrieval and classification of food images. Comput Biol Med. 2016;77:23–39. pmid:27498058
- View Article
- PubMed/NCBI
- Google Scholar
28. Taichi J, Keiji Y. A food image recognition system with Multiple Kernel Learning. 2009 16th IEEE International Conference on Image Processing (ICIP). 2009. p. 285–8. https://doi.org/10.1109/icip.2009.5413400
29. Ravi D, Lo B, Yang G-Z. Real-time food intake classification and energy expenditure estimation on a mobile device. 2015 IEEE 12th International Conference on Wearable and Implantable Body Sensor Networks (BSN). 2015. p. 1–6. https://doi.org/10.1109/bsn.2015.7299410
30. Pontil M, Verri A. Properties of support vector machines. Neural Comput. 1998;10(4):955–74. pmid:9573414
- View Article
- PubMed/NCBI
- Google Scholar
31. Rahmat RA, Kutty SB. Malaysian Food Recognition using Alexnet CNN and Transfer Learning. 2021 IEEE 11th IEEE Symposium on Computer Applications & Industrial Electronics (ISCAIE). 2021. https://doi.org/10.1109/iscaie51753.2021.9431833
32. Zahisham Z, Lee CP, Lim KM. Food recognition with ResNet-50. 2020 IEEE 2nd International Conference on Artificial Intelligence in Engineering and Technology (IICAIET). 2020:1–5. https://doi.org/10.1109/iicaiet49801.2020.9257825
33. Hassannejad H, Matrella G, Ciampolini P, De Munari I, Mordonini M, Cagnoni S. Food image recognition using very deep convolutional networks. Proceedings of the 2nd International Workshop on Multimedia Assisted Dietary Management. 2016:41–9. https://doi.org/10.1145/2986035.2986042
34. Gu A, Goel K, Re C. Efficiently modeling long sequences with structured state spaces. International Conference on Learning Representations. arXiv:2111.00396. [Preprint] 2021 [cited 2021 Oct 31. ].
- View Article
- Google Scholar
35. Dao T, Gu A. Mamba: linear-time sequence modeling with selective state spaces. arXiv 2312.00752. 2024 [cited 2024 May 31. ]
- View Article
- Google Scholar
36. Wang B, Ma J. Li F. U-mamba: enhancing long-range dependency for biomedical image segmentation. arXiv:2401.04722. 2024 [cited 2024 Jan 9. ] Available from:
- View Article
- Google Scholar
37. Xiang S, Ruan J. Vm-unet: Vision mamba unet for medical image segmentation. arXiv: 2024 [cited 2024 Feb 4]. Available from: https://doi.org/10.48550/arXiv.2402.02491.
- View Article
- Google Scholar
38. Meng C, Guo T, Wang Y. Mambamorph: a mamba-based backbone with contrastive feature learning for deformable mr-ct registration. arXiv. 2024 [cited 2024 Mar 13. ] Available from:
- View Article
- Google Scholar
39. He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016. https://doi.org/10.1109/cvpr.2016.90
40. Gu A, Johnson I, Goel K, Saab K, Dao T, Rudra A, et al. Combining recurrent, convolutional, and continuous-time models with linear state space layers. Adv Neural Inf Process Syst. 2021.
- View Article
- Google Scholar
41. Gupta A, Gu A, Berant J. Diagonal state spaces are as effective as structured state spaces. Adv Neural Inf Process Syst. 2022. https://doi.org/10.48550/arXiv.2203.14343
42. Liu D, Zhao L, Wang Y, Kato J. Learn from each other to classify better: cross-layer mutual attention learning for fine-grained visual classification. Pattern Recog. 2023;140:109550.
- View Article
- Google Scholar
43. Le Q, Tan M. Efficientnet: Rethinking model scaling for convolutional neural networks. Preprint. 2020 [cited 2020 Sep 11. ]. Available from: https://doi.org/10.48550/arXiv.1905.11946
- View Article
- Google Scholar
44. Liu Z, Mao H, Wu C, Feichtenhofer C, Darrell T, S X. A convnet for the 2020s. arXiv:2201.03545. 2022 [cited 2022 Mar 2. ]. Available from:
- View Article
- Google Scholar
45. Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, et al. Swin transformer: hierarchical vision transformer using shifted windows. 2021 IEEE/CVF International Conference on Computer Vision (ICCV). 2021. p. 9992–10002. https://doi.org/10.1109/iccv48922.2021.00986

[ref1] 1. Min W, Jiang S, Liu L, Rui Y, Jain R. A survey on food computing. ACM Comput Surv. 2019;52(5):1–36.
View Article
Google Scholar

[2] View Article

[3] Google Scholar

[ref2] 2. Alakbari FS, Mohyaldinn ME, Ayoub MA, Muhsan AS, Abdulkadir SJ, Hussein IA, et al. Prediction of critical total drawdown in sand production from gas wells: Machine learning approach. Can J Chem Eng. 2022;101(5):2493–509.
View Article
Google Scholar

[5] View Article

[6] Google Scholar

[ref3] 3. Alakbari FS, Mohyaldinn ME, Ayoub MA, Muhsan AS. Deep learning approach for robust prediction of reservoir bubble point pressure. ACS Omega. 2021;6(33):21499–513. pmid:34471753
View Article
PubMed/NCBI
Google Scholar

[8] View Article

[9] PubMed/NCBI

[10] Google Scholar

[ref4] 4. Alakbari FS, Mohyaldinn ME, Ayoub MA, Hussein IA, Muhsan AS, Ridha S, et al. A gated recurrent unit model to predict Poisson’s ratio using deep learning. J Rock Mechan Geotech Eng. 2024;16(1):123–35.
View Article
Google Scholar

[12] View Article

[13] Google Scholar

[ref5] 5. Chen CS, Chen YJ. optimizing supply chain networks with the power of graph neural networks. arXiv:250106221. [Preprint]. 2025 [cited 2025 Jan 7. ]. Available from:
View Article
Google Scholar

[15] View Article

[16] Google Scholar

[ref6] 6. Wu R, Zhang T, Xu F. Cross-market arbitrage strategies based on deep learning. Acade J Sociol Manage. 2024:20–6.
View Article
Google Scholar

[18] View Article

[19] Google Scholar

[ref7] 7. Chen S, He H. Stock price prediction using convolutional neural network. IOP Conf Ser: Mater Sci Eng. 2018;435:012026.
View Article
Google Scholar

[21] View Article

[22] Google Scholar

[ref8] 8. Lai S-L, Chen C-S, Lin B-R, Chang R-F. Intraoperative detection of surgical gauze using deep convolutional neural network. Ann Biomed Eng. 2023;51(2):352–62. pmid:35972601
View Article
PubMed/NCBI
Google Scholar

[24] View Article

[25] PubMed/NCBI

[26] Google Scholar

[ref9] 9. Chen CS, Tsai AHW, Huang SC. Quantum multimodal contrastive learning framework. arXiv:240813919. [Preprint]. 2024 [cited 2024 Aug 25. ]. Available from:
View Article
Google Scholar

[28] View Article

[29] Google Scholar

[ref10] 10. Chen C-S, Yang Y-H, Chen G-Y, Chang S-H. Food classification for dietary support using fine-grained visual recognition with the HERBS Network. 2024.
View Article
Google Scholar

[31] View Article

[32] Google Scholar

[ref11] 11. Van Horn G, Branson S, Farrell R, Haber S, Barry J, Ipeirotis P, et al. Building a bird recognition app and large scale dataset with citizen scientists: The fine print in fine-grained dataset collection. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2015. https://doi.org/10.1109/cvpr.2015.7298658

[ref12] 12. Zhou Y, Wang B, Huang L, Cui S, Shao L. A benchmark for studying diabetic retinopathy: segmentation, grading, and transferability. IEEE Trans Med Imaging. 2021;40(3):818–28. pmid:33180722
View Article
PubMed/NCBI
Google Scholar

[35] View Article

[36] PubMed/NCBI

[37] Google Scholar

[ref13] 13. Maji S, Kannala J, Rahtu E, Blaschko M, Vedaldi A. Fine-grained visual classification of aircraft. Technical report. [Preprint]. 2013 [cited 2013 Jun 21. ] Available from:
View Article
Google Scholar

[39] View Article

[40] Google Scholar

[ref14] 14. Parkhi OM, Vedaldi A, Zisserman A, Jawahar CV. Cats and dogs. 2012 IEEE Conference on Computer Vision and Pattern Recognition. 2012. p. 3498–505. https://doi.org/10.1109/cvpr.2012.6248092

[ref15] 15. Nilsback M-E, Zisserman A. Automated flower classification over a large number of classes. 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing. 2008. p. 722–9. https://doi.org/10.1109/icvgip.2008.47

[ref16] 16. Van Horn G, Cole E, Beery S, Wilber K, Belongie S, MacAodha O. Benchmarking representation learning for natural world image collections. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2021. https://doi.org/10.1109/cvpr46437.2021.01269

[ref17] 17. Singla A, Yuan L, Ebrahimi T. Food/non-food image classification and food categorization using pre-trained GoogLeNet Model. Proceedings of the 2nd International Workshop on Multimedia Assisted Dietary Management. 2016. p. 3–11. https://doi.org/10.1145/2986035.2986039

[ref18] 18. Liu Y, Tian Y, Zhao Y, Yu H, Xie L, Wang Y, et al. Vmamba: visual state space model. Technical report. [Preprint]. 2024 [cited 2024 Jan 18. ]. Available from:
View Article
Google Scholar

[46] View Article

[47] Google Scholar

[ref19] 19. Deng J, Dong W, Socher R, Li L-J, Kai Li, Li Fei-Fei. ImageNet: A large-scale hierarchical image database. 2009 IEEE Conference on Computer Vision and Pattern Recognition. 2009. p. 248–55. https://doi.org/10.1109/cvpr.2009.5206848

[ref20] 20. Fan B, Li W, Dong L, Li J, Nie Z. Automatic Chinese Food recognition based on a stacking fusion model. Annu Int Conf IEEE Eng Med Biol Soc. 2023;2023:1–4. pmid:38083522
View Article
PubMed/NCBI
Google Scholar

[50] View Article

[51] PubMed/NCBI

[52] Google Scholar

[ref21] 21. Chen X, Zhou H, Zhu Y, Diao L. Chinesefoodnet: a large-scale image dataset for chinese food recognition. arXiv preprint arXiv:170502743. 2017 [cited 2017 May 8. ]. Available from: https://doi.org/10.48550/arXiv.1705.02743
View Article
Google Scholar

[54] View Article

[55] Google Scholar

[ref22] 22. Havard TA, Jones TJ, Kavanagh JL. Analogue experiments to investigate magma mixing within dykes. Bull Volcanol. 2025;87(4):29. pmid:40207182
View Article
PubMed/NCBI
Google Scholar

[57] View Article

[58] PubMed/NCBI

[59] Google Scholar

[ref23] 23. Min W, Liu L, Wang Z, Luo Z, Wei X, Wei X, et al . Isia food-500: A dataset for large-scale food recognition via stacked global-local attention network. arXiv:2008.05655. [Preprint] 2020: [cited 2020 Aug 13. ] Available from:
View Article
Google Scholar

[61] View Article

[62] Google Scholar

[ref24] 24. Min W, Wang Z, Liu Y, Luo M, Kang L, Wei X, et al. Large scale visual food recognition. IEEE Trans Pattern Anal Mach Intell. 2023;45(8):9932–49. pmid:37021867
View Article
PubMed/NCBI
Google Scholar

[64] View Article

[65] PubMed/NCBI

[66] Google Scholar

[ref25] 25. Farinella G, Allegra D, Stanco F. A benchmark dataset to study the representation of food images. Computer Vision – ECCV. 2014;2015:584–99.
View Article
Google Scholar

[68] View Article

[69] Google Scholar

[ref26] 26. Chen J, Ngo C. Deep-based ingredient recognition for cooking recipe retrieval. Proceedings of the 24th ACM international conference on Multimedia. 2016. https://doi.org/10.1145/2964284.2964315

[ref27] 27. Farinella GM, Allegra D, Moltisanti M, Stanco F, Battiato S. Retrieval and classification of food images. Comput Biol Med. 2016;77:23–39. pmid:27498058
View Article
PubMed/NCBI
Google Scholar

[72] View Article

[73] PubMed/NCBI

[74] Google Scholar

[ref28] 28. Taichi J, Keiji Y. A food image recognition system with Multiple Kernel Learning. 2009 16th IEEE International Conference on Image Processing (ICIP). 2009. p. 285–8. https://doi.org/10.1109/icip.2009.5413400

[ref29] 29. Ravi D, Lo B, Yang G-Z. Real-time food intake classification and energy expenditure estimation on a mobile device. 2015 IEEE 12th International Conference on Wearable and Implantable Body Sensor Networks (BSN). 2015. p. 1–6. https://doi.org/10.1109/bsn.2015.7299410

[ref30] 30. Pontil M, Verri A. Properties of support vector machines. Neural Comput. 1998;10(4):955–74. pmid:9573414
View Article
PubMed/NCBI
Google Scholar

[78] View Article

[79] PubMed/NCBI

[80] Google Scholar

[ref31] 31. Rahmat RA, Kutty SB. Malaysian Food Recognition using Alexnet CNN and Transfer Learning. 2021 IEEE 11th IEEE Symposium on Computer Applications & Industrial Electronics (ISCAIE). 2021. https://doi.org/10.1109/iscaie51753.2021.9431833

[ref32] 32. Zahisham Z, Lee CP, Lim KM. Food recognition with ResNet-50. 2020 IEEE 2nd International Conference on Artificial Intelligence in Engineering and Technology (IICAIET). 2020:1–5. https://doi.org/10.1109/iicaiet49801.2020.9257825

[ref33] 33. Hassannejad H, Matrella G, Ciampolini P, De Munari I, Mordonini M, Cagnoni S. Food image recognition using very deep convolutional networks. Proceedings of the 2nd International Workshop on Multimedia Assisted Dietary Management. 2016:41–9. https://doi.org/10.1145/2986035.2986042

[ref34] 34. Gu A, Goel K, Re C. Efficiently modeling long sequences with structured state spaces. International Conference on Learning Representations. arXiv:2111.00396. [Preprint] 2021 [cited 2021 Oct 31. ].
View Article
Google Scholar

[85] View Article

[86] Google Scholar

[ref35] 35. Dao T, Gu A. Mamba: linear-time sequence modeling with selective state spaces. arXiv 2312.00752. 2024 [cited 2024 May 31. ]
View Article
Google Scholar

[88] View Article

[89] Google Scholar

[ref36] 36. Wang B, Ma J. Li F. U-mamba: enhancing long-range dependency for biomedical image segmentation. arXiv:2401.04722. 2024 [cited 2024 Jan 9. ] Available from:
View Article
Google Scholar

[91] View Article

[92] Google Scholar

[ref37] 37. Xiang S, Ruan J. Vm-unet: Vision mamba unet for medical image segmentation. arXiv: 2024 [cited 2024 Feb 4]. Available from: https://doi.org/10.48550/arXiv.2402.02491.
View Article
Google Scholar

[94] View Article

[95] Google Scholar

[ref38] 38. Meng C, Guo T, Wang Y. Mambamorph: a mamba-based backbone with contrastive feature learning for deformable mr-ct registration. arXiv. 2024 [cited 2024 Mar 13. ] Available from:
View Article
Google Scholar

[97] View Article

[98] Google Scholar

[ref39] 39. He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016. https://doi.org/10.1109/cvpr.2016.90

[ref40] 40. Gu A, Johnson I, Goel K, Saab K, Dao T, Rudra A, et al. Combining recurrent, convolutional, and continuous-time models with linear state space layers. Adv Neural Inf Process Syst. 2021.
View Article
Google Scholar

[101] View Article

[102] Google Scholar

[ref41] 41. Gupta A, Gu A, Berant J. Diagonal state spaces are as effective as structured state spaces. Adv Neural Inf Process Syst. 2022. https://doi.org/10.48550/arXiv.2203.14343

[ref42] 42. Liu D, Zhao L, Wang Y, Kato J. Learn from each other to classify better: cross-layer mutual attention learning for fine-grained visual classification. Pattern Recog. 2023;140:109550.
View Article
Google Scholar

[105] View Article

[106] Google Scholar

[ref43] 43. Le Q, Tan M. Efficientnet: Rethinking model scaling for convolutional neural networks. Preprint. 2020 [cited 2020 Sep 11. ]. Available from: https://doi.org/10.48550/arXiv.1905.11946
View Article
Google Scholar

[108] View Article

[109] Google Scholar

[ref44] 44. Liu Z, Mao H, Wu C, Feichtenhofer C, Darrell T, S X. A convnet for the 2020s. arXiv:2201.03545. 2022 [cited 2022 Mar 2. ]. Available from:
View Article
Google Scholar

[111] View Article

[112] Google Scholar

[ref45] 45. Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, et al. Swin transformer: hierarchical vision transformer using shifted windows. 2021 IEEE/CVF International Conference on Computer Vision (ICCV). 2021. p. 9992–10002. https://doi.org/10.1109/iccv48922.2021.00986

Figures

Abstract

Background

Methods

Results

Conclusions

Introduction

Related work

Food recognition datasets

Food image recognition

State space model on visual recognition

Deep residual learning on space state model

Methods

State space models

Discretization

2D selective scan mechanism

VMamba model

ResVMamba model

Implementation details

Results

Dataset

Balanced distribution on image sizes across categories

Unbalanced number of images across categories on CNFOOD-241

Performance metrics

Comparisons with State-of-the-Art (SOTA) methods

Discussion

Conclusion

Acknowledgments

References