Figures
Abstract
Visual design element recognition and analysis play a critical role in various applications, ranging from creative design to cultural artifact preservation. However, existing methods often struggle with accurately identifying and understanding complex, multimodal design elements in real-world scenarios. To address this, we propose an integrated model that combines the Swin Transformer for precise image segmentation, multi-scale feature fusion for robust type recognition, and a multimodal large language model (LLM) for fine-grained image understanding. Experimental results on ETHZ Shape Classes, ImageNet, and COCO datasets demonstrate that the proposed model outperforms state-of-the-art methods, achieving 88.6% segmentation accuracy and a 92.3% F1 score in multimodal tasks. These findings highlight the model’s potential as an effective tool for advanced design element recognition and analysis. The source code for this study can be viewed at this url: https://github.com/LIU-WENBO/Multi-Feature-Design-Elements-Recognition.
Citation: Wenbo L (2025) An integrated framework for multi-feature fusion and intelligent recognition of design elements: Challenges and solutions. PLoS One 20(12): e0339277. https://doi.org/10.1371/journal.pone.0339277
Editor: Nattapol Aunsri, Mae Fah Luang University, THAILAND
Received: February 23, 2025; Accepted: December 1, 2025; Published: December 26, 2025
Copyright: © 2025 Liu Wenbo. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The URLs for downloading the three datasets are as follows: (1) https://vision.ee.ethz.ch/datsets.html (2) https://image-net.org/challenges/LSVRC/index.php (3) https://cocodataset.org/#download.
Funding: The author(s) received no specific funding for this work.
Competing interests: NO authors have competing interests.
Introduction
The recognition and fine-grained understanding of visual design elements [1] play a crucial role in fields such as industrial design [2], advertising [3], brand identity [4], artwork [5], fashion [6], and packaging design [7]. In these areas, the precise application of visual components—such as color, shape, texture, typography, and layout—is key to conveying messages, evoking emotions, and establishing unique identities. By developing AI models capable of detailed analysis of these elements, we can significantly enhance both creative processes and market-driven outcomes.
In industrial design, fine-grained visual understanding enables AI to assist designers in optimizing product aesthetics and functionality. By analyzing intricate details of form, materials, and ergonomics, AI systems can provide real-time feedback, aiding in the rapid iteration of design concepts. In advertising, the ability to recognize the interplay between visual components allows for more impactful campaigns. AI can help tailor advertisements to consumer preferences, dynamically adapting visual elements for maximum engagement and effectiveness. In brand identity, maintaining consistency in visual assets—such as logos, colors, and typography—is essential. AI can ensure that design elements align with brand guidelines across various media, preserving the integrity of brand identity. Additionally, AI can provide valuable insights into market trends and competitors, helping brands differentiate themselves visually. [8–10]
In the fashion industry, AI’s fine-grained understanding of design elements can identify emerging trends and guide designers in creating collections that resonate with consumers [11–13]. Similarly, in artwork, AI can analyze historical styles and techniques, providing inspiration for contemporary creations. In packaging design, fine-grained visual analysis helps ensure that packaging not only attracts consumer attention but also effectively communicates the product’s essence. AI models can predict how different design elements will influence consumer perception, optimizing packaging for both aesthetic appeal and functionality.
Multimodal large models (MLMs) [14–16] have significantly advanced the recognition and fine-grained understanding of visual design elements, offering valuable tools for various creative industries, including industrial design, advertising, brand identity, fashion, and packaging. These models, such as CLIP [17] and DALL·E [18], are pre-trained on vast datasets of image-text pairs, allowing them to capture deep semantic relationships between visual elements (such as color, shape, and texture) and textual descriptions. In the context of visual design, MLMs can analyze intricate design features, recognizing visual patterns and correlating them with specific design principles or brand identities. This capability allows AI systems to support designers by providing insights into design aesthetics, automating design processes, and offering recommendations based on an understanding of both the visual and contextual aspects of the design. For example, in advertising and branding, MLMs can ensure consistency across various media platforms, aligning visual elements such as logos, typography, and color schemes with a brand’s identity and message.
However, the use of MLMs in fine-grained visual design understanding presents certain challenges. One of the primary limitations is their heavy reliance on large-scale, high-quality annotated datasets. In design domains, gathering such datasets can be time-consuming and expensive, as it requires expertise in both design principles and detailed annotation of visual elements. Additionally, despite the impressive capabilities of MLMs, they still struggle with complex design nuances that require cultural or contextual interpretation. For instance, understanding the subtleties of design styles across different cultural or historical contexts may be difficult for AI models, as these aspects often require human intuition and experience. Furthermore, training and deploying these models require substantial computational resources, which can limit their accessibility for smaller design studios or businesses with fewer technical resources. While MLMs can automate certain aspects of design, the need for human oversight in more intricate design decisions remains crucial. Overall, while multimodal models offer significant advantages in improving efficiency and supporting creativity in visual design, their limitations in data dependency, fine-grained understanding, and computational cost need to be addressed for broader practical adoption. Table 1 shows a comparative analysis of these visual design element recognition approaches.
Currently, the main techniques related to the recognition and fine-grained understanding of visual design elements include the following five:
- CLIP (Contrastive Language-Image Pretraining): CLIP is a multimodal model trained on a large dataset of image-text pairs, where it learns to align image features with corresponding textual descriptions using contrastive learning. The model consists of a vision encoder (such as a ResNet or Vision Transformer) and a text encoder (typically a Transformer model). By mapping both images and text into a shared semantic space, CLIP can perform tasks such as zero-shot image classification and visual question answering. CLIP excels in generalizing across a wide range of image-text tasks, making it highly versatile for recognizing and interpreting visual design elements. It can understand design aesthetics by associating textual descriptions (e.g., “modern logo design,” “minimalist color palette”) with corresponding images, helping designers identify visual elements based on specific design language. While CLIP is powerful for general recognition tasks, it can struggle with fine-grained design details, such as differentiating subtle variations in texture or intricate design patterns. Additionally, CLIP’s reliance on large, pre-existing datasets means it might not be fully adapted to domain-specific design nuances without additional fine-tuning. [19–21]
- DeepLabV3+ (Semantic Image Segmentation): DeepLabV3+ is a semantic segmentation model that uses an encoder-decoder architecture with a modified atrous convolutional structure to capture multi-scale context in images. It outputs pixel-wise classifications of image regions, enabling fine-grained segmentation of visual elements in design images, such as shapes, patterns, or textures. For design tasks, DeepLabV3+ can accurately identify and segment distinct visual elements like borders, shapes, and colors within a design. This fine-grained segmentation is crucial for tasks such as logo extraction, background removal, and identifying color schemes in design work. DeepLabV3+ may struggle with more abstract design elements, such as visual balance or symmetry, and cannot easily interpret higher-level design principles like harmony or contrast. Additionally, segmentation models often require significant labeled data, which can be challenging to gather in design-specific contexts. [22–24]
- YOLO (You Only Look Once): YOLO is an object detection model that frames the problem of detecting objects in images as a single regression problem, predicting bounding boxes and class labels in one go. YOLO uses a single convolutional network to detect objects at various scales and outputs a set of bounding boxes for each detected object. YOLO is extremely fast and can detect visual design elements such as logos, icons, and product shapes in real-time. Its ability to perform object detection on images with high speed and accuracy makes it particularly useful for analyzing visual components in dynamic environments like digital marketing or e-commerce websites, where design elements need constant monitoring. While YOLO is fast and effective, it tends to have lower precision when detecting smaller objects or fine-grained design details. The model’s performance can degrade when handling complex or overlapping design elements, which are common in visual compositions like advertisements or fashion layouts. [25–27]
- StyleGAN2 (Generative Adversarial Network for Style Transfer): StyleGAN2 is a generative model designed for high-quality image synthesis, particularly excelling in producing images with a wide variety of styles and fine-grained textures. It uses a generator and discriminator architecture in GANs, where the generator creates synthetic images, and the discriminator evaluates their realism. StyleGAN2 can be applied to design tasks such as generating new design elements or styles based on given input images. For instance, it can generate new visual assets that follow a specific artistic style or design trend, making it highly useful for branding and packaging design, where consistent style generation is required. The model can produce visually convincing designs but may lack deeper understanding of design principles like color theory, visual hierarchy, or brand-specific guidelines. Fine-tuning StyleGAN2 to respect design consistency across various elements requires additional domain-specific training. [28–30]
- ViT (Vision Transformer): ViT applies transformer-based architectures to images, treating image patches as sequences, similar to how language models process text. The model captures long-range dependencies between image patches, making it effective in understanding global and local image features simultaneously. ViT excels in recognizing high-level visual patterns and global design structures, such as color palettes, layouts, and composition. Its ability to capture fine details in large-scale images makes it valuable for identifying design patterns, especially in complex or abstract design tasks. ViT models can be computationally expensive, especially when dealing with high-resolution images typical in design work. Additionally, ViT may not always capture low-level visual features (such as textures) as well as CNN-based models [31], making it less effective for tasks that require very detailed pixel-level recognition in design elements. [32–34]
This study focuses on the development of an integrated model for the efficient recognition and in-depth analysis of visual design elements. The model combines image segmentation using the Swin Transformer, multi-scale feature fusion for visual design element type recognition, and fine-grained image understanding through a multimodal large language model. First, the Swin Transformer [35,36] is employed for the precise segmentation of visual design elements, leveraging its powerful ability to capture both local and global information. This enables the model to accurately distinguish between different design components, even in intricate design layouts. Next, a multi-scale feature fusion approach is utilized to enhance the recognition of various design element types, such as shapes, colors, and textures, by integrating features across multiple levels. This ensures stable and effective recognition, even in complex design environments. Lastly, by incorporating a multimodal LLM, the model benefits from fine-grained image understanding through joint analysis of visual content and textual descriptions. This allows the model not only to recognize the physical components of design but also to comprehend their contextual meanings and creative intentions. The advantages of this model lie in its multi-layered, cross-modal approach, which provides both high precision and efficiency in recognizing visual design elements. The use of the Swin Transformer ensures the model’s adaptability to the complex and detailed nature of design elements, enabling efficient fine-grained image understanding. Furthermore, the integration of the multimodal LLM offers a deeper semantic understanding, allowing the model to better interpret the intent and context behind design choices. This combined approach enhances the model’s ability to handle the variability and complexity inherent in real-world design tasks, making it a robust tool for advanced visual design element recognition and analysis.
The three innovations of this study include:
- This study innovatively combines the Swin Transformer for precise segmentation of visual design elements, leveraging its ability to capture both local and global information, thus enhancing the accuracy of detail recognition.
- A multi-scale feature fusion approach is employed for design element type recognition, strengthening the model’s ability to identify various design elements such as shapes, colors, and textures through multi-layered feature integration.
- By integrating a multi-modal large language model for fine-grained image understanding, the model conducts a joint analysis of visual content and textual descriptions, enabling a deeper interpretation of design elements’ meaning and context beyond mere object recognition.
In the rest of this paper, we will introduce the recently related work in section 2. Section 3 presents the proposed methods: overview, image segmentation of visual design elements based on Swin Transformer, visual design element type recognition based on multi-scale feature fusion, multi-modal LLM-based fine-grained image understanding of visual design elements. Section 4 introduces the experimental part, including practical details, comparative experiments, and an ablation study. Section 5 includes a conclusion.
Related work
Visual design elements.
Visual design elements refer to the fundamental components that constitute a visual design, including shapes, colors, lines, textures, and space. These elements are combined and organized to convey specific visual information, emotions, and themes, playing a critical role in both artistic and functional design. They serve as the building blocks through which designers can express creativity, structure, and intent. The key types of visual design elements include shapes and graphics, which influence the form and visual appeal; colors, which communicate emotions and establish atmosphere; lines, which define space and rhythm; textures, which enhance depth and tactile perception; and space, which determines the organization and clarity of the design. [37–39]
The use of visual design elements holds significant value in various design fields, including branding, user interface design, advertising, and product packaging. These elements not only enhance the aesthetic appeal and recognizability of a design but also guide the user’s visual experience, contributing to greater engagement and brand loyalty. By carefully combining and innovating with these elements, designers can achieve more precise and effective communication, increasing the competitive edge of their designs in the market. Furthermore, the strategic application of these elements allows for the creation of designs that are not only visually striking but also functional, ensuring that they resonate with users on both emotional and practical levels.
Overview of visual design element recognition techniques.
Visual design element recognition [40] refers to the automated process of identifying and classifying fundamental components of visual designs, such as shapes, colors, lines, textures, and spatial arrangements [41]. This process is critical in various fields, including digital design analysis, content-based image retrieval [42], art and fashion recognition, and design optimization. By enabling machines to understand and interpret the building blocks of visual compositions, these techniques facilitate more intelligent design analysis, pattern recognition, and automation in creative workflows [43]. The underlying principle of visual design element recognition involves extracting and analyzing low- to high-level features from visual data and mapping these features to predefined categories or design elements.
The techniques used for visual design element recognition can be broadly classified into several categories, including traditional image processing, feature-based methods, deep learning-based methods, and hybrid approaches that combine multiple techniques. Each of these methods has its own set of strengths and weaknesses depending on the complexity of the design environment and the specific requirements of the task.
- Traditional Image Processing Methods: These methods rely on low-level image processing techniques, such as edge detection [44], thresholding [45], color histograms [46], and texture segmentation [47]. Common algorithms include Canny edge detection [48], Sobel filters [49], and Fourier transforms [50]. These methods are computationally efficient and simple to implement, making them suitable for real-time applications and resource-constrained environments. However, they struggle with complex, non-linear relationships in design elements, and are less effective when designs contain noise, occlusions, or intricate textures. Additionally, traditional methods often fail to capture the semantic meaning of design elements, limiting their effectiveness in tasks requiring deeper contextual understanding. [51]
- Feature-Based Recognition Methods: These approaches focus on detecting distinctive features from images, such as corners, edges, or keypoints. Algorithms like SIFT (Scale-Invariant Feature Transform) [52], HOG (Histogram of Oriented Gradients) [53], and SURF (Speeded Up Robust Features) [54] extract these features to match and classify design elements. Feature-based methods offer better performance than traditional methods, especially in handling variations in scale, rotation, and lighting. They are widely used for object recognition and scene parsing. However, these methods tend to be computationally expensive, and their performance can degrade in the presence of cluttered or noisy design elements. Moreover, they typically focus on specific elements like shapes and textures, without capturing higher-level design structures or semantic meaning. [55]
- Deep Learning-Based Methods: In recent years, deep learning, particularly Convolutional Neural Networks (CNNs) [56] and Transformer-based [57] models, has become the dominant approach for visual element recognition. CNNs excel at learning hierarchical features directly from raw pixel data, enabling them to identify complex patterns and relationships in visual designs. Recent innovations, such as Swin Transformer [36] and Vision Transformers (ViTs) [58], have extended these capabilities by capturing both local and global context in design elements. These methods have shown significant improvements in accuracy and generalization, especially in large-scale and complex design datasets. However, deep learning methods require large-annotated datasets for training, extensive computational resources, and can be slow to adapt to new design trends without further fine-tuning or transfer learning. Furthermore, the “black-box” nature of deep models can make it difficult to interpret the reasoning behind their classifications. [59]
- Hybrid Approaches: Hybrid methods combine traditional image processing or feature-based techniques with deep learning models to leverage the strengths of both approaches. For example, a hybrid model might first use edge detection to extract basic shapes and then apply a CNN or Transformer model to classify and interpret these shapes within the context of the design. Hybrid methods aim to improve both efficiency and robustness by providing a balance between interpretability and recognition performance. However, these methods introduce additional complexity and may increase computational costs, particularly when combining multiple deep learning models or integrating various feature extraction techniques. [60]
Recent research has made significant strides in improving the accuracy and adaptability of visual design element recognition. One of the key developments has been the integration of multi-modal models that combine visual data with textual information, often using multi-modal transformers or vision-language models such as CLIP [61]. These models can understand both the visual characteristics and the semantic meaning of design elements, leading to better recognition of context-specific design components.
Another major advancement is the use of self-supervised learning techniques, which allow models to learn from unannotated data by leveraging data augmentation, contrastive learning, or pretext tasks [62]. This approach is particularly beneficial in design contexts where labeled data is scarce or expensive to obtain. Furthermore, advancements in few-shot learning [63] and transfer learning [64] have enabled deep learning models to generalize better with smaller datasets, making it easier to apply these techniques to specific design domains or emerging design trends. In addition, spatial reasoning and attention mechanisms have become a focal point of research, particularly with the introduction of Transformer-based models. [65] These models excel at capturing long-range dependencies and complex spatial relationships in design elements, enabling more accurate recognition in intricate or dynamic design environments. Despite these advancements, challenges remain, such as handling real-time recognition in large-scale applications, improving interpretability for design professionals, and reducing the computational burden of deep learning models. Ongoing research continues to focus on enhancing model efficiency, robustness, and adaptability to new design contexts.
The combination of Swin Transformer-based segmentation with multi-scale feature extraction has been explored in previous work (e.g., semantic segmentation and fine-grained categorization tasks), while multimodal LLMs have become standard in visual language tasks (e.g., variants of CLIP or VL-BERT)(see Table 2). This makes the research likely to be seen as incremental rather than a seminal contribution. However, the target paper provides incremental improvements in specific applications of visual design elements (e.g., brand logos, packaging), especially by end-to-end integration of the three to deal with complex design layouts and semantic intents.
While the individual components of our framework have foundations in existing literature, the novelty of this work lies in the end-to-end integration and its specific architectural design tailored for fine-grained visual design element analysis. As detailed in Table 2, our approach differs from prior works in several key aspects. Unlike the base Swin Transformer which focuses on general hierarchical vision tasks, our framework incorporates a dedicated multi-scale feature fusion mechanism followed by a multimodal LLM fine-tuned for design semantics. This allows for a cohesive flow from precise segmentation to semantic interpretation of design intent, a progression not present in the original architecture. Compared to multi-scale fusion methods like MSCPN [67] that rely on covariance pooling, our integration with the Swin backbone provides a more powerful hierarchical feature extractor. Furthermore, while multimodal LLMs are indeed becoming standard, our application involves a specialized cross-modal attention layer post-segmentation and a domain-specific fine-tuning strategy using design-centric datasets, moving beyond the general alignment goals of models surveyed in Literature [14].
Fine-grained understanding of visual design elements based on multi-modal LLM.
Fine-grained understanding of visual design elements using multi-modal large language models (LLMs) [70] integrates both visual and textual data to enable detailed recognition and interpretation of design components, such as shapes, colors, textures, and their contextual meanings. Unlike traditional image recognition, this approach incorporates semantic information from textual descriptions, allowing for more nuanced analysis of design elements and their relationships within a composition. The goal is to capture subtle features and understand the design intent behind each element [71,72].
Multi-modal LLMs leverage both visual and textual information. Vision models, such as Convolutional Neural Networks (CNNs) or Vision Transformers (ViTs), extract detailed visual features from images, while natural language models process textual data. These models are typically fused at various stages to combine insights from both modalities. Vision-language pretraining models like CLIP and ALIGN [73]align image and text features in a shared embedding space, allowing the model to recognize design elements in fine detail by relating visual patterns to textual descriptions. Transformer-based models (e.g., VL-BERT [74]) use attention mechanisms to enhance this process by focusing on specific visual regions and their corresponding textual meanings.
The primary advantage of multi-modal LLMs is their ability to combine visual recognition with semantic context, allowing for more accurate and contextual aware understanding of design elements. This is especially useful for tasks like art interpretation, design composition analysis, and fashion recognition, where understanding the relationships between elements is key. However, these models face several challenges, including high computational costs, especially for high-resolution images or large datasets. Additionally, achieving effective fusion between visual and textual information can be difficult, leading to misalignments or inefficiencies.
Recent research has focused on improving fusion techniques, such as dynamic fusion [75–77], to better integrate visual and textual data. Self-supervised learning [78] and few-shot learning [79] are being used to enhance fine-grained recognition with minimal labeled data, addressing the challenge of domain-specific adaptation. Moreover, advancements in spatial reasoning and attention mechanisms allow models to better capture intricate design details and their contextual significance [80]. Efforts to improve model interpretability, such as attention visualization [81] and explainable AI (XAI) [82], are also advancing, making it easier to understand model decisions.
Method
Overview.
This study presents an integrated model designed to efficiently recognize and analyze visual design elements by combining advanced image segmentation, multi-scale feature fusion, and multimodal understanding. The model employs the Swin Transformer for precise segmentation, leveraging its capability to capture both local and global visual features, ensuring accurate differentiation of design components even in intricate layouts. Following segmentation, a multi-scale feature fusion mechanism integrates features such as shape, color, and texture across multiple levels to enhance the recognition of design element types, achieving robustness in complex design environments. Finally, a multimodal large language model is incorporated to provide fine-grained understanding by jointly analyzing visual and textual information. This enables the model not only to identify physical design components but also to interpret their contextual meanings and creative intents. The combination of these methods ensures high precision, adaptability to complex designs, and a deep semantic understanding of visual design elements. The general structure of the model is shown in Fig 1.
The selection of feature extraction techniques was driven by three fundamental requirements of visual design analysis: compositional hierarchy preservation, cross-scale pattern recognition, and semantic-concept alignment. The Swin Transformer’s shifted window attention was prioritized over conventional CNNs for its unique ability to maintain spatial relationships across design elements while efficiently modeling both local details and global layouts. This architectural choice directly addresses the hierarchical nature of design compositions, where elements maintain meaningful spatial relationships at multiple scales. The multi-scale fusion mechanism was specifically designed to capture the characteristic scale variations in design elements, from fine textures to overall shapes. Finally, the cross-modal attention framework bridges the gap between low-level visual features and high-level design concepts, enabling the model to understand design elements in their proper semantic context. These interconnected choices form a cohesive feature extraction pipeline tailored to the nuanced demands of computational design analysis.
The integrated framework proposed in the paper distinguishes itself from prior models through a specialized end-to-end pipeline architecture that sequentially integrates Swin Transformer-based segmentation, multi-scale feature fusion, and multimodal LLM fine-tuning without relying on standard vector-only or early-fusion strategies seen in earlier works. Unlike the Swin Transformer in Liu et al. [36], which focuses on hierarchical vision tasks via shifted window attention but lacks multimodal integration, this framework employs a custom cross-modal attention layer post-segmentation to align visual hierarchies with textual descriptors, avoiding the computational overhead of dynamic fusion in surveys like Yin et al. [14] Compared to multi-layer visual fusion in Lin et al. [83], where features are fused across LLM layers for general multimodal tasks, the paper’s approach incorporates a domain-specific LLM fine-tuning strategy using low-rank adaptation (LoRA) on paired visual-text design datasets, emphasizing semantic interpretation of design intents (e.g., color harmony or layout context) rather than broad alignment. This contrasts with covariance pooling in MSCPN [67], which aggregates multi-scale features via second-order statistics but omits LLM-driven semantics, or the RGBD fusion in Swin-Transformer Indoor Segmentation [66], which is modality-limited and lacks fine-grained textual guidance.
While the Swin Transformer and multi-scale feature fusion are established techniques in general computer vision, and multimodal LLMs have been applied to domains like document understanding [84], the novelty of our framework lies in the specific architectural integration tailored for fine-grained visual design element analysis. Unlike prior works that utilize these components in isolation or for different objectives, our pipeline is architected as a sequential, end-to-end process: precise segmentation via Swin Transformer, followed by multi-scale fusion capturing element attributes, culminating in a multimodal LLM fine-tuned specifically for design semantics. A key differentiator is the custom cross-modal attention mechanism, which is designed to align the hierarchical visual features output by the Swin Transformer with textual prompts for interpreting design-specific concepts like ‘color harmony’ or ‘layout intent,’ a focus absent in general-purpose models. This targeted integration, moving beyond vector concatenation or early fusion seen in other works, is optimized for the nuanced recognition and semantic interpretation required in design analysis, justifying the architectural novelty of the proposed framework.
Image segmentation of visual design elements based on Swin Transformer.
The image segmentation of visual design elements using the Swin Transformer is a multi-step process that leverages its hierarchical architecture and self-attention mechanism to achieve high precision in complex layouts. The process begins by dividing the input image , where H, W, and C represent the image height, width, and the number of channels, respectively, into non-overlapping patches. Each patch is embedded into a fixed-dimensional feature vector
, forming the input sequence:
where P is the patch size, and N is the total number of patches.
These patch embeddings are processed through a series of Swin Transformer layers, which perform shifted window-based self-attention to compute attention weights for local regions while enabling cross-window communication:
where Q, K, and V are the query, key, and value matrices derived from the input features; is the dimension of the key; and B represents the positional encoding bias for each window.
The hierarchical structure of the Swin Transformer aggregates features across multiple levels, progressively reducing spatial dimensions while increasing channel depth. At each level, the output feature map is refined through downsampling and self-attention, enabling the model to capture both local and global context:
where l denotes the current layer, and are the spatial dimensions and channel depth at layer l.
Finally, the segmented output is obtained by applying an up-sampling and classification layer to generate pixel-wise predictions where K is the number of segmentation classes. The segmentation loss is computed using a combined cross-entropy and Dice loss:
where are the ground truth and predicted probabilities for pixel i and class k, respectively, and λ balances the two loss terms.
This process ensures accurate segmentation of visual design elements by leveraging the Swin Transformer’s ability to model long-range dependencies and multi-scale features, making it suitable for complex design layouts. The principle of image segmentation of visual design elements based on Swin Transformer is shown in Fig 2.
Visual design element type recognition based on multi-scale feature fusion.
The recognition of visual design element types using multi-scale feature fusion involves the extraction and integration of features from multiple spatial and semantic levels to enhance robustness and accuracy in identifying design attributes such as shapes, colors, and textures. This process leverages a hierarchical structure to capture both fine-grained details and global context.
1. Feature Extraction The input segmented design elements , where H and W are the height and width of the segmented element, and K is the number of segmentation classes, are passed through multiple convolutional layers to extract features at different levels:
where represents the feature map at layer l,
is a convolution operation with kernel size
, and L is the total number of layers. Spatial dimensions
and
progressively decrease through pooling operations, while channel depth
increases to capture higher-level semantics.
2. Multi-Scale Feature Representation Features from different layers are resized and concatenated to form a multi-scale representation:
where ⨁ denotes channel-wise concatenation, adjusts the spatial dimensions of
to match the target resolution
, ensuring alignment across scales.
3. Type-Specific Feature Aggregation To focus on specific design element attributes, the concatenated features are passed through attribute-specific attention modules. For each attribute
, the attention mechanism is defined as:
where is the attention map for attribute aa, and
,
, and
are learnable projection matrices for query, key, and value, respectively. The aggregated feature
encodes the attribute-specific information.
4. Classification of Element Types The attribute-specific features are combined and passed through a classification head to predict the type of the visual design element:
where represents the probabilities for T element types,
and
are the weights and bias of the classification layer and
converts the combined feature map into a vector.
5. Loss Function for Training The model is trained using a weighted cross-entropy loss to account for class imbalances:
where N is the number of training samples, is the ground truth label for class t of sample i,
is the predicted probability, and
is the weight for class t.
This method enables precise and adaptive recognition of design element types, making it highly effective for real-world applications involving diverse and intricate visual designs. The principle of visual design element type recognition based on multi-scale feature fusion is shown in Fig 3.
Multi-modal LLM-based fine-grained image understanding of visual design elements.
The use of a multi-modal large language model for fine-grained image understanding involves the integration of visual and textual modalities to enhance the comprehension of visual design elements. This process combines deep visual feature extraction with natural language representations, enabling tasks such as counting, position detection, color recognition, text recognition, celebrity identification, scene understanding, landmark detection, and artwork identification.
Cross-Modal Alignment via Attention Mechanisms To enable joint reasoning, the visual and textual features are fused using a cross-modal attention mechanism. The fused representation is computed as:
where are the query, key, and value projections of the visual and textual features, respectively.
are learnable matrices, and
is the attention map, determining how visual and textual features interact.
Task-Specific Output Heads
- 1. Counting
To count visual design elements, the fused features are passed through a regression head:
where is the predicted count,
and
are learnable weights and biases.
- 2. Position Detection
For position detection, the model outputs bounding boxes using a localization head:
where represents the bounding box parameters (center coordinates, width, height).
- 3. Color Recognition
Color classification is achieved by predicting probabilities over a predefined set of color categories C:
where contains probabilities for each color class.
- 4. Text Recognition
For text recognition, the model decodes visual text features using a sequence-to-sequence framework:
where is the reconstructed text sequence.
- 5. Celebrity, Scene, Landmark, and Artwork Recognition
For tasks requiring classification or identification, the fused features are mapped to class probabilities:
where represents probabilities over K classes.
Loss Functions
To train the model, task-specific losses are combined into a unified objective:
Where is the MSE loss for counting,
is the bounding box regression loss,
is the sequence generation loss for text recognition (e.g., cross-entropy),
is the classification loss,
are weights for balancing task-specific losses.
The principle of multi-modal LLM-based fine-grained image understanding of visual design elements is shown in Fig 4.
Complexity analysis.
The computational complexity of our proposed model consists of three main components: (1) Swin Transformer segmentation, (2) multi-scale feature fusion, and (3) multimodal LLM understanding. For an input image of size , the Swin Transformer’s complexity is
per layer, where
is the channel dimension at layer
, and
is the window size (typically
). This is more efficient than standard Vision Transformers’
complexity due to the local window attention mechanism. The multi-scale feature fusion operates at
where
is the number of scales, maintaining linear complexity with respect to feature dimensions. The multimodal LLM’s cross-modal attention has
complexity for visual-textual fusion, where
and
are the visual and textual feature dimensions respectively. Compared to DETR (
) and Mask R-CNN (
where
is kernel size), our model achieves better performance with comparable complexity through efficient hierarchical processing and optimized feature fusion. The total memory requirement scales linearly with input size (
) due to the Swin Transformer’s shifted window approach, making it practical for high-resolution design analysis.
Experiment
Experimental design.
Experimental Design In order to evaluate the efficacy of the multi-modal LLM-based fine-grained image understanding approach for visual design elements, several experiments were conducted. These experiments aimed to assess the model’s performance on tasks such as counting, position detection, color recognition, text recognition, celebrity identification, scene recognition, landmark detection, and artwork identification. Two main experimental categories were designed: baseline experiments, which focused on evaluating single-task performance, and integrated experiments, which tested the model’s capability in handling multiple tasks simultaneously. The experiments were conducted using well-established datasets, including COCO, ETHZ Shape Classes, and ImageNet, ensuring a comprehensive evaluation across diverse design layouts and task scenarios.
Experimental Setup The experiments were carried out on a high-performance computing workstation equipped with the following hardware and software configuration: two NVIDIA A100 40GB GPUs, 64GB of RAM, and an Intel Xeon 64-core CPU. The software stack included Python 3.8, PyTorch 1.10, CUDA 11.1, and various libraries for data preprocessing and model training, such as Transformers 4.14 and OpenCV 4.5. The operating system was Ubuntu 20.04. Data loading and preprocessing were optimized using multi-threading to accelerate the training process. Each experiment was repeated five times to ensure result stability, and the final performance metrics were averaged over these runs.
Model Configuration Key hyperparameters of the model were set as follows: The visual encoder was based on the Swin Transformer, with a hidden dimension of 384, a patch size of 4x4, 12 layers, and 12 attention heads, resulting in an output dimension of 768. The textual encoder utilized a BERT-base model with a hidden size of 768 and a maximum sequence length of 128 tokens. The cross-modal attention mechanism used a feature dimension of 512, with the query, key, and value projection matrices set to 512. The Adam optimizer was used with an initial learning rate of 1e-4 and weight decay of 1e-5, applying a cosine annealing learning rate scheduler. The batch size was set to 32, and gradient accumulation was utilized to optimize training efficiency.
Experimental Details The experimental procedure involved two main phases: pre-training and fine-tuning. During the pre-training phase, the model was initially trained on large-scale datasets such as ImageNet and COCO for 50 epochs to learn basic visual features. In the fine-tuning phase, the model was further trained on task-specific datasets for an additional 20 epochs. Data augmentation techniques, such as random cropping, rotation, and color jittering, were applied to enhance the model’s robustness. The training and evaluation datasets were split into 80% for training and 20% for validation. Additionally, 10-fold cross-validation was used to ensure the reliability and generalizability of the results.
Evaluation Metrics The performance of the model was evaluated using a variety of metrics tailored to each task. For the counting task, the mean absolute error (MAE) was computed, which quantifies the difference between predicted and ground truth counts. Position detection was evaluated using the average localization error (ALE), which measures the Euclidean distance between predicted and true bounding boxes. For color recognition, accuracy was calculated, representing the proportion of correctly classified colors. Text recognition performance was assessed using Character Error Rate (CER) and Word Error Rate (WER). In tasks such as celebrity identification, scene recognition, landmark detection, and artwork classification, the top-1 accuracy was used to evaluate the model’s classification performance. These metrics allowed for a comprehensive assessment of the model’s performance across various visual design tasks, ensuring its suitability for real-world applications.
Training Details The training process employed the Adam optimizer with an initial learning rate of and weight decay of
, utilizing a cosine annealing learning rate scheduler. Training was conducted for a total of 70 epochs, comprising 50 epochs of pre-training on large-scale datasets (ImageNet and COCO) followed by 20 epochs of task-specific fine-tuning. The stopping criterion was based on monitoring the validation loss, with early stopping implemented if no improvement was observed for 10 consecutive epochs to prevent overfitting. The minimization of the combined segmentation and classification loss functions was considered achieved when the validation loss plateaued, indicating model convergence.
Model Training Details To mitigate overfitting and ensure robust generalization, several regularization techniques were employed during the training process. The loss for both the training and validation sets was meticulously monitored. Specifically, weight decay with a coefficient of
was applied to all learnable parameters to constrain model complexity. Additionally, dropout layers with a rate of 0.1 were incorporated within the multi-scale feature fusion modules. Data augmentation techniques, including random cropping, rotation (
), and color jittering (brightness = 0.2, contrast = 0.2, saturation = 0.2, hue = 0.1), were extensively used to increase the diversity of the training data. Early stopping was implemented based on the validation loss, halting training if no improvement was observed for 10 consecutive epochs. The comparison of training and validation loss curves confirmed the effectiveness of these regularization measures in preventing overfitting and promoting stable convergence.
Dataset and data preprocessing.
The following three datasets were used in this study.
ETHZ Shape Classes The ETHZ Shape Classes dataset is a specialized dataset designed for shape-based object recognition tasks. It contains 255 images divided into five object classes: Applelogos, Bottles, Giraffes, Mugs, and Swans. The dataset focuses on objects with strong, distinct shapes in cluttered backgrounds, challenging models to accurately identify contours and boundaries. Images are provided in varying resolutions, ensuring diverse and realistic scenarios. Key fields in the dataset include object masks, object class labels, and image metadata. In this study, ETHZ Shape Classes is particularly advantageous for evaluating shape recognition capabilities, as its emphasis on distinctive object contours aligns well with the model’s segmentation and recognition tasks.
ImageNet The ImageNet dataset is a large-scale visual database widely used for pre-training deep learning models. It comprises over 14 million labeled images spanning 1,000 object classes, making it a cornerstone for developing robust image recognition systems. Images in ImageNet are high-resolution and come with class labels and associated bounding boxes for object localization tasks. The dataset’s diversity and scale make it ideal for pre-training the Swin Transformer in this study, allowing the model to learn generalizable visual features. Its comprehensive coverage of object categories also facilitates the development of recognition models that perform well on diverse and complex design elements.
COCO The COCO (Common Objects in Context) dataset is a comprehensive dataset for object detection, segmentation, and captioning tasks. It contains over 330,000 images, including 200,000 labeled with 80 object categories and 1.5 million object instances. Key fields include object bounding boxes, segmentation masks, image captions, and object categories. COCO stands out for its emphasis on contextual relationships between objects, providing a rich resource for models requiring both fine-grained object detection and contextual understanding. In this study, COCO is used to evaluate the model’s ability to segment and analyze visual design elements in complex, multi-object scenes, leveraging its annotated masks and captions for fine-grained multimodal tasks.
For data preprocessing, we implemented a rigorous pipeline to maintain consistency while preserving design integrity. This included standardized normalization procedures to handle variations in image formats and resolutions. Spatial transformations were applied judiciously to augment training data without distorting essential design characteristics. All preprocessing steps were validated through visual inspection to ensure they maintained the semantic meaning of design elements.
The augmentation strategy was specifically designed to address common challenges in design recognition tasks. We employed geometric transformations that respect design principles, such as aspect-ratio-preserving scaling and rotation within reasonable bounds. Color space adjustments were carefully calibrated to maintain the perceptual qualities of design elements while introducing necessary variability.
Sample Dataset Implementation.
To further elucidate the proposed framework, we provide a brief implementation on a sample dataset. The sample comprises two representative images: one containing a logo design and another with a textured pattern. The image is first processed by the Swin Transformer for segmentation. For instance, the logo image is segmented into distinct regions corresponding to shapes and text. The segmented output is then passed through the multi-scale feature fusion module. Features extracted at different scales are integrated, enhancing the recognition of element types such as ‘circle’ for the logo shape and ‘serif font’ for the text. Finally, the multimodal LLM analyzes the fused visual features alongside a textual prompt. The LLM’s cross-modal attention mechanism aligns the visual segments with semantic concepts, outputting a fine-grained understanding like “A circular logo with bold, serif text conveying a classic brand identity.” This step-by-step demonstration on sample data clarifies the integrated workflow of our framework.
Comparison study with SOTA models.
To comprehensively evaluate the performance of the proposed model, a series of comparative experiments were conducted against several state-of-the-art (SOTA) models, including DETR, Mask R-CNN, Vision Transformer (ViT), and LayoutLMv3. The experiments focused on three key datasets: ETHZ Shape Classes, ImageNet, and COCO. The evaluation metrics included segmentation accuracy, mean Average Precision (mAP) for object detection, and F1 score for multimodal tasks such as text-guided element recognition. Each model was trained and tested under identical experimental conditions to ensure fair comparison. The results, summarized in Table 2, demonstrate the superiority of the proposed model in addressing various visual design tasks.
Detailed Results On the COCO dataset, the proposed model achieved a segmentation accuracy of 88.6%, significantly outperforming DETR (83.4%), Mask R-CNN (85.2%), and ViT (84.8%). For object detection tasks, the proposed model obtained a 51.4% mAP, compared to 47.2% for DETR, 49.3% for Mask R-CNN, and 48.7% for ViT(see Fig 5). In multimodal tasks, such as text-guided element recognition, the proposed model achieved an F1 score of 92.3%, which is markedly higher than LayoutLMv3’s 88.5%(see Table 3).
On the ETHZ Shape Classes dataset, which emphasizes shape recognition in cluttered scenes, the proposed model demonstrated outstanding performance with a classification accuracy of 93.1%, surpassing Mask R-CNN (89.7%) and ViT (91.2%). This highlights the proposed model’s effectiveness in capturing fine-grained visual features and distinguishing complex shapes.
On the ImageNet dataset, the proposed model achieved a top-1 accuracy of 92.4%, outperforming ViT (90.1%) and other SOTA models. The results confirm the proposed model’s ability to generalize across large-scale datasets and handle diverse visual categories effectively.
The proposed model’s multi-layered approach, combining Swin Transformer-based segmentation, multi-scale feature fusion for recognition, and multimodal LLM for fine-grained understanding, provides substantial improvements across all tasks. The high segmentation accuracy and object detection mAP on COCO demonstrate its ability to handle complex scenes with multiple objects. The high F1 score for text-guided tasks highlights its strength in multimodal reasoning, which is essential for understanding contextual and semantic relationships in visual design. These results establish the proposed model as a robust and versatile tool for visual design analysis, outperforming SOTA models on diverse benchmarks.
Comparison study.
Five comparative experiments were then conducted to verify the performance of the proposed model from different perspectives.
Comparison Study 1 Segmentation Accuracy on Complex Shapes. This experiment compared the segmentation accuracy of the proposed model against DETR, Mask R-CNN, and ViT on the ETHZ Shape Classes dataset. The dataset’s focus on objects with complex shapes and cluttered backgrounds made it an ideal benchmark for evaluating shape segmentation performance. Models were tested on 255 images across five shape classes, with segmentation accuracy measured using the Intersection over Union (IoU) metric. The proposed model achieved an IoU of 87.5%, significantly outperforming DETR (80.2%), Mask R-CNN (84.1%), and ViT (82.3%)(see Table 4).
The proposed model’s superior performance is attributed to its Swin Transformer-based segmentation, which captures both local and global shape features effectively(see Fig 6). The multi-scale attention mechanism ensures fine-grained segmentation even in cluttered backgrounds, while DETR and Mask R-CNN struggle to maintain accuracy in such complex scenarios due to their reliance on pre-defined anchors or less robust global context modeling.
Comparison Study 2 Object Detection in Multi-Object Scenes. This study compared the object detection performance of the proposed model with DETR, YOLOv5, and Mask R-CNN on the COCO dataset. The test set included images containing multiple overlapping objects, and the detection performance was evaluated using mean Average Precision (mAP) at an IoU threshold of 0.5. The proposed model achieved mAP of 51.4%, outperforming DETR (47.2%), YOLOv5 (48.6%), and Mask R-CNN (49.3%)(see Table 5).
The advantage of the proposed model lies in its multi-scale feature fusion, which effectively integrates low-level spatial features and high-level semantic information(see Fig 7). This enables better detection of small and occluded objects, whereas DETR and YOLOv5 struggle with occlusion, and Mask R-CNN exhibits limited performance due to reliance on fixed-region proposals.
Comparison Study 3 Multimodal Text-Guided Recognition. The experiment evaluated the text-guided recognition performance of the proposed model and LayoutLMv3 on the COCO dataset. Tasks included recognizing visual elements based on textual descriptions, such as “red circle” or “landmark Eiffel Tower.” The F1 score was used as the evaluation metric. The proposed model achieved an F1 score of 92.3%, outperforming LayoutLMv3, which scored 88.5%(see Table 6).
The superiority of the proposed model is due to its multimodal large language model (LLM), which effectively combines textual and visual features(see Fig 8). The cross-modal attention mechanism enhances the understanding of semantic relationships between textual descriptions and visual elements, while LayoutLMv3 struggles with complex text-visual correlations in detailed scenarios.
Comparison Study 4. Color and Texture Recognition. This study compared the proposed model with Mask R-CNN, ViT, and DETR for color and texture recognition tasks on a subset of the ImageNet dataset. Accuracy in correctly identifying the dominant color and texture was used as the evaluation metric. The proposed model achieved a recognition accuracy of 94.2%, surpassing Mask R-CNN (89.8%), ViT (91.4%), and DETR (88.7%)(see Table 7).
The proposed model’s higher accuracy is attributed to its multi-scale feature fusion, which combines spatial and textural information(see Fig 9). This allows for more precise recognition of subtle textures and color variations, while other models are limited by less sophisticated feature aggregation techniques.
Comparison Study 5 Scene and Landmark Recognition. This experiment evaluated the scene and landmark recognition capabilities of the proposed model against ViT, LayoutLMv3, and YOLOv5 on the COCO dataset. The task involved identifying scenes (e.g., beach, forest) and landmarks (e.g., Eiffel Tower) with top-1 accuracy as the evaluation metric. The proposed model achieved a top-1 accuracy of 93.7%, compared to ViT (90.3%), LayoutLMv3 (88.9%), and YOLOv5 (89.6%)(see Table 8).
The proposed model excels in scene and landmark recognition due to its fine-grained image understanding enabled by the multimodal LLM(see Fig 10). The integration of visual and contextual information allows it to discern subtle cues in scene composition and landmark features. Other models lack the advanced multimodal attention mechanisms required for such nuanced understanding.
Ablation study results and analysis.
To validate the contributions of each module in the proposed model, we conducted a series of ablation experiments(results are shown in Fig 11). The experiments evaluated the impact of removing or replacing key components: Swin Transformer for segmentation, multi-scale feature fusion, and the multimodal large language model (LLM) for fine-grained image understanding. Performance was measured using segmentation accuracy, mAP for object detection, and F1 score for multimodal tasks across the COCO dataset.
Ablation study #1: Impact of Swin Transformer for Segmentation
We replaced the Swin Transformer module with a standard ResNet-50-based segmentation backbone to assess its contribution. Segmentation accuracy and IoU were measured for visual design element segmentation on the COCO dataset.
The Swin Transformer significantly improved segmentation accuracy (+4.3%) and IoU (+6.3%)(see Table 9). This is due to its ability to capture hierarchical global and local features through shifted window attention, which is particularly effective in complex layouts. ResNet-50 lacks this nuanced hierarchical feature aggregation, leading to poorer performance.
Ablation study #2: Impact of Multi-Scale Feature Fusion
The multi-scale feature fusion module was removed, and a single-scale feature extraction approach was employed. Object detection mAP was measured to evaluate performance on the COCO dataset.
The multi-scale feature fusion improved object detection mAP by 4.7%(see Table 10). This module enhances performance by combining low-level spatial and high-level semantic features, enabling the model to detect objects of varying sizes and complexities. Without multi-scale fusion, the model struggled with smaller objects and occluded scenarios.
Ablation study #3: Impact of Multimodal LLM for Fine-Grained Understanding
The multimodal LLM was replaced with a traditional image-only CNN for analyzing text-guided tasks, such as color recognition and landmark identification. F1 scores were measured for text-guided recognition on COCO.
The multimodal LLM significantly improved the F1 score (+5.8%)(see Table 11). Its cross-modal attention mechanism effectively integrates textual and visual information, enabling nuanced understanding of context and semantics. The image-only CNN lacked the capacity to leverage textual guidance, resulting in lower performance.
The ablation experiments validate the effectiveness of the proposed model’s modular design. Each component provides critical improvements to specific tasks, and their combined use ensures robust, high-performance visual design element analysis. These findings highlight the innovative and complementary nature of the proposed architecture.
Ablation Study #4: Impact of Integrated Pipeline Architecture
To rigorously validate the synergistic contribution of the proposed integrated pipeline, which combines Swin Transformer-based segmentation, multi-scale feature fusion, and multimodal LLM for fine-grained design element recognition, we conducted an ablation study to assess the impact of removing individual components. The experiment evaluates the full model against variants where: (1) the Swin Transformer is replaced with a ResNet-50 backbone (lacking hierarchical attention), (2) multi-scale feature fusion is removed in favor of single-scale features, and (3) the multimodal LLM is substituted with a CNN-based image-only model. Performance was measured across the ETHZ Shape Classes dataset (segmentation, IoU), ImageNet dataset (color/texture recognition, accuracy), and COCO dataset (object detection, mAP; text-guided recognition, F1 score). The results, presented in Table 1, demonstrate significant performance degradation when any component is removed, confirming that the integrated architecture drives the model’s superior performance (e.g., 92.3% F1 score on COCO) through the synergistic combination of components, rather than the isolated strengths of Swin Transformer’s efficiency or LLM’s contextual understanding.
The results in Table 12 highlight the critical role of each component. Removing the Swin Transformer reduces IoU by 6.3% on ETHZ Shape Classes due to the loss of hierarchical shifted window attention, which is essential for capturing complex design layouts. Omitting multi-scale feature fusion leads to a 4.7% drop in COCO mAP, as single-scale features struggle with objects of varying sizes. Replacing the multimodal LLM with a CNN-based model causes a 5.8% decrease in COCO F1 score, underscoring the LLM’s importance in aligning visual and textual semantics for fine-grained design intent interpretation. These findings confirm that the proposed pipeline’s novelty lies in its integrated architecture, which synergistically combines these components to achieve robust performance across diverse design recognition tasks.
Ablation Study on Integrated Framework Contribution
To quantitatively validate that the performance superiority stems from the synergistic integration of the proposed framework’s components rather than their isolated capabilities, we conducted a comprehensive ablation study. The full model was compared against three variants: 1) w/o Swin: The Swin Transformer was replaced with a ResNet-50 backbone for segmentation; 2) w/o Multi-Scale: The multi-scale feature fusion module was removed, using only single-scale features from the final layer of the backbone; 3) w/o LLM: The multimodal LLM was replaced with a standard CNN-based model for image-only understanding, disabling cross-modal analysis. Each variant was evaluated on the ETHZ Shape Classes (segmentation IoU), ImageNet (color/texture recognition accuracy), COCO (object detection mAP), and COCO (text-guided recognition F1 score) datasets. The results demonstrate a significant and consistent performance drop across all tasks and datasets when any key component is ablated. For instance, on the COCO dataset, removing the multimodal LLM led to a 5.8% decrease in the F1-score for text-guided recognition, while replacing the Swin Transformer caused a 4.6% reduction in object detection mAP. These results conclusively show that the high performance reported is not attributable to any single component but is a direct outcome of their carefully engineered integration within the end-to-end pipeline, thereby solidifying the framework’s novelty.
Conclusion and outlook
Conclusion
This study presents an innovative model for the efficient recognition and fine-grained analysis of visual design elements, integrating advanced segmentation, feature fusion, and multimodal understanding techniques. By employing the Swin Transformer, the model achieves precise segmentation of visual design components, leveraging both local and global context. The multi-scale feature fusion enhances the recognition of diverse design element types, enabling robust performance in complex scenarios. Furthermore, the incorporation of a multimodal large language model facilitates a deeper semantic understanding of design elements by bridging visual and textual modalities. Extensive experiments on ETHZ Shape Classes, ImageNet, and COCO datasets demonstrate the model’s superiority over state-of-the-art methods, with significant improvements in segmentation accuracy, detection performance, and multimodal reasoning tasks. Ablation studies confirm the contributions of each module, highlighting their complementary roles in the model’s architecture. These results underscore the model’s potential as a powerful tool for advancing research and applications in design element recognition and analysis.
Outlook
Future research directions will explore transformer-based architectures for enhanced feature fusion capabilities. Specifically, we plan to investigate: (1) multi-scale vision transformers with cross-attention mechanisms for improved global-local feature integration, (2) dynamic channel-wise attention for adaptive feature weighting, and (3) hybrid CNN-transformer architectures that combine the strengths of convolutional operations for local feature extraction with self-attention for long-range dependency modeling. The integration of such architectures could further enhance our model’s ability to handle complex feature interactions while maintaining computational efficiency.
One notable limitation of this study lies in the model’s dependency on high-quality and diverse training data, particularly for multimodal tasks. The multimodal large language model component requires extensive, accurately labeled visual-text paired datasets for training. In niche domains, such as artistic design or cultural artifacts, obtaining such datasets can be challenging. Furthermore, the model’s complexity and computational demands, especially for large-scale design images, may impact its real-time performance and scalability. These constraints could hinder the deployment and broader application of the model in practical scenarios. To address the data dependency issue, semi-supervised learning methods or generative adversarial networks could be employed to expand limited labeled datasets by generating diverse and high-quality visual-text pairs. Pre-trained cross-modal models, such as CLIP or DALL-E, could also be utilized for data augmentation in underrepresented domains. To mitigate computational complexity, techniques like model compression—such as knowledge distillation, pruning, and quantization—can be implemented to reduce computational costs. Additionally, lightweight multimodal LLMs, such as MiniGPT or LoRA fine-tuning, can enhance inference efficiency. By integrating cloud and edge computing, the computational load can be distributed, enabling improved real-time performance and scalability, thereby enhancing the model’s applicability to practical use cases.
Another significant limitation of this study is the model’s reliance on computationally intensive components, such as the Swin Transformer and the multimodal large language mode, which require substantial memory and processing power. This dependency may restrict the model’s accessibility for deployment on resource-constrained devices or in real-time applications. Additionally, while the model excels on benchmark datasets, its generalization capability to unseen, domain-specific datasets, such as those in unconventional or low-resource visual design contexts, remains uncertain. These limitations highlight potential challenges in extending the model to broader, more diverse real-world scenarios. To address the computational intensity, optimization techniques such as low-rank approximation, weight pruning, and lightweight model architectures like MobileNet or TinyBERT can be adopted for reducing resource demands. For improving generalization, fine-tuning the model on domain-specific data through transfer learning could be explored. Furthermore, self-supervised pretraining on large-scale unlabeled visual design data may improve robustness and adaptability. Integrating edge computing for local tasks and cloud computing for more intensive operations can also enhance scalability and real-time performance, enabling the model to cater effectively to diverse application scenarios.
References
- 1. Wei X-S, Song Y-Z, Aodha OM, Wu J, Peng Y, Tang J, et al. Fine-grained image analysis with deep learning: a survey. IEEE Trans Pattern Anal Mach Intell. 2022;44(12):8927–48. pmid:34752384
- 2. Alcaide-Marzal J, Diego-Mas JA, Acosta-Zazueta G. A 3D shape generative method for aesthetic product design. Design Studies. 2020;66:144–76.
- 3. Liu W. Retracted. research on the application of multimedia elements in visual communication art under the internet background. Mobile Information Systems. 2021;2021(1).
- 4. Liu L, Dzyabura D, Mizik N. Visual Listening In: Extracting Brand Image Portrayed on Social Media. Marketing Science. 2020;39(4):669–86.
- 5.
Yilma BA, Leiva LA. The elements of visual art recommendation: Learning latent semantic representations of paintings. In: Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, 2023. 1–17.
- 6. Bashirzadeh Y, Mai R, Faure C. How rich is too rich? Visual design elements in digital marketing communications. International Journal of Research in Marketing. 2022;39(1):58–76.
- 7. Vermeir I, Roose G. Visual Design Cues Impacting Food Choice: A Review and Future Research Agenda. Foods. 2020;9(10):1495. pmid:33086720
- 8. Auernhammer J, Roth B. The origin and evolution of Stanford University’s design thinking: From product design to design thinking in innovation management. Journal of Product Innovation Management. 2021;38(1):623–44.
- 9. Attias N, Danai O, Abitbol T, Tarazi E, Ezov N, Pereman I, et al. Mycelium bio-composites in industrial design and architecture: Comparative review and experimental analysis. Journal of Cleaner Production. 2020;246:119037.
- 10. Yıldız BS, Pholdee N, Bureerat S, Erdaş MU, Yıldız AR, Sait SM. Comparision of the political optimization algorithm, the Archimedes optimization algorithm and the Levy flight algorithm for design optimization in industry. Materials Testing. 2021;63(4):356–9.
- 11. Yogantari MV. Visual Exploration Using Acrylic Paint on Used Fashion Items for Sustainable Use. IJPR. 2020;24(3):2574–9.
- 12.
Zhao D, Zou Q. Neural network models in fashion design recommendation with interactive visualization methods. 2023.
- 13. Chung D. A design proposal for fashion and sustainability-ar-driven personalized designs for environmental sustainability. 한국디자인포럼. 2023;81:39–52.
- 14. Yin S, Fu C, Zhao S, Li K, Sun X, Xu T, et al. A survey on multimodal large language models. Natl Sci Rev. 2024;11(12):nwae403. pmid:39679213
- 15. Song S, Li X, Li S, Zhao S, Yu J, Ma J, et al. How to bridge the gap between modalities: A comprehensive survey on multimodal large language model. arXiv preprint. 2023.
- 16. Huang J, Zhang J. A survey on evaluation of multimodal large language models. arXiv preprint. 2024.
- 17.
Zhao Z, Liu Y, Wu H, Wang M, Li Y, Wang S, et al. Clip in medical imaging: a comprehensive survey. 2023. https://arxiv.org/abs/2312.07353
- 18.
Marcus G, Davis E, Aaronson S. A very preliminary analysis of DALL-E 2, arXiv preprint arXiv:2204. 2022:13807.
- 19.
Shen S, Li LH, Tan H, Bansal M, Rohrbach A, Chang KW, Yao ZK. Keutzer, How much can clip benefit vision-and-language tasks?, arXiv preprint arXiv:2107.06383. 2021.
- 20.
Khandelwal A, Weihs L, Mottaghi R, Kembhavi A. Simple but effective: Clip embeddings for embodied ai. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022. 14829–38.
- 21.
Liu J, Zhang Y, Chen JN, Xiao J, Lu Y, Landman BA, et al. Clip-driven universal model for organ segmentation and tumor detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023. 21152–64.
- 22. Peng H, Xue C, Shao Y, Chen K, Xiong J, Xie Z, et al. Semantic Segmentation of Litchi Branches Using DeepLabV3+ Model. IEEE Access. 2020;8:164546–55.
- 23. Wang Y, Yang L, Liu X, Yan P. An improved semantic segmentation algorithm for high-resolution remote sensing images based on DeepLabv3. Sci Rep. 2024;14(1):9716. pmid:38678060
- 24. Du S, Du S, Liu B, Zhang X. Incorporating DeepLabv3+ and object-based image analysis for semantic segmentation of very high resolution remote sensing images. International Journal of Digital Earth. 2020;14(3):357–78.
- 25. Jiang P, Ergu D, Liu F, Cai Y, Ma B. A Review of Yolo Algorithm Developments. Procedia Computer Science. 2022;199:1066–73.
- 26. Diwan T, Anirudh G, Tembhurne JV. Object detection using YOLO: challenges, architectural successors, datasets and applications. Multimed Tools Appl. 2023;82(6):9243–75. pmid:35968414
- 27. Hussain M. YOLO-v1 to YOLO-v8, the Rise of YOLO and Its Complementary Nature toward Digital Manufacturing and Industrial Defect Detection. Machines. 2023;11(7):677.
- 28.
Viazovetskyi Y, Ivashkin V, Kashin E. Stylegan2 distillation for feed-forward image manipulation. In: Computer Vision–ECCV 2020: 16th European Conference. Glasgow, UK, 2020. 170–86.
- 29.
Ayanthi D, Munasinghe S. Text-to-face generation with stylegan2. arXiv preprint. 2022. https://doi.org/10.48550/arXiv.2205.12512
- 30.
Back J. Fine-tuning stylegan2 for cartoon face generation. arXiv preprint. 2021. https://doi.org/arXiv:2106.12445
- 31. Nigam S. Forecasting time series using convolutional neural network with multiplicative neuron. Applied Soft Computing. 2025;174:112921.
- 32. Khan S, Naseer M, Hayat M, Zamir SW, Khan FS, Shah M. Transformers in Vision: A Survey. ACM Comput Surv. 2022;54(10s):1–41.
- 33. Han K, Wang Y, Chen H, Chen X, Guo J, Liu Z, et al. A Survey on Vision Transformer. IEEE Trans Pattern Anal Mach Intell. 2023;45(1):87–110. pmid:35180075
- 34.
Islam K. Recent advances in vision transformer: A survey and outlook of recent work. arXiv preprint. 2022. https://doi.org/10.48550/arXiv.2203.01536
- 35.
Liu Z, Hu H, Lin Y, Yao Z, Xie Z, Wei Y, et al. Swin transformer v2: Scaling up capacity and resolution. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022. 12009–19.
- 36.
Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, et al. Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021. 10012–22.
- 37. Gao Y (Lisa), Wu L, Shin J, Mattila AS. Visual Design, Message Content, and Benefit Type: The Case of A Cause-Related Marketing Campaign. Journal of Hospitality & Tourism Research. 2020;44(5):761–79.
- 38. Lu J. Innovative application of recombinant traditional visual elements in graphic design. IJCAI. 2022;46(1).
- 39. Sisodia A, Burnap A, Kumar V. Generative Interpretable Visual Design: Using Disentanglement for Visual Conjoint Analysis. Journal of Marketing Research. 2024;62(3):405–28.
- 40.
Xia K. A Deep Learning-Based Method for Visual Element Recognition and Openness Evaluation of Street Interfaces. In: 2025 6th International Conference on Computer Vision, Image and Deep Learning (CVIDL). 2025. 1165–8. https://doi.org/10.1109/cvidl65390.2025.11085530
- 41. Kumar R, Naaz S. Exploring the depth of elements and principles of visual design. ShodhKosh J Vis Per Arts. 2023;4(2ECVPAMIAP).
- 42. Hameed IM, Abdulhussain SH, Mahmmod BM. Content-based image retrieval: A review of recent trends. Cogent Engineering. 2021;8(1).
- 43. Huang L, Zheng P. Human-Computer Collaborative Visual Design Creation Assisted by Artificial Intelligence. ACM Trans Asian Low-Resour Lang Inf Process. 2023;22(9):1–21.
- 44. Sun R, Lei T, Chen Q, Wang Z, Du X, Zhao W, et al. Survey of Image Edge Detection. Front Signal Process. 2022;2.
- 45.
Seelaboyina R, Vishwakarma R. Different thresholding techniques in image processing: A review. In: ICDSMLA 2021: Proceedings of the 3rd International Conference on Data Science, Machine Learning and Applications. 2023. 23–9.
- 46. Hamdani H, Septiarini A, Sunyoto A, Suyanto S, Utaminingrum F. Detection of oil palm leaf disease based on color histogram and supervised classifier. Optik. 2021;245:167753.
- 47. Aouat S, Ait-hammi I, Hamouchene I. A new approach for texture segmentation based on the Gray Level Co-occurrence Matrix. Multimed Tools Appl. 2021;80(16):24027–52.
- 48. Agrawal H, Desai K. Canny edge detection: a comprehensive review. IJTRS. 2024;9(Spl):27–35.
- 49. Ranjan R, Avasthi V. Edge detection using guided sobel image filtering. Wireless Pers Commun. 2023;132(1):651–77.
- 50. Fu Z, Lin Y, Yang D, Yang S. Fractional Fourier Transforms Meet Riesz Potentials and Image Processing. SIAM J Imaging Sci. 2024;17(1):476–500.
- 51.
Zhang X, de Greef L, Swearngin A, White S, Murray K, Yu L, et al. Screen Recognition: Creating Accessibility Metadata for Mobile Applications from Pixels. In: Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, 2021. 1–15. https://doi.org/10.1145/3411764.3445186
- 52. Li H, Wang L, Zhao T, Zhao W. Local-Peak Scale-Invariant Feature Transform for Fast and Random Image Stitching. Sensors (Basel). 2024;24(17):5759. pmid:39275669
- 53. Marini M, Hariyanto H. Implementasi metode HOG (Histogram of oriented gradient) untuk mendeteksi pejalan kaki pada citra video. Innovative: Journal of Social Science Research. 4(2024):13964–70.
- 54. Verma K, Ghosh D, Kumar A. Visual tracking in unstabilized real time videos using SURF. J Ambient Intell Human Comput. 2019;15(1):809–27.
- 55.
Peng Z, Huang W, Gu S, Xie L, Wang Y, Jiao J, et al. Conformer: Local features coupling global representations for visual recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021. 367–76.
- 56. Brahmi W, Jdey I, Drira F. Exploring the role of Convolutional Neural Networks (CNN) in dental radiography segmentation: A comprehensive Systematic Literature Review. Engineering Applications of Artificial Intelligence. 2024;133:108510.
- 57. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. Advances in Neural Information Processing Systems. 2017;30.
- 58. Thisanke H, Deshan C, Chamith K, Seneviratne S, Vidanaarachchi R, Herath D. Semantic segmentation using Vision Transformers: A survey. Engineering Applications of Artificial Intelligence. 2023;126:106669.
- 59. Tripathi M. Analysis of Convolutional Neural Network based Image Classification Techniques. JIIP. 2021;3(2):100–17.
- 60. Li Y, Yao T, Pan Y, Mei T. Contextual Transformer Networks for Visual Recognition. IEEE Trans Pattern Anal Mach Intell. 2023;45(2):1489–500. pmid:35363608
- 61. Wang J, Chan KCK, Loy CC. Exploring CLIP for Assessing the Look and Feel of Images. AAAI. 2023;37(2):2555–63.
- 62. Bardes A, Ponce J, LeCun Y. Vicregl: Self-supervised learning of local visual features. Advances in Neural Information Processing Systems. 2022;35:8799–810.
- 63. Ren W, Tang Y, Sun Q, Zhao C, Han Q-L. Visual Semantic Segmentation Based on Few/Zero-Shot Learning: An Overview. IEEE/CAA J Autom Sinica. 2024;11(5):1106–26.
- 64. Paymode AS, Malode VB. Transfer Learning for Multi-Crop Leaf Disease Image Classification using Convolutional Neural Network VGG. Artificial Intelligence in Agriculture. 2022;6:23–33.
- 65. Zhang Q, Yang YB. Rest: An efficient transformer for visual recognition. Advances in neural information processing systems. 2021;34:15475–85.
- 66. Cui L, Jing X, Wang Y, Huan Y, Xu Y, Zhang Q. Improved Swin Transformer-Based Semantic Segmentation of Postearthquake Dense Buildings in Urban Areas Using Remote Sensing Images. IEEE J Sel Top Appl Earth Observations Remote Sensing. 2023;16:369–85.
- 67. Qian L, Yu T, Yang J. Multi-Scale Feature Fusion of Covariance Pooling Networks for Fine-Grained Visual Recognition. Sensors (Basel). 2023;23(8):3970. pmid:37112311
- 68. Cheng Y, Zhang Z, Yang M, Nie H, Li C, Wu X, et al. Graphic Design with Large Multimodal Model. AAAI. 2025;39(3):2473–81.
- 69. Zou X, Zhang W, Zhao N. From fragment to one piece: A survey on AI-driven graphic design. arXiv preprint. 2025.
- 70.
Liang Z, Xu Y, Hong Y, Shang P, Wang Q, Fu Q, et al. A Survey of Multimodel Large Language Models. In: Proceedings of the 3rd International Conference on Computer, Artificial Intelligence and Control Engineering. 2024. 405–9. https://doi.org/10.1145/3672758.3672824
- 71.
Jiao Q, Chen D, Huang Y, Li Y, Shen Y. Enhancing multimodal large language models with vision detection models: an empirical study. 2024. https://arxiv.org/abs/2401.17981
- 72.
Lin Z, Liu C, Zhang R, Gao P, Qiu L, Xiao H, et al. Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models. arXiv preprint. 2023. https://doi.org/10.48550/arXiv.2311.07575
- 73. Och FJ, Ney H. A Systematic Comparison of Various Statistical Alignment Models. Computational Linguistics. 2003;29(1):19–51.
- 74.
Su W, Zhu X, Cao Y, Li B, Lu L, Wei F, et al. Vl-bert: Pre-training of generic visual-linguistic representations. arXiv preprint. 2019. https://doi.org/10.48550/arXiv.1908.08530
- 75. Wang B, Li B, Gao T, Li L, Wang H, Zhao C, et al. DAMF: A Semantic-Guided Dynamic Attention Framework for Visual-Haptic-Textual Multimodal Fusion. Knowledge-Based Systems. 2025:114244.
- 76.
Sahu G, Vechtomova O. Dynamic fusion for multimodal data. arXiv preprint. 2019. https://doi.org/10.48550/arXiv.1911.03821
- 77. Wang S, Zhang J, Zong C. Learning Multimodal Word Representation via Dynamic Fusion Methods. AAAI. 2018;32(1).
- 78.
Shu Y, Van den Hengel A, Liu L. Learning Common Rationale to Improve Self-Supervised Representation for Fine-Grained Visual Recognition Problems. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2023. 11392–401. https://doi.org/10.1109/cvpr52729.2023.01096
- 79. Zha Z, Tang H, Sun Y, Tang J. Boosting Few-Shot Fine-Grained Recognition With Background Suppression and Foreground Alignment. IEEE Trans Circuits Syst Video Technol. 2023;33(8):3947–61.
- 80. Liu F, Emerson G, Collier N. Visual spatial reasoning. Transactions of the Association for Computational Linguistics. 2023;11:635–51.
- 81.
Vig J. A multiscale visualization of attention in the transformer model. arXiv preprint. 2019. https://doi.org/10.48550/arXiv.1906.05714
- 82.
Xu F, Uszkoreit H, Du Y, Fan W, Zhao D, Zhu J. Explainable AI: A brief survey on history, research areas, approaches and challenges. In: CCF international conference on natural language processing and Chinese computing. 2019. 563–74.
- 83.
Lin J, Chen H, Fan Y, Fan Y, Jin X, Su H, et al. Multi-Layer Visual Feature Fusion in Multimodal LLMs: Methods, Analysis, and Best Practices. In: Proceedings of the Computer Vision and Pattern Recognition Conference. 2025. 4156–66.
- 84.
Huang Y, Lv T, Cui L, Lu Y, Wei F. LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking. In: Proceedings of the 30th ACM International Conference on Multimedia. 2022. 4083–91. https://doi.org/10.1145/3503161.3548112