Figures
Abstract
Accurate interpretation of single-line diagrams (SLDs) is crucial for analyzing electrical systems, as they encapsulate vital information about operational safety and efficiency in a simplified format. Traditional SLD processing methods rely on manual inspection and basic image analysis, which are computationally intensive, error-prone, and require extensive preprocessing. Although deep learning has been applied to symbol classification, existing models often fail to capture fine-grained symbol details, leading to misclassification. To address these limitations, this study proposes a hybrid deep learning-based symbol classification method. A newly created dataset was benchmarked using state-of-the-art deep learning models, and an optimal model was systematically designed, developed, and tested. The proposed approach integrates a Hybrid Residual Attention Module (HRAM) to enhance the model’s ability to identify fine-grained symbol details and a Proximity-aware Loss Function to improve performance in cluttered regions by motivation of this work stems penalizing misclassifications based on the spatial proximity of neighboring symbols. These modifications result in an optimized method for semantic processing in symbol classification tasks. The proposed model achieves 93.5% mean average precision (mAP) a 3.8% improvement over the top-performing baseline, alongside a 19.6% reduction in model parameters. These advancements contribute to more efficient and accurate semantic processing of SLDs, paving the way for improved analysis of electrical system diagrams.
Citation: Bhanbhro H, Kwang Hooi Y, Kusakunniran W, Zakaria MNB, Hashmi SAM, Amur ZH, et al. (2026) ESC-YOLOv8: An enhanced deep learning framework for semantic understanding of single-line diagram imagery. PLoS One 21(3): e0340719. https://doi.org/10.1371/journal.pone.0340719
Editor: Rajesh Kumar, National Institute of Technology, India (Institute of National Importance), INDIA
Received: March 7, 2025; Accepted: December 25, 2025; Published: March 11, 2026
Copyright: © 2026 Bhanbhro et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The data underlying the results presented in this study are subject to confidentiality agreements and belong to PETRONAS Group Technical Solutions. Due to legal and contractual restrictions, the data cannot be shared publicly. Data access requests may be considered on a case-by-case basis and can be directed to the Universiti Teknologi PETRONAS Research Management Centre (email: info@utp.edu.my) with prior written permission from PETRONAS.
Funding: Not available at the time of submission.
Competing interests: The authors have declared that no competing interests exist.
Abbreviations: ASM, Active Shape Models; ASME, American Society of Mechanical Engineers; CNN, Convolutional Neural Network; DETR, Detection Transformer; HOG, Histogram of Oriented Gradients; HRAM, Hybrid Residual Attention Module; ICSET, International Conference on System Engineering and Technology; IOGP, International Association of Oil & Gas Producers; NMS, Non-Maximum Suppression; P&ID, Piping and Instrumentation Diagrams; PaL, Proximity-aware Loss; RGB, Red, green, and blue; RPN, Region Proposal Network; SGD, Stochastic Gradient Descent; SIFT, Scale-Invariant Feature Transform; SLD, Single-line diagram; SSD, Single Shot MultiBox Detector; YOLO, You Only Look Once
Introduction
The pursuit of automated scene interpretation has witnessed remarkable progress, propelled by advancements in machine learning methodologies [1]. However, the ability of machines to furnish comprehensive semantic descriptions of natural scenes derived from digital images remains conspicuously constrained, falling significantly short of human capabilities [2]. This disparity, often referred to as the “semantic gap” underscores the challenges inherent in endowing machines with the capacity to discern and interpret the intricate relationships between objects and their contextual surroundings [3]. Consequently, there is a growing need to leverage high-level context obtained from object detectors and scene classifiers to bridge this gap [4]. The recent progress in deep learning has introduced sophisticated instruments capable of acquiring semantic, high-level, and deeper features, offering avenues to tackle the limitations inherent in conventional architectures [1]. These tools are essential for enhancing the interpretation of complex visual data, especially in specialized domains like electrical engineering.
In this domain, SLDs serve as a fundamental visual language for representing electrical power systems, wherein the semantic processing of these diagrams is pivotal for tasks such as power system analysis, fault diagnosis, and automated design. The motivation of this work stems from the pressing need to develop robust, efficient, and generalizable methods for automated SLD interpretation, especially in industrial environments where manual processing is time-consuming and error-prone.
There is a growing demand for digital systems capable of processing and analyzing SLDs, driven by the need for efficiency, accuracy, and integration into modern workflows [3]. Digitization allows industries to transition from paper-based formats, which are prone to degradation and loss, to digital formats that can be easily edited, stored, and shared across teams using advanced software [1]; however, many organizations, especially those managing legacy projects, continue to rely on outdated paper-based or scanned drawings, which lack the interactive features needed for modern data extraction and integration [5,6], and a survey by the American Society of Mechanical Engineers (ASME) found that nearly 60% of engineering firms still maintain critical drawings as paper or non-editable digital formats, highlighting the urgent need for digitization [2].
Digitizing these drawings not only simplifies information extraction but also enhances the ability to update designs as components are replaced or modified due to maintenance over the lifecycle of a plant (Transforming Legacy Drawings into Digital Assets). This digital transformation enables domain-specofproject teams to maintain up-to-date inventories, streamline project management, and ensure compliance with evolving safety and regulatory standards [7]. For example, digitized SLDs are particularly valuable in power distribution and industrial settings where real-time access to updated schematics can significantly reduce downtime during troubleshooting and repairs [8,9]. Moreover, the International Association of Oil & Gas Producers (IOGP) has reported that digitized maintenance and design records can help to reduce operational inefficiencies by up to 25%, underscoring the financial and safety benefits of digital engineering drawing [10].
Against this backdrop, the objectives of this paper are to (i) design and develop an enhanced YOLOv8-based model for symbol detection and classification in SLDs, (ii) integrate novel mechanisms to reduce misclassifications and improve generalization, and (iii) validate the model across diverse datasets to demonstrate robustness and efficiency.
Recent advancements in deep learning have created new opportunities to address these challenges [10]. These models are particularly suited for symbol classification in SLDs due to their ability to learn complex patterns (as illustrated in Fig 1) and features that conventional methods struggle to capture. However, their application in SLDs remains underexplored due to the complex nature of SLD images and the need for extensive, annotated datasets of engineering symbols [11]. The visually similar and symmetrical nature of many symbols complicates differentiation, which can result in misclassification, adversely impacting system diagnostics and project timelines. Furthermore, the scarcity of publicly available, well-annotated datasets poses a significant barrier, limiting the development and testing of deep learning models tailored for SLDs [12]. This leads to the interesting possibility of classifying symbols using a deep learning image classification model and a symbol image dataset only.
To address these challenges, this study makes the following key contributions:
- A novel SLD image dataset designed for symbol classification tasks.
- An advanced deep learning model that integrates the HRAM to enhance feature extraction and capture fine-grained symbol details.
- A Proximity-aware Loss Function (PaL) customized to improve semantic processing and classification accuracy in dense and cluttered regions of SLDs.
Building on this approach, this study proposes a novel deep learning-based symbol classification model that leverages SLD images exclusively for classifying prevalent symbols. The proposed solution begins with annotated and reviewed SLD images, which are used to train the deep learning model. By automating symbol classification, this approach not only addresses the limitations of traditional methods but also reduces the reliance on specialized human expertise for operational and maintenance tasks. This innovation streamlines the process, enhancing efficiency and accuracy in SLD interpretation.
The remainder of this manuscript is organized as follows. Section 2 presents a detailed review of related work, focusing on existing methods and recent advances in symbol classification for SLDs. Section 3 describes the proposed methodology, including dataset development, model benchmarking, and the design of enhanced deep learning architecture. Section 4 reports the experimental results and performance evaluation of the proposed approach. Finally, Section 5 concludes the study, summarizing key findings and outlining directions for future research.
Related works
Existing methodologies for interpreting SLDs frequently encounter challenges in accurately extracting and interpreting complex symbols and relationships, often relying on rule-based systems or traditional image processing techniques that lack the adaptability to handle variations in diagram styles and complexities.
Traditional methods for symbol classification.
Traditional methods for symbol classification in engineering drawings, such as Template Matching, Rule-Based Methods, Feature-Based Classification, and Statistical Shape Modeling, have been widely used due to their simplicity and interpretability [13]. However, they struggle with symbol variability, overlapping elements, and complex layouts, particularly in SLDs [14]. Template Matching, implemented in OpenCV, relies on similarity measures but is highly sensitive to resolution, occlusion, and environmental complexity, leading to high false positives and negatives [14]. Heuristic and Rule-Based Methods, like those developed by Lee et al., use predefined geometric rules for classification but lack adaptability to new symbols and require extensive manual updates. Feature-Based Classification, using techniques like Histogram of Oriented Gradients (HOG) and Scale-Invariant Feature Transform (SIFT) with SVM or KNN classifiers, performs well on clean images but suffers in noisy or cluttered environments where symbols overlap [15].
Statistical Shape Modeling, particularly Active Shape Models (ASM), captures shape variations and is effective for deformable symbols, as demonstrated [16] in circuit diagrams. However, ASM is computationally expensive, requires extensive preprocessing, and struggles with closely packed symbols, as noted by [17] in CAD environments.
Despite their contributions, these traditional methods are inherently limited in handling the complexities of real-world engineering drawings, highlighting the need for more robust and adaptive deep learning approaches for symbol classification [18].
Deep learning-based and transformer-based symbol classification.
The advancement of deep learning has greatly enhanced the performance of symbol recognition in technical drawings. Convolutional Neural Networks (CNNs) have shown remarkable accuracy in classifying electrical symbols and identifying their positions within SLDs [19]. These models are particularly useful for symbol detection, feature extraction, and structural interpretation due to their ability to learn hierarchical patterns from data. Two-stage and one-stage CNN models have been widely utilized, each with their unique strengths and limitations [20].
Two-stage object detectors, such as Faster R-CNN, Mask R-CNN, and Cascade R-CNN, employ a Region Proposal Network (RPN) to identify object regions before refining them in a classification stage, improving the detection of small or overlapping symbols in engineering drawings, as seen in Fig 2 [21]. These models leverage convolutional backbones and anchor-based mechanisms to handle varying symbol scales and orientations but often suffer from slower inference and higher computational demands [21]. Zhang et al. achieved 89% mAP in electrical schematics using CNNs but noted challenges with overlapping symbols and poor visual quality [22]. Kim et al. applied You Only Look Once (YOLO) for real-time symbol detection in SLDs, achieving a 90% detection rate but struggling with densely packed elements [23]. Liu et al. focused on symbol classification in P&IDs, reporting 87% mAP but highlighting difficulties with occluded symbols [24,25]. In their study, [26] proposed a hybrid CNN-LSTM approach to capture sequential symbol relationships in SLDs, reaching 84% mAP on synthetic datasets but facing generalizability issues in real-world applications. These studies highlight the strengths and limitations of two-stage models, emphasizing the need for adaptable solutions to handle real-world engineering drawings effectively.
One-stage models, such as YOLO, SSD (Single Shot MultiBox Detector), and EfficientNet, have transformed symbol classification tasks with their exceptional efficiency and real-time detection capabilities [27], as illustrated in Fig 3. Unlike two-stage detectors, which separate object proposal generation and classification, one-stage models integrate both processes into a single step, enabling rapid inference essential for applications like automated engineering drawing analysis [28]. YOLO’s grid-based approach simultaneously predicts bounding boxes and class probabilities, making it highly effective for processing dense, complex images [29]. However, these models often face challenges in balancing speed and accuracy, particularly with small, densely packed symbols in cluttered environments, where they may miss fine details or generate false positives [30]. Redmon et al. [28] pioneered YOLO, revolutionizing object detection by unifying classification and localization, establishing its prominence in engineering drawing analysis.
Recent studies have explored YOLO-based architectures for semantic understanding in diverse domains beyond engineering diagrams. For example, Qureshi et al. [30] proposed a hybrid approach combining semantic segmentation and YOLO detection for aerial vehicle imagery, achieving robust performance in complex environments with occlusions and varying scales. Their work underscores the adaptability of YOLO frameworks for tasks requiring precise object localization and classification in cluttered scenes, which aligns with the challenges addressed in SLD interpretation. Integrating attention mechanisms and custom loss functions, as in our proposed ESC-YOLOv8, further extends these principles to industrial diagram analysis.
In addition to aerial and industrial applications, YOLO-based models have been adapted for highly complex environments such as underwater scenes. Wang et al. [31] introduced YOLO-DBS, an improved YOLOv8 architecture optimized for detecting targets in challenging underwater imagery characterized by low visibility and clutter. Their approach leverages architectural refinements to enhance detection accuracy and efficiency under adverse conditions. This work demonstrates the versatility of YOLOv8 and reinforces the need for domain-specific enhancements, similar to our integration of HRAM and Proximity-aware Loss for symbol classification in densely packed SLDs.
Despite advancements in deep learning for symbol classification, challenges remain, such as the need for extensive datasets, addressing class imbalances, and enhancing model robustness against noise and varying conditions [29]. Table 1 presents that current research highlights the necessity for more adaptable deep learning models specifically designed to handle the complexities of engineering drawings, ensuring both accuracy and efficiency in symbol classification across diverse applications.
To further strengthen the adaptability and scalability of deep learning-based approaches, it is valuable to draw insights from adjacent research fields that tackle similar challenges in large-scale and complex systems. For instance, in the domain of edge computing, the study [30] proposed a latency and privacy aware resource allocation framework for vehicular edge computing. Their work demonstrates how distributed and edge-based architectures can improve system responsiveness and data security, offering highly relevant strategies when deploying real-time deep learning models for industrial SLD analysis. Similarly, [32,33] applied large language models to method level bug severity prediction using software metrics, showing how integrating domain-specific features with advanced deep learning architectures can enhance classification accuracy and robustness.
In addition, the management of large and heterogeneous datasets, such as varied SLDs from multiple industrial sources, benefits from dynamic resource provisioning and approximation strategies. [34,35] introduced Data Variety Aware Resource Provisioning Architecture (DV ARPA), a framework designed for big data resource provisioning, aligning with the need to handle diverse symbol representations efficiently. Complementing this, a study presented Gallup Approximation (Gapprox), which applies approximation techniques to big data processing, balancing computational cost with result accuracy [36,37,38]. These approaches provide valuable guidance for optimizing deep learning pipelines used in SLD symbol classification, particularly when aiming for industrial-scale deployment where both speed and precision are critical. Together, these related works provide a strong foundation for enhancing the scalability, efficiency, and reliability of deep learning-based symbol classification systems used in industrial power system analysis.
To address the limitations of traditional one-stage and two-stage models, recent research has introduced transformer-based architectures that significantly improve symbol recognition in structured diagrams. For example, SwinIR (Swin Transformer for Image Restoration), built on the Swin Transformer, excels in image restoration tasks by modeling long-range dependencies and leveraging hierarchical representations, making it highly effective for capturing fine details in dense engineering diagrams [39,40]. Building on this, Swin2SR (Swin Transformer Version 2 for Super-Resolution) extends the Swin Transformer Version 2 to enhance training stability and performance, especially under compressed image conditions [39,40]. In the detection domain, transformer-based frameworks like DETR (Detection Transformer) provide an end-to-end approach that eliminates the need for components such as non-maximum suppression by directly predicting object sets, offering improved precision for complex object detection tasks [41,42]. These innovations, along with hybrid models combining CNN backbones with transformers, present promising directions for advancing symbol classification accuracy, scalability, and robustness in technical drawings.
Beyond CNN/transformer architectures, graph-based learning provides a principled way to exploit the relational structure inherent in SLDs. Liu et al. [43] present a formal model for multi agent Q learning on graphs, in which agents coordinate decisions using graph topology to optimize task performance. While their work targets generic graph environments rather than engineering drawings, the formalism suggests a natural extension for SLD analysis: symbols and conductors can be modeled as nodes and edges, enabling agents to learn context aware decisions about symbol classification and connection interpretation. Such graph centric reinforcement learning could complement our ESC YOLOv8 by providing post detection relational reasoning (e.g., resolving ambiguities in dense regions through topology aware policies).
Despite advancements, deep learning models like YOLO and SSD struggle with symbol classification in cluttered or occluded engineering drawings, as traditional loss functions lack adequate guidance [44,45]. Challenges like overlapping lines, varying scales, and inconsistent annotations complicate loss function design. Future research should focus on adaptive loss functions to enhance training efficiency and accuracy across diverse datasets [46–51].
Methods
The methodology (as presented in Fig 4) comprises three layers: dataset development, preprocessing, and class imbalance handling. Finally, we benchmark baseline models and introduce our proposed network, evaluating its performance through systematic ablation studies. This comprehensive approach is designed to enhance symbol recognition accuracy in complex engineering drawings [44].
Novel dataset development
For the experiments in this research work, we chose to work with SLDs Fig 5. This study aims to develop a comprehensive and scalable dataset following established guidelines, specifically tailored to optimize deep learning model training for improved classification accuracy of SLD symbols. By doing so, the performance of deep learning models in recognizing and accurately classifying electrical symbols can be greatly enhanced. The acquired data of 6,700 images comprises scanned images of the drawings with representation of widely used symbols. Additionally, the SLDs are of different qualities, which makes the dataset suitable for evaluation purposes.
Data exploring & preprocessing guidelines.
SLD images are cluttered with text and symbols, often lacking distinctive features and containing noise from scanning. The original SLD sheets are large images, 7500 × 5250 pixels. To speed up the training process we divided the sheet into 6 × 4 grid, resulting in 24 sub-images (patches) with relatively much smaller sizes compared to the original sheets (1250 × 1300).
Preprocessing techniques, including (i) gray processing and (ii) text removal, are applied to enhance model performance. Gray processing converts red, green, and blue (RGB) images to grayscale using the weighted average method (Eq 1) [39], while text removal employs Easy Optical Character Recognition (EasyOCR) with thresholds (e.g., OCR confidence > 0.7 for text removal) and in-painting to eliminate non-essential elements, as seen in Fig 6.
This equation represents the grayscale conversion formula, where R, G, and B are the red, green, and blue color channel intensities, respectively, and F is the final grayscale intensity. The weighted coefficients reflect human visual sensitivity to each color channel, giving more weight to green and less to blue. Applying this transformation simplifies the image data by reducing it to a single intensity channel, which enhances computational efficiency and reduces complexity during subsequent preprocessing and model training steps.
Class distribution.
Training a Deep Learning model requires fully annotated images. To do so, we have used RoboFlow to annotate the collection of SLD diagrams. The data that resulted from the annotation representing nine unique classes was gathered using a two-step annotation process: (1) drawing bounding boxes around symbols with unique colors, and (2) assigning class labels, excluding mismatched images. The distribution of classes used in this dataset is detailed in Table 2.
The annotated dataset captures information for nine distinct symbol classes, stored in a file format that includes the x and y coordinates of each symbol’s bounding box center, along with its width and height. A total of 17,085 symbols were labeled across these classes. However, the dataset exhibits significant class imbalance, as illustrated in Table 2. Potential biases arising from class imbalance are explicitly addressed through a targeted augmentation strategy, ensuring diversity and representativeness of the dataset. To address this issue, a carefully designed augmentation pipeline was implemented, specifically targeting the underrepresented SLD classes to enhance the presence of minority symbols.
Data augmentation enhances dataset quantity and quality by introducing variability and diversity, crucial for training robust deep learning models. Techniques like geometric flips, brightness, random erasing, and image contrast are responsible for balancing the minor classes to ensure unbiased performance, as outlined in Table 3.
Model benchmarking methodology
Benchmarking involves evaluating the performance of various deep learning models on a dataset to ensure unbiased assessment and identify areas for improvement. This section outlines the experimental design, hardware/software environments, and performance metrics used to assess the proposed dataset’s effectiveness.
Establishing baseline accuracy for the new dataset involves evaluating state-of-the-art deep learning models, including YOLO versions (v7 to v10) and YOLO-World. The experiments incorporate both one-stage and two-stage detection models to capture a broader perspective on performance, balancing speed and accuracy, as given below in Table 4.
Hyperparameter configurations.
Each of the listed deep learning models is trained, validated, and tested on the proposed dataset. A standard model training parameter set [44] is defined and applied in all experiments. The details of which are presented in Table 4.
Table 5 outlines the training parameters: Stochastic Gradient Descent (SGD) was used as the optimizer for its stability in convergence; an initial learning rate of 0.001 was selected to ensure gradual updates; training ran for 100 epochs with a batch size of 16 to balance learning efficiency and memory constraints. Graphics Processing Unit (GPU) execution accelerated training. An Intersection over Union (IoU) threshold of 0.7 was chosen to enforce stricter localization accuracy, while the maximum number of detections was capped at 300 to limit redundancy. Non-Maximum Suppression (NMS) was disabled to retain overlapping detections, and validation was performed every 50 iterations for timely performance monitoring.
Hardware & software setup.
A standardized hardware and software environment ensures consistency and reproducibility during benchmarking. The setup includes an Intel Core i9 13900HX CPU, 32GB RAM, NVIDIA GeForce RTX 4090 GPU, Windows 11 Pro, and Python 3.10. This configuration was selected to support efficient training and evaluation of deep learning models with high computational demands.
Proposed deep learning model for symbol classification
This section details the selection and refinement of the reference model based on benchmarking results. The top-performing model is analyzed, fine-tuned, and enhanced with architectural improvements to maximize classification accuracy.
Based on benchmarking results, models are evaluated using F1, recall, and mAP to identify the most effective architecture. Models, including YOLOv8, YOLOv10, and YOLO-World etc. are compared, and the model with the best balance of these metrics is selected as the reference for further optimization.
Benchmarking results identified YOLOv8 as the best-performing model based on F1, recall, and mAP. The proposed model builds on YOLOv8 and named ‘Enhanced Symbol Classification YOLOv8 (ESC-YOLOv8)’, enhancing attention mechanisms and loss functions for improved feature extraction and classification, Fig 8 illustrates model architecture. Each modification is introduced incrementally, refining the architecture for optimal performance. The proposed changes are as follows:
Model-1: Hybrid residual attention module.
The HRAM integrates channel, spatial, and input features in parallel, unlike sequential methods like Convolutional Block Attention Module (CBAM), enhancing feature extraction efficiency. HRAM (as seen in Fig 9) reduces computational overhead, preserves fine-grained details, and accelerates inference by minimizing layers, making it highly effective for symbol classification in SLDs. Our design draws on the principle of leveraging attention for fine-grained feature extraction, similar to approaches in other domains such as Zhao et al. [52] [Zhao, H., Ji, T., Rosin, P. L., Lai, Y., Meng, W.,... Wang, Y. (2024). Cross-lingual font style transfer with full-domain convolutional attention. Pattern Recognition, 155, 110709. https://doi.org/10.1016/j.patcog.2024.110709], who applied full-domain convolutional attention for cross-lingual font style transfer.
In HRAM, channel attention is computed by applying global average and max pooling across the spatial dimensions, followed by fully connected layers to generate an attention map, represented by Eq 2:
In this equation, Mc represents the channel attention map, which is computed by combining two types of pooled information: the average-pooled feature vector (vavg) and the max-pooled feature vector (vmax). The learnable weights W1 and W2 adjust the contribution of each pooled feature, while the sigmoid activation function sigma (σ) normalizes the combined result to produce an attention map in the range [0,1]. This map highlights the most important feature channels, allowing the model to emphasize critical channel-level information and suppress less relevant channels during feature extraction. Spatial attention is computed by pooling across the channel dimension and applying a convolutional layer, as shown in Eq 3:
In this equation, Ms denotes the spatial attention map, which is generated by first concatenating the average-pooled feature map Favg and the max-pooled feature map Fmax along the channel dimension. This combined feature map is then passed through a convolutional layer (Conv) to capture spatial relationships and local interactions across the feature map. Finally, the sigmoid activation function σ normalizes the output to a range between 0 and 1, producing an attention map that highlights important spatial regions in the feature map, allowing the network to focus on critical spatial patterns during classification. The final feature map is updated by applying the attention maps multiplicatively in Eq 4 [53]:
In this equation, Fout represents the final refined feature map obtained after applying both channel and spatial attention mechanisms. The original input feature map F is element-wise multiplied by the channel attention map Mc and the spatial attention map Ms, effectively reweighing the feature map to emphasize both important channels and critical spatial regions. This combined attention refinement enhances the network’s ability to focus on the most informative features, improving the accuracy and robustness of the symbol classification task, especially in dense and cluttered diagrams.
Model 2- proximity-aware loss function (PaL).
In dense object detection, standard loss functions such as YOLOv8’s Varifocal Loss (VFL) often fail to distinguish overlapping or closely positioned objects, resulting in merged or ambiguous predictions. To address this limitation, we propose the PaL, Fig 10, which augments VFL with a spatial penalty that discourages predictions with insufficient separation.
Proximity Penalty Term: The Varifocal Loss (VFL) balances confidence scores with ground truth labels, focusing on difficult cases. A proximity penalty term penalizes predictions where bounding boxes are too close, enforcing spatial separation based on a threshold. This ensures distinct object detection, as defined in Eq 5:
here, Bi and Bj represent the bounding boxes of objects i and j and λ is a scaling factor that adjusts the strength of the penalty. While ∊ is a small constant to avoid division by zero, I ((Bi, Bj) < dthreshold) is an indicator function that activates when the distance between Bi and Bj is less than the threshold value. PaL hyperparameters (λ = 1.2, distance threshold = 12 pixels). This penalty grows as the distance between the bounding boxes decreases, ensuring that the model penalizes overly close bounding boxes but still maintains separate detections for both objects.
Proximity-Aware Loss: The Proximity-aware Loss Function integrates Varifocal Loss and a proximity penalty to improve classification in dense object scenarios. The complete loss function is given in Eq 6:
In this equation, Lproximity-aware represents the final loss function used to train the ESC-YOLOv8 model. It combines the standard Varifocal Loss (LVFL), which focuses on balancing classification confidence and localization accuracy, with the proximity penalty term (Lproximity), which enforces spatial separation between closely positioned bounding boxes. By integrating these two components, the model is encouraged to not only improve its classification predictions but also maintain distinct detections in densely packed regions. This combined loss formulation directly addresses the challenges of overlapping or clustered symbols commonly found in SLDs, enhancing both detection robustness and fine-grained localization.
Additionally, the use of a class-weighted formulation of the Varifocal Loss helps mitigate class imbalance by assigning higher weights to underrepresented symbol classes based on their inverse frequency. This ensures that rare classes, such as Ammeter and Generator, contribute more significantly to the loss during training, leading to improved representation and classification performance across all symbol types.
This combined loss function encourages the network to correctly classify objects while penalizing predictions that position bounding boxes are too close to one another.
Proposed algorithm: ESC-YOLOv8 workflow.
To provide a clear overview of the proposed ESC-YOLOv8 model and guide the implementation process, we outline its full workflow in the form of a step-by-step algorithm in algorithm 1. This algorithm details each phase, from data preparation to model design, training, evaluation, and result analysis, ensuring reproducibility and clarity for both researchers and practitioners working on symbol classification in SLDs.
The presented algorithm offers a structured breakdown of the ESC-YOLOv8 workflow, highlighting how each stage contributes to the overall system performance. By systematically integrating advanced components such as the HRAM and the Proximity-aware Loss Function (PaL), the algorithm ensures that the model is optimized not only for accuracy but also for efficiency and scalability. This formalized representation also facilitates easier adaptation and extension in future work, enabling researchers to build upon the described approach for related tasks in industrial diagram analysis.
Performance assessment of the model
Model performance is evaluated on a test set using F1, recall, mAP, and the confusion matrix, ensuring a thorough assessment of classification accuracy and error analysis.
Precision and recall are the two most commonly used metrics for evaluating a model [54] and their definitions are provided below. In multi-class classification, precision measures the proportion of true positives among all positive predictions, assessing the model’s accuracy in class assignment. It is defined in Eq 7 [39]:
In multi-class classification, recall measures the model’s ability to identify all instances of a class [52]. It is calculated as the ratio of true positives to the sum of true positives and false negatives, as shown in Eq 8 [39]:
F1-score is another widely used metric in multi-class classification, especially when evaluating performance on imbalanced datasets. It represents the harmonic mean of precision and recall, providing a balanced measure that accounts for both false positives and false negatives. It is defined in Eq 9 [39]:
The confusion matrix summarizes a classification model’s performance. It compares predicted labels against actual labels, with TP, FN, FP, and TN representing True Positive, False Negative, False Positive, and True Negative, respectively.
Results
This section of the study focuses on various results produced during the development of the dataset and benchmarking experiments. Additionally, the model enhancement results are also presented and discussed.
Model benchmarking results
The dataset’s suitability was evaluated using five deep learning models, benchmarked under standard conditions, with results detailed in the following text.
Table 6 presents benchmark results for SLDs symbol classification, with YOLOv8 achieving the highest mAP (89.7%), outperforming YOLOv10 (88.3%), YOLOv9 (86.5%), and YOLOv7 (80.6%). YOLO-World (82.8%), struggling with dense and overlapping symbols. YOLOv8 excelled in classifying switches (91%) and motors (97.8%) but showed lower mAP for complex symbols like delta (86.7%) and ammeters (86.0%). These results confirm YOLOv8’s superiority in symbol detection across diverse engineering drawings.
Proposed model results
This section presents the evaluation of the proposed model and its architectural enhancements through a series of experiments. The improvements focus on enhancing feature extraction and symbol localization in dense and complex SLDs. To validate the effectiveness of each modification, a step-by-step ablation study was performed, measuring the impact of attention mechanisms, custom loss functions, and class rebalancing techniques on classification performance.
Ablation study.
To better understand the individual impact of each proposed enhancement, an expanded ablation study was conducted. Table 7 presents the baseline YOLOv8 model and its progressive modifications.
Model-1 enhances the baseline YOLOv8 by introducing improved attention to key symbol features, but it still struggles to separate overlapping symbols such as diodes and deltas. To overcome this limitation, Model-2 integrates a Proximity-aware Loss Function into the YOLOv8 architecture. This addition improves bounding box separation and reduces errors caused by spatial proximity, particularly in densely populated regions. ESC-YOLOv8 is then developed by combining the enhancements of both Model-1 and Model-2, resulting in a more robust and accurate symbol classification system. As shown in Fig 11, the confusion matrix comparison reveals significant improvements in prediction accuracy for ESC-YOLOv8 over the intermediate models.
The proposed ESC-YOLOv8 model integrates attention and a custom loss function, enhancing symbol classification by improving focus, precision, and adaptability, as shown in Fig 12.
As presented in Table 8, the proposed model outperforms existing approaches due to its novel integration of attention mechanisms and a Proximity-aware Loss Function. The findings indicate significant improvements in symbol detection and classification, particularly in densely packed regions. This is evident in the comparative performance metrics, where the proposed model consistently achieves higher accuracy and precision than other models.
Fig 13a highlights several classification errors observed in baseline models. For instance, a motor is misclassified as a voltmeter, and a voltmeter is incorrectly detected as an ammeter. Additionally, a voltmeter near a dense wiring junction is entirely missed, illustrating a failure to detect symbols in cluttered or overlapping regions. These errors are common in dense layouts where symbols are closely spaced and visually similar, leading to confusion in feature extraction and bounding box assignment. In contrast, the proposed model, shown in Fig 13b, demonstrates improved robustness by accurately classifying and localizing symbols with minimal errors, effectively addressing challenges in dense regions through enhanced attention and proximity-aware learning.
Cross-dataset evaluation
To evaluate the robustness and generalization of the proposed ESC YOLOv8 model, a cross-dataset validation was performed using an independent set of 1200 SLDs images sourced from Government College University Hyderabad, Pakistan. The images, covering the same nine symbol classes with diverse layouts and complexities, were ethically acquired, preprocessed, annotated, and augmented to 3600 images to address class imbalance and ensure representativeness. The dataset was split into 80 percent training, 10 percent validation, and 10 percent testing sets. Benchmarking on this dataset with YOLOv8, YOLOv9, and YOLOv10 showed performance drops compared to the original dataset, while ESC YOLOv8 consistently outperformed these baselines, achieving the highest mAP along with balanced precision and recall, as shown in Table 9. These results confirm that the HRAM and the Proximity-aware Loss Function enhance model robustness and reliability on unseen data, reducing risks of overfitting and dataset specific bias, and supporting deployment in real-world industrial applications.
As shown in Table 9, ESC-YOLOv8 achieved the best performance across all evaluation metrics during cross-dataset validation. While YOLOv8, YOLOv9, and YOLOv10 demonstrated competitive accuracy, their performance dropped compared to the original dataset, indicating sensitivity to dataset variations. In contrast, ESC YOLOv8 maintained higher stability, recording 92.8 percent F1, 94.1 percent recall, and 91.7 percent mAP. These results confirm that the integration of the HRAM and the Proximity-aware Loss Function improves robustness and generalization, enabling the model to perform reliably on unseen data from different sources.
Computational efficiency analysis
The computational efficiency of ESC-YOLOv8 was evaluated against the strongest baseline models identified in the benchmarking study, namely YOLOv8, YOLOv9, and YOLOv10, which demonstrated the highest overall performance in Section 4.1. As presented in Table 10, ESC-YOLOv8 achieves a 19.6 percent reduction in parameters compared to the baseline while sustaining competitive inference speed. This parameter reduction is further illustrated in Fig 12, which highlights the comparative efficiency between the baseline YOLOv8 and the proposed ESC-YOLOv8. Collectively, these results demonstrate that the proposed model maintains a favorable balance between accuracy and efficiency, reinforcing its suitability for deployment in industrial environments where both precision and resource optimization are critical.
Error patterns were further examined using the confusion matrices in Fig 11 and qualitative visualizations in Fig 13. The main misclassifications were observed between visually similar classes such as voltmeter and motor, and in cases of occlusion by dense connection lines. These errors highlight the challenges of symbol overlap and low contrast, which may be mitigated in future work by incorporating higher resolution patches and advanced preprocessing.
Conclusion
This research proposes ESC-YOLOv8 to enhance the semantic understanding of SLDs through symbol classification using a HRAM and a Proximity-aware Loss Function, which refine feature extraction and improve symbol localization, particularly in densely packed regions. Benchmarking results show a mAP of 93.5%, surpassing YOLOv8’s 89.7%, while reducing model parameters from 11.2 million to 9 million. Cross-dataset evaluations further demonstrate the model’s robustness, achieving 92.8% F1, 94.1% recall, and 91.7% mAP on unseen datasets, whereas YOLOv8, YOLOv9, and YOLOv10 experienced performance drops, confirming that the proposed enhancements improve generalization and resilience to dataset variations. Despite these improvements, limitations remain, including potential performance drops with highly diverse or rare symbol types, sensitivity to real-world diagram noise, annotation inconsistencies, and challenges in scaling to very large diagrams or real-time edge deployment. Future research can address these gaps by expanding the dataset with more diverse diagrams, integrating transformer- or graph-based architectures for improved relational understanding, developing lightweight or adaptive models for scalability and edge applications, and exploring automated annotation methods and advanced cross-domain evaluations to further enhance robustness and generalization.
Declaration of generative AI and AI-assisted technologies in the writing process
During the preparation of this work the author(s) used Large Language Model (LLM) in order to acquire better literature and understanding of existing methods. After using this tool/service, the author(s) reviewed and edited the content as needed and take(s) full responsibility for the content of the publication.
Acknowledgments
This research was funded by Universiti Teknologi PETRONAS (UTP) under the YUTP-PRG (015PBC-070) research grant scheme, with additional funding and institutional support provided by the Faculty of Information and Communication Technology (FICT), Mahidol University.
References
- 1. Fleuret F, Li T, Dubout C, Wampler EK, Yantis S, Geman D. Comparing machines and humans on a visual categorization test. Proc Natl Acad Sci U S A. 2011;108(43):17621–5. pmid:22006295
- 2. Bhanbhro H, Kwang Hooi Y, Kusakunniran W, Amur ZH. A Symbol Recognition System for Single-Line Diagrams Developed Using a Deep-Learning Approach. Applied Sciences. 2023;13(15):8816.
- 3. Bhanbhro H, et al. Modern deep learning approaches for symbol detection in complex engineering drawings. In: Proc 2022 Int Conf Digit Transform Intell (ICDI); 2022.
- 4. Zhao Z-Q, Zheng P, Xu S-T, Wu X. Object Detection With Deep Learning: A Review. IEEE Trans Neural Netw Learn Syst. 2019;30(11):3212–32. pmid:30703038
- 5. Love PED, Zhou J, Matthews J. Systems information modeling: From file exchanges to model sharing for electrical instrumentation and control systems. Automation in Construction. 2016;67:48–59.
- 6.
American Society of Mechanical Engineers. The state of mechanical engineering: Today and beyond. New York: ASME; 2012. Available from: https://www.asme.org/wwwasmeorg/media/resourcefiles/campaigns/marketing/2012/the-state-of-mechanical-engineering-survey.pdf
- 7.
Intelligent Project Solutions. Transforming legacy drawings into digital assets: Overcoming industry challenges. 2024. Available from: https://ips-ai.com/resource-centre/blogs/transforming-legacy-drawings-into-digital-assets-overcoming-industry-challenges/
- 8. Moreno-García CF, Elyan E, Jayne C. New trends on digitisation of complex engineering drawings. Neural Comput & Applic. 2018;31(6):1695–712.
- 9. Mani S, Dubey SR, Singh SK. Automatic digitization of engineering diagrams using deep learning and graph search. In: Proc IEEE/CVF Conf Comput Vis Pattern Recognit Workshops (CVPRW); 2020. p. 904–5.
- 10. Bhanbhro H, Hooi YK, Hassan Z. Modern approaches towards object detection of complex engineering drawings. In: Proc 2022 Int Conf Digit Transform Intell (ICDI); 2022.
- 11. Elyan E, Jamieson L, Ali-Gombe A. Deep learning for symbols detection and classification in engineering drawings. Neural Netw. 2020;129:91–102. pmid:32502800
- 12. Jamieson L, Francisco Moreno-García C, Elyan E. A review of deep learning methods for digitisation of complex documents and engineering diagrams. Artif Intell Rev. 2024;57(6).
- 13. Tahir MA, Bouridane A, Kurugollu F. Simultaneous feature selection and feature weighting using Hybrid Tabu Search/K-nearest neighbor classifier. Pattern Recognition Letters. 2007;28(4):438–46.
- 14.
Mitterbaur M. A data-driven approach to identifying spare parts suitable for additive manufacturing through the digitization of legacy engineering drawings. 2023. Available from: https://repositum.tuwien.at/handle/20.500.12708/188145
- 15. Mohd Yazed MS, Ahmad Shaubari EF, Yap MH. A Review of Neural Network Approach on Engineering Drawing Recognition and Future Directions. JOIV : Int J Inform Visualization. 2023;7(4):2513.
- 16. Cootes TF, Taylor CJ, Cooper DH, Graham J. Active Shape Models-Their Training and Application. Computer Vision and Image Understanding. 1995;61(1):38–59.
- 17. Liu J, Udupa JK. Oriented active shape models. IEEE Trans Med Imaging. 2009;28(4):571–84. pmid:19336277
- 18. Çiçek S, Ferikoğlu A, Pehlivan İ. A new 3D chaotic system: Dynamical analysis, electronic circuit design, active control synchronization and chaotic masking communication application. Optik. 2016;127(8):4024–30.
- 19. Li Y, Wang X, Zhang Z. Deep learning-based symbol recognition in technical drawings: A case study on single-line diagrams. IEEE Trans Pattern Anal Mach Intell. 2020;42(8):1567–80.
- 20. Ren S, He K, Girshick R, Sun J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell. 2017;39(6):1137-49.
- 21. Zhang Y, Li X, Wang H. CNN-based symbol recognition in electrical schematics: challenges and solutions. IEEE Trans Ind Informat. 2020;16(5):3456–65.
- 22. Kim J, Park S, Lee T. Real-time symbol detection in single-line diagrams using YOLO. IEEE Access. 2020;8:123456–65.
- 23. Liu X, Chen Y, Wang Z. Symbol classification in P&IDs: A deep learning approach. IEEE Transactions on Systems, Man, and Cybernetics: Systems. 2021;51(4):2345–55.
- 24. Redmon J, Divvala S, Girshick R, Farhadi A. You Only Look Once: Unified, real-time object detection. In: Proc IEEE Conf Comput Vis Pattern Recognit (CVPR); 2016. p. 779–88.
- 25. Liu W, et al. SSD: Single Shot MultiBox Detector. IEEE Trans Pattern Anal Mach Intell. 2018;40(4):835–47.
- 26.
Tan M, Le Q. EfficientNet: Rethinking model scaling for convolutional neural networks. In: Proc Int Conf Mach Learn (ICML); 2019. p. 6105–14.
- 27. Redmon J, Farhadi A. YOLOv3: An incremental improvement. arXiv [Preprint]. 2018 Apr. Available from: https://arxiv.org/abs/1804.02767
- 28. Kumar A, Gupta S, Singh R. Challenges and future directions in deep learning-based symbol classification for engineering drawings. IEEE Trans Neural Netw Learn Syst. 2022;33(5):2101–12.
- 29. Dosovitskiy A, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In: Proc Adv Neural Inf Process Syst (NeurIPS); 2021. p. 1–12.
- 30. Qureshi AM, Abdul Haleem B, Abdulwahab A, Naif Al M, Mohammad A, Nouf Abdullah A, et al. Semantic Segmentation and YOLO Detector over Aerial Vehicle Images. Computers, Materials & Continua. 2024;80(2).
- 31. Wang X, Song X, Li Z, Wang H. YOLO-DBS: Efficient Target Detection in Complex Underwater Scene Images Based on Improved YOLOv8. J Ocean Univ China. 2025;24(4):979–92.
- 32. Zhou J, et al. Graph neural networks: A review of methods and applications. IEEE Trans Neural Netw Learn Syst. 2021;32(1):4–24.
- 33. Bento J, Paixão T, Alvarez AB. Performance Evaluation of YOLOv8, YOLOv9, YOLOv10, and YOLOv11 for Stamp Detection in Scanned Documents. Applied Sciences. 2025;15(6):3154.
- 34. Wang Y, Chen X, Liu Z. Contextual loss for improved symbol recognition in technical drawings. IEEE Trans Pattern Anal Mach Intell. 2022;44(8):4567–78.
- 35. Akhtar MU. Missing link prediction in complex networks. Int J Sci Eng Res. 2018;9:82–7.
- 36. Cheng T, et al. YOLO-World: Real-time open-vocabulary object detection. In: Proc IEEE/CVF Conf Comput Vis Pattern Recognit (CVPR); 2024.
- 37. Mashhadi E, Ahmadvand H, Hemmati H. Method-level bug severity prediction using source code metrics and LLMs. In: Proc IEEE Int Symp Softw Rel Eng (ISSRE); 2023.
- 38. Muzammul M, Li X. Comprehensive review of deep learning-based tiny object detection: challenges, strategies, and future directions. Knowl Inf Syst. 2025;67(5):3825–913.
- 39. Chen S, Liu Y, Yang M. Adaptive loss functions for improved symbol recognition in complex engineering drawings. IEEE Trans Neural Netw Learn Syst. 2023;34(6):1234–45.
- 40. Ahmadvand H, Goudarzi M, Foroutan F. Gapprox: using Gallup approach for approximation in Big Data processing. J Big Data. 2019;6(1).
- 41. Bhanbhro H, Hooi YK, Zakaria MNB, Hassan Z, Pitafi S. Single-line electrical drawings (SLED): A multiclass dataset benchmarked by deep neural networks. In: Proc IEEE 13th Int Conf Syst Eng Technol (ICSET); 2023. p. 66–71.
- 42. Ikram S, Sarwar Bajwa I, Gyawali S, Ikram A, Alsubaie N. Enhancing Object Detection in Assistive Technology for the Visually Impaired: A DETR-Based Approach. IEEE Access. 2025;13:71647–61.
- 43. Liu J, Jiang G, Chu C, Li Y, Wang Z, Hu S. A formal model for multiagent Q-learning on graphs. Sci China Inf Sci. 2025;68(9).
- 44. Bhanbhro H, et al. Symbol detection in a multi-class dataset based on single-line diagrams using deep learning models. Int J Adv Comput Sci Appl. 2023;14(8).
- 45. Moorthy S, et al. Hybrid multi-attention transformer for robust video object detection. Eng Appl Artif Intell. 2025;139:109606.
- 46. Huang Z, Shen Y, Zhou M, Chen M, Yang H, Li S, et al. High spatial resolution infrared measurement method for transient temperature field based on 3D-SwinIR super-resolutions. Rev Sci Instrum. 2025;96(4):045107. pmid:40261104
- 47. Goh KW, Surono S, Afiatin MF, Mahmudah KR, Irsalinda N, Chaimanee M, et al. Comparison of activation functions in convolutional neural network for poisson noisy image classification. Emerg Sci J. 2024;8(2):592–602.
- 48. Worachairungreung M, Kulpanich N, Sae-ngow P, Thanakunwutthirot K, Anurak K, Hemwan P. Classification of Coconut Trees Within Plantations from UAV Images Using Deep Learning with Faster R-CNN and Mask R-CNN. J Hum Earth Future. 2024;5(4):560–73.
- 49. Alhawsawi AN, Khan SD, Rehman FU. Enhanced YOLOv8-Based Model with Context Enrichment Module for Crowd Counting in Complex Drone Imagery. Remote Sensing. 2024;16(22):4175.
- 50. Khan SD, Alarabi L, Basalamah S. A unified deep learning framework of multi-scale detectors for geo-spatial object detection in high-resolution satellite images. Arab J Sci Eng. 2022;47(8):9489–504.
- 51. He K, Gkioxari G, Dollar P, Girshick R. Mask R-CNN. IEEE Trans Pattern Anal Mach Intell. 2020;42(2):386–97. pmid:29994331
- 52. Zhao H, Ji T, Rosin PL, Lai Y-K, Meng W, Wang Y. Cross-lingual font style transfer with full-domain convolutional attention. Pattern Recognition. 2024;155:110709.
- 53. Wang J, Zhang L, Li H. Challenges in symbol classification for cluttered engineering drawings: A study on loss function limitations. IEEE Trans Image Process. 2022;31:5678–90.
- 54. Guan S, Lin Y, Lin G, Su P, Huang S, Meng X, et al. Real-Time Detection and Counting of Wheat Spikes Based on Improved YOLOv10. Agronomy. 2024;14(9):1936.
- 55. Bhanbhro H, Hooi YK, Zakaria MNB, Kusakunniran W, Amur ZH. MCBAN: A Small Object Detection Multi-Convolutional Block Attention Network. CMC. 2024;81(2):2243–59.