Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Particle swarm optimization-based NLP methods for optimizing automatic document classification and retrieval

Abstract

Text classification plays an essential role in natural language processing and is commonly used in tasks like categorizing news, sentiment analysis, and retrieving relevant information. [0pc][-9pc]Please check and confirm the inserted city and country name for affiliation 1 is appropriate.However, existing models often struggle to perform well on multi-class tasks or complex documents. To overcome these limitations, we propose the PBX model, which integrates both deep learning and traditional machine learning techniques. By utilizing BERT for text pre-training and combining it with the ConvXGB module for classification, the model significantly boosts performance. Hyperparameters are optimized using Particle Swarm Optimization (PSO), enhancing overall accuracy. We tested the model on several datasets, including 20 Newsgroups, Reuters-21578, and AG News, where it outperformed existing models in accuracy, precision, recall, and F1 score. In particular, the PBX model achieved a remarkable 95.0% accuracy and 94.9% F1 score on the AG News dataset. Ablation experiments further validate the contributions of PSO, BERT, and ConvXGB. Future work will focus on improving performance for smaller or ambiguous categories and expanding its practical use across various applications.

Introduction

In the wake of the rapid evolution of information technology, the archival management domain grapples with the formidable challenges of processing and administrating vast volumes of data. The pursuit of efficient and precise automatic classification and retrieval of diverse archival materials has emerged as a pivotal concern within the realms of information management and intelligent technologies [1]. Notably, in sectors such as law, finance, and healthcare, the accurate categorization and retrieval of documents are indispensable for bolstering work efficiency and elevating the accuracy of decision-making processes [24]. Nevertheless, traditional text classification approaches grounded in manual rules and shallow feature extraction techniques often prove ill-equipped to adapt to the increasingly intricate document structures and the ever-changing requirements of different domains. Consequently, their performance in high-precision tasks leaves much to be desired [5].

Historically, text classification relied on traditional machine learning methods such as Support Vector Machines (SVM), Naive Bayes, and Decision Trees [68], which often involved manual feature extraction techniques like TF-IDF and the bag-of-words model [9]. However, these approaches struggle with long texts, complex semantics, and multi-class classification tasks, leading to limited performance and generalization [10]. Recent advances in deep learning, particularly in Natural Language Processing (NLP), have provided more effective solutions. Pre-trained models like BERT (Bidirectional Encoder Representations from Transformers) leverage deep learning to capture contextual information, greatly enhancing classification accuracy [11]. Despite BERT’s remarkable prowess in semantic comprehension, it is not without its drawbacks. These include the challenges associated with hyperparameter tuning, the high computational resource demands, and the need to further augment the model’s capacity to understand complex documents in certain specialized domains [12,13]. Additionally, a solitary deep-learning model often struggles to achieve optimal performance, especially in scenarios involving multi-level feature extraction and multi-class classification tasks, where a single model may be incapable of comprehensively capturing all the pertinent information within the text [14].

This paper introduces the PBX Model (PSO-BERT-ConvXGB), a novel hybrid approach that combines deep learning and traditional machine learning techniques to improve the accuracy and efficiency of the automatic archive classification and retrieval system. The model leverages BERT for semantic understanding, uses CNN for feature extraction, applies XGBoost for classification, and optimizes hyperparameters through the PSO algorithm to significantly enhance its performance.

The principal contributions of this paper can be distilled into the following three aspects:

  • The integration of BERT and CNN for multi-level feature extraction, which not only combines deep-learning and traditional machine-learning paradigms but also employs convolutional operations to delve deeper into the local features within the text, thereby improving the model’s classification performance for complex texts.
  • The introduction of the Particle Swarm Optimization (PSO) algorithm to automate the hyperparameter tuning of BERT and XGBoost, effectively addressing the inefficiencies and instability inherent in traditional hyperparameter adjustment methods.
  • The utilization of multiple public datasets for experimentation, evaluating the model’s adaptability and performance in document classification across different domains, and validating the model’s effectiveness and generalization capabilities in multi-field document classification and retrieval.

Related works

Text classification and document retrieval technologies

Text classification and document retrieval are fundamental tasks within Natural Language Processing (NLP), with extensive research and practical applications [10]. Traditional methods typically rely on feature extraction techniques like TF-IDF (Term Frequency-Inverse Document Frequency) and the Bag-of-Words model, which represent text by analyzing the frequency of words. These approaches work well for simple tasks but fail to capture the nuanced semantic relationships in complex or domain-specific documents, as they overlook the context in which words appear [15]. This limitation becomes especially evident in multi-class classification scenarios.

The advent of deep learning has shifted the focus to models built on CNN, RNN, and Transformer architectures, such as BERT, which have become the dominant techniques in text classification [16]. BERT, in particular, enhances semantic comprehension and boosts classification accuracy by leveraging its bidirectional Transformer structure to understand contextual information [17]. Despite the success of pre-trained models like BERT in various classification tasks, challenges such as difficulties in hyperparameter tuning, high computational costs, and long training times persist [18]. These issues can reduce the model’s efficiency in real-world applications.

Additionally, while deep learning models excel at feature learning, a single model often fails to capture all levels of textual features, especially in complex and multi-class classification tasks [19]. As a result, hybrid approaches combining traditional machine learning algorithms with deep learning models are gaining popularity [20]. For example, the XGBoost model, as a powerful traditional machine learning algorithm, can make efficient classification decisions through gradient boosted trees (GBDT). Combining BERT with XGBoost can give full play to BERT’s advantages in semantic understanding while taking advantage of XGBoost’s efficiency in classification tasks.This study proposes an optimized model that integrates BERT with ConvXGB, enhanced by PSO, aiming to address the limitations of current methods in high-dimensional feature spaces and multi-class classification tasks. In order to more clearly demonstrate the advantages and disadvantages of various text classification methods, this paper compares several mainstream algorithms and their characteristics in Table 1.

thumbnail
Table 1. Existing text classification techniques and their advantages and disadvantages.

https://doi.org/10.1371/journal.pone.0325851.t001

Optimization algorithms in text classification

Optimization algorithms play a crucial role in improving the performance of models in text classification tasks. These algorithms have been extensively used for hyperparameter tuning, feature selection, and model structure adjustment [29,30]. Traditional optimization techniques like Grid Search and Random Search involve either systematically searching through all possible hyperparameter combinations or randomly selecting values to find the best solution. However, as the size of text classification tasks grows, the computational cost of these methods becomes prohibitively high, especially when applied to deep learning models, which require significant time and resources [3133]. To address these challenges, heuristic optimization methods such as PSO have gained popularity. PSO mimics the movement of particles in a search space, allowing for efficient exploration of the hyperparameter space at a lower computational cost. It has demonstrated good performance in tuning deep learning model hyperparameters [34].

PSO offers a more effective balance between global and local search efficiency compared to traditional techniques like Grid Search and Random Search, especially in high-dimensional, complex search spaces [35,36]. PSO is now commonly used to optimize hyperparameters in deep learning models, including BERT, by adjusting parameters such as learning rate, batch size, and training epochs to improve classification accuracy and model training efficiency [37,38]. Despite its success in various domains, there are still challenges regarding the balance between search efficiency and the complexity of the parameter space, as well as improving the model’s generalization ability [1]. This paper presents a solution by combining PSO with BERT and XGBoost to optimize hyperparameters and enhance performance in multi-class text classification tasks, offering a new approach to overcoming the current limitations of these methods.

Methodology

Overall model framework

The model introduced in this paper combines deep learning and traditional machine learning techniques to enhance the accuracy and efficiency of automatic archive classification and retrieval. It is composed of four key modules: the BERT pre-training module, the ConvXGB module (which integrates CNN and XGBoost), and the PSO module. These modules work in synergy, leveraging their individual strengths to collaboratively tackle complex text classification tasks. As shown in Fig 1, it demonstrates how each module of the model works together and forms a closed-loop optimization process through the flow of information.

thumbnail
Fig 1. The overall architecture of the PBX model, including the collaborative working process of the four main modules.

The input text undergoes semantic processing in the BERT pre-training module, and then local feature extraction is carried out through CNN. The extracted features are fed into XGBoost for classification decision-making. The PSO is responsible for optimizing the hyperparameters of BERT, CNN, and XGBoost.

https://doi.org/10.1371/journal.pone.0325851.g001

At the heart of the model, the BERT pre-training module converts input text into vector representations that capture deep semantic meaning. Using its bidirectional Transformer architecture, BERT extracts subtle semantic cues from the text by considering the full context, allowing the model to better understand complex relationships and provide robust semantic support for subsequent tasks. This model is then fine-tuned to optimize its performance for specific tasks and datasets, enhancing its capability in processing domain-specific text.

Once BERT completes the semantic encoding, the text data is passed to the ConvXGB module, which integrates the strengths of Convolutional Neural Networks (CNN) and the XGBoost classifier. The CNN’s role is to extract local features from BERT’s output, capturing important keywords, phrases, and other relevant features through its convolutional and pooling layers. After processing, these features are flattened and fed into XGBoost. XGBoost employs its gradient boosting tree algorithm to classify the extracted features, efficiently handling non-linear relationships and adjusting weights automatically, thus improving accuracy and performance in multi-class classification tasks.

To ensure optimal module performance, the Particle Swarm Optimization (PSO) module plays a crucial role in the overall model. The PSO algorithm searches for and fine-tunes the hyperparameters of BERT, CNN, and XGBoost by simulating the behavior of particle swarms in the solution space. This allows each module to adapt to various datasets while maintaining efficient training and classification performance. By introducing PSO, the need for manual hyperparameter tuning is minimized, and the risk of getting stuck in local optima is avoided through global optimization, ultimately improving the model’s overall performance. The detailed steps of the proposed PBX model are summarized in Algorithm 1.

Algorithm 1. PBX model pseudocode.

Through the collaborative work of these four modules, the model can gradually optimize each link from the extraction of deep semantics of text to local features and then to the generation of classification decisions, ultimately achieving more efficient and accurate archive classification and retrieval. The following sections will explore the detailed implementation process of each module and the way they cooperate one by one.

BERT pre-training module

The BERT pre-training module is a key component of the model presented in this paper. It converts the input text into context-aware vector representations rich in semantic information [39], enabling the model to effectively capture intricate dependencies and the underlying semantic structures between words within the text. As shown in Fig 2, the core functions of this module include tokenizing the input text, generating word embeddings, adding position encodings, processing through bidirectional Transformers, and finally generating context representations. The design of this process ensures that BERT can consider context information simultaneously when processing text, thus generating text representations with rich semantics.

thumbnail
Fig 2. The BERT pre-training module, including tokenization of the input text, word embeddings, position encoding, bidirectional Transformer processing, and the final generated context representation.

https://doi.org/10.1371/journal.pone.0325851.g002

The input text is initially processed by tokenization, which breaks the text into smaller sub-word units. BERT uses the WordPiece tokenizer, a technique that efficiently handles out-of-vocabulary (OOV) words and reduces vocabulary size by utilizing sub-word units, thus improving processing efficiency [40]. Each token is mapped to a high-dimensional vector through the embedding layer. The embedding for each token xi is represented as E(xi), and the combination of word vectors for all tokens in the text forms the input matrix:

(1)

where , and d is the dimension of the word embedding, usually 768 or higher.To capture sequential information, BERT adds position encodings to the word vectors, allowing the model to recognize the relative position of each token in the text. Let the position encoding matrix be P(X). The text embedding after position encoding is:

(2)

where represents the position-aware input matrix, and P(X) is the position encoding matrix.

BERT utilizes a bidirectional Transformer architecture, with the core mechanism being self-attention. This process dynamically adjusts each token’s representation by calculating its similarity to other tokens, considering both left and right contexts simultaneously [41]. This bidirectional approach offers significant advantages over traditional unidirectional models in capturing syntactic and semantic relationships in longer texts. During self-attention computation, BERT employs query (Q), key (K), and value (V) matrices for weighted summation:

(3)

where Q and K are the query and key matrices, V is the value matrix, and dk is the dimension of the key. By calculating attention weights between tokens, the model effectively captures relationships within the text, enhancing the accuracy of token representations.

BERT performs attention calculations across multiple subspaces in parallel, and the output of each Transformer layer is updated as follows:

(4)

where represents the output of the l-th layer, and FFN refers to the feed-forward network, which applies non-linear transformations to each token, strengthening the model’s ability to express complex patterns. BERT refines deep semantic features by stacking multiple Transformer layers.

In BERT’s final output, each token receives a context-dependent representation. For text classification tasks, BERT aggregates the representations of the entire sentence using the [CLS] (classification) token, which serves as the foundation for subsequent classification. The output from the [CLS] token is then passed through a fully-connected layer, mapping it to the category space and generating the predicted category probability:

(5)

where and are the weights and biases of the classification layer, is the vector corresponding to the [CLS] token, and represents the predicted category probability. This output is optimized using the cross-entropy loss function, fine-tuning the BERT model for the specific task:

(6)

where yi is the true label, and is the predicted probability. This process allows BERT to adjust its internal parameters to better fit the task at hand, improving its performance in specific applications.

Through the above mechanisms, BERT generates rich context information and deep-level semantic features, providing strong support for subsequent feature extraction and classification tasks. Through fine-tuning, BERT can flexibly adapt to different downstream tasks, improving the performance of the model in various text classification tasks. In the framework of this paper, the BERT pre-training module provides high-quality semantic representations for the subsequent ConvXGB module, promoting the improvement of the overall performance of the model.

ConvXGB module

The ConvXGB module plays a vital role in the proposed model. It extracts relevant features from the context representations produced by the BERT pre-training module and utilizes these features for the final document classification task [42]. Fig 3 illustrates the architecture of the ConvXGB module, which is primarily composed of a CNN and an XGBoost classifier. Within this module, the CNN is responsible for capturing local features from the text representations generated by BERT, while the XGBoost classifier uses these features to perform the final classification.

thumbnail
Fig 3. Architecture diagram of the ConvXGB module, including the collaborative working process of CNN feature extraction and the XGBoost classifier.

https://doi.org/10.1371/journal.pone.0325851.g003

The CNN’s primary function is to extract local features from the context representations generated by BERT [43]. After the input text is processed through the BERT pre-training module, the context vectors of each token form a matrix , where n represents the number of tokens and d denotes the dimension of the vectors generated by BERT. This matrix is fed into the CNN, which performs convolution operations to capture local features within the text.

During the convolution process, the CNN extracts features from local regions of the input matrix by sliding a convolutional kernel. Let the kernel be , where k is the size of the kernel. The convolution operation is performed as follows:

(7)

Here, is the result of the convolution at the i-th position, representing the local features extracted from the text. The CNN progressively captures local features, such as keywords and phrases, by applying multiple convolutional kernels and layers. The output from each convolutional layer is processed by a pooling layer, typically using max-pooling, to reduce dimensionality while retaining key information. The pooled feature map is calculated as:

(8)

In this equation, P is the feature map after pooling, which is compressed in size while preserving the most significant local features.

After the CNN extracts the local features, the resulting feature map is flattened and passed to the XGBoost classifier. XGBoost, an efficient Gradient-Boosting Decision Tree (GBDT) algorithm, builds a strong classifier by combining multiple weak classifiers, thereby improving classification accuracy [44]. It excels at handling complex, non-linear relationships and optimizing the model’s generalization ability through an ensemble approach.

The XGBoost classifier updates the weights of its trees using the gradient-boosting method during each training iteration. Let the model output be , then XGBoost’s prediction process is expressed as:

(9)

where T is the number of trees, is the output of the t-th tree, is the weight of the tree, and represents the prediction of the decision tree.

During training, XGBoost updates the tree weights by minimizing the loss function, which uses cross-entropy loss:

(10)

where yi is the true label, is the predicted label, and is the regularization term to prevent overfitting.

XGBoost’s ability to automatically weight features allows it to focus on the most important features. In this model, the features extracted by the CNN are passed to XGBoost, which then classifies them to produce the final document category prediction.

The combination of CNN and XGBoost enables the model to give full play to the advantages of both. Features are extracted from the semantic representations of BERT by the CNN to obtain local features and then passed to XGBoost for classification. XGBoost weights these features through its powerful gradient-boosting algorithm to generate the final classification result. Since the CNN can capture local information in the text and XGBoost can perform refined classification on these features, this combination greatly improves the accuracy and efficiency of the model in complex text classification tasks.

Particle Swarm Optimization (PSO) module

In the PBX model, the Particle Swarm Optimization (PSO) module plays a key role in fine-tuning the hyperparameters of both BERT and ConvXGB, thereby enhancing the overall classification performance of the model. By simulating swarm intelligence search processes, PSO efficiently explores the high-dimensional hyperparameter space to find optimal solutions, overcoming the inefficiencies of traditional parameter tuning methods [45]. The architecture of the PSO module is shown in Fig 4, illustrating how PSO adjusts hyperparameters across different modules to optimize the final classification outcomes.

thumbnail
Fig 4. The optimization process of PSO for hyperparameters such as the learning rate of BERT, the kernel size of CNN, and the tree depth of XGBoost, and its mechanism of action in improving the classification performance of the model to achieve the global optimal solution.

https://doi.org/10.1371/journal.pone.0325851.g004

The PSO algorithm works by having particles move through the solution space, continuously updating their positions to seek the best possible solution [46]. Each particle represents a set of hyperparameters, with its position denoted as , where d represents the number of hyperparameters. The particle’s velocity is updated using the following equation:

(11)

where w is the inertia weight, c1 and c2 are acceleration constants, r1 and r2 are random numbers, pid is the particle’s best previous position, and gd is the global best position. Through this formula, particles update their velocities and positions in each iteration to find better hyperparameter combinations.

In the BERT module, PSO primarily optimizes the learning rate , batch size , and the number of training epochs . The goal of this optimization is to minimize the classification loss function :

(12)

PSO adjusts these hyperparameters to ensure that the BERT model can converge quickly and stably during the fine-tuning process, thereby improving the classification accuracy. In the CNN part of the ConvXGB module, PSO is used to optimize the kernel size k, number of convolutional layers L, and stride s. The CNN extracts local features through convolutional operations, and these hyperparameters determine the efficiency and quality of feature extraction. The loss function of the CNN module is:

(13)

PSO optimizes these hyperparameters so that the CNN can effectively extract key features from the context vectors of BERT, improving the representativeness of the features.

In the XGBoost part, PSO mainly optimizes the tree depth D, learning rate , subsample ratio , and column sample ratio . These hyperparameters control the complexity of each tree and the training process. The loss function of XGBoost can be expressed as:

(14)

PSO adjusts these hyperparameters to help XGBoost find the best classification boundary in different feature spaces, thereby improving the classification accuracy. Ultimately, the goal of the entire model is to minimize the sum of the loss functions of all modules:

(15)

PSO optimizes these hyperparameter combinations to maximize the classification performance of the model, ensuring that the entire model can achieve the best results in the document classification task. Through the intelligent optimization of PSO, the model in this paper can automatically adjust the hyperparameters of each module, avoiding the inefficiency of traditional manual parameter-tuning. The PSO module not only improves the training efficiency of the model but also enables the model to show stronger adaptability and accuracy in a variety of text classification tasks.

Expertment

Datasets

In this study, we used three widely recognized public text classification datasets: 20 Newsgroups, Reuters-21578, and AG News. These datasets span various domains and categories, providing a comprehensive evaluation of the PBX model across different tasks. Detailed information about these datasets is presented in Table 2.

thumbnail
Table 2. Detailed information of the experimental datasets, including the field, number of documents, number of categories, and a brief description of each dataset.

https://doi.org/10.1371/journal.pone.0325851.t002

The reason for choosing these three datasets as the basis of the experiment is that they cover different fields and categories, which can effectively test the adaptability and performance of the model in multiple scenarios. 20 Newsgroups is a classic multi-class text classification dataset, covering multiple topics from sports to politics, which can evaluate the performance of the model in complex-category tasks. Reuters-21578 focuses on financial news and is suitable for datasets that need to handle multi-label classification problems, which can test the text classification ability of the model in a specific field. The AG News dataset is a large-scale news classification task, which can test the training efficiency and accuracy of the model when processing large-scale data.

In terms of data preprocessing, each dataset was first cleaned to remove irrelevant symbols and stopwords, and unified case-conversion was performed. Then, the WordPiece tokenizer was used to tokenize the text to ensure that the model can effectively handle out-of-vocabulary words. To align with the input requirements of the BERT model, the text from each dataset was transformed into a fixed-length sequence of tokens, with shorter texts being padded accordingly. Additionally, to ensure fairness in the model evaluation, the datasets were split into training, validation, and test sets following the standard ratio of 70% for training, 15% for validation, and 15% for testing. This standardization guarantees that the experimental results are comparable. To further address the overfitting problem that may occur in the model, especially on high-dimensional features or small data sets, we paid special attention to the application of regularization. During data preprocessing and model training, we introduced techniques such as Dropout and L2 regularization to reduce the risk of overfitting and improve the generalization ability of the model. In addition, we also introduced appropriate regularization methods between BERT and CNN layers to further enhance the adaptability of the model on complex and high-dimensional data sets. Using these datasets, we can assess the model’s classification performance across various fields and dataset sizes, as well as evaluate its practical application adaptability and overall effectiveness.

Experimental environment and parameter

In the experiments of this paper, all model training and testing were carried out under a unified experimental environment to ensure the comparability and reproducibility of the results. The hardware and software environments of the experiments, as well as the parameter settings of the models, were strictly controlled to ensure the efficiency and accuracy of the experiments. Table 3 shows the specific details of the experimental environment and parameter settings.

thumbnail
Table 3. Experimental Environment and Parameter Settings, including detailed information such as hardware configuration, deep-learning framework version, model hyperparameters, and training time.

https://doi.org/10.1371/journal.pone.0325851.t003

The experimental environment configuration in this paper adopts high-performance GPUs and sufficient memory to ensure the efficient training of large-scale datasets and complex models. The used deep-learning frameworks (TensorFlow and PyTorch) can support the training of BERT and CNN, while XGBoost handles classification tasks through its efficient gradient-boosting algorithm. The PSO algorithm quickly performs hyperparameter optimization through the computing power accelerated by the GPU, ensuring that each training can be carried out efficiently and accurately.

Evaluation metrics

In this paper, we evaluate the performance of the PSO-BERT-XGBoost model using four key metrics: Accuracy, Precision, Recall, and F1 Score. These metrics offer a comprehensive view of the model’s performance, especially when dealing with multi-class classification or class imbalance.

Accuracy measures the overall proportion of correct predictions:

(16)

While accuracy is useful, it may not reflect performance in imbalanced datasets, so we also use Precision and Recall. Precision calculates the proportion of correct positive predictions out of all predicted positives:

(17)

Recall evaluates the ability of the model to identify actual positive instances:

(18)

To balance these two factors, we use the F1 Score, the harmonic mean of Precision and Recall:

(19)

These metrics allow for a balanced assessment of the PSO-BERT-XGBoost model’s accuracy, efficiency, and robustness across different datasets.

Comparative experiments

Comparison of PBX Model with Six Advanced Text Classification Models: This table presents the performance of the PBX model alongside six other state-of-the-art text classification models, including TextCNN, FastText, DPCNN, LightGBM, RoBERTa, and XLNet. The models are compared based on accuracy, precision, recall, and F1 score across multiple datasets, with a focus on evaluating the PBX model’s strengths in handling complex semantic understanding and large-scale datasets in multi-class classification tasks. The results are summarized in Table 4.

thumbnail
Table 4. Comparison of the performance results of the PBX model and six other advanced models in terms of accuracy, precision, recall, and F1 score on the 20 Newsgroups, Reuters-21578, and AG News datasets.

https://doi.org/10.1371/journal.pone.0325851.t004

The results of the comparison demonstrate that the PBX model outperforms all other models on the three datasets (20 Newsgroups, Reuters-21578, and AG News), achieving remarkable results across the board. Particularly on the AG News dataset, the model excelled, achieving the highest accuracy (95.0%) and F1 score (94.9%), surpassing all deep learning and ensemble models, including RoBERTa and XLNet. This highlights the model’s superior performance in large-scale text classification tasks.

For the 20 Newsgroups and Reuters-21578 datasets, the PBX model achieved accuracies of 91.2% and 84.3%, respectively. When compared to models such as RoBERTa and TextCNN, the PBX model showed improvements of 1% to 2%. Notably, it significantly outperformed other models in precision, recall, and F1 score. The PBX model was especially effective in handling complex class distributions and multi-label classification challenges. Compared to traditional machine learning methods such as SVM and LightGBM, the PBX model not only offers clear advantages in accuracy but also benefits from enhanced hyperparameter tuning through PSO optimization, improving the robustness and generalization of the model. While methods like SVM and LightGBM perform well on simpler datasets, the PBX model has a distinct edge when dealing with more complex and diverse tasks like those presented in 20 Newsgroups and Reuters-21578.

By combining BERT’s semantic understanding, XGBoost’s powerful classification capabilities, and PSO optimization for hyperparameter tuning, the PBX model adapts to various datasets, achieving optimal performance. This demonstrates that the PBX model not only leverages the semantic strengths of deep learning models but also enhances its overall performance through optimization techniques, particularly excelling in high-dimensional and complex text classification tasks.

To present the experimental results more clearly, we visualized the performance of the PBX model, emphasizing its superior results across all datasets (Fig 5). By integrating BERT’s semantic capabilities, XGBoost’s robust classification power, and the PSO optimization for hyperparameter tuning, the model can dynamically adjust its parameters, achieving optimal performance on a variety of datasets. This combination allows the PBX model to consistently outperform other models in diverse classification tasks. This proves that the PBX model not only has the semantic expression ability of deep learning models but can also further improve the overall performance through optimization algorithms, especially having significant advantages in high-dimensional and complex text classification tasks.

thumbnail
Fig 5. Visualization of the comparative experimental results.

https://doi.org/10.1371/journal.pone.0325851.g005

From a technical perspective, the advantage of the PBX model over other existing models lies in its unique fusion strategy. The BERT model provides PBX with a powerful contextual semantic representation, while XGBoost effectively handles the nonlinear relationship of features in the classification stage. The introduction of the PSO optimization algorithm enables the model to automatically optimize hyperparameters, avoiding the inefficiency of traditional manual parameter adjustment methods, thereby improving the accuracy and robustness of the model. Compared with single models (such as RoBERTa, XLNet, etc.) or traditional machine learning methods (such as SVM, LightGBM), the PBX model’s unique combination of multi-level feature extraction and model optimization gives it a stronger advantage when processing complex and high-dimensional data sets.

Ablation experiments

To further assess the contribution of each module in the PBX Model, we performed five ablation experiments by individually removing the BERT pre-training module, the CNN module, the XGBoost module, and the PSO optimization module. These experiments help us evaluate the impact of each component on the model’s overall performance and examine how the integration of PSO, BERT, CNN, and XGBoost influences classification results. The results of these experiments, including accuracy, precision, recall, and F1 score for each dataset, are presented in Table 5.

thumbnail
Table 5. Ablation Experiment Results: Performance of the PBX model on the 20 Newsgroups, Reuters-21578, and AG News datasets, evaluating accuracy, precision, recall, and F1 score under various configurations, including the removal of the BERT pre-training module, CNN module, XGBoost module, and PSO optimization module.

https://doi.org/10.1371/journal.pone.0325851.t005

As can be seen from the ablation experiment results in Table 5, the PBX Model (Complete Model) performs outstandingly on all datasets and evaluation indicators, significantly better than the combinations after removing any single module. Especially on the AG News dataset, the accuracy (95.0%) and F1 score (94.9%) of the complete model are significantly better than other configurations. After removing the BERT module, the performance of the model on all datasets has declined, especially the accuracy and precision of the 20 Newsgroups and AG News datasets have decreased significantly, proving the important role of the BERT pre-training module in semantic understanding and classification tasks. When the CNN module is removed, the decline in model performance is relatively small, but the recall and F1 score on the 20 Newsgroups dataset have decreased slightly, indicating that CNN contributes to the improvement of the model in extracting local features. When the XGBoost module is removed, there is a significant drop in the overall model performance, particularly on the 20 Newsgroups and AG News datasets. This highlights the essential role of XGBoost in decision-tree classification and its ability to manage complex features. Additionally, removing the PSO optimization module leads to a noticeable decline in performance, especially on the 20 Newsgroups dataset, where both accuracy and F1 score are considerably lower compared to the full model. This suggests that PSO optimization is crucial for enhancing the model’s performance and fine-tuning its hyperparameters.

Visualization results

Fig 6 displays the confusion matrices of the PBX model across three datasets. Each matrix illustrates how the model classifies documents into various categories. From the figure, it is clear that the PBX model achieves strong performance on all three datasets, with relatively accurate classifications across the different categories. Especially for high-frequency categories (such as business, technology, and sports), the prediction accuracy is high. Overall, the classification effect of the model on various document types is relatively balanced, and the misclassification cases are few, indicating that the model can effectively capture the semantic features of the text for accurate classification.

thumbnail
Fig 6. Confusion matrices of the PBX Model on the 20 Newsgroups, Reuters-21578, and AG News datasets.

(The horizontal and vertical coordinates represent the predicted labels and the true labels respectively. The color blocks represent different document types of categories: 1 - Social Science, 2 - Sports, 3 - Politics, 4 - Business, 5 - Technology, 6 - Health, 7 - Entertainment, 8 - Education.)

https://doi.org/10.1371/journal.pone.0325851.g006

On the 20 Newsgroups dataset, the PBX model performs relatively stably on most categories. Especially in the classification of business and technology documents, it shows high accuracy. For some more complex categories, such as social science documents, the model can also provide relatively accurate classification results. The classification effect in the AG News dataset is particularly remarkable. The model can accurately distinguish different types of news, especially in the classification of sports and politics documents, with small errors, further verifying the powerful performance of the PBX model.

However, although the PBX model performs excellently, the classification accuracy in some small categories (such as the education category) has slightly decreased. Therefore, in the future, the overall classification accuracy can be improved by further optimizing the model, especially the performance on small-sample categories.

Fig 7 shows the Top-3 classification prediction results of the PBX model on three different datasets. The bar chart for each document displays the top 3 predicted categories of the model for its classification task and the corresponding confidence levels. From this figure, it can be observed that the model has a high confidence level in the Top-1 predicted category for most documents, demonstrating strong classification capabilities. Especially on the 20 Newsgroups and AG News datasets, the model shows strong accuracy in ranking the top predicted categories, demonstrating that the PBX model effectively captures the semantic features of documents and produces a clear category ranking for each document.

thumbnail
Fig 7. Visualization of the Top-3 classification prediction results of the PBX model.

(For each document (on the vertical axis), there are three predicted categories (bars on the horizontal axis). The numbers represent the labels of the predicted categories, and the length of the bars indicates the confidence level of the model for that category.)

https://doi.org/10.1371/journal.pone.0325851.g007

However, although the model performs outstandingly in the Top-1 prediction for most documents, there are still significant differences in the confidence levels of the model for the Top-2 and Top-3 category predictions of some documents. For example, in the Reuters-21578 dataset, the differences in the confidence levels among the Top-3 predicted categories of some documents are small, indicating that the model may have certain uncertainties in the classification of these documents. This phenomenon may be related to the similarity among categories in the dataset or the ambiguity of the document content.

Fig 8 shows the prediction error distribution of the PBX model on three datasets, including histograms and Kernel Density Estimation (KDE). From the figure, it can be seen that for each dataset, most of the errors of the model are concentrated within a small range, that is, the absolute error between the predicted label and the true label is small, indicating that the classification results of most documents are relatively accurate. However, there are still some large errors, which indicates that there are certain difficulties in classifying some documents, possibly due to the ambiguity of the document content or the similarity among categories.

thumbnail
Fig 8. Prediction error distribution and its smoothing estimation based on the PBX model.

https://doi.org/10.1371/journal.pone.0325851.g008

Especially on the 20 Newsgroups dataset, the error distribution shows a relatively uniform pattern, and the model has relatively large prediction errors for some document categories. This may be related to the diversity and complexity of this dataset. Especially in some categories with ambiguous boundaries, the model has difficulty making accurate classifications. In the Reuters-21578 and AG News datasets, although the errors are small, there are still a small number of misclassified documents, suggesting that the classification accuracy of the model for some specific categories still needs to be improved.

Discussion

Based on the above experimental results, it can be seen that the model performs excellently on different datasets. The combination of PSO optimization, BERT pre-training, and ConvXGB effectively improves the classification accuracy and shows obvious advantages in hyperparameter tuning. The ablation experiments verify the contribution of each module to the performance, especially the important role of PSO optimization and XGBoost in enhancing the classification ability of the model.

Although the model performs well overall, there are still some shortcomings. Especially in the classification accuracy of small categories (such as the education category), the performance of the model has declined, which may be related to insufficient category sample size and the complexity of text content. Misclassifications still exist in some categories, especially in categories with ambiguous boundaries, where the model has difficulty making accurate distinctions.

In response to these problems, future research will focus on several aspects of improvement: First, we will explore the use of data augmentation techniques to expand the training data of small sample categories. By generating virtual samples or applying text generation methods (such as text augmentation and reorganization based on pre-trained models), the model can better learn the characteristics of these minority categories, thereby improving its classification accuracy. Second, we consider introducing domain-specific knowledge, such as using domain-specific vocabulary or semantic information to improve the model’s understanding of small sample categories. This domain knowledge can be obtained by manually annotating features or fine-tuning pre-trained models on data from related fields. In addition, in order to prevent overfitting of small sample categories, we will further improve the regularization method and optimization strategy of the model. For example, by adding Dropout, L2 regularization and other means to improve the generalization ability of the model on these categories, thereby reducing overfitting. Ultimately, these optimizations will further promote the adaptability and universality of the PBX model in various practical applications.

Conclusion

As natural language processing technology continues to advance, text classification has become a central task in various domains, particularly when dealing with large volumes of documents and complex classification challenges. However, current models often struggle to effectively integrate the strengths of deep learning and traditional machine learning approaches, particularly in terms of performance on multi-class and intricate datasets. This paper introduces the PBX model, which combines the pre-trained BERT model, the powerful classification capabilities of ConvXGB, and the hyperparameter optimization using Particle Swarm Optimization (PSO), leading to significant improvements in text classification performance. Experimental results demonstrate that the PBX model outperforms other models in accuracy, precision, recall, and F1 score on the 20 Newsgroups, Reuters-21578, and AG News datasets, especially excelling in multi-class text classification tasks.

Although the PBX model performs well in most experiments, there are still some shortcomings, especially in the classification accuracy of small sample categories and the ability to distinguish ambiguous categories. When there is content overlap or high similarity between some categories, the classification effect of the model is affected. Future research will focus on several aspects of improvement: First, we plan to further optimize the classification accuracy of small category samples and explore effective methods of data augmentation technology and introducing domain knowledge to improve the performance of the model in unbalanced datasets. Second, we will improve the model’s ability to distinguish ambiguous categories by introducing advanced technologies such as attention mechanisms or dedicated Transformer models, especially when there is strong content overlap or similarity between categories. Finally, we will continue to optimize the fusion method of BERT and XGBoost and improve the hyperparameter search strategy of the PSO algorithm to improve the robustness and generalization of the model in various text classification tasks.

References

  1. 1. Minaee S, Kalchbrenner N, Cambria E, Nikzad N, Chenaghlu M, Gao J. Deep learning–based text classification: a comprehensive review. ACM Comput Surv. 2021;54(3):1–40.
  2. 2. Li Q, Peng H, Li J, Xia C, Yang R, Sun L, et al. A survey on text classification: from traditional to deep learning. ACM Trans Intell Syst Technol. 2022;13(2):1–41.
  3. 3. Xing X, Wang B, Ning X, Wang G, Tiwari P. Short-term OD flow prediction for urban rail transit control: a multi-graph spatiotemporal fusion approach. Inf Fusion. 2025;118:102950.
  4. 4. Zhang L, Liu J, Wei Y, An D, Ning X. Self-supervised learning-based multi-source spectral fusion for fruit quality evaluation: a case study in mango fruit ripeness prediction. Inf Fusion. 2025;117:102814.
  5. 5. Chen H, Wu L, Chen J, Lu W, Ding J. A comparative study of automated legal text classification using random forests and deep learning. Inf Process Manag. 2022;59(2):102798.
  6. 6. Wahba Y, Madhavji N, Steinbacher J. A comparison of svm against pre-trained language models (plms) for text classification tasks. In: International Conference on Machine Learning, Optimization, and Data Science. 2022. p. 304–13.
  7. 7. Ying Y, Mursitama TN. Effectiveness of the news text classification test using the naïve Bayes’ classification text mining method. J Phys: Conf Ser. 2021;1764(1):012105. IOP Publishing.
  8. 8. Yuvaraj N, Chang V, Gobinathan B, Pinagapani A, Kannan S, Dhiman G. Automatic detection of cyberbullying using multi-feature based artificial intelligence with deep decision tree classification. Comput Electr Eng. 2021;92:107186.
  9. 9. Hassan SU, Ahamed J, Ahmad K. Analytics of machine learning-based algorithms for text classification. Sustain Oper Comput. 2022;3:238–48.
  10. 10. Gasparetto A, Marcuzzo M, Zangari A, Albarelli A. A survey on text classification algorithms: from text to predictions. Inf. 2022;13(2):83.
  11. 11. Hossain MdR, Hoque MM, Siddique N, Sarker IH. Bengali text document categorization based on very deep convolution neural network. Exp Syst Appl. 2021;184:115394.
  12. 12. Liang Y, Li H, Guo B, Yu Z, Zheng X, Samtani S, et al. Fusion of heterogeneous attention mechanisms in multi-view convolutional neural network for text classification. Inf Sci. 2021;548:295–312.
  13. 13. Huang J, Yu X, An D, Ning X, Liu J, Tiwari P. Uniformity and deformation: a benchmark for multi-fish real-time tracking in the farming. Exp Syst Appl. 2025;264:125653.
  14. 14. Lavanya PM, Sasikala E. Deep learning techniques on text classification using Natural language processing (NLP) in social healthcare network: a comprehensive survey. In: 2021 3rd International Conference on Signal Processing and Communication (ICPSC). IEEE; 2021; p. 603–9.
  15. 15. Ray A, Kolekar MH, Balasubramanian R, Hafiane A. Transfer learning enhanced vision-based human activity recognition: a decade-long analysis. Int J Inf Manag Data Insights. 2023;3(1):100142.
  16. 16. Arslan Y, Allix K, Veiber L, Lothritz C, Bissyandé TF, Klein J, et al. A comparison of pre-trained language models for multi-class text classification in the financial domain. In: Companion Proceedings of the Web Conference. 2021. p. 260–8.
  17. 17. Fesseha A, Xiong S, Emiru ED, Diallo M, Dahou A. Text classification based on convolutional neural networks and word embedding for low-resource languages: Tigrinya. Inf. 2021;12(2):52.
  18. 18. Malekzadeh M, Hajibabaee P, Heidari M, Zad S, Uzuner O, Jones JH. Review of graph neural network in text classification. In: 2021 IEEE 12th Annual Ubiquitous Computing, Electronics & Mobile Communication Conference (UEMCON). IEEE; 2021. p. 0084–0091.
  19. 19. Jang J, Kim Y, Choi K, Suh S. Sequential targeting: a continual learning approach for data imbalance in text classification. Exp Syst Appl. 2021;179:115067.
  20. 20. Asudani DS, Nagwani NK, Singh P. Impact of word embedding models on text analytics in deep learning environment: a review. Artif Intell Rev. 2023;56(9):10345–425.
  21. 21. Zen BP, Susanto I, Putriyani K, Sintiya S. Automatic document classification for tempo news articles about covid 19 based on term frequency, inverse document frequency (TF-IDF), and Vector Space Model (VSM). AIP Conf Proc. 2024;2952(1):1–6.
  22. 22. Xiao L, Li Q, Ma Q, Shen J, Yang Y, Li D. Text classification algorithm of tourist attractions subcategories with modified TF-IDF and Word2Vec. PLoS One. 2024;19(10):e0305095. pmid:39423226
  23. 23. Sathya J, Fernandez FMH. Effective automatic cyberbullying detection using a hybrid approach SVM and NLP. In: 2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS). 2024. p. 1–6.
  24. 24. Zhang P, Ma Z, Ren Z, Wang H, Zhang C, Wan Q, et al. Design of an automatic classification system for educational reform documents based on naive bayes algorithm. Mathematics. 2024;12(8):1127.
  25. 25. Lokker C, Abdelkader W, Bagheri E, Parrish R, Cotoi C, Navarro T, et al. Boosting efficiency in a clinical literature surveillance system with LightGBM. PLOS Digit Health. 2024;3(9):e0000299. pmid:39312500
  26. 26. Li Q, Wang D, Liu F, Yu J, Jia Z. LightGBM hybrid model based DEM correction for forested areas. PLoS One. 2024;19(10):e0309025. pmid:39374230
  27. 27. Hazim LR, Ata O. Textual authenticity in the AI era: evaluating BERT and RoBERTa with logistic regression and neural networks for text classification. In: 2024 International Symposium on Electronics and Telecommunications (ISETC). IEEE; 2024. p. 1–6.
  28. 28. Kang Q, Wang D, Zhang X, Wei Y, Liang A. Industrial classification algorithm for enterprises based on XLNET model. In: 2024 International Conference on Computational Linguistics and Natural Language Processing (CLNLP). 2024. p. 91–5. https://doi.org/10.1109/clnlp64123.2024.00025
  29. 29. Memiş E, Akarkamçı H, Yeniad M, Rahebi J, Lopez-Guede JM. Comparative study for sentiment analysis of financial tweets with deep learning methods. Appl Sci. 2024;14(2):588.
  30. 30. Zhang H, Yu L, Wang G, Tian S, Yu Z, Li W, et al. Cross-modal knowledge transfer for 3D point clouds via graph offset prediction. Pattern Recogn. 2025;162:111351.
  31. 31. Abiodun EO, Alabdulatif A, Abiodun OI, Alawida M, Alabdulatif A, Alkhawaldeh RS. A systematic review of emerging feature selection optimization methods for optimal text classification: the present state and prospective opportunities. Neural Comput Appl. 2021;33(22):15091–118. pmid:34404964
  32. 32. Kumari AA, Bhagat A, Henge SK. Classification of diabetic retinopathy severity using deep learning techniques on retinal images. Cybernet Syst. 2024:1–25.
  33. 33. Daghrir J, Tlig L, Bouchouicha M, Litaiem N, Zeglaoui F, Sayadi M. Texture characterization fuzzy logic-based model for melanoma diagnosis. Cybernet Syst. 2023:1–19.
  34. 34. Abualigah L, Gandomi AH, Elaziz MA, Hamad HA, Omari M, Alshinwan M, et al. Advances in meta-heuristic optimization algorithms in big data text clustering. Electron. 2021;10(2):101.
  35. 35. Asif M, Nagra AA, Ahmad MB, Masood K. Feature selection empowered by self-inertia weight adaptive particle swarm optimization for text classification. Appl Artif Intell. 2022;36(1):2004345.
  36. 36. DeMatteo C, Jakubowski J, Stazyk K, Randall S, Perrotta S, Zhang R. The headaches of developing a concussion app for youth: balancing clinical goals and technology. Int J E-Health Med Commun. 2024;15(1):1–20.
  37. 37. Alzanin SM, Gumaei A, Haque MA, Muaad AY. An optimized Arabic multilabel text classification approach using genetic algorithm and ensemble learning. Appl Sci. 2023;13(18):10264.
  38. 38. Almayyan WI, AlGhannam BA. Detection of kidney diseases: importance of feature selection and classifiers. IJEHMC. 2024;15(1):1–21.
  39. 39. Chen X, Cong P, Lv S. A long-text classification method of Chinese news based on BERT and CNN. IEEE Access. 2022;10:34046–57.
  40. 40. Lin Y, Meng Y, Sun X, Han Q, Kuang K, Li J, et al. Bertgcn: transductive text classification by combining gcn and bert. arXiv preprint. 2021.
  41. 41. Onan A. Hierarchical graph-based text classification framework with contextual node embedding and BERT-based dynamic fusion. J King Saud Univ Comput Inf Sci. 2023;35(7):101610.
  42. 42. Thongsuwan S, Jaiyen S, Padcharoen A, Agarwal P. ConvXGB: a new deep learning model for classification problems based on CNN and XGBoost. Nucl Eng Technol. 2021;53(2):522–31.
  43. 43. Kumar A, Bhatt BR, Anitha P, Yadav AK, Devi KK, Joshi VC. A new diagnosis using a Parkinson’s Disease XGBoost and CNN-based classification model using ML techniques. In: 2022 International Conference on Advanced Computing Technologies and Applications (ICACTA). IEEE; 2022. p. 1–6.
  44. 44. Wicaksono GW, Oktaviana UN, Prasetyo SN, Sari TI, Hidayah NP, Yunus NR, et al. Classification of industrial relations dispute court verdict document with XGBoost and bidirectional LSTM. JOIV: Int J Informatics Vis. 2023;7(3–2):1041–7.
  45. 45. Alhaj YA, Dahou A, Al-Qaness MA, Abualigah L, Abbasi AA, Almaweri NAO. A novel text classification technique using improved particle swarm optimization: a case study of Arabic language. Future Internet. 2022;14(7):194.
  46. 46. Dodda R, Babu AS. Text document clustering using modified particle swarm optimization with k-means model. Int J Artif Intell Tools. 2024;33(01):2350061.
  47. 47. Albishre K, Albathan M, Li Y. Effective 20 newsgroups dataset cleaning. In: 2015 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT). vol. 3. IEEE; 2015. p. 98–101.
  48. 48. Liu X. Text analysis and multi-label classification on articles of 5 topics of the Reuters-21578 dataset: a case study of gaining insights into data and enabling language-aware data products with machine learning.
  49. 49. Zhang X, Zhao J, LeCun Y. Character-level convolutional networks for text classification. Adv Neural Inf Process Syst. 2015;28.
  50. 50. Chen H, Zhang Z, Huang S, Hu J, Ni W, Liu J. TextCNN-based ensemble learning model for Japanese text multi-classification. Comput Electric Eng. 2023;109:108751.
  51. 51. Umer M, Imtiaz Z, Ahmad M, Nappi M, Medaglia C, Choi GS, et al. Impact of convolutional neural network and FastText embedding on text classification. Multim Tools Appl. 2023;82(4):5569–85.
  52. 52. Zhang M, Pang J, Cai J, Huo Y, Yang C, Xiong H. DPCNN-based models for text classification. In: 2023 IEEE 10th International Conference on Cyber Security and Cloud Computing (CSCloud). IEEE; 2023; p. 363–8.
  53. 53. Lubis AR, Prayudani S, Fatmi Y, Nugroho O. Classifying news based on Indonesian news using LightGBM. In: 2022 International Conference on Computer Engineering, Network, and Intelligent Multimedia (CENIM). IEEE; 2022. p. 162–6.
  54. 54. Guo Z, Zhu L, Han L. Research on short text classification based on roberta-textrcnn. In: 2021 International Conference on Computer Information Science and Artificial Intelligence (CISAI). IEEE; 2021. p. 845–9.
  55. 55. Arabadzhieva-Kalcheva N, Kovachev I. Comparison of BERT and XLNet accuracy with classical methods and algorithms in text classification. In: 2021 International Conference on Biomedical Innovations and Applications (BIA). vol. 1. IEEE; 2022. p. 74–6.