Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Machine learning and deep learning-based approach to categorize Bengali comments on social networks using fused dataset

Abstract

Through the advancement of the contemporary web and the rapid adoption of social media platforms such as YouTube, Twitter, and Facebook, for example, life has become much easier when dealing with certain highly personal problems. The far-reaching consequences of online harassment require immediate preventative steps to safeguard psychological wellness and scholarly achievement via detection at an earlier stage. This piece of writing aims to eliminate online harassment and create a criticism-free online environment. In the paper, we have used a variety of attributes to evaluate a large number of Bengali comments. We communicate cleansed data utilizing machine learning (ML) methods and natural language processing techniques, which must be followed using term frequency and reverse document frequency (TF-IDF) with a count vectorizer. In addition, we used tokenization with padding to feed our deep learning (DL) models. Using mathematical visualization and natural language processing, online bullying could be detected quickly. Multi-layer Perceptron (MLP), K-Nearest Neighbors (K-NN), Extreme Gradient Boosting (XGBoost), Adaptive Boosting Classifier (AdaBoost), Logistic Regression Classifier (LR), Random Forest Classifier (RF), Bagging Classifier, Stochastic Gradient Descent (SGD), Voting Classifier, and Stacking are employed in the research we conducted. We expanded our investigation to include different DL frameworks. Deep Neural Networks (DNN), Convolutional Neural Networks (CNN), Convolutional-Long Short-Term Memory (C-LSTM), and Bidirectional Long Short-Term Memory (BiLSTM) are all implemented. A large amount of data is required to precisely recognize harassing behavior. To rapidly recognize internet harassment written material, we combined two sets of data, producing 94,000 Bengali comments from different points of view. After understanding the ML and DL models, we can see that a hybrid model (MLP+SGD+LR) performed more effectively when compared to other models, its evaluation accuracy is 99.34%, precision is 99.34%, recall rate is 99.33%, and F1 score is 99.34% on multi-label class. For the binary classification model, we got 99.41% of accuracy.

Introduction

Social media platforms are vibrant platforms for communicating potential, inspiring ideas, stories, and important information. Due to the development of high-speed internet and communications technologies, a large number of people from all walks of life have joined social networks (SN) and share their opinions about a wide range of topics [1, 2]. Nowadays, a significant issue on social media is abuse or threats, which is known as cyberbullying. Social media platforms act as a robust medium for conversation [3]. It also helps in providing various facilities or encouraging ideas. As of July 2023, there will be 4.88 billion engaged social networking user identities worldwide, representing 60.6% of the world’s population. There are 53 million Bangladeshi users of Instagram, Messenger, and Facebook that are part of Meta’s global advertising audience. At the beginning of 2023, the percentage of the Bangladeshi population using the internet was 38.9 percent, with 66.94 million users. As of January 2023, 26.0 percent of Bangladesh’s population, or 44.70 million people, were active on social media. As seen in “Fig 1”, the total number of individuals who use (in millions) the social networking sites that are available in Bangladesh, Facebook (43.25), YouTube (34.4), Instagram (4.45), Messenger (20.35), LinkedIn (5.9), and Twitter (1.05) [4]. Because Bengali is the native tongue of Bangladesh, it is reasonable to predict that a sizable proportion of these people will use the Bengali language on social networking platforms. However, it is stated that there are about 228 million native speakers of Bengali globally, most of the Indian states associated with Tripura, Assam, and West Bengal, but also in significant numbers in the UK, the USA, and the Arabian Peninsula, with 160 million of them being Bangladeshis [5].

thumbnail
Fig 1. Social media user distribution in Bangladesh (2023).

https://doi.org/10.1371/journal.pone.0308862.g001

Businesses, governmental organizations, and event planners can fully comprehend people’s sentiments and perceptions by analyzing the data obtained via SN. However, because of the exponential rise in SN users, data is abundant with a vast quantity of remarks and posts, making it difficult for people to precisely extract pertinent data from the texts [6]. To address human limitations, hidden and in-sight information must be automatically removed from online-generated text. As social media platforms keep growing, the frequency of cyberbullying is increasing [7]. People are regularly harassed by strangers and unauthorized users on social media platforms [8]. Nevertheless, the nation has seen an enormous rise in internet access in the past few years, coupled with a growing proclivity to use online platforms, such as social networking sites, for educational activities. As a result of this, the number of victims of cyberbullying has risen, leading to mental health problems between learners. According to recent research, 73.71% of cyberbullying victims are women [9]. According to a preferred "Bangladesh Cybercrime Trend 2023" published in 2022 through the Internet Crime Awareness Foundation (CAF), 52.21% of the total reported via the internet actions were related to cyberbullying and brutal social networking articles, in undergraduates making up nearly all of these those affected [10]. Cyberbullying harms individuals’ mental health, resulting in higher instances of anxiety and sadness. Cyberbullying and cyberstalking affect people psychologically. Abusers use social media platforms’ anonymity to their advantage, enabling their vicious behavior to go unpunished. Additionally, as harassment increases in frequency over time, things get worse [11].

For example, identifying abusers and making them responsible is more difficult than with the classic bully. Perpetrators employ phony IDs, labels, and locations to harass people using technological devices and services (social websites, phones, electronic mail, and more), and they may also utilize encrypted networks to conceal the truth about themselves and their location from others. Furthermore, because cybercrime happens via online tools and technology, it can reach vast audiences quickly, making it more harmful than face-to-face incidents. Additionally, bullying is a never-ending form of humiliation that leaves victims feeling helpless. That is why we require an automated solution to assist us in identifying and preventing the majority of bullying.

Plenty of investigations were conducted to recognize insulting posts and comments in both English and various languages [12, 13]. Additionally, there are only a few investigations that concentrate on Bengali language harassing identification [5, 11]. As a result, this research provides a possibility to contribute towards the identification of Bengali internet harassment and protect users from this kind of harassment. In Bengali, there are numerous methods for identifying cyberbullying. Their performance assessments for detecting bullying are insufficient for shielding people via Bengali cyberbullying. As a result, there is currently a pressing desire for a stronger approach geared toward detecting a variety of violent Bengali written material. Techniques that utilize ML, as well as DL, may prove highly successful in recognizing and eliminating violent Bengali material via social networking sites. To identify cyberbullying, it is essential to investigate and develop more complex, quick, and flexible detection techniques. By gathering the cessation of current methods that we operate regarding innovative approaches that can effectively impact the ever-changing step of cyberbullying.

Multiple components are created in our paper. We attempt to give the highlights of the associated study part in the field of study for identifying cyberbullying and the group of offensive text in the very first section. The composition of both datasets is described in the methodology section. We have also cleaned data based on a variety of parameters, including eliminating special characters, multiple spaces, punctuation marks, non-Bengali characters, and numbers, among others. Following data preprocessing, we extract features for ML classifiers employing TF-IDF transformation with Count-Vectorizer. On the other hand, tokenization with padding is used to feed our DL models. We establish a vocabulary with a capacity of over 20,000 ML feature extraction, which helps us with our large combined dataset. We define thoroughly several algorithms for ML and DL and implement them all via a comparable arithmetic equation. To prepare for cyberbullying, we intend to talk about the initial stages, classifier selection, and setup, and the final implementation of each strategy. The findings of all approaches will be addressed and contrasted to identify the problems and constraints that were discovered. The conclusion of our paper includes a summary and evaluation of what our research means in this field as well as for the larger community in general. The main outcomes of the present study paper are as follows:

  • A hybrid machine-learning method for detecting online harassment.
  • Cleansed data according to several criteria, such as removing punctuation, numerous spaces, particular characters, non-Bengali text, and numbers.
  • For the training and testing of the framework, two datasets are combined.
  • Count-Vectorizer, along with TF-IDF for ML and tokenization with padding for DL models are used to extract the feature.
  • For efficient undersampling, Iterative Hard Thresholding (IHT) was employed.
  • K-Fold cross-validation was used to evaluate the model robustly.
  • Tested for feature significance and model efficacy using ANOVA and Chi2 tests.
  • Examined the performance metrics between the ML, DL, and hybrid models to identify the most successful model.
  • Created a web application that allows users to interact with the most optimal model.
  • Suggested using a mixed machine-learning approach to identify online harassment.

Related work

Multiple research projects were carried out in the area of identifying cyberbullying, alongside a particular focus on wellness for the Bengali language. The results of these studies show that there is a need for efficient ways to combat cyberbullying on social media platforms, which serves as a compelling argument for the current research. The relevant works on multi-class Sentiment Analysis (SA) in Bengali and English are covered in this section. We concentrated on important features of SA papers, including the quantity of datasets, number of classes, employed techniques, and outcomes. Different linguistic perspectives have emerged in response to multiclass SA in recent years.

With the introduction of a CNN and LSTM-based classifier that achieved 85.8% accuracy as well as 86% F1 scores upon a set of 42,036 comments on Facebook, Haque et al. [5] addressed the difficulties associated with multi-class analysis of sentiment in Bengali social media comments. By demonstrating its effectiveness in real-world sentiment detection through a web application integration, the suggested model outperforms baseline techniques.

However, utilizing an encoder-decoder-based LSTM network, Das et al. [11] proposed a different approach for identifying instances of hateful language in Bengali. To address the various classes hateful language issue, they employed TF-IDF, word2Vec, and a 1D CNN model in a network using LSTM. An encoder-decoder machine learning model a well-liked NLP tool was presented in this paper as a way to categorize feedback submitted by users in Bengali on profiles on Facebook.

Eshan et al. [12] also examined a variety of machine learning algorithms, including Random Forest, multinomial Naïve Bayes, Support Vector Machine with linear, Radial Basis Function (RBF), Polynomial, and Sigmoid kernels. They also contrasted these algorithms with characteristics utilizing single-word, bigram, and trigrams, which are used by Count Vectorizer and Tfidf Vectorizer, but found that the SVM Linear kernel produced the most effective outcomes.

A Gated Recurrent Unit (GRU) model was developed by Ishmam et al. [13] to classify feedback from customers on social media sites. 5126 pieces of Bengali feedback were gathered for the purpose of research, and they were divided into five categories: political discourse, spiritual remarks, encouragement, insults, and discrimination towards race and religion. The GRU structure recognized hateful speech with an accuracy of 70.10%.

A binary and multilabel classifier algorithm was presented by Ahmed et al. [14] to recognize abuse statements on Facebook pages. 44,001 user reviews from well-known public Facebook pages were examined for this study and were divided into classes such as non-bully, sexual, threat, troll, and religious. This NN + Ensemble method produced a multilabel classification accuracy of 85% and a binary classification accuracy of 87.91%.

A model for identifying cyberbullying in texts written in Bangla and Romanized was developed by Ahmed et al. [15] using ML and DL techniques. Three social media datasets were produced by their research: one of them for Bangla, another for Romanized Bangla, and one combined dataset. In the combined dataset, the ML algorithm Multinomial Naive Bayes (MNB) achieved an accuracy rate of 80%.

A Bengali-language method for identifying cyberbullying on social media was proposed by Emon et al. [16]. They used 44,001 Bengali comments from Facebook to test different transformer models, such as Bengali BERT, Bengali DistilBERT, and XLM-RoBERTa. Out of all the models, the XLM-RoBERTa model had the highest accuracy rate (85%) and F1 score (86%).

A technique for using ML algorithms to recognize abusive language in Bangla was proposed by Mahmud et al. [17]. Using logistic regression (LR) and annotated translated Bengali corpora, they were able to identify bullying in Bengali with a 97% accuracy rate.

63,000 Bengali Facebook comments from various celebrity pages were compiled by Khan et al. [18] in order to group fans’ sentiments toward the celebrity into five categories: happy, excited, upset, shocked, and content. The feature extractor they used to train SVM, NB, RF, KNN, and NN was TF-IDF. They used the SVM classifier to predict a person’s attitude toward a celebrity with a 62% accuracy rate. Even though they employed a sizable dataset for their investigation, the dataset’s class imbalance issue resulted in a low accuracy score.

The HS-BAN slanderous speech database in Bengali, which has more than 50,000 categorized statements, was made available by Romim et al. [19]. They investigated language features along with artificial neural network-based methods to develop a common detection of hateful speech systems for Bengali. These comparisons demonstrated that sentences incorporating algorithms developed on unofficial papers performed better than those developed on official texts, leading to the Bi-LSTM models outperforming Fast Text casual word implementation with an F1 score of 86.78%.

Utilizing machine learning techniques like NB, J48, SVM, as well as KNN, Akhter et al. [20] completed an analogous binary categorization assignment to identify Bengali cyberbullying comments. A collection of 2400 Bengali comments classified as either attacked or not was used in their tests. They used the TF-IDF as an extractor of features to teach the SVM classification algorithm, and they were able to achieve 97% accuracy. Nevertheless, multi-class SA was not present in the experiment because it tended toward categorization into binary categories.

In order to detect cyberbullying in Bengali on social media, Akhter et al. [21] developed a strong hybrid machine-learning model that achieved high accuracy rates of 98.57% and 98.82% in binary as well as multilabel identification. Effective text preparation, feature extraction with TFIDF, and dataset normalization with the instance hardness threshold are all part of their methodology.

The previous research on discrimination recognition in multiple languages, including Bengali, abusive content identification, and cyberbullying detection is summarized here. These studies have provided insightful information. The research findings still have some shortcomings. A few demonstrated lower accuracy rates, underscoring the need for development, while others limited generalizability by concentrating on particular languages and text types. It was common to rely on certain ML algorithms and approaches, which necessitated investigating a larger variety of strategies. Inadequate rationale for classification decisions and a restricted number of dataset categories were also noted. Our research intends to create a strong hybrid machine-learning approach that covers harmful content categories, and Bengali languages to close these gaps. We will investigate different ML algorithms and DL approaches, offer thorough explanations for categorizations, and further the development of techniques for detecting and preventing cyberbullying. “Table 1” depicts prior research with boundaries. To overcome any required boundaries, we employed 8 various ML algorithms with 4 distinct DL models, as well as some ensemble approaches employing voting classifiers along with stacking to determine the most effective result from the best model. Our method processes huge quantities of data, assisting in the detection of bullying. It additionally includes five classes, each with distinctive positions, respectively.

Materials and methods

The following section outlines our suggested approach in addition to the numerous ML and DL algorithms implemented in the structure. To begin, we will describe the way the concept operates. After that, the ML and DL algorithms are given a brief overview.

Fig 2” depicts our proposed model’s workflow. The proposed model is divided into eight major sections, the first of which is the collection of Bengali bullying text data. We began by gathering information using two openly accessible databases of Bengali social media comments. Then we combined them into a single dataset that was separated into five distinct groups: trolls, religious, sexual, threats, and not bullying. The text was then preprocessed to remove websites, punctuation, digits, emojis, special signs, and Bengali stop words in order to prepare the data. To transform text data into numerical information, we use a count vectorizer with a TF-IDF transformer to build our ML classifiers and tokenization with padding to feed our DL models. We use IHT to resample the dataset for potential class imbalances during training and testing. To split the data, we used k-fold cross-validation and ML classifiers such as Staking, Voting, SGD, LR, RF, MLP, Bagging, XGBoost, Adaboost, and K-NN. For comparative analysis, DL models such as BiGRU, CNN, CLSTM, DNN, and RNN were also used. And assessed performance using traditional evaluation metrics such as accuracy, precision, recall, and F1 score. Finally, we developed a web application based on the best-performing models.

Dataset collection

The lack of thorough assessment of publicly available databases in Bengali has become a significant disadvantage of the various classes of the Bengali sentiment analysis approach. The majority of present investigations depend on privately collected datasets, often focusing on diverse difficulties as well as including just a tiny Bengali writing the database. Initially, we gathered two sets of data from Kaggle [22, 23] over a multi-class cyberbullying analysis. We require additional understanding to get around previous research constraints, that are going to be gathered via the analysis of enormous quantities of knowledge. This is the reason in our line of work, that we integrate the data sets. We mixed them into a single MS Excel file featuring approximately 94,000 specimens, every single one labeled as Not Bully, Troll, Sexual, Religious, as well as Threat. After that, our team split the data into 20% for testing purposes and 80% for learning.

Table 2” presents here an in-depth analysis of content categorization across various categories, with designations for each category like "Not Bully," "Troll," "Sexual," "Religious," as well as "Threat." In terms of total comments, the "Not Bully" class has the most with 33,579, after "Troll" with 23,193, "Sexual" with 18,026, "Religious" alongside 15,424, and lastly "Threat" with 3,778. When it comes to total words, the "Not Bully" category once again takes the lead with 390,688, next to "Troll" with 317,191, "Sexual" with 345,339, "Religious" with 392,196, and "Threat" with 60,269. “Fig 3” represents the percentage of comments for each class. This knowledge is essential for comprehending the makeup and features of various content classifications, laying the groundwork for additional research and information, specifically in the realms of natural language processing (NLP) as well as compromise. “Fig 4” shows the word cloud of our combined dataset.

The labels in the dataset are described below:

  • Not Bully: indicates a group of comments which are not related to bullying.
  • Troll: This class represents comments that have been identified as troll-related.
  • Sexual: This class represents comments that have been recognized to contain sexual content.
  • Religious: This class represents comments that have been discovered to contain religious content.
  • Threat: This class represents comments that have been recognized as containing threats.

Data preprocessing

The preprocessing of data has become the procedure of preparing unprocessed data and overfitting a model. That constitutes one of the absolute most essential processes within the development of a framework. We manage empty strings substituting via technique to be it is categorized characteristics, every web link, commas, as well as unique signs within the content preliminary analysis. The content is subsequently converted to tokens, and every Bengali stop word and common word is removed. Following that, we used lemmatization to keep the text’s actual phrases. In addition, we manage Banglish sentences, which include a combination regarding Bengali as well as English sentences, by first spotting the Bengali language, then splitting the Bengali as well as the English language, transforming the English into Bengali words, and at last collaborating with the resultant tokens of value. This cleansing helps to reduce input depth, preserve important details, and prepare knowledge for building models with little difficulty [24]. “Fig 5” shows the density of comments length before and after preprocessing the dataset.

thumbnail
Fig 5. Density plot of comments length before and after preprocessing the dataset.

https://doi.org/10.1371/journal.pone.0308862.g005

From this plot, we discover that before preprocessing the dataset, there were 210 words present in a sentence, most of the sentences contained 0 to 30 words and approximately 10 words contained the most frequently occurring sentences. On the other hand, after completing our desired preprocessing approach, we utilized the fact that the word length for sentences decreased to 150. Here also, we see that approximately 8 words are most frequent for the sentences, and at this stage, density is increased compared to before preprocessing. "Fig 6” shows each column contains the original comments before preprocessing and processed comments after preprocessing, along with the processed comments translated form of the Bangla dataset.

thumbnail
Fig 6. Original and processed text of Bengali text dataset.

https://doi.org/10.1371/journal.pone.0308862.g006

Feature extractions techniques

The extraction of features is a dimensionality-reducing technique employed in ML that visualizes higher-dimensional participation through a variety of low-dimensional sets of attributes. The effectiveness of the ML model may be substantially enhanced, and the amount of computing work may decrease once pertinent characteristics are taken out [25, 26]. In the identification of text, every ML and DL classifier treats writing words as a number instead of natural language content. Before employing any algorithm, natural language terms must be transformed to vector style, enabling text sentences to be symbolized using vector representations. TF-IDF, CV, and tokenization methods are able to be employed to transform every single one of the vocabularies into a vector of features. The aforementioned methods may record the randomized visualization of sentences in an already established vector domain, just like vectors with real values [27]. The dispersed vector format for words implies vectors can be effectively represented within a low-dimensional space for features, which proved beneficial in prior multi-class categorization investigations. In the present research, methods for feature extraction were employed in two steps: ML classifiers were trained using TF-IDF with CV, and DL models were trained using tokenization and padding.

Feature extractions for machine learning classifiers.

We combine the technique Count Vectorizer (CV) with TF-IDF [28] to create a powerful text data processing algorithm. CV tokenizes data, resulting in a matrix containing word counts. This matrix is then transformed by the TF-IDF transformer, which weights terms based on its frequency and rarity across data. The joined matrix that results gives a high visual of text data, combining both term frequencies and importance. This hybrid model improves ML classifiers for tasks such as cyberbullying analysis and classification, as well as NLP applications by giving a better understanding of character importance in documents. “Fig 7” shows us an example of CV and TF-IDF transformation representations of a sample dataset, and the translated version of the Bangla text is in brackets.

thumbnail
Fig 7. Example of CV & TF-IDF transformation representation of a comment.

https://doi.org/10.1371/journal.pone.0308862.g007

The following Eq (1) is used to visualize the explained technique: Where Tf refers to the TF-IDF transformation employed to the matrix gated from CV, Cv refers to the use of CV to tokenize and count words, resulting in a matrix, and Hy is the final matrix that paired both term frequencies and significance on TF-IDF weighting.

(1)

Tokenization and padding for deep learning models.

The use of tokens serves as a vital assignment when analyzing text in raw form over cyberbullying assignments. Tokenization is a method of separating written material through tokens of value, that are fragments regarding the initial content. We tokenized the document during each word stage as well as allocated symbols for every word. The dataset includes phrases of varied lengths, which can impact the accuracy of the categorization techniques. As a result, we used the edging methodology to standardize the measurement for every phrase. In this procedure, an extra equal is utilized during short phrases and words that go over the maximum allowed length are diminished [29]. “Fig 8” shows us an example of tokenization and padding representation of sample comment, and the translated version of the comment is in brackets below.

thumbnail
Fig 8. Example of tokenization and padding representation of a comment.

https://doi.org/10.1371/journal.pone.0308862.g008

Resampling the data

The process of reprocessing entails frequently extracting specimens of the original data [30]. We used the underestimating approach because we had a lot of textual information, which allowed us to create a harmonious dataset that best reflects reality and can detect Bengali bullying. This method of leveling uneven data entails retaining each of the information that comes to the minority class and reducing the percentage of the majority class. To extract more accurate data, this method makes use of initially disorganized text in the collection process. To minimize the dominant group and reduce imbalances in class, the underestimating is done by employing a new method called Instance Hardness Threshold (IHT) [31]. As for the algorithm’s learning, we used a logistic regression algorithm in IHT and cross-validation. Logistic regression is easier to employ, understand, as well as to train.

Machine learning classifiers

In order to create a framework to recognize Bengali bullying, this research measures performance using ten classifiers, including Staking, Voting, SGD, LR, RF, MLP, Bagging, XGBoost, Adaboost, and K-NN. A variety of performance indicators have been used to evaluate each classifier’s performance. We will discuss the different ML techniques used in this study for the model of prediction and categorization strategies that are outlined in this section.

Stacking ensemble.

To improve the accuracy of predictions, ensemble training is a mixed ML method that considers the forecasting capabilities of several basic algorithms [32]. Three categories of methods are used in combination development: bagging, boosting, as well as stacking. In this particular research, several ML algorithms are employed initially within a single-level arranging system. Lastly, finalized estimates are returned by fitting a LogitBoost to the forecasts from each categorization framework. “Fig 9” shows the arrangement of the stacking group.

Voting classifier.

A voting classification is a method of collaborative learning that employs several separate classification algorithms and aggregates its forecasts to potentially outperform one algorithm [33]. The opportunity to predict the class designation instead of the final class mark which is frequently predicted using description models is provided by hard voting. By an average of the class possibilities, scholars can predict what classes will be called using soft voting [34]. “Fig 10” shows us an arrangement of the voting ensemble model.

Stochastic gradient descent.

The goal of achieving a task that has the correct excellence characteristics can be advanced repeatedly using the SGD method [35]. It determines the level of progress based on the evolution of additional factors. Because it substitutes an indicator over the actual angle which is calculated via the entire educational index for the angle which is identified using an arbitrary portion of the data, it is quite accurate and could be seen to provide an unpredictable estimate of tendency decrease improvement [36]. Rather than precisely calculating the gradient of Gn(js), every time it predicts this value using an only determined for instance kr: (2)

Logistic regression.

LR objective is to use the evaluation and training information collections as a foundation for classifying the comments through multiple orientation classifications. When used on new data, it performs admirably [37]. Its formula in numbers is: (3) Where K denotes a dependent value and the amount that remains represents the expression’s boundaries operation.

Multi-layer perceptron.

In each set of inputs, a neural network known as the MLP creates a unique vector in order to function. Included in the MLP multilayered perceptron structure are an informational stage, an output, and a hidden stage [38]. Eq (4) represents the logical modification of the initial information by every single layer of the Multi-Layer Process (MLP), implementing the stimulation operation that every single layer as well as utilizing weights and biases, as defined by Pacifici et al. [37]. The vector that is being input is S, the bias vector is m(i), the value of the weight matrix for layer i is denoted by T(i), and the activated operation is represented by Ψ.

(4)

K-nearest neighbor.

This method is employed for task reversion and organization. Be aware that certain models, like the one in [39], represent the data set’s vectors using the class level. The process of classifying a newly arrived content according to its characteristic vector pi and its resemblance with the K-NN characteristic vectors pj which is computed by applying the geographic vector similarities measure is represented by the formula (5). The integer operates the equation making sure that every single neighbor’s contribution is just taken into account in the event that pi doesn’t exist in the total number Zm.

(5)(6)

Random Forest.

A "random forest" is a DT configuration where branches are created during the preprocessing stage [40]. This algorithm combines predictions from multiple models in an ensemble approach. The RF will select that class from among all the forest trees based on which classification garnered the most votes, should that class receive any votes. The mathematical formula for the RF classification algorithm appears as follows: (7)

Bagging classifier.

Considering it was developed in 1996 by Leo Breiman, bagging is also known as bootstrap aggregating [41]. In the situation of division into multiple potential categories X: Z→ {− 1, 1} with respect to the structure of a particular definition Z a collection of training. The sequence of classification algorithms Xn, n = 1…, N is generated by the collecting approach in reaction to modifications in the creating information. Eq (8) displays the mixed classification predictions as the weighted average of the separate classification projections. In order to boost the influence of more precise classification algorithms on the result’s estimation, the variables kn, n = 1…, N are selected.

(8)

Extreme Gradient Boosting.

XGBoost, a higher-level tree approach, applies the gradient-boosting idea [42]. XGBoost employs a more regularized model formalization than previous gradient-boosting techniques to prevent data overfitting [43]. The XGBoost classifier’s final projection is represented by S(r) in Eq (9). P is the total quantity of trees or enhancing phases. The phase dimensions or speed of learning at each iteration p is denoted by Γp. The p-th insufficient classification (tree) in the team is denoted by hp(r).

(9)

Adaptive boosting.

This technique is sometimes referred to as weak learning. Weak learning was established in 2017 [44] by guaranteeing the lowest value for each feature’s standard threshold value. The weakest classification algorithm is represented by ti(r) along with si, which stands for the weighted and weakest classification algorithm beginning at stage i, accordingly, in Eq (10).

(10)

Deep learning models

In order to create a framework to recognize Bengali bullying, this research measures performance using 5 DL models also, including BiGRU, CNN, CLSTM, DNN, and RNN. A variety of performance indicators have been used to evaluate each model’s performance. We will discuss the different DL techniques used in this study for the model of prediction and categorization strategies that are outlined in this section.

Deep neural network.

An improved model designed for multimodal learning is called a DNN. Its complex and well-structured hierarchy highlights this model’s adaptability and demonstrates its usefulness in a variety of applications [45]. In essence, the DNN is a cutting-edge approach that builds on blended learning research.

Recurrent neural network.

A type of neural network, also known as recurrent neural networks [46], has chain reactions because the relationships among their brain cells form an instructed phase. The main function of RNNs is to understand data in batches utilizing the internal recall that the instructed cycles have recorded. ht is the hidden state at time t, xt is the input at time t, Whh represents the recurrent weights, Wxh represents the input weights, bh is the bias term, f is the activation function (commonly sigmoid or tanh).

(11)

Convolutional neural network.

It was developed to address 2D problems. Among the many that make it up are Conv2D, maximal pooling, flattening, and an entirely linked layer [47]. In Eq 12, y stands for the outcome, W for filter coefficients (kernels), X for the input information, b for biases, * for the kernel convolution procedure, and f for the activated function.

(12)

Convolutional Long Short-Term Memory.

This simulation is similar to what was previously demonstrated. After the CNN had extracted characteristics, this implementation used an LSTM layer to handle decreased excellent input patterns [48]. A CLSTM network’s convolutional neural networks as well as recurrent operations are combined in Eq 13. The hidden state at time t is represented by ht, the input at time t is by xt, the input-to-hidden convolutional weights are represented by Wih, the hidden-to-hidden recurrent weights are represented by Uhh, the bias term is by bh, the convolution operation is denoted by *, and the activation function (usually sigmoid or tanh) is by f.

(13)

Bidirectional Gated Recurrent Unit.

As a way to regulate the movement of textual data at various points, The GRU paired applicant secret states as well as modified an LSTM’s three entry points by restoring as well as modifying entrances [49]. Intelligent prejudice results from the GRU’s exclusive focus on past events as well as disregard for the impact of the following data on current words [50]. As a result, a bidirectional simulation, BiGRU, was employed in this study.

Experimental results and discussions

The findings, an explanation of the recommended strategy, and the creative design will all be covered in this section. The classification results of 10 ML algorithms Staking, Voting, SGD, LR, RF, MLP, Bagging, XGBoost, Adaboost, and K-NN as well as 5 DL models including BiGRU, CNN, CLSTM, DNN, and RNN that are discussed in this section of the paper. We assessed each classifier separately and found the confusion matrix and ROC curve for each model to find the best fit for our dataset.

Experimental setup

The ML and DL algorithms, an origin-based tool that mixes Python with NLP, were utilized to develop proposed method for detecting online harassment architecture. The model is trained and evaluated using a local computer carried out with text editor Visual Studio Code, NumPy (1.26.1), Matplotlib (3.8.0), Scikit-Learn (1.3.2), Pandas (2.1.2), Keras (2.15.0), TensorFlow (2.15.0), and PyTorch (2.1.1) on a Windows 10 64-bit computer. The computer has 16 GB of RAM, a Ryzen 9 5900HS by AMD x64 processor with 3.30 GHz with Radeon Graphics, and a GPU called NVIDIA GeForce RTX 3050ti with 4 GB of GPU memory.

ANOVA and Chi2 tests

The ANOVA test has a p-value of 0.0. Considering the null assumption is true, the p-value is the likelihood of receiving findings from the test that are no less than the same extreme as those recorded throughout the investigation as shown as “Table 3”. A p-value of 0.0 in this instance suggests that the null theory is strongly refuted by the proof. The ANOVA test’s F-value, also known as the F-statistic, is roughly 20.99. The proportion to the overall variability of each group’s means how much variance inside the groups is measured by the F-value. Greater variation between the group means in relation to the statistical variability throughout the groups is indicated by a higher F-value. The Chi2 test has a p-value of 1.0. In a similar vein, if the null assumption is correct, the p-value here represents the likelihood of receiving test results that are not less than as extreme as those noted during the test. There may not be an important distinction among the noticed and expected frequencies, according to a p-value of 1.0. The Chi2 value is roughly 825999.76. The difference between the actual and anticipated frequencies in the scenario table is measured by the Chi2 statistic, also known as the Chi2 value. The discrepancy between the expected and observed frequencies is quantified. In conclusion, the ANOVA test indicates a substantial variance between the averages of the groups, whereas the Chi-square test shows no statistically significant variance between what was seen and expected frequencies.

Parameters used

The most important variables associated with the various classifiers employed by NLP algorithms are presented in the following “Table 4”, which offers a thorough comparison of them.

thumbnail
Table 4. Comparative overview of all DL and ML algorithms parameters in used in this study.

https://doi.org/10.1371/journal.pone.0308862.t004

Performance metrics

Several statistical measures, such as the F1 score, precision, recall, accuracy, and sensitivity, are employed to assess the efficiency of the suggested framework. There are four categories of qualities, according to the confusion matrix’s conclusion: true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). When "TP" is used, true instants have been effectively identified. The letter "TN" denotes a result where the recommended approach correctly identified the fruit variety that was incorrectly classified. "FP" refers to the situation where a positive detection was mistakenly identified by the suggested framework. "FN" denotes the situation in which the suggested framework miscalculated the kind of negative detection. Eqs 1417 compute these quantities [51].

(14)(15)(16)(17)

Results analysis

We have evaluated the technique we proposed for binary as well as multi-label categorization and discover that the approach achieves notably higher detection outcomes to feed Bengali bullying. Production of Several-Class Categorizing Inside the evaluated multiple categories of the dataset, identically proposed ML as well as DL techniques have been employed under the method of learning different comment data to locate and categorize several kinds of internet-based assault groups (e.g., religious background, troll, threat, sexual, as well as non-bullying). Multiple kind of supervised methods of classification were utilized, and the hypotheses were assessed using outcomes involving accuracy, precision, recall, and the F1 score. The previously impressive performance demonstrated by the hybrid stacking model, which demonstrates a 99.34% accuracy during a classification of several class assignments, highlights the versatility of our methodology. The framework successfully classifies 3,753 comments, and the reliability, precision, recall, and F1 score assessment indicators all confirm the efficiency of our approach. The careful examination is graphically represented as shown in “Fig 11(a)”, in which the confusion matrix shows the small difference of just twenty-five wrongly categorized feedbacks throughout the massive data set. Particularly, as “Fig 11(b)–11(d)” illustrates, LR, MLP, RF, SGD, Voting, and XGBoost stand out just alongside exceptional precision, while Adaboost, Bagging, and K-NN accomplish quietly via a reduced level of accuracy as well as a greater rate of statistical erroneous.

thumbnail
Fig 11. Results of training as well as validation: The matrix of confusion of all ML models.

(a) Stacking, (b) AdaBoost, (c) Bagging, (d) K-NN, (e) LR, (f) MLP, (g) RF, (h) SGD, (i) Voting, (j) XGBoost.

https://doi.org/10.1371/journal.pone.0308862.g011

Moving on to binary categorization, our combination stacking model demonstrates how valuable it is once again. With just 85 cases of incorrect categorization, it skillfully classifies an astounding 13,346 remarks, demonstrating the flexibility as well as efficacy of our system when working across a larger information set. Its actives an accuracy of 99.41%, precision of 99.41%, recall of 99.41% along with F1 score of 99.41%. “Fig 12(a)” presents a detailed illustration of the equivalent matrix of confusion for the classification of binary data, demonstrating the system’s capacity to handle the subtleties of the form of binary labeling. In addition to the conventional measurements, our research includes a thorough examination of the ROC curve, which is shown in “Fig 12(b)”. This curve provides detailed insights into the model’s ability to handle a variety of accuracy and specificity by illuminating its performance across a range of limitations. The assessment is deepened by the ROC curve, which enables an enhanced understanding of each model’s efficacy attributes.

thumbnail
Fig 12. Confusion matrix and ROC curve of hybrid stacking model with binary classifications.

(a) Confusion Matrix, (b) ROC curve.

https://doi.org/10.1371/journal.pone.0308862.g012

In this work, we use Receiver Operating Characteristic (ROC) curves to graphically depict the multilevel categorization accomplishment percentage, which serves as an essential component in assessing the effectiveness of different classification methods. ROC curves play a crucial role in demonstrating the compromise within specificity (1-Specificity) as well as sensitivity (True Positive Rate, or TPR), providing insightful information about a classifier’s capacity to differentiate among multiple categories. Our study aims to illustrate the ROC curve to feed various classifiers in the specific scenario of multiclass categorization by employing a wide variety of threshold values. The ROC curve normally starts at (0,0), which denotes a situation in which every detail is categorized as negative, as well as ends at (1,1), which acts as a threshold for positive classification. The shape of this curve represents the trade-off between correctly classifying adverse circumstances as beneficial and incorrectly classifying instances that are positive as positive. The multiclass ROC curve for each classifier Adaboost, Bagging, K-NN, LR, MLP, RF, SGD, Voting, XGBoost, and our stacking model is depicted in “Fig 13”, which is underneath and also “Table 5” shows the AUC score of all ML models. Every “Fig 13(a)–13(j)” represents a distinct classifier and displays its own ROC curve. A sophisticated comprehension of every classifier’s efficiency across a range of value thresholds is made possible by this thorough analysis. When evaluating the efficacy of a classifier, an ROC curve is a useful tool, especially when dealing with multilabel category examples. One important measurement obtained using the curve of the ROC is known as the Area Under the Curve (AUC), where a greater AUC typically denotes better accuracy of the model. A thorough summary of the way every algorithm and hybrid ML model handles the complexities of distinguishing within each of the five categories taken into consideration in our research is given in “Fig 13”.

thumbnail
Fig 13. All ML models ROC curve over every class.

(a) Adaboost, (b) Bagging, (c) K-NN, (d) LR, (e) MLP, (f) RF, (g) SGD, (h) Voting, (i) XGBoost, (j) Stacking.

https://doi.org/10.1371/journal.pone.0308862.g013

Table 5” shows the AUC scores for multiple machine learning models over various kinds of detection. Each model executed extremely well for identifying non-bully, troll, sexual, religious, and threat content, via AUC scores near or equal to 1.00. However, AdaBoost along with K-NN performed slightly worse in some classes, with AUC values spanning 0.59 to 0.99. Overall, the results show that the models are highly effective, with the majority accomplishing near-perfect AUC scores across all classes.

A wide range of outcome measurements, including important metrics like F1 score, recall, accuracy, and precision, are summarized in “Table 6”. Simultaneously, “Fig 14” presents the identical information in a chart with bars as an illustration. We focus on ML models and analyze their efficiency using these important metrics. Regarding precision, recall, and F1 scores, both the stacking and SGD algorithms perform exceptionally well, with notably high precision percentages of 99.34% and 99.13%, respectively, along with similarly high recalls and F1 scores. The aforementioned models demonstrate their effectiveness in making precise forecasts with a general precision of 99.34% and 99.13%, respectively. The voting and MLP algorithms function as well, through F1 scores reflecting such remarkable results as recall rates of 98.99% and 98.91%, in addition to precision rates of 99% and 98.92%. The corresponding accuracy values are 98.99% and 98.91%, respectively, which demonstrate how consistently accurate they can be. Conversely, Adaboost exhibits smaller recall (31.63%) and precision (39.75%), resulting in an F1 score of 21.49%. Similar to this, the performance of the Bagging and K-NN models lags, with accuracy levels of 64.24% to 72.50%, respectively. The XGBoost, LR, and RF models perform in a balanced manner, achieving accuracy levels above 97%. These frameworks show excellent precision, recall, and F1 score values, confirming their dependability in task categorization circumstances even though they do not achieve the precision specifications for stacking and SGD. Recall that the F1 score and precision are important because they provide a detailed assessment of a model’s effectiveness. Recall highlights the model’s capacity to remember all real positive tests; precision measures how accurate the framework is at predicting optimistic situations; and the F1 score finds a balance among the two to provide a thorough evaluation of an algorithm’s overall performance. Here, stacking and SGD remain as algorithms that demonstrate their proficiency in handling the assignment at hand by being able to both predict positives precisely and record everything genuine positives in a timely way. Overall, “Table 6” along with “Fig 14” clearly illustrates the privilege of careful assessment, which highlights the specific advantages and disadvantages of every ML framework. It also gives us a clear idea of how every approach performs across important metrics and confirms that stacking as well as SGD are effective in the aforementioned multilabel categorization instance.

thumbnail
Fig 14. The collection of data assessment’s accuracy, precision, recall, and F1 score of all ML models.

https://doi.org/10.1371/journal.pone.0308862.g014

thumbnail
Table 6. All ML models classification reports of precision, recall, F1 score, and accuracy.

https://doi.org/10.1371/journal.pone.0308862.t006

We use multiple DL simulations, which include DNN, CNN, BiGRU, RNN, and CLSTM, to analyze various comment sets of data as well as classify various types of internet attacks. We assess the predictions using a variety of overseen classification techniques and major performance indicators, which include accuracy, precision, recall, and the F1 score. Interestingly, our DL algorithms execute exceptionally well. The previous BiGRU approach, for example, reached an excellent accuracy of 90.63% when classifying various classes. This demonstrates how flexible and effective our approach is at managing different types of cyberbullying. The comprehensive analysis is illustrated graphically in “Fig 15(a)”, in which the confusion matrix reveals a small disparity, using 354 comments not correctly categorized throughout the huge amount of data. The dependability of our method has been confirmed by a thorough evaluation of accuracy, precision, and recall, along with the F1 score, for which the structure skillfully classifies 3,424 feedback comments. Notable is the contribution of various DL algorithms, namely BiGRU, CLSTM, CNN, DNN, and RNN, which contribute to the general efficacy of our methodology, as depicted in “Fig 15(a)–15(e)” in that order. This demonstrates how well predictive models handle the complexities of multi-label grouping for identifying harassment in Bengali. Our investigation of DL models confirms their importance in the field of online misconduct identification by demonstrating their capacity to both supplement and, in some cases, outperform conventional ML methods. The technique we use has been modified to utilize DL models, which broadens its application and offers a flexible or all-encompassing solution to the complex problem of classifying and identifying different types of internet assault in Bengali language information.

thumbnail
Fig 15. Results of training as well as validation: The matrix of confusion of all ML models.

(a) BiGRU, (b) CLSTM, (c) CNN, (d) DNN, (e) RNN.

https://doi.org/10.1371/journal.pone.0308862.g015

The “Fig 16” and “Table 7” below, shows the multiclass ROC curve and AUC score for each model, which includes BiGRU, CLSTM, CNN, DNN, and RNN. For each classifier shown in “Fig 16(a)–16(e)”, a unique ROC curve is displayed. This comprehensive analysis allows a sophisticated understanding of each classifier’s efficiency across a range of value thresholds. “Fig 16” provides an extensive summary of how each model addresses the challenges of differentiating within each of the five categories that our study examined.

thumbnail
Fig 16. All DL models ROC curve over every class.

(a) BiGRU, (b) CLSTM, (c) CNN, (d) RNN, (e) DNN.

https://doi.org/10.1371/journal.pone.0308862.g016

thumbnail
Table 7. AUC score comparison table for all the deep learning algorithms.

https://doi.org/10.1371/journal.pone.0308862.t007

Table 7” compares the AUC scores of multiple deep learning models that identify different types of content. BiGRU, CNN, CLSTM, DNN, along with RNN all had high AUC ratings, indicating superior performance across classes. While the AUC scores differ slightly between models and classes, they consistently outperform in detecting non-bully, troll, sexual, religious, or threat content.

DL models analyze them using some important metrics. Regarding precision, recall, and F1 scores, both the BiGRU and CNN algorithms perform exceptionally well, with a notably significant precision rate of 90.73% and 90.03%, respectively, along with similarly high recalls and F1 scores. The CLSTM, DNN, and RNN algorithms function as well, through F1 scores reflecting such remarkable results as recall rates of 89.36%, 89.20%, and 88.41%, in addition to precision rates of 89.59%, 89.57, and 88.37%. The corresponding accuracy values are 89.36%, 89.20%, and 88.41%, respectively, which demonstrate how consistently accurate they can be. These frameworks show excellent precision, recall, and F1 score values, confirming their dependability in task categorization circumstances even though they do not achieve the precision specifications for BiGRU and CNN. Overall, “Table 8” along with “Fig 17” clearly illustrates the privilege of careful assessment, which highlights the specific advantages and disadvantages of every DL framework. It also gives us a clear idea of how every approach performs across important metrics and confirms that BiGRU and CNN are effective in the aforementioned multilabel categorization instance.

thumbnail
Fig 17. The collection of data assessment’s accuracy, precision, recall, and F1 score.

https://doi.org/10.1371/journal.pone.0308862.g017

thumbnail
Table 8. All DL models classification report of precision, recall, F1 score and accuracy.

https://doi.org/10.1371/journal.pone.0308862.t008

Discussion

The dissertation presents an in-depth examination of all the specifics regarding our hybrid ML strategy, which is specifically designed to identify instances of online bullying in Bengali-language written material, using the findings from the experiment as well as subsequent debates. As the authors of the paper, we provide a thorough examination of a wide range of ML as well as DL scenarios, illuminating both the particular benefits of each one and the overall effectiveness of the structure that was created. Ten algorithms namely, Staking, Voting, SGD, LR, RF, MLP, Bagging, XGBoost, and Adaboost along with K-NN are examined in detail concerning binary as well as multi-label classifications as an element of the inspection of ML models. The best performer is the hybrid stacking model, which achieves an amazing accuracy of 99.34% through multi-class classification. The algorithm’s remarkable accuracy and recall, along with its F1 score in an assortment of internet abuse classifications, including sexually explicit material, threats, trolls, faith histories, and non-bullying incidents, highlight its adaptability along with its general effectiveness. A more thorough grasp of the pros and cons of each ML simulation is made possible by the careful examination of confusion matrices alongside ROC curves. Models with notable precision include LR, MLP, RF, SGD, and Voting, as well as XGBoost. Models with lower accuracy and a higher rate of statistical errors include Adaboost, Bagging, and K-NN. This complicated analysis helps identify which approaches are the best suited to tackle the complex issues that harassment online in Bengali raises. The research investigation highlights the potential of DL algorithms, among them BiGRU, CNN, CLSTM, DNN, and RNN, for the classification of various forms of cyberbullying within Bengali text. With an exceptional accuracy rate of 90.63%, the BiGRU model stands out, specifically highlighting the effectiveness of DL methods in addressing the convoluted nature of internet assault. Precision, recall, and F1 scores are also used to evaluate DL frameworks. BiGRU, along with CNN, both demonstrate noteworthy precision rates. This adds to the increasing amount of evidence that supports the utility of DL frameworks in the field of internet assault identification by reaffirming their capabilities.

A thorough examination of past research suggests that our research is a trailblazer in the field of NLP identification methods. The current study has a collection of 94,000 instances, which is significantly larger than the typical dataset dimensions in similar studies. Managing a demanding five-class multi-class issue, our investigation shows a sophisticated comprehension of various language subtleties. The biggest surprise is the remarkable performance of 99.34%, which attests to the system’s effectiveness in properly identifying instances in every group. This excellent accuracy highlights both the suggested system’s resilience and its opportunity for practical use. Furthermore, every category frequently achieves a 99.34% precision, recall, and F1 score, indicating a harmonious simulation that excels in obtaining an excellent proportion of pertinent data in addition to adequacy. “Fig 18” shows the classification report of hybrid stacking model.

The research’s supremacy is reinforced by a comparison alongside the additional research in “Table 9”. Haque et al. [5] utilized various models including SVC, SGD, RF, LR, MNB, DT, CLSTM, BiGRU, BiLSTM, and LSTM, achieving an accuracy of 85.80% with a dataset size of 42,036. Their limitation was lower detection accuracy. Das et al. [11] employed RNN, attention mechanism, LSTM, GRU, and CNN, achieving an accuracy of 77% with a dataset size of 7,425. They faced challenges with lower detection accuracy and limited dataset coverage. Eshan et al. [12] used RF, MNB, and SVM, achieving an accuracy of 75% with a dataset size of 2,500. They also faced limitations in accuracy and dataset coverage. Ishmam et al. [13] employed GRU, Adaboost, RF, NV, and SVC, achieving an accuracy of 70.10% with a dataset size of 5,126. They encountered limitations in accuracy and dataset coverage. Ahmed et al. [14] applied Random Forest, SVM, KNN, Naïve Bayes, and a hybrid neural network (CNN-SVM), achieving an accuracy of 87.91% for CNN-SVM and 85% for SVM with a dataset size of 44,001. Their limitations included being limited to specific categories and having a dataset with limited accuracy. Ahmed et al. [15] utilized MNB, SVM, LR, XGBoost, CNN, LSTM, BLSTM, and GRU, achieving an accuracy of 80% for MNB with a dataset size of 12,000. They faced limitations in accuracy and dataset coverage, and also achieved an AUC score of 0.80. Emon et al. [16] employed LinearSVC, LR, MNB, RF, ANN, RNN, and LSTM, achieving an accuracy of 82.20% with a dataset size of 4,700. They faced limitations due to a relatively small dataset and lower detection accuracy. Mahmud et al. [17] utilized LR, MB, DT, RF, SVM, AdaBoost, GB, SGD, ET, KNN, and MLP, achieving an accuracy of 97% with a dataset size of 3,000. They faced limitations due to a relatively small dataset and a limited preprocessing approach. M. Khan et al. [18] employed SVM, NB, RF, KNN, and NN, achieving an accuracy of 62% with a dataset size of 63,000. Their limitations included lower detection accuracy and only 2% of data labeled for the "Religious" class. Romim et al. [19] utilized Bi-LSTM, achieving an F1-score of 86.78% with a dataset size of 50,314. Their study was limited to hate speech detection in Bengali language, and they did not discuss the potential impact of the dataset. Akhter et al. [20] utilized NB, J48, SVM, and KNN, achieving an accuracy of 97.73% with a dataset size of 2,400, and also achieved an AUC score of 0.54. They faced limitations in accuracy and dataset coverage. Akhter et al. [21] employed DT, RF, LR, and MLP, achieving an accuracy of 98.82% for MLP and 98.57% for LR with a dataset size of 44,001, and also achieved an AUC score of 0.997. They did not use DL models, and their models’ execution time was high.

thumbnail
Table 9. Comparison of the proposed model with previous investigations.

https://doi.org/10.1371/journal.pone.0308862.t009

Using a 94,000 dataset, our work used TF-IDF with Count Vectorizer and Tokenization, resulting in an accuracy of 99.34% for the 5-class classification task and 99.41% for the 2-class classification task, and also achieved an AUC score of 1.0. In terms of accuracy, our method performs better than current approaches. Compared with other studies, we achieve significantly higher accuracy by using TF-IDF with Count Vectorizer and Tokenization. Furthermore, our approach tackles the drawbacks noted in earlier research, including reduced detection precision, restricted dataset scope, and particular linguistic limitations.

Most importantly, we incorporate both Count Vectorizer with TF-IDF transformer and Tokenization in its feature extraction technique, showcasing a thoughtful integration of methods for enhanced performance. The versatility and effectiveness of the model are enhanced by this all-encompassing approach. The Bengali Cyberbullying Detector (BCBD), a web-based application built using the hybrid stacking approach (SGD+MLP+LR), is a prime instance of a real-life application. We have become a leader in the domain of natural language processing (NLP)-based detection procedures by providing an attractive blend of a large dataset, skillful multi-class categorization, and a strong feature extraction method.

Building web application

In accordance with the suggested hybrid approach (SGD+MLP+LR) developed through stacking, we developed an online tool designated Bengali Cyberbullying Detector (BCBD) for performing multi-class categorization upon Bengali texts employing the Flask platform. This tool is designed for chancing multi-class categorization in Bengali text. By classifying online bullying into five different orders, BCBD tries to empower individuals to depict the true of online bullying within Bengali text. The significance of BCBD lies in its capability to help druggies decode and comprehend the factual online bullying environment present in a given text. This categorization allows druggies to respond quickly to the content without any confusion. “Fig 19” shows a visual representation of BCBD, demonstrating how druggies input their target text. The bracket results are also displayed, indicating the linked order for each input. The successful integration of the SGD, MLP, and LR factors in our approach enhances the delicacy and effectiveness of BCBD. This tool not only contributes to the field of natural language processing but also addresses the pivotal issue of cyberbullying, particularly in the Bengali environment.

Conclusion and future work

The research emphasizes multi-class online bullying about Bengali comments on social networking sites, an issue complicated by little study. focusing on the difficulties of identification to be a result of perpetrators obscuring away from conflict. A hybrid ML approach to detecting Bengali online bullying is suggested, including text preprocessing, a combination of feature extraction techniques for ML classifiers as well as tokenization over DL models, and data managing using IHT. ML techniques (Staking, Voting, SGD, LR, RF, MLP, Bagging, XGBoost, Adaboost, and K-NN) along with DL models (BiGRU, CNN, CLSTM, DNN, and RNN) utilized with k-fold cross-validation technique, while efficiency is measured through different indicators. We categorized 94,000 Bengali online comments through five different groups using data gathered from two datasets that are open to the publicduring the research, we demonstrated that multi-class cyberbullying was able to attain an exceptional level of correctness over the language by using successful processing and extraction of features using different methodologies. The algorithm accurately detected Bengali online harassment on social media platforms with a success rate of about 99.41% as well as 99.34% in binary along with multiclass categorization, correspondingly, by employing a hybrid ML approach, as well as 90.63% correctness employing a DL algorithm named BiGRU. According to our findings, the model given might prove beneficial to computerized Bengali cyberbullying surveillance systems. There are a few restrictions to the present investigation. For starters, its primary emphasis on detecting online Bengali cyberbullying through a hybrid ML technique is limited to specific languages as well as cultural environments. The application of an openly accessible data set containing few comments might not accurately reflect the wide variety of cyberbullying incidents. The investigation fails to examine transformer-based methods, that showed potential in tasks involving natural language processing. Furthermore, difficulties and concerns of practical application are not investigated. Managing these drawbacks would improve the dependability as well as the relevance of future identification of cyberbullying studies.

Though the outcomes is good, there is always an opportunity for enhancement. In future years, we will conduct studies to tackle the problem of interspersed classes by broadening the information on training. Our team would like to investigate transformer-based techniques. Methods including BERT, RoBERTa, and ELECTRA, among others might have applications on multilingual online bullying databases covering a variety of cyberbullying classifications. This field of investigation has the potential to improve the effectiveness of online harassment surveillance systems. On the opposite end of the spectrum, we are researching video and picture filtration because both serve as targets for harassment on social networking sites.

References

  1. 1. Plaza D., & Plaza L. (2019). Facebook and WhatsApp as elements in transnational care chains for the Trinidadian diaspora. Genealogy, 3(2), 15.
  2. 2. Wellman B. (2001). Physical place and cyberplace: The rise of personalized networking. International journal of urban and regional research, 25(2), 227–252.
  3. 3. Chakraborty, P., & Seddiqui, M. H. (2019, May). Threat and abusive language detection on social media in bengali language. In 2019 1st International Conference on Advances in Science, Engineering and Robotics Technology (ICASERT) (pp. 1–6). IEEE.
  4. 4. Kemp, S. (2023) Digital 2023: Bangladesh—DataReportal–Global Digital Insights, DataReportal. https://datareportal.com/reports/digital-2023-bangladesh.(Accessed: 28 December 2023).
  5. 5. Haque R., Islam N., Tasneem M., & Das A. K. (2023). Multi-class sentiment classification on Bengali social media comments using machine learning. International Journal of Cognitive Computing in Engineering, 4, 21–35.
  6. 6. Asif M., Ishtiaq A., Ahmad H., Aljuaid H., & Shah J. (2020). Sentiment analysis of extremism in social media from textual information. Telematics and Informatics, 48, 101345.
  7. 7. Kee D. M. H., Al-Anesi M. A. L., & Al-Anesi S. A. L. (2022). Cyberbullying on Social Media under the Influence of COVID-19. Global Business and Organizational Excellence, 41(6), 11–22.
  8. 8. Das, A. K., Ashrafi, A., & Ahmmad, M. (2019, February). Joint cognition of both human and machine for predicting criminal punishment in judicial system. In 2019 IEEE 4th International Conference on Computer and Communication Systems (ICCCS) (pp. 36–40). IEEE.
  9. 9. Akter, M., Zohra, F. T., & Das, A. K. (2017, February). Q-MAC: QoS and mobility aware optimal resource allocation for dynamic application offloading in mobile cloud computing. In 2017 International Conference on Electrical, Computer and Communication Engineering (ECCE) (pp. 803–808). IEEE.
  10. 10. Rahman T., Hossain M. M., Bristy N. N., Hoque M. Z., & Hossain M. M. (2023). Influence of cyber-victimization and other factors on depression and anxiety among university students in Bangladesh. Journal of Health, Population and Nutrition, 42(1), 119. pmid:37932869
  11. 11. Das A. K., Al Asif A., Paul A., & Hossain M. N. (2021). Bangla hate speech detection on social media using attention-based recurrent neural network. Journal of Intelligent Systems, 30(1), 578–591.
  12. 12. Eshan, S. C., & Hasan, M. S. (2017, December). An application of machine learning to detect abusive bengali text. In 2017 20th International conference of computer and information technology (ICCIT) (pp. 1–6). IEEE.
  13. 13. Ishmam, A. M., & Sharmin, S. (2019, December). Hateful speech detection in public facebook pages for the bengali language. In 2019 18th IEEE international conference on machine learning and applications (ICMLA) (pp. 555–560). IEEE.
  14. 14. Ahmed, M. F., Mahmud, Z., Biash, Z. T., Ryen, A. A. N., Hossain, A., & Ashraf, F. B. (2021). Cyberbullying detection using deep neural network from social media comments in bangla language. arXiv preprint arXiv:2106.04506.
  15. 15. Ahmed, M. T., Rahman, M., Nur, S., Islam, A., & Das, D. (2021, February). Deployment of machine learning and deep learning algorithms in detecting cyberbullying in bangla and romanized bangla text: A comparative study. In 2021 International Conference on Advances in Electrical, Computing, Communication and Sustainable Technologies (ICAECT) (pp. 1–10). IEEE.
  16. 16. Emon, E. A., Rahman, S., Banarjee, J., Das, A. K., & Mittra, T. (2019, June). A deep learning approach to detect abusive bengali text. In 2019 7th International Conference on Smart Computing & Communications (ICSCC) (pp. 1–5). IEEE.
  17. 17. Mahmud, T., Das, S., Ptaszynski, M., Hossain, M. S., Andersson, K., & Barua, K. (2022, October). Reason based machine learning approach to detect bangla abusive social media comments. In International Conference on Intelligent Computing & Optimization (pp. 489–498). Cham: Springer International Publishing.
  18. 18. Khan M. S. S., Rafa S. R., & Das A. K. (2021). Sentiment analysis on bengali facebook comments to predict fan’s emotions towards a celebrity. Journal of Engineering Advancements, 2(03), 118–124.
  19. 19. Romim, N., Ahmed, M., Islam, M. S., Sharma, A. S., Talukder, H., & Amin, M. R. (2021). HS-BAN: A Benchmark Dataset of Social Media Comments for Hate Speech Detection in Bangla. arXiv preprint arXiv:2112.01902.
  20. 20. Akhter, S. (2018, December). Social media bullying detection using machine learning on Bangla text. In 2018 10th International Conference on Electrical and Computer Engineering (ICECE) (pp. 385–388). IEEE.
  21. 21. Akhter A., Acharjee U. K., Talukder M. A., Islam M. M., & Uddin M. A. (2023). A robust hybrid machine learning model for Bengali cyber bullying detection in social media. Natural Language Processing Journal, 4, 100027.
  22. 22. Ahmed, F. (2023) Dataset for cyberbully detection bengali comments, Kaggle. https://www.kaggle.com/datasets/cypher1337/dataset-for-cyberbully-detection-bengali-comments (Accessed: 17 November 2023).
  23. 23. Nobel, S.N. (2023) Facebook sentiment analysis Bangla language, Kaggle. https://www.kaggle.com/datasets/smnuruzzaman/facebook-sentiment-analysis-bangla-language (Accessed: 20 November 2023).
  24. 24. Denny M. J., & Spirling A. (2018). Text preprocessing for unsupervised learning: Why it matters, when it misleads, and what to do about it. Political Analysis, 26(2), 168–189.
  25. 25. Al-Jarrah, O. Y., Siddiqui, A., Elsalamouny, M., Yoo, P. D., Muhaidat, S., & Kim, K. (2014, June). Machine-learning-based feature selection techniques for large-scale network intrusion detection. In 2014 IEEE 34th international conference on distributed computing systems workshops (ICDCSW) (pp. 177–181). IEEE.
  26. 26. Toma R. N., Prosvirin A. E., & Kim J. M. (2020). Bearing fault diagnosis of induction motors using a genetic algorithm and machine learning classifiers. Sensors, 20(7), 1884. pmid:32231167
  27. 27. Hu J., Li S., Yao Y., Yu L., Yang G., & Hu J. (2018). Patent keyword extraction algorithm based on distributed representation for patent classification. Entropy, 20(2), 104. pmid:33265195
  28. 28. Kumar P., & Dhinesh Babu L. D. (2021). Fuzzy based feature engineering architecture for sentiment analysis of medical discussion over online social networks. Journal of Intelligent & Fuzzy Systems, 40(6), 11749–11761.
  29. 29. Ullah S., Talib M., Rana T., Hanif M. K., & Awais M. (2022). Deep learning and machine learning-based model for conversational sentiment classification. Comput Mater Continua, 72(2), 2323–2339.
  30. 30. El-Amir H., & Hamdy M. (2020). Deep learning pipeline. Apress: Berkeley, CA, USA.
  31. 31. Jin Y., Zhang W., Wu X., Liu Y., & Hu Z. (2021). A novel multi-stage ensemble model with a hybrid genetic algorithm for credit scoring on imbalanced data. IEEE Access, 9, 143593–143607.
  32. 32. Sharma N., Dev J., Mangla M., Wadhwa V. M., Mohanty S. N., & Kakkar D. (2021). A heterogeneous ensemble forecasting model for disease prediction. New Generation Computing, 1–15. pmid:33424081
  33. 33. Zhang Y., Zhang H., Cai J., & Yang B. (2014). A weighted voting classifier based on differential evolution. In Abstract and applied analysis (Vol. 2014, pp. 1–6). Hindawi Limited.
  34. 34. Khalid M., Ashraf I., Mehmood A., Ullah S., Ahmad M., & Choi G. S. (2020). GBSVM: sentiment classification from unstructured reviews using ensemble classifier. Applied Sciences, 10(8), 2788.
  35. 35. Johnson R., & Zhang T. (2013). Accelerating stochastic gradient descent using predictive variance reduction. Advances in neural information processing systems, 26.
  36. 36. Bottou, L. (2010). Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010: 19th International Conference on Computational StatisticsParis France, August 22–27, 2010 Keynote, Invited and Contributed Papers (pp. 177–186). Physica-Verlag HD.
  37. 37. Ismail H. M., Harous S., & Belkhouche B. (2016). A Comparative Analysis of Machine Learning Classifiers for Twitter Sentiment Analysis. Res. Comput. Sci., 110, 71–83.
  38. 38. Pacifici F., Chini M., & Emery W. J. (2009). A neural network approach using multi-scale textural metrics from very high-resolution panchromatic imagery for urban land-use classification. Remote Sensing of Environment, 113(6), 1276–1292.
  39. 39. Chatzigeorgakidis G., Karagiorgou S., Athanasiou S., & Skiadopoulos S. (2018). FML-kNN: scalable machine learning on Big Data using k-nearest neighbor joins. Journal of Big Data, 5(1), 1–27.
  40. 40. Jimenez E. Y., Kelley K., Schofield M., Brommage D., Steiber A., Abram J. K., et al. (2021). Medical nutrition therapy access in CKD: a cross-sectional survey of patients and providers. Kidney Medicine, 3(1), 31–41. pmid:33604538
  41. 41. Breiman L. (1996). Bagging predictors. Machine learning, 24, 123–140.
  42. 42. Friedman J. H. (2001). Greedy function approximation: a gradient boosting machine. Annals of statistics, 1189–1232.
  43. 43. John, V., Liu, Z., Guo, C., Mita, S., & Kidono, K. (2016). Real-time lane estimation using deep features and extra trees regression. In Image and Video Technology: 7th Pacific-Rim Symposium, PSIVT 2015, Auckland, New Zealand, November 25–27, 2015, Revised Selected Papers 7 (pp. 721–733). Springer International Publishing.
  44. 44. Wyner A. J., Olson M., Bleich J., & Mease D. (2017). Explaining the success of adaboost and random forests as interpolating classifiers. The Journal of Machine Learning Research, 18(1), 1558–1590.
  45. 45. Montavon G., Samek W., & Müller K. R. (2018). Methods for interpreting and understanding deep neural networks. Digital signal processing, 73, 1–15.
  46. 46. Dang N. C., Moreno-García M. N., & De la Prieta F. (2020). Sentiment analysis based on deep learning: A comparative study. Electronics, 9(3), 483.
  47. 47. LeCun Y., Bengio Y., & Hinton G. (2015). Deep learning. nature, 521(7553), 436–444. pmid:26017442
  48. 48. Alharbi, A., & Lee, M. (2019, July). Crisis detection from Arabic tweets. In Proceedings of the 3rd workshop on Arabic corpus linguistics (pp. 72–79).
  49. 49. Zhang D., Hong M., Zou L., Han F., He F., Tu Z., et al. (2019). Attention pooling-based bidirectional gated recurrent units model for sentimental classification. International Journal of Computational Intelligence Systems, 12(2), 723.
  50. 50. Guo, P., Zhang, J., Hou, Y., Gong, X., Wang, P., & Zhang, Y. (2019, November). Quantum-inspired DMATT-BiGRU for conversational sentiment analysis. In 2019 IEEE 31st International Conference on Tools with Artificial Intelligence (ICTAI) (pp. 1602–1606). IEEE.
  51. 51. Dalianis H., & Dalianis H. (2018). Evaluation metrics and evaluation. Clinical Text Mining: secondary use of electronic patient records, 45–53.