Threatening language detection from Urdu data with deep sequential model

Ashraf Ullah; Khair Ullah Khan; Aurangzeb Khan; Sheikh Tahir Bakhsh; Atta Ur Rahman; Sajida Akbar; Bibi Saqia

doi:10.1371/journal.pone.0290915

Abstract

The Urdu language is spoken and written on different social media platforms like Twitter, WhatsApp, Facebook, and YouTube. However, due to the lack of Urdu Language Processing (ULP) libraries, it is quite challenging to identify threats from textual and sequential data on the social media provided in Urdu. Therefore, it is required to preprocess the Urdu data as efficiently as English by creating different stemming and data cleaning libraries for Urdu data. Different lexical and machine learning-based techniques are introduced in the literature, but all of these are limited to the unavailability of online Urdu vocabulary. This research has introduced Urdu language vocabulary, including a stop words list and a stemming dictionary to preprocess Urdu data as efficiently as English. This reduced the input size of the Urdu language sentences and removed redundant and noisy information. Finally, a deep sequential model based on Long Short-Term Memory (LSTM) units is trained on the efficiently preprocessed, evaluated, and tested. Our proposed methodology resulted in good prediction performance, i.e., an accuracy of 82%, which is greater than the existing methods.

Citation: Ullah A, Khan KU, Khan A, Bakhsh ST, Rahman AU, Akbar S, et al. (2024) Threatening language detection from Urdu data with deep sequential model. PLoS ONE 19(6): e0290915. https://doi.org/10.1371/journal.pone.0290915

Editor: Toqir Rana, The University of Lahore, PAKISTAN

Received: September 21, 2023; Accepted: March 26, 2024; Published: June 6, 2024

Copyright: © 2024 Ullah et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: Now we have made the dataset publically available on GitHub, and can be accessed on this link “https://github.com/ashraf8484/Augmented-Threatening-Language-Urdu-Dataset”. The same link is also mentioned in the revised Manuscript as a footnote at page number 8.

Funding: The source of funding is Cardiff Metropolitan University UK. The study benefitted significantly from the funder's integral involvement. They actively participating in experiments, and offering valuable guidance throughout the analysis process. Their support significantly contributed to the overall success and robustness of the study.

Competing interests: The authors have declared that no competing interests exist.

1. Introduction

The definition of threat, according to Twitter, is a statement to impose serious physical harm or an intent to kill persons or an entire group. A threat is defined as a statement expressing severe harm, either bodily or in another form. For instance,”Keep your mouth shut, or you will be seen red.” The word”red” can be depicted as a threat to injure someone or either kill someone in the worst case. This kind of remark is thus considered a vile aspersion. Twitter has taken some initiatives to stop the spread of threatening remarks on the platform. For instance timeout feature is used to suspend an account for hours that are found abusive. Even more, efforts are required to detect threatening and offensive content in different languages since abusive language is still prevalent in social media.

Social networks have become essential parts of our daily routine with the emerging communication technology and the Internet as the toll of users of social media is rising abruptly. For instance, StatInvestor(https://statinvestor.com/) reported that from 2010 to 2021, the number of social media users had extended three times from 0.97 billion to 3.02 billion. Twitter has around 353 million active users per month, according to Statista (https://www.statista.com/), with 200 billion annual tweets. Twitter is used for writing, reading, and sharing short texts with a character limit of 280 per tweet.it is considered one of the most popular social media platforms. Such platforms have huge numbers of people with cultural, religious, ethnic, and linguistic diversity [1]. At the same time, freedom of speech is said to be restricted by censorship of free expression as Twitter has been used as a medium to commit cyber crimes like spamming, cyber-bullying, malware spreading, and phishing [2]. Furthermore, the challenging issues include encouragement to self-harm and sexual offending [3–6]. This can instigate threats against physical violence and gender-based violence [7]. GamerGate, a controversial online movement that emerged.

In 2014 within the gaming community, scandal can be an example where women were given death and rape threats by video game lovers on the Twitter platform.

Some users also use Twitter to threaten other people by posting threatening posts. This has initiated a growing body of research to investigate the use of threatening content in social media by detecting threatening content and abusive language [8, 9]. Given the anguish it causes social media users, further research in automatic threatening language detection is vitally important to solve this problem in large platforms like Twitter. While the technology for the automatic detection of threatening language is still in its early stage, Twitter has gone beyond only English to detect threatening language [10, 11]. Different studies investigated the detection of threatening languages automatically in different languages like German, Italian, Arabic, Indonesian, Bengali, and Dutch [7, 12–16]. These studies worked on linguistic features and lexical resources for the detection of threatening language automatically.

Urdu is a major Indo-Aryan language primarily spoken in Pakistan and India. It has a rich history and is known for its poetic and literary traditions. Urdu encompasses many emotional words and phrases that can effectively convey sentiments and emotions. This allows sentiment analysis models to capture and analyze nuanced sentiments expressed in Urdu text. Threats detection from Urdu poses a significant challenge due to limited data availability, linguistic diversity, and regional variations within Urdu-speaking communities. ML and DL models require diverse datasets for training to address these variations and accurately detect depressive sentiments effectively. The proliferation of Urdu language usage across various social media platforms like Twitter, WhatsApp, Facebook, and YouTube presents a significant challenge due to the absence of robust Urdu Language Processing (ULP) libraries. The lack of adequate resources hampers the identification and mitigation of potential threats concealed within textual and sequential data shared in Urdu.

Current literature underscores the limitations imposed by the absence of online Urdu vocabulary, constraining the applicability of lexical and machine learning-based techniques in effectively processing Urdu data. Existing methodologies fall short in efficiently preprocessing Urdu content akin to the processing capabilities available for English.

In response to this pressing need, this research endeavors to bridge the gap by introducing a comprehensive Urdu language vocabulary, featuring a meticulously curated stop words list and a dedicated stemming dictionary. The goal is to establish an efficient preprocessing framework for Urdu data, aimed at reducing input sentence size and eliminating redundant and noisy information.

Moreover, the study proposes the utilization of a deep sequential model leveraging Long Short-Term Memory (LSTM) units. This model is trained on the meticulously preprocessed Urdu data, subsequently evaluated and tested to assess its performance.

The primary objective of this research is to ascertain the effectiveness of the proposed methodology in enhancing predictive performance. The validation results demonstrate a promising accuracy rate of 82%, surpassing the efficacy of existing methodologies.

There are a few key flaws and differences in Urdu language processing and social media threat identification that need to be taken into consideration. First off, even though efforts have been made to identify language that poses a threat in a variety of languages, including Urdu, the preponderance of English-centric approaches leaves a significant gap in addressing the particular complexities and subtleties present in Urdu linguistic structures and cultural contexts, impeding the creation of effective threat detection systems. Second, the lack of extensive libraries for Urdu language processing makes it more difficult to identify risks in Urdu material on social media, which restricts the use of cutting-edge machine learning methods. Furthermore, as vocabulary and syntax variances present challenges to reliable threat recognition, geographical variations and various linguistic characteristics among Urdu-speaking groups further complicate the creation of broadly applicable threat detection models. Moreover, a major bottleneck that compromises threat detection models’ efficacy and generalizability is the lack of annotated datasets for training and assessment. In order to advance Urdu language processing and threat detection, these issues must be resolved. This calls for the creation of novel approaches to get around these restrictions and promote the advancement of more potent strategies for detecting and averting threats on a variety of social media platforms.

This research’s scope includes solving the difficulties associated with processing Urdu language, especially when it comes to threat identification on social media. The study tries to improve threat detection model efficiency by providing a broad vocabulary in Urdu and suggesting sophisticated preprocessing methods. Furthermore, the use of deep sequential models, including Long Short-Term Memory (LSTM) units, emphasises how critical it is to advance Urdu language processing technology in order to better identify and predict threats. The present study effectively mitigates dangers in Urdu textual data across many social media platforms, so contributing to the wider goal of fostering safer online environments. It also closes significant gaps in existing approaches.

Therefore, the fundamental focus of this study is to address the challenges inherent in Urdu language processing by establishing a robust preprocessing framework and validating its efficacy through the implementation of advanced deep sequential models, thus contributing to improved threat identification and prediction in Urdu textual data across social media platforms.

1.1. Contributions

In this paper, we proposed to employ advanced augmentation techniques in order to address the challenge of limited data availability for Urdu data. Furthermore, we perform data preprocessing to remove the noise efficiently with the help of the proposed Urdu stop words list and stemming dictionary. We apply deeper data augmentation steps leading us towards better training of LSTMs [17] for threatening Tweets detection. Our model trained over this preprocessed data results in improved outcomes than the existing state-of-the-art methods [18, 19]. We summarize our research contribution as follows:

We proposed to apply enhanced data augmentation steps.
We generated Urdu stop words list and stemming dictionary.
We efficiently performed preprocessing of data to remove the noise.
Finally, the proposed LSTM architecture is trained for threatening language detection.

1.2. Paper arrangement

The rest of our paper is managed in the respective sections as follows; Section 2 describes the available state-of-the-art and some key contributions. Section 3 explains the proposed methodology and network architecture. Section 4 is about a brief discussion of the experimental setup. Section 5 presents and discusses the results and comparison with other methods. The paper is concluded in Section 6, with some future directions.

2. Literature review

Detection of threatening language is challenging work, specifically differentiating it from other derogatory content or even kind content where there are chances of some flapping words. Negative terms are commonly used for sarcasm or amusement. For example, Blood, Stab, Murder, Death, and kill are commonly used words. The Natural Language Processing (NLP) community is working for different social media platforms like Twitter, WhatsApp, Facebook, YouTube, Blogs, and Instagram to detect threatening language [2, 20–29]. Various studies counted upon chi-square feature selection and lexicon-based techniques are used for the automatic detection of threatening language. Furthermore, character n-grams [12, 21, 22, 30–32], word n-grams, and a combination of both are also used by many researchers [1, 12, 16, 22–24, 26, 33, 34].

Some studies also use machine learning techniques and multilingual datasets to detect threatening language automatically. For instance, various studies proposed the usage of Logistic Regression (LR) classifiers and Support Vector Machines (SVMs) to identify offensive speech in blogs, tweets, Reddit, articles, and Facebook [12, 16, 21, 23, 24, 26, 29, 31, 33]. Likewise, Naïve Bayes (NB) was utilized to identify derogatory remarks in News Groups and comments from YouTube. Whereas for detecting threatening language in Turkish Instagram content and tweets, Decision Tree(DT) had been used [1, 22, 29, 31, 35]. Furthermore, a single study has utilized K-nearest neighbors (KNN) on the datasets of Instagram posts, comments, or tweets in the Turkish language [29].

Deep Learning models are also used in the latest studies to detect threatening language. These techniques were also employed for depression diagnosis and provided the best results [36]. For instance, to detect threatening content on Facebook and Twitter in German and English, Convolutional Neural Networks (CNNs) are used [1, 6, 12, 13, 20–25, 27, 30, 31, 33, 35]. Those studies proved that CNNs surpassed other neural network-based models. Bert’s model for improving news article categorization performance in Bangla was used. This explains why blog postings, newspaper articles, and social media posts are all gaining popularity among the large Bangla-speaking populace [37]. Deep Learning algorithms have been used to identify various forms of music in a few significant ways. Bengali music is considerably more engaging because of its substance and distinctiveness. In addition, there is still a lot to learn about using the DL technique in Bengali music. As a result, Bengali music genre categorization is a relatively recent topic of study in the field of Deep Learning [38]. Furthermore, in the Bengali language, the LSTM model, which is a Recurrent Neural Network architecture, was used by a few research studies to detect threatening language on Facebook [12, 23, 26, 31]. English Twitter content used Graph Convolutional Network and BiLSTM [21, 23, 30, 32]. Some other researchers used threatening language detection in languages like German, English, Turkish, Italian, Bengali, Danish, Arabic, Japanese, Spanish, Portuguese, and Indonesian [1, 6, 12, 13, 15, 20–26, 29–35, 39].

Recent research has focused on detecting threatening language with machine learning from Urdu data [18, 19]. The first ever threats labeled Urdu dataset is proposed in 2021 with only 3,564 Tweets [18]. Both studies have applied different ML and DL methods, including Multi-Layer Perceptron (MLP), SVM, Extra Tree (ET), and Bernoulli Naive Bayes (BNB) to construct a Logistic Regression (LR) classifier. The best accuracy reported by both authors is around 75%, which is good but improvable. The issue is that handi-craft feature selection methods are used for the detection due to the unavailability of Urdu preprocessing libraries. In contrast, our research proposed Urdu stop words list and stemming dictionary for efficient preprocessing and better classification. Table 1 provides brief information about the available research for threat detection in different languages. As the table shows, only two research articles have focused on threat detection from Urdu data.

Download:

Table 1. Literature performed for threats detection on Twitter data.

https://doi.org/10.1371/journal.pone.0290915.t001

3. Methodology

The proposed methodology that is followed in this research comprises data acquisition (gathering and augmentation), preprocessing (data cleaning, stemming, tokenization, vectorization, padding), and modeling along with analysis (LSTM training, validation, testing, and comparison) as depicted in Fig 1.

Download:

Fig 1. Proposed methodology block diagram.

https://doi.org/10.1371/journal.pone.0290915.g001

3.1. Data acquisition

3.1.1. Dataset selection.

This research has selected a manually labeled, as threaten or non-threaten, Urdu dataset containing 3,564 tweets [18]. The total number of threatening sentences and non-threatening sentences is 1,782 each. The total number of tweets in each class is little for a DL model, which is increased with the help of a data augmentation technique called back-translation. Sample sentences with respective labels from the dataset are;

Non-Threaten بھگوڑی لیگ صرف اپنے مفاد تک ہے
Threaten بکواس مت کرو

3.1.2. Data augmentation.

Due to the limited labeled data of the Urdu language, data enhancement is required to resolve the overfitting and under-fitting of a deep learning model. If the quantity of data is less, then the chances of model under-fitting and overfitting are high [40]. To resolve the issue, the better solution is to acquire the data manually under the supervision of domain experts, but that is quite a hectic and time-consuming job. Another intelligent way of machine learning is to apply data augmentation techniques to enhance data [40]. In this research, we have used back-translation to enhance the number of samples. As the name suggests, back-translation uses translation from one language to another and then back-translates the translated text to the first language. Back-translation is done using Machine Translation Service (MTS), which depends on the random variations generated by the service. We have used Google Translation API using the Python platform. The total number of threatening sentences and non-threatening sentences is 7128 each. The dataset is publically available on GitHub (https://github.com/ashraf8484/Augmented-Threatening-Language-Urdu-Dataset).

Original Tweet:بھگوڑی لیگ صرف اپنے مفاد تک
Translated Tweet: The fugitive league is only up to its own interests
Back-translated: مفرور لیگ صرف اپنے فائدے کے لیے ہے

3.2. Preprocessing

Preprocessing is always considered one of the essential steps for any ML or DL model. Without preprocessing, the models may remain underfit due to noise.

3.2.1. Data cleaning.

It is required to clean up the data to enhance the quality and improve any automated model’s productivity [40]. Removal of special characters, white spaces, digits, and stop words is performed for data cleaning. Any less informative word frequently used in any language is considered a stop word for that language. Due to the unavailability of Urdu stop words, this research has introduced a list of 270 Urdu stop words generated from translated English stop words list and manual inspection.

3.3. Stemming

In linguistics, a single word can be presented in different forms depending on the Tense the word is used. For example, the word ‘Read’ can be used as ‘Reads’, ‘Reading’, or ‘Readable.’ The meaning of all the variants of a single word is the same, but computers and machines consider all the variants as different words. In Urdu, the variants are not limited to a few numbers. Table 2 shows a sample Urdu word variant. V_n shows the possible variants for the root word ‘پڑھ’. The total number of variants is more than 10. The number variants for the same root word in English are shown in the table, which is just 2. Another issue is that no Urdu stemming dictionary maps a word’s different forms to its root word. In this research, a low-level Urdu stemming dictionary is proposed and created with the inspection of the literature expert. The total number of unique root words in the generated Urdu dictionary is approximately 1800.

Download:

Table 2. Sample stemming words.

https://doi.org/10.1371/journal.pone.0290915.t002

3.4. Vectorization

Vectorization is a process of transforming textual data into numerical representations, as computers primarily comprehend numerical information. Various vectorization techniques are employed to facilitate this conversion, including indexing, count vectorizers, N-grams, and term frequencies. This research utilized the indexing method due to its simplicity and versatility across different grammatical styles. In the indexing method, each distinct word, known as a token, is extracted from the dataset. Subsequently, each unique token is assigned a unique numerical value, an index. For example, in Table 3 consider the word ‘بھگوڑی’ and the index number assigned to the word is ‘17’. Similarly, all the other unique words are given a unique index number. All the words are replaced with their respective indices. Consider the same sentence discussed in section 3.1.1 is shown in the vectorized form below. Any automated model prefers a fixed length of input vectors, while the resulting vectorized dataset has different lengths. The same padding is applied to the vectorized dataset to equalize the length of all tweets to 200.

Download:

Table 3. Vectorization Example.

https://doi.org/10.1371/journal.pone.0290915.t003

3.5. Deep sequential model

This research used Long Short-Term Memory (LSTM) model [17] as a deep sequential model. An LSTM model takes care of Long-Term Memory (LTM) and Short-Term Memory (STM), as the name suggests. There are four different gates used in an LSTM unit. These gates are:

Forget Gate: The functionality of this gate is to forget less useful or infrequent information inside LTM.
Learn Gate: A current input (an event) and STM are merged to remember the recently learned information from STM and applied to that event.
Remember Gate: As the name suggests, the main functionality of this gate is to remember the previous information up to certain limits. LTM information which is not forgotten. An event and STM merged in this gate work as an updated version of LTM.
Use Gate: The functionality of Use Gate is to predict the current event’s output using STM, LTM, and an Event.

A simple LSTM architecture is shown in Fig 2. The boxes in red, yellow, and blue are different parameters and activation functions of the above-mentioned gates.

Download:

Fig 2. An LSTM unit.

https://doi.org/10.1371/journal.pone.0290915.g002

4. Experimental setup

The proposed methodology was implemented on a Windows 10 operating system, utilizing Python 3.10 programming language on a core i7 desktop system with 16 GB of RAM. The LSTM model is trained with the help of the TensorFlow library. The performance measures used for evaluation and testing purposes include accuracy, f1-score, precision, and recall presented in 1, 3, 4, and 2, respectively, in terms of True Positive (TP), False Positive (FP), False Negative (FN), and True Negative (TN).

(1)

(2)

(3)

(4)

4.1. LSTM architecture

After successful preprocessing, the cleaned data is given to the LSTM model shown in Fig 3. The input layer has a shape of 200 followed by an embedding layer where random weights are given. The total number of parameters at this layer was 7150200. Then the model implemented an LSTM layer with an output shape of 256 neurons with a total of 467968 trainable parameters, followed by a dense layer with 32896 trainable parameters. The final output layers have a total of 2 labels because of the binary classification problem.

Download:

Fig 3. Model architecture.

https://doi.org/10.1371/journal.pone.0290915.g003

4.2. Hyper- and parameters configuration

For setting up parameters and hyper parameters of the LSTM model, Table 4 provides all the model configurations. Early stopping is beneficial to a model’s convergence and hence avoided overfitting the LSTM model. The starting learning rate is finalized at 0.001. The early stopping was monitored using validation accuracy during training. Patience in stopping the training processes was set to 4.

Download:

Table 4. Parameters configuration.

https://doi.org/10.1371/journal.pone.0290915.t004

5. Results and discussion

This section highlights an overview of the best results that are achieved for the parameters and hyper parameters given in Table 4. First, the dataset is split into train tests with a ratio of 80:20. From the training data, before passing it to the proposed model, we have split it further into an 80:20 ratio between the train and validation set. Moreover, callbacks from Table 4 are also applied based on validation loss while training the model to save the best weights.

5.1. Training

Fig 4a shows the graph which depicts the training accuracy versus validation accuracy.

Download:

Fig 4. LSTM model performance while training and validation.

https://doi.org/10.1371/journal.pone.0290915.g004

X-axis represents the number of epochs, and the y-axis shows the accuracy value. The graph clearly shows that model validation accuracy increases as the number of epochs increases and hence the model is converging. There is no overfitting on the training data as the difference between training and validation performance measures is negligible. Also, Fig 4b shows that the model is learning with epochs and has stopped the training, after meeting the stopping criteria, at epoch number 15.

5.2. Testing

The model performance is checked for testing data after getting optimum weights for the training and validation set. The confusion matrix calculated on the testing data is given in Table 5, which shows TP, TN, FP, and FN which were 601, 568, 145, and 112, respectively. FN are the cases that were actually threats but detected as non-threats. Such sentences might be dangerous and should be reduced. It is noticed that the number of threats predicted by the proposed methodology is more than non-threats, which is a good practice.

Download:

Table 5. Confusion matrix on testing data.

https://doi.org/10.1371/journal.pone.0290915.t005

Furthermore, the performance measures on the testing dataset are provided in Table 6. The best accuracy, precision, recall, and f1-score achieved by the proposed methodology are 81.97%, 82.04%, 81.97%, and 81.96%, respectively. All the performance measures are almost the same, which shows that the proposed methodology does not over-fit the LSTM architecture.

Download:

Table 6. Comparison of classes.

https://doi.org/10.1371/journal.pone.0290915.t006

5.3. Comparison with literature

As discussed in section 2, only two papers have worked in detecting threatening language from Urdu data [18, 19]. This research has compared the overall methods and performance with the state-of-the-art methods in the literature. Some well-known ML methods, including SVM, LR, and MLP, are initially trained on the features selected from the proposed dataset [18]. This research has not used any preprocessing to detect better threats, which is not recommended [40]. Similarly, another research used features selection and removing stop words from the dataset but did not use any stemming process.

In contrast, the proposed research has let the LSTM model automatically select features and tune the parameters along with the stop words removal and stemming the words to the respective roots. Table 7 briefly compares the proposed method and the available research to the best of our knowledge. We have further validated our proposed flow of preprocessing steps by performing experiments without data preprocessing, and the accuracy is compared. Fig 5 depicts the performance measures comparison between the proposed method and state-of-the-art literature. Our proposed method has outperformed the methods by a good margin due to the efficient data preprocessing and data enhancement with data augmentation. Without performing the feature selection and data preprocessing steps, the accuracy is less than the available methods. Our proposed methodology has improved accuracy, precision, recall, and f1-score by 7.96%, 11.2%, 6.32%, and 7.97%, respectively, which is a great achievement for the Urdu community.

Download:

Fig 5. Performance comparison with the literature.

https://doi.org/10.1371/journal.pone.0290915.g005

Download:

Table 7. Methodology comparison between the proposed work and literature.

https://doi.org/10.1371/journal.pone.0290915.t007

For a single forward pass through an LSTM layer, the time complexity is approximately O(T*N2), where T is the sequence length and N is the number of LSTM units. However, during training, multiple iterations over the data are needed, so the overall complexity is higher.

The runtime complexity of an SVM primarily depends on the number of support vectors, denoted by ’nSV’, and the number of input features, denoted by ’n’. For linear SVM, the training time complexity is typically O(n * nSV), making it efficient for high-dimensional data but can be slow if the number of support vectors is very high.

6. Conclusions and future directions

After performing different experiments on noisy, fewer, clean, and enhanced data, it is concluded that when data is noisy, then underfitting of the model during training occurs, and any DL model does not converge. Data augmentation can help in enhancing data and aids diversity in the data which avoids underfitting and overfitting of a model. For any automated model, data should be neat and clean. As textual data is often very noisy, it is not efficient to train a model with good performance.

Data preprocessing and cleaning textual data are essential for better performance and prevent a model’s under fitting. In this research, we have achieved the best possible results for detecting threats from the Urdu language using the proposed data processing step with the help of the proposed Urdu stop words list and stemming dictionary. Due to the diverse nature of Urdu, the stemming dictionary is limited to the used dataset. This dictionary can be further extended in the future to enhance the performance of any automated model. Moreover, transfer learning-based sequential models can also be applied to efficiently preprocess data to enhance the detection rate further.

References

1. Mehdad Y.; Tetreault J. Do characters abuse more than words? In Proceedings of the Proceedings of the 17th annual meeting of the special interest group on discourse and dialogue, 2016, pp. 299–303.
- View Article
- Google Scholar
2. Balakrishnan V.; Khan S.; Fernandez T.; Arabnia H.R. Cyberbullying detection on twitter using Big Five and Dark Triad features. Personality and individual differences 2019, 141, 252–257.
- View Article
- Google Scholar
3. Schmidt A.; Wiegand M. A survey on hate speech detection using natural language processing. In Proceedings of the Proceedings of the fifth international workshop on natural language processing for social media, 2017, pp. 1–10.
- View Article
- Google Scholar
4. Badjatiya, P.; Gupta, S.; Gupta, M.; Varma, V. Deep learning for hate speech detection in tweets. In Proceedings of the Proceedings of the 26th international conference on World Wide Web companion, 2017, pp. 759–760.
5. Wang, X.; Liu, Y.; Sun, C.J.; Wang, B.; Wang, X. Predicting polarities of tweets by composing word embeddings with long short-term memory. In Proceedings of the Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2015, pp. 1343–1353.
6. Xiang, G.; Fan, B.; Wang, L.; Hong, J.; Rose, C. Detecting offensive tweets via topical feature discovery over a large scale twitter corpus. In Proceedings of the Proceedings of the 21st ACM international conference on Information and knowledge management, 2012, pp. 1980–1984.
7. Del Vigna12, F.; Cimino23, A.; Dell’Orletta, F.; Petrocchi, M.; Tesconi, M. Hate me, hate me not: Hate speech detection on facebook. In Proceedings of the Proceedings of the first Italian conference on cybersecurity (ITASEC17), 2017, pp. 86–95.
8. Behzadan, V.; Aguirre, C.; Bose, A.; Hsu, W. Corpus and deep learning classifier for collection of cyber threat indicators in twitter stream. In Proceedings of the 2018 IEEE International Conference on Big Data (Big Data). IEEE, 2018, pp. 5002–5007.
9. Kok S.; Abdullah A.; Jhanjhi N.; Supramaniam M. Ransomware, threat and detection techniques: A review. Int. J. Comput. Sci. Netw. Secur 2019, 19, 136.
- View Article
- Google Scholar
10. Davidson, T.; Warmsley, D.; Macy, M.; Weber, I. Automated hate speech detection and the problem of offensive language. In Proceedings of the Proceedings of the international AAAI conference on web and social media, 2017, Vol. 11, pp. 512–515.
11. Ashraf, N.; Mustafa, R.; Sidorov, G.; Gelbukh, A. Individual vs. group violent threats classification in online discussions. In Proceedings of the Companion Proceedings of the Web Conference 2020, 2020, pp. 629–633.
12. Chakraborty, P.; Seddiqui, M.H. Threat and abusive language detection on social media in bengali language. In Proceedings of the 2019 1st International Conference on Advances in Science, Engineering and Robotics Technology (ICASERT). IEEE, 2019, pp. 1–6.
13. Eder E.; Krieg-Holz U.; Hahn U. At the lower end of language—Exploring the vulgar and obscene side of German. In Proceedings of the Proceedings of the third workshop on abusive language online, 2019, pp. 119–128.
- View Article
- Google Scholar
14. Oostdijk, N.; van Halteren, H. N-gram-based recognition of threatening tweets. In Proceedings of the Computational Linguistics and Intelligent Text Processing: 14th International Conference, CICLing 2013, Samos, Greece, March 24–30, 2013, Proceedings, Part II 14. Springer, 2013, pp. 183–196.
15. Polignano M.; Basile P.; De Gemmis M.; Semeraro G. Hate Speech Detection through AlBERTo Italian Language Understanding Model. In Proceedings of the NL4AI@ AI* IA, 2019, pp. 1–13.
- View Article
- Google Scholar
16. Alakrot A.; Murray L.; Nikolov N.S. Towards accurate detection of offensive language in online communication in arabic. Procedia computer science 2018, 142, 315–320.
- View Article
- Google Scholar
17. Hochreiter S.; Schmidhuber J. Long short-term memory. Neural computation 1997, 9, 1735–1780. pmid:9377276
- View Article
- PubMed/NCBI
- Google Scholar
18. Amjad M.; Ashraf N.; Zhila A.; Sidorov G.; Zubiaga A.; Gelbukh A. Threatening language detection and target identification in Urdu tweets. IEEE Access 2021, 9, 128302–128313.
- View Article
- Google Scholar
19. Mehmood A.; Farooq M.S.; Naseem A.; Rustam F.; Villar M.G.; Rodríguez C.L.; et al. Threatening URDU Language Detection from Tweets Using Machine Learning. Applied Sciences 2022, 12, 10342.
- View Article
- Google Scholar
20. Razavi, A.H.; Inkpen, D.; Uritsky, S.; Matwin, S. Offensive language detection using multi-level classification. In Proceedings of the Advances in Artificial Intelligence: 23rd Canadian Conference on Artificial Intelligence, Canadian AI 2010, Ottawa, Canada, May 31–June 2, 2010. Proceedings 23. Springer, 2010, pp. 16–27.
21. Park, J.H.; Fung, P. One-step and two-step classification for abusive language detection on twitter. arXiv preprint arXiv:1706.01206 2017.
22. Chen, Y.; Zhou, Y.; Zhu, S.; Xu, H. Detecting offensive language in social media to protect adolescent online safety. In Proceedings of the 2012 International Conference on Privacy, Security, Risk and Trust and 2012 International Confernece on Social Computing. IEEE, 2012, pp. 71–80.
23. Zampieri, M.; Malmasi, S.; Nakov, P.; Rosenthal, S.; Farra, N.; Kumar, R. Predicting the type and target of offensive posts in social media. arXiv preprint arXiv:1902.09666 2019.
24. Rani P.; Ojha A.K. KMI-coling at SemEval-2019 task 6: exploring N-grams for offensive language detection. In Proceedings of the Proceedings of the 13th International Workshop on Semantic Evaluation, 2019, pp. 668–671.
- View Article
- Google Scholar
25. Lee H.S.; Lee H.R.; Park J.U.; Han Y.S. An abusive text detection system based on enhanced abusive and non-abusive word lists. Decision Support Systems 2018, 113, 22–31.
- View Article
- Google Scholar
26. Ishisaka, T.; Yamamoto, K. Detecting nasty comments from BBS posts. In Proceedings of the Proceedings of the 24th Pacific Asia Conference on Language, Information and Computation, 2010, pp. 645–652.
27. Ptaszynski, M.E.; Masui, F. Automatic cyberbullying detection: Emerging research and opportunities: Emerging research and opportunities 2018.
28. Zhao Y.; Zhang Y. Comparison of decision tree methods for finding active objects. Advances in Space Research 2008, 41, 1955–1959.
- View Article
- Google Scholar
29. Özel, S.A.; Saraç, E.; Akdemir, S.; Aksu, H. Detection of cyberbullying on social media messages in Turkish. In Proceedings of the 2017 International Conference on Computer Science and Engineering (UBMK). IEEE, 2017, pp. 366–370.
30. Mishra, P.; Del Tredici, M.; Yannakoudakis, H.; Shutova, E. Abusive language detection with graph convolutional networks. arXiv preprint arXiv:1904.04073 2019.
31. Lee, Y.; Yoon, S.; Jung, K. Comparative studies of detecting abusive language on twitter. arXiv preprint arXiv:1808.10245 2018.
32. Sigurbergsson, G.I.; Derczynski, L. Offensive language and hate speech detection for Danish. arXiv preprint arXiv:1908.04531 2019.
33. Burnap P.; Williams M.L. Us and them: identifying cyber hate on Twitter across multiple protected characteristics. EPJ Data science 2016, 5, 1–15. pmid:32355598
- View Article
- PubMed/NCBI
- Google Scholar
34. Gómez-Adorno H.; Enguix G.B.; Sierra G.; Sánchez O.; Quezada D. A Machine Learning Approach for Detecting Aggressive Tweets in Spanish. In Proceedings of the IberEval@ SEPLN, 2018, pp. 102–107.
- View Article
- Google Scholar
35. Pelle R.; Alcântara C.; Moreira V.P. A classifier ensemble for offensive text detection. In Proceedings of the Proceedings of the 24th Brazilian Symposium on Multimedia and the Web, 2018, pp. 237–243.
- View Article
- Google Scholar
36. Hasib K.M., et al., Depression Detection From Social Networks Data Based on Machine Learning and Deep Learning Techniques: An Interrogative Survey. IEEE Transactions on Computational Social Systems, 2023, 1568–1586.
- View Article
- Google Scholar
37. Hasib K.M., et al., Strategies for enhancing the performance of news article classification in Bangla: Handling imbalance and interpretation. Engineering Applications of Artificial Intelligence, 2023. 125: p. 106688.
- View Article
- Google Scholar
38. Hasib K.M., et al., Bmnet-5: A novel approach of neural network to classify the genre of bengali music based on audio features. IEEE Access, 2022. 10: p. 108545–108563.
- View Article
- Google Scholar
39. Febriana, T.; Budiarto, A. Twitter dataset for hate speech and cyberbullying detection in Indonesian language. In Proceedings of the 2019 International Conference on Information Management and Technology (ICIMTech). IEEE, 2019, Vol. 1, pp. 379–382.
40. Duong H.T.; Nguyen-Thi T.A. A review: preprocessing techniques and data augmentation for sentiment analysis. Computational Social Networks 2021, 8, 1–16.
- View Article
- Google Scholar

[ref1] 1. Mehdad Y.; Tetreault J. Do characters abuse more than words? In Proceedings of the Proceedings of the 17th annual meeting of the special interest group on discourse and dialogue, 2016, pp. 299–303.
View Article
Google Scholar

[2] View Article

[3] Google Scholar

[ref2] 2. Balakrishnan V.; Khan S.; Fernandez T.; Arabnia H.R. Cyberbullying detection on twitter using Big Five and Dark Triad features. Personality and individual differences 2019, 141, 252–257.
View Article
Google Scholar

[5] View Article

[6] Google Scholar

[ref3] 3. Schmidt A.; Wiegand M. A survey on hate speech detection using natural language processing. In Proceedings of the Proceedings of the fifth international workshop on natural language processing for social media, 2017, pp. 1–10.
View Article
Google Scholar

[8] View Article

[9] Google Scholar

[ref4] 4. Badjatiya, P.; Gupta, S.; Gupta, M.; Varma, V. Deep learning for hate speech detection in tweets. In Proceedings of the Proceedings of the 26th international conference on World Wide Web companion, 2017, pp. 759–760.

[ref5] 5. Wang, X.; Liu, Y.; Sun, C.J.; Wang, B.; Wang, X. Predicting polarities of tweets by composing word embeddings with long short-term memory. In Proceedings of the Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2015, pp. 1343–1353.

[ref6] 6. Xiang, G.; Fan, B.; Wang, L.; Hong, J.; Rose, C. Detecting offensive tweets via topical feature discovery over a large scale twitter corpus. In Proceedings of the Proceedings of the 21st ACM international conference on Information and knowledge management, 2012, pp. 1980–1984.

[ref7] 7. Del Vigna12, F.; Cimino23, A.; Dell’Orletta, F.; Petrocchi, M.; Tesconi, M. Hate me, hate me not: Hate speech detection on facebook. In Proceedings of the Proceedings of the first Italian conference on cybersecurity (ITASEC17), 2017, pp. 86–95.

[ref8] 8. Behzadan, V.; Aguirre, C.; Bose, A.; Hsu, W. Corpus and deep learning classifier for collection of cyber threat indicators in twitter stream. In Proceedings of the 2018 IEEE International Conference on Big Data (Big Data). IEEE, 2018, pp. 5002–5007.

[ref9] 9. Kok S.; Abdullah A.; Jhanjhi N.; Supramaniam M. Ransomware, threat and detection techniques: A review. Int. J. Comput. Sci. Netw. Secur 2019, 19, 136.
View Article
Google Scholar

[16] View Article

[17] Google Scholar

[ref10] 10. Davidson, T.; Warmsley, D.; Macy, M.; Weber, I. Automated hate speech detection and the problem of offensive language. In Proceedings of the Proceedings of the international AAAI conference on web and social media, 2017, Vol. 11, pp. 512–515.

[ref11] 11. Ashraf, N.; Mustafa, R.; Sidorov, G.; Gelbukh, A. Individual vs. group violent threats classification in online discussions. In Proceedings of the Companion Proceedings of the Web Conference 2020, 2020, pp. 629–633.

[ref12] 12. Chakraborty, P.; Seddiqui, M.H. Threat and abusive language detection on social media in bengali language. In Proceedings of the 2019 1st International Conference on Advances in Science, Engineering and Robotics Technology (ICASERT). IEEE, 2019, pp. 1–6.

[ref13] 13. Eder E.; Krieg-Holz U.; Hahn U. At the lower end of language—Exploring the vulgar and obscene side of German. In Proceedings of the Proceedings of the third workshop on abusive language online, 2019, pp. 119–128.
View Article
Google Scholar

[22] View Article

[23] Google Scholar

[ref14] 14. Oostdijk, N.; van Halteren, H. N-gram-based recognition of threatening tweets. In Proceedings of the Computational Linguistics and Intelligent Text Processing: 14th International Conference, CICLing 2013, Samos, Greece, March 24–30, 2013, Proceedings, Part II 14. Springer, 2013, pp. 183–196.

[ref15] 15. Polignano M.; Basile P.; De Gemmis M.; Semeraro G. Hate Speech Detection through AlBERTo Italian Language Understanding Model. In Proceedings of the NL4AI@ AI* IA, 2019, pp. 1–13.
View Article
Google Scholar

[26] View Article

[27] Google Scholar

[ref16] 16. Alakrot A.; Murray L.; Nikolov N.S. Towards accurate detection of offensive language in online communication in arabic. Procedia computer science 2018, 142, 315–320.
View Article
Google Scholar

[29] View Article

[30] Google Scholar

[ref17] 17. Hochreiter S.; Schmidhuber J. Long short-term memory. Neural computation 1997, 9, 1735–1780. pmid:9377276
View Article
PubMed/NCBI
Google Scholar

[32] View Article

[33] PubMed/NCBI

[34] Google Scholar

[ref18] 18. Amjad M.; Ashraf N.; Zhila A.; Sidorov G.; Zubiaga A.; Gelbukh A. Threatening language detection and target identification in Urdu tweets. IEEE Access 2021, 9, 128302–128313.
View Article
Google Scholar

[36] View Article

[37] Google Scholar

[ref19] 19. Mehmood A.; Farooq M.S.; Naseem A.; Rustam F.; Villar M.G.; Rodríguez C.L.; et al. Threatening URDU Language Detection from Tweets Using Machine Learning. Applied Sciences 2022, 12, 10342.
View Article
Google Scholar

[39] View Article

[40] Google Scholar

[ref20] 20. Razavi, A.H.; Inkpen, D.; Uritsky, S.; Matwin, S. Offensive language detection using multi-level classification. In Proceedings of the Advances in Artificial Intelligence: 23rd Canadian Conference on Artificial Intelligence, Canadian AI 2010, Ottawa, Canada, May 31–June 2, 2010. Proceedings 23. Springer, 2010, pp. 16–27.

[ref21] 21. Park, J.H.; Fung, P. One-step and two-step classification for abusive language detection on twitter. arXiv preprint arXiv:1706.01206 2017.

[ref22] 22. Chen, Y.; Zhou, Y.; Zhu, S.; Xu, H. Detecting offensive language in social media to protect adolescent online safety. In Proceedings of the 2012 International Conference on Privacy, Security, Risk and Trust and 2012 International Confernece on Social Computing. IEEE, 2012, pp. 71–80.

[ref23] 23. Zampieri, M.; Malmasi, S.; Nakov, P.; Rosenthal, S.; Farra, N.; Kumar, R. Predicting the type and target of offensive posts in social media. arXiv preprint arXiv:1902.09666 2019.

[ref24] 24. Rani P.; Ojha A.K. KMI-coling at SemEval-2019 task 6: exploring N-grams for offensive language detection. In Proceedings of the Proceedings of the 13th International Workshop on Semantic Evaluation, 2019, pp. 668–671.
View Article
Google Scholar

[46] View Article

[47] Google Scholar

[ref25] 25. Lee H.S.; Lee H.R.; Park J.U.; Han Y.S. An abusive text detection system based on enhanced abusive and non-abusive word lists. Decision Support Systems 2018, 113, 22–31.
View Article
Google Scholar

[49] View Article

[50] Google Scholar

[ref26] 26. Ishisaka, T.; Yamamoto, K. Detecting nasty comments from BBS posts. In Proceedings of the Proceedings of the 24th Pacific Asia Conference on Language, Information and Computation, 2010, pp. 645–652.

[ref27] 27. Ptaszynski, M.E.; Masui, F. Automatic cyberbullying detection: Emerging research and opportunities: Emerging research and opportunities 2018.

[ref28] 28. Zhao Y.; Zhang Y. Comparison of decision tree methods for finding active objects. Advances in Space Research 2008, 41, 1955–1959.
View Article
Google Scholar

[54] View Article

[55] Google Scholar

[ref29] 29. Özel, S.A.; Saraç, E.; Akdemir, S.; Aksu, H. Detection of cyberbullying on social media messages in Turkish. In Proceedings of the 2017 International Conference on Computer Science and Engineering (UBMK). IEEE, 2017, pp. 366–370.

[ref30] 30. Mishra, P.; Del Tredici, M.; Yannakoudakis, H.; Shutova, E. Abusive language detection with graph convolutional networks. arXiv preprint arXiv:1904.04073 2019.

[ref31] 31. Lee, Y.; Yoon, S.; Jung, K. Comparative studies of detecting abusive language on twitter. arXiv preprint arXiv:1808.10245 2018.

[ref32] 32. Sigurbergsson, G.I.; Derczynski, L. Offensive language and hate speech detection for Danish. arXiv preprint arXiv:1908.04531 2019.

[ref33] 33. Burnap P.; Williams M.L. Us and them: identifying cyber hate on Twitter across multiple protected characteristics. EPJ Data science 2016, 5, 1–15. pmid:32355598
View Article
PubMed/NCBI
Google Scholar

[61] View Article

[62] PubMed/NCBI

[63] Google Scholar

[ref34] 34. Gómez-Adorno H.; Enguix G.B.; Sierra G.; Sánchez O.; Quezada D. A Machine Learning Approach for Detecting Aggressive Tweets in Spanish. In Proceedings of the IberEval@ SEPLN, 2018, pp. 102–107.
View Article
Google Scholar

[65] View Article

[66] Google Scholar

[ref35] 35. Pelle R.; Alcântara C.; Moreira V.P. A classifier ensemble for offensive text detection. In Proceedings of the Proceedings of the 24th Brazilian Symposium on Multimedia and the Web, 2018, pp. 237–243.
View Article
Google Scholar

[68] View Article

[69] Google Scholar

[ref36] 36. Hasib K.M., et al., Depression Detection From Social Networks Data Based on Machine Learning and Deep Learning Techniques: An Interrogative Survey. IEEE Transactions on Computational Social Systems, 2023, 1568–1586.
View Article
Google Scholar

[71] View Article

[72] Google Scholar

[ref37] 37. Hasib K.M., et al., Strategies for enhancing the performance of news article classification in Bangla: Handling imbalance and interpretation. Engineering Applications of Artificial Intelligence, 2023. 125: p. 106688.
View Article
Google Scholar

[74] View Article

[75] Google Scholar

[ref38] 38. Hasib K.M., et al., Bmnet-5: A novel approach of neural network to classify the genre of bengali music based on audio features. IEEE Access, 2022. 10: p. 108545–108563.
View Article
Google Scholar

[77] View Article

[78] Google Scholar

[ref39] 39. Febriana, T.; Budiarto, A. Twitter dataset for hate speech and cyberbullying detection in Indonesian language. In Proceedings of the 2019 International Conference on Information Management and Technology (ICIMTech). IEEE, 2019, Vol. 1, pp. 379–382.

[ref40] 40. Duong H.T.; Nguyen-Thi T.A. A review: preprocessing techniques and data augmentation for sentiment analysis. Computational Social Networks 2021, 8, 1–16.
View Article
Google Scholar

[81] View Article

[82] Google Scholar

Figures

Abstract

1. Introduction

1.1. Contributions

1.2. Paper arrangement

2. Literature review

3. Methodology

3.1. Data acquisition

3.1.1. Dataset selection.

3.1.2. Data augmentation.

3.2. Preprocessing

3.2.1. Data cleaning.

3.3. Stemming

3.4. Vectorization

3.5. Deep sequential model

4. Experimental setup

4.1. LSTM architecture

4.2. Hyper- and parameters configuration

5. Results and discussion

5.1. Training

5.2. Testing

5.3. Comparison with literature

6. Conclusions and future directions

References