Arab2Vec: An Arabic word embedding model for use in Twitter NLP applications

Abdelrahman Hamdy; Ayman Youssef; Conor Ryan

doi:10.1371/journal.pone.0328369

Abstract

The analysis of Arabic Twitter data sets is a highly active research topic, particularly since the outbreak of COVID-19 and subsequent attempts to understand public sentiment related to the pandemic. This activity is partially driven by the high number of Arabic Twitter users, around 164 million. Word embedding models are a vital tool for analysing Twitter data sets, as they are considered one of the essential methods of transforming words into numbers that can be processed using machine learning (ML) algorithms. In this work, we introduce a new model, Arab2Vec, that can be used in Twitter-based natural language processing (NLP) applications. Arab2Vec was constructed using a vast data set of approximately 186,000,000 tweets from 2008 to 2021 from all Arabic Twitter sources. This makes Arab2Vec the most up-to-date word embedding model researchers can use for Twitter-based applications. The model is compared with existing models from the literature. The reported results demonstrate superior performance regarding the number of recognised words and F1 score for classification tasks with known data sets and the ability to work with emojis. We also incorporate skip-grams with negative sampling, an approach that other Arabic models haven’t previously used. Nine versions of Arab2Vec are produced; these models differ regarding available features, the number of words trained on, speed, etc. This paper provides Arab2Vec as an open-source project for users to employ in research. It describes the data collection methods, the data pre-processing and cleaning step, the effort to build these nine models, and experiments to validate them qualitatively and quantitatively.

Citation: Hamdy A, Youssef A, Ryan C (2025) Arab2Vec: An Arabic word embedding model for use in Twitter NLP applications. PLoS One 20(8): e0328369. https://doi.org/10.1371/journal.pone.0328369

Editor: Junaid Rashid, Sejong University, KOREA, REPUBLIC OF

Received: January 12, 2023; Accepted: June 29, 2025; Published: August 29, 2025

Copyright: © 2025 Hamdy et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All our models are available from this repository: https://github.com/Abdelrahmanrezk/AraETEWordVec.

Funding: The author(s) received no specific funding for this work.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Arabic Twitter analysis is an important research topic as Twitter contains many Arabic tweets. Around 164 million active monthly users are estimated to constantly add to the total [15]. However, working with the Arabic language is challenging for two reasons: first, there is a low number of data sets and pre-trained models, and second, the language itself is more complex to work with than Western languages. For example, the Arabic language contains diacritics that change the meaning of sentences. For example, سكر means sugar in Arabic. However, if we add diacritics سَكَرَ it means “got drunk”.

Word embedding models, such as the popular and successful Word2Vec model[3], are responsible for converting words into vectors of numbers suitable for ML models. These vectors capture the semantic properties of words, which can help ML models do a better job in their required tasks. The models require training with huge text data sets to extract hidden relationships between words in the text. This enables them to produce word embedding models to help integrate text data into ML and deep learning applications. This has many applications in NLP, such as sentiment analysis [2], textual entailment [14], information retrieval [5], and question answering [4].

In this work, we propose a novel open-source Arabic word embedding model trained on a large dataset of approximately 186 million tweets. Our model compares favourably with existing models in the literature, achieving a vocabulary size of 2,027,042 words, significantly higher than the previous best of 1,476,715. A key strength of our model is its ability to recognise words that earlier models miss, particularly those related to COVID-19, as it is trained on more recent data. Additionally, it effectively handles emojis, treating them interchangeably with words, and recognises English words commonly found in Arabic tweets, such as “it” or “A+,” which prior models typically ignore. We evaluate the model on two datasets: a COVID-related dataset[17] and the Arabic Sentiment Twitter Dataset (ASTD) [13]. In summary, our model addresses several challenges that earlier Arabic word embeddings faced, including limited vocabulary, poor emoji handling, and an inability to process embedded English or COVID-19-related terms, offering significant improvements across all these areas.

The proposed model has several advantages over other models as it can be seen in Table 1 in the literature. These can be summarised as follows:

Download:

Table 1. Comparison Table between different embedding models.

https://doi.org/10.1371/journal.pone.0328369.t001

It is an open-source model trained on 186M tweets and, therefore, can recognise more words (2,027,042, approximately 33%) than previous models;
It recognises COVID-19-related words and emojis with higher accuracy as it was trained on a newer data set;
It has a variant trained using negative sampling. This is an advantage for our work, as we will see from the results that this variant achieves higher accuracy than the previous model.
It exhibits better performance than the two state-of-the-art models from the literature (AraVec [1], AraWordVec [9] ) on the two tested data sets (COVID-19, ASTD).

The paper is organized as follows. We start with the Literature Survey section, which contains a literature survey on the research topic, before moving to the Background Information section, which describes the background knowledge needed to understand the proposed research. The paper then moves to the methodology section that describes the main steps of the proposed framework. The following section discusses the computational effort in the steps discussed above. Next, section provides a qualitative comparison between the model and two other well-known models from the literature using publicly available data. Next, Section presents a quantitative experimental comparison between the proposed model and models from the literature. Then the paper introduces the discussion section that summarizes the main findings of the work. Finally last section provides some conclusions and describes some future work.

Literature survey

Here, we briefly introduce some related work related to our proposed work and some interesting applications from the literature. In [1], the authors proposed a pre-trained word embedding model in the Arabic context using the Word2Vec model. The trained models are general, distributed word embeddings trained on text-based data collected from the Internet, Wikipedia, and Twitter. They proposed six different models for the three different Arabic content domains. The Twitter model was trained using 77,600,000 Arabic tweets from 2016 to 2018. This model is used in our work to compare it with our results; we refer to it as the AraVec model. A set of word embedding models specifically for Twitter data was introduced by [9]. The authors compared their model with previous models from the literature and demonstrated the superior performance of their model in a word similarity task. We also compare them with this set of models and refer to them as the ArWordVec model. ArWordVec was also tested on a multi-class sentiment analysis task. The results show that AraVec achieved the best accuracy of 68.93, and ArWordVec achieved the best accuracy of 69.93. However, as our comparisons will show, the model can only recognise a relatively small number of words.

A multilingual word embedding model was introduced by [10]. This was trained with more than 100 languages, using their corresponding Wikipedia sites. The model showed superior performance compared to the then state-of-the-art English models. In [11], the authors trained an embedding model on 157 languages and evaluated it on ten languages. The results show good performance when compared with previous models. ArbEngVec [8] is a bilingual English-Arabic word embedding model. The model was trained on an enormous data set of 93 million sentence pairs extracted from the Open Parallel Corpus Project (OPUS), containing 90 languages and more than 2.7 billion parallel sentences.

The model was successfully evaluated using both intrinsic and extrinsic evaluations, where intrinsic evaluation tests the quality of the model in general, independent of a specific task, while extrinsic evaluation tests model quality when undertaking a specific NLP task. In [6], a word embedding model for medical and health applications in Arabic was introduced. Three trained models (Word2Vec, fastText, and Glove) were compared and evaluated, with the results showing a superior performance of Word2Vec and fastText [16] over the Glove model [7].

These pre-trained models were used in different NLP applications. In [19], a convolution neural network (CNN) was trained using the Twitter continuous bag of words model (AraVec Model). This CNN was used for an Arabic tweet sentiment classification task. In [20], AraVec models generate the feature vectors for words (tokens from tweets). These feature vectors are used for training a long short-term memory (LSTM) model for emotion analysis of Arabic tweets. When their technique was compared to a Support Vector Machine (SVM), a Random Forest (RF), and a fully connected deep NN, it produced the best performance results and increased validation by 9% over the previous best SVM result. In [21], the authors used AraVec and AraWordVec for text classification for Arabic social media tweets. In this application, AraWordVec achieved higher performance than AraVec. The authors employed three machine learning classification algorithms: RF, SVM, and Gaussian Naive Bayes. In [25], 26 different text pre-processing techniques were used to treat Arabic tweets before creating a classifier to identify tweets containing health-related information. The author’s analysis aimed to determine how pre-processing methods influenced traditional algorithms. They discovered that most strategies did not improve the accuracy of the base model. They used different AraVec and AraWordVec models and tested them on two data sets. In [24], the authors used the pre-trained AraWordVec to build their methodology for sentiment analysis of the Arabic language.

Background information

Word2vec is a word representation method with numerous uses in many NLP applications. Initially proposed by Mikolov and his team at Google, Word2Vec converts each input token (word) to a real-valued vector. Similar words that appear in the same context are assigned similar real-valued vectors (similar representation). Word2vec is a deep neural network model that takes the tokens as inputs and produces the vector representations. Word representation similarities yield semantic features, frequently sustaining semantic linkages in vector operations on word vectors. For instance, the vector of the (King) plus the (Woman) minus the (Man) is not far from the vector of (Queen) [18].

In this paper, we use three different variants of Word2Vec models. In the following subsections, we discuss briefly the differences between these variants. Fig 1 shows the block diagram of Skip-gram and CBOW algorithms.

Download:

Fig 1. Word2Vec models [3].

https://doi.org/10.1371/journal.pone.0328369.g001

Continuous Bag of Words (CBOW)

In this model, we train a neural network to predict a target word from its neighbouring words, essentially predicting the word from context. The neighbouring words could be a single word or several words. The order of context words makes no difference, hence the name bag of words. A limitation of CBOW is that it gives all context words equal importance when making a prediction [22]. However, this is not the case in practice since some words have a higher predictive value.

Skip-Gram (SG)

An alternative model is the Skip-gram, essentially the opposite of the CBOW model. This model tries to detect the context of the sentence based on the input word.

CBOW is generally faster to train than Skip-gram, but Skip-gram can work better with small data sets [3].

Skip-Gram With Negative Sampling (SGNS)

Skip-gram models can take a long time to train due to the large number of required neuron weights. To mitigate this, the Skip-gram creators proposed a technique called “negative sampling” to address this problem. In a typical neural network, training involves modifying the neural weights to improve the accuracy of each training sample. However, when using negative sampling, only a tiny portion of the weights are modified with each sample.

Proposed methodology

Fig 2 shows the proposed framework of this paper.

Download:

Fig 2. The proposed framework.

https://doi.org/10.1371/journal.pone.0328369.g002

Data collection

Twitter data has become increasingly important and attractive to NLP researchers in recent years.

This is partly because tweets can easily be queried using a developer account on Twitter and also because they have tags for location and other user information. Our data set comprises 377,000,000 tweets collected between 2008 and 2021 from all Arabic Twitter sources. We gather tweets by examining the language information provided for each one, specifically from a section called the “language column.” This information is used to select tweets. The data is used for research purposes only, and the data analysis complies with Twitter’s terms and conditions. We process this data set to produce useful data to train the models.

Data pre-processing

Pre-processing the data is a crucial step in any NLP model to prepare the data for the word embedding model to process. The pre-processing steps we undertook included the following:

Remove duplicate Tweets and filter Tweets

We first filtered the tweets to remove duplicates, as they increased the model training time without benefit. This reduced the data set size from 377,000,000 to 186,000,000 tweets.

Cleaning Arabic text

There were also several tasks specific to Arabic text that were required to clean the text:

Replace URLs with a single Arabic word.
Some tweets may contain a URL for a site. These URLs are removed and replaced with رابطويب, which is Arabic for web site link.
Replace mentions with a single Arabic word.
A mentions in Twitter is a reference to a person or a brand; we remove these as there is nothing to learn for the embedding model from these words and replace them with the word حسابشخصي, which is an Arabic word meaning personal account.
Remove diacritics (Tashkeel).
Arabic uses diacritics to change the pronunciation of words. We found that most tweets don’t contain diacritics, so we removed them before training the model.
Remove characters repeated more than two times sequentially.
Some tweets contain repeated characters, such as هههههههه and سلااااااام , meaning hhhhhhh (akin to lol or some indication of amusement/laughter) and peace, respectively. These cases are edited by removing repeated characters so that if a character is repeated more than two times in a word, all but the first occurrence is removed.
Separate numbers associated with words.
Some tweets may contain a word that has a number attached to it. In this step, we remove any numbers connected to a word.
Reduce repeated emojis used sequentially.
Some Twitter users use the same emoji multiple times in sequence in a tweet to indicate a particularly strong emotion. This step removes this sort of repetition.
Remove extraneous spaces.
Some tweets may contain more than one space sequentially. In this step, we remove this repetition.

After these steps, tokenisation using the treebank library from Python is applied.

Computational effort in these steps

This section discusses the time consumed in each of the previous steps.

The training of the models took around four days per model, which varied from 75 to 85 hours for each; we produced nine models in total. These comprise three tri-gram models (CBOW, Skip-gram, and Skip-gram with negative sampling) with a word count threshold (the number of words examined when calculating a context) of 300; another three tri-gram models (CBOW, Skip-gram, Skip-gram with negative sampling) with a word count threshold of 100; and the third three uni-gram models (CBOW, Skip-gram, Skip-gram with negative sampling) with a word count threshold of 100. Tri-gram models process text three words simultaneously, while uni-grams do so one word at a time. In this work, we use tri-grams because they allow the model more freedom in training by not limiting it to smaller n-grams. We do this to make different models with different sizes and features available for the Arabic NLP research community. (The source code and all models are downloadable from the following link: https://github.com/Abdelrahmanrezk/AraETEWordVec)

Model evaluation and comparison

We compare the performance of our model and that of the AraVec and ArWordVec models in four ways:

The total number of words recognised, which depends on the number of tweets that the model is trained on;
Word similarity, which shows what words the model finds similar (and with what similarity score) to an input word;
Clustering of positive and negative words (in this step, we test the model with a subset of words that can be clustered into two groups, negative and positive, and we draw the output representation from the model to see if similar words are getting representations that are close to each other or not);
Clustering of named entities (in this step, we choose a random set of named entities that fall under different categories, such as person and country names, and see if the model can cluster them into similar groups).

Number of words recognized

We initially compare our model with others based on the number of recognised words. As seen in Table 2, our model can recognise many more than the others.

Download:

Table 2. Number of recognised words.

https://doi.org/10.1371/journal.pone.0328369.t002

Word similarity

In this section, we examine the performance of each model when trying to suggest similar words to a random selection of words. We start with the word خريج, which means graduate in English. Fig 3 shows the results obtained from each of the three models. In Fig 3, the bar on the left shows how similar the model believed a particular suggested word is, while the word itself is next (in Arabic), followed by its English translation.

Download:

Fig 3. Words similar to “Graduate” suggested by the three models.

https://doi.org/10.1371/journal.pone.0328369.g003

The results show that all three models successfully recognise the word. It can be seen that both ArWordVec and AraVec produce five words that are syntactically related to the word, like خريجين, which means graduate. Arab2Vec, on the other hand, produced more words that are semantically related to the word, like تخصص, which means speciality, كليه, which means college and جامعه, which means university. The second word we use to test the three models is بنغازي (Benghazi), which is a city in Libya. Fig 4 shows the most similar words to the word Benghazi. The word was not recognised using the AraWordVec model, while the ArVec and Arab2Vec models had similar results. Next, we test the model using the pink flower emoji; Fig 5 shows the most similar emojis to the pink flower emoji .

Download:

Fig 4. Words similar to “Benghazi” suggested by the AraVec Model and Arab2Vec Model.

https://doi.org/10.1371/journal.pone.0328369.g004

Download:

Fig 5. Words similar to

emoji for AraVec and Arab2Vec model.

https://doi.org/10.1371/journal.pone.0328369.g005

The ArWordVec model was not able to recognise the pink flower emoji, while the AraVec model produces words that are not relevant to it. The Arab2Vec model, on the other hand, successfully produced several emojis that are similar in meaning to the pink flower emoji . Next, we test the model with the rolling-on-the-floor laughing emoji as shown in Fig 6.

Download:

Fig 6. Words that are similar to the

emoji as produced by the Arab2Vec model.

Recall that “hh” is often used in Arabic to denote amusement or laughter.

https://doi.org/10.1371/journal.pone.0328369.g006

The Arab2Vec model is the only model that can recognise Emojis with good accuracy, and the other two models(AraVec and ArWordVec) failed to recognise that emoji.

Clustering of positive and negative words

This exercise will see if our model can capture the similarities between words from different clusters. In this test, introduced by [1], we test the model with a small subset of words from two clusters (negative and positive) to see if the model can capture the similarities between the words. The words we use are provided in Table 3:

Download:

Table 3. Set of positive and negative words.

https://doi.org/10.1371/journal.pone.0328369.t003

Fig 7 shows the clustering of positive and negative word results for the AraVec model. Fig 8 shows the clustering of positive and negative word results for the Arab2Vec model. It can be seen in Fig 8 that the Arab2Vec model could cluster the words correctly into two classes.

Download:

Fig 7. Words clustering from the AraVec model.

The positioning of the words indicates two clear clusters.

https://doi.org/10.1371/journal.pone.0328369.g007

Download:

Fig 8. Word clustering Arab2Vec model.

The arrangement of the words indicates that no meaningful clustering has taken place.

https://doi.org/10.1371/journal.pone.0328369.g008

ArWordVec could not cluster these words because the model is trained on fewer words.

Clustering of named entities

In this section, we compare the models using different named entities. The entities we use are names from countries, people, months, social media platforms, different organisation types, internet-related devices, electronic company names, military-related equipment and vehicle types. Several words are chosen randomly to represent each entity. The same words are used to compare our model with previous models to ensure fairness in comparison. The words we use are provided in the table in the same order mentioned in the text. Table 3 shows the words used in this experiment and their clustering entity names.

Fig 9 shows the clustering of named entities results from the Arab2Vec model. In contrast, Fig 10 shows the clustering of named entities results from the AraVec model.

Download:

Fig 9. Clustering of named entities by the Arab2Vec model.

https://doi.org/10.1371/journal.pone.0328369.g009

Download:

Fig 10. Clustering of named entities AraVec model.

https://doi.org/10.1371/journal.pone.0328369.g010

It can be seen from the figures that Arab2Vec was able to cluster the named entities into separate regions, while we see that in AraVec results, the two clusters of electronic companies and internet devices interfere with each other and become inseparable by this model. Table 4 shows the different name entities used in this experiment along with there Arabic translation.

Download:

Table 4. Named entities table.

https://doi.org/10.1371/journal.pone.0328369.t004

The results couldn’t be compared with the ArWordVec model because it couldn’t recognise some words as it was trained on a few words.

Experimental comparison of the models

In this section, we test our proposed models experimentally to calculate the accuracy difference between the models proposed in the literature and our model. These comparisons are conducted using publicly available datasets. Note that it is not possible for us to redistribute the training data as this would violate copyright.

COVID-19 data set classification

In this problem, we use 10,000 tweets classified as COVID-19-related and non-COVID-19-related tweets; this data set [26] is freely available for download from (https://github.com/SarahAlqurashi/COVID-19-Arabic-Tweets-Dataset).

In this experiment, we use a set of ML models (logistic regression, support vector classification (SVC), AdaBoost and Gradient Boost (GB)) and compare their F1 scores using the maximum F1 score achieved by each model. These experiments were conducted with each of Arab2Vec, AraVec and ArWordVec. The AdaBoost algorithm usually gives us the highest F1 score, as shown in Table 5. We also employ a deep learning model (LSTM with batch normalisation, the hyper-parameters shown in Table 5). Table 6 shows a comparison between the different models. The second column shows the F1 scores of an ML model (AdaBoost), while the third column shows the F1 scores of the deep learning models. Table 7 shows the a comparison between F1 scores of the best performance ML or DL models on COVID dataset.

Download:

Table 5. Deep learning model parameters of LSTM.

https://doi.org/10.1371/journal.pone.0328369.t005

Download:

Table 6. F1 score of different ML algorithms on COVID-19 test dataset.

https://doi.org/10.1371/journal.pone.0328369.t006

Download:

Table 7. F1 score of different models on the COVID-19 test data set using machine learning (ML) and deep learning (DL). The overall best performer, Arab2vec Skip-gram NS, is in italics.

https://doi.org/10.1371/journal.pone.0328369.t007

The results show that our model achieves higher accuracy in classifying COVID-19 data sets with both ML (AdaBoost) and deep learning models.

ASTD classification

The Arabic sentiment tweets data set (ASTD) is a data set of four classes [13] (objective, subjective positive, subjective negative and subjective neutral). This data set is also freely downloadable (https://github.com/mahmoudnabil/ASTD). As with the previous experiment, we first used the same set of machine learning models (logistic regression (LR), support vector classification (SVC), AdaBoost and Gradient Boost (GB)) and the F1 score was compared. It can be seen in Table 8 that the AdaBoost model gives the best F1 score, so we use it in our comparisons. Table 9 compares each model’s machine learning F1 scores and deep learning F1 scores. In the first column, we compare AdaBoost’s F1 score using different models, while in the second column, we compare deep learning models (LSTM with batch normalisation) DL F1. The distribution of these tweets is shown in Table 8. Table 10 shows the comparison between the F1 scores of the best performance ML or DL models on ASTD dataset.

Download:

Table 8. Distribution of Arabic tweets in the ASTD dataset.

https://doi.org/10.1371/journal.pone.0328369.t008

Download:

Table 9. F1 score of different ML algorithms on the ASTD test dataset.

https://doi.org/10.1371/journal.pone.0328369.t009

Download:

Table 10. F1 score of different models on ASTD test dataset.

https://doi.org/10.1371/journal.pone.0328369.t010

The results show that our model performs better than others in training DL models. However, the ArWordVec CBOW model achieves higher accuracy in the ML models.

Discussion

Word embedding models are key components of most NLP applications. Building Arabic embedding models presents unique challenges due to the complexities of the Arabic language. In this work, a novel Arabic word embedding model is proposed. The proposed word embedding models are trained on 186 million tweets, resulting in a model that recognises 33% more words than previously available models in the literature. Furthermore, the proposed model can recognise COVID-related words and emojis, providing an additional advantage over prior models. This work offers NLP researchers and practitioners nine distinct models of varying sizes tailored for different Arabic NLP applications. One of these models is trained using negative sampling, representing a novel contribution that has not been previously available in the literature. The proposed models are evaluated in four ways: the total number of recognised words, word similarity, clustering of positive and negative words, and clustering of named entities. Experimental results indicate that the proposed models demonstrate significant performance improvements over existing models such as AraVec and ArWordVec.

An alternate approach would be to employ a Large Language Model (LLM) to perform this task. In particular, LLMs’ ability to create context-aware representations is attractive. However, a dedicated word embedding model like the one proposed here can reasonably be expected to be more efficient and resource-friendly than fine-tuning a large language model for word embedding tasks. Future work will explore this trade-off further.

Conclusion

A novel word embedding model for Arabic, Arab2Vec, is introduced. The model is trained on a vast dataset and demonstrates superior performance compared to the state-of-the-art models, AraVec and ArWordVec.

The Arab2Vec model was evaluated and compared against AraVec and ArWordVec in four aspects: the total number of recognised words, word similarity, clustering of positive and negative words, and clustering of named entities. Additionally, the model was tested experimentally on two distinct datasets: COVID-19 and ASTD.

Arab2Vec offers several advantages over existing models. First, it can recognise almost two million words, compared to the previous high of 1,476,715. Second, it effectively handles new terms introduced on Twitter, such as COVID-19-related vocabulary. Third, unlike other models in the literature, it can process emojis, broadening its utility in social media analysis.

Another key contribution of this work is including models trained using negative sampling, which has not been previously introduced in AraVec or ArWordVec. Arab2Vec is being made available to the NLP and AI communities for research purposes, with all source code and models offered freely.

We plan to train additional word embedding models, such as GloVe and fastText, and compare our approach with alternative methods, including transformer-based ones.

References

1. Soliman AB, Eissa K, El-Beltagy SR. Aravec: a set of Arabic word embedding models for use in Arabic NLP. Procedia Computer Science. 2017;117:256–65.
- View Article
- Google Scholar
2. Heikal M, Torki M, El-Makky N. Sentiment analysis of Arabic tweets using deep learning. Procedia Computer Science. 2018;142:114–22.
- View Article
- Google Scholar
3. Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. arXiv preprint 2013. https://arxiv.org/abs/1301.3781
- View Article
- Google Scholar
4. Othman N, Faiz R, Smaïli K. Enhancing question retrieval in community question answering using word embeddings. Procedia Computer Science. 2019;159:485–94.
- View Article
- Google Scholar
5. El Mahdaouy A, El Alaoui SO, Gaussier E. Improving Arabic information retrieval using word embedding similarities. International Journal of Speech Technology. 2018;21(1):121–36.
- View Article
- Google Scholar
6. Habib M, Faris MF, Alomari A, Faris H. AltibbiVec: a word embedding model for medical and health applications in the Arabic language. IEEE Access. 2021;9:133875–88.
- View Article
- Google Scholar
7. Pennington J, Socher R, Manning CD. GloVe: Global vectors for word representation. In: EMNLP. 2014. p. 1532–43.
8. Lachraf R, Ayachi Y, Abdelali A, Schwab D. ArbEngVec: Arabic-English cross-lingual word embedding model. In: The Fourth Arabic NLP Workshop. 2019.
9. Fouad MM, Mahany A, Aljohani N, Abbasi RA, Hassan S-U. ArWordVec: efficient word embedding models for Arabic tweets. Soft Computing. 2020;24(11):8061–8.
- View Article
- Google Scholar
10. Al-Rfou R, Perozzi B, Skiena S. Polyglot: distributed word representations for multilingual NLP. arXiv preprint 2013. https://arxiv.org/abs/1307.1662
- View Article
- Google Scholar
11. Grave E, Bojanowski P, Gupta P, Joulin A, Mikolov T. Learning word vectors for 157 languages. arXiv preprint 2018. https://arxiv.org/abs/1802.06893
- View Article
- Google Scholar
12. Alayba AM, Palade V, England M, Iqbal R. Improving sentiment analysis in Arabic using word representation. In: IEEE ASAR. 2018. p. 13–8.
13. Nabil M, Aly M, Atiya A. ASTD: Arabic sentiment tweets dataset. In: EMNLP. 2015. p. 2515–9.
14. Almarwani N, Diab M. Arabic textual entailment with word embeddings. In: Third Arabic NLP Workshop. 2017. p. 185–90.
15. Abdelali A, Mubarak H, Samih Y, Hassan S, Darwish K. QADI: Arabic dialect identification in the wild. In: Sixth Arabic NLP Workshop, 2021. p. 1–10.
16. Bojanowski P, Grave E, Joulin A, Mikolov T. Enriching word vectors with subword information. TACL. 2017;5:135–46.
- View Article
- Google Scholar
17. Hamdy A, Youssef A, Ryan C. Arabic hands-on analysis, clustering and classification of a large Arabic Twitter dataset on COVID-19. Int J Simul-Syst Sci Technol. 2021;22(1):6-1.
- View Article
- Google Scholar
18. Al-Azani S, El-Alfy E-SM. Using word embedding and ensemble learning for highly imbalanced data sentiment analysis in short Arabic text. Procedia Computer Science. 2017;109:359–66.
- View Article
- Google Scholar
19. Dahou A, Elaziz MA, Zhou J, Xiong S. Arabic sentiment classification using convolutional neural network and differential evolution algorithm. Comput Intell Neurosci. 2019;2019:2537689. pmid:30936911
- View Article
- PubMed/NCBI
- Google Scholar
20. Khalil EA, Hakim EMF, El Houby HK. Deep learning for emotion analysis in Arabic tweets. Journal of Big Data. 2021;8(1):1–15.
- View Article
- Google Scholar
21. Alzanin SM, Azmi AM, Aboalsamh HA. Short text classification for Arabic social media tweets. Journal of King Saud University - Computer and Information Sciences. 2022.
- View Article
- Google Scholar
22. Sonkar S, Waters AE, Baraniuk RG. Attention word embedding. arXiv preprint 2020. https://arxiv.org/abs/2006.00988
- View Article
- Google Scholar
23. Balla HAMN, Llorens Salvador M, Delany SJ. Arabic Medical Community Question Answering Using ON-LSTM and CNN. In: ICMLC. 2022. p. 298–307.
24. Elsamadony O, Keshk A, Abdelatey A. Sentiment analysis for Arabic language using word embedding. In: ICENCO. 2021. p. 51–6.
25. Albalawi Y, Buckley J, Nikolov NS. Investigating the impact of pre-processing techniques and pre-trained word embeddings in detecting Arabic health information on social media. J Big Data. 2021;8(1):95. pmid:34249602
- View Article
- PubMed/NCBI
- Google Scholar
26. Alqurashi S, Alhindi A, Alanazi E. Large Arabic Twitter dataset on COVID-19. arXiv preprint 2020. https://arxiv.org/abs/2004.04315
- View Article
- Google Scholar

[ref1] 1. Soliman AB, Eissa K, El-Beltagy SR. Aravec: a set of Arabic word embedding models for use in Arabic NLP. Procedia Computer Science. 2017;117:256–65.
View Article
Google Scholar

[2] View Article

[3] Google Scholar

[ref2] 2. Heikal M, Torki M, El-Makky N. Sentiment analysis of Arabic tweets using deep learning. Procedia Computer Science. 2018;142:114–22.
View Article
Google Scholar

[5] View Article

[6] Google Scholar

[ref3] 3. Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. arXiv preprint 2013. https://arxiv.org/abs/1301.3781
View Article
Google Scholar

[8] View Article

[9] Google Scholar

[ref4] 4. Othman N, Faiz R, Smaïli K. Enhancing question retrieval in community question answering using word embeddings. Procedia Computer Science. 2019;159:485–94.
View Article
Google Scholar

[11] View Article

[12] Google Scholar

[ref5] 5. El Mahdaouy A, El Alaoui SO, Gaussier E. Improving Arabic information retrieval using word embedding similarities. International Journal of Speech Technology. 2018;21(1):121–36.
View Article
Google Scholar

[14] View Article

[15] Google Scholar

[ref6] 6. Habib M, Faris MF, Alomari A, Faris H. AltibbiVec: a word embedding model for medical and health applications in the Arabic language. IEEE Access. 2021;9:133875–88.
View Article
Google Scholar

[17] View Article

[18] Google Scholar

[ref7] 7. Pennington J, Socher R, Manning CD. GloVe: Global vectors for word representation. In: EMNLP. 2014. p. 1532–43.

[ref8] 8. Lachraf R, Ayachi Y, Abdelali A, Schwab D. ArbEngVec: Arabic-English cross-lingual word embedding model. In: The Fourth Arabic NLP Workshop. 2019.

[ref9] 9. Fouad MM, Mahany A, Aljohani N, Abbasi RA, Hassan S-U. ArWordVec: efficient word embedding models for Arabic tweets. Soft Computing. 2020;24(11):8061–8.
View Article
Google Scholar

[22] View Article

[23] Google Scholar

[ref10] 10. Al-Rfou R, Perozzi B, Skiena S. Polyglot: distributed word representations for multilingual NLP. arXiv preprint 2013. https://arxiv.org/abs/1307.1662
View Article
Google Scholar

[25] View Article

[26] Google Scholar

[ref11] 11. Grave E, Bojanowski P, Gupta P, Joulin A, Mikolov T. Learning word vectors for 157 languages. arXiv preprint 2018. https://arxiv.org/abs/1802.06893
View Article
Google Scholar

[28] View Article

[29] Google Scholar

[ref12] 12. Alayba AM, Palade V, England M, Iqbal R. Improving sentiment analysis in Arabic using word representation. In: IEEE ASAR. 2018. p. 13–8.

[ref13] 13. Nabil M, Aly M, Atiya A. ASTD: Arabic sentiment tweets dataset. In: EMNLP. 2015. p. 2515–9.

[ref14] 14. Almarwani N, Diab M. Arabic textual entailment with word embeddings. In: Third Arabic NLP Workshop. 2017. p. 185–90.

[ref15] 15. Abdelali A, Mubarak H, Samih Y, Hassan S, Darwish K. QADI: Arabic dialect identification in the wild. In: Sixth Arabic NLP Workshop, 2021. p. 1–10.

[ref16] 16. Bojanowski P, Grave E, Joulin A, Mikolov T. Enriching word vectors with subword information. TACL. 2017;5:135–46.
View Article
Google Scholar

[35] View Article

[36] Google Scholar

[ref17] 17. Hamdy A, Youssef A, Ryan C. Arabic hands-on analysis, clustering and classification of a large Arabic Twitter dataset on COVID-19. Int J Simul-Syst Sci Technol. 2021;22(1):6-1.
View Article
Google Scholar

[38] View Article

[39] Google Scholar

[ref18] 18. Al-Azani S, El-Alfy E-SM. Using word embedding and ensemble learning for highly imbalanced data sentiment analysis in short Arabic text. Procedia Computer Science. 2017;109:359–66.
View Article
Google Scholar

[41] View Article

[42] Google Scholar

[ref19] 19. Dahou A, Elaziz MA, Zhou J, Xiong S. Arabic sentiment classification using convolutional neural network and differential evolution algorithm. Comput Intell Neurosci. 2019;2019:2537689. pmid:30936911
View Article
PubMed/NCBI
Google Scholar

[44] View Article

[45] PubMed/NCBI

[46] Google Scholar

[ref20] 20. Khalil EA, Hakim EMF, El Houby HK. Deep learning for emotion analysis in Arabic tweets. Journal of Big Data. 2021;8(1):1–15.
View Article
Google Scholar

[48] View Article

[49] Google Scholar

[ref21] 21. Alzanin SM, Azmi AM, Aboalsamh HA. Short text classification for Arabic social media tweets. Journal of King Saud University - Computer and Information Sciences. 2022.
View Article
Google Scholar

[51] View Article

[52] Google Scholar

[ref22] 22. Sonkar S, Waters AE, Baraniuk RG. Attention word embedding. arXiv preprint 2020. https://arxiv.org/abs/2006.00988
View Article
Google Scholar

[54] View Article

[55] Google Scholar

[ref23] 23. Balla HAMN, Llorens Salvador M, Delany SJ. Arabic Medical Community Question Answering Using ON-LSTM and CNN. In: ICMLC. 2022. p. 298–307.

[ref24] 24. Elsamadony O, Keshk A, Abdelatey A. Sentiment analysis for Arabic language using word embedding. In: ICENCO. 2021. p. 51–6.

[ref25] 25. Albalawi Y, Buckley J, Nikolov NS. Investigating the impact of pre-processing techniques and pre-trained word embeddings in detecting Arabic health information on social media. J Big Data. 2021;8(1):95. pmid:34249602
View Article
PubMed/NCBI
Google Scholar

[59] View Article

[60] PubMed/NCBI

[61] Google Scholar

[ref26] 26. Alqurashi S, Alhindi A, Alanazi E. Large Arabic Twitter dataset on COVID-19. arXiv preprint 2020. https://arxiv.org/abs/2004.04315
View Article
Google Scholar

[63] View Article

[64] Google Scholar

Figures

Abstract

Introduction

Literature survey

Background information

Continuous Bag of Words (CBOW)

Skip-Gram (SG)

Skip-Gram With Negative Sampling (SGNS)

Proposed methodology

Data collection

Data pre-processing

Remove duplicate Tweets and filter Tweets

Cleaning Arabic text

Computational effort in these steps

Model evaluation and comparison

Number of words recognized

Word similarity

Clustering of positive and negative words

Clustering of named entities

Experimental comparison of the models

COVID-19 data set classification

ASTD classification

Discussion

Conclusion

References