Deep learning-based idiomatic expression recognition for the Amharic language

Idiomatic expressions are built into all languages and are common in ordinary conversation. Idioms are difficult to understand because they cannot be deduced directly from the source word. Previous studies reported that idiomatic expression affects many Natural language processing tasks in the Amharic language. However, most natural language processing models used with the Amharic language, such as machine translation, semantic analysis, sentiment analysis, information retrieval, question answering, and next-word prediction, do not consider idiomatic expressions. As a result, in this paper, we proposed a convolutional neural network (CNN) with a FastText embedding model for detecting idioms in an Amharic text. We collected 1700 idiomatic and 1600 non-idiomatic expressions from Amharic books to test the proposed model’s performance. The proposed model is then evaluated using this dataset. We employed an 80 by 10,10 splitting ratio to train, validate, and test the proposed idiomatic recognition model. The proposed model’s learning accuracy across the training dataset is 98%, and the model achieves 80% accuracy on the testing dataset. We compared the proposed model to machine learning models like K-Nearest Neighbor (KNN), Support Vector Machine (SVM), and Random Forest classifiers. According to the experimental results, the proposed model produces promising results.


Introduction
In recent years, the development of deep learning in neural networks improves performance in many natural language processing (NLP) tasks.In natural language processing, neural networks are used for the development of machine translation, speech recognition, text generation, text mining, and named entity recognition.
Idiomatic expression is a collection of words that have a different meaning from the individual words in them.The meaning of the idioms cannot be interpreted from the meaning of words that constructs them directly [1].Idiomatic expressions are one of the important parts of all-natural languages [2].The detection of this type of expression from Amharic text helps those individuals that are not familiar with the Language.For example, the expression " " can be directly translated as "he drops his face" but the actual meaning is "he becomes sad".
Idiomatic expression recognition from a given text plays an important role in the implementation of tasks such as machine translation, speech recognition, sentiment analysis, and dialog system within the respective language.Amharic is one of the languages grouped under the Semitic language families that have more than 4000 idiomatic expressions [3].
The paper by [4] presents the use of Skip-Thought Vectors to create distributed representations that encode features that are predictive concerning idiom token classi cation.They showed that classi ers using these representations have competitive performance compared with the state of the art in idiom token classi cation.However, their models use only the sentence containing the target phrase as input and are thus less dependent on a potentially inaccurate or incomplete model of discourse context.They further demonstrate the feasibility of using these representations to train a competitive general idiom token classi er.
The authors of [5] proposed an idiomatic expression detection method based on the assumption that idioms and their literal counterparts do not occur in the same contexts.The inner product of context word vectors with the vector representing a target expression is computed rst by their model.Because literal vectors predict local contexts well, their inner product with contexts should be greater than idiomatic ones.This distinguishes literals from idioms, and then in word vector space, computes literal and idiomatic scatter (covariance) matrices from local contexts.Because the scatter matrices represent context distributions, they used the Frobenius norm to calculate the difference between the distributions.Idiomatic expression in language has a detrimental impact on NLP task performance [8].However, according to the researchers' understanding, there is no Amharic natural language processing model that considers idiomatic expression.This inspired us to create an Amharic idiomatic phrase identi cation system based on deep learning.This study focuses on the construction of a CNN using the FastText model to detect the presence of idiomatic terms in an Amharic text.The overall contributions of the study are summarized as follows: 1. Prepare a general-purpose Amharic idiomatic expression dataset that can be used by other studies in the future.
2. Proposed a deep learning model that incorporates CNN with FastText to recognize idioms from Amharic texts.
3. Evaluate the performance of the proposed recognition model with various evaluation metrics.
The remainder of the paper is structured as follows.Section 2 presents the planned work's comprehensive methodology in detail.Section 3 de nes the experimental results.In section 4, we present the outcome and a discussion of it.Finally, section 5 is the conclusion.

Materials And Methods
This study focuses on the development of a deep learning model using FastText to detect the presence of idiomatic terms in an Amharic text.Figure 1 below depicts the proposed idiomatic expression recognition architecture for the Amharic language.Pre-processing, word embedding, and learning modules are all components of the proposed automatic idiomatic expression identi cation system.The tasks in the proposed model range from data gathering to evaluation.This means the proposed model contains tasks from data collection up to evaluation.

Dataset
The dataset utilized in this study was gathered from two Amharic books " " (idiomatic expressions in Amharic), and " " (love up to the grave) [3,9].Most idiomatic expressions in books are 2 to 4 in number of tokens, so the dataset contains only 2 to 4 length idiomatic expressions.There are more than four thousand idioms in the Amharic writing system.We received 1700 idiomatic words from the aforementioned books, all of which are easily readable in the books themselves.In addition to idioms, we also collect phrases that are not classi ed as idiomatic expressions in the book to train the proposed model.After collecting the data from books, we apply the following preprocessing modules to clean up the data and make the learning phase as easy as possible.
i. Normalization The Amharic writing system has different letters (" ") that can be read with the same pronunciation, but there are no rules to distinguish their meanings.As a result, in Amharic, the same concept or name of an object may be represented by these letters.This increases the number of features extracted for processing or analysis.To overcome this redundancy, we normalize those characters with the same pronunciation to one canonical letter used in this study, as shown in Table 1 below [10].ii.Stemming Stemming is the process of reducing in ected words to their stem, base, or root form.Amharic is one of the morphological-rich Semitic languages [11].Different terms can exist with the same stem, and this helps reduce the size of feature space for processing.In this study, we used the HornMorpho stemmer developed by Michel Gasser [12].HornMorpho is a Python library that is developed for the analysis of three Ethiopian languages Amharic, Afan Oromo, and Tigrigna.
iii.Remove stop words In Amharic, the common words, e.g." " and others that scoreless weightage in the text processing tasks is called stop words.Stop words are eliminated to save computational time wasted in processing them.Amharic does not have a well-prepared list of stop words.However, we eradicate stop words by [10].In addition, to stop word removal, we also replace numbers with their name in alphabetic characters (" ").For example, in "2 ", 2 can be changed to two (" ") and produce " ".This replacement is done by keeping a map of key-value relation between digits and an alphabetic description of each digit.

Text representation
Encoding is highly required to pass texts as input to different machine learning and deep learning models [13].One of the text encoding algorithms that changes a given text into a vector is the word2vec algorithm.It is a set of neural network models used to represent a word in a vector space.Those words which have similarities in their context are clustered together and those that do not have any contextual meaning similarity appear sparsely on the vector space.However, word2vec fails to generate the vector of words that are not in the training vocabulary.
FastText is one of the state-of-the-art word embedding models developed by Facebook.For 157 languages, Facebook develops pre-trained FastText embedding models.One of the languages with a trained FastText word embedding model is Amharic.FastText embedding's strength is that it can create a vector for a given term even if it is not in the training vocabulary.This is resolved by taking into account the character-level n-gram of a given term.Because of this, we created a vector of both idiomatic and non-idiomatic (literal) words to train and test the suggested detection model using the pre-trained FastText word embedding.The pre-trained FastText model is used to build the vector for each word in the example algorithm 1 below.for every idiom (j) in the corpus for every token (i) in idiom (j) vector j,i = model.get_vector(token(i) ) endfor endfor end 2.3.Learning Model

Convolutional Neural Network
We need a learning model to determine whether a particular phrase is idiomatic or not.A convolutional neural network is an advanced neural network model that is used to discover patterns and relationships between data items based on their relative positions [11].CNN can automatically learn effective text feature representation from massive text using a 1D structure (word order) in the convolutional layer.It captures local relationships among the neighbor words in terms of context windows, and by using pooling layers, it extracts global features.CNN is a neural network made up of several convolutional and pooling layers.

K-Nearest Neighbor Classi er
K-Nearest-Neighbors is a basic yet effective non-parametric supervised classi cation technique.The KNN classi er is the most common pattern recognition classi er because of its effective performance, e cient outputs, and simplicity.It is frequently utilized in pattern recognition, machine learning, text classi cation, data mining, object identi cation, and a variety of other domains [14].The KNN method classi es by analogy, which means that it compares the unknown data point to the training data points to which it is comparable.The Euclidean distance is used to calculate similarity.The attribute values are adjusted to avoid bigger range characteristics from outweighing smaller range ones.In KNN classi cation, the unknown pattern is assigned the most predominant class amongst the classes of its nearest neighbors.In the event of a tie between two classes for the pattern, the class that has the minimum average distance to the unknown pattern is assigned.Through the combination of several local distance functions based on individual attributes, a global distance function based on distance can be calculated [15].Where Tp denotes true positive, Tn denotes true negative, Fp denotes false positive, and Fn denotes false negative.

Results And Discussions
experiments are carried out in a Windows 10 environment on a machine equipped with a Core i7 processor and 16 GB of RAM.The experimental setups used to develop the proposed Amharic idiomatic expression recognition system are shown in Table 3 below.

Training and validating the Model
We have divided the data train and validate its performance with a training test split ratio of 80%, 10%, and 10% for training, validating, and testing the proposed model, respectively.To train the proposed CNN model, we tune the hyperparameters using a grid search strategy.The value of the hyperparameters used in this study is shown in Table 3below.

Testing the model
With the testing dataset and the evaluation metrics listed in Table 2 above, we assess the effectiveness of the proposed Amharic idiomatic expression recognition model.Figure 5 below shows the experimental results of how well the proposed scheme performed in terms of accuracy, precision, recall, and f1measure.
As shown in Fig. 5 above, the proposed Amharic idiomatic expression recognition system, which makes use of CNN with FastText word embedding, achieved results with accuracy, precision, recall, and f1-score of 80%, 70%, 77.78%, and 73.68%, respectively.

Comparison the the model with other models
We must take into account two factors to justify a model working effectively [19].These factors are one by examining the model's numerical output and two by contrasting its performance with that of other models applied to the same dataset by other studies.As a result, we contrasted the new model's performance with some of the machine learning models employed in earlier studies [20].We compare the proposed model against KNN, SVM, and Random Forest classi ers.The comparison result is shown in Table 4 below.According to the results in Table 5 above, CNN is effective at identifying idioms in Amharic.This is because the features of idiomatic expressions in the Amharic language can be gained better with the help of FastText's embedding [21].

Conclusion
Different NLP models are now being developed for the Amharic language without taking idiomatic expressions into account.This misleads models since idioms' meanings differ from the meanings that may be inferred by looking at words that make them up.Idioms are one of the most fascinating and di cult aspects of Amharic vocabulary.Machine learning algorithms do not process text as input, so they require encoding of texts into another format.We produced a vector of each word used in this study using a pre-trained FastText word embedding as part of this encoding.The experimental ndings show that compared to models utilized in this study, the proposed CNN with the FastText embedding model is more effective at detecting Amharic idioms.The proposed approach can therefore be applied to natural language processing tasks requiring the detection of idiomatic expressions, such as machine translation, sentiment analysis, and question-answering systems.However, due to the magnitude of the data, the model's performance requires improvement.Additional information from the holy books of Amharic can be added to the proposed model to enhance its performance.In the future, we plan to conduct Amharic machine translation by incorporating this model as a component of it.

Figures
Page 13/16 The architecture of the proposed idiomatic expression recognition system  Training Accuracy of the proposed model

Table 1
Normalization of characters having the same pronunciations.

Table 3
The training accuracy and training loss of the model are then displayed in Figs.3 and 4below after the model has been trained using the abovementioned parameters and training dataset.Since the training accuracy grows as the number of epoch increases, the model does a good job of learning from the data.In addition, as the number of epochs rises, the training loss declines.This shows that the model picks up on idiomatic expression features from the training set.

Table 4 .
Comparison of the proposed model with SVM, KNN, Random Forest All the above results shown in Table4above are produced with the same dataset and with the same word embedding model, which is FastText.In addition to this, we compared the performance of the proposed idiomatic recognition model (CNN with FastText embedding) with other word embedding models like Term Frequency-Inverse Document Frequency (TF-IDF) and one-hot encoding vectors.The result is depicted as shown in Table5below.

Table 5
Comparison of different words vector representation