Figures
Abstract
Citation illustrates the link between citing and cited documents. Different aspects of achievements like the journal’s impact factor, author’s ranking, and peers’ judgment are analyzed using citations. However, citations are given the same weight for determining these important metrics. However academics contend that not all citations can ever have equal weight. Predominantly, such rankings are based on quantitative measures and the qualitative aspect is completely ignored. For a fair evaluation, qualitative evaluation of citations is needed in addition to quantitative ones. Many existing works that use qualitative evaluation consider binary class and categorize citations as important or unimportant. This study considers multi-class tasks for citation sentiments on imbalanced data and presents a novel framework for sentiment analysis in in-text citations of research articles. In the proposed technique, features are retrieved using a convolutional neural network (CNN), and classification is performed using a voting classifier that combines Logistic Regression (LR) and Stochastic Gradient Descent (SGD). The class imbalance problem is handled by the synthetic minority oversampling technique (SMOTE). Extensive experiments are performed in comparison with the proposed approach using SMOTE-generated data and machine learning models by term frequency (TF), and term frequency-inverse document frequency (TF-IDF) to evaluate the efficacy of the proposed approach for citation analysis. It is found that the proposed voting classifier using CNN features achieves an accuracy, precision, recall, and F1 score of 0.99 for all. This work not only advances the field of sentiment analysis in academic citations but also underscores the importance of incorporating qualitative aspects in evaluating the impact and sentiments conveyed through citations.
Citation: Alnowaiser K (2024) Scientific text citation analysis using CNN features and ensemble learning model. PLoS ONE 19(5): e0302304. https://doi.org/10.1371/journal.pone.0302304
Editor: Mohamed Hammad, Menoufia University, EGYPT
Received: August 2, 2023; Accepted: April 2, 2024; Published: May 28, 2024
Copyright: © 2024 Khaled Alnowaiser. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: Data relevant to this paper are available from GitHub at https://github.com/nowaiserk/Plosone.
Funding: This study was supported by Prince Sattam bin Abdulaziz University in the form of a grant to KA [PSAU/2024/R/1445].
Competing interests: The authors have declared that no competing interests exist.
Introduction
Scientific publications have increased substantially over the past decade and a large number of researchers are publishing their work in the form of articles and books worldwide. Consequently, many publications are available today with varying scientific quality and impact. Therefore, the necessity of reviewing and rating published scientific articles is in great demand. The literature has a wide range of evaluation standards for scientific writing. Nonetheless, the citation count is one of the most important evaluation indicators. The number of citations is frequently utilized to assess a paper’s or researcher’s impact [1–3]. In addition, it has served as the foundation for many additional measures like h-index [4], impact factor, i-10 index [5] and other assessment parameters for researchers, conferences, journals, and research institutions [6, 7].
Scientific papers are linked to other cutting-edge scientific literature. In scientific contexts, the terms “citation” and “reference” are used to refer to the referenced work of other scholars [8]. Citing the work of other academics shows that scholars value their contributions and creates a connection between the referred study and the paper being cited [9]. Citations are essential for evaluating the quality of a work, however, the amount of the material used in citations makes it exceedingly challenging to extract information. Another difficulty in assessing the impact of publication is the expansion of scientific literature. Citation conveys both the effect of a publication and the authors’ significance. Nonetheless, a quantitative metric is to count the citations that do not describe any qualitative expect of the citation. The number of times a research paper is cited by other researchers is known as the citation count, and it can be deceptive in judging the quality of a research study [10]. Because of the co-authorship, citations might be skewed and created with the purpose of increasing the reference count [11]. The paper is cited occasionally by the citing author only to identify short commands and suggestions for improvement. This sort of citation is also counted when calculating citation indexes [12]. Ranking methods cannot highlight highly effective research works by counting the number of citations only.
Citation sentiment analysis is a relatively understudied field of research. The majority of the citation in the research article explains the research findings without expressing an opinion; thus, the majority of citations are positive [12]. Some authors examined the sources in which criticism is frequently expressed in a courteous manner. Detection of the negative context of citations is a challenging task as their text is frequently implicit and concealed with contrasting terms. The positive context of citation involves its discussion with respect to design and evaluation criteria. Giving weight to the sentiments might help decide the text’s polarity. In the past, researchers have worked on the detection of the sentiments from text using different approaches such as n-grams, dependency relations, lexical features, structure-based features, and many other state-of-the-art approaches [13]. Sentiment extraction from the citations is a challenging task as in most cases sentiments are hidden in citations and difficult to analyze. As per the studies, most citation sentiments are neutral because they are just used to support a phenomenon [14]. The study showed that the sentiment present in the citation text expresses the real feeling of the author towards another author’s research. Predominantly, the sentiments of the author are explicitly hidden because they use an objective way to express the sentiments in the scientific papers [14].
The bibliometric measurements are the most important application of citation sentiment analysis. Citation sentiment analysis contributes to the improvement of bibliometric measurements. The prior method for determining an article’s influence was to tally the number of times it was mentioned. Citation sentiment analysis, on the other hand, may be used to assign weights to each citation text based on the feelings of the citations [15]. Most of the time, the feelings indicated in the citation text are concealed, making it challenging to determine whether they are good, negative, or neutral [16]. In terms of people, it is simple for them to read the citation language and recognize the sentiment represented in it. But it becomes a complex and demanding undertaking when trying to train a model to automatically anticipate the polarity of feelings [17]. The citation text’s emotion polarity appears to be neutral for the most part, with any concealed negative or positive attitudes. Many methods, including lexical analysis and feature extraction, can be utilized for hidden sentiment analysis. The following significant advances are made by this work using the machine learning technique for automated sentiment categorization of in-text sentiments.
- A machine learning-based framework is contrived to perform sentiment analysis for in-text citations of research articles. To improve the effectiveness of sentiment analysis, the proposed model combines a voting classifier with features generated using CNN. In addition, various machine learning models are utilized in this regard like decision tree (DT), AdaBoost, logistic regression (LR), stochastic gradient descent (SGD), random forest (RF), extra tree classifier (ETC), support vector machine (SVM), and a voting classifier combining LR and SGD aka VC(LR+SGD).
- An effective text generation technique is proposed to handle class imbalance the problem for determining the citation text’s sentiment into positive, negative, or neutral.
- The efficacy of the proposed approach is analyzed with and without using synthetic minority oversampling techniques (SMOTE). The performance of the Voting classifier in combination with CNN-based features is analyzed in terms of performance evaluation metrics.
The remainder part of the article is organized as follows. Several significant works that are connected to the current investigation are described in the Related Work section. The proposed technique, the dataset, and the models employed in the current work are all described in depth in the section Materials and Methods. Results and their discussion are covered in the section Results and Discussions, and the conclusion section is presented at the end of the paper.
Related work
Several metrics have been established throughout the years to judge the quality of a research publication or authors. For example, the h-index is a significant metric for determining the author’s rank and prominence [4]. In addition, impact factor and eigenvector are used for the same purpose [18, 19]. Nonetheless, these are quantitative indicators, and qualitative approaches to determine an article’s rank have not been well-researched. Citation sentiment is a relatively new issue that may be used to overcome the constraints of quantitative techniques for analyzing the relevance of research publications. For example, Johson and Zang [20] designed a hybrid approach in which objective and subjective variables are integrated to assess the effect of a research publication. The study coupled the research paper’s and author’s impact factors with citation feelings for this goal. Later phases involve labeling each lemma and calculating its score with the SentiWordNet.
Some studies use citation sentiment with objective measures, while some methods consider only its sentiment. Ikram and Afzal [21] proposed a multilevel citation text analysis system by considering the aspects. With the help of the material immediately surrounding the reference, several elements are initially derived from the citation phrases. A linguistic rule-based technique is utilized to determine the polarity of these factors in sentiment analysis. The support vector machine is used with N-gram features to achieve the highest level of sentiment categorization accuracy. Similarly, Nguyen et al [22] presented a deep neural network for sentiment analysis of the text. LSTM approach is used with word embedding using word2vec while the data imbalance is dealt with using SMOTE. Results demonstrate that the proposed system outperforms the conventional SVM in terms of performance.
Athar [23] worked on context-based sentiment classification. The impact of window size on various techniques is examined through a variety of studies. The N-gram features’ results demonstrate how the addition of contexts expands the vocabulary and influences performance. Ghosh and Shah [24] investigate the importance of the features for ranking a scientific article. The study performed the citing sentiment on the ACL paper collection. A few carefully chosen characteristics are used to train the models, including sentiment score, N-grams with positive and negative polarity, self-citation, part of tags, and sentiment text. The findings demonstrate that Digging provides the greatest accuracy score of 80.61%.
The authors in [25] divided the citations into two categories: “influential” and “non-influential” The machine learning method (SVM) was used to help with this classification. A hundred research papers from the ACL Anthology were included in the dataset. There were five unique characteristics used: citation frequency, similarity, context, position, context, and other factors. A study [26] presented a content-based method for sentiment analysis in in-text citations, utilising binary classification to differentiate between significant and unimportant citations. The analysis’ sentiment and cosine similarity scores were used as features in binary classification algorithms, filling a gap in the state-of-the-art literature where the ideal model for sentiment analysis was unknown. The work used automated sentiment analysis on extracted in-text citation material, tested several categorization models, and improved the understanding of sentiment analysis in citation settings.
Previously, numerous methods for identifying noteworthy citations were presented, including content-based, meta-data-based, and bibliographic-based approaches. However, even though the results were cutting-edge, they still needed to be improved. Authors in [27] used a two-module technique that included section-wise citation counting and sentiment analysis of citation sentences. The first module used Neural Network and Multiple Regression for automated weight assignment, while the second module used sentiment analysis for sentence-based categorization utilising Random Forest, Support Vector Machine, and Multilayer Perceptron. Another research [28] proposed a new machine learning framework to distinguish important from non-important citations by analyzing syntactic and contextual information. Using three feature selection algorithms and three classifiers, the study identified key features for differentiation. Experimentation on two datasets demonstrated the framework’s superior classification performance compared to contemporary research, highlighting the significance of both syntactic and contextual features in identifying important citations.
Several methods primarily concentrate on determining the citation’s sentiments. For instance, Liu [9] proposed a system utilizing word2vec for citation text analysis. From ACL collections, sentence vectors are developed using word embedding. To calculate the polarity of the citation, the positive and negative polarity is used. SVM is employed with chosen characteristics to assess the citation’s emotion. Results indicate that manual feature engineering performs better when trying to determine the polarity of a citation. Mercier et al [10] proposed a deep learning system called ImpactCite. The system is built on XLNet, which emphasizes emotion classification tasks and classification of intent, which reveals the citation’s meaning. Results indicate that for micro-F1 and macro-1, the impact citations score yields scores of 88.13 and 88.93 respectively. and gives better performance than the present citation sentiment approaches.
A review of the available research indicates two key points: machine learning techniques outperform traditional approaches, and dataset imbalance has a considerable impact on the efficacy of such approaches. In this regard, this work employs a machine learning-based technique enhanced with CNN-based features using SMOTE to address the dataset imbalance issue.
Materials and methods
This section explains the methodology and procedures employed in this study in detail. Fig 2 depicts the architecture of the suggested technique. Starting with data retrieval, the approach follows the generation of text by SMOTE to balance the dataset. Feature extraction is then carried out that involves term frequency (TF), term frequency-inverse document frequency (TF-IDF), and CNN. The data is split for training and testing where the selected machine learning models are utilized for sentiment classification.
Citation sentiment corpus
This study utilizes the ‘citation sentiment corpus’ taken from ACL Anthology Network [12]. The dataset contains 8,736 citation text annotated by human experts. Out of 8,736 citations records 829 are positive, 280 negative, and 7627 are neutral. These 8,736 citations are extracted from 194 research articles. After SMOTE augmentation/upsampling, we have 1000 citation records of each class to balance the dataset. The dataset comprises ‘Source_Paper ID’, ‘Target_Paper ID’, ‘Citation_Text’, and ‘Sentiment’. The ‘Source_Paper ID’ is the citing paper’s ID that represents the source of the text, ‘Target_Paper ID’ is the cited paper’s ID, ‘Citation_Text’ is the original text that contains the citation while ‘Sentiment’ is the label of the target class and can be ‘positive’, ‘negative’, or ‘neutral’. Table 1 shows a few example records from the collection.
Machine learning classifiers
Supervised machine learning algorithms are extensively used to solve classification and regression problems [29]. Tree-based and regression-based algorithms are used in this study. This study used 8 different supervised algorithms to solve the classification problem. Table 2 provides implementation details about these machine learning models, and their hyperparameter settings for all of them. To find the best parameters, a method called grid search is used. This involves trying different values for each parameter within a specified range and evaluating how well the model works. Every parameter goes through the procedure, and the values that optimize the model’s performance are selected at the conclusion.
Decision tree.
The DT is a supervised machine learning model that learns discrete rules from data features to predict target variables [30, 31]. The main benefit of the DT is the features subset that appears at different classification steps and decision rules. DT comprises different kinds of leaf nodes and various inside nodes having branches. Every leaf node denotes a target class while internal nodes denote features connected with branches to perform classification. The efficacy of the DT is based on how well-trained it is on the dataset.
AdaBoost classifier.
AdaBoost from adaptive boosting is based on an ensemble learning classifier that utilizes the boosting method to train weak learners [31]. It combines many weak learners to recursively train them on the copies of the actual corpus, where every weak learner focuses on the outliers [32]. It is a metadata approach that uses the N number of weak learners and uses different assigned weights for training.
Logistic regression.
LR is basically designed for binary classification but in this research, I have used LR One-vs-All (OvA) technique for classification. For each class, a binary classification model is trained to distinguish that class from all other classes. The final prediction is then based on the class with the highest probability among the individual classifiers. A statistical algorithm LR uses different variables to compute the final results. It is a regression-based model which estimates the class’ probability. Therefore, it performs best for categorical data. To estimate the probability and ascertain the link between dependent and independent variables, LR employs the logistic function [33].
Stochastic gradient classifier.
The working of SGD is similar to the LR and SVM. For the multi-class classification, SGD proves to be a robust classifier as it aggregates the various binary classifiers in a one-verses-all technique. SGD randomly selects the examples from the batch, so the hyperparameters of SGD need correct values to achieve precise results. It is highly sensitive towards scaling of features [34].
Random forest.
RF comprises numerous decision trees which work separately to find the result while the decision on the ultimate outcome is made using the majority vote method. The outcome error rate is very less than other classifiers which is attributed to low correlation among trees [35]. Different split criteria are used for RF; the dataset is split on the basis of the Gini index which is the cost function. In RF, the bagging approach is utilized in which multiple classifiers are trained on bootstrapped data and used to minimize variance.
Extra tree classifier.
ETC employs the meta estimator, which trains several weak learners using random samples from the dataset to enhance the prediction outcome [36, 37]. It is an ensemble model like RF widely utilized for classification problems. ETC differs from RF in the way of construction of trees in the forest. It uses actual data for training, unlike RF which uses bootstrap data samples. At every node, a tree uses k features of a random sample. Trees choose the best feature for splitting. These random feature samples lead toward the multiple de-correlated DTs.
Support vector classifier.
SVC is basically designed for binary classification but in this research like LR, I have used SVC as One-vs-All (OvA) or One-vs-Rest (OVR) technique with linear kernel (OneVsRestClassifier(SVC(kernel=‘linear’))) for classification. For each class, a binary classification model is trained to distinguish that class from all other classes. The final prediction is then based on the class with the highest probability among the individual classifiers. The SVC, which Cortes and Vapnik first suggested, is a binary classification technique that may be expanded to handle issues with many classes [38]. The SVC is used to deal with multi-class classification problems. To deal with nonlinear classification, outlier detection, and regression support vector is a powerful technique. But the major drawback of SVC is that it does not give good results on small-sized corpus because it works on the cross-validation of data.
Voting classifier.
Recently voting classifiers have shown better performance for many tasks than the traditional models. In a voting classifier, many classifiers can be added with respect to training time constraints, and a single regression model is used as a regression model to calculate the voting outcomes. Every model forecasts the target label, and classifiers vote among themselves to select the target class label [40]. Soft and hard voting is used where soft voting considers the probability value of different classes from each classifier while hard voting considers classifiers’ prediction with majority votes wins. This study combines LR and SGD as a voting classifiers.
Feature extraction
The technique of finding meaningful features from the data for good and efficient training of the machine learning model is known as feature engineering. Techniques for feature engineering can help machine learning algorithms perform better. As a result of feature engineering, which separates the valuable feature from the raw data, the consistency and accuracy of the learning algorithm are improved. In this work, we used Vectorization (TF-IDF), prediction-based (TF), and SMOTE upsampling features. The strengths and weaknesses of these techniques are discussed in Table 3.
Dealing with dataset imbalance
This study utilizes SMOTE and CNN-based features with a voting classifier to address the issue of an unbalanced dataset.
Using synthetic minority oversampling technique.
SMOTE is a popular oversampling method for addressing the issue of unbalanced data. By leveraging Euclidian distance to generate random syntactic data of the minority class from its closest neighbors [41]. Because they are created using the original characteristics, newly produced instances resemble the original data pretty closely. To deal with high dimensional data SMOTE is not a good choice because it creates extra noise. A recent study uses the application of SMOTE to predict heart failure cases [42]. Machine Learning and SMOTE show reasonable results but still do not quite well to compete with deep learning models [43]. This study uses SMOTE to generate a new training dataset.
Architecture of convolutional neural network for feature extraction.
In this work for scientific paper citation analysis, the deep learning model CNN is used as a feature extraction technique [44–46]. CNN is a widely used deep learning system mostly used for classification tasks. As a deep learning system has the capacity to extract features, the convoluted features are used for scientific paper citation sentiment analysis. The standard CNN model has four layers: an embedding, a convolutional, a pooling, and a flattening layer. For citation sentiment analysis, the first layer of CNN used is an embedding layer and it has an embedding size of 12, 000 and an output dimension of 100. The convolutional layer has 500 filters, a kernel size of 2, and a rectified linear unit (ReLU) as an activation function. The third layer is the max pooling layer; for the significant feature maps max pooling layer with 2 sizes is used from the output of the convolutional layer. The output is ultimately transformed into a 1D array using a flattened layer.
For example, a tuple set (fsi, tci) is from the citation sentiment analysis dataset, where the fs presents the feature set and tc presents the target column, and I show the index of the tuple. The embedding layer is used as a transformation tool to convert the training set into the needed input.
(1)
(2)
where EL shows the embedding layer and EOs shows the embedding layers output which is the input of the convolutional layer. There are three different parameters for the EL: Vs vocabulary size, I input length and Os is the output dimension.
In this study for citation sentiment analysis, the EL size is set at 12, 000. It shows that the EL can take the inputs from 0 to 12000. The input length is 42 and the output dimension Os is set to 100. EL processes all the input and gives the output for the CNN for additional processing. EL output dimension is EOs = (None, 42, 100)
(3)
The convolutional layer output is extracted from the EL output. CNN is implemented with the 500 filters, i.e., F = 500, and a kernel size of 2. Utilizing the ReLU activation function, all negative values are set to zero while all other values are left unaltered.
(4)
The map max pooling layer is used to extract features. For this purpose, a 2 pool is used. Fmap shows the features after max-pooling, Ps = 2 is the size of the pooling window and S-2 is the size of the stride. The last flattened layer is utilized for the data transformation. By using the above-mentioned steps we obtained the 251470 features for the training of the classifiers.
(5)
To convert the 2D data into 1D, a flattened layer is used. The machine learning models perform well on the 1D data, which is the primary driver behind this conversion. The aforementioned procedure is conducted during the training of the models.
Proposed methodology
Ensemble models are becoming more prevalent and have led to greater accuracy and efficiency for classification tasks. By merging multiple classifiers, it is possible to enhance the performance beyond what individual models can achieve. In this work, an ensemble learning model is employed to enhance scientific paper citation sentiment classification. The proposed method extracts features from CNN and involves a voting classifier that unites the LR and SGD through the soft voting criterion. The ultimate output is determined by the class that receives the most votes. The proposed ensemble model, as outlined in Algorithm 1, operates as follows:
(6)
The prediction probabilities for each test sample are provided by and
. These probabilities, as illustrated in Fig 1, pass through the soft voting criterion using the LR and SGD.
To demonstrate the capabilities of the proposed approach, let’s consider an example. When a sample is evaluated by both the LR and SGD, it is assigned a probability score for each class. Suppose we have three classes, Class 1 (Positive), Class 2 (Negative), and Class 3 (Neutral), with likelihood scores of 0.4, 0.5, and 0.6, respectively, according to the LR model. For the same classes, the probability scores are 0.6, 0.7. and 0.9, respectively, according to the SGD model. Let P(x) be the probability score of x, where x belongs to the dataset’s four classes. The probabilities for the three classes can be computed as follows:
- P(1) = (0.4 + 0.6)/2 = 0.50
- P(2) = (0.5 + 0.7)/2 = 0.60
- P(3) = (0.6 + 0.9)/2 = 0.75
The ultimate prediction will be 3 (Neutral class), as it has the highest probability score, as shown below:
(7)
VC(LR+SGD) determines the final output by selecting the class with the highest average probability and adding the estimated probability from the two classifiers. The study’s proposed citation sentiment analysis framework, which is illustrated in Fig 2 of the paper, involves the use of an ensemble model called VC(LR+SGD) that combines two machine learning models. This study employed the citation sentiment analysis dataset, which was acquired from the University of Cambridge Lab.
To assess the proposed model, the ‘citation sentiment analysis dataset’ is used in two stages. In the first stage, citation sentiment analysis is done using TF and TF-IDF features alone and with SMOTE. In the second stage of the experiments, the dataset is processed for machine-learning models using convolutional features. Two sections of data make up the whole, with 70% allocated for training and 30% reserved for testing. This approach, known as the training-testing split, is a common method in machine learning used to assess the accuracy of the model on new and unseen data.
Algorithm 1 Ensembling LR and SGD.
Input: input data
MLR = Trained LR
MSGD = Trained SGD
for i = 1 to M do
if MLR ≠ 0 & MSGD ≠ 0 & training_set ≠ 0 then
Decision function =
end if
return final label
end for
Results
The performance of several classifiers is evaluated using different evaluation parameters for citation text analysis. This work uses accuracy, precision, recall, and F1 score as the evaluation metrics. For the implementation of the machine learning algorithm, the sci-kit-learn library and NLTK have been utilized in Python. For training and testing, the data are divided into 0.7:0.3 ratios.
Performance of classifiers using TF without SMOTE
The efficiency of the classifiers has been compared using TF without SMOTE to analyze citation text. The voting classifier achieves the greatest accuracy of 0.9122, according to the results shown in Table 4. SVC received a score of 0.8961, the second-highest for accuracy. LR, RF, and ETC show almost similar results in terms of precision, recall, and F1 score for citation sentiment analysis. However, Across all models, DT has the poorest outcomes using TF with a 0.8473 accuracy score.
Performance of classifiers using TF with SMOTE
Supervised machine learning models have been evaluated using TF features with SMOTE. From Table 5, it can be clearly observed that combining TF and SMOTE substantially improved the performance of all classifiers for sentiment analysis of citation text. The best model continues to be RF, which achieves accuracy scores of 0.9729, precision scores of 0.98, recall scores of 0.96, and F1 scores of 0.97. DT, LR, SGD, RF, and SVC show accuracy higher than 0.90 for all evaluation matrices. AdaBoost performs the poorest, with accuracy, precision, recall, and F1 score values of 0.8361, 0.84, 0.79, and 0.82 respectively. The performance of VC is poor in this case because both classifiers have the same way of learning feature patterns (linear feature capturing). If the features are quite similar to each other, than there are more chances that both classifiers makes similar types of errors and performs poor like in our case.
Performance of classifiers using TF-IDF without SMOTE
Without utilizing SMOTE, the outcomes of classifiers that use the feature extraction method TF-IDF are compared. The accuracy, precision, recall, and F1 score comparison of classifiers employing TF-IDF is shown in Table 6. It is observed that the voting classifier outperforms other models with an accuracy score of 0.9122 and 0.90 values each for precision, recall, and F1. SVC shows a marginally lower performance with a 0.8961 accuracy score, 0.87 precision, 0.89 recall, and 0.87 F1 score. For citation sentiment analysis, RF and ETC produce comparable findings with accuracy scores of 0.8760 and 0.8775, respectively.
Performance of classifiers using TF-IDF with SMOTE
After using SMOTE, the performance of the models is also assessed using the TF-IDF. Results given in Table 7 provide the comparison of classifiers using TF-IDF with SMOTE balanced dataset to analyze sentiments of citation text. It is evident that classifiers that use TF-IDF with SMOTE perform better than classifiers that use TF-IDF without SMOTE. With accuracy scores of 0.9729, precision, of 0.98, and F1 scores of 0.96, RF had the best outcomes. All models have shown significant improvement in classification accuracy after applying SMOTE. SVC achieved values higher than 0.96 in terms of all evaluation measures. Several factors could contribute to this decrease in the performance of the VC. First, the synthetic samples generated by SMOTE may introduce noise or artificial patterns that do not align well with the true distribution of the data. This can adversely impact the decision boundaries learned by the VC, resulting in a decrease in overall performance. Second, the specific characteristics of the VC, which combines multiple base classifiers, may interact with the synthetic samples in a way that hinders its ability to generalize to the true underlying data distribution.
Performance of classifiers using CNN features
Finally, the result of classifiers is compared using TF-IDF with CNN to analyze citation text. Results are given in Table 8 which shows that all models have achieved improved and better results as compared to the results obtained by applying TF. DT, LR, SGD, RF, ETC, SVC, and voting classifiers achieved higher than 0.94 accuracy scores. RF and SVC have shown similar precision with a 0.97 score which is the second-highest. SVC has achieved a 0.97 value of recall and F1 score. However, ETC has achieved the highest results with a 0.9922 accuracy, 0.99 precision, recall, and F1 score.
Results of cross-validation
To further verify the effectiveness of the proposed strategy, a 5-fold cross-validation is also conducted. The findings are shown in Table 9. Observably, the suggested model has an average accuracy of 0.995 while its average precision, recall, and F1 score are 0.995, 0.997, and 0.996, respectively.
Discussions
The comparison of the classifiers using TF and TF-IDF with SMOTE and text synthesis using CNN is shown in Fig 3. When employed with SMOTE balanced data, it demonstrates the Voting classifier’s superiority over all other classifiers. When combined with TF-IDF and SMOTE, RF outperformed every combination of TF-IDF with classifiers, achieving 0.9829 accuracies, 0.98 precision, 0.96 recall, and 0.97 F1 score. Lastly, the findings show that the right mix of feature extraction approaches is essential for supervised machine learning models to be effective. For the analysis of the imbalanced text data, data balancing technique such as SMOTE improves the performance of the classifiers. Tree-based algorithm RF presents better results when trained on SMOTE-balanced data using TF-IDF features for sentiment analysis of citation text. If RF hyperparameter tweaking is done correctly, variance drops, and bias increases based on trees. Results reveal that the statistical technique (SMOTE) used to balance the data before training improves the performance of the classifiers. It seems that classifiers are not trained well when classes are imbalanced in text classification. When used properly with a features extraction technique called TF-IDF for sentiment analysis of citation text, tree-based models are more generalized and outperform other models.
Though SMOTE is a very useful technique in improving the performance of the models with class imbalance problems, it also has some limitations. For example, most of the generated synthetic samples are in one direction and complicate the decision process for the classifiers. SMOTE also creates a large number of noisy data samples and adds noise to the dataset. Table 10 presents the training and testing accuracy comparison of TF and TF-IDF features with SMOTE. The models’ training and testing accuracy values may be shown to differ significantly from one another. DT, RF, ETC, and voting classifiers have shown 100% training accuracy but lower values for testing accuracy when trained on TF with SMOTE. ETC, SVC, and voting classifier have shown 100% training accuracy while a lower value of testing accuracy is observed when trained on TF-IDF with SMOTE. Hence results prove the overfitting of the models.
Comparative analysis with cutting-edge methods
The proposed model’s performance is compared with state-of-the-art research work based on feature engineering and learning models [47]. The reason for selecting this research for comparison is that this research work also utilized 4 different types of feature engineering for optimizing the performance of citation sentiment analysis. The same interest is ours but with a new feature engineering technique (CNN features) that outperformed the all techniques used in [47] research work. Table 11 displays the results of all models, which reveal that the voting classifier using CNN features yields the highest performance than other models. In [47], CNN performed better with 0.93 accuracies, 0.94 precision, 0.96 recall, and 0.95 F1 score. On the hand, the proposed approach used in this paper is VC(LR+SGD) using CNN features have shown superior performance with 0.99 accuracy, 0.99 precision, 0.99 recall, and 0.99 F1 score.
Conclusions
Citation sentiment analysis has emerged as an attractive solution to complement the limitation of quantitative measures like citation count, h-index, etc. Analyzing the positive, negative, and neutral sentiments of the citing authors helps to determine the importance of a research article. Current citation sentiment analysis approaches face two challenges: low accuracy and dataset imbalance. This research endeavors to resolve the issue of dataset imbalance by using SMOTE and improved accuracy results using CNN for feature extraction and a voting classifier for classifying sentiments of citation text.
Extensive experiments are performed using TF, TF-IDF, and CNN features with SMOTE to analyze the performance of the model’s accuracy. Experimental results show that VC(LR+SGD) is the best-performing model for citation text classification when it is applied to the CNN-extracted features. The proposed approach achieves the highest accuracy score of 0.9922 while the values for precision, recall, and F1 score is 0.99 each. It is observed that when trained on SMOTE-generated data, models exhibit overfitting, but CNN features do not exhibit this issue. The future work of this research work is the ensemble of machine-deep learning models with a feature fusion of hand-crafted and word embedding techniques. The second future work direction is to make use of hybrid feature selection like merging two different word embedding with PCA.
References
- 1.
Garfield, E. The use of journal impact factors and citation analysis for evaluation of science. 41st Annual Meeting Of The Council Of Biology Editors, Salt Lake City, UT. (1998).
- 2. Herther N. Research evaluation and citation analysis: Key issues and implications. The Electronic Library. 27 pp. 361–375 (2009,6).
- 3. Oppenheim C. The correlation between citation counts and the 1992 research assessment exercise ratings for British research in genetics, anatomy and archaeology. Journal Of Documentation. 53, 477–487 (1997).
- 4. Hirsch J. An index to quantify an individual’s scientific research output. Proceedings Of The National Academy Of Sciences. 102, 16569–16572 (2005). pmid:16275915
- 5. Garfield E. The history and meaning of the journal impact factor. Jama. 295, 90–93 (2006). pmid:16391221
- 6. Moed H., Colledge L., Reedijk J., Moya-Anegon F., Guerrero-Bote V., Plume A. & Amin M. Citation-based metrics are appropriate tools in journal assessment provided that they are accurate and used in an informed way. Scientometrics. 92, 367–376 (2012).
- 7. Wildgaard L., Schneider J. & Larsen B. A review of the characteristics of 108 author-level bibliometric indicators. Scientometrics. 101, 125–158 (2014).
- 8. Hjerppe R. Supplement to a “Bibliography of bibliometrics and citation indexing & analysis” (Trita-lib-2013). Scientometrics. 4, 241–273 (1982).
- 9. Bar-Ilan J. & Halevi G. Post retraction citations in context: a case study. Scientometrics. 113, 547–565 (2017). pmid:29056790
- 10. Huggett S. Journal bibliometrics indicators and citation ethics: A discussion of current issues. Atherosclerosis. 230, 275–277 (2013). pmid:24075756
- 11. Bornmann L. & Daniel H. What do citation counts measure? A review of studies on citing behavior. Journal Of Documentation. 64, 45–80 (2008).
- 12.
Athar, A. Sentiment analysis of citations using sentence structure-based features. Proceedings Of The ACL 2011 Student Session. pp. 81–87 (2011).
- 13. Yu B. Automated citation sentiment analysis: what can we learn from biomedical researchers. Proceedings Of The American Society For Information Science And Technology. 50, 1–9 (2013).
- 14.
Athar, A. Sentiment analysis of citations using sentence structure-based features. Proceedings Of The ACL 2011 Student Session. pp. 81–87 (2011).
- 15. Yousif A., Niu Z., Tarus J. & Ahmad A. A survey on sentiment analysis of scientific citations. Artificial Intelligence Review. 52, 1805–1838 (2019).
- 16. Xu J., Zhang Y., Wu Y., Wang J., Dong X. & Xu H. Citation sentiment analysis in clinical trial papers. AMIA Annual Symposium Proceedings. 2015 pp. 1334 (2015). pmid:26958274
- 17. Amjad Z. & Ihsan I. VerbNet based citation sentiment class assignment using machine learning. International Journal Of Advanced Computer Science And Applications. 11, 621–627 (2020).
- 18. Pan R. & Fortunato S. Author Impact Factor: tracking the dynamics of individual scientific impact. Scientific Reports. 4, 1–7 (2014). pmid:24814674
- 19. West J., Jensen M., Dandrea R., Gordon G. & Bergstrom C. Author-level Eigenfactor metrics: Evaluating the influence of authors, institutions, and countries within the social science research network community. Journal Of The American Society For Information Science And Technology. 64, 787–801 (2013).
- 20. Kochhar S. & Ojha U. Index for objective measurement of a research paper based on sentiment analysis. ICT Express. 6, 253–257 (2020).
- 21. Ikram M. & Afzal M. Aspect based citation sentiment analysis using linguistic patterns for better comprehension of scientific knowledge. Scientometrics. 119, 73–95 (2019).
- 22.
Nguyen, D., Vo, K., Pham, D., Nguyen, M. & Quan, T. A deep architecture for sentiment analysis of news articles. International Conference On Computer Science, Applied Mathematics And Applications. pp. 129–140 (2017).
- 23.
Athar, A. & Teufel, S. Context-enhanced citation sentiment detection. Proceedings Of The 2012 Conference Of The North American Chapter Of The Association For Computational Linguistics: Human Language Technologies. pp. 597–601 (2012).
- 24.
Ghosh, S. & Shah, C. Identifying Citation Sentiment and its Influence while Indexing Scientific Papers. Proceedings Of The 53rd Hawaii International Conference On System Sciences. (2020).
- 25. Zhu X., Turney P., Lemire D. & Vellino A. Measuring academic influence: Not all citations are equal. Journal Of The Association For Information Science And Technology. 66, 408–427 (2015).
- 26. Aljuaid H., Iftikhar R., Ahmad S., Asif M. & Afzal M. Important citation identification using sentiment analysis of in-text citations. Telematics And Informatics. 56 pp. 101492 (2021).
- 27. Nazir S., Asif M., Ahmad S., Aljuaid H., Iftikhar R., Nawaz Z. et al. Important Citation Identification by Exploding the Sentiment Analysis and Section-Wise In-Text Citation Weights. IEEE Access. 10 pp. 87990–88000 (2022).
- 28. Wang M., Zhang J., Jiao S., Zhang X., Zhu N. & Chen G. Important citation identification by exploiting the syntactic and contextual information of citations. Scientometrics. 125 pp. 2109–2129 (2020).
- 29. Safavian S. & Landgrebe D. A survey of decision tree classifier methodology. IEEE Transactions On Systems, Man, And Cybernetics. 21, 660–674 (1991).
- 30. Brijain M., Patel R., Kushik M. & Rana K. A survey on decision tree algorithm for classification. International Journal Of Science And Research (IJSR). (2014).
- 31. Zhang Y., Zhang H., Cai J. & Yang B. A weighted voting classifier based on differential evolution. Abstract And Applied Analysis. 2014 (2014).
- 32. Freund Y. & Schapire R. A decision-theoretic generalization of on-line learning and an application to boosting. Journal Of Computer And System Sciences. 55, 119–139 (1997).
- 33. Sebastiani F. Machine learning in automated text categorization. ACM Computing Surveys (CSUR). 34, 1–47 (2002).
- 34.
Zadrozny, B. & Elkan, C. Transforming classifier scores into accurate multiclass probability estimates. Proceedings Of The Eighth ACM SIGKDD International Conference On Knowledge Discovery And Data Mining. pp. 694–699 (2002).
- 35. Gregorutti B., Michel B. & Saint-Pierre P. Correlation and variable importance in random forests. Statistics And Computing. 27, 659–678 (2017).
- 36. Rustam F., Ashraf I., Mehmood A., Ullah S. & Choi G. Tweets classification on the base of sentiments for US airline companies. Entropy. 21, 1078 (2019).
- 37. Safavian S. & Landgrebe D. A survey of decision tree classifier methodology. IEEE Transactions On Systems, Man, And Cybernetics. 21, 660–674 (1991).
- 38. Cortes C. & Vapnik V. Support-vector networks. Machine Learning. 20, 273–297 (1995).
- 39. Umer M., Ashraf I., Mehmood A., Ullah S. & Choi G. Predicting numeric ratings for google apps using text features and ensemble learning. ETRI Journal. 43, 95–108 (2021).
- 40. Catal C. & Nangir M. A sentiment classification model based on multiple classifiers. Applied Soft Computing. 50 pp. 135–141 (2017).
- 41. Chawla N. Data mining for imbalanced datasets: An overview. Data Mining And Knowledge Discovery Handbook. pp. 875–886 (2009).
- 42. Ishaq A., Sadiq S., Umer M., Ullah S., Mirjalili S., Rupapara V. & Nappi M. Improving the prediction of heart failure patients’ survival using SMOTE and effective data mining techniques. IEEE Access. 9 pp. 39707–39716 (2021).
- 43. Umer M., Sadiq S., Karamti H., Karamti W., Majeed R. & Nappi M. IoT Based Smart Monitoring of Patients’ with Acute Heart Failure. Sensors. 22, 2431 (2022). pmid:35408045
- 44. Ashraf I., Hur S. & Park Y. Application of deep convolutional neural networks and smartphone sensors for indoor localization. Applied Sciences. 9, 2337 (2019).
- 45. Rustam F., Siddique M., Siddiqui H., Ullah S., Mehmood A., Ashraf I. et al. Wireless capsule endoscopy bleeding images classification using CNN based model. IEEE Access. 9 pp. 33675–33688 (2021).
- 46. Khan S., Rahmani H., Shah S. & Bennamoun M. A guide to convolutional neural networks for computer vision. Synthesis Lectures On Computer Vision. 8, 1–207 (2018).
- 47. Karim M., Missen M., Umer M., Sadiq S., Mohamed A. & Ashraf I. Citation context analysis using combined feature embedding and deep convolutional neural network model. Applied Sciences. 12, 3203 (2022).