Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Enhancing tertiary students’ programming skills with an explainable Educational Data Mining approach

  • Md Rashedul Islam,

    Roles Data curation, Methodology, Software, Validation, Visualization, Writing – original draft

    Affiliation Department of Computer Science and Engineering, Hajee Mohammad Danesh Science and Technology University, Dinajpur, Bangladesh

  • Adiba Mahjabin Nitu,

    Roles Conceptualization, Formal analysis, Supervision, Validation, Writing – review & editing

    Affiliation Department of Computer Science and Engineering, Hajee Mohammad Danesh Science and Technology University, Dinajpur, Bangladesh

  • Md Abu Marjan,

    Roles Conceptualization, Data curation, Software, Validation, Writing – review & editing

    Affiliation Department of Computer Science and Engineering, Hajee Mohammad Danesh Science and Technology University, Dinajpur, Bangladesh

  • Md Palash Uddin ,

    Roles Conceptualization, Data curation, Formal analysis, Methodology, Supervision, Writing – review & editing

    palash_cse@hstu.ac.bd

    Affiliation Department of Computer Science and Engineering, Hajee Mohammad Danesh Science and Technology University, Dinajpur, Bangladesh

  • Masud Ibn Afjal,

    Roles Conceptualization, Formal analysis, Methodology, Supervision, Validation, Writing – review & editing

    Affiliation Department of Computer Science and Engineering, Hajee Mohammad Danesh Science and Technology University, Dinajpur, Bangladesh

  • Md Abdulla Al Mamun

    Roles Conceptualization, Supervision, Validation, Writing – review & editing

    Affiliation Department of Computer Science and Engineering, Hajee Mohammad Danesh Science and Technology University, Dinajpur, Bangladesh

Abstract

Educational Data Mining (EDM) holds promise in uncovering insights from educational data to predict and enhance students’ performance. This paper presents an advanced EDM system tailored for classifying and improving tertiary students’ programming skills. Our approach emphasizes effective feature engineering, appropriate classification techniques, and the integration of Explainable Artificial Intelligence (XAI) to elucidate model decisions. Through rigorous experimentation, including an ablation study and evaluation of six machine learning algorithms, we introduce a novel ensemble method, Stacking-SRDA, which outperforms others in accuracy, precision, recall, f1-score, ROC curve, and McNemar test. Leveraging XAI tools, we provide insights into model interpretability. Additionally, we propose a system for identifying skill gaps in programming among weaker students, offering tailored recommendations for skill enhancement.

1 Introduction

The field of Educational Data Mining (EDM) is an interdisciplinary research area dedicated to developing tools for analyzing data collected from educational settings. Using statistical, Data Mining (DM), and Machine Learning (ML) techniques, EDM seeks to uncover hidden patterns [1]. By delving into unique educational data, EDM aims to understand student performance and learning environments better [2]. Widely adopted in higher education, EDM contributes to student-centered approaches and real-time predictive insights [35], focusing on refining learning processes through precise modeling of student behavior and performance. Recent research has explored EDM applications in higher education, including dropout prediction, academic performance forecasting [69], and behavior analysis [1013].

The modern era’s emphasis on technology has spurred increased interest in Computer Science (CS) and related fields among students, given the promising job market. Programming skills stand out as crucial for success in CS, encompassing language proficiency, mathematical acumen, problem-solving abilities, creativity, communication skills, and adaptability. While EDM has been applied to assess programming skills, existing approaches have limitations. For instance, Pathan et al. [14] and Sunday et al. [15] focus on specific programming languages or courses, while Marjan et al. [16] predict tertiary-level programming skills without incorporating data representation techniques. To address these limitations, we present an enhanced EDM system with explainability, aiming to (i) improve the classification accuracy of students’ programming performance; (ii) identify key factors influencing classification; and (iii) develop a skill gap identification system with skill enhancement recommendations. Our approach employs fundamental ML algorithms and ensemble learning, alongside eXplainable Artificial Intelligence (XAI) tools such as shapash, eli5, and Local Interpretable Model Agnostic Explanations (LIME). Our study introduces several innovative aspects, including the customized ensemble ML approach with specific modifications to improve the prediction and enhancement of tertiary students’ programming skills. We integrated XAI techniques to add transparency and interpretability to the EDM process, significantly advancing beyond traditional black-box models. Additionally, our comprehensive evaluation framework employed multiple performance metrics and cross-validation techniques to ensure the reliability and generalizability of our results. Finally, we provided actionable insights and recommendations for educators, directly impacting teaching strategies and student support mechanisms. As such, our key contributions include:

  • Introduction of an effective data pre-processing technique and stacking ensemble model, achieving superior classification accuracy.
  • Utilization of XAI tools to explore model interpretability and feature importance.
  • Development of a recommendation system for improving programming skills, incorporating skill gap identification.

The subsequent sections are organized as follows: Section 2 reviews related EDM works. Section 3 outlines our proposed explainable EDM system’s approach, including dataset preprocessing and ML classifiers. Section 4 presents experiments and classification results. The utilization of XAI tools is detailed in Section 5, and Section 6 summarizes findings and conclusions.

2 Literature review

One of the most widely studied topics in EDM is the analysis and prediction of students’ performance. In a study [17], the relationship between students’ academic performance and their involvement in extracurricular activities is examined using three ML algorithms: Random Forest (RF), Decision Tree (DT), and k-Nearest Neighbor (KNN). The results show that DT outperformed the other algorithms, achieving an F1-score of 84% and an accuracy of 85%. In another study [18], DM techniques and video learning analytics are employed to forecast students’ final performance. The study aims to predict students’ semester grades using video learning analytics and data from the learning management system, student information system, and mobile applications. Eight different classification algorithms are applied, and the results indicate that RF classifiers achieve an accuracy level of 88.3%. Anjana Pradeep [19] conducted a study on student dropout and failure prediction at M.G. University, Kerala, India, from 2013–2018. Several classification algorithms, including induction rules and DTs, were employed. The results showed that the AD Tree algorithm, when applied with the most relevant attributes, achieved an accuracy of 92%. The study in [20] focuses on evaluating ML models using student interaction data from a Virtual Learning Environment (VLE). CatBoost achieves the best result with 94.64% accuracy. The study is based on a specific course from a particular institute, which fails generalizability. Overall, these studies illustrate the effectiveness of ML algorithms in analyzing the collected data and predicting students’ performance in different educational contexts. In these works, researchers did not suggest any methods to improve classification accuracy or provide practical implementation guidelines, despite the potential for improvement in these fields.

There is a lot of attention given to enhancing CSE students’ proficiency in computer programming. To aid tutors in this work, a DT-based model was presented in [14] to categorize students into three categories (Good, Average, and Poor) based on their C programming skills. The authors used a very tiny dataset (70) and 16 features in this research. The DT algorithm obtained 87% accuracy. There could be a chance to improve the reliability of the study if more data can be collected. The experiment presented in [15] was based on a dataset from a JAVA-based “Introduction to Computer Programming” course. They analyzed student log data collected from the Department of Mathematics and Computer Science Unit, including metrics like Assignment Completed (ASC), Class Test Score (CTS), Class Attendance (CATT), and Class Lab Work (CLW). For the analysis, they utilized data mining strategies such as the ID3 and J48 Decision Tree Algorithms. According to their findings, J48 achieved an accuracy of 87%. In another research work, [16], researchers predicted university-level students’ programming skills and proposed a mechanism to enhance them. They divided the class into four categories: Excellent, Good, Average, and Weak, based on the students’ levels of programming expertise. The authors built a dataset relevant to this objective and investigated six fundamental ML algorithms. Their testing revealed that RF achieved an accuracy of 93%. The authors did not explain individual classes, which could be an interesting part of the research. Table 1 represents the existing work on EDM with objectives and classification performance. However, there has been a significant gap in improving performance prediction as well as exploring which features most effectively enhance performance in computer programming education.

thumbnail
Table 1. Works in the existing literature and comparison with the proposed EDM system.

https://doi.org/10.1371/journal.pone.0307536.t001

Recently, XAI has emerged as a vibrant research area, as evidenced by the growing number of scholarly articles and dedicated conferences. It is worth noting that the literature encompasses XAI explainers tailored for specific models, along with explanations that have either global or local scopes. In the article cited in [24], valuable insights and significant features are discussed to identify the reasons for student dropout in schools using XAI tools such as LIME and SHAP. Such XAI approaches can serve as a foundation for granting learners greater responsibility and control over their learning [25]. Particularly, explanations of AI, including the language used and the details provided, should facilitate teachers, students, and parents in recognizing personal relevance [26], thereby empowering them to make informed decisions regarding the implementation and application of AI in ways that align with their values. As these works recommend, XAI can explore black-box models. In particular, to predict students’ performance in programming, XAI can be an emerging solution to explore hidden patterns or reasons behind individual classification. In this work, we have incorporated our findings and the literature gap from previous research. Our study focuses on the gap between traditional EDM and XAI. We have introduced a customized ensemble ML approach with specific modifications to improve the prediction. Table 1 represents the comparison between existing EDM and our proposed EDM system in terms of classification.

3 Proposed explainable EDM methodology

3.1 Overview

Fig 1 depicts the operational steps of the proposed EDM approach. Dataset preparation stands as the initial and pivotal phase within our EDM framework. Following necessary preprocessing and feature engineering, the dataset is partitioned into various training and testing ratios and utilized with different ML models, including Logistic Regression (LR), DT, Support Vector Machine (SVM), Artificial Neural Network (ANN), k-NN, Naive Bayes Classifier (NBC), and RF. Subsequently, a stacking ensemble model is employed. Performance evaluation and assessment are conducted based on experimental outcomes and evaluation metrics. Finally, to enhance model interpretability, we leverage XAI tools such as LIME, SHAPASH, and ELI5.

3.2 Dataset description

The dataset comprises 1720 samples and encompasses data from Computer Science and Engineering (CSE) students attending various universities in Bangladesh [16]. Table 2 represents all features and values (levels) of each feature. It encompasses a broad spectrum of information, including students’ proficiency in programming languages, problem-solving experiences in both online and onsite programming contests, computational skills, creativity, course results, and other technological experiences. With 36 features and 1 target label, the dataset aims to predict students’ proficiency level, categorized into four classes: Excellent, Good, Average, and Weak.

thumbnail
Table 2. Feature names and levels of the utilized dataset.

https://doi.org/10.1371/journal.pone.0307536.t002

3.3 Dataset preprocessing

Table 2 illustrates the number of unique values for each feature. This dataset contains unstructured data for ML tasks, with all data represented as string values. Consequently, a conversion of these data into numerical values is necessary. To achieve this conversion and ensure appropriate data representation, we explore an effective feature engineering process. Furthermore, to address the class imbalance, we investigate the Synthetic Minority Oversampling Technique (SMOTE) and NearMiss data balancing techniques.

3.3.1 Feature engineering.

Algorithm 1 represents the proposed feature engineering process. In this process, we explore both Label Encoding (LE) and One Hot Encoding (OHE) techniques to convert categorical and string values into numerical values. Initially, we calculate the number of unique values, Cμ, for each of the 36 features. Through several trials, we determine the optimized value for all Cμ to encode the features as binary mappings, thereby avoiding overfitting to the model. Our findings indicate that if Cμ is less than 7, binary mapping of the feature is most suitable. For binary mapping, we employ OHE. For features where Cμ is greater than or equal to 7, we utilize LE. Specifically, we employ LE for 8 features (feature numbers 14, 18, 19, 20, 21, 32, 33, and 34 from Table 2) and binary mapping for the remaining 28 features. Finally, we combine the results of these two encoding methods to obtain our final dataset after feature engineering, denoted as Sp.

Algorithm 1 Proposed Feature Engineering Procedure

Input: Raw Dataset (S), Number of Features (n)

Output: Preprocessed dataset after Feature Engineering Sp

1: Procedure (FeatureEngineering():)

2:  for i = 1 to n do

3:   Cμ[i] ← Unique(i) ▹ Calculate number of unique values, Cμ, of each feature

4:  end for

5:  for i = 1 to n do

6:   if Cμ[i] < 7 (trial and error manner) then

7:    Rb ← binary mapping (OHE)(Cμ[i])

8:   else

9:    Rle ← label encoding(Cμ[i])

10:   end if

11:  end for

12:  Sp ← Concat(Rb, Rle)

13:  return Sp

14: end procedure

One Hot Encoding. Textual information is not directly interpretable by ML algorithms, which necessitates conversion to numeric values. In this research, data preprocessing involved encoding textual information as one-hot vectors. This technique enhances model comprehension [27]. OHE creates a new variable for each distinct level of a categorical attribute or feature. Each class or category is then mapped to a binary variable, taking values of 0 or 1. Here, 0 denotes the absence of the class or category, while 1 signifies its presence. Despite potentially increasing the dimensionality of feature vectors, especially when dealing with high cardinality, OHE remains widely adopted due to its simplicity [28].

Label Encoding. LE is a technique utilized in ML and data analysis to transform categorical variables into numerical labels. This technique assigns a unique integer value to each category [29]. LE proves beneficial when the categorical variable possesses a natural order, such as “low”, “medium”, and “high”. Assigning integer values based on this natural order can be particularly advantageous in such scenarios.

3.3.2 Data balancing.

In DM implementations, data inconsistency often poses a challenge. Specifically, data imbalance denotes a scenario where one class has significantly more samples than others, impacting classifier performance. To tackle this issue, various preprocessing techniques, such as oversampling, undersampling, synthetic minority oversampling, and others, are commonly employed. In our efforts to create a balanced dataset, we utilized SMOTE in conjunction with NearMiss. Table 3 presents the results after applying SMOTE and NearMiss to the dataset, which serves as our final ML trainable dataset.

thumbnail
Table 3. State of the dataset before and after data balancing using SMOTE and NearMiss.

https://doi.org/10.1371/journal.pone.0307536.t003

Synthetic Minority Oversampling Technique. SMOTE stands as the most common and effective oversampling method across numerous application domains, extensively utilized for handling imbalanced data [30]. In class-imbalanced datasets, one class typically comprises significantly fewer instances than the others, potentially leading to a biased model that favors the majority class. To address this, SMOTE augments the number of instances in the minority class by generating synthetic samples resembling the existing minority samples. The SMOTE algorithm accomplishes this by creating new samples through interpolation among existing minority class samples. Specifically, SMOTE selects k nearest neighbors for each minority sample from the same class and generates new samples by interpolating between the chosen sample and its neighbors [31].

NearMiss. NearMiss stands as a popular undersampling technique employed to mitigate the class imbalance issue in ML [32]. In class-imbalanced datasets, where one class comprises significantly fewer instances than others, there’s a risk of bias toward the majority class, potentially leading to a skewed model. NearMiss tackles this problem by reducing the number of instances in the majority class to achieve a balanced dataset and enhance model performance. The NearMiss algorithm achieves this by selecting samples from the majority class that are closest to the instances of the minority class. By removing only the examples near the majority class, this method preserves the decision boundary of the minority class. Consequently, the resulting dataset exhibits a balanced class distribution after selecting the majority of class samples.

3.4 Model training

After preprocessing, we proceed to train the ML algorithms to identify the most effective classifier for predicting students’ programming skills using the prepared dataset. Multiple experiments are conducted to minimize the disparity between observed and predicted values. Specifically, we focus on SVM, NBC, LR, RF, DT, ANN, and KNN ML models. We employ a grid search strategy as necessary to optimize the output of all ML classifiers. Table 4 illustrates the hyperparameter tuning values for the utilized ML classifiers. Subsequently, we apply an ensemble learning method known as stacking to enhance prediction accuracy. Our proposed stacking method, Stacking-SDRA, employs ANN as the base model, with SVM, DT, and RF serving as sub-models.

thumbnail
Table 4. Hyperparameter tuning values for the ML classifier.

https://doi.org/10.1371/journal.pone.0307536.t004

3.4.1 Logistic regression.

Logistic regression (LR) is a supervised linear ML algorithm for classification as opposed to regression, despite its name. It is widely used for classification problems [33] such as medicine, social science, and EDM [21, 22]. LR computes the likelihood that a given input belongs to a specific category by fitting a logistic function to the input data. To reduce the discrepancy between the expected probabilities and the actual outcomes, the model’s parameters are adjusted during training.

3.4.2 Decision tree.

A DT classifier is a popular supervised ML algorithm that can be used for both regression and classification problems [34]. The DT algorithm predicts the class of a sample by recursively splitting the feature space into smaller regions based on the value of the features until each region contains only samples of a single class. In the tree structure, each internal node represents a test on a feature, and each branch indicates the conclusion of the test, leading to a child node corresponding to a subset of the samples in the decision tree classifier. The leaves of the tree represent the predicted class for each subset.

3.4.3 Naive Bayes Classifier.

NBC is a probabilistic classifier based on the Bayes theorem [35]. In this model, the presence of one feature in a class is treated as though it were unrelated to any other feature’s presence. This assumption simplifies the calculation of probabilities and allows for efficient classification of data with a large number of features. The NBC can be mathematically represented as follows: (1) where P(x) is the marginal probability of sample x, P(y) is the prior probability of class y, P (y | x) is denoted as the posterior probability of class y given sample x, and P (x | y) is the likelihood of sample x given class y.

3.4.4 Support Vector Machine.

The SVM classifier can be applied to both binary and multi-class classification tasks. The SVM classifier works by finding a hyperplane in a high-dimensional feature space that excellently separates the data into different classes [36]. An SVM classifier can be visualized in a two-dimensional feature space, where the hyperplane is a line that separates the two classes. The support vectors are the nearest points to the hyperplane in each class and are used to define the margin. The hyperplane determines the decision boundary of the SVM classifier, and the predicted class for a new sample is based on which side of the hyperplane it falls on. In this paper, we apply the RBF kernel to the SVM classifier.

3.4.5 k–Nearest Neighbor.

KNN is an instance-based ML model and is used as a non-parametric supervised learning method [37]. KNN’s training phase is quicker than that of competing classifiers, but its testing phase is slower and more resource-intensive. The k-value determines the categorization in KNN. KNN decides the class of a sample data point by majority voting among its nearest neighbors, where the k-value indicates the measure of the number of neighbors. Distance from the sample data point is used to find the neighbors [38].

3.4.6 Artificial Neural Network.

An ANN classifier is a type of ML algorithm inspired by the function and structure of the human brain [39]. It is designed to learn from given data and make predictions based on that learning by utilizing a feedforward neural network as the deep learning model. The ANN classifier consists of three layers: an input layer, several more hidden layers (one or more), and an output layer. The input layer receives the feature vectors of the input data, the hidden layers perform calculations on the input data, and the output layer generates the predicted class label [40]. The ANN classifier passes the input data through the layers of neurons, each of which performs a weighted sum of the inputs and employs an activation function to produce an output. During training, the weights and biases of the neurons are adjusted to reduce the difference between the predicted and actual output.

3.4.7 Random Forest.

RF is a decision tree-based ensemble ML learning algorithm [41]. It takes several decision trees on several subsets of the main dataset and takes the average result to increase the predictive accuracy of the given dataset. This algorithm can predict categorical and continuous data using classification and regression methods. Each decision tree uses a simple deterministic probability to choose the significant relevant feature of data samples randomly and randomly takes the subset of the given dataset as ML trainable data. The RF classifier fits a variety of predefined bootstrapped datasets on various decision trees. The predicted result is the average value of the fitted response of an uninterrupted response from all the independent trees of each bootstrapped sample. Simply, instead of depending on one decision tree, it takes the prediction from the individual tree, and based on averaging or majority votes of predictions, RF predicts the final output [42]. We have used RF with the hyperparameter of n_estimators = 100.

3.4.8 Ensemble Learning-Stacking.

Ensemble stacking is a popular technique used in ML to improve the accuracy of predictions by combining the outputs of multiple models [43]. In an ensemble stacking approach, multiple base models are trained on the same training data using different algorithms and/or hyperparameters. The outputs of these models are then combined, often by training a meta-model on top of the outputs of the base models [44]. In this study, we have proposed a Stacking-SRDA model, where SVM, RF, and DT are taken as the base models and ANN is taken as a meta-model.

3.5 Performance evaluation

To evaluate the performance of the ML algorithms for predicting students’ programming performance, we employ the key widely used performance measurement metrics: accuracy, precision, F1-score, recall, rmse and cohen kappa (C Kappa). [45]. Accuracy is referred to as the testing accuracy in an ML model [46], which states the percentage of the whole dataset’s actual value that agrees with the predicted value. It helps to classify the students. Precision calculates the positive predictive value or probability of a positive test result [47]. Precision refers to the percentage of students who are precisely classified into the specified categories of students (excellent, good, average, and weak) as we categorize them according to their performance in programming. Recall specifies the probability value of true positives (TP) from total predicted positive values by the ML model [48]. The value of the F1-score is gained from recall and precision. It practices taking fast action concerning related methods. In the imbalanced dataset, precision, F1-score, and recall are sometimes assumed to be more effective performance measurement metrics than accuracy [49]. ML model produces True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN) values while training. Performance measurement tools of ML are constructed using TP, TN, FP, FN values as follows: (2) (3) (4) (5) (6) A typical metric for assessing a prediction model’s accuracy is the Root Mean Square Error (RMSE) [50]. Because of squaring, higher mistakes contribute more significantly to the average size of the errors between predicted and actual values. RMSE can be denoted as: (7) Where n is denoted as some instances, yi is the actual value and pi is the predicted value. Cohen’s Kappa (C Kappa) is used to assess inter-rater agreement in categorical items [51]. It is a more reliable measure than a simple percentage agreement since it takes into consideration the potential that the agreement happened by accident. It is denoted as follows: (8) where p0 is the observed and pe is the expected agreement proportion.

3.5.1 Receiver Operating Characteristic curve.

ROC curve is also considered a performance measure for ML algorithms. ROC curve specifies the relationship between the true-positive rate (sensitivity) and the false-positive rate (specificity)

3.5.2 Significance test (McNemar test).

To compare the significant level of Stacking-SRDA with another ML classifier, we employ the McNemar Test. The McNemar test is a statistical test applied to compare the proportions of two paired binary samples to determine whether they come from the same population or not [52]. The McNemar test is often used when there are only two possible outcomes for a given observation or variable, such as in medical diagnosis or in evaluating the performance of an ML algorithm [53]. It is commonly used to assess whether there is a significant difference in the classification performance of two models or algorithms on a given dataset. The McNemar test compares the proportion of cases where two models make the same prediction and the proportion where they make different predictions.

3.6 Explainable Artificial Intelligence

To enhance the transparency of models, XAI has emerged as a solution to convert AI from a black-box ML model to a grey box. The goal of XAI is to develop a range of techniques that produce more easily explainable models while retaining high levels of performance [54]. This approach may be particularly useful for educational institutions when making decisions. Several approaches to developing XAI include rule-based systems, decision trees, and neural network visualization techniques [55]. These approaches aim to provide a range of explanations for how AI models work, including global explanations that provide an overall understanding of the model’s behavior, as well as local explanations that provide detailed insights into individual predictions or decisions made by the model [56].

3.6.1 Global explainability.

Global explanations are useful for understanding the model’s general behavior and how it performs on average across different types of inputs. They are also important for identifying potential biases or errors in the model that could be corrected or improved. In this study, we use SHAPASH, ELI5, and GWO for global explainability.

SHAPASH. Shapash XAI is an open-source Python library that provides various tools and techniques for building XAI models [57]. Global explanations give an overall understanding of how a model works and what factors are most important in determining its outputs, while local explanations provide detailed insights into individual predictions or decisions made by the model. Shapash is compatible with a wide range of ML frameworks and libraries, including scikit-learn, XGBoost, and TensorFlow.

ELI5. ELI5 is a technique in the field of XAI that is aimed at providing clear and understandable insights into the key factors that influence a model’s predictions [57]. It is designed to use language that non-experts can comprehend, making it a useful tool for a wide range of applications. The ELI5 package provides various functions and tools for implementing ELI5 in Python.

Grey Wolf Optimization (GWO). GWO is a new and interpretable metaheuristic algorithm. It is based on the hunting and social structure of grey wolves [58]. To solve optimization challenges, GWO imitates the cooperative hunting style and leadership structure of grey wolves. Numerous optimization problems in the data mining [59, 60] domains have been efficiently solved by it because of its simplicity, ease of implementation, and capacity to successfully strike a balance between exploration and exploitation. It offers the global explainability of the constructed model, which represents the model’s reliability.

3.6.2 Local explainability.

Local explanations are useful for understanding how the model arrived at a specific prediction or decision and can help identify cases where the model may have made errors or reached unreasonable conclusions. In this study, we use LIME for local explainability.

Local Interpretable Model Agnostic Explanations. Another popular XAI technique is LIME, which involves creating a local approximation of the model to generate clear and interpretable explanations for its predictions [57]. Like ELI5, LIME is implemented using a dedicated package called lime, which provides a range of functions and tools for generating and interpreting LIME explanations. One of the key benefits of LIME is that it provides a way to generate local explanations that are specific to individual predictions. This can be particularly useful in applications where it is important to understand the reasoning behind individual decisions made by an ML model, such as in medical diagnosis or credit scoring.

4 Result and discussion

4.1 Classification result without proposed preprocessing pipeline

To classify students’ programming performance using ML algorithms, we divided the dataset into 80:20 ratios. Table 5 shows the ML classifier’s performance without applying the proposed preprocessing pipeline (data balancing and feature engineering). All values represent the mean value of 5 trials of experiments with a standard deviation of accuracy. Here, RF obtains the highest performance among the other ML algorithms. It shows 0.92 ± 0.02 accuracy with 92% precision and recall and 91% f1-score. The performance of the other algorithms is also in a stable state, where NBC shows a minimum 71% accuracy and f1-score.

thumbnail
Table 5. Classification result without proposed processing pipeline.

All values represent the mean value of 5 trials of experiments.

https://doi.org/10.1371/journal.pone.0307536.t005

To improve prediction performance, we individually apply SMOTE and NearMiss as data balancing techniques with feature engineering. The following results represent the classifier’s performance after applying data balancing techniques with feature engineering.

4.2 Classification performance after applying SMOTE and feature engineering

To improve the classifier’s performance, we apply SMOTE as a data balancing technique and feature engineering to the dataset. The experimental result is shown in Table 6. Overall performance of all the classifiers improves compared to the result of Table 5. We perform 5 trials of experiments for all of the algorithms and calculate all values of performance measurement metrics. We also calculate the standard deviation of accuracy. Here, RF achieves 0.95 ± 0.02 accuracy, precision, recall, and f1-score. ANN and RF represent the lowest RMSE with 0.16 and 0.17, respectively. RF produces the best C.Kappa score, with a value of 0.93. SVM, ANN, and DT show more than 90% accuracy, and NBC and KNN obtain more than 85% accuracy.

thumbnail
Table 6. Classification results using ML algorithms after applying SMOTE and feature engineering, training, and testing ratios is 80:20.

All values represent the mean value of 5 trials of experiments.

https://doi.org/10.1371/journal.pone.0307536.t006

Table 6 represents the performance of the ML classifiers in an 80:20 training and testing ratio. To check the stability of our model with different data availability, performance degradation as well as the rate of decline, we experimented with the same analysis in 50:50 and 30:70 ratios. Table 7 represents the performance of ML algorithms with a 50:50 training and testing ratio, and Table 8 represents the performance of ML algorithms with a 30:70 training and testing ratio. In these two experiments, the result shows that RF performs better than other models, with 0.93± 0.02 accuracy for the 50:50 dataset and 0.91± 0.02 accuracy for the 30:70 dataset. In both cases, RF achieves the least RMSE score with 0.19 and 0.21. Also achieves the best C Kappa coefficient.

thumbnail
Table 7. Classification results using ML algorithms after applying SMOTE and feature engineering, training, and testing ratios is 50:50.

All values represent the mean value of 5 trials of experiments.

https://doi.org/10.1371/journal.pone.0307536.t007

thumbnail
Table 8. Classification results using ML algorithms after applying SMOTE and feature engineering, training, and testing ratios is 30:70.

All values represent the mean value of 5 trials of experiments.

https://doi.org/10.1371/journal.pone.0307536.t008

4.3 Classification performance after applying NearMiss and feature engineering

After applying SMOTE as an oversampling technique, we get better performance compared to the result Table 5. We also apply the undersampling technique NearMiss and feature engineering to the ML trainable dataset. For every algorithm, we run five trails and determine every value for every performance assessment parameter. We also compute the accuracy standard deviation. The experimental result is shown in Table 9. Here, RF achieves 0.93 ± 0.02 accuracy, precision, recall, and f1-score. SVM and ANN perform the same as they obtain 92% accuracy, precision, recall, and f1-score. LR, NBC, and ANN achieve 91%, 85%, and 84% respectively. LR and RF show the lowest RMSE value.

thumbnail
Table 9. Classification results using ML algorithm after applying NearMiss and feature engineering, training, and testing ratios is 80:20.

All values represent the mean value of 5 trials of experiments.

https://doi.org/10.1371/journal.pone.0307536.t009

Table 9 represents the performance of the ML classifiers in 80:20 training and testing ratio after applying NearMiss and feature engineering. To check the tolerance of the classifiers, we experimented with the same analysis in 50:50 and 30:70 ratios. Table 10 represents the performance of ML algorithms with a 50:50 training and testing ratio, and Table 11 represents the performance of ML algorithms with a 30:70 training and testing ratio. In the 50:50 dataset experiment, the result shows that RF achieves the highest 0.90 ± 0.02 accuracy, precision, recall, and 90% f1-score. In addition, RMSE and C Kappa are the best scores for RF. For the 30:70 training and testing ratio experiment, the result shows that RF achieves the best result with 0.87 ± 0.00 accuracy, precision, recall, and f1-score. It shows the RMSE value is 0.24 and the C Kappa score is 0.83 for RF.

thumbnail
Table 10. Classification results using ML algorithm after applying NearMiss and feature engineering, training, and testing ratios is 50:50.

All values represent the mean value of 5 trials of experiments.

https://doi.org/10.1371/journal.pone.0307536.t010

thumbnail
Table 11. Classification results using the ML algorithm after applying NearMiss and feature engineering, training, and testing ratios are 30:70.

All values represent the mean value of 5 trials of experiments.

https://doi.org/10.1371/journal.pone.0307536.t011

Our findings demonstrate that using data balancing techniques such as SMOTE and NearMiss, as well as our proposed feature engineering, resulted in a notable improvement in ML model performance compared to the imbalanced dataset. These increases were constant across numerous evaluation parameters, demonstrating the efficacy of our methodology.

4.4 Performance of the proposed ensemble model

From the above sections 4.1, 4.2, and 4.3, we conclude that the performance of the classifier is superior when we apply SMOTE and feature engineering. To improve the accuracy, we analyze the different combinations of the classifier to build an ensemble method and finally get the best combination that constructs our proposed ensemble, Stacking-SRDA. Using this Stacking-SRDA, we test our dataset in 80:20, 50:50, and 30:70 training and testing ratios. The classification results are tabulated in Table 12. In all cases, the proposed ensemble outperforms the benchmark ML algorithms. In 80:20, Stacking-SRDA achieves the highest 96% accuracy, precision, recall, and f1-score. In the 50:50 ratio, this model obtains 93% accuracy precision, recall, and 92% f1-score. Stacking-SRDA achieves the highest 92% accuracy, precision, recall, and f1-score in the 30:70 ratio.

thumbnail
Table 12. Classification results using Stacking-SRDA with 80:20, 50:50, and 30:70 training and testing ratios.

All values represent the mean value of 5 trials of experiments.

https://doi.org/10.1371/journal.pone.0307536.t012

4.5 10 fold cross-validation result of the ML algorithms

We also examine the findings of 10-fold cross-validation to firmly evaluate model performance. It splits the dataset into 10 different subsets, and the model is iteratively trained and tested on various folds. The outcome demonstrates that there is no overfitting, less volatility, and impartial classifier performance. Experimental results are depicted in Table 13. The result shows that Stacking-SRDA produces better results quietly than other algorithms.

thumbnail
Table 13. 10-fold cross-validation accuracy of the ML algorithms.

https://doi.org/10.1371/journal.pone.0307536.t013

4.6 Significance test result

We also employ the significance test using McNemar’s Test. McNemar’s Test usually compares two classifiers and says how much one classifier is more significant than another classifier. As such, we compare Staking-SRDA with six key ML classifiers (LR, SVM, ANN, NBC, DT, KNN, and RF). Table 14 shows the result of the McNemar Test. In the McNemar test, if the P-value is less than 0.0001, then the hypothesis model is 1% significant, and if it is 0.0001>P-value<0.05, then the hypothesis model is 5% significant compared with other models. The result shows that Staking-SRDA is 1% significant than LR, SVM, DT, NBC, and KNN and 5% significant than RF and ANN. Where 1% means its significant level is highest, and 5% means it is less significant than 1%.

thumbnail
Table 14. McNemar test result of Staking-SRDA comparing with six key ML classifier.

https://doi.org/10.1371/journal.pone.0307536.t014

4.7 Ablation study on feature engineering

We perform data distribution-level and algorithm-level ablation studies to investigate the effect of each component of our proposed feature engineering. In the feature engineering process, we use both OHE and LE techniques to transform the categorical or string value into a numerical value. First, we removed the OHE of the proposed feature engineering and trained the ML algorithms with the LE technique. Then, we remove LE from the proposed model and train the ML algorithms with the OHE technique. Additionally, we provide the results produced by the proposed feature engineering model. The results are provided in Table 15. Using LE achieves 94% accuracy, and OHE obtains 95% accuracy. From the results, it can be seen that our proposed model outperforms the two single data conversion techniques in terms of performance measurement metrics. Our proposed model achieves 96% accuracy.

We also analyze the ROC curve to evaluate the goodness of the fit. The ROC curve for individual classes of the key ML classifier is shown in Fig 2. Additionally, we provide the ROC curve for all classifiers’ best performance in Fig 3, where it can be seen that Stacking-SRDA and RF achieve higher ROC values than the other ML models used in this experiment.

thumbnail
Fig 2. ROC curve performance of individual class using Stacking-SRDA.

https://doi.org/10.1371/journal.pone.0307536.g002

thumbnail
Fig 3. ROC curve using NBC, DT, SVM, ANN, RF, KNN, and Stacking-SRDA.

https://doi.org/10.1371/journal.pone.0307536.g003

5 Explanation of the model using explainable AI

Finally, we use explainable AI tools such as LIME, SHAPASH, and ELI5 on the ML-trainable dataset to describe how the model works.

5.1 Result using Global XAI

As a Global XAI, we have used SHAPASH and ELI5. At first, we used SHAPASH to determine feature importance. Fig 4 represents the feature importance of the model using SHAPASH. This bar chart of feature importance represents the sum of the absolute contribution values of each feature. It shows the most important features in descending order. This figure shows that the “problem solved number” is the most important feature, followed by onsite participation, experiences, technical experience, and use of STL, accordingly. From the result, it is noticeable that the student who solves a large number of problems is considered a more skilled programmer. According to this explanation, we can suggest that early-stage programmers solve a huge number of problems by joining onsite contests to gather their experience and technical knowledge. The student who uses STL instead of raw coding can save the time and space of the problem solver, which will help him be an excellent programmer.

Another tool we use is ELI5, which computes permutation importance” or “Mean Decrease Accuracy (MDA)”. Table 16 represents the result of the permutation importance of the trained classifier. As we can observe from the above output, ELI5 shows us the contribution of each feature in predicting the output. In ELI5, a prediction is mostly the sum of positive attributes, inclusive of bias. For example, if we remove the “problem solved number” feature from the dataset, the probability of decreasing the accuracy will be 0.1015 ± 0.0383 of the classifiers and for the “Onsite Participation” feature, the probability of decreasing the accuracy will be 0.0935 ± 0.0363.

Furthermore, we use a GWO as an interpretable metaheuristic approach to finding the feature’s importance. Fig 5 represents the importance of the dataset using GWO. From this chart, our findings demonstrate that “problem solved number,” “Onsite participation,” “technical experience,” and “coding curiosity” are among the more important features. However, “SSC HSC result” and “Mentor” have no importance on the model.

5.2 Local XAI using LIME

Moreover, depending on our dataset and models with easily interpretable visualizations as well as a straightforward and smooth implementation process, we employed LIME for local explainability to explain the individual target classes. Figs 6 to 9 represent individual class explanations, which show the weight or importance of the features for individual classes. We first considered a situation where the model predicted that a student was weak in programming, as depicted in Fig 6. We have randomly taken samples from the dataset, which is a record of a weak programmer. From the importance of the features, we find that the top features are problem-solved numbers, onsite participation, experience, and use of STL. From Fig 4, we find that the student did not give much importance to the number of problems solved to gather experience with or use of STL. That’s why certain weights of these features negatively influence the model’s ability to predict a weak student in programming. Whereas students focused on those that had less importance,. These weighted features significantly influence the model’s ability to predict a weak student in programming.

thumbnail
Fig 6. Local explainability using LIME for the “Weak” class.

https://doi.org/10.1371/journal.pone.0307536.g006

thumbnail
Fig 7. Local explainability using LIME for the ‘Average’ class.

https://doi.org/10.1371/journal.pone.0307536.g007

thumbnail
Fig 8. Local explainability using LIME for the ‘Good’ class.

https://doi.org/10.1371/journal.pone.0307536.g008

thumbnail
Fig 9. Local explainability using LIME for the ‘Excellent’ class.

https://doi.org/10.1371/journal.pone.0307536.g009

Fig 7 represents the status of an average-level student in programming. It shows that the student has some experience participating in a programming contest for problem-solving with a good learning speed. That’s why certain weights of these features positively influence the model to predict an average student in programming.

For a good and excellent level programmer, it is noticeable in Figs 8 and 9 that the student focused on regular practice of problem-solving and participated in an onsite programming contest to gather experience. These weighted features have significant importance for the model to predict a good or excellent student in programming.

5.3 Implementation guideline and recommendation for programming skill gap identification system

Finally, with the result of our proposed ensemble models, Stacking-SRDA and XAI, we have designed a programming skill gap identification system for weak students with a recommendation and provided an implementation guideline of this EDM system to an educational infrastructure. Fig 10 represents the proposed design of programming skill enhancement with recommendations and implementation guidelines. To implement this EDM system in an educational setting, an environment needs to be set up where the necessary server and software will be installed. In the data collection step, this server must be integrated with the student information management system. This dataset will collect the performance data of students’ different skills. This data needs to be pre-processed to create ML-trainable data. Our proposed Stacking-SRDA will then be applied to this ML-trainable data. The system will measure the performance of the students’ skills, provide AI explanations for each skill, classify the students’ programming abilities, and identify skills gaps. Finally, it will recommend how much a student needs to improve in each skill to become an excellent programmer. Students will receive their recommendations on their dashboards. A continuous feedback system is needed to monitor the student’s performance and help them develop their programming skills.

thumbnail
Fig 10. Proposed design of a programming skill gap identification system with the recommendation.

https://doi.org/10.1371/journal.pone.0307536.g010

6 Conclusion and future work

In this paper, we have proposed an explainable EDM system that predicts students’ performance in programming more accurately than previous models and introduced effective model interpretability. To achieve better accuracy than the previous models, we investigated an effective feature engineering process and an ensemble learning model. In the feature engineering process, we employ an ablation study, and our findings manifest that the combination of OHE and LE performs better. The classification module forecasts the current status of a student as excellent, good, average, and weak. To classify students’ performance, we train the dataset with key ML classifiers and employ a stacking ensemble learning model. To evaluate performance, we perform our experiments with three training and testing ratios, as well as 10-fold cross-validation. All the experimental results show that our proposed stacking-SRDA obtains a 96% accuracy level to predict the student’s performance in computer programming. In this way, the proposed EDM system outperforms all the previous models in terms of performance measurement metrics. To explain the model, we have utilized the XAI tools LIME, SHAPASH, GWO, and ELI5, which present interpretability to our proposed EDM system. XAI tools present the most significant features for an excellent programmer. Finally, we have proposed a programming skill gap identification system for weak students with recommendations and a guideline to implement this EDM system. The result of this system will help weak programmers pay more attention to their weaknesses to improve their programming skills. In the future, this study could be employed in real-world EDM settings in the existing curriculum, and a web-based recommendation system could be developed with the help of experimental results from the ML model and findings from XAI tools. This recommendation system classifies the students according to their performance and will help the weak programmers identify their programming skill gaps. After implementing this EDM system, we could select some existing and established EDM systems for a comprehensive evaluation. These findings will help develop more effective tools and educational strategies for improving programming ability.

References

  1. 1. Baker RS, Martin T, Rossi LM. Educational data mining and learning analytics. The Wiley handbook of cognition and assessment: Frameworks, methodologies, and applications. 2016; p. 379–396.
  2. 2. Livieris IE, Kotsilieris T, Tampakas V, Pintelas P. Improving the evaluation process of students’ performance utilizing a decision support software. Neural Computing and Applications. 2019;31:1683–1694.
  3. 3. Aldowah H, Al-Samarraie H, Fauzy WM. Educational data mining and learning analytics for 21st century higher education: A review and synthesis. Telematics and Informatics. 2019;37:13–49.
  4. 4. Cardoso Silva Filho RL, Garg A, Brito K, Adeodato PJL, Carnoy M. Beyond scores: A machine learning approach to comparing educational system effectiveness. Plos one. 2023;18(10):e0289260. pmid:37883478
  5. 5. Yin C, Tang D, Zhang F, Tang Q, Feng Y, He Z. Students learning performance prediction based on feature extraction algorithm and attention-based bidirectional gated recurrent unit network. Plos one. 2023;18(10):e0286156. pmid:37878591
  6. 6. Waheed H, Hassan SU, Nawaz R, Aljohani NR, Chen G, Gasevic D. Early prediction of learners at risk in self-paced education: A neural network approach. Expert Systems with Applications. 2023;213:118868.
  7. 7. Turabieh H, Azwari SA, Rokaya M, Alosaimi W, Alharbi A, Alhakami W, et al. Enhanced Harris Hawks optimization as a feature selection for the prediction of student performance. Computing. 2021;103:1417–1438.
  8. 8. Gao L, Zhao Z, Li C, Zhao J, Zeng Q. Deep cognitive diagnosis model for predicting students’ performance. Future Generation Computer Systems. 2022;126:252–262.
  9. 9. Abuzinadah N, Umer M, Ishaq A, Al Hejaili A, Alsubai S, Eshmawi A, et al. Role of convolutional features and machine learning for predicting student academic performance from MOODLE data. Plos one. 2023;18(11):e0293061. pmid:37939093
  10. 10. Karthikeyan VG, Thangaraj P, Karthik S. Towards developing hybrid educational data mining model (HEDM) for efficient and accurate student performance evaluation. Soft Computing. 2020;24(24):18477–18487.
  11. 11. Crivei LM, Czibula G, Ciubotariu G, Dindelegan M. Unsupervised learning based mining of academic data sets for students’ performance analysis. In: 2020 IEEE 14th International Symposium on Applied Computational Intelligence and Informatics (SACI). IEEE; 2020. p. 000011–000016.
  12. 12. Okoye K, Arrona-Palacios A, Camacho-Zuñiga C, Achem JAG, Escamilla J, Hosseini S. Towards teaching analytics: a contextual model for analysis of students’ evaluation of teaching through text mining and machine learning classification. Education and Information Technologies. 2022; p. 1–43. pmid:34658654
  13. 13. Larabi-Marie-Sainte S, Jan R, Al-Matouq A, Alabduhadi S. The impact of timetable on student’s absences and performance. Plos one. 2021;16(6):e0253256. pmid:34170914
  14. 14. Pathan AA, Hasan M, Ahmed MF, Farid DM. Educational data mining: A mining model for developing students’ programming skills. In: The 8th International Conference on Software, Knowledge, Information Management and Applications (SKIMA 2014). IEEE; 2014. p. 1–5.
  15. 15. Sunday K, Ocheja P, Hussain S, Oyelere S, Samson B, Agbo F. Analyzing student performance in programming education using classification techniques. International Journal of Emerging Technologies in Learning (iJET). 2020;15(2):127–144.
  16. 16. Marjan MA, Uddin MP, Ibn Afjal M. An Educational Data Mining System For Predicting And Enhancing Tertiary Students’ Programming Skill. The Computer Journal. 2022;.
  17. 17. Sharma N, Appukutti S, Garg U, Mukherjee J, Mishra S. Analysis of Student’s Academic Performance based on their Time Spent on Extra-Curricular Activities using Machine Learning Techniques. International Journal of Modern Education and Computer Science. 2023;15(1):46.
  18. 18. Hasan R, Palaniappan S, Mahmood S, Abbas A, Sarker KU, Sattar MU. Predicting student performance in higher educational institutions using video learning analytics and data mining techniques. Applied Sciences. 2020;10(11):3894.
  19. 19. Amare MY, Šimonová S. Global challenges of students dropout: A prediction model development using machine learning algorithms on higher education datasets. In: SHS Web of Conferences, Volume 129 (2021). EDP Sciences-Web of Conferences; 2021.
  20. 20. Alruwais N, Zakariah M. Student-Engagement Detection in Classroom Using Machine Learning Algorithm. Electronics. 2023;12(3):731.
  21. 21. Adekitan AI, Salau O. The impact of engineering students’ performance in the first three years on their graduation result using educational data mining. Heliyon. 2019;5(2):e01250. pmid:30886917
  22. 22. Kabathova J, Drlik M. Towards predicting student’s dropout in university courses using different machine learning techniques. Applied Sciences. 2021;11(7):3130.
  23. 23. Rodrigues RL, Ramos JLC, Silva JCS, Dourado RA, Gomes AS. Forecasting students’ performance through self-regulated learning behavioral analysis. International Journal of Distance Education Technologies (IJDET). 2019;17(3):52–74.
  24. 24. Khosravi H, Shum SB, Chen G, Conati C, Tsai YS, Kay J, et al. Explainable artificial intelligence in education. Computers and Education: Artificial Intelligence. 2022;3:100074.
  25. 25. Abdi S, Khosravi H, Sadiq S, Gasevic D. Complementing educational recommender systems with open learner models. In: Proceedings of the tenth international conference on learning analytics & knowledge; 2020. p. 360–365.
  26. 26. Srinivasan R, Chander A. Explanation perspectives from the cognitive sciences—A survey. In: Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence; 2021. p. 4812–4818.
  27. 27. Seger C. An investigation of categorical variable encoding techniques in machine learning: binary versus one-hot and feature hashing; 2018.
  28. 28. Cerda P, Varoquaux G, Kégl B. Similarity encoding for learning with dirty categorical variables. Machine Learning. 2018;107(8-10):1477–1494.
  29. 29. Saxena S. Here’s All you Need to Know About Encoding Categorical Data (with Python code)—analyticsvidhya.com;. https://www.analyticsvidhya.com/blog/2020/08/types-of-categorical-data-encoding/.
  30. 30. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research. 2002;16:321–357.
  31. 31. Demidova L, Klyueva I. SVM classification: Optimization with the SMOTE algorithm for the class imbalance problem. In: 2017 6th Mediterranean conference on embedded computing (MECO). IEEE; 2017. p. 1–4.
  32. 32. Bao L, Juan C, Li J, Zhang Y. Boosted near-miss under-sampling on SVM ensembles for concept detection in large-scale imbalanced datasets. Neurocomputing. 2016;172:198–206.
  33. 33. Bisong E, Bisong E. Logistic regression. Building machine learning and deep learning models on google cloud platform: A comprehensive guide for beginners. 2019; p. 243–250.
  34. 34. Murthy SK. Automatic construction of decision trees from data: A multi-disciplinary survey. Data mining and knowledge discovery. 1998;2:345–389.
  35. 35. Taheri S, Mammadov M. Learning the naive Bayes classifier with optimization models. International Journal of Applied Mathematics and Computer Science. 2013;23(4):787–795.
  36. 36. Cortes C, Vapnik V. Support-vector networks. Machine learning. 1995;20:273–297.
  37. 37. Coomans D, Massart DL. Alternative k-nearest neighbour rules in supervised pattern recognition: Part 1. k-Nearest neighbour classification by using alternative voting rules. Analytica Chimica Acta. 1982;136:15–27.
  38. 38. Adeniyi DA, Wei Z, Yongquan Y. Automated web usage data mining and recommendation system using K-Nearest Neighbor (KNN) classification method. Applied Computing and Informatics. 2016;12(1):90–108.
  39. 39. Chen WS, Du YK. Using neural networks and data mining techniques for the financial distress prediction model. Expert systems with applications. 2009;36(2):4075–4086.
  40. 40. Marbouti F, Diefes-Dux HA, Madhavan K. Models for early prediction of at-risk students in a course using standards-based grading. Computers & Education. 2016;103:1–15.
  41. 41. Breiman L. Random forests. Machine learning. 2001;45:5–32.
  42. 42. Zhang C, Ma Y. Ensemble machine learning: methods and applications. Springer; 2012.
  43. 43. Dietterich TG. Ensemble methods in machine learning. In: Multiple Classifier Systems: First International Workshop, MCS 2000 Cagliari, Italy, June 21–23, 2000 Proceedings 1. Springer; 2000. p. 1–15.
  44. 44. Pavlyshenko B. Using stacking approaches for machine learning models. In: 2018 IEEE second international conference on data stream mining & processing (DSMP). IEEE; 2018. p. 255–258.
  45. 45. Devasia T, Vinushree T, Hegde V. Prediction of students performance using Educational Data Mining. In: 2016 International Conference on Data Mining and Advanced Computing (SAPIENCE). IEEE; 2016. p. 91–95.
  46. 46. De Albuquerque RM, Bezerra AA, de Souza DA, do Nascimento LBP, de Mesquita Sá JJ, do Nascimento JC. Using neural networks to predict the future performance of students. In: 2015 International Symposium on Computers in Education (SIIE). IEEE; 2015. p. 109–113.
  47. 47. Kaur K, Kaur K. Analyzing the effect of difficulty level of a course on students performance prediction using data mining. In: 2015 1st International Conference on Next Generation Computing Technologies (NGCT). IEEE; 2015. p. 756–761.
  48. 48. Annamalai S, Udendhran R, Vimal S. An intelligent grid network based on cloud computing infrastructures. In: Novel practices and trends in grid and cloud computing. IGI Global; 2019. p. 59–73.
  49. 49. Wu H, Yang S, Huang Z, He J, Wang X. Type 2 diabetes mellitus prediction model based on data mining. Informatics in Medicine Unlocked. 2018;10:100–107.
  50. 50. Hodson TO. Root mean square error (RMSE) or mean absolute error (MAE): When to use them or not. Geoscientific Model Development Discussions. 2022;2022:1–10.
  51. 51. Vieira SM, Kaymak U, Sousa JM. Cohen’s kappa coefficient as a performance measure for feature selection. In: International conference on fuzzy systems. IEEE; 2010. p. 1–8.
  52. 52. Lachenbruch PA. McNemar test. Wiley StatsRef: Statistics Reference Online. 2014;.
  53. 53. Pembury Smith MQ, Ruxton GD. Effective use of the McNemar test. Behavioral Ecology and Sociobiology. 2020;74:1–9.
  54. 54. Van Lent M, Fisher W, Mancuso M. An explainable artificial intelligence system for small-unit tactical behavior. In: Proceedings of the national conference on artificial intelligence. Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999; 2004. p. 900–907.
  55. 55. Holzinger A, Langs G, Denk H, Zatloukal K, Müller H. Causability and explainability of artificial intelligence in medicine. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery. 2019;9(4):e1312. pmid:32089788
  56. 56. Guidotti R, Monreale A, Ruggieri S, Turini F, Giannotti F, Pedreschi D. A survey of methods for explaining black box models. ACM computing surveys (CSUR). 2018;51(5):1–42.
  57. 57. Melo E, Silva I, Costa DG, Viegas CM, Barros TM. On the Use of eXplainable Artificial Intelligence to Evaluate School Dropout. Education Sciences. 2022;12(12):845.
  58. 58. Mirjalili S, Mirjalili SM, Lewis A. Grey wolf optimizer. Advances in engineering software. 2014;69:46–61.
  59. 59. Wen J, Zhang G, Zhang H, Yin W, Ma J. Speculative text mining for document-level sentiment classification. Neurocomputing. 2020;412:52–62.
  60. 60. Yildirim G. A novel grid-based many-objective swarm intelligence approach for sentiment analysis in social media. Neurocomputing. 2022;503:173–188.