Optimized ensemble deep learning for predictive analysis of student achievement

Kaitong Wang

doi:10.1371/journal.pone.0309141

Abstract

Education is essential for individuals to lead fulfilling lives and attain greatness by enhancing their value. It improves self-assurance and enables individuals to navigate the complexities of modern society effectively. Despite the obstacles it faces, education continues to develop. The objective of numerous pedagogical approaches is to enhance academic performance. The development of technology, especially artificial intelligence, has caused a significant change in learning. This has made instructional materials available anytime and wherever easily accessible. Higher education institutions are adding technology to conventional teaching strategies to improve learning. This work presents an innovative approach to student performance prediction in educational settings. The strategy combines the DistilBERT with LSTM (DBTM) hybrid approach with the Spotted Hyena Optimizer (SHO) to change parameters. Regarding accuracy, log loss, and execution time, the model significantly improved over earlier models. The challenges presented by the increasing volume of data in graduate and postgraduate programs are effectively addressed by the proposed method. It produces exceptional performance metrics, including a 15-25% decrease in processing time through optimization, 98.7% accuracy, and 0.03% log loss. This work additionally demonstrates the effectiveness of DBTM-SHO in administering extensive datasets and makes an important improvement to educational data mining. It provides a robust foundation for organizations facing the challenges of evaluating student achievement in the era of vast data.

Citation: Wang K (2024) Optimized ensemble deep learning for predictive analysis of student achievement. PLoS ONE 19(8): e0309141. https://doi.org/10.1371/journal.pone.0309141

Editor: Hikmat Ullah Khan, University of Sargodha, PAKISTAN

Received: May 28, 2024; Accepted: August 7, 2024; Published: August 26, 2024

Copyright: © 2024 Kaitong Wang. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The data underlying the results presented in the study are available from https://analyse.kmi.open.ac.uk/open_dataset.

Funding: The author(s) received no specific funding for this work.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Educational Data Mining is the predominant goal and aspiration of using data collected by educational institutions to acquire valuable insights and foster tactical decision-making to target major problematic issues. EDUM methods illustrated by [1] are multi-faceted. These include forecast approaches, model discovery, and association mining efforts. The above approaches allow for the continuous progression of data mining implementations within this industry.

The basic objective of the vast EDUM domain is to use neural network analysis to find the relationships that influence learning results. Evaluating how well students do in class is a must for those in charge of higher education [2]. We must first identify the variables that impact test scores to use this prediction approach to its full potential. Only then can we intervene at the right time to boost performance by enhancing the aspects linked to success.

Several approaches are used to predict student success in predictive models, e.g., fuzzy logical inference framework (FLIF), Bayes networks (BN), Support Vector Regression (SVR), random forest, and Decision Trees [3]. The preconditioning of FS might be a key preprocessing step in assessing the importance of the attribute number on prediction reliability. SF aims to eliminate redundant, unrelated features and reduce data irrelevance as much as possible to represent the concepts correctly. As a result, the prediction accuracy and processing time are increased [4].

There are many uses of machine learning, including image processing, text recognition, robotics, and text categorization. One subset of these uses is Deep Neural Networks (DNN) [5]. The adaptability of DNNs in these diverse tools shows that they can accurately predict and differentiate results. They also solve many scheduling problems in wireless technology scenarios and meet their energy requirements, which is evidence of this [6].

The integration of computational learning and the concepts of EDUM could fundamentally shape and legalize the achievements in the learning approach that has proven out-of-reach for traditional education. Integration could obtain impressive outcomes that contradict traditional settings’ characteristics. However, it can only be fully understood and achieved with it. The integration will promote the interaction of educators, school administrators, and legislators with the characteristic understanding of students’ complex performance and the development of feasible measures to determine academic and learning progression within the transformed learning environment. Integration with these strategies will help the researcher devise a means to enhance the evaluation of performance accuracies while enabling the establishment of the education environment with the ability to withstand and survive all susceptibilities in the learning achievements of modern students.

Integration of Artificial Intelligence (AI)

In the recent transformation of several industries, including education, machine learning (ML) and AI have been playing an important part. We worked on improving the models, and our research aims at a new Hybrid model using AI and ML for better predictive analysis of Student success. The ultimate objective of our research is to predict with high confidence how well students perform. Spotted Hyena Optimizer(SHO) and DistilBERT with LSTM (DBTM) are sophisticated AI/ML models for predictive analytics. These advanced AI technologies enable us to see more subtle patterns in student data than we could before with conventional methods. This gets us closer to our objectives of understanding student success and developing better prediction models.

Schools often face the challenge of managing large amounts of student data. ML and ML systems excel at efficiently processing and analyzing massive datasets. Our approach merges, cleanses, and preprocesses data from several sources using AI-driven algorithms to guarantee high-quality inputs for precise predictions. The validity of our work is dependent on this ability, which addresses the challenge of managing massive amounts of data in academic settings.

Some critical points of our method are feature engineering and selection. ML techniques such as Random Forest and Elastic Net are used for feature extraction and selection. These techniques help extract the most meaningful features from your dataset and indirectly improve a model’s prediction power. This perfectly aligns with our overarching objective of optimizing mean accuracy, which is to emphasize the most valuable characteristics; this will maximize prediction power. For our model, we leverage AI’s contextual and sequential pattern understanding abilities using a mixture of DistilBERT and LSTM. This dual method is best aligned with the purpose of our study and will hopefully allow more accurate predictions of student performance. Incorporating AI into our prediction model has one big advantage: It can identify complex patterns and correlations in the data.

Our study also mostly depends on adaptive optimization, which is based on the use of SHO; SHO works to ensure the model operates well in many datasets and under numerous circumstances by optimizing the parameters. Therefore, the incorporated capacity of SHO to fine-tune the model parameters and the dependability and resilience of the model for real-time application in education also help the practicality of our stated aim to increase projected accuracy. Our work aims to provide practical insights into student performance using AI and ML approaches. Educators can also use this information to make significant modifications in their lessons that will cater effectively to each student. The result is a boost in academic performance, typically resulting in reduced student participation at risk. By applying artificial intelligence and machine learning, our expected abilities may be enhanced while still attaining our overall aim of leveraging technology to improve education significantly.

Contributions

This work significantly advances the subject of Educational Data Mining and predictive evaluation of student performance in various respects:

Novel Hybrid Model: we propose a unique hybrid model of DistilBERT with LSTM (DBTM) combined with the Spotted Hyena Optimizer SHO for educational performance prediction tasks. This approach takes advantage of sequential pattern recognition and contextual learning, making prediction more accurate.
Improved Performance Metrics: The DBTM-SHO performs significantly better accuracy, execution time, and log loss when compared with the earlier models. We experiment with this model and demonstrate 98.7% accuracy, a reduction in execution time by 15-25% while being computationally efficient, displaying a log loss of 0.03%.
Data handling: We provide a complete data preprocessing pipeline for student performance, merging data from different sources and processing the dataset to fill discrepancies, also applying advanced feature engineering. This guarantees a dataset of high quality for analysis and modeling.
Scalability: The proposed DBTM model becomes scalable and can adapt to different inputs required as it incorporates the Spotted Hyena optimizer, which increases scalability; hence, it can be used in real-time applications. Due to this capability of dynamic parameter tuning, the model continues to be robust when facing different conditions and sets.
Practical Relevance for Education: Our research defines an AI-powered framework to monitor and enhance academic performance, which can be effectively embedded into the education ecosystem due to a strong foundation. The system would help educators intervene with at-risk students sooner and provide the necessary support to help them learn better.

Related work

Many studies have been done in the enormous corpus of academic literature to tackle the challenging problem of assessing students’ academic achievement [7]. Using DT, NN models, and SVM, a study by the authors of [8] looked at students’ online activities to forecast their academic achievement. The findings highlight a striking association between internet usage patterns and academic achievement. The frequency of internet usage is positively associated with academic performance, while the amount of traffic generated by the internet is adversely associated. Another study [9] uses data collected from submission forms to create a prediction algorithm for categorization based on a neural network. Three separate components of academic achievement prediction are the focus of this research. Data mining methods for developing and testing prediction models are the subject of the second dimension. The first dimension looks at what goes into determining how far along the academic progression scale a student is. If students’ Grade Point Averages (GPAs) throughout multiple years are taken into account, it is possible to predict their Cumulative Grade Point Averages (CGPAs) [10]. Researchers searching for the best methodology employ many classification strategies such as Tree Ensemble, Random Forest, Naïve Bayes, Stochastic Neural Network, Decision Tree, and Logistic Regression.

Factors including parental occupation, educational attainment, and student demographics are also considered in the study [11]. To achieve an impressive accuracy rate of 71.3%, it employs classification methods including Rules-based categorization, Naïve Bayes, and Decision Trees. There is a proliferation of frameworks as researchers develop individualized models using different features and classification schemes. By considering variables like prior course grades, significance, graduation date, campus, and country, the J48 decision tree algorithm [12] is used to predict students’ ultimate GPA. Meanwhile, [13, 14] evaluates five classifiers—ID3, J48, Naïve Bayes, Neural Network, and Bayes Network—using factors like activity, attendance, midterm examinations, and other data. The tool expands by utilizing logistic regression and support vector machines to identify pupils with potential for success based only on their prior academic performance. In addition, [15] constructs two parallel models that integrate demographic data and survey results, utilizing naïve Bayes and Bayesian networks. The naïve Bayes technique is found to be the most precise. However, difficulties remain when the number of features and factors contributing to the problem rises. This requires the use of tools that can effectively analyze data in the setting of a rising student population. Concurrently, the author of [16] focuses on using regression analysis techniques to forecast student success in online learning. The author assesses modern regression algorithms to examine their applicability for accurate educational forecasting and analysis. By acquiring important insights, teachers may lower student failures and improve decision-making processes using machine-learning approaches. The author details five ML methods validated via experiments: logistic regression, neural networks, Bayesian networks, and support vector machines (SVMs). These approaches have enhanced prediction accuracy and provided valuable data for improving teaching practices.

The research conducted by [17] considers cognitive and social factors to forecast student success. By combining social network research with more conventional metrics of academic achievement, they seek to understand how students’ social connections affect their grades. Researchers gained insight into how intellectual and social elements impact students’ growth by finding strong correlations between students’ cooperative activities, social network frameworks, and overall academic achievement. In addition, a similar study [18] involves analyzing a large amount of textual information, such as student assignments and forum posts, using NLP technology, which allows quickly identifying students at risk. The results of the emotional activity of students have shown that this information can be considered a predictor of early dropouts from training programs. Researchers developed an Early Warning System that will control students’ risk status and, using NLP, will decide if the student has difficulties. From the above, it can be understood that the informal activities of participants are a vital factor that signals emerging problems and dissatisfaction.

The accuracy of data analysis in predicting students’ performance in online learning environments is the primary focus of research on online education. The study aims to develop learner performance prediction models using machine learning methods like clustering and forecasting [19]. We will accomplish this by examining website data, including task completion times, learner engagement, and responses to instructional materials. This study advances the body of predictive analytics research, especially to address the unique characteristics of online learning settings.

These studies, taken together, show the variety of statistical analysis techniques used to evaluate student achievement as summarized in Table 1. To raise the standard of education, researchers are always expanding the parameters of predictive modeling. They are leveraging natural language processing, emphasizing social components, and considering the challenges of distance learning.

Download:

Table 1. Literature review on predicting academic achievement.

https://doi.org/10.1371/journal.pone.0309141.t001

Proposed system model

The proposed method’s initial phase entails consolidating student performance data from multiple CSV files and datasets into a unified data frame for analysis. Data discrepancies, including inconsistencies and omissions, are resolved during preprocessing. As part of the preprocessing phase, the correlation matrix is generated to identify correlations between attributes. Feature engineering is an essential part of the process, which includes not only feature extraction but also feature selection. The algorithm implemented for the feature extraction is Elastic Net, whereas the algorithm used to assess the significance of the previously selected features is Random Forest. Results provided by the selected data are also enhanced through exploratory data analysis. The proposed classification ensemble, DBTM, effectively captures sequential patterns and trains DistillBERT using LSTM. The feature importance analysis also reflects the structural changes of the data by utilizing the central limit theorem. Spotted Hyena Optimizer has also been implemented to boost the classification model’s ability to adapt by enhancing DBMT parameters. The framework is evaluated using various performance criteria to determine the classification outputs. It is used to guarantee that the model is suitable for real-time performance in the system. Fig 1 presents the suggested framework.

Download:

Fig 1. Proposed framework.

https://doi.org/10.1371/journal.pone.0309141.g001

Dataset description

The data collection comprises many CSV files, each of which systematically records distinct features of student information, assessment, and participation in the educational setting [20].

The dataset contains 200,000 samples from the Open University’s learning analytics. Data, assessments, and student activity in the classroom are all represented in the many CSV files that comprise this dataset. Module codes, presentation details, assessment types, student characteristics, VLE activity on their resources, and performance metrics are all included in each file. Before the dataset was analyzed, efforts were made to handle missing values and maintain data quality. The specifics listed here are essential to this paper’s research, findings, and lived experience to grasp the scope and nature of the dataset. The subsequent subsections comprehensively describe each CSV file, as Fig 2 illustrates. Furthermore, Table 2 shows the details of the dataset CSV files.

Download:

Fig 2. Dataset attributes (Abstract view).

https://doi.org/10.1371/journal.pone.0309141.g002

Download:

Table 2. Description of data files.

https://doi.org/10.1371/journal.pone.0309141.t002

With this data, we can examine students’ course progress, performance indicators, and engagement with the module presentations. Each CSV file may easily be linked to the others using common IDs, allowing for a comprehensive study of the educational environment.

Data preprocessing

Consolidating data from many CSV files was the first step in our data preparation process. The dataset obtained from the k-th CSV file is denoted as DT_k, while the aggregated dataset is represented by DT. Achieving integration was made possible by merging and utilizing common identifiers. (1)

Aggregating all this data with the union operation (∪) ultimately allows us to create a comprehensive dataset. A dataset analysis followed it to find inconsistencies like incorrect entries and empty cells. Let (N_kl) be the missing-value parameter in the k-th row and l-th column to handle missing data. Processing missing data and Restoring a common dataset Missing Data Imputation strategies were applied [21]. (2)

We used data cleaning methods to fix discrepancies and ensure everything was represented consistently.

Matrix of correlations: To identify relationships between different traits, a correlation matrix C was generated. The calculation of the correlation coefficient E_kl between variables U_k and U_l resulted in the following outcome [21]: (3)

The term corv(Y_k, Y_l) represents the covariance between Y_k and Y_l, while and denote the standard deviations of Y_k and Y_l respectively.

Feature engineering and EDA

The evaluation of feature relevance for predictive modeling was conducted in the field of feature engineering. Each feature X_i was assigned a significance value (IM_i) using a Random Forest model, as shown in Fig 3. In addition, for feature extraction, Elastic Net, a linear regression model that incorporates both L1 and L2 regularization, was employed [22]: (4) Where θ is the coefficient vector, δ determines the regularization strength, γ regulates the L1/L2 ratio, and z is the response variable.

Download:

Fig 3. Random forest feature importance.

https://doi.org/10.1371/journal.pone.0309141.g003

Exploratory Data Analysis (EDA) entails employing statistical analysis and visualizations to uncover the inherent patterns in the dataset, facilitating the detection of any anomalies and providing a comprehensive understanding of its structure. The meticulous compilation of the data enabled a stable and refined dataset for further examination, establishing the groundwork for additional analysis.

Classification with hybrid DBTM

The proposed DBTM (DistillBERT with LSTM) model achieves effective student performance prediction by integrating sequential pattern recognition with transformer-based contextual learning. The methods will apply DistillBERT (a smaller version of BERT) to capture a complete range of context data, such as academic transcripts, demographic factors, and student activity logs. These embeddings, also called HDB, are the weights representing bonds between whatever is in your input data.

The model features an LSTM layer to account for time and temporal patterns prevalent in student interactions. Then the output from the use of pair-wise DistillBERT (HDB) is also used as an input sequence (W_LSTM) to LSTM, which generates further sequential embeddings known H_LSTM. Time information on student activity engineered by sequential embeddings. The complete representation is then formed by aggregating the contextual embeddings (S_DB) and sequence embedding dumps from each of these DS (the output for all sentences from LSTM, i.e., E_LSTM. The final input for the formal classification job is the hybrid representation (H_Combined). Model Architecture: There are three primary parts to the architecture:

Contextual embeddings (H_DB) are generated from raw data via DistillBERT processing [23]. (5)
LSTM allows sequential pattern recognition by generating sequential embeddings (H_LSTM) from DistillBERT (H_DB) produced as the input sequence [24]. (6)
Concatenating sequential embeddings (ELSTM) with contextual embeddings (HDB generates a comprehensive representation (H_Combined). (7)
By applying the Decision-Based Tree Model (DBTM), categorization is done in certain tasks like predicting student accomplishment categories, i.e. grades or pass/fail, etc. The combined representation (H_Combined) is then passed to fully connected layers, and a softmax activation generates the probability distribution across classes. Thus, the model combines local pattern recognition of LSTM and global context understanding of DistillBERT, providing an accurate predictive framework for academic success. The combined taxonomical organization [25] (proposition three citations) is illustrated in Eq. (8)

The model improves its accuracy in predicting student achievement using an integrated approach, using local sequential information and global context. Fig 4 shows the DBMS framework.

Download:

Fig 4. Proposed DBTM architecture of data flow.

https://doi.org/10.1371/journal.pone.0309141.g004

Tuning with SHO

This tuning approach acquires the Spotted Hyena Optimizer (SHO) [26] to enhance parameters (ϕ) & of DistillBERT with LSTM (DBTM) model, therefore enhancing its prediction on higher education student success forecasting as well. Based on some parameter settings specified by the objective function F(ϕ), called pack of hyenas and named packP(0), this is introduced into the optimization process. This function includes key performance metrics tracked at different epochs in the model training, i.e., recall, precision, and F1-S score.

The hyena pack is adept at hunting and exploration tasks as each iterative cycle progresses. Each hyena has a different set of parameters evaluated for the fitness function. Cooperation and interaction between hyenas can ensure easy study of many prospective offers. The population update is based on evolutionary principles; it aims to keep those configurations with higher fitness alive, thus moving the pack of hyenas closer and closer to an optimal configuration. This collecting knowledge of pack hyenas is utilized to modify a DBTM setting. This series of adaptations and adjustments is simply a process called; (9)

Changes in learning rates, dropout rates, and other critical factors are utilized to develop the model’s structure. One essential property of SHO is its ability to respond to changes in data. The DBTM model is more adaptable to changes in student performance data due to the ability of its changed parameters, making it scalable. This iterative optimization loop refines parameter settings, leading to the optimum set (ϕ) that maximizes the fitness function. Algorithm 1: Tuning DBTM using SHO.

Algorithm 1 Tuning using SHO

Input: Clean Data from Feature Engineering Step

Objective Function: F(ψ) representing the fitness of DBTM model parameters.

Initial Pack of Hyenas: H⁽⁰⁾ with diverse parameter configurations.

Maximum Number of Iterations: T.

Output: Optimal set of parameters: *ψ*.

1: Algorithm Steps:

2: Initialization: Release a pack of hyenas, each embodying a distinct parameter configuration: H⁽⁰⁾← ReleaseHyenaPack()

3: Set the iteration counter k = 1.

4: Optimization Loop: While k ≤ T do:

5: a. Hunting Phase: Evaluate the fitness of each hyena’s parameter configuration:

6: Evaluate Fitness: F(ψ^(k))

7: b. Update Hyena Pack: Update the hyena pack based on the fitness values:

8: Update Hyena Pack: H^(k+1) ← UpdateHyenaPack(H^(k), F)

9: c. Exploration: Facilitate the exploration of novel parameter configurations through hyena collaboration and communication.

10: d. Update Parameters: Update the DBTM model parameters based on hyena collaboration:

11: Update Parameters: ψ^(k+1) ← UpdateParameters(ψ^(k), H^(k+1))

12: Increment Iteration Counter: k ← k + 1

13: Result: ψ ← arg max_ψ F(ψ)

Adding an attribute configuration that materially increases the prediction power of the DBTM model. These rationalizations have been confirmed by thorough investigation and analysis using various data sets across different studies, among which study11 has further supported the updated DBTM model in real-life settings. Meanwhile, the analytical and prescriptive simulation framework associated with predicting academic success in kids is improved practically due to its simple optimization parameter process, which Spotted Hyena Optimizer wanted. Fig 5 describes the tuning process.

Download:

Fig 5. Parameter tuning process.

https://doi.org/10.1371/journal.pone.0309141.g005

Performance assessment

The efficacy of the proposed hybrid technique is evaluated using several classification measures, such as the confusion matrix, log loss, statistical analysis, and execution time. Using these assessment indicators, we analyze the effectiveness of categorization algorithms. The techniques are illustrated in Fig 6.

Download:

Fig 6. Performance metrics.

https://doi.org/10.1371/journal.pone.0309141.g006

Simulation and results

This study aims to improve the precision of predicting student performance by utilizing a machine with an Intel Core i7 11th Gen CPU operating at 2.4 GHz and coupled with a 4GB RTC graphics card. Python is the main language to run simulations in the Anaconda Spyder IDE.

The first stage is importing the dataset into the framework and preparing it for preprocessing. The dataset contains missing values, which are resolved using two independent approaches. Any row with more than 50% missing values is excluded. Alternatively, when the number of missing values in a row is below 50.

Exploratory Data Analysis (EDA) is conducted on the dataset to acquire insights and understand the data. Fig 7 displays the histogram representing the variability of assessment results obtained from the OULAD dataset. Score ranges are shown on the x-axis, and the frequency of occurrences within each range is shown on the y-axis. A kernel density estimate (KDE) curve is superimposed over the distribution to offer a normalized representation of the data distribution. The histogram clearly shows that most students’ results are concentrated within specified intervals, as the conspicuous peaks show. The KDE curve facilitates the identification of any underlying trends or patterns in the assessment scores. This graphical representation facilitates the rapid assessment of outliers and core tendencies. The distribution’s tails can indicate the presence of exceptionally high or low scores.

Download:

Fig 7. Distribution of dataset (Assessment score).

https://doi.org/10.1371/journal.pone.0309141.g007

The relationship between student involvement and academic results is well shown in Fig 8. A count plot of distinct VLE activity categories is shown on the left side of the figure, illustrating the varied degrees of student participation. A unique color palette makes it simpler to differentiate between different activities, making it easier to identify engagement patterns and how frequently each type is utilized. Conversely, the main focus of the figure on the right is the average number of interactions (after clicks) according to students’ final findings. Students with better grades, particularly those who get distinctions, use the virtual learning environment (VLE) considerably more than their peers who do worse. Taken as a whole, these numbers show how vital participation is for students to do well in school and illuminate the connections between various VLE activities and grades.

Download:

Fig 8. Activity types and final result.

https://doi.org/10.1371/journal.pone.0309141.g008

Fig 9 explains how students’ participation in the VLE and their assessment results relate. As a measure student’s academic achievement, the x-axis displays assessment results. In contrast, the total quantity of clicks pupils performed in the VLE is displayed on the y-axis, measuring their online activity.

Download:

Fig 9. Distribution of dataset (Relation between score and VLE).

https://doi.org/10.1371/journal.pone.0309141.g009

Patterns and trends within the data can be observed in Fig 9. A positive correlation between assessment scores and Virtual Learning Environment (VLE) clicks is suggested if the graph’s points demonstrate an increasing trend from left to right, learners actively participating in the virtual classroom typically achieve higher test scores. Conversely, a lack of clear trends or discrepancies among points indicates a weak or nonexistent relationship between VLE engagement and academic performance.

Fig 10 shows the dataset’s component correlation matrix, which provides useful information about the correlations between different assessment measures. On the heatmap, different colors represent different degrees of association. Lighter shades show weak correlations, dark red shows strong positive correlations, and dark blue shows major negative correlations. The numerical annotations within each heatmap cell represent specific correlation coefficients ranging from -1 to 1. By interpreting these coefficients, readers can discern trends and associations between pairs of variables. Negative coefficients suggest an inverse relationship, while positive coefficients indicate variables that tend to move in tandem. The heatmap facilitates the identification of clusters of factors with strong correlations, guiding further research efforts.

Download:

Fig 10. Numeric attributes correlation.

https://doi.org/10.1371/journal.pone.0309141.g010

In Table 3, we present a comprehensive performance evaluation of our proposed method alongside existing approaches documented in the literature. The results are extensively examined, compared to those obtained by other researchers, using numerous methodologies. A DBTM-SHO performs well in all metrics, making it visible from a performance evaluation perspective. These numbers indicate how effective our architecture is and what its scale can be in the industry.

Download:

Table 3. Analyzing the new method’s performance in contrast to the current one.

https://doi.org/10.1371/journal.pone.0309141.t003

A comparison of the categorization approach proposed in this research with others is shown, which can be seen in Table 4. Understanding how valid and valuable the results of this thorough investigation are. Table findings display positive benefits across all research variables, which validates our strategy as being hardy. This detailed statistical analysis validates the performance obtained by our method and, more importantly, sheds light on the architecture underlying categorization. The positive correlations among the variables help confirm that our classification algorithms are authentic and valuable, maybe even practical under real-world use.

Download:

Table 4. Statistical analysis results on the OULAD dataset.

https://doi.org/10.1371/journal.pone.0309141.t004

Fig 11 depicts a sensitivity study that investigates the impact of various factors on the model’s efficacy. The effect of each parameter on the model’s overall performance may be seen clearly in the bar chart. A red trend line shows the overall trend in the parameters’ effects, and each bar’s height shows the trend’s size. If the parameters are above the benchmark line (the dotted green line at 0.1), then the effect is above average; if they are below, then the influence is below average. There is a numerical description of the influence of each parameter next to each bar, which represents that parameter. This sensitivity analysis may assist researchers and practitioners in making informed decisions during model optimization by illuminating elements that significantly affect model accuracy.

Download:

Fig 11. Sensitivity study to determine how factors affect model performance.

https://doi.org/10.1371/journal.pone.0309141.g011

Fig 12 shows the proposed DBTM-SHO technology and other existing methods as a function of data size and execution time. By plotting execution times in seconds against data sizes on the x-axis and 10,000 to 200,000 on the y-axis, we can see the relationship between the two variables. Markers on the graph indicate the amount of time required to execute for different data amounts, and each line on the graph indicates a different approach. As represented by the figure, the execution time could vary significantly depending on the approach and the dataset size. Database Transform SHO’s execution time remains relatively consistent as the volume of data increases. The processing efficacy of SVM and NB is more notable when the datasets are larger. The distinct patterns of execution time that NN, DT, LR, and ResNet display in response to data size demonstrate that various approaches have varied processing demands and scalability characteristics.

Download:

Fig 12. Connection between data amount and duration of execution.

https://doi.org/10.1371/journal.pone.0309141.g012

Conclusion

The results of this study demonstrate that educational institutions can benefit from utilizing enhanced ML algorithms to predict how well their students will do in the future. Significant enhancements in accuracy, decrease in log loss, and improvement in execution time have been accomplished by building a comprehensive framework that unifies data preparation, feature engineering, and classification utilizing the hybrid DBTM-SHO model. Demonstrating its capacity to efficiently manage expanding volumes of data in graduate and postgraduate programs, the model achieves noteworthy performance metrics—98.7% accuracy, 0.03% log loss, and 15%–25% optimized execution time—or more. The findings underscore the model’s potential as a valuable instrument for examining educational data. The consequences of these discoveries are significant for all those involved in education. The methodology that has been suggested establishes a robust framework for the timely identification of children who might be at risk, thereby facilitating tailored support strategies and interventions. Through the implementation of predictive analytics, educators and administrators have the potential to improve student outcomes, maximize the efficiency of current resources, and elevate the overall efficacy of the educational system. In addition, the amalgamation of optimization algorithms and cutting-edge machine learning methodologies streamlines the decision-making process based on data. It facilitates individualized learning experiences, thereby augmenting student achievement.

Limitations and future research directions

The study’s findings are encouraging; however, some constraints must be addressed to make the model more applicable and generalizable. For example, the model may not apply to other educational settings due to its exclusivity to the OULAD dataset; these settings and student demographics may vary greatly. Future research should assess this approach utilizing many datasets from various institutions to ensure its durability and flexibility. Second, the objective of this work is to predict student performance with structured data that contains demographic information and module-related algorithms, as well as public software for interaction within the virtual classroom. The model can be improved further by adding more unstructured data sources such as forum entries, textual notes, etc. Applying NLP techniques to text data would provide a fuller picture of students’ learning behaviors and performance predictors. In addition, the computational overhead and resource requirements of the proposed hybrid model (SHO-DBTM) must be considered. Future work should focus on optimizing model performance without impairing prediction. Model reduction, feature selection, and distributed computing architectures effectively address these challenges; consequently, the final model is more suited for educational online applications.

References

1. Baek C, Doleck T. Educational data mining versus learning analytics: A review of publications from 2015 to 2019. Interactive Learning Environments. 2023;31(6):3828–50.
- View Article
- Google Scholar
2. Gohar AS, Qouta MM. Challenges of improving the quality of academic supervision of postgraduate studies at the Faculty of Education, Damietta University. Journal of Educational Issues. 2021;7(1):113–37.
- View Article
- Google Scholar
3. Pimentel JS, Ospina R, Ara A. Learning time acceleration in support vector regression: A case study in educational data mining. Stats. 2021;4(3):682–700.
- View Article
- Google Scholar
4. Chen Y, Zhai L. A comparative study on student performance prediction using machine learning. Education and Information Technologies. 2023;28(9):12039–57.
- View Article
- Google Scholar
5. Almusawi HA, Durugbo CM. Linking task-technology fit, innovativeness, and teacher readiness using structural equation modelling. Education and Information Technologies. 2024;1(1):1–30.
- View Article
- Google Scholar
6. Poole AH, Agosto D, Greenberg J, Lin X, Yan E. Where do we stand? Diversity, equity, inclusion, and social justice in North American library and information science education. Journal of Education for Library and Information Science. 2021;62(3):258–86.
- View Article
- Google Scholar
7. Kapucu MS, Özcan H, Aypay A. Predicting secondary school students’ academic performance in science course by machine learning. International Journal of Technology in Education and Science. 2024;8(1):41–62.
- View Article
- Google Scholar
8. Chui KT, Liu RW, Zhao M, De Pablos PO. Predicting students’ performance with school and family tutoring using generative adversarial network-based deep support vector machine. IEEE Access. 2020;8:86745–52.
- View Article
- Google Scholar
9. Tao T, Sun C, Wu Z, Yang J, Wang J. Deep neural network-based prediction and early warning of student grades and recommendations for similar learning approaches. Applied Sciences. 2022;12(15):7733.
- View Article
- Google Scholar
10. Alangari N, Alturki R. Predicting students’ final GPA using 15 classification algorithms. Romanian Journal of Information Science and Technology. 2020;23(3):238–49.
- View Article
- Google Scholar
11. Aleid MA, Aldhyani THH, Khalaf OI, Algburi S. Modelling and predicting student flexibility in online learning using machine learning: Students’ academic performance. International Journal of Computing and Digital Systems. 2024;15(1):1–20.
- View Article
- Google Scholar
12. Hoque MI, Azad AK, Tuhin MAH, Salehin ZU. University students’ result analysis and prediction system by decision tree algorithm. Advances in Science, Technology, and Engineering Systems Journal. 2020;5:115–22.
- View Article
- Google Scholar
13. Patel P, Thakkar T, Patel M, Trivedi A. A review: An approach for secondary school students performance using machine learning and data mining. International Journal of Intelligent Systems and Applications in Engineering. 2024;12(14):0–11.
- View Article
- Google Scholar
14. Cruz-Jesus F, Castelli M, Oliveira T, Mendes R, Nunes C, Sa-Velho M, et al. Using artificial intelligence methods to assess academic achievement in public high schools of a European Union country. Heliyon. 2020;6(6):1455–71. pmid:32551378
- View Article
- PubMed/NCBI
- Google Scholar
15. Albahli S. Efficient hyperparameter tuning for predicting student performance with Bayesian optimization. Multimedia Tools and Applications. 2023;34(3):1–25.
- View Article
- Google Scholar
16. Hussain S, Khan MQ. Student-Performulator: Predicting students’ academic performance at secondary and intermediate level using machine learning. Annals of Data Science. 2023;10(3):637–55. pmid:38624826
- View Article
- PubMed/NCBI
- Google Scholar
17. Wahono B, Chang CY, Retnowati A. Exploring a direct relationship between students’ problem-solving abilities and academic achievement: A STEM education at a coffee plantation area. Journal of Turkish Science Education. 2020;17(2):211–24.
- View Article
- Google Scholar
18. Shaik T, Tao X, Li Y, Dann C, McDonald J, Redmond P, et al. A review of the trends and challenges in adopting natural language processing methods for education feedback analysis. IEEE Access. 2022;10:56720–39.
- View Article
- Google Scholar
19. Rafique A, Khan MS, Jamal MH, Tasadduq M, Rustam F, Lee E, et al. Integrating learning analytics and collaborative learning for improving student’s academic performance. IEEE Access. 2021;9:167812–26.
- View Article
- Google Scholar
20. Kuzilek J, Hlosta M, Zdrahal Z. Open University learning analytics dataset. Scientific Data. 2017;4(1):1–8. pmid:29182599
- View Article
- PubMed/NCBI
- Google Scholar
21. Mishra P, Biancolillo A, Roger JM, Marini F, Rutledge DN. New data preprocessing trends based on ensemble of multiple preprocessing techniques. TrAC Trends in Analytical Chemistry. 2020;132:116045.
- View Article
- Google Scholar
22. Zou H, Hastie T. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society Series B: Statistical Methodology. 2005;67(2):301–20.
- View Article
- Google Scholar
23. Saha U, Mahmud MS, Keya M, Lucky EAE, Khushbu SA, Noori SRH, et al. Exploring public attitude towards children by leveraging emoji to track out sentiment using distil-BERT a fine-tuned model. In: International Conference on Image Processing and Capsule Networks. Cham: Springer International Publishing; 2022. p. 332–46.
24. Landi F, Baraldi L, Cornia M, Cucchiara R. Working memory connections for LSTM. Neural Networks. 2021;144:334–41. pmid:34547671
- View Article
- PubMed/NCBI
- Google Scholar
25. Buffoni L, Civitelli E, Giambagli L, Chicchi L, Fanelli D. Spectral pruning of fully connected layers. Sci Rep. 2022;12(1):11201. pmid:35778586
- View Article
- PubMed/NCBI
- Google Scholar
26. Dhiman G, Kumar V. Spotted Hyena Optimizer: A novel bio-inspired based metaheuristic technique for engineering applications. Advances in Engineering Software. 2017;114:48–70.
- View Article
- Google Scholar
27. Hussain S, Gaftandzhieva S, Maniruzzaman M, Doneva R, Muhsin ZF. Regression analysis of student academic performance using deep learning. Education and Information Technologies. 2021;26(1):783–98.
- View Article
- Google Scholar
28. Wang F, Ying Y. Evaluation of students’ innovation and entrepreneurship ability based on ResNet network. Mobile Information Systems. 2022.
- View Article
- Google Scholar

[ref1] 1. Baek C, Doleck T. Educational data mining versus learning analytics: A review of publications from 2015 to 2019. Interactive Learning Environments. 2023;31(6):3828–50.
View Article
Google Scholar

[2] View Article

[3] Google Scholar

[ref2] 2. Gohar AS, Qouta MM. Challenges of improving the quality of academic supervision of postgraduate studies at the Faculty of Education, Damietta University. Journal of Educational Issues. 2021;7(1):113–37.
View Article
Google Scholar

[5] View Article

[6] Google Scholar

[ref3] 3. Pimentel JS, Ospina R, Ara A. Learning time acceleration in support vector regression: A case study in educational data mining. Stats. 2021;4(3):682–700.
View Article
Google Scholar

[8] View Article

[9] Google Scholar

[ref4] 4. Chen Y, Zhai L. A comparative study on student performance prediction using machine learning. Education and Information Technologies. 2023;28(9):12039–57.
View Article
Google Scholar

[11] View Article

[12] Google Scholar

[ref5] 5. Almusawi HA, Durugbo CM. Linking task-technology fit, innovativeness, and teacher readiness using structural equation modelling. Education and Information Technologies. 2024;1(1):1–30.
View Article
Google Scholar

[14] View Article

[15] Google Scholar

[ref6] 6. Poole AH, Agosto D, Greenberg J, Lin X, Yan E. Where do we stand? Diversity, equity, inclusion, and social justice in North American library and information science education. Journal of Education for Library and Information Science. 2021;62(3):258–86.
View Article
Google Scholar

[17] View Article

[18] Google Scholar

[ref7] 7. Kapucu MS, Özcan H, Aypay A. Predicting secondary school students’ academic performance in science course by machine learning. International Journal of Technology in Education and Science. 2024;8(1):41–62.
View Article
Google Scholar

[20] View Article

[21] Google Scholar

[ref8] 8. Chui KT, Liu RW, Zhao M, De Pablos PO. Predicting students’ performance with school and family tutoring using generative adversarial network-based deep support vector machine. IEEE Access. 2020;8:86745–52.
View Article
Google Scholar

[23] View Article

[24] Google Scholar

[ref9] 9. Tao T, Sun C, Wu Z, Yang J, Wang J. Deep neural network-based prediction and early warning of student grades and recommendations for similar learning approaches. Applied Sciences. 2022;12(15):7733.
View Article
Google Scholar

[26] View Article

[27] Google Scholar

[ref10] 10. Alangari N, Alturki R. Predicting students’ final GPA using 15 classification algorithms. Romanian Journal of Information Science and Technology. 2020;23(3):238–49.
View Article
Google Scholar

[29] View Article

[30] Google Scholar

[ref11] 11. Aleid MA, Aldhyani THH, Khalaf OI, Algburi S. Modelling and predicting student flexibility in online learning using machine learning: Students’ academic performance. International Journal of Computing and Digital Systems. 2024;15(1):1–20.
View Article
Google Scholar

[32] View Article

[33] Google Scholar

[ref12] 12. Hoque MI, Azad AK, Tuhin MAH, Salehin ZU. University students’ result analysis and prediction system by decision tree algorithm. Advances in Science, Technology, and Engineering Systems Journal. 2020;5:115–22.
View Article
Google Scholar

[35] View Article

[36] Google Scholar

[ref13] 13. Patel P, Thakkar T, Patel M, Trivedi A. A review: An approach for secondary school students performance using machine learning and data mining. International Journal of Intelligent Systems and Applications in Engineering. 2024;12(14):0–11.
View Article
Google Scholar

[38] View Article

[39] Google Scholar

[ref14] 14. Cruz-Jesus F, Castelli M, Oliveira T, Mendes R, Nunes C, Sa-Velho M, et al. Using artificial intelligence methods to assess academic achievement in public high schools of a European Union country. Heliyon. 2020;6(6):1455–71. pmid:32551378
View Article
PubMed/NCBI
Google Scholar

[41] View Article

[42] PubMed/NCBI

[43] Google Scholar

[ref15] 15. Albahli S. Efficient hyperparameter tuning for predicting student performance with Bayesian optimization. Multimedia Tools and Applications. 2023;34(3):1–25.
View Article
Google Scholar

[45] View Article

[46] Google Scholar

[ref16] 16. Hussain S, Khan MQ. Student-Performulator: Predicting students’ academic performance at secondary and intermediate level using machine learning. Annals of Data Science. 2023;10(3):637–55. pmid:38624826
View Article
PubMed/NCBI
Google Scholar

[48] View Article

[49] PubMed/NCBI

[50] Google Scholar

[ref17] 17. Wahono B, Chang CY, Retnowati A. Exploring a direct relationship between students’ problem-solving abilities and academic achievement: A STEM education at a coffee plantation area. Journal of Turkish Science Education. 2020;17(2):211–24.
View Article
Google Scholar

[52] View Article

[53] Google Scholar

[ref18] 18. Shaik T, Tao X, Li Y, Dann C, McDonald J, Redmond P, et al. A review of the trends and challenges in adopting natural language processing methods for education feedback analysis. IEEE Access. 2022;10:56720–39.
View Article
Google Scholar

[55] View Article

[56] Google Scholar

[ref19] 19. Rafique A, Khan MS, Jamal MH, Tasadduq M, Rustam F, Lee E, et al. Integrating learning analytics and collaborative learning for improving student’s academic performance. IEEE Access. 2021;9:167812–26.
View Article
Google Scholar

[58] View Article

[59] Google Scholar

[ref20] 20. Kuzilek J, Hlosta M, Zdrahal Z. Open University learning analytics dataset. Scientific Data. 2017;4(1):1–8. pmid:29182599
View Article
PubMed/NCBI
Google Scholar

[61] View Article

[62] PubMed/NCBI

[63] Google Scholar

[ref21] 21. Mishra P, Biancolillo A, Roger JM, Marini F, Rutledge DN. New data preprocessing trends based on ensemble of multiple preprocessing techniques. TrAC Trends in Analytical Chemistry. 2020;132:116045.
View Article
Google Scholar

[65] View Article

[66] Google Scholar

[ref22] 22. Zou H, Hastie T. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society Series B: Statistical Methodology. 2005;67(2):301–20.
View Article
Google Scholar

[68] View Article

[69] Google Scholar

[ref23] 23. Saha U, Mahmud MS, Keya M, Lucky EAE, Khushbu SA, Noori SRH, et al. Exploring public attitude towards children by leveraging emoji to track out sentiment using distil-BERT a fine-tuned model. In: International Conference on Image Processing and Capsule Networks. Cham: Springer International Publishing; 2022. p. 332–46.

[ref24] 24. Landi F, Baraldi L, Cornia M, Cucchiara R. Working memory connections for LSTM. Neural Networks. 2021;144:334–41. pmid:34547671
View Article
PubMed/NCBI
Google Scholar

[72] View Article

[73] PubMed/NCBI

[74] Google Scholar

[ref25] 25. Buffoni L, Civitelli E, Giambagli L, Chicchi L, Fanelli D. Spectral pruning of fully connected layers. Sci Rep. 2022;12(1):11201. pmid:35778586
View Article
PubMed/NCBI
Google Scholar

[76] View Article

[77] PubMed/NCBI

[78] Google Scholar

[ref26] 26. Dhiman G, Kumar V. Spotted Hyena Optimizer: A novel bio-inspired based metaheuristic technique for engineering applications. Advances in Engineering Software. 2017;114:48–70.
View Article
Google Scholar

[80] View Article

[81] Google Scholar

[ref27] 27. Hussain S, Gaftandzhieva S, Maniruzzaman M, Doneva R, Muhsin ZF. Regression analysis of student academic performance using deep learning. Education and Information Technologies. 2021;26(1):783–98.
View Article
Google Scholar

[83] View Article

[84] Google Scholar

[ref28] 28. Wang F, Ying Y. Evaluation of students’ innovation and entrepreneurship ability based on ResNet network. Mobile Information Systems. 2022.
View Article
Google Scholar

[86] View Article

[87] Google Scholar

Figures

Abstract

Introduction

Integration of Artificial Intelligence (AI)

Contributions

Related work

Proposed system model

Dataset description

Data preprocessing

Feature engineering and EDA

Classification with hybrid DBTM

Tuning with SHO

Performance assessment

Simulation and results

Conclusion

Limitations and future research directions

References