Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Research on learning achievement classification based on machine learning

  • Jianwei Dong ,

    Contributed equally to this work with: Jianwei Dong, Ruishuang Sun

    Roles Funding acquisition, Investigation

    1982086289@qq.com

    Affiliations College of Educational Science, Xinjiang Normal University, Urumqi, China, College of Educational Science, Xinjiang Teacher’s College, Urumqi, China

  • Ruishuang Sun ,

    Contributed equally to this work with: Jianwei Dong, Ruishuang Sun

    Roles Conceptualization, Data curation, Formal analysis

    Affiliation School of Software, Xinjiang University, Urumqi, China

  • Zhipeng Yan,

    Roles Methodology, Project administration, Resources

    Affiliation School of Computer Science and Technology, Xinjiang University, Urumqi, China

  • Meilun Shi,

    Roles Writing – original draft, Writing – review & editing

    Affiliation School of Computer Science and Technology, Xinjiang University, Urumqi, China

  • Xinyu Bi

    Roles Supervision, Validation

    Affiliation School of Computer Science and Technology, Xinjiang University, Urumqi, China

Abstract

Academic achievement is an important index to measure the quality of education and students’ learning outcomes. Reasonable and accurate prediction of academic achievement can help improve teachers’ educational methods. And it also provides corresponding data support for the formulation of education policies. However, traditional methods for classifying academic performance have many problems, such as low accuracy, limited ability to handle nonlinear relationships, and poor handling of data sparsity. Based on this, our study analyzes various characteristics of students, including personal information, academic performance, attendance rate, family background, extracurricular activities and etc. Our work offers a comprehensive view to understand the various factors affecting students’ academic performance. In order to improve the accuracy and robustness of student performance classification, we adopted Gaussian Distribution based Data Augmentation technique (GDO), combined with multiple Deep Learning (DL) and Machine Learning (ML) models. We explored the application of different Machine Learning and Deep Learning models in classifying student grades. And different feature combinations and data augmentation techniques were used to evaluate the performance of multiple models in classification tasks. In addition, we also checked the synthetic data’s effectiveness with variance homogeneity and P-values, and studied how the oversampling rate affects actual classification results. Research has shown that the RBFN model based on educational habit features performs the best after using GDO data augmentation. The accuracy rate is 94.12%, and the F1 score is 94.46%. These results provide valuable references for the classification of student grades and the development of intervention strategies. New methods and perspectives in the field of educational data analysis are proposed in our study. At the same time, it has also promoted innovation and development in the intelligence of the education system.

Introduction

In the era of big data, education related data has seen a significant increase in content and quantity. How to realize educational data mining has become an important research topic in the current education field [1]. Our text effectively utilizes various educational data and employs methods such as relationship mining [2], prediction [35], and clustering [68] to accurately evaluate and understand student behavior and performance. As well as our text also helps students flexibly adjust their learning strategies. As is well known, academic achievement is of great significance in the modern education system. It serves as a core indicator for measuring students’ educational activities and academic performance. Accurately estimating students’ academic performance can improve the quality of student education, explore the laws of educational development, and enrich educational models.

In education policy and research, estimating student grades is crucial for analyzing and improving the education system. By analyzing the various characteristics of students, we can understand what influences their academic achievement. And by analyzing data, we can spot future trends in grades and keep an eye on students’ academic achievement with early warnings. Currently, structured data in education is becoming increasingly rich and complex. Research [9] has found that students’ achievements are not only directly related to their grades in various subjects, classroom conditions, and learning abilities, but also closely related to potential characteristics such as family background, extracurricular activities, and family education. In addition to this, the concept of academic integrity, i.e. honesty, trust, fairness, respect, responsibility and courage, can be explored in terms of its impact on student achievement [10]. Thus, the classification of student achievements should identify and analyze potential learning difficulties from a more comprehensive perspective. And it should find what affects student performance and give advice on how to improve, improving the quality of education [11] and promoting the comprehensive development of students.

Due to the increasing complexity of educational data, the model of classifying student grades needs to have significant advantages in dealing with multidimensional features, especially when dealing with nonlinear features. In addition, in specific small-scale studies or datasets of specific classes, the amount of data is often limited. The current classification methods for academic achievement can be roughly divided into two categories. One is based on statistical methods and machine learning methods. Another method is based on statistics mainly includes linear regression [12], logistic regression [13] and etc. These methods usually have simple principles, convenient calculations, and can quickly obtain the classification results of students’ academic achievements. However, the prediction accuracy of these methods is limited, making it difficult to capture complex nonlinear relationships. These methods also don’t fully use the rich information in education data and are not good at finding hidden factors or learning from them. Compared to statistical methods, machine learning based methods perform well in handling nonlinear relationships, multidimensional features, and mining complex patterns in data. The more typical K-means clustering and Long Short-Term Memory (LSTM) networks are used to model and predict student performance based on their reported behaviours and preferences [14]. In addition, machine learning methods include decision trees [1517], support vector machines [1820], random forests [21], neural networks [22,23], etc. These methods include They are able to better handle the multidimensional features in student achievement prediction, resulting in higher prediction accuracy. However, in a limited number of small datasets, machine learning methods are prone to overfitting. In addition, these data have relatively high requirements in computational complexity, data volume and training time, and are more dependent on data.

Limitations in data volume and multidimensional data can lead to inaccuracies in learning evaluation. These issues also lead to teachers’ lack of understanding of students’ current learning status and may mislead educational decisions, seriously affecting the actual quality of education. Therefore, this article addresses issues such as handling nonlinear relationships, multidimensional features, and limited data volume, we propose an innovative method based on radial basis function network for Gaussian distribution based oversampling (GDO-RBFN). This method can effectively alleviate data imbalance and limited data volume. Firstly, we perform Gaussian distribution based sampling on data with limited data volume, and generate samples using the statistical features of the original data. This approach can provide sufficient data for training subsequent models and weaken the impact of class imbalance in the original data. Secondly, we import the generated new dataset into the RBFN model. This approach can demonstrate advantages in predicting student academic performance through its excellent non-linear fitting and generalization abilities [24]. The effective combination of two methods improves the accuracy of predicting students’ academic performance. This combination motivates students to plan reasonable learning goals, enhances learning motivation, and it also benefits teachers to teach, optimize teaching methods, and allocate educational resources reasonably. In summary, the contributions of this article are as follows:

  • This article proposes a model called GDO-RBFN, which innovatively combines Gaussian based distribution oversampling and radial basis function networks. It expands the original dataset through Gaussian distribution sampling. It effectively alleviates the problems of limited educational data and imbalanced data categories.
  • This article uses Radial Basis Function Network (RBFN) as a nonlinear strong fitting model, which can effectively capture high-dimensional features. It generates more representative sample data. More representative sample data is generated. This significantly improved the predictive performance of the GDO-RBFN model in student academic achievement prediction classification tasks.
  • We compared six mainstream performance prediction models on an education dataset with class imbalance. We use multiple metrics such as Accuracy, Precision, and Recall for evaluate. The significant advantages of the GDO-RBFN model in handling limited data and mining feature relationships have been validated by us. Compared with other models, our model has better overall performance. From the perspectives of data volume and features, our model solves the practical problem of inaccurate classification of student grades. Our model provides educators with more accurate predictive analysis tools.

The rest of this article is organized as follows: The second section reviews the development and application of data mining methods and GDO technology in the field of education. The third section first introduces the framework of our proposed method GDO-RBFN, and then focuses on describing the effective data augmentation and RBFN model. The fourth section introduced the experimental setup and evaluation criteria, and comprehensively discussed the experimental results. The fifth section introduces the limitations and future development of this study.

Review

Development of EDM technology

Educational Data Mining (EDM) technology applies interdisciplinary theoretical knowledge and practical techniques to solve practical educational problems. Including education, computer science, statistics, and computer science, etc. Among them, the most important form is still to extract valuable information from massive amounts of specific structured educational data. Further it needs to analyze and explore students’ learning characteristics, behaviors, emotions, and other factors, as well as the difficulty of the course and the correlation between courses. Finally, it needs to establish an effective analytical model to predict students’ learning ability or academic achievement, and conduct accurate assessment and classification. Around the 1980s, it was widely believed from a psychological and cognitive perspective that factors such as student motivation, planning, and learning styles were associated with student performance. Cortez [25] used multiple classification algorithms to make reasonable predictions on the academic performance of Portuguese high school students. His results showed that the decision tree had a significant effect and accurate predictive performance. Peña-Ayala [26] conducted an in-depth analysis of the application of Data Mining (DM) techniques, such as clustering analysis, classification algorithms, and regression models, in the field of education. Anzer [27] predicted individual course grades and uses linear regression to predict final grades. He predicted the factors most relevant to final grades based on his performance before, during, and after class. Hussain [28] evaluated students’ academic achievement using 12 characteristics representing academic and personal qualities. He compared RF, PART, BayesNet and other methods, and found that RF still had a significant advantage in prediction accuracy. Minn [29] proposed a novel model that combined multi task learning and graph neural networks. He significantly enhanced the model’s understanding and predictive ability of students’ knowledge status by modeling the dependency relationships between knowledge points. Meanwhile, the model can evaluate students’ mastery level of different knowledge points. Fan Yang [30] proposed using classification BP-NN for student performance estimation. This model can estimate students’ attributes by referring to their prior knowledge and the attributes of other students with similar features. Mahmoud [31] analysed and compared the results of each classifier used, including K-mean, maximum likelihood, support vector machines and neural networks, integrating different types of classification. Olabanjo [32] collected students’ past learning records and their psychological abilities. Based on this, the RBFN model is trained to predict student performance, and principal component analysis is also used to evaluate performance.

Application of sampling technology

Sampling techniques play a crucial role in analyzing data and modeling processes, especially in situations with limited samples. Sampling techniques amplify limited samples, generate new data samples, and expand the size of the dataset. Thus the effectiveness of training and the accuracy of prediction are thus improved. For example, the minority undersampling method k-INOS [33], based on the domain of influence approach. It improves the effectiveness of handling imbalanced data by preventing noise samples from contaminating the oversampling process. In predicting amyloid protein models. OPTIC-SMOTE [34] is an improved SMOTE model based on density clustering. It significantly improves the authenticity and representativeness of synthesized samples by removing noisy samples and fully utilizing boundary sample information. ADPPTC [35] can accurately complete tasks within a specific time frame, ensuring optimality and specified time stability. GAME [36] is a generative adversarial method for minority groups. It adjusts the parameters of the local linear model to approach the majority category by extending the sampling margin of the data generation and adversarial stages. Its goal is to enhance the diversity and representativeness of synthesized samples, avoiding the problem of minority class samples being limited to a finite sample space. CHSMOTE [37] is a convex packet-based SMOTE algorithm. This algorithm involves selecting the boundary majority samples as the initial samples. It is sampling and identifying the sample synthesis region by checking whether the constructed convex packet contains boundary majority samples or not. Its algorithmic principle is to generate more samples with valid information by using expanding the generation range of synthetic samples. Compared to SMOTE-like sampling, Gaussian Distribution based Oversampling (GDO) [38] is a sample generated based on statistical features of the data. GDO simulates different feature patterns and distributions. At the same time, GDO can effectively reduce the influence of noise samples and enhance the diversity of samples. It also effectively avoids the problem that SMOTE-type methods are too dense. In the analysis of structured data in education, class-specific datasets often face the challenge of insufficient data volume. This issue this limits the ability of the model to provide in-depth understanding and accurate assessment of student behaviors and performance. Limited amount of data may lead to insufficient model training. It fails to capture the connections and deeper meaning of students’ underlying behaviors. To cope with this problem, this study introduces a proposed novel data resampling technique called GDO. This technology generates virtual student learning records that simulate different modes of learning and subject understanding. These generated data can effectively extend the original dataset and also support the analysis of student learning trends, personalized education needs. These data can also assess the effectiveness of teaching methods. This approach helps to reduce reliance on a limited sample of actual data and provides a more comprehensive and reliable information base for analyzing education data.

Materials and methods

Modelling framework

The framework of the model is shown in Fig 1. This study proposes a classification model for GDO-RBFN students’ expected achievement. The aim of this paper is to weaken the effect of class imbalance by exploiting the advantages of RBFN by deep mining the GDO and high-dimensional features that are effectively augmented to the data. This paper is also more effective at capturing the underlying characteristics of the data. The core framework is divided into two parts, the data augmentation GDO part and the RBFN classification part. The GDO method is used to perform data augmentation, calculate information weights and compute new samples. For the generated samples, the validity of the samples was tested using t-tests and the data were standardized. Different characteristics were chosen for the study: four categories: overall characteristics, personal problems, family problems and educational habits. The RBFN model is used for classification and model evaluation to compare predicted scores with actual scores. At the same time, it analyses the results and guides students to performance improvement.

thumbnail
Fig 1. Classification model for predicting student performance GDO-RBFN.

https://doi.org/10.1371/journal.pone.0325713.g001

GDO data amplification

GOD is a diversity sample generated by modelling different feature patterns and distributions based on the statistical characteristics of the data. It is to balance the class distribution of the original data through data-enhanced samples, thus weakening the effect of class imbalance. This GDO is broadly divided into three parts: probabilistic selection of anchor instances, generation of new instances and oversampling probabilistic analysis.

  • Calculate the Anchor Selection Probability for Each Minority Class Instance.
    Measuring the importance of data information is based on the distance and density of a small number of class instances, and the probability of information weights is calculated as follows:
    Considering that different instances carry different information, we focus on distinguishing the differences between a few instances, set as the density factor M(Xi) and the distance factor C(Xi), included among these, denotes the set of few class instances in Ni. denotes the set of most class instances in Ni. belongs to the minority class of real sets. belongs to the majority class of real sets. The density factor M(Xi) is defined as follows:
    M(Xi) represents the proportion of majority class instances in the K-nearest neighbors of Xi. The distance factor C(Xi) is defined as follows:
    M(Xi) and C(Xi) are added together to get the information weight:
  • Selection of Anchor Instances and New Generation of Samples.
    Our study uses a roulette wheel algorithm to select anchor instances from a small number of points based on anchor selection probability. We determine the selection probability of each minority class instance as an anchor instance. The selection probability is denoted by .

    It needs to iteratively pick anchor instances. It generates new samples of minority instances around the anchor instances that follow the Gaussian distribution model until the number reaches the same as the number of majority classes. The quantity to be synthesized is set to Ga:
    denotes instances in the minority class. denotes instances in the majority class. Ga denotes the sampling rate.
    Roulette Algorithm Selection
    Our study is to generate a uniformly distributed each time an anchor instance is selected from a small number of instances, and compare it with the cumulative probability. The instance Xi is selected if the following conditions are satisfied:
    In which, each of the few instances Xi is selected as an anchor instance the number of times W(Xi) expects:
    Since the real situation contains several unknown factors, failure to select appropriate robustness measures may lead to performance degradation or instability of the system [39]. The biggest advantage of the roulette algorithm is that it is able to carry more information for the few classes being sampled multiple times. It also improves the quality of the few newly generated class instances.
  • Generation of New Instances.
    Based on the above operations, appropriate anchors are selected to generate new instances of higher quality. The generation steps based on Gaussian distribution are as follows:
    An arbitrary direction is chosen as the starting direction vector of the anchor instance . The end point of direction training is denoted as . The details are shown in Fig 2 below.
    Here Xi, V and are all in the same check, The directional vector is given by the following formula:
    Where O is the origin of the coordinates, and are the position vectors of points V and Xi.
    Determine the Distance Between the New Instance and the Anchor Instance
    According to the Gaussian distribution , the mean is 0 and the standard deviation , is the scaling factor:
    The distance between the new instance and the anchor instance is calculated. Also, calculate the scale factor between this distance and the length of the direction vector q:
    In this,.So, the final new instance coordinates can be expressed as:

GDO generates new samples based on the statistical characteristics of the data, effectively alleviating the problems of limited data volume and unbalanced data classes. Compared with traditional oversampling methods such as SMOTE, GDO generates samples based on the statistical characteristics of the data, which can effectively reduce the influence of noise samples, enhance sample diversity, and thereby improve the performance of the model. The RBFN, discussed below, further improves the model’s performance in classifying students’ academic performance by virtue of its good nonlinear fitting and generalisation abilities.

The module of RBFN

The RBFN model was first proposed by Moody and Darken as a typical feed-forward neural network. The output of a radial basis function network is a linear combination of the input radial basis function and neuron parameters. As shown in Fig 3, it is designed to overcome the limitations of traditional neural networks in dealing with complex nonlinear problems. By introducing a radial function as a non-linear activation function, RBFN is able to effectively capture potentially complex patterns and relationships in data. The model consists of an input layer, a radial function layer and an output layer to classify a standardized new dataset.

  • Feature selection and Standardization.
    We assigned the original features in the dataset in different combinations: overall features (all features), personal problems, family problems and educational habits. And these features were standardized with a mean of 0 and a variance of 1.
  • Radial Function Output Layer.
    The radial basis function maps the input features to the higher space. Each of its radial functions is composed of a centre and a width parameter . The formulae are defined below:
    Where denotes the distance between the input sample x and the center , and is used to denote the width of the control basis function.
  • The Layer of Output.
    The output is a weighted linear combination of the radial function’s to get the final classification with the following formula:

where y is the output, wj is the weight, b is the bias, and M is the number of radial functions.

The loss function is used to measure the difference between predicted and actual values. The classification function used for the experiment is the cross-entropy loss function:

L is the value of the loss function, N is the number of samples, and C is the number of categories. yi,c is the actual value of i samples on category c, and refers to the predicted value of i samples on category c.

Through the above steps, the radial basis function network can effectively classify student performance. The model tunes the model parameters by standardizing the input features, high-dimensional mapping of the radial basis function layer, linear combination of the output layers, and mini-mising the loss function. Thus, the model achieves accurate prediction of student performance.

Experiment

In order to improve the running rate and performance of the model, when we set the learning rate lr of RBFN, the default values of the two datasets were 0.005, the embedding dimension was set to 32, and the number of neurons in the hidden layer was 300. For the GDO module, we adopt a neighbor number of 5, a covariance coefficient of 0.2, and a target oversampling rate that are dynamically adjusted according to different feature groups.

Datasets

The datasets we used are all publicly available datasets, the data provider is the UCI database, which can be categorized into two types: the higher education student performance evaluation dataset and the student performance dataset, which can be downloaded from the links respectively: [https://archive.ics.uci.edu/dataset1] and [https://archive.ics.uci.edu/dataset2], and the characterizations of the different datasets are summarized in Tables 1 and 2. For ease of representation, we call the evaluation dataset and performance dataset in subsequent descriptions. This study places this dataset in the context of different machine learning and deep learning methods for predicting actual student achievement performance. By analyzing multiple characteristics of students provided in the dataset, such as personal information, academic performance, living habits and family background, among other factors. It aims to analyze and understand how these factors affect students’ performance in academic achievement. This model improves teaching methods and educational decision-making based on data prediction model results.

thumbnail
Table 1. Higher education students performance evaluation dataset feature description.

https://doi.org/10.1371/journal.pone.0325713.t001

thumbnail
Table 2. Student performance dataset feature description.

https://doi.org/10.1371/journal.pone.0325713.t002

The evaluation dataset has 146 data entries consisting of 32 features, of which 1–10 are individual problems, 11–16 are family problems, and the rest are educational habits. Student achievements are divided into eight categories: AA, BA, BB, CB, CC, DC, DD, and FF. The experiment classified them into three categories based on their performance characteristics: excellent in category A, qualified in category B, and unqualified in category C. In order to validate the effectiveness of GDO-RBFN, we similarly conducted experiments on the performance dataset, which contains 396 pieces of data with 32 features, where 1–12 are individual problems, 13–25 are family problems, and 26–32 are educational habits, which similarly categorize student performance.

Evaluation indicators

In the field of education, accurately evaluating the performance of models is particularly important for classifying student grades. Confusion matrix provides a structured method to summarize the accuracy of model prediction results. Among them, there are true cases (TP), false negative cases (FN), false positive cases (FP), and true negative cases (TN), as shown in Table 3. It helps to better classify students’ academic performance in research.

By using the following different indicators, more effective and accurate student intervention and support strategies can be developed based on the prediction results of different models. Ensure that educational resources can help students who truly need assistance to the maximum extent possible, and improve overall educational outcomes. The following is the formula for using indicators and the purpose of selecting them as experimental evaluation indicators. Indicator calculation: Use the weighted average of Acc, Precision, Recall, and F1 as the final measurement indicator. Specifically, the accuracy, recall, and F1 score of each category will be weighted based on their proportion in the dataset. Then, sum up to obtain the weighted average of the entire dataset. This ensures that each category is treated fairly in the evaluation, thereby more accurately reflecting the performance of the model. Accuracy, calculated using the following formula:

  • Logistic Regression(LR): The classic linear classification model, commonly used in the mid-20th century, is mainly applied to solve binary classification problems. Its core principle maps features between 0-1 probabilities through linear models and sigmoid functions.
  • Support Vector Machine(SVM): Originating from the supervised algorithm proposed by Vladimir Vapnik and Corinna Cortes in the 1990s. It is suitable for linear and nonlinear classification, regression, and anomaly detection tasks, and performs classification by maximizing the inter class interval.
  • Random Forest(RF): The ensemble learning algorithm proposed by Leo Breiman in 2001 utilizes multiple decision trees for ensemble learning. It improves the accuracy and robustness of the model through voting or averaging predicted values.
  • Multilayer Perceptron(MLP): As early as the 1980s, it was widely studied and promoted as a classic feedforward neural network structure. It is composed of multiple neurons, each fully connected to the next layer.
  • Artificial Neural Network(ANN): This is a relatively ancient concept, but the development and application of modern neural networks have significantly increased since the 1990s. It simulates the neural network of the human brain and is used to solve various machine learning problems.
  • Bidirectional Long Short-Term Memory(BILSTM): Deep learning models, proposed by Schuster and Paliwal in 1997.They are used for processing sequential data. They consider past and future contextual information at every time step.
  • Radial Basis Function Network(RBFN): The classic neural network structure was widely studied and applied in the late 1980s and early 1990s. Its hidden layer uses radial basis functions as activation functions, typically used for regression and classification problems.

Results

We predict students’ grades based on various characteristics according to the experimental content. The experiment can be roughly divided into two parts: overall feature classification and data classification based on GDO amplification. The overall feature classification includes the classification of personal issues, family issues, and educational habits. In order to further enhance the generalisation ability of the model and validate its applicability across different cohorts, a 10-fold cross-validation was carried out during the course of the study, specifically, the original dataset was partitioned into 10 mutually exclusive subsets, and each time, 9 of them were used as the training set, and the remaining 1 subset as the test set. The process was repeated 10 times to ensure that each sample had the opportunity to participate in the evaluation as a test sample. The experiments were based on pytorch to build the RBFN model and Adam was used for learning optimisation.

Comparison of pre-amplification models

In this experiment, three machine learning models (LR, SVM, RF) and four deep learning models (MLP, ANN, BILSTM, RBFN) were used to classify the data, and the model performance was evaluated based on accuracy, precision, recall, and F1 score. The results of the categorization by overall characteristics are given in Table 4.

thumbnail
Table 4. Classify by overall characteristics. The best results are in bold and the second best results are in italics (%).

https://doi.org/10.1371/journal.pone.0325713.t004

By classifying based on overall features, i.e. using dataset 32 features. The experiment uses machine learning LR, SVM, RF, and deep learning MLP, ANN, BILSTM, and RBFN models. Among all models, RF performs the best in classification, with higher accuracy, recall, and F1 score than other models. This indicates that RF has significant advantages in processing high-dimensional and complex datasets, as its model can reduce overfitting and has good robustness to outliers. The LR model achieves the best accuracy, rarely mistaking high achieving students for local grades. Although it has good processing ability on linearly separable data, it has limitations in processing complex and nonlinear features. SVM has natural advantages in handling small samples and high-dimensional data, but based on experimental data, there may be too many non-linear features, and the actual effect is not ideal. MLP can capture nonlinear features in data through the connection of multiple layers of neurons, but its performance is slightly lower than RF and RBFN. Compared to MLP, ANN has a slight advantage in nonlinear feature processing, but its actual performance is still inferior to RF. BILSTM can capture bidirectional dependencies in sequence data, but in this experiment, its performance did not significantly surpass other models. RBFN is second only to RBFN in other indicators, demonstrating its potential in capturing non-linear features of data.

Describing individual problems based on 1-10 features of the overall characteristics, as shown in Table 5, except for logistic regression, the results of all other models run show a significant decrease compared to the overall characteristics. Among them, compared to the overall features, the best method for classifying individual problem features is also RF, with an accuracy decrease of 2.86%, an accuracy increase of 0.51%, a recall decrease of 2.6%, and F1 scores decrease of 1.09%. This indicates that when the number of features decreases, especially when key family issues and educational habits are lacking, the decision tree construction and overall performance of the model will be affected. The indicators of RBFN have also decreased by about 3%. When key features are missing, the model’s ability to capture data complexity decreases, leading to a decrease in its classification performance. Only LR is not affected like other class models, indicating that logistic regression performs well in handling linearly separable data and still has strong classification ability for personal problem features. Despite the lack of family issues and educational habits, their overall performance is not significantly affected.

thumbnail
Table 5. Classify by personal issues. The best results are in bold and the second best results are in italics (%).

https://doi.org/10.1371/journal.pone.0325713.t005

The 11–16 features in the overall characteristics describe family problems, and the impact of family feature classification on each model varies, as shown in Table 6. The best performing models are LR and RBFN. Among them, RF and RBFN are more sensitive to changes in feature sets, and when only using household features, there are obvious signs of decline in accuracy and other indicators. Especially in the case of household characteristics, the effectiveness of RF in capturing information is not as good as in the case of overall characteristics, and the effect decreases significantly. LR has slightly improved in accuracy and precision, while the rest has decreased, indicating that family characteristics play an important role in capturing potential patterns of student performance in LR models, but at a slight sacrifice of accuracy. ANN, BILSTM, SVM and other models have an impact on classification performance, but the effect is not significant.

thumbnail
Table 6. Classified by family issues. The best results are in bold and the second best results are in italics (%).

https://doi.org/10.1371/journal.pone.0325713.t006

By categorizing the last 16 features of the overall characteristics as educational habit features, as shown in Table 7, it is not difficult to see that compared to the overall features, using only the last 16 features significantly improves the accuracy of predicting classification grades, indicating the crucial role of educational habit features in grade classification. Educational habits can effectively capture potential differences in academic performance, including course interest, study time, exam preparation, etc. These factors have a significant impact on student performance, and can directly reflect students’ learning ability and academic level from these instance differences. According to the experimental results, the best model for classifying educational habits is RBFN, which improves accuracy by 2.7%, precision by 5.6%, recall by 2.71%, and F1 scores by 4.15%. The RBFN model performs well in handling high-dimensional data and nonlinear relationships, and therefore shows significant improvement when using educational habit features. The limited increase in LR and RF models indicates that educational habit features can improve accuracy, but the effect is minimal.

thumbnail
Table 7. Classified by educational habits. The best results are in bold and the second best results are in italics (%).

https://doi.org/10.1371/journal.pone.0325713.t007

Comparison of amplified models

The three types of data used in the experiment have levels A, B, and C. Due to the limited number of 145 entries in the original dataset. Among them, the most common category is 120 cases, and the remaining two categories will be amplified in the experiment. Mainly utilizing oversampling based on Gaussian distribution, generating new sample nodes using Gaussian distribution to provide sufficient data for model training, achieving more accurate classification of student grades. The amplification was categorized by overall characteristics as shown in Table 8.

thumbnail
Table 8. Classify according to overall characteristics after amplification. The best results are in bold and the second best results are in italics (%).

https://doi.org/10.1371/journal.pone.0325713.t008

After Gaussian distribution based oversampling (GDO) amplification, each model showed significant improvements in accuracy, precision, recall, and F1 score. Specifically, the performance indicators of LR, SVM, RF, MLP, ANN, BILSTM, and RBFN models after amplification have improved by nearly 10% compared to before amplification. Especially the RBFN model, it performs the best in all indicators, with an accuracy of 95.00%, precision of 95.52%, recall rate of 95.00%, and F1 score of 94.95%. RBFN exhibits extremely high performance in adapting to GDO, far exceeding the optimal RF model before amplification. This phenomenon indicates that RBFN has certain requirements for data training volume, and sufficient training volume can further capture potential feature relationships. In addition, GDO amplification provides more authentic data, weakening the negative impact of class imbalance on model performance and improving the actual performance of the model in overall classification tasks. This further proves the effectiveness of GDO method in solving data imbalance problems and the powerful feature capture ability of RBFN model under sufficient training data.

Predicting student performance classification based on the top 10 individual questions according to overall features significantly improved the model’s classification ability on individual question features, especially with SVM, RF, MLP, and RBFN showing the most significant improvement, as shown in Table 9. This indicates that through GDO amplification, not only does it increase the sample size, but it also enhances the model’s ability to capture and learn complex features, allowing the model to still perform well in classifying data with fewer features. There is a significant decrease in LR, as the performance of the LR model is affected when the newly generated data samples do not fully conform to the linear distribution of the original data. The ability of LR to process data is also limited by the increase in data complexity and the number of nonlinear features.

thumbnail
Table 9. Categorize by personal questions after amplification. The best results are in bold and the second best results are in italics (%).

https://doi.org/10.1371/journal.pone.0325713.t009

According to the classification of family problem characteristics, the impact of family problem characteristics on various models has significantly improved after data augmentation, as shown in Table 10. The RF model showed significant improvement in all indicators, with an accuracy of 92.22%. This indicates that the RF model can better capture data features and improve classification performance in terms of family problem characteristics. The RBFN model also performed well, with an accuracy improvement of 88.06%, demonstrating its advantages in capturing data complexity and feature relationships. In contrast, the LR model shows a slight decrease in accuracy and precision, indicating that family features have to some extent affected its classification ability. Models such as SVM, MLP, ANN, and BILSTM have all shown improvements in classification performance, although the magnitude is not as significant as RF and RBFN, they still perform stably. This indicates that the data provided by GDO amplification effectively improves the classification performance of family problem features and enhances the overall performance of various models in this feature classification.

thumbnail
Table 10. Categorize by family issues after amplification. The best results are in bold and the second best results are in italics (%).

https://doi.org/10.1371/journal.pone.0325713.t010

According to the classification of educational habit characteristics, after data GDO amplification, each model showed significant improvement in classification performance when using educational habit characteristics, as shown in Table 11. Among them, the RBFN model performs excellently, with an accuracy rate of up to 93.58%, precision of 94.6%, recall rate of 93.58%, and F1 Score of 92.92%, demonstrating the advantages of this model in high-dimensional feature and nonlinear relationship data processing. The performance improvement of SVM and other models is also quite significant, especially in accuracy and precision, which highlights the necessity of using GDO to amplify data. The improvement of the retrospective model is relatively limited, and this model is better applied to linear classification. After GDO amplification, the potential data capture ability of RBFN is effectively amplified.

thumbnail
Table 11. Expand and classify according to individual educational habits. The best results are in bold and the second best results are in italics (%).

https://doi.org/10.1371/journal.pone.0325713.t011

In the above experiment, regardless of whether GDO amplification is used or not, the most effective methods for classifying student performance are based on overall features and educational features. However, after using GDO amplification, various methods showed significant improvements, especially the RBFN model, which showed obvious superiority in handling nonlinear relationships and high-dimensional features. In terms of data, GDO amplification can improve the minority class data to a certain extent, and sufficient training can better capture the relationship between potential features of the data, alleviating the problem of class imbalance. In summary, the GDP-RBFN adopted has the most ideal performance in score classification.

Discussion

Reliability of synthetic data

Since the data were synthesised using GDO, we conducted experiments in order to verify the credibility of the synthesised data, using the Homogeneity of Variance, P-Value, distribution overlap and Fréchette distance as evaluation indicators, and the results of the experiments are shown in Table 12. Starting from the Homogeneity of Variance and P-Value, the Homogeneity of Variance and P-Value of the four categories are higher than 0.05, with the P-Value of the overall dataset being 0.8387, personal problems 0.9210, family problems 0.9075, and educational habits 0.9125. This indicates that there is no significant difference in the homogeneity of variance of the data in different categories, and the hypothesis of homogeneity of variance is valid. significant difference and the assumption of variance chi-square is valid. In terms of distribution overlap and Frechette distance, the distribution overlap values of the four categories are distributed between 0.94 and 0.97, close to 1, indicating that the distributions of raw and synthetic data on each feature are highly overlapped. And the Frechette distance is distributed between 0.02–0.04, all less than 0.05, indicating that the distribution of synthetic data and original data is very similar, suggesting that the synthetic data is of high quality. In summary, the distributions of the synthetic data on these categories were consistent with the original data with high confidence, further verifying the effectiveness of the GDO amplification method in generating credible synthetic data.

Sensitivity parameter analysis

Fig 4 shows the performance metrics (accuracy, precision, and F1 score) of the model under different feature combinations as a function of oversampling rate p. When integrating all features, as the oversampling rate increases, various performance indicators gradually improve, reaching the highest value at an oversampling rate of 1, indicating a significant improvement in model performance with moderate oversampling. When using personal problem features, the accuracy and F1 score significantly decrease at an oversampling rate of 0.2, but gradually increase thereafter, reaching optimal results at 0.8, indicating that personal problem features are more sensitive to low oversampling rates. When applying family problem features, the performance of the model significantly decreases at low oversampling rates (0.2), but gradually improves with increasing oversampling rates and reaches optimal results at 1. When using educational habit features, the performance index slightly decreases at an oversampling rate of 0.6, but reaches its optimal effect at 0.8, indicating that educational habit features are sensitive to moderate oversampling. Overall, different feature combinations have varying sensitivities to oversampling rates, but moderate oversampling (typically between 0.6 and 1) can significantly improve model performance.

k denotes the number of nearest neighbours considered in calculating the weights of the minority class samples. the choice of the k value has a significant impact on the calculation of the weights of the minority class samples. Smaller values of k may lead to overconcentration of the generated samples, while larger values of k may introduce noise. The following are the results of the comparison experiments.

The experimental comparison in Table 13 reveals that when the value of k is 5, all the evaluation indexes of educational habit feature classification are optimal for both datasets. We can optimise the performance of the GDO method by adjusting the size of k.

thumbnail
Table 13. Effect of k-value in GDO on the accuracy of classification of educational habitus features, and the best results are in bold (%).

https://doi.org/10.1371/journal.pone.0325713.t013

Limitations

Through the exploration in Tables 4–8, we found that the research results of grouping and classifying according to different features revealed that the RBFN method we proposed did not lead in all datasets and evaluation metrics. This might be due to its weak local sensitivity and adaptability to high-dimensional data, resulting in not leading comprehensively in all scenarios. The data characteristics also have the factor of insufficient matching degree with the local kernel function of RBFN. The fact that we adopted hyperparameters without optimizing for different datasets might also be one of the reasons why RBFN has not achieved comprehensive leadership.

Secondly, combining Tables 6 and 10, the ranking of our GDO-RBFN decreased instead after being augmented by GDO. Although GDO generates new samples through Gaussian distribution, the generated samples may still have certain similarities, resulting in insufficient sample diversity. This may affect the model’s generalization ability for minority class samples. Furthermore, the synthetic samples generated by GDO may deviate from the true distribution characteristics of the original data, causing the model to learn false patterns on the enhanced dataset, which instead reduces the generalization ability on the original test set. This is particularly sensitive in educational data, where student characteristics often have complex underlying structures.

In the future, we will enhance the ability of RBFN to capture global features by introducing an attention mechanism or hierarchical radial basis functions, and use adaptive parameter optimisation methods to improve the model’s adaptability to different data distributions. In addition, the loss function is customised and optimised for the specific needs of educational scenarios. For the selection of data augmentation strategy, we propose to adopt the conditional augmentation strategy to control the generation range and intensity of augmented data, perturb only the non-critical features, and retain the core features to carry out the next step of the study, so as to make my model more suitable for the application of educational classification models.

Conclusion

The GDO-RBFN method, first proposed in the study, utilizes Gaussian distribution based data augmentation techniques combined with the RBFN model for student performance classification, effectively alleviating the impact of limited data volume and class imbalance. This method has shown significant potential and advantages in processing complex feature data, especially in the ability to handle non-linear relationships and multidimensional features, and can effectively predict the grades of classified students. However, when applied in small datasets or specific populations, the actual effectiveness of GDO-RBFN is limited by the quality and quantity of the data. The generalization ability and adaptability of the model need to be strengthened in order to adapt to the universality and effectiveness of different educational backgrounds and specific student groups. There is potential for application in feature engineering and integration technology in the future, and we will further expand this research. In addition, exploring more deep learning methods to uncover deep patterns and correlated features in student performance prediction to address the challenges of data complexity and diversity. In the future, we will develop more educational data models to provide educators with more accurate analysis tools, help them better adjust students’ learning strategies, and ultimately improve the quality of education, achieving personalized and overall optimization goals.

References

  1. 1. Al Breiki B, Zaki N, Mohamed EA. Using educational data mining techniques to predict student performance. In: 2019 International conference on electrical and computing technologies and applications (ICECTA). 2019. pp. 1–5. https://doi.org/10.1109/icecta48151.2019.8959676
  2. 2. Baker BD. Within-district resource allocation and the marginal costs of providing equal educational opportunity: evidence from Texas and Ohio. EPAA. 2009;17:3.
  3. 3. Su Y, Liu Q, Liu Q, Huang Z, Yin Y, Chen E, et al. Exercise-enhanced sequential modeling for student performance prediction. AAAI. 2018;32(1).
  4. 4. Alamri R, Alharbi B. Explainable student performance prediction models: a systematic review. IEEE Access. 2021;9:33132–43.
  5. 5. Zhang Y, Yun Y, An R, Cui J, Dai H, Shang X. Educational data mining techniques for student performance prediction: method review and comparison analysis. Front Psychol. 2021;12:698490. pmid:34950079
  6. 6. Govindasamy VelmuruganA study on classification and clustering data mining algorithms based on students’ academic performance prediction. Int J Control Theory Appl.2017;10:147.60.
  7. 7. Ali ZM Hassoon NH,Ahmed WS, Abed HN.The application of data mining for predicting academic performance using k-means clustering and naïve Bayes classification. Int J Psychosoc Rehabil.2020;03:24.51.
  8. 8. Salamonson Y, Ramjan LM, van denn S, Metcalfe L, Chang s, Everett B. Sense of coherence, self-regulated learning and academic performance in first year nursing students: A cluster analysis approach. Nurse Educ Pract. 2016;17:208.13 pmid:26804936
  9. 9. Indicators O.Education at a glance 2016: OECD indicators .2013;
  10. 10. Blegur J, Subarjah H, Hidayat Y, Ma’amun A, Mahendra IMS,Peer-assessment academic integrity scale (PAAIS-24). Emerg Sci j. 2024;(2):513.26
  11. 11. Peña-Ayala A. Educational data mining: a survey and a data mining-based analysis of recent works. Expert Syst Appl. 2014;41(4):1432–62.
  12. 12. Yohannes E, Ahmed S. Prediction of student academic performance using neural network, linear regression and support vector regression: a case study. IJCA. 2018;180(40):39–47.
  13. 13. Costa SF, Diniz MM. Application of logistic regression to predict the failure of students in subjects of a mathematics undergraduate course. Educ Inf Technol. 2022;27(9):12381–97.
  14. 14. Al-Ali R, Alhumaid K, Khalifa M, Salloum SA, Shishakly R, Almaiah MA. Analyzing socio-academic factors and predictive modeling of student performance using machine learning techniques. Emerg Sci J. 2024;8(4):1304–19.
  15. 15. David Kolo K, Adepoju A, Kolo Alhassan S J.A decision tree approach for predicting students’ academic performance. IJEME. 2015;5(5):12–9.
  16. 16. Pandey M, Kumar Sharma V. A decision tree algorithm pertaining to the student performance analysis and prediction. IJCA. 2013;61(13):1–5.
  17. 17. Ogwoka TM, Cheruiyot W, Okeyo G. A Model for predicting students’ academic performance using a hybrid of k-means and decision tree algorithms. IJCATR. 2015;4(9):693–7.
  18. 18. Asogbon M, Samuel O, Omisore O, Ojokoh B. A multi-class support vector machine approach for students’ academic performance prediction. Int J Multidiscip Curr Res. 2016;4:210–5.
  19. 19. Chui KT, Liu RW, Zhao M, De Pablos PO. Predicting students’ performance with school and family tutoring using generative adversarial network-based deep support vector machine. IEEE Access. 2020;8:86745–52.
  20. 20. Mohammad Sabri MZ, Abd Majid NA, Hanawi SA, Mohd Talib NI, Anuar Yatim AI. Prediction model based on continuous data for student performance using principal component analysis and support vector machine. TEM J. 2023;12:1201–10.
  21. 21. Batool S, Rashid J, Nisar MW, Kim J, Mahmood T, Hussain A. A random forest students’ performance prediction (RFSPP) model based on students’ demographic features. In: 2021 Mohammad Ali Jinnah University international conference on computing (MAJICC). 2021. pp. 1–4.
  22. 22. Arsad PM, Buniyamin N, Manan JLA. A neural network students’ performance prediction model (NNSPPM). In: 2013 IEEE international conference on smart instrumentation, measurement and applications (ICSIMA). 2013. pp. 1–5.
  23. 23. Nabil A, Seyam M, Abou-Elfetouh A. Prediction of students’ academic performance based on courses’ grades using deep neural networks. IEEE Access. 2021;9:140731–46.
  24. 24. El-Sousy FFM. Adaptive dynamic sliding-mode control system using recurrent rbfn for high-performance induction motor servo drive. IEEE Trans Ind Inf. 2013;9(4):1922–36.
  25. 25. Cortez P, Silva A. Using data mining to predict secondary school student performance. EUROSIS; 2008.
  26. 26. Peña-Ayala A. Educational data mining: a survey and a data mining-based analysis of recent works. Expert Syst Appl. 2014;41(4):1432–62.
  27. 27. Anzer A, Tabaza HA, Ali J. Predicting academic performance of students in UAE using data mining techniques. In: 2018 International conference on advances in computing and communication engineering (ICACCE); 2018. pp. 179–83.
  28. 28. Hussain S, Abdulaziz Dahan N, Ba-Alwi FM, Ribata N. Educational data mining and analysis of students’ academic performance using WEKA. IJEECS. 2018;9(2):447.
  29. 29. Minn S, Yu Y, Desmarais MC, Zhu F, Vie J-J. Deep knowledge tracing and dynamic student classification for knowledge tracing. In: 2018 IEEE international conference on data mining (ICDM); 2018. pp. 1182–7. https://doi.org/10.1109/icdm.2018.00156
  30. 30. Yang F, Li FWB. Study on student performance estimation, student progress analysis, and student potential prediction based on data mining. Comput Educ. 2018;123:97–108.
  31. 31. Abdel Aziz KM, Elsonbaty L. Comparative study of different classification methods and winner takes all approach. Civ Eng J. 2024;10(10):3370–85.
  32. 32. Olabanjo OA, Wusu AS, Manuel M. A machine learning prediction of academic performance of secondary school students using radial basis function neural network. Trends Neurosci Educ. 2022;29:100190. pmid:3647061
  33. 33. Olabanjo OA, de Morais RFAB, Vasconcelos GC. Boosting the performance of over-sampling algorithms through under-sampling the minority class. Neurocomputing. 2019;343:3.18
  34. 34. Yang R, Liu J, Zhang Q, Zhang L. Multi-view feature fusion and density-based minority over-sampling technique for amyloid protein prediction under imbalanced data. Appl Soft Comput. 2024;150:111100.
  35. 35. Zhang Z, Zhang K, Xie X, Stojanovic V. ADP-based prescribed-time control for nonlinear time-varying delay systems with uncertain parameters. IEEE Trans Automat Sci Eng. 2025;22:3086–96.
  36. 36. Wang K, Zhou T, Luo M, Li X, Cai Z. Generative adversarial minority enlargement–a local linear over-sampling synthetic method. Expert Syst Appl. 2024;237:121696.
  37. 37. Yuan X, Chen S, Zhou H, Sun C, Yuwen L. CHSMOTE: Convex hull-based synthetic minority oversampling technique for alleviating the class imbalance problem. Inf Sci. 2023;623:324–41.
  38. 38. Xie Y, Qiu M, Zhang H, Peng L, Chen Z. Gaussian distribution based oversampling for imbalanced data classification. IEEE Trans Knowl Data Eng. 2022;34(2):667–79.
  39. 39. Song X, Sun P, Song S, Stojanovic V. Saturated-threshold event-triggered adaptive global prescribed performance control for nonlinear Markov jumping systems and application to a chemical reactor model. Expert Syst Appl. 2024;249:123490.