Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Exploration of designing an automatic classifier for questions containing code snippets—A case study of Oracle SQL certification exam questions

Abstract

This study uses the Oracle SQL certification exam questions to explore the design of automatic classifiers for exam questions containing code snippets. SQL’s question classification assigns a class label in the exam topics to a question. With this classification, questions can be selected from the test bank according to the testing scope to assemble a more suitable test paper. Classifying questions containing code snippets is more challenging than classifying questions with general text descriptions. In this study, we use factorial experiments to identify the effects of the factors of the feature representation scheme and the machine learning method on the performance of the question classifiers. Our experiment results showed the classifier with the TF-IDF scheme and Logistics Regression model performed best in the weighted macro-average AUC and F1 performance indices. The classifier with TF-IDF and Support Vector Machine performed best in weighted macro-average Precision. Moreover, the feature representation scheme was the main factor affecting the classifier’s performance, followed by the machine learning method, over all the performance indices.

Introduction

Question classification associates a class label in the testing criteria with a question. It is a subfield in the document classification. Compared to general documents, questions are shorter, which makes the question classification more challenging [13].

The classification of questions in learning assessment is one of the applications of the automatic classifier [2]. When building a test bank, teachers must classify the questions into different exam topics for assembling a test paper with a specific scope according to the testing purpose. For example, Bloom’s Taxonomy is a classification of educational objectives to assess students’ learning outcomes, including levels of remembering, understanding, applying, analyzing, evaluating, and creating. Questions could be classified using Bloom’s Taxonomy to assess the learner’s cognitive ability. Alternatively, questions could be classified according to the exam topics to assess the learner’s proficiency in specific subjects.

Classifying questions is more challenging for programming course exams because the question stem or options often contain code snippets in addition to the general text descriptions. The code snippets’ particular syntax, such as structure, symbols, and variables, increases the complexity of the question classification. Taking the Oracle SQL certification exam questions as an example, the question stem could describe the report’s requirements, and the options provide various SELECT statements. Or, the question stem could be a SQL code snippet, and the question’s options could list the descriptions of the code intent. The SQL code snippets contain keywords (such as SELECT, UPDATE, WITH, etc.), field names, table names, and literals. All these make the question classifier design more sophisticated.

Classifying SQL exam questions is to build a test bank for making test papers for learning assessment. Teachers can collect historical questions from different sources and classify them by the exam topic in the test bank. Then, they select questions on specific topics from the test bank when assembling a test paper according to the teaching progress. Classifying questions manually is time-consuming and increases the teaching workload for teachers. An automatic question classifier can help teachers build a test bank effortlessly and reduce the workload.

The Literature on the question classifier in learning assessment primarily focuses on classifying questions with pure text content according to cognitive levels [4, 5] or knowledge domains [6, 7]. However, there needs to be more discussion on the classifier for classifying questions with code snippets, and this study fills the gap in the literature.

The study contributes to teaching by helping teachers automatically classify historical questions to build a test bank. For example, one can apply our study’s results to the test bank management module of the automatic test paper assembling system developed by [8]. When importing questions, the system suggests possible topic labels to reduce the classification workload. Then, teachers will have more time to conduct formative or summative assessments to provide students with learning feedback and improve teaching quality.

The study aims to compare various question classifier designs to explore the effects of the feature representation schemes and machine learning methods on the classifier’s performance for classifying Oracle SQL certification exam questions. We use a factorial experimental design to identify the effects and the best combination of the feature representation schemes and machine learning methods in the metrics of weighted macro-average Area Under Curve (AUC), weighted macro-average F1 score, and weighted macro-average accuracy.

The rest of the paper is organized as follows. Section two reviews the literature on the question classifiers. Section three presents the research design, including the dataset, the feature representation schemes, the machine learning methods, the search for the optimal parameter for ML methods, the indicators for evaluation, and the experimental design. Section four describes the experiment results and discusses the findings. Finally, the last section concludes the study.

Literature review

Question classifiers classify a question to a class in a non-overlapping set of classes. It is the multi-label classification if a question has multiple class labels or the single-label classification if only one class label is allowed [9].

Most question classifications applied in the learning assessment belong to the single-label classification problem under the multi-class labels, for example, classifying questions to a cognitive level according to Bloom’s Taxonomy [10, 11]. The question length in the learning assessment is generally shorter than the length of other documents. Because of the shorter content, the design of the question classifier is more challenging than the document classifier’s [13].

Since the datasets, feature representation schemes, and machine learning methods impact the question classifiers, the remaining is divided into three subsections: Dataset, Feature Representation and Word Weights, and Machine Learning Methods.

Datasets

Classifier development depends on the dataset and the knowledge domain. A classifier developed from a dataset in one domain is challenging to apply to another [6]. Words in the dataset may have varying weights across classification schemes for various domains [4]. Besides, the class number of the classification scheme and the dataset size affect the classifier’s design [11, 12].

The literature’s datasets cover several domains, such as business and marketing, computer science, computer programming, and operating systems. Since classification depends highly on the knowledge domain, most studies adopt custom classification schemes [3], such as the schemes for the science subject test [6] or the biomedical exam [7]. There are also general classification schemes, such as Bloom’s Taxonomy [13] or Costa Levels of Questioning [14]. Table 1 summarizes the knowledge domains, classification schemes, and the sizes of the datasets from the literature to develop the question classifiers.

thumbnail
Table 1. The datasets and classification schemes used in the literature to develop question classifiers in various knowledge domains.

https://doi.org/10.1371/journal.pone.0309050.t001

Feature representation and word weights

The question’s feature representation is one factor that impacts the classifiers. Previous studies mostly use the Term Frequency-Inverse Document Frequency (TF-IDF) to extract features [3]. Lilleberg et al. [25] pointed out that using the TF-IDF and word embedding schemes (such as Word2Vec) simultaneously performs better than TF-IDF alone because word embedding complements the semantic information that the TF-IDF cannot capture. Mohammed and Omar [13] employed the Term Frequency (TF)-Inverse Document Frequency (IDF) based on Part-Of-Speech (POS), abbreviated as TFPOS-IDF, to determine the word weights. The TFPOS-IDF gives different weights to words based on their POS tags to modify the term frequency in the TF-IDF scheme. Then, Word2Vec is used to extract the semantic, dense features of the words and combine them with their TFPOS-IDF representations. The resultant question feature vectors are dense and can reduce computational complexity and improve learning performance [26].

When words have characteristics that identify a class in the classification scheme, allocating higher weights to these words in the representation scheme can improve the classification performance. When using Bloom’s Taxonomy to classify questions, the Enhanced TF-IDF (E-TFIDF) gives verbs more weight, followed by nouns and adjectives [18]. The TFPOS-IDF uses the same concept and further distinguishes the verb types to provide different weights. The TFPOS-IDF performs better than the E-TFIDF and TF-IDF when using the support vector machine [4]. Gani et al. [12] proposed the Enhanced TFPOS-IDF (ETFPOS-IDF), which distinguishes verbs into Bloom’s Taxonomy and supporting verbs and gives them different weights. Compared to TF-IDF, the ETFPOS-IDF can improve the accuracy and F1-Measure by 5.2% and 5.7%, respectively.

Machine learning models

Besides the dataset and the feature representation scheme, the machine learning model affects the classification performance. The support vector machine (SVM) is the most used model in the literature [3]. In most cases, SVM outperforms other models, including the k-Nearest Neighbor (k-NN) [17, 18, 23], Naive Bayes (NB) [17, 18], Linear Regression [13], and Random Forest [12]. But, as an exception case, Abduljabbar and Omar [20] pointed out that k-NN performs better than SVM and NB.

We also found advanced models used in the literature. Osadi et al. [24] proposed an ensemble learning by combining rule-based, SVM, k-NN, and NB models and aggregating these results with the majority voting method to classify questions to cognitive levels in Bloom’s Taxonomy. If the dataset is large enough, at least 500 questions, one may consider deep learning models for classifying questions, such as BERT [11], LSTM [10], and CNN [5], because the deep learning models perform better than machine learning models in large datasets. Table 2 summarizes the datasets and models used in the literature to develop question classifiers; Table 3 summarizes the performance of the winning models in different datasets.

thumbnail
Table 2. The datasets and the machine learning models used in the literature to develop question classifiers.

https://doi.org/10.1371/journal.pone.0309050.t002

thumbnail
Table 3. The performance of the winning machine learning models in different datasets in the literature.

https://doi.org/10.1371/journal.pone.0309050.t003

In summary, the design of the question classifier depends on the knowledge domain and the classification scheme. Previous studies on the question classifier in learning assessment mainly focus on classifying questions with pure text contents. Most studies use TF-IDF to extract features from the questions. Other proposed derived TF-IDF schemes give different weights to words according to the characteristics of the classification scheme to improve the performance. SVM is one of the machine learning models that performs well in most studies. Deep learning models perform better than machine learning models in large datasets. However, there needs to be more discussion on the classifiers that classify questions containing code snippets. This study fills the literature gap by extending the question classifier’s design to questions mixed with text and code snippets.

Methodology

The study adopts the experimental design method to explore the effects of the feature representation schemes and machine learning models on the performance of the question classifiers for questions containing code snippets.

The study has the following assumptions and criteria to classify questions in the dataset:

  • The topics in the question classification scheme have basic and advanced inter-dependency. The advanced topics cover the knowledge scope of the basic subjects.
  • If a question contains multiple topics, we will label the question as the most advanced topic.
  • This study employs the Oracle SQL Expert exam topics as the classification scheme [27]. The exam topics are organized in a two-level hierarchy. Each topic can be further divided into multiple sub-topics. However, due to the limited number of questions in the dataset, only the first-level topics are used in the study. The topics and the number of questions in each topic are shown in Table 4. In the table, higher topic numbers indicate more advanced topics.
thumbnail
Table 4. The first-level topics in the Oracle SQL Expert exam and the number of questions in each topic in the dataset.

https://doi.org/10.1371/journal.pone.0309050.t004

Dataset

The dataset contains 171 questions written in English and formatted with Markdown syntax and symbols. The question content may contain general text descriptions, markdown symbols, and SQL code snippets. The code snippets may appear in the question stem or the answer options.

The text in the SQL code snippets can be further divided into several types:

  • SQL reserved words, e.g., CREATE TABLE, SELECT, INSERT, UPDATE, etc.
  • Literals, e.g., numbers, strings, date strings, etc.
  • Data type keywords
  • Operators
  • Pseudo-columns, e.g., NEXTVAL
  • Schema object names, e.g., names for the table, column, function, sequence, etc.

Questions in the dataset were annotated with the topic identifications according to the assumptions and criteria mentioned before. Table 4 summarizes the categorization. The topic, "Using Subqueries to Solve Queries," contains the most questions, a total of 27 items accounting for 15%. The two topics, "Managing Views" and "Relational Database concepts," all contain the least number of questions; each has four items, accounting for 2%. The dataset is highly unbalanced. The topic with the most questions is about six times more than the least.

Data preprocessing

The data are pre-processed before extracting the features according to the following steps:

  1. Lowercase conversion and removal of special symbols: Convert all texts to lowercase. Remove the Markdown syntax, such as the image syntax “!()[]” and the table syntax “—+—,” etc. However, the SQL operators, such as +, -, *, /, -, %, _, etc., are retained.
  2. Lemmatization: The study used the spaCy en_core_web_sm 3.6.0 package [28] for lemmatization.
  3. Revise the incorrect lemmatization results: For example, the column name “ord_no” is lemmatized to “ord _ no,” which loses the original meaning.
  4. Stop word removal: Remove the stop words from the content, but do not include the SQL keywords. This study used the NLTK package [29] to handle the stop words.

The study does not perform stemming in the pre-processing process because it makes words lose their original meanings and affects the classification performance. For example, the SELECT in the SQL statement and the "selecting" and "selection" in the general text will all be converted to "select" after stemming, which makes the context information lost.

Feature representation

The study considers three feature representation schemes commonly used in the literature: Term Frequency-Inverse Document Frequency (TF-IDF), Word2Vector, and FastText.

TF-IDF value indicates the importance of a word in a document. When a word appears only in a single document and has a high frequency, the word is an essential feature of the document. The TF-IDF value of word i in question j is fij = tfij×idfi, where tfij is the frequency of word i in question j, and idfi is the scarcity measure to the word across all documents. The higher the word scarcity, the larger the value of idfi.

The process of using TF-IDF to vectorize a question involves two steps. In the fitting step, the TF-IDF model is trained on the training dataset to build the vocabulary and calculate idfi for each word in the vocabulary. Then, in the transform step, the learned vocabulary and the idfi values are used to encode a new question to a vector. The word frequency in the new question is multiplied by each word’s idfi value to generate the question vector. Specifically, let V be the vocabulary. Then, the vector of the new question j′ is: (1) where tfij, is the frequency of word i in new question j′ and iV.

The second considered scheme, Word2Vec, is a word embedding model in natural language processing. Word2Vec represents a word with a fixed-dimension vector. Two adjacent word vectors mean they have similar meanings. Compared with the TF-IDF scheme, Word2Vec can capture the word’s semantic meaning. The study used Google’s 300-dimensional pre-trained word embeddings [30].

The last scheme, FastText, is also a word embedding model in natural language processing [31]. Instead of word vectors, FastText generates the embeddings of the n-gram characters that compose the word. When encountering an unknown word in the training dataset, FastText can split the word into multiple characters and combine the embeddings of these characters to generate the word embedding, which overcomes the Out-of-Vocabulary (OOV) problem.

We generate a question embedding by averaging the word embeddings for the question. Let denote the embedding vector for word i, where ek is the k-th element in the embedding vector, and m is the embedding dimension. Additionally, let Vj be the set of words in question j. Then, the embedding vector for question j is: (2)

Machine learning models

The study considers four commonly used models in the literature: Multinominal Naive Bayes (MNB), Logistics Regression (LOGREG), Linear Support Vector Machine (LSVM), and Support Vector Machine (SVM). The following describes the characteristics and parameters of each method.

MNB model has a high learning bias and is suitable for a small amount of data, commonly used as a baseline for comparison in the literature [32]. MNB’s parameter α≥0, a pseudo-count value, smooths the likelihood of a word in a class to avoid the zero-probability problem. The parameter makes the words not in the training samples have a non-zero probability of preventing zero probability in further computations. A more considerable α value results in a more significant smoothing effect and a simpler model.

LOGREG performs well in high-dimensional, sparse data. The parameter C controls the regularization strength when using L2 regularization in the LOGREG. A larger C value results in a smaller regularization strength and a more complex model [33].

LSVM is the most used method in the literature and performs very well [3]. Like the parameter in LOGREG, the parameter C controls the regularization strength when using L2 regularization in the LSVM.

Besides the linear classifiers of LOGREG and LSVM, the study also considers the non-linear classifier SVM adopting the Radial Basis Function (RBF) kernel. The RBF kernel has two parameters: C and γ. The parameter C controls the regularization strength to control the model complexity, as mentioned in the previous two classifiers. The parameter γ controls the influence of the training observations near the decision boundary since the observations near it determine the decision boundary. The larger the γ value is, the smaller the impact of the observations far from the decision boundary is, which has a smoother decision boundary [34].

Parameter optimization

This study employs stratified K-fold cross-validation to evaluate the generalization ability of the models in the learning phase. The stratified split of the training dataset ensures that each fold contains learning instances in all classes, which is suitable for the unbalanced dataset.

In our dataset, the 01 and 13 topics contain the fewest questions, with only four questions each. The stratified split divides the dataset into the training and test datasets. In the training dataset, the 01 and 13 topics contain only three questions each. Therefore, the study uses three-folds in the cross-validation to identify the model parameters with the best generalization ability.

This study uses the grid search to identify the model parameters to make a model with good generalization ability. The grid search performs cross-validation for each parameter combination and selects the best combination. The parameter search points for each model are set as follows:

  • MNB parameter α: logarithm scale interval [log10−2,log104], 20 equal parts.
  • C parameter in LOGREG, LSVM, and SVM: logarithm scale interval [log10−2,log1010], 20 equal parts.
  • γ parameter in SVM: logarithm scale interval [log10−9,log103], 13 equal parts.

When the grid search algorithm searches the best parameters, the study uses the One-versus-Rest (OvR) strategy to convert a multi-class into a binary classification problem. The study uses the Area Under the Curve (AUC) of the Receiver Operating Characteristic (ROC) to guide the search, as AUC is a simple and robust metric for comparing the effectiveness of classification models.

Evaluation metrics

The study measures the classifier performance with the following metrics: weighted precision, weighted F1-score, and weighted AUC because of the unbalanced question dataset in the study. The weighted metrics consider the class imbalance by weighting the class metrics by their proportion. Their metrics are presented as the following.

The recall rate measures the ability of the classifier to detect positive instances, given all positive instances that exist either in the true positive or false negative cases. Let TP be the number of true positive instances, FP be the number of false positive instances, TN be the number of true negative instances, and FN be the number of false negative instances, respectively. Under the One-vs-Rest (OvR) multiclass strategy, the recall rate for the class c is calculated as follows: (3) Let wc be the proportion of the class c in the dataset. Then, the weighted macro-average recall rate for all classes is: (4) where L is the total classes in the dataset.

The precision rate measures the ability of the classifier to identify positive instances given all predicted positive instances composed of TP and FP cases. The precision rate of the class c is: (5) Then, the weighted macro-average precision rate for all classes is: (6) The F1 score is the harmonic mean of the precision and recall rates, considering the balance between the precision and recall rates. When the precision and recall rates are not equal, the F1 score decreases. The F1 score of the class c is: (7) Then, the weighted macro-average F1 score for all classes is: (8) The ROC curve is not affected by the imbalanced dataset and is suitable for evaluating classifiers. Plotting an ROC curve requires the true positive rate (TPR) and the false positive rate (FPR) as the X and Y axes, respectively. The TPR is the same as the recall rate. The area under the ROC curve (AUC) quantifies the classifier performance. A large AUC value indicates the TPR is higher than the FPR, which means the classifier performs better. A random guess classifier has an AUC value of 0.5, and a perfect classifier achieves 1.

The study employs the weighted macro-average FPR and TPR to calculate the AUC value of the model. The FPR of the class c is: (9) Then, the weighted macro-average FPR for all classes is: (10) Since the TPR is the same as the recall rate, the weighted macro-average TPR for all classes is the same as Eq (4).

The accuracy rate is biased in the unbalanced dataset because it is affected by the majority class [35]. Since the question dataset in the study is unbalanced, the study does not employ the accuracy rate for evaluation. Additionally, the study does not adopt the weighted macro-average recall rate since it equals to the accuracy rate.

Experiment design and data analysis procedure

Factors and responses.

The study employs factorial experiments to examine the effects of feature representation schemes and machine learning models on the classifier’s performance. The experiment considers two factors and three responses, as shown in Table 5. The first factor is the feature representation scheme, which has three levels: TF-IDF, Word2Vec, and FastText.

The second factor is the machine learning model, which has four levels: Logistics Regression, Multi-nominal Naive Bayes, Linear Support Vector Machine, and Support Vector Machine with the Radial Basis Function as the kernel.

The response variables include weighted macro-average AUC, weighted macro-average precision, and weighted macro-average F1 score.

The two factors result in 12 groups. The experiment replicates 300 trials for each group, resulting in 3600 trials in total.

Power analysis.

This study conducts the power analysis to ensure the statistical significance of the experiment results using the G*Power software [36]. The power of the experiment is 1.000 under the following conditions: the type I error α = 0.05, the degrees of freedom of the factors (3−1)×(4−1) = 6, the sample size 3600, and the effect size η2 = 0.025.

Program implementation and data analysis tools.

The study implements the classifiers and the grid search with the Scikit-Learn library [37]. R language and related packages are used for data analysis and plotting.

Analysis procedure and methods.

The data analysis procedure is as follows:

  1. Descriptive statistics and data distribution analysis: Analyze the mean, standard deviation, skewness, and kurtosis of the three responses. The normal distribution and homogeneity of variance across groups are tested.
  2. Analysis of Variance: The parametric ANOVA is performed if the response variable meets the normal distribution and homogeneity of variance assumptions. Otherwise, the non-parametric Aligned Ranks Transformation ANOVA (ART ANOVA) is used. ART ANOVA, a non-parametric method, does not require the data to meet the above two assumptions. ART ANOVA can analyze the main and interaction effects [38, 39], which is more suitable than the Sheirer-Ray-Hare Test [40].
  3. Post-hoc Analysis: If the response variable meets the assumptions of the normal distribution and homogeneity of variance, the Tukey HSD post-hoc analysis is performed. Otherwise, the Aligned Ranks Transformation Contrast is used to find the differences and effect sizes between the levels of each factor.
  4. Effect Size measurements in ANOVA and t-test: η2 is employed to measure the effect size of a significant effect in ANOVA. The thresholds for small, medium, and large effect sizes are 0.01, 0.06, and 0.14, respectively. Cohen’s d measures the effect size of the t-test. The thresholds for small, medium, and large effect sizes are 0.2, 0.5, and 0.8, respectively.

For the concise, the nomenclatures in the S1 Appendix are used to express the statistical results.

Experiment results and discussion

Experiment results

Table 6 summarizes the experiment results for the three performance metrics. The factors FRS, MLM, and their interaction significantly affect the weight macro-average AUC with large effects. The classifier using the TF-IDF scheme with the LOGREG model achieves the best performance, with a mean of 93.812%. Next, in the metric of the weighted macro-average precision, FRS and MLM factors significantly affect the performance with large effect sizes and their interaction with the intermediate effect. The classifier with the TF-IDF schema and the SVM model performs best, with a mean of 72.574% on the weighted macro-average precision. Lastly, the factors FRS, MLM, and their interaction significantly affect the weighted macro-average F1-score with large effect sizes. The classifier using the TF-IDF scheme and the LOGREG model achieves the best mean, averaging 81.574% on the weighted macro-average F1-score. The following subsections present the details.

thumbnail
Table 6. Summary of the experiment results for the three performance metrics: weighted macro-average AUC (wAUC), weighted macro-average precision (wP), and weighted macro-average F1-score (wF1).

https://doi.org/10.1371/journal.pone.0309050.t006

Weighted macro-average AUC (wAUC).

The mean, median, mode, standard deviation, skewness, and kurtosis of all wAUC values are 86.581%, 87.092%, 86.093%, 6.31, -0.699, and 0.273, respectively. The wAUC values do not distribute as the normal curve (Anderson-Darling Stat = 27.489 p <0.001) and do not have homogeneity of variance (Levene’s Test F = 33.844 < 2.2e-16); Fig 1 shows the wAUC distributions of the twelve groups. The groups in the TF-IDF levels have smaller variances than the others.

thumbnail
Fig 1. The distributions of the weighted macro-average AUC values for groups of various combinations of feature representation schemes and machine learning models.

https://doi.org/10.1371/journal.pone.0309050.g001

Aligned Ranks Transformation ANOVA (ART ANOVA) was used to analyze the variances since the wAUC values do not meet the normality and homogeneity of variance assumptions.

The FRS factor (F = 3423.55, p < 2.22e-16, DF = 2), and the MLM factor (F = 757.71, p < 2.22e-16, DF = 3) were significant; so was the interaction (F = 343.70, p < 2.22e-16, DF = 6). The FRS factor generated the largest effect size ( = 0.656), followed by the MLM factor ( = 0.388) and the interaction ( = 0.365).

The means for the factors and their interaction are shown in Fig 2. In the FRS factor, the TF-IDF level performed the best (mT = 92.924), and the worst was the FastText level (mF = 82.934). The post-hoc analysis indicated that the TF-IDF produced a significantly larger effect size than the other two levels (tTW = 67.404, p < 0.0001, dTW = 3.05; tTF = 75.269, p < 0.0001, dTF = 2.09).

thumbnail
Fig 2. The mean analysis of the weighted macro-average AUC values for FRS, MLM factors, and their interactions.

https://doi.org/10.1371/journal.pone.0309050.g002

For the MLM factor, LOGREG had the best mean (mL = 88.900), and MNB had the worst (mM = 82.671), as shown in Fig 2. The post-hoc analysis indicated that the LOGREG produced a significantly large effect size compared to the MNB (tLM = 44.320, p < 0.0001, dLM = 0.909). Although the mean of the LOGREG was greater than that of the LSVM, the effect size was significantly small (tLLS = 17.439, p < 0.0001, dLLS = 0.459). Likewise, the effect size between the LOGREG and SVM was significantly small (tLS = 6.944,p < 0.0001, dLS = 0.203).

For the interaction of FRS and MLM factors, the group with the best mean was (TF-IDF, LOGREG) with m(T,L) = 93.812, followed by (TF-IDF, MNB) with m(T,M) = 93.056, as shown in Fig 2. However, no significant difference existed between the two groups (t(T,L)−(T,M) = 3.597, p = 0.0170). The mean of the group (TF-IDF, LSVM) (m(T,LS) = 92.3) was close to that of the group (TF-IDF, SVM) (m(T,S) = 92.5), and no significant difference existed between the two groups (t(T,LS)−(T,S) = -1.062, p = 0.9961). The group with the worst mean was (FastText, MNB) with m(F,M) = 73.706. The effect size between the best and worst groups was significantly large (t(T,L)−(F,M) = 72.962, p < 0.0001, d(T,L)−(F,M) = 5.07).

Since the variance of wAUC values did not homogeneous across groups, the post-hoc analysis examined the effect sizes in the four quantiles, as shown in Fig 3. In Fig 3, The best group (TF-IDF, LOGREG) is set to be the reference group. The three horizontal dash lines denote the thresholds for the small, medium, and large effect sizes. The reference group has the best performance in all quantiles. The differences caused by the best group ranged from small to large effect sizes compared to the other groups.

thumbnail
Fig 3. The effect size analysis on the weighted macro-average AUC values in the four quantiles for the interactions between the FRS and MLM factors.

https://doi.org/10.1371/journal.pone.0309050.g003

Weighted macro-average precision (wP).

The mean, median, mode, standard deviation, skewness, and kurtosis of all wP values are 66.284%, 67.232%, 70.172%, 9.553, -0.470, and -0.035, respectively. The wP values do not distribute as the normal curve (Anderson-Darling Stat = 13.4768, p < 0.001) and do not have homogeneity of variance (Levene’s Test F = 38.273, p < 2.2e-16). Fig 4 shows the wP distributions in all factor-level combinations. The data in the group (Word2Vec, MNB) is distributed as a bimodal curve. The groups with the TF-IDF scheme exhibit less variance than the others.

thumbnail
Fig 4. The distributions of the weighted macro-average precision values for groups of various combinations of feature representation schemes and machine learning models.

https://doi.org/10.1371/journal.pone.0309050.g004

This study uses the non-parametric ART ANOVA to analyze the variances of the wP values. The FRS factor produced a significantly large effect size (F = 774.47, p < 2.22e-16, DF = 2, = 0.302); So did the interaction of the FRS and MLM factors (F = 110.70, p < 2.22e-16, DF = 6, = 0.156). However, the MLM factor produced a significantly medium effect size (F = 191.16, p < 2.22e-16, DF = 3, = 0.138).

Fig 5 shows the mean analysis of the wP values for the FRS, MLM factors, and their interaction. In the FRS factor, the TF-IDF level contributed the best mean (mT = 72.378), and the worst was the FastText level (mF = 61.895). The post-hoc analysis indicated that the TF-IDF produced a significantly larger effect size than the FastText (tTF = 37.769, p < 0.0001, dTF = 1.33).

thumbnail
Fig 5. The mean analysis of the weighted macro-average precision values for FRS, MLM factors, and their interactions.

https://doi.org/10.1371/journal.pone.0309050.g005

For the MLM factor, the LSVM level contributed the best mean (mLS = 70.632), and the worst was the MNB level (mM = 62.562), as shown in Fig 5. The post-hoc analysis indicated that their difference was significantly medium (tLSM = 23.672, p < 0.0001, dLSM = 0.795). Instead, the means of the LOGREG and SVM levels were close (mL = 66.184, mS = 76.76), and no significant difference existed between the two levels (tLS = 1.565, p = 0.3988).

As for the interactions, the groups with the TF-IDF scheme but different machine learning models performed quite closely. There were no significant differences between the groups. The means for the various machine learning models from high to low were SVM (m(T,S) = 72.574), MNB (m(T,M) = 72.515), LOGREG (m(T,L) = 72.484), and LSVM (m(T,LS) = 71.939), given the TF-IDF scheme.

The worst group was (FastText, MNB) with m(F,M) = 51.584. Compared to the worst group, the best group (TF-IDF, SVM) generated a significantly large effect size (t(T,S)−(F,M) = 31.713, p < 0.0001, d(T,S)−(F,M) = 3.11).

As the variance of wP values was not evenly distributed among the groups, the post-hoc analysis examined the effect sizes within each quantile, as shown in Fig 6. The reference group is (TF-IDF, SVM) with the best mean of wP values among all groups. The reference group performed the best in the first two quantiles. But, in the third and fourth quantiles, the best groups become (TF-IDF, LOGREG) and (Word2Vec, MNB) respectively. In the fourth quantile, the group’s mean (Word2Vec, MNB) was greater than that of the reference group, and the effect size was significantly small ( = 0.264).

thumbnail
Fig 6. The effect size analysis on the weighted macro-average precision values in the four quantiles for the interactions between the FRS and MLM factors.

https://doi.org/10.1371/journal.pone.0309050.g006

Weighted macro-average F1 (wF1).

The mean, median, mode, standard deviation, skewness, and kurtosis of all wF1 values are 50.829%, 53.342%, 58.216%, 12.796, -0.672, and 0.141, respectively. The wF1 values did not distribute as the normal curve (Anderson-Darling Stat = 37.477 p < 0.001) and were not evenly distributed among the groups (Levene’s Test F = 29.637 < 2.2e-16). Fig 7 shows the wF1 distributions in all factor-level combinations. A bimodal shape occurred in the group (Word2Vec, MNB). And, like the results in the wAUC and wP, the groups with the TF-IDF scheme caused less variance than the others.

thumbnail
Fig 7. The distributions of the weighted macro-average F1 values for groups of various combinations of feature representation schemes and machine learning models.

https://doi.org/10.1371/journal.pone.0309050.g007

This study uses non-parametric ART ANOVA to analyze the variances since the distribution of wF1 values did not meet the assumptions for the parametric ANOVA. The FRS factor (F = 2329.63, p < 2.22e-16, DF = 2), the MLM factor (F = 1320.74, p < 2.22e-16, DF = 3), and their interaction (F = 217.99, p < 2.22e-16, DF = 6) were all statistically significant. The effect sizes of the FRS, MLM, and their interaction were large ( = 0.565, = 0.525, = 0.267).

The mean analysis of the factors and their interactions are shown in Fig 8. In the FRS factor, the TF-IDF level contributed the best mean (mT = 61.468), followed by the FastText level (mF = 45.396), and the worst was the Word2Vec level (mW = 45.623). As indicated by the post-hoc analysis, the TF-IDF generated a significantly larger effect size than the FastText (tTF = 59.649, p < 0.0001, dTF = 1.72). Nevertheless, no significant difference existed between the Word2Vec and FastText levels.

thumbnail
Fig 8. The mean analysis of the weighted macro-average F1 values for FRS, MLM factors, and their interactions.

https://doi.org/10.1371/journal.pone.0309050.g008

As for the MLM factor, the best and worst levels were LOGREG (mL = 57.929) and MNB (mM = 41.938), respectively. Their difference in effect size was significantly large (tLM = 52.077, p < 0.0001, dLM = 1.330).

Fig 8 also shows the mean analysis of the interactions between the FRS and MLM factors. The best group was (TF-IDF, LOGREG) with mT,L = 63.545, and the worst was (FastText, MNB) with mF,M = 32.71. Additionally, their differences in effect size were significantly large (t(T,L)−(F,M) = 54.802, p < 0.0001, d(T,L)−(F,M) = 4.120).

Due to the uneven variance of wF1 values across the groups, the study examined the effect sizes within four quantiles, respectively, as shown in Fig 9. The figure uses the group (TF-IDF, LOGREG) with the best mean as the reference for comparing effect sizes. Unlike the case in the wP, the best group remained the same in all quantiles. Note that the effect sizes of all groups were almost flat from the first to fourth quantiles, except the group (Word2Vec, MNB). The exceptional group’s effect size increased after the second quantile.

thumbnail
Fig 9. The effect size analysis on the weighted macro-average F1 values in the four quantiles for the interactions between the FRS and MLM factors.

https://doi.org/10.1371/journal.pone.0309050.g009

Discussion

The feature representation schema impacts the classifier’s performance the most, followed by the machine learning model and the interaction between the two, according to the experiment results. The feature representation schema converts the document/question features into numerical representations for machine learning models. Poor feature representations reduce the classifier’s performance. Many studies have focused on finding good feature representations to improve the classifier’s performance, such as the feature representation for question classification [12, 13, 18] or document classification [41, 42]. In the experiments, the feature representation schema factors had the largest effect size, consistent with the literature.

A good feature representation schema improves the classifier’s performance and consistency across different machine-learning models. As shown in the mean analysis of the interactions in Figs 2, 5, and 8, the TF-IDF scheme classifiers had better performance and smaller performance variances across different machine learning models than those with the Word2Vec and FastText schemes. That implies that the quality of the feature representation schema should be prioritized when designing a question classifier.

When comparing the feature representation schemas, the experiment results indicated that the classifiers with the TF-IDF scheme outperformed those with the Word2Vec and FastText schemes in all performance metrics. Martinčić-Ipšić et al. [43] found that the TF-IDF method is not inferior to the word embedding method. Dessí et al. [44] and Khanna et al. [45] also reported that the TF-IDF method outperformed the word embedding scheme in document classification, especially for short documents [46].

There might be two reasons for the inferior performance of the word embedding scheme in question classification. Firstly, although the word embedding scheme captures the semantics of words, it does not consider the importance of words in the feature representation [47]. Secondly, the pre-trained word embedding schemes used in the study might need to learn the vocabulary in the SQL syntax or the table or column names. The table or column names in the SQL syntax are often abbreviated or combined with multiple words to form new names. These abbreviations and the new names might cause out-of-vocabulary problems in the pre-trained word embedding schemes. As a result, these schemes might lose some domain knowledge and harm the feature representation quality.

Nevertheless, the word embedding scheme sometimes outperforms the TF-IDF scheme. Arora et al. [48] reported that the word embedding scheme outperformed the TF-IDF scheme in classifying the instances in the TagMyNews dataset. They trained its word embeddings instead of pre-trained ones. That implies that if word embeddings can effectively represent the semantic information in the knowledge domain, word embedding would be more effective than TF-IDF.

As for the machine learning models, no consensus is reached on which model performs the best in all performance metrics in the experiment. The logistic regression model outperformed the others in the weighted AUC and weighted F1 metrics; the linear support vector machine outperformed the others in the weighted precision metric, as indicated by the experiment results.

Many studies in question classification reported that the SVM model outperformed the others [12, 13, 15, 18, 23]. These studies mainly used the TF-IDF scheme. In the study’s experiment, when using the TF-IDF scheme, the logistic regression model outperformed the others in the weighted AUC and weighted F1 metrics, and so did the SVM model in the weighted precision metric. Our experiment results are partially consistent with the literature.

Does the SVM model outperform the logistic regression model in all cases? No consensus exists in the literature. Pranckevičius and Marcinkevičius [49] reported that the logistic regression model outperformed the SVM model in classifying text reviews using the TF-IDF scheme. Musa [50] reported that the logistic regression model outperformed the SVM model in predicting the relevance of heart disease.

Theoretically, the loss functions of the SVM and logistic regression models behave similarly. Hence, they should have similar performance. If the separation between class instances is clear, the SVM model would outperform the logistic regression model. On the other hand, if the class instances overlap, the logistic regression model would outperform the SVM model [34]. Empirically, Salazar et al. [51] reported that the data distribution impacts the performance of the two models. If the data distribution is univariate, the logistic regression model outperforms the SVM model. Otherwise, the SVM model outperforms the logistic regression model. The data distributions in all groups in the experiment were univariate. Hence, it is reasonable that the logistic regression model sometimes outperformed the SVM model.

To summarize the discussion, the factors that impact the performance of classifiers include (i) the data distribution, (ii) the quality of the feature representation schema, (iii) the machine learning model, and (iv) the parameter settings. Hence, one should first consider the feature representation schema. Then, choose the appropriate machine learning model, considering the data distribution. Finally, tune the parameter settings of the machine learning model to achieve the best performance.

Conclusion

The study explores factors that affect the design of automatic classifiers for questions containing code snippets through factorial experiments, taking Oracle SQL certification exam questions as examples. Our research results show that the TF-IDF and Logistics Regression classifier performed best in the weighted macro-average AUC and weighted macro-average F1; the classifier with TF-IDF and Support Vector Machine performed best in the weighted macro-average Precision. Besides, the experiment results indicate that the feature representation scheme produces a more significant effect size than the machine learning method on the performance of the question classifiers. A good feature representation scheme can improve the performance and consistency of different machine learning methods. In addition, logistic regression and SVM models performed better than the linear SVM and MNB models.

Based on the experiment results and literature, this study concludes that the data distribution, the quality of the feature representation scheme, the machine learning models, and the parameter settings of the models are the main factors affecting the performance of the classifiers for questions containing code snippets. The feature representation scheme should be the first factor when designing a question classifier. Upon deciding on the feature representation scheme, one then chooses the appropriate machine learning model, considering the data distribution. Finally, fine-tune the parameter settings of the machine learning model to achieve the best performance.

The contributions of this study are twofold. First, the study explores the design of automatic classifiers for questions containing code snippets. Hence, it fills the gap in the literature on question classification, which mainly focuses on questions with general text descriptions. Second, the study’s results enable teachers/practitioners to build a question classification system to suggest a topic for a question. That can help them automatically classify the collected historical questions to accelerate building a test bank and reduce teachers’/practitioners’ workload.

Due to the limited research resources, the study has the following limitations: (1) the number of questions in the dataset is 171. If the number of questions increases, other advanced models, such as neural networks or deep learning models, may be employed; (2) this study uses pre-trained word embedding schemes. The results might differ if self-trained word embedding schemes are used; (3) this study mainly discusses four machine learning models: logistic regression, SVM, linear SVM, and Multi-nominal Naive Bayes. Other machine learning models or ensemble learning methods may produce different results; (4) this study only focuses on SQL certification exam questions. Whether one can directly extend our research results to questions containing code snippets in other programming languages (such as Java, Python, etc.) needs further research.

In the future, one can develop ensemble learning models to improve the performance of the classifiers. Alternatively, one can establish classifiers with the incremental learning method to learn new instances online to adapt to the pattern changes when adding new questions to the test bank.

Supporting information

S1 Appendix. Appendix nomenclatures to express the statistical results.

https://doi.org/10.1371/journal.pone.0309050.s001

(DOCX)

References

  1. 1. Li X, Roth D. Learning question classifiers: The role of semantic information. Nat Lang Eng. 2002;12(3):229–49.
  2. 2. Sangodiah A, Muniandy M, Heng LE. Question classification using statistical approach: A complete review. J Theor Appl Inf Technol. 2015;71(3):386–95.
  3. 3. Silva VA, Bittencourt I, Maldonado JC. Automatic question classifiers: A systematic review. IEEE Trans Learn Technol. 2019;12(4):485–502.
  4. 4. Sangodiah A, Fui YT, Heng LE, Jalil NA, Ayyasamy RK, Meian KH. A comparative analysis on term weighting in exam question classification. 5th International Symposium on Multidisciplinary Studies and Innovative Technologies (ISMSIT); 2021: IEEE; 2021.
  5. 5. Laddha MD, Lokare VT, Kiwelekar AW, Netak LD. Classifications of the summative assessment for revised Bloom’s taxonomy by using deep learning. Int J Latest Trends Eng Technol. 2021;69(3):211–8.
  6. 6. Xu D, Jansen PA, Martin J, Xie Z, Yadav V, Madabushi HT, et al., editors. Multi-class hierarchical question classification for multiple choice science exams. Proceedings of the Twelfth Language Resources and Evaluation Conference; 2020; Marseille, France: European Language Resources Association (ELRA).
  7. 7. Wasim M, Asim MN, Ghani Khan MU, Mahmood W. Multi-label biomedical question classification for lexical answer type prediction. J Biomed Inform. 2019;93:103143. pmid:30872137
  8. 8. Chen HY, Liu YC, Liu XQ, Chiu V. The development of an automatic test assembly system for a formative assessment in mastery learning instruction: Case of the SQL mastery course. IEEE Access. 2023;11:95974–88.
  9. 9. Tsoumakas G, Katakis I. Multi-label classification: An overview. Int J Data Warehous. 2007;3(3):1–13.
  10. 10. Shaikh S, Daudpotta SM, Imran AS. Bloom’s learning outcomes’ automatic classification using LSTM and pretrained word embeddings. IEEE Access. 2021;9:117887–909.
  11. 11. Zhang J, Wong CP, Giacaman N, Luxton-Reilly A, editors. Automated classification of computing education questions using Bloom’s taxonomy. Proceedings of the 23rd Australasian Computing Education Conference; 2021.
  12. 12. Gani MO, Ayyasamy RK, Alhashmi SM, Sangodiah A, Fui YT. ETFPOS-IDF: A novel term weighting scheme for examination question classification based on Bloom’s taxonomy. IEEE Access. 2022;10:132777–85.
  13. 13. Mohammed M, Omar N. Question classification based on Bloom’s taxonomy cognitive domain using modified TF-IDF and word2vec. PLoS One. 2020;15(3):e0230442. pmid:32191738
  14. 14. Shanthi P, Krishnamurthi I. A semantic approach for question classification using register linear based model. Middle-East J Sci Res. 2015;34(4):685–94.
  15. 15. Yahya AA, Osama A. Automatic classification of questions into Bloom’s cognitive levels using support vector machines. The International Arab Conference on Information Technology; Naif Arab University for Security Science (NAUSS), Riyadh, Saudi Arabia2011.
  16. 16. Yahya AA, Toukal Z, Osman A, editors. Bloom’s taxonomy–based classification for item bank questions using support vector machines. Modern Advances in Intelligent Systems and Tools; 2012; Berlin, Heidelberg: Springer.
  17. 17. Yahya AA, Osman A, Taleb A, Alattab AA. Analyzing the cognitive level of classroom questions using machine learning techniques. Procedia Soc Behav. 2013;97:587–95.
  18. 18. Mohammed M, Omar N. Question classification based on Bloom’s taxonomy using enhanced TF-IDF. Int J Adv Sci Eng Inf Technol. 2018;8(4–2):1679.
  19. 19. Haris SS, Omar N. A rule-based approach in Bloom’s taxonomy question classification through natural language processing. 7th International Conference on Computing and Convergence Technology (ICCCT); Seoul, Korea (South): IEEE; 2012. p. 410–4.
  20. 20. Abduljabbar DA, Omar N. Exam questions classification based on Bloom’s taxonomy cognitive level using classifiers combination. J Theor Appl Inf Technol. 2015;78:447–55.
  21. 21. Sanders K, Ahmadzadeh M, Clear T, Edwards SH, Goldweber M, Johnson C, et al., editors. The canterbury questionbank: Building a repository of multiple-choice CS1 and CS2 questions. Proceedings of the ITiCSE Working Group Reports Conference on Innovation and Technology in Computer Science Education-Working Group Reports; 2013; Canterbury, England, United Kingdom: Association for Computing Machinery.
  22. 22. Sangodiah A, Ahmad R, Ahmad WFW. Taxonomy based features in question classification using support vector machine J Theor Appl Inf Technol. 2017;95(12):2814–23.
  23. 23. Patil SK, Shreyas MM. A comparative study of question bank classification based on revised Bloom’s taxonomy using SVM and K-NN. 2017 2nd International Conference On Emerging Computation and Information Technologies (ICECIT); Tumakuru, India: IEEE; 2017. p. 1–7.
  24. 24. Osadi KA, Fernando M, Welgama WV. Ensemble classifier based approach for classification of examination questions into Bloom’s taxonomy cognitive levels. Int J Comput Appl. 2017;162:1–6.
  25. 25. Lilleberg J, Zhu Y, Zhang Y. Support vector machines and word2vec for text classification with semantic features. 2015 IEEE 14th International Conference on Cognitive Informatics & Cognitive Computing (ICCI*CC)2015. p. 136–40.
  26. 26. Vajjala S, Majumder B, Gupta A, Surana H. Practical natural language processing: O’Reilly Media; 2020.
  27. 27. Oracle [Internet]. Oracle database SQL exam number: 1z0-071: Oracle University; [cited 2022 Sep 10]. Available from: https://education.oracle.com/oracle-database-sql/pexam_1Z0-071.
  28. 28. spaCy [Internet]. En_core_web_sm: Trained pipelines for english [cited 2023 Aug 10]. 3.6.0:[Available from: https://spacy.io/models/en#en_core_web_sm].
  29. 29. NLTK [Internet]. Natural language toolkit [cited 2023 Aug 10]. 3.8:[Available from: https://www.nltk.org/].
  30. 30. Google [Internet]. Word2vec 2013 [cited 2023 Aug 10]. Available from: https://code.google.com/archive/p/word2vec/.
  31. 31. Facebook Inc. [Internet]. Word vectors for 157 languages 2022 [cited 2023 Aug 30]. Available from: https://fasttext.cc/docs/en/crawl-vectors.html.
  32. 32. Manning CD, Raghavan P, Schütze H. Introduction to information retrieval: Cambridge University Press; 2008.
  33. 33. Müller AC, Guido S. Introduction to machine learning with python: O’Reilly; 2016.
  34. 34. James G, Witten D, Hastie T, Tibshirani R. An introduction to statistical learning: With applications in R. 2nd ed. New York: Springer; 2021.
  35. 35. Provost F, Fawcett T. Data science for business: O’Reilly Media; 2013.
  36. 36. Faul F, Erdfelder E, Lang A-G, Buchner A. G*power 3: A flexible statistical power analysis program for the social, behavioral, and biomedical sciences. Behav Res Methods. 2007;39:175–91. pmid:17695343
  37. 37. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine learning in Python. J Mach Learn Res. 2011;12:2825–30.
  38. 38. Wobbrock JO, Findlater L, Gergle D, Higgins JJ, editors. The aligned rank transform for nonparametric factorial analyses using only anova procedures. Proceedings of the SIGCHI conference on human factors in computing systems; 2011; New York, NY, USA: Association for Computing Machinery.
  39. 39. Mangiafico S. Summary and analysis of extension program evaluation in R. version 1.20.05, revised 2023 ed: Rutgers Cooperative Extension, New Brunswick, NJ.; 2016.
  40. 40. Toothaker LE, Chang H-s. On "the analysis of ranked data derived from completely randomized factorial designs". J Educ Behav Stat. 1980;5(2):169–76.
  41. 41. Yetisgen-Yildiz M, Pratt W, editors. The effect of feature representation on MEDLINE document classification. AMIA Annual Symposium Proceedings; 2005: AMIA Symposium.
  42. 42. Yilmaz S, Toklu S. A deep learning analysis on question classification task using word2vec representations. Neural Comput Appl. 2020;32(7):2909–28.
  43. 43. Martinčić-Ipšić S, Miličić T, Todorovski L. The influence of feature representation of text on the performance of document classification. Appl Sci. 2019;9(4):743.
  44. 44. Dessí D, Helaoui R, Kumar V, Recupero DR, Riboni D. TF-IDF vs word embeddings for morbidity identification in clinical notes: An initial study. arXiv preprint arXiv:210509632. 2021.
  45. 45. Khanna S, Tiwari B, Das P, Das AK. A comparative study on various text classification methods. In: Das AK, Nayak J, Naik B, Dutta S, Pelusi D, editors. Computational intelligence in pattern recognition. Singapore: Springer Singapore; 2020. p. 539–49.
  46. 46. Piskorski J, Jacquet G, editors. TF-IDF character n-grams versus word embedding-based models for fine-grained event classification: A preliminary study. Proceedings of the Workshop on Automated Extraction of Socio-political Events from News; 2020; Marseille, France: European Language Resources Association (ELRA).
  47. 47. Zhang T, Wang L, editors. Research on text classification method based on word2vec and improved TF-IDF. Advances in Intelligent Systems and Interactive Applications (IISA 2019); 2020: Springer, Cham.
  48. 48. Arora M, Mittal V, Aggarwal P, editors. Enactment of TF-IDF and word2vec on text categorization. Proceedings of 3rd International Conference on Computing Informatics and Networks; 2021; Singapore: Springer Singapore.
  49. 49. Pranckevičius T, Marcinkevičius V. Comparison of naive bayes, random forest, decision tree, support vector machines, and logistic regression classifiers for text reviews classification. Balt J Mod Comput. 2017;5(2):221–32.
  50. 50. Musa AB. Comparative study on classification performance between support vector machine and logistic regression. Int J Mach Learn Cybern. 2012;4:13–24.
  51. 51. Salazar DA, Vélez JI, Salazar JC. Comparison between SVM and logistic regression: Which one is better to discriminate? Rev Colomb Estad. 2012;35(SPE2):223–37.