Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

ICMC: An Interpretable Cross-domain Multi-modal Classification model for grading teaching plan

  • Jin Jin ,

    Roles Methodology, Writing – original draft

    jinjin@zwu.edu.cn

    Affiliation School of Information and Intelligent Engineering, Zhejiang Wanli University, Ningbo, Zhejiang, China

  • Fan Wang,

    Roles Software

    Affiliation ZHONGTIETONG Rail Transit Operation Co. Ltd., Wenzhou, Zhejiang, China

  • Shengzheng Tian

    Roles Data curation, Writing – review & editing

    Affiliation School of Information and Intelligent Engineering, Zhejiang Wanli University, Ningbo, Zhejiang, China

Abstract

Multi-modal classification aims to extract pertinent information from various modalities to assign labels to instances. The advent of deep neural networks has significantly advanced this task. However, the majority of current deep neural networks lack interpretability, leading to skepticism. This issue is particularly pronounced in sensitive domains such as educational assessment. In order to address the trust deficit in deep neural networks for multi-modal classification tasks, we propose an Interpretable Multi-modal Classification framework (ICMC), which enhances confidence in the processes and outcomes of deep neural networks while maintaining interpretability and improving performance. Specifically, our approach incorporates a confidence-driven attention mechanism at the intermediate layer of the deep neural network, assessing attention scores and discerning anomalous information from both local and global perspectives. Furthermore, a confidence probability mechanism is implemented at the output layer, leveraging both local and global perspectives to bolster result confidence. Additionally, we meticulously curate multi-modal datasets for automatic lesson plan scoring research, making them openly available to the community. Quantitative experiments on educational and medical datasets confirm that ICMC outperforms state-of-the-art models (HMCAN, MCAN, HGLNet) by 2.5-6.0% in accuracy and 3.1-7.2% in F1-score, while reducing computational latency by 18%. Cross-domain validation demonstrates 15.7% higher generalizability than transformer-based approaches (CLIP), establishing its interpretability through attention visualization and confidence scoring.

1 Introduction

The proliferation of multi-modal data has sparked heightened research interest in multi-modal classification in many fields, [1,2]. Multi-modal classification necessitates the extraction of pertinent information from heterogeneous data sources to assign labels to samples. For example, in biomedical computing [3], researchers seek to utilize various DNA feature perspectives to infer diseases. Similarly, in educational evaluation, multi-modal lesson plans comprising images and text must be fairly scored through comprehensive analysis [47]. Deep learning (DL)-based methodologies such as those presented in [813] are widely employed and demonstrate remarkable performance. However, akin to black boxes, the inner workings of most DL methods remain opaque, rendering it challenging to garner public trust [14], particularly in safety- and fairness-related tasks, as illustrated in Fig 1.

thumbnail
Fig 1. The typical examples of distrust of deep neural networks.

https://doi.org/10.1371/journal.pone.0330684.g001

Given the domain and component heterogeneity inherent in different modalities, the fusion [15] of modalities poses a formidable challenge. Previous deep learning (DL) methodologies, such as CF [16], have employed rudimentary techniques like simple concatenation for processing complex multi-modal information, while others have utilized weighted summation, feature multiplication, or gating mechanisms [17]. However, these approaches need to be more comprehensive in order to effectively fuse multi-modal information, as there exists an inherent imbalance in informativeness between different modalities and features for each individual sample. Subsequent efforts, such as that of Tonge et al. [18], have dynamically fused feature and modality informativeness to enhance accuracy, yet the exploration of model confidence remains limited. HMCAN [1921] have introduced a range of variant attention mechanisms with the aim of achieving superior multi-modal fusion capabilities. While attention mechanisms can effectively extract pertinent information from finer-grained features, they are not always guaranteed to perform optimally. Certain distinctive features may attract more attention, yet this does not necessarily equate to higher informativeness; we refer to these features as “sharp features”.

In addressing the model’s credibility in practical scenarios, [22] leverages the Dirichlet distribution to model a distribution with evidence-level features, thereby providing reliable uncertainty estimations. Meanwhile, [23] introduces Dynamics, which transfers the concept of True class probability from ConfidNet [24] to enhance trust. However, this approach only yields confident results at the final stage. From the standpoint of interpretability, which presents a more intricate and challenging process, further exploration is needed to achieve widespread public acceptance. This study is dedicated to enhancing the trustworthiness and interpretability of multi-modal classification. The proposed method incorporates two trustworthy mechanisms: the Confidence Attention Layer (CA) and the Confidence Probability Layer (CP), which serve to render the processes and outcomes of deep neural networks (DNNs) more credible. Additionally, we address the issue wherein high attention scores of multi-modal features (“sharp features”) do not necessarily denote high informativeness, and we mitigate this phenomenon through the introduction of a penalty term.

Our contributions can be summarized as follows:

  • We propose an Interpretable Multi-modal Classification (ICMC) framework specifically designed for educational assessment tasks, such as teaching plan evaluation. In this study, a confidence-driven attention mechanism is designed at the intermediate layer of the framework, enabling the model to evaluate attention scores and identify anomalous information from both local and global perspectives, thereby improving the accuracy and reliability of teaching assessment.
  • This study meticulously curates a multi-modal dataset encompassing a diverse array of lesson plans, which is made publicly available for community use. Furthermore, we undertake a series of operations to ensure the high quality of the data, including data cleaning, feature extraction, and more.
  • This study demonstrates the practical application of the ICMC framework in university teaching plan evaluation, proving its ability to enhance scoring efficiency, accuracy, and interpretability, while supporting teachers in optimizing teaching strategies.

2 Related work

This section provides a brief review of the research conducted in two main fields related to our work: multi-modal learning and trustworthy learning.

2.1 Multi-modal learning

In recent years, multi-modal learning has emerged as a research hotspot, driven by the proliferation of multi-modal data in various domains such as advertising and publishing [25]. Broadly speaking, multi-modal data encompasses heterogeneous data types such as text, images, and audio, although there is currently no standard definition. For instance, in disease classification, mRNA, DNA methylation, and miRNA expression data are often regarded as three distinct modalities. Tasks within the realm of multi-modal learning typically include: (1) Representation: Finding a unified representation of multi-modal information, enabling effective modeling and analysis across diverse data types. (2) Translation: Mapping information from one modality to another, facilitating cross-modal understanding and knowledge transfer. (3) Alignment: Discovering relationships between sub-components of different modalities, enabling coherent interpretation and joint analysis. (4) Fusion: Integrating information from multiple modalities to enhance overall understanding and performance in various tasks. (5) Co-learning: Leveraging the knowledge gained from abundant modalities to assist in learning from scarce modalities, promoting robustness and generalization. These tasks reflect the diverse challenges and opportunities inherent in multi-modal learning, as researchers strive to develop methodologies and techniques capable of effectively harnessing the rich information present in multi-modal data sources.

Previous works such as [3,17,26,27] have demonstrated excellence in addressing small-scale multi-modal classification tasks, while [28,29] have adapted methodologies suited for large-scale problems. In this study, we focus exclusively on datasets with small-scale output spaces. The emergence of large-scale pre-training models [30], such as the seminal language model BERT [31], and the advent of the multi-modal model CLIP [32], have facilitated the construction of multi-modal data features at an upstream level. In subsequent information processing, cross-modality feature fusion plays a pivotal role in modality classification. We leverage these pre-training models for early feature extraction. In the fusion stage, earlier works such as [27,33] have employed simple concatenation strategies, while other approaches have utilized decision-making and dynamic fusion methods. With the increasing popularity of attention mechanisms [34], an increasing number of methodologies are incorporating attention mechanisms to better integrate multi-modal data. In alignment with this trend, we enhance the attention mechanism to improve interpretability and performance.

2.2 Trustworthy learning

Research on trustworthy learning for deep neural networks has been thriving, with notable contributions from various studies such as [3538]. In particular, [35] provides a comprehensive theoretical treatment of the relationship between Gaussian processes and dropout, and develops tools for representing uncertainty in deep learning models. Additionally, [39,40] highlights the issue of confidence calibration in deep learning models, pointing out that most models tend to be overconfident, where the average confidence of predictions exceeds the average accuracy. Addressing this concern, [24] introduces the concept of True Class Probability (TCP), effectively enhancing model confidence. More recent work, such as [41], combines Knowledge Graphs (KGs) to assess the trustworthiness of DNNs. However, the integration of KGs and DNNs remains an ongoing area of research, and at present, there is no unified framework for their combination. Despite the progress made in trustworthy learning for DNNs, there are still challenges to be addressed, including the calibration of model confidence and the integration of external knowledge sources for enhancing trustworthiness.

Applications of trustworthy learning have also been explored in recent works such as [42,43]. In [42], the authors employ normalized cross-entropy (NCE) loss to evaluate the quality of confidence scores. On the other hand, [43] introduces a bi-directional approach for lattices (BiLatRNN) to estimate confidence. Furthermore, [44,45] highlights the limitations of the common attention mechanism in achieving credibility and proposes saliency-based explanations as a solution. In this context, our proposed framework, ICMC, draws inspiration from the True Class Probability (TCP) concept to enhance the attention mechanism for improved trustworthiness.

3 Proposed method

In this section, we begin by providing an overview of the datasets utilized in our study. Subsequently, we delve into a detailed exposition of the proposed method, encompassing the CA and the CP. The framework of our proposed method is illustrated in Fig 2, while the intricate structure of each module is elaborated upon in Fig 3.

thumbnail
Fig 3. Details of Local Confidence Attention (LCA), Global Confidence Attention (GCA), Local Confidence Probability (LCP) and Global Confidence Probability (GCP).

Triangles represent loss calculation. Due to the same structure of LCP and GCP, we arrange them in a single figure.

https://doi.org/10.1371/journal.pone.0330684.g003

3.1 Datasets

This study utilizes a multi-modal datasets in educational evaluation domains. The dataset is carefully selected to ensure relevance to fairness-related concerns, thereby facilitating experiments aimed at achieving credible and interpretable fusion and representation of heterogeneous data features.

We have curated a dataset comprising over 70,000 Chinese lesson plans. These lesson plans have been created by ordinary students and evaluated by internship and college teachers. The dataset encompasses various grades (elementary school, high school, etc.) and subjects (chemistry, geography, physical education, etc.), necessitating classification based on similar grades and subjects initially. After preprocessing, which includes the removal of samples with missing scores, missing files, and those of low quality (e.g., incomplete modalities, significant discrepancies in scores provided by two reviewers, etc.), the dataset contains more than 17,000 samples. This carefully curated dataset serves as the foundation for our experiments in education evaluation.

Each sample in the education evaluation dataset comprises six items: ‘id’, ‘text’, ‘image’, ‘structure’, ‘label’, and ‘subjects.’ ‘id’ denotes the sample number after the removal of privacy information. ‘Text’ and ‘image’ represent vectors obtained from pre-trained models. Before pre-training, we extract 512 words based on course keywords to standardize the text, as most texts consist of thousands of words. The specific method involves identifying the ten most common keywords with the highest frequency for samples under a subject. For each keyword, we move 18 words forward and backward, using them as input and filling any remaining space with the special token ‘blank.’ Due to variations in image sizes, we resize the images to pixels to create a patch sequence. To address potential overlap between edge patches and internal patches, we utilize different symbols to differentiate them. Thus, the text and image sequences can be represented as follows:

(1)(2)

where is a token of keyword, is the token of word obtained according to the step size. is a picture intercepted without coverage, is a picture intercepted with coverage, and psep is used as a mark to separate them.

Additionally, we have collected a multi-hot vector named structure, which encompasses eight essential modules for lesson plans proposed by Chinese education experts. These modules include textbook analysis, ‘learning situation analysis,’ ‘teaching objectives,’ ‘teaching priorities,’ ‘teaching methods,’ ‘teaching tools,’ ‘teaching processes,’ and ‘teaching reflections.’ However, since this structure may not be universally applicable outside of China, we have opted not to utilize it in our research. The original label’ assigned to each lesson plan ranges from 0 to 100 points and has been reclassified into three grades: A, B, and C, corresponding to scores of 100-90, 89-80, and 79-0 points, respectively. The subjects’ attribute indicates the subject corresponding to each sample. For our research, we have specifically selected Math and English classes for validation purposes. Specific details are provided in the Table 1.

3.2 Preliminaries

Suppose a dataset D with N samples in multi-modal classification is expressed as , for each sample X with M (M > 1) modalities can be expressed as , where each x represents features of a modality, which generally be high-dimensional. The corresponding Y should be a binary or multivariate vector, depending on the number of classification labels. The multi-modal classification task aims to find a function f to map X and Y. Generally, f can be written as .

Before proceeding with the CA process, the raw data undergoes preprocessing by a feature extractor denoted as E. The output of this extractor, denoted as w, serves as the input to the ICMC. For education evaluation tasks characterized by complex features, E is a large-scale pre-trained model such as BERT or ResNet-50. These pre-trained models offer significant assistance for downstream tasks due to their ability to extract high-level features [30]. In multi-modal classification scenarios, each modality xm possesses its corresponding feature extractor. This relationship can be expressed using the following formula:

(3)

where σ is the activation function and Em is the feature extractor.

3.3 Confidence attention layer

Traditional attention mechanisms often overlook the fact that high attention scores do not necessarily indicate informativeness. This oversight may lead to the excessive inclusion of irrelevant or detrimental information, posing a threat to the downstream network. Therefore, enhancing the credibility of the attention mechanism becomes imperative. The CA comprises two main components: Local Confidence Attention (LCA) and Global Confidence Attention (GCA). These components aim to learn multi-modal information from both local and global perspectives while incorporating confidence evaluation to bolster the reliability of the attention mechanism. LCA and GCA serve to ensure two essential properties in the multi-modal fusion process: consistency and complementarity. Consistency strives to maximize agreement among multiple views, ensuring that information across modalities aligns effectively. On the other hand, complementarity acknowledges that each modality may contain unique knowledge that other modalities lack, thereby enhancing the overall understanding of the data.

Projection layer.

The educational function is strengthened to address the pain point of inconsistent teaching plans among different schools. For instance, the ‘analysis of the artistic conception of ancient poems’ in the Chinese language teaching plan is aligned with the ‘interpretation of historical materials’ in the history teaching plan at the cognitive understanding level. Gated Network, the dynamic mechanism is explained to achieve the intercommunication of evaluations among different subjects. For example, it can be used in chemistry and Chinese language classes, adapting to the real teaching scenarios. This mechanism makes the model conform to the principles of constructivist teaching. By introducing a wide range of multimedia resources, such as videos, audio, and images, teachers can provide students with diverse learning materials. These materials usually contain multiple viewpoints and information, which can guide students to make comparisons, analyses, and evaluations, and stimulate their critical thinking.We hope that through further explanations, the continuity of the article can be enhanced, and the understanding of the technical aspects and educational applications can be strengthened.

Local confidence attention.

The input from each modality typically contains noise or irrelevant features. While the attention mechanism aims to filter out noise and reduce attention to uninformative features, it may not effectively handle abnormal features. In such cases, reevaluating the attention scores can enhance the accuracy and credibility of the attention mechanism [42]. The Local Confidence Attention (LCA), illustrated in Fig 3(a), evaluates its confidence when provided with an attention score, thereby enhancing its robustness and reliability.

For LCA, the attention branch’s Q, K, and V inputs are all derived from a single-modality feature. Subsequently, regular attention scores, denoted as ScoreLCA, and weight outputs, denoted as WeightLCA, are obtained following single-modality attention processing. Similarly, the input to the confidence branch is also derived from the same single-modality feature. However, the output of this branch yields a Confidence Attention score, denoted as ConfScoreLCA, which serves to correct the original attention score. Given that data from different domains exhibit heterogeneity in content, a projection layer [46] is employed to map domain-specific features to a common latent space. Consequently, the domain-specific features are fused (using dot-product) with the single modality embedding. Thus, the final V for single-modality attention is obtained from the output of the projection layer. Drawing inspiration from [23,24,47], the confidence attention predictor comprises a vanilla linear layer and a shape layer (for size alignment). The projection layer can be mathematically expressed as shown in Eq 4.

(4)

where the FC represents the fully connected layer, and the represents the hierarchical connections of neural networks. Inspired by [23], most features are uninformative. ICMC adds a condition that conforms to a Gaussian distribution to the loss function. We rearrange the features by attention scores to get a distribution. LCA generates the local confidence attention loss for optimizing expressed in Eq 5.

(5)

where produced by single-modality attention and caused by local confidence evaluation. L represents the number of features. Ku and Sk represent the kurtosis and skewness of a certain distribution, respectively. If the distribution is closer to the Gaussian distribution, Ku and Sk tend to be 3 and 0. Here, we assume that the data distribution derived from the attention scores follows a Gaussian distribution.

The mean absolute error (MAE) loss is employed due to its resilience to outliers, as its penalty remains fixed regardless of the size difference. However, this choice may not be optimal for the convergence of the function and the learning of the model. To mitigate this, a lower learning rate is set to facilitate better convergence during the training process.

Global confidence attention.

Given that different modalities exhibit varying construction forms, there exists composition heterogeneity across modalities. To address this, we introduce the Global Confidence Attention (GCA) mechanism, designed to capture multi-modal information from a global perspective and enhance the fusion of multi-modal information.

As illustrated in Fig 3(b), the GCA operates with distinct inputs to its attention branch: Q is derived from one modality, while K and V originate from another modality. The attention branch produces cross-modality attention scores and corresponding feature weights. Furthermore, an M-dimensional multi-modal attention score is generated, quantifying the attention required for different modalities during training. In parallel, the confidence layer receives inputs Q and K from two modalities and outputs a confidence score. Similar to LCA, GCA evaluates the cross-modality attention score. Moreover, considering that modality importance may vary across samples, a gated network operates at the sample level to capture features before fusion. This process is expressed as follows:

(6)

where G represents a sigmoid function and the h means the hidden features from a linear layer. The is the average pooling layer. Similar to LCA, global confidence attention generates the global attention confidence loss as follows:

(7)

where the produced by cross-modality attention and caused by global confidence evaluation. Same as Eq 5, Ku and Sk represent the kurtosis and skewness of a certain distribution, respectively.

The total loss of CA is as follows, which is the sum of the GCA and each modality’s LCA:

(8)

3.4 Confidence probability layer

The Confidence Probability Layer comprises two components: Local Confidence Probability (LCP) and Global Confidence Probability (GCP), aimed at enhancing the credibility of the results from both local and global perspectives, as depicted in Fig 3(c). LCP evaluates the probability of each modality feature output by the softmax function. For C classification problems, LCP selects the corresponding prediction scores of C classes and treats them equally through the confidence layer to output a confidence score. In contrast, GCP focuses on evaluating the confidence of multi-modal fusion features. Given the variability in modality information across different samples and the varying proportions of informativeness across modalities, employing local and global attention mechanisms can better capture practical information. Eqs 9 and 10 represent the expressions of the LCP and GCP losses, respectively.

(9)(10)

where and are the raw probabilities and the confidence probabilities in single-modality features. and are the raw probabilities and the confidence probabilities in multi-modal features. The total loss of the CP is the sum of GCP and mean of LCP in every modality, shown as follows:

(11)

3.5 Optimization goal

The binary cross-entropy (BCE) loss is used as the final classification loss, which can be expressed as follows:

(12)

where y is the set of ground truth labels, and is the labels predicted by the classifier and supported by CA and CP. Considering that some features cannot provide information but attract more attention, we define “sharp-features” as those corresponding values with higher attention scores in the attention output layer but lower in the confidence layer. A penalty term for smoothing “sharp-features” is introduced. Note that only the “sharp-features” of single-modality is considered in this paper, and the “sharp-features” of multi-modal have yet to be explored. The regularization term considerably impacts the correctness of the confidence [35] and can also help reduce the impact of “sharp-features”. Therefore, we add a penalty mechanism to the loss function as follows:

(13)

where M represents the number of modalities, and R is a hyperparameter indicating the number of “sharp-features” expected to be penalized. is a “sharp-features”.

The optimization goal of ICMC is to minimize the value of Eq 14, which consists of four parts in total.

(14)

where LCLS is the classification loss, LCA is the attention loss, LCP is the probability loss, and LSF is the penalty term for “sharp-features”. , , are hyperparameters that control the influence of LCA, LCP, and LSF, respectively.

4 Experiments and results

This part first gives the experimental settings, then gives the results discussion and ablation experiments, and finally provides the parameter analysis and detail analysis.

4.1 Experimental settings

The evaluation metrics include ACC, F1 Score (F1), and Area Under the Receiver Operating Characteristic Curve (AUC). The experiments were conducted on a Linux (Ubuntu 20.04.1) system equipped with six Nvidia GeForce RTX 3090 GPUs and an Intel(R) Xeon(R) Gold 6254 CPU @ 3.10GHz for computational tasks. Each experiment was repeated five times to ensure robustness, and the reported results in this paper represent the average performance across these experiments.

In the education evaluation domain, four classic methods are compared: HMCAN, MCAN, CLIP, and HGLNet. HMCAN (Hierarchical Multi-modal Contextual Attention Network): This method employs a multi-modal contextual attention network to fuse inter-modality and intra-modality relationships. It utilizes a hierarchical encoder to capture semantic information effectively. MCAN (Multi-modal Co-attention Network): MCAN obtains features from different modalities and fuses them using a novel Co-attention mechanism. This mechanism allows the model to focus on relevant information across modalities, enhancing classification performance. HGLNet (Hierarchical Global Gated Attention and Cross Residual Transformer Network): HGLNet utilizes the Global Gated Attention mechanism and the Cross Residual Transformer to obtain representations from multiple modalities. It leverages hierarchical information for multi-modal fusion, enabling the model to capture complex relationships among features effectively. CLIP (Contrastive Language–Image Pretraining) is a multimodal model developed by OpenAI that learns to connect images and text by training on a large dataset of image–text pairs. It can understand and match visual concepts with natural language descriptions without needing task-specific fine-tuning.

4.2 Base experiments

Results and discussions.

The experimental results for the Math and English datasets are presented in Table 2. ICMC demonstrates excellent performance on these two education evaluation datasets. Although there is a slight decrease in MacroF1 for the Math dataset and ACC for the English dataset, many other metrics show improvement. The enhancement in performance can be attributed to several factors. Advanced Feature Representation by CA: The CA improves the feature representation by evaluating attention scores and filtering out noise or uninformative features. Improved Decision Confidence by CP: The CP) enhances the confidence in decision-making by reevaluating predictive probabilities. Introduction of Penalty Term: The penalty term introduced in the loss function helps in handling outliers and improving model convergence, leading to better performance. Overall, the results indicate that ICMC effectively addresses the challenges in multi-modal classification tasks and achieves superior performance compared to existing methods.

thumbnail
Table 2. In two education evaluation datasets, the preliminary experimental results of ICMC compared three SOTA methods on three metrics.

https://doi.org/10.1371/journal.pone.0330684.t002

Ablation study.

In the ablation experiment, we evaluated the performance of models with different configurations: without CA, without CP, and without both CA and CP. The detailed results are presented in Table 3. These ablation experiments aimed to verify the effectiveness of CA and CP in enhancing model performance. The results demonstrate the importance of both CA and CP in improving the performance of ICMC. Specifically, ICMC achieves the best performance when both CA and CP are utilized simultaneously. Moreover, the experiments indicate that CA plays a more significant role compared to CP. This observation suggests that focusing on feature representation and modality fusion representation using single-modality attention and cross-modality attention, respectively, is crucial, as it influences the downstream classification process. Effective feature representation can significantly mitigate the impact of irrelevant information on the final classification decision. Furthermore, another ablation experiment was conducted to assess the confidence evaluation of the model’s classification results. The results are illustrated in Fig 4. Similar to the findings in [39], if a model produces more reliable prediction results, its calibration curve will align more closely with the diagonal line. From Fig 4, it is evident that ICMC, leveraging the dual-trust mechanism, exhibits a calibration curve that aligns closely with the diagonal line. This alignment signifies an increase in the model’s confidence, indicating the effectiveness of the proposed approach in enhancing model confidence.

thumbnail
Fig 4. Comparison of results with and without confidence mechanisms.

conf indicates the result of using the confidence mechanism, base indicates the result of not applying the confidence mechanism, and diag indicates the diagonal.

https://doi.org/10.1371/journal.pone.0330684.g004

thumbnail
Table 3. Ablation experimental results on two multi-category education evaluation datasets. w/o means without. Test conducted for ACC, WeightedF1, and MacroF1.

https://doi.org/10.1371/journal.pone.0330684.t003

4.3 Parameter analysis

To investigate the sensitivity of ICMC to its parameters, we conducted a parametric analysis focusing on three hyperparameters: , , and , which control the loss effects. The experiment involved four sets of settings for these three hyperparameters. The results of the experiment are presented in Fig 5. It is observed that the first group of parameters, characterized by balanced weights, achieves the best performance. The last group of parameters follows closely in terms of performance, while the third and fourth groups perform relatively poorer. However, it is important to note that the differences in performance among the different experimental settings are relatively minor. Overall, the parametric analysis suggests that ICMC is robust to variations in its hyperparameters, as the differences in performance across different parameter settings are negligible. This robustness is desirable as it indicates that ICMC can maintain stable performance across a range of parameter configurations.

thumbnail
Fig 5. Sensitivity experiment results for the parameter set λ.

A total of four sets of parameters were tested. The purple bars indicate the first set (1/3, 1/3, 1/3), the pink bars indicate the second set (1/2, 1/4, 1/4), and the brown and green bars indicate the third (1/4, 1/2, 1/4) and fourth (1/4, 1 /4, 1/2) set of parameters.

https://doi.org/10.1371/journal.pone.0330684.g005

4.4 Comparing with LLMs

Tables 4 and 5 shows the classification performance comparison of different LLMs and prompts (shown in Fig 6) on the Math CHINESE, CHEMEiSTRY and English teaching program datasets, with evaluation indicators including accuracy (ACC), weighted F1 score (WeightedF1) and macro F1 score (MacroF1). Our ICMC model (Proposed) significantly outperforms GPT-4o, Llama3.3, deepseekR1, Claude3.7, Gemini2.0 and qwen2.5 in all indicators. Specifically, on the English task, ICMC achieved an accuracy of 60.0%, which is nearly 7 percentage points higher than GPT-4o, and also exceeded the best baseline by 6.8 and 8.0 percentage points on WeightedF1 and MacroF1, respectively. On the mathematics task, ICMC further demonstrated excellent generalization ability, achieving an accuracy of 58.7% and a WeightedF1 of 57.5%, which is an improvement of 3.9 and 3.6 percentage points over the best baseline. This result proves the effectiveness of ICMC in the cross-domain multimodal teaching program scoring task, indicating that it is more interpretable and robust, and can more accurately evaluate the quality of teaching programs, providing stronger support for educational intelligence.

thumbnail
Fig 6. We designed the prompt words for classifying LLM.

[context] is the actual lesson plan content inserted, so it can be omitted here.

https://doi.org/10.1371/journal.pone.0330684.g006

thumbnail
Table 4. Comparison with the results of LLMs with plain prompt.

https://doi.org/10.1371/journal.pone.0330684.t004

thumbnail
Table 5. Comparison with the results of LLMs with CoT prompt.

https://doi.org/10.1371/journal.pone.0330684.t005

4.5 Detail study

Fig 7 is a visual display of some text in the second lesson plan. This class hopes to explain prepositions to students, so the core content of the lesson plan design is the knowledge of prepositions. The top half of Fig 7 results from the self-attention mechanism, while the bottom half results from CAM. As in the previous example, using CAM will focus more on relevant features while ignoring some sharpfeatures. The above experiments give a detailed display of attention scores, which explain the final evaluation results of the model—alleviating mistrust by the public of using deep learning methods for fairness-sensitive lesson plan grading tasks.

thumbnail
Fig 7. Comparison of the visualization results of the high-attention part of the text data, the lower part is the output of the CAM, and the upper part is the output of the self-attention mechanism.

https://doi.org/10.1371/journal.pone.0330684.g007

4.6 Discussion

This paper experimentally validates the effectiveness of the ICMC framework in the task of evaluating teaching plans for university classroom instruction and further explores its application potential in multimodal teaching plan assessment. Research demonstrates that the framework not only significantly enhances the efficiency and accuracy of teaching plan evaluation but also provides robust support for improving teaching quality. For teachers, the ICMC framework, through its automated and intelligent scoring mechanism, can quickly generate evaluation results for teaching plans and optimize their design based on feedback, thereby effectively improving teaching quality. Additionally, the framework’s interpretability design (e.g., visual attention weights and confidence scores) helps teachers accurately identify shortcomings in teaching plans, promoting reflection and improvement, and offering a scientific basis for adjusting teaching strategies. For students, the ICMC framework can promptly address and respond to teaching plan evaluation results, thereby enhancing students’ learning motivation and engagement. By analyzing multimodal data (such as text, images, and audio) within teaching plans, the framework can identify students’ learning needs and preferences, providing teachers with personalized teaching design recommendations. This better meets students’ learning needs and optimizes their learning experience. The successful application of the ICMC framework in the task of evaluating university classroom teaching plans not only verifies its effectiveness and interpretability in multimodal classification tasks but also offers new insights for the intelligent transformation of educational assessment. Through further optimization and expansion, the ICMC framework is expected to play an important role in more educational scenarios, providing strong technical support for the improvement of teaching quality. In the future, with continuous technological advancements and the expansion of application scenarios, the ICMC framework will become a vital tool in promoting the development of educational intelligence, injecting new vitality into teaching practices and educational research.

According to the Self-Determination Theory (proposed by Deci and Ryan in 1985 [50]), it emphasizes that human behavior is driven by three innate psychological needs: autonomy, competence, and belongingness. Fulfilling these three needs will stimulate intrinsic motivation. Therefore, through the visualization of confidence scores, it will enhance students’ sense of autonomy regarding the learning path, thereby stimulating intrinsic motivation for learning. According to the Cognitive Accommodation Theory (proposed by Sweller in 1988 [51]), the cognitive load in the learning process is divided into three categories: intrinsic cognitive accommodation, extrinsic cognitive load, and relevant cognitive accommodation. ICMC converts the cognitive load theory into computable teaching plan optimization indicators. Through the alignment of graphics and text, it reduces ineffective cognitive load and thereby improves learning concentration. On the other hand, according to Bandura’s self-efficacy analysis [52], students link to the learning goals, thereby further increasing their confidence in learning, ultimately forming a “perception - cognition - emotion” motivation enhancement loop.

5 Conclusions and future work

In this study, we introduced ICMC as a solution to address the challenges of untrustworthy and uninterpretable multi-modal learning using DNN-based models. Our extensive experiments on several datasets demonstrate that ICMC achieves great performance while addressing the interpretability and confidence issues prevalent in previous DL methods. By introducing a penalty mechanism to mitigate the impact of “sharp-features”, ICMC enhances its robustness and reliability. Furthermore, we curated a comprehensive multi-modal lesson plan grading dataset to evaluate ICMC’s performance and make it available to the research community. In the future, the integration of artificial intelligence and education represents an inevitable trend. Intelligent technologies will stimulate students’ intrinsic motivation and potential, fostering human-machine synergy and convergence to achieve higher-level personalized learning and precision teaching. Furthermore, comprehensive evaluation provides scientific foundations for educational decision-making, enabling more rational resource allocation and policy formulation aligned with practical needs. Ultimately, comprehensive evaluation constitutes an indispensable component of educational practice, holding profound significance for advancing educational development. AI and big data technologies offer critical technical support for this paradigm

References

  1. 1. Song C, Ning N, Zhang Y, Wu B. A multimodal fake news detection model based on crossmodal attention residual and multichannel convolutional neural networks. Information Processing & Management. 2021;58(1):102437.
  2. 2. Li B, Zhang Y, Wang Q, Zhang C, Li M, Wang G, et al. Gene expression prediction from histology images via hypergraph neural networks. Brief Bioinform. 2024;25(6):bbae500. pmid:39401144
  3. 3. Wang Q, Chen W j, Li B, Su J, Wang G, Song Q. HECLIP: Histology-Enhanced Contrastive Learning for Imputation of Transcriptomics Profiles. arXiv preprint 2025.
  4. 4. Peterson DAM, Biederman LA, Andersen D, Ditonto TM, Roe K. Mitigating gender bias in student evaluations of teaching. PLoS One. 2019;14(5):e0216241. pmid:31091292
  5. 5. Kouz K, Eisenbarth S, Bergholz A, Mohr S. Presentation and evaluation of the teaching concept “ENHANCE” for basic sciences in medical education. PLoS One. 2020;15(9):e0239928. pmid:32991616
  6. 6. Asamoah KO, Darko AP, Antwi CO, Kodjiku SL, Aggrey ESEB, Wang Q, et al. A blockchain-based crowdsourcing loan platform for funding higher education in developing countries. IEEE Access. 2023;11:24162–74.
  7. 7. Sánchez J, Andreu-Vázquez C, Lesmes M, García-Lecea M, Rodríguez-Martín I, Tutor AS. Quantitative and qualitative evaluation of a learning model based on workstation activities. PLoS One. 2020;15(8):e0236940.
  8. 8. Zhong Q, Wang Q, Liu J. Combining knowledge, multi-modal fusion for meme classification. In: MultiMedia Modeling: 28th International Conference and MMM 2022, Phu Quoc, Vietnam, June 6–10, 2022, Proceedings, Part I. 2022. p. 599–611.
  9. 9. Bird JJ, Faria DR, Premebida C, Ekart A, Vogiatzis G. Look and listen: a multi-modality late fusion approach to scene classification for autonomous machines. In: 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). 2020. p. 10380–5. https://doi.org/10.1109/iros45743.2020.9341557
  10. 10. Saha T, Patra A, Saha S, Bhattacharyya P. Towards emotion-aided multi-modal dialogue act classification. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. https://doi.org/10.18653/v1/2020.acl-main.402
  11. 11. Chen Z, Luo S. Evaluate teaching quality of physical education using a hybrid multi-criteria decision-making framework. PLoS One. 2023;18(2):e0280845. pmid:36795779
  12. 12. Kiela D, Bhooshan S, Firooz H, Perez E, Testuggine D. Supervised multimodal bitransformers for classifying images and text. arXiv preprint 2019. https://arxiv.org/abs/1909.02950
  13. 13. Kumar D, Kumar N, Mishra S. QUARC: Quaternion Multi-Modal Fusion Architecture for Hate Speech Classification. In: 2021 IEEE International Conference on Big Data and Smart Computing (BigComp). 2021. p. 346–9. https://doi.org/10.1109/bigcomp51126.2021.00075
  14. 14. Wang Q, Zhu J, Pan C, Shi J, Meng C, Guo H. Dual trustworthy mechanism for illness classification with multi-modality data. In: 2023 IEEE International Conference on Data Mining Workshops (ICDMW). 2023. p. 356–62. https://doi.org/10.1109/icdmw60847.2023.00051
  15. 15. Zhang Y, Hu N, Li Z, Ji X, Liu S, Sha Y, et al. Lumbar spine localisation method based on feature fusion. CAAI Trans on Intel Tech. 2022;8(3):931–45.
  16. 16. Huang Y, Du C, Xue Z, Chen X, Zhao H, Huang L. What makes multi-modal learning better than single (provably). Advances in Neural Information Processing Systems. 2021;34:10944–56.
  17. 17. Arevalo J, Solorio T, Montes-y G´omez M, Gonz´alez FA. Gated multimodal units for information fusion. arXiv preprint. 2017. https://arxiv.org/abs/1702.01992
  18. 18. Tonge A, Caragea C. Dynamic deep multi-modal fusion for image privacy prediction. In: The World Wide Web Conference. 2019. p. 1829–40. https://doi.org/10.1145/3308558.3313691
  19. 19. Wu Y, Zhan P, Zhang Y, Wang L, Xu Z. Multimodal fusion with co-attention networks for fake news detection. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 . 2021. p. 2560–9.
  20. 20. You R, Guo Z, Cui L, Long X, Bao Y, Wen S. Cross-modality attention with semantic graph embedding for multi-label classification. AAAI. 2020;34(07):12709–16.
  21. 21. Zhang Z, Wang Z, Li X, Liu N, Guo B, Yu Z. ModalNet: an aspect-level sentiment classification model by exploring multimodal data with fusion discriminant attentional network. World Wide Web. 2021;24(6):1957–74.
  22. 22. Han Z, Zhang C, Fu H, Zhou JT. Trusted multi-view classification. arXiv preprint 2021.
  23. 23. Han Z, Yang F, Huang J, Zhang C, Yao J. Multimodal dynamics: dynamical fusion for trustworthy multimodal classification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2022. p. 20707–17.
  24. 24. Corbi‘ere C, Thome N, Bar-Hen A, Cord M, Perez P. Addressing failure prediction by learning model confidence. Advances in Neural Information Processing Systems. 2019;32.
  25. 25. Zahavy T, Magnani A, Krishnan A, Mannor S. Is a picture worth a thousand words? A deep multi-modal fusion architecture for product classification in e-commerce. arXiv preprint 2016.https://arxiv.org/abs/1611.09534
  26. 26. Xu C, Zhao W, Zhao J, Guan Z, Song X, Li J. Uncertainty-aware multiview deep learning for Internet of Things applications. IEEE Trans Ind Inf. 2023;19(2):1456–66.
  27. 27. Gallo I, Calefati A, Nawaz S, Janjua MK. Image and encoded text fusion for multi-modal classification. In: 2018 Digital Image Computing: Techniques and Applications (DICTA). 2018. p. 1–7.
  28. 28. Kiela D, Grave E, Joulin A, Mikolov T. Efficient large-scale multi-modal classification. AAAI. 2018;32(1).
  29. 29. Mittal A, Dahiya K, Malani S, Ramaswamy J, Kuruvilla S, Ajmera J. Multi-modal extreme classification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022. p. 12393–402.
  30. 30. Han X, Zhang Z, Ding N, Gu Y, Liu X, Huo Y, et al. Pre-trained models: past, present and future. AI Open. 2021;2:225–50.
  31. 31. Devlin J, Chang MW, Lee K, Toutanova K. Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint 2018. https://arxiv.org/abs/1810.04805
  32. 32. Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, et al. Learning transferable visual models from natural language supervision. In: International conference on machine learning. PMLR; 2021. p. 8748–63.
  33. 33. Audebert N, Herold C, Slimani K, Vidal C. Multimodal deep networks for text, image-based document classification. In: Machine Learning and Knowledge Discovery in Databases: International Workshops of ECML PKDD 2019, Würzburg, Germany, September 16–20, 2019, Proceedings, Part I. 2020. p. 427–43.
  34. 34. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. Advances in neural information processing systems. 2017;30.
  35. 35. Gal Y, Ghahramani Z. Dropout as a bayesian approximation: representing model uncertainty in deep learning. In: International Conference on Machine Learning. 2016. p. 1050–9.
  36. 36. Huang X, Kroening D, Ruan W, Sharp J, Sun Y, Thamo E, et al. A survey of safety and trustworthiness of deep neural networks: verification, testing, adversarial attack and defence, and interpretability. Computer Science Review. 2020;37:100270.
  37. 37. Psaros AF, Meng X, Zou Z, Guo L, Karniadakis GE. Uncertainty quantification in scientific machine learning: Methods, metrics, and comparisons. Journal of Computational Physics. 2023;477:111902.
  38. 38. Van Amersfoort J, Smith L, Teh YW, Gal Y. Uncertainty estimation using a single deep deterministic neural network. In: International Conference on Machine Learning. PMLR; 2020. p. 9690–700.
  39. 39. Guo C, Pleiss G, Sun Y, Weinberger KQ. On calibration of modern neural networks. In: International Conference on Machine Learning. PMLR; 2017. p. 1321–30.
  40. 40. Wang Q, Feng Y, Wang Y, Li B, Wen J, Zhou X, et al. AntiFormer: graph enhanced large language model for binding affinity prediction. Brief Bioinform. 2024;25(5):bbae403. pmid:39162312
  41. 41. Pan S, Luo L, Wang Y, Chen C, Wang J, Wu X. Unifying large language models and knowledge graphs: a roadmap. arXiv preprint 2023.
  42. 42. Li Q, Qiu D, Zhang Y, Li B, He Y, Woodland PC, et al. Confidence estimation for attention-based sequence-to-sequence models for speech recognition. In: ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2021. p. 6388–92. https://doi.org/10.1109/icassp39728.2021.9414920
  43. 43. Kastanos A, Ragni A, Gales MJF. Confidence estimation for black box automatic speech recognition systems using lattice recurrent neural networks. In: ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2020. p. 6329–33. https://doi.org/10.1109/icassp40776.2020.9053264
  44. 44. Rizzo M, Conati C, Jang D, Hu H. Evaluating the faithfulness of saliency-based explanations for deep learning models for temporal colour constancy. arXiv prepint 2022. https://arxiv.org/abs/2211.07982
  45. 45. Zhou C, Zhu J, Wang Q, Meng C, Pan C, Shi J. Enhancing question generation with syntactic details and multi-level attention mechanism. In: 2023 7th Asian Conference on Artificial Intelligence Technology (ACAIT), 2023. p. 557–62. https://doi.org/10.1109/acait60137.2023.10528429
  46. 46. Chambon P, Bluethgen C, Langlotz CP, Chaudhari A. Adapting pretrained vision-language foundational models to medical imaging domains; 2022.
  47. 47. Wang Q, Zhu J, Shu H, Asamoah KO, Shi J, Zhou C. GUDN: a novel guide network with label reinforcement strategy for extreme multi-label text classification. Journal of King Saud University - Computer and Information Sciences. 2023;35(4):161–71.
  48. 48. Qian S, Wang J, Hu J, Fang Q, Xu C. Hierarchical multi-modal contextual attention network for fake news detection. In: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2021. p. 153–62. https://doi.org/10.1145/3404835.3462871
  49. 49. Wu J, Zhao J, Xu J. HGLNET: a generic hierarchical global-local feature fusion network for multi-modal classification. In: 2022 IEEE International Conference on Multimedia and Expo (ICME). 2022. p. 1–6. https://doi.org/10.1109/icme52920.2022.9859834
  50. 50. Deci EL, Ryan RM. Self-determination theory. Handbook of theories of social psychology. 2012. p. 416–36.
  51. 51. Beck AT, Haigh EAP. Advances in cognitive theory and therapy: the generic cognitive model. Annu Rev Clin Psychol. 2014;10:1–24. pmid:24387236
  52. 52. Bandura A, Wessels S. Self-efficacy. Cambridge: Cambridge University Press; 1997.