Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

A novel Z-number based multi-stage assessment framework for problem-based learning in practical courses

  • Limin Yu,

    Roles Conceptualization, Methodology

    Affiliation Shandong Jiaotong University, Jinan, China

  • Zhe Chen ,

    Roles Conceptualization, Data curation, Formal analysis, Funding acquisition, Methodology, Writing – original draft, Writing – review & editing

    czsdjtu@outlook.com

    Affiliation Shandong Jiaotong University, Jinan, China

  • Zeyu Qin

    Roles Conceptualization, Formal analysis, Writing – original draft

    Affiliation Shandong Jiaotong University, Jinan, China

Abstract

Confronting the growing disparity between standardized evaluation systems and personalized competency development in practical education, this study proposes a novel data-driven framework integrating Multi-Criteria Group Decision Making (MCGDM) to enhance curriculum assessment within Problem-Based Learning (PBL) environments. Specifically, the framework incorporates Z-number theory to effectively capture the uncertainty and reliability inherent in expert evaluations, addressing the challenges posed by subjective and imprecise human judgments. The well-established Multi-Attributive Border Approximation Area Comparison (MABAC) method is extended through the integration of Z-number modeling to enhance robustness in ranking and decision-making processes. A multi-stage assessment process is designed, encompassing a Pass check phase, a Score determination phase, and a Grading phase, aligned with the pedagogical principles of PBL. Furthermore, a hybrid Entropy-Criteria Importance Through Intercriteria Correlation (CRITIC) weighting scheme under Z-number representation is introduced to objectively determine the importance of evaluation criteria, considering both information dispersion and inter-criteria correlation. The proposed method is applied to a real-life case study involving 24 students, evaluated by multiple stakeholder groups, including peer teams, instructors, and industry experts. Sensitivity and comparative analyses suggest that the proposed framework provides a robust and reliability-aware assessment procedure within the studied course context. The findings indicate its potential to support a more transparent and structured evaluation process, although broader generalizability requires further validation across cohorts, courses, and institutions.

1 Introduction

Problem-Based Learning (PBL), originally developed in the 1960s to reform medical education, has since evolved into a transformative pedagogical approach across a wide range of disciplines, including engineering [1]. Its focus on inquiry, collaboration, and problem-solving aligns with modern engineering education and the UN Sustainable Development Goals (SDGs) [2]. In recent years, PBL has been increasingly integrated into engineering curricula to foster interdisciplinary competencies, technological innovation, and systems thinking. Applications range from AI-based waste management systems and wireless sensor networks in smart agriculture to virtual reality simulations for green manufacturing processes. These implementations reflect PBL’s growing relevance as a means of equipping students with critical thinking, technical fluency, and collaborative skills essential for tackling complex, real-world engineering problems. The following literature review further explores the theoretical foundations, implementation strategies, and educational outcomes of PBL within engineering disciplines.

With support from the SDGs, the integration of emerging technologies into engineering education has advanced interdisciplinary competency development through PBL. For example, in an AI course by Vargas, students worked on a smart waste-sorting system project, optimizing Convolutional Neural Network (CNN) algorithms and addressing data bias, which improved their AI ethics awareness by 25% [3]. Similarly, Wang and Cui’s Wireless Sensor Network (WSN) project in agricultural engineering enhanced students’ ability to integrate hardware design with cloud-based data management [4]. In green manufacturing, Chiou’s team used a Virtual Reality (VR) platform to optimize a solar panel production line, fostering a systems-level engineering mindset [5].

Research in educational psychology confirms that structured collaboration is key to PBL’s success, grounded in Vygotsky’s sociocultural theory and Chickering and Gamson’s work on best practices in education [6]. Teamwork is essential in PBL, as students tackle complex problems requiring diverse perspectives and collective reasoning. This collaboration enhances critical thinking and mirrors real-world professional contexts, where problem-solving is collaborative. Studies show that peer interaction in PBL increases motivation, accountability, and knowledge retention, making teamwork not just a necessity but a powerful pedagogical tool for academic and professional growth.

While PBL encourages students to learn and solve problems collaboratively, its workflow can be characterized as a constructive, self-directed, collaborative, and contextual process [7]. Within this framework, each student is expected to engage in independent learning. The typical workflow of PBL is illustrated in Fig 1 and comprises six main stages. Before the process begins, students are organized into small groups. In the problem presentation stage, students are introduced to a complex, ill-structured, and open-ended problem. They then proceed to clarify the problem, identifying its inputs, expected outputs, the necessary knowledge to acquire, and formulating initial hypotheses. During the self-directed learning stage, each student or subgroup conducts independent research aligned with the established learning objectives. Subsequently, students reconvene to share their findings; the group engages in critical discussions to synthesize the newly acquired knowledge, revise initial hypotheses, and deepen their problem analysis through collaborative reasoning. Based on these discussions and the accumulated knowledge, students develop a solution, explanation, or response to the problem, which may take the form of a report, presentation, model, or other deliverable. Finally, students typically engage in reflective activities focused on the knowledge acquired, the learning methods employed, group dynamics, and the strengths and limitations of their proposed solutions. Self-assessment and peer assessment are often incorporated into this final stage to reinforce critical reflection and continuous improvement.

As the final stage of the PBL pedagogical approach, evaluation serves not only as a mechanism for feedback and culmination but, more importantly, as a critical tool for fostering deep thinking and metacognitive reflection in students [8,9]. Given PBL’s emphasis on self-directed inquiry and collaborative learning, traditional knowledge-recall assessments are insufficient for comprehensively measuring learning outcomes. Instead, a well-designed evaluation system should emphasize students’ holistic performance throughout the problem-solving process, encompassing analytical thinking, self-regulated learning, teamwork proficiency, and the application of integrated knowledge [10]. By combining formative and summative assessment strategies, educators can evaluate learners’ progress more comprehensively while providing timely, constructive feedback. Simultaneously, students gain clearer insights into their strengths and weaknesses through reflective practices, thus promoting continuous learning improvement. Moreover, assessment functions as a motivational catalyst, enhancing students’ engagement with learning objectives and strengthening their academic commitment. In this regard, assessment in PBL transcends its traditional evaluative role to become a pivotal driver of deeper learning engagement and comprehensive competency development [11].

Despite the widespread recognition of assessment’s pivotal role in fostering deep learning within PBL environments [12], several critical gaps remain in current evaluation practices:

  • The multi-source assessment paradigm, involving instructors, tutors, industry experts, and peer evaluators, lacks robust mechanisms to reconcile heterogeneous data types and varying authority weights among stakeholders, often resulting in oversimplified averaging approaches that dilute domain-specific expertise. Moreover, the varying reliability of different evaluators is seldom accounted for, leading to potential biases and inconsistent aggregation of assessment results [4].
  • While team collaboration constitutes a cornerstone of PBL pedagogy, existing evaluation systems fail to adequately decouple individual contributions from group outcomes, perpetuating measurement uncertainties caused by social loafing artifacts and cognitive biases such as halo effects and anchoring biases [13,14].
  • Prevailing assessment frameworks prioritize static competency snapshots over dynamic capability trajectories, offering limited insights into learners’ metacognitive evolution across iterative problem-solving phases [12].

Although prior studies have advocated mixed-method assessments, few provide operationalizable models addressing these interconnected challenges through mathematically rigorous yet pedagogically meaningful integrations. This gap underscores the urgent need for an adaptive evaluation architecture that systematically harmonizes multi-stakeholder perspectives, quantifies latent competencies, and mitigates subjectivity risks, an imperative this study addresses through its novel MCGDM-driven assessment framework.

The remainder of this paper is structured as follows. Section 2 provides a detailed review of the theoretical foundations underpinning this study, including PBL in engineering education, MCGDM in course evaluation, Z-number theory, and the Multi-Attributive Border Approximation Area Comparison (MABAC) method. Section 3 introduces the key definitions and the proposed grading framework based on Z-number-enhanced MABAC (Z-MABAC) method, consisting of the pass check phase, score determination phase, and grading phase, in alignment with the evaluation process of PBL. Section 4 presents the implementation of the proposed framework through a real-world case study conducted in an educational context, illustrating the step-by-step application of the proposed method. Section 5 provides a comparative analysis with other established Z-number-based methods, validating the robustness and effectiveness of the proposed approach. Finally, Section 6 summarizes the major findings, practical implications, and future research directions of this work.

2 Literature review

This section is divided into 4 parts. Section 2.1 reviews the PBL in engineering education, Section 2.2 shows the application of MCGDM in course evaluation, Section 2.3 introduces the concept and developments of Z-numbers, Section 2.4 presents the MABAC method, and their integration in addressing uncertainty and reliability in decision-making contexts. A summary of the key literature reviewed is presented in Table 1.

thumbnail
Table 1. Summary of key literature in PBL, MCGDM, Z-numbers, and MABAC methods.

https://doi.org/10.1371/journal.pone.0349114.t001

2.1 PBL in engineering education

Problem-Based Learning (PBL) has gained significant momentum in engineering education in recent years, driven largely by the demands of Industry 4.0 and the increasing emphasis on interdisciplinary, experiential learning. In biomedical engineering, a longitudinal three-year study conducted at Georgia Tech and Emory University implemented an advanced PBL framework tailored for AI applications in biomedicine. The program involved 92 undergraduate and 156 graduate students, resulting in notable outcomes including 16 student publications and the development of computational methods addressing real-world biomedical problems [15]. Similarly, in electromechanical engineering, research introduced a PBL model centered on real-world problem-solving and critical thinking. Student feedback highlighted high levels of satisfaction, especially in areas such as problem analysis, solution design, and the integration of theoretical knowledge with industrial practice [16]. Industrial engineering programs have also embraced PBL methodologies; for instance, the “Insights 4.0” initiative combined PBL with project-based learning, leading to increased student engagement and improved academic performance. Participants reported higher final grades alongside enhanced collaboration skills and both technical and soft competencies, underscoring PBL’s effectiveness in this domain [17]. Moreover, during the COVID-19 pandemic, engineering programs adapted by integrating at-home laboratories with PBL approaches to maintain educational quality [18]. This strategy not only preserved academic performance but also elevated student motivation and self-efficacy, demonstrating PBL’s efficacy in remote and hybrid learning environments.

Collectively, these examples underscore PBL’s versatility and profound impact within engineering education, equipping students with critical thinking, practical problem-solving abilities, and the adaptability essential for navigating today’s complex, interdisciplinary challenges. As educational demands evolve, PBL continues to stand out as a robust framework for preparing future-ready engineers.

2.2 MCGDM in course evaluation

MCGDM has proven effective in addressing the complexity inherent in student course evaluation. Traditional evaluation methods often rely on single or limited criteria, which fail to capture the multidimensional nature of learning outcomes. In contrast, MCGDM provides a structured framework that integrates perspectives from multiple stakeholders, such as students, instructors, and educational experts, thereby enabling a more comprehensive and equitable assessment. For example, an empirical study at Gulf University employed the Analytic Hierarchy Process (AHP) to evaluate student contributions in cooperative learning, successfully resolving conflicts in subjective assessments and facilitating more accurate credit distribution [19]. Additionally, MCGDM methods have been applied to optimize teaching plans and grading processes in university courses, as demonstrated in studies on energy market education and chemistry performance evaluations [20].

A key strength of MCGDM lies in its capacity to simultaneously consider a broad range of criteria without reducing evaluation outcomes to oversimplified metrics, including cognitive skills, creativity, teamwork, and presentation quality. Techniques such as AHP, Technique for Order Preference by Similarity to Ideal Solution (TOPSIS), and Simple Additive Weighting (SAW) enable the structured weighting and ranking of these diverse factors.

This approach is exemplified by Kara and Yıldırım’s study [25], which employed AHP and TOPSIS to assess high school chemistry students’ performance tasks, ensuring balanced consideration of psychomotor, cognitive, and affective domains. This approach is further illustrated by Yıldırım et al. [21], who developed an integrated intuitionistic Z-AHP and Z-TOPSIS methodology for hydrogen storage technology selection, explicitly incorporating both judgment reliability and decision-makers’ hesitancy under intuitionistic fuzzy environments. Moreover, MCGDM frameworks have demonstrated enhancements in fairness and transparency, particularly within group-based learning environments and interdisciplinary evaluations [26,27]. These methods also provide systematic alignment between teaching outcomes and program objectives, a critical consideration in outcome-based education models [28].

To further address the fuzziness and subjectivity inherent in human evaluations, researchers have integrated fuzzy logic, hesitant fuzzy sets, and probabilistic reasoning into MCGDM models. For instance, the Fuzzy Ordered Priority Approach for Multi-Criteria decision-making (FOPA-MC) model combines fuzzy logic with group decision-making for peer evaluation, effectively capturing the vagueness of subjective inputs [22]. Similarly, probabilistic hesitant fuzzy TODIM (an acronym in Portuguese for Interactive and Multi-Criteria Decision Making) and Evaluation based on Distance from Average Solution (EDAS) methods have been applied in evaluating Chinese language instruction quality, yielding promising results in mitigating evaluator hesitation and ambiguity [29].

Despite these advancements, several challenges remain. Fuzzy MCGDM models often involve high computational complexity and can be sensitive to initial assumptions such as criteria weights or decision-maker preferences. Furthermore, the need for adequate training of evaluators to use these advanced tools poses a risk to the consistency and reliability of assessment outcomes. A recent systematic review highlights the necessity for standardized frameworks and broader empirical validation across diverse disciplines [30].

In response to these challenges, researchers are exploring more robust tools for managing uncertain information. Among these, Z-number theory stands out for its ability to handle incomplete data while explicitly characterizing the reliability of information. Its unique capacity to fuse fuzzy and probabilistic information offers a flexible and accurate framework for modeling uncertainty, thereby more faithfully reflecting the complexity of real-world decision-making scenarios [23].

2.3 Z-number

In the realm of MCGDM, the complexity of decision structures and evaluation criteria often complicates the process, even for experts who are thoroughly familiar with the available alternatives. To address this challenge, the concept of the Z-number has been introduced to enhance the representation and measurement of the reliability associated with linguistic information. A Z-number is defined as , where A represents the expert’s evaluation or preference, and B denotes the degree of reliability or possibility associated with A. This dual-dimensional structure provides a novel approach to modeling both uncertainty and reliability within decision-making contexts [23].

To promote the understanding and application of Z-numbers, extensive research has been conducted, leading to the development of various extended forms, such as the Z*-number, Z-advanced number, and Z + -number, which have been applied to a range of MCGDM problems [31]. However, due to the inherent complexity of directly manipulating and solving Z-numbers in practical applications, scholars have explored transforming Z-numbers into more tractable representations, such as Triangular Fuzzy Numbers (TFNs) and trapezoidal fuzzy numbers, thereby leveraging existing fuzzy set methodologies to solve decision-making problems more efficiently [24].

Kang delved into Z-number operations and established a transformation rule to convert Z-numbers into fuzzy numbers and trapezoidal fuzzy numbers [32]. Božanić transformed Z-numbers into TFNs and applied them in camp location selection by integrating level-based weight assessment and multi-attributive ideal-real comparative analysis [33]. Garg proposed a method for converting Z-numbers into granulated Z-numbers, further expanding the application scope of Z-numbers [34].

These studies indicate that converting Z-numbers into TFNs or trapezoidal fuzzy numbers has become a widely accepted and validated approach for facilitating their practical application in decision-making. This transformation not only simplifies the computational process but also ensures the effective utilization of Z-number theory in various contexts. While Z-numbers have demonstrated remarkable capabilities in addressing uncertainty and reliability, they still fall short in fully capturing the subjectivity of Decision-Makers (DMs) in certain scenarios, such as mortise and tenon structure selection. To comprehensively address the intertwined issues of subjectivity, uncertainty, and reliability, this study integrates rough numbers with Z-numbers.

2.4 Multi-attributive border approximation area comparison (MABAC) method

Since the MABAC method was introduced by Pamučar and Ćirović, it has been applied to address MAGDM problems, such as the selection of transport and processing resources at logistics centers [35]. Its efficiency was demonstrated by comparing it with the COPRAS, TOPSIS, MOORA, and VIKOR methods. Subsequently, Pamučar improved the original MABAC technique, enhancing its applicability in more complex decision-making scenarios [36]. Zhao integrated the notion of cumulative prospect theory into the original MABAC approach and developed the IF-MABAC method, which is known as the cumulative prospect theory IF-MABAC method [37]. Liu and Zhang proposed a MABAC method for dealing with complex and uncertain decision-making situations, which considers the distance between alternatives and the Border Approximation Area (BAA) [38]. Jana developed a MCGDM technique based on a traditional MABAC model with bipolar fuzzy numbers by introducing two aggregation operators [39].

In the MABAC method, for each criterion, the Border Approximation Area (BAA) is determined by calculating the geometric mean of the corresponding elements in the weighted decision matrix. Alternatives whose criterion values fall within the upper approximation area are considered close to the ideal solution for that specific criterion, whereas those located in the lower approximation area are regarded as closer to the anti-ideal solution, as illustrated in Fig 2. After calculating the distances from the BAA across all criteria, the arithmetic mean of these distances is computed for each alternative to derive its overall performance score. Finally, alternatives are ranked based on these aggregated values, with higher arithmetic means indicating superior performance.

thumbnail
Fig 2. The upper and the lower approximation area.

https://doi.org/10.1371/journal.pone.0349114.g002

By integrating Z-number theory into the MABAC method, this approach further strengthens the robustness of the evaluation process under conditions of uncertainty. The Z-number-based MABAC not only considers experts’ preference information but also explicitly incorporates the reliability of these judgments. This enhancement enables more transparent, interpretable, and reliable decision outcomes compared with conventional MABAC or other traditional evaluation methods [24].

3 Methodology

This study is a retrospective analysis of student assessment data collected in a PBL practical course delivered as part of the regular curriculum. The study protocol for the retrospective research use of these records was reviewed and approved by the ethics committee of the School of Art & Design, Shandong Jiaotong University (final approval dated 30 October 2025). Written informed consent was obtained from all participating students.

This study constitutes a retrospective analysis of student assessment data collected in a PBL practical course that was delivered as part of the regular curriculum. The corresponding author, who served as the course instructor, obtained permission from the relevant academic unit and administration before using these assessment records for research purposes, and written informed consent was obtained from all participating students. During normal course delivery and grade administration, the corresponding author necessarily had access to identifiable student information (including names and student ID numbers) as part of routine educational management, independent of the present research.

After the course had been completed and all grades had been finalised, and before any research analyses were initiated, the corresponding author exported the assessment records and anonymised them specifically for research use. In this anonymisation process, all direct identifiers were removed and replaced with randomly generated study codes; class and group labels were recoded, and any potentially identifying free-text information was removed or aggregated. The re-identification key linking study codes to individual students was stored securely by the corresponding author and was not shared with the other co-authors. Thus, for the purposes of this study, all co-authors only accessed a fully anonymised dataset and did not handle any identifiable information once the study formally started.

The study protocol, covering the retrospective use of these anonymised records, was reviewed and approved by the Ethics Committee of the relevant institution, which functions as an accredited Institutional Review Board (IRB). The consent form provided to students explained the study’s purpose, procedures, potential risks, and confidentiality safeguards. All procedures comply with the ethical standards of the Declaration of Helsinki and the Belmont Report.

In this PBL course, student performance is categorized into five distinct grades: A, B, C, D, and E, corresponding to score intervals of [90, 100], [80, 90), [70, 80), [60, 70), and below 60 (Fail), respectively. To develop a comprehensive evaluation system tailored for practice-oriented courses in application-focused universities, it is necessary to first define the overarching structure of the practical learning process. Considering that practical learning involves the development of multiple competencies, this study adopts a hierarchical analytical framework, which is commonly used to evaluate student performance across various stages and dimensions of learning in practice-based education.

As illustrated in Fig 3, the evaluation framework is constructed around two primary components: continuous assessment and final assessment. Continuous assessment prioritizes students’ ongoing engagement throughout the course, focusing on process-oriented behaviors such as teamwork, class participation, problem-solving, and individual reflection. In contrast, the final assessment emphasizes the quality of project deliverables and comprehensive individual performance at the course’s conclusion. After extensive literature review, expert consultation, and consideration of operational features of application-oriented institutions, a hierarchical assessment model was developed. At the top tier (Level 1), the model encapsulates overall course evaluation. This is decomposed at Level 2 into continuous and final assessment categories. Each category is further broken down into sub-categories and specific measurable indicators at Level 3, which constitute the criteria evaluated by experts. Each sub-category designates specific evaluators, including individual instructors, instructor teams, industry experts, and peer groups, to ensure balanced and multifaceted assessment perspectives.

thumbnail
Fig 3. The multi-level analytic hierarchy framework.

https://doi.org/10.1371/journal.pone.0349114.g003

However, it is important to acknowledge that despite the structured multi-criteria assessment framework and diversified evaluator composition, challenges persist in ensuring the reliability of subjective judgments, particularly those made by student peers. In practice-oriented courses that involve Peer Assessment (PA) as a component of continuous evaluation, students often lack sufficient domain expertise and evaluative consistency. Their assessments are prone to variability and uncertainty, influenced by factors such as limited judgmental maturity, interpersonal bias, or inconsistent interpretation of evaluation criteria. These limitations introduce epistemic and reliability-related uncertainties into the decision-making process.

In this PBL course, student performance is graded into five categories: A (90–100), B (80–90), C (70–80), D (60–70), and E (Fail). To develop a comprehensive evaluation system, the study adopts a hierarchical framework, often used in practice-based education. This framework consists of two main components: continuous assessment and final assessment. Continuous assessment focuses on ongoing student engagement, including teamwork, participation, problem-solving, and reflection, while final assessment evaluates the quality of the project and individual performance. After consulting literature and experts, a hierarchical assessment model was created, with top-level evaluations broken down into continuous and final assessments, each further divided into specific measurable criteria at Level 3. These criteria are assessed by various evaluators, such as instructors, industry experts, and peers, to ensure diverse perspectives, as shown in Fig 3.

Despite the structured framework, challenges remain in ensuring the reliability of subjective judgments, especially peer assessments, due to factors like lack of domain expertise, inconsistency in judgment, and bias, which introduce uncertainties into the evaluation process.

To strengthen the theoretical justification of the Level 3 criteria, Table 2 provides a literature support mapping for all fifteen criteria in Fig 3. Each criterion from C1 to C15 is linked to a representative study in educational assessment and PBL related research, ensuring that every criterion is grounded in established constructs and widely used evaluation dimensions in practice oriented courses. By organizing the criteria under the Level 2 dimensions and associating each criterion with distinct supporting evidence, Table 2 clarifies the rationale for selecting these criteria and improves the transparency and academic rigor of the assessment design.

thumbnail
Table 2. Summary of supporting literature for the 15 Level 3 assessment criteria.

https://doi.org/10.1371/journal.pone.0349114.t002

To address these concerns, this study integrates Z-number theory into the evaluation framework. Unlike traditional TFNs, Z-numbers incorporate a reliability measure alongside linguistic information, enabling the model to capture both the vagueness and the perceived trustworthiness of peer evaluations. By combining Z-number theory with the MABAC method, the proposed approach enhances the robustness of the evaluation under uncertainty. The methodological process, from Z-number construction to final ranking via MABAC, is illustrated in Fig 4. This procedure consists of three main phases: Pass Check, Score Determination, and Grading.

thumbnail
Fig 4. Framework of PBL practical course student assessment.

https://doi.org/10.1371/journal.pone.0349114.g004

3.1 Pass check phase

The evaluation process begins with the Pass Check phase, during which instructors perform an initial qualitative screening of each student’s overall engagement and integrity throughout the course. This phase serves as a gatekeeping mechanism to ensure that only those who meet the fundamental requirements of participation are eligible for further assessment. The evaluation in this stage focuses on key behavioral indicators, including class attendance, task engagement, ethical conduct, and active involvement in team-based project activities.

Students who exhibit severe deficiencies are assigned a Fail status. These deficiencies may include chronic absenteeism, failure to complete assigned project tasks, or breaches of academic integrity. Examples of misconduct include plagiarism and other violations of ethical standards. These students are directly categorized into Grade E, which reflects a critically low level of engagement incompatible with the pedagogical goals of PBL courses. Notably, the assignment of an E grade is reserved for exceptional circumstances and is exceedingly rare in practice.

To uphold the fairness and credibility of the evaluation process, students who receive a Fail status are excluded from the subsequent performance ranking. This exclusion applies not only to their own assessments but also to the peer evaluations they provided. There are two main reasons for this. First, students who fail to meet basic participation standards likely have an incomplete or biased understanding of team dynamics and peer contributions. Second, including evaluations involving these students could introduce noise and distortion into the decision matrix, thereby compromising the consistency and reliability of the group decision-making results. Their removal ensures the integrity and robustness of the evaluation process.

Given the subjectivity and complexity inherent in evaluating performance within PBL-oriented curricula, students who pass this initial screening proceed to the Score Determination phase, where their performance is systematically assessed using the proposed Z-MABAC method.

3.2 Score determination phase

Following the preliminary filtering conducted in the Pass check phase, the evaluation process advances to the score determination phase. Given the diverse and PBL courses, a comprehensive evaluation must incorporate multiple perspectives and criteria. To this end, the evaluation framework includes 15 criteria spanning both continuous and final assessment components, each reflecting a specific dimension of student performance. These criteria are assessed by a heterogeneous group of evaluators, including the Instructor (main course teacher), an Instructor Group, an Industry Expert, and Peer groups composed of student teams.

To support a more structured, consistent, and reliability-aware evaluation of student performance in the studied PBL setting, the score determination phase adopts a structured two-stage decision-making framework based on Z-number theory and fuzzy MCGDM. The entire process begins with the transformation of linguistic evaluations into TFNs, followed by a systematic decomposition into two major stages. In stage 1, criteria weights are determined using instructor group evaluations. In stage 2, the Z-MABAC method is implemented in alternative ranking. This two-stage architecture is also illustrated in Fig 5. and described in detail below.

3.2.1 Preparation: Z-number convert into TFNs.

Subjective assessments often involve uncertainty and inconsistent reliability. This issue is particularly evident in peer evaluations. To address this challenge, the present study adopts the Z-number formalism. In this formalism, each evaluation is expressed as a Z-number . denote the restriction of the variable, using a membership function assigns a value in this interval to each element . Then we have . denote the reliability measure, similarly we have the membership function .

Before proceeding with the evaluation stages, all qualitative assessments expressed as Z-numbers must be converted into a computationally tractable format. This applies to two sources of evaluation data. Instructor group’s Z-number assessments on criteria importance (used in Stage 1) and evaluators’ Z-number assessments on student performance across criteria (used in Stage 2).

As is shown in Table 3, a linguistic-to-fuzzy mapping table was established for both importance and reliability, ensuring consistent interpretation across expert groups. For instance, if an evaluator considers a criterion to be Very Important and expresses High Certainty about this judgment, it is represented as a Z-number with the restriction part and the reliability part , corresponding respectively to the linguistic terms Very Important (FA) and Very Certain (VC).

thumbnail
Table 3. Z-number linguistic scale for criteria evaluation (important & reliable).

https://doi.org/10.1371/journal.pone.0349114.t003

The weight of each expert in the aggregation process is determined based on a two-level structure:

  • For each criterion, the relative weights of contributing evaluator groups are defined in accordance with the course’s instructional design.
  • Within each evaluator group, all individual members are assigned equal weight, reflecting a uniform contribution assumption.

For example, consider Criterion C1, which is jointly evaluated by peer students and the course instructor, each contributing 50% to the final score. In this context, peer evaluations are conducted within student teams, and each student is assessed only by their teammates, excluding self-assessment. Suppose a team consists of five members. For any given student in the team, the remaining four peers each contribute equally to the peer evaluation component. Therefore, each peer is assigned a weight of 0.125 (i.e., 0.5 ÷ 4), and the instructor holds a weight of 0.5 in the aggregation of this criterion.

After standardizing all evaluations into Z-number representations, the assessment proceeds using the Z-MABAC method. This method facilitates aggregation, normalization, and comparison of Z-valued assessments across criteria and evaluators, while preserving both performance ratings and their associated reliability.

To make the Z-number applicable for computational processing, each Z-number can be transformed into a single representative TFN. This is achieved by computing the expected value of the fuzzy set modified by the reliability function, as proposed by Zadeh. The and mentioned before can be converted to triangular membership functions, and defuzzified value α can be calculated by the following equation.

(1)

Where is the membership function of the fuzzy restriction component, modulated by the reliability component. This transformation enables the integration of both subjective evaluations and their associated confidence levels into a unified TFN format suitable for further processing in MABAC.

Then we have

(2)

Z-number then can be converted into TFNs, shown as

(3)

This transformation adjusts the evaluated importance or performance by accounting for the evaluator’s confidence, thereby ensuring that assessments with higher uncertainty exert proportionally less influence on the final decision. To facilitate clarity and reproducibility, all relevant symbols and definitions used throughout the following stages are summarized in Table 4.

3.2.2 Stage 1: Criteria weight determination.

This stage determines the objective weights of evaluation criteria based on instructor group assessments. To capture both the dispersion of criteria information and the mutual independence among criteria, a hybrid weighting method combining Entropy and CRITIC is adopted. In this hybrid scheme, the Entropy component reflects the degree of dispersion in the evaluation information, while the CRITIC method is used to determine the weight of each criterion based on contrast intensity and inter-criteria conflict. Specifically, the weight of each criterion is determined by calculating the contrast intensity between that criterion and all other criteria, considering how much the criterion contrasts with the other criteria in the decision matrix [55,56]. The detailed computational steps are presented as follows.

Step 1: Each instructor provides a Z-number evaluation of the importance of every criterion. After converting these Z-numbers to TFNs, they are aggregated to form a consensus fuzzy value for each criterion:

(4)

where K is the number of experts in the instructor group.

Step 2: Each aggregated TFN is normalized across all criteria:

(5)

Step 3: Compute the entropy of each criterion by the following equations.

(6)(7)

Step 4: Compute the standard deviation and inter-criteria correlation to derive CRITIC weights by the following equations.

(8)(9)

Step 5: Final hybrid weights are combined by entropy and CRITIC weights:

where controls the fusion degree, close to 1 emphasizes the Entropy-based weights, while close to 0 emphasizes the CRITIC-based weights, and typically set . The pseudocode of this stage is shown in Fig 6.

3.2.3 Stage 2: Students performance ranking via Z-MABAC method.

With the criteria weights determined, the Z-MABAC method is applied to rank student alternatives based on expert and peer evaluations. Each expert provides a Z-number . The Z-numbers are converted into TFNs according to Eqs. (1) to (3) at the preparation stage.

Step 1: Aggregate TFNs Across Experts. TFNs for each student i and criterion j using a weighted average:

(10)

This yields a unified fuzzy decision matrix .

Step 2: Normalize each element using fuzzy min–max normalization. For benefit criteria:

(11)

In this study, all 15 criteria are formulated as benefit-type, meaning that higher values indicate better performance. Therefore, normalization follows the benefit-type formula above for all criteria. Once for cost criteria, the numerator becomes .

Step 3: Multiply normalized TFNs with attribute weights , derived from the Entropy-CRITIC hybrid method:

(12)

Step 4: Compute the BAA for each criterion:

(13)

Step 5: The final MABAC score is the distance from the BAA:

(14)

Which can be calculated using the vertex method:

(15)

Here, and represent the TFNs of and respectively.

Step 6: Final Score and Ranking. Aggregate the distances across all criteria for each alternative:

(16)

Rank all students based on the descending order of . A higher score indicates stronger performance relative to the group under the weighted, uncertainty-aware multi-criteria framework. The pseudocode of this stage is shown in Fig 7.

3.3 The grading phase

With the exclusion of students who failed to meet the basic participation or ethical standards during the Pass Check phase, the remaining student cohort proceeds to the final Grading process. These students have demonstrated adequate engagement and are deemed eligible for ranking and grade assignments. Based on institutional records and empirical observations from prior course implementations, the distribution of grades among this qualified population generally adheres to a predefined ratio.

Let denote the proportion vector for Grades A through D, where and . In this study, the adopted distribution is . Let M be the total number of students who passed the initial screening and are eligible for ranking. Students are sorted in descending order based on their final composite performance scores obtained through the Z-MABAC method. The grade assignment is conducted based on percentile thresholds derived from the distribution vector π, with ranking intervals determined as shown in Table 5. For example, assuming M = 24, the resulting grade intervals are as follows:

thumbnail
Table 5. Grade assignment intervals based on percentile thresholds.

https://doi.org/10.1371/journal.pone.0349114.t005

  • Grade A: Students ranked 1st to 5th
  • Grade B: Students ranked 6th to 16th
  • Grade C: Students ranked 17th to 23rd
  • Grade D: Students ranked 24th

In order to emphasize the technical advantages of the Z-MABAC method compared to simpler evaluation approaches, Table 6 provides a comparative analysis across several key dimensions.

thumbnail
Table 6. Comparison of traditional and proposed assessment methods.

https://doi.org/10.1371/journal.pone.0349114.t006

This grading mechanism ensures a consistent and fair mapping between performance rankings and grade levels, grounded in institutional experience.

4 Case study

4.1 Background description and the pass check phase

This study was conducted in the context of the course “Design investigation and analysis”, involving 24 students divided into four peer groups of six students each. In the present study, no student was disqualified during this phase. All 24 students demonstrated sufficient participation, attendance, and ethical behavior, and therefore passed the initial screening. As a result, the performance evaluation proceeded with the full cohort, without the need to assign Grade E.

Fig 8 shows the fifteen assessment criteria, which cover key dimensions of student learning outcomes and behaviors, including teamwork, task completion, communication, attendance, problem analysis, stage report quality, individual contributions, reflection, innovation, feasibility, technical application, presentation quality, professional knowledge mastery, problem-solving skills, and communication abilities.

thumbnail
Fig 8. The assessment criteria in the course “Design investigation and analysis”.

https://doi.org/10.1371/journal.pone.0349114.g008

The evaluation process incorporated a multi-expert structure to capture diverse perspectives: the peer groups, a core instructor team consisting of five members (T1 to T5), a primary instructor, and an industry expert. Each evaluator cluster was assigned specific roles in assessing fifteen criteria related to student performance, ensuring a comprehensive and balanced appraisal framework.

Expert weights were assigned based on the cluster roles, with peers and instructors sharing equal influence on C1, C3, C7, C8; while other criteria such as C6, C9 and C10 were evaluated solely by the instructor group or shared between the instructor team and industry expert. Criteria C6, C9, and C10 were weighed 60% by the instructor team and 40% by the industry expert, reflecting the relevance of practical industry insights. The related individual expert weights were then computed using the algorithmic approach, and the resulting values are presented in Table 7.

To address the inherent uncertainty and subjectivity in human evaluation, Z-numbers were employed to represent each expert’s judgment as a pair of fuzzy values: the restriction and reliability of the assessment. As shown in Table 4, linguistic terms for restriction ranged from “Very Inaccurate / Very Poor / Very Unimportant” to “Very Accurate / Excellent / Very Important,” modeled as TFNs, while reliability levels ranged from “Very Low / Very Uncertain” to “Almost Certain / Extremely Certain,” also expressed as TFNs.

4.2 Implementation of score determination and grading phases based on the proposed model

4.2.1 Stage 1: Criteria weight determination.

In this stage, the importance weights of the evaluation criteria were determined following the procedure outlined in Section 3.2. Specifically, the Z-number-based approach was employed to capture both the perceived importance and the associated reliability of expert assessments. Take instructor T1 as an example, who rated the importance of each criterion using linguistic terms mapped to TFNs. Reliability levels, representing the evaluator’s confidence in each assessment, were similarly expressed through linguistic-to-fuzzy mappings. The detailed linguistic evaluations and their corresponding fuzzy representations are presented in Table 8.The complete linguistic evaluation tables for criteria importance (instructors T1–T5) are provided in the Supporting Information (S2 File).

thumbnail
Table 8. Linguistic evaluations of criteria importance and TFN transformations (instructor T1).

https://doi.org/10.1371/journal.pone.0349114.t008

Table 8 lists the original linguistic restriction and reliability evaluations and their corresponding TFNs. The Z-number for each criterion was converted into a unified TFN according to the Eqs. (1) to (3). The resulting converted TFNs for importance assessments are shown in the “Converted TFN” column.

Following the Z-number conversion, the Entropy-CRITIC hybrid method was adopted to compute the final attribute weights. By integrating these two complementary perspectives, the final weight vector achieves greater robustness and objectivity. The resulting weights from both methods, as well as the aggregated combined weights, are summarized in Table 9 (λ = 0.5).

thumbnail
Table 9. Final attribute weights derived from the Entropy-CRITIC hybrid method.

https://doi.org/10.1371/journal.pone.0349114.t009

These combined weights were subsequently utilized in the Z-MABAC-based ranking stage to ensure the consistency and fairness of MCGDM under uncertainty.

4.2.2 Stage 2: Students performance ranking via Z-MABAC method.

In the 2nd stage, the Z-MABAC method was employed to calculate the final performance scores and rankings of the 24 students based on the multi-source Z-number evaluations collected in the previous phase. The raw evaluation dataset comprised a total of 1,512 entries (Table 10), covering the assessments of all students across fifteen criteria (C1–C15) by various evaluators including peers, instructors, and an industry expert. Each evaluation record included the Z-number linguistic assessments for both restriction (importance) and reliability (confidence), expressed as triangular fuzzy numbers (TFNs), alongside the corresponding expert weight. The full expert evaluation dataset based on Z-number used in this study is available in the Supporting Information (S1 File).

thumbnail
Table 10. Expert evaluation data based on Z-number (Partial Data).

https://doi.org/10.1371/journal.pone.0349114.t010

Following the transformation of the Z-number assessments into TFNs and the application of expert weights, a weighted normalized decision matrix was constructed, as shown in Table 11. The matrix reflects the aggregated and normalized performance of each student across the 15 criteria, incorporating both restriction and reliability through the Z-number framework. For brevity, only a subset of students’ normalized scores is presented here.

thumbnail
Table 11. Weighted normalized matrix of Z-number assessments (partial data).

https://doi.org/10.1371/journal.pone.0349114.t011

After constructing the weighted normalized matrix, the Z-MABAC method was applied to compute the final closeness coefficients (MABAC scores) for each student. These scores reflect the integrated performance outcomes across all criteria, considering both the reliability and importance of the evaluations provided by multiple stakeholders. Table 12 presents the final results, where students are ranked according to their MABAC scores in descending order. Scores were rounded to three decimal places for reporting clarity.

thumbnail
Table 12. Final Z-MABAC scores and student ranking.

https://doi.org/10.1371/journal.pone.0349114.t012

Following the ranking results obtained through the Z-MABAC method, this stage applies the grading criteria defined in Section 3.3 to categorize students into four performance bands. According to the grading policy, students falling into the top 20% are assigned Grade A, the next 45% are assigned Grade B, followed by 30% assigned Grade C, and the bottom 5% assigned Grade D. Given that the class size is M = 24, the detailed categorization based on MABAC scores is presented in Table 13.

thumbnail
Table 13. Grading results based on Z-MABAC ranking.

https://doi.org/10.1371/journal.pone.0349114.t013

To further summarize the dispersion of the Z-MABAC results after grading, Fig 9 visualizes the distribution of Z-MABAC scores across grade categories. The boxplot provides a compact view of the median and variability within each grade group, based on the assignments reported in Table 13. As shown in Fig 9, Grade A corresponds to consistently positive and higher Z-MABAC scores, Grade B remains mostly positive but closer to zero with moderate spread, while Grade C shifts into the negative range with a wider dispersion. Grade D contains a single student with the lowest score, which appears as a single point.

thumbnail
Fig 9. Boxplot of Z-MABAC scores across grade categories.

https://doi.org/10.1371/journal.pone.0349114.g009

5 Discussion

5.1 Sensitivity analysis

To evaluate the robustness of the proposed MABAC method under variations of key parameters, a comprehensive sensitivity analysis was conducted focusing on two dimensions: the variation of the confidence parameter λ in the Z-number framework, and incremental changes in individual attribute weights.

5.1.1 Impact of criteria weight perturbations.

The effect of amplifying each criterion’s weight individually was analyzed by recalculating the MABAC scores and ranks for the student alternatives. Spearman rank correlation coefficients between the original ranking and each perturbed ranking were computed to quantify stability. As shown in Fig 10, all Spearman coefficients remain above 0.98, indicating that the ranking order is highly stable against single-attribute weight increases.

thumbnail
Fig 10. Spearman rank correlation coefficients under incremental increases of individual criterion weight.

https://doi.org/10.1371/journal.pone.0349114.g010

Correspondingly, the score variation under attribute weight increments is visualized in Fig 11, where individual student scores exhibit smooth and minor fluctuations, further confirming the robustness of MABAC with respect to moderate changes in attribute importance.

thumbnail
Fig 11. Line plot of MABAC score variations for students with individual criterion weight increment.

https://doi.org/10.1371/journal.pone.0349114.g011

5.1.2. Effect of λ Variation in the Z-number Model.

The confidence parameter λ, which balances the influence of the membership and reliability functions in the Z-number, varied systematically from 0.1 to 0.9 to assess its impact on student rankings and scores. Spearman rank correlations between rankings at different λ values and the original ranking are illustrated in Fig 12. These correlations are all above 0.99, evidencing very high rank stability across the entire λ range.

thumbnail
Fig 12. Spearman rank correlation coefficients under varyingλ values in the Z-number model.

https://doi.org/10.1371/journal.pone.0349114.g012

The sensitivity of individual MABAC scores to λ changes is depicted in Fig 13. While scores vary progressively with λ, no abrupt changes or rank reversals occur, demonstrating that the proposed model maintains consistent evaluations even when the confidence weighting in Z-numbers is altered.

thumbnail
Fig 13. Line plot of MABAC score variations for students with differentλ values.

https://doi.org/10.1371/journal.pone.0349114.g013

The sensitivity analyses collectively indicate that the MABAC method integrated with Z-numbers exhibits strong robustness to variations in both criterion weights and the λ parameter. The consistently Spearman rank correlation coefficients mostly exceeding 0.98, reflecting that the relative ordering of alternatives is largely unaffected by these perturbations. This robustness is crucial for ensuring reliable decision-making in uncertain environments where subjective weights and confidence levels may vary.

5.2 Comparison analysis

To evaluate the consistency and discriminative capabilities of different decision-making methods, we compare eight approaches based on both Z-number and traditional TFN representations. The methods include Z-MABAC, Z-TOPSIS, Z-CODAS, Z-MARCOS, and their corresponding TFN-based versions. The Spearman rank correlation matrix is presented in Fig 14, while the ranking comparison of students across all methods is illustrated in Fig 15.

thumbnail
Fig 14. The Spearman correlation across 8 methods.

https://doi.org/10.1371/journal.pone.0349114.g014

thumbnail
Fig 15. The ranking comparison of students across 8 methods.

https://doi.org/10.1371/journal.pone.0349114.g015

5.2.1 Consistency analysis via Spearman correlation.

As shown in Fig 15, the Spearman rank correlation coefficients among the Z-number-based methods are exceptionally high, all exceeding 0.99, indicating an outstanding level of agreement. In particular, Z-MABAC shares a perfect or near-perfect correlation with Z-TOPSIS (ρ = 0.998), Z-CODAS (ρ = 0.997), and Z-MARCOS (ρ = 0.996). This high degree of consistency demonstrates the reliability of the Z-number framework in producing stable and coherent ranking outputs across different MCGDM techniques.

In contrast, TFN-based methods only consider the restrictions, exhibit relatively lower correlations, especially TFN-MARCOS, which shows a noticeably weaker correlation with both Z-number and other TFN approaches. This indicates that ignoring the reliability component in fuzzy evaluations may lead to inconsistencies in ranking, particularly in edge cases where uncertainty plays a significant role.

5.2.2 Top and bottom rank consistency.

Fig 15 reveals that the highest-ranked student under Z-MABAC is S21, which is fully consistent with Z-CODAS and Z-MARCOS, while S9 is consistently ranked second or first in most methods. Conversely, S10 is ranked the lowest (rank 24) across all eight methods, showing excellent agreement on the least preferred alternative.

Such alignment at the extremes indicates that both Z-number and TFN-based models are robust in identifying the best and worst-performing alternatives. However, discrepancies in middle ranks are more noticeable in TFN-MARCOS and TFN-CODAS compared to their Z-number counterparts.

5.2.3 Impact of reliability modeling through Z-numbers.

The integration of reliability in Z-numbers appears to enhance the stability and coherence of the rankings. Compared to TFN-based counterparts, Z-number methods yield tighter clustering of rankings, fewer rank inversions, and higher Spearman correlations.

This evidence shows the value of incorporating the reliability of Z-numbers, which effectively modulates the fuzzy restriction set by the DM’s confidence. The improved agreement among Z-number-based methods suggests that this additional layer of information leads to more informed, trustable, and consistent evaluations.

Furthermore, in comparison to TFN-based models, Z-number methods show a better balance between sensitivity and robustness, retaining the flexibility of fuzzy modeling while mitigating subjectivity through reliability weighting.

5.2.4 Computational complexity of Stage 1 and Stage 2.

To assess the computational efficiency of the methods used in this study, the complexity of Stage 1 and Stage 2 is analyzed as follows:

  1. Stage 1 (Criteria Weight Determination): The time complexity of this stage is , where N is the number of criteria and K is the number of evaluators. Based on the actual data in this study, with criteria and evaluators, the overall time complexity for Stage 1 is calculated to be 75.
  2. Stage 2 (Students’ Performance Ranking via Z-MABAC Method): The time complexity of this stage is , where M is the number of students, N is the number of criteria, and K is the number of evaluators. With students, criteria, and evaluators, the overall time complexity for Stage 2 is calculated to be 1800.

This complexity analysis highlights the computational requirements of Stage 1 and Stage 2, providing insights into their efficiency in practical applications.

5.3 Inter-rater reliability analysis

To complement the robustness and comparison analyses presented above, an empirical inter-rater reliability (IRR) analysis was conducted based on the evaluator-level data reported in S1 and S2 Files. Specifically, S2 File provides the linguistic evaluations of criteria importance made by instructors T1–T5, together with the corresponding TFN transformations, and was therefore used to assess agreement in the criteria-weight determination stage. In contrast, S1 File contains the evaluator-level records for the case study, including criterion, student ID, evaluator identity, restriction and reliability linguistic ratings, the corresponding TFNs, and evaluator weights. From this file, the ratings of instructors T1–T5 on the six instructor-assessed criteria (C6, C9, C10, C11, C12, and C15) across the 24 students were extracted to evaluate agreement in the teacher-evaluation stage.

Because the original judgments were expressed as linguistic terms and transformed into triangular fuzzy numbers, the centroid values of the restriction TFNs were used as numeric scores for IRR estimation. A two-way random-effects intraclass correlation coefficient (ICC) with absolute agreement was adopted, and both the single-measure ICC [ICC(2,1)] and the average-measure ICC [ICC(2,k)] were reported.

First, agreement among the five instructors (T1–T5) in the criteria-weight determination stage was examined based on the 15 criteria recorded in S2 File. As summarized in Table 14, the results yielded ICC(2,1) = 0.351 and ICC(2,5) = 0.730. This indicates that, although individual-level agreement among instructors was moderate, the aggregated evaluations of the five-member instructor panel achieved acceptable reliability for group-based criteria weighting.

thumbnail
Table 14. Inter-rater reliability results for fixed instructor panels.

https://doi.org/10.1371/journal.pone.0349114.t014

Second, agreement among the same instructor panel was examined in the teacher-evaluation stage using the records in S1 File for the six instructor-assessed criteria (C6, C9, C10, C11, C12, and C15) across the 24 students. The pooled analysis over all student–criterion combinations produced ICC(2,1) = 0.726 and ICC(2,5) = 0.930, indicating strong agreement among instructors, particularly when their evaluations were aggregated. The overall IRR results for Stage 1 and Stage 2 are presented in Table 14.

A criterion-wise analysis further confirmed this pattern. As shown in Table 15, the average-measure ICC values were 0.903 for C6, 0.886 for C9, 0.897 for C10, 0.879 for C11, 0.959 for C12, and 0.973 for C15. These findings provide an empirical complement to the Z-number-based modeling of evaluator reliability. In other words, besides representing uncertainty and confidence through Z-numbers, the observed agreement among the fixed instructor panels also supports the internal consistency of the proposed framework within the present case study. However, these findings should still be interpreted in light of the single-course and single-institution context.

thumbnail
Table 15. Criterion-wise inter-rater reliability for instructor-assessed criteria.

https://doi.org/10.1371/journal.pone.0349114.t015

5.4 Limitations

The empirical validation in this study is based on a single cohort of 24 students from one practical course at one institution. Therefore, the present findings should be interpreted as case-based evidence of feasibility and internal robustness within the studied setting, rather than as definitive evidence of general applicability. In particular, the evaluator composition, course design, and grading policy adopted in this case may differ from those in other disciplines, institutions, or assessment settings. Accordingly, claims regarding fairness, transparency, and practical applicability should be understood as context-dependent and limited to the current course environment. Future research should examine the proposed framework across multiple cohorts, courses, institutions, and evaluator configurations to establish broader external validity.

In addition, although the present study now includes a formal empirical inter-rater reliability analysis for the fixed instructor panels involved in the criteria-weight determination stage and the teacher-evaluation stage, this empirical check remains limited to comparable evaluator sets with complete repeated-rating structures. Other components of the framework, such as peer evaluations and single-expert judgments, were not examined through the same ICC-based design. Therefore, the reported IRR results should be understood as a partial empirical complement to the Z-number-based reliability modeling, rather than a full agreement assessment across all evaluator types. Future research should extend empirical agreement analysis to more diverse evaluator configurations and educational settings to further strengthen the empirical basis of the framework.

A further limitation concerns the elicitation of confidence information in the Z-number assessments. In the current implementation, some reliability inputs depend on evaluators’ self-reported confidence expressed through linguistic terms. Although this design is consistent with the conceptual structure of Z-numbers, such confidence judgments may still be affected by individual response tendencies, subjective bias, or differences in evaluators’ interpretation of linguistic labels. In addition, the final grade allocation in this case study follows an institution-specific proportion-based grading policy, which may not be directly transferable to all educational contexts. Future studies may improve the framework by combining self-reported reliability with empirical calibration strategies, behavioral indicators, or alternative grading policies that are better aligned with different institutional requirements.

6 Conclusion

This study proposes a novel Z-number-based multi-stage assessment framework for PBL practical courses. By integrating Z-number modeling, an Entropy-CRITIC weighting scheme, and the Z-MABAC ranking method, the framework provides a structured procedure for combining multi-source evaluations under uncertainty and varying levels of confidence. Within the studied course context, the results indicate that the proposed framework can support a more transparent, reliability-aware, and systematic assessment process than conventional single-dimensional grading approaches.

The case study demonstrates the feasibility of applying the framework to a real PBL course involving multiple evaluator roles, including instructors, peer groups, and an industry expert. The sensitivity and comparative analyses further show that the ranking results remain relatively stable under the tested settings, suggesting internal robustness of the proposed procedure within this specific application. In this sense, the framework offers a practical way to organize process-oriented and outcome-oriented assessment information in a unified decision-making structure.

At the same time, the findings should be interpreted with appropriate caution. The empirical evidence is derived from a single cohort of 24 students in one course at one institution, and the study does not yet provide broader external validation across different educational settings. In addition, although evaluator reliability is modeled through Z-numbers, further empirical work is needed to complement this representation with formal agreement measures and wider replication. Therefore, the present study should be regarded as a context-specific demonstration of feasibility rather than conclusive evidence of general applicability.

Overall, the proposed two-stage Z-number-based MCGDM framework provides a structured approach for incorporating both evaluation restrictions and reliability information into student performance assessment. The sensitivity and comparison analyses suggest that the framework can generate stable and coherent ranking results, and the additional inter-rater reliability analysis offers empirical support for the consistency of the fixed instructor panels involved in this case study. Future research should examine the framework across multiple cohorts, courses, institutions, and evaluator configurations, and further extend empirical agreement analysis to more diverse assessment settings.

Supporting information

S1 File. Expert evaluation data based on Z-number.

Anonymized expert evaluation dataset used in this study, including each criterion’s restriction/reliability linguistic ratings, their corresponding triangular fuzzy numbers (TFNs), and expert weights.

https://doi.org/10.1371/journal.pone.0349114.s001

(PDF)

S2 File. Linguistic evaluations of criteria importance (instructors T1–T5).

Anonymized linguistic evaluation tables for criteria importance provided by instructors (T1–T5), together with the corresponding TFN mappings for restriction and reliability.

https://doi.org/10.1371/journal.pone.0349114.s002

(PDF)

References

  1. 1. Barrows HS. A taxonomy of problem-based learning methods. Med Educ. 1986;20(6):481–6. pmid:3796328
  2. 2. Unesco. Education for sustainable development goals: learning objectives. 2017.
  3. 3. Vargas M, Nuñez T, Alfaro M, Fuertes G, Gutierrez S, Ternero R. A project based learning approach for teaching artificial intelligence to undergraduate students. 2020. https://repositorio.umayor.cl/xmlui/handle/sibum/8280
  4. 4. Wang Y, Cui S. Enhance Project-based Learning Experience for Undergraduate Students with Wireless Sensor Network. In: 2015 ASEE Annual Conference and Exposition Proceedings. 26.653.1-26.653.11.
  5. 5. Chiou R, Fegade T, Wu Y-C, Tseng B, Mauk M, Husanu I. Project-based learning with implementation of virtual reality for green energy manufacturing education. 2020.
  6. 6. Ramírez D, Ramírez P. Making Practical Experience: Teaching Thermodynamics, Ethics, and Sustainable Development with PBL at a Bioenergy Plant. In: 2015 ASEE Annual Conference and Exposition Proceedings. 26.1125.1-26.1125.13.
  7. 7. Yew EHJ, Goh K. Problem-Based Learning: An Overview of its Process and Impact on Learning. Health Professions Education. 2016;2(2):75–9.
  8. 8. Kaushik M. Evaluating a first-year engineering course for project based learning (PBL) essentials. Procedia Comput Sci. 2020;172:364–9.
  9. 9. Ngereja B, Hussein B, Andersen B. Does project-based learning (PBL) promote student learning? A performance evaluation. Educ Sci. 2020;10:330.
  10. 10. Lavado-Anguera S, Velasco-Quintana PJ, Terrón-López MJ. Project-based learning (PBL) as an experiential pedagogical methodology in engineering education: a review of the literature. Educ Sci. 2024;14:617.
  11. 11. Baidal-Bustamante E, Mora C, Alvarez-Alvarado MS. STEAM Project-Based Learning Approach to Enhance Teaching-Learning Process in the Topic of Pascal’s Principle. IEEE Trans Educ. 2023;66(6):632–41.
  12. 12. van der Vleuten CPM, Schuwirth LWT. Assessment in the context of problem-based learning. Adv Health Sci Educ Theory Pract. 2019;24(5):903–14. pmid:31578642
  13. 13. Jiang D, Dahl B, Du X. A systematic review of engineering students in intercultural teamwork: characteristics, challenges, and coping strategies. Educ Sci. 2023;13:540.
  14. 14. Tadjer H, Lafifi Y, Seridi-Bouchelaghem H. A New Approach for Assessing Learners in an Online Problem Based Learning Environment. Learning and Performance Assessment. IGI Global. 2020. 307–24.
  15. 15. Nnamdi MC, Tamo JB, Shi W, Wang MD. Advancing Problem-Based Learning in Biomedical Engineering in the Era of Generative AI. arXiv. 2025.
  16. 16. Mahtani K, Guerrero JM, Decroix J. Implementing innovation in project-based learning in electro-mechanical engineering education. International Journal of Mechanical Engineering Education. 2024;54(1):129–52.
  17. 17. Rodriguez-Sanchez C, Orellana R, Barbosa PRF, Borromeo S, Vaquero J. Insights 4.0: transformative learning in industrial engineering through problem‐based learning and project‐based learning. 2025. https://doi.org/10.1002/cae.22736
  18. 18. Ariza JÁ. Can in-home laboratories foster learning, self-efficacy, and motivation during the COVID-19 pandemic? -- a case study in two engineering programs. arXiv. 2022.
  19. 19. Alsamaray HS. AHP as Multi-criteria Decision Making Technique, Empirical Study in Cooperative Learning at Gulf University. ESJ. 2017;13(13):272.
  20. 20. Alcázar-Ortega M, Montuori L, Rodríguez-García J, Vargas-Salgado C. Multi-Criteria Evaluation Method in the Field of University Education: Application to a Course on Energy Markets. Knowledge. 2023;3(1):40–52.
  21. 21. Haktanır E, Kahraman C. Integrated AHP & TOPSIS methodology using intuitionistic Z-numbers: An application on hydrogen storage technology selection. Expert Systems with Applications. 2024;239:122382.
  22. 22. Capuano N, Caballé S, Percannella G, Ritrovato P. FOPA-MC: fuzzy multi-criteria group decision making for peer assessment. Soft Comput. 2020;24(23):17679–92.
  23. 23. Muhammad Farhan Hakim Nik Badrul Alam N, Muhammad Naim Ku Khalif K, Izzati Jaini N. Application of Intuitionistic Z-Numbers in Supplier Selection. Intelligent Automation & Soft Computing. 2023;35(1):47–61.
  24. 24. Tan Y, Chen Z, Wang B, Ma Q, Wei J. A Z-number and MABAC method based on reliability analysis and evaluation of product design concept. Eksploatacja i Niezawodność – Maintenance and Reliability. 2024.
  25. 25. Yildirim M, Gebraeel NZ, Sun XA. Integrated Predictive Analytics and Optimization for Opportunistic Maintenance and Operations in Wind Farms. IEEE Trans Power Syst. 2017;32(6):4319–28.
  26. 26. Kwok RC-W, Zhou D, Zhang Q, Ma J. A Fuzzy Multi-Criteria Decision Making Model for IS Student Group Project Assessment. Group Decis Negot. 2006;16(1):25–42.
  27. 27. Ilieva G. Fuzzy Group Full Consistency Method for Weight Determination. Cybernetics and Information Technologies. 2020;20(2):50–8.
  28. 28. Laguna-Sánchez P, Palomo J, de la Fuente-Cabrero C, de Castro-Pardo M. A Multiple Criteria Decision Making Approach to Designing Teaching Plans in Higher Education Institutions. Mathematics. 2020;9(1):9.
  29. 29. Xia F. Optimized multiple-attribute group decision-making through employing probabilistic hesitant fuzzy TODIM and EDAS technique and application to teaching quality evaluation of international chinese course in higher vocational colleges. 2024.
  30. 30. Yüksel FŞ, Kayadelen AN, Antmen F. A systematic literature review on multi-criteria decision making in higher education. International Journal of Assessment Tools in Education. 2023;10(1):12–28.
  31. 31. Shang B, Chen Z, Ma Q, Tan Y. A comprehensive mortise and tenon structure selection method based on Pugh’s controlled convergence and rough Z-number MABAC method. PLoS One. 2023;18(5):e0283704. pmid:37200387
  32. 32. Kang B, Wei D, Li Y, Deng Y. A method of converting Z-number to classical fuzzy number. Journal of Information & Computational Science. 2012;9:703–9.
  33. 33. Božanić D, Jurišić D, Erkić D. LBWA – Z-MAIRCA Model Supporting Decision Making in the Army. Oper Res Eng Sci Theor Appl. 2020;3(2).
  34. 34. Garg A, Maiti J, Kumar A. Granulized Z‐OWA aggregation operator and its application in fuzzy risk assessment. Int J of Intelligent Sys. 2021;37(2):1479–508.
  35. 35. Pamučar D, Ćirović G. The selection of transport and handling resources in logistics centers using Multi-Attributive Border Approximation area Comparison (MABAC). Expert Systems with Applications. 2015;42(6):3016–28.
  36. 36. Pamucar D, Petrovic I, Cirovic G. Modification of the best-worst and MABAC methods: a novel approach based on interval-valued fuzzy-rough numbers. Expert Systems with Applications. 2018;91:89–106.
  37. 37. Zhao M, Wei G, Chen X, Wei Y. Intuitionistic fuzzy MABAC method based on cumulative prospect theory for multiple attribute group decision making. Int J of Intelligent Sys. 2021;36(11):6337–59.
  38. 38. Liu P, Zhang P. A normal wiggly hesitant fuzzy MABAC method based on CCSD and prospect theory for multiple attribute decision making. Int J Intell Syst. 2020;36(1):447–77.
  39. 39. Jana C. Multiple attribute group decision-making method based on extended bipolar fuzzy MABAC approach. Comp Appl Math. 2021;40(6).
  40. 40. Schürmann V, Marquardt N, Bodemer D. Conceptualization and Measurement of Peer Collaboration in Higher Education: A Systematic Review. Small Group Research. 2023;55(1):89–138.
  41. 41. Zen Z, Reflianto, Syamsuar, Ariani F. Academic achievement: the effect of project-based online learning method and student engagement. Heliyon. 2022;8(11):e11509. pmid:36411883
  42. 42. Thite S, Ravishankar J, Tomeo-Reyes I, Ortiz AM. Design of a simple rubric to peer-evaluate the teamwork skills of engineering students. Eur J Eng Educ. 2024;49:623–46.
  43. 43. Adesina OO, Adesina OA, Adelopo I, Afrifa GA. Managing group work: the impact of peer assessment on student engagement. Account Educ. 2023;32:90–113.
  44. 44. Chen T, Zhao Y-J, Huang F-Q, Liu Q, Li Y, Alolga RN, et al. The effect of problem-based learning on improving problem-solving, self-directed learning, and critical thinking ability for the pharmacy students: A randomized controlled trial and meta-analysis. PLoS One. 2024;19(12):e0314017. pmid:39621602
  45. 45. Gomez-del Rio T, Rodriguez J. Design and assessment of a project-based learning in a laboratory for integrating knowledge and improving engineering design skills. Educ Chem Eng. 2022;40:17–28.
  46. 46. Katsenos I, Pierrakeas C. Assessing individual contributions of team members to team achievement by combining peer assessments and digital presence in an academic environment. Educ Sci. 2025;15:279–95.
  47. 47. Rook L, Dean BA, Tofa M, Ellmers G, Villeneuve-Smith S, Kennedy M. Student feedback for integrating reflection into the higher education curriculum; a cross disciplinary study. High Educ Res Dev. 2025;:1–15.
  48. 48. Abdellatif R, El-Wakeel H. Assessing creative outcomes in studio-based learning: a comparative assessment of analytical rubrics. International Journal of Design Creativity and Innovation. 2024;13(1):41–66.
  49. 49. Yakob M, Sari RP, Hasibuan MP, Nahadi N, Anwar S, El Islami RAZ. The feasibility authentic assessment instrument through virtual laboratory learning and its effect on increasing students’ scientific performance. J Balt Sci Educ. 2023;22:631–40.
  50. 50. Vargas H, Heradio R, Farias G, Lei Z, de la Torre L. A Pragmatic Framework for Assessing Learning Outcomes in Competency-Based Courses. IEEE Trans Educ. 2024;67(2):224–33.
  51. 51. Di Palma R, Beausaert S, Mahr D, Heller J, Hilken T. Does Using Virtual Reality to Enhance Students’ Presentation Skills Work? The Role of Feedback and Presence. Computer Assisted Learning. 2025;41(5).
  52. 52. Syeed MM, Shihavuddin A, Uddin MF, Hasan M, Khan RH. Outcome Based Education (OBE): Defining the Process and Practice for Engineering Education. IEEE Access. 2022;10:119170–92.
  53. 53. Zhang R, Shi J, Zhang J. Research on the Quality of Collaboration in Project-Based Learning Based on Group Awareness. Sustainability. 2023;15(15):11901.
  54. 54. Taylor B, Kisby F, Reedy A. Rubrics in higher education: an exploration of undergraduate students’ understanding and perspectives. Assessment & Evaluation in Higher Education. 2024;49:799–809.
  55. 55. Ye F, Sun J, Wang Y, Nedjah N, Bu W. A novel method for the performance evaluation of institutionalized collaborative innovation using an improved G1-CRITIC comprehensive evaluation model. Journal of Innovation & Knowledge. 2023;8(1):100289.
  56. 56. Zhou B, Chen J, Wu Q, Pamučar D, Wang W, Zhou L. Risk priority evaluation of power transformer parts based on hybrid fmea framework under hesitant fuzzy environment. FU Mech Eng. 2022;20(2):399.