Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

A novel multi-objective dynamic flexible job shop scheduling algorithm using reinforced learning based black widow spider algorithm

  • Kashif Akram ,

    Roles Conceptualization, Software, Writing – original draft

    kakram.dme19smme@student.nust.edu.pk (KA); mkhan7@pmu.edu.sa (MK)

    Affiliation School of Mechanical & Manufacturing Engineering (SMME), Campus H-12, National University of Sciences & Technology (NUST), Islamabad, Pakistan

  • Muhammad Usman Bhutta,

    Roles Supervision

    Affiliations School of Mechanical & Manufacturing Engineering (SMME), Campus H-12, National University of Sciences & Technology (NUST), Islamabad, Pakistan, Anglia Ruskin University (ARU), Bishop Hall Ln, Chelmsford CM1 1SQ, United Kingdom

  • Shahid Ikramullah Butt,

    Roles Validation

    Affiliation School of Mechanical & Manufacturing Engineering (SMME), Campus H-12, National University of Sciences & Technology (NUST), Islamabad, Pakistan

  • Muhammad Rizwan,

    Roles Formal analysis

    Affiliation School of Mechanical & Manufacturing Engineering (SMME), Campus H-12, National University of Sciences & Technology (NUST), Islamabad, Pakistan

  • Muhammad Salman Khan,

    Roles Writing – review & editing

    Affiliation School of Mechanical & Manufacturing Engineering (SMME), Campus H-12, National University of Sciences & Technology (NUST), Islamabad, Pakistan

  • Mushtaq Khan ,

    Roles Writing – review & editing

    kakram.dme19smme@student.nust.edu.pk (KA); mkhan7@pmu.edu.sa (MK)

    Affiliation Mechanical Engineering Department, College of Engineering, Prince Mohammad Bin Fahd University, Al Khobar,  Saudi Arabia

  • Alamzeb Khan

    Roles Validation

    Affiliation School of Mechanical & Manufacturing Engineering (SMME), Campus H-12, National University of Sciences & Technology (NUST), Islamabad, Pakistan

Abstract

In today’s fast-paced manufacturing environments, solving flexible job shop scheduling problem (FJSP) has become essential due to swift design-to-manufacturing-to-consumer cycle and frequent disruptive events like new job arrivals. This study proposes a novel reinforcement learning based black widow spider algorithm (BWSA-RL) to address the multi-objective dynamic flexible job shop scheduling problem (MODFJSP). The algorithm utilizes a hybrid reinforcement learning framework for dynamic adjustment of procreation and mutation rates of BWSA-RL. The switch between SARSA and Q-learning is achieved through a novel conversion operator based on sparsity of Q-tables. To enhance Pareto front diversity, a novel hybrid crowding distance metric (HCD) is introduced. Additionally, a rescheduling-heuristic is proposed to accommodate new job arrivals. A comprehensive experimental regime was applied to validate the proposed novelties against 30 benchmark instances. Mathematical model was validated with mixed integer linear programming (MILP). The conversion condition operator and the HCD metric were benchmarked against two other approaches, demonstrating their effectiveness in balancing exploration and exploitation while maintaining solution diversity. BWSA-RL was benchmarked against four state-of-the-art algorithms, outperforming them in 83.3% of the instances. BWSA-RL demonstrated its potential as a robust approach for MODFJSP, balancing energy efficiency and operational goals like makespan, due-date conformance and schedule stability.

1. Introduction

Efficient production scheduling is essential for meeting the demands of modern-day manufacturing, characterized by increased variability, reduced batch sizes, and extensive customizations [1]. The NP-hard flexible job shop scheduling problem (FJSP) is a widely recognized challenge in current manufacturing industries, such as semiconductor manufacturing, automobile parts manufacturing, and transportation logistics [2]. The dynamic flexible job shop scheduling problem (DFJSP) involves assigning and sequencing operations across multiple machines while adapting to disruptions like machine breakdowns and priority changes to maintain manufacturing efficiency and stability. Additionally, real-world scheduling often requires optimizing multiple conflicting objectives, leading to the multi-objective dynamic flexible job shop scheduling problem (MODFJSP). Complexity of MODFJSP demands advanced algorithms to balance competing goals [3], typically addressed through weighted-objective methods or Pareto-based approaches, the latter offering production managers greater flexibility in selecting optimal scheduling solutions [3,4].

Since FJSP’s introduction by Brucker and Schlie [5], it has attracted significant research interest due to its application potential in real-world manufacturing systems. The complexity of FJSP makes exact methods impractical for solving large-scale instances. Metaheuristic algorithms have emerged as a preferred solution approach, balancing solution quality with computational efficiency. Researchers have explored various metaheuristic approaches, including the multi-objective discrete Jaya algorithm [6], non-dominated sorting genetic algorithm (NSGA-II) [7], and game-theory-based strategies [8] to enhance solution quality. Additionally, the improved spider monkey optimization [9] and hybrid shuffled frog leaping algorithm [10], have been proposed to improve scheduling efficiency in dynamic environments.

Energy consumption has become a key consideration in scheduling, with recent studies incorporating energy efficiency into MODFJSP objectives. Various optimization techniques, such as memetic NSGA-II [11], enhanced NSGA-II [12], and improved grey wolf optimization [13], have been developed to minimize makespan while optimizing energy consumption. Bi-population evolutionary algorithms [14] and iterative tabu search methods [15] have further refined energy-efficient scheduling by balancing completion times and real-time energy tariffs. Other approaches, such as knowledge-based evolutionary algorithms [16] and genetic programming based hyper-heuristics [17], leverage machine learning principles to enhance scheduling efficiency while reducing energy consumption. These studies emphasize the need for multi-objective optimization strategies that consider both production efficiency and environmental-economical sustainability.

The integration of reinforcement learning (RL) into metaheuristic algorithms has further advanced FJSP solutions, allowing for adaptive decision-making in dynamic environments. Deep Q-learning [18], proximal policy optimization [19], and artificial bee colony-based RL approaches [20] have demonstrated improved scheduling flexibility and performance. Reinforcement learning models, including graph neural networks [21] and graph reinforcement learning [22], have been applied to optimize sequencing and machine assignments. Hybrid approaches that combine RL with NSGA-II [23], and transformer networks [24] have also been explored to enhance scheduling under uncertainties such as machine breakdowns and variable processing times. These methods highlight the potential of RL-driven optimization frameworks in handling complex scheduling challenges.

Job priority consideration in FJSP remains a relatively underexplored area, despite its importance in real-world applications where jobs differ in urgency and penalties for delays. Researchers have developed priority-based scheduling models using heuristics such as critical ratio and earliest start time, particularly in sectors like automobile repair and just-in-time manufacturing [25]. Hybrid metaheuristic approaches, including tabu-variable neighborhood search [26] and quantum annealing-based algorithms [27], have been introduced to optimize job sequencing while considering priority constraints. Other studies have incorporated outsourcing and weighted penalty mechanisms [28] to balance overdue days and delay penalties. These contributions underscore the need for advanced scheduling models that integrate job priority levels to enhance real-world applicability.

Existing FJSP research largely assumes equal job priority, overlooking realistic variations and their impact on scheduling solutions. Additionally, there is a lack of heuristics designed to reduce instability from new job insertions while considering priority constraints. Existing RL implementations rely on a rigid SARSA-to-Q-learning switch, which limits adaptability and efficiency, highlighting the need for a dynamic switching mechanism. Moreover, Pareto-based solutions often use Euclidean or Hamming distance metrics, which can lead to inaccuracies due to scaling effects, necessitating an improved crowding distance metric. To address these gaps, this study proposes a Pareto-optimal black widow spider algorithm (BWSA) for solving MODFJSP while optimizing makespan, total energy consumption, average due-date penalty, and schedule instability. RL is incorporated to dynamically adjust procreation and mutation rates, while a rescheduling heuristic is developed to minimize instability caused by new job insertions. The main contributions of this study are summarized as follows.

  1. A MILP model that integrates optimization objectives and accommodates three levels of job priorities.
  2. A novel hybrid implementation of BWSA and RL for solving MODFJSP.
  3. A novel conversion condition operator to control switching from SARSA to Q-learning.
  4. A novel hybrid cosine distance metric to promote diversity in evolutionary populations.
  5. A novel rescheduling heuristic to manage new job insertion.

The reset of the paper is organized as: section 1 provides introduction, literature review and research gap, section 2 provides mathematical model, section 3 delves into the details of the proposed algorithm, section 4 presents computational results and section 5 provides conclusions. The research methodology is explained in Fig 1, the green highlighted boxes are the novelties proposed in this research paper.

thumbnail
Fig 1. Research methodology and overall organization of the proposed study.

https://doi.org/10.1371/journal.pone.0347108.g001

2. Problem formulation

2.1. Problem model

An MODFJSP can be defined as a system of jobs , where , which are to be process on machines , where . Each job consist of operations , where . Every operation can be performed on the given set of candidate machines , where = {machines capable of performing operation }.

This research divides scheduling into stages: Stage 1 focuses on optimizing makespan (MK), total energy consumption (TEC), and average due-date penalty (ADP), forming a Pareto front of elite solutions for managers to select from. Stage 2 and beyond handle rescheduling due to disruptions, introducing an additional objective, instability (INS), which measures deviations from the previous schedule. A 4 × 3 sample problem is presented in Table 1, with its energy requirements detailed in Table 2. If a job’s operation cannot be performed on a specific machine, it is marked with an infinity sign in Table 1. In the next section a formal mathematical model of the proposed problem is presented.

thumbnail
Table 2. Estimated process, idle and on/off energies of the sample problem.

https://doi.org/10.1371/journal.pone.0347108.t002

2.2. Mathematical model

In this section, mathematical model is proposed which integrates four objective functions and relevant constraints. The required indices, parameters, and variables are defined prior to introducing the mathematical model.

  • Indices
    • :- jobs iteration index,
    • j:- operations iteration index,
    • :- machine iteration index,
    • :- indices representing the operational positions on machine k,
    • :- indices are used to define stages, with where s = 1 denotes the initial stage and corresponds to rescheduling stages
  • Parameters
    • :- total jobs to be scheduled during stage s
    • :- total operations of job during stage s
    • :- total available machines
    • :- the total number of operations assigned to machine k in stage s
    • :- end time of last operation on machine k during stage s
    • :- processing time on machine k for operation
    • :- energy consumption rate for processing an operation on machine k
    • :- idling energy consumption rate for keeping machine k in standby for next job
    • :- energy needed for one complete On/Off cycle of machine 𝑘
    • :- operation start time during stage s
    • :- operation finish time during stage s
    • :- start time for an operation assigned to machine 𝑘 in position 𝑝 at stage 𝑠
    • :- completion time for an operation assigned to machine 𝑘 in position 𝑝 at stage 𝑠
    • :- weight assigned to high priority jobs for ADP calculation
    • :- weight assigned to normal priority jobs for ADP calculation
    • :- weight assigned to low priority jobs for ADP calculation
    • :- completion time of ith job
    • :- due-date of ith job
    • :- release time of the first operation of job 𝑖
    • :- An arbitrarily large number
  • Variable

Objective functions;

  1. (1). Minimize makespan (MK)
(1)
  1. (2). Minimize total energy consumption (TEC)
(2)
  1. (3). Minimize average due date penalty (ADP)
(3)
  1. (4). Minimize instability (INS)
(4)

The objective functions are constraint by

(5)(6)(7)

Equation (5) restricts loading of operation more than once, equation (6) puts restriction of simultaneous loading of more than one operation on single machine, inherent precedence constraints are ensured through equation (7).

(8)(9)(10)(11)

Equations (89) and (1011) links start and finish times of operation with machine k respectively.

(12)

Equation (12) ensures that the first operation of any job is scheduled no earlier than its arrival time . For the initial scheduling stage, is zero, while for subsequent rescheduling stages, equals the dynamic event time .

(13)

Equation (13) restricts assigning of more than one operations in overlapping timeslots on machine k. Following are the assumptions applied on this study.

  1. (1). Pre-emption is not allowed once an operation begins on a machine.
  2. (2). Processing times for all job operations are deterministic, including setup, loading, and unloading times.
  3. (3). Machine energy consumption remains constant throughout scheduling.

3. Proposed reinforcement learning based Black widow spider algorithm (BWSA-RL)

3.1. Overview of BWSA-RL framework

This section introduces the proposed BWSA-RL for solving the MODFJSP. The proposed approach integrates a population-based evolutionary optimization framework with a reinforcement learning (RL) controller that adaptively regulates key evolutionary parameters during the search process.

The core optimization engine of BWSA-RL is derived from the black widow spider algorithm (BWSA), which evolves a population of candidate schedules through biologically inspired operators such as procreation, mutation, and cannibalism. In contrast to conventional BWSA variants that rely on static parameter settings, BWSA-RL introduces an adaptive control mechanism in which RL dynamically adjusts the procreation rate and mutation rate based on the observed search performance. This adaptive strategy enables the algorithm to respond effectively to different optimization phases and problem scales. In the following paragraph an overview of the working of the proposed algorithm is provided.

The Fig 2 illustrates the overall workflow of the proposed BWSA-RL framework. The algorithm starts with generation of an initial population of feasible solutions using a hybrid initialization strategy, after that this initial population is evaluated and sorted in Pareto fronts through non-dominated sorting algorithm. The Q-tables for and are initialized with zeros, the initial state is determined and the values of and are extracted from the action table as per RL policy . Procreation and mutation are performed using RL suggested and , and newly generated offspring solutions are kept in a container population . After completion of these evolutionary processes, the population and are merged into a new population . Population is evaluated, sorted in Pareto fronts and hybrid crowding distance metric is calculated. Then cannibalism, an elitist selection mechanism, is performed on until its size reduces to the original size of population , then all the members of are replaced with . At the end of each generation, the reward is calculated and the next state is determined as per policy and relevant update of Q-tables is performed by either SARSA or Q-learning. These processes keep on repeating until the stopping conditions are met, i.e., the maximum number of generations is reached, or no improvement is seen in the elite Pareto front over the last 20 generations. The time and space complexity of BWSA-RL is and respectively where is the total number of operations to be scheduled.

thumbnail
Fig 2. BWSA-RL framework illustrating the algorithmic control flow and integration of the reinforcement learning module.

https://doi.org/10.1371/journal.pone.0347108.g002

3.2. Encoding and decoding

Two vector encoding method [29] is adopted for this study, process sequence is stored in one vector named operation sequence vector (OSV) and machine allocation is stored in another vector named machine assignment vector (MAV). The size of both OSV and MAV is equal to the total operations to be scheduled in the current stage. The operation precedence constraint is embedded in OSV, and corresponding machine codes are stored in MAV. Fig 3a shows a two-vector encoding for an optimal solution to the sample problem presented in Table 1. Reading from left to right, the first 2 represent the first operation of job 2 which is to be performed on machine 1, similarly second appearance of 2 represents operation of job 2 which is to be processed on machine 2.

thumbnail
Fig 3. Encoding and decoding of OSV and MAV. a) an encoded solution, b) step by step decoding using G&T algorithm.

https://doi.org/10.1371/journal.pone.0347108.g003

To decode the encoded solutions, G&T [30,31] algorithm is employed. This algorithm generates solutions in the active region of solution space which contains optimal solutions [31]. Step by step decoding is explained in Fig 3b, by reading OSV and MAV from left to right first operation is loaded on machine 1 for 2 processing units of time. Then the second operation is loaded on machine 2 for 2 units of time, to maintain precedence constraint the starting of this operation must be on or after completion time of operation . In this way all the subsequent operations are scheduled on the machines.

3.3. Population initialization

The quality and diversity of the initial population significantly impact the convergence of evolutionary algorithms to optimal or near-optimal solutions [2]. This study employs a set of initialization rules, each targeting a specific objective. For makespan minimization, three rules are used: the most operations remaining rule [32], the most work remaining rule [33], and the global minimum time rule [34]. Total energy consumption is optimized using the minimum energy rule and the minimum energy load rule [35]. Average due-date penalty is addressed through a priority-based rule that encodes jobs in high-to-low priority order. Schedule instability is reduced using a stability-considered rule that assigns operations to the same machines as in the previous schedule. Additionally, random initialization for operation sequence vectors (OSV) and machine assignment vectors (MAV) are incorporated to enhance population diversity.

3.4. Evolutionary operators of BWSA-RL

3.4.1. Procreation.

The goal of precreation is to create new individuals through mating of parent spiders from the current population. The total numbers of mattings are controlled through the parameter . The expression for is given in equation 14.

(14)

Where is the total individuals in . Higher the value of more crossover operations will be performed and vice versa. Parent selection is performed using tournament selection. Two random individuals from the population are chosen, and the stronger solution is selected as P₁ based on the rank. If both have the same rank, crowding distance (section 3.6) acts as a tiebreaker, favoring the higher value. The process is repeated to select the second parent, P₂. Then both parents are subjected to procreation and produce two children solutions which are stored in , this mechanism is explained in Fig 4.

thumbnail
Fig 4. Procreation mechanism for parent selection based on a tournament selection strategy.

https://doi.org/10.1371/journal.pone.0347108.g004

The procreation is completed in two phases, in first phase the OSVs and in second phase MAVs of both parents pass genes to offspring. The OSV of child spider is generated in following four steps, explained graphically in Fig 5a, the particulars of these steps are given below.

thumbnail
Fig 5. Procreation process for generating offspring spiders from parents P1 and P2:(a) OSV generation steps, and (b) MAV generation steps.

https://doi.org/10.1371/journal.pone.0347108.g005

  1. Step 1: two random number and are generated such that and , where is the total elements in OSV. For example, and , refer to step one of Fig 5a.
  2. Step 2: for every operation in that falls between and , its corresponding operation in is identified and removed to create . For instance, in , job number 2 is found at the third position. The first appearance of job 2 in the OSV of is at the first position, so this entry is removed from .
  3. Step 3: the genes between and are copied in child solution .
  4. Step 4: the genes in are copied, in order of appearance, in empty locations of to form complete OSV of child .

The MAV of child spider is generated in following two steps, explained graphically in Fig 5b, the particulars of these steps are given below.

  1. Step 1: the machine assignment between, previously generated, and of are copied from MAV of as shown in Fig 5b. For example, the third position in OSV of represents operation , therefore machine assignment of from is copied on this location.
  2. Step 2: the remaining machine assignments are copied from the MAV of . For example, the first location of OSV of represents operation and therefore its assignment is copied from second position from MAV of .

The procreation technique ensures feasible child solutions without needing a repair mechanism. The second offspring is generated by swapping parents and following the same procedure. After procreation, the stronger parent is designated as mother and the other as father, the weaker father is removed (cannibalized) from the population. This approach enhances diversity and promotes the inheritance of stronger characteristics in the BWSA-RL algorithm.

3.4.2. Mutation.

The purpose of mutation is to create diversity through the arbitrary introduction of characteristics that may not be present in the current population. The total number of mutations are controlled through parameter , the expression for is given in equation 15.

(15)

Where is the total members in . To perform mutation, a solution is randomly selected from , the mutation is completed in two stages as depicted in Fig 6.

thumbnail
Fig 6. Mutation operator consisting of (a) operation swapping and (b) random reassignment of machines.

https://doi.org/10.1371/journal.pone.0347108.g006

  1. Step 1: a random value r is selected from the interval , where q represents the total gene count in OSV or MAV.
  2. Step 2: the operations at positions r and r + 1 in OSV are swapped, and a random machine is assigned to each in MAV.

This proposed mutation strategy of BWSA-RL always results in feasible solutions therefore repair strategy is not required.

3.4.3. Cannibalism and population update.

Cannibalization involves removing weaker individuals from the population , ensuring that only stronger and more diverse solutions are passed on to future generations. The population is divided into Pareto fronts, and the hybrid crowding distance for each member is subsequently calculated. The complete process of cannibalization is summarized in the following steps and a flowchart of the process is shown in Fig 7.

thumbnail
Fig 7. Cannibalization process illustrating the elimination of weaker solution spiders from the population.

https://doi.org/10.1371/journal.pone.0347108.g007

  1. Step 1: evaluate population , identify Pareto fronts through non-dominated sorting, calculate hybrid crowding distance, and delete all members of population .
  2. Step 2: set iteration variable .
  3. Step 3: if the combined number of members in population and the Pareto front of does not exceed the maximum population size, go to step 4; otherwise, jump to step 5.
  4. Step 4: copy Pareto front of to population . go to step 6.
  5. Step 5: sort the Pareto front by crowding distance in descending order and sequentially copy individuals to until the count of members reaches .
  6. Step 6: check if the number of members in is equal to maximum population size, then go to step 7 else set and go to step 3.
  7. Step 7: return population P to main loop of the BWSA-RL algorithm

3.5. RL based dynamic parameter adaptation strategy

In the proposed BWSA-RL framework, reinforcement learning is employed as a supervisory control mechanism rather than as a direct solution generator. The RL component does not construct scheduling solutions or modify individual chromosomes explicitly. Instead, it dynamically regulates the evolutionary behavior of BWSA by adapting key algorithmic parameters during the optimization process.

In the proposed framework, RL is used to adaptively control two key parameters of the BWSA algorithm, namely the procreation rate and mutation rate . These parameters were selected because they directly regulate the exploration–exploitation trade-off of the search process. Specifically, influences the generation of new candidate solutions and thus controls population diversity at a global level, whereas introduces stochastic perturbations that help the algorithm escape local optima. Compared to other parameters, and have the most significant impact on search dynamics while maintaining low computational overhead. Therefore, adapting these parameters through reinforcement learning enables dynamic search control without modifying the underlying solution representation or increasing algorithmic complexity. In static-parameter evolutionary algorithms, selecting suitable values for these parameters is challenging and highly problem-dependent. BWSA-RL addresses this limitation by enabling parameter adaptation based on real-time feedback from the search process.

At each generation, the RL agent observes the current search state, which reflects the population’s recent performance in terms of convergence and diversity across multiple objectives. Based on this state, the agent selects an action corresponding to a predefined range of parameter values. The selected parameters are then applied in the next evolutionary cycle. After the population is updated, a reward signal, derived from changes in hypervolume, is computed and used to update the RL policy.

To ensure stable learning and efficient convergence, a hybrid learning strategy is adopted in which SARSA is employed during the early stages of optimization, followed by Q-learning in later generations. The on-policy nature of SARSA provides conservative and stable updates when the population is highly diverse and the search landscape is uncertain. As the optimization progresses and state–action values become more reliable, Q-learning is activated to accelerate convergence through greedy exploitation of learned policies.

Through this adaptive control mechanism, reinforcement learning enables BWSA-RL to autonomously adjust its search behavior in response to problem dynamics, thereby improving robustness, scalability, and solution quality without manual parameter tuning.

3.5.1. Introduction of SARSA and Q-learning.

RL enables agent to learn optimal strategies by interacting with their environment. The agent observes its state, takes an action, and receives feedback as a reward. Positive rewards increase the likelihood of repeating an action, while negative rewards decrease it. Over time, the agent learns to maximize cumulative rewards for optimal decision-making. This learning action is explained in Fig 8, an agent gets the current state of the environment and takes an action , due to this action the environment state changes to and a reward is returned, then based on this and the agent takes a new action . This process goes on until the agent learns to maximize the total accumulated reward.

thumbnail
Fig 8. State-agent interaction diagram illustrating the reinforcement learning mechanism.

https://doi.org/10.1371/journal.pone.0347108.g008

The framework of both SARSA and Q-learning algorithms is similar, and the only difference is in the calculation of future reward for updating of Q-tables. Equations 16 and 17 are for updating Q-tables for SARSA and Q-learning respectively.

(16)(17)

Where is the Q-table value for state and action , is the learning rate and is the discount factor used for deduction of future rewards. Q-learning, an off-policy method, updates Q-values based on maximum future rewards for faster convergence but risks local optima. SARSA, an on-policy approach, learns more cautiously, reducing the chance of local optima but converging more slowly.

3.5.2. Conversion condition operator: a SARSA to Q-learning switching mechanism.

This study utilizes both SARSA and Q-learning for updating Q-tables. Initially BWSA-RL uses SARSA algorithm to update Q-tables, but after certain time the algorithm switches to Q-learning, the conversion condition of this transition is based on the proposed sparseness , defined by the following equation;

(18)

Where and are number of zero values in procreation and mutation Q-tables respectively, and are the total number of states and actions respectively. Initially, all elements of the Q-table matrices in BWSA-RL are zeros, resulting in . As the search progresses and the algorithm continues scoring rewards, these zeros begin to diminish, causing S to decrease. The BWSA-RL switches to Q-learning when , where is the sparseness threshold.

The proposed sparseness reflects the proportion of unvisited state–action pairs. A high value indicates that the agent has not sufficiently explored the state–action space, making SARSA more suitable due to its on-policy nature and stable learning behavior under limited information. As learning progresses, the decreases, indicating that the agent has accumulated sufficient experience across different states. At this stage, switching to Q-learning enables more aggressive exploitation of learned policies through its off-policy update rule. Therefore, Q-table sparseness serves as an intuitive and data-driven indicator of the exploration maturity of the learning process.

The threshold parameter determines the point at which the learning strategy transitions from SARSA to Q-learning. A lower value of would trigger early switching, potentially leading to premature exploitation, whereas a higher value may delay the transition and increase computational overhead. Thus, represents a trade-off between exploration completeness and convergence speed. The value of is empirically determined through Taguchi design of experiment in Section 4.3.

3.5.3. State definition.

In context of RL, the state can be considered as agent’s modeling and encoding of its environment [36]. An ideal methodology to define state must encompass all characteristics of the population, and the complexity of defining states for multi-objective algorithms becomes more challenging [2]. This research purposes a state definition based on changes in average values of all four optimization objectives, , , and are defined as difference in average MK, TEC, ADP and INS for previous and current generations. The formal expressions for these differences are given in equation 1922.

(19)(20)(21)(22)

Where and represent makespan of ith member of population for gth and (g-1)th generation respectively. , , , , and are defined similarly. N is the total number of solutions in population. Following the calculation of these delta values, Table 3 is utilized to assess the current state of the population.

3.5.4. Action set.

This study employs two action sets comprising potential values for parameters and , with each action set containing eight options. The typical values of and are in the range of 0.4 to 0.8 and 0.05 to 0.45 respectively, and interval size is set at 0.05. The eight actions for parameter are [0.4, 0.45), [0.45, 0.50), [0.50, 0.55), [0.55, 0.60), [0.60, 0.65), [0.65, 0.70), [0.70, 0.75) and [0.75, 0.80]. Similarly, the eight actions for parameter are [0.05, 0.10), [0.10, 0.15), [0.15, 0.20), [0.20, 0.25), [0.25, 0.30), [0.30, 0.35), [0.35, 0.40) and [0.40, 0.45]. For example, if the second action is selected for then the value of this parameter is set randomly between 0.45 and 0.50. The action values for are calculated in a similar manner. The expressions for calculation of and are given in equations 23 and 24.

(23)(24)

Where , , and are the interval start and end values of the selected procreation and mutation action respectively, and is a random number between [0, 1].

3.5.5. Reward method.

After execution of the selected action , the environment generates feedback in the form of a reward which determines the quality of the selected action , for the state . This could be positive, indicating improvement, or it could be negative, indicating detriment in the overall health of the population. This study employs hyper-volume ratio (HVR) to calculate reward, Hyper-volume (HV) is a measure of both convergence and diversity [37]. The expressions to calculate HVR and reward are given in expression 25 and 26 respectively.

(25)(26)

Where and are the HV of and generation respectively, HVR is the ration of and . Instead of using reward as fixed constant value, this study uses a dynamic reward based on current value of HVR, for higher values of HVR more reward is given and vice versa. Negative rewards are not used, as these rewards might cause convergence instability [38].

3.5.6. Epsilon greedy policy.

The action selection policy governs the agent’s selection of an action in the given state . This study incorporates the Epsilon greedy policy, it maintains a balance between exploration and exploitation of search space through the choice of parameter ε. The action is selected as per the following expression.

(27)

Where ε is the greedy rate, is a random number between range [0,1] and is the action set. If then for the current state, action with max Q-value is selected otherwise action is selected randomly.

3.6. Diversity preservation with Hybrid cosine distance (HCD)

Non-dominated sorting ranks solutions into Pareto fronts, with the best front assigned rank 1 and subsequent fronts ranked in ascending order. Solutions within the same rank are non-dominating and considered equally optimal. To differentiate between them, a crowding distance metric is used [39], measuring solution diversity in objective and variable spaces. A higher value of crowding distance indicates a more promising solution within the same rank. Maintaining diversity is crucial for ensuring the effectiveness and efficiency of evolutionary algorithms.

This study proposed a novel hybrid cosine distance metric (HCD). The traditional method to evaluate crowding distance is to calculate Euclidean distance between reference solution and its neighbors in objective space. This traditional method has two limitations, first it relies on Euclidean distance which some time might be misleading due to consideration of scale [40], second MODFJSP is a discrete combinatorial optimization problem therefore many different permutations in variable space may point to the same solution in objective space. The proposed HCD overcomes these limitations by utilization of cosine distance in place of Euclidean distance. This reliance on cosine distance results in better estimate, due to consideration of angle, of separation between two vectors. Additionally, this proposed HCD hybridizes cosine distances of both variable and objective domains. The expressions to calculate for member of the population is given below.

(28)(29)(30)(31)

Where , and are cosine distances of operation sequence, machine assignment and objective function vectors respectively. , , , , and represents OSV, MAV and objective values vector for and solution respectively. The pseudocode for calculation of HCD is given in listing Algorithm 1.

Algorithm 1: Hybrid Cosine Distance (HCD) Operator

Input All Pareto Fronts Pareto Fronts

Output Crowding Distance Data Structure HCD

1 HCD ⇓ Initialize with zeros

2 For Each PF in Pareto Fronts

3 Sort PF by objective functions f1, f2, f3, f4

4 N ⇓ Count of total individuals in PF

5 For i = 1 to N

6 = cosine distance between OSV of P and

7 = cosine distance between MAV of P and

8 = cosine distance between objectives of P and

9

10 End For

11 End For Each

12 Return HCD

13 End

3.7. Dynamic event handling and rescheduling heuristic

Real-world manufacturing shops face multiple disruptions, including new job insertions, tool wear, machine breakdowns, job cancellations, priority changes, and process alterations, adding complexity to scheduling problem [41]. To align with practical scenarios, this study focuses on the most common disruption, new job insertion, in MODFJSP. Such disruptions introduce instability, defined as the deviation from the original schedule, impacting resource organization, tooling, and personnel planning. High instability leads to excessive change management, increased costs, missed due dates, and resource wastage. Therefore, an efficient rescheduling technique must prioritize minimizing instability to maintain schedule reliability and operational efficiency.

The rescheduling framework begins with scheduling the initial job set using BWSA-RL, optimizing MK, TEC, and ADP. After execution, a Pareto front of elite solutions is generated, from which production managers select a schedule for implementation. This schedule remains active until an external disruption occurs. When a disruption arises, it is integrated into the running schedule using the proposed heuristic, ensuring minimal instability. The detailed framework is given in Fig 9.

thumbnail
Fig 9. Proposed rescheduling framework for handling the insertion of new jobs in a dynamic scheduling environment.

https://doi.org/10.1371/journal.pone.0347108.g009

3.7.1. Insertion of new job.

The proposed heuristic for incorporation of new jobs in running schedule is based on priority of the incoming job with the aim to minimize instability. The decision matrix for rescheduling is given in Table 4, the particulars of the steps required in it are given below.

thumbnail
Table 4. Priority based decision matrix for rescheduling after arrival of new job.

https://doi.org/10.1371/journal.pone.0347108.t004

  1. Step 1: a new job arrives.
  2. Step 2: determine the priority of .
  3. Step 3: if priority is high then go to step 4, if priority is normal then go to step 6, if priority is low then go to step 8.
  4. Step 4: freeze all high priority jobs to their current machine assignments and timeslot allocations.
  5. Step 5: reschedule all normal and low priority jobs along with and then go to step 10.
  6. Step 6: freeze all high and normal priority jobs to their current machine assignments and timeslot allocations.
  7. Step 7: reschedule all low priority jobs along with with BWSA-RL, then go to step 10.
  8. Step 8: freeze all the jobs to their current machine assignments and timeslot allocations.
  9. Step 9: reschedule with BWSA-RL, then go to step 10.
  10. Step 10: go to the main loop of the rescheduling framework.

4. Computational results and discussions

This section provides details of extensive computational experiments and evaluations of the proposed BWSA-RL and rescheduling heuristics. The following experimental regime has been designed to assess the performance of BWSA-RL.

  1. Taguchi design of experiments is used to determine the optimal RL hyper parameters.
  2. MILP model is executed in IBM CPLEX and compared with BWSA-RL for validation.
  3. Conversion condition operator is experimentally evaluated and benchmarked against two alternative approaches.
  4. Hybrid crowding distance metric (HCD) is benchmark against two proposed crowding distance operators.
  5. Effectiveness of the ADP objective function is evaluated, and optimal weight levels are experimentally determined.
  6. BWSA-RL is benchmarked against four state-of-the-art algorithms published in leading journals.
  7. Rescheduling heuristic effectiveness under dynamic conditions is tested through experiments.

The proposed algorithm is implemented in C#.NET and executed on an Intel Core i7 processor with 8 GB RAM. The source code for BWSA-RL can be downloaded from the link https://github.com/K-Akram/BWSA-RL-Code and also included in supporting of this paper as SourceCode.Zip. Before presenting the experimental results, instance generation and performance metrics are discussed in the following sections.

4.1. Instance generation

The proposed algorithm incorporates job priority and allows for rescheduling to incorporate new jobs. Since no existing benchmark problems cover all these aspects, thirty problems have been generated to evaluate the algorithm’s performance [42,43]. The benchmark problems range from 25 to 500 operations, 5–50 jobs and 3–20 machines. The processing times are generated in the range of 1–10 units of time. On/Off, idling and processing energies in the ranges of 0 to 0.3, 0.2 to 0.4 and 0.5 to 1.0 units of energy respectively. Additionally, each problem includes three randomly generated dynamic new job insertion events. The proposed set of problems is named P01 to P30 and can be downloaded from https://github.com/K-Akram/Problem-Set-RL-P and also included in supporting files of this paper as ProblemSet.Zip.

4.2. Performance metrics

This study evaluates algorithm performance using three standard metrics: Set Coverage (C), Generational Distance (GD), and Inverse Generational Distance (IGD) [44]. Given the absence of known true Pareto fronts for the benchmark problems, the best Pareto front generated across all runs is adopted as the reference front, denoted PF*.

  1. Set Coverage (C): It compares two Pareto fronts in terms of dominance, let A and B are two Pareto fronts then the percentage C(A,B) is defined as;
(32)

Where denotes the total solutions in Pareto front B. This metric is computed in pairs, i.e., C(A,B) and C(B,A), Pareto front A is considered dominant only if .

  1. Generational Distance (GD): GD measures the average Euclidean distance from solutions in a Pareto front PF to the closest points in the benchmark front PF*. A smaller GD value indicates better algorithm performance. It is computed as follows.
(33)
  1. Inverse Generational Distance (IGD): IGD measures the average Euclidean distance from solutions in PF* to the closest points in the PF. It is computed as follows.
(34)

A lower IGD value reflects better quality of the generated Pareto front. A zero IGD indicates that every solution in PF exists in the reference front PF*, though not necessarily vice versa. When both GD and IGD equal zero, PF and PF* are entirely identical.

  1. Statistical Tests: To statistically compare the performance of multiple algorithms across multiple benchmark instances, this study employs the Friedman test, a widely used non-parametric statistical test for ranking-based comparisons in metaheuristic optimization. The Friedman test is particularly suitable for stochastic optimization algorithms, as it does not assume normality of the underlying data and is robust to outliers. However, its application is justified when performance measures are obtained from repeated independent runs and when comparisons involve the same set of problem instances across all algorithms. To assess the Friedman test’s assumptions of non-normality and non-homogeneity, the Shapiro–Wilk test is used to examine the normality of the performance data, and Levene’s test is applied to evaluate the homogeneity of variances among algorithms. Together, these statistical tests establish a rigorous and appropriate framework for comparative performance evaluation, complementing the descriptive metrics C, GD and IGD used in this study.

4.3. Parameter calibration through Taguchi design of experiment (DOE)

Optimizing parameter settings is crucial for maximizing algorithm performance [45]. The proposed BWSA-RL’s reinforcement learning portion has a set of four hyper parameters, i.e., sparseness threshold Ts, learning rate α, discount rate γ and epsilon greedy value ε. Four levels of each parameter are tested for two problems, P10 and P17. The levels of four parameters are listed in Table 5, and the IGD results of these tests are listed in Table 6. The analysis was conducted using Minitab (v21.2), and the main effect plots of the problems are depicted in Fig 10. Based on this analysis, the final optimal parameter values are determined as follows: Ts = 0.4, α = 0.6, γ = 0.4, and ε = 0.9.

thumbnail
Table 5. Taguchi design of experiment parameters with four selected levels.

https://doi.org/10.1371/journal.pone.0347108.t005

thumbnail
Table 6. IGD results for P10 and P17 for Taguchi design of experiment.

https://doi.org/10.1371/journal.pone.0347108.t006

thumbnail
Fig 10. Taguchi design of experiment showing main effect plots for problems P10 and P17.

https://doi.org/10.1371/journal.pone.0347108.g010

4.4. Validation of proposed BWSA-RL model

The proposed model’s validity was tested on ten small-scale benchmark problems (S01 to S10) with total operations ranging from 4 to 16, solved using IBM ILOG CPLEX optimization studio (v22.1.1). It is important to note that these small-scale problems are selected due to the exponential computational complexity of exact methods. Benchmark instances are accessible via the link given in Section 5.1. Due to CPLEX’s intrinsic limitation of simultaneous optimization of multiple objectives, three primary objectives, i.e., makespan MK, total energy consumption TEC, and average due-date penalty ADP were optimized individually using OPL. CPLEX was configured with a 3600 second runtime limit, after which it terminates if an optimal solution was not found.

For comparison, BWSA-RL was run with a population size of 180, a maximum of 200 generations, Ts = 0.4, α = 0.6, γ = 0.4, and ε = 0.9. Table 7 summarizes the best results achieved by each algorithm on the optimization objectives and lists the total number of Pareto front solutions produced by BWSA-RL. CPLEX solved instances S01 to S09 optimally but was unable to solve S10. Notably, BWSA-RL not only found optimal solutions for all solvable instances faster than CPLEX but also outperformed CPLEX on S10. These results suggest that while CPLEX is suitable for smaller problem instances but more sophisticated algorithms like BWSA-RL are necessary for larger-scale problems. The CPLEX results should be interpreted as a baseline stability check rather than a comprehensive validation of multi-objective performance.

thumbnail
Table 7. CPLEX versus BWSA-RL: Test results for problems S01 to S10.

https://doi.org/10.1371/journal.pone.0347108.t007

4.5. Conversion condition operator testing and benchmarking

The proposed conversion operator, designed to switch from SARSA to Q-learning, was evaluated and benchmarked against two alternatives. Three versions of BWSA-RL were implemented for comparison: BWSA-RL (A) with the proposed conversion operator, BWSA-Fixed (B) which switches to Q-learning after completing half of the total generations, and, BWSA-SLGA (C) with the conversion operator proposed by [46], the expression for this operator is given in equation 35.

(35)

where represents the number of current iterations. represents the total number of states and represents the total number of actions. To minimize the influence of random factors, each benchmark problem was executed 30 times using a population size of 180 and a maximum of 200 generations. Three performance metrics were calculated for each problem.

The C-metric data comparison of proposed conversion operator with fixed and SLGA operators are presented in Fig 11a and Fig 11b respectively. The data shows that the proposed operator completely outperformed the other two benchmarking approaches. Comparison of GD and IGD is shown in Fig 12, lower GD values suggest that BWSA-RL achieves closer proximity to the true Pareto front compared to BWSA-Fixed and BWSA-SLGA. Similarly, the lower IGD values for BWSA-RL indicate better distribution and diversity of solutions along the Pareto front. The performance metrics highlight the superior convergence and spread of BWSA-RL in comparison to the fixed and SLGA operators. The complete data of all three metrics are given in S1 Table.

thumbnail
Fig 11. C-metric comparison of the proposed conversion operator (BWSA-RL) with BWSA-Fixed and BWSA-SLGA operators.

https://doi.org/10.1371/journal.pone.0347108.g011

thumbnail
Fig 12. Comparison of GD and IGD metrics for the proposed conversion operator (BWSA-RL) against BWSA-Fixed and BWSA-SLGA operators.

https://doi.org/10.1371/journal.pone.0347108.g012

The Friedman test was employed to statistically evaluate the performance differences among the compared algorithms. Prior to its application, the underlying assumptions regarding data distribution were examined. Specifically, Levene’s test was conducted to assess the homogeneity of variances, while the Shapiro–Wilk test was used to examine data normality. As reported in Table 8, Levene’s test yielded a p-value less than 0.05, indicating significant variance heterogeneity among the datasets, and the Shapiro–Wilk test results confirmed that the IGD data deviate from normality. These findings justify the adoption of a non-parametric statistical approach. Consequently, the Friedman test was applied to the IGD results, and the obtained p-value (p < 0.05) indicates that the observed performance differences among the algorithms are statistically significant and not attributable to random variation. Furthermore, the mean rank values presented in Table 9 demonstrate that the proposed BWSA-RL method achieves the best overall ranking, confirming its superior performance compared to the other benchmarking conversion condition operators.

thumbnail
Table 8. Levene and Shapiro-Wilk test results for IGD data of comparison with other conversion condition operators.

https://doi.org/10.1371/journal.pone.0347108.t008

thumbnail
Table 9. Friedman test results for IGD values of comparison with other conversion condition operators.

https://doi.org/10.1371/journal.pone.0347108.t009

The superior performance of BWSA-RL can be attributed to the proposed dynamic sparsity-based operator, which adjusts the switching process from SARSA to Q-learning based on the learning itself. Unlike the fixed operator, which prematurely transitions to Q-learning halfway through the generations, or the SLGA operator, which follows a static approach, the dynamic operator ensures adaptability and balances exploration and exploitation effectively. In conclusion, the results validate the efficacy of the proposed conversion operator.

4.6. Effectiveness of hybrid crowding distance metric (HCD)

The HCD is benchmarked against two other crowding distance metrics proposed in well reputed journals. These selected benchmarking crowding distance metrics are modified crowding distance operator (MCDO) proposed by [6] and Hamming distance and Euclidean distance (HDED) proposed by [8]. Three versions of BWSA-RL were coded using HCD, MCDO and HDED, and these versions were designated as HCD (A), MCDO (B), and HDED (C). To eliminate any chance factor each benchmark instance was run 30 times and performance metrics were calculated for each run.

The C-metric values, presented in graphical format in Fig 13, clearly demonstrate that the results obtained using the HCD approach are superior to those achieved with other benchmark crowding distance metrics. HDED performed better for problems P01 and P02 but as the problem size increased HCD started producing better results. The GD and IGD values for the three methods are summarized in box plots shown in Fig 14. These visual representations provide a clear comparison, highlighting the performance advantages of the proposed HCD approach. Specifically, HCD consistently achieves lower GD and IGD values compared to HDED and MCDO, emphasizing its ability to generate Pareto fronts with enhanced convergence and diversity. Moreover, the tighter spread of HCD results, as reflected in the compact interquartile range, showcases its robustness and stability across different runs. These findings validate the effectiveness of the HCD method in producing high-quality, well-distributed solutions. The complete data of all three metrics are given in S2 Table.

thumbnail
Fig 13. C-metric comparison between (a) HCD and MCDO and (b) HCD and HDED, highlighting the superior performance of HCD.

https://doi.org/10.1371/journal.pone.0347108.g013

thumbnail
Fig 14. GD and IGD comparison of HCD with MCDO and HDED using (a) box plots of GD and (b) box plots of IGD metrics.

https://doi.org/10.1371/journal.pone.0347108.g014

To examine whether statistically significant differences exist among the evaluated crowding distance metrics, a Friedman non-parametric test was performed. Before conducting this analysis, the distributional characteristics of the IGD data were investigated. Variance homogeneity was assessed using Levene’s test, while the normality of the data was evaluated using the Shapiro–Wilk test. As summarized in Table 10, Levene’s test indicates significant variance heterogeneity (p < 0.05), and the Shapiro–Wilk test shows deviations from normality for most metrics. Although the non-normality assumption is violated for the HDED metric (p > 0.05), this isolated case does not invalidate the use of the Friedman test. These observations collectively support the selection of a non-parametric statistical framework. The Friedman test results, shown in Table 11, yielded a p-value less than 0.05, confirming that the observed performance differences among the crowding distance metrics are statistically significant. Moreover, the mean rank analysis reveals that the proposed HCD consistently attains the best ranking, indicating its superior performance relative to the other benchmark crowding distance measures.

thumbnail
Table 10. Levene and Shapiro-Wilk test results for IGD data of crowding distance comparison.

https://doi.org/10.1371/journal.pone.0347108.t010

thumbnail
Table 11. Friedman test results for IGD data of crowding distance comparison.

https://doi.org/10.1371/journal.pone.0347108.t011

The results highlight the effectiveness and utility of the HCD approach. This superior performance is primarily attributed to its method of assessing solution scarcity around the reference point, which effectively integrates information from both the variable and objective domains. Additionally, the use of cosine distance further improves the results by addressing scaling issues, a common limitation of the Euclidean distance.

4.7. Effectiveness of weights on due date compliance

This section gauges the effect of various weight settings on average due-date penalty (ADP) function, stated in equation 3. The ADP functions have three control weights , and for high, normal and low priority jobs. To test the effects of these weights on non-compliance (NC) of due dates, MK, TEC and ADP following test has been designed.

Step 1: due date for each job is calculated with the following equation.

(36)

Where is the relaxation factor, for this study its value is assumed to be 1.2, is the total number of operations for job, is the total number of machines, is time for operation on machine and if is performed on machine otherwise .

Step 2: five setting levels of weights are selected such that and . These weights are listed in Table 12.

thumbnail
Table 12. Setting of weights to evaluate effectiveness ADP objective on due-date compliance.

https://doi.org/10.1371/journal.pone.0347108.t012

Step 3: all benchmark problems are solved 30 times with each weight setting and NC of due dates NCH, NCN and NCL for high, normal and low priority jobs are calculated through equations 3739. The average NC values for each weight setting are shown in Fig 15.

thumbnail
Fig 15. Comparison of average due-date non-conformance across different weight configurations.

https://doi.org/10.1371/journal.pone.0347108.g015

(37)(38)(39)

Where is the Pareto front of elite solutions, | is the number of solutions in , is the total number of jobs, and are the due-date and completion time of job for solution in respectively.

Step 4: by using results obtained in this run, average MK, TEC and ADP are also calculated and results are shown in Fig 16.

thumbnail
Fig 16. Effect of different weight settings on optimization objectives:(a) makespan, (b) total energy consumption, and (c) average due-date penalty.

https://doi.org/10.1371/journal.pone.0347108.g016

By studying Fig 15 it can be observed that Setting 1 is experiment-control, and for this setting there is no significant difference in NC for all priorities, but as the weight increases the compliance of due date for high priority jobs increases.

To study the effect of weights on objective values the average MK, TEC and ADP are plotted in Fig 16. From figure, it can be inferred that the weight settings have no significant impact on MK and TEC. The ADP is affected by weight settings and the most significant parameter to effect ADP value is the weight of parameter. The complete data table for effects of weights on NC are given in S3 Table, and for effects of weights on optimization objectives are given in S4 Table.

The experimental analysis demonstrates that the weighting structure embedded in the ADP objective plays a decisive role in controlling due-date compliance, particularly for high-priority jobs. As the weight increases, the optimization process places greater emphasis on reducing deviations from due dates for urgent jobs, thereby guiding the search toward schedules that prioritize timely completion of high-priority tasks. This targeted emphasis explains the observed improvement in due-date compliance without introducing notable trade-offs in makespan or total energy consumption. Since MK and TEC are governed primarily by processing sequences and machine assignments, adjustments in due-date penalty weights do not significantly alter their behavior. The results indicate that the weighted ADP formulation provides an effective and flexible mechanism for incorporating job priorities into the scheduling process. Among the tested configurations, the weight setting , , and offers a balanced compromise, achieving improved compliance for urgent and normal jobs while preserving overall solution quality, making it well suited for practical multi-priority scheduling environments.

4.8. Comparison of BWSA-RL with other algorithms

BWSA-RL was compared with four state-of-the-art algorithms published in reputed journals. The selection criteria for benchmark algorithms were: multi-objective optimizations algorithm, preferably proposed to solve FJSP, published in high impact factor journals, and preferably a reinforcement learning based algorithm. As per the above-mentioned criteria, four algorithms were selected: 1) Evolutionary algorithm incorporating reinforcement learning (EARL), published in 2024 [2]. 2) Multi-objective Q-learning based hyper heuristic with Bi-criteria selection (QHH-BS), published in 2022 [47]. 3) Reinforcement learning multi-objective evolutionary algorithm (RMOEAD), published in 2022 [38]. 4) Enhanced non-dominated sorting genetic algorithm (ENSGA), published in 2023 [12].

Algorithms [1,2], and [3] are RL-based algorithms similar to the proposed BWSA-RL, while algorithm [4] is a non-RL technique selected specifically for benchmarking against an efficient non-RL-based method. All benchmarking algorithms, except QHH-BS, were originally proposed for solving multi-objective flexible job shop scheduling problem and were implemented using the parameter configurations recommended in their respective original studies. Although QHH-BS was introduced for a closely related mixed shop scheduling environment, its underlying optimization mechanism is generic and applicable to the problem considered in this work. Each benchmarking instance was run 30 times to eliminate all chance factors, three performance metrics were calculated and shown in Table 13. The first portion of the Table 13 and Fig 17 shows all C-metric comparisons of four benchmark algorithms with BWSA-RL, from this data it is evident that BWSA-RL generated superior values of C-metric than other benchmarking algorithms, While ENSGA yielded superior C-metric results for the initial two problems, BWSA-RL demonstrated dominant performance as the problem size increased, surpassing all competing algorithms. The second and third section of Table 13 and box plot of Fig 18 compares the GD and IGD results respectively, the experimental data indicates that BWSA-RL performed consistently better than other benchmark algorithms and produced lower values of GD and IGD metrics. In the last column of Table 13, win status of BWSA-RL is listed, a win and loss are indicated by a ‘+’ and ‘-’ signs respectively. The criterion for winning is that BWSA-RL must perform better in terms of all performance metrics. As per this criterion BWSA-RL scored 25 wins out of 30 benchmark problems and the overall win rate is 83.3%. The five scored losses occurred for problems P01, P02, P04, P07 and P11, which are small-scale instances. These exceptions can be attributed to the small-scale nature of the affected instances, which limits the learning horizon available to the reinforcement learning component. Under such conditions, the adaptive policy may not fully learn before the maximum allowable iterations runout. In contrast, larger problem instances provide richer state–action interactions, enabling more effective learning of procreation and mutation strategies and resulting in superior performance.

thumbnail
Table 13. Comparison of BWSA-RL with other algorithms for performance metrics.

https://doi.org/10.1371/journal.pone.0347108.t013

thumbnail
Fig 17. C-metric comparison of BWSA-RL with four competing algorithms:(a) EARL, (b) QHH-BS, (c) RMOEAD, and (d) ENSGA.

https://doi.org/10.1371/journal.pone.0347108.g017

thumbnail
Fig 18. Comparison of GD and IGD metrics between BWSA-RL and competing algorithms using (a) GD box plots and (b) IGD box plots.

https://doi.org/10.1371/journal.pone.0347108.g018

To get a statistical comparison of BWSA-RL with four other algorithms, the Friedman test was performed on IGD values of compared algorithms. The non-homogeneity and non-normality were tested with Levene and Shapiro–Wilk tests, respectively, and the results were summarized in Table 14. The Levene test with a p-value < 0.05 suggested rejection of the null hypothesis, indicating that the IGD values were non-homogeneous. Similarly, the null hypothesis could also be rejected for the Shapiro–Wilk test with p-values < 0.05 for all algorithms, except RMOEAD. After establishing non-homogeneity and non-normality, the Friedman test was conducted, and the results were shown in Table 15. BWSA-RL had the highest mean rank value of 1.2 with a p-value < 0.05, indicating a significant difference in the compared algorithms’ performance. The next runners-up were EARL, QHH-BS, RMOEAD, and ENSGA with mean rank values of 2.5, 2.8, 4.1, and 4.4, respectively. Overall, the statistical results confirmed superior performance of the BWSA-RL algorithm. Fig 19 displays the convergence plots of MK, TEC, and ADP for problem P15. Visual inspection of the figure confirms that BWSA-RL converges faster to superior objective values within the same number of iterations compared to other algorithms.

thumbnail
Table 14. Levene and Shapiro-Wilk test results for IGD data of comparison with other algorithms.

https://doi.org/10.1371/journal.pone.0347108.t014

thumbnail
Table 15. Friedman test results for IGD values of comparison with other algorithms.

https://doi.org/10.1371/journal.pone.0347108.t015

thumbnail
Fig 19. Convergence behavior of BWSA-RL for benchmark problem P15 in terms of (a) makespan, (b) total energy consumption, and (c) average due-date penalty.

https://doi.org/10.1371/journal.pone.0347108.g019

The performance of the proposed BWSA-RL is assessed using C-metric, GD, IGD, and convergence rate, and benchmarked against four state-of-the-art algorithms: EARL, QHH-BS, RMOEAD, and ENSGA. The complete examination of the data shows superior performance of BWSA-RL in solving MODFJSP, the plausible reasons for this overall gain in performance is due to multilayer novelties introduced through incorporation of reinforcement learning based parameter control, better evolutionary strategies imitating the mating behavior of black widow spiders, additionally the amalgamation of better hybrid cosine crowding distance metric have helped to boast diversity among populations. In the next section, experimental analysis of rescheduling module is discussed.

4.9. Effectiveness of rescheduling heuristics

This study of MODFJSP proposes heuristics for handling insertion of new jobs with the aim of minimizing instability objective. The proposed heuristic, referred to as RS2, was compared with the complete rescheduling technique named RS1, where all jobs were rescheduled after every disruptive event. To test the effectiveness of the rescheduling heuristic, three random job insertions were generated for each benchmark problem. Each problem was run 30 times using both RS1 and RS2 techniques to eliminate chance factors. MK, TEC, ADP, and INS were recorded for each run, with average values listed in S5 Table and data plots shown in Fig 20. Fig 20a, Fig 20b and Fig 20c show that RS1 and RS2 produced comparable results in terms of MK, TEC, and ADP, with RS1 having a slight edge. Fig 20d demonstrates that RS2 significantly outperforms RS1 in maintaining schedule stability, and as the instance size grows the performance of RS2 becomes more significant. From the data analysis, it can be inferred that the differences in MK, TEC, and ADP are minimal, but the schedules produced by RS2 are significantly more robust in terms of stability.

thumbnail
Fig 20. Comparative performance of complete rescheduling (RS1) and the proposed heuristic (RS2) for (a) makespan, (b) total energy consumption, (c) average due-date penalty, and (d) instability, demonstrating the superior stability of RS2 with minimal impact on other objectives.

https://doi.org/10.1371/journal.pone.0347108.g020

To statistically assess the differences between RS1 and the proposed heuristic RS2, variance-based significance testing was conducted for all optimization objectives. The p-values reported in Table 16 indicate that the differences observed for MK, TEC, and ADP are not statistically significant, as all corresponding p-values exceed the 0.05 significance level. This suggests that the performance of RS2 remains comparable to RS1 with respect to these objectives. In contrast, the p-value associated with INS < 0.05, indicating a statistically significant difference between the two rescheduling strategies. This result confirms that RS2 achieves a significant improvement in schedule stability compared to complete rescheduling. A more detailed breakdown of the statistical outcomes for benchmark instances is provided in the online supplementary material S5 Table.

thumbnail
Table 16. P-value comparison of variance for the results of RS1 and RS2 for all optimization objectives.

https://doi.org/10.1371/journal.pone.0347108.t016

The proposed rescheduling heuristics effectively maintain system stability without significantly affecting MK, TEC, and ADP. This is achieved through a freeze-and-reschedule strategy, which preserves optimized operations while only rescheduling necessary ones. By maintaining prior optimization benefits, the approach ensures minimal disruption and allows further optimization for new and lower-priority jobs. This results in robust and efficient performance in dynamic scheduling scenarios.

4.10. Managerial insights

This study presents BWSA-RL, an RL-based metaheuristic designed for MODFJSP, featuring dynamic adaptation of procreation and mutation rates, an improved hybrid crowding distance operator, and a heuristic for managing new job insertions. The algorithm autonomously adjusts parameters based on population dynamics, enhancing search efficiency. Job priority integration minimizes earliness and tardiness, making it adaptable for diverse priority levels. Tested on small to large-scale problems, it delivers results within 500 seconds for 500 operations. Real-world job shops face frequent new job insertions, causing schedule instability; this study mitigates that by strategically incorporating priorities into rescheduling. BWSA-RL’s superior performance and ease of coding make it practical for industrial applications. While this study adopts three priority levels for simplicity, the algorithm can be extended to accommodate custom priority structures, enhancing flexibility for job shop managers. These features establish BWSA-RL as an effective metaheuristic for modern job shops.

5. Conclusions and future directions

This work focuses on the MODFJSP, incorporating new job arrivals as dynamic events. It aims to minimize makespan, total energy consumption, due-date penalties, and schedule instability. A three-level job priority system is introduced to improve due-date adherence. A reinforcement learning module integrates SARSA and Q-learning with a novel dynamic switching operator for better exploration-exploitation balance. Additionally, a hybrid crowding distance metric using cosine distances is introduced to improve diversity assessment. A population-based evolutionary algorithm, BWSA-RL, inspired by the black widow spiders, is developed to handle these challenges effectively.

Key contributions include a MILP model tested with IBM CPLEX for validation, an innovative conversion condition operator for dynamic RL switching, and an ablation study confirming the hybrid RL approach’s effectiveness. The proposed ADP objective is evaluated for job priorities, showing strong due-date adherence with minimal impact on makespan and energy consumption. The novel crowding distance metric outperforms existing approaches, and BWSA-RL is benchmarked against four state-of-the-art algorithms, demonstrating superior performance across key evaluation metrics. The rescheduling heuristic successfully minimizes instability.

Despite the promising results, this study has certain limitations that open avenues for future research. First, the experimental evaluation is conducted on synthetically generated benchmark instances; although these problems are widely used and carefully designed, validating the proposed approach on real industrial case studies would further strengthen its practical relevance. Second, the dynamic environment considered in this work is limited to new job insertions, while other common disruptions such as machine breakdowns, processing time variability, and priority changes remain unexplored. The comparison with exact optimization (CPLEX) is restricted to small-scale instances and single-objective formulations. As such, it does not fully reflect the complexity of the multi-objective dynamic scheduling problem addressed in this study. In addition, the reinforcement learning component is employed for adaptive parameter control rather than direct policy learning for scheduling decisions, which may limit its ability to fully exploit long-term learning potential. BWSA-RL employs use of copying and maintaining multiple populations during execution, this may result in higher computational cost for very large scale problems. Future studies may extend the proposed framework by incorporating additional dynamic events, investigating deep reinforcement learning for direct decision-making, and improving computational efficiency for large-scale industrial applications. Quantum-inspired reinforcement learning approaches, such as quantum policy learning, offer potential improvements in efficiency and stability, and represent a promising direction for future research in DFJSP.

Supporting information

S1 Table. Conversion condition operator analysis results of BWSA-RL for C-metric, GD and IGD.

https://doi.org/10.1371/journal.pone.0347108.s001

(DOCX)

S2 Table. Comparison of performance of crowding distance operators for C-metric, GD and IGD.

https://doi.org/10.1371/journal.pone.0347108.s002

(DOCX)

S3 Table. Due-dates non-conformance data for all five weight settings.

https://doi.org/10.1371/journal.pone.0347108.s003

(DOCX)

S4 Table. Effect of each weight setting on optimization objectives.

https://doi.org/10.1371/journal.pone.0347108.s004

(DOCX)

S5 Table. Comparison of RS1 and RS2 for optimization objectives and detailed statistical analysis.

https://doi.org/10.1371/journal.pone.0347108.s005

(DOCX)

References

  1. 1. Abderrahim M, Bekrar A, Trentesaux D, Aissani N, Bouamrane K. Bi-local search based variable neighborhood search for job-shop scheduling problem with transport constraints. Optim Lett. 2022;16(1):255–80.
  2. 2. Zhang G, Yan S, Song X, Zhang D, Guo S. Evolutionary algorithm incorporating reinforcement learning for energy-conscious flexible job-shop scheduling problem with transportation and setup times. Eng Appl Artif Intell. 2024;133:107974.
  3. 3. Vital-Soto A, Baki MF, Azab A. A multi-objective mathematical model and evolutionary algorithm for the dual-resource flexible job-shop scheduling problem with sequencing flexibility. Flex Serv Manuf J. 2023;35(3):626–68.
  4. 4. da Silva DMP, Inoue RS, Kato ERR. Scheduling choice method for flexible job shop problems using a fuzzy decision maker. Intelligent Systems with Applications. 2024;21:200302.
  5. 5. Brucker P, Schlie R. Job-shop scheduling with multi-purpose machines. Computing. 1990;45(4):369–75.
  6. 6. Caldeira RH, Gnanavelbabu A. A Pareto based discrete Jaya algorithm for multi-objective flexible job shop scheduling problem. Expert Systems with Applications. 2021;170:114567.
  7. 7. Yu W, Zhang L, NG. An adaptive multiobjective evolutionary algorithm for dynamic multiobjective flexible scheduling problem. Int J Intell Syst. 2022;37(12):12335–66.
  8. 8. Wei L, He J, Guo Z, Hu Z. A multi-objective migrating birds optimization algorithm based on game theory for dynamic flexible job shop scheduling problem. Expert Systems with Applications. 2023;227:120268.
  9. 9. Zhang W, Zheng Y, Ahmad R. An energy-efficient multi-objective integrated process planning and scheduling for a flexible job-shop-type remanufacturing system. Advanced Engineering Informatics. 2023;56:102010.
  10. 10. Meng L, Zhang C, Zhang B, Gao K, Ren Y, Sang H. MILP modeling and optimization of multi-objective flexible job shop scheduling problem with controllable processing times. Swarm and Evolutionary Computation. 2023;82:101374.
  11. 11. Burmeister SC, Guericke D, Schryen G. A memetic NSGA-II for the multi-objective flexible job shop scheduling problem with real-time energy tariffs. Flex Serv Manuf J. 2023.
  12. 12. Luan F, Zhao H, Liu SQ, He Y, Tang B. Enhanced NSGA-II for multi-objective energy-saving flexible job shop scheduling. Sustainable Computing: Informatics and Systems. 2023;39:100901.
  13. 13. Zhang W, Zheng Y, Ahmad R. An energy-efficient multi-objective scheduling for flexible job-shop-type remanufacturing system. J Manuf Syst. 2023;66.
  14. 14. Li J, Han Y, Gao K, Xiao X, Duan P. Bi-Population Balancing Multi-Objective Algorithm for Fuzzy Flexible Job Shop With Energy and Transportation. IEEE Transactions on Automation Science and Engineering. 2023;:1–17.
  15. 15. Shen L, Dauzère-Pérès S, Maecker S. Energy cost efficient scheduling in flexible job-shop manufacturing systems. Eur J Oper Res. 2023;310(3):992–1016.
  16. 16. Yu F, Lu C, Zhou J, Yin L, Wang K. A knowledge-guided bi-population evolutionary algorithm for energy-efficient scheduling of distributed flexible job shop problem. Eng Appl Artif Intell. 2024;128:107458.
  17. 17. Xu B, Xu K, Fei B, Huang D, Tao L, Wang Y. Automatic design of energy-efficient dispatching rules for multi-objective dynamic flexible job shop scheduling based on dual feature weight sets. Mathematics. 2024;12.
  18. 18. Luo S, Zhang L, Fan Y. Dynamic multi-objective scheduling for flexible job shop by deep reinforcement learning. Comput Ind Eng. 2021;159:107489.
  19. 19. Zhang L, Feng Y, Xiao Q, Xu Y, Li D, Yang D, et al. Deep reinforcement learning for dynamic flexible job shop scheduling problem considering variable processing times. Journal of Manufacturing Systems. 2023;71:257–73.
  20. 20. Li Y, Liao C, Wang L, Xiao Y, Cao Y, Guo S. A Reinforcement Learning-Artificial Bee Colony algorithm for Flexible Job-shop Scheduling Problem with Lot Streaming. Applied Soft Computing. 2023;146:110658.
  21. 21. Lei K, Guo P, Wang Y, Zhang J, Meng X, Qian L. Large-Scale Dynamic Scheduling for Flexible Job-Shop With Random Arrivals of New Jobs by Hierarchical Reinforcement Learning. IEEE Trans Ind Inf. 2024;20(1):1007–18.
  22. 22. Su C, Zhang C, Wang C, Cen W, Chen G, Xie L. Fast Pareto set approximation for multi-objective flexible job shop scheduling via parallel preference-conditioned graph reinforcement learning. Swarm and Evolutionary Computation. 2024;88:101605.
  23. 23. Tang H, Xiao Y, Zhang W, Lei D, Wang J, Xu T. A DQL-NSGA-III algorithm for solving the flexible job shop dynamic scheduling problem. Expert Systems with Applications. 2024;237:121723.
  24. 24. Liu Z, Mao H, Sa G, Liu H, Tan J. Dynamic job-shop scheduling using graph reinforcement learning with auxiliary strategy. J Manuf Syst. 2024;73:1–18.
  25. 25. Andrade-Pineda JL, Canca D, Gonzalez-R PL, Calle M. Scheduling a dual-resource flexible job shop with makespan and due date-related criteria. Ann Oper Res. 2020;291(1):5–35.
  26. 26. Ren X, Wang X, Geng N, Jiang Z. The Just-In-Time Job-Shop Rescheduling with Rush Orders by Using a Meta-Heuristic Algorithm. In: 2021 IEEE 17th International Conference on Automation Science and Engineering (CASE), 2021. 298–303.
  27. 27. Schworm P, Wu X, Klar M, Glatt M, Aurich JC. Multi-objective quantum annealing approach for solving flexible job shop scheduling in manufacturing. J Manuf Syst. 2024;72:142–53.
  28. 28. Li H, Wang X, Peng J. A hybrid differential evolution algorithm for flexible job shop scheduling with outsourcing operations and job priority constraints. Expert Systems with Applications. 2022;201:117182.
  29. 29. Fan J, Shen W, Gao L, Zhang C, Zhang Z. A hybrid Jaya algorithm for solving flexible job shop scheduling problem considering multiple critical paths. J Manuf Syst. 2021;60:298–311.
  30. 30. Giffler B. Algorithms for solving production-scheduling problems. Oper Res. 1960;8(4):487–503.
  31. 31. Mahmud S, Chakrabortty RK, Abbasi A, Ryan MJ. Switching strategy-based hybrid evolutionary algorithms for job shop scheduling problems. J Intell Manuf. 2022;33(7):1939–66.
  32. 32. Pezzella F, Morganti G, Ciaschetti G. A genetic algorithm for the flexible job-shop scheduling problem. Comput Oper Res. 2008;35(10):3202–12.
  33. 33. Brandimarte P. Routing and scheduling in a flexible job shop by tabu search. Ann Oper Res. 1993;41(3):157–83.
  34. 34. Kacem I, Hammadi S, Borne P. Approach by localization and multiobjective evolutionary optimization for flexible job-shop scheduling problems. IEEE Trans Syst, Man, Cybern C. 2002;32(1):1–13.
  35. 35. Akram K, Bhutta MU, Butt SI, Jaffery SHI, Khan M, Khan AZ, et al. A Pareto-optimality based black widow spider algorithm for energy efficient flexible job shop scheduling problem considering new job insertion. Applied Soft Computing. 2024;164:111937.
  36. 36. Liu J, Sun B, Li G, Chen Y. An integrated scheduling approach considering dispatching strategy and conflict-free route of AMRs in flexible job shop. Int J Adv Manuf Technol. 2023;127(3):1979–2002.
  37. 37. Shang K, Ishibuchi H, He L, Pang LM. A Survey on the Hypervolume Indicator in Evolutionary Multiobjective Optimization. IEEE Trans Evol Computat. 2021;25(1):1–20.
  38. 38. Li R, Gong W, Lu C. A reinforcement learning based RMOEA/D for bi-objective fuzzy flexible job shop scheduling. Expert Systems with Applications. 2022;203:117380.
  39. 39. Deb K, Pratap A, Agarwal S, Meyarivan T. A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Trans Evol Computat. 2002;6(2):182–97.
  40. 40. Sharma LK, Rungta S. Comparative study of data cluster analysis for microarray. Int J Comput Trends Technol. 2012;3(May 2012):387–90.
  41. 41. Saouabi MDE, Nouri HE, Belkahla Driss O. A two-level evolutionary algorithm for dynamic scheduling in flexible job shop environment. Evol Intell. 2024;17(5):4133–53.
  42. 42. Caldeira RH, Gnanavelbabu A, Vaidyanathan T. An effective backtracking search algorithm for multi-objective flexible job shop scheduling considering new job arrivals and energy consumption. Comput Ind Eng. 2020;149(September):106863.
  43. 43. Sang Y, Tan J. Many-objective flexible job shop scheduling problem with green consideration. Energies. 2022;15.
  44. 44. Liu Q, Li X, Gao L, Wang G. A multiobjective memetic algorithm for integrated process planning and scheduling problem in distributed heterogeneous manufacturing systems. Memetic Comp. 2022;14(2):193–209.
  45. 45. Usman S, Lu C, Gao G. Flexible job-shop scheduling with limited flexible workers using an improved multiobjective discrete teaching–learning based optimization algorithm. Optim Eng. 2023.
  46. 46. Chen R, Yang B, Li S, Wang S. A self-learning genetic algorithm based on reinforcement learning for flexible job-shop scheduling problem. Comput Ind Eng. 2020;149(January):106778.
  47. 47. Cheng L, Tang Q, Zhang L, Zhang Z. Multi-objective Q-learning-based hyper-heuristic with Bi-criteria selection for energy-aware mixed shop scheduling. Swarm and Evolutionary Computation. 2022;69:100985.