Figures
Abstract
Reconfigurable assembly lines have emerged as a vital manufacturing paradigm to meet the growing demand for customized and multi-variety products. This study considers the reconfigurable assembly line scheduling problem, involving product sequencing optimization, to minimize reconfiguration cost, production workload equalization, and logistics leveling simultaneously. This study formulates a novel and linearized multi-objective mathematical model, which rectifies deficiencies in prior formulations. A novel Q-learning-based multi-objective hyper-heuristic algorithm is proposed. The algorithm integrates multiple metaheuristic operators, including particle swarm optimization, teaching–learning-based optimization, whale optimization algorithm, and grey wolf optimizer, within a unified search framework. Q-learning is employed to adaptively select the most promising operator at each search stage based on real-time performance feedback. Moreover, the proposed algorithm incorporates a new density-aware leader selection strategy with a survival-time decay factor to select the global best solution for population evolution, favoring superior solutions in sparse regions and increasing selection pressure on high-quality individuals. A numerical case study demonstrates that the models with the ε-constraint method could achieve a set of Pareto solutions. A computational study on 120 generated benchmark instances demonstrates that the proposed methodology outperforms nine other high-performing multi-objective algorithms.
Citation: Zhao H, Huang X, Liu G, Li Z, Chen F, Lu G (2026) Learning-based multi-objective hyper-heuristic algorithm for reconfigurable assembly line scheduling problems. PLoS One 21(5): e0348884. https://doi.org/10.1371/journal.pone.0348884
Editor: Babak Aslani, Memorial Sloan Kettering Cancer Center, UNITED STATES OF AMERICA
Received: February 11, 2026; Accepted: April 22, 2026; Published: May 20, 2026
Copyright: © 2026 Zhao et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All data, code, and computational results of this study are accessible on GitHub at https://github.com/zixiangliwust/Instances_RALSP under the MIT license.
Funding: This project is partially supported by National Natural Science Foundation of China under grant 62173260 and Hubei Provincial Natural Science Foundation of China under grant 2026AFD061.
Competing interests: The authors have declared that no competing interests exist.
1. Introduction
With the development of customization, there is an increasing demand for producing multi-variety and small-lot products in the modern market. The modern manufacturing industry is developing new and flexible manufacturing systems to replace the traditional production models [1]. The reconfigurable manufacturing system (RMS), introduced by Koren, Heisel [2], is conceived to achieve high flexibility. RMS consists of modular machines that could modify production capacity and functionality by adding or subtracting hardware components and modifying software. And hence, RMS has a rapid response to fluctuations in market demand [3,4]. Compared to other systems, RMS has the advantage of swiftly switching between different products, and it could facilitate small-lot production and meet demands for multi-variety and customized output. RMS could effectively combine the high throughput rates of traditional assembly lines and the high flexibility of cellular systems [5]. For producing a single-type product, the basic simple-model assembly line suffices [6,7]. When it is necessary to assemble multiple products, there are two options: the mixed-model assembly line and the multi-model assembly line. Mixed-model assembly lines could assemble different products simultaneously on the same line. And the products on this line are usually structurally similar, and they could be represented by a joint precedence graph [8,9]. In this mixed-model assembly line, the reconfiguration costs and times for switching between products are usually negligible. Conversely, if product variations are significant, multi-model assembly lines are employed. Here, products are assembled in separate batches, requiring physical reconfiguration of the line between batches, and thus incurring non-negligible reconfiguration costs and time. Unlike mixed-model lines, multi-model lines assemble only one product type at any given time.
Fig 1 illustrates three assembly line configurations under different production demands. In this figure, a single-model assembly line could produce one single-type product (represented by triangles). Mixed-model assembly lines could assemble several products simultaneously, where different geometries denote different products. In contrast, the multi-model assembly line produces different products in separate batches, and the physical reconfiguration between batches is required (indicated by “R” in the diagram). For this configuration, the different sequences of products could lead to different reconfiguration costs. And the sequences of products also influence production workload balance and logistics leveling. Consequently, the optimization of product sequencing is essential for achieving high efficiency and low cost of reconfigurable assembly lines.
Despite several studies having been published, there are several gaps related to the reconfigurable assembly line scheduling problem (RALSP). Firstly, published mathematical models often suffer from nonlinearities and logical inconsistencies; the formulated formulations cannot be solved by standard optimization tools. Secondly, while some metaheuristics have been applied, there is a lack of intelligent, adaptive hyper-heuristic frameworks capable of dynamically leveraging the strengths of multiple search operators specifically for the multi-objective RALSP. Thirdly, the integration of advanced machine learning techniques like Q-learning for adaptive operator selection within such a framework remains underexplored. In this study, a more realistic RALSP is formulated with the objectives of minimizing reconfiguration cost, production workload equalization, and logistics leveling. A Q-learning-based multi-objective hyper-heuristic algorithm (QMOHH) is developed to achieve a set of Pareto solutions. The main contributions of this paper are as follows.
- 1) A novel and tractable multi-objective mathematical model is constructed to minimize reconfiguration cost, production workload equalization, and logistics leveling. This model advances beyond prior work by presenting three fully linearized mixed integer linear programming (MILP) sub-models. It rectifies errors found in previously published formulations and ensures direct solvability for benchmark generation and exact analysis. A numerical case study demonstrates that the models with the ε-constraint method could achieve a set of Pareto solutions.
- 2) The proposed QMOHH functions as a Q‑learning‑based hyper‑heuristic framework that dynamically orchestrates multiple metaheuristic operators based on four metaheuristics—particle swarm optimization (PSO), teaching–learning-based optimization (TLBO), whale optimization algorithm (WOA), and grey wolf optimizer (GWO). Meanwhile, it employs a new density-aware leader selection strategy with a survival-time decay factor to select the global best solution for population evolution. This design ensures that the search maintains an effective equilibrium between thorough exploration and intensive exploitation.
- 3) A comprehensive comparative study is conducted to evaluate the performance of the proposed QMOHH. The algorithm is compared with nine other state‑of‑the‑art multi‑objective algorithms, including generalized differential evolution algorithm (GDE3), evolutionary algorithm based on decomposition (MOEAD), non-dominated sorting genetic algorithm II (NSGA-II), and others. Using 120 generated benchmark instances and three performance indicators, the experimental results show that the QMOHH consistently outperforms the competing algorithms.
The remainder of this paper is organized as follows. Section 2 reviews the relevant literature. Section 3 describes the problem and provides the mathematical formulation. Section 4 introduces the proposed QMOHH in detail. Section 5 provides a numerical case study to clarify the main features of the problem under consideration. Section 6 evaluates the improvements of the proposed QMOHH and provides a comparative study. Finally, Section 7 concludes this study and suggests future research directions.
2. Literature review
This section reviews relevant literature on RALSP, hyper-heuristic methods, and Q-learning integration. And then it concludes by identifying the research gaps and contributions.
2.1 Previous work on RALSP
Research on the RALSP focuses on flexibility improvement, multi-objective optimization, and adaptation to dynamic environments. Among pioneering works, Koren, Heisel [2] defined the RMS paradigm using modular machine tools and open-architecture controllers. After that, researchers have expanded the research on RMS in various dimensions. Specifically, Hasan, Jain [4] proposed a “Service Level” index to quantify reconfiguration effort. Colledani, Gyulai [10] developed an integrated method for system design and reconfiguration planning under uncertain demand. Dou, Li [11] adopted the NSGA-II for bi-objective optimization of cost and tardiness in reconfigurable flow lines.
For multi-objective optimization of RALSP, Goyal and Jain [12] combined MOPSO with maximum deviation theory. Yuan, Deng [13] considered cloud manufacturing and used distance-sorting PSO. And Prasad and Jayswal [14] incorporated reconfiguration effort, profit-to-cost ratio, and due dates using Shannon entropy. For modular products and dynamic environments, Pattanaik and Jena [15] used a Pareto-based heuristic. Yuan, Yu [16] designed a memetic algorithm. And Yang, Liu [17] used hybrid PSO for assembly sequence and equipment selection. Meanwhile, some attention is paid to reconfiguration efficiency and sustainability. Specifically, Yelles-Chaouche, Gurevsky [18] formulated MILP for task reassignment cost. Tremblet, Yelles-Chaouche [19] considered uncertain product arrival. Delorme and Gianessi [20] minimized power peaks. In addition, Gholami, Delorme [21] used fuzzy logic for sustainable supply chains.
As the application of hyper-heuristics to RALSP is limited, this section reviews the integration of hyper-heuristics and reinforcement learning in the related scheduling problems. Cano-Belmán, Ríos-Mercado [22] developed a scatter search-based hyper-heuristic for mixed-model sequencing. Mosadegh, Fatemi Ghomi [23] developed Q‑learning‑based simulated annealing. Afterwards, Özbakır and Seçme [24] proposed a hyper-heuristic for stochastic parallel lines. And Zhou and Zhao [25] developed a hyper-heuristic to optimize the material feeding. Guo, Liu [26] proposed the hyper-heuristic for integrated process planning and scheduling. Lu, Gao [27] solved PCB assembly line scheduling problems with a hyper-heuristic optimizer. Reviews on reinforcement learning-based hyper-heuristics and scheduling applications can be found in Li, Wei [28] and Vela, Valencia-Rivera [29].
Apart from the integration of hyper-heuristics and reinforcement learning, the integration of reinforcement learning (especially Q-learning) with metaheuristic algorithms has produced promising performance through adaptive scheduling. For instance, Yan and Wang [30] proposed a double-layer Q-learning algorithm in aircraft assembly scheduling, and Zhang, Tang [31] developed a Q-learning-based multi-objective evolutionary algorithm in assembly line balancing. In recent years, Meng, Li [32] proposed a Q-learning-inspired differential evolution algorithm for mixed-model assembly line balancing and sequencing. Rauf, Mumtaz [33] developed a multi-objective intelligent hybrid genetic algorithm integrated with Q-learning-based parametric tuning, and Wen, Liu [34] developed PSO with Q-learning for aircraft pulsating assembly line scheduling. For broader reviews, see Bortolini, Galizia [5] on RMS, Gianassi, Leoni [1] on resource management in mixed-model and multi-model assembly lines, and Battaïa, Delorme [35] on line balancing and model sequencing.
2.2 Research gap and contributions
Through the literature review, several research gaps can be identified. Firstly, the existing mathematical models for RALSP might have the following drawbacks: nonlinearities and logical inconsistencies. They are usually developed for describing the problem, and they cannot be solved by the standard optimization tools. Secondly, while some metaheuristics have been developed, there lacks the studies on the intelligent and adaptive hyper-heuristic frameworks, which could dynamically leverage the strengths of multiple metaheuristics for a special multi-objective RALSP. Thirdly, the integration of machine learning (e.g., Q-learning) into metaheuristic or hyper-heuristics for adaptive operator and parameter selection still needs more attention. To address these gaps, the main contributions of this paper are as follows.
- 1) Formulation of a novel and tractable multi-objective mathematical model: This study develops three fully linearized MILP sub‑models—one for each objective: reconfiguration cost, workload equalization, and logistics leveling. This formulation corrects flaws in earlier models and is directly solvable by commercial solvers, providing exact benchmarks for small to medium instances.
- 2) Development of one new and effective hyper-heuristic: The proposed QMOHH selects one promising metaheuristic operator from eight metaheuristic operators with the Q-learning. Meanwhile, it utilizes a density-aware leader selection strategy with a survival-time decay factor to select the global best solution for population evolution. The density-aware leader selection strategy provides more computation effort to the isolated Pareto solutions and ensures that each solution will be explored in iterations, hence ensuring an effective balance between exploration and exploitation.
- 3) Comprehensive experimental study and validation: This study conducts a comprehensive evaluation of the QMOHH utilizing 120 generated benchmark instances. And QMOHH is compared with nine other popular multi‑objective algorithms. Computational results along with statistical analysis demonstrate the superiority of the proposed QMOHH in convergence, diversity, and overall solution quality.
The proposed QMOHH differs from existing reinforcement-learning-assisted metaheuristics in three key aspects. 1) Most studies embed Q-learning into a single metaheuristic (e.g., Mosadegh, Fatemi Ghomi [23]; Zhang, Tang [31]). QMOHH instead acts as a hyper-heuristic that selects among eight complete operators from four metaheuristics (PSO, TLBO, WOA, GWO), leveraging diverse search behaviors within a unified framework. 2) Existing approaches often use algorithm-specific states (e.g., diversity, temperature) and scalar fitness rewards. QMOHH defines the state purely by search progress and the reward as the number of newly added non-dominated solutions, directly aligning adaptation with multi-objective optimization goals. 3) Unlike crowding-distance-based or random selection, QMOHH introduces a survival-time decay factor that reduces the attractiveness of long-surviving solutions, preventing stagnation and promoting exploration in sparse regions.
3. Problem description and model formulations
This section starts with describing the problem under consideration in detail, where the characteristics and the assumptions are provided. Afterwards, this section presents the complete mathematical formulation, which consists of three interconnected sub-models.
3.1 Problem description
In practice, assembly lines are primarily categorized into three types, as illustrated in Fig 2. Single-model assembly lines are dedicated to the high-volume production of a single product type, where no sequencing problem arises due to fixed product variety, and the primary optimization objective is line balancing—assigning assembly tasks to stations based on one or several criteria. Mixed-model assembly lines assemble a family of products that are similar in structure, function, and assembly methods to meet multi-variety, small-batch demands, requiring simultaneous consideration of both line balancing and product sequencing. Reconfigurable assembly lines, designed for customized and small-batch production, sequence a set of structurally similar but distinct products by rapidly switching between them via hardware and software adjustments. In most studies, line balancing in a reconfigurable assembly line is assumed to be predetermined, so optimization focuses on product sequencing with objectives of minimizing reconfiguration cost, equalizing production workload, and leveling material logistics.
The RALSP addressed in this study involves sequencing product types from the same family. The demand for the
th product type is
, and the total demand is
. The number of stations and the processing time for each product at each station are predetermined and fixed. To simplify the problem, a minimum production cycle is often used in the literature [13]. Here, the original demand
is replaced by
, where
is the greatest common divisor of the demands
for all
product types. Consequently, the number of products to be sequenced is reduced to
. For example, if the demands for four products (A, B, C, D) are 200, 200, 300, and 100, respectively, and their greatest common divisor is 100, the minimum production cycle is calculated as [1–3].
Before presenting the mathematical formulation, the following basic assumptions are presented: 1) the number of product types and stations is fixed and known; 2) the reconfiguration cost for switching between any two product types is known in advance; 3) the part requirements for each product type and the associated part-frequency constraints are known; 4) the processing time for each product type at each station and the constant interval time between successive product startups are known; 5) all stations are utilized for assembling every product, with the assembly process for each starting at the first station and finishing at the last; 6) and the transportation time for moving products between stations is negligible.
3.2 Mathematical model
This section presents the complete and tractable mathematical formulation for the RALSP. A key contribution of this study is the development of three fully linearized MILP models, each dedicated to one of the three conflicting objectives: reconfiguration cost minimization, production workload equalization, and logistics leveling. Unlike previous studies that often present conceptual or nonlinear formulations that are challenging to solve directly, the developed models are designed for solvability by standard MILP solvers (e.g., CPLEX, Gurobi), providing exact benchmarks and enabling practical deployment for small-to-medium instances. The parameters and variables are first described below.
The first model (Model 1) addresses the objective of minimizing the total reconfiguration cost incurred when switching between different product types on the assembly line. In multi-model assembly lines, reconfiguration costs are high and cannot be neglected. Reconfiguration costs consist of expenses caused by changing hardware components, changing tools, reprogramming software, and downtime during changeovers. The minimization of reconfiguration cost is crucial to maintain the production efficiency and the profit of the assembly line. The decision variables for Model 1 are first introduced as follows.
Model 1 minimizes total reconfiguration cost. Equation (1) sums the costs of transitions between consecutive products, including from the last to the first. Constraints (2) and (3) enforce that each position has exactly one product and that demand is satisfied. Constraints (4)–(6) linearize the product-of-binary terms for consecutive positions using auxiliary variables ; constraints (7)–(9) handle the cyclic transition. The complete linearization is further discussed in Section 3.3.
In assembly lines, a steady supply of parts to stations is important to prevent bottlenecks, avoid shortages, and reduce in-process inventories. The second mode (Model 2) takes the supply of parts to stations into account by optimizing workload equalization, which is achieved by minimizing violations of part-frequency constraints. Here, part-frequency constraints denote that, for one sliding window of consecutive product positions, there are at most
products requiring part
within the
consecutive positions. The violations of part-frequency constraints could disrupt material flow, increase logistics complexity, and result in uneven station loads. Hence, the minimization of violations of part-frequency constraints is important in real applications to promote smoother production and better resource utilization. The decision variables utilized in Model 2 are defined below.
Model 2 minimizes violations of part-frequency constraints. Equation (11) sums all violations. Constraints (12)–(13) are the same assignment and demand constraints as in Model 1. Equation (14) computes part usage within sliding windows using modulo arithmetic. Constraints (15)–(16) use the big-M method to detect violations: if usage exceeds , then
. The linearization is detailed in Section 3.3.
In mixed-model and reconfigurable lines, it is often desirable to have a constant or nearly constant rate of output for each product variant (a concept known as “rate smoothing” or “level scheduling”). This constant rate aims at stabilizing downstream processes, including part supply, material handling, and delivery schedules. The third mode (Model 3) considers this situation to optimize logistics leveling, which is achieved by minimizing the deviation of actual cumulative production from an ideal and evenly-spaced production rate for each product type. Model 3 quantifies and minimizes these deviations, and the decision variables for Model 3 are introduced as follows.
Model 3 minimizes logistics leveling by reducing deviation from ideal production rates. Equation (18) linearizes the absolute deviation. Constraints (19)–(20) are the assignment and demand constraints. Equations (21)–(24) compute station completion times with inter-station intervals. Constraints (25)–(26) calculate cumulative counts. Equation (27) defines makespan . Constraint (28) contains a bilinear term
, which is handled by a two-stage ε-constraint (see Section 3.3). Equation (29) linearizes the deviation using positive and negative slack variables. Notice that constraint (28),
, contains a bilinear term (
), rendering the model a bilinear program. To ensure exact solvability via standard MILP solvers and to establish a reliable benchmark, this study implements an
-constraint method in two sequential linear stages. Stage 1 minimizes the total completion time
subject to constraints (19)–(27) and the original assignment constraints, obtaining its optimal value
. Afterwards, fixing
in constraint (28) (which then becomes linear), Stage 2 solves the resulting pure MILP to minimize the total absolute deviation (objective [18]). This decomposition guarantees that both sub-models are linear and can be solved to optimality. The two-stage procedure ensures that the obtained solution is Pareto-optimal with respect to the third objective, while maintaining linearity in each stage.
3.3 Analysis of the formulated models
The three MILP sub-models presented in Section 3.2 are designed to be directly solvable by standard commercial solvers, thereby providing exact benchmarks for small-to-medium instances. To highlight the advancements achieved, this section compares the proposed formulations with prior work, particularly the models introduced by Yuan, Deng [13].
The formulations in Yuan, Deng [13] serve as a conceptual framework for the RALSP but suffer from fundamental limitations that prevent direct solution by MILP solvers. 1) Their workload equalization model defines the objective via a conditional rule if
, which cannot be expressed as linear constraints without auxiliary binary variables and big-M transformations. 2) Their delayed workload model involves max() functions in constraints (8) and (9), which are inherently nonlinear and non-convex and cannot be handled by MILP solvers without omitted linearization steps. 3) Moreover, the three sub-models use inconsistent variable definitions and constraint structures, making it impossible to combine them into a single multi-objective MILP model that can be solved simultaneously. In contrast, the models in this study share a common variable set and can be jointly addressed through
-constraint or weighted-sum methods.
Consequently, while the earlier work provides valuable insights into the problem structure, it does not yield a solvable MILP formulation. The models proposed in this study overcome these limitations through complete linearization strategies: Model 1 linearizes product-of-binary terms using auxiliary variables and linear constraints (4)–(9); Model 2 encodes part-frequency constraint violations using the big-M method, transforming the original logical condition into linear inequalities (15)–(16); and Model 3 handles the bilinear term in constraint (28) via a two-stage
-constraint approach, while linearizing absolute deviations using positive and negative slack variables. These linearizations ensure that all three models are directly solvable by commercial solvers such as CPLEX and Gurobi, providing exact optimal solutions for small-to-medium instances and reliable benchmarks for evaluating heuristic algorithms. For practical multi-objective optimization, the linear models can be combined using weighted-sum or
-constraint methods, maintaining computational tractability.
4 Proposed methodology
To address the multi-objective RALSP formulated in Section 3, this study proposes a novel QMOHH. The algorithm employs a Q-learning controller to select from a pool of eight heuristic operators adaptively. The heuristic operators are derived from four established metaheuristics, including PSO, TLBO, WOA, and GWO. Meanwhile, to enhance search performance, QMOHH employs a new density-aware leader selection strategy with a survival-time decay factor to select the global best solution for population evolution. The new density-aware leader selection strategy ensures effective utilization of high-quality solutions while preventing over-reliance on any single individual.
4.1 Main procedure of QMOHH
The framework of the proposed QMOHH is outlined in Algorithm 1. QMOHH begins by initializing a population, along with a Pareto archive for maintaining non-dominated solutions. A Q-learning agent is also initialized to guide operator selection.
During each iteration, the algorithm determines its current state based on evaluation progress and uses an ε-greedy policy to select one of the eight heuristic operators. If required, a global best solution is chosen from the archive using the new density-aware leader selection strategy. The selected operator is employed to generate new solutions, and subsequently, the new solutions are evaluated. Afterwards, the personal best solutions and Pareto archive are updated when necessary. A reward—based on the number of newly added non-dominated solutions—is computed and used to update the Q-table. This main loop is repeated until the termination criterion (a maximum number of evaluations in this study). After completing the main loop, the Pareto archive is returned as the achieved approximate Pareto-optimal set by QMOHH. With the Q-learning and new density-aware leader selection strategy, QMOHH achieves the proper balance between exploration and exploitation throughout the search process.
Algorithm 1. Proposed Q-learning-based multi-objective hyper-heuristic
Input: Instance data, parameters (population size N and maximum evaluations)
Output: Achieved an approximate Pareto archive
1: Initialize population P randomly under the demand constraints;
2: Evaluate the initial population, where the objective values for each solution are calculated;
3: Update Pareto archive with the initial population;
4: Update personal best solutions (P_best) with P_best ← P;
5: Create states, actions, and a Q-table to initialize the Q-learning agent;
6: Achieve current state based on the current progress (evaluations/max_evaluations);
7: While evaluations < max_evaluations do
8: Achieve the current state based on evaluations/max_evaluations;
9: Achieve the selected operator for population evolution with Q-learning;
10: If the selected operator requires a global leader then
11: Achieve global best with the density-aware leader selection strategy from the Pareto archive;
12: End-if
13: Obtain a new population (P_new) with the selected operator;
14: Evaluate the new population and update the evaluation count;
15: Update personal best solutions (P_best) if the new solution dominates the current solution;
16: Update Pareto archive with non-dominated solutions from P_new;
17: reward ←the number of newly added non-dominated solutions;
18: next_state ← determine next state based on the evaluation progress;
19: update Q_table;
20: P ← select the N solutions for P and P_new;
21: current_state ← next_state;
22: End-while
23: Return Pareto archive// Final Pareto-optimal solution set
4.2 Solution presentation
A product sequence of length is necessary for RALSP to achieve objective values. In product sequence, each position corresponds to a product. Here, demand constraints
for each product type
must be satisfied in this product sequence. Suppose that there are four product types (A, B, C, D) with demands [1–3]. One feasible sequence is [A, B, C, A, C, D, B, C].
The proposed QMOHH utilizes a string of floating-point numbers with the size of for encoding. Based on the floating-point numbers, the random-key method is applied to achieve a feasible product sequence. Fig 3 provides an example solution presentation, where there are four product types (A, B, C, D) with a minimum production cycle of [1–3]. Eight floating-point numbers are randomly generated within the range [0.00, 1.00] for encoding, and the original task sequence is [A, A, B, B, C, C, C, D]. After sorting the floating-point numbers in ascending order, the final product sequence could be achieved, namely [B, C, C, B, A, A, D, C].
Once the product sequence is determined by sorting the floating-point numbers in encoding, three objectives could be calculated directly on the basis of the three linearized MILP models in Section 3.2, respectively. Specifically, reconfiguration cost is achieved by summing the cost incurred between consecutive product pairs in the task sequence. Production workload equalization is calculated by summing the violations of the part-frequency constraints along the product sequence. Logistics leveling is calculated by summing the deviation of the actual cumulative production from the ideal, evenly-spaced production rate for each product type. Clearly, the demand constraints for each product type
are satisfied for any string of floating-point numbers with the size of
This floating-point encoding scheme ensures the generation of feasible sequences throughout the search process and facilitates the application of the various metaheuristic operators described in Section 4.3.
4.3 Metaheuristic operators
The proposed QMOHH employs eight heuristic operators derived from four well-established metaheuristics. Each heuristic operator is adapted to modify the floating-point numbers in encoding, aiming at obtaining diverse and feasible product sequences.
PSO-based operators: Three operators are adapted from PSO [36]. These include the first operator with basic position update, the second operator combining position update and uniform mutation to emphasize exploration, and the third operator combining position update and non-uniform mutation. The third operator utilizes the non-uniform mutation with time-decreasing strength to achieve phased refinement.
TLBO-based operators: Three operators are derived from TLBO [37]. These include the teacher-phase operator simulating a teacher guiding the students, the learning-phase operator simulating peer-to-peer learning among students, and the mixed TLBO phase combining the teacher phase and learning phase to integrate global and local search.
WOA-based operator: This operator implements the three characteristic behaviors of WOA—encircling prey, bubble-net attacking, and random search—probabilistically [38]. Adaptive parameters shift the emphasis from exploration to exploitation throughout the optimization process.
GWO-based operator: Based on the GWO [39], this operator employs a hierarchy of three leading solutions (alpha, beta, delta) selected from the archive to guide population updates. The implementation of three leaders emphasizes the search around the space between Pareto solutions.
Notice that the global best solution is necessary for all the heuristic operators except for the learning-phase operator in multi-objective optimization, and the new density-aware leader selection strategy could be utilized to obtain the global best solution in iterations. All operators are utilized to modify the continuous encoding scheme. And any new continuous encoding scheme could be transferred into a feasible product sequence utilizing the decoding in Section 4.2, and hence, the demand constraints are satisfied.
Among the eight heuristic operators, some operators emphasize global exploration while others emphasize local refinement. They provide the Q-learning mechanism with a rich set of strategies to dynamically select from, and hence enable effective adaptation to the evolving solution space.
4.4 Q-Learning mechanism for operator selection
Q-learning is a model-free, value-based reinforcement learning algorithm that has been widely utilized in the literature. Q-learning operates by estimating the expected cumulative reward (Q-value) for taking a specific action in a given state [31]. The algorithm has a Q-table where each entry contains state
and action
indicating the expected utility of choosing action
in state
. The Q-values are updated using the expression [31] in iterations, where
is the learning rate,
is the discount factor,
is the immediate reward received after taking action
in state
, and
is the next state.
In the proposed QMOHH algorithm, Q-learning is responsible for selecting one of eight heuristic operators at each iteration adaptively. Q-learning starts with determining state representation, action set, and reward function, and in the main loop, Q-learning conducts the action selection and policy update iteratively.
State representation: The state is defined based on the progress of the optimization process. The algorithm’s evaluation progress, calculated as the ratio of completed evaluations to the maximum allowed evaluations
, is discretized into ten intervals (0.0 to 0.9). This discretization creates a state space that effectively captures the different stages of the search, allowing the Q-learning policy to be phase-dependent. Unlike approaches that rely on problem-specific metrics (e.g., population diversity, temperature), this state definition is problem-independent and requires no instance-specific tuning, facilitating generalization to other scheduling problems.
Action set: The action set contains eight elements corresponding to eight heuristic operators in Section 4.3, including three PSO-based operators, three TLBO-based operators, one WOA-based operator, and one GWO-based operator.
Reward function: After applying the selected operator and evaluating the new populations, the reward is calculated as the number of new non-dominated solutions added to the external Pareto archive from the newly generated population. This design directly aligns the Q-learning objective with the goal of multi-objective optimization—expanding the Pareto front—rather than relying on scalar fitness improvements commonly used in single-objective or parameter-tuning contexts.
Action selection and policy update: At each iteration, the current state is determined, and later one heuristic operator is selected based on the Q-table utilizing ε-greedy policy. After generating the new population with the selected heuristic operator, the new population is evaluated, and the reward is computed. Later on, the Q-table is updated using the standard Q-learning update rule based on the reward.
4.5 New density-aware leader selection strategy
A global best solution (gbest) is necessary for population evolution in iterations, and the method of selecting gbest from the Pareto archive is crucial for guiding the search direction. Different from the methods in Coello, Pulido [36], the proposed QMOHH employs a new density-aware leader selection strategy that incorporates a survival-time decay factor. The strategy calculates a modified crowding distance for each archive solution as
, where
is the original crowding distance in Deb, Pratap [40],
is the solution’s selection time in iterations, and
is a decay rate (e.g., 0.5). This formula reduces the attractiveness of long-surviving solutions, even if they reside in sparse regions. The survival-time decay factor
can be intuitively understood as a “freshness” penalty: a solution that has been selected as the global leader many times becomes progressively less likely to be chosen again, giving newer or less frequently selected solutions an opportunity to guide the population. This mechanism prevents the algorithm from getting trapped around a few dominant individuals and promotes continuous exploration of sparser regions of the Pareto front. Hence, this new density-aware leader selection strategy encourages exploration of new areas and prevents search stagnation. The leader selection process is illustrated in Fig 4.
The survival-time decay factor reduces the selection probability of solutions that have been repeatedly chosen as leaders, ensuring a balanced exploration of the Pareto front. This mechanism ensures diversity-driven exploration by consistently favoring leaders from less crowded regions of the Pareto front, while the decay factor automatically balances this with the need to exploit known high-quality areas over time.
5 Numerical case study
In this section, the developed models and QMOHH are utilized to solve a real-world case introduced by Yuan, Deng [13]. This case is taken from Changzhou AMEC&GBM Motor Company, which procures a family of DC electric motors in a reconfigurable assembly line. Similar to other reconfigurable systems, the assembly line in this case has the following challenges: producing multiple product variants in small batches while pursuing the minimization of changeover effort and the maintenance of smooth production flow.
This real-world case consists of five stations () to produce 4 distinct motor models belonging to the same product family. The motor models are: 62ZYT001, 100ZYT001, 62ZYT-SUV, and 78ZYT001, denoted as products A, B, C, and D, respectively. The demand for products is [3–5] under the minimum production cycle, which results in a product sequencing problem with
products. Relevant production data are summarized in Table 1 (reconfiguration cost matrix, in thousand CNY), Table 2 (part-requirement matrix, where “1” indicates the part is required), and Table 3 (operation time per station, in time-units). The following part-frequency constraints apply, expressed as
: Part P1: (2; 3); Part P2: (3; 5); Part P3: (3; 4); Part P4: (1; 2). A fixed interval of 3 time-units is assumed between successive product startups.
To solve this multi-objective problem with the model, the ε-constraint method is first utilized to obtain Pareto-optimal solutions. Firstly, each of the three MILP models (Model 1, Model 2, Model 3) is solved individually to obtain its minimum value. Based on the achieved product sequences by the models, the values of all objectives are achieved and the maximum value for each objective is approximated by using the worst values. Secondly, a grid of ε values is created for two of the objectives while the third is taken as the primary objective. To generate the Pareto set, an interval step of 5 is utilized: the range of each objective is divided into 5 equal parts, yielding 6 ε values per objective. Including a relaxed case (ε = ∞) for each constraint, a total of subproblems are solved for a given primary objective, and cycling through all three primary objectives leads to
MILP runs. Here, the step size can be adjusted to obtain a finer or coarser approximation of the Pareto frontier. Thirdly, a set of MILP models are solved under many combinations of objective values. If the first objective is the primary objective, the MILP is solved:
for each combination
. Here, the bilinear term
is handled by a two-stage linearization in Model 3. Finally, all obtained solutions are merged, and a dominance filter is applied to extract the non-dominated set.
All the models terminate when the optimal solution is found or the computation time reaches 3600 seconds (s). Table 4 presents the 13 Pareto-optimal solutions obtained via the ε-constraint method. As seen in this table, reconfiguration cost ranges from 13 to 67, workload violation ranges from 7 to 24, and workload equalization ranges from 22.51 to 64.23. The solution with the lower reconfiguration cost (e.g., 13) has the higher workload violation and higher logistics leveling deviation; the solution with reduced workload violation or logistics leveling required substantially higher reconfiguration cost. These findings confirm the evident trade-offs among the objectives. This set of diverse non-dominated solutions provides many scheduling alternatives for the decision-maker to select, demonstrating the effectiveness of the formulated models.
The proposed QMOHH algorithm is subsequently applied and the best solution in 10 runs are recorded, where 85 non-dominated solutions are yielded. As seen in Fig 5, QMOHH could expand the discovered Pareto frontier in the search regions. This demonstrates QMOHH’s strength in broad exploration and its ability to approximate an extensive and well-distributed frontier. Notice that some of the Pareto solutions by the developed model are dominated by the Pareto solutions by QMOHH. The reason lies behind two factors: firstly, the discrete interval step used in the ε-constraint method limits the exploration of the continuous trade-off space; secondly, solving the multi-objective problem requires a two-stage decomposition due to the bilinear term in Model 3, which may further restrict the search space for the third objective in the final stage.
Meanwhile, the average computation time in each run is 1.78s and the total computation time is 261.99s. For the QMOHH, the computation time is 8.06s in each run on average and the total computation time is 80.59s. In short, this case study first verifies that the developed mathematical model can generate a set of high-quality Pareto solutions, which capture the fundamental trade-offs among the objectives. Meanwhile, QMOHH has a strong exploration capability in navigating the complex multi-objective landscape, and it uncovers a broad range of high-quality trade-off solutions with less computation time. This case study demonstrates that QMOHH is capable of providing decision-makers with a comprehensive approximation of the true Pareto-optimal set.
6. Experimental results
This section conducts a comparative study to evaluate the performance of the proposed QMOHH. Firstly, Section 6.1 introduces the solved instances and the compared algorithms. Next, Section 6.2 evaluates the model with ε-constraint method and Section 6.3 evaluates the performance improvements of the proposed QMOHH. Finally, Section 6.4 presents the comparative study, where the proposed QMOHH is compared with nine other multi-objective algorithms.
6.1 Experimental design
To evaluate the proposed QMOHH, 120 test instances are generated based on realistic production scenarios. Each instance consists of four product types. The minimum production cycle (product demands) is defined using 20 pre-determined patterns, with total demand per cycle ranging from 10 to 29. Each pattern satisfies that the greatest common divisor of the four demands is 1, ensuring that no further reduction is needed. The number of stations takes six fixed values: 5, 10, 15, 20, 25, and 30, each producing 20 instances. The processing time for each product at a given station is uniformly generated between 6 and 25 time-units. Reconfiguration costs are generated as an matrix with zero diagonal and off-diagonal entries uniformly drawn from 0 to 12 (in thousand CNY). Part requirements are generated by first creating a random integer between 0 and 8 for each product-part pair; the binary requirement
is set to 1 if this integer > 0, otherwise 0. The number of part types per instance is randomly set between 12 and 16. For part-frequency constraints, the window length
is randomly selected from
and the maximum allowable count
is randomly chosen between 1 and
. The interval time between successive product startups is fixed to 5 time-units. Instances are classified by scale according to the number of stations: instances with 5 or 10 stations are categorized as small-scale; those with 15 or 20 stations as medium-scale; and those with 25 or 30 stations as large-scale. This instance set comprises 40 small-scale, 40 medium-scale, and 40 large-scale instances, for a total of 120 instances. All instances, along with the source code of all algorithms, are publicly available at https://github.com/zixiangliwust/Instances_RALSP under the MIT license.
Three indicators are used to evaluate the Pareto solution sets: GD (generational distance) measures convergence to the reference front; (ε-indicator) quantifies the smallest factor needed for the obtained set to dominate the reference; and 1-NHV (complement of normalized hypervolume) assesses both convergence and diversity. Smaller values indicate better performance for all indicators. These indicators have been widely applied in the literature [41,42] and detailed calculation procedures are available in the source code on GitHub (see Data Availability).
To evaluate the performance of the proposed QMOHH, it is compared with nine other multi-objective algorithms, including multi-objective simulated annealing algorithm (MOSA) [43], multi-objective restarted simulated annealing (MRSA) [41], multi-objective artificial bee colony algorithm (MOABC) [44], improved multi-objective artificial bee colony (IMOABC) [42], GDE3 [45], MOEAD [46], NSGA-II [40], multi-objective PSO algorithm (MOPSO) [36], a new PSO-based metaheuristic (SMPSO) [47].
The termination criterion for all tested algorithms is a total of 100,000 evaluations. The proposed linearized MILP models are solved using the IBM ILOG CPLEX Optimization Studio (version 22.1.1) solver. And all tested algorithms are implemented in Python 3.10 on a personal computer equipped with an Intel(R) Core (TM) Ultra 9 185H, with parallel computation employed to accelerate the experiments. For each independent run, the random seed is set to the current system time to ensure statistical independence across the 10 runs per instance.
All algorithms are calibrated using the Taguchi experimental design method to ensure fair comparison. A set of 30 representative instances (covering small, medium, and large scales) is used for calibration. For each algorithm, an orthogonal array is employed to evaluate five key parameters at four levels. The response metric is the average 1-NHV over 10 independent runs. The optimal parameter combination for each algorithm is selected based on the signal-to-noise ratio. The final calibrated parameters for QMOHH are: population size
, learning rate
, discount factor
, initial ε-greedy exploration probability
(decaying linearly to 0.1), survival-time decay factor
, uniform and non-uniform mutation probabilities
with perturbation 0.5 and termination at
evaluations. These values are explicitly defined in the source code, which is publicly available at https://github.com/zixiangliwust/Instances_RALSP. The same calibration procedure is applied to the nine compared algorithms; their final parameter settings can also be found in the code.
6.2 Evaluating the MILP model with ε-constraint method
The results in Table 5 illustrate the computational behavior of the MILP models under the ε-constraint method with a time limit of 300s per subproblem. Here, total time(s) reports the total computation time. From this table, it is observed that, as the number of stations () or the total number of products (
) increases, the total runtime rises substantially. For small instances (e.g.,
,
), all 147 ε-combinations are solved to optimality within 20s, yielding 40 valid solutions and 3 Pareto points. In contrast, for the largest instance (
,
), the total runtime exceeds 7888s, and only 5 out of 147 subproblems reach optimality. Instances with larger
(i.e., longer product sequences) consistently require longer solution times, even when the number of stations remains moderate. For instance, with
, increasing
from 10 to 29 raises the total runtime from 23s to nearly 600s, while the number of optimal subproblems drops from 36 to 10.
The computational cost becomes prohibitive when both and
exceed moderate values. Consequently, the MILP models are best suited for small-scale instances (e.g.,
,
) where exact Pareto frontiers can be obtained within acceptable time. For larger problems, the use of metaheuristics such as QMOHH is essential. Moreover, finer ε-grids (e.g., smaller interval steps) would further increase the number of subproblems, making the approach even more computationally demanding.
These findings confirm that the developed MILP formulations provide exact benchmarks for small-to-medium instances, while larger-scale problems require efficient meta-heuristic methods.
6.3 Evaluating the improvements of QMOHH
To validate the contribution of each proposed component, the complete QMOHH is compared against four variants, as defined below. All four variant algorithms, along with the complete QMOHH, are evaluated on the 120 benchmark instances described in Section 6.1, following the same experimental protocol.
- MOHH-v1: Only three PSO-based operators are utilized. It is utilized to establish a PSO-centric baseline to assess the value of the utilization of eight heuristic operators.
- MOHH-v2: Q-learning mechanism is disabled, and heuristic operators are randomly selected. It is utilized to assess the value of adaptive operator selection with the Q-learning mechanism.
- MOHH-v3: Q-learning controller is replaced with a Deep Q-Network (DQN). It is utilized to examine the impact of a different reinforcement learning architecture.
- MOHH-v4: The standard method to select the global best solution from MOPSO is utilized to replace the new density-aware leader selection strategy. It is utilized to test the advantage of the new density-aware leader selection strategy.
In this section, the 1-NHV indicator is utilized to measure the performance of these five algorithms, where the smaller value of 1-NHV denotes the better overall solution quality. The average 1-NHV values over the 120 instances are presented in Table 6. The Friedman test is conducted on the mean ranks of the algorithms for each performance indicator. Fig 6 depicts the means plot of the average ranks of MOHH methods in terms of the 1-NHV indicator.
The results in Table 6 verify the clear superiority of QMOHH, which achieves the lowest average 1-NHV of 0.232. MOHH-v1 achieves worse performance (0.290) compared to QMOHH, underscoring the benefit of integrating a diverse set of metaheuristic operators. Again, MOHH-v2 achieves worse performance (0.248) compared to QMOHH, indicating that disabling the Q-learning mechanism leads to a noticeable performance degradation. This finding highlights the critical role of adaptive operator selection with the Q-learning mechanism. Notably, MOHH-v3 exhibits the poorest performance among the compared methods, suggesting that the Q-learning framework is more suitable and efficient than a more complex deep reinforcement learning approach for online heuristic operator selection. MOHH-v4 obtains worse performance (0.242) compared to QMOHH, confirming the effectiveness of the new density-aware leader selection strategy.
This ablation study conclusively validates that each core component of QMOHH—the Q-learning-based adaptive operator selection, the ensemble of diverse heuristics, and the new density-aware leader selection strategy—contributes uniquely and essentially to its superior search capability and final solution quality for the multi-objective RALSP.
6.4 Comparative study
To comprehensively evaluate the performance of the proposed algorithm, QMOHH is compared with nine other state-of-the-art multi-objective algorithms in Section 6.1. All algorithms are evaluated using the same 120 benchmark instances described in Section 6.1, under the identical termination criterion. Each algorithm is executed 10 times on each instance. The parameter settings for each algorithm are determined through the Taguchi calibration widely utilized in published papers.
Table 7 provides the average 1-NHV values from 10 independent runs across 120 instances. For space reasons, only the results of the best four algorithms are presented here. Detailed results for all evaluation indicators are omitted due to space constraints, but are available upon request. As the smaller 1-NHV value indicates the better algorithm performance, the algorithms can be ranked in increasing order of the overall average 1-NHV values presented in Table 7. Namely, QMOHH ranks first as it achieves the best overall average value (0.198), demonstrating the superiority of QMOHH over all other compared algorithms. MOPSO ranks second with 0.222, SMPSO ranks third with 0.236, and MOEAD ranks fourth with 0.287. Among the methods, MOSA and MRSA exhibit the weakest performance.
This ranking clearly shows that QMOHH holds a clear advantage over the compared methods in terms of 1-NHV. In fact, the superior performance of QMOHH is attributed to adaptive operator selection with the Q-learning mechanism and the new density-aware leader selection strategy. These improvements together are in favor of QMOHH achieving a proper balance between exploration and exploitation, and hence QMOHH produces outstanding performance for the multi-objective RALSP.
To validate the performance differences statistically, this section conducts the non-parametric statistical test due to the violation of normality assumptions required by ANOVA. The Friedman test is conducted separately on the mean ranks of the algorithms for each performance indicator (GD, , and 1-NHV). The results of the Friedman test reject the null hypothesis (p-value < 0.01) for all three indicators, confirming that there exist statistically significant differences among the algorithms’ performances.
Fig 7 illustrates the means plot for the average ranks of the algorithms in terms of GD. Clearly, QMOHH achieves the lowest average rank, indicating its superior performance in convergence. Except for MOPSO, confidence intervals of QMOHH and the other eight algorithms do not overlap, and hence it is safe to say that QMOHH outperforms the other eight algorithms statistically in terms of GD. In short, QMOHH obtains the lowest average rank in terms of GD in this statistical analysis.
Similarly, Fig 8 presents the results for the indicator. QMOHH again attains the best (lowest) average rank. In this case, the confidence intervals of MOPSO, SMPSO and QMOHH overlap, which is common when multiple high-performing algorithms are compared. As the confidence intervals of QMOHH and the other seven algorithms do not overlap, QMOHH outperforms the other seven algorithms statistically in terms of
. In short, QMOHH consistently achieves the lowest mean rank and achieves competing performance again.
Finally, Fig 9 illustrates the ranking results for the 1-NHV indicator. Consistent with the other evaluation indicators, QMOHH secures the top rank. The confidence intervals of QMOHH and MOPSO overlap, yet QMOHH’s lower average rank signifies its superior ability to balance convergence and diversity in the obtained Pareto front.
To further validate pairwise differences, Dunn’s multiple comparisons test is conducted as a post-hoc analysis following the Friedman test. The results show that QMOHH significantly outperforms all compared algorithms except MOPSO and SMPSO in terms of three evaluation indicators. These findings are consistent with the overlapping confidence intervals observed in Figs 7–9 and confirm that QMOHH achieves superior or competitive performance while being statistically superior to the majority of the state-of-the-art algorithms. The comparative study, along with the non-parametric statistical analysis, confirms the effectiveness and robustness of the proposed QMOHH in solving the multi-objective RALSP under consideration. Despite some overlap in confidence intervals, QMOHH achieves the top ranking in terms of all three performance indicators; it could be concluded that the proposed methodology produces superior performance in solving the multi-objective RALSP.
7. Conclusions and future research directions
Faced with the increasing demand for customized and multi-variety production, reconfigurable assembly lines have gained prominence as a flexible manufacturing paradigm. This study tackles the multi-objective reconfigurable assembly line scheduling problem, which aims to minimize reconfiguration cost, production workload equalization, and logistics leveling simultaneously. A foundational contribution is the establishment of a new and linearized mathematical model comprising three MILP formulations, which enables exact problem characterization and provides a solid benchmark for algorithmic evaluation. Given the NP-hard nature of this problem, a novel Q-learning-based multi-objective hyper-heuristic algorithm is proposed. The algorithm operates within a unified search framework that dynamically selects and integrates multiple metaheuristic operators—including particle swarm optimization, teaching–learning-based optimization, whale optimization algorithm, and grey wolf optimizer. A Q-learning mechanism is employed to adaptively choose the most promising operator at each search stage based on real-time performance feedback. The algorithm further incorporates a new density-aware leader selection strategy with a survival-time decay factor to select the global best solution for population evolution. This new density-aware leader selection strategy ensures an adaptive balance between exploration and exploitation by favoring superior solutions in sparse regions and increasing selection pressure on high-quality individuals in iterations.
A case study demonstrates that the models with the ε-constraint method can achieve a set of Pareto solutions, and the proposed hyper-heuristic method can effectively obtain a set of high-quality Pareto solutions, thereby validating the necessity and practicality of a multi-objective approach. Additionally, ablation experiments comparing four key variants of QMOHH confirmed that the Q-learning-driven adaptive operator selection, integration of diverse metaheuristic operators, and new density-aware leader selection strategy each contribute significantly to the algorithm’s superior performance, outperforming simplified versions lacking these components. Extensive experiments on 120 generated benchmark instances further confirm the superiority of the proposed methodology. In comprehensive comparisons with nine state-of-the-art multi-objective algorithms—including multi-objective particle swarm optimization, non-dominated sorting genetic algorithm II, generalized differential evolution, multi-objective simulated annealing, and others—the proposed algorithm consistently achieved better performances in terms of several evaluation indicators. Statistical analysis further confirms the superiority of the proposed method in several evaluation indicators, demonstrating the effectiveness of the proposed Q-learning-driven operator selection and new density-aware leader selection strategy.
The proposed algorithm could be embedded into a production decision-support system to assist the line manager to achieve the proper product sequencing. The proposed model and QMOHH could obtain a set of high-quality Pareto solutions, and the line manager could select the most suitable scheduling scheme according to real-time production requirements. Hence, the developed model and QMOHH could enhance the operational efficiency and responsiveness of reconfigurable assembly lines.
Despite the promising results, several limitations are acknowledged. First, the proposed MILP models, although fully linearized and solvable by commercial solvers, are best suited for small-to-medium instances due to their exponential computational complexity. Second, the hyper-heuristic parameters are calibrated using the Taguchi method on a representative set of instances; while QMOHH performs robustly within the tested ranges, optimal parameter settings may vary for different problem structures or scales, and further tuning is required in new applications. Third, the benchmark instances, though generated based on realistic production scenarios, are synthetic in nature. And overfitting to these instances cannot be completely ruled out. Future work includes validation on additional real-world industrial cases to further assess the generalization capability of the proposed approach. Meanwhile, for future research, the proposed flexible hyper-heuristic framework could be extended to solve other complex scheduling problems, including mixed-model line balancing, integrated lot-streaming and scheduling in reconfigurable systems, and others. The algorithm could also be extended to dynamic disruptions or real-time data-driven scheduling to further enhance its practicality in industrial applications. And the simultaneous optimization of line balancing and product sequencing in reconfigurable environments is another promising research venue.
References
- 1. Gianassi M, Leoni L, Fantozzi IC, De Carlo F, Tucci M. Mixed-model and multi-model assembly lines: a systematic literature review on resource management. J Manuf Syst. 2025;82:632–57.
- 2. Koren Y, Heisel U, Jovane F, Moriwaki T, Pritschow G, Ulsoy G. Reconfigurable manufacturing systems. CIRP Ann. 1999;48(2):527–40.
- 3. Yelles-Chaouche AR, Gurevsky E, Brahimi N, Dolgui A. Reconfigurable manufacturing systems from an optimisation perspective: a focused review of literature. Int J Prod Res. 2020;59(21):6400–18.
- 4. Hasan F, Jain PK, Kumar D. Service level as performance index for reconfigurable manufacturing system involving multiple part families. Procedia Eng. 2014;69:814–21.
- 5. Bortolini M, Galizia FG, Mora C. Reconfigurable manufacturing systems: literature review and research trend. J Manuf Syst. 2018;49:93–106.
- 6. Battaïa O, Dolgui A. Hybridizations in line balancing problems: a comprehensive review on new trends and formulations. Int J Prod Econ. 2022;250:108673.
- 7. Boysen N, Schulze P, Scholl A. Assembly line balancing: what happened in the last fifteen years? Eur J Oper Res. 2022;301(3):797–814.
- 8. Boysen N, Fliedner M, Scholl A. Assembly line balancing: which model to use when? Int J Prod Econ. 2008;111(2):509–28.
- 9. Razali MM, Kamarudin NH, Ab. Rashid MFF, Mohd Rose AN. Recent trend in mixed-model assembly line balancing optimization using soft computing approaches. EC. 2019;36(2):622–45.
- 10. Colledani M, Gyulai D, Monostori L, Urgo M, Unglert J, Van Houten F. Design and management of reconfigurable assembly lines in the automotive industry. CIRP Ann. 2016;65(1):441–6.
- 11. Dou J, Li J, Su C. Bi-objective optimization of integrating configuration generation and scheduling for reconfigurable flow lines using NSGA-II. Int J Adv Manuf Technol. 2016;86(5–8):1945–62.
- 12. Goyal KK, Jain PK. Design of reconfigurable flow lines using MOPSO and maximum deviation theory. Int J Adv Manuf Technol. 2015.
- 13. Yuan M, Deng K, Chaovalitwongse WA, Cheng S. Multi-objective optimal scheduling of reconfigurable assembly line for cloud manufacturing. Optim Methods Softw. 2016;32(3):581–93.
- 14. Prasad D, Jayswal SC. Reconfigurability consideration and scheduling of products in a manufacturing industry. Int J Prod Res. 2017;56(19):6430–49.
- 15. Pattanaik LN, Jena A. Tri-objective optimisation of mixed model reconfigurable assembly system for modular products. Int J Comput Integr Manuf. 2018;32(1):72–82.
- 16. Yuan M, Yu H, Huang J, Ji A. Reconfigurable assembly line balancing for cloud manufacturing. J Intell Manuf. 2018;30(6):2391–405.
- 17. Yang J, Liu F, Dong Y, Cao Y, Cao Y. Multiple-objective optimization of a reconfigurable assembly system via equipment selection and sequence planning. Comput Ind Eng. 2022;172:108519.
- 18. Yelles-Chaouche AR, Gurevsky E, Brahimi N, Dolgui A. Minimizing task reassignments under balancing multi-product reconfigurable manufacturing lines. Comput Ind Eng. 2022;173:108660.
- 19. Tremblet D, Yelles-Chaouche AR, Gurevsky E, Brahimi N, Dolgui A. Optimizing task reassignments for reconfigurable multi-model assembly lines with unknown order of product arrival. J Manuf Syst. 2023;67:190–200.
- 20. Delorme X, Gianessi P. Line balancing and task scheduling to minimise power peak of reconfigurable manufacturing systems. Int J Prod Res. 2023;62(14):5061–86.
- 21. Gholami H, Delorme X, Dolgui A. An intelligent data-driven model for sustainable-resilient supplier scrutiny and selection in sustainable reconfigurable manufacturing systems. Int J Prod Res. 2025;64(7):2591–615.
- 22. Cano-Belmán J, Ríos-Mercado RZ, Bautista J. A scatter search based hyper-heuristic for sequencing a mixed-model assembly line. J Heuristics. 2010;16(6):749–70.
- 23. Mosadegh H, Fatemi Ghomi SMT, Süer GA. Stochastic mixed-model assembly line sequencing problem: mathematical modeling and Q-learning based simulated annealing hyper-heuristics. Eur J Oper Res. 2020;282(2):530–44.
- 24. Özbakır L, Seçme G. A hyper-heuristic approach for stochastic parallel assembly line balancing problems with equipment costs. Oper Res Int J. 2020;22(1):577–614.
- 25. Zhou B, Zhao L. A hyper-heuristic algorithm-based automatic monorail shuttle system for material feeding optimization in mixed-model assembly lines. Soft Comput. 2023;28(4):3083–105.
- 26. Guo H, Liu J, Zhuang C, Dong H, Zhang F. A hyper-heuristic for dynamic integrated process planning and scheduling problem with reconfigurable manufacturing cells. IEEE Trans Syst Man Cybern, Syst. 2025;55(6):3892–905.
- 27. Lu G, Gao H, Li Z, Yu X, Wang T, Qiu J, et al. Hyper-heuristic optimization using multifeature fusion estimator for PCB assembly lines with linear-aligned-heads surface mounters. IEEE Trans Cybern. 2025;55(8):3879–90. pmid:40266864
- 28. Li C, Wei X, Wang J, Wang S, Zhang S. A review of reinforcement learning based hyper-heuristics. PeerJ Comput Sci. 2024;10:e2141. pmid:38983203
- 29. Vela A, Valencia-Rivera GH, Cruz-Duarte JM, Ortiz-Bayliss JC, Amaya I. Hyper-heuristics and scheduling problems: strategies, application areas, and performance metrics. IEEE Access. 2025;13:14983–97.
- 30. Yan Q, Wang H. Double-layer Q-learning-based joint decision-making of dual resource-constrained aircraft assembly scheduling and flexible preventive maintenance. IEEE Trans Aerosp Electron Syst. 2022;58(6):4938–52.
- 31. Zhang Z, Tang Q, Chica M, Li Z. Reinforcement learning-based multiobjective evolutionary algorithm for mixed-model multimanned assembly line balancing under uncertain demand. IEEE Trans Cybern. 2024;54(5):2914–27. pmid:37018615
- 32. Meng K, Li S, Han Z. Optimizing mixed-model assembly line efficiency under uncertain demand: a Q-Learning-Inspired differential evolution algorithm. Comput Ind Eng. 2025;200:110743.
- 33. Rauf M, Mumtaz J, Adeel R, Minhas KA, Usman M. An efficient Q-learning-based multi-objective intelligent hybrid genetic algorithm for mixed-model assembly line efficiency. Symmetry. 2025;17(6):811.
- 34. Wen X, Liu H, Zhang X, Wang H, Zhang Y, Ye G, et al. A Q-learning improved particle swarm optimization for aircraft pulsating assembly line scheduling problem considering skilled operator allocation. CMC. 2026;86(1):1–27.
- 35. Battaïa O, Delorme X, Dolgui A, Haddou-Benderbal H. New trends in line balancing and model sequencing in assembly, disassembly and machining environments. Comput Ind Eng. 2025;207:111210.
- 36. Coello CAC, Pulido GT, Lechuga MS. Handling multiple objectives with particle swarm optimization. IEEE Trans Evol Comput. 2004;8(3):256–79.
- 37. Rao RV, Savsani VJ, Vakharia DP. Teaching–learning-based optimization: a novel method for constrained mechanical design optimization problems. Comput Aided Des. 2011;43(3):303–15.
- 38. Mirjalili S, Lewis A. The whale optimization algorithm. Adv Eng Softw. 2016;95:51–67.
- 39. Mirjalili S, Mirjalili SM, Lewis A. Grey wolf optimizer. Adv Eng Softw. 2014;69:46–61.
- 40. Deb K, Pratap A, Agarwal S, Meyarivan T. A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Trans Evol Comput. 2002;6(2):182–97.
- 41. Li Z, Tang Q, Zhang L. Minimizing energy consumption and cycle time in two-sided robotic assembly line systems using restarted simulated annealing algorithm. J Clean Prod. 2016;135:508–22.
- 42. Li Z, Janardhanan MN, Ponnambalam SG. Cost-oriented robotic assembly line balancing problem with setup times: multi-objective algorithms. J Intell Manuf. 2020;32(4):989–1007.
- 43. Cakir B, Altiparmak F, Dengiz B. Multi-objective optimization of a stochastic assembly line balancing: a hybrid simulated annealing algorithm. Comput Ind Eng. 2011;60(3):376–84.
- 44. Saif U, Guan Z, Liu W, Wang B, Zhang C. Multi-objective artificial bee colony algorithm for simultaneous sequencing and balancing of mixed model assembly line. Int J Adv Manuf Technol. 2014;75(9–12):1809–27.
- 45.
Kukkonen S, Lampinen J, editors. GDE3: The third evolution step of generalized differential evolution. 2005 IEEE Congress on Evolutionary Computation; 2005.
- 46. Zhang Q, Li H. MOEA/D: a multiobjective evolutionary algorithm based on decomposition. IEEE Trans Evol Comput. 2007;11(6):712–31.
- 47.
Nebro AJ, Durillo JJ, Garcia-Nieto J, Coello CAC, Luna F, Alba E, editors. SMPSO: A new PSO-based metaheuristic for multi-objective optimization. 2009 IEEE Symposium on Computational Intelligence in Multi-Criteria Decision-Making(MCDM); 2009.