A metamorphic testing approach for event sequences

Test oracles are commonly used in software testing to determine the correctness of the execution results of test cases. However, the testing of many software systems faces the test oracle problem: a test oracle may not always be available, or it may be available but too expensive to apply. One such software system is a system involving abundant business processes. This paper focuses on the testing of business-process-based software systems and proposes a metamorphic testing approach for event sequences, called MTES, to alleviate the oracle problem. We utilized event sequences to represent business processes and then applied the technique of metamorphic testing to test the system without using test oracles. To apply metamorphic testing, we studied the general rules for identifying metamorphic relations for business processes and further demonstrated specific metamorphic relations for individual case studies. Three case studies were conducted to evaluate the effectiveness of our approach. The experimental results show that our approach is feasible and effective in testing the applications with rich business processes. In addition, this paper summarizes the experimental findings and proposes guidelines for selecting good metamorphic relations for business processes.


Introduction
Software is widely used in various fields and greatly promotes the development of society. However, software faults have caused massive disasters. Software quality assurance has become a critical activity in the software industry, and software testing is an effective method to ensure software quality. Many techniques have been proposed to guide test case selection and testing automation to improve the effectiveness of software testing. Most of these techniques require an underlying assumption that an oracle (a mechanism through which testers can verify the correctness of the test outputs) is attainable. However, in many practical applications, a test oracle is not attainable or is attainable but is too expensive to apply. These two situations are known as the oracle problem [1][2][3] and are challenging problems in software testing. a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 In real-life applications, a system often consists of many subsystems or services that involve a large number of business processes and data transformations. Such a system is very difficult to test. Testers not only need to identify business processes and construct many test inputs but also have to determine the expected outputs. This process is error-prone and expensive. For example, a bank system normally involves many complex transaction processes from various terminals and frequently processes transactions in batches. To test such a system thoroughly, testers have to identify a large number of business processes, construct a large number of test cases and calculate the expected outputs manually. The expectations and comparisons of the test outputs are time-consuming and error-prone. Therefore, test oracles are expensive to apply and the testing of such software systems faces the oracle problem.
Traditionally, one way to test a system that suffers from the test oracle problem is to use a 'pseudo-oracle' [4], in which multiple implementations of an algorithm are executed and at least one fault is detected if the outputs are different. This method is not always feasible because it is very costly, and different people can make the same type of mistake. Another method is a 'partial oracle' [5], which can verify the correctness or incorrectness of test outputs according to a certain condition or range. For instance, the output of sin 38˚should not be greater than 1 or less than −1. This method is relatively simple and inexpensive, but it is suitable only for limited cases. Metamorphic testing (MT) has been proposed to alleviate the oracle problem [6,7]. To address the oracle problem, MT uses the relations over multiple inputs and outputs, namely, metamorphic relations (MRs), to verify the test results. If an MR is violated, at least one fault is detected. MT is a simple, effective and automatable method without test oracles [8,9]. Many researchers have applied MT in various applications in different domains, such as numerical analysis [10], machine learning [11], bioinformatics [12,13], middleware applications [14], embedded software [15], the National Aeronautics and Space Administration (NASA) data access toolkit [16], cybersecurity [17], compilers [18,19], search engines [20] and geographic systems [21]. Additionally, MT has also been integrated with other testing and analysis techniques, such as fault-based testing [7], program slicing [22] and symbolic execution [23]. A comprehensive survey of MT introduces its application areas, research results and challenges [24].
In the software industry, a system usually includes a large number of interactions and business processes. End-users pay more attention to the correctness of business processes. The application of MT to test business processes is challenging. Two prominent problems exist. One problem is how to represent test inputs in MT for business processes. To test the system thoroughly, testers need to construct various test scenarios from the users' perspectives to reflect business processes. These test scenarios should also be regarded as test inputs in the testing of business-process-based systems. Test scenarios are basically expressed in natural language. How to express the relations between different test scenarios in MT must be studied. One possible approach is to formalize test scenarios just like normal test inputs in MT for business processes. Another problem is how to construct an MR for business processes. The MR requires multiple relations between different test scenarios, composite test inputs and outputs, among which the key challenge is to construct the relations among different test scenarios.
Some previous studies regarding event sequence testing can help us solve the first problem. Belli et al. proposed event sequence graphs (ESGs) to represent a user's actions in graphical user interface (GUI) testing [25]. Memon proposed a scalable event-flow model of GUI-based applications to present all possible event sequences on a GUI [26]. Sabharwall et al. proposed an event-flow model to generate and express test scenarios [27]. A sequence generation approach to business process testing was proposed based on test case composition and colored petri nets [28]. In addition, solutions to event sequence testing, such as sequence covering arrays [29], better bounds [30], integrating event-based testing and structure testing [31], have been proposed. Clearly, using event sequences is an intuitive approach to test business-process-based systems. The abovementioned methods of event sequence generation provide guidance regarding the formal description of test scenarios and facilitate the descriptions of the test input and MRs. The foremost step of MT for event sequences is to construct useful MRs between event sequences. Although previous studies presented some principles for constructing good MRs in MT (see the section on related work), they did not include how to construct MRs between event sequences. This paper proposes an MT approach for event sequences. We utilize event sequences to represent business processes and then construct MRs for event sequences to test business-process-based software systems without using test oracles. To apply this method, we study general rules that we call 'properties between event sequences' to identify MRs for event sequences. Three case studies are conducted to demonstrate the specific MRs. The experimental results and findings demonstrate the effectiveness of our approach.

Metamorphic testing
MT can be used to test systems with or without test oracles [32]. Instead of focusing on the verification of the correctness of each individual output, MT identifies various MRs to verify the relations among multiple inputs and their outputs. In general, one or more MRs are first identified based on knowledge about the intended algorithm or functionality of the software under test (SUT). Then, the source test cases are generated using traditional testing techniques, such as random testing [33], fault-based testing [7], black-box testing and white-box testing. Given a source test case, its follow-up test case is constructed by using the relevant MR. These source and follow-up test cases are further executed on the SUT, and their outputs are checked against the MR. If the MR is violated, then the SUT must be faulty.
A simple example that exemplifies MT is a program P that calculates the median of a set of numbers. The correctness of P is difficult to verify when the number of elements in the set is large. However, the algorithm of P has some defined properties. One property is that when every input number is increased by the same real number x, the resulting median is also increased by x. Based on this property, we can define an MR as follows "Suppose the source test input is {s 1 , s 2 , .., s n } (n is the number of input elements and n >= 1), and the follow-up test case is constructed as {s 1 + 10, s 2 + 10, . . ., s n + 10} based on the source test case. Then, we have P(s 1 + 10, s 2 + 10, . . ., s n + 10) = P(s 1 , s 2 , . . ., s n ) + 10." Then, the source and follow-up test cases are both executed in the program P, and their outputs are compared. If this MR is violated, there exists at least one fault in the program.
MT provides an effective verification mechanism of test outputs for applications with the oracle problem. Rather than verifying the individual output of one execution, MT determines whether an MR is violated on the basis of multiple executions. The method is simple to implement and independent of the programming language. Additionally, the automation of MT is easy. We can write simple scripts to automatically generate follow-up test cases and compare test outputs. MT has been applied in a wide range of applications. Although some general rules have been proposed to select good MRs, how to identify an MR for business processes has rarely been studied. We study this issue in this paper.

Business process and event sequence graph
A business process is a series of activities performed in a coordinated manner to achieve a business goal [34]. A business process can be described as an ESG, which is a directed graph that depicts events and event interactions in a simplified way [25]. An illustrative example of an ESG is shown in Fig 1. A node denotes an event, which indicates a user's action or an operation call with inputs to the SUT. An arrowed line represents the interaction between two events. Two pseudo-nodes '[',']' inserted into an ESG do not represent real events but rather mark the entry and exit of the ESG.
A test scenario of a business process depicts a sequence of operations or interactions between a user and a system. The scenario can be described as an event sequence composed of well-organized events. Thus, various event sequences representing test scenarios of business processes can be generated based on an ESG and search methods, such as deep-breath-first search. These sequences can be 1-, 2-, 3-, . . .n-way event sequences and can be tested based on event coverage or basis path testing. To obtain an event coverage of 100%, all events must be performed at least once. Basis path testing is a white-box testing method that finds linearly independent paths of execution in the control flow graph (CFG) to test a program. A linearly independent path, which we call a basis path, is a path through a CFG with at least one node different from the nodes of the other paths. An ESG is similar to the CFG of a program. All linearly independent paths are constructed and executed to cover all branches of event sequences in an ESG. For instance, in Fig 1, event b is executed after event a is performed, and events c and d follow event b. Thus, event sequence ha, b, c, di depicts a business process scenario that traverses the path a ! b ! c ! d. The path covers all events for an event coverage of 100% but covers only one independent path of execution, i.e., a ! b ! c ! d. More event sequences, such as ha, c, di and ha, b, di, should be performed to obtain greater path coverage.
The three basic business process scenarios are shown in Fig 2. Fig 2A shows a scenario in which only a single event is tested. In some cases, an event can be tested after another event. These two events can be closely or slightly related. For instance, event2 in Fig 2B is executed after event1 is performed successfully. This is a typical sequential event sequence. Sometimes, a loop event can be executed many times, and the input of the later event may come from the output of the previous event. Fig 2C describes a loop event sequence. Other scenarios can be combined with the basic scenarios. Fig 3A shows a scenario that combines a sequential event sequence hevent1, event2i with a loop event sequence hevent3 1 , . . ., event3 n i (the subscripts 1, . . ., n represent the number of loop executions). In Fig 3B, the scenario can be divided into two parallel sequential event sequences hevent1, event2i and hevent1, event3i when the condition 'and' holds. If the condition 'or' holds, only one of the event sequences hevent1, event2i or

Running examples
A spreadsheet is an application created by end-users that displays a table of information for end-users' tasks, such as data analysis, mathematical computation and office work. A typical spreadsheet may consist of hundreds or even thousands of cells into which input data (e.g., text and numbers) and formulas are entered. Spreadsheets are error-prone for end-users due to mistyping input data and formulas. Moreover, incorrect input data and formula faults can spread from the upstream cells to the downstream cells that depend on the upstream input data or computation results. These errors are difficult to detect. Although oracles are available for spreadsheets, the testing is time-consuming and prone to human error because testers generally must manually calculate the 'expected' results. This issue causes the oracle problem in spreadsheet testing, which has been reported in many previous studies [35][36][37]. The following two examples involve a formula fault and incorrect input data.  Example 1 is a spreadsheet in which cells A2-A301 list the daily sales amounts, and cell A302 uses a faulty formula '=SUM(A2:A300)/300' instead of the correct one '=SUM(A2: A301)/300' to calculate the average daily sales amount. To verify the correctness of this spreadsheet, a tester manually calculates the expected result with a calculator and compares it with the 'actual' value in cell A302. This manual computation is time-consuming and prone to error owing to the large amount of data. MT can use some properties to alleviate this problem. An example of an MR is given as follows. We execute two test cases and compare whether their output results satisfy MR1. If this MR is not satisfied, there exists a fault in this spreadsheet.
MR1: If all daily sales amounts in cells A2-A301 increase by a constant k, the average daily sales amount in cell A302 will increase by k.
Example 2 shows a spreadsheet involving multistep computations in Fig 4. Each of the columns from B to F displays one salesman's data. All salesmen's daily sales amounts are stored in row 2 to row 8. Each of the cells from B9 to F9 stores each salesman's weekly sales amount calculated via the summation formula. For instance, the value in cell B9 is calculated by the formula '=SUM(B2:B8)'. Cells B10-F10 show the sales commission ratios for all salesmen's weekly sales amounts. The weekly sales commissions in cells B11-F11 are obtained by multiplying the weekly sales amounts by the sales commission ratios. For example, the value in cell B11 is calculated using the formula '=B9 � B10'. Finally, the total of all salesmen's sales commissions is calculated using the formula '=SUM(B11:F11)' in cell G11. Here, cell B10 stores not the correct sales commission ratio of 1% but rather the wrong input value of 0.9%. The subsequent weekly sales commission in cell B11 is also faulty, which further causes a faulty result in cell G11.
Generally, a tester computes the results in cells B9-F9 manually. Then, these results are manually multiplied by the values of cells B10-F10 to obtain the weekly sales commissions in cells B11-F11. Finally, the values in cells B11-F11 are calculated to generate the expected result in cell G11. From this process, we can see that a tester manually performs mathematical calculations 11 times to obtain the expected result for such a simple spreadsheet. If the spreadsheet includes a large amount of data, it will be even more expensive to obtain the oracle due to the considerable number of error-prone manual computations.
MT can simply use an MR to solve the oracle problem in spreadsheet testing. The above process implies three events: calculate the weekly sales amount, calculate the weekly sales commission and calculate the total sales commission. We can use an event sequence to represent the process of the multistep computation and construct the ESG in Fig 5. An example of an MR for this event sequence is as follows.
MR2: For the event sequence 'calculate the weekly sales amount, calculate the weekly sales commission, calculate the total sales commission', the total sales commission in cell G11 should increase by the constant 0.01 � m if all daily sales amounts in the spreadsheet increase by a constant m.
MT for the event sequence can test the spreadsheet more easily by executing only two groups of test data. Certainly, the correctness of the spreadsheet can be further verified in a finer-grained manner, such as by considering each salesman's weekly sales commission, and the following MR3, an extension of MR2, can be used.
MR3: For the event sequence 'calculate the weekly sales amount, calculate the weekly sales commission, calculate the total sales commission', each weekly sales commission in cells B11-F11 and the total sales commission in cell G11 will increase by the constant 0.01 � m if all daily sales amounts in the spreadsheet increase by a constant m.

Methods
To test business-process-based software systems, we propose a method of metamorphic testing for event sequences (MTES) without using test oracles. In contrast to traditional MT, MTES focuses on the testing of business processes with not only input and output sequences but also event sequences. Therefore, the procedure of testing business processes by MTES in Fig 6 is slightly different from that of traditional MT. In general, the process includes the following steps.
• Test scenarios are identified from the business processes of a system, and event sequences are generated. Each event sequence represents a test scenario of a business process.
• An MR between event sequences is designed based on the properties of the event sequences, input sequences and output sequences.
• The source test case (E, I) is generated. E is one of these event sequences. I is the input sequence triggering the event sequence E, which can be generated by random testing [33] and fault-based testing [7].
• The follow-up test case (E 0 , I 0 ) is constructed based on the source test case (E, I) and the given MR. E 0 and E can be identical or not. I 0 is the input sequence triggering the event sequence E 0 .
• Two test cases are executed in the system, and the corresponding output sequences, O and O 0 , are tested to check whether they violate the MR. If the MR is violated, the tested business processes are faulty. Note that testers can compare the ultimate outputs or the intermediate outputs and all outputs or partial outputs of the source and follow-up output sequences.
The key issue in MTES is to identify the metamorphic relation between event sequences. Compared with a traditional MR, a metamorphic relation between event sequences involves not only the properties among multiple input sequences and output sequences but also the properties between event sequences. We propose general rules to construct the follow-up event sequences, which are the properties of the event sequences listed in Table 1.
A metamorphic relation between event sequences is defined as follows: if there exists one relation R I between (E, I) and (E 0 , I 0 ) and another relation R O between O and O 0 , R O is always satisfied whenever R I is satisfied. This metamorphic relation can be presented in the following   n i is called the follow-up output sequence. Furthermore, the source and follow-up event sequences may be a single event, or an event sequence with multiple events or a combination of them. Therefore, MRs for event sequences can be categorized into the following three types according to the operations used to construct the follow-up test cases: MR based on a fixed single-event sequence: ðE; IÞ À À À À À À À À ! Consider the example in Fig 5. We obtain an event sequence and construct the metamorphic relation MR3. In this MR, the source event sequence E s is denoted as 'calculate the weekly sales amount, calculate the weekly sales commission, calculate the total sales commission', and the source input sequence I s is a set S of all sales amounts from all salesmen. The follow-up event sequence can be constructed as (E f , I f ), where the follow-up event sequence E f is the same as the source event sequence E s , and the values of all elements in the follow-up input sequence I f increase by the constant m than those in the source input sequence I s . Certainly, testers can also select single events or varied event sequences to test the system more comprehensively.
for this type of critical transaction without the need to calculate test outputs. Case 3 tests a complex autoscaling process of a virtual cluster, which is related to not only the elastic cloud management system but also the Openstack cloud platform. A test oracle is not available for this process because of the unpredictable resource utilization of this cluster. Therefore, we use MTES to test the autoscaling mechanism of a virtual cluster. In the experimental procedure, the following common methods are used to setup the experiments.
Test case generation. In terms of event sequence generation, we use an ESG to manually generate the source event sequences based on basis path testing. The follow-up event sequences are constructed based on the source event sequences and some of the related properties. In our case studies, we select key event sequences to implement the experiments. These event sequences are sufficient to illustrate our approach. For each MR, we use the random testing technique to generate the source input sequence based on the source event sequence. Thus, we can combine the source input sequence with the source event sequence to generate the source test case. Then, the follow-up test case can be constructed based on the MR and the source test case. Thus, a series of test groups (each of which includes a source test case and a follow-up test case) are composed. In the following case studies, some constraints exist in the test groups.
• The numerical inputs involving money, such as the transaction amount and the balance, must be positive. The numerical outputs involving money can be made negative by setting the parameters of these systems to compare the mathematical relations between the source and follow-up outputs.
• All the inputs and outputs involving money must keep two digits after the decimal point. This means that rounding is used in the calculations involving money.
• The transaction amount of an ATM withdrawal in this paper cannot exceed 5000 and must be a multiple of 50. The transaction amount of any deposit cannot exceed 200000.
• With respect to an event sequence, the input of an event derived from the output of the previous event is also affected by the input of the previous event.
Mutant generation. The mutation analysis technique applies mutation operators to inject faults into a program and thus generates various mutants to evaluate the effectiveness of a test method. A mutant is generally a program with one statement or expression mutated by a mutation operator. If a mutant exhibits a behavior different from the SUT, the mutant is killed, and the fault is detected. Mutants generated by mutation operators are similar to real faults [38]. We use the mujava [39] tool to automatically generate mutants for the program under test. Mujava provides two types of mutation operators: method-level operators and class-level operators. In this paper, we focus on faults for which incorrect outputs are produced, such as errors in calculation, logic and conditions. Therefore, we use only a few method-level operators (arithmetic, relational and conditional operators) to generate mutants. Each mutant is a program with one mutated statement. An equivalent mutant is a mutated program that is behaviorally equivalent to the original and cannot be killed by any test case. We select only killable mutants (i.e., non-equivalent mutants [39,40]), excluding the mutants that cause crashes, exceptions and obvious errors in case studies 1 and 2. Because the system in case study 3 is implemented in the Javascript and Python languages, mutants cannot be generated automatically by mutation tools. Three different program versions with real faults are provided to evaluate the effectiveness of our approach in case study 3.
Effective measurement. Clearly, the MTES we propose is feasible in theory, but its effectiveness requires further validation in practical applications. We conduct three case studies to investigate this issue in terms of two metrics.
The first metric is the mutation score (MS), which is an intuitive indicator of the effectiveness of MT and is defined as follows: where N k denotes the number of killed mutants and N n denotes the number of all non-equivalent mutants. The second metric is the fault-detection rate, which is defined as follows: where N v denotes the number of test cases that cause their outputs to violate an MR and N a denotes the total number of test cases. We adopt MS as the metric to assess the effectiveness of our approach in case studies 1 and 2. Because mutation analysis is not used in case study 3, we use FDR as the metric in case study 3. This metric can more realistically reflect the effectiveness of our approach because of the real faults in this case. To compare the source and follow-up output results, we write scripts to automatically determine whether they violate the MRs.
Imprecision. The problem of imprecision arises when test outputs are compared. A loss of precision occurs in floating-point operations for Java, which can cause test outputs to violate an MR even if the test outputs are actually correct. In addition, rounding errors can also cause false positives. For example, the transaction fee of a deposit is calculated based on the formula 0.001 � A, where A denotes the deposit amount. If we deposit 4124.23 onto a card with a balance of 2000.00 in an MR, we will achieve a new balance 6120.11. If we deposit 8248.46 onto a card with a balance 4000.00, the new balance should theoretically change to double the previous output. However, the actual result is only 12240.21 due to a rounding error. We may incorrectly think the program is faulty because the outputs violate the MR. These problems are solved by setting thresholds in the comparison of test outputs such that no violation is reported if the difference in test outputs is within the threshold.

Case study 1
A simplified electricity bill payment system. Fig 7 shows the ESG of a simplified electricity bill payment system from a community. Four main events (i.e., functions) are included: account balance inquiry, account recharge, electricity bill inquiry and online payment. The implementation of these functions consists of 180 lines of core code written in Java that mainly achieve numerical calculations of these functions, connection to a MySQL database and SQL queries. When a consumer logs into this system, he can check his account balance by implementing the event 'account balance inquiry'. To increase his account balance, he can also deposit money into his account by implementing the event 'account recharge'. Furthermore, he can obtain his electricity bill to know his monthly electricity fee by implementing the event 'electricity bill inquiry'. The monthly electricity fee is calculated using the electricity price and the monthly electricity consumption of a consumer. Then, the fee is deducted from his account balance by implementing the event 'online payment'. The event 'online payment' cannot be executed until the event 'electricity bill inquiry' is implemented successfully. The classes of electricity prices are shown in Table 2. The electricity price E p varies with the number of family members F m and the cumulative annual electricity consumption C ca , which is the total amount of electricity consumed by a consumer in one year. According to this price table, each family pays the electricity bill from their online account monthly. In December of each year, a lowincome family is compensated by CNY98.45, that is, CNY98.45 is deposited into its account.
The input of an account recharge is the 2-tuple (N, A), where N denotes the account number and A denotes the recharge amount. The input of an account balance inquiry is a user's account number N. The outputs of an account recharge and an account balance inquiry are both denoted as (N, B), where B is the new balance. The input of an electricity bill inquiry is the 2-tuple (N, M), and its output is the 5-tuple (N, M, C m , F, C a ), where M denotes the month considered, C m denotes the monthly electricity consumption, C a denotes the annual electricity consumption, the electricity fee F is calculated using the formula F = E p � C m , and the cumulative annual electricity consumption C ca in Table 2 is obtained based on the formula C ca = C m + C a . The input of an online payment is the output of an electricity bill inquiry. The new balance B from the output (N, B) of an online payment is calculated using the formula B = B o − C m , where B o is the balance before paying the electricity bill.
Metamorphic relations of a simplified electricity bill payment system. We create accounts N 1 , N 2 , N 3 , and N 4 with the same balance B 0 for normal-income families with three members, four members, five members and six members, respectively. Account N 5 with balance B 0 + M is for a normal-income family with five members. Account N 6 with balance B 0 is for a low-income family with three members. To design MRs between event sequences, the following basic properties of this system are first identified.
If the follow-up monthly electricity consumption C mf is twice as large as the source electricity consumption C ms and both the source and follow-up cumulative annual electricity consumptions are not more than 2520 kWh, that is, C as + C ms < = 2520 and C af + C mf < = 2520, the follow-up electricity fee F f should be twice as large as the source electricity fee F s . Given the fixed multi-event sequence hAccount Recharge, Electricity Bill Inquiry, Online Paymenti and the source input sequence I s = hAccount Recharge(N 1 , A), Electricity Bill Inquiry(N 1 , 5), Online Payment(N 1 , 5, C ms , F s , C as )i, the follow-up input sequence I f = hAccount Recharge(N 2 , A + K), Electricity Bill Inquiry(N 2 , 5), Online Payment(N 2 , 5, C ms + C, F f , C af )i can be constructed by changing the account number from N 1 to N 2 , separately adding the positive integers K and C to the recharge amount A and the monthly electricity consumption C ms , and changing the monthly electricity fee from F s to F f and the annual electricity consumption from C as to C af . Thus, the corresponding source and follow-up output sequences can be denoted as f are the source and follow-up balances after executing the first event 'Account Recharge' and B s and B f are the final source and follow-up balances for card number N 1 and card number N 2 . Thus, we can design the metamorphic relations MR3-MR4. MR3: If both the source and follow-up cumulative annual electricity consumption are within the range (0, 2520], that is, C as + C ms < = 2520 and C af + C mf < = 2520, then the follow-up final balance B f should satisfy the relation B f = B s + K − 0.5469C. MR4: If the source cumulative annual electricity consumption C ms + C as is within the range (0, 2520] and the follow-up annual electricity consumption C af is within the range (4800, + 1), we will obtain the following output relation for the follow-up final balance:

MR5:
Supposing that the source input sequence is denoted as I s = hAccount Recharge(N 3 , A), Electricity Bill Inquiry(N 3 , 5), Online Payment(N 3 , 5, C ms , F s , C as )i, the follow-up input sequence I f = hAccount Recharge(N 4 , A), Electricity Bill Inquiry(N 4 , 5), Online Payment (N 4 , 5, C ms + C, F f , C af )i can be constructed by changing account number N 3 with five family members to account number N 4 with six family members, adding a positive integer C to the monthly electricity consumption C ms , changing the monthly electricity fee from F s to F f and changing the annual electricity consumption from C as to C af . Supposing the source annual electricity consumption C as > 2520, the source cumulative annual electricity consumption C as + C ms < = 3720, the follow-up annual electricity consumption C af > 3720 and the follow-up cumulative annual electricity consumption C af + C ms + C < = 4800, the output sequences should satisfy the relation B f = B s − 0.05 � C ms − 0.5969 � C. MR6: Supposing that the source input sequence is described as I s = hAccount Recharge(N 2 , A), Electricity Bill Inquiry(N 2 , 5), Online Payment(N 2 , 5, C ms , F s , C as )i, the follow-up input sequence I f = hAccount Recharge(N 3 , A + K), Electricity Bill Inquiry(N 3 , 5), Online Payment(N 3 , 5, 2 � C ms , F f , C af )i can be constructed by changing account number N 2 with four members to account number N 3 with five members, adding a positive integer K to the recharge amount A, multiplying the monthly electricity consumption C ms by a positive integer 2, changing the electricity fee from F s to F f and changing the annual electricity consumption from C as to C af . If there exist a source annual electricity consumption C as > 2520, a source cumulative annual electricity consumption C as + C ms < = 4800 and a follow-up annual electricity consumption C af > 4800, we can obtain the following output relation for the follow-up final balance: Given the fixed multi-event sequence E s = E f = hElectricity Bill Inquiry, Online Paymenti, the source and follow-up output sequences are denoted as where M o denotes the month considered. Supposing an account N 7 with balance B 0 + M is from a normal-income family with three members, we can construct the following two MRs. MR7: Given the source input sequence I s = hElectricity Bill Inquiry(N 1 , 5), Online Payment (N 1 , 5, C ms , F s , C as )i, the follow-up input sequence I f = hElectricity Bill Inquiry(N 7 , 5), Online Payment(N 7 , 5, C ms + C, F f , C af )i can be constructed by changing account number N 1 with balance B 0 to account number N 7 with balance B 0 + M, adding a positive integer C to the monthly electricity consumption C ms , changing the electricity fee from F s to F f and changing the annual electricity consumption from C as to C af . If the source and follow-up cumulative annual electricity consumptions are both within the range (0, 2520], that is, C as + C ms < = 2520 and C af + C ms + C < = 2520, we can obtain the following output relation: : Given the source input sequence I s = hElectricity Bill Inquiry(N 6 , 12), Online Payment(N 6 , 12, C ms , F s , C as )i, the follow-up input sequence I f = hElectricity Bill Inquiry(N 7 , 12), Online Payment(N 7 , 12, C ms + C, F f , C af )i can be constructed by changing account number N 6 with low income and balance B 0 to account number N 7 with normal income and balance B 0 + M, adding a positive integer C to the monthly electricity consumption C ms , changing the electricity fee from F s to F f and changing the annual electricity consumption from C as to C af . If the source and follow-up cumulative annual electricity consumptions are both within the range (0, 2520], that is, C as + C ms < = 2520 and C af + C ms + C < = 2520, we will achieve the following output relation for the follow-up final balance: and B 1 f refer to the first source and follow-up balances, and B s and B f refer to the final source and follow-up balances in the source and follow-up output sequences. If the source and follow-up cumulative annual electricity consumptions are both within the range (0, 2520], that is, C as + C ms < = 2520 and C af + C ms + C < = 2520, we can obtain the relation with the follow-up final balance B f = B s + M − 0.5469 � C. MR11: Compared with MR10, MR11 uses account number N 3 with balance B 0 and account number N 5 with balance B 0 + M, and has different relations between the source and followup input sequences, that is, C ms > 2520, C as + C ms < = 3720, C af > 3720 and C af + C ms + C < = 4800. Thus, the source and follow-up output sequences should satisfy the relation with the follow-up final balance B f = B s − 0.05 � C ms − 0.5969 � C. MR12: Based on the source event sequence E s , the follow-up event sequence E f = Account Recharge is constructed by deleting the events 'Electricity Bill Inquiry' and 'Online Payment' from the source event sequence. If the source input sequence is described as I s = hAccount Recharge(N 1 , A), Electricity Bill Inquiry(N 1 , 5), Online Payment(N 1 , 5, C ms , F s , C as )i, the follow-up input sequence I f = Account Recharge(N 2 , A + K) can be constructed by changing the account number from N 1 to N 2 , adding a positive integer K to the recharge amount A, changing the electricity fee from F s to F f and changing the annual electricity consumption from C as to C af . The corresponding source and follow-up output sequences are separately denoted as O s ¼ hðN 1 ; B 1 s Þ; ðN 1 ; 5; C ms ; F s ; C as Þ; ðN 1 ; B s Þi and O f = (N 2 , B f ). If the source cumulative annual electricity consumption is within the range (0, 2520], that is, C as + C ms < = 2520, the output relation B f = B s + K + 0.5469 � C ms should be setup.

MR13:
Based on the source event sequence E s , the follow-up event sequence E f = hElectricity Bill Inquiry, Online Payment, Account Rechargei is constructed by permuting the order of events in the source event sequence. Given the source input sequence I s = hAccount Recharge(N 1 , A), Electricity Bill Inquiry(N 1 , 5), Online Payment(N 1 , 5, C ms , F s , C as )i, the follow-up input sequence I f = hElectricity Bill Inquiry(N 2 , 5), Online Payment(N 2 , 5, C ms , F s , C as ), Account Recharge(N 2 , A)i is constructed by changing account number N 1 with three members to account number N 2 with four members. If the cumulative annual electricity consumption is within the range (0, 2520], that is, C as + C ms < = 2520, the source and follow-up final balances should be the same. Experimental results and analysis. We use mutation analysis to generate 548 mutants excluding the equivalent mutants and those that lead to exceptions, crashes and obvious errors. Furthermore, we generate 200 test groups (each group includes one source and one follow-up test case) for each MR. All test groups are executed, and their output sequences are compared. MSs are calculated, and the results are shown in Table 3. We obtain the following findings. 2. An MR based on varied event sequences normally has higher fault-detection capability than an MR based on a fixed event sequence. Certainly, there is a precondition that they have the same source event sequence and input sequence but different follow-up event sequences and input sequences. For example, MR10 and MR12 are more effective than MR3 due to the different follow-up event sequences and input relations. Likewise, MR11 is more effective than MR5. MR9 is more effective than MR1 because MR9 continuously executes the event of account recharge twice rather than once, as in MR1. Furthermore, the MS of MR10 exceeds the sum of the MSs of MR1 and MR7 although the account balance inquiry event yields no mutants. This result occurs because MR10 has different event sequences for the source and follow-up test cases.
3. MRs with different input and output relations have different effectiveness. The effectiveness of MT for event sequences is also affected by factors other than the event sequences, such as the input and output relations. For instance, MR3-MR6 have different fault-detection To investigate the effectiveness of MRs in detail, we further analyze the results for different types of mutants. All mutants fall into three categories: • mathematics mutants, in which the statements involving mathematical calculations are mutated by arithmetic operators, such as '+' instead of '-'.
• off-by-one mutants, in which variables are adjusted by one, such as inserting '++' before or after variables.
• condition mutants, in which the condition statements are mutated by relational operators or conditional operators, such as using '<' instead of '>' or inserting '!' before a conditional expression.
We classify the mutants into 188 mathematics mutants, 167 off-by-one mutants and 193 condition mutants. The MSs of the MRs are presented with respect to mutant type in Table 4. Each MR has a different sensitivity to each type of mutants. MR1, MR3 and MR7 are not sensitive to off-by-one mutants, with an MS of 0%. Although MR3-MR6 are designed on the basis of the same event sequence, they have different sensitivities to different types of mutants. MR3 cannot kill any off-by-one mutant, and MR4 is sensitive to mathematics and condition mutants. MR5 and MR6 both have relatively high sensitivities to all types of mutants, with the MSs greater than 10%. MR8 presents higher sensitivity to all types of mutants than does MR7 due to its richer input relations. MR10 and MR12 kill more mathematics and off-by-one mutants than does MR3. MR9 kills the same number of mathematics mutants and more offby-one mutants than does MR1. MR1 has the same source and follow-up event sequences, whereas MR9 goes through different event sequences. MR1, MR9 and MR13 cannot kill any condition mutant. No test cases from MR1 and MR9 go through the mutated statements in these condition mutants. MR13 executes some of the mutated statements but produces an MS

Case study 2
A simplified interbank transaction system. The process of interbank transactions is shown in Fig 9A. The acquirer (receiving bank) receives the card transaction details from various terminals and transmits them to the issuer through an intermediate process system (CUPS). The issuer (issuing bank) processes these transactions and replies to the acquirer. The system under test is a simplified program from the transaction process system of the issuer. Three main features are offered in Fig 9B: interbank ATM withdrawal, interbank counter deposit and deposit cancellation. The deposit cancellation event can occur only after a counter deposit is completed successfully.
The transaction fee criteria are shown in Table 5. An interbank ATM withdrawal includes two types of transaction fees, which apply to transactions from the same city as the issuer and transactions from a different city. For an interbank counter deposit, three types of transaction fees exist according to the transaction amount A. We implement our approach on fine-grained modules, such as the modules of interbank ATM withdrawal, counter deposit and deposit cancellation.
Metamorphic relations of interbank ATM withdrawal. For an interbank ATM withdrawal event, the input triggering the event is a 5-tuple (N, A, C a , C i , B 0 ), where N refers to the card number, A refers to the transaction amount, C a and C i , respectively, refer to the city code of the acquirer and the city code of the issuer, and B 0 refers to the initial balance of card number N. Moreover, C a = C i indicates that the transaction received by the acquirer is from the same city as the issuer, whereas C a 6 ¼ C i indicates that the transaction comes from a different city. The output of an interbank ATM withdrawal is a 3-tuple (R, F, B), where R, F and B represent the response code, transaction fee and balance after the transaction. Note that in this case study, the initial balance B 0 is usually sufficient unless stated otherwise.  balances (B s and B f ) should satisfy the relations F f = F s + 0.01(K − 1) � A and B f = K � B s + 2 (K − 1).

MR1.3:
If the source test case is a transaction from the same city as the issuer, and the follow-up test case is a transaction from a different city, that is, C a = C i and C where the superscripts '1' and '2', respectively, refer to the first and second events in the source and follow-up output sequences, the total transaction fee F s and the final balance B s in the source output sequence can be calculated using the formulas F s ¼ F 1 s þ F 2 s and B s ¼ B 2 s , and those in the follow-up output sequence can be calculated via the formulas F f ¼ f Þi is constructed by permuting the order of the two transaction amounts A 1 and A 2 of the source input sequence, we can obtain the same source and follow-up response codes, total transaction fees and final balances.

MR based on varied event sequences.
If we sequentially withdraw cash A 1 and A 2 from the same card, the new balance should be related to that calculated by withdrawing cash A 1 + A 2 once. Therefore, we suppose E s = hATM withdrawal, ATM withdrawali is the source event sequence that sequentially withdraws cash twice; the source input sequence is given as Thus, we can construct the follow-up test case (E f , I f ), where the follow-up event sequence E f = ATM withdrawal is constructed by deleting an ATM withdrawal event from the source event sequence. The follow-up input sequence Þ is constructed by withdrawing cash A 1 + A 2 once, changing the card number from N to N 0 , changing the city code of the acquirer from C a to C 0 a and changing the city code of the issuer from C i to C

MR1.11:
The difference of this MR from MR1.10 is that the source test case is a transaction from the same city as the issuer but the follow-up test case is a transaction from a different city, that is, C a = C i and C

Metamorphic relations of interbank counter deposit.
For the interbank counter deposit event, the input is a 4-tuple (N, S, A, B 0 ), where N, S, A and B 0 represent the card number, sequence number, transaction amount and the initial balance of the card, respectively. The output is identical to the output sequence of an interbank ATM withdrawal, namely, response code R, transaction fee F and new balance B.

MR based on a fixed single-event sequence.
We suppose that the source test case is (counter deposit, I s ) and that the follow-up test case is (counter deposit, I f ). Given the source input sequence I s = counter deposit (N, S, A, B

Metamorphic relations of deposit cancellation.
We investigate an event sequence that executes a deposit cancellation after sequentially executing two counter deposits. The aim of this test is to check whether the deposit cancellation event can correctly cancel the deposit transaction. Suppose the source test case is denoted as (E s , I s ), where E s is hcounter deposit, counter deposit, deposit cancellationi and I s is hcounter deposit (N, S 1 , A 1 , B 0 ), counter deposit (N, S 2 , A 2 , B 1 s ), deposit cancellation(S 2 , A 2 )i. The input of deposit cancellation (S 2 , A 2 ) is derived from the input of the second counter deposit event, which means that the second transaction is withdrawn. Thus, if the source output sequence is denoted as Then, the source and follow-up output sequences should have the same response code, and satisfy the follow-up transaction fee relation F 1 f þ F 2 f ¼ F 1 s þ 50 and the follow-up final balance relation

MR3.6:
Compared with MR3.5, the follow-up event sequence of this MR E f = counter deposit is constructed by deleting a counter deposit event and its corresponding deposit cancellation event from the source event sequence. The corresponding follow-up input sequence and output sequence are represented as I f = counter deposit (N, S 1 , A 1 , B 0 ) and F f , B f ). In this case, the source and follow-up output sequences should have the same response code, total transaction fee and final balance, that is, Experimental results and analysis. For each MR, we use random testing to generate the source input sequences. Considering the limitations of ATM withdrawal, we generate 50, 200 and 200 valid test groups for each MR from ATM withdrawal, counter deposit and deposit cancellation, respectively. Then, we use mutation analysis to separately generate 65 and 58 mutants for the modules of ATM withdrawal and counter deposit. The event sequence involving two modules of counter deposit and deposit cancellation includes 85 non-equivalent mutants. We execute all test groups, compare their output sequences and evaluate the effectiveness in terms of MS. Table 6 summarizes the MS of each ATM withdrawal MR for all mutants. MRs based on varied event sequences have higher fault-detection capabilities. MR1.11 is the strongest and kills nearly 90% of all mutants, whereas MR1.5, based on a fixed single-event sequence, is the weakest and kills only 16.92% of all mutants. For the same type of metamorphic relations, different MRs have different fault-detection capabilities. For instance, MR1.3 is more effective than other MRs based on a fixed single-event sequence, MR1.8 is more effective than other MRs based on a fixed multi-event sequence, and MR1.11 is more effective than MR1.10, which is based on varied event sequences. Further analysis reveals that MRs that conduct executions of the source and follow-up test cases in different ways are more likely to reveal faults. In MR1.11, the execution of the follow-up test case is performed with a more different input sequence, different event sequence and different execution path than those of the source test case, whereas MR1.10 uses only a different input sequence and a different event sequence. For a fixed multi-event sequence, MR1.8 includes different input sequences and different execution paths, whereas the other MRs include only different input sequences. The same situation occurs for MRs based on a fixed single-event sequence, except for MR1.5. MR1.5 is less effective than MR1.4 even though it has more different execution paths. Further observation indicates that MR1.5 includes only one output parameter, while MR1.4 includes three output parameters. MR1.5 has a 'loose' output relation, which deteriorates the fault-detection effectiveness. Table 7 shows the MS of each MR for the counter deposit event for all mutants. The MRs derived from different test scenarios have different fault-detection effectiveness. For instance, MR2.11, which is based on varied event sequences, is the strongest metamorphic relation and kills 81.03% of all mutants. MR2.3, which is based on a fixed single-event sequence, kills 74.14% of all mutants, while MR2.8, which is based on a fixed multi-event sequence, kills only 12.07% of all mutants. In addition, the MRs with greater differences in the executions of the SUT have higher fault-detection capabilities. For example, MR2.3 kills more mutants than do the other MRs based on a fixed single-event sequence because the execution of its follow-up test case involves more different execution paths and richer input and output relations. MR2.6 is more effective than the other MRs based on a fixed multi-event sequence because of more different execution paths and richer input relations. MRs based on varied event sequences are usually more effective. For instance, MR2.11 is the most effective of all MRs due to more different event sequences and execution paths. MR2.9 and MR2.10 are more effective than the other MRs due to more different event sequences, except the abovementioned MR2.3, which has more different execution paths. The same phenomenon exists in Table 8, which shows the results of the event sequence involving counter deposit and deposit cancellation. MR3.1-MR3. 6 have different fault-detection capabilities. MR3.2 is the best metamorphic relation, killing 85.88% of all mutants, whereas the worst metamorphic relation, MR3.1, kills only 63.53% of all mutants. Furthermore, the best MRs are those that make the executions of the source and follow-up test cases as different as possible. For instance, MR3.2 and MR3.4 are more effective than the other MRs because they involve more different input sequences and execution paths. Although both MR3.5 and MR3.6 involve varied event sequences, the executions of their source and follow-up test cases partially go through the same execution path and input sequence. Therefore, MR3.5 and MR3.6 are less effective than MR3. . Moreover, the effectiveness of a metamorphic relation is related to multiple factors.
We further analyze the experimental results with respect to different types of mutants. Each MR for ATM withdrawal has variable sensitivity to different types of mutants from Table 9. For instance, MR1.4 can kill 86.96% of mathematics mutants and 75% of condition mutants, but it cannot kill any off-by-one mutant. MR1.3, MR1.8 and MR1.11 are sensitive to all types of mutants, and their MSs are identical for condition mutants. Among these three MRs, MR1.3 has a slightly lower MS than the other MRs for mathematics mutants, and MR1.11 has the highest MS of up to 100% for off-by-one mutants. Table 10 shows that MR2.4, MR2.7 and MR2.8 are insensitive to off-by-one mutants and cannot kill any off-by-one mutant. However, MR2.2 and MR2.10 are the most sensitive MRs to off-by-one mutants, with the MSs of 100%. MR2.6 and MR2.7 are very sensitive to mathematics mutants, with the MSs of 80%. Among all MRs, MR2.11 is the strongest MR and is sensitive to all types of mutants, whereas MR2.8 is the weakest MR and kills only 35% of mathematics mutants.
The same situation exists in Table 11. Each MR has different sensitivities to different types of mutants. MR3.4 kills 100% of mathematics mutants and 80% of condition mutants, but it kills only 46.67% of off-by-one mutants. MR3.2 is sensitive to all types of mutants, with the MSs of 80% or higher.

Case study 3
An elastic cloud management system. Cloud computing has been widely applied in the information technology (IT) industry with rich resources and a pay-as-you-go cost model. Cloud computing integrates various computational, storage and network resources into a large pool to benefit a large number of users' resource demands simultaneously. Based on virtualization techniques, users can request various virtual machines (VMs) and virtual clusters as needed. Users can also release some or all VMs when they do not need as many resources.
Autoscaling is an effective method to ensure the quality of service of users' applications. Autoscaling can dynamically reallocate resources to enhance application performance or reduce users' cost when the resource utilization is above or below a preset threshold. For example, a virtual cluster with 10 VMs is created to run a web application on a cloud platform. When the average resource utilization of this virtual cluster (e.g., CPU utilization) exceeds a preset threshold (e.g., 80%) during a fixed observation period, the application performance will decrease. At this moment, this cluster will automatically add one or more VMs according to the predefined autoscaling strategy to improve the application performance. Conversely, one or more VMs can be removed to reduce users' resource cost when the average resource utilization of the cluster is below a preset threshold. Figs 13 and 14 show an elastic cloud management system and its ESG, respectively. This system manages an Openstack platform composed of 13 physical servers (1 controller node, 1 network node, 1 storage node and 10 compute nodes). A round-robin scheduling strategy is applied to determine on which compute node a VM will be created. The elastic cloud management system includes many components, three of which are related mainly to autoscaling: the cluster deployment and running component, the monitor component and the autoscaling controller. A user submits a request for a three-tier web application cluster to this system, including the number and configuration of the requested VMs and the runtime environment of the web application. The cluster deployment and running component automatically create the VMs and deploy the application on the VMs, which completes the creation of the web application cluster. When the cluster is running, the monitor component collects the real-time resource utilizations of the VMs and periodically saves the data in a MongoDB database. Simultaneously, the autoscaling controller periodically retrieves the data (i.e., resource utilization of the VMs) from the MongoDB database to compute the average resource utilization of the cluster and to determine whether the cluster can increase or decrease the number of VMs according to the autoscaling strategy. If the average resource utilization of the cluster exceeds the predefined upper threshold of the autoscaling strategy, it will trigger the Openstack controller to create new VMs and add them to the cluster. Conversely, VMs will be removed from the cluster if the average resource utilization of this cluster is below the predefined lower threshold. The implementation of autoscaling is closely related to the monitoring of the VMs, the determination of the autoscaling controller and the VM provision of the Openstack controller. If any component fails, the autoscaling of the cluster will not succeed. The autoscaling process can be described as an event sequence hVM monitoring, autoscaling determinination, VM provisoni. The resource utilizations of the VMs are affected by various factors, such as user behavior and other VMs sharing the same physical resources. These factors are time-varying and unpredictable, so we cannot obtain the average resource utilization of the cluster. Thus, we cannot determine the quantity of VMs to add or remove. A test oracle is not attainable in this process, which is an apparent oracle problem. MT can be used to alleviate this problem.
Metamorphic relations of autoscaling on an elastic cloud management system. In general, the evaluation of the autoscaling of an elastic cloud management system includes the following two perspectives.
• How accurately are the resources provided according to the workload variation and autoscaling strategy?
• How quickly or timely are the resources provided in an elastic cloud management system or platform?
The following two metrics are considered in this case study as good indicators of autoscaling.
• Scaling resource ability: the ability to scale out or scale in resources to match workload variation.
• Scaling resource time: the response time to scale out or scale in resources.
For any web application cluster in this case study, the upper and lower thresholds of CPU utilization are set to 80% and 20% in its autoscaling strategy, respectively. Each autoscaling has two evaluation periods, and each evaluation period lasts for a certain determination time. The cluster will not scale out a virtual machine until the average resource utilizations in the two evaluation periods both exceed 80%. Moreover, if the average CPU utilizations in two consecutive evaluation periods are both below 20%, the elastic cloud management system will scale in one VM from the web application cluster. A virtual machine can usually be provided within a minute. In fact, the time for scaling out may be delayed due to network speed and disk I/O speed. We suppose that two completely identical web application clusters, including the same resource and running environment, exist. The clusters can scale out or scale in resources according to the same autoscaling strategy. The following MRs can be constructed.
1. MR1: If the same workload is imposed on two identical clusters during the same observation period, they will increase by the same number of VMs.

MR2:
In contrast to MR1, the two identical clusters will decrease by the same number of VMs when their workloads decrease by the same amount during the same observation period.
3. MR3: During the same observation period, if two identical clusters are both stressed with the same workload that causes their CPU utilizations to exceed 80%, they should scale out the same number of VMs, and their response time for scaling out VMs should be similar at the minute level. That is, if one cluster scales out one VM within t minutes, then the scaling-out time of the other cluster should be in the range t ± 1 minutes.
Experimental results and analysis. To test the autoscaling of a cloud management system, we need to simulate the workload of an application to trigger resource autoscaling. The load test software 'webbench' is used to impose a workload on a web application. This software can concurrently simulate thousands of requests to visit a web application per second, which can cause the resource utilization of the cluster running this web application to increase sharply. Autoscaling of this cluster can thus be triggered. Note that workload generation is not the event being tested but the method used to generate test cases. We create two groups of clusters with the same resource configuration but different operating systems. One group includes two clusters using the 'CentOS 6.5 Server' operating system, and the other group includes two clusters running the 'Ubuntu 14.04 Desktop' operating system. Each cluster includes 1 loadbalance (LB) service, 3 Tomcat web servers based on VMs and 1 MySQL database server based on a VM. Each cluster is reset to the initial quantity of VMs before each test is implemented. Furthermore, these VMs are all 2vCPU/2G/40G (2 core CPU, 2 GB memory and 40 GB disk).
For each MR, we use the 'webbench' software to generate workloads as test cases. For example, the identical source and follow-up test cases can be generated by executing the command 'webbench -c m -t h http://192.168.80.12/' for MR1, where m and h can be set to random values within the range [3000, 20000] and [0, 3600], respectively. m concurrent processes of visiting a web site are executed to generate workloads within h seconds. The resource utilization increases sharply to over 80% and then remains above 80%. For MR2, we first stress two identical clusters to make their resource utilizations exceed 20% during the same period, and then interrupt the stress operations simultaneously. Thus, their workloads decrease quickly, and these clusters will scale in VMs. The source and follow-up test cases are both generated via the above process. For MR3, the source and follow-up test cases can be constructed in the same manner as those of MR1. For each MR, we separately construct 100 source test cases and 100 follow-up test cases to test each group of clusters.
Autoscaling of a cluster involves not only the three components of the elastic cloud management system but also the related component of the Openstack cloud platform. The components are developed based on different programming languages and operating systems. Mutation analysis is not suitable for testing this system. We use three different program versions (V1.0, V2.0 and V3.0) to verify the effectiveness of our approach in the development process of the system. Each program version provides the functions of monitoring, autoscaling and resource provisioning. Program V1.0 is the first version submitted by the development team. The first version was revised to program V2.0 because of some faults. In the new program V2.0, the CPU monitoring interval is set to 600 s. The autoscaling determination time is set to 600 s per evaluation period. In the further revised program V3.0, the CPU monitoring interval is set to 300 s, and the autoscaling determination time per evaluation period is the same as that in program V2.0.
We used the above three MRs to test each program version. All test cases were executed, and their outputs were compared to verify whether they violated these MRs. The experimental results are presented in Table 12. MR1 and MR2 are not violated with the FDR of 0%, while MR3 is violated with the FDR of 100% for program V1.0. No cluster adds or removes any VM under the different application workloads. The development team reviewed the program and found that the user of the ceilometer component had no right to access the MongoDB database. Therefore, the corresponding monitoring data were not saved to the MongoDB database, and autoscaling was not triggered. For program V2.0, all MRs are violated. The development team found that the cloud management system retrieved insufficient data from the MongoDB database in some cases, which prevented the autoscaling from being triggered to scale out resources. In general, the monitoring data are first saved to the MongoDB database; then, the autoscaling controller retrieves the monitoring data from the database to determine whether to trigger autoscaling. The determination time of autoscaling per evaluation period should be longer than the monitoring interval to obtain sufficient data. This problem of program V2.0 is fixed in program V3.0. According to the experimental results of program V3.0, only MR3 is violated. This result demonstrates that resetting the monitoring interval greatly alleviates the problem of insufficient data. However, the response time problem of resource provisioning remains in the autoscaling process. The development team found that the GUIs of the 'Ubuntu 14.04 Desktop' operating system on some VMs did not start or started very slowly, which caused a longer time for scaling out resources and violated MR3. MR3 is more effective than the others due to its richer output relations (i.e., scenarios). Three actual problems are found in the testing process of MTES. One is the image problem from the VM provisioning component of the Openstack cloud platform, and the others are configuration problems from the monitoring component. The results show that MTES is applicable and simple in the domain of cloud computing.

Summary of the experimental findings
According to the above results and analysis for the case studies, we summarize the experimental findings as follows.
1. MRs based on different event sequences have higher fault-detection capabilities due to more different test scenarios.
2. MRs with richer input and output relations have higher fault-detection capabilities.
3. Different MRs have different sensitivities to different types of mutants. Those with an addition transformation or a permutation transformation in the input sequences have difficulty detecting off-by-one mutants.
4. Good MRs are those that make the source execution and follow-up execution as different as possible. This confirms the findings of two previous studies [41,42]. Furthermore, the differences in the source and follow-up executions in this paper include different event sequences, different execution paths, different input and output parameters.
5. MRs based on event sequences exhibit high effectiveness, with an MS up to 89% in the finegrained module testing and 100% for some types of mutants in case study 2. However, only 39.23% of all mutants are killed in the system testing of case study 1. Therefore, MTES is not always efficient but makes it easy for end-users to test systems with rich business processes.

Discussion
We concluded in our previous work [43] that MT is a cost-effective approach for factual applications with mathematical functions. In this paper, we propose an approach to construct MRs between event sequences, which can construct multiple types of metamorphic relations to test various business processes of actual applications. The effectiveness and applicability of the proposed approach are validated via case studies.

More general application in different domains
In the IT industry, an increasing number of applications integrate several systems or services to provide business processes. Particularly, many cloud applications integrate a large number of cloud services and involve various scenarios of business processes; therefore, the oracle problem has become a critical issue. In reality, users pay more attention to the correctness of business processes. However, previous studies have seldom employed MT techniques to test various business processes. Our approach applies MT to test business-process-based software systems, and its applicability and effectiveness are verified through three case studies in different domains. The proposed method is a general approach that can be used in applications from other domains. Additionally, our approach introduces the process of MTES to verify the correctness of business processes. The MT is refined in terms of the identification of business processes and the construction of MRs. In MTES, business process scenarios are first identified based on the domain knowledge of experts or users. Then, the corresponding event sequences are organized to construct MRs. These are the general components of MTES that are suitable for applications in different domains.
More importantly, we not only use the rules from the previous studies to construct MRs between event sequences but also extend the guidance on the construction of good MRs. Good MRs should not only make the executions of the SUT as different as possible but also make the input and output relations as rich as possible. Moreover, the differences in executions should also include different event sequences and different input and output parameters, with the exception of different execution paths.
MTES can not only alleviate the oracle problem in business process testing but also make the testing of some business processes easier and more efficient. Generally, business processes with test oracles have simple mathematical or logical relations that must be tested with a large quantity of test data. MTES can use simple relations without the manual and error-prone computations for testing, which is simpler and more efficient than regular testing. The future combination of MTES with automated techniques will further improve the test efficiency of MTES and promote its wide application in the IT industry.

Limitations
MTES is promising for testing business-process-based software systems. Our approach has certain limitations in the construction of MRs. If more events are involved, MRs between event sequences become more difficult to construct. If a user's business processes are 4, 5,. . ., n-way event sequences, the number of input and output relations will increase substantially in MRs. Furthermore, a large number of source and follow-up test cases based on these event sequences and MRs will be difficult to generate. The cost will be very high for the implementation of MTES. We could alternatively design simple MRs (e.g., non-equalities) to verify the correctness of business processes to reduce the cost of MTES. However, in this paper, our approach does not focus on the cost of MTES but rather on its feasibility and effectiveness in software systems from different domains. Therefore, we design various MRs with respect to different business processes, different execution paths, and different input and output relations to validate the approach. Although these MRs exhibit higher fault-detection capabilities than those with single-event scenarios, they are relatively complex and difficult to construct. In the future, how much will the MRs and test cases increase when going from 1 event to 2,3,. . .,nway event sequences? Moreover, what is the most suitable dimension for an event sequence to balance the effectiveness and cost of MTES? These problems require additional research.

Validity
The primary threat to internal validity is the implementation of MTES, such as test case generation, test execution and comparison of test outputs. We tested the implementation at the unit level and system level and checked the data thoroughly. We also adopted measures to resolve the problems related to floating-point precision and rounding when test outputs are compared. These steps ensured the quality of our experiments.
The threat to external validity is mainly related to the systems under test. In this paper, the system under test in case study 2 was used in our previous study [43]. The system in case study 1 is similar to that in case study 2. They are the simplified programs with mathematical functions from real-life applications. Although these systems are small, they have common characteristics with business process scenarios. The system in case study 3 is a real-life elastic cloud management system involving a complex cloud resource environment and event relations. The oracle problem is prominent. These three systems from different domains are typical and meaningful to expand the application of MTES in the software industry. It is also worthwhile to further investigate the effectiveness of our approach with respect to other classes of systems in the software industry.
Another threat to external validity is the mutants automatically generated by the mujava tool in case study 1 and 2. Although the mutants generated by mutation operators are similar to real faults [38], they are not real faults and can be restricted in type. However, mutation analysis has been widely used to evaluate the effectiveness of test methods, so this threat is acceptable. In addition, we use three different program versions with real faults in case study 3 to validate our approach. The experimental results are also promising.
The primary threat to construct validity is the measurement of the effectiveness. We use the MS and fault-detection rate as metrics of the effectiveness of the MRs. These metrics have been widely used in the literatures. Another threat to construct validity is the construction of MRs for event sequences. Because MRs for event sequences involve various business processes of different systems from multiple domains, we may not be fully acquainted with them. Experts from these domains gave us professional guidance to ensure the correctness of the MRs constructed for event sequences, thereby greatly reducing the threat.

Related work
Some researchers have applied MT to system testing and integration testing. Murphy et al. proposed an automatic system testing approach and its implementation framework [8]. Their study focused on the automation of MT, such as automatic input transformations, parallel executions and output comparisons of applications. However, our approach focuses on the construction of MRs between event sequences. Chan et al. proposed the concept of checkpoints, which provided a convenient way to conduct integration testing of middleware-based applications [44]. They used the relations of the source and follow-up input sequences between checkpoints to test the program, which is, to some extent, similar to our approach. However, our approach includes not only the relations between the source and follow-up input sequences but also the relations between the source and follow-up event sequences. Our approach is more specific and feasible for practical applications.
Some researchers have applied MT in the domain of bank and cloud computing. Chan et al. proposed a metamorphic approach for online service testing and conducted a case study on a foreign exchange dealing service applications [45]. They used the successful test cases of offline testing as the source test cases for online testing, but they assumed that test oracles were available for offline testing. Our method does not include this assumption. Sun et al. proposed an MT framework for web services and conducted a case study on a transfer function of a bank system [46]. However, they designed only simple MRs, most of which were non-equalities. In this paper, we consider different business process scenarios to design different types of MRs to demonstrate that MT is suitable and effective for systems with various business process scenarios. A methodology is proposed to semi-automatically test and validate cloud models by combining simulation techniques and MT [47]. The method simulates different cloud models and constructs different MRs to implement performance experiments, which validate the usefulness and applicability of MT in cloud computing. In contrast to this study, our approach focuses on function testing of a cloud management platform. We provide an effective approach to constructing MRs between event sequences to test business processes, which can easily be extended to test applications from different domains.
To some extent, we reference to event sequence generation and test case generation from GUI testing [25,26,28,48], but we further integrate these generation methods with MT and propose MTES to test business-process-based software systems. Moreover, these GUI testing methods regard only direct-interactive events as an event sequence, whereas we also regard related events as an event sequence.
Additionally, some researchers have proposed principles for constructing good MRs. Murphy et al. [49] suggested input transformation rules to construct MRs for mathematical functions, such as permutation, addition and multiplication. Chen et al. [50] proposed a METRIC identification methodology based on the category-choice framework and developed a generator tool, MR-GEN, to help users identify MRs from specifications in a systematic manner. This methodology improved the applicability, effectiveness and automation of MT. Mayer and Guderlei [51] derived that some MRs with linear equations, as well as those close to the implementations, are limited in terms of fault-detection capability. They proposed that good MRs should have rich semantics. Sun et al. proposed an acquisition methodology (μMT) of MRs by means of data mutation [52], in which data mutation operators are applied to generate valid mutated test cases as follow-up test cases and the output relations are generated according to the input relations by the mapping rules. Ding and Zhang proposed an approach to iteratively refine MRs for adequate tests [53]. This approach first constructs initial MRs to implement mutation testing and then evaluates the effectiveness of metamorphic relations to iteratively refine MRs. Liu et al. [54] proposed a composite approach of MRs to achieve higher cost-effectiveness with respect to an event, algorithm or function. Although these approaches indicated how to construct good MRs, they did not provide guidance for the construction of MRs for event sequences. This paper proposes some general rules, called properties between event sequences, to construct MRs for business processes.

Conclusion
Many studies have demonstrated that MT is an effective approach to test programs with test oracle problems. However, most of these studies have not considered rich business process scenarios in the software industry. Therefore, the applicability of MT requires further validation. In this paper, we propose an MT approach for event sequences, which can be used to systematically test applications with rich business processes. We conduct three case studies in different domains to illustrate our approach. The experimental results demonstrate the feasibility and effectiveness of our approach. The results also confirm the previous findings that good MRs are those that make the executions as different as possible. Furthermore, this paper considers more differences between the source and follow-up executions, such as different event sequences and different input and output parameters and relations. We find that MRs based on different event sequences have higher fault-detection capabilities than those based on the same event sequence. Additionally, MRs with richer input and output relations have higher fault-detection capabilities. On the other hand, to improve the practical impact of our proposed approach, more experimental studies involving real-world software applications and applications suffering from the oracle problem should be conducted. This will be an important aspect of our future work.