A testing-coverage software reliability model considering fault removal efficiency and error generation

In this paper, we propose a software reliability model that considers not only error generation but also fault removal efficiency combined with testing coverage information based on a nonhomogeneous Poisson process (NHPP). During the past four decades, many software reliability growth models (SRGMs) based on NHPP have been proposed to estimate the software reliability measures, most of which have the same following agreements: 1) it is a common phenomenon that during the testing phase, the fault detection rate always changes; 2) as a result of imperfect debugging, fault removal has been related to a fault re-introduction rate. But there are few SRGMs in the literature that differentiate between fault detection and fault removal, i.e. they seldom consider the imperfect fault removal efficiency. But in practical software developing process, fault removal efficiency cannot always be perfect, i.e. the failures detected might not be removed completely and the original faults might still exist and new faults might be introduced meanwhile, which is referred to as imperfect debugging phenomenon. In this study, a model aiming to incorporate fault introduction rate, fault removal efficiency and testing coverage into software reliability evaluation is developed, using testing coverage to express the fault detection rate and using fault removal efficiency to consider the fault repair. We compare the performance of the proposed model with several existing NHPP SRGMs using three sets of real failure data based on five criteria. The results exhibit that the model can give a better fitting and predictive performance.


Introduction
Due to software's ever-increasing usage and crucial role in safety-critical systems, high-quality software products are in great demand. However, the failures of safety-critical software may cause catastrophic loss in life and property. Therefore, software reliability, which is defined as the probability of failure-free software's operating in a special usage environment for a special period of time [1], has become one of the most important customer-oriented characteristics of a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 operating environments and gave a software reliability model with Vtub-shaped fault-detection rate [29]. Peng et al. studied the fault detection process (FDP) and fault correction process (FCP) with the incorporation of testing effort function and imperfect debugging [30].
However, most of the latest models considered "error generation" neglecting the imperfect fault removal efficiency. Therefore, in-depth and comprehensive consideration of fault removal efficiency is of great importance to build more precise reliability models.
Moreover, researchers have suggested that the accuracy of SRGMs can be further improved by considering the influence of some real issues happening during the testing process [31][32][33]. Testing coverage is considered as one of the most important factors.
Testing coverage is a good metric for identifying the effectiveness and completeness. Many time-dependent testing coverage functions (TCFs) have been proposed in terms of different distributions, such as Logarithmic-exponential (L-E) [34], S-shaped [35], Rayleigh [36], Weibull & Logistic [37] and Lognormal [38]. Based on different TCFs, software reliability models have also been developed to express the relationship between testing coverage and the cumulative detected faults, such as Beta, Hyper-exponential (H-E) [33], L-E [34], Rayleigh model [36], and some other coverage-based SRGMs [33,35,37].
In this study, we propose a model considering not only error generation, but also imperfect fault removal efficiency incorporating testing coverage. The rest of the paper is organized as follows. In Section 2, we give a brief overview of NHPP and the assumptions for the proposed model, then present the establishment of the proposed model, and several existing SRGMs are also presented. In Section 3, we state two parameter estimation methods and five criteria for models' performance comparison. In Section 4, we compare the descriptive and predictive performance of this model with several existing NHPP SRGMs on three representative failure data sets, together with the sensitivity analysis. Finally, we draw the conclusions in Section 5.

Software reliability modeling Basic assumptions
The model presented in this paper is based on NHPP, which is utilized to describe the failure phenomenon during the testing phase. The counting process {N(t), t ! 0} of NHPP is shown as follows: The mean value function m(t) is given as follows: where λ(u) is the fault intensity function. The proposed model is based on the following assumptions: 1. Software faults' occurrence and removal follow NHPP.
2. The software failure rate at any time is a function of fault detection rate and the number of faults remaining in the software at that time. The fault detection rate can be expressed by 1À cðtÞ ; c(t) is the percentage of the code that has been examined up to time t, c 0 (t) is the derivative of the testing coverage function.
3. When a software failure is detected, an immediate debugging starts, and either (a) the total number of faults is reduced by one with probability p, or (b) the total number of faults remains the same with probability 1-p. 4. During the fault repair process, whether the fault is removed completely or not, new faults are introduced with a probability constant α.

Model development
Let c(t) represent the percentage of the code that has been covered up to time t. Here c(t) refers to any kind of coverage, e.g. statement coverage, branch coverage, C-use coverage and P-use coverage etc. Obviously, c(t) is an increasing function of testing time t. Usually, it increases very fast from the beginning of software testing process as more test cases are executed to examine the software; after some certain time point, the testing coverage's increasing rate becomes flat and less because less testing coverage happens to realize the residual fault detection [35]. Thus, a concave or S-shaped function may be used to model the testing coverage function. Apparently, (1-c(t)) denotes the percentage of the code that has not been examined by test cases up to time t. The derivative of testing coverage function, c 0 (t), denotes the coverage rate. Thus, the function c 0 (t)/(1-c(t)) is recommended to be used to denote the fault detection rate [35,37], which has been taken as the common assumption by SRGMs considering testing coverage. Based on the above assumptions, the mean value function considering both fault removal efficiency and testing coverage can be got by solving the following differential equation: where a(t) represents the fault content function of the software, β is proportionality constant, pis the fault removal efficiency, which means p% percentage of detected faults can be removed successfully during the developing process, m(t) denotes the expected fault number detected up to time t, and pm(t) is the expected fault number that can be eliminated completely, so [a(t)-pm(t)] represents the expected remaining fault number presented in the software at time t. It should be noted that, when β = 1 and p 6 ¼ 1, the proposed model has the same form as which is recommended in [21]. When β 6 ¼ 1 and p = 1, we can get the same form recommended in [22]. Existing models generally assume that p equals to 100% [18]. From Assumption 4, the total fault number function a(t), is a linear function of the expected fault number detected up to time t. That is, where a denotes the initial fault number presented in the software system before testing starts and α > 0. Substituting a(t) from Eq (4) into Eq (3), and solving it in terms of the initial condition that at t = 0, m(t) = 0, we can obtain where c(0) refers to the testing coverage function when t = 0. The software reliability function based on the NHPP is shown as follows: where m(t) is given by Eq (5).

A new model with testing coverage
Substituting different testing coverage function c(t) into Eq (5), we can get different mean value function m(t) correspondingly. As mentioned above, the testing coverage's increasing rate maybe shows a varying trend which firstly increases then decreases. That is, the testing coverage function may show an S-shaped varying trend which is suitable to be described by an S-shaped curve. The inflection S-shaped function is one representative S-shaped function which is flexible and applicable in many cases and has been applied into modeling software reliability [8], so here the following function is used to describe the testing coverage: where A denotes the maximum percentage of testing coverage, r is the shape parameter and c is the scale parameter. Clearly, when t = 0, c(0) = 0. Substituting Eq (7) into Eq (5), we can get the mean value function as follows: It should be noted that both error generation and fault removal efficiency as well as testing coverage are all combined into the proposed model. Table 1 gives a summary of several existing NHPP models and the proposed model.

Parameter estimation methods and model comparison criteria
Theoretically, once the analytical expression for m(t) is derived, the parameters in m(t) can be estimated by using the maximum likelihood estimation (MLE) method or the least square estimation (LSE) method. MLE is one of the most useful techniques for deriving estimators because comparing to other estimation methods the maximum likelihood estimates are Table 1. Summary of the software reliability models and their mean value functions.

Maximum likelihood estimation (MLE)
Since all the failure data are expressed in the form of pairs (t i , y i ) (i = 1,2,. . ., n; where y i is the cumulative number of faults detected in time (0, t i ], basing on the definition of NHPP, the likelihood function is given as follows: The logarithmic form of the above likelihood function is given as follows: By taking derivatives of Eq (10) with respect to each parameter in m(t), and setting the results equal to zero, we can obtain equations for the proposed model as follows: After solving the above equations simultaneously, we can obtain the maximum likelihood estimates of all parameters for the proposed model.

Least square estimation (LSE)
LSE is based on minimizing the sum of the squared distance from the best fit line and the actual data points. The sum of the squared distance is given as follows: By taking derivatives of Eq (12) with respect to each parameter in m(t), and setting the results equal to zero, we can obtain equations for the proposed model as follows: After solving the above equations simultaneously, we can obtain the least square estimates of all parameters for the proposed model.

Criteria for models' descriptive power comparison
Here we use four criteria to examine the descriptive performance of SRGMs. The first criterion is the mean value of squared error (Mean Square-Error, MSE), which is defined as follows [41]: where n is the number of observations,mðt i Þ is the estimated value of cumulative fault number up to time t i according to the fitted mean value function, i = 1,2,. . .,n. N represents the number of parameters used in the model.
In practice, when comparing the performance of models with different numbers of parameters, it is always considered unfair to simply compare the performance of models owning more parameters with others owning fewer parameters without giving any penalty to those models with more parameters. So it should be noted that here MSE value considers the penalty term with respect to the degrees of freedom when there are many parameters and assigns a larger penalty to a model with more parameters. Thus, more parameters, more penalty will be given; so lower value of MSE indicates better goodness of fit.
The second criterion which is used to examine the fitting power of SRGMs is correlation index of the regression curve equation (R 2 ), which is expressed as follows: where " y ¼ 1 n X n i¼1 y i . Therefore, the more R 2 , the better is the model's performance.
The third criterion which is used to evaluate the performance of SRGMs is adjusted R 2 (Adjusted_R 2 ), which can be expressed as follows: where the value of R 2 in the right side of Eq (16)is shown as Eq (15) and P represents the number of predictors in the fitted model. Therefore, the more Adjusted_R 2 , the better is the model's goodness-of-fit. Here the penalty with respect to the number of model's parameter is also considered. The four criterion is AIC, which measures the ability of a model to maximize the likelihood function that is directly related to the degrees of freedom during fitting and defined as follows: where L is the value of likelihood function at its maximum, N represents the number of parameters used in the model. The lower value of AIC indicates better goodness-of-fit. It should be noted that AIC takes the degrees of freedom into consideration by assigning a larger penalty to a model with more parameters. The number of parameters are also considered in MSE and Adjusted_R 2 , where a larger penalty will be assigned to a model with more parameters.

Criteria for models' predictive power comparison
Here we use SSE criterion to examine the predictive power of SRGMs. SSE is the sum of squared error, which is expressed as follows [42]: Assume that by the end of testing time t n , totally y n faults have been detected. Firstly we use the data points up to time t m-1 (t m-1 < t n ) to estimate the parameters of m(t), then substituting the estimated parameters in the mean value function yields the prediction value of the cumulative fault numbermðt m Þ by t m (t m < t n ), y m is the actual number of faults detected by t m . Then the procedure is repeated for several values of t i (i = m + 1, m + 2, . . . n.) until t n .
Therefore, the less SSE, the better is the model's performance.

Model analysis with real application
Here we examine the performance of the proposed model compared to several existing NHPP models basing on three representative data sets.

Monitor and control system data
The first data set is large in size and collected from testing a real monitor and control system (Data Set 1, DS-1) [43], which has been widely used in many studies, such as [42]. The details are recorded in Table 2 and the time unit is day. There are totally 481 faults observed within 111 days. All data points are used to fit the models and estimate the models' parameters.
Here the model parameters estimated by both LSE and MLE are given in Table 3, respectively. MSE values, R 2 values and Adjusted R 2 values are obtained basing on the parameters estimated by LSE method and AIC values are given basing on parameters estimated by MLE method. It can be seen that compared to all models using MSE, R 2 and Adjusted R 2 criteria, the proposed model displays the smallest MSE value, the largest R 2 value and Adjusted_ R 2 value at 239.4231, 0.9899 and 0.9894, respectively. Although the inflection S-shaped model also fits well, its values are still larger or smaller than those of the proposed model. Though the AIC value of the proposed model is not the smallest one among all models, that is, the value of AIC at 647.31 is a little bigger than those values of the inflection S-shaped model, the delayed Sshaped and the P-Z model. But this value is not very bigger than the smallest value of the inflection S-shaped model at 641.8546, the value of the delayed S-shaped model at 644.0284 and P-Z model's value of 645.8532, as well very less than the value of the Yamada exponential model at 727.8002. So we can deduce that the descriptive power of our proposed model is better than those of other models.
Moreover, some additional information can be acquired from the estimation values of the parameters given by the proposed model. For instance, in the context of LSE method, the fault removal efficiency is 64.43%, which is less than the average value according to [20] (The range of the fault removal efficiency was from 45% to 99% with the average value of 72%). Therefore, more resources should be allocated to enhance the efficiency of the fault removal. Moreover, the fault removal probability on per failure is 0.6443 which is a lower value, and the fault introduction rate is 0.2301 which is not a very low value. That means if those models neglecting imperfect debugging are used, more deviation will be introduced. The initial fault content is estimated to be 408, together with 0.2301 fault introduction rate and 481 total detected faults, then the expected number of total detected faults is 519. Thus, at 111 days, which is the assumed stopping time point, there are still 38 faults latent in the software. The fault introduction rate is 0.2301, i.e., 1 fault will be introduced when 4 faults are removed on average. So more testing training should be given to the testers and their testing skill should be improved greatly.

Tandem computer data
In this section, we examine models using another data collected from Tandem Computers Release #1 (Data Set 2, DS-2) [2], which has also been widely used in many studies, such as [7,42,44]. DS-2 is small in size and the failure data are tabulated in Table 4 with time unit week. There are totally 100 faults detected within about 20 weeks. All data points are used to fit the models and estimate the models' parameters.
As the same as in the first case study, we also use both LSE and MLE to estimate the models' parameters recorded in Table 5. MSE, R 2 and Adjusted R 2 values are calculated on the parameters obtained by LSE method and AIC values in the context of MLE method.
It can be seen that compared to all models using the MSE, R 2 and Adjusted R 2 criteria, the proposed model displays the smallest MSE value, the largest R 2 value and Adjusted_R 2 value, the values are 8.8385, 0.9929 and 0.9897, respectively. Although the inflection S-shaped model also fits well, its values 10.5647, 0.989, 0.9877 are still larger or smaller than those of the proposed model. Though the AIC value of the proposed model is not the smallest one compared to the ones of the inflection S-shaped model and G-O model, it is still not very bigger than them and much smaller than other models' values. So we can deduce that the proposed model performs comparably better than those of other models in goodness-of-fit behavior.
Moreover, some additional information can be acquired from the estimation values of the parameters given by the proposed model. For instance, in the context of LSE method, the fault removal efficiency is 82.48%, which is slightly higher than the average value according to [20], indicating the skill level of the testing team is beyond the average level. The initial fault content is estimated 59, and the fault introduction rate is 0.5148, the expected total number of faults detected is 110. Thus, at 20 weeks, which is the assumed stopping time point, there are still about 10 remaining faults present in the software. This means that some faults still remain in the software at the end of the testing phase. The fault introduction probability is 0.5148, i.e., on average, 1 fault will be introduced when 2 faults are removed on average. So more testing training should be given to the testers and their testing skill should be improved greatly. The testing coverage function of the proposed model based on the parameters estimated by LSE according to DS-2 is graphically illustrated in Fig 4, the changing trend of the testing coverage over time also shows the aforementioned trend. Fig 5 depicts the fault detection rate over time of DS-2. It shows that the fault detection rate displays a non-increasing S-shaped trend with different decreasing rate at different phase, firstly its decreasing rate changes from flat to more, then the fault detection rate decreases from fast to slow. The fitting comparison of all models for DS-2 is graphically illustrated in Fig 6. It can be seen that the proposed model fits the actual data better than all other models.

PL/I database application
In this section, we examine models using another data cited from Ohba (Data Set 3, DS-3) [5], which has also been widely used in many studies, such as [22,45,46]. The failure data are given

90.4466
Notes: The bold numbers mean the result of the best SRGM in this column. https://doi.org/10.1371/journal.pone.0181524.t005 in Table 6 with time unit week. There are totally 328 faults detected within about 19 weeks. All data points are used to fit the models and estimate the models' parameters.
Here we only use MLE to estimate the models' parameters recorded in Table 7. MSE, R 2 , Adjusted R 2 and AIC values are all calculated on the parameters obtained by MLE.
It can be seen that compared to all models using the MSE, R 2 , Adjusted R 2 and AIC criteria, the proposed model displays the smallest MSE value and AIC value, the largest R 2 value and Adjusted_R 2 value, the values are 93.8861, 203.3359, 0.9943 and 0.9906, respectively. That is to say, we can deduce that the proposed model performs the best among all models in goodnessof-fit behavior.
Moreover, some additional information can be acquired from the estimation values of the parameters given by the proposed model. For instance, the fault removal efficiency is 60%, which is below the average value according to [20], indicating the skill of the testing team should be improved. The initial fault content is estimated 628, and the fault introduction rate is 0.5, the expected total number of faults detected is 792 at 19 weeks. Thus, at 19 weeks, which is the assumed stopping time point, there are still about 464 remaining faults present in the software. This means that many faults still remain in the software at the end of the testing phase. The fault introduction probability is 0.5, i.e., on average, 1 fault will be introduced when 2 faults are removed on average. So more testing training should be given to the testers and their testing skill should be improved greatly.
The testing coverage function of the proposed model based on the parameters estimated by MLE according to DS-3 is graphically illustrated in Fig 7, the changing trend of the testing coverage over time also shows the aforementioned trend. Fig 8 depicts the fault detection rate over time of DS-3. It shows that the fault detection rate displays an S-shaped trend firstly increasing and then decreasing with different decreasing rate at different phase, e.g. firstly its decreasing rate changes from flat to more, then the fault detection rate decreases from fast to slow. The fitting comparison of all models for DS-3 is graphically illustrated in Fig 9. It can be seen that the proposed model fits the actual data better than all other models.

Comparison of models' predictive power
In order to validate the performance of the proposed model's predictive power, we divide the above three data sets into two parts. For DS-1, we use the first 80% of data points to estimate the models' parameters, then use the remaining data to compare the models' predictive power.  For DS-2, we use the first 90% of data points to estimate the models' parameters, then use the remaining data to compare the models' predictive power. Here we also use both LSE and MLE methods to estimate the models' parameters. In terms of SSE values listed in Table 8, the proposed model still offers the smallest values of SSE at 2.6775 and 0.7418, though followed by the delayed S-shaped model with an SSE of 3.5409 and the Yamada Rayleigh model with an SSE of 1.6944, which are only 2 times as large as that of the proposed model, but other models have very larger SSE values from 3 times to 1672 times as large as that of the proposed one.
For DS-3, we use the first 95% of data points to estimate the models' parameters, then use the remaining data to compare the models' predictive power. Here we only use MLE method to estimate the models' parameters. From Table 8, it can be seen that the proposed model provides both the lowest value of SSE (0.2231) and the smallest value of AIC (199.7570).
Therefore, according to these results listed in Table 8, we can conclude that the proposed model provides better prediction performance.

Sensitivity analysis
Here we conduct the sensitivity analysis to study each parameter's impact on the robustness of the proposed model, in which one parameter is changeable, while the other parameters are set to their fixed values. Due to limited space, here we only give the results based on DS-1 and DS-2, the same conclusion can be obtained on DS-3, too.  parameters A, α, α, β, p, c and r in the proposed model based on DS-1 respectively. From  Fig 10(a)-10(g), it can be seen that the cumulative number of detected faults will be apparently changed with the expected number of initial fault number α, fault introduction rate α, the fault removal efficiency p, the probability constant β, the maximum testing coverage percentage A, scale parameter c and shape parameter r changing accordingly. Thus, parameters A, α, α, β, p, c and r are all influential parameters in the proposed model. Fig 11(a)-11(g) also show the similar results for DS-2. So these parameters can be regarded as influential parameters in the proposed model.

Conclusions
In this paper, an imperfect debugging model NHPP-based is developed to incorporate both error generation and imperfect fault removal efficiency, together with considering inflected Sshaped testing coverage to denote the fault detection rate function. Comparisons of this model with several other existing NHPP models have been presented in detail. In addition, three widely used failure data sets are provided for validating the goodness-of-fit and predictive performance of the proposed model. Moreover, five comparison criteria are used to evaluate model performance and the results conclude that the proposed model can fit and predict better. Thus, the results obtaining from the proposed model are encouraging. Furthermore, the sensitivity analysis displays that parameters A, α, α, β, p, c and r are influential parameters in the proposed model. The limitations for the proposed model are analyzed as follows: 1. In our experiments, quantity and kind of fault data sets seem to be limited. To be wellknown, using more data sets and more kinds of data sets can give more effective demonstration for the model's performance. However, we used only three real-world data sets to validate the model's performance. Thus, fault data sets in quantity and more kinds are needed in the future work to give a more in-depth validation.
2. The proposed model assumes the fault removal efficiency to be a constant to simplify the model's mathematical derivation and calculation. But in practical software debugging process, the fault removal efficiency may be changed dependent on time because it depends on many factors such as the skill of the testing team, the complexity of software system, the testing strategy and the testing environment etc. Thus, the fault removal efficiency may take some kinds of complicated function forms, e.g., a time dependent function rather than a constant. Therefore, more forms of fault removal efficiency function should be studied in the future research.