Finite population distribution function estimation with dual use of auxiliary information under simple and stratified random sampling

The main purpose of this paper is to propose two new estimators for estimating the finite population distribution function under simple and stratified random sampling schemes using supplementary information on the distribution function, mean and ranks of the auxiliary variable. The mathematical expressions for the bias and mean squared error of the proposed estimators are derived under the first order of approximation. The theoretical and empirical studies showed that the proposed estimators uniformly perform better than the existing estimators in terms of the percentage relative efficiency.


Introduction
In survey sampling, the use of suitable auxiliary information improves the precision of estimators of the unknown population parameter(s). Several estimators of population parameters, including the population mean, median, total, distribution function, quantiles, etc., exist in the literature, and requires, supplementary information on one or more auxiliary variables along with the information on the study variable. A number of studies have been published on the estimation of the population mean. Some important references to the population mean estimation using auxiliary information include Murthy [1], Sisodia and Dwivedi [2], Srivastava and Jhajj [3], Rao [4], Upadhyaya and Singh [5], Singh [6], Kadilar and Cingi [7], Kadilar and Cingi [8], Gupta and Shabbir [9], Grover and Kaur [10], Grover and Kaur [11], Lu [12], Muneer et al. [13], Shabbir and Gupta [14], and Gupta and Yadav [15]. In these studies, the authors have proposed improved ratio, product, and regression type estimators for estimating the finite population mean. These authors have used a single auxiliary variable for the estimation procedure. In a recent study, Haq et al. [16] suggested using ranks of the auxiliary varible as an additional auxiliary variable to increase the precision of the estimator of the population mean in simple random sampling. However, to the best of our knowledge, there is no study concerning the use of two auxiliary information variables for the estimation of the finite population distribution function. a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 The problem of estimating the finite population cumulative distribution function (CDF) arises when the interest lies in finding out the proportion of the values of the study variable that are less or equal to a certain value. There are situations where estimating the CDF is deemed necessary. For example, for a nutritionist, it is interesting to know the proportion of the population that consumes 25% or more of the calorie intake from saturated fat. In the literature, many authors have estimated the CDF using information about one or more auxiliary variables. Chambers and Dunstan [17] suggested an estimator for estimating the CDF that requires information both on the study and auxiliary variables. Similarly, Rao et al. [18] and Rao [19] proposed ratio and difference/regression estimators for estimating the CDF under a general sampling design. Kuk [20] suggested a kernel method for estimating the CDF using the auxiliary information. Ahmed and Abu-Dayyeh [21] estimated the CDF using the information on multiple auxiliary variables. Rueda et al. [22] used a calibration approach to develop an estimator for estimating the CDF. Singh et al. [23] considered the problem of estimating the CDF and quantiles with the use of at the estimation stage of a survey. Moreover, Yaqub and Shabbir [24] considered a generalised class of estimators for estimating the CDF in the presence of non-response. Chen and Chen [25] investigated the injury severities of truck drivers in single-and multi-vehicle accidents on rural highways while Zeng et al. [26] worked on a multivariate random-parameter Tobit model for analysing highway crash rates by injury severity, and Yaqub and Shabbir [24] considered a generalised class of estimators for estimating the CDF in the presence of non-response. Dong et al. [27] investigated the differences of singlevehicle and multi-vehicle accident probability using a mixed logit model, Chen et al. [28] worked on an analysis of hourly crash likelihood using an unbalanced panel data mixed logit model and real-time driving environmental big data, Zeng et al. [29] suggested jointly modelling area-level crash rates by severity, and Zeng et al. [30] used spatial joint analysis for zonal daytime and night-time crash frequencies using a Bayesian bivariate conditional autoregressive model. However, these estimators only used one auxiliary variate.
In this paper, we propose two new families of estimators for estimating the CDF using the information on the distribution function, ranks, and mean of the auxiliary variable under simple random sampling and stratified random sampling. The bias and mean squared errors (MSEs) of the existing and proposed estimators of the CDF are derived under the first order of approximation. The theoretical and numerical comparisons showed that the proposed estimators are more precise than the existing adapted estimators when estimating the CDF of a finite population.

Notation in simple random sampling
Consider a finite population O = {1,2,. . .,N} of N distinct units. In order to estimate the finite population distribution function, a sample of size n units is drawn from O using simple random sampling without replacement. Suppose Y and X are the study and auxiliary variables, respectively. Let Z denote the ranks of X, I(Y�y) is indicator variable based on Y and I(X�x) is indicator variable based on X. Similarly, are the population and sample means of X (Z), respectively.
In order to obtain the biases and mean squared errors (MSEs) of the adapted and proposed estimators of F(y), we consider the following relative error terms. Let such that e i = 0 for i = 0,1,2,3, where E(�) is the mathematical expectation of (�). Let

Adapted estimators in simple random sampling
In this section, some estimators of finite population mean are adapted for estimating the finite CDF under simple random sampling. The biases and MSEs of these adapted estimators are derived under the first order of approximation.

PLOS ONE
Finite population distribution function estimation with dual use of auxiliary information The bias and MSE ofF 2 ðyÞ, to the first order of approximation, are BiasðF 2 ðyÞÞ ffi FðyÞðV 0200 À V 1100 Þ; If R 12 >C 2 /(2C 1 ), thenF 2 ðyÞ is better thanF 1 ðyÞ in terms of MSE. 3. Murthy [32] adapted product estimator of F(y) iŝ The bias and MSE ofF 3 ðyÞ, to the first order of approximation, are If −C 2 /(2C 1 )>R 12  where k is an unknown constant. Here,F 4 ðyÞ is an unbiased estimator of F(y). The minimum variance ofF 4 ðyÞ at the optimum value k (opt) = (F(y)V 1100 )/(F(x)V 0200 ) is Here, (8) may be written as 5. Rao [4] adapted difference-type estimator of F(y) iŝ where k 1 and k 2 are unknown constants. The bias and MSE ofF 5 ðyÞ, to the first order of approximation, are The optimum values of k 1 and k 2 , determined by minimizing (11), are The minimum MSE ofF 5 ðyÞ at the optimum values of k 1 and k 2 is Here, (12) may be written as 6. Singh et al. [33] adapted generalized ratio-type exponential estimator of F(y) iŝ F 6 ðyÞ ¼FðyÞexp aðFðxÞ ÀFðxÞÞ aðFðxÞ þFðxÞÞ þ 2b where a and b are known constants. The bias and MSE ofF 6 ðyÞ, to the first order of approximation, are where θ = aF(x)/(aF(x)+b). 7. Grover and Kaur [11] adapted generalized class of ratio-type exponential estimator of F (y) isF 7 ðyÞ ¼ k 3F ðyÞ þ k 4 ðFðxÞ ÀFðxÞÞ � � exp aðFðxÞ ÀFðxÞÞ aðFðxÞ þFðxÞÞ þ 2b where k 3 and k 4 are unknown constants. The bias and MSE ofF 7 ðyÞ, to the first order of

PLOS ONE
Finite population distribution function estimation with dual use of auxiliary information approximation, are The optimum values of k 3 and k 4 , determined by minimizing (17), are The simplified minimum MSE ofF 7 ðyÞ at the optimum values of k 3 and k 4 is Here, (18) may be written as which shows thatF 7 ðyÞ is more precise thanF 4 ðyÞ.

Proposed estimators in simple random sampling
The precision of an estimator increases by using suitable auxiliary information at the estimation stage. In previous studies, the sample distribution function of the auxiliary variable was used to improve the efficiencies of the existing distribution function estimators. In a recent study, Haq et al. [16] suggested using ranks of the auxiliary variable as an additional auxiliary variable to increase the precision of an estimator of the population mean. On similar lines, we use additional auxiliary information on sample means of the auxiliary and ranked-auxiliary variables along with the sample distribution function estimators of F(y) and F(x) to estimate the finite CDF. For this purpose, we propose two families of estimators for estimating F(y).

First proposed family of estimators
For the first family of estimators, the sample mean of the auxiliary variable is used as an additional auxiliary variable; whilst in the second family of estimators, the sample mean of the ranked auxiliary variable is used as an additional auxiliary variable.

PLOS ONE
Finite population distribution function estimation with dual use of auxiliary information The simplified minimum MSE ofF 8 ðyÞ at the optimum values of k 5 , k 6 and k 7 is where R 2 1:23 ¼ Here, (24) may be written as where It can be seen thatF 8 ðyÞ is more precise thanF 4 ðyÞ.

Second proposed family of estimators
On similar lines, second proposed family of estimators for estimating F(y) is given bŷ FðxÞ ÀFðxÞ FðxÞ aðFðxÞ ÀFðxÞÞ aðFðxÞ ÀFðxÞÞ þ 2b where k 8 , k 9 and k 10 are unknown constants, a(6 ¼0) and b are either two real numbers or functions of known population parameters of I(X�x), like R 12 , β 2 (coefficient of kurtosis), C 2 , etc. The estimatorF 9 ðyÞ can also be written aŝ Simplifying (27) and keeping terms only up to the second power of e i 's, we can writê

PLOS ONE
Finite population distribution function estimation with dual use of auxiliary information The bias and MSE ofF 9 ðyÞ, to the first order of approximation, are FðyÞV 1100 À 2k 8 k 10 FðyÞV 1001 À yk 10 FðyÞV 0101 The optimum values of k 8 , k 9 and k 10 , determined by minimizing (29), are The simplified minimum MSE ofF 9 ðyÞ at the optimum values of k 8 , k 9 and k 10 is where R 2 1:24 ¼ Here, (30) may be written as where It is clear thatF 9 ðyÞ is more precise thanF 4 ðyÞ.
In Table 1, we put some members of the Singh et al. [33], Grover and Kaur [11], and proposed families of estimators with selected choices of a and b.

PLOS ONE
Finite population distribution function estimation with dual use of auxiliary information 12. From (13) and (31), 13. From (15) and (31), 14. From (19) and (31), The proposed families of estimators are always more precise than the adapted estimators as the above conditions (i)-(xiv) are always true.

Empirical study in simple random sampling
In this section, we conduct a numerical study to investigate the performances of the adapted and CDF estimators. For this purpose, five populations are considered. The summary statistics of these populations are reported in Table 2. The percentage relative efficiency (PRE) of an estimatorF i ðyÞ with respect toF 1 ðyÞ is where i = 2,3,. . .,9.
The PREs of distribution function estimators, computed from five populations, are given in Tables 7-11 From the numerical results, presented in Tables 3-7, it is observed that the PREs of all families of estimators change with the choice of a or b. It is further noted that the proposed families of estimators are more precise than the adapted distribution function estimators of Cochran [31], Murthy [32], Rao [4], Singh et al. [33] and Grover and Kaur [11], in terms of PRE. It can be seen that, for data sets I, III, IV and V, the second proposed family of the estimators perform better than the first proposed family of estimators, and for data set-II, it is also observed that the first proposed family of the estimators perform better than the second proposed family of estimators.

PLOS ONE
Finite population distribution function estimation with dual use of auxiliary information Here we take five data sets for numerical illustration. We selected different sample sizes from these populations and, then, we used simple random sampling. The MSEs (minimum) of the proposed families of estimators are pointed out in Eqs 25 and 31. Finally, the adapted estimators and proposed families of estimators were compared with each other with respect to their PRE values. These results are set out in Tables 3-7. In Table 2, we see the summary statistics about the populations. We can also note from the numerical results presented in Tables 3-7 that the PREs of all families of estimators change with the choice of a or b. It is further noted that the proposed families of estimators are more precise than the adapted distribution function estimators of Cochran [31], Murthy [32], Rao [4], Singh et al. [33] and Grover and Kaur [11], in terms of the PRE. It can be seen that, for data sets I, III, IV and V, the second proposed family of the estimators perform better than the first proposed family of estimators, and, for data set II, it is also observed that the first proposed family of the estimators performs better than the second proposed family of estimators.

Notation in stratified random sampling
Consider a finite population O = {1,2,. . .,N} of N distinct units, which is divided into L homogeneous strata, where the size of hth stratum is N h , for h = 1,2,. . .,L, such that P L h¼1 N h ¼ N.
be the population (sample) means of X and Z under stratified random sampling, respectively, where �

PLOS ONE
Finite population distribution function estimation with dual use of auxiliary information In order to obtain the biases and MSEs of the adapted and proposed estimators of F(y) under stratified random sampling, the following relative error terms are considered. Let such that E(e i ) = 0 for i = 1,2,3,4, where E(�) is the mathematical expectation of (�). Let

PLOS ONE
Finite population distribution function estimation with dual use of auxiliary information

PLOS ONE
Finite population distribution function estimation with dual use of auxiliary information

Adapted estimators in stratified random sampling
In this section, some estimators of finite population mean are adapted for estimating the finite CDF under stratified random sampling. The biases and MSEs of these adapted estimators are derived under the first order of approximation.

The traditional unbiased estimator of F(y) iŝ
The variance ofF SRS st ðyÞ is   where m is an unknown constant. Here,F Reg st ðyÞ is an unbiased estimator of F(y). The minimum variance ofF Reg st ðyÞ at the optimum value m (opt) = (F(y)ψ 1100 )/(F(x)ψ 0200 ) is Here, (39) may be written as where m 1 and m 2 are unknown constants. The bias and MSE ofF R;D st ðyÞ, to the first order of

PLOS ONE
Finite population distribution function estimation with dual use of auxiliary information approximation, are The optimum values of m 1 and m 2 , determined by minimizing (42) Here, (43) may be written as The optimum values of m 3 and m 4 , determined by minimizing (48) Here, (49) may be written as where m 5 , m 6 and m 7 are unknown constants, a st (6 ¼0) and b st are either two real numbers or functions of known population parameters of I(X�x), like R 12 , b 2ðstÞ ¼ The estimatorF Prop1 st ðyÞ can also be written aŝ Simplifying (52) and keeping terms only up to the second power of z i 's, we can writê F Prop1 st ðyÞ À FðyÞ where m 8 , m 9 and m 10 are unknown constants, a st (6 ¼0) and b st are either two real numbers or functions of known population parameters of I(X�x), like R 12 The estimatorF Prop2 st ðyÞ can also be written aŝ Simplifying (58)

PLOS ONE
Finite population distribution function estimation with dual use of auxiliary information 9. From (35) and (62), 10. From (37) and (62) The proposed families of estimators are always more precise than the adapted estimators as the above conditions (i)-(xiv) are always true.

Empirical study in stratified random sampling
In this section, we conduct a numerical study to investigate the performances of the adapted and proposed CDF estimators. For this purpose, five populations are considered. The summary statistics of these populations are reported in Tables  where i = 2,3,. . .,9.
The PREs of distribution function estimators, computed from four populations, are given in Tables 9-12 It is further noted that the proposed families of estimators are more precise than the adapted distribution function estimators of Cochran [31], Murthy [32], Rao [4], Singh et al. [33] and Grover and Kaur [11], in terms of PRE. It can be seen that, for all data sets, the second proposed family of the estimators perform better than the first proposed family of estimators.
We computed sample size in stratum h. Here we took sample sizes of 180, 180, 140 and 140 from four populations. Then we used stratified random sampling, and the MSEs (minimum) of the proposed families of estimators were computed in Eqs 56 and 62, respectively. Lastly, the adapted estimators and proposed families of estimators were compared with each other with respect to their PRE values. The PRE results are shown in Tables 13-16. In Tables 9-12, we can observe the descriptive statistics regarding the populations, strata, and sample size. From the numerical results, presented in Tables 13-16, it is observed that the PREs of all families of estimators change with the choices of a and b. It is further noted that the proposed families of estimators are more precise than the adapted distribution function estimators of Cochran [31], Murthy [32], Rao [4], Singh et al. [33] and Grover and Kaur [11], in terms of the PRE. It can be seen that, for all data sets, the second proposed family of estimators perform better than the first proposed family of estimators. We can also see a rise in the value of PREs when the value of a = 1; and b = R 12 ; a = R 12 and b = C 2 ; a = R 12 and b = β 2(st) , and see a slight fall in PREs when a = 1 and b = NF(x)).

Conclusion
In this paper, we have proposed two new families of estimators for estimating the finite population distribution function under simple and stratified random sampling schemes. The proposed estimators required supplementary information about the sample mean and the ranks of the auxiliary variable. The biases and MSEs of the proposed families of estimators were derived using the first order approximation. Based on theoretical and numerical comparative studies, it can be concludethat the proposed families of estimators are more precise than their existing counterparts. Thus, we recommend using the sample mean and the ranks of the auxiliary variable with the proposed families of estimators for estimating the finite population distribution function under simple or stratified random sampling. It would be interesting to extend the suggested estimators to two-phase and stratified two-phase sampling schemes. Furthermore, the proposed estimators could also be generalised by utilising information about multi-auxillary variables.