Estimation of finite population distribution function with dual use of auxiliary information under non-response

In this paper, we propose two new families of estimators for estimating the finite population distribution function in the presence of non-response under simple random sampling. The proposed estimators require information on the sample distribution functions of the study and auxiliary variables, and additional information on either sample mean or ranks of the auxiliary variable. We considered two situations of non-response (i) non-response on both study and auxiliary variables, (ii) non-response occurs only on the study variable. The performance of the proposed estimators are compared with the existing estimators available in the literature, both theoretically and numerically. It is also observed that proposed estimators are more precise than the adapted distribution function estimators in terms of the percentage relative efficiency.


Introduction
One of the common problems in sample surveys is non-response. The non-response bias is serious concern in survey studies. It occurs in many ways, including linguistic problems, illness, due to response, due to non-acceptance, the process of return address misguided, captured by another person, etc. Sometime, sample survey experts use the auxiliary information to improve the precision of estimators. As expected, non-response not only reduces the precision of an estimator but also increases its bias. A number of research articles have been published on the estimation of population mean under non-response in order to control the non-response bias and to increase the efficiency of estimators. Hansen and Hurwitz [1] suggested that a sub-sample of earlier non-respondents can be re-communicated with a more expensive system. They adapted the first effort by a mail questionnaire and the second attempt by personal interview. In addition, firstly, they developed an estimator to estimate the population mean in the presence of non-response. For some related works on the nonresponse, we refer to Rao [2], Khare and Srivastava [3], Khare and Sinha [4,5], Olufadi and Kumar [6], Muneer et al. [7], Pal and Singh [8], Ahmad and Shabbir [9] and the references cited therein. The problem of estimating the finite population cumulative distribution function (CDF) arises when the interest lies in knowing the proportion of values of the study variable that are less or equal to a certain value. There are different situations where estimating the CDF is deemed necessary. For example, for an economist, it is interesting to know the proportion of the population that 27% or more Pakistanis do not have skills. Similarly, a soil scientist may be interested in estimating the distribution of clay percent in the soil. In addition, policy-makers may be interested in knowing the proportion of people living in a developing country below the poverty line.
In survey sampling literature, the authors have estimated the CDF using information on one or more auxiliary variables. Chambers and Dunstan [10] suggested an estimator for estimating the CDF that requires information both on the study and auxiliary variables. On similar lines, Rao et al. [11] and Rao [12] proposed ratio and difference/regression estimators for estimating the CDF under a general sampling design. Kuk [13] suggested a kernel method for estimating the CDF using the auxiliary information. Ahmed Abu-Dayyeh [14] estimated the CDF using information on multiple auxiliary variables. Chen and Wu [15] suggested a method for estimating the CDF and quantiles using the model-calibrated pseudo empirical likelihood. A calibration approach has been used by Rueda et al. [16] to devise an estimator for estimating the CDF. Singh et al. [17] considered the problem of estimating the CDF and quantiles with the use of auxiliary information at the estimation stage of a survey. Moreover, Chen et al. [18] investigate the injury severities of truck drivers in single-and multi-vehicle accidents on rural highways, Zeng et al. [19] worked on a multivariate random-parameters tobit model for analyzing highway crash rates by injury severity, Yaqub and Shabbir [20] considered a generalized class of estimators for estimating the CDF in the presence of non-response. Dong et al. [21] investigating the differences of single-vehicle and multi-vehicle accident probability using mixed logit model, Chen et al. [22] worked on analysis of hourly crash likelihood using unbalanced panel data mixed logit model and real-time driving environmental big data, Zeng et al. [23] suggested a jointly modeling area-level crash rates by severity and Zeng et al. [24] used spatial joint analysis for zonal daytime and night time crash frequencies using a bayesian bivariate conditional autoregressive model. Hussain et al. [25] proposed two new families of estimators using the supplementary information on auxiliary variable and exponential function for the population distribution functions in case of non-response under simple random sampling, and Hussain et al. [26] two new families of estimators for estimating the finite population distribution function are proposed under simple and stratified random sampling schemes using supplementary information on the distribution function, mean and ranks of the auxiliary variable.
In this paper, we propose two new families of estimators for estimating the CDF using information on the sample distribution function, mean and ranks of the auxiliary variable along with the information on the sample distribution function of the study variable under simple random sampling in the presence of non-response. The bias and mean squared errors (MSEs) of the existing and proposed estimators of the CDF are derived under the first order of approximation. The theoretical and numerical comparisons revealed that the proposed estimators are more precise than the existing adapted estimators when estimating the CDF.
The issue of non-response in survey sampling threatens to undermine the validity of inferences drawn from estimates based on those surveys. High non-response rates create opportunity or risk for bias in estimates and affect survey design, data collection, estimation and analysis Plewes et al. [27]. With these issues in mind, we propose two new families of estimators for estimating the CDF using information on the sample distribution function, mean and ranks of the auxiliary variable along with the information on the sample distribution function of the study variable under simple random sampling in the presence of non-response. The bias and mean squared errors (MSEs) of the existing and proposed estimators of the CDF are derived under the first order of approximation. The theoretical and numerical comparisons revealed that the proposed estimators are more precise than the existing adapted estimators when estimating the CDF.
The rest of the paper is organized as follows. In Section 2, some notations are given. In Section 3, we adapt some estimators of the finite population mean for estimating the finite CDF. The proposed estimators are given in Section 4. In Sections 5 and 6, theoretical and numerical comparisons are conducted, respectively. Finally, conclusions are drawn in Section 7.

Notations
A sample of size n units is drawn from this population using simple random sampling without replacement. It is assumed that out of n units, n 1 units respond but n 2 = n −n 1 units do not respond. Clearly, n 1 and n 2 units belong to the respondent and non-respondent groups, respectively. Moreover, a sub-sample of size r = n 2 /k units, where k>1, is drawn from n 2 units using simple random sampling without replacement, and, this time the response is obtained from r units.
: the population variance of X, Þ 2 =ðN 2 À 1Þ: the population variance of X for non-response group, Þ 2 =ðN 2 À 1Þ: the population variance of Z for non-response group, : the population coefficient of variation of I(Y�y), : the population coefficient of variation of I(X�x), (2) (y): the population coefficient of variation of I(Y�y) for non-response group, C 2(2) = S 2(2) /F (2) (x): the population coefficient of variation of I(X�x) for non-response group, C 3ð2Þ ¼ S 3ð2Þ = � X ð2Þ : the population coefficient of variation of X for non-response group, : the population coefficient of variation of Z for non-response group, ðyÞÞðIðX i � xÞ À F ð2Þ ðxÞÞg=ðN 2 À 1Þ: the population covariance between I(Y�y) and I(X�x) for non-response group, Þg=ðN 2 À 1Þ: the population covariance between I(Y�y) and X for non-response group, Þg=ðN 2 À 1Þ: the population covariance between I(X�x) and X for non-response group, Þg=ðN 2 À 1Þ: the population covariance between I(Y�y) and Z for non-response group,  (2) ): the population correlation coefficient between I(Y�y) and I(X�x) for non-response group, (2) ): the population correlation coefficient between I(Y�y) and X for non-response group, r 23(2) = S 23(2) /(S 2(2) S 3(2) ): the population correlation coefficient between I(X�x) and X for non-response group, (2) ): the population correlation coefficient between I(Y�y) and Z for non-response group, r 24(2) = S 24(2) /(S 2(2) S 4(2) ): the population correlation coefficient between I(X�x) and Z for non-response group.
We can write Similarly, let Let Similarly let where [20] have suggested an unbiased estimator of F(y) under non-response, which is given bŷ
In order to obtain the biases and MSEs of the existing and proposed estimators, the following relative error terms are considered. Let e � 1 ¼F � ðyÞ À FðyÞ where E(�) stands for the mathematical expectation of (�). Let where r,s,t,u = 1,2,3,4, Here, : the coefficient of multiple determination of I(Y�y) on I(X�x) and X with situation-I under non-response.
Let r 2 1:23ð2Þ ¼ : the coefficient of multiple determination of I(Y�y) on I(X�x) and X with situation-II under non-response.
: the coefficient of multiple determination of I(Y�y) on I(X�x) and Z with situation-I under non-response.
Let r 2 1:24ð2Þ ¼ : the coefficient of multiple determination of I(Y�y) on I(X�x) and Z with situation-II under non-response. Under non-response, two situations are considered. The situation-I refers to the non-response both on study and auxiliary variables while situation-II refers to the non-response only on the study variable. For notational convenience, we follow the notations given in Table 1.

Adapted estimators
In this section, some estimators of finite population mean are adapted for estimating the finite CDF under non-response with simple random sampling. Moreover, the biases and MSEs of these adapted estimators are derived under the first order of approximation.
1. Cochran [28] adapted ratio estimator of F(y) iŝ The bias and MSE ofF 2 ðyÞ, to the first order of approximation, are BiasðF 2 ðyÞÞ ffi FðyÞðQ 0200 À Q 1100 Þ; Coefficient of correlation Coefficient of multiple determination 2. Murthy [29] adapted product estimator of F(y) iŝ The bias and MSE ofF 3 ðyÞ, to the first order of approximation, are 3. The adapted difference estimator of F(y) iŝ where k is an unknown constant. Here,F 4 ðyÞ is an unbiased estimator ofFðyÞ. The minimum variance ofF 4 ðyÞ at the optimum value k (opt) = (F(y)Q 1100 )/(F(x)Q 0200 ) is Here, (11) may be written as 4. Rao [30] adapted difference-type estimator of F(y) iŝ where k 1 and k 2 are unknown constants. The bias and MSE ofF 5 ðyÞ, to the first order of approximation, are The optimum values of k 1 and k 2 , determined by minimizing (??), are The minimum MSE ofF 5 ðyÞ at the optimum values of k 1 and k 2 is Here, (15) may be written as where a and b are known constants. The bias and MSE ofF 6 ðyÞ, to the first order of approximation, are where θ = aF(x)/(aF(x)+b).

:
The simplified minimum MSE ofF 7 ðyÞ at the optimum values of k 3 and k 4 is Here, (21) may be written as which shows thatF 7 ðyÞ is more precise thanF 4 ðyÞ.

Proposed estimators
The precision of an estimator surges by using the appropriate secondary information at the estimation stage. In previous studies, the sample distribution function of the auxiliary variable was used to expand the productivities of the prevailing distribution function estimators. In a recent study, Haq et al. [33] recommended that to use ranks of the auxiliary variable as an additional auxiliary variable to increase the precision of an estimator of the population mean.
On the similar lines use additional auxiliary information on sample means of the auxiliary and ranked-auxiliary variables along with the sample distribution function estimators of F(y) and F (x) to estimate the finite CDF. For this persists, we recommend two families of estimators for estimating F(y). The (first and second) families of estimators, used the (sample mean of the auxiliary variable and the sample mean of the ranked-auxiliary variable) as an additional auxiliary variable. On the lines ofF 5 ðyÞ andF 6 ðyÞ, first proposed family of estimators for estimating F(y) is given bŷ FðxÞ ÀF H ðxÞ FðxÞ where k 5 , k 6 and k 7 are unknown constants, a (6 ¼0) and b are either two real numbers or functions of known population parameters of I(X�x), like R 12 , β 2 (coefficient of kurtosis), C 2 , etc. The estimatorF 8 ðyÞ can also be written aŝ
In Table 2, we put some members of the Singh et al. [31], Grover and Kaur [32], and proposed families of estimators with selected choices of a and b.

Empirical study
In this section, we conduct a numerical study to see the performance of the adapted and proposed distribution function estimators. For this purpose, five population are considered. The summary statistics of these populations are reported in Tables 3-6. The percentage relative efficiency (PRE) of an estimatorF i ðyÞ with respect toF H ðyÞ is where i = 2,3,. . .,9.
The PREs of distribution function estimators, computed from five populations, are given in Tables 7-14.
Population I -Gujarati [34]  Y: The eggs produced in 1990 (millions) and X: The price per dozen (cents) in 1991. The proportion of the non-response units in the given population is considered to be the last 25% units.

Population II -Gujarati [34]
Y: The eggs produced in 1990 (millions) and X: The price per dozen (cents) in 1990. The proportion of the non-response units in the given population is considered to be the last 25% units.   Population III-Singh [35] Y: Duration of sleep of persons with age more than 50 years and X: The age of persons in years. The proportion of the non-response units in the given population is considered to be the last 25% units.
Population IV-Singh [35] Y: The estimated number of fish caught by marine recreational fisherman in year 1995 and X: The estimated number of fish caught by marine recreational fisherman in year 199 The share of the non-response units in the given population is the last 25% units.
As mention above, we used four data sets for numerical illustration. We also considered different sample sizes from these populations. We then used simple random sampling, and the MSEs (minimum) of the proposed families of estimators pointed out in Eqs 28 and 21. The proposed families of estimators and the adapted estimators were compared between each other with respect to their PREs values. The result of PREs are presented in Tables 7-14, for both Situations under non-response. In Tables 3-6, we see the summary statistics about the populations. We can also observed from numerical results shown in Tables 7-14 that the PREs of all families of estimators change with the choices of a and b. It is further noted that the proposed families of estimators are more precise than the adapted distribution function estimators of Hansen and Hurwitz [1], Cochran [28], Murthy [29], Rao [30], Singh et al. [31] and Grover and Kaur [32] in terms of the PRE. It can be seen that, for some data sets the second proposed family of the estimators perform better than the first proposed family of estimators for both situations of non-response, and for data sets, it is also observed that the first proposed family of the estimators performs better than the second proposed family of estimators under both situations of non-response.

Conclusion
In this paper, we have proposed two new families of estimators for estimating the finite population distribution function in the presence of non-response under simple random sampling. The proposed estimators required information on the sample distribution function of study variable, and the distribution function of auxiliary variable with additional information on either sample mean or ranks of the auxiliary variable. The bias and MSE of the proposed families of estimators were derived under the first order of approximation. Based on the theoretical and numerical (four populations considered in the numerical investigation) comparisons, it has turned out that the proposed families of estimators are more precise than their adaptive estimators under two situations of non-response. Thus, we acclaim categorically, the use of our proposed families of estimators over the existing estimators deliberate in this paper for the new survey for estimating the finite population distribution function in the presence of nonresponse.