Model based estimation of population total in presence of non-ignorable non-response

The problem of handling non-ignorable non-response has been typically addressed under the design-based approach using the well-known sub-sampling technique introduced by Hansen and Hurwitz [1946, Journal of the American Statistical Association, Vol 41(236), Page 517- 529]. Alternatively, the model-based paradigm emphasizes on utilizing the underlying model relationship between the outcome variable and one or more covariate(s) whose population values are known prior to the survey. This article utilizes the model relationship between the study variable and covariate(s) for handling non-ignorable non-response and obtaining an unbiased estimator for the population total under the sub-sampling technique. The main idea is to combine the estimates obtained from the sample on first call and the sub-sample from second call using separate model relationships. The contribution of this paper helps us in providing unbiased estimates with an improved efficiency under model-based paradigm in presence of non-ignorable non-response. The provided method is more economical than the available estimators under callback methods as we are working sub-sampling and also increase response rate as a stronger mode of interview is employed for data collection. A numerical study using Monte Carlo is presented to illustrate the behavior of the proposed and the efficiency comparison.


Introduction
In statistical investigations, once data collection is completed, one has to bear some, perhaps a considerable amount of non-response. Although a significant resource can be employed to improve data collection process to avoid the non-response about 20% non-response rate is commonly accepted. Item non-response occurs when one or more questions in the questionnaire are left unanswered during the survey. While a unit non-response occurs when one or more unit(s) do not response at all or are missing. Non-response in sample surveys leads to non-sampling error in estimation of the population parameters and yields biased estimates which ultimately spoils inference about the population of interest. When non-response occurs completely at random then the best way to deal with is to impute the projected values of the outcome variable corresponding to non-respondents. On contrary, when non-response factor (e.g, age, sex or/and income status etc.) is correlated with the outcome variable then the usual imputation methods fail to cope with the situation. In such situations, the population a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 parameters and the behavior of the population may differ among the responding population (respondents) and the responding populations (non-respondent).
There are several approaches for checking whether there is a difference between the populations of respondents and non-respondents and evaluating potential bias due to non-response: (i) specific follow-up of non-respondents and (ii) analysis of the characteristics of respondents and non-respondents which are known prior to survey. [1] used demographic information (education, age, employment status, state of residence, field of employment etc.) to compare the respondents and the non-respondents. Information regarding non-respondents may come from previous surveys of same population (in the case of longitudinal surveys or with rotation groups) or by using some external data sources (e.g. administrative data etc.). [2] suggested a method for adjustment of non-ignorable non-response in studies involving one or more additional attempts to contact initial non-responders. [3] worked on changing in survey estimates as a function of additional calls under the similar protocol as well as under a different protocol. [4] considered the use of level-of-effort paradata to model the mechanism of non-response in surveys and for adjusting non-response bias, specially bias that is not missing at random (NMAR) or non-ignorable. The approach was based on unconditional maximum likelihood estimation model that adapted and extended the prior work to cope with the complexities encountered in large-scale surveys.
For similar situation, [5] examined whether non-participation in a census-based health study was related with poorer health status, using the Hordaland Health Study conducted in western Norway in 1997-1999. They aimed to determine whether health problems were overrepresented in nonparticipants and to explore the consequences of participation bias on relation between outcomes and exposures. Statistical techniques for dealing with non-ignorable non-response based on a propensity-to-respond score has been developed by [6] assuming both item as well as unit non-response. Moreover, [7] proposed an approach of increasing blood supply by collecting blood more frequently from the selected donors for studying the relationship between ageing the population and blood transfusion. The primary aim of their proposed INTERVAL trial was to observe whether donation intervals can be acceptably and safely decreased to optimize blood supply while maintaining the health status of donors. The health status of a cohort of 1991 Gulf War veterans was periodically assessed by [8]. They compared various health outcomes of veterans with those of their peers in military who were not posted to the Gulf. Another example in which one can make utilization of sub-sampling method can be found in [9], where missing data and incomplete randomized interventions were common. These problems complicate the analysis as well as interpretation of controlled randomized trials (CRT), and are rarely handled well in practice. [10] modeled the nonresponse probabilities as logistic functions of the survey variable and related covariates in the survey with callback. They proposed maximum likelihood semi-parametric estimators of the parameters in the response probabilities. They further proposed, an efficient estimator for the mean of the study variable using the estimated response probabilities. The method was employed to data taken from the Singapore Life Panel Survey, a survey of health spending utilizing a census-based sample of individuals 50-70 years old, assuming that non-response was related to the health status.
In real surveys, as discussed in above cited works, non-response occurrence is not missing at random (NMAR) or, in other word, it is non-ignorable. When the occurrence of nonresponse in sample survey is related to the outcome of the survey, a valid statistical inference about the target population is quiet difficult. One can make efficient utilization of the sub-sampling method instead of call back. To fill this gap, [11] introduced a well known procedure for sub-sampling (follow-up) the non-respondents. The method includes sub-sampling a portion of non-respondents from the first sample with the assumption that some stronger mode of interview is applied for the purpose of sub-sampling non-respondents, consequently, all persons give full response on second call. On the basis of sub-sampling procedure introduced by [11], many authors including [12,13], [14] and [15] worked on mean estimation under designed-based approach ignoring model relationship between the study variable and the known covariates. [16] suggested Hansen and Hurwitz [11] type estimator under Bayesian paradigm using squared error loss function (SELF). Later on [17] considered Bayesian approach of estimation under a general model using [11] technique. In survey sampling, usually one assumes the population as a finite collection of distinct and countable units. The measurements on the variable under investigation in the population are considered to be non-stochastic. The focus lies in estimation of population parameters i.e. functions of the population measurements on the study variable in the population (such as mean, total, proportion etc), which are also non-stochastic consequently. A sample is considered just as a smaller collection of population units and inference is carried out typically under the probability distribution formed by the random mechanism employed to draw the sample, which is termed as sampling design (S.D). Desirable properties of the estimators such as unbiasedness and efficiency are established by averaging out the values of the estimators over all possible samples.
While in model-based inference, a population is considered as a collection of realizations of a set of stochastic variables with a specified but unknown mean and a specified variance (usually assumed to be known). While a sample is a collection of identically distributed and independent variables for some fixed S.D. The parameters to be estimated are characteristics of the distribution of the original stochastic variables such as mean, and lower order moments, which are assumed to be constant quantities under the frequentist point of view.
Under model-based statistical inference [18] worked on estimation of a finite population mean. [19] attempted to obtain optimal model-unbiased estimators of the population mean and total using least square (LS) estimation method and the well known Gauss-Markov Theorem (GMT) assuming linear population model. [20] introduced the linear least-square prediction approach for estimation of finite population parameters under two-stage sampling. Other related works on estimation of mean and total under model-based approach can be found in [21], [22], [23], [24], [25], [26], [27] and [28]. [29] adapted mixed model prediction in small areas. Furthermore, [30] compared the model-based approach with model-assisted approach. For an updated comparison of the model-based and the designed based frameworks see [31]. A detailed review of the model-based estimation can also be found in [32]. As we already mentioned that the presence of non-response in sample surveys not only creates problem of small sample size but also spoils the inference when the behavior (underlying model relationship) of the population of respondents and non-respondents are different.
In current article, a model unbiased linear predictor for the population total in presence of non-ignorable non-response is proposed assuming unit non-response. The sub-sampling technique introduced by [11] is used to obtain samples under a fixed sampling design. We provide a revision of model-based approach for estimation of superpopulation total in Section 2. Our proposed estimator and its properties under assumed model are given in Section 3. Some shortcomings of the proposed estimation technique and their possible solutions are discussed in Section 4. A numerical study with real data set and a Monte Carlo simulation are respectively provided in Sections 5 and 6. A discussion with concluding remarks is given in Section 7.

Model-based estimation of population total
Consider a finite population of N distinct units U = {1, 2, ‥i‥, N}. Let y = (y i , i 2 U) be the vector of the realized values of a stochastic vector Y = (Y i , i 2 U) of order N × 1 and x = (x ij , i 2 U, j = 0, 1, 2, . . ., p) be a matrix of (p+1) auxiliary variables whose values are assumed to be known for every unit in U. We start with multiple linear regression model Y = xβ + �, where β = (β 0 , β 1 , . . .., β p ) T and � = (� 1 , � 2 , . . .., � N ) T be the vectors of regression coefficients and the random error terms respectively. Let s = {1, 2, 3, . . ., n} be a member of S of all possible samples of size n that can be drawn from U using some S.D. Further, the random vector of the study variable Y, the known auxiliary matrix x and the random error vector � are splited into sampled (s) and non-sampled ð� sÞ as: The population total T y (which is assumed to be random under model-based approach) is expressed as is the vector containing 1's for every units in population. For obtaining population mean W are taken as vector of 1/N for all units. Optimal values of w 0 i s are found by minimizing the prediction variance which is considered as good practice in model based approach [32]. For further statistical inference about the estimated parameter assumption of normally distributed error term is also necessary specially in case of small sample sizes. After observing y s as the realized values of Y s the problem is to predict sub-vector Y � s using the information contained in the sample and the auxiliary information through model relationship between the study variable and the auxiliary variable (s). Under linear population model, a predictor for y s which is obtained by minimizing the sum of squared residuals. The model-based estimator given in [33] iŝ Note that the total estimator given in (1) works only when error terms are iid with zero mean and constant variance [27].T y posses all the properties with respect to the model as the predictor of y � s does [18,19]. When all OLS assumptions fulfill the estimatorT y is model unbiased with the model-variance after averaging over all possible sample of same S.D.
where the subscript D is used to show that the expectation is applied with respect to S.D and I N−n is the identity matrix of order (N − n) × (N − n). Setting p = 0 the linear regression model reduces to homogeneous population model i.e. Y = x 0 β 0 + �, where x 0 is vectors of 1's. Care should be taken while selecting a suitable set of predictors which comes under the domain of variable selection (inclusion and exclusion) [34]. Moreover, when variance of the error term depends on some function of the auxiliary variable(s), weighted least square (WLS) estimator is preferred for estimating β as alternative to OLS. Moreover, if the number of regressors exceeds number of observations in the sample then ridge regression is preferred [27,35]. We discuss these problems for our proposal later in Section 4.

Model-based estimation of population total in presence of nonresponse
In voluntary surveys, a common threat to the validity of the survey estimates is the problem of non-response. Different surveys possess different response rates, the surveys that ask questions which seem interesting and relevant to the respondents are tend to achieve the highest response rates. In recent years, response rates have been declined even in popular surveys, and, as a consequence, worries about non-response bias have been increased. As we discussed in introduction section that non-response is considered as problematic only if the population of non-respondents is an informative sample of the total sample. Unfortunately, this appears almost in majority of practical applications. In household surveys, for instance, there is a lot of evidence that non-respondents are often younger than respondents, and that women are more likely to persuade to take part than men. Similarly, response rates are also tend to be lower in deprived areas than the areas with abundance of facilities. All of these examples show that the pattern of achieved samples for surveys mostly do not reflect the population that is meant to represent very well. These surveys typically may over-represent women, and the persons elder than certain age. And often under-represent those living in less developed cities and deprived areas. When values of such demographic variable(s) are known for whole target population, we can stratify the population as the respondents and the non-respondents. The problem is then to choose a variable which more accurately stratifies the population as respondents and non-respondents. Suppose that R is a stratification vector defined as R = (R i , i 2 U), where R i = 1(0) according to the ith unit belongs to the population of respondents (non-respondents). In case of missing completely at random (MCAR) non-response factor R and the study variable Y are uncorrelated and one can ignore the non-response or just apply different imputation techniques [36]. When the stratification variable R is related to the study variable Y, the model for the respondents differs from that of the non-respondents such as in above example the population models may differ among men and women, youngers and elders and deprived and settle areas. To capture this difference, we specify the model of respondents and non-respondents in the population separately according to the values of R such that where β r and β nr are the vectors of regression coefficients corresponding to the respondents the non-respondents respectively. Consequently, we get sub-populations U 1 and U 2 such that where U 1 and U 2 are the subsets of U denoting populations of respondents and non-respondents with sizes N 1 and N 2 respectively. It is assumed that the error terms are independently and identically distributed (IID) with means E m (� 1 ) = E m (� 2 ) = 0 with model variances V m ð� 1 Þ ¼ s 2 1 I N1 ; and V m ð� 2 Þ ¼ s 2 2 I N2 , where I N 1 and I N 2 are the identity matrix of order N 1 and N 2 respectively. Separation of model is straight forward when we have exact knowledge about the occurrence of non-response and a related stratification variable which is almost impossible in real world problem. As it is not possible to have such information that separates the underlying model exactly into the respondents and the non-respondents. One way to overcome this problem may be to use two phase sampling for obtaining information on stratification variable. In which we select a larger sample on first phase and observe the stratification variable (i.e. respondents are marked as respondents according to their behavior to respond the first phase survey are observe such factor which cause non-response) and estimate the proportions of units fall in sub-populations i.e. λ 1 = N 1 /N and λ 2 = N 2 /N. These information then can be used at second phase for estimating population parameters of the study variable. Before going toward our proposal, we discuss the estimation of population total without sub-sampling non-respondents which help us in knowing how the non-response creates biasedness in estimation of total.

Estimation of total without sub-sampling
For a sample s of size n assume that only n 1 units respond while remaining n 2 units don't respond. The prediction problem given in Section 2 becomes , are vectors of weights associated with n 1 respondents, N 1 − n 1 nonsampled units from responding population, and N 2 units from non-responding population is unknown and can be predicted using sample at hand and the auxiliary information for the non-responded and non-sampled values. A predictive estimator for population total based on respondents only, can be found as follow: where b 1 is the vector of OLS estimates of β 1 based on n 1 respondents. The model bias ofT y1 is See Appendix A1 for proof.T y1 is unbiased estimate of T y if the vectors of coefficients for the responding and non-responding sub-populations are same i.e. β 1 = β 2 , this is equivalent to regression imputation. This situation occurs when Behavior of the responding and the nonresponding populations are same allowing us to ignore the non-response just as reduced sample size. We obtain model mean squared error (M-MSE) of the total estimatorT y1 as The subscript m shows that expectation is applied over model. The model-mean squared error (M-MSE) given in (7) depends on random sample under designed-based point of view. Consequently, it varies with sampling fluctuations. To obtain a fix value, we apply expectation with respect to S.D.

Estimation of total with sub-sampling
As we already discussed, there are several approaches for handling the problem of nonresponse in sample literature. A suitable approach may be chosen according to the type of non-response (full or partial), the accessibility of the auxiliary variable(s) and the validity of the underlying response model for handling the problem. In general, re weighting is used to deal with full (non-availability of units) non-response. Imputation is preferably applied for dealing with partial non-response although it can be applied for full non-response if appropriate auxiliary information is available. Re-weighting eliminates or at least reduces total nonresponse bias [36,37]. While the sub-sampling method introduced by [11] provides a good adjustment for non-response bias and yield unbiased estimator for the population mean when the non-response variable R is significantly correlated with the survey outcome.
In this study, we develop a model-based estimator for population total by adjusting nonresponse using sub-sampling procedure. As the models described in (3) and (4) have different parameters it is inevitable to obtain information about both sub-populations. The sample information obtained from respondents alone leads to biased estimate for the population total of the whole population. For estimating the relationship between the study and the auxiliary variables for the population of non-respondents and estimating total, we need some information from non-respondents as well. The sampling mechanism in Section 2 is based on the respondents from first sample which don't provide any information about the population model of non-respondents. The sub-sampling introduced by [11] is the best alternative to handle such situation of non-response which assumes the mode of data collection on first round was inexpensive and then a more stronger mode of interview is employed for sub-sampling non-respondents. The rationale behind taking a sub-sample instead of following all non-respondent is the fact that taking information from all non-respondents by using stronger mode of interview increases survey cost. Sometime randomized response techniques (more expensive and complex) are applied to gather information on second call [38]. The method assumes sub-sampling � n 2 ¼ n 2 k (k > 1) units from n 2 units selected and not respond on first round, using some stronger mode of interview (face to face survey, telephonic survey etc). The estimation process covers two prediction problems (i) predicting N 1 − n 1 non-sampled units from the sample taken from the first round using model given in (3) and (ii) predicting N 2 − n 2 (non-sampled)+ n 2 − � n 2 (non-responded) units on the basis of sample obtained on second round using the model relationship given in (4). Let � s 2 be the sub-sample of size � n 2 selected from s 2 and � � s 2 ¼ U 2 À � s 2 be the set representing non-sampled values from the population of non-respondents. Now the outcome vector for respondents is further partitioned as The matrix x, the vector W and the random error vector � are also partitioned into sampled and non-sampled parts in same way.
The population total of the study character is now expressed as replacing known values of the response units, we have The first part is predicted on the basis of sample obtained on first round along with model given in (3) and the second part is predicted on the basis of sample obtained on second round and the model given in (4). Under the sub-sampling technique a linear unbiased predictor for T y iŝ are the hessian matrix for the first round sample and sub-sample respectively. The well-known GMT provides the evidence that the OLS estimators are the best linear unbiased estimators (BLUE) of the parameters β 1 and β 2 when the observations obtained on first round sample s 1 and the second round sample � � s 2 follows two different population models with independently and identically distributed error terms. In designed based point of view, the selection of sub-sample � s 2 depends on the selection of s 1 , hence the assumption of independence is no more valid. To proceed we need the assumption of model independency only. The separation of population as the respondents and the nonrespondent is based on the values of R which is already discussed in previous section. The role of the variable R is same as the role of stratification variable in stratified sampling which is merely used to separate populations into respondents and non-respondents. Hence more correlation between the non-response factor (R) and the study variable is a requirement for using the sub-sampling approach. The case of low correlation between the study variable and the non-response variable can be handled through weighting adjustment and imputation techniques discussed in literature review. However the literature of sub-sampling technique reveals that the efficiency of the sub-sampling estimator is not affected by this correlation. But in case of presence of significant correlation proceeding with just respondents on first call may produce invalid and inconsistent statistical inference.
Note that respondents on first sample always represent the responding population U 1 . While the non-respondents on first sample may or may not represent the population of the non-respondent U 2 as it depends on the degree of relationship between R and Y and the nature of occurrence of non-response (whether it is ignorable or not). The model bias ofT � y is derived in Appendix A 2, and given by T � y is model unbiased if all of the OLS assumptions are satisfied for the populations of the respondents and non-respondents. Assuming unbiasedness model variance of the total estimator under non-response is obtained as Taking expectation with respect to S.D we get

Estimation of total with sub-sampling under super-collinearity and heteroscedasticity
While applying linear regression model for predicting the non-sampled values from the population of non-respondents the number of input variables (regressors) may greatly exceeds the number of observations i.e. � n 2 < ðp þ 1Þ as we are sub-sampling a relatively small portion of non-respondents. In such situations, fitting the full model to the nonrespondents without penalization will result in wider prediction intervals, and the normal equations may not have trivial solution as the matrix H � s 2 does not possess the full rank property. It is not possible to estimate the parameters of the model when H � s 2 is singular i.e. not of full rank. This situation is called super-collinearity or ill-conditioning. The problem of super-collinearity can be solved using ridge regression. To get an estimate for β 2 , when there is super-collinearity in x 2 , we use ad-hoc fix method proposed by [40] for resolving singularity of H � s 2 . We simply replace H ¼ H � s 2 by HðvÞ ¼ H � s 2 þ vI pþ1 with v 2 [0, 1]. The scalar v is called tuning parameter or penalty parameter. A clearly defined estimator for β 2 obtained even for high-dimensional data matrix ð� n 2 � pÞ for a strictly positive v is b 2 ðvÞ ¼ HðvÞ (8), we obtain a partially ridge regression (PRR) estimator (as the concept of ridge regression is used for non-responding part only) for population total which is given bŷ The expressions for model-bias and expected model-MSE of the PRR estimator of the total in presence of non-response are obtained by replacing H(v) by H in (10) and (11). Following [41] a range for v in which the model-MSE of x � � s 2 b 2 ðvÞ is smaller than the model-variance of where ψ 2 is the minimum eigen-value of the matrix ðH � s 2 Þ À 1 À . PRR is also applicable for predicting non-sampled respondents when n 1 < (p + 1) leading to super-collinearity in the respondents.
Another major problem that arises in estimation of so called superpopulation parameters is the violation of assumption of homoscedasticity is violated. In presence of heteroscedasticity one has

units specific variances for respondents and non-respondents respectively. Here
, where x 1i and x 2i are the vectors of the auxiliary variables corresponding to the ith unit in respondents and non-respondents respectively. In such situations, OLS estimators for the regression coefficients may have higher variances. If we have information about the variance structure for the populations of respondents and non-respondents (assuming zero correlation between the units), we can adopt weighted least square (WLS) method of estimation. The WLS estimators of β 1 and β 2 The sub-matrices are also diagonal assuming zero correlation between the error terms corresponding to the respobndents and the non-respondents. A WLS estimator for T y in presence of non-response is obtained by replacing b 1wls and b 2wls by b 1 and b 2 respectively in (8). It is assumed that the variance structures of the responding and non-responding population are known and depend on covariates whose values are known for each population unit. In practice, for many types of data set, the structure of weights (inverse of variance) is usually unknown, so one has to perform an ordinary least squares (OLS) regression first to estimate the variance structure and obtain estimates for the population regression coefficients after performing an iterative process which is commonly known as generalized least square (GLS).

Application
A real data set taken from [42] is applied to investigate the behavior of our proposed modelbased estimator. The data set is given as supporting information S1 Data 7 to this paper. The data consist of 748 blood donors on following variables: y = Monetary total blood donated in c.c., x 1 = Time (months since first donation), x 2 = Recency (months since last donation) and x 3 = Frequency (total number of donation). Considering the above 748 blood donors as our population of interest, we select a sample of size 100 using simple random sampling without replacement. The scatter plot matrix between the variables in the sample selected on first call and the sub-sample collected on the second call represents the relationship between the variables in the population of respondents and nonrespondents for response rates λ 2 = 0.4 (Figs 1 and 2) and λ 2 = 0.4 (Figs 3 and 4). Fig 1 shows the relation between the study variable y and the predictors x 1 , x 2 , and x 3 for the sampled respondents which shows that the study variable y, is highly related to x 3 and moderately related to x 1 but weakly related to x 2 . Fig 2 portrays the relationship between variables for the sub-sampled non-respondents which is different from the relationship in Fig 1 which shows the relevancy of the data to our proposed sampling mechanism. One can observe the similar relationship between the variables for λ = 0.2 from upper triangle of Figs 3 and 4. Hence our proposal works here as the relationship between the total monetary blood donated and its three determinants have different relationship for the population of the respondents and the sub-population of the non-respondents which is the main assumption of our data collection mechanism. We select half (k = 2) of the non-respondent selected on first call for sub-sampling on second call.  Further to see the magnitude of the prediction error, we provide a bootstrap sampling procedure taking different non-response rate (say λ 2 ) in the population. We generate a new variable R associated with each 748 cases which posses value 1 if the ith unit has an outcome greater than the λ 2 th percentile of all the y values in the data set otherwise zero.
1. A sample of size n (for n = 100, 200) is taken from the data using simple random sampling without replacement and divide it them into the respondents and the non-respondents according to the value of R and observe n 1 and n 2 .
2. Select a sub-sample of size � n 2 ¼ n 2 k (taking k = 2,4) from n 2 non-respondents again using simple random sampling without replacement and compute the estimator using information obtain from first and second samples. We take p = 2 to avoid the problem of super-collinearity in our situation.

Repeat
Step 2, 2000 times to get expected value from the sub-sampling. The sub-sampling does not alter results ofT 1 as it is based on sample from respondents only.
4. Repeat Steps 1-3, 5000 times to obtain a stable value of prediction variance and bias for both estimators.
The prediction bias and variances are computed as follows: The RB and RMSE for the [11]-type estimator for the population total are obtained by replacingT y1 byT � y in Eqs (16) and (17) respectively. Table 1 provides relative bias (RB) and relative mean squared error (RMSE) of the total estimator based on the sample on first call for different combinations of n, λ 2 and k. Model based estimation in non-response The results in Table 1 are reported assuming non-response rate λ 2 at 50%, 25%, and 10%. RB of both estimators go to zero as non-response rate falls toward zero which assures that for full response it vanishes while the sub-sampling method produce ignorable bias as compared to direct method which is the attractive feature of this method. Further from Table 1, one can observe that RMSE is smaller in case of sub-sampling non-respondents, i.e. taking interview of additional non-respondents through some stronger mode of interview, for every choices of λ 2 . RB and RMSE ofT � y tend to increase with decrease in non-response rate in the population which shows that our proposed technique works well for higher non-response rates as compared to lower smaller ones. RB and RMSE of the model based total estimator go down while increasing sub-sample size � n 2 (decreasing k) as expected. Further, this error decreases when population has smaller non-response rate λ 2 . In upcoming section, we provide a simulation study to provide a detailed picture of the performance of estimators in terms of design bias and mean squared error.

Simulation study
To see the long run behavior of the proposed estimators in terms of bias and efficiency, a simulation study, generating a hypothetical population, is conducted. Following [43], a matrix z = (z ij , i = 1, 2, 3, . . ., N, j = 1, 2, . . ., p) with p variate each generated from N(100, 1), has been constructed with N = 10,000 observations. The ijth element of the auxiliary matrix x is computed as x ij = (1 − ρ) 0.5 × z ij + ρ × z ij , where ρ is the degree of linear relationship between x and z to be fixed in advance. The vector of the study variable (y) is then obtained by using the relationship y = xγ + �, where γ is the vector of coefficients which are computed as the averaged eigen vectors corresponding to the eigen values of H = x T x that are greater than unity and � * N(0, σ 2 I N ) is randomly generated error term. It is assumed that the variance is of homoscedastic nature with constant diagonal σ 2 . We fix σ 2 at 0.01, 0.1 and 1. The data consist of (y, 1 N , x, R i ), where 1 N is the vector of 1's. R i takes value 1 if the ith value of variable y falls in a threshold lower than (1 − λ 2 )th quantile in the population, where λ 2 is non-response rate in the population. In real life, we suggest to choose R in form of some observable covariates or latent variables. The simulation study is conducted in following three steps.
• Take a random sample of size n from the population generated through the mechanism described above and split it into n 1 respondents and n 2 non-respondents according to the values of R i .
• Select a sub-sample of size � n 2 from n 2 non-respondents for fix k.
• Estimate the population total (T y ) using estimated models from samples obtained on Steps 1 and 2.
• Simulate Steps 2-3 500 times and average the values of estimates.
• Repeat Steps 1-4 2000 times to obtain prediction errors to obtain 2000 estimated values.
The bias (B) and mean squared error (MSE) of the proposed total estimators are computed using the formula given in Eqs (16) and (17) respectively after removing the denominators as the generated values are already standardized. The subscript v is used for the results where prediction is performed using PRR.
Tables 2-4 provide the bias of the PPR estimator and mean squared error of both estimators for different combinations of σ 2 , λ 2 , ρ, n and k in nested order. We obtain results for p = 5 and p = 8 but the result for p = 5 is not reported here for the sake of space. Tables 2, 3 and 4 provide the prediction error measures (B and MSE) for σ 2 = 0.01, σ 2 = 0.1 and σ 2 = 1 respectively. From Tables 2-4, one can see that the bias of the PRR total estimator tends to increase with increase in k. This implies selecting a smaller sub-sample increases the bias in estimation due to sampling error although this bias depends on the magnitude of the tuning parameter v. MSE of the total estimator under multiple regression and PRR both increase with increase in k which shows that MSE of the estimators grows with smaller sub-samples from non-respondents. The PRR total estimator is more sensitive to the change in k, in terms of MSE, as the optimum value of the tuning parameter v is estimated from sub-sample. In practice v might be computed using data available from previous surveys of the same population or through expert judgment. The estimation methods of v by minimization of prediction error are available in [43]. Moreover, whatever model we use for prediction, the MSE values of the total estimators depend on the sample size of respondents and sub-sample of non-respondents. The simulated results are provided for sample size 100, 150 and 250 with sub-sample size inversely proportional to k = 1.5, k = 2 and k = 3. It can be noticed that MSE values are increasing with increase in k. Comparing two portions of Tables 2-4, we observe that the MSE of proposed estimators fall when non-response rate increases which conflicts the efficiency property of the [11] estimator. The reason is the use of separate models and increasing λ 2 from 0.2 to 0.4 implies(i.e. we are using Model (2) for 40% of the data) which is the main contribution of our proposal in terms of increased precision. Apart from the design parameters, the data generating process also effects the efficiency of the total estimator which can be seen from three different columnpanels (for three different choices of the parameter ρ) assuming that the correlation between the variables X and Z are same for all choices of j of Tables 2-4.

Conclusion
This article is concerned with utilization of model relationship between the outcome variable and one or more covariate(s) for efficient estimation of population total of the outcome variable in surveys with non-ignorable non-response. A model based version of [11] sub-sampling technique is suggested which assumes that the responding and non-responding population have different models. This assumption may hold for majority of real world situations where the occurrence of non-response is observable like a stratification variable. In public health surveys the non-response occurrence is based on the gender, ethical affiliation, age and other demographic factors of the respondents. In such situations, respondents and non-respondents may have different models. The method assumes that a stratification variable is available to divide the population into respondents and non-respondents which is difficult to obtain in most of real surveys although a two phase sampling method can provide a better stratification variable to divide the population into respondents and non-respondents. It is shown that under linear population model (linear in parameter as well as in variables) the total estimator with sub-sampling is model-unbiased and has smaller model-variance as compared to predictive estimator based on sampled respondents only. The linearity assumption emphasizes on linear in parameters but not restricted to the linearity in variable. Polynomial regression models are also useful for handling non-response in demographic surveys using age as the predictor. The problem of non-response can be well handled using polynomial regression models which is an open area to work in future. While sub-sampling non-respondents the number of observations may become smaller than the number of regressors included in the model leading to problem of super-collinearity. To cope with super-collinearity problem, we suggest a version of ridge regression named, called PRR, for predicting the non-sampled non-respondents. WLS and GLS are suggested for obtaining estimates of the regression coefficients for respondents and non-respondents when error terms for at least one model is of heteroscedasticity nature.
To confirm mathematical expressions a numerical study with blood transfusion data has been carried out. The suggested method is applicable to telephonic or web household surveys where households are first contacted with email or telephone call and then non-respondents are followed via face to face surveys where it seems logical to select a sub-sample of non-respondents through more expensive mode (face to face).