Optimum second call imputation in PPS sampling

The current study deals with imputation of item non-response in probability proportional to size (PPS) sampling. A new imputation procedure is proposed by using the known co-variance between the study variable and the auxiliary variable in the case of quantitative sensitive study variable by considering the non-response in a randomization mechanism on the second call. An empirical study is conducted at the optimum values of kog and nog for the relative comparisons of ratio, difference, and proposed estimators, respectively, with the Hansen-Hurwitz estimator.


Introduction
Survey sampling is a technique which is utilizes in almost every field of life to estimate the finite population parameters with limited response. There are many sample selection procedures, which provide reliable data by selecting the representative sample. In equal probability sampling schemes, the probability of selection is equal for all the units in target population. If units varying in size, equal probability sampling may not give the appropriate importance to large or small units in the population. The appropriate importance to the population units is assigned by allocating the unequal probabilities of selection to the different units in the population. Thus, when units are different in size and variable under study is correlated with their auxiliary information e.g. size, then the selection probabilities may be assigned in proportion to their sizes. For example, 1. Colleges with large number of educational departments are likely to have more students and more faculty members. For the funds allocation, it may well be desirable to adopt a scheme of selection in which colleges are selected with probabilities proportional to their students or departments.
2. In an industrial survey, the number of workers may be selected as size of industrial area. 3. In biological studies, the number of patients may be selected according to the size of the hospital.
For all of these cases, the selection of sampling units is proportional to the size of auxiliary information associated with the particular unit, is called sampling with probability proportional to size (PPS). It is well known that the proper use of auxiliary information at estimation stage or at design stage or at both stages is helpful to magnify the performance of resultant estimators. Ratio, product, and regression estimators are good examples in this context. In many real life situations, where non-response/refusals may affect the reliability and accuracy of data sets. These refusals are mostly occurred due to many reasons such as time of survey (during summer or winter vacations, office hours etc.), survey contents (embarrassing nature of questions, double barrel question etc.), respondent burden (irrelevant questions, length of questionnaire etc.), or data collection methods (telephone or mail surveys, personal interviews etc).
Initially, [1] provides an idea of sub-sampling the non-respondents of first call by dividing the population into two strata; respondents and non-respondents at first call. The detailed discussion on the proposed estimator is given in Subsections 1.1 and 1.2 for the case of simple random and PPS sampling scheme, respectively.

Sample selection in simple random sampling
be a finite population of N units. Let y i and (x i , z i ) be the values of the study variable (y) and the auxiliary variable (x, z), respectively, for i = 1, 2, � � �, N. Assume that x i has high positive and z i has low positive correlation, respectively, with the study variable (y i ). So, x i is used at the estimation stage and z i is used at the sample selection stage from population. Let a sample {(I = I 1 , I 2 , � � �, I n )} of size n be selected using simple random sampling without replacement(SRSWOR) scheme. Assume that n 1s units respond at first call, report their responses y i (1) and n 2s units do not respond at first call. Further, a sample of size r 1s ¼ n 2s k , where k > 1, is drawn from n 2s non-respondent group, report their responses y i (2) , belong to group G 1 and r 2s = r 1s (k − 1) are those, who refuse to report their response belong to group G 2 .
Thus, the sub-sampling estimate for population mean, is given by where w 1s ¼ n 1s n , w 2s ¼ n 2s n , � y ð1Þ ¼ 1 P n 2 i¼1 y i and r be the respondents. The variance of � y � is given by

Selection of sample with PPS sampling
In PPS sampling scheme, the selection of units in the sample is carried with probability proportional to a given measure of size, where the size is measured by the available suitable auxil- Let a sample {s i = (s 1 , s 2 , s 3 , � � �, s n )} of size n be selected using PPS with replacement sampling scheme. Assume that n 1 units respond at first call, report their responses u i(1) = y i (1) z i and n 2 units do not respond at first call. Further, a sample of size, r 1 ¼ n 2 k , is drawn from n 2 non-respondent group, report their responses u i(2) = y i (2) z i , belongs to group G 1 and r 2 = r 1 (k − 1) is those, who refuse to report their responses belong to group G 2 . Thus, the Hansen-Hurwitz estimator under PPS sampling scheme can be modified as: where n 1 and r 1 are the PPS respondent units at first and second calls, respectively. The variance of � u � is given by

Statement of the problem
When variables of interest are sensitive or embarrassing in nature, then respondents are reluctant to report their true responses or may refuse to respond. Several statistical models are available in literature to protect the confidentiality and privacy of interviewee by hiding their identities, which are helpful to reduce the non-response bias. A pioneer idea of randomized response technique (RRT) was described by [2] to handle the high rate of refusals due to sensitive nature of questions. Commonly, these refusals have been occurred during the analysis of demographic and economic variables, respectively, etc. Interest readers may be referred to read [3][4][5][6][7][8][9], and many others. [10,11] use the randomized response models (RRMs) for obtaining the true status of interviewee on second attempt. The proposed estimators by these researchers can perform better as compared to traditional ones. The aim of this investigation is to study the missing complete at random (MCAR) values at second call, when the interviewees are reluctant to use RRMs. For the non-respondents of first call, different additive, multiplicative and subtractive models, respectively, might be utilized to create the feeling among respondents that their privacy is secured beside their truthful response.
For creating privacy protection felling among non-respondents of first call, we consider to modify linear randomized response model proposed by [12]. From the n 2 non-respondents of first call, the scrambled response is obtained using the [12] model.

Privacy protection at second call.
Let the i th respondent draw two cards i.e S 1i and S 2i from two independent decks of cards, say D 1 and D 2 , respectively, which are un-correlated with y. At the second call, the i th respondent can report the scrambled response as follows: Let E 3 and V 3 be, respectively, the expected value and variance over the scrambled device. We assume that

PLOS ONE
Optimum second call imputation in PPS sampling Also letû ið2Þ be the suitable transformation of randomized response for the i th unit whose expectation under (5) model coincides with the true response y i , as:û where C 2 At the second call, out of n 2 non-respondent of first call, only r 1 interviewees can give their scrambling responses and remaining r 2 units cannot give their true or scrambled responses.
iðr 1 Þ be the sample mean of respondent class at second attempt.

Modifying existing literature
In this section, we modify the exiting literature as per the statement of the problem. The most commonly used imputation procedures are discussed in Subsection 2.1, 2.2, and 2.3.

Mean estimator
In this section, our focus is to impute the missing r 2 values by using conventional method of imputation. The missing structure is defined as follows: Hence, the whole population is divided in O (1) and O (2) strata having N 1 and N 2 units, respectively. Furthermore, O (2) is divided into two groups G 1 and G 2 of size R 1 and R 2 units, respectively, when N 1 , N 2 , R 1 and R 2 are known in advance. For the case of scrambled responses at second call, the point Hansen-Hurwitz estimator for population mean ð � Y Þ can be modified as:� So, we have the following Lemmas.

Lemma 2.1
The variance of� u ðr 1 Þ , is given by Proof. Proof: Let E j and V j , j = (1, 2) be the expected values and variances for given n 2 and r 1 , respectively. Then, by the definition of variance, we have Corollary 2.1.1. It is important to note that Vf� u ðr 1 Þ g requires the second moment (μ 2u ) of y, which is generally unknown. [13] suggested two possible ways to acquire μ 2u : (i) guess it from the prior information or pilot survey and (ii) obtain the sample estimate to derive the information about μ 2u by keeping in mind the sensitive nature of u i . Lemma 2.2. The variance of� u � , is given by Proof. Proof Let E m and V m , m = (4, 5) be the expected values and variances for given N 1 and N 2 , respectively. By definition, we have By ignoring correction factor 1 À n N À � for the ease of computation, then we have  (4) and (14), we see that the variance of modified estimator is higher than Hansen-Hurwitz estimator. It means that� u � is less efficient than � u � . The objective of our study is to increase the truth and confidence among interviewees that their privacy is secure beside their true answers. Moreover, the non-response at first call might be occurred due to non-availability or inability to provide the required information. Therefore, at the second call, it may happen that those people are willing to report their responses directly, even the sensitive characteristics are investigated. For this purpose, the randomization in stages should be re-expounded as an optional randomized response (ORR) procedure, which permits the respondents to divulging the direct or true response without using RRT, is given bŷ where t i ¼ 1 if the i th respondent report their direct response 0 otherwise ( It is easy to show that the unbiased estimator for � Y is derived by replacing (15) in (9) and its variance becomes (1 − t i )ϕ i instead of ϕ i , in (14). Furthermore, ORR reduces the variance and privacy at various values of t i for the non-respondents at first call.

Ratio estimator
Initially, [14] takes into account the utility of auxiliary information at estimation stage by defining the ratio estimator for population. The traditional ratio estimator can be modified for the imputation of missing scrambled responses at second call, as: where The point estimator for sub-population (O (2) ), is given bŷ The Hansen-Hurwitz ratio estimator for population mean ð � Y Þ, is given bŷ The variance of modified ratio estimator is given by

Difference estimator
Now, we consider the difference estimator for explaining missing structure of scrambled responses, as: where d is an unknown constant. The point estimator for sub-population mean (O), is given bŷ The combined version of modified Hansen-Hurwitz estimator is given bŷ The variance of� u � ðdÞ estimator, is stated as r uv ð2Þ , variance of� u � ðdÞ reduces to where C 2� d ð2Þ ¼ C 2 u ð1 À r 2 uv ð2Þ Þ.
The problem of estimating the population parameters by using higher order moments of the auxiliary variable was considered by [15][16][17]. Later on [18][19][20] among others, also contemplate the known higher order moments of the auxiliary variable for estimation of finite population parameters. In the theory of survey sampling, it is well established result that the use of higher order moments of the auxiliary variable plays a pivotal role in estimating the finite population mean of the study variable. This literature inspired the researchers to impute the missing values at second call by using known covariance between the study variable and the auxiliary variable.

Proposed imputation procedure
Initially, [21] improves the conventional mean estimator by using a tuning constant (α (s) ), in the case of missing values, as: which leads to Searls's type estimator for � u ð2Þ is given by Although Searls's approach uses the known coefficient of variation to increase the efficiency of the estimation procedure. The optimum value of α (s) depends on C u ð2Þ , C v ð2Þ and r uv ð2Þ , which are stable quantities. The stability of these constant has been explored by numerous researchers like [22][23][24], etc. Therefor, the present investigation is a significant search of optimum imputation method by using the co-variance between the study and auxiliary variable. The imputation of item non-response is given by where α 1 , α 2 , and α 3 are suitable chosen constants and are determined by minimizing the resultant mean square error. The point estimator for population mean, is defined as: The modified version of Hansen-Hurwitz difference estimator is given bŷ The variance of� u ðpÞ , is given by ( ) ( ) ; m ab ð2Þ ¼ 1 The optimum values of α j , j = (1, 2, 3) are obtained, respectively, by minimizing (30), as follows: n o ; and where w 2 n o À 1 .

Remark 1.
The second term in Vð� u � ðpÞ Þ min: is vanished, if k = 1. It happens when each nonrespondent of first call is interviewed at second call.

Choice of sampling fractions
We shall deduce the optimum values of k and n that minimize the variance at specified cost. The cost function for the proposed model is based on following four components, as: 1. C 0 = over head cost.
2. C 1 = per unit cost for collecting the response by mail inquiry at first call.
3. C 2 = the unit cost for obtaining the scrambled response from the non-respondent group of first call. 4. C 3 = cost per unit for editing, processing or imputing the missing r 2 values.
Thus, the cost function is given by Note that C � is the total cost, thus it varies from sample to sample. So, we use the expected cost by applying the expectation on (33), we have So, we have the following Lemma, as: Lemma 4.1. The optimum values of k and n for the minimum expected cost are, respectively, given by k ov ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi and where g = c, d, and p. Proof. Let the variance Vð� u ðgÞ Þ be a fixed V 0 , i.e Vð� u ðgÞ Þ ¼ V 0 , then the Lagrange function, is given by where ξ is a Lagrange multiplier. Differentiating (37) with respect to n, equating to zero i.e @L @n ¼ 0 À � and ignoring δ j , j = (1or2). We have ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ( ) v u u t ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi Note that Vð� u � ðgÞ Þ min: ffi Substituting (38) in (39), we have ffi ffi ffi x p ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi � � s ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi Substituting (40) in (38), we have which is the required optimum sample size (n ov ). Now, we differentiate (37) with respect to k and equate to zero i.e @L @k ¼ 0 À � . Then, we have Using (38) in (42), we have k ov ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi which is the required optimum value of k. For the numerical comparison, we consider the following values of un-known constants, as: where r is assumed response rate, which is 40% of n og . The optimum values of relative efficiencies (R.Es) of� u � ðgÞ are given in Table 1. Table 1 shows the optimum values of k, n, and R.E(j) of estimators i.e modified ratio and difference estimators. Under this hypothetical population, the modified estimators i.e� u � ðgÞ perform better as compared to traditional Hansen-Hurwitz estimator ð� u � Þ. We also observe that the optimum value of n og is approximately similar for all� u � ðpÞ , so optimum sample of size n op is used for the relative comparison between existing and proposed imputation estimators.
From Table 1, we observed following proportionality relationships between C 2 , C 3 , r 1 , V o , k op , n op , and R.E(j).
1. The values of V o and n op have inverse relationship with C 2 and C 3 .
2. C 2 and C 3 have the positive relationship with RE(j). As the costs of scrambling response and imputation increase, the relative efficiencies of� u � ðgÞ have been improved significantly.  From the numerical finding, we can conclude that the proposed imputation procedure at second call should be performs better as compared to existing and tradition Hansen-Hurwitz estimators at various values of C 2 , C 3 , r 1 and V o .

Conclusion
The problem of non-response bias in the sensitive quantitative study variable has been diminished by sub-sampling the non-respondent, viz. Hansen and Hurwitz (1946) procedure. A new imputation mechanism has been defined by using the known co-variance between the study variable and the auxiliary variable. Optimum value for sample size is also derived for a given set of unit cost (C q , q = 0, 1, 2, 3), r 1 , and V o . From the Table 1, we can easily say that the proposed imputation method can outperforms as compared to ratio, difference, and Hansen-Hurwitz estimators.
When the processing, editing, or imputing cost per unit is high, the proposed imputation strategy can performs better as compared to their counterpart. Our proposed imputation procedure is also useful when there are serious concerns about the non-response bias or refusals due to the sensitive nature of the study variable that is difficult to ignore it.
Supporting information S1 Code. In this research a hypothetical data set is used which can be easily regenerated at the given value of parameters with the help of available statistical software.