Figures
Abstract
The current study deals with imputation of item non-response in probability proportional to size (PPS) sampling. A new imputation procedure is proposed by using the known co-variance between the study variable and the auxiliary variable in the case of quantitative sensitive study variable by considering the non-response in a randomization mechanism on the second call. An empirical study is conducted at the optimum values of kog and nog for the relative comparisons of ratio, difference, and proposed estimators, respectively, with the Hansen-Hurwitz estimator.
Citation: Sohil F, Sohail MU, Shabbir J (2022) Optimum second call imputation in PPS sampling. PLoS ONE 17(1): e0261834. https://doi.org/10.1371/journal.pone.0261834
Editor: Dejan Dragan, Univerza v Mariboru, SLOVENIA
Received: June 19, 2021; Accepted: December 12, 2021; Published: January 21, 2022
Copyright: © 2022 Sohil et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: In this research, a hypothetical data set is used which can be easily regenerated at the given value of parameters with the help of available statistical software. The parameters are included in the paper and its Supporting information files.
Funding: The authors received no specific funding for this work.
Competing interests: The authors have declared that no competing interests exist.
1 Introduction
Survey sampling is a technique which is utilizes in almost every field of life to estimate the finite population parameters with limited response. There are many sample selection procedures, which provide reliable data by selecting the representative sample. In equal probability sampling schemes, the probability of selection is equal for all the units in target population. If units varying in size, equal probability sampling may not give the appropriate importance to large or small units in the population. The appropriate importance to the population units is assigned by allocating the unequal probabilities of selection to the different units in the population. Thus, when units are different in size and variable under study is correlated with their auxiliary information e.g. size, then the selection probabilities may be assigned in proportion to their sizes. For example,
- Colleges with large number of educational departments are likely to have more students and more faculty members. For the funds allocation, it may well be desirable to adopt a scheme of selection in which colleges are selected with probabilities proportional to their students or departments.
- In an industrial survey, the number of workers may be selected as size of industrial area.
- In biological studies, the number of patients may be selected according to the size of the hospital.
For all of these cases, the selection of sampling units is proportional to the size of auxiliary information associated with the particular unit, is called sampling with probability proportional to size (PPS). It is well known that the proper use of auxiliary information at estimation stage or at design stage or at both stages is helpful to magnify the performance of resultant estimators. Ratio, product, and regression estimators are good examples in this context.
In many real life situations, where non-response/refusals may affect the reliability and accuracy of data sets. These refusals are mostly occurred due to many reasons such as time of survey (during summer or winter vacations, office hours etc.), survey contents (embarrassing nature of questions, double barrel question etc.), respondent burden (irrelevant questions, length of questionnaire etc.), or data collection methods (telephone or mail surveys, personal interviews etc).
Initially, [1] provides an idea of sub-sampling the non-respondents of first call by dividing the population into two strata; respondents and non-respondents at first call. The detailed discussion on the proposed estimator is given in Subsections 1.1 and 1.2 for the case of simple random and PPS sampling scheme, respectively.
1.1 Sample selection in simple random sampling
Let Ω = {Ω1, Ω2, Ω3, ⋯, ΩN} be a finite population of N units. Let yi and (xi, zi) be the values of the study variable (y) and the auxiliary variable (x, z), respectively, for i = 1, 2, ⋯, N. Assume that xi has high positive and zi has low positive correlation, respectively, with the study variable (yi). So, xi is used at the estimation stage and zi is used at the sample selection stage from population. Let a sample {(ℑ = ℑ1, ℑ2, ⋯, ℑn)} of size n be selected using simple random sampling without replacement(SRSWOR) scheme. Assume that n1s units respond at first call, report their responses yi(1) and n2s units do not respond at first call. Further, a sample of size , where k > 1, is drawn from n2s non-respondent group, report their responses yi(2), belong to group G1 and r2s = r1s(k − 1) are those, who refuse to report their response belong to group G2.
Thus, the sub-sampling estimate for population mean, is given by
(1)
where
,
,
,
and r be the respondents. The variance of
is given by
(2)
where
,
,
,
,
,
, and
.
1.2 Selection of sample with PPS sampling
In PPS sampling scheme, the selection of units in the sample is carried with probability proportional to a given measure of size, where the size is measured by the available suitable auxiliary information. Let ui = yi/(Nπi) and vi = xi/(Nπi), where and also let
and
be the unbiased estimators of population means and their variances are
and
, respectively, where
. It is also assumed that the average value of ui is approximately equal to average value of yi.
Let a sample {si = (s1, s2, s3, ⋯, sn)} of size n be selected using PPS with replacement sampling scheme. Assume that n1 units respond at first call, report their responses ui(1) = yi(1)/(N1 π1i), where and n2 units do not respond at first call. Further, a sample of size,
, is drawn from n2 non-respondent group, report their responses ui(2) = yi(2)/(N2π2i), where
, belongs to group G1 and r2 = r1(k − 1) is those, who refuse to report their responses belong to group G2. Thus, the Hansen-Hurwitz estimator under PPS sampling scheme can be modified as:
(3)
where n1 and r1 are the PPS respondent units at first and second calls, respectively. The variance of
is given by
(4)
where
,
,
, and
.
1.3 Statement of the problem
When variables of interest are sensitive or embarrassing in nature, then respondents are reluctant to report their true responses or may refuse to respond. Several statistical models are available in literature to protect the confidentiality and privacy of interviewee by hiding their identities, which are helpful to reduce the non-response bias. A pioneer idea of randomized response technique (RRT) was described by [2] to handle the high rate of refusals due to sensitive nature of questions. Commonly, these refusals have been occurred during the analysis of demographic and economic variables, respectively, etc. Interest readers may be referred to read [3–9], and many others. [10, 11] use the randomized response models (RRMs) for obtaining the true status of interviewee on second attempt. The proposed estimators by these researchers can perform better as compared to traditional ones.
The aim of this investigation is to study the missing complete at random (MCAR) values at second call, when the interviewees are reluctant to use RRMs. For the non-respondents of first call, different additive, multiplicative and subtractive models, respectively, might be utilized to create the feeling among respondents that their privacy is secured beside their truthful response.
For creating privacy protection felling among non-respondents of first call, we consider to modify linear randomized response model proposed by [12]. From the n2 non-respondents of first call, the scrambled response is obtained using the [12] model.
1.3.1 Privacy protection at second call.
Let the ith respondent draw two cards i.e S1i and S2i from two independent decks of cards, say D1 and D2, respectively, which are un-correlated with y. At the second call, the ith respondent can report the scrambled response as follows:
(5)
Let E3 and V3 be, respectively, the expected value and variance over the scrambled device. We assume that E3(S1i) = θ1, E3(S2i) = θ2, and
with
and
. Also let
be the suitable transformation of randomized response for the ith unit whose expectation under (5) model coincides with the true response yi, as:
(6)
with
(7)
where
and
.
At the second call, out of n2 non-respondent of first call, only r1 interviewees can give their scrambling responses and remaining r2 units cannot give their true or scrambled responses. Let be the sample mean of respondent class at second attempt.
2 Modifying existing literature
In this section, we modify the exiting literature as per the statement of the problem. The most commonly used imputation procedures are discussed in Subsection 2.1, 2.2, and 2.3.
2.1 Mean estimator
In this section, our focus is to impute the missing r2 values by using conventional method of imputation. The missing structure is defined as follows:
(8)
Hence, the whole population is divided in Ω(1) and Ω(2) strata having N1 and N2 units, respectively. Furthermore, Ω(2) is divided into two groups G1 and G2 of size R1 and R2 units, respectively, when N1, N2, R1 and R2 are known in advance. For the case of scrambled responses at second call, the point Hansen-Hurwitz estimator for population mean can be modified as:
(9)
So, we have the following Lemmas.
Lemma 2.1 The variance of , is given by
(10)
Proof. Proof: Let Ej and Vj, j = (1, 2) be the expected values and variances for given n2 and r1, respectively. Then, by the definition of variance, we have
(11)
Corollary 2.1.1. It is important to note that requires the second moment (μ2u) of y, which is generally unknown. [13] suggested two possible ways to acquire μ2u: (i) guess it from the prior information or pilot survey and (ii) obtain the sample estimate to derive the information about μ2u by keeping in mind the sensitive nature of ui.
Lemma 2.2. The variance of , is given by
(12)
Proof. Proof Let Em and Vm, m = (4, 5) be the expected values and variances for given N1 and N2, respectively. By definition, we have
(13)
By ignoring correction factor for the ease of computation, then we have
(14)
Corollary 2.2.1. From (4) and (14), we see that the variance of modified estimator is higher than Hansen-Hurwitz estimator. It means that is less efficient than
.
The objective of our study is to increase the truth and confidence among interviewees that their privacy is secure beside their true answers. Moreover, the non-response at first call might be occurred due to non-availability or inability to provide the required information. Therefore, at the second call, it may happen that those people are willing to report their responses directly, even the sensitive characteristics are investigated. For this purpose, the randomization in stages should be re-expounded as an optional randomized response (ORR) procedure, which permits the respondents to divulging the direct or true response without using RRT, is given by
(15)
where
It is easy to show that the unbiased estimator for is derived by replacing (15) in (9) and its variance becomes (1 − ti)ϕi instead of ϕi, in (14). Furthermore, ORR reduces the variance and privacy at various values of ti for the non-respondents at first call.
2.2 Ratio estimator
Initially, [14] takes into account the utility of auxiliary information at estimation stage by defining the ratio estimator for population. The traditional ratio estimator can be modified for the imputation of missing scrambled responses at second call, as:
(16)
where
,
, and
.
The point estimator for sub-population (Ω(2)), is given by
(17)
The Hansen-Hurwitz ratio estimator for population mean , is given by
(18)
The variance of modified ratio estimator is given by
(19)
where
,
2.3 Difference estimator
Now, we consider the difference estimator for explaining missing structure of scrambled responses, as:
(20)
where d is an unknown constant.
The point estimator for sub-population mean (Ω), is given by
(21)
The combined version of modified Hansen-Hurwitz estimator is given by
(22)
The variance of estimator, is stated as
(23)
where
.
When , variance of
reduces to
(24)
where
.
The problem of estimating the population parameters by using higher order moments of the auxiliary variable was considered by [15–17]. Later on [18–20] among others, also contemplate the known higher order moments of the auxiliary variable for estimation of finite population parameters. In the theory of survey sampling, it is well established result that the use of higher order moments of the auxiliary variable plays a pivotal role in estimating the finite population mean of the study variable. This literature inspired the researchers to impute the missing values at second call by using known covariance between the study variable and the auxiliary variable.
3 Proposed imputation procedure
Initially, [21] improves the conventional mean estimator by using a tuning constant (α(s)), in the case of missing values, as:
(25)
which leads to Searls’s type estimator for
is given by
(26)
Although Searls’s approach uses the known coefficient of variation to increase the efficiency of the estimation procedure. The optimum value of α(s) depends on ,
and
, which are stable quantities. The stability of these constant has been explored by numerous researchers like [22–24], etc. Therefor, the present investigation is a significant search of optimum imputation method by using the co-variance between the study and auxiliary variable. The imputation of item non-response is given by
(27)
where α1, α2, and α3 are suitable chosen constants and are determined by minimizing the resultant mean square error. The point estimator for population mean, is defined as:
(28)
The modified version of Hansen-Hurwitz difference estimator is given by
(29)
The variance of , is given by
(30)
where
The optimum values of αj, j = (1, 2, 3) are obtained, respectively, by minimizing (30), as follows:
(31)
where
,
and
is the coefficient of multiple determination of u on v and
.
Substituting (31) in (30), the variance of , is given by
(32)
where
.
Remark 1. The second term in is vanished, if k = 1. It happens when each non-respondent of first call is interviewed at second call.
4 Choice of sampling fractions
We shall deduce the optimum values of k and n that minimize the variance at specified cost. The cost function for the proposed model is based on following four components, as:
- C0 = over head cost.
- C1 = per unit cost for collecting the response by mail inquiry at first call.
- C2 = the unit cost for obtaining the scrambled response from the non-respondent group of first call.
- C3 = cost per unit for editing, processing or imputing the missing r2 values.
Thus, the cost function is given by
(33)
Note that C* is the total cost, thus it varies from sample to sample. So, we use the expected cost by applying the expectation on (33), we have
(34)
So, we have the following Lemma, as:
Lemma 4.1. The optimum values of k and n for the minimum expected cost are, respectively, given by
(35)
and
(36)
where g = c, d, and p.
Proof. Let the variance be a fixed V0, i.e
, then the Lagrange function, is given by
(37)
where ξ is a Lagrange multiplier. Differentiating (37) with respect to n, equating to zero i.e
and ignoring δj, j = (1or2). We have
which implies
(38)
Substituting (38) in (39), we have
(40)
Substituting (40) in (38), we have
(41)
which is the required optimum sample size (nov). Now, we differentiate (37) with respect to k and equate to zero i.e
. Then, we have
which implies
(42)
Using (38) in (42), we have
(43)
which is the required optimum value of k.
Corollary 4.1.1. The optimum values of n and k are proportional to the expected cost (C*). To get optimum values of k and n, that, , we simply substitute
and
in (35) and (36).
5 Empirical comparison
On the lines of [25], the relative comparison of with respect to
is considered by generating a hypothetical population under following key steps, as:
- Let two independent populations say {Ω(x) and Ω(z)} of size 1000 are obtained from gamma distribution, using following parametric values, as:
(44)
The study variable is generated by(45)
- Splitting the populations into two strata having N1 = 690 and N2 = 310 units.
- Assume that, out of N2 units, R1 units provides the response by using (5) and remaining R2 = (N2 − R1) are those who refuse to give their true or scrambled responses.
- Imputing the missing R2 values by using
and Suv.
- Repeat the process
times. The variance of the given estimator is obtained by using following expression, as:
(46) and the relative efficiency (R.E) of
is obtained by using the following expression
(47)
For the numerical comparison, we consider the following values of un-known constants, as:
where r is assumed response rate, which is 40% of nog. The optimum values of relative efficiencies (R.Es) of
are given in Table 1.
Table 1 shows the optimum values of k, n, and R.E(j) of estimators i.e modified ratio and difference estimators. Under this hypothetical population, the modified estimators i.e perform better as compared to traditional Hansen-Hurwitz estimator
. We also observe that the optimum value of nog is approximately similar for all
, so optimum sample of size nop is used for the relative comparison between existing and proposed imputation estimators.
From Table 1, we observed following proportionality relationships between C2, C3, r1, Vo, kop, nop, and R.E(j).
- The values of Vo and nop have inverse relationship with C2 and C3.
- C2 and C3 have the positive relationship with RE(j). As the costs of scrambling response and imputation increase, the relative efficiencies of
have been improved significantly.
- r1 has the negative association with kop and nop. The values of kop and nop decrease as r1 increasing.
- Vo also has the inverse relationship with kop and nop. As the value of Vo increases, the values of kop and nop decrease.
- The relative efficiencies of
are also inversely correlated with r1 and Vo. The values of R.E(j) decrease as Vo and nop increase.
From the numerical finding, we can conclude that the proposed imputation procedure at second call should be performs better as compared to existing and tradition Hansen-Hurwitz estimators at various values of C2, C3, r1 and Vo.
6 Conclusion
The problem of non-response bias in the sensitive quantitative study variable has been diminished by sub-sampling the non-respondent, viz. Hansen and Hurwitz (1946) procedure. A new imputation mechanism has been defined by using the known co-variance between the study variable and the auxiliary variable. Optimum value for sample size is also derived for a given set of unit cost (Cq, q = 0, 1, 2, 3), r1, and Vo. From the Table 1, we can easily say that the proposed imputation method can outperforms as compared to ratio, difference, and Hansen-Hurwitz estimators.
When the processing, editing, or imputing cost per unit is high, the proposed imputation strategy can performs better as compared to their counterpart. Our proposed imputation procedure is also useful when there are serious concerns about the non-response bias or refusals due to the sensitive nature of the study variable that is difficult to ignore it.
Supporting information
S1 Code. In this research a hypothetical data set is used which can be easily regenerated at the given value of parameters with the help of available statistical software.
https://doi.org/10.1371/journal.pone.0261834.s001
(R)
Acknowledgments
We are grateful to the reviewers and the associate editor for their in depth comments for improving the quality of the article.
References
- 1. Hansen M. H. and Hurwitz W. N. (1946). The problem of non-response in sample surveys. Journal of the American Statistical Association, 41(236):517–529 pmid:20279350
- 2. Abul-Ela A.-L. A., Greenberg G. G., and Horvitz D. G. (1967). A multi-proportions randomized response model. Journal of the American Statistical Association, 62(319):990–1008.
- 3. Warner S. L. (1965). Randomized response: A survey technique for eliminating evasive answer bias. Journal of the American Statistical Association, 60(309):63–69. pmid:12261830
- 4. Greenberg B. G., Abul-Ela A.-L. A., Simmons W. R., and Horvitz D. G. (1969). The unrelated question randomized response model: Theoretical framework. Journal of the American Statistical Association, 64(326):520–539.
- 5. Moors J. (1971). Optimization of the unrelated question randomized response model. Journal of the American Statistical Association, 66(335):627–629.
- 6. Folsom R. E., Greenberg B. G., Horvitz D. G., and Abernathy J. R. (1973). The two alternate questions randomized response model for human surveys. Journal of the American Statistical Association, 68(343):525–530.
- 7. Eichhorn B. H. and Hayre L. S. (1983). Scrambled randomized response methods for obtaining sensitive quantitative data. Journal of Statistical Planning and Inference, 7(4):307–316.
- 8. Mangat N. and Singh R. (1990). An alternative randomized response procedure. Biometrika, pages 439–442.
- 9. Gjestvang C. R. and Singh S. (2006). A new randomized response model. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(3):523–530.
- 10. Diana G., Riaz S., and Shabbir J. (2014). Hansen and hurwitz estimator with scrambled response on the second call. Journal of Applied Statistics, 41(3):596–611.
- 11. Ahmed S., Shabbir J., and Gupta S. (2017). Use of scrambled response model in estimating the finite population mean in presence of non response when coefficient of variation is known. Communications in Statistics-Theory and Methods, 46(17):8435–8449.
- 12. Diana G. and Perri P. F. (2010a). New scrambled response models for estimating the mean of a sensitive quantitative character. Journal of Applied Statistics, 37(11):1875–1890.
- 13. Diana G. and Perri P. F. (2010b). New scrambled response models for estimating the mean of a sensitive quantitative character. Journal of Applied Statistics, 37(11):1875–1890.
- 14. Cochran W. (1940). The estimation of the yields of cereal experiments by sampling for the ratio of grain to total produce. The Journal of Agricultural Science, 30(02):262–275.
- 15. Srivastava S. K. and Jhajj H. S. (1981). A class of estimators of the population mean in survey sampling using auxiliary information. Biometrika, 68(1):341–343.
- 16. Isaki C. T. (1983). Variance estimation using auxiliary information. Journal of the American Statistical Association, 78(381):117–123.
- 17. Singh S. and Horn S. (1998). An alternative estimator for multi-character surveys. Metrika, 48(2):99–107.
- 18. Mohamed C., Sedory S. A., and Singh S. (2016). Imputation using higher order moments of an auxiliary variable. Communications in Statistics-Simulation and Computation, 46(8):6588–6617.
- 19. Sohail M. U., Shabbir J., and Ahmed S. (2017). Modified class of ratio and regression type estimators for imputing scrambling response. Pakistan Journal of Statistics, 33(4):277–300.
- 20. Bhushan S., Pratap Pandey A., and Pandey A. (2018). On optimality of imputation methods for estimation of population mean using higher order moment of an auxiliary variable. Communications in Statistics-Simulation and Computation, pages 1–15.
- 21. Searls D. T. (1964). The utilization of a known coeffcient of variation in the estimation procedure. Journal of the American Statistical Association, 59(308):1225–1226.
- 22.
Murthy M. N. (1967). Sampling theory and methods. Calcutta-35: Statistical Publishing Society, 204/1, Barrackpore Trunk Road, India.
- 23.
Reddy V. (1978). A study on the use of prior knowledge on certain population parameters in estimation. Sankhya C, 40:29–37.
- 24. Singh S. (2009). A new method of imputation in survey sampling. Statistics, 43(5):499–511.
- 25. Okafor F. C. and Hyunshik L. (2000). Double sampling for ratio and regression estimation with sub-sampling the non-respondents. Survey Methodology, 26(2):183–188.