Estimation of finite population mean for a sensitive variable using dual auxiliary information in the presence of measurement errors

In this study, we propose a new improved estimator of population mean for the sensitive variable in the presence of measurement error under simple and stratified random sampling. This estimator accounts the auxiliary information as well as the ranks of the auxiliary variable. From theoretical and numerical studies it is shown that a new improved estimator performs better than the existing estimators under study.


Introduction
In survey sampling, the assumption is made that all the observations are carefully considered on the characteristics under study so the information we obtained is error free. But in practice this assumption is not achieved due to many reasons, including non-response which may arises due to refusal of respondents to give the information or not at home or lack of interest or due to some sensitive issues. In analysis, a basic assumption is that all observations are measured correctly. In multiple regression model, it is assumed that all observations based on the study variable and the auxiliary variable are observed without any error. In many situations these assumptions are violated because of the following reasons. (i) Under the context of qualitative, it is hard to measure some variables (e.g., intelligence, taste, ability, climate, education, poverty etc.). So we use the dummy variables and observations are recorded in terms of values of dummy variables. (ii) In application, some variables are clearly defined but it is hard to take the correct observations (e.g., age is either under reported or over reported in complete year). (iii) It is no doubt that some variables are conceptually defined but is hard to take correct observation on it, instead the observations are taken on closely related variables (e.g., level of education is measured by the number of years of schooling). In all above mentioned cases, it is not possible to obtain true value of the variable. Instead it is recorded with error. So measurement error (ME) appeared because of difference between observed and true value. Also ME is due to the use of imperfect measure of true values of variables. Suppose we are interested to get the average level of anxiety among students, So we take a random sample of some students and measure their level of anxiety. Then we calculate the mean level of anxiety i.e. sample mean. a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 The normality assumption says that if you repeat this process many times and plot the sample means, the distribution will be normal. Usually measurement error and randomized response are studied separately using the known auxiliary or additional information. In reality, when the variable of interest is sensitive, the respondents hesitate to provide the personal information, which gives rise to measurement error.
To estimate the population mean, few researchers discussed the problem of measurement error. [1] discussed some important sources of measurement error in survey data. [2] done the estimation of population mean in the presence of measurement error for ratio-product type estimators. [3] and [4] presented the ratio method of estimation in the presence of measurement error. Further the work is extended by [5]. [6], [7] and [8] studied measurement error and non response together. [9] suggested an estimator for the estimation of population mean in the presence of measurement error and non response under stratified random sampling.
In survey sampling, when the variable of interest is sensitive, then the respondents hesitate to provide their personal information. Direct survey on sensitive question increases the relative bias. [10] introduced the randomized response technique (RRT), which reduces the possible bias and is used to obtain the true information while insuring the privacy of the respondents. For estimation of mean of a sensitive quantitative variable the Randomized Response model (RRM) is extended by [11]. [12] introduced the scrambled randomized response method. [13] proposed the optional RRT method and further this work is extended by [14]. [15] used the scrambled response technique for the estimation of population mean when coefficient of variation is known. [16] used the empirical Bayes estimation for the estimation of sensitive variable. [17] studied the estimation of population mean of sensitive variable in the presence of nonsensitive auxiliary information. [18] and [19] studied the improved estimation of population mean in simple and stratified random sampling.
When the correlation between the study variable and the auxiliary variable is sufficient, then the ranks of the auxiliary variable are also correlated with the study variable and consequently the precision of the estimator increased. [20] suggested the concept of ranks of the auxiliary variable to make efficient estimates. In practice, not much literature has been found in estimating the population mean for the sensitive variable in the presence of measurement error based on dual use of the auxiliary information.
The present paper is organized as: Section 2 gives existing estimators and an improved proposed estimator of population mean for sensitive variable in the presence of measurement error under simple random sampling. Both theoretical and numerical comparison are done in Section 2. In Section 3, some existing estimators and an improved class of estimators is suggested for estimating the finite population mean by incorporating both measurement error and sensitive information simultaneously under stratified random sampling. Efficiency comparison, numerical results and simulation study are also presented in Section 3. Conclusion is given in Section 4.

Estimators under simple random sampling
be a finite population of size N. Suppose that a simple random sample of size n is drawn from O by using simple random sampling without replacement. Let Y be the sensitive study variable, which is not observed directly and X be the non-sensitive auxiliary variable which has positive correlation with Y. Let R x be ranks of the auxiliary variable X. Let S be a scrambling variable which is independent of Y and X. We assume that S has zero mean and variance S 2 s . The respondent is asked to give a scrambled response for the study variable Y given by Z = Y + S and in addition asked to provide a true response for X.
Let (x i , r x,i , y i , z i ) be the observed values and (X i , R x,i , Y i , Z i ) be the actual values on the variables (X, R x , Y, Z) respectively. Then the measurement errors be V i = x i − X i , U i = z i − Z i and T i = r x,i − R x,i . These measurement errors are assumed to be uncorrelated having normal distribution with zero mean and variances S 2 V , S 2 U and S 2 T respectively. Let S 2 X , S 2 R x and S 2 Z be the population variances; ρ XZ , r XR x and r ZR x be the coefficients of correlation between their subscripts.

Existing estimators in literature
In this section we consider the following existing estimators.
Mean estimator. The usual unbiased mean per unit estimator, is given by where � z is given in Eq (12). The variance of � y 0 , is given by where λ = (n −1 − N −1 ). Ratio estimator. The traditional ratio estimator, is given by, where � x is the sample mean (see Eq (13)) and � X is known population mean. The bias and mean square error of � y R to first degree of approximation, are given by and X . Difference estimator. The usual difference estimator is given by, where d is the constant, whose value is to be determined optimally. The minimum variance of � y D , is given by where optimum value of d is Recently [21] proposed the generalized randomized response estimator, given by, where � w ¼ �ða� x þ gÞ þ ð1 À �Þða � X þ gÞ and � W ¼ a � X þ g; k and g are constants, and ϕ is assumed to be an unknown constant which is determined . Also α(6 ¼0) and γ are assumed to be some known parameters of the auxiliary variable X. The bias and minimum MSE of � y K to first degree approximation, are given by and which is exactly equal to the variance of the difference estimator � y D , but � y D is preferable over � y K because of unbiasedness.

The proposed estimator
We propose an improved randomized response estimator for estimating the population mean of the sensitive variable, dealing with the problem of measurement error. Measurement error is considered on both the study and the auxiliary variables The proposed estimator, is given by where, m 1 and m 2 are constants whose values are to be determined. For obtaining the bias and mean square error, we assume that Adding δ Z and δ U , we get Dividing both sides by n, and then simplifying, we get Similarly, we can write In order to get the bias and MSE of the proposed estimator, we consider the following relative error terms: Solving Eq (11) in terms of errors, we have Further simplifying, and keeping the terms up to power 2, we have On the lines of [22] and [23], we use the approximation method to derive the MSE of our proposed estimator in simple and stratified random sampling. The signal to noise ratio can easily be obtained by using the expression C:V ¼ S:D Mean . Using above equation the bias of � y ðPÞ , is given by Squaring and taking expectation in Eq (16), we have The optimum values of m 1 and m 2 are and Substitute the optimum values of m 1 and m 2 in Eq (18), we get the minimum MSE of � y P , given by where,

Numerical results
In this section two populations are generated for simulation study and two are based on real data sets.

Simulation study.
We have generated two populations of size 1,000 from multivariate normal distribution with different covariance matrices. The results of simulation is given in Tables 1 and 2. The population means and covariance matrices, are given below:   Covariance matrices shows the distribution of sensitive variable Y, the auxiliary variable X and the ranks of the auxiliary variable R x . There is high correlation in Population I, and weak correlation in Population II. The scrambling response S is distributed as N(0, 0.01σ X ). The response variable is Z = Y + S. We estimate the MSE using k = 1000 samples of various sizes selected from each population. Three different sample sizes n = 100, 150, 200 are taken from both populations. The expression is given below: where i = 0, R, D, K, P. Tables 1 and 2 show that the proposed estimator � y P performs better as compared to all other existing estimators for both populations. The MSE of proposed estimators is smaller for Population I as compared to Population II because there is high correlation between the variables in Population I as compared to Population II. As the sample size increases MSE of all the estimators decreases, and it is observed that MSEs of both difference estimator � y D and Khalil estimator � y K is same, but � y D is preferable over � y K because of unbiasedness.

Application to real data.
In this section we have considered two data sets for numerical comparisons. Both data sets consist of 654 observations. The data summary is given below (see Tables 3 and 4) and results are given in Tables 5 and 6.

Population III (Source: [24])
Population IV (Source: [24]) In both populations the study and the auxiliary variables are identical, but scrambling responses are different. The correlation coefficients for both the Populations are: ρ XY = 0.7564, r XR x ¼ 0:7831 and r YR x ¼ 0:6161. In Population, III and IV smoke (No = 0, Yes = 1) and sex (Female = 0, Male = 1) are taken as scrambling responses respectively.  Tables 5 and 6 show that the proposed estimator � y P is more efficient as compared to all other considered estimators in both Populations (III and IV). The MSEs of both difference estimator � y D and Khalil estimator � y K are equivalent, but � y D is preferable over � y K because of unbiasedness.

Estimators under stratified random sampling
Consider a finite population of N identifiable units which are partitioned into L homogeneous subgroups called strata, such that the h th strata consist of N h units, where h = 1, 2, . . ., L and P L h¼1 N h ¼ N. Let Y h be the sensitive variable, which do not observe directly and X h be the non-sensitive auxiliary variable which has a positive correlation with Y h . Let R x,h be the ranks of the auxiliary variable X h and S h be a scrambling variable which is independent of Y h and X h . S h has zero mean and variance S 2 s h . The respondent is asked to give a scrambled response for the study variable Y h given by Z h = Y h + S h , additionally asked to provide a true response for X h .
A simple random sample of size n h is drawn without replacement such that P L h¼1 n h ¼ n. Let (x hi , r x,hi , y hi , z hi ) be the observed values and (X hi , R x,hi , Y hi , Z hi ) be the actual values on the variables (X h , R x,h , Y h , Z h ) of the i th (i = 1, 2, . . ., n) sampled units in the h th stratum. Then the measurement errors be V These measurement errors are assumed to be uncorrelated and having normal distribution with zero mean and variances S 2 hV , S 2 hU and S 2 hT respectively. Let S 2 hX , S 2 hR x and S 2 hZ be the population variances; ρ hXZ , r hXR x and r hZR x be the coefficients of correlation, between their subscripts.

Existing estimators in literature
In this section we consider the following existing estimators.
Mean estimator. The usual unbiased mean per unit estimator, is given by where P h ¼ N h N is the known stratum weight and � z h is the mean of the sensitive variable Z h in the stratum h, (see Eq (33)). The variance of � y Sð0Þ , is given by where l h ¼ ðn À 1 h À N À 1 h Þ. Ratio estimator. The traditional ratio estimator, is given by where � X h is the known population mean and � x h is the sample mean of the auxiliary variable in stratum h, (see Eq (34)). The bias and mean square error of � y SðRÞ , are given by Bð� y SðRÞ Þ ffi and MSEð� y SðRÞ Þ ffi where The usual difference estimator, is given by where d h is the constant, whose value is to be determined optimally. The minimum variance of � y SðDÞ , is given by where d hðoptÞ ¼ r hXZ S hZ S hX ðS 2 hX þS 2 hV Þ . Khalil randomized response estimator. [21] proposed the estimator, which is given by, and g are constants, and ϕ h is assumed to be an unknown constant whose value is to be . Also α h (6 ¼0) and γ h are assumed to be some known parameters of the auxiliary variable X. The bias and minimum MSE of � y SðKÞ , are given by Bð� y SðKÞ Þ ffi and MSEð� y SðKÞ Þ min ffi which is exactly equal to the variance of the difference estimator � y SðDÞ , but � y SðDÞ is preferable over � y SðKÞ because of unbiasedness.

The proposed estimator
An improved randomized response estimator for estimating the population mean of a sensitive variable in the presence of measurement error is proposed. A scrambling response of Y h is observed in the form of Z h = Y h + S h , where S h is distributed as Nð0; S 2 s h Þ. The suggested estimator is given by where, m 1h and m 2h are constants whose values are to be determined. � R x h and � r x h are the population mean and sample mean of the ranked of the auxiliary variable, respectively(see Eq (35)). For obtaining the bias and mean square error, we define: Adding δ hZ and δ hU , we get Dividing both sides by n h , and then simplifying, we get Similarly, we can get In order to get the bias and MSE of the suggested estimator, we consider the following relative error terms: , E(e jh ) = 0, j = 0, 1, 2.
and Eðe 1h e 2h Þ ¼ Using Eq (32) in terms of errors, we have Further simplifying, and keeping the terms up to power 2, we have Using above equation, the bias of � y SðPÞ , is given by Squaring and then taking expectations of Eq (37), we have MSEð� y SðPÞ Þ ffi From Eq (39), the optimum values of m 1h and m 2h are and Substitute the optimum values of m 1h and m 2h in Eq (39), the minimum MSE � y SðPÞ is given by where, h ðr hXZ S hX S hZ Þðr hXR x S hX S hR x Þðr hZR x S hZ S hR x Þ:

Efficiency comparison
The efficiency comparison of � y Sð0Þ ; � y SðRÞ ; � y SðDÞ and � y SðKÞ with respect to � y SðPÞ are given by, (23)

From Eqs
The proposed class of estimators is more efficient than other existing estimators when above Conditions 1 to 4 are satisfied.

Numerical results
In this section two populations are generated for simulation study and one for real data set.

Simulation study.
We have generated two populations of size 1,000 from multivariate normal distribution with different covariance matrices. The results are given in Tables 7  and 8. The mean and covariance matrices are give below Population V.  where i = 0, R, D, K, P Tables 7 and 8 show that the estimator � y SðPÞ performs better as compared to the estimators � y Sð0Þ ,� y SðRÞ ,� y SðDÞ and � y SðKÞ . The efficiency of the estimator � y SðPÞ is improved when there is sufficient correlation between the study variable and the auxiliary variable. By increasing the sample size, MSE values decreases. As the MSEs of � y SðDÞ and � y SðKÞ are equal, so their numerical results are also identical for both the populations.

Application to real data.
In this section we consider the real life data set for numerical comparisons. Strata I consist of 318 observations and Strata II contain 336 observations. The data summary is given below (see Tables 9 and 10). The results are given in Table 11. ρ 1XY = 0.7564, r 1XR x ¼ 0:7831 and r 1YR x ¼ 0:6151 In Table 11, we observed that the estimator � y SðPÞ performs better than the estimators � y Sð0Þ ,� y SðRÞ ,� y SðDÞ and � y SðKÞ . The estimators � y SðDÞ and � y SðKÞ have same MSEs but � y SðDÞ is preferable due to unbiasedness. As the sample size increases the MSE values decreases, which are the expected results.

Conclusion
In the present paper, we have proposed a new improved estimator of the finite population mean that encounter additional information on the auxiliary variable as well as on ranks of the auxiliary variable in the presence of measurement error under simple and stratified random sampling. Through simulation study and real life data sets (see Tables 1, 2 , 5, 6, 7, 8 and 11) it is observed that the proposed estimators � y P and � y SðPÞ perform better than the existing estimators, particularly when there is sufficient correlation between the study variable and the auxiliary variable. It is also concluded that difference estimator and [21] estimator are equally efficient, but difference estimator is preferable due to unbiasedness.
Supporting information S1 File. Data used in the manuscript "S1_File.csv". (CSV) Estimation of finite population mean in the presence of measurement errors Funding acquisition: Erum Zahid.