Figures
Abstract
In social surveys, the randomized response technique can be considered a popular method for collecting reliable information on sensitive variables. Over the past few decades, it has been a common practice that survey researchers develop new randomized response techniques and show their improvement over previous models. In majority of the available research studies, the authors tend to report only those findings which are favorable to their proposed models. They often tend to hide the situations where their proposed randomized response models perform worse than the already available models. This approach results in biased comparisons between models which may influence the decision of practitioners about the choice of a randomized response technique for real-life problems. We conduct a neutral comparative study of four available quantitative randomized response techniques using separate and combined metrics of respondents’ privacy level and model’s efficiency. Our findings show that, depending on the particular situation at hand, some models may be better than the other models for a particular choice of values of parameters and constants. However, they become less efficient when a different set of parameter values are considered. The mathematical conditions for efficiency of different models have also been obtained.
Citation: Azeem M, Shabbir J, Salahuddin N, Hussain S, Ijaz M (2023) A comparative study of randomized response techniques using separate and combined metrics of efficiency and privacy. PLoS ONE 18(10): e0293628. https://doi.org/10.1371/journal.pone.0293628
Editor: Viacheslav Kovtun, Institute of Theoretical and Applied Informatics Polish Academy of Sciences: Instytut Informatyki Teoretycznej i Stosowanej Polskiej Akademii Nauk, UKRAINE
Received: July 5, 2023; Accepted: October 17, 2023; Published: October 27, 2023
Copyright: © 2023 Azeem et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All relevant data are available within the manuscript.
Funding: The author(s) received no specific funding for this work.
Competing interests: The authors have declared that no competing interests exist.
1. Introduction
In sample surveys, researchers almost always need to cope with high rates of refusals as well as false response respondents when collecting the information related to sensitive traits. Some examples of such sensitive characteristics include cheating behavior in an examination, illegally earned income, tax payable, and abortions, etc. Attempting to devise a strategy to obtain reliable data from the respondents in surveys on sensitive issues, Warner [1] suggested the randomized response procedure. Initially, this procedure was developed to cope with situations in which the researchers have to collect information on qualitative characteristics. Warner [2] modified the originally developed qualitative variable—based procedure to the case of quantitative variables through the introduction of an additive noise. Eichhorn and Hayre [3] proposed an improved variant of the quantitative randomized response strategy by utilizing a multiplicative noise in place of additive noise.
Gupta et al. [4] presented a randomization technique in which respondents are given an option to provide their true response or use a randomization procedure to report a random response. If any of the respondents chooses to report a scrambled answer, he/she has to apply additive scrambling to answer the sensitive question. An improved multiplicative variant of the Gupta et al. [4] scrambling method was introduced in the research study of Bar-Lev et al. [5]. Later on, Gjestvang and Singh [6] presented an improved optional method which utilized an additive random variable. Diana and Perri [7] presented an improved randomized response strategy by using additive as well as multiplicative random noise. Likewise, Al-Sobhi et al. [8] suggested an additive-subtractive scrambling method for quantitative data.
Gupta et al. [9] proposed a quantification metric which is used for quantification of the privacy level and efficiency into a unified quantity. Later, Narjis and Shabbir [10] proposed a modification of the original Gjestvang and Singh [6] procedure and proved its improvement in terms of efficiency as compared to the randomization technique of Gjestvang and Singh [6]. Khalil et al. [11] carried out a research study to analyze the impact of observational error on estimation of the mean of finite population. Gupta et al. [12] proposed an improved randomization technique and proved its improvement over the Diana and Perri [7] randomization procedure. The comparison was made by taking into account the privacy level as well as model’s efficiency.
Singh et al. [13] studied elimination of the influence of non-response using the randomized scrambling technique. Gupta et al. [14] developed an estimator of the population variance using the Diana and Perri [7] randomization strategy. Saleem and Sanaullah [15] suggested estimators of the population mean under randomized response techniques. Zapata et al. [16] suggested an improvement in the Warner’s [1] technique. In a recent study, Azeem [17] proposed a weighted measure of efficiency and privacy for assessing the performance of randomized response techniques. Murtaza et al. [18] analyzed randomized response models under the assumption of correlated variables. Further research related to various forms of randomized response techniques can be found in Yan et al. [19], Young et al. [20], Zhang et al. [21], Azeem et al. [22], Azeem [23], and Azeem and Salam [24], among others.
2. Selected randomized response models for comparison
Suppose our population under consideration contains a total of N elements and we draw a simple random sample containing n elements from our population. We denote the quantitative sensitive variable of interest by Y, and we also consider an additive random variable, say S. We also assume that E(Yi) = μY, E(S) = θ, ,
, where
and
denote, respectively, the population variance of Y and S; and let μY and θ be the notations for the population mean of the variable of interest Y and random variable S, respectively. Further, let T be the notation for another scrambling variable, for which we assume that E(T) = 1, along with
. Finally, we assume that all three variables under consideration are independent of each other.
2.1 The Gjestvang and Singh [6] quantitative model
Using the Gjestvang and Singh [6] optional scrambling model, the observed responses may be expressed as:
(1)
where α and β denote the constants and are determined by the interviewer.
An estimator of μY using the Gjestvang and Singh [6] scrambling procedure may be expressed as:
(2)
The variance of
can be derived as:
(3)
2.2 Diana and Perri [7] model
The randomization strategy of Diana and Perri [7] contains both additive and multiplicative scrambling. The observed responses based on this model, are given as:
(4)
Assuming E(S) = 0, an unbiased estimator of the sensitive variable using the Diana and Perri [7] model can be expressed as:
(5)
The variance of
can be derived as:
(6)
2.3 Narjis and Shabbir [10] optional scrambling model
Using the Narjis and Shabbir [10] optional scrambling model, the observed responses can be expressed as:
(7)
where γ is a constant and its value is chosen by the interviewer before the survey is conducted.
An estimator of μY using the Narjis and Shabbir [10] optional technique may be written as:
(8)
where Zi has been provided in Eq (4).
2.4 Gupta et al. [12] scrambling technique
Using the Gupta et al. [12] scrambling technique, the observed responses are:
(10)
where W indicates the sensitivity level, and A denotes a constant such that 0 < A < 1. Using the Gupta et al. [12] technique and assuming E(S) = 0, an unbiased estimator can be expressed as:
(11)
The variance of
can be derived as:
(12)
3. Re-formulating the variance expressions for comparison
For the purpose of unbiased estimation of population mean, the Diana and Perri [7] and the Gupta et al. [12] models assumed that E(S) = 0. On the other hand, the Gjestvang and Singh [6] and the Narjis and Shabbir [10] models assumed that E(S) = θ. In order to make the comparison simple and uniform, we assume that E(S) = 0 for all four models. Thus, the sampling variance under the Gjestvang and Singh [6] technique may be re-written as:
(13)
In the same way, the variance of the sample mean using the Narjis and Shabbir [10] optional technique may be re-written as:
(14)
The Narjis and Shabbir [10] and the Gupta et al. [12] models used different notations for probabilities for various types of responses. For purpose of comparison, we attempt to bring the mathematical expressions for variance of the mean under identical notations. Equating the probability of true response for the Narjis and Shabbir [10] and the Gupta et al. [12] models, we get:
(15)
or
(16)
Equating the probability of additive scrambling of the Narjis and Shabbir [10] and the Gupta et al. [12] models, we get:
or
or
(17)
Thus
(18)
and
(19)
Using Eq (16) to Eq (19), the variance of the mean using the Gupta et al. [12] technique can be written as:
(20)
4. Performance evaluation
A measure of privacy level was presented by Yan et al. [13] which is given below:
(21)
From Eq (21) one may observe that a larger value of ∇ is preferable as it shows a higher level of respondents’ privacy.
The Gupta et al. [9] combined metric of privacy level and efficiency can be expressed as:
(22)
From Eq (22), it is clear that smaller values of δ are desirable.
The measure of privacy using the Gjestvang and Singh [6] method can be written as:
(23)
The combined metric of efficiency and privacy level using the Gjestvang and Singh [6] method may be derived as follows:
(24)
The measure of privacy level using the Diana and Perri [7] model is given by:
(25)
The combined metric of privacy and efficiency using the Diana and Perri [7] model can be expressed as:
(26)
Using the Narjis and Shabbir [10] optional model, the metric of privacy-level may be derived as:
(27)
The combined metric of efficiency and privacy level using the Narjis and Shabbir [10] technique can be derived as:
(28)
Finally, the metric of privacy level using the Gupta et al. [12] optional technique can be written as:
(29)
The combined metric of the privacy level and model efficiency using the Gupta et al. [12] technique can be obtained as:
(30)
5. Conditions for efficiency
Here we present the mathematical expression for conditions of efficiencies of various models.
5.1 Gupta et al. [12] quantitative model vs. Narjis and Shabbir [10] scrambling model
The Gupta et al. [12] scrambling technique is more precise than the Narjis and Shabbir [10] technique, if:
or
or
or
(31)
5.2 Gupta et al. [12] quantitative model vs. Gjestvang and Singh [6] quantitative model
The Gupta et al. [12] technique is more precise than the Gjestvang and Singh [6] model, if:
or
or
or
(32)
5.3 Gupta et al. [12] model vs. Diana and Perri [7] model
The Gupta et al. [12] technique is more precise than the Diana and Perri [7] model, if
or
or
(33)
Condition (33) is always true.
5.4 Narjis and Shabbir [10] scrambling model vs. Diana and Perri [7] model
The Narjis and Shabbir [10] scrambling model will be more precise than the Diana and Perri [7] model, if
or
or
(34)
6. A real-world survey
We applied the four selected techniques to a practical student survey, selecting 40 undergraduate students from the students registered in the Department of Mathematics of the University of Malakand, Pakistan. We were interested in estimating the average grade point average (GPA) of the students. Each selected participant was given a deck of 100 cards as well as a calculator. Each card displayed a random number for each of the two scrambling variables T and S, generated from a normal distribution. For the scrambling variable S, we chose the mean of the normal distribution zero with variance 0.5. Likewise, for variable T, the mean of the distribution was 1 with variance 0.5.
The survey procedure using a randomized response technique has been presented in Fig 1.
The values of the constants α, β, and γ were chosen based on some prior knowledge about the population under study. If prior information is not available, a pilot survey may be conducted to obtain an estimate of the constants. For this survey, we decided to choose α = 3, β = 3 and γ = 4, so that ,
, and
. Using these choices of constants, the observed response function under the Gjestvang and Singh [6] model may be expressed as follows:
(37)
The observed response under the Diana and Perri [7] model may be expressed as follows:
(38)
The observed response using the Narjis and Shabbir [10] model may be expressed as follows:
(39)
The observed response using the Gupta et al. [12] scrambling method may be expressed as follows:
(40)
Using Eq (37), each card displayed one of the two types of instructions for the Gjestvang and Singh [6] model:
50 out of 100 cards had the instruction: “Add 3 times the value of S with your GPA and report the number you get.”
The remaining 50 cards had the instruction: “Subtract 3 times the value of S from your GPA and report the number you get.”
In the same manner, the survey was repeated four times with different instructions on each card for each model. Each student was instructed to choose one card at random and read the instructions on the selected card to report the response. To ensure privacy protection, the participants were also asked neither to disclose the card they selected, nor their true GPA. The respondents were only required to report the scrambled / masked value. The responses recorded from the participants under different models are presented in Tables 1–4.
It may be observed that the responses recorded under the Diana and Perri [7] model are all positive numbers, ranging from 0.807 to 5.712. The feasible range of students GPA is 0 to 4, however, the scrambling process resulted in a few out-of-range responses. The Gupta et al. [12] model provided only one negative response, with the responses ranging from -0.267 to 5.458. It is also observed that the Gjestvang and Singh [6] model and the Narjis and Shabbir [10] model provided several negative and out-of-range responses. The sample means of the data in Tables 1–4 are 2.35, 2.85, 2.31, and 2.54, respectively.
7. Numerical illustration
Table 5 displays the sampling variance under different models for various choice of values of α, β, and γ. Tables 6 and 7 present ∇ and δ values, respectively, using various randomized response models.
8. Discussion and conclusion
The present study analyzed a detailed comparison among four available quantitative randomized response techniques: (i) Gjestvang and Singh [6] model, (ii) Diana and Perri [7] model, (iii) Narjis and Shabbir [10] scrambling technique, and (iv) Gupta et al. [12] optional model. The mathematical conditions for efficiency comparison of various models were obtained. We found that some of the efficiency conditions are always true, whereas other conditions are not always true. Table 6 shows that when σT = 0.4, Gupta et al. [12] technique appears to be the most efficient among all four randomized response techniques. However, for σT = 1, the Narjis and Shabbir [10] randomization technique is more precise than the Gupta et al. [12] model for a variety of choices of α, β, and γ. It is also clear that since the Gjestvang and Singh [6] and the Narjis and Shabbir [10] models do not use multiplicative scrambling variable, T, so the variance of the mean under these models does not change when σT changes. Moreover, the variance of the mean under Diana and Perri [7] model does not change when the values of α, β, and γ change. It is also observed that the Diana and Perri [7] scrambling model is the worst among all four models in terms of efficiency.
The quality of randomized response models cannot be solely judged from model-efficiency. The respondents’ privacy protection is also an important aspect for judging the usefulness of randomized response techniques. The level of privacy may be quantified by the values of ∇ where a larger value indicates better privacy level. Table 1 shows the ∇ values for various values of α, β, and γ. It also indicates that for σT = 0.4, the Gjestvang and Singh [6] optional model is the best among all four models when privacy protection of the respondents is taken into account. The performance of Diana and Perri [7] model is also observed to be better than the Gupta et al. [12] model. However, when σT = 1, the Diana and Perri [7] model becomes the best among all models for most of the cases of values of α, β, and γ.
Finally, comparing the overall usefulness of the four quantitative models using δ values, the results are shown in Table 7. It is clearly observed in Table 3 that the Diana and Perri [7] model performs best with respect to δ values when σT = 0.4. However, for σT = 1, the Gupta et al. [12] scrambling method reduces the values of δ to a minimum level, which makes it the most useful model of all four models in terms of overall quality.
We conclude that a randomized response model which can perform best in one situation may perform the worst in another situation. So, in practical problems, the researchers should be aware of the situation at hand when deciding to choose a particular randomized response model for collecting data from respondents. The researchers may choose a randomized response technique when respondent-privacy is more important than efficiency. On the other hand, if model-efficiency is more important to the researcher than privacy protection, a different randomized response model may be more useful.
The present study compared four available randomized response models. We found that, depending on the choice of parameters, one model can perform better than another model, and vice versa. The current study is limited to quantitative models, however, in many practical problems, the variable under consideration may be of qualitative nature. Therefore, it may be interesting if a neutral comparative analysis of qualitative models is carried out. Moreover, the current study is limited to the case of no correlation among variables. In practice, some degree of correlation may exist among variables which may affect the findings of the comparison. We therefore recommend future researchers to perform a comparative assessment of randomized response models assuming correlated variables as it may give further interesting results.
References
- 1. Warner SL. Randomized response: A survey technique for eliminating evasive answer bias. Journal of the American Statistical Association. (1965); 60(309): 63–69. https://doi.org/10.1080/01621459.1965.10480775 pmid:12261830
- 2. Warner SL. The linear randomized response model. Journal of the American Statistical Association. (1971); 66(336): 884–888. https:/doi.org/10.1080/01621459.1971.10482364
- 3. Eichhorn BH, Hayre LS. Scrambled randomized response methods for obtaining sensitive quantitative data. Journal of Statistical Planning and Inference. (1983); 7: 307–316.
- 4. Gupta S, Gupta B, Singh S. Estimation of sensitivity level of personal interview survey questions. Journal of Statistical Planning and Inference. (2002); 100(2): 239–247.
- 5. Bar-Lev SK, Bobovitch E, Boukai B.A note on randomized response models for quantitative data. Metrika. (2004); 60(3): 255–260.
- 6. Gjestvang CR, Singh S. An improved randomized response model: Estimation of mean. Journal of Applied Statistics. (2009); 36(12): 1361–1367.
- 7. Diana G, Perri PF. A class of estimators of quantitative sensitive data. Statistical Papers. (2011); 52(3): 633–650.
- 8. Al-Sobhi MM, Hussain Z, Al-Zahrani B, Singh HP, Tarray TA. Improved randomized response approaches for additive scrambling models. Mathematical Population Studies. (2016); 23: 205–221.
- 9. Gupta S, Mehta S, Shabbir J, Khalil S. A unified measure of respondent privacy and model efficiency in quantitative rrt models. Journal of Statistical Theory and Practice. (2018); 12(3): 506–511.
- 10. Narjis G, Shabbir J. An efficient new scrambled response model for estimating sensitive population mean in successive sampling. Communications in Statistics–Simulation and Computation. (2021); 1–18. https://doi.org/10.1080/03610918.2021.1986528
- 11. Khalil S, Zhang Q, Gupta S. Mean estimation of sensitive variables under measurement errors using optional rrt models. Communications in Statistics–Simulation and Computation. (2021); 50(5): 1417–1426.
- 12. Gupta S, Zhang J, Khalil S, Sapra P. Mitigating lack of trust in quantitative randomized response technique models. Communications in Statistics–Simulation and Computation. (2022); 1–9. https://doi.org/10.1080/03610918.2022.2082477
- 13. Singh C, Kamal M, Singh GN, Kim JM. Study to alter the nuisance effect of non-response using scrambled mechanism. Risk Management and Healthcare Policy. (2021); 1595–1613. pmid:33889040
- 14. Gupta S, Aloraini B, Qureshi MN, Khalil S. Variance estimation using randomized response technique. REVSTAT–Statistical Journal. (2020); 18(2): 165–176.
- 15. Saleem I, Sanaullah A. Estimation of mean of a sensitive variable using efficient exponential-type estimators in stratified sampling. Journal of Statistical Computation and Simulation. (2022); 92(2): 232–248. https://doi.org/10.1080/00949655.2021.1940182
- 16. Zapata Z, Sedory SA, Singh S. An innovative improvement in Warner’s randomized response device for evasive answer bias. Journal of Statistical Computation and Simulation. (2022). https://doi.org/10.1080/00949655.2022.2101649
- 17. Azeem M. Introducing a weighted measure of privacy and efficiency for comparison of quantitative randomized response models. Pakistan Journal of Statistics. (2023); 39(3): 377–385.
- 18. Murtaza M, Singh S, Hussain Z. Use of correlated scrambling variables in quantitative randomized response technique. Biometrical Journal. (2021); 63(1): 134–147. pmid:33103272
- 19. Yan Z, Wang J, Lai J. An efficiency and protection degree-based comparison among the quantitative randomized response strategies. Communications in Statistics–Theory and Methods. (2008); 38(3): 400–408.
- 20. Young A, Gupta S, Parks RA. A binary unrelated-question rrt model accounting for untruthful responding. Involve, A Journal of Mathematics. (2019); 12(7): 1163–1173.
- 21. Zhang Q, Khalil S, Gupta S. Mean estimation in the simultaneous presence of measurement errors and non-response using optional RRT models under stratified sampling. Journal of Statistical Computation and Simulation. (2021); 91(17): 3492–3504.
- 22. Azeem M, Hussain S, Ijaz M, Salahuddin N. An improved quantitative randomized response technique for data collection in sensitive surveys. Quality and Quantity. https://doi.org/10.1007/s11135-023-01652-5
- 23. Azeem M. Using the exponential function of scrambling variable in quantitative randomized response models. Mathematical Methods in the Applied Sciences. (2023). https://doi.org/10.1002/mma.9295
- 24. Azeem M, Salam A. Introducing an efficient alternative technique to optional quantitative randomized response models. Methodology. (2023); 19(1): 24–42. https://doi.org/10.5964/meth.9921