Abstract
In this paper, we propose a partial randomized response technique to collect reliable sensitive data for estimation of population proportion in ranked set sampling (RSS) scheme using auxiliary information. The idea is to increase confidence and (or) co-operation of the respondents by providing them the option of both ‘direct’ and ‘randomized’ response for the inquired sensitive question. This option is quite logical because perception of sensitive (insensitive) inquiry can vary among respondents. The properties of the proposed method are discussed and compared with existing randomized response techniques. Cost analysis is also carried out to prove supremacy of the suggested method. Finally, an application to clinical trial on AIDS is included.
Citation: Abbasi AM, Shad MY, Ahmed A (2022) On partial randomized response model using ranked set sampling. PLoS ONE 17(11): e0277497. https://doi.org/10.1371/journal.pone.0277497
Editor: Beatriz Cobo, University of Granada: Universidad de Granada, SPAIN
Received: April 12, 2022; Accepted: October 8, 2022; Published: November 29, 2022
Copyright: © 2022 Abbasi et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All relevant data are within the paper and its Supporting information files.
Funding: The authors received no specific funding for this work
Competing interests: The authors have declared that no competing interests exist.
1 Introduction
Let Y = {y1, y2, …, yM} be a finite population of which estimating proportion of individuals having sensitive attribute is the prime interest. To procure such data, a respondent may be reluctant to disclose truth or provide incorrect information due to his/her privacy concerns. Therefore, such data collected through conventional (direct) ways are likely to include response-bias. In this connection [1], proposed the randomized response method with objective to collect trustworthy data by protecting privacy of the interviewee (respondent). This method, based on a randomizing device, requires interviewee to report ‘yes’ or ‘no’ as per statement pointed out by the randomizing device without revealing it to the interviewer. This technique is quite encouraging as respondent cannot be identified i.e., which question he/she has answered. After development of this pioneer work in simple random sampling (SRS) framework, different researchers have proposed a verity of methods which protect respondent’s privacy by different ways and (or) give precise estimate under SRS scheme. For example, see [2–5] and references therein.
McIntyre [6] devised the RSS scheme as oppose to the conventional SRS scheme for investigation of population characteristics. This method considers ranking of the small sets of units by eyes or any other method that involves minimal cost before acquiring final sample. To choose a sample of size m under RSS scheme, the experimenter first needs to identify m sets of units each of size m from the target population and rank the units within each set. Now, for i = 1, 2, …, m, select ith smallest unit from the ith set. The same process can be repeated r times, if required, to obtain the sample of size mr. After development of RSS scheme [7], administrated its mathematical setup. To cope with the situation when direct ranking of main variable is troublesome [8], suggested the idea of indirect ranking using a concomitant variable that is highly correlated with the variable of interest. In this method, ranking is done on an auxiliary variable to choose corresponding units of the main variable. For instance, in a forest survey study, we wish to estimate height of trees; but ranking of trees may be problematic. On the other hand, the measurement on diameter of tree, which is also highly correlated with height of the tree, is easy to determine.
To select m(≥ 2) units, the RSS scheme that based on an auxiliary variable X can be described as: (i) construct m sets of units each of size m from bivariate population (X,Y). (ii) In each set, obtain the exact measurement on X, and arrange Y with respect to X. (iii) Select Y values corresponding to ith (i = 1, 2, …, m) ordered unit of X in the ith set. The steps (i)-(iii) can be cycled for r times, if desired, to get a final sample of size N = mr. For more literature and application of RSS design, the interested reader is referred to the studies [9–13].
Terpstra [14] have investigated usual (insensitive) population proportion using RSS scheme and then compared with SRS rival. Following [14, 15] suggested a new proportion estimator in RSS scheme and claimed that it works better than that of [14]. Later on [16], has pointed out some bottlenecks associated with the estimator in [15] and suggested new improved estimators in RSS schemes.
Recently [17], have introduced a new sensitive proportion estimator in RSS scheme that beats SRS-based competitor suggested by [1]. In this study, keeping in view different perceptions of sensitive inquiry, we suggest a mixed (partial randomized response) method for acquiring reliable data using RSS scheme, wherein ranking is done via a continuous auxiliary variable. We discuss its properties and compare with SRS and RSS competitors given in [1, 17] respectively. The rest parts of this study are arranged as: A review of existing sensitive proportion estimators is given in Section 2. The proposed estimator, along with its properties, is developed in Section 3. Application to AIDS data is considered in Section 4. Cost analysis is done in Section 5, and final remarks are included in Section 6.
2 A review of existing methods
Let Y1j, Y2j, …, Ymj denotes a simple random sample with replacement sampling (SRSWR) of size m in jth cycle, for j = 1, 2, …, r. To collect true response, each interviewee is provided with a randomizing device, say a spinner, to choose one of the following two assertion, without revealing it to the interviewer:
- I have the sensitive attribute A
- I do not have the sensitive attribute A
with, respective, pre-determined selection probabilities p ≠ 0.5 and 1 − p. Each interviewee uses the said device and reports ‘yes’ or ‘no’ as per selected statement and his/her correct status. Then ‘yes’ response of ith (i = 1, 2, …, m) respondent at jth cycle can be written as
Let m1 be the total number of ‘yes’ responses from the sample size N = mr. [1] established the maximum likelihood (ML) estimate of π as given by
,
and showed that
is unbiased and its variance is given by
(2.1)
Let Y be a Bernoulli variate and X is a continuous insensitive auxiliary variable having cumulative distribution function (cdf) FX(x). Let Y given X = x also follows Bernoulli distribution i.e.,
, where g(x) is a function with range (0, 1). From [17], the marginal distribution of Y is
with parameter π = E[g(X)], ‘E’ stands for expected value. In this study we assume that π = E[g(β0 + β1X)], β0, β1 ∈ ℜ, g(⋅) is a inverse logit or probit link function, and X follows normal distribution with mean 2 and variance 1 or standard uniform distribution.
Let {(Y[i]j, X(i)j): i = 1, 2, …, m;j = 1, 2, …, r} be a RSS of size N, where Y[i]j denotes ith imperfect ranked unit in jth cycle, and X(i)j serves ith perfect ranked unit emerged in jth cycle. From [14], Y[i] is with mean π[i] = E[g(β0 + β1X(i))] and variance
. [14] introduced insensitive proportion (
) in RSS and is given by
(2.2)
Let the respondents under RSS scheme are directed to select one of the two aforesaid statements (a) and (b) by using randomizing device suggested in [1]. The respondent reports ‘yes’ (‘no’) according to the outcomes of the device and his/her actual status. Let Y[i]j = 1 if ith ranked unit reports ‘yes’. Then
Let
are independent and identically distributed (i.i.d) Bernoulli randomized responses with parameter pπ[i] + (1 − p)(1 − π[i]), then ML estimate of π[i] for the given data
is
, where
;
is the total number of successes observed under ith ranked unit. Now, overall measure of π, using the relation
, see the reference [17], is given by
(2.3)
and its associated variance is
(2.4)
3 The proposed method
In this section, we propose a partial randomized response model to gather reliable data for estimation of population sensitive proportion under RSS scheme. This technique is applicable in the situation in which some respondents prefer ‘direct response’ to ‘randomized response’ for a sensitive inquiry. Suppose, taking into account cost and time, the experimenter decides to collect k ≥ 2 ‘direct responses’ by the technique introduced by [14] and the remaining m − k ≥ 2 units through randomized response technique suggested by [17]. Assuming that k2 + (m − k)2 identifiable units are available and following the lines of Eqs (2.2) and (2.3), we propose a mixed (partial randomized response) model as given by
(3.1) N′ = (N − rk). It is obvious that Y[i]j and
are Bernoulli variates with, respective, parameter π[i] and λ[g] = (2p − 1)π[g] + (1 − p). It is noteworthy that
,
and
are special cases of
. For k = 0,
reduces to
. Similarly, for k = m,
becomes
; when k = 0 and m = 1,
simplifies to
.
Lemma: The proposed estimator possesses the following properties:
- (i). It is an unbiased estimator i.e., E(
) = π
- (ii). It is more precise than
and
i.e.,
Proof:
(i) From Eq (3.1), we have
This completes proof (i).
To prove (ii), again using Eq (3.1), we have
(3.2)
Since the term
is always greater than or equal to zero, it is easy to observe that
. Moreover [17], have showed that
. This completes the proof (ii).
The relative precision (RP) of with respect to
can be evaluated by the formula
(3.3)
From [17], the limiting distribution of the proposed estimator, when m is fixed and r → ∞, is given by
(3.4)
Some interesting results can be obtained from Eq (3.4). For k = 0, it gives the result that presented in [17]. Furthermore, for k = m or p = 0(1), it becomes [14] under direct inquiry about the attribute of interest. Moreover, in the situation when both p and m are equal to 1, we get conventional method of direct interaction with the respondent under SRS. Note that these are not frequent occurrences; but can happen. Finally, the variance given in Eq (3.4) can be estimated by substituting π[i] by
. In this way, the estimated variance of ith ranked unit becomes independent of g(⋅). This validates asymptotic inference from the suggested method.
3.1 Comparison of proportion estimators
In this section, we examine behavior of RP given in Eq (3.3) under inverse logit link function g(β0 + β1X) for different choices of (β0, β1) ∈ {(−10, 7), (−5, 5), (−2, 6)}, p ∈ [0.1, 0.9] and assuming X follows (i) normal distribution with parameters mean = 2 and variance = 1 i.e, N(2,1) (ii) uniform over the range 0 and 1 i.e., U(0,1). It is pertinent to mention that the RP formula is independent of r, hence we take different m(= 4, 5) instead of r to examine the performance of . Furthermore, the magnitude of correlation coefficient between X and Y is also computed under inverse logit link function g(β0 + β1X) against each above assumed pair (β0, β1), when X follows N(2,1) or U(0,1). The resultant ρ values are close to {0.7, 0.6, 0.5} and {0.6, 0.4, 0.1}. Note that the results are obtained by numerical integration technique, using Mathematica Software.
In S1 Fig, we have presented results obtained when inverse logit link function is used, and X follows above mentioned distributions and m = 4. Similarly, S2 Fig depicts RP results under same setup, except, when m = 5. On the other hand, S3 and S4 Figs, respectively, show the function
for m = 4 and m = 5.
It can be observed, from S1 and S2 Figs, that RP is an increasing function of m(k) and/(or) ρ. It is also symmetric about p = 0.5. In other words, for given m and ρ, one can either choose p or 1 − p for sensitive statement (a) without compromising RP value. For a fixed m, as expected, a maximum(minimum) gain in term of RP is achieved at the largest (smallest) value of k. Recall that at minimum k = 0, RP curve shows behavior of [17] estimator and it falls at the bottom position (showing less precise) as compared to the other cases when k > 0.
It is also easy to observe, from S3 and S4 Figs, that relative performance of and
, as expected, becomes unity when k = 0. However,
outperforms than
when k > 0. During this study, we have also examined behavior of RP values under inverse probit link function which is almost same as that under inverse logit link function. For instance, see the RP values of
vs
in S5 and S6 Figs when m = 4, 5. All these results strongly support our proposed estimator without compromising privacy of the respondent. In addition,
is more flexible than
and
.
4 Application
We have first collected a real age data set of 50 AIDS patients from a local government hospital, Rawalpindi, Pakistan. Then each respondent was approached and requested to answer the question-whether he/she has had a sexual relation with any sex-worker or otherwise. The respondent was given option to either respond directly using SRS or via [1] randomizing device with p = 0.2 and 1 − p = 0.8 respectively. All interviewees were also assured of their identity will never be disclosed, and made it clear that this information could help in their treatments. They were convinced enough that almost half-25 patients gave consent to disclose true information directly. After converting ‘yes’ and ‘no’ responses into binary response, we have computed a true sensitive proportion π = 0.473. We have also computed mean and variance of X and its correlation with Y as given by
and ρX,Y = 0.417. The main purpose of this survey was to make known of the true proportion, and compare performance of the above discussed estimators.
4.1 Estimation of sensitive proportion
The data acquired in this section are, now, used to estimate sensitive proportion under the proposed model and other above discussed models. To draw a sample of size N = 8 under partial randomized response model with m = 4, the following process is repeated for r = 2 times. Assuming k = 0, draw m2 units and divide them into m sets of size m. The units in different sets are ranked with respect to X and then corresponding Y values are found. The obtained data is displayed in S1 Table. The last column of S1 Table presents randomized response data. Similarly, the S2 Table gives layout of the data when k = 2. Note that in S1 and S2 Tables, Y[a]bc denotes ath imperfect ranked unit at bth set in cth cycle. However, bth set information is omitted, from the final data given in the last columns of the S1 and S2 Tables, for simplicity. It is pertinent to mention that we also obtained data for m = 5 and r = 2 at k = 0, 2, 3; but are not tabulated due to page constraints.
From S1 and S2 Tables, we have computed, using Eqs (2.3) and (3.1), ,
at k = 2. To estimate sensitive proportion under [1] procedure, we drew a sample of size N = 8 by SRS method and got
and
.
4.2 Performance evaluation
In this subsection, we assess relative performance of the proposed estimator with respect to existing above discussed estimators for m = 4, 5 and k = 0, 2, 3. To this end, we have used above acquired data and variance estimate of , h = a, k, w are computed for p = 0.2. From these numerical results, we have computed the RP values and shown in S3 Table.
The results given in S3 Table are consistent with those of plotted in S1 and S2 Figs. Thus, the partial randomized response is an efficient alternative to the existing and
, and can be used confidently to obtain accurate estimate without compromising privacy of the respondent.
5 Cost analysis
In survey sampling, several factors such as cost, time and accuracy or precision are taken into account to choose an appropriate sampling design. Generally, cost is main focus in almost all sampling surveys. However, this important factor was neglected in the previous Sections by assuming that there is no cost associated with the ranking of sampling units.
Following [18], we develop a cost model in RSS to assess performance of the estimators. Let cs denotes cost of stratification attached with each measured unit in ranked set sampling. In common practice, this is the cost of drawing m − 1 units and accomplishing judgment ordering of the m units of a set. Similarly, cq serves the cost of drawing and quantifying a unit without classification or ranking.
Now, the relative efficiency (RE) is defined as the ratio of the variance of the estimator under SRS and RSS with assumption that the total cost, say C, is same for both sampling schemes. Again, using the reference [18], the RE of with respect to
is given by
(5.1)
It is clear from Eq (5.1) that for fixed cq when cs varies, the magnitude of RE decreases. Moreover, maximum RE value can be gained when cs = 0. To graphical present behavior of Eq (5.1), we assume different combinations of (cq, cs) measured in dollar as {(20, 2), (20, 4)}. Note that, the values of
, h = a, k, w are same as obtained in Subsection 4.2.
In S7–S10 Figs, we have presented RE values of vs
for different combinations of (cq, cs) at m = 4. Likewise, the RE values of
vs
for different (cq, cs) at m = 5 are displayed in S11–S14 Figs.
As expected, RE rapidly decreases as cs increases and vice-versa. However, the proposed model still remains superior. Hence, it is appropriate to estimate sensitive proportion using partial randomized response model when there is negligible (minimum) cost involved for ranking of units.
6 Conclusion
This study has proposed a partial randomized response model for efficiently estimating sensitive proportion under RSS scheme, wherein ranking is done via a continuous auxiliary variable. Both mathematical and numerical results supported the suggested model relative to the existing ordinary models. Moreover, the limiting distribution of the new model is also discussed and then derived some interesting results from it. The graphical representation of the numerical results revealed that the RP values between and
are symmetric about p = 0.5, and became larger when m(k) or magnitude of correlation between X and Y increases. On the other hand, RP values between
and
show similar trend, except,
also work efficiently even at low correlation (ρ < 0.5). Finally, the cost analysis has also been done which also advocated supremacy of the new model without compromising privacy of the respondents. Moreover, the proposed model provide the options of both direct and indirect (randomized response), therefore, it is flexible. Furthermore, the cost of direct response is far less than indirect query. Hence, proposed model is highly recommended for sensitive proportion estimation−being flexible, economical and efficient.
Supporting information
S1 Fig.
vs
under different distributions and values of ρ and p when m = 4.
https://doi.org/10.1371/journal.pone.0277497.s001
(TIF)
S2 Fig.
vs
under different distributions and values of ρ and p when m = 5.
https://doi.org/10.1371/journal.pone.0277497.s002
(TIF)
S3 Fig.
vs
under different distributions and values of ρ and p when m = 4.
https://doi.org/10.1371/journal.pone.0277497.s003
(TIF)
S4 Fig.
vs
under different distributions and values of ρ and p when m = 5.
https://doi.org/10.1371/journal.pone.0277497.s004
(TIF)
S5 Fig.
vs
for different distributions, ρ and p using probit link function when m = 4.
https://doi.org/10.1371/journal.pone.0277497.s005
(TIF)
S6 Fig.
vs
for different distributions, ρ and p using probit link function when m = 5.
https://doi.org/10.1371/journal.pone.0277497.s006
(TIF)
S7 Fig.
vs
under different distributions, values of ρ and p when m = 4, cq = $20 and cs = $2.
https://doi.org/10.1371/journal.pone.0277497.s007
(TIF)
S8 Fig.
vs
under different distributions, values of ρ and p when m = 4, cq = $20 and cs = $4.
https://doi.org/10.1371/journal.pone.0277497.s008
(TIF)
S9 Fig.
vs
under different distributions, values of ρ and p when m = 4, cq = $20 and cs = $2.
https://doi.org/10.1371/journal.pone.0277497.s009
(TIF)
S10 Fig.
vs
under different distributions, values of ρ and p when m = 4, cq = $20 and cs = $4.
https://doi.org/10.1371/journal.pone.0277497.s010
(TIF)
S11 Fig.
vs
under different distributions, values of ρ and p when m = 5, cq = $20 and cs = $2.
https://doi.org/10.1371/journal.pone.0277497.s011
(TIF)
S12 Fig.
vs
under different distributions, values of ρ and p when m = 5, cq = $20 and cs = $4.
https://doi.org/10.1371/journal.pone.0277497.s012
(TIF)
S13 Fig.
vs
under different distributions, values of ρ and p when m = 5, cq = $20 and cs = $2.
https://doi.org/10.1371/journal.pone.0277497.s013
(TIF)
S14 Fig.
vs
under different distributions, values of ρ and p when m = 5, cq = $20 and cs = $4.
https://doi.org/10.1371/journal.pone.0277497.s014
(TIF)
S1 Table. A partial randomized response real data when m = 4 and k = 0.
https://doi.org/10.1371/journal.pone.0277497.s015
(PDF)
S2 Table. A partial randomized response real data when m = 4 and k = 2.
https://doi.org/10.1371/journal.pone.0277497.s016
(PDF)
Acknowledgments
The authors are grateful to an Academic Editor and two anonymous reviewers for providing useful comments that substantially improved the previous version of this study.
References
- 1. Warner SL. Randomized response: A survey technique for eliminating evasive answer bias. Journal of the American Statistical Association. 1965; 60(309): 63–69. pmid:12261830
- 2. Horvitz DG, Shah BV, and Simmons WR. The unrelated question randomized response model. Proceedings of the Social Statistics Section. Journal of the American Statistical Association. 1967; 65–72.
- 3. Greenberg BG, Abul-Ela A-LA, Simmons WR, Horvitz DG. The unrelated question randomized response model: Theoretical framework. Journal of the American Statistical Association. 1969;64(326): 520–539.
- 4. Kuk AYC. Asking sensitive questions indirectly. Biometrika. 1990; 77(2): 436–438.
- 5. Mangat NS. An improved randomized response strategy. Journal of the Royal Statistical Society. 1994; 56(1): 93–95.
- 6. McIntyre GA. A method for unbiased selective sampling, using ranked sets. Australian Journal of Agricultural Research. 1952; 3(4): 385–390.
- 7. Takahasi K and Wakimoto K. On unbiased estimates of the population mean based on the sample stratified by means of ordering. Annals of the Institute of Statistical Mathematics. 1968;20(1): 1–31.
- 8. Stokes SL. Ranked set sampling with concomitant variables. Communications in Statistics-Theory and Methods. 1977; 6(12): 1207–1211.
- 9. Frey J. A note on ranked set sampling using a covariate. Journal of Statistical Planning and Inference. 2011;141(2): 809–816.
- 10. Zamanzade E and Vock M. Variance estimation in ranked set sampling using a concomitant variable. Statistics & Probability Letters. 2015;105: 1–5.
- 11. Zamanzade E and Mohammadi M. Some Modified Mean Estimators in Ranked Set Sampling Using a Covariate. Journal of Statistical Theory and Applications. 2016; 15(2): 142–152.
- 12. Zamanzade E, Mahdizadeh M. Distribution function estimation using concomitant-based ranked set sampling. Hacettepe Journal of Mathematics and Statistics. 2018; 47(3): 755–761.
- 13. Haq A, Brown J, Moltchanova E, and Al Omari AI. Partial ranked set sampling design. Environmetrics. 2013; 24(3), 201–207.
- 14. Terpstra JT and Liudahl LA. Concomitant-based rank set sampling proportion estimates. Statistics in Medicine. 2004; 23(13): 2061–2070. pmid:15211603
- 15. Zamanzade E and Mahdizadeh M. A more efficient proportion estimator in ranked set sampling. Statistics & Probability Letters. 2017; 129: 28–33.
- 16. Abbasi AM and Shad MY. Estimation of population proportion using concomitant based ranked set sampling. Communications in Statistics-Theory and Methods. 2022; 51 (9), 2689–2709.
- 17. Abbasi AM and Shad MY. Sensitive proportion in ranked set sampling. PLoS One. 2021; 16 (8): e0256699. pmid:34464414
- 18. Dell TR and Clutter JL. Ranked set sampling theory with order statistics background. Biometrics. 1972; 545–555.