Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Sensitive proportion in ranked set sampling

Abstract

This paper considers the concomitant-based rank set sampling (CRSS) for estimation of the sensitive proportion. It is shown that CRSS procedure provides an unbiased estimator of the population sensitive proportion, and it is always more precise than corresponding sample sensitive proportion (Warner SL (1965)) that based on simple random sampling (SRS) without increasing sampling cost. Additionally, a new estimator based on ratio method is introduced using CRSS protocol, preserving the respondent’s confidentiality through a randomizing device. The numerical results of these estimators are obtained by using numerical integration technique. An application to real data is also given to support the methods.

1 Introduction

In some social surveys, we may encounter the problem of estimating the proportion of the population having sensitive attribute, such as drug addicts, users of heron and non-taxpayers, for which people are not inclined to respond truthfully. In such situations the techniques for collecting direct information may result in elusive, ambiguous, and even no response. To overcome these problems, [1] advised randomized response (RR) technique under simple random sampling (SRS) plan with the objective to collect truthful answers while fully preserving the respondent’s privacy. This method involves a randomizing device, such as a spinning arrow or a deck of cards, to procure truthful information on the sensitive attribute. The respondent answers ‘yes’ or ‘no’ according to the outcome produced by the randomizing device. As the interviewer is kept unaware of the result produced by the said device, the use of this technique ensures that the respondent cannot be recognized on the basis of his/her response. After development of the first randomized response model [1], numerous variants have been suggested by different researchers to obtain more reliable estimates of the sensitive attribute by increasing respondent’s degree of privacy. A comprehensive literature will not be demonstrated. However, some worth-mentioning work developed under SRS plan can be found in the Reference [25] and the references cited therein.

Ranked set sampling (RSS) was introduced by [6], as an efficient alternative to simple random sampling (SRS), for estimation of pasture and forage yields. The RSS employs ranking of the small sets of units by visually or via a concomitant information before selecting final sample for actual quantification. [7] developed the theory of RSS procedure. More detail and application of RSS (CRSS) can be explored in the References [813].

The estimation of non-sensitive population proportion under CRSS has been investigated by [14]. Thereafter, [15] introduced a new proportion estimator in CRSS framework and showed that it works better than that of [14] without using extra resources. Recently, [16] has highlighted some drawbacks associated with the estimator given in [15] and proposed new improved estimators. Moreover, [17] showed that how RSS can be applied to ordered categorical variables for estimating the probabilities of all categories. They used ordinal logistic regression to aid in the ranking of the ordinal variable of interest.

The idea of using RSS in the estimation of sensitive attribute is similar to its application in above discussed inference problems. The ranking can be carried out by visually or by using a concomitant variable which should be non sensitive but statistically correlated with study attribute. The following two examples develop better understanding about how to use concomitant information for ranking the units:

  1. Example 1. Let us suppose that we want to estimate the proportion of drug addicts in a social survey through RR technique. We can easily rank (order) two or more units by a glance with respect to either their facial expressions or ages.
  2. Example 2. Let us suppose that under study parameter is the proportion of non-taxpayers. We can order two or more households by a glance with respect to either their living styles or house-sizes.

Recently, [18] has adopted model-based ranking approach, introduced by [17], for studying sensitive proportion using concomitant based-rank set sampling. This ranking method requires estimated success probabilities by fitting the logistic regression. The main concern with this ranking method is that concomitant information is not directly used for ranking of the study variable, instead it requires fitting logistic model to the data on previous studies and ranking process is done on the basis of obtained probabilities. In this paper, a new efficient estimator is proposed using CRSS which overcomes the drawbacks of [18] procedure and also beats [1] estimator. Furthermore, a new estimator based on ratio method is also introduced under CRSS plan.

2 Background

Let (Y, X) denotes a bivariate random variable where sensitive study attribute Y follows a Bernoulli distribution and X is a continuous nonsensitive concomitant with cumulative distribution function (cdf) FX(x). Suppose that the conditional distribution of Y given X = x is also Bernoulli and is denoted by B(1, g(x)), where g(x) ∈ (0, 1) could be inverse logit (probit) link function defined as where w = 0.551(β0 + β1 x), α, β ∈ ℜ.

It follows that the marginal distribution of Y is B(1, π) with mean π = E[g(X)] and variance . For more detailed discussion, an interested reader can pursue [14]. The CRSS plan for selection of m(≥ 2) units can be elucidated as:

  1. Step 1 Identify m2 units from (Y, X) and divide them into m sets of size m.
  2. Step 2 From each set of size m, obtain the exact measurement on X, then rank the sets according to the values of X.
  3. Step 3 Obtain the corresponding Y values of the ith (i = 1, 2, …, m) ordered unit of X in the ith set.
  4. Step 4 The above Steps 1–3 can be repeated for n cycles, if required, to obtain a sample of size k = mn.

Let {(Y[1]j, X(1)j), (Y[2]j, X(2)j), …, (Y[i]j, X(i)j), …, (Y[m]j, X(m)j)} be a bivariate ranked set sample of size m in jth cycle, j = 1, 2, …, n, where Y[i]j denotes ith imperfect ranked unit in the jth cycle and X(i)j denotes ith perfect ranked unit in jth cycle. Note that the square bracket [⋅] denotes imperfect ranking and (⋅) serves for perfect ranking. Again, from [14], Y[i] is B(1, π[i]) with mean (probability) π[i] = E[g(X(i))] and variance . Let f(i)(x) be the probability density function (pdf) and F(i)(x) cumulative distribution function (cdf) of an order statistics (OS) X(i) then we have (2.1) and

The mean and variance of X(i) are given by respectively; see the Reference [19].

The success (‘yes’) probability of Y[i], as given in [14], is numerically computed as

The covariance between X(i) and Y[i] is defined as

By virtue of partitioning, as defined in [20], we have (2.2)

The following well-known notations will be used in this study.

Also the following relationships will be used in this paper.

For more detail, see the References [20, 21].

3 Warner’s model under CRSS

As this study involves randomized response (RR) procedure, it is important to give an overview of the basic RR procedure given in the Reference [1]. Let Y1j, Y2j, …, Ymj be a simple random sample with replacement (SRSWR) of size m in jth cycle, for (j = 1, 2, …, n). Each respondent is provided with a suitable randomizing device, say a spinner, for selection of one of the two statements: (a) I have the sensitive attribute A (b) I do not have the sensitive attribute A with pre-assigned selection probabilities p ≠ 0.5 and 1 − p respectively. Each respondent spins the spinner and report ‘yes’ (‘no’) if his/her status matches (does not match) with the statement pointed out by the randomization device. As the interviewer is kept unaware of the outcome of the randomization device, and this makes the respondent comfortable to truthfully report his/her actual status. Then ‘yes’ response of ith (i = 1, 2, …, m) respondent at jth cycle is given by

Let m1 denotes number of ‘yes’ responses out of the sample of size k = mn, then [1] derived the maximum likelihood estimate of π as given by where is estimate of λ. The estimator is unbiased and its variance is given by (3.1)

Now, suppose that the respondents are selected using CRSS design and are instructed to choose one of the two above-mentioned statements (a) and (b) by using the given randomizing device. The respondent reports ‘yes’ (‘no’) according to the outcomes of the randomizing device and his/her actual status. A complete layout of ith response under CRSS is given in the S1 Fig. Let Y[i]j = 1 if ith ranked unit reports ‘yes’, otherwise Y[i]j = 0. Then

Let Y[i]1, Y[i]2, …, Y[i]n are independent and identically distributed (i.i.d) Bernoulli randomized responses under CRSS plan with parameter [i] + (1 − p)(1 − π[i]), then likelihood function of π[i] for the given data Y[i]j, j = 1, 2, … n is (3.2) where is the total number of successes observed under ith ranking unit. It is obvious that Zi is binomial variate with parameters n and [i] + (1 − p)(1 − π[i]). The joint likelihood function of π[i], i = 1, 2, …, m given CRSS data ycrss = {Y[i]j, i = 1, 2, …, m;j = 1, 2, …, n} is (3.3)

Note that the form of maximum likelihood (ML) function given in (3.3) is too complicated to obtain ML estimate of π. Moreover, the variance of the estimator from (3.3) will not in closed form. To avoid this situation, we separately estimate each π[i] using likelihood function given in (3.2) and then these individual proportions are combined by using the relation for overall estimate of π. The log of the likelihood function (3.2) is and necessary conditions on π[i] for a maximum give

After simplification, we obtain where . Hence, the propose measure of π under CRSS plan is given by (3.4) .

Theorem: Let {Y[i]j, i = 1, 2, …, m;j = 1, 2, …, n} be a ranked set sample of size k. Then

  1. (i) is an unbiased estimator of the population proportion π i.e., E() = π
  2. (ii) is more precise than i.e.,

Proof:

  1. (i) From (3.4), we have
    But
    Hence
    This completes proof (i).
  2. (ii) From (3.4), the variance of is (3.5)

Since variance of a constant term is zero, (3.5) reduces to (3.6) where λ[i] = [i] + (1 − p)(1 − π[i]) = (2p − 1)π[i] + (1 − p).

Now, substituting the value of λ[i] in (3.6) and then simplification gives

We have used the fact Note that , hence . This completes the proof (ii).

The relative efficiency (RE) of with respect to can be examined by the ratio (3.7)

The expression (3.7) is always greater than unity irrespective of the choice of g(⋅), subject to the condition that p ≠ 0.5. In other words, is a superior alternative to . It may be noted that when p = 0.5, and consequently . It is also obvious from (3.7) that RE is independent of number of cycles n, i.e., the results can not be improved by increasing n. The following result also holds from the Reference [14] when m is fixed and n → ∞. (3.8)

We can see that when m = 1, (3.8) simplifies to Warner’s result [1] under SRS. Furthermore, the choice p = 1 i.e., selection of sensitive attribute by randomization device is sure and respondent’s privacy is zero, (3.8) reduces to [14] procedure of directly asking the respondent about the attribute of interest under CRSS. Whereas the choice p = 0 i.e., no chance of selecting sensitive question by randomizing device, also brings (3.8) to [14] procedure. Moreover, if both p and m are equal to 1, (3.8) becomes conventional method of direct interaction with the respondent under SRS. However, for precise and reliable estimate the conditions m ≥ 2, 0 < p < 0.5(or0.5 < p < 1) are required. Finally, a consistent estimator of variance in (3.8) can be obtained by replacing π[i] with . In this way the variance estimate becomes free of g(⋅). Hence, asymptotic inference can easily be derived from CRSS plan.

3.1 Numerical illustration

We investigate the RE of with respect to by using the expression (3.7) for different choices of β0, β1, 0.1 ≤ p ≤ 0.9 and assuming X follows (i) normal distribution with parameters mean = 2 and variance = 1 (ii) uniform over the range 0 and 1. It is important to recall that the RE formula as given in (3.7) is independent of n, hence we take different m(= 2, 3, 4, 5) instead of n to evaluate the performance of . Furthermore, the magnitude of correlation coefficient between X and Y is also computed under inverse logit (probit) link function. All results are obtained by numerical integration technique, as demonstrated in the Section 2, using Mathematica Software and are displayed in S1-S4 Tables in S1 File.

As expected, the RE is an increasing function of m and/(or) ρ. It is also symmetric about p = 0.5. In other words, for given m and ρ, it does not matter one assigns the design parameter p or 1 − p to the aforesaid sensitive statement (a). However, respondents cooperation can be increased by choosing p or 1 − p, whichever is assigned to the statement (a), from the interval [0.10.5) and at the same time one can also achieve reasonable precision for some suitable choice of m and/(or) ρ. Generally, the results under both link functions are almost same.

4 Sensitive proportion using ratio method

In survey sampling, a concomitant information is commonly used for improving precision of the estimator pertaining to non-sensitive quantity. Such information is utilized at the designing phase for selection of appropriate sample or directly at the estimation phase by ratio (product) or regression methods or incorporated at both phases. As regards sensitive proportion, a few attempts have been made to consider concomitant information at designing or estimation phase under SRS plan. For example, [22] has constructed a ratio estimator for sensitive proportion under SRS plan. The randomizing device for this method consisting of a deck of cards showing two aforesaid statements (a) and (b). In addition, each individual is required to disclose his/her true value of nonsensitive concomitant X. Then sensitive proportion estimate under this scenario is estimated as where is ratio estimator of λ. The expressions of bias and MSE of are, respectively, given by (4.1) and (4.2) where and . [22] showed that is more precise than Warner’s estimator [1] under some suitable conditions. Here, it is important to point out that the concomitant variable used in [22] method is binary. However, in case of continuous concomitant variable its functional form will remain the same except estimation process shifted to numerical integration. Thereafter, [23] extended this work and presented a general form of the estimator under SRS plan. To the best of our information, no single attempt has been made so far to consider concomitant information at both designing and estimation stages to optimize gain in precision for estimating sensitive proportion using CRSS plan. This motivated us to fill up this gape in the literature and suggest a new improved procedure.

Let {(Y[i]j, X(i)j);i = 1, 2, …, m; j = 1, 2, …, n} be a bivariate ranked set sample. The respondents are instructed to select one of the two aforesaid statements (a) and (b) by using a randomizing device and report ‘yes’(‘no’) according to the statement selected by the device and their actual status. In addition, each individual is advised to provide his/her true value of X. Now, on the lines of [22] estimator, we propose the following estimator under CRSS plan: (4.3) where is ratio estimator of λ, , . To derive bias and MSE of the suggested ratio estimator up to the first order of approximation, we proceed as follows:

Let such that E(ξ0) = 0 = E(ξ1). Following [20, 21] and keeping in view randomized response model, we have and

For proof, see S1 Appendix.

Now, expressing (4.3) in terms of ξi, i = 0, 1, we have (4.4)

Applying expectation on both sides of (4.4), we have

So the bias expression in the final form is given by (4.5)

We see that the bias of approaches to zero as k becomes infinitely large, indicating is a consistent estimator of π. To obtain MSE of up to the first order of approximation, we extract the following expression from (4.4) (4.6)

By the definition of MSE, from (4.6), we have or (4.7)

Since the second term on the right side of (4.7) is always positive. Hence, , provided p ≠ 0.5. In other words, the expression (4.7) reveals that the proposed estimator is more reliable (having less risk) than .

The relative efficiency of with respect to can be measured by examining (4.8)

The numerical results, for different choices of m and p when X follows (i) normal distribution (ii) uniform distribution, are computed by numerical integration technique using Mathematica Software. The RE results, obtained by using the expression (4.8), are reported in S5-S8 Tables in S1 File. Note that, for the choice p = 0.5, RE becomes undefined, so we have omitted RE values against p = 0.5. As expected, all results in S5-S8 Tables in S1 File are greater than 1, and RE is an increasing function of m i.e., more precise results can be obtained by increasing m. It can be observed from S5-S8 Tables in S1 File that there is no symmetry among the RE values, obtained under the interval 0.1 ≤ p < 0.5 and 0.5 < p ≤ 0.9, as was observed for the case of and (see S1-S4 Tables in S1 File). However, as all RE values are greater than unity, can be considered as an efficient alternative to .

5 An application to real data

Following the Reference [24], we have conducted a small scale survey to collect the primary data set of 500 male students in Quaid-i-Azam University, Islamabad. In this survey, each student was asked about his age and a sensitive attribute−whether he has a ‘girl-friend’ or not. On our request, the students spared themselves for this activity and promised to response truthfully via the Warner’s [1] randomizing device with p = 0.2. We considered ‘age’ as a concomitant variable X and ‘girl-friend’ as a sensitive attribute Y. The purpose of this data gathering was to make known of the quantities such as mean and variance of X along with proportion of study attribute Y and correlation coefficient ρ, which are given by , , π = 0.30 and ρ = 0.35 respectively.

Assuming the above population data, we took a concomitant-based ranked set sample of size k = 5(2) = 10 as follows: We selected m2 = 25 students by simple random sample with replacement sampling and randomly partitioned them into 5 sets each of size 5. Furthermore, the students in each set are ranked with respect to X and then ith ranked student is selected from the ith set (i = 1, 2, …, 5) to estimate π. A layout of CRSS method is given in S9. Table in S1 File, where Y[i]jk denotes ith judgment (imperfect) ordered statistic of the student in jth set at kth cycle and X(i)jk serves ith perfect ordered statistic of the student in jth set at kth cycle. In the final acquired data, we have omitted jth set information for the sake of brevity. On the other hand, under simple random sample plan, 6 out of 10 students reported ‘yes’, that is, . From the data given in S9. Table in S1 File, we have computed some estimates and their associated variances (MSEs) for illustration purpose as given below.

From (3.1) and (3.6), we have and . Similarly, from (4.2) and (4.7), we have and .

As expected, both and estimates are very close to true π. It can be observed that is less than . This supports instead of for the estimation of π. Similarly, is less than indicates that proposed ratio method for estimating sensitive attribute is better than ordinary estimator given in the Reference [22]. Moreover, we can expect further improvement in these results by taking into account multiple-concomitants situation in the present study, as advised in the Reference [14], which is in progress.

6 Conclusion

In this study, we have suggested an efficient alternative to Warner’s model [1] for estimating sensitive proportion under CRSS plan. Additionally, a new estimator that based on ratio method has also been proposed using CRSS and compared with its SRS counterpart given in [22]. Both mathematical and numerical results support our proposed estimators.

In future research, it would be interesting to explore effects on the results in Bayesian framework under ranked set sampling methods.

Finally, we would like to discuss the case of generalizing the proposed ratio estimator of λ so as to incorporate concomitant information along with its known parameters for further enhancing accuracy of the results. It is worth-mention that one can also estimate λ via exponential ratio estimator [25]. Thus, two general families of estimators for λ are presented as Where a ≠ 0 and b are known parameters of X. For specific problem, any one of them can be selected to better estimate sensitive proportion as oppose to existing [25] procedure. Hence, this study has provided different options to the experimenter for obtaining precise measure of sensitive proportion.

Supporting information

S1 Fig. Probability tree diagram of ith response.

https://doi.org/10.1371/journal.pone.0256699.s002

(TIF)

S1 File. Relative efficiencies of the proposed methods and a layout of real data set.

https://doi.org/10.1371/journal.pone.0256699.s003

(PDF)

Acknowledgments

The authors are thankful to an Academic Editor and two anonymous reviewers for providing useful comments that substantially improved the previous version of the article.

References

  1. 1. Warner SL (1965). Randomized response: A survey technique for eliminating evasive answer bias. Journal of the American Statistical Association 60: 63–69. pmid:12261830
  2. 2. Horvitz DG, Shah BV, and Simmons WR. (1967). The unrelated question randomized response model. Proceedings of the Social Statistics Section, Journal of the American Statistical Association: 65–72.
  3. 3. Greenberg BG, Abul-Ela A-LA, Simmons WR, and Horvitz DG (1969). The unrelated question randomized response model: Theoretical framework. Journal of the American Statistical Association 64: 520–539.
  4. 4. Kuk AYC (1990). Asking sensitive questions indirectly. Biometrika 77: 436–438.
  5. 5. Mangat NS (1994). An improved randomized response strategy. Journal of the Royal Statistical Society: Series B (Methodological) 56: 93–95.
  6. 6. McIntyre GA (1952). A method for unbiased selective sampling, using ranked sets. Australian Journal of Agricultural Research 3: 385–390.
  7. 7. Takahasi K and Wakimoto K (1968). On unbiased estimates of the population mean based on the sample stratified by means of ordering. Annals of the Institute of Statistical Mathematics 20: 1–31.
  8. 8. Stokes SL (1977). Ranked set sampling with concomitant variables. Communications in Statistics-Theory and Methods 6: 1207–1211.
  9. 9. Frey J (2011). A note on ranked set sampling using a covariate. Journal of Statistical Planning and Inference. 141: 809–816.
  10. 10. Zamanzade E and Vock M (2015). Variance estimation in ranked set sampling using a concomitant variable, Statistics & Probability Letters 105: 1–5.
  11. 11. Zamanzade E and Mohammadi M (2016). Some Modified Mean Estimators in Ranked Set Sampling Using a Covariate. Journal of Statistical Theory and Applications 15: 142–152.
  12. 12. Zamanzade E and Mahdizadeh M (2018). Distribution function estimation using concomitant-based ranked set sampling. Hacettepe Journal of Mathematics and Statistics 47: 755–761.
  13. 13. Ashour SK and Abdallah MS (2019). New distribution function estimators and tests of perfect ranking in concomitant-based ranked set sampling. Communications in Statistics-Simulation and Computation. 1–26.
  14. 14. Terpstra JT and Liudahl LA (2004). Concomitant-based rank set sampling proportion estimates. Statistics in Medicine 23: 2061–2070. pmid:15211603
  15. 15. Zamanzade E and Mahdizadeh M (2017). A more efficient proportion estimator in ranked set sampling. Statistics & Probability Letters 129: 28–33.
  16. 16. Abbasi AM and Shad MY (2021). Estimation of population proportion using concomitant-based ranked set sampling. Communications in Statistics-Theory and Methods.
  17. 17. Chen H, Stasny EA, and Wolfe DA (2008). Ranked set sampling for ordered categorical variables. Canadian Journal of Statistics 36: 179–191.
  18. 18. Santiago A, Sautto JM, and Bouza CN (2019). Randomized estimation a proportion using ranked set sampling and Warners procedure. Investigacion Operacional 40: 356–361.
  19. 19. David HA and Nagaraja HN (2003). Order Statistics. 3rd Edition. New York, John Wiley & Sons.
  20. 20. Dell TR and Clutter JL (1972). Ranked set sampling theory with order statistics background. Biometrics 545–555.
  21. 21. Samawi HM and Muttlak HA (1996). Estimation of ratio using rank set sampling. Biometrical Journal 38: 753–764.
  22. 22. Yan Z (2006). Ratio method of estimation of population proportion using randomized response technique. Model Assisted Statistics and Applications 1: 125–130.
  23. 23. Diana G and Perri PF (2009). Estimating a sensitive proportion through randomized response procedures based on auxiliary information. Statistical Papers 50: 661–672.
  24. 24. Al-Sobhi MM, Hussain Z, Al-Zahrani B (2014) General Randomized Response Techniques Using Polya’s Urn Process as a Randomization Device. PLoS ONE 9(12). pmid:25541936
  25. 25. Bahl S and Tuteja RK (1991). Ratio and product type exponential estimators. Journal of Information and Optimization Sciences 12: 159–164.