General Randomized Response Techniques Using Polya's Urn Process as a Randomization Device

In this paper, interesting improvements in [1] and [2] randomized response techniques have been proposed. The proposed randomized response technique applies Polya's urn process (see [3]) to obtain data from respondents. One of the suggested technique requires reporting the number of draws to observe a fixed number of cards of certain type. On the contrary, the number of cards of a certain type is to be reported in case of second proposed randomized response model. Based on the information collected through the suggested techniques, two different unbiased estimators of proportion of a sensitive attribute have been suggested. A detailed comparative simulation study has also been done. The results are also supported by means of a small scale survey.


Introduction
Surveys and questionnaires are usual statistical tools for obtaining data about attitudes, behaviors, emotions, and so forth. The important assumption of any survey technique is that the respondents are completely truthful in their reporting. However, the legitimacy of this assumption is dubious when investigators ask questions that most would be hesitant to respond publicly. Examples of such questions are those that disclose whether the respondent possesses an illicit behavior, or a trait that is socially undesirable, or the question may concern with the trait of which respondent is embarrassed or which the respondent feels extremely personal to reveal openly. Faced with such questions, some respondents in a sample will decline to respond or will misreport. Either type of avoidance introduces a bias into collected information. Hence, there are serious procedural Editor: Enrique Hernandez-Lemus, National Institute of Genomic Medicine, Mexico hurdles to conduct surveys in studies in which a stigmatizing characteristic is associated with the phenomenon of concern.
[1] introduced an ingenious RRT which produces the response following a Bernoulli distribution. A number of RRTs such as the [4], [10], [9] are special cases of [1] technique. [19] have reported that the [1]9s family of RRTs performs better than the Simmon's family in terms of efficiency and privacy protection. Recently [2], improved the [1]9s RRT by introducing geometric distribution as a randomization device.
There is an extensive amount of literature on the applications of urn models in different fields like genetics, capture-recapture models, computer theory, biology, learning processes, etc. Some of applications of urn models in epidemiology are noted by [20], [21] and [22], amongst many others. Applications to botany, lexicology and numismatics are mentioned by [23]. Interested readers may refer to [24] and [25] for a detailed account on the applications of urn models in randomization structures.
As far as RRT is concerned, urn scheme may be applied to determine the questions asked in a survey enquiry. The pioneer RRT proposed by [4], and further extended by [18] and [26] is a striking example. Utilization of urn model as a randomization device for [4] RRT may be briefly explained as follows. In the [4] RRT, the respondent randomly draws a card from an urn containing g green and r red cards. If a green card is drawn, the respondent will report (yes or no) to the question, ''I am a member of sensitive group'', and replace the card drawn. If a red card is drawn, the question is, ''I am not a member of sensitive group''. The interviewer is unaware of the colors of the cards drawn by the respondents, but the probability of drawing a green card is known. Under the assumptions of randomness of the drawing of balls and truthful reporting of answers, obviously, the total number of yes responses in a sample of n respondents follows a binomial distribution with parameters n and g=(gzr).
Of the many urn models, the Polya's urn model is very popular within Statistics because it generalizes the binomial, hypergeometric, and beta-Bernoulli (betabinomial) distributions through a single formula. In the present study, we intend to apply Polya's urn scheme to randomize the responses. It is important to note that different discrete distributions such as binomial, hypergeometric, negative binomial, geometric, negative hypergeometric, beta binomial, uniform, etc. can be generated through Polya's urn scheme. Thus, using Polya's urn schemes may be taken as more flexible and a generalization of the above mentioned distributions. The idea is, actually, taken from the [2] RRT (a special case of [1] RRT) which yields a response following a geometric distribution. The rest of the paper is organized as follows. In the next section, we present the [1] and [2] RRTs. Two new estimators using Polya's urn process have been suggested in Section 3. Section 4 consists of discussion and conclusion of the study. A real life example has been presented in Section 5.

Some Recent Related RRTs
In this section, we present brief summaries of the [1] and [2] RRTs and introduce the notations. Let U~u 1 ,u 2 ,:::,u ?
ð Þbe an infinite dichotomous population and every individual in the population belongs either to a sensitive group (possessing a sensitive attribute) G, or to its complement G. The problem is to estimate p 0vpv1 ð Þ , the unknown proportion of population members in group G: To do so, a sample s~u 1 ,u 2 ,:::,u n ð Þof size n is drawn from the population U using a simple random sampling with replacement sampling scheme. Because of the sensitive nature of the attribute under study, a direct question regarding membership in G or otherwise is not expected to be helpful in terms of cooperation from the respondents. Thus, an alternative procedure such as RRT is needed if we are to procure reliable data on the sensitive attribute.
Two of the background RRTs are discussed in the following subsections.

Kuk [1] RRT
In this RRT, if a respondent belongs to a sensitive group G, then he/she is instructed to use a deck of cards having h 1 proportion of cards with the statement, '' I [G'' and if he/she belongs to non-sensitive group G, then he/she is requested to use a different deck of cards having h 2 proportion of cards with the statement, ''I6 [G''. The probability of a yes response in the [1] model is given by P r yes ð Þ kuk~h Kuk~h1 pzh 2 1{p ð Þ: ð2:1Þ An unbiased estimator of p is given bŷ where n 1 is the observed number of yes responses in the sample s and follows a binomial distribution with parameters h Kuk~h1 pzh 2 1{p ð Þ and n. Thus the variance ofp Kuk is given by ð2:3Þ

Singh and Grewal [2] RRT
In this RRT, each respondent is provided with two decks of cards in the same way as in the [1] RRT. In the first deck of cards h Ã 1 is the proportion of cards with the statement, ''I [G'' and 1{h Ã 1 À Á be the proportion of cards with the statement, ''I6 [G''. In the second deck of cards h Ã 2 is the proportion of cards with the statement, ''I = [G'' and 1{h Ã 2 À Á be the proportion of cards with the statement, ''I6 [G''. Up till here, it is same as that of the [1]. If a respondent belongs to sensitive group G, he/she is instructed to draw cards, one by one using with replacement, from the first deck of cards until he/she gets the first card bearing the statement of his/her own status, and requested to report the number of cards, say X, drawn by him/her to obtain the first card of his/her own status. If a respondent belongs to non-sensitive group G, he/she is instructed to draw cards, one by one using with replacement drawing, from the second deck of cards until he/she gets the first card bearing the statement of his/her own status, and requested to report the number of cards, say Y, drawn by him/her to obtain the first card of his/her own status. Since cards are drawn using with replacement sampling, it is clear that X and Y follow geometric distributions with parameters h Ã 1 and h Ã 2 , respectively. If Z i denotes the number of cards reported by the ith respondent, then it can be written as where a i is a Bernoulli random variable with E a i ð Þ~p. The expectation of reported number of cards is given by ð2:4Þ An unbiased estimator of p proposed by [2] is given bŷ with variance given by

Proposed RRTs
In this section, we present two new RRTs using Polya's urn scheme.

Proposed RRT I
A more general RRT is explained below. Consider two decks having two types of cards, red and green. The deck 1 contains a 1 b 1 ð Þ red (green) cards. The deck 2 contains a 2 b 2 ð Þ red (green) cards. Each respondent belonging to the sensitive (non-sensitive) group is requested to use deck 1 (deck 2) and randomly draw n 1 n 2 ð Þ cards one by one. On each draw he/she is requested to replace the card drawn and add c 1 c 2 ð Þ cards of the same color. If a respondent belongs to sensitive (non-sensitive) group he is required to report the number of red cards drawn, say X' Y' ð Þ, in n 1 n 2 ð Þ draws. Obviously, here X' and Y' have the distributions f 1 x' ð Þ and f 1 y' ð Þ, respectively. The functional forms of f 1 x' ð Þ and f 1 y' ð Þ are given by where a i is a random variable defined as above and E X' i ð Þ~m X'~n 1 a 1 a 1 zb 1 and E Y' i ð Þ~m Y'~n 2 a 2 a 2 zb 2 . Thus, expected response may be written as Now an unbiased estimator of population proportion p may be defined and its variance can easily be worked out. By solving (3.4) for p and estimating E Z' i ð Þ by Z'~(1=n) P n i~1 Z' i , an unbiased estimator of p is suggested as follows: Its variance is given by where s 2 X'~n Following remarks are in order. Remark 1: It is interesting to see that the reported response Z' i follows a two component mixture distribution with p and 1{p ð Þ as the mixing probabilities. For c 1~c2~1 , the distribution of response Z' i is a mixture of two beta-binomial distributions with parameters n 1 ,a 1 ,b 1 ð Þand n 2 ,a 2 ,b 2 ð Þ. Remark 2: For c 1~c2~0 , the distribution of Z' i is a mixture of two binomial distributions with parameters n 1 ,p 1~a 1 a 1 zb 1 and n 2 ,p 2~a 2 a 2 zb 2 . Remark 3: For c 1~c2~{ 1, the distribution of Z' i is a mixture of two hypergeometric distributions with parameters a 1 zb 1 ,a 1 ,n 1 ð Þand a 2 zb 2 ,a 2 ,n 2 ð Þ . In this case we must have n 1 ƒa 1 zb 1 and n 2 ƒa 2 zb 2 .

Proposed RRT II
The proposed RRT II works in a fashion similar to that of Proposed RRT I. Here, we assume that c 1~c2~1 , and respondents are requested to report the number of draws to observe a fixed number, say r 1 and r 2 , of red cards. Let X '' Y '' ð Þ denotes the number of draws from urn 1 (urn 2) required to observe r 1 r 2 ð Þ red cards. Obviously, now, X'' and Y'' have the distributions given by Þ , x''~r 1 ,r 1 z1,r 1 z2,:::,?, ð3:7Þ The response Z'' i from the ith respondent may be written as Z'' i~ai X'' i z 1{a i ð ÞY'' i , where a i is a random variable defined as above and . Thus, expected response may be written as Now, following the steps as in subsection 3.1, an unbiased estimator of population proportion p may be defined and its variance can be derived. By solving (3.9) for p and estimating E Z'' i ð Þ by Z''~n {1 P n i~1 Z'' i , an unbiased estimator of p is suggested as Its variance is given by Remark 5: For c 1~c2~0 , the distribution of Z'' i is a mixture of two negative binomial distributions with parameters r 1 ,p 1~a 1 a 1 zb 1 and r 2 ,p 2~a 2 a 2 zb 2 .
Remark 8: If a 1~b1~c1 and a 2~b2~c2 , the distribution of Z'' i is a mixture of two uniform distributions.

Discussion and Conclusion
Since our objective in this study was to introduce an application of Polya's urn process to obtain data on sensitive variables, we did not intend to have a fullfledged comparative study of proposed estimators with any other estimators. However, to have an idea, we just considered estimatorp 2 and compared it with [1] and [2] estimators assuming c 1~c2~0 , r 1~r2~4 , h 1~0 :7 and h 2~0 :2. The reason of setting h 1~0 :7 and h 2~0 :2 is that the [1] model is at its best when h 1 {h 2 j jis maximum. As mentioned in Remark 5 above, for c 1~c2~0 , we have p j~aj a j zb j À Á {1~h Ã j for j~1,2. The relative efficiency (RE) of the estimatorp 2 relative top Kuk andp SG is defined as RE 1~Vp respectively. The RE results are displayed in S1 Table available in the supporting information files. From S1 Table (see S1 Table), it observed that proposed estimator is relatively more efficient than that of [1] and [2]. For the situations, where c 1~c2~1 and c 1~c2~{ 1, we have done a bit detailed comparison study. As, n 1 (or n 2 ) is the fixed number of cards to be drawn in the proposed RRT I and r 1 (or r 2 ) is pre decided number of cards of certain type in proposed RRT II, we fixed n 1~r1 and n 2~r2 so that proposed estimators could be compared with each other on equal footings. The RE of the proposed estimatorsp 1 andp 2 relative tô p Kuk andp SG is arranged in the S2-S4 Tables and S5-S10 Tables, respectively. It was observed that RE of proposed estimators relative top Kuk increases with the increase in a 1 when c 1~c2~1 (see S2 and S3 Tables). The proposed estimators are also more efficient when we take c 1~c2~{ 1 (see S4 Table). Same is the behavior of RE of the proposed estimators when we compare them withp SG (see S5-S10 Tables). From the S2-S10 Tables (see S2-S10 Tables), it is evident that the proposed estimators outshine the two competing estimatorsp Kuk andp SG . Also, it can be observed that RE of both the estimators is higher (lower) for larger (smaller) p when either c 1~c2~1 or c 1~c2~{ 1, whereas, for c 1~c2~0 , the situation is reversed. The RE of proposed estimators is directly proportional to the difference between n 1 r 1 ð Þ and n 2 r 2 ð Þ. The overall finding is that the proposed estimatorp 1 is comparatively more efficient thanp 2 . That is, using number of cards of certain type in fixed drawings is more useful than forcing the respondent to keep drawing the cards until he/she observes a pre-decided number of cards of one kind.
It is to be noted that variances Vp 1 ð Þ and Vp 2 ð Þ are decreasing functions of respectively. Thus, variances of the proposed estimators may be cut down to a desired level by suitably choosing the values of a 1 ,a 2 ,b 1 ,b 2 ,n 1 ,n 2 ,r 1 and r 2 so that m X' {m Y' j jand m X'' {m Y'' j jis a maximum. Moreover, it is seen that the Polya's urn process generates different distributions. Thus, using Polya's urn process is more general and more flexible scheme to generate a randomized response following a desired distribution. Additionally, in the proposed RRTs, no additional sampling cost is needed and every respondent uses the same randomization device. These two features of the proposal may be considered as extra advantages associated with it.

A practical example
As a practical example, we conducted a small scale survey by drawing a sample of size 100. Consider the population of students currently enrolled in different programs at Quaid-i-Azam University, Islamabad. The students were requested verbally to volunteer themselves for this survey study and were assured that their identity will not be disclosed in anyway. From this population, we took 1000 students including 200 those students who had been using marijuana for the last six months. The purpose of this was to take a population with known population proportion of marijuana users, that is, we took p~0:2. As from the simulation results, it is evident that the proposed RRT 1 is relatively better than the others, therefore, we decided to apply the proposed RRT 1 in actual application. A simple random sample of 100 students (out of 1000 selected students) was drawn using with replacement sampling and every selected student was given two urns each containing red and green cards. The urn 1 (urn 2) contains 1000 (100) red and 300 (40) green cards. For generating data through proposed RRT 1, he/she, then, was asked to draw 3 (3) cards at random from urn 1 (urn 2) if he/she had used (not used) marijuana, at least once, in the last six months. At each draw, he/she was directed to replace two cards (i.e. c 1~c2~1 ) of the color of the card drawn. After drawing the cards, he/she was requested to report the number of red cards. For generating responses through [1] and [2] RRTs, we fixed h 1~0 :7,h 2~0 :3 and h 1~h Ã 1~0 :7,h 2~h Ã 2~0 :3. It is to be noted that the same respondents were taken to generate the responses through three different randomization devices considered in this study. The data obtained through these randomization devices are presented in S11-S13 Tables (see S11-S13 Tables). The estimates of the proportion of students who had used marijuana at least once, during the last six months, are obtained asp 1~0 :204,p Kuk~0 :12 andp SG~0 :2475. From these estimates, it is clear that the proposed RRT 1 provided the closest estimate of the population proportion, i.e. p~0:2. Hence, the proposed RRT 1 is more accurate than the other RRTs considered in this small scale survey.