Parameter Estimation in Stratified Cluster Sampling under Randomized Response Models for Sensitive Question Survey

Randomized response is a research method to get accurate answers to sensitive questions in structured sample survey. Simple random sampling is widely used in surveys of sensitive questions but hard to apply on large targeted populations. On the other side, more sophisticated sampling regimes and corresponding formulas are seldom employed to sensitive question surveys. In this work, we developed a series of formulas for parameter estimation in cluster sampling and stratified cluster sampling under two kinds of randomized response models by using classic sampling theories and total probability formulas. The performances of the sampling methods and formulas in the survey of premarital sex and cheating on exams at Soochow University were also provided. The reliability of the survey methods and formulas for sensitive question survey was found to be high.


Introduction
In surveys, researchers sometimes ask sensitive questions such as sex, income and drug abuse. Respondents often try to avoid answering sensitive questions because they are concerned with privacy. Some people will refuse to cooperate with researchers or give wrong answers to these questions. This will result in response bias and reporting errors in surveys on sensitive topics [1]. The randomized response technique was developed by Warner to collect sensitive information [2]. Privacy was argued to be maintained by the method so that response bias could be potentially removed. After that, modified randomized response models were proposed for different kinds of sensitive question survey. For example, Simmons model was introduced in dichotomous sensitive question survey [3] [4]. The additive model was designed for quantitative sensitive question survey, which had advantages of simple design, small sample size, and reduced sampling error [5] [6].
Simple random sampling has been considered a useful sampling method under randomized response models for sensitive question survey [7]. It seems to be easy to understand and apply by using random number table. However, this method is susceptible to sampling error if the sample is neither large nor representative. On the other hand, it is also cumbersome and costly when sampling from a large target population [8] [9]. Compared to simple random sampling, other sophisticated sampling methods may be more suitable for a large population. For instance, stratified sampling uses prior information about the total population to get the sampling procedure more efficient and it reduces sampling error, while cluster sampling has the advantages of less cost and convenience by separating the population into clusters [10] [11]. But these methods are not widely used in sensitive question survey because they are not easy to understand like simple random sampling. So formulas for simple random sampling may be incorrectly used instead of a relatively complex but efficient sampling procedure in sensitive question survey. Even if effective sampling has been performed, the reliability of sampling methods under randomized response models has seldom been evaluated [12] [13].
In this work, we provided designs for cluster sampling and stratified cluster sampling under two kinds of randomized response models. Corresponding formulas for parameter estimation were also deduced. These complex sampling methods have been successfully employed in survey of premarital sex and cheating at Soochow University and may be more applicable to a large population.

Dichotomous sensitive question survey
Simmons model on dichotomous sensitive question survey. In Simmons model, a randomized device is designed like that some red balls and some white balls are mixed together and put into a bag. Before responding, a respondent randomly takes a ball out of the bag, identifies the color of the ball and then puts it back. When the red ball is drawn, he/she would give a yes or no answer to a question with sensitive character A, while the white ball is taken out, he/she would also give a yes or no answer to the other question with non-sensitive character B. These balls are the same except for color and the red ball's probability of being selected should not be 50% [5].
Cluster sampling under Simmons model. It is convenient to use cluster sampling method in science research. The cluster sampling process in Simmons model is as follows: 1). The population is divided into several clusters (primary units), and each cluster is composed of secondary units. 2). Some clusters are randomly selected from the population. 3). Simmons model (1.1.1) is applied to all the secondary units of selected clusters for dichotomous sensitive question survey.
Stratified cluster sampling under Simmons model. Stratified sampling is useful in reducing sampling error. Based on cluster sampling, the stratified cluster sampling process in Simmons model is as follows: 1). The population is stratified into different strata by characters; 2). Each stratum is divided into different clusters (primary units) respectively; 3). Each cluster is composed of secondary units; 4). Simmons model (1.1.1) is employed to all the secondary units to conduct dichotomous sensitive question survey.

Quantitative sensitive question survey
The additive model on quantitative sensitive question. In the additive model [5], a randomized device is designed to randomly generate an integer between 0 and 9. In the device, ten same balls are respectively labeled by the ten integers. By a randomization process, each respondent takes an integer-labeled ball and add the integer to the numerical value of his response to the sensitive question to get a final result.
Cluster sampling under the additive model on quantitative sensitive question. Cluster sampling (1.1.2) is used and the additive model is applied to subunits for quantitative sensitive question survey.

Formula Deductions
Estimation of the population proportion for sensitive question survey Formulas for cluster sampling. Suppose the population is divided into N clusters, and the ith cluster contains Mi subunits, and then n clusters are drawn from the population randomly.
Estimation of the population proportion and its variance is described below: Suppose the proportion of sensitive character A in the ith cluster is π i , and the ith cluster contains a i secondary units with character A (π i = a i / M i ), and the proportion of character A in the population is π.
① When each cluster contains M secondary units, the estimator of π is[8] The estimator of variance ofp can be stated as [8] vðpÞ where f = nM / NM = n / N is the sampling ratio. ② When all the clusters are not the same size, we can approximately get the estimator of π The estimator of the variance ofp can be obtained as where From the equation π i = a i / M i , we can get Calculation of π i and a i is described below: Let P denote the proportion of the sensitive question in the randomized device of the Simmons model, and the proportion of people with non-sensitive character B in the ith cluster is R i , which is known or can be obtained. λ i denotes the proportion of a "yes" response in the ith cluster and π i denotes the proportion of character A in the ith cluster. From full probability formulas [14], we can get and Formulas for stratified cluster sampling. Suppose the population is composed of L strata, and the hth stratum contains N k clusters (primary units), and the ith cluster contains M ih secondary units. The population contains N secondary units and n h clusters are randomly drawn from the hth stratum.
Estimation of the population proportion in the hth stratum and its variance is described below: Suppose π ih denotes the proportion of subunits with sensitive character A in the ith cluster of the hth stratum, and there are a ih subunits with character A.
① When each cluster in the hth stratum contains M h subunits (clusters in different strata can be of different size), and from formula 1, the estimator of population proportion π h for cluster sampling in the hth stratum iŝ From (2), the estimator of the variance ofp h (the estimator of Vðp h Þ) can be stated as where f h = n h M h / N h M h = n h / N h is the sampling ratio of the hth stratum. ② When all the clusters are not the same size, from formula 3, the estimator of the population proportion in the hth stratum can be obtained aŝ And from formula 4, the estimator of the variance ofp h can be got as and by formula 5, we can get where M ih =n h is the mean of subunits of each cluster in the hth stratum, is the sampling ratio in the hth stratum.
Estimation of the population proportion and its variance is described below: The estimator of the population proportion can be stated as [8] p ¼ where W h ¼ M ih =N is the relative size of the hth stratum according to the number of subunits.
As the sample of each stratum is relatively independent, from formula 12 we can get the variance ofp h [15] VðpÞ According to the size of each cluster of the hth stratum, we can use formula 8, 9 and 11 to estimate the Vðp h Þ in formula 13.
Calculation of π ih and a ih is described below: Suppose the proportion of subunits with unrelated non-sensitive character B in the ith cluster of the hth stratum is R ih . R ih is known or can be acquired by special survey. λ ih denotes the proportion of a "yes" answer in the ith cluster of the hth stratum and π ih denotes the proportion of character A in the ith cluster. From full probability formula [14], we can get and a ih = M ih π ih , i = 1,2,. . .,n h ; h = 1,2,. . .,L

Estimation of the population mean for sensitive question survey
Formulas for cluster sampling. Estimation of the population mean and its variance is described below: ① When M i = M, let μ i denote the mean of the ith cluster, and replacing a i by y i in formula 1, the estimator of the population mean can be stated aŝ When a i is also replaced by y i in formula 2, the estimator of the variance ofm can be obtained as where f = nM / NM = n / N is the sampling ratio.
② When all the clusters are not the same size, the estimator of the population mean is [16] m ¼ By replacing a i by y i in formula 4, the estimator of the variance ofm can also be got as [8] vðmÞ where M i =n is the mean of the subunits in each cluster, and f ¼ Whenp is replaced bym in formula 5, we can get Calculation of μ i and y i is described below: Suppose μ i denotes the mean of variables with sensitive character in the ith cluster, and μ iZ denotes the mean of numerical values of the answers in the ith cluster, and μ Y denotes the mean of all the random numbers in the randomized device. And then from characteristics of means [15], we can get And we can get Formulas for stratified cluster sampling. Estimation of the population mean and its variance of the hth stratum is described below: ① When M ih = M h , from formula 15, we can get the estimator of the population mean And from formula 16, the estimator of the variance ofm h can be obtained as ② When all the clusters in the hth stratum are not the same size, from formula 17, the estimator of the population mean (μ h ) of the hth stratum iŝ From formula 18, the estimator of the variance ofm h can be stated as And from formula 19, we can get a simplified formula Estimation of the population mean and its variance is described below: Through replacingp h bym h in formula 12, we can get the estimator of the population mean [8]:m As samples of each stratum are independent, by formula 26, we can get According to the size of the clusters of the hth stratum and by formula 22, 24 and 25, we can estimate Vðm h Þ in formula 27.
Calculation of μ ih and y ih is described below: Let μ ih denote the mean of variables with sensitive character in the ith cluster of the hth stratum, μ iZ denote the average value of all the answers in the ith cluster of the hth stratum, and μ Y denote the average value of all the random numbers in the randomized device. Then from characteristics of means [15], we can get And then we can get

Applications
Let the students on Dushu Lake Campus of Soochow University be the target population which is divided into two strata. Define undergraduates as the first stratum which contains 9689 students and graduates as the second stratum which contains 1890 students, we can get W 1 = 9689/(9689 + 1890) % 0.84, W 2 % 0.16. Let each class to be a cluster and clusters in each stratum to be approximately the same size, 20 clusters which contain 1080 students were randomly drawn from undergraduates and 18 clusters containing 818 students were drawn from graduates. Each student was repeatedly surveyed twice at different time and 3796 times of survey were conducted in total. All the questionnaires were recovered and the passing rate of the questionnaires was 100%. Data bank was established using Excel 2003 and all the data was analyzed by SAS 9.13.

Survey on a population proportion
In our randomized device, there were 6 red balls and 4 white balls with the same size and weight in a bag. In the absence of others, each student chose a ball from the bag randomly. When the red ball was drawn, he should answer the sensitive question that whether he had premarital sex. And when the white ball was chosen, he should answer the unrelated question that whether he was a boy. The student could only give a yes or no answer and finally the real proportion (R ih ) of boys in each cluster of each stratum should be acquired. Ethics Statement. The participants provided their written informed consent to participate in this study anonymously. The data were also collected and analyzed anonymously. This research was approved by the Ethics Committee of Soochow University including the consent procedure.
The proportion of students having premarital sex in each class. Survey on premarital sex of students from 38 classes at Soochow University was repeated twice by stratified cluster sampling under Simmons model. From formula 14, we can get the premarital sex rate π i1 ((i = 1,2,. . .,20) in the first survey of undergraduates and p 0 i1 ((i = 1,2,. . .,20) in the second survey of undergraduates, and the premarital sex rate π i2 (i = 1,2,. . .,18) in the first survey of graduates and p 0 i2 ((i = 1,2,. . .,18) in the second survey of graduates ( Table 1). Estimation of premarital sex proportion and its variance in each stratum. By the first survey and formula 7, we can get the estimator of the premarital sex ratio of undergraduates: From formula 8, we can get the estimator of the variance ofp 1 : 20ð20 À 1Þ ½ð0:2624 À 0:1683Þ 2 þ ð0:1631 À 0:1683Þ 2 þ Á Á Á þ ð0:1926 À 0:1683Þ 2 ¼ 0:0002 By the first survey and formula 7, we can also get the estimator of the premarital sex ratio of graduates: And from formula 8, we can get the estimator of the variance ofp 2 : 18ð18 À 1Þ ½ð0:2348 À 0:2457 Þ 2 þ ð0:22296 À 0:2457 Þ 2 þ Á Á Á þ ð0:2803 À 0:2457 Þ 2 ¼ 0:0001 Estimation of the premarital sex proportion and its variance of the students on Dushu Lake Campus of Soochow University. By formula 12, we can get the estimator of the premarital sex ratio of the students: And from formula 13, we can get the estimator of the variance ofp: Thus, the 95% confidence interval of the population proportion is given bŷ Reliability evaluation. By using SAS 9.13, the data from the repeat survey of the 38 classes were carried out arcsine transformation of square root. The correlative analysis showed that the results of the two repeated sample surveys were coincident and the reliability of our survey methods and formulas was high (coefficient of product-moment correlation r = 0.8843, P<0.0001).

Survey on a population mean
In our randomized device, there were 10 balls of identical size in a bag and the balls were respectively tagged by integers from 0 to 9. Each chosen student was asked to select a ball from the bag and get a corresponding integer, and added the number to times of his cheating on exams in last two semesters and wrote down the final result.
The mean of times of cheating on exams in each class. Through twice repeated surveys on cheating on exams in last two semesters by stratified cluster sampling under the additive model and by formula 28, we can get the average times of cheating on exams of undergraduates in 20 classes in the first survey (μ i1 (i = 1,2,. . .,20)) and that in the second survey (m 0 i1 (i = 1,2,. . .,20)), and the mean of times of cheating on exams of graduates in 18 classes in the first survey (μ i2 (i = 1,2,. . .,18)), and that in the second survey (m 0 i2 (i = 1,2,. . .,18)) can also be got ( Table 2).
Estimation of the population mean and its variance of cheating times in each stratum. Estimation of the population mean of cheating times in each stratum is described below: By the first survey on undergraduates and formula 21, we can get the estimator of the population mean of cheating times of undergraduates in last two semesters: By the first survey on graduates and formula 21, we can also get the estimator of the population mean of cheating times of graduates in last two semesters: Estimation of the variance of the population mean of cheating times in each stratum is described below: From the first survey on undergraduates and formula 22, we can get the estimator of the variance of the population mean of cheating times of undergraduates in last two semesters: 20ð20 À 1Þ ½ð1:0532 À 1:0337Þ 2 þ ð2:0250 À 1:0337Þ 2 þ Á Á Á þ ð0:5222 À 1:0337Þ 2 From the first survey on graduates and formula 22, we can also get the estimator of the variance of the population mean of cheating times of graduates in last two semesters: 18ð18 À 1Þ ½ð1:3182 À 1:0354 Þ 2 þ ð2:4302 À 1:0354Þ 2 þ Á Á Á þ ð0:9000 À 1:0354Þ 2 Estimation of the population mean and its variance of cheating times of students on Dushu Lake Campus of Soochow University. By formula 26, we can get the estimator of the population mean of cheating times of students on Dushu Lake Campus of Soochow University: Reliability evaluation. Correlative analysis was applied to the data of two repeated surveys under the additive model in 20 undergraduate classes by using SAS 9.13. The Shapiro-Wilks' W test was applied to μ i1 and m 0 i1 and the corresponding values of W were 0.9616 and 0.9288 respectively; the P values were 0.5768 and 0.1464 respectively, and were normally distributed. This analysis showed that the coincidence between results of the two repeated cluster samplings in the first stratum was high (coefficient of product-moment correlation r = 0.95755, P<0.0001). Rank correlation analysis was applied to the data of two repeated surveys under the additive model in 18 graduate classes by using SAS 9.13 and also showed that the coincidence was high (the Spearman rank correlation coefficient r s = 0.90243, P<0.0001, not normal distribution). Rank correlation analysis was also applied to the data of all the 38 classes and the results of the two repeated stratified cluster samplings was verified to be coincident (the Spearman rank correlation coefficient r s = 0.90311,P<0.0001, not normal distribution), which indicate that our survey methods and relevant formulas were reliable.

Discussion
Sensitive question survey is very important in social and medical research, especially for the prevention of AIDS in China. After going through the introduction period and the growth period of AIDS, China is now facing the threat of AIDS outbreak. To prevent AIDS, accurate data are needed although many people may be unwilling to give true answers to such a sensitive question. Survey methods and formulas for parameter estimation proposed in this study will be helpful to get reliable data to prevent sexually transmitted diseases and improve public health.
Randomized response models have been widely used to make people cooperative in sensitive question survey. Recently a Meta-analysis was applied to 38 relevant publications from 1965 to 2000 and showed that the application of randomized response models has a significant advantage of accuracy and reliability compared with other traditional survey methods [17]. As to the sampling design for sensitive question survey, statisticians have provided a lot of sampling methods. However, most cases have been limited to simple random sampling so far, and studies on the evaluation of the reliability and validity of sample survey on sensitive questions are also rare.
In this work, formulas for parameter estimation in cluster sampling and stratified cluster sampling under two randomized response models were deduced respectively and both dichotomous and quantitative data about sensitive issues could be acquired. Under the two models, cluster sampling and stratified cluster sampling were successfully applied to the survey of premarital sex and cheating at Soochow University. The evaluation of the test-retest reliability demonstrated that the coincidence of results between two repeated surveys was high and our survey methods and statistical formulas are highly reliable.
As for cluster and stratified cluster sampling, the sample size is generally large. So the sample proportion and sample mean usually follow the normal distribution. With the deduced formulas, we can get estimators of all kinds of population proportions, means and their variances; we can estimate the intervals of those population proportions and means; and we can further compare the proportion and mean of each stratum by t-test, Z test, analysis of variance or rank test.
Multistage sampling, cluster sampling, stratified cluster sampling and other complex sampling methods and corresponding formulas under randomized response models for survey of quantitative and qualitative sensitive issues are under study in the project. In this study, cluster sampling and stratified cluster sampling methods and relevant formulas under two randomized response models are proved to be reliable and will probably be used for sensitive question survey of a large population.