Using convenient stratification criteria such as geographical regions or other natural conditions like age, gender, etc., is not beneficial in order to maximize the precision of the estimates of variables of interest. Thus, one has to look for an efficient stratification design to divide the whole population into homogeneous strata that achieves higher precision in the estimation. In this paper, a procedure for determining Optimum Stratum Boundaries (OSB) and Optimum Sample Sizes (OSS) for each stratum of a variable of interest in health surveys is developed. The determination of OSB and OSS based on the study variable is not feasible in practice since the study variable is not available prior to the survey. Since many variables in health surveys are generally skewed, the proposed technique considers the readily-available auxiliary variables to determine the OSB and OSS. This stratification problem is formulated into a Mathematical Programming Problem (MPP) that seeks minimization of the variance of the estimated population parameter under Neyman allocation. It is then solved for the OSB by using a dynamic programming (DP) technique. A numerical example with a real data set of a population, aiming to estimate the Haemoglobin content in women in a national Iron Deficiency Anaemia survey, is presented to illustrate the procedure developed in this paper. Upon comparisons with other methods available in literature, results reveal that the proposed approach yields a substantial gain in efficiency over the other methods. A simulation study also reveals similar results.
Citation: Reddy KG, Khan MGM, Khan S (2018) Optimum strata boundaries and sample sizes in health surveys using auxiliary variables. PLoS ONE 13(4): e0194787. https://doi.org/10.1371/journal.pone.0194787
Editor: Gajendra P. S. Raghava, Indraprastha Institute of Information Technology, INDIA
Received: August 20, 2017; Accepted: March 10, 2018; Published: April 5, 2018
Copyright: © 2018 Reddy et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All relevant data are within the paper and its Supporting Information files.
Funding: The authors received no specific funding for this work.
Competing interests: The authors have declared that no competing interests exist.
Stratified random sampling is an important sampling technique utilized in estimating the prevalence of diseases such as diabetes, anaemia, obesity hypertension, and smoking. In stratified sampling, the sampling-frame is divided into a number (say, L) of non-overlapping groups or strata in such a way that the strata constructed are internally homogeneous with respect to the variable (or main variable) under study, because that maximizes the precision of the estimator of the parameter of interest concerning the study variable, e.g. its mean . An advantage of stratified sampling design is that when a stratum is homogeneous, the measurements of the study variable (y) vary little from each other and the precise estimate of y can be obtained from a small sample in that stratum. Thus, combining these estimates from all L strata, the design produces a gain in the precision of estimate of the variable in the whole population . However, in most practical situations, especially in health surveys, it is difficult to construct such optimum strata, and hence, more often the health surveyors stratify the population in most convenient manners such as the use of geographical regions (e.g. North, Central, South, etc.), administrative regions (e.g. provinces, districts, etc.) or other natural criteria (e.g. urban-rural, sex, age etc.). Moreover, the stratification by convenience manner is not always a reasonable criterion as the strata so obtained may not be internally homogeneous with respect to a variable of interest. Thus, one has to look for the Optimum Stratum Boundaries (OSB) that maximizes the precision of the estimators.
The problem of determining OSB for a variable, when its frequency distribution is known, is well-known in the sampling literature. The basic consideration involved in determining OSB is that the strata should be as internally homogenous as possible, that is, in order to achieve maximum precision, the stratum variances should be as small as possible [1, 2]. When a single variable is of interest and the stratification is made based on this study variable, then an ideal situation is that the distribution of the study variable is known and the OSB can be determined by cutting the range of its distribution at suitable points. This problem of determining the OSB, when both the estimation and stratification variables are the same, was first discussed by Dalenius . He presented a set of minimal equations which are usually difficult to solve for OSB because of their implicit nature. Hence, subsequently the attempts for determining approximately optimum stratum boundaries have been made by several authors [4–9].
Many authors have also attempted to determine the global OSB. For example, Unnithan  proposed an iterative method that requires a suitable initial solution. For a skewed population where a certainty stratum (some specific units are included in the sample where extremely large units are isolated so that they do not influence sampling variability) is necessary. Lavallée  proposed an algorithm to construct stratum boundaries for a power allocated (applying an exponential value q, where 0 < q < 1, to the stratum population value under Neyman Allocation to allow for a sufficient spread of the sample allocation) stratified sample. Later on, Hidiroglou  presented a more general form of the algorithm. After reviewing Lavallée and Hidiroglou’s algorithm, a modified algorithm that incorporated the different relationships between the stratification and study variables was proposed [13, 14].
There are several other algorithms available in the literature, for example, Niemiro  proposed a random search method and the simplex method  was used to present a new method of stratification . Later on, Kozak  presented a modified random search algorithm. Gunning  proposed an alternative method to approximate stratification based on a geometric progression. This approach was compared with three other methods [8, 11, 20] which confirmed that the geometric progression method is more efficient . The usefulness of Gunning and Horgan’s geometric progression method was studied and it revealed that the geometric progression approach is less efficient than Lavallée and Hidiroglou’s algorithm [22, 23].
Another kind of stratification method that has been proposed in the literature is due to Khan et. al. [24–30]. When the distributions of the study variables were known, they formulated the problems of determining OSB as optimization problems, which are solved by developing computational techniques Dynamic Programming (DP). The DP technique was first proposed by Bühler & Deutler , which was also used for determining the OSB which would divide the population domain of two stratification variables into distinct subsets such that the precision of the variables of interest is maximized [11, 32].
Numerous research have also been undertaken whereby auxiliary variable(s), which can be historical data, are used to improve the precision of the estimates of study variable y. When the frequency distribution of the auxiliary variable, x, is known, several approximation methods of determining OSB using the auxiliary variable(s) have been suggested and discussed by many authors [1, 9, 33–43].
In this paper, a procedure for determining the OSB and sample size for each stratum of a variable of interest in health surveys is developed. The determination of the OSB and sample sizes, based directly on the survey variable (y), is not feasible in practice since the it is unavailable prior to conducting the survey. Hence, optimum stratification is made based on multiple auxiliary variables (x1, x2, …, xp) that are readily available in health surveys. It shall be assumed that the population values of the study variable y are available as realizations of a stochastic background variable or at least can be realized as proxy values of y from previous or other recent surveys and y holds a regression model with the auxiliary variable(s) [2, 14, 30, 44–46]. Moreover, often y is highly correlated with x such that the regression of y upon x has homoscedastic errors. In situations like this, stratification can be achieved using the auxiliary variable(s). The application of the proposed methodology will be demonstrated with empirical investigations using real and simulated datasets. This proposed research deals with the problem of stratification for a study variable using the many auxiliary variables that are found in a multivariate survey. In health surveys, these auxiliary variables normally characterize positively skewed distributions that are families of the Gaussian distribution such as Weibull, Gamma, Log-normal, etc. Thus, this research investigates if the proposed parametric-based mathematical programming approach for determining the OSB yields a gain in efficiency over other methods that are well-known in literature. This research also tries to find out if the proposed method works for skewed distributions such as the Weibull or Gamma when both linear and nonlinear regression models are used in the MPP formulation of the stratification problem.
The problem of determining OSB is redefined as the problem of determining Optimum Strata Widths (OSW) and is formulated as a Mathematical Programming Problem (MPP) that seeks minimization of the variance of the estimated population parameter. Since the formulated MPP can be viewed as a multistage decision problem, it is solved using a DP technique. These OSB are then used to compute the sample size for each stratum under Neyman allocation. A numerical example with a real data set of skewed population, where the auxiliary variables follow Weibull distributions, is presented to illustrate the proposed procedure. The results are compared with the Dalenius & Hodges’ cum method , Gunning & Horgan’s geometric method  and Lavallée & Hidiroglou’s method .
The general formulation of the problem of OSB as an MPP
Let the population be stratified into a fixed L strata based on p auxiliary variables, x1, x2, …, xp, and the estimation of the mean of study variable y is of interest. If a simple random sample of size nh is to be drawn from hth stratum with sample mean , then an unbiased stratified sample mean, , is given by (1) where Wh = Nh/N is the proportion of the population contained in the hth stratum for the study variable y, where N is the total number of units in the population and is assumed to be known while Nh is the total unknown number of units in each stratum. Then the variance of is given by (2)
The finite population correction factors in (2) could be ignored [8, 9, 20, 36]. Thus, under the Neyman allocation , that is, (3) (2) is given by (4) where σhy is the stratum standard deviation of y in hth stratum; h = 1, 2, …, L and n is the preassigned total sample size.
Consider that the study variable has the regression model of the form: (5) where λ(x1, x2, …, xp) is a linear (or nonlinear) function of xi;(i = 1, 2, …, p) and ϵ is an error term such that E(ϵ|x1, x2, …, xp) = 0 and V(ϵ|x1, x2, …, xp) = ψ(x1, x2, …, xp) > 0 for all xi. The parameters in λ are assumed to be known from a recent survey.
Assuming that λ and ϵ are uncorrelated , it follows that (6) where denotes the variance of λ(x1, x2, …, xp) in the hth stratum and is the variance of ϵ in the hth stratum. Eq (6) assumes homoscedasticity, i.e., homogeneity of the variance of ϵ over the distribution of the predictors xi(i = 1, 2, …, p), given the stratum h.
Let f(xi) be the estimated frequency functions of the auxiliary variables, xi(i = 1, 2, …, p), that are used for the stratification of the main variable. If the population mean of the study variable y is to be estimated over a range (a, b) under the allocation (3), then the problem of determining the strata boundaries of y is to cut up the range, (a, b) at (L − 1) intermediate points a = y0 ≤ y1 ≤ y2 ≤, …, ≤ yL−1 ≤ yL = b such that (4) is minimum. Since the study variable is not available at the design stage, the range (a, b) could either be the compromise range derived from all the auxiliary variables or an estimated range that best explains the study variable, possibly chosen from previous surveys.
For a fixed sample size n, minimizing the expression of the right hand side of (4) is equivalent to minimizing . Thus, from (6), the following is minimized: (7) If f(xi) are known and integrable frequency functions of the auxiliary variables, then for the given λ(x1, x2, …, xp), the first term inside the square root function in (7) can be expressed as the functions of the boundary points (yh−1, yh) by finding the stratum weight Whxi, mean μhxi and variance of ith auxiliary variable xi using the following expressions: (8) (9) (10) where i = 1, 2, …, p.
The quantities computed by Eqs (8)–(10) may be different for different auxiliary variables since it depends on their best-fit frequency distributions, for example, Weibull, Gamma, or any other skewed distribution. If two or more auxiliary variables are characterized by the same distribution, the quantities in (8)–(10) may still be different because their estimated parameters would certainly be different.
Note that the second term in (7) are also obtained as a function of boundary points using the frequency function of the regression error. Thus, the objective function (7) could be expressed as a function of boundary points (yh−1, yh) only: . Then, the problem of determination of OSB can be expressed as the following optimization problem to determine y1, y2, …, yL. (11)
We further define lh = yh − yh−1;h = 1, 2, …, L, where lh ≥ 0 denotes the range or width of the hth stratum. From this, the range of the distribution of y, d = b − a, can be expressed as a function of the stratum width. (12)
Note that if y0 is known, the first term, ϕ1(l1, y0), in the objective function of the MPP (14) is a function of l1 alone. Once l1 is known, the second term ϕ2(l2, y1) will become a function of l2 alone and so on. Due to the special nature of functions, the MPP (14) may be treated as a function of lh alone and is expressed as: (15)
In real-world situations, the study variable is unknown at the design stage, hence, readily-available auxiliary variables can be used to create OSB. The proposed technique carries out optimization through the MPP (15) on the compromise range (d) derived from all auxiliary variables. The technique also assumes that the parameters of the regression model in (15) are known from a recent survey. The best-fit distributions of the auxiliary variables, xi, are used in the formulation of MPP (15).
The solution procedure using dynamic programming technique
The problem (15) is a multistage decision problem in which the objective function and the constraint are separable functions of lh, which allows us to use a DP technique . Dynamic programming determines the optimum solution of a multi-variable problem by decomposing it into stages, each stage compromising a single variable subproblem. A DP model is basically a recursive equation based on Bellman’s principle of optimality . This recursive equation links the different stages of the problem in a manner which guarantees that each stage’s optimal feasible solution is also optimal and feasible for the entire problem .
Consider the following subproblem of (15) for first k(< L) strata: (16) where dk < d is the total width available for division into k strata or the state value at stage k. Note that dk = d for k = L and the transformation functions are given by
Let Φk(dk) denote the minimum value of the objective function of (16), that is,
With the above definition of Φk(dk), the MPP (15) is equivalent to finding ΦL(d) recursively by finding Φk(dk) for k = 1, 2, …, L and 0 ≤ dk ≤ d. We can write:
For the first stage, that is, for k = 1: (18) where is the optimum width of the first stratum. The relations (17) and (18) are solved in a forward manner first for k = 1, 2, …, L to determine the optimum subproblem objective and then solved in a backward manner second to determine the OSB.
The application of the above solution procedure is summarized in Appendix A in order to determine the OSB for MPP (15).
Determination of optimum sample size
When OSB (yh, yh−1) are determined as discussed in Sections 2-3, the optimum sample size nh;h = 1, 2, …, L that minimizes the variance of the estimate can easily be computed.
If the study variable holds the regression model (5) with the auxiliary variables across the strata, using (2) and (7), the sample size nh are obtained for a fixed total sample of size n under Neyman allocation  for h = 1, 2, …, L and given as follows: (19) where Wh, and are derived in terms of the optimum boundary points (yh, yh−1).
It is also worth noting that the OSB (yh, yh−1) through the MPP (15) are so obtained that nh must satisfy the restrictions: (20) where Nh = NWh. The restriction 1 ≤ nh is added to the formulation so that the hth stratum must form with at least a unit and the restriction nh ≤ Nh is added to avoid the over sampling.
Determination of optimum number of strata
This is one of the first issues that need to be considered in an optimal stratification design, however, it can be dependent on the OSB and the allocation of sample units among the strata. The goal of stratification is to make all strata as homogenous as possible, which implies that the more the number of strata, the more the homogeneity within a stratum. This results in a reduction in the total variance of , that is, . However, an increase in the number of strata may involve extra cost and resources in planning and drawing the samples.
Problem of determining optimum number of strata was first discussed by Dalenius  who postulated that uniformly distributed variable, is inversely proportional to L2. Later, Cochran  investigated the effect of the number of strata on for some skewed distributed populations with Neyman allocation. He confirmed that this relationship holds for skewed distribution and the rate of reduction in is independent of skewness of the population. The results indicated that only a little reduction in variance is to be expected beyond L = 6 unless the correlation between the auxiliary information and the survey population is greater than 0.95.
To apply the above idea to the current situation of the optimal number of strata, assume that fpc is negligible and consider the distribution of the data to be approximately uniform, as done by Cochran . Then the range of the distribution of values of y [a, b] is d = b − a, and hence the variance of the distribution is S2 = d2/12. The variance of the sample mean for a simple random sample of size n can therefore be calculated as: (21) Thus, if a simple case of creating L strata of equal size is considered, stratum variance would then be calculated as S2 = d2/12L2. It follows from Wh = 1/L and Eq (4), (22) This reveals that variance of the sample mean is inversely proportional to the square of the number of strata, L. This however, does not consider the relationship between the auxiliary variables and the study population. It can be extended by considering a linear relationship given in Eq (5). As suggested by Cochran , in this case, using (5), it can be shown that (23) This again shows that the variance is inversely related to the square of the number of strata. Applying (23), one can empirically study the effect of increasing the number of strata. To complete this analysis, a cost function that shows the relation of cost with L, for planning and executing a survey, is required. However, whatever the form of cost function,  showed that the increase in L beyond 6 will seldom be profitable. Thus, if the extra cost involved in planning and executing the survey, which is incurred due to an increase in the number of strata is not of much importance, a reasonable approach to determining the optimum number of strata may be discussed as follows:
Compute for L = 1, 2, ‥, k, where k is a possible value of the candidate L. Now decreases as L increases and is minimum when L = k. Therefore, a surveyor may choose the optimum number of strata at the point where an increase in L is not useful as it gives only a small decrease in . The approach is illustrated in Fig 1, which is a hypothetical plot of against L. One can choose the desired number of strata as the point at which the “elbow” in the curve becomes apparent. Clearly, this requires judgment on the part of the surveyors.
Construction of OSB with Weibull auxiliary variables
The Weibull distribution is a two or three-parameter family of continuous probability distributions. Because of its versatility in the fitting of a variety of distributions, it is one of the most widely used distributions in applied statistics, especially in survival analysis, mortality or failure analysis, reliability, engineering to model manufacturing and delivery times, in extreme value theory and weather forecasting. Due to its moderately skewed profile, it also characterizes well a wide range of health data, including health monitoring data, epidemiological data such as episode durations of depression and gene expressions data [51–54].
If all the auxiliary variables, xi;i = 1, 2, …, p, approximately follow Weibull distribution on the interval [xi,0, xi, L], its three-parameter probability density function with a state space xi ≥ 0 is given by: (24) where ri > 0 is the shape parameter, θi > 0 is the scale parameter and γi is the location parameter of the distribution of ith auxiliary variable.
The shape parameter gives the Weibull distribution its flexibility. By changing the value of the shape parameter, the distribution can model a wide variety of data that follows the Exponential distribution, the Rayleigh distribution, the Normal distribution or even the approximate Log-normal distribution.
Scaling the auxiliary variables
While stratifying the study variable based on multiple auxiliary variables, the raw data in the form of different auxiliary variables are generally of different scales (eg., kg, mg, dollar, etc.). The values of one variable may be less or more spread out than other variables. With the auxiliary variables exhibiting different distributions, the range of data, minimum and maximum values for these auxiliary variables will certainly be different from each other. Hence, this may affect the convergence of the MPP (15) and hence its ability to determine the OSB accurately. A way to encounter this problem is to standardize each variable by subtracting its mean and dividing by its standard deviation.
Another method, which this paper uses, is a simple scaling procedure whereby every variable is divided by its maximum value. While maintaining the original distribution of the auxiliary variables, this scaling procedure results in the auxiliary variables getting closer to each other, which in turn, helps in reducing the overall range or the search space of the optimal solution. One must note that the solution procedure of dynamic programming technique is generally advisable and feasible for small sets of units (N ≤ 20) , hence, scaling is a necessary means for faster convergence of an optimal solution. The MPP (15), when solved, provides the OSB of the scaled study variable and the OSB for the original study variable can be obtained by the usual re-scaling procedure.
Estimating the regression model
To illustrate the estimation of the regression model in formulating the problem of determining OSB as an MPP for a population with more than one auxiliary variable, we use a health survey data on Anameia, which was obtained from the 2004 Fiji National Nutritional Survey conducted by the National Food and Nutrition Centre (Fiji) and funded by AusAID, UNICEF and Government of Fiji. The data included a micronutrient survey where blood samples were drawn from women of childbearing age and measurements were made to record levels of Haemoglobin, Iron and Folate amongst many other variables. Whilst only tabulations are publicly available from http://ghdx.healthdata.org/record/fiji-national-nutrition-survey-2004, data used for the purpose of applying the proposed method is accessible from http://repository.usp.ac.fj/id/eprint/10439 where the main aim is to estimate the Haemoglobin content in Fijian women. The whole data was fully anonymized before making them accessible. The data cannot be de-anonymized because there is no public datasets available to cross-reference.
The data has the following three characteristics for each woman:
- Level of Haemoglobin
- Level of Iron
- Level of Folate
Suppose that a survey on Iron Deficiency Anaemia is to be conducted in a country, where Haemoglobin (y) is the variable of interest and is to be stratified. Then, the levels of Iron and Folate collected in this study may be the reasonable choice for the auxiliary variables, x1 and x2. In this example, Haemoglobin is available to us but in reality the main variable might not be available prior to the survey. Thus, Haemoglobin will be used purely as an example for numerical illustrations and comparison purposes.
To estimate the Haemoglobin content (y) in women, a multiple regression model (given by Eq (4)) was fitted using scaled data for the survey mentioned above. It was observed that the data significantly fitted a linear regression model with Iron and Folate levels (p < 0.001)—the estimated parameters for these two predictors were also highly significant (p < 0.001).
The coefficient of determination R2 = SSR/SST, with an Adjusted R-squared value of 12.54% was found to be one of the highest for the linear model when compared with the model summary of all the other non-linear models available in standard statistical packages. Thus, this model fits the data best and gives us no reason to consider an alternative model. There is a small positive linear relationship between Haemoglobin and Iron (r = 0.350, p < 2.2e − 16), and Haemoglobin and Folate (r = 0.161, p < 1.31e − 05). Therefore, the Haemoglobin content (y), Iron level (x1) and Folate level (x2) are fairly assumed to follow a linear regression model given in (5): (25)
This idea can be applied in the ideal situation where the main variable is not available. The beta weights of the regression model, initial and final values could be taken as guestimates from prior surveys.
Estimating the distribution of the auxiliary variables
To determine the distributions of the auxiliary variables, f(x1) and f(x2), relative frequency histograms for Iron and Folate are constructed. The two histograms presented in Figs 2 and 3 reveal that the distributions of both auxiliary variables are right-skewed and match 3P Weibull distribution with different parameters.
Using the Kolmogorov-Smirnov test for each of the two variables, the maximum difference (D) between the observed distribution and the Weibull distribution is found to be non-significant (all p-values > 0.05). This also supports the fact that all variables follow 3P Weibull distributions, where parameters are obtained by the maximum likelihood estimate (MLE) method.
Formulation of the MPP with Weibull distribution
Considering that y has a linear regression on xi; (i = 1, 2, …, p). Then, from (5), the function λ(x1, x2, …, xp) is of the form: (26) Assume that the model in (26) holds for all strata. Then, (27) Let all the auxiliary variables, xi, follow 3P Weibull distribution (i.e., xi ∼ W(ri, θi), γi) with density function given by (24). By using (8)–(10), the quantities Whxi, μhxi, and can be obtained as a function of boundary points (yh−1, yh). Using the substitution of yh = yh−1+lh, they are presented as follows: (28) μhλ can be expressed as: (29) Let Γ(r, x) and Q(r, s) denote the upper incomplete gamma function and the regularized incomplete gamma function, respectively, given by (30) (31) Then, using Eqs (28)–(31), μhxi can be simplified to be (32) Similarly, the quantity is reduced to (33) where Whxi and and are given by Eqs (28) and (32) respectively.
Since the auxiliary variables follow Weibull distributions, Wh and in the first term of (7) are given by (28) and (33) respectively. Thus, for the ith auxiliary variable, is (34) Using (34), the formulated MPP given in (15) could be generalised and expressed as the following MPP in order to determine the OSB for the main variable: (35) where d in Eq (35) is the estimated or hypothetical range of the main study variable, βi are the regression coefficients, θi and ri are parameters of the 3P Weibull distributions of ith auxiliary variable, Γ(⋅) is the upper incomplete gamma function and Q(⋅) is the upper regularized incomplete gamma function. Whereas, the term can be computed when the distribution of ϵ is known. For the current model, since this error term is normally distributed, the distribution is given by: (36) Then, following from (8)–(10), Wh and σhϵ are obtained as: (37) (38) where and h = 1, 2, …, L.
In this section, numerical results are presented to illustrate the application of the proposed technique to a real and a simulated population. The OSB for the main variable are obtained and presented together with the values of the objective function for L = 2, 3, …, 6 for different regression models.
The real data, as discussed earlier in Section 6.2, has Haemoglobin as the study variable while Iron and Folate are auxiliary variables that follow Weibull distributions with their estimated parameters. Haemoglobin is being used here purely for comparison purposes, in reality, the main variable is not available. Using the recursive Eqs (17) and (18), the MPP (35) with d = 10.9 (range of main variable) is solved by executing a C++ computer program developed to implement the proposed DP technique. R codes were also developed for computing the quantities such as the initial value (x0) of the distribution, regression coefficients (βi), Weibull parameters (α, β, γ), range (d) of the distribution, etc. required for determining the OSB using the C++ program. Users can easily stratify a population by executing the C++ program for the given value of L, x0, d, n, etc. in an open source IDE such as DEV C++. The C++ program and R codes can be made available on request from the authors.
The results for the OSB (yh) along with optimum sample sizes (nh) and the values of the objective function are presented in Table 1 for the following regression models: (39)
A skewed population with two auxiliary variables (x1 and x2) and the study variable (y), each of size N = 5000, were randomly generated using the R software. This data had a relatively weak linear relationship between y and x1 (r = 0.014, p = 0.34), and a weak linear relationship as well between y and x2 (r = 0.023, p = 0.11). The simulated data was different from the real data in the sense that it had a very low predictive power in its regression models (Adj. R2 = 0.03%). The ANOVA results from multiple linear regression also indicated a non-statistically significant model fit (p = 0.175).
For the simulated data, the OSB (yh) along with optimum sample sizes (nh) and ∑Wh σh values are presented in Table 2 for the following regression models:
Various other investigations related to OSB, sample size and the performance of the proposed technique, in both real and simulated data, are carried out and discussed in the following section and subsections.
Results and discussion
Primarily, this paper involves the usage of multiple auxiliary variables in determining the OSB for the study variable. Investigations into the performance of the proposed method are also carried out to investigate some of the very pertinent issues such as:
- Comparison of results using single and multiple auxiliary variables;
- Comparison with other established methods of stratification in literature;
- Determination of the optimum number of strata;
- Comparison of stratification using other skewed distribution such as 3P Gamma;
- Sensitivity of the proposed method with linear regression against nonlinear regression;
- Consistency of the results obtained for real data with a simulated data set.
Thus, in the following subsections, comparative results are presented for the three models that are to create OSB for the main variable in real and simulated data. These are done to ascertain the effects of using a single auxiliary variable and multiple auxiliary variables in terms of the changes observed in the OSB, sample sizes and the ∑Wh σh values achieved for L = 2, …, 6. Together with results on the performance of the proposed method against other methods, results for 3P Gamma distribution and nonlinear regression are also presented.
Use of single and multiple auxiliary variables
Tables 1 and 2 present the OSB, sample sizes and ∑Wh σh values for real and simulated data respectively. For real data in Model 2, which uses Folate, ∑Wh σh values are the lowest in the three models while for simulated data, it is lowest in Model 4 which uses variable x1. The ∑Wh σh values using the other models (i.e., models 1 & 3 in real data and 5 & 6 in simulated data) are close to each other. It is seen from the results that the ∑Wh σh values of the main variables in both real and simulated data appear to be declining exponentially as L increases in all the models. It must also be noted that in Table 2, the OSB and OSS are equivalent in all three models. This is due to the fact that the simulated data set is quite large and it results in a very precise-fitting of the distribution, which leads to equivalent OSB in all three models.
The findings in both data are similar in the sense that a single auxiliary variable model performs either better or worse than the model with multiple auxiliary variables. In real data, Model 2 performs better than Models 1 and 3 while in simulated data, Model 4 performs better than Models 5 and 6. This may be due to the fact that both Model 2 in real data and Model 4 in simulated data have a much weaker correlation with the dependent variable (see Tables 3 and 4). The results in Tables 3 and 4 presents key statistics such as the correlations, measure of regression error (RSE) and goodness of fit (AIC) for all the models in both data. It appears that the model with the auxiliary variable(s) that has the lowest correlation and Adjusted R2 and the highest RSE or AIC performs the best for the proposed method. Thus, the proposed method of stratification works best with uncorrelated auxiliary variable(s).
Comparison with other available methods
For the purpose of the comparison of the performance of the proposed method, the following univariate methods available in the literature are considered:
The stratification package developed by  in the R statistical software is used to determine the OSB and sample sizes for the main study variable, Haemoglobin. The OSB are then used to compute the variance of the estimated mean (i.e., the values of the objective function or ∑Wh σh) in each of the six models so that a comparative analysis could be carried out between the established methods and the proposed method. Note that comparisons are only possible here since the main variable is available to us in this example. The three methods above need the main variable to work out the OSB, however, the proposed method can work on auxiliary variable(s) to compute OSB for the main variables, with a few assumptions on the main variable.
The results, based on Models 1, 2 and 3 for real data, are given in Table 5, which presents the ∑Wh σh values of the estimate for Cum method, Geometric method, Lavallée and Hidiroglou’s method and the proposed method with a fixed total sample of size n = 500, for L = 2, 3, …, 6. The efficiencies of the proposed DP method over the other three methods are also presented in the table.
Upon examination of these results, it is noted that when a single auxiliary variable (Model 1) is used to determine OSB, the proposed method performs considerably well over the three methods and the efficiency of these OSB increases by about 2% to 50% for L = 1, 2, …, 6. Model 2 also produces much more efficient OSB over other methods and the efficiency increases from about 1% to 49%, which is quite similar to Model 1. Model 3 also increases the efficiencies from about 2% to 50%, being almost exactly similar to Model 1. Thus, with the use of auxiliary variables, either single or both, the proposed method increases the precision of the estimate compared to other univariate methods.
For simulated data, Table 7 presents the ∑Wh σh values for the three methods along with the proposed method for the three different models (Models 4-6) together with the efficiencies of the proposed method over the others. The results generally support the similar findings obtained for real data. Compared to all other methods, the proposed method increases the precision ranging from about 11% to 63% in Model 4 and 21% to 133% in Model 6. For Model 5, the proposed method increases the precision ranging from about 26% to 127% against Cum method and 25% to 135% against L-H (Kozak) method. It doesn’t perform so well against Geometric method. Table 8 provides the OSB and sample sizes using the other methods which can be compared with the results of the proposed method presented in Table 2.
When considering Weibull distribution cases, the sample allocations under the proposed method (which uses Neyman allocation given by (19)) are given in Tables 1 and 2 for real and simulated data respectively. In the method, the overall size of strata (Nh) as well as variability () of the auxiliary variable(s) affects the stratum sample sizes (nh), i.e., nh ∝ Wh Sh. It is noticeable that for both real and simulated examples, the stratum samples sizes given by the proposed method is a bit different from the sample sizes given by other methods presented in Tables 6 and 8. This is because of the differences seen in the OSB, and hence the Wh, between the methods.
To substantiate the results, the method of bootstrap re-sampling is used to investigate the behaviour of the findings made earlier on the real dataset. A large number (n = 10,000) of independent re-samples are drawn with replacement from the population data. The re-samples are of the same size as the Anaemia data (N = 724), creating many variants of the original data. Since there are three variables in the Anaemia population, bootstrap re-sampling is done on individuals, which means three variables are randomly generated for each population. From the large number of bootstrap re-samples, results for only 5 randomly selected samples are presented for the sake of brevity. We consider all three models given by equations in (39).
For all five bootstrap samples, Tables 9—13 present the OSB, OSS (nh) and variances for all three models are calculated using the proposed method. It is again observed that Model 2 has the lowest variance and this means that it is the best model to use out of the three. To further investigate why Model 2 is the best, Table 14 is drawn up. It is found out that results are consistent in all five bootstrap samples. Model 2 performs the best because it has a low correlation with the main variable together with a high RSE, a very low adjusted R2 and the highest AIC amongst the three models. Thus, whether it is a single or multiple auxiliary variables (ie., all models studied herein) used in the formulation of the problem of stratification, the gains in efficiency of the proposed method over other established methods are substantial. These are given by Tables 15—19 where we see that the variances given by the proposed method are lower than the other methods. Hence, with bootstrap re-sampling procedure, it is seen that we obtain consistent findings to what was seen in the original Anaemia data.
Number of strata
To study the relationship between the number of strata and the ∑Wh σh value, an investigation is carried out for the real and simulated data using the six models. The ∑Wh σh are calculated using the proposed method and the results are presented for L = 2, 3, …, 20. These are presented in Figs 4 and 5 where the curves appear to be on top of each other and all of them decrease exponentially. After L = 7, where the “elbow” is found, the rate of decrease in the ∑Wh σh values from there onwards is not as big as what is seen from L = 2 to 7. For argument’s sake, one might even be comfortable with L = 6 as the appropriate number of strata. The finding supports the investigation carried out by Cochran  that the number of strata to be constructed beyond six is not much useful in terms of the relative gain in efficiency or the reduction of ∑Wh σh. All six models in real and simulated data are very similar when it comes to the relative gain in efficiencies and one can easily pick out L = 7 where the “elbow” appears, indicating that the percentage reduction thereafter is not worth investing in for a sample survey since additional costs are involved with increase in the number of strata. Increasing the number of strata to more than 7 may not be a good trade-off for a little gain of precision in the estimates.
Using skewed distributions other than Weibull
The distribution of the auxiliary variable can vary depending on how well the data fits a particular skewed distribution based on the best MLE of its parameters. Weibull is selected in this paper due it’s versatility in fitting skewed distributions, especially for health data. To probe into the performance of Weibull distribution against any other skewed distribution, both auxiliary variables in the real and simulated data are fitted with a 3P Gamma distributions because of its moderately skewed profile as well. Three different linear regression models are again used for the comparison of results. The associated MPP is formulated and solved using the DP technique.
The OSB, sample sizes and ∑Wh σh values are presented in Table 20 for the real data while Table 21 is for the simulated data. Similar to the results obtained under Weibull distribution, the results for Gamma show that the OSB are slightly different from each other in all the three models. To compare the performance of Gamma results against those obtained under Weibull distribution for the real data, ∑Wh σh values from Table 20 are compared with Table 1. They reveal that fitting the data with Weibull distribution yields a much more efficient set of OSB compared to fitting the data with Gamma distribution. This holds true for both single or multiple auxiliary variables. Results for the simulated data in Tables 2 and 21 also reveal similar findings. This may be due to the fact that Weibull was a better fit than Gamma for the two auxiliary variables,
Linear versus nonlinear regression
As shown in (5), the proposed method can incorporate linear as well as nonlinear regression for construction of OSB. In the preceding sections, it has been discussed that linear regression performs well in real as well as simulated data. To investigate the sensitiveness of linear regression over nonlinear regression, a simple case of quadratic regression is fitted in this section. Consider that the study variables are to be stratified using a single auxiliary variable (e.g., Iron in real & and x2 in simulated data). Then, λ(x) in (5) reduces to: (40) (41) The ANOVA results for this quadratic regression reveals that the model is statistically significant (p-value < 0.001) for both real and simulated data.
Using the procedures discussed in Sections 3–7, the OSB and sample sizes are determined. Table 22 presents the results along with the ∑Wh σh values for real and simulated data respectively. The results reveal that for both data, the ∑Wh σh values from linear regression (Model 3 from Table 1 and Model 6 from Table 2) are lower than non-linear regression model which means that linear regression performs better than the nonlinear regression.
To investigate this further, Table 23 presents some key statistical measures such as measure of regression error (RSE) and goodness of fit (AIC) with regards to how the model under nonlinear regression performs against the models under linear regression for both real and simulated data. The measures for linear regression are presented in Tables 3 and 4. They reveal that the results are consistent with the findings earlier in the paper—that the model with the lowest Adjusted R2 and the highest RSE or AIC performs the best. Thus, linear regression model performs better than the nonlinear regression model.
Stratified random sampling is an efficient and widely used sampling technique in health surveys to estimate the prevalence of diseases and many other parameters. Often, the surveyors encounter two major difficulties prior to drawing the samples and these are: (i) constructing the optimum strata within which the units are as homogeneous as possible and (ii) determining the optimum sample size to be drawn from each stratum, so that the precision of the estimates of parameters of the study or target variables are maximized. In this paper, a parametric-based method is proposed to address these two problems, which can be used to estimate parameters with more precision.
The optimum stratification based on the study variable is not feasible in practice since it is unknown prior to conducting the survey. Thus, the proposed technique uses auxiliary information in designing the sampling plan. This paper investigates how the usage of one or more auxiliary variables influence the OSB and hence the effect on the efficiency of the stratum boundaries by fitting a distribution of Weibull family that characterize many health variables. It also investigates the sensitivity of the OSB and the performance of the proposed method by fitting with other skewed distributions such as Gamma. Together with investigating the optimum number of strata, the proposed method also sees the sensitiveness of linear and nonlinear regression modelling techniques in implementing the proposed method.
The problem of finding the OSB is formulated as an MPP that seeks minimization of the variance of the estimated population parameter and solved using a DP technique. The solution procedure is implemented through a C++ computer program and an R script to facilitate the computation of the OSB through the C++ program. Both materials can be made available on request from the authors. After obtaining the OSB, they are then used to compute the optimum sample size for each stratum using Neyman allocation. Numerical examples using a real data set and a simulated data set are presented to illustrate the application, the sensitivity and the usefulness of the proposed technique. This paper also presents the results from cum method , geometric method  and the generalized Lavallée and Hidiroglou’s method [11, 18] for a comparative analysis.
It can be concluded that in the construction of strata for health populations, usage of both single or multiple auxiliary variables leads to substantial gains in the precision of the estimates over other available methods. It was also established that using uncorrelated auxiliary variable(s) to determine OSB for the main variable leads to much more efficient results. It was also found out that when another skewed distribution such as Gamma is used to characterize the distribution of the auxiliary variables, it performed well but not quite as accurate as Weibull. Hence, the best-fit distribution should always be chosen for more accurate calculation of OSB. It was also found out that when linear regression was used in formulating the problem of stratification, it performed better than nonlinear regression. This simply depends on the data and one must always choose the best regression technique to represent the relationship between the variables.
The following steps are followed in implementing the DP technique to solve the MPP for the OSB:
- Start at k = 1. Set Φ0(d0) = 0.
- Calculate Φ1(d1), the minimum value of RHS of (18) for l1 = d1, 0 ≤ l1 ≤ d1, and 0 ≤ d1 ≤ d.
- Record Φ1(d1) and l1.
- For k = 2, express the state variable as dk−1 = dk − lk.
- Set Φk(dk) = 0 if lk > dk, where 0 ≤ dk ≤ d.
- Calculate Φk(dk), the minimum value of RHS of (17) for lk;0 ≤ lk ≤ dk.
- Record Φk(dk) and lk.
- For k ≥ 3, …, L, go to Step 4.
- At k = L, ΦL(d) is obtained and hence the optimum value of lL is obtained.
- At k = L − 1, using the backward calculation for , read the value of ΦL−1(dL−1) and hence the optimum value of lL−1.
- Repeat Step 10 until the optimum value of l1 is obtained from Φ1(d1).
The authors are grateful to the Academic Editor and Reviewers for their invaluable comments and suggestions to improve the manuscript.
- 1. Cochran WG. (1977); Sampling techniques. New York, Wiley and Sons. 1977;98:259–261.
- 2. Lohr S. Sampling: design and analysis. Nelson Education; 2009.
- 3. Dalenius T. The problem of optimum stratification. Scandinavian Actuarial Journal. 1950;(3-4):203–213.
- 4. Dalenius T, Gurney M. The problem of optimum stratification. II. Scandinavian Actuarial Journal. 1951;1951(1-2):133–148.
- 5. Mahalanobis PC. Some aspects of the design of sample surveys. SankhyÄ: The Indian Journal of Statistics. 1952; p. 1–7.
- 6. Hansen MH, Hurwitz WN. On the Theory of Sampling from Finite Populations. The Annals of Mathematical Statistics. 1943;14(4):333–362.
- 7. Aoyama H. A study of the stratified random sampling. Annals of the Institute of Statistical Mathematics. 1954;6(1):1–36.
- 8. Ekman G. An Approximation Useful in Univariate Stratification. The Annals of Mathematical Statistics. 1959;30(1):219–229.
- 9. Sethi VK. A note on optimum stratification of populations for estimating the population means. Australian Journal of Statistics. 1963;5(1):20–33.
- 10. Unnithan VKG. The minimum variance boundry points of stratification. Sankhya. 1978;40(C):60–72.
- 11. Lavallée P, Hidiroglou M. On the stratification of skewed populations. Survey methodology. 1988;14(1):33–43.
- 12. Hidiroglou MA, Srinath KP. Problems associated with designing subannual business surveys. Journal of Business & Economic Statistics. 1993;11(4):397–405.
- 13. Sweet EM, Sigman RS. Evaluation of model-assisted procedures for stratifying skewed populations using auxiliary data. In: Proceedings of the Section on Survey Research Methods. vol. 1; 1995. p. 491–496.
- 14. Rivest LP. A generalization of the Lavallée and Hidiroglou algorithm for stratification in business surveys. Survey Methodology. 2002;28(2):191–198.
- 15. Niemiro W. Optimal construction of strata using random search method. Wiadomosci statystyczne. 1999;10:1–9.
- 16. Nelder JA, Mead R. A simplex method for function minimization. The computer journal. 1965;7(4):308–313.
- 17. Lednicki B, Wieczorkowski R. Optimal stratification and sample allocation between subpopulations and strata. Statistics in transition. 2003;6(2):287–305.
- 18. Kozak M. Optimal stratification using random search method in agricultural surveys. Statistics in Transition. 2004;6(5):797–806.
- 19. Gunning P, Horgan JM. A new algorithm for the construction of stratum boundaries in skewed populations. Survey Methodology. 2004;30(2):159–166.
- 20. Dalenius T, Hodges JL Jr. Minimum variance stratification. Journal of the American Statistical Association. 1959;54(285):88–101.
- 21. Horgan JM. Stratification of Skewed Populations: A review. International Statistical Review. 2006;74(1):67–76.
- 22. Kozak M, Verma MR. Geometric versus optimization approach to stratification: A comparison of efficiency. Survey Methodology. 2006;32(2):157.
- 23. Kozak M, Verma MR, Zielinski A. Modern approach to optimum stratification: Review and perspectives. Statistics in Transition. 2007;8(2):223–250.
- 24. Khan EA, Khan MGM, Ahsan MJ. Optimum stratification: a mathematical programming approach. Calcutta Statistical Association Bulletin. 2002;52:323–333.
- 25. Khan MGM, Sehar N, Ahsan MJ. Optimum stratification for exponential study variable under Neyman allocation. Journal of the Indian Society of Agricultural Statistics. 2005;59(2):146–150.
- 26. Khan MGM, Ahmad N, Khan S. Determining the Optimum Stratum Boundaries Using Mathematical Programming. Journal of Mathematical Modelling and Algorithms. 2009;8(4):1–15.
- 27. Nand N. Determining the Optimum Strata Boundary Points using Mathematical Programming. Survey Methodology. 2003;34(2):1–3.
- 28. Khan MGM, Nand N, Ahmad N. Determining the optimum strata boundary points using dynamic programming. Survey Methodology. 2008;34(2):205–214.
- 29. Nand N, Khan MGM. Optimum Stratification for Cauchy and Power Type Study Variable. Journal of Applied Statistical Science. 2009;16(4):453.
- 30. Khan MGM, Reddy KG, Rao DK. Designing stratified sampling in economic and business surveys. Journal of Applied Statistics. 2015; p. 1–20.
- 31. Bühler W, Deutler T. Optimal stratification and grouping by dynamic programming. Metrika. 1975;22(1):161–175.
- 32. Lavallée P. Two-way Optimal Stratification Using Dynamic Programming. In: Proceedings of the Section on Survey Research Methods. Virginia: American Statistical Association; 1988.
- 33. Dalenius T. Sampling in Sweden: contributions to the methods and theories of sample survey practice. Almqvist and Wiksell; 1957.
- 34. Dalenius T, Hodges JL. The choice of stratification points. Scandinavian Actuarial Journal. 1957;1957(3-4):198–203.
- 35. Taga Y. On optimum stratification for the objective variable based on concomitant variables using prior information. Annals of the Institute of Statistical Mathematics. 1967;19(1):101–129.
- 36. Serfling RJ. Approximately optimal stratification. Journal of the American Statistical Association. 1968;63(324):1298–1309.
- 37. Singh R, Sukhatme BV. Optimum stratification. Annals of the Institute of Statistical Mathematics. 1969;21(1):515–528.
- 38. Singh R, Sukhatme BV. Optimum stratification in sampling with varying probabilities. Annals of the Institute of Statistical Mathematics. 1972;24(1):485–494.
- 39. Singh R, Sukhatme BV. Optimum stratification with ratio and regression methods of estimation. Annals of the Institute of Statistical Mathematics. 1973;25(1):627–633.
- 40. Singh R, Parkash D. Optimum stratification for equal allocation. Annals of the Institute of Statistical Mathematics. 1975;27(1):273–280.
- 41. Mehta SK, Singh R, Kishore L. On optimum stratification for allocation proportional to strata totals. Journal of Indian Statistical Association. 1996;34:9–19.
- 42. Rizvi SEH, Gupta JP, Bhargava M. Optimum stratification based on auxiliary variable for compromise allocation. Metron. 2002;60(3-4):201–215.
- 43. Gupta RK, Singh R, Mahajan PK. Approximate optimum strata boundaries for ratio and regression estimators. Aligarh Journal of Statistics. 2005;25:49–55.
- 44. Singh S. Advanced Sampling Theory With Applications: How Michael”” Selected”” Amy. vol. 2. Springer Science & Business Media; 2003.
- 45. Thomsen I. A comparison of approximately optimal stratification given proportional allocation with other methods of stratification and allocation. Metrika. 1976;23(1):15–25.
- 46. Hedlin D. On the stratification of highly skewed populations. Stockholm University. Mathematical Statistics; 1998.
- 47. Neyman J. On the two different aspects of the representative method: the method of stratified sampling and the method of purposive selection. Journal of the Royal Statistical Society. 1934; p. 558–625.
- 48. Bellman RE. Dynamic Programming. Princeton, N.J.: Princeton University Press; 1957.
- 49. Taha HA. Operations Research: An Introduction. New Jersey: Pearson Education, Inc.; 2007.
- 50. Cochran WG. Comparison of methods for determining stratum boundaries. Bulletin of the International Statistical Institute. 1961;38(2):345–358.
- 51. Patten SB. A major depression prognosis calculator based on episode duration. Clinical Practice and Epidemiology in Mental Health. 2006;2(1):13. pmid:16774672
- 52. Wahed AS, Luong TM, Jeong J. A new generalization of Weibull distribution with application to a breast cancer data set. Statistics in medicine. 2009;28(16):2077–2094. pmid:19424958
- 53. Niu G, Singh S, Holland SW, Pecht M. Health monitoring of electronic products based on Mahalanobis distance and Weibull decision metrics. Microelectronics Reliability. 2011;51(2):279–284.
- 54. Wang H, Wang Z, Li X, Gong B, Feng L, Zhou Y. A robust approach based on Weibull distribution for clustering gene expression data. Algorithms for Molecular Biology. 2011;6(1):14. pmid:21624141
- 55. Hansen P, Jaumard B. Cluster analysis and mathematical programming. Mathematical programming. 1997;79(1-3):191–215.
- 56. Baillargeon S, Rivest LP. The construction of stratified designs in R with the package stratification. Survey Methodology. 2011;37(1):53–65.