Estimation of the distribution function of a finite population utilizing auxiliary information in the context of non-response within complex survey sampling

Mohsin Abbas; Muhammad Ahmed Shehzad; Haris Khurram; Mahwish Rabia

doi:10.1371/journal.pone.0322660

Abstract

This study focuses on estimating a finite population cumulative distribution function (CDF) using two-stage and three-stage cluster sampling under non-response. This work is then extended to estimate the finite population CDF under non-response using stratified two-stage and three-stage cluster sampling. We propose two distinct families of CDF estimators, specifically designed for these complex surveys, namely classical ratio/product-type and exponential ratio/product-type. Furthermore, we introduce a difference estimator for the CDF under non-response, utilizing ancillary information about the variances and covariances of the estimators under these complex schemes. We provide mathematical expressions for the biases and mean squared errors of the proposed CDF estimators, based on first-order approximation. To evaluate the performance of the proposed estimators, we conduct extensive simulations and assess their efficiency. The simulation results demonstrate that the proposed families of estimators perform well under different sampling scenarios. Our findings indicate that difference CDF estimators are more explicit than the other estimators discussed. We support our theoretical claims by analyzing real datasets.

Citation: Abbas M, Ahmed Shehzad M, Khurram H, Rabia M (2025) Estimation of the distribution function of a finite population utilizing auxiliary information in the context of non-response within complex survey sampling. PLoS One 20(5): e0322660. https://doi.org/10.1371/journal.pone.0322660

Editor: Laleh Tafakori, RMIT University, AUSTRALIA

Received: December 21, 2024; Accepted: March 25, 2025; Published: May 22, 2025

Copyright: © 2025 Abbas et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The datasets supporting the findings of this study are available in the Figshare repository at https://doi.org/10.6084/m9.figshare.28151576.v1, and https://doi.org/10.6084/m9.figshare.28151780.v1 for Population I, and II, respectively. Relevant data is also provided in the supplementary information section.

Funding: The author(s) received no specific funding for this work.

Competing interests: The authors have declared that no competing interests exist.

1 Introduction

Non-response is a critical issue in survey sampling that can lead to biased estimates and inefficiencies in statistical inference. In complex survey sampling schemes such as two-stage cluster sampling (2SCS), stratified two-stage cluster sampling (S2SCS), three-stage cluster sampling (3SCS), and stratified three-stage cluster sampling (S3SCS), the challenge of non-response becomes even more pronounced due to the hierarchical structure of the data and the varying probabilities of response at different stages. Despite the extensive literature on CDF estimation, existing methods primarily focus on simple random sampling (SRS) or basic stratified sampling, with limited attention to the impact of non-response in complex sampling frameworks.

A significant gap in the literature is the absence of comprehensive methods for estimating the finite population CDF under non-response within these complex sampling schemes while effectively incorporating auxiliary information. Traditional estimation approaches often neglect the potential of utilizing ancillary information such as variances and covariances of estimators, which could significantly enhance estimation efficiency. Moreover, the use of ratio/product-type and exponential ratio/product-type estimators in handling non-response within complex sampling designs remains under-explored.

It is common for surveys to have incomplete data due to certain reasons. For example, in an opinion survey, the selected family may have moved to another place or the selected person may have passed away. In a mailed questionnaire, many respondents may not reply. This problem of incomplete sample data due to the non-availability of information from the respondent is called the problem of non-response. The estimate attained from such deficient data can be inaccurate, especially when there are differences between the respondents and non-respondents. One way to address the problem of non-response in a mailed questionnaire survey is to divide the population into two groups: respondents and non-respondents. From the non-respondent group, a sub-sample is selected and personally interviewed to obtain the necessary information. This technique is known as sub-sampling and was introduced by [7]. By making this extra effort, an unbiased estimator can be obtained for the non-response cases.

Estimating the CDF for a finite population is a fundamental problem in survey sampling, particularly in scenarios where auxiliary information is available. Accurate estimation of the CDF is crucial in various fields, including public health, economics, and social sciences, where decision-making often relies on understanding the distribution of a population’s characteristics. However, the complexity of survey designs, such as multi-stage cluster sampling with or without stratification, coupled with the challenge of non-response, complicates the estimation process. Non-response, a common issue in surveys, can introduce significant bias, leading to inaccurate estimates if not properly accounted for.

The estimation of the CDF in finite populations has been extensively studied in the context of survey sampling. Early works by [4] and [22] laid the foundation for the use of supplementary information in improving the exactness of survey estimates. The ratio and product estimators, introduced by these pioneers, have been particularly popular due to their ability to reduce the variance of estimators by leveraging the correlation between the study variable and auxiliary information. These estimators have since been widely applied in various sampling schemes, including SRS and stratified sampling, see [2,6,11–13] for more details.

In 2SCS, the population is divided into primary sampling units (PSUs), a subset of of the PSUs is then selected, and secondary sampling units (SSUs) are drawn form the selected PSUs. 3SCS adds an additional level by selecting tertiary sampling units (TSUs) from within subgroups of the chosen SSUs. For more details, see [4,8,10] and the references cited therein. S2SCS and S3SCS further improve the precision by dividing thr population into distinct strata based on certain characteristics before applying the sampling stages. This ensure better representation of subgroups, reduce variability, and improves the accuracy of estimates, particularly when the population is heterogeneous. These methods are particularly effective for large, dispersed, and diverse populations. For more details about these sampling schemes, the reader may see [1,8,15,17–19,21,22,24,25,30] for more details.

With the increasing complexity of survey designs, particularly in multi-stage sampling, researchers have sought to adapt these classical estimators to more complex scenarios. [8] provided early treatments of multi-stage sampling, highlighting the challenges and potential solutions for variance estimation. Following this, several authors, including [1,3,5,13,19,26,29,32], extended ratio and product estimators to multi-stage sampling frameworks, demonstrating their effectiveness in reducing bias and variance.

The issue of non-response has been a persistent challenge in survey sampling, leading to biased estimates if not properly addressed. [16] provided a comprehensive treatment of methods to handle non-response, including imputation techniques and weighting adjustments. In the context of CDF estimation, [23] proposed methods for handling non-response, but these approaches often involve complex adjustments that may not be feasible in all survey contexts.

Recent advancements have seen the development of exponential ratio and product estimators, which have been shown to offer improved performance in certain situations. These estimators, discussed by [29], are particularly useful when the relationship between the study variable and auxiliary information is non-linear. However, despite these innovations, the need for more precise estimators, particularly in complex sampling designs with non-response, remains.

To address this gap, this study develops novel estimation strategies for CDF estimation under non-response in 2SCS/S2SCS, 3SCS/S3SCS schemes. By incorporating auxiliary information, we propose two new families of CDF estimators: classical ratio/product-type and exponential ratio/product-type estimators. Additionally, we introduce a difference estimator specifically designed to enhance estimation accuracy. We derive mathematical expressions for the biases and mean squared errors of these estimators using first-order approximation and validate their performance with real-world datasets. Our findings indicate that the difference CDF estimators provide greater precision compared to the other proposed estimators, making them a more reliable approach for handling non-response in complex survey designs.

The paper is structured as follows: [section:2]Section 2 offers a concise overview of CDF estimation using the SRS scheme under non-response. Section 3 extends the discussion to CDF estimation under 2SCS/S2SCS and 3SCS/S3SCS schemes, also considering non-response. In Section 4 we derive precise mathematical formulations for the covariances associated with the CDF estimators within the 2SCS/S2SCS and 3SCS/S3SCS frameworks. Section 5 introduces two-families of estimators—ratio/product and exponential ratio/product—designed to estimate the population CDF under non-response conditions. An empirical analysis is presented in Section 6. Finally, Section 7 provides a summary of the key findings and concludes the paper.

2 Estimation of finite population CDF under non-response

Consider the scenario where we categorize the target population into two uniform groups, referred to as strata: (i) the response group; and (ii) the non-response group. Let denote the number of units within the population that are part of the response group, while M signifies the number of units in the non-response group, ensuring that + . With this information, we can express the finite population CDF, G(y), in the following way:

(1)

where, , are the weights and , correspond to CDF computed from response groups and non-response group, respectively. Here, in-dices and corresponds to response and non-response group, respectively.

G(y), of a random variable Y tells us the probability that Y could take value less than or equal to a specific value y. This value of y could be the mean, median, Quantiles or any threshold value of the random variable Y.

(2)

The function is called indicator function since it indicates whether Y_i is less than a given value or not. The indicator function hold the same for the other variables.

Our goal is to estimate G(y), for a finite population under non-response. To achieve this, we select a sample of m units from a population of M units. Out of these m units, units respond to the survey while units do not respond. To obtain responses from the non-responding units, they are contacted once again by personal effort, such as telephone, email, or any other method. Then, a sub-sample of size , where c is a number equal to or greater than 1, is obtained from the non-responding units, and it is assumed that all r units will respond to the survey. It should be noted that SRS without replacement is used in all stages of sampling. An unbiased estimator of G(y) under non-response can be obtained using CDF estimators, and , which are based on and r, respectively. This method is based on the work of [7] given by

(3)

where, and have their usual meanings.

It is straightforward to demonstrate that serves as an unbiased estimate of G(y). Additionally, the variability of can be stated as follows:

(4)

where, , , and . For more details, see [33,9] and references cited therein.

3 Estimation of CDF under non-response in complex survey sampling

In this section, we will evolve unbiased estimators of G(y) for 2SCS/S2SCS and 3SCS/S3SCS under non-response. These estimators will be utilized in the subsequent sections.

3.1 The 2SCS with non-response

The 2SCS method selects a sample through two stages. The population comprises N₁ PSUs, each with N_2,i SSUs. Let Y_i,j denote the characteristic value for the jth SSU in the ith PSU, where N_2,i is the total number of SSUs. Additionally, and represent the units in the response and non-response groups, such that . Based on this, G(y) under 2SCS is expressed as follows:

(5)

where,

Here, , and and correspond to the average cluster size and CDF, for the response group and non-response groups in the ith PSU, respectively.

To estimate finite population CDF for 2SCS while accounting for non-response, we select a sample of size n₁ from N₁ PSUs with SRS, and then from ith PSU, a sub-sample of size n_2,i is chosen. It is observed that out of n_2,i SSUs, units respond and units do not respond. Following [7]’s sub-sampling technique of non-respondents, we select a subsample of from the non-respondent units of using SRS scheme from the ith PSU selected in the sample such that , where and then interview all units. It should be noted that the samples are chosen using SRS without replacement on all stages of sampling in the 2SCS scheme. For the 2SCS with non-response, we propose an estimator of G(y) given by:

(6)

where,

(7)

Here, and are the weights and and are the CDF based on and units, respectively.

In next Lemmas, we study the mathematical properties of .

Lemma 1: Under the 2SCS, is an unbiased estimator of G(y).

Proof: By assigning indices “1”, and “2” to represent the first, and second stages of sampling, respectively, we can express it as follows:

(8)

Note that may be observed as a simple random sample of size n₁ from in the first steps of sampling. For more details, see [10] and the references cited therein.

Lemma 2: The variances of is given by

where

(9)

Proof: We know that

(10)

From Eq (8), . Thus by using this result:

(11)

(12)

For more details, the readers may see [33,9] and the references cited therein.

3.2 The S2SCS with non-response

Consider a finite population that is divided into L strata, with each stratum comprising N_1,h PSUs, where . Each PSU contains N_2,i,h SSUs for and . Let Y_i,j,h represent the characteristic value for the jth SSU within the ith PSU of the hth stratum, where . Furthermore, let and denote the counts of units in the response and non-response groups, respectively, such that . Utilizing this information under the S2SCS framework, the CDF for finite population can be expressed as:

(13)

where

are the hth stratum’s average cluster sizes and CDF, respectively.

To estimate G(y) with S2SCS under non-response, n_1,h PSUs is chosen from the hth stratum. The sample sizes n_1,h are allotted according to an allocation system, such as equal, Neyman or proportional allocations. Further, a sub-sample of size n_2,i,h SSUs from ith selected PSU of hth stratum is selected. It should be noted that on both stages of sampling under S2SCS, the samples are taken using SRS without replacement. It is observed that out of n_2,i,h SSUs, units respond and units do not respond. Adopting [7] technique of sub-sampling of non-respondent, the unbiased estimator of G(y) with S2SCS can be written as:

(14)

where

In the subsequent Lemmas, we will examine the mathematical properties of estimator with non-response under S2SCS.

Lemma 3: is an unbiased estimators of G(y). The variances of is given by

(15)

The proof is similar to Lemmas 1 and 2. The mathematical expressions of is similar to with the exception that earlier is computed from the hth stratum.

3.3 3SCS with non-response

The 3SCS extends the 2SCS method by adding a third stage. In 3SCS, the population U consists of N₁ PSUs, each containing N_2,i SSUs, which in turn include N_3,ij TSUs. Let define Y_ijk as the characteristic value associated with the kth TSU located within the jth SSU of the ith PSU, where the indices are defined as , , and . Furthermore, it is established that , with and retaining their conventional definitions. Under 3SCS with non-response, the finite population CDF may be written as

(16)

where

(17)

In this context, and represent the weights, while signifies the average size of the cluster. Additionally, let and denote the CDF derived from the jth SSU of the ith PSU for the response group and the non-response group, respectively.

To estimate G(y) under 3SCS with non-response, an unbiased estimator of population CDF can be derived as follow. First, a sample of PSUs, n₁, is chosen. Next, from each selected PSU, a sample of SSUs, n_2,i, is drawn. Finally, within each selected SSU, a sample of TSUs, n_3,ij, is selected. Further, let where and have their usual meanings. A sub-sample of size where units is drawn from units. Under the 3SCS with non-response at third stage, an unbiased estimator of G(y) is given by:

(18)

where

(19)

and , , have their usual meanings. and are the CDF computed from response and non-response group, respectively. It should be noted that under 3SCS with non-response, SRS without replacement is used in all stages of sampling. For more detail, see [33,1] and references cited therein.

In the subsequent Lemmas, we will examine the mathematical properties of .

Lemma 4: Under 3SCS, is an unbiased estimator of G(y).

Proof. By assigning indices “1”, “2”, and “3” to represent the first, second, and third stages of sampling, respectively, we can express it as follows:

(20)

Lemma 5: The variances of is given by

(21)

where

(22)

Proof:

(23)

From Eq (20), we can write

(24)

and

(25)

Also from Eq (20), we have

(26)

Now take expectations of Eq (20) to get

(27)

Now Eq (21) follows by replacing Eq (24), Eq (25) and Eq (27) in Eq (23), which completes the proof.

3.4 S3SCS under Non-Response

S3SCS begins by dividing the entire population U into L strata, based on certain characteristics such as different regions or income levels. Within the hth stratum, the population contains N_1,h PSUs, each with N_2,i,h SSUs, each of which has N_3,ij,h TSUs for . Let Y_ijk,h be the characteristics under study obtained for the kth TSUs in the jth SSUs drawn from the ith PSUs within the hth stratum for , , and . Further, let where and have their usual meanings. Under S3SCS with non-response the population CDF may be written as:

(28)

In estimating the population CDF using S3SCS under non-response conditions, the sampling process consists of several stages within each stratum. First, a sample of PSUs is selected from N_1,h, resulting in a size of n_1,h. From each selected PSU, SSUs are sampled from N_2,i,h, yielding a size of n_2,i,h. Lastly, TSUs are chosen from each sampled SSU, with a size of n_3,ij,h from N_3,ij,h. It is observed that out of n_3,ij,h, units respond and units do not respond, such that . Adopting [8] technique of sub-sampling of non-respondent, the unbiased estimator of population CDF, G(y), under S3SCS with non-response can be written as:

(29)

where

Here, and have there usual meanings.

In the forthcoming Lemmas, we examine the characteristics of the CDF estimator in the context of non-response within the S3SCS framework.

Lemma 6: is an unbiased estimators of G(y). The variances of is given by

(30)

Proof. The proof has a resemblance to Lemmas 4 and 5. The proof of is completed by observing that its mathematical expressions are similar to , except that the earlier is assessed from the hth stratum.

4 Covariance computation and estimation in the presence of non-response

Here, we use the previously discussed complex survey sampling schemes to formulate precise mathematical equations for the covariance of CDF estimators on the basis of non-response given above in Section 3.

4.1 Two-stage and stratified two-stage cluster sampling

In a finite population U, we assume that X represents the auxiliary, and Y denotes the primary variable under study. Additionally, to approximate (G(y),G(x)) within the frameworks of 2SCS and S2SCS, let and represent the corresponding CDF estimators derived from variables (Y,X). It should be noted that the CDF estimators on the basis of X can be computed on the similar lines of Y.

Lemma 7: Under 2SCS scheme, the covariance between and is given by

(31)

where

(32)

(33)

(34)

Proof. Here, , , and retain their standard interpretations. The demonstration of this Lemma can be found in the S3 Appendix.

Lemma 8: Under S2SCS scheme, the covariance between and is given by

(35)

Proof. The proof has a resemblance to Lemmas 7. The proof of is completed by observing that its mathematical expressions is similar to , except that the aforesaid is assessed from the hth stratum.

4.2 Three-stage and stratified three-stage cluster sampling

Under 3SCS and S3SCS with non response, let and be the respective CDF estimators of (G(y),G(x)) that are based on (Y,X), respectively.

Lemma 9: Under 3SCS scheme, the covariance between and is given by

(36)

where

(37)

(38)

(39)

(40)

and .

Proof. Here, , , and retain their standard interpretations. The demonstration of this Lemma can be accessed in the S3 Appendix.

Lemma 10: Under S3SCS scheme, the covariance between and is given by

(41)

Proof. The proof has a resemblance to Lemmas 9. The proof of is completed by observing that its mathematical expressions is similar to , except that the earlier is computed from the hth stratum.

5 CDF estimation under non-response using auxiliary information

In this section, two families of estimators, namely ratio/product and exponential ratio/product, are developed, which utilize auxiliary information to approximate the CDF of the population, G(y), in the context of complex survey sampling scheme with non-response. To acquire the biases and MSEs of the suggested families of estimators for G(y), we could look into the following relative error terms: Let represents the relative error associated with the estimation of G(y), which is defined as:

let represents the relative error associated with the estimation of G(x), which is defined as:

(42)

such that . Furthermore, we define the variance and covariance parameters as follows:

(43)

which gives

where, represents the relative variance of the estimator for G(y), quantifying the dispersion of the estimated CDF around its true value, normalized by . represents the relative variance of the estimator for G(x), similarly normalized by . represents the relative variance of the estimator of G(y) and G(x), normalized by G(y)G(x), indicating the degree of association between the two estimators. represent a CDF estimator derived from the sampling scheme S, where S corresponds to 2S, 3S, S2S, or S3S.

5.1 First proposed family of CDF estimators

We suggest a class of ratio and product-type estimators, analogous to those introduced by [1] and [14], for estimating the population CDF under non-response.

(44)

where a is different from 0 and b as real numbers or functions of the known parameters of the supplementary variable X such as coefficient of variation , correlation coefficient , coefficient of skewness and coefficient of kurtosis etc. The selected scalars, and , minimize the MSE of . Different estimators of may be developed by selecting appropriate a, b, , and values. In Table 1, we display some members of for various values of a, b, , and .

Download:

Table 1. Several members of the suggested families of CDF estimators under non-response

https://doi.org/10.1371/journal.pone.0322660.t001

The bias and MSE of can be approximated mathematically by writing and . The expression on right-hand side (RHS) of Eq (47) can be written in terms of ’s:

(45)

where + . By expanding the RHS of Eq (48) and considering only the terms of order second power in ’s, we have

(46)

To obtain bias of , subtract G(y) from both sides of Eq (49) and consider only the terms up-to first order of approximation, given as:

(47)

Retain terms up-to first order of approximation, we can write Eq (49) as follows:

(48)

To obtain MSE of , take square and then its expectation on both sides of the Eq (51) and retain terms up-to first order, we have:

(49)

The minimal MSE of is given by

(50)

(51)

where, is the correlation coefficient between and . The minimum MSE of can be obtained by utilizing the optimum value of , say .

5.2 Second proposed family of CDF estimators

We propose another class of exponential ratio/product estimators similar to those proposed by [1] and [28] to estimate the population CDF G(y) under non-response given as:

(52)

where a, b, and have their usual meanings. Some members of for various values of a, b, , and can be seen in Table 1. The bias and MSE of can be obtained by expressing the Eq (55) in terms of ’s as:

(53)

where . By expanding the RHS of Eq (56) and considering only the terms of order second power in ’s, we have

(54)

To obtain bias of , subtract G(y) from both sides of Eq (57) and consider only the terms up-to first order of approximation, given as:

(55)

Retain terms up-to first order of approximation, we can write Eq (57) as follows:

(56)

Under the first order of approximation, we obtain the MSE of by taking the square on both sides of Eq (59).

(57)

The minimal MSE of can be expressed by utilizing the optimum value of , say , is given by

(58)

which is equivalent to that of .

A variety of estimators can be derived from the suggested families and , as shown in Eqs (47) and (55), by selecting appropriate values for the parameters a, b, , and . By imputing specific values of , , a, and b into Eqs (50), (52), (58), and (60), we can derive the first-order approximations of the bias and MSE/variance for the respective members of the families and .

5.3 Difference CDF Estimator

Additionally, supplementary information about the variance and covariance of the estimators can be used to improve the precision of CDF estimators. Let denotes the ratio of covariance between to the variance of , under an S sampling scheme, i.e.

(59)

For situations involving non-response, the difference estimator for the population CDF, say , which relies on , , G(x), and , is expressed as:

(60)

where expressed as a linear aggregator of and , and G(x) denotes the population CDF under the S sampling scheme. It can be demonstrated that the is an unbiased estimator of G(y), and variance of can be obtained by expressing the Eq (63) in terms of ’s as:

(61)

To calculate the variance of , square both sides of Eq (64) and apply expectation.

(62)

The simplified equation for the variance of may be obtained by substituting , given by

(63)

which is identical to the minimal MSE of and .

6 Empirical Study

In this section, we consider simulation study and real datasets to calculate the relative efficiencies (REs) of the proposed CDF estimators for G(y) under non-response, using the S sampling scheme-based estimator .

6.1 Population I

A dataset sourced from the Centers for Disease Control (CDC), is associated with the Second National Health and Nutrition Examination Survey (NHANES-II). It comprises 10,351 units, representing the entire non-institutionalized civilian (NIC) population of the United States (US), including all 50 states and the District of Columbia. The data is separated into four geographic regions (REGs): midwestern, southern, northeastern, and western, each farther subdivided into specific locations (LOCs). With a success probability of 0.50, random numbers are generated from a Bernoulli distribution to stratify the dataset into two strata, with 0 indicating Stratum-I and 1 indicating Stratum-II. The dataset can be found at: https://www.stata-press.com/data/r15/svy.html. Our goal is to determine the percentage of underweight individuals in the NIC US population using body mass index (BMI) as the study variable indices by “Y” and weight as the ancillary variable indices by “X”. An individual is classified as underweight if their BMI is less than or equal to y = 18.50, with the average weight of the NIC US population being x = 71.8975. For the 2SCS/S2SCS design, REG and BMI are treated as PSU and SSU. In the 3SCS/S3SCS design, REG, LOC, and BMI are considered as PSU, SSU, and TSUs, respectively. Below are the constants for the population.

To demonstrate what we have aforesaid, the values of based on an S sampling scheme are computed, which are given below.

Download:

Table 2. The

values based on an S sampling scheme under non-response utilizing Population I

https://doi.org/10.1371/journal.pone.0322660.t002

Note: Stratifying is abbreviated as .

6.2 Population II

This data is taken from the study conducted by [27] in Multan district, southern Punjab, Pakistan, spans from January 2020 to March 2020. The dataset includes 1040 units, categorized into two strata: Stratum-I (males) and Stratum-II (females). In this study, BMI and weight are used as study and auxiliary variables, respectively. The objective is to estimate the proportion of underweight children by utilizing the auxiliary variable, with x = 35.76 representing the average weight, under the S sampling scheme. For the 2SCS/S2SCS design, socioeconomic status (SES) and BMI are treated as PSU and SSU. In the 3SCS/S3SCS design, SES, Age, and BMI are considered as PSU, SSU, and TSUs, respectively. The population constants are as follows:

To demonstrate what we have aforesaid, the values of primarily based on an S sampling scheme are computed below.

Download:

Table 3. The

values based on an S sampling scheme under non-response utilizing Population II

https://doi.org/10.1371/journal.pone.0322660.t003

We present the following expressions to compute REs of the proposed CDF estimators of finite population distribution function under S sampling schemes with non-response relative to .

where . Table 4 and 5 represent the REs of these CDF estimators.

Download:

Table 4. Evaluating REs of proposed CDF estimator of G(y) under non-response relative to

utilizing Population I

https://doi.org/10.1371/journal.pone.0322660.t004

Download:

Table 5. Evaluating REs of the proposed CDF estimator of G(y) under non-response relative to

utilizing Population-II

https://doi.org/10.1371/journal.pone.0322660.t005

Based on the S sampling scheme, the REs of the proposed CDF estimators were estimated with the distinctive values of n₁, n_2,i, and n_3,ij using the aforementioned datasets. It can be observed that proposed CDF estimators under S sampling schemes that utilizing supplementary information appear to be marginally more efficient than do not, as indicated by RE values greater than one. Furthermore, CDF estimators for S sampling schemes with stratification can also be seen slightly more efficient than those without stratification. Efficiency tends to increase with the number of sampling stages, although the impact of increasing the sample size remains uncertain. Typically, RE tends to increase by increasing the sample size at various sampling stages and vice varsa.

6.3 Simulation Study

To further validate our findings, we estimated the MSE through a simulation approach by computing the squared difference between the estimator and the true parameter. The CDF of the proposed family of estimators under different non-response rates and sample sizes were simulated, and their corresponding MSEs were calculated using a simulation-based approach. The MSE/Variance of the CDF estimators under non-response are computed by drawing 10 thousand samples from Population I under a given sampling scheme. The simulation MSE of estimators for both families under S-sampling scheme are computed as:

(64)

(65)

(66)

where . The relative precision (RP) of and with respect to under an S-sampling scheme is given by:

(67)

We then calculated the RP between the estimators and found that our proposed classical ratio/product-type, exponential ratio/product-type, and difference CDF estimators performed well compared to existing methods. Additionally, we extended our simulation analysis by computing the MSE for different sample sizes of PSUs, SSUs, and TSUs under 2SCS/S2SCS, and 3SCS/S3SCS schemes. To assess the robustness of our estimators under varying levels of non-response, we evaluated their efficiency at response rates of 75% and 80%. For brevity of discussion, we considered c = 1 in our calculations.

While estimating the MSE through simulation, we adopted this structured sampling design for selecting PSUs, SSUs, and TSUs under different sampling schemes. This framework was applied across 2SCS, 3SCS, and their stratified counterparts, ensuring a systematic and realistic representation of hierarchical data selection and can bee seen in Table 6. This structured approach allowed us to analyze the performance of the proposed estimators under varying levels of non-response effectively.

Download:

Table 6. The structured sampling design for the estimation of MSE through simulation under non-response utilizing Population I

https://doi.org/10.1371/journal.pone.0322660.t006

Overall, the proposed families of estimators performed well under different sampling conditions. However, the proposed difference estimator demonstrated slightly higher efficiency compared to both the estimators without auxiliary information and the other proposed families of estimators. Although the efficiency gains of the difference estimator were modest, it consistently outperformed other estimators across all sampling schemes. Notably, the maximum gain in precision was found in the ratio/product-type estimators, particularly under the 3SCS scheme, where it reached 15% when considering a sampling structure of n₁ = 2, n_2,i = 12, and n_3,ij = 20.

The consistency between the theoretical and simulation-based MSE results further validates the reliability of the proposed estimators in handling non-response within complex survey sampling frameworks. The detailed results of these simulation studies under various complex survey sampling schemes and non-response rates can be found in the supplementary file (S1 Tables), from Table 1 to Table 16.

7 Conclusion

This paper considered 2SCS and 3SCS schemes with and without stratification under non-response in order to estimate the finite population CDF. Two families of estimators, namely exponential and ratio/product-type, have been proposed. Additionally, the CDF has also been estimated using a difference estimator. Under the first order of approximation, mathematical expressions have been developed for the biases and MSE of the prospective CDF estimators under these sampling schemes. Real datasets were used to substantiate and illustrate the proposed theory. The CDF estimators proposed for non-response in complex survey sampling with auxiliary information appeared to be marginally more efficient than those without ancillary details. Our findings revealed that the difference estimators performed well in terms of RE compared to the other estimators discussed in this study. Moreover, the results of the simulation study further confirmed the efficiency of the proposed families of estimators, demonstrating their superior performance across different sampling scenarios. The simulations provided additional evidence that the proposed families of estimators consistently outperformed other estimators, reinforcing our theoretical findings.

Download:

Table 7. List of abbreviations and acronyms

https://doi.org/10.1371/journal.pone.0322660.t007

Supporting information

S1 Tables. Simulation Results of the RP of the Proposed CDF Estimators.

This file contains RP of the proposed CDF estimators under non-response using auxiliary information within 2SCS/3SCS, and S2SCS/S3SCS schemes

https://doi.org/10.1371/journal.pone.0322660.s001

(TEX)

S2 Files. Datasets for Population I, and II.

This file contains datasets supporting the findings of this study.

https://doi.org/10.1371/journal.pone.0322660.s002

(XLSX)

S3 Appendix. Proofs for Lemma 7 and 9.

This file contains demonstration of Lemma 7 and Lemma 9.

https://doi.org/10.1371/journal.pone.0322660.s003

(PDF)

References

1. Abbas M, Haq A. Estimation of finite population distribution function with auxiliary information in a complex survey sampling. SORT. 2022;46(1):67–94. http://dx.doi.org/10.2436/20.8080.02.118
- View Article
- Google Scholar
2. Aditya K, Sud U, Chandra H, Biswas A. Calibration based regression type estimator of the population total under two stage sampling design. J Indian Soc Agr Stat. 2016;70:19–24.
- View Article
- Google Scholar
3. Al-Saleh MF, Samuh MH. On multistage ranked set sampling for distribution and median estimation. Comput Stat Data Anal. 2008;52(4):2066–78.
- View Article
- Google Scholar
4. Cochran W. Sampling techniques. 3rd edn. John Wiley & Sons; 1970.
5. Francisco C, Fuller W. Estimation of the distribution function with a complex survey. In: JSM Proceedings of the Survey Research Methods Section. American Statistical Association; 1986, pp. 37–45.
6. Garg N, Srivastava M. A general class of estimators of a finite population mean using multi-auxiliary information under two stage sampling scheme. J Reliab Statist Stud. 2009;2:101–19.
- View Article
- Google Scholar
7. Hansen M, Hurwitz W. On the theory of sampling from finite populations. Ann Math Stat. 1943;14:333–62.
- View Article
- Google Scholar
8. Hansen M, Hurwitz W, Madow W. Sample survey methods and theory. Vol. I. Methods and applications. New York: John Wiley \& Sons; 1953.
9. Haq A. Estimation of the distribution function under hybrid ranked set sampling. J Stat Comput Simul. 2016;87(2):313–27.
- View Article
- Google Scholar
10. Haq A, Abbas M, Khan M. Estimation of finite population distribution function in a complex survey sampling. Commun Stat Theory Methods. 2021;52(8):2574–96.
- View Article
- Google Scholar
11. Hussain S, Ahmad S, Saleem M, Akhtar S. Finite population distribution function estimation with dual use of auxiliary information under simple and stratified random sampling. PLoS One. 2020;15(9):e0239098. pmid:32986764
- View Article
- PubMed/NCBI
- Google Scholar
12. Iftikhar A, Shi H, Hussain S, Abbas M, Ullah K. Efficient estimators of finite population mean based on extreme values in simple random sampling. Math Probl Eng. 2022;2022(1):5866085.
- View Article
- Google Scholar
13. Jabeen R, Sanaullah A, Hanif M. Generalized estimator for estimating population mean under two stage sampling. Pak J Stat. 2014;30(4).
- View Article
- Google Scholar
14. Khoshnevisan M, Singh R, Chauhan P, Sawan N. A general family of estimators for estimating population mean using known value of some population parameter(s). Far East J Theor Stat. 2007;22:181–91.
- View Article
- Google Scholar
15. Lee SE, Lee PR, Shin K-I. A composite estimator for stratified two stage cluster sampling. CSAM. 2016;23(1):47–55.
- View Article
- Google Scholar
16. Little R, Rubin D. Statistical analysis with missing data. Hoboken, NJ: John Wiley & Sons.
17. Mahalanobis P. A sample survey of the acreage under jute in Bengal. Sankhya. 1940;1:511–30.
- View Article
- Google Scholar
18. Murthy M. Sampling theory and methods. Calcutta: Statistical Publishing Society; 1967.
19. Nafiu LA, Oshungade IO, Adewara A. Alternative estimation method for a three-stage cluster sampling in finite population. Am J Math Stat. 2012;2(6):199–205.
- View Article
- Google Scholar
20. Stokes SL, Sager TW. Characterization of a ranked-set sample with application to estimating distribution functions. J Am Stat Assoc. 1988;83(402):374–81.
- View Article
- Google Scholar
21. Preston J. Rescaled bootstrap for stratified multistage sampling. Surv Methodol. 2009;35:227–34.
- View Article
- Google Scholar
22. Raj D. Sampling theory. New York: McGraw-Hill Book Company; 1968.
23. Rao J, Rao J. Small area estimation. New York: John Wiley \& Sons; 1999.
24. Rustagi R. Some theory of the prediction approach to two stage and stratified two stage cluster sampling. PhD thesis. The Ohio State University.
25. Särndal CE, Swensson B, Wretman J. Model assisted survey sampling. Springer Science \& Business Media.
26. Sedransk N, Sedransk J. Distinguishing among distributions using data from complex sample designs. J Am Stat Assoc. 1979;74(368):754–60.
- View Article
- Google Scholar
27. Shehzad MA, Khurram H, Iqbal Z, Parveen M, Shabbir MN. Nutritional status and growth centiles using anthropometric measures of school-aged children and adolescents from Multan district. Arch Pediatr. 2022;29(2):133–9. pmid:34955308
- View Article
- PubMed/NCBI
- Google Scholar
28. Singh R, Chauhan P, Sawan N, Smarandache F. Improvement in estimating the population mean using exponential estimator in simple random sampling. Int J Stat Econ. 2009;3:13–8.
- View Article
- Google Scholar
29. Singh R, Solanki C. A new approach to handle non-response in sample surveys. J Stat Res. 2012;46:69–82.
- View Article
- Google Scholar
30. Singh R, Vishwakarma GK, Gupta P, Pareek S. An alternative approach to estimation of population mean in two-stage sampling. Math Theory Model. 2013;3:48–53.
- View Article
- Google Scholar
31. Singh S. Advanced sampling theory with applications: How Michael “selected” Amy, vol. 2. Dordrecht: Springer; 2003.
32. Sukhatme P, Sukhatme B, Sukhatme S, Asok C. Sampling theory of surveys with applications. Ames: Iowa State University Press; 1970.
33. Yaqub M, Shabbir J. Estimation of population distribution function in the presence of non-response. Hacettepe J Math Stat. 2016;46(131):1–1.
- View Article
- Google Scholar

[ref1] 1. Abbas M, Haq A. Estimation of finite population distribution function with auxiliary information in a complex survey sampling. SORT. 2022;46(1):67–94. http://dx.doi.org/10.2436/20.8080.02.118
View Article
Google Scholar

[2] View Article

[3] Google Scholar

[ref2] 2. Aditya K, Sud U, Chandra H, Biswas A. Calibration based regression type estimator of the population total under two stage sampling design. J Indian Soc Agr Stat. 2016;70:19–24.
View Article
Google Scholar

[5] View Article

[6] Google Scholar

[ref3] 3. Al-Saleh MF, Samuh MH. On multistage ranked set sampling for distribution and median estimation. Comput Stat Data Anal. 2008;52(4):2066–78.
View Article
Google Scholar

[8] View Article

[9] Google Scholar

[ref4] 4. Cochran W. Sampling techniques. 3rd edn. John Wiley & Sons; 1970.

[ref5] 5. Francisco C, Fuller W. Estimation of the distribution function with a complex survey. In: JSM Proceedings of the Survey Research Methods Section. American Statistical Association; 1986, pp. 37–45.

[ref6] 6. Garg N, Srivastava M. A general class of estimators of a finite population mean using multi-auxiliary information under two stage sampling scheme. J Reliab Statist Stud. 2009;2:101–19.
View Article
Google Scholar

[13] View Article

[14] Google Scholar

[ref7] 7. Hansen M, Hurwitz W. On the theory of sampling from finite populations. Ann Math Stat. 1943;14:333–62.
View Article
Google Scholar

[16] View Article

[17] Google Scholar

[ref8] 8. Hansen M, Hurwitz W, Madow W. Sample survey methods and theory. Vol. I. Methods and applications. New York: John Wiley \& Sons; 1953.

[ref9] 9. Haq A. Estimation of the distribution function under hybrid ranked set sampling. J Stat Comput Simul. 2016;87(2):313–27.
View Article
Google Scholar

[20] View Article

[21] Google Scholar

[ref10] 10. Haq A, Abbas M, Khan M. Estimation of finite population distribution function in a complex survey sampling. Commun Stat Theory Methods. 2021;52(8):2574–96.
View Article
Google Scholar

[23] View Article

[24] Google Scholar

[ref11] 11. Hussain S, Ahmad S, Saleem M, Akhtar S. Finite population distribution function estimation with dual use of auxiliary information under simple and stratified random sampling. PLoS One. 2020;15(9):e0239098. pmid:32986764
View Article
PubMed/NCBI
Google Scholar

[26] View Article

[27] PubMed/NCBI

[28] Google Scholar

[ref12] 12. Iftikhar A, Shi H, Hussain S, Abbas M, Ullah K. Efficient estimators of finite population mean based on extreme values in simple random sampling. Math Probl Eng. 2022;2022(1):5866085.
View Article
Google Scholar

[30] View Article

[31] Google Scholar

[ref13] 13. Jabeen R, Sanaullah A, Hanif M. Generalized estimator for estimating population mean under two stage sampling. Pak J Stat. 2014;30(4).
View Article
Google Scholar

[33] View Article

[34] Google Scholar

[ref14] 14. Khoshnevisan M, Singh R, Chauhan P, Sawan N. A general family of estimators for estimating population mean using known value of some population parameter(s). Far East J Theor Stat. 2007;22:181–91.
View Article
Google Scholar

[36] View Article

[37] Google Scholar

[ref15] 15. Lee SE, Lee PR, Shin K-I. A composite estimator for stratified two stage cluster sampling. CSAM. 2016;23(1):47–55.
View Article
Google Scholar

[39] View Article

[40] Google Scholar

[ref16] 16. Little R, Rubin D. Statistical analysis with missing data. Hoboken, NJ: John Wiley & Sons.

[ref17] 17. Mahalanobis P. A sample survey of the acreage under jute in Bengal. Sankhya. 1940;1:511–30.
View Article
Google Scholar

[43] View Article

[44] Google Scholar

[ref18] 18. Murthy M. Sampling theory and methods. Calcutta: Statistical Publishing Society; 1967.

[ref19] 19. Nafiu LA, Oshungade IO, Adewara A. Alternative estimation method for a three-stage cluster sampling in finite population. Am J Math Stat. 2012;2(6):199–205.
View Article
Google Scholar

[47] View Article

[48] Google Scholar

[ref20] 20. Stokes SL, Sager TW. Characterization of a ranked-set sample with application to estimating distribution functions. J Am Stat Assoc. 1988;83(402):374–81.
View Article
Google Scholar

[50] View Article

[51] Google Scholar

[ref21] 21. Preston J. Rescaled bootstrap for stratified multistage sampling. Surv Methodol. 2009;35:227–34.
View Article
Google Scholar

[53] View Article

[54] Google Scholar

[ref22] 22. Raj D. Sampling theory. New York: McGraw-Hill Book Company; 1968.

[ref23] 23. Rao J, Rao J. Small area estimation. New York: John Wiley \& Sons; 1999.

[ref24] 24. Rustagi R. Some theory of the prediction approach to two stage and stratified two stage cluster sampling. PhD thesis. The Ohio State University.

[ref25] 25. Särndal CE, Swensson B, Wretman J. Model assisted survey sampling. Springer Science \& Business Media.

[ref26] 26. Sedransk N, Sedransk J. Distinguishing among distributions using data from complex sample designs. J Am Stat Assoc. 1979;74(368):754–60.
View Article
Google Scholar

[60] View Article

[61] Google Scholar

[ref27] 27. Shehzad MA, Khurram H, Iqbal Z, Parveen M, Shabbir MN. Nutritional status and growth centiles using anthropometric measures of school-aged children and adolescents from Multan district. Arch Pediatr. 2022;29(2):133–9. pmid:34955308
View Article
PubMed/NCBI
Google Scholar

[63] View Article

[64] PubMed/NCBI

[65] Google Scholar

[ref28] 28. Singh R, Chauhan P, Sawan N, Smarandache F. Improvement in estimating the population mean using exponential estimator in simple random sampling. Int J Stat Econ. 2009;3:13–8.
View Article
Google Scholar

[67] View Article

[68] Google Scholar

[ref29] 29. Singh R, Solanki C. A new approach to handle non-response in sample surveys. J Stat Res. 2012;46:69–82.
View Article
Google Scholar

[70] View Article

[71] Google Scholar

[ref30] 30. Singh R, Vishwakarma GK, Gupta P, Pareek S. An alternative approach to estimation of population mean in two-stage sampling. Math Theory Model. 2013;3:48–53.
View Article
Google Scholar

[73] View Article

[74] Google Scholar

[ref31] 31. Singh S. Advanced sampling theory with applications: How Michael “selected” Amy, vol. 2. Dordrecht: Springer; 2003.

[ref32] 32. Sukhatme P, Sukhatme B, Sukhatme S, Asok C. Sampling theory of surveys with applications. Ames: Iowa State University Press; 1970.

[ref33] 33. Yaqub M, Shabbir J. Estimation of population distribution function in the presence of non-response. Hacettepe J Math Stat. 2016;46(131):1–1.
View Article
Google Scholar

[78] View Article

[79] Google Scholar

Figures

Abstract

1 Introduction

2 Estimation of finite population CDF under non-response

3 Estimation of CDF under non-response in complex survey sampling

3.1 The 2SCS with non-response

3.2 The S2SCS with non-response

3.3 3SCS with non-response

3.4 S3SCS under Non-Response

4 Covariance computation and estimation in the presence of non-response

4.1 Two-stage and stratified two-stage cluster sampling

4.2 Three-stage and stratified three-stage cluster sampling

5 CDF estimation under non-response using auxiliary information

5.1 First proposed family of CDF estimators

5.2 Second proposed family of CDF estimators

5.3 Difference CDF Estimator

6 Empirical Study

6.1 Population I

6.2 Population II

6.3 Simulation Study

7 Conclusion

Supporting information

S1 Tables. Simulation Results of the RP of the Proposed CDF Estimators.

S2 Files. Datasets for Population I, and II.

S3 Appendix. Proofs for Lemma 7 and 9.

References