Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Estimation of the distribution function of a finite population utilizing auxiliary information in the context of non-response within complex survey sampling

  • Mohsin Abbas ,

    Roles Conceptualization, Formal analysis, Methodology, Writing – original draft, Writing – review & editing

    abbasmohsin202@gmail.com

    Affiliation Department of Statistics, Bahauddin Zakariya University, Multan, Pakistan

  • Muhammad Ahmed Shehzad,

    Roles Methodology, Supervision, Validation

    Affiliation Department of Statistics, Bahauddin Zakariya University, Multan, Pakistan

  • Haris Khurram,

    Roles Investigation, Methodology, Software

    Affiliation Department of Science and Humanities, National University of Computer and Emerging Sciences, Chiniot - Faisalabad Campus, Pakistan

  • Mahwish Rabia

    Roles Data curation, Formal analysis, Validation

    Affiliation Department of Statistics, Government College Women University Sialkot, Pakistan

Abstract

This study focuses on estimating a finite population cumulative distribution function (CDF) using two-stage and three-stage cluster sampling under non-response. This work is then extended to estimate the finite population CDF under non-response using stratified two-stage and three-stage cluster sampling. We propose two distinct families of CDF estimators, specifically designed for these complex surveys, namely classical ratio/product-type and exponential ratio/product-type. Furthermore, we introduce a difference estimator for the CDF under non-response, utilizing ancillary information about the variances and covariances of the estimators under these complex schemes. We provide mathematical expressions for the biases and mean squared errors of the proposed CDF estimators, based on first-order approximation. To evaluate the performance of the proposed estimators, we conduct extensive simulations and assess their efficiency. The simulation results demonstrate that the proposed families of estimators perform well under different sampling scenarios. Our findings indicate that difference CDF estimators are more explicit than the other estimators discussed. We support our theoretical claims by analyzing real datasets.

1 Introduction

Non-response is a critical issue in survey sampling that can lead to biased estimates and inefficiencies in statistical inference. In complex survey sampling schemes such as two-stage cluster sampling (2SCS), stratified two-stage cluster sampling (S2SCS), three-stage cluster sampling (3SCS), and stratified three-stage cluster sampling (S3SCS), the challenge of non-response becomes even more pronounced due to the hierarchical structure of the data and the varying probabilities of response at different stages. Despite the extensive literature on CDF estimation, existing methods primarily focus on simple random sampling (SRS) or basic stratified sampling, with limited attention to the impact of non-response in complex sampling frameworks.

A significant gap in the literature is the absence of comprehensive methods for estimating the finite population CDF under non-response within these complex sampling schemes while effectively incorporating auxiliary information. Traditional estimation approaches often neglect the potential of utilizing ancillary information such as variances and covariances of estimators, which could significantly enhance estimation efficiency. Moreover, the use of ratio/product-type and exponential ratio/product-type estimators in handling non-response within complex sampling designs remains under-explored.

It is common for surveys to have incomplete data due to certain reasons. For example, in an opinion survey, the selected family may have moved to another place or the selected person may have passed away. In a mailed questionnaire, many respondents may not reply. This problem of incomplete sample data due to the non-availability of information from the respondent is called the problem of non-response. The estimate attained from such deficient data can be inaccurate, especially when there are differences between the respondents and non-respondents. One way to address the problem of non-response in a mailed questionnaire survey is to divide the population into two groups: respondents and non-respondents. From the non-respondent group, a sub-sample is selected and personally interviewed to obtain the necessary information. This technique is known as sub-sampling and was introduced by [7]. By making this extra effort, an unbiased estimator can be obtained for the non-response cases.

Estimating the CDF for a finite population is a fundamental problem in survey sampling, particularly in scenarios where auxiliary information is available. Accurate estimation of the CDF is crucial in various fields, including public health, economics, and social sciences, where decision-making often relies on understanding the distribution of a population’s characteristics. However, the complexity of survey designs, such as multi-stage cluster sampling with or without stratification, coupled with the challenge of non-response, complicates the estimation process. Non-response, a common issue in surveys, can introduce significant bias, leading to inaccurate estimates if not properly accounted for.

The estimation of the CDF in finite populations has been extensively studied in the context of survey sampling. Early works by [4] and [22] laid the foundation for the use of supplementary information in improving the exactness of survey estimates. The ratio and product estimators, introduced by these pioneers, have been particularly popular due to their ability to reduce the variance of estimators by leveraging the correlation between the study variable and auxiliary information. These estimators have since been widely applied in various sampling schemes, including SRS and stratified sampling, see [2,6,1113] for more details.

In 2SCS, the population is divided into primary sampling units (PSUs), a subset of of the PSUs is then selected, and secondary sampling units (SSUs) are drawn form the selected PSUs. 3SCS adds an additional level by selecting tertiary sampling units (TSUs) from within subgroups of the chosen SSUs. For more details, see [4,8,10] and the references cited therein. S2SCS and S3SCS further improve the precision by dividing thr population into distinct strata based on certain characteristics before applying the sampling stages. This ensure better representation of subgroups, reduce variability, and improves the accuracy of estimates, particularly when the population is heterogeneous. These methods are particularly effective for large, dispersed, and diverse populations. For more details about these sampling schemes, the reader may see [1,8,15,1719,21,22,24,25,30] for more details.

With the increasing complexity of survey designs, particularly in multi-stage sampling, researchers have sought to adapt these classical estimators to more complex scenarios. [8] provided early treatments of multi-stage sampling, highlighting the challenges and potential solutions for variance estimation. Following this, several authors, including [1,3,5,13,19,26,29,32], extended ratio and product estimators to multi-stage sampling frameworks, demonstrating their effectiveness in reducing bias and variance.

The issue of non-response has been a persistent challenge in survey sampling, leading to biased estimates if not properly addressed. [16] provided a comprehensive treatment of methods to handle non-response, including imputation techniques and weighting adjustments. In the context of CDF estimation, [23] proposed methods for handling non-response, but these approaches often involve complex adjustments that may not be feasible in all survey contexts.

Recent advancements have seen the development of exponential ratio and product estimators, which have been shown to offer improved performance in certain situations. These estimators, discussed by [29], are particularly useful when the relationship between the study variable and auxiliary information is non-linear. However, despite these innovations, the need for more precise estimators, particularly in complex sampling designs with non-response, remains.

To address this gap, this study develops novel estimation strategies for CDF estimation under non-response in 2SCS/S2SCS, 3SCS/S3SCS schemes. By incorporating auxiliary information, we propose two new families of CDF estimators: classical ratio/product-type and exponential ratio/product-type estimators. Additionally, we introduce a difference estimator specifically designed to enhance estimation accuracy. We derive mathematical expressions for the biases and mean squared errors of these estimators using first-order approximation and validate their performance with real-world datasets. Our findings indicate that the difference CDF estimators provide greater precision compared to the other proposed estimators, making them a more reliable approach for handling non-response in complex survey designs.

The paper is structured as follows: [section:2]Section 2 offers a concise overview of CDF estimation using the SRS scheme under non-response. Section 3 extends the discussion to CDF estimation under 2SCS/S2SCS and 3SCS/S3SCS schemes, also considering non-response. In Section 4 we derive precise mathematical formulations for the covariances associated with the CDF estimators within the 2SCS/S2SCS and 3SCS/S3SCS frameworks. Section 5 introduces two-families of estimators—ratio/product and exponential ratio/product—designed to estimate the population CDF under non-response conditions. An empirical analysis is presented in Section 6. Finally, Section 7 provides a summary of the key findings and concludes the paper.

2 Estimation of finite population CDF under non-response

Consider the scenario where we categorize the target population into two uniform groups, referred to as strata: (i) the response group; and (ii) the non-response group. Let denote the number of units within the population that are part of the response group, while M signifies the number of units in the non-response group, ensuring that  +  . With this information, we can express the finite population CDF, G(y), in the following way:

(1)

where, , are the weights and , correspond to CDF computed from response groups and non-response group, respectively. Here, in-dices and corresponds to response and non-response group, respectively.

G(y), of a random variable Y tells us the probability that Y could take value less than or equal to a specific value y. This value of y could be the mean, median, Quantiles or any threshold value of the random variable Y.

(2)

The function is called indicator function since it indicates whether Yi is less than a given value or not. The indicator function hold the same for the other variables.

Our goal is to estimate G(y), for a finite population under non-response. To achieve this, we select a sample of m units from a population of M units. Out of these m units, units respond to the survey while units do not respond. To obtain responses from the non-responding units, they are contacted once again by personal effort, such as telephone, email, or any other method. Then, a sub-sample of size , where c is a number equal to or greater than 1, is obtained from the non-responding units, and it is assumed that all r units will respond to the survey. It should be noted that SRS without replacement is used in all stages of sampling. An unbiased estimator of G(y) under non-response can be obtained using CDF estimators, and , which are based on and r, respectively. This method is based on the work of [7] given by

(3)

where, and have their usual meanings.

It is straightforward to demonstrate that serves as an unbiased estimate of G(y). Additionally, the variability of can be stated as follows:

(4)

where, , , and . For more details, see [33,9] and references cited therein.

3 Estimation of CDF under non-response in complex survey sampling

In this section, we will evolve unbiased estimators of G(y) for 2SCS/S2SCS and 3SCS/S3SCS under non-response. These estimators will be utilized in the subsequent sections.

3.1 The 2SCS with non-response

The 2SCS method selects a sample through two stages. The population comprises N1 PSUs, each with N2,i SSUs. Let Yi,j denote the characteristic value for the jth SSU in the ith PSU, where N2,i is the total number of SSUs. Additionally, and represent the units in the response and non-response groups, such that . Based on this, G(y) under 2SCS is expressed as follows:

(5)

where,

Here, , and and correspond to the average cluster size and CDF, for the response group and non-response groups in the ith PSU, respectively.

To estimate finite population CDF for 2SCS while accounting for non-response, we select a sample of size n1 from N1 PSUs with SRS, and then from ith PSU, a sub-sample of size n2,i is chosen. It is observed that out of n2,i SSUs, units respond and units do not respond. Following [7]’s sub-sampling technique of non-respondents, we select a subsample of from the non-respondent units of using SRS scheme from the ith PSU selected in the sample such that , where and then interview all units. It should be noted that the samples are chosen using SRS without replacement on all stages of sampling in the 2SCS scheme. For the 2SCS with non-response, we propose an estimator of G(y) given by:

(6)

where,

(7)

Here, and are the weights and and are the CDF based on and units, respectively.

In next Lemmas, we study the mathematical properties of .

Lemma 1: Under the 2SCS, is an unbiased estimator of G(y).

Proof: By assigning indices “1”, and “2” to represent the first, and second stages of sampling, respectively, we can express it as follows:

(8)

Note that may be observed as a simple random sample of size n1 from in the first steps of sampling. For more details, see [10] and the references cited therein.

Lemma 2: The variances of is given by

where

(9)

Proof: We know that

(10)

From Eq (8), . Thus by using this result:

(11)(12)

For more details, the readers may see [33,9] and the references cited therein.

3.2 The S2SCS with non-response

Consider a finite population that is divided into L strata, with each stratum comprising N1,h PSUs, where . Each PSU contains N2,i,h SSUs for and . Let Yi,j,h represent the characteristic value for the jth SSU within the ith PSU of the hth stratum, where . Furthermore, let and denote the counts of units in the response and non-response groups, respectively, such that . Utilizing this information under the S2SCS framework, the CDF for finite population can be expressed as:

(13)

where

are the hth stratum’s average cluster sizes and CDF, respectively.

To estimate G(y) with S2SCS under non-response, n1,h PSUs is chosen from the hth stratum. The sample sizes n1,h are allotted according to an allocation system, such as equal, Neyman or proportional allocations. Further, a sub-sample of size n2,i,h SSUs from ith selected PSU of hth stratum is selected. It should be noted that on both stages of sampling under S2SCS, the samples are taken using SRS without replacement. It is observed that out of n2,i,h SSUs, units respond and units do not respond. Adopting [7] technique of sub-sampling of non-respondent, the unbiased estimator of G(y) with S2SCS can be written as:

(14)

where

In the subsequent Lemmas, we will examine the mathematical properties of estimator with non-response under S2SCS.

Lemma 3: is an unbiased estimators of G(y). The variances of is given by

(15)

The proof is similar to Lemmas 1 and 2. The mathematical expressions of is similar to with the exception that earlier is computed from the hth stratum.

3.3 3SCS with non-response

The 3SCS extends the 2SCS method by adding a third stage. In 3SCS, the population U consists of N1 PSUs, each containing N2,i SSUs, which in turn include N3,ij TSUs. Let define Yijk as the characteristic value associated with the kth TSU located within the jth SSU of the ith PSU, where the indices are defined as , , and . Furthermore, it is established that , with and retaining their conventional definitions. Under 3SCS with non-response, the finite population CDF may be written as

(16)

where

(17)

In this context, and represent the weights, while signifies the average size of the cluster. Additionally, let and denote the CDF derived from the jth SSU of the ith PSU for the response group and the non-response group, respectively.

To estimate G(y) under 3SCS with non-response, an unbiased estimator of population CDF can be derived as follow. First, a sample of PSUs, n1, is chosen. Next, from each selected PSU, a sample of SSUs, n2,i, is drawn. Finally, within each selected SSU, a sample of TSUs, n3,ij, is selected. Further, let where and have their usual meanings. A sub-sample of size where units is drawn from units. Under the 3SCS with non-response at third stage, an unbiased estimator of G(y) is given by:

(18)

where

(19)

and , , have their usual meanings. and are the CDF computed from response and non-response group, respectively. It should be noted that under 3SCS with non-response, SRS without replacement is used in all stages of sampling. For more detail, see [33,1] and references cited therein.

In the subsequent Lemmas, we will examine the mathematical properties of .

Lemma 4: Under 3SCS, is an unbiased estimator of G(y).

Proof. By assigning indices “1”, “2”, and “3” to represent the first, second, and third stages of sampling, respectively, we can express it as follows:

(20)

Lemma 5: The variances of is given by

(21)

where

(22)

Proof:

(23)

From Eq (20), we can write

(24)

and

(25)

Also from Eq (20), we have

(26)

Now take expectations of Eq (20) to get

(27)

Now Eq (21) follows by replacing Eq (24), Eq (25) and Eq (27) in Eq (23), which completes the proof.

3.4 S3SCS under Non-Response

S3SCS begins by dividing the entire population U into L strata, based on certain characteristics such as different regions or income levels. Within the hth stratum, the population contains N1,h PSUs, each with N2,i,h SSUs, each of which has N3,ij,h TSUs for . Let Yijk,h be the characteristics under study obtained for the kth TSUs in the jth SSUs drawn from the ith PSUs within the hth stratum for , , and . Further, let where and have their usual meanings. Under S3SCS with non-response the population CDF may be written as:

(28)

In estimating the population CDF using S3SCS under non-response conditions, the sampling process consists of several stages within each stratum. First, a sample of PSUs is selected from N1,h, resulting in a size of n1,h. From each selected PSU, SSUs are sampled from N2,i,h, yielding a size of n2,i,h. Lastly, TSUs are chosen from each sampled SSU, with a size of n3,ij,h from N3,ij,h. It is observed that out of n3,ij,h, units respond and units do not respond, such that . Adopting [8] technique of sub-sampling of non-respondent, the unbiased estimator of population CDF, G(y), under S3SCS with non-response can be written as:

(29)

where

Here, and have there usual meanings.

In the forthcoming Lemmas, we examine the characteristics of the CDF estimator in the context of non-response within the S3SCS framework.

Lemma 6: is an unbiased estimators of G(y). The variances of is given by

(30)

Proof. The proof has a resemblance to Lemmas 4 and 5. The proof of is completed by observing that its mathematical expressions are similar to , except that the earlier is assessed from the hth stratum.

4 Covariance computation and estimation in the presence of non-response

Here, we use the previously discussed complex survey sampling schemes to formulate precise mathematical equations for the covariance of CDF estimators on the basis of non-response given above in Section 3.

4.1 Two-stage and stratified two-stage cluster sampling

In a finite population U, we assume that X represents the auxiliary, and Y denotes the primary variable under study. Additionally, to approximate (G(y),G(x)) within the frameworks of 2SCS and S2SCS, let and represent the corresponding CDF estimators derived from variables (Y,X). It should be noted that the CDF estimators on the basis of X can be computed on the similar lines of Y.

Lemma 7: Under 2SCS scheme, the covariance between and is given by

(31)

where

(32)(33)(34)

Proof. Here, , , and retain their standard interpretations. The demonstration of this Lemma can be found in the S3 Appendix.

Lemma 8: Under S2SCS scheme, the covariance between and is given by

(35)

Proof. The proof has a resemblance to Lemmas 7. The proof of is completed by observing that its mathematical expressions is similar to , except that the aforesaid is assessed from the hth stratum.

4.2 Three-stage and stratified three-stage cluster sampling

Under 3SCS and S3SCS with non response, let and be the respective CDF estimators of (G(y),G(x)) that are based on (Y,X), respectively.

Lemma 9: Under 3SCS scheme, the covariance between and is given by

(36)

where

(37)(38)(39)(40)

and .

Proof. Here, , , and retain their standard interpretations. The demonstration of this Lemma can be accessed in the S3 Appendix.

Lemma 10: Under S3SCS scheme, the covariance between and is given by

(41)

Proof. The proof has a resemblance to Lemmas 9. The proof of is completed by observing that its mathematical expressions is similar to , except that the earlier is computed from the hth stratum.

5 CDF estimation under non-response using auxiliary information

In this section, two families of estimators, namely ratio/product and exponential ratio/product, are developed, which utilize auxiliary information to approximate the CDF of the population, G(y), in the context of complex survey sampling scheme with non-response. To acquire the biases and MSEs of the suggested families of estimators for G(y), we could look into the following relative error terms: Let represents the relative error associated with the estimation of G(y), which is defined as:

let represents the relative error associated with the estimation of G(x), which is defined as:

(42)

such that . Furthermore, we define the variance and covariance parameters as follows:

(43)

which gives

where, represents the relative variance of the estimator for G(y), quantifying the dispersion of the estimated CDF around its true value, normalized by . represents the relative variance of the estimator for G(x), similarly normalized by . represents the relative variance of the estimator of G(y) and G(x), normalized by G(y)G(x), indicating the degree of association between the two estimators. represent a CDF estimator derived from the sampling scheme S, where S corresponds to 2S, 3S, S2S, or S3S.

5.1 First proposed family of CDF estimators

We suggest a class of ratio and product-type estimators, analogous to those introduced by [1] and [14], for estimating the population CDF under non-response.

(44)

where a is different from 0 and b as real numbers or functions of the known parameters of the supplementary variable X such as coefficient of variation , correlation coefficient , coefficient of skewness and coefficient of kurtosis etc. The selected scalars, and , minimize the MSE of . Different estimators of may be developed by selecting appropriate a, b, , and values. In Table 1, we display some members of for various values of a, b, , and .

thumbnail
Table 1. Several members of the suggested families of CDF estimators under non-response

https://doi.org/10.1371/journal.pone.0322660.t001

The bias and MSE of can be approximated mathematically by writing and . The expression on right-hand side (RHS) of Eq (47) can be written in terms of ’s:

(45)

where  +  . By expanding the RHS of Eq (48) and considering only the terms of order second power in ’s, we have

(46)

To obtain bias of , subtract G(y) from both sides of Eq (49) and consider only the terms up-to first order of approximation, given as:

(47)

Retain terms up-to first order of approximation, we can write Eq (49) as follows:

(48)

To obtain MSE of , take square and then its expectation on both sides of the Eq (51) and retain terms up-to first order, we have:

(49)

The minimal MSE of is given by

(50)(51)

where, is the correlation coefficient between and . The minimum MSE of can be obtained by utilizing the optimum value of , say .

5.2 Second proposed family of CDF estimators

We propose another class of exponential ratio/product estimators similar to those proposed by [1] and [28] to estimate the population CDF G(y) under non-response given as:

(52)

where a, b, and have their usual meanings. Some members of for various values of a, b, , and can be seen in Table 1. The bias and MSE of can be obtained by expressing the Eq (55) in terms of ’s as:

(53)

where . By expanding the RHS of Eq (56) and considering only the terms of order second power in ’s, we have

(54)

To obtain bias of , subtract G(y) from both sides of Eq (57) and consider only the terms up-to first order of approximation, given as:

(55)

Retain terms up-to first order of approximation, we can write Eq (57) as follows:

(56)

Under the first order of approximation, we obtain the MSE of by taking the square on both sides of Eq (59).

(57)

The minimal MSE of can be expressed by utilizing the optimum value of , say , is given by

(58)

which is equivalent to that of .

A variety of estimators can be derived from the suggested families and , as shown in Eqs (47) and (55), by selecting appropriate values for the parameters a, b, , and . By imputing specific values of , , a, and b into Eqs (50), (52), (58), and (60), we can derive the first-order approximations of the bias and MSE/variance for the respective members of the families and .

5.3 Difference CDF Estimator

Additionally, supplementary information about the variance and covariance of the estimators can be used to improve the precision of CDF estimators. Let denotes the ratio of covariance between to the variance of , under an S sampling scheme, i.e.

(59)

For situations involving non-response, the difference estimator for the population CDF, say , which relies on , , G(x), and , is expressed as:

(60)

where expressed as a linear aggregator of and , and G(x) denotes the population CDF under the S sampling scheme. It can be demonstrated that the is an unbiased estimator of G(y), and variance of can be obtained by expressing the Eq (63) in terms of ’s as:

(61)

To calculate the variance of , square both sides of Eq (64) and apply expectation.

(62)

The simplified equation for the variance of may be obtained by substituting , given by

(63)

which is identical to the minimal MSE of and .

6 Empirical Study

In this section, we consider simulation study and real datasets to calculate the relative efficiencies (REs) of the proposed CDF estimators for G(y) under non-response, using the S sampling scheme-based estimator .

6.1 Population I

A dataset sourced from the Centers for Disease Control (CDC), is associated with the Second National Health and Nutrition Examination Survey (NHANES-II). It comprises 10,351 units, representing the entire non-institutionalized civilian (NIC) population of the United States (US), including all 50 states and the District of Columbia. The data is separated into four geographic regions (REGs): midwestern, southern, northeastern, and western, each farther subdivided into specific locations (LOCs). With a success probability of 0.50, random numbers are generated from a Bernoulli distribution to stratify the dataset into two strata, with 0 indicating Stratum-I and 1 indicating Stratum-II. The dataset can be found at: https://www.stata-press.com/data/r15/svy.html. Our goal is to determine the percentage of underweight individuals in the NIC US population using body mass index (BMI) as the study variable indices by “Y” and weight as the ancillary variable indices by “X”. An individual is classified as underweight if their BMI is less than or equal to y = 18.50, with the average weight of the NIC US population being x = 71.8975. For the 2SCS/S2SCS design, REG and BMI are treated as PSU and SSU. In the 3SCS/S3SCS design, REG, LOC, and BMI are considered as PSU, SSU, and TSUs, respectively. Below are the constants for the population.

To demonstrate what we have aforesaid, the values of based on an S sampling scheme are computed, which are given below.

thumbnail
Table 2. The values based on an S sampling scheme under non-response utilizing Population I

https://doi.org/10.1371/journal.pone.0322660.t002

Note: Stratifying is abbreviated as .

6.2 Population II

This data is taken from the study conducted by [27] in Multan district, southern Punjab, Pakistan, spans from January 2020 to March 2020. The dataset includes 1040 units, categorized into two strata: Stratum-I (males) and Stratum-II (females). In this study, BMI and weight are used as study and auxiliary variables, respectively. The objective is to estimate the proportion of underweight children by utilizing the auxiliary variable, with x = 35.76 representing the average weight, under the S sampling scheme. For the 2SCS/S2SCS design, socioeconomic status (SES) and BMI are treated as PSU and SSU. In the 3SCS/S3SCS design, SES, Age, and BMI are considered as PSU, SSU, and TSUs, respectively. The population constants are as follows:

To demonstrate what we have aforesaid, the values of primarily based on an S sampling scheme are computed below.

thumbnail
Table 3. The values based on an S sampling scheme under non-response utilizing Population II

https://doi.org/10.1371/journal.pone.0322660.t003

We present the following expressions to compute REs of the proposed CDF estimators of finite population distribution function under S sampling schemes with non-response relative to .

where . Table 4 and 5 represent the REs of these CDF estimators.

thumbnail
Table 4. Evaluating REs of proposed CDF estimator of G(y) under non-response relative to utilizing Population I

https://doi.org/10.1371/journal.pone.0322660.t004

thumbnail
Table 5. Evaluating REs of the proposed CDF estimator of G(y) under non-response relative to utilizing Population-II

https://doi.org/10.1371/journal.pone.0322660.t005

Based on the S sampling scheme, the REs of the proposed CDF estimators were estimated with the distinctive values of n1, n2,i, and n3,ij using the aforementioned datasets. It can be observed that proposed CDF estimators under S sampling schemes that utilizing supplementary information appear to be marginally more efficient than do not, as indicated by RE values greater than one. Furthermore, CDF estimators for S sampling schemes with stratification can also be seen slightly more efficient than those without stratification. Efficiency tends to increase with the number of sampling stages, although the impact of increasing the sample size remains uncertain. Typically, RE tends to increase by increasing the sample size at various sampling stages and vice varsa.

6.3 Simulation Study

To further validate our findings, we estimated the MSE through a simulation approach by computing the squared difference between the estimator and the true parameter. The CDF of the proposed family of estimators under different non-response rates and sample sizes were simulated, and their corresponding MSEs were calculated using a simulation-based approach. The MSE/Variance of the CDF estimators under non-response are computed by drawing 10 thousand samples from Population I under a given sampling scheme. The simulation MSE of estimators for both families under S-sampling scheme are computed as:

(64)(65)(66)

where . The relative precision (RP) of and with respect to under an S-sampling scheme is given by:

(67)

We then calculated the RP between the estimators and found that our proposed classical ratio/product-type, exponential ratio/product-type, and difference CDF estimators performed well compared to existing methods. Additionally, we extended our simulation analysis by computing the MSE for different sample sizes of PSUs, SSUs, and TSUs under 2SCS/S2SCS, and 3SCS/S3SCS schemes. To assess the robustness of our estimators under varying levels of non-response, we evaluated their efficiency at response rates of 75% and 80%. For brevity of discussion, we considered c = 1 in our calculations.

While estimating the MSE through simulation, we adopted this structured sampling design for selecting PSUs, SSUs, and TSUs under different sampling schemes. This framework was applied across 2SCS, 3SCS, and their stratified counterparts, ensuring a systematic and realistic representation of hierarchical data selection and can bee seen in Table 6. This structured approach allowed us to analyze the performance of the proposed estimators under varying levels of non-response effectively.

thumbnail
Table 6. The structured sampling design for the estimation of MSE through simulation under non-response utilizing Population I

https://doi.org/10.1371/journal.pone.0322660.t006

Overall, the proposed families of estimators performed well under different sampling conditions. However, the proposed difference estimator demonstrated slightly higher efficiency compared to both the estimators without auxiliary information and the other proposed families of estimators. Although the efficiency gains of the difference estimator were modest, it consistently outperformed other estimators across all sampling schemes. Notably, the maximum gain in precision was found in the ratio/product-type estimators, particularly under the 3SCS scheme, where it reached 15% when considering a sampling structure of n1 = 2, n2,i = 12, and n3,ij = 20.

The consistency between the theoretical and simulation-based MSE results further validates the reliability of the proposed estimators in handling non-response within complex survey sampling frameworks. The detailed results of these simulation studies under various complex survey sampling schemes and non-response rates can be found in the supplementary file (S1 Tables), from Table 1 to Table 16.

7 Conclusion

This paper considered 2SCS and 3SCS schemes with and without stratification under non-response in order to estimate the finite population CDF. Two families of estimators, namely exponential and ratio/product-type, have been proposed. Additionally, the CDF has also been estimated using a difference estimator. Under the first order of approximation, mathematical expressions have been developed for the biases and MSE of the prospective CDF estimators under these sampling schemes. Real datasets were used to substantiate and illustrate the proposed theory. The CDF estimators proposed for non-response in complex survey sampling with auxiliary information appeared to be marginally more efficient than those without ancillary details. Our findings revealed that the difference estimators performed well in terms of RE compared to the other estimators discussed in this study. Moreover, the results of the simulation study further confirmed the efficiency of the proposed families of estimators, demonstrating their superior performance across different sampling scenarios. The simulations provided additional evidence that the proposed families of estimators consistently outperformed other estimators, reinforcing our theoretical findings.

Supporting information

S1 Tables. Simulation Results of the RP of the Proposed CDF Estimators.

This file contains RP of the proposed CDF estimators under non-response using auxiliary information within 2SCS/3SCS, and S2SCS/S3SCS schemes

https://doi.org/10.1371/journal.pone.0322660.s001

(TEX)

S2 Files. Datasets for Population I, and II.

This file contains datasets supporting the findings of this study.

https://doi.org/10.1371/journal.pone.0322660.s002

(XLSX)

S3 Appendix. Proofs for Lemma 7 and 9.

This file contains demonstration of Lemma 7 and Lemma 9.

https://doi.org/10.1371/journal.pone.0322660.s003

(PDF)

References

  1. 1. Abbas M, Haq A. Estimation of finite population distribution function with auxiliary information in a complex survey sampling. SORT. 2022;46(1):67–94. http://dx.doi.org/10.2436/20.8080.02.118
  2. 2. Aditya K, Sud U, Chandra H, Biswas A. Calibration based regression type estimator of the population total under two stage sampling design. J Indian Soc Agr Stat. 2016;70:19–24.
  3. 3. Al-Saleh MF, Samuh MH. On multistage ranked set sampling for distribution and median estimation. Comput Stat Data Anal. 2008;52(4):2066–78.
  4. 4. Cochran W. Sampling techniques. 3rd edn. John Wiley & Sons; 1970.
  5. 5. Francisco C, Fuller W. Estimation of the distribution function with a complex survey. In: JSM Proceedings of the Survey Research Methods Section. American Statistical Association; 1986, pp. 37–45.
  6. 6. Garg N, Srivastava M. A general class of estimators of a finite population mean using multi-auxiliary information under two stage sampling scheme. J Reliab Statist Stud. 2009;2:101–19.
  7. 7. Hansen M, Hurwitz W. On the theory of sampling from finite populations. Ann Math Stat. 1943;14:333–62.
  8. 8. Hansen M, Hurwitz W, Madow W. Sample survey methods and theory. Vol. I. Methods and applications. New York: John Wiley \& Sons; 1953.
  9. 9. Haq A. Estimation of the distribution function under hybrid ranked set sampling. J Stat Comput Simul. 2016;87(2):313–27.
  10. 10. Haq A, Abbas M, Khan M. Estimation of finite population distribution function in a complex survey sampling. Commun Stat Theory Methods. 2021;52(8):2574–96.
  11. 11. Hussain S, Ahmad S, Saleem M, Akhtar S. Finite population distribution function estimation with dual use of auxiliary information under simple and stratified random sampling. PLoS One. 2020;15(9):e0239098. pmid:32986764
  12. 12. Iftikhar A, Shi H, Hussain S, Abbas M, Ullah K. Efficient estimators of finite population mean based on extreme values in simple random sampling. Math Probl Eng. 2022;2022(1):5866085.
  13. 13. Jabeen R, Sanaullah A, Hanif M. Generalized estimator for estimating population mean under two stage sampling. Pak J Stat. 2014;30(4).
  14. 14. Khoshnevisan M, Singh R, Chauhan P, Sawan N. A general family of estimators for estimating population mean using known value of some population parameter(s). Far East J Theor Stat. 2007;22:181–91.
  15. 15. Lee SE, Lee PR, Shin K-I. A composite estimator for stratified two stage cluster sampling. CSAM. 2016;23(1):47–55.
  16. 16. Little R, Rubin D. Statistical analysis with missing data. Hoboken, NJ: John Wiley & Sons.
  17. 17. Mahalanobis P. A sample survey of the acreage under jute in Bengal. Sankhya. 1940;1:511–30.
  18. 18. Murthy M. Sampling theory and methods. Calcutta: Statistical Publishing Society; 1967.
  19. 19. Nafiu LA, Oshungade IO, Adewara A. Alternative estimation method for a three-stage cluster sampling in finite population. Am J Math Stat. 2012;2(6):199–205.
  20. 20. Stokes SL, Sager TW. Characterization of a ranked-set sample with application to estimating distribution functions. J Am Stat Assoc. 1988;83(402):374–81.
  21. 21. Preston J. Rescaled bootstrap for stratified multistage sampling. Surv Methodol. 2009;35:227–34.
  22. 22. Raj D. Sampling theory. New York: McGraw-Hill Book Company; 1968.
  23. 23. Rao J, Rao J. Small area estimation. New York: John Wiley \& Sons; 1999.
  24. 24. Rustagi R. Some theory of the prediction approach to two stage and stratified two stage cluster sampling. PhD thesis. The Ohio State University.
  25. 25. Särndal CE, Swensson B, Wretman J. Model assisted survey sampling. Springer Science \& Business Media.
  26. 26. Sedransk N, Sedransk J. Distinguishing among distributions using data from complex sample designs. J Am Stat Assoc. 1979;74(368):754–60.
  27. 27. Shehzad MA, Khurram H, Iqbal Z, Parveen M, Shabbir MN. Nutritional status and growth centiles using anthropometric measures of school-aged children and adolescents from Multan district. Arch Pediatr. 2022;29(2):133–9. pmid:34955308
  28. 28. Singh R, Chauhan P, Sawan N, Smarandache F. Improvement in estimating the population mean using exponential estimator in simple random sampling. Int J Stat Econ. 2009;3:13–8.
  29. 29. Singh R, Solanki C. A new approach to handle non-response in sample surveys. J Stat Res. 2012;46:69–82.
  30. 30. Singh R, Vishwakarma GK, Gupta P, Pareek S. An alternative approach to estimation of population mean in two-stage sampling. Math Theory Model. 2013;3:48–53.
  31. 31. Singh S. Advanced sampling theory with applications: How Michael “selected” Amy, vol. 2. Dordrecht: Springer; 2003.
  32. 32. Sukhatme P, Sukhatme B, Sukhatme S, Asok C. Sampling theory of surveys with applications. Ames: Iowa State University Press; 1970.
  33. 33. Yaqub M, Shabbir J. Estimation of population distribution function in the presence of non-response. Hacettepe J Math Stat. 2016;46(131):1–1.