How Big of a Problem is Analytic Error in Secondary Analyses of Survey Data?

Secondary analyses of survey data collected from large probability samples of persons or establishments further scientific progress in many fields. The complex design features of these samples improve data collection efficiency, but also require analysts to account for these features when conducting analysis. Unfortunately, many secondary analysts from fields outside of statistics, biostatistics, and survey methodology do not have adequate training in this area, and as a result may apply incorrect statistical methods when analyzing these survey data sets. This in turn could lead to the publication of incorrect inferences based on the survey data that effectively negate the resources dedicated to these surveys. In this article, we build on the results of a preliminary meta-analysis of 100 peer-reviewed journal articles presenting analyses of data from a variety of national health surveys, which suggested that analytic errors may be extremely prevalent in these types of investigations. We first perform a meta-analysis of a stratified random sample of 145 additional research products analyzing survey data from the Scientists and Engineers Statistical Data System (SESTAT), which describes features of the U.S. Science and Engineering workforce, and examine trends in the prevalence of analytic error across the decades used to stratify the sample. We once again find that analytic errors appear to be quite prevalent in these studies. Next, we present several example analyses of real SESTAT data, and demonstrate that a failure to perform these analyses correctly can result in substantially biased estimates with standard errors that do not adequately reflect complex sample design features. Collectively, the results of this investigation suggest that reviewers of this type of research need to pay much closer attention to the analytic methods employed by researchers attempting to publish or present secondary analyses of survey data.


Introduction
Secondary analyses of survey data sets collected from large probability samples of persons or establishments further scientific progress in many academic fields, including (but not limited to) education, sociology, and public health. The samples underlying these data sets, while enabling inferences about population characteristics or relationships between variables of interest in a finite population of interest, are often "complex" in nature, employing sampling strategies such as stratification of the population and cluster sampling [1][2]. These complex sample design features improve the cost efficiency of survey data collection, but also require secondary analysts to employ approaches that account for the effects of the complex sampling statistically [3].
Unfortunately, many secondary analysts of these data sets do not have formal training in survey statistics, and may apply incorrect analytic methods when analyzing these data sets as a result. The application of standard statistical methods to these data sets can lead to incorrect population inferences, which effectively negates the resources dedicated to the survey data collection. This potential analytic error on the part of secondary analysts defines an important part of the widely-researched Total Survey Error (TSE) framework [4][5][6][7][8]. Unfortunately, this important component of TSE has received almost no research attention relative to the other important sources of survey error that define this framework.
In this article, we extend prior knowledge about the magnitude of the analytic error problem by: 1) reviewing representative samples of research products presenting analyses of three different nationally representative survey data sets, to understand the statistical approaches that users of these data employed; 2) identifying evidence of apparent analytic errors in the studies, and quantifying the prevalence of the different types of errors over time across the studies; 3) attempting to isolate sources of the apparent analytic errors based on the dissemination format (formal journal article, book chapter, technical report, conference presentation, etc.); and 4) demonstrating the implications of making analytic errors for inferences based on analyses of survey data. The results of this study suggest that analytic error is a significant problem in these types of research investigations, and these findings have important implications for peer reviewers and the scientific community more generally.

Alternative Approaches to Survey Data Analysis
There are generally two schools of thought in the survey statistics literature with regard to correct theoretical approaches to the analysis of survey data arising from complex samples [9]. First, the design-based analysis approach is characterized by 1) the use of sampling weights for unbiased estimation of parameters describing finite populations (e.g., means, proportions, regression coefficients, etc.), where the weights may be adjusted for survey nonresponse and calibrated to reflect known population features [10], and 2) non-parametric estimation of the variances of weighted estimates using either codes describing complex sampling features (such as sampling stratum codes, or codes describing sampling clusters) or replicate weight variables [11]. The primary historical developments underlying design-based analysis approaches can be found in Neyman [12], Hansen, Hurwitz and Madow [13], Kish [1], Cochran [14], Binder [15] and Korn and Graubard [16]; design-based methods for variance estimation are discussed at length in Wolter [11], Heeringa, West and Berglund [2] and Valliant, Dever and Kreuter [10]. Second, the model-based analysis approach ignores the notion of a finite population, and assumes that the survey data arise from an infinite data generation process governed by a probability model, where estimation of the parameters that define that model is the focus of the analysis. Model-based approaches have generally come to rely on various forms of multilevel (or hierarchical linear) models, or Bayesian approaches [17][18]. The complex sampling features essentially become predictors in these models, entering as either fixed effects (for strata that are fixed by design across hypothetical repeated samples) or random effects (for randomly sampled clusters). The analyst also needs to decide whether to use the sampling weights to estimate the parameters of the probability model [19][20][21][22][23], or include the weights as covariates to "control" for the relationships of features used to define the weights with the dependent variable [16]. This decision is not clearly guided by any theoretical results, and has been a source of controversy among statisticians [16,19,24,25,26].
There are thus alternative "correct" approaches that a secondary user of survey data can take when analyzing complex sample survey data. Recent publications have even attempted to unite these two broader types of approaches into single analytic paradigms [24,27,28]. Unfortunately, analysts of survey data from fields outside of statistics and survey methodology generally do not have the benefit of technical training in these alternative approaches. This lack of training can lead to analytic errors in published analyses of survey data when methods appropriate for "standard" simple random samples (or independent and identically distributed data) that are taught as a critical component of many degree programs are applied when analyzing the data. The key point for analysts is that the sample design features are accounted for, regardless of the approach used. A failure to do this can lead to biased estimates and incorrect inferences [16].

Contributions to the Existing Literature
The study presented in this article makes several unique contributions to the very small amount of existing literature on analytic error. We build on an initial pilot study of analytic error in 100 published, peer-reviewed journal articles, which found that the failure to use one of the correct analytic approaches described above in published secondary analyses of a variety of public health-related complex sample survey data sets is in fact quite common [29][30]. While the current study also focuses on potential analytic errors in secondary analyses of complex sample survey data, it makes several unique contributions relative to this initial pilot study: 1. We consider the possibility of analytic error in additional types of research products aside from peer-reviewed journal articles, including conference / proceedings papers, technical reports, and book chapters; 2. We draw a formal stratified sample of 145 unique research products, treating different decades as sampling strata, enabling us to assess trends in the types of analysis approaches used across different decades; online for secondary analysis, and they each arise from samples with complex designs. This requires secondary users of these data sets to employ appropriate estimation methods accounting for the features of the sample designs when analyzing the data. We specifically chose to focus on the SESTAT surveys for three primary reasons: 1. An established body of literature spanning multiple decades has made use of SESTAT data to describe the characteristics of the U.S. workforce, and these data allow users to make timely inferences about important topics regarding the advanced education of the U.S. workforce and trends in its characteristics. In short, we shift the substantive focus of this study to research aimed at describing the scientific capabilities of the U.S. workforce, rather than public health outcomes (as in the pilot study).
2. The NCSES currently employs a fairly unique mechanism to make design information available to public data users for analysis. Final adjusted sampling weights are provided in all public-use SESTAT data sets, but the necessary replicate weights and design codes for variance estimation purposes are presently only available upon request. This distinguishes the SESTAT data sets from the other public health survey data sets analyzed in the initial pilot study [29][30], each of which included all of this design information in their public-use data files. SESTAT data users need to read the online documentation very carefully to understand the need to request data files containing the replicate weights and other design information for variance estimation purposes, and this introduces an increased risk of analytic error due to a failure to fully account for complex sampling features. We wanted to assess what analysts of SESTAT data were doing in their studies, given this somewhat unique mechanism for obtaining the public-use data and sample design information.
3. The NCSES is currently making a concerted effort to improve their documentation and also understand the analytic approaches being employed by public users of NCSES survey data (including SESTAT data). In line with these goals, the NCSES recently called for research proposals aiming to improve the analytic methods employed by SESTAT data users (originally National Science Foundation Program Solicitation 12-545, and now 15-521). The present study was part of this evaluation objective.

Background: SESTAT
Per the official SESTAT web site, "This integrated data system is a unique source of longitudinal information on the education and employment of the college-educated U.S. science and engineering workforce" (see the web site provided above for more information). Table 1 outlines the complex sampling features associated with each of these three survey programs, in addition to recently updated counts of the unique Google Scholar (GS) links associated with each survey (as a proxy measure of the research activity related to each survey). At present, the NCSES includes final adjusted sampling weights for estimation purposes in all public-use SESTAT data files. The NCSES also makes replicate weights capturing these essential sample design features available to public users of the SESTAT data upon request for design-based variance estimation purposes [31]. Interested readers can consult Valliant, Dever and Kreuter [10] or Wolter [11] for more information on design-based variance estimation using replicate weights. Detailed codes describing sampling strata and sampling clusters, which would be especially important for model-based analysis approaches, are also available via restricted-use agreements. Individuals who request the replicate weights or establish these restricted-use agreements are provided with metadata files describing these replicate weight variables and sample design codes [32]. SESTAT data users also have two additional options for taking complex sampling features into account in their analyses: 1. Use a free online analysis tool that provides correct design-based standard errors based on the replicate weights, for straightforward descriptive and tabular analyses; or 2. Use a generalized variance function (GVF) approach to variance estimation [11], incorporating aggregate design effect information provided for estimates computed using SESTAT data [32][33].
Unfortunately, despite the public availability of this information and the opportunity to access the necessary data for appropriate variance estimation by request or via restricted-use agreements, SESTAT data users may ignore the documentation provided or may not be appropriately alerted to the importance of using these variables for variance estimation if they do not search the SESTAT web site carefully. This could adversely affect the inferences that users make based on these three survey data sets. If only the final sampling weights are used in design-based estimation, and the complex sampling features representing stratification and cluster sampling are ignored, variance estimates may be biased, ultimately affecting confidence intervals and tests of significance. Users of the SESTAT data employing model-based approaches also need to consider what role these weights and sample design codes will play in the probability models that they specify for their variables.

Sampling of Research Products
We sampled 50 research products presenting analyses of data from each of the three SESTAT surveys using the following methodology. Within Google Scholar (scholar.google.com), a search term was submitted including the name of the survey in quotations, and the word "analysis" (e.g., "National Survey of College Graduates" analysis). The size of the set of search results (N) was then considered as the size of the "population" of related products; for example, 655 links or "citations" were identified (October 2015) when submitting the above search term to Google Scholar. Specific year ranges for the products were specified in Google Scholar to ensure appropriate sorting of the identified products by time (implicit stratification), given that time is likely an important factor in the prevalence of analytic errors. More specifically, we hypothesize that knowledge about (and software enabling) appropriate analytic methods for complex sample survey data has become more widely disseminated in recent years, meaning that we expect time and the prevalence of various errors to be negatively correlated. This implicit stratification of the identified research products by time was done to ensure that we had a representative picture of the analytic error problem across different time periods, considering the lifetimes of each of the SESTAT surveys. This operation therefore resulted in a list of products that was implicitly stratified by year, and the sampling interval (k = N / 50) was determined in such a way that one of the 50 products will be selected from each interval. Fifty (50) products were then sampled using systematic sampling based on fractional intervals [1], and a stratified random sampling model was used for making inferences based on the resulting sample, with strata defined by collapsing adjacent intervals. The University of Michigan-Ann Arbor provides its researchers with free online access to JSTOR and nearly all major academic journals, so we did not experience any access problems for the journals in which peer-reviewed journal articles appeared.
Research products were found to be "eligible" if they actually presented original analyses of the survey data and did not simply refer to other articles presenting analyses of these survey data sets. Products also needed to be readily accessible in electronic format. If 50 "eligible" products were not identified, an additional systematic sample of the required size (e.g., an additional 10 products given 10 ineligible products) was selected. In total, 232 research products were sampled across the three surveys following this procedure, and 82 were excluded based on review of the abstracts (see Fig 1 for the PRISMA flow chart; see also the supporting PRISMA check list in S2 Text). Some products presented analyses of the fully integrated SESTAT database, meaning that data from all three surveys were analyzed simultaneously. These products were only coded once to represent one of the three SESTAT surveys. If some working papers did not provide a date of availability online, we inferred the date based on the most recent cited publication.
A complete reference list for all sampled products can be found in S1 Text. We note that some of the sampled research products were working papers that the authors indicated should not be cited without permission, and we do not cite these papers directly at any point in this study, only reporting results in aggregate.

Coding Operations
After 50 research products were sampled for each of the three surveys, our research team reviewed and coded each sampled product in a qualitative fashion, recording responses to the following questions: 1.
In what year was the product made available for public viewing? 2. Did the analysis account (in some fashion) for the survey weights?
3. Did the analysis account (in some fashion) for the sample design features (e.g., stratification, cluster sampling, replicate weights) in variance estimation?
4. Did the authors appear to use a design-based approach or a model-based approach in the analysis (where the "model-based" approaches include those that ignore sampling features entirely in the model specification)?
5. Did the authors appear to use appropriate statistical software procedures?
6. Did the authors use appropriate methods for subpopulation analysis when design-based methods were employed, per Chapter 4 of Heeringa et al. [2]?
7. How did the authors describe their inferences: with respect to the target population for a given survey (appropriate: e.g., ". . .an estimated 60% of this population spent four years on their Ph.D."), or with respect to the sample in hand (inappropriate: e.g., ". . .60% of the sample spent four years on their Ph.D.")?
8. Was the sampled product a formal journal article, a book chapter / technical report, or a paper presented at a conference (appearing in conference proceedings)? We systematically recorded answers to each of these questions for each of the 150 sampled research products, and then coded the answers into binary indicators of the various approaches used (e.g., "used design-based approach" or "ignored sample design features in variance estimation"). Two (2) of the 50 sampled SDR products and three (3) of the 50 sampled NSRCG products were found to be general SESTAT review articles upon reading the text in detail, and we were unable to find additional "eligible" SDR or NSRCG products from the same time periods as these five (5) sampled products in a detailed search of the available literature. This resulted in a sample of 145 research products for analysis (see Fig 1 above). Indeed, many of the unique Google Scholar links presented in Table 1 were for research products that referenced analyses of data from these three surveys, but did not formally present analyses of data from these surveys (i.e., these research products were ineligible for this study). The appropriate codes for each of the 145 products were reviewed and agreed upon by the entire research team, each of whom reviewed all of the products.
The final data set of coded articles analyzed in this study (available in Excel format; see S1 Data represents a body of evidence with regard to analytic approaches that have been employed by analysts of SESTAT data sets since their initial public availability. We note that in assigning these codes, we are not classifying particular approaches as "correct" or "incorrect," but rather painting an empirical picture of the types of approaches that tend to be described in research products by analysts of these data. Evidence of consistent failure to account for sample design features in the analyses would suggest potentially high prevalence of analytic errors in these products.

Statistical Analyses
We employed standard descriptive techniques for estimation based on stratified random samples to compute estimates of the prevalence of each type of analytic approach (both overall and for each survey), in addition to standard errors of the estimates (based on the stratified random sampling model). We also generated descriptive plots indicating the estimated prevalence of each type of analytic approach as a function of the year of publication, for each of the three survey programs. We specifically focused on publication decades (e.g., 2000 or earlier, 2001-2010, etc.) when assessing the trends.
We then fit logistic regression models to these data, where individual products are grouped within the individual surveys. The binary indicator for a given type of analytic approach (e.g., using the sampling weights in estimation) was the dependent variable in these models, and the models included fixed effects of publication year (mean-centered within each survey) and possibly other functional forms of publication year (depending on the observed trends). The models also included fixed effects of survey name and interactions between the survey name and year (to determine whether the prevalence of particular approaches is changing over time in a different fashion for the different surveys). The models were fitted using the -logit-command in the Stata software (Version 14+). These trend analyses were purely exploratory; we did not have any a priori expectations with regard to the variance in prevalence or trends between the SESTAT surveys. Given the recent proliferation of software for analyzing survey data and quality references on the topic, we did expect to see an overall decrease in the prevalence of analytic errors as a function of publication year; this was suggested by West et al. [29]. We also examined relationships between the type of product (book chapter / technical report, conference presentation / proceedings paper, or formal journal article) and the prevalence of each type of error, separately for each survey and overall.
Next, for the peer-reviewed journal articles, we examined the relationships of journal-level features with indicators of the various analytic approaches, assessing the relationships of all journal features described above with the binary indicators in an exploratory fashion. Fixed effects of these covariates were added to the logistic regression models fitted to the data recorded from the articles, enabling identification of significant journal-level correlates of the analytic approaches used. We also analyzed the co-occurrence of particular analytic approaches (e.g., failing to use weights and failing to use specialized variance estimation methods). To this end, binary indicators of co-occurrence of the various possible approaches were constructed and then modeled using the same approaches described above for the individual indicators.
Finally, we considered the implications of making analytic errors for inferences related to key variables measured in two of the three SESTAT surveys. We did not consider example analyses of the NSRCG data, as this survey was discontinued in 2010 and absorbed into the NSCG. We focused on possible errors made when using a design-based approach, given that this approach is more widely-used by non-statisticians and more readily available in existing software. We first reviewed the research products that we sampled and worked with NSF program officers affiliated with the two surveys to identify key variables that are frequently analyzed by researchers working with these data, in addition to regression models that may be of substantive interest to researchers. Next, we requested the replicate weights for each of the two surveys from NCSES staff, in addition to documentation describing the use of these replicate weights.
For each of the key variables and models from the two surveys listed in Table 2 below, we then considered three alternative approaches to making inferences about descriptive parameters (means, percentages) and analytic parameters (regression coefficients), which included the calculation of estimated standard errors for the estimated parameters and 95% confidence intervals for the parameters: 1. Fully accounting for the complex sampling features, using the weights in estimation and the replicate weights for variance estimation; 2. Using the weights in estimation and Taylor Series Linearization (TSL) for variance estimation (which recognizes variance in the weights), but ignoring the replicate weights (which capture complex sampling features such as stratification) when estimating the variances; and 3. Completely ignoring the complex sampling features.
When comparing the results from approach 2) to approach 1), we computed the ratio of the estimated variances to assess the effect of ignoring the replicate weights (and therefore the complex sample design features) on the variance estimate. When comparing the results from approach 3) to approach 1), we estimated both the bias in the unweighted estimate (defined as the difference between the unweighted and weighted estimate, treating the weighted estimate as unbiased) and the overall misspecification effect [34] on the variance estimate due to completely ignoring the complex sampling features. All analyses were performed using the SURVEYMEANS, SURVEYFREQ, SURVEYREG, and SURVEYLOGISTIC procedures in the SAS software (Version 9.4; SAS Institute, Cary, NC). The S2 Data file in the supporting Table 2. Key variables and regression models analyzed from two of the three SESTAT surveys to assess the implications of making analytic errors for inferences related to descriptive and regression parameters. information contains the SAS code used to download the public-use SESTAT data files, generate the variables for analysis, and perform all of the analyses.

Results
For each of the three SESTAT surveys individually (and also across all three surveys), Table 3 presents prevalence estimates based on binary indicators of different analytic approaches employed across all years represented in the samples. From Table 3, we see initial evidence of some variance across the surveys in the frequency with which investigators employ certain types of approaches. Research products presenting analyses of the SDR and NSCG data are slightly more likely to use the available sampling weights in estimation, but not more likely to use appropriate variance estimation techniques. Design-based approaches were much more common in NSCG products, and SDR products are more likely to describe results with respect to the larger target population. Overall, we found that the sampling weights available in the public-use data files were accounted for in only about half of the sampled research products, and appropriate variance estimation and/or subpopulation estimation was rarely used (7.6% of publications and 10.7% of publications using designbased approaches, respectively). Nearly 75% of the sampled publications described results with respect to the population rather than the sample, and while a failure to do this is a relatively minor type of error, this is important when describing inferences arising from these types of analyses. Fig 2 presents trends in the prevalence of each of these analytic approaches as a function of the decade in which a sampled research product was first available, for each of the three surveys. Fig 2 does not present evidence of any significant trends in the prevalence of the different types of analytic approaches over time; that is, the prevalence of using these approaches is fairly stable, centered on the overall estimates for each survey in Table 3. While there is slight evidence of an increase over time in the proportion of research products appropriately using weights in estimation for the NSCG survey (the top-left panel of Fig 2), there do not appear to be consistent trends in the use of appropriate variance estimation (the top-right panel), appropriate subpopulation estimation when design-based approaches are used (the lower-left panel), or descriptions of results with respect to the larger target population (the lower-right panel). In fact, the probability of appropriately describing results with respect to the larger target population (rather than the sample) for the NSRCG is decreasing over time.
The plots in Fig 2 suggest that appropriate variance estimation and subpopulation estimation is rarely performed across the three surveys, and that the probability of this behavior is not changing over time. Logistic regression models fitted to each indicator confirmed these visual assessments, with no significant decade effects or significant interactions between decade and survey. We did find in these models that when adjusting for decade, the odds of using a designbased approach in the NSCG were more than three times higher compared to the SDR (adjusted odds ratio = 3.1, 95% CI = 1.3-7.4), consistent with what we found in Table 3. We also found that the odds of describing results with respect to the larger target population were nearly six times higher in the SDR when compared to the NSRCG (adjusted odds ratio = 5.8, 95% CI = 1.7-19.1) and more than five times higher when compared to the NSCG (adjusted odds ratio = 5.4, 95% CI = 1.6-17.5), consistent with the results in Fig 2 and Table 3. Given the findings that users of the NSCG were more likely to employ design-based approaches but less likely to describe results with respect to the larger target population, we also examined whether the use of weights in estimation or the use of a design-based approach to the analysis increased the probability of describing results with respect to the target population. Interestingly, when adjusting for decade and survey, the use of weights in estimation and the use of a design-based approach did not significantly affect the probability of describing results with respect to the target population (as opposed to the sample). This finding reflects a possible disconnect between the use of appropriate methods and the use of appropriate language to describe the results of these types of analyses. Table 4 shows the estimated prevalence of each type of error across different types of research products, for each survey. In none of these analyses did we find a significant association between the type of product and the indicator of the approach used, suggesting that the same approaches tend to be used regardless of the type of product. In general, conference papers were the least likely to account for weights in estimation (42.9%), despite being the most likely to use design-based approaches (64.3%). This was an interesting result, and we reviewed what was happening in particular in the case of the NSCG, where there were four sampled conference papers (only one of which used weights in estimation, resulting in the 25%  Table 4). Three of the four sampled conference papers appeared to use a designbased analysis rather than specifying formal models for the variables of interest (i.e., essentially assuming a simple random sample), but two of the three simply failed to mention the use of weights in estimation at any point. Notably, across the three surveys, published journal articles were the least likely types of scientific products to perform appropriate variance estimation (4.0%) and perform appropriate subpopulation analyses when design-based approaches were employed (7.1%).
We next assessed bivariate associations between journal-specific factors and each coded indicator variable. First, Table 5 presents some descriptive characteristics of the journals in which peer-reviewed articles were published. We note in Table 5 that the journals in which these articles were published rarely provide statistical guidance with regard to analyzing complex sample survey data on their websites or in their submission guidelines, and that less than 50% of the articles (overall) were published in journals with dedicated statisticians on their review boards.
Second, considering the associations of journal-specific factors with the indicators analyzed above, we found that NSCG articles published in journals with dedicated statistical reviewers were substantially more likely to employ design-based approaches (92.9% vs. 58.8% in journals without dedicated statistical reviewers, p < 0.05), suggesting that statistical reviewers will typically require authors to at least consider design features in their analysis. We also found that all of the published journal articles using appropriate variance estimation techniques and appropriate subpopulation analysis approaches were published in journals with dedicated statistical reviewers. We see these results as motivation for future practice, where forcing authors to think carefully about complex sampling features (regardless of the approach used) may reduce potential analytic errors. This is especially important in light of the finding that journal articles were the least likely to use these appropriate variance estimation approaches. Finally, there were other common issues that emerged when we were reviewing the sampled research products. We often noted references to the presentation of "robust" standard errors (usually in the footnotes of tables), without any additional clarification of how these standard errors were computed. "Robust" standard errors could refer to a number of different types of variance estimators, and simply referring to "robust" standard errors does not clarify whether complex sampling features (such as stratification, which would generally result in more precise estimates) were accounted for in their computation. Furthermore, many of the articles coded as using model-based approaches did not account for the complex sampling features in any way in the model specification. When model-based approaches are used, it's important to make sure that features of the sample designs (e.g., sampling strata in the SDR) are at the very least included in some way in the models, to make the sample design features ignorable in the context of the larger overall estimation objectives. This includes the sampling weights, which were quite often ignored when these "model-based" approaches were employed. Finally, we found that explicit mention of the names of statistical software procedures used to do the analyses was excessively rare (only 5 of the 145 sampled products). This type of information can help to make the analysis approaches used more transparent, and will also help to enable reproducible research.

Implications of Making Analytic Errors
Before presenting the results of our analyses based on the alternative analytic approaches, we begin with some theoretical expectations to guide our review and interpretation of the results. First, considering the 2010 SDR, the online SESTAT documentation (https://ncsesdata.nsf.gov/ doctoratework/2010/sdr_2010_tech_notes.pdf) indicates that a stratified sample design was employed, resulting in unequal probabilities of selection for persons from different sampling strata. Weights were constructed for SDR respondents that reflected the unequal probabilities of selection and also adjustments for differential nonresponse across strata, and replicate weights were constructed that reflected the stratified sample design and also captured uncertainty in the nonresponse adjustments for variance estimation purposes [31][32].
In theory, one would therefore expect that descriptive population estimates based on survey variables with values that vary widely across selected sampling strata subject to oversampling (e.g., those with disabilities and ethnic minorities) would be subject to bias if the SDR respondent weights were ignored in estimation. In terms of variance estimation, the use of the highly variable respondent weights in estimation would be expected to increase variance estimates [1], but the stratified sampling would be expected to decrease variance estimates for descriptive estimates based on variables with values that vary widely across the strata. This is due to the fact that stratified sampling based on variables that are homogeneous within strata and heterogeneous between strata will increase the precision of survey estimates [1]. Using the weights only in estimation (and ignoring the replicate weights for variance estimation purposes) could thus result in increases in variance estimates that would not be offset by the expected gains in precision due to stratified sampling, especially for those variables with values that varied across sampling strata. For SDR variables that are not strongly associated with the sampling strata, the use of weights in estimation and accounting for the replicate weights would generally lead to an increase in variance estimates relative to ignoring the design features entirely. Expected effects of the SDR sample design would therefore depend on the variable being analyzed, but we would expect that using the weights only in estimation may be problematic for the efficiency of estimates based on variables strongly associated with the SDR sampling strata.
Next, considering the 2010 NSCG, a stratified sample design was also employed, only using the American Community Survey (ACS) as a sampling frame [35]. This procedure once again resulted in unequal probabilities of selection for persons from different sampling strata, in part due to oversampling of particular subgroups based on ACS information (e.g., whether or not a person had a science and engineering degree) and the use of probability proportionate to size (PPS) sampling of persons within strata, mainly based on ACS weights [35]. Weights were also constructed for NSCG respondents that reflected the unequal probabilities of selection and adjustments for differential nonresponse across strata, and replicate weights were constructed that reflected the stratified sample design and again captured uncertainty in the nonresponse adjustments for variance estimation purposes [31][32]. We therefore have similar theoretical expectations regarding the effects of ignoring either the weights or the replicate weights on descriptive estimation and variance estimation: estimates based on variables that are related to the NSCG sampling strata (e.g., correlates of having a science and engineering degree) will tend to be biased if the weights are ignored, and the use of replicate weights for variance estimation has the potential to capture gains in sampling efficiency from the stratified sampling and offset some of the increases in the variance of estimates due to the use of the highly variable weights in estimation. We do note that fully accounting for complex sample designs that also feature cluster sampling within strata (unlike the SDR and NSCG) when estimating variances, via replicate weights or stratum and cluster codes, would likely increase standard errors further relative to the use of weights only, due to the inefficiencies introduced by cluster sampling [2,11]. This would not be our general expectation here, given the sample designs used for the SDR and the NSCG.
Finally, considering theoretical expectations with regard to the estimation of regression models, the use of the SDR and NSCG weights in estimation will generally lead to unbiased population estimates of regression coefficients. However, the use of weights in estimation could also lead to inefficient estimates of regression coefficients (i.e., estimates with standard errors that are excessively large) if a model has been well-specified and the weights do not provide any information about the estimated coefficients [16,36]. Expected gains in the efficiency of estimates due to stratified sampling would likely not be as large in the case of estimated regression coefficients as in the case of descriptive parameters like means and proportions, given that complex samples are typically designed with descriptive parameters in mind [1]. We remind readers that this is an active area of research, where several methods have been developed to examine whether survey weights should be used when estimating regression coefficients [36]. We examine changes in estimates due to the use of weights in estimation in this section, and whether the stratified sampling does tend to partially offset losses in efficiency due to the use of weights when estimating the regression coefficients of interest.
We now consider some observations related to the estimation of descriptive parameters from the two surveys. For each of the key variables identified in the two surveys (Table 2), Table 6 presents estimates of percentages or means, estimated standard errors of the estimates, and confidence intervals for the descriptive parameters, using the three alternative analytic approaches (the latter two of which involve some form of analytic error). We also include the aforementioned measures of bias in the unweighted estimates, along with the ratios of variance estimates that enable comparisons of estimated variances when fully accounting for the complex sample design versus accounting for the weights only, and when fully accounting for the complex sample design versus ignoring it entirely (the misspecification effect, or MEFF). The results in Table 6 demonstrate that a failure to account for the complex sample design features in analysis can have severe implications for descriptive estimates and inferences based on those estimates. First, considering the 2010 SDR, a failure to use the final SDR weights in the analysis generally has modest implications for the estimates (see Fig 3), with the most extreme changes noted for race / ethnicity. This was expected in theory, given that these demographic features were used to define sampling strata with different sampling rates. Slight changes in inference are observed for percentages describing distributions of current salary, attending professional meetings in the past year, major fields of Science and Engineering (S & E), and labor force status.
More noticeable in the case of the 2010 SDR is consistent evidence of a failure to use the replicate weights in variance estimation leading to variance estimates that tend to be too large, as was expected in theory. For a few estimates, use of the replicate weights tends to increase the variance estimates (MEFFs greater than 1), but for most estimates, the replicate weights capture gains in precision of the estimates (MEFFs less than 1) due to the stratified sampling employed in the SDR (Table 1). This pattern is apparent in Fig 4, where the majority of the standard errors for the estimates fall below the 45-degree line. This means that standard errors based on the replicate weights are smaller than standard errors for the same estimates that reflect variance in the survey weights only. These observed gains in efficiency would be lost if analysts failed to account for the stratified sampling in variance estimation. Simply using the final weights alone in analysis (in the absence of the replicate weights) does not adequately capture these important gains in efficiency. Next, considering the estimates for the 2010 NSCG in Table 6, we find that a failure to use the final NSCG weights in estimation has much more severe implications for the resulting estimates relative to the SDR. For the vast majority of the estimates (and especially those related to working in science and engineering fields, as expected), there are substantial changes in the sizes of the estimates when using the weights for estimation (see Fig 3), and inferences would change noticeably regardless of the variance estimation approach employed. These large biases in the unweighted estimates underscore the importance of using the final NSCG weights correctly in estimation; the weights are highly correlated with several of the key measures of interest. Examining the ratios of variances, we note that the misspecification effects tend to be greater than 1, suggesting a general increase in the variance of the estimates that is primarily being driven by the highly variable respondent weights in the NSCG (see Fig 5). However, similar to the SDR analyses, we once again note that a failure to fully account for the stratified sampling (i.e., just using the weights in the analysis) would lead to variance estimates that are too large; this pattern is once again evident in Fig 4. While the misspecification effects still tend to be greater than 1 due to the variable respondent weights (Fig 5), a failure to fully account for the stratified sampling would result in variance estimates that were excessively large, and overly conservative inferences.
We now consider the implications of making analytic errors for inferences related to regression model parameters. Table 7 presents estimated regression parameters (along with estimated standard errors and 95% confidence intervals) in the models described in Table 2 for each of the two surveys, following the three different analytic approaches. We also include the aforementioned variance ratios, in addition to design-adjusted multi-parameter Wald tests for the terms included in the models, enabling overall (or omnibus) conclusions about the Analytic Error in Survey Data Analysis importance of the terms included in the models (e.g., the overall importance of the major degree field × race/ethnicity interaction in the logistic regression model for the probability of having a salary greater than $150K, based on the 2010 SDR data).
First considering the estimated models for the 2010 SDR in Table 7, we find that overall inference related to the importance of the major degree field × race/ethnicity interaction in the model predicting salary greater than $150K would change depending on whether the complex sampling features were taken into account. When fully accounting for the complex sampling features, one would conclude that this interaction is significant (based on the design-adjusted Wald test), and simply accounting for the weights or ignoring the design features entirely would lead to different conclusions all together. Closer inspection of the results reveals that this change in inference is largely due to increased precision of the estimates when accounting for the stratified sample design of the SDR via the replicate weights (which was possible in theory); several ratios of variance estimates based on fully accounting for the complex sampling (versus using the weights only) are less than 1, and this pattern is evident in Fig 4. In this first model, the misspecification effects that would arise when completely ignoring the complex sampling features vary slightly around 1.0 and depend on the estimate (Fig 5).
In the second SDR model, we find that accounting for the complex sampling features does not have a large impact on inferences related to the relationships of principal job category and race / ethnicity with hours worked per week, suggesting that this model was fairly well-specified. Regardless of the analysis approach used, we would conclude that the differences between the race / ethnicity groups in the distribution of hours worked per week clearly depend on the principal job category. We do note that for this model, fully accounting for the complex sampling features tends to result in MEFF values that are greater than 1, suggesting that the SDR  Table 7. Estimated regression parameters, standard errors, Wald tests, confidence intervals, and misspecification effects in four regression models fitted to data from the 2010 SDR and NSCG surveys when following the three alternative analytic approaches.   stratification resulted in larger gains in the efficiency of the estimated coefficients for the salary model than for the model predicting hours worked per week. In general, failing to account for the complex sampling features in the second model (for hours worked per week) would simply lead to slightly understated standard errors. Avoiding these slight losses in the efficiency of the estimates by ignoring both the respondent weights and the replicate weights may not be problematic if the model was well-specified and the weights are not carrying any information about the estimated coefficients [16,36]. Next, considering the first NSCG model for log-transformed current salary, we find that a failure to incorporate the final NSCG weights in estimation would lead to completely different inference regarding the main effect of having a science and engineering degree on salary. When ignoring the NSCG weights, there is no evidence of those with a science and engineering degree having a different mean salary from those with a different degree (given the non-significant interaction between gender and type of degree). However, when using the weights in estimation, we see evidence of a much larger positive (and significant) effect of having a science and engineering degree on expected current salary. A failure to account for the weights would thus lead to a completely different conclusion regarding the benefits of having a degree in this area (see Fig 3). We also see evidence of substantially understated standard errors for the estimated regression coefficients when completely ignoring the complex sampling features, with none of the misspecification effects falling below 2.0 (and one as large as 9.0); this pattern is evident in Fig 5. Most of the impact of the complex sampling on the standard errors comes from the variance in the weights, as accounting for the additional complex sampling features via the replicate weights does not lead to substantial changes in the estimated standard errors.
Finally, considering the second NSCG model for the binary indicator of having a job in a science and engineering field, we see that ignoring the weights in estimation leads to substantial changes in the estimates and corresponding inferences. When ignoring the weights in estimation, one would conclude that there is strong evidence of an interaction between gender and race / ethnicity when predicting the probability of having a science and engineering job. When accounting for the weights in estimation, there is no longer evidence of a significant interaction, and the estimated coefficients shift substantially. In addition, we once again see evidence of substantially understated standard errors for the estimated regression coefficients when completely ignoring the complex sampling features, with none of the misspecification effects falling below 1.7 (and one as large as 4.0); see Fig 5. We also see additional evidence of fully accounting for the stratified sampling (via the replicate weights) introducing more efficiency in the estimates relative to just using the weights alone (Fig 4). As was expected, we therefore see consistent evidence in both the SDR and the NSCG of the possible gains in efficiency from fully accounting for the stratified sampling via the replicate weights (relative to using the respondent weights alone), whether generating descriptive estimates or estimating regression models.

Discussion
We highlight six key findings in this study: 1. The sampled research products rarely accounted for the complex design features of the samples underlying the SESTAT survey data, and these prevalence rates did not vary across the three SESTAT surveys: only 55% of the products incorporated the publicly-available sampling weights into the analyses, only 8% of the products accounted for the complex sampling features when estimating variances, and only 11% of the products presenting design-based analyses performed appropriate subpopulation analyses accounting for the complex sampling [2].
2. Slightly more than half of the sampled products (56%) used design-based (vs. model-based) approaches (especially NSCG products), and while the majority of the products (74%) described results with respect to the target populations of the SESTAT samples (especially SDR products), accounting for sampling weights or using design-based approaches was not associated with this method of describing the results.
3. There was no evidence of trends in the prevalence of the different analytic approaches over time.
4. Different types of products did not vary in terms of the prevalence of the approaches used, but peer-reviewed journal articles had the lowest rates of accounting for the complex sampling features when estimating variances.
5. The presence of statistical reviewers on the editorial boards of peer-reviewed journals (44% of the articles published in peer-reviewed journals) increased the probability of accounting for the complex sampling features in analysis. 6. A failure to fully account for the complex sampling features of the SESTAT data sets in analysis has critical implications for inferences related to popular descriptive and analytic (regression) parameters based on these data.
The first two findings are largely consistent with the results of one pilot study of the analytic error problem that examined 100 research products from different (i.e., non-SESTAT) surveys [29][30], such as the National Health and Nutrition Examination Survey (NHANES), and provide additional evidence suggesting that secondary analysts may be making analytic errors quite frequently when working with public-use survey data sets. A failure to account for sampling weights in estimation can substantially bias population estimates of key descriptive parameters, and a failure to account for complex sampling features when estimating the variances of estimates can lead to incorrect statements regarding sampling variability. Furthermore, if roughly half of secondary analysts are using model-based approaches to analyze the SESTAT data, these models need to account for the complex sampling features in some way to make sure that they are not informative regarding the estimates of interest, and we rarely found evidence of this approach being used. The next four findings contribute unique knowledge about the analytic error problem. This study assessed the prevalence of apparent analytic errors in different types of research products (including conference proceedings papers and book chapters), for different subject matter (describing the college-educated science and engineering work force in the U.S.), across multiple decades (with a stratified sample of research products, with decades treated as strata). We also extended prior knowledge related to the problem of analytic error by presenting the implications of actually making analytic errors when using the public-use SESTAT data files (given that selected SESTAT design information is only available upon request), finding several examples of inferences that would change substantially when failing to account for the complex sampling features.
The relatively static prevalence of SESTAT investigators employing "appropriate" analytic approaches over time raises concern about whether there has been sufficient dissemination of knowledge across different fields with regard to appropriate techniques for analyzing survey data. Taken together with 1) the relatively low prevalence of appropriate approaches found in this study and 2) the finding that peer-reviewed articles in journals with dedicated statistical reviewers were more likely to use theoretically appropriate approaches, we feel that reviewers and consumers of these research products should take more care in making sure that appropriate methods for survey data analysis have been employed by the study authors. This type of peer feedback can play an important role in dissemination of knowledge about the importance of using these methods to avoid analytic errors when making inferences about larger populations based on survey data. While word limits for academic journals (which we were only able to determine for about half of the peer-reviewed articles) may ultimately lead to the removal of details describing the analytic methods used in a given study, we feel that transparency regarding analytic methods is essential for enabling reproducible research and confirming that a given study has employed analytic techniques appropriate for survey data.
Furthermore, web sites providing guidance for individuals submitting manuscripts should explicitly indicate that any products presenting secondary analyses of complex sample survey data need to demonstrate that they have sufficiently incorporated any available complex sampling features into the analyses presented. As outlined earlier in this paper, there are many theoretically sound design-based and model-based approaches currently available to secondary analysts and possible using standard statistical software (especially design-based methods), so there is no reason that secondary analysts should not be at least considering the effects of these complex sampling features on their analyses in future publications. Adding these restrictions to web sites accepting these types of research products will also help to ensure that analysts are taking all steps to avoid the possibility of making analytic errors. We also encourage faculty and researchers from more applied fields teaching courses on research methods to place more emphasis on analytic techniques for survey data in their courses. This will also help to enhance the dissemination of knowledge regarding appropriate analytic techniques to different fields.
Finally, additional replications of this study using other survey data sources in general would provide more empirical background regarding the magnitude of this problem. The two studies conducted to date (including this one) have analyzed a total sample of 245 scientific products, and while the review and coding of these publications is fairly time-intensive, this is still a very small sample of all research products that have ever presented secondary analyses of complex sample survey data. For example, one could consider other major national surveys focusing on educational subject matter, such as the Programme for International Student Assessment (PISA) or the National Assessment of Educational Progress (NAEP). We used Google Scholar to identify (in a non-random fashion) 10 of the most frequently-cited peerreviewed journal articles presenting analyses of PISA and NAEP data, just for illustration purposes. We found that among 7 articles presenting analyses of PISA data and 3 articles presenting analyses of NAEP data (see S1 Text), two articles completely ignored the complex sampling features (weights and variance estimation codes) in analyses, one article ignored the weights in estimation but accounted for the complex sampling features in variance estimation, and two articles used incorrect methods for subpopulation analysis. The prevalence of analytic error may well vary across different types of survey programs, and could be a function of the documentation available for secondary analysts or the ease with which one can obtain data describing the complex sampling features. Additional evidence of potential analytic errors in other contexts would further underscore the importance of educating researchers and scientists from other fields about the implications of not performing these analyses correctly, and also considering analytic error as an essential component of the larger Total Survey Error (TSE) framework.