Do Pressures to Publish Increase Scientists' Bias? An Empirical Support from US States Data

The growing competition and “publish or perish” culture in academia might conflict with the objectivity and integrity of research, because it forces scientists to produce “publishable” results at all costs. Papers are less likely to be published and to be cited if they report “negative” results (results that fail to support the tested hypothesis). Therefore, if publication pressures increase scientific bias, the frequency of “positive” results in the literature should be higher in the more competitive and “productive” academic environments. This study verified this hypothesis by measuring the frequency of positive results in a large random sample of papers with a corresponding author based in the US. Across all disciplines, papers were more likely to support a tested hypothesis if their corresponding authors were working in states that, according to NSF data, produced more academic papers per capita. The size of this effect increased when controlling for state's per capita R&D expenditure and for study characteristics that previous research showed to correlate with the frequency of positive results, including discipline and methodology. Although the confounding effect of institutions' prestige could not be excluded (researchers in the more productive universities could be the most clever and successful in their experiments), these results support the hypothesis that competitive academic environments increase not only scientists' productivity but also their bias. The same phenomenon might be observed in other countries where academic competition and pressures to publish are high.


Introduction
The objectivity and integrity of contemporary science faces many threats. A cause of particular concern is the growing competition for research funding and academic positions, which, combined with an increasing use of bibliometric parameters to evaluate careers (e.g. number of publications and the impact factor of the journals they appeared in), pressures scientists into continuously producing ''publishable'' results [1].
Competition is encouraged in scientifically advanced countries because it increases the efficiency and productivity of researchers [2]. The flip side of the coin, however, is that it might conflict with their objectivity and integrity, because the success of a scientific paper partly depends on its outcome. In many fields of research, papers are more likely to be published [3,4,5,6], to be cited by colleagues [7,8,9] and to be accepted by high-profile journals [10] if they report results that are ''positive'' -term which in this paper will indicate all results that support the experimental hypothesis against an alternative or a ''null'' hypothesis of no effect, using or not using tests of statistical significance.
Words like ''positive'', ''significant'', ''negative'' or ''null'' are common scientific jargon, but are obviously misleading, because all results are equally relevant to science, as long as they have been produced by sound logic and methods [11,12]. Yet, literature surveys and meta-analyses have extensively documented an excess of positive and/or statistically significant results in fields and subfields of, for example, biomedicine [13], biology [14], ecology and evolution [15], psychology [16], economics [17], sociology [18].
Many factors contribute to this publication bias against negative results, which is rooted in the psychology and sociology of science. Like all human beings, scientists are confirmationbiased (i.e. tend to select information that supports their hypotheses about the world) [19,20,21], and they are far from indifferent to the outcome of their own research: positive results make them happy and negative ones disappointed [22]. This bias is likely to be reinforced by a positive feedback from the scientific community. Since papers reporting positive results attract more interest and are cited more often, journal editors and peer reviewers might tend to favour them, which will further increase the desirability of a positive outcome to researchers, particularly if their careers are evaluated by counting the number of papers listed in their CVs and the impact factor of the journals they are published in.
Confronted with a ''negative'' result, therefore, a scientist might be tempted to either not spend time publishing it (what is often called the ''file-drawer effect'', because negative papers are imagined to lie in scientists' drawers) or to turn it somehow into a positive result. This can be done by re-formulating the hypothesis (sometimes referred to as HARKing: Hypothesizing After the Results are Known [23]), by selecting the results to be published [24], by tweaking data or analyses to ''improve'' the outcome, or by willingly and consciously falsifying them [25]. Data fabrication and falsification are probably rare, but other questionable research practices might be relatively common [26].
Quantitative studies have repeatedly shown that financial interests can influence the outcome of biomedical research [27,28] but they appear to have neglected the much more widespread conflict of interest created by scientists' need to publish. Yet, fears that the professionalization of research might compromise its objectivity and integrity had been expressed already in the 19 th century [29]. Since then, the competitiveness and precariousness of scientific careers have increased [30], and evidence that this might encourage misconduct has accumulated. Scientists in focus groups suggested that the need to compete in academia is a threat to scientific integrity [1], and those guilty of scientific misconduct often invoke excessive pressures to produce as a partial justification for their actions [31]. Surveys suggest that competitive research environments decrease the likelihood to follow scientific ideals [32] and increase the likelihood to witness scientific misconduct [33] (but see [34]). However, no direct, quantitative study has verified the connection between pressures to publish and bias in the scientific literature, so the existence and gravity of the problem are still a matter of speculation and debate [35].
To verify this hypothesis, this study analysed a random sample of papers published between 2000 and 2007 that had a corresponding author based in the US. These papers, published in all disciplines, declared to have tested a hypothesis, and it was determined whether they concluded to have found a ''positive'' (full or partial) or a ''negative'' support for the tested hypothesis. Using data compiled by the National Science Foundation, the proportion of ''positive'' results was then regressed against a sheer measure of academic productivity: the number of articles published per-capita (i.e. per doctorate holder in academia) in each US state, controlling for the effects of per-capita research expenditure. NSF data provides an accurate proxy of a state's academic productivity, because it controls for multiple authorship by counting papers fractionally. Since the probability for a paper to report a positive result depends significantly on its methodology, on whether it tests one or more hypotheses, on the discipline it belongs to and particularly on whether the discipline is pure or applied [36], these confounding effects were controlled for in the regression models.

Results
A total of 1316 papers were included in the analysis. All US states and the federal district were represented in the sample, except Delaware. The number of papers per state varied between 1 and 150 (mean: 26.3264.16SE), and the percentage of positive results between 25% and 100% (mean: 82.38615.15STDV, Figure 1). The number of papers from each state in the sample was almost perfectly correlated with the total number of papers that each state had published in 2003 according to NSF (Pearson's r = 0.968, N = 50, P,0.001), as well as any other year for which data was available (i.e. 1997, 2001 and 2005, r$0.963 and p,0.001 in all cases). This shows the sample to be highly representative of academic publication patterns in the US.
The effect of per capita academic productivity remained highly significant when controlling for expenditure and for characteristics of study: broad methodological category, papers testing one vs. multiple hypotheses, and pure vs. applied discipline ( Table 1, Nagelkerke R 2 = 0.051). Similar results were obtained when controlling for the effect of discipline instead of methodology ( Table 2, Nagelkerke R 2 = 0.065). Adding an interaction term of discipline by academic productivity did not improve the model significantly overall (Wald = 20.424, df = 19, p = 0.369), although contrasting each discipline's interaction term with that of Space Science showed significantly positive interaction effects for The proportion of papers published between 2000 and 2007 that supported the tested hypothesis was completely uncorrelated with the total (i.e. non per capita) number of doctorate holders, total number of papers and total R&D expenditure (b = 060 and p$0.223 for all three cases). Controlling for any of these parameters did not alter the results of the regression in any meaningful way.

Sensitivity analyses
The analyses were run using 2003 data from the Science and Engineering Indicators 2006 report [37], because this year had the most complete data series (all parameters in the report had been calculated for that year), and because it fell almost in the middle of

Discussion
In a random sample of 1316 papers that declared to have ''tested a hypothesis'' in all disciplines, outcomes could be significantly predicted by knowing the addresses of the corresponding authors: those based in US states where researchers publish more papers per capita were significantly more likely to report positive results, independently of their discipline, methodology and research expenditure. The probability for a study to yield a support for the tested hypothesis depends on several research-specific factors, primarily on whether the hypothesis tested is actually true and how much statistical power is available to reject the null hypothesis [38]. However, the geographical origin of the corresponding author should not, in theory, be relevant, nor should parameters measuring the sheer quantity of publications per capita. Although, as discussed below, not all confounding factors in the study could be controlled for, these results support the hypothesis that competitive academic environ-ments increase not only the productivity of researchers, but also their bias against ''negative'' results.
All main sources of sampling and methodological bias in this study were controlled for. The number of papers from each state in the sample was almost perfectly correlated with the actual number of papers that each state produced in any given year, which confirms that the sampling of papers was completely randomised with respect to address (as well as any other study characteristic including the particular hypothesis tested and the methods employed), and therefore that the sample was highly representative of the US research panorama. The total number of Table 1. Logistic regression slope, standard error, Wald test with statistical significance, odds ratio and 95% confidence interval of the probability for a paper to report a positive result, depending on the following study characteristics: per capita academic productivity of US state of corresponding author, per capita R&D academic expenditure of US state of corresponding author, papers testing more than one hypothesis (only the first of which was considered in this study), papers published in pure as opposed to applied disciplines, and methodological category of paper.  papers, total R&D and total number of doctorate holders were completely uncorrelated to the proportion of positive results, ruling out the possibility that different frequencies of positive results between states are due to sampling effects. Although the analyses were all conducted by one author, expectancy biases can be excluded, because the classification of papers in positive and negative was completely blind to the corresponding address in the paper, and the US states' data were obtained by an independent source (NSF). We can also exclude that the association between productivity and positive results was an artifact of the effects of methodologies and disciplines of papers (which are elsewhere shown to be significant predictors of positive results [36]), because controlling for these factors increased the size and statistical significance of the regression, suggesting that the effect is truly cross-disciplinary. In sum, these results are likely to represent a genuine pattern characterising academic research in the US. An unavoidable confounding factor in this study is the quality and prestige of academic institutions, which is intrinsically linked to the productivity of their resident researchers. Indeed, official rankings of universities often include parameters measuring publication rates [39] (although the validity of such rankings is controversial [40,41]). Therefore, it could be argued that the more productive states are also the ones hosting the ''best'' universities, which provide better academic structures (laboratories, libraries, etc…) and more advanced and stimulating intellectual environments. This could make scientists better at picking up the right hypotheses and more successful in testing them, increasing their chances to obtain true positive results. Separating this quality-of-institution effect from that of bias induced by pressures to publish is difficult, because the two factors are strictly linked: the best universities are also the most competitive, and thus presumably the ones where pressures to produce are highest.
However, the quality-of-institution effect is unlikely to fully explain the findings of this study for at least two reasons. First, because if structures and resources are really important, then positive results should also tend to increase where more R&D expenditure is available, but a negative (though non statistically significant) trend was observed instead. Second, because the variability in frequency of positive results between states is too high to be reasonably explained by the quality factor alone. At one extreme, states yielded as few as 1 in 4 papers that supported the tested hypothesis, at the other extreme, numerous states reported between 95% and 100% positive results, including academically productive ones like Michigan (N = 54 papers in this sample), Ohio (N = 47), District of Columbia (N = 18) and Nebraska (N = 13). In absence of bias of any kind, this would mean that corresponding authors in these states almost never failed to find a support for the hypotheses they tested. But negative results are virtually inevitable, unless all the hypotheses tested were true, experiments were designed and conducted perfectly, and the statistical power available were always 100% -which it rarely is, and is usually much lower [42,43,44,45,46].
As a matter of fact, the prestige of institutions could be expected to have the opposite influence on published results, in analogy with what has been observed by comparing countries. In the biomedical literature, the statistical significance of results tends to be lower in papers from high-income countries, which suggests that journal editors tend to reject papers from low-income countries unless they have particularly ''good'' results [47]. If there were a similar editorial bias favouring highly prestigious universities in the US -and some studies suggest that there is [9,48] -then the more productive states (prestigious institutions) should be allowed to publish more negative results.
A possibility that needs to be considered in all regression analyses is whether the cause-effect relationship could be reversed: could some states be more productive precisely because their researchers tend to do many cheap and non-explorative studies (i.e. many simple experiments that test relatively trivial hypotheses)? This appears unlikely, because it would contradict the observation that the most productive institutions are also the more prestigious, and therefore the ones where the most important research tends to be done.
What happened to the missing negative results? As explained in the Introduction, presumably they either went completely unpublished or were somehow turned into positive through selective reporting, post-hoc re-interpretation, and alteration of methods, analyses and data. The relative frequency of these behaviours remains to be established, but the simple nonpublication of results is unlikely to be the only explanation. If it were, then we should have to assume that authors in the more productive states are even more productive than they appear, but wastefully do not publish many negative results they get.
Since positive results in this study are estimated using what is declared in the papers, we cannot exclude the possibility that authors in more productive states simply tend to write the sentence ''test the hypothesis'' more often when they get positive results. However, it would be problematic to explain why this should be the case and, if it were, then we would still have to understand if and how negative results are published. Ultimately, such an association of word usage with socio-economic parameters would still suggest that publication pressures have some measurable effect on how research is conducted and/or presented.
Selective reporting, reinterpreting and altering results are commonly considered ''questionable research practices'': behaviours that might or might not represent falsification of results, depending on whether they express an intention to deceive. There is no doubt that negative results produced by a methodological flaw should either be corrected or not be published at all, and it is likely that many scientists select or manipulate their negative results because they sincerely think their experiments went wrong somewhere -maybe the sample was too small or too heterogeneous, some measurements were inaccurate and should be discarded, the hypothesis should be reformulated, etc… However, in most circumstances this might be nothing more than a ''gut feeling'' [49]. Moreover, positive results should be treated with the same scrutiny and rigour applied to negative ones, but with all likelihood they are not. This latter form of neglect is probably one of the main sources of bias in science.
Adding an interaction term of discipline by productivity did not increase the accuracy of the model significantly. Although we are currently unable to measure the statistical power of interaction terms in complex logistic regression models, the lack of significance suggests that large disciplinary differences in the effect of publication pressures are unlikely. Interestingly, however, some interdisciplinary variability was observed: Pharmacology and Toxicology, and Neuroscience and Behaviour had a significantly stronger association between productivity and positive results compared to Space Science. Of course, since we had 20 disciplines in the model, the significance of these two terms could be due to chance alone. However, we cannot exclude that a study with higher statistical power could confirm this result and reveal other small, but nonetheless interesting differences between fields.
This study focused on the United States primarily because they are one of the most scientifically productive countries, and are academically diversified but linguistically and culturally rather homogeneous, which eliminated the confounding effect of editorial biases against particular countries, cultures or languages. More-over, the research output and expenditure of all US states are recorded and reported by NSF periodically and with great accuracy, yielding a reliable dataset. Academic competition might be particularly high in US universities [1], but is surely not unique to them. Therefore, the detrimental effects of the publish-or-perish culture could be manifest in other countries around the world.

Materials and Methods
The sample of papers used in this study was part of a larger sample used to compare bias between disciplines [36]. Papers within this latter were obtained with the following method. The sentence ''test* the hypothes*'' was used to search all 10837 journals in the Essential Science Indicators database, which classifies journals univocally in 22 disciplines. Only papers published between 2000 and 2007 were sampled. When the number of papers retrieved from one discipline exceeded 150, papers were selected using a random number generator. In one discipline, Plant and Animal Sciences, an additional 50 papers were analysed, in order to increase the statistical power of comparisons involving behavioural studies on non-humans (see below for details on methodological categories). By examining the abstract and/or full-text, it was determined whether the authors of each paper had concluded to have found a positive (full or partial) or negative (null or negative) support. If more than one hypothesis was being tested, only the first one to appear in the text was considered. We excluded meeting abstracts and papers that either did not test a hypothesis or for which sufficient information to determine the outcome was lacking.
All data was extracted by the author. An untrained assistant who was given basic written instructions (similar to the paragraph above, plus a few explanatory examples) scored papers the same way as the author in 18 out of 20 cases, and picked up exactly the same sentences for hypothesis and conclusions in all but three cases. The discrepancies were easily explained, showing that the procedure is objective and replicable.
To identify methodological categories, the outcome of each paper was classified according to a set of binary variables: 1outcome measured on biological material; 2-outcome measured on human material; 3-outcome exclusively behavioural (measures of behaviours and interactions between individuals, which in studies on people included surveys, interviews and social and economic data); 4-outcome exclusively non-behavioural (physical, chemical and other measurable parameters including weight, height, death, presence/absence, number of individuals, etc…). Biological studies in vitro for which the human/non-human classification was uncertain were classified as non-human. . Disciplines were attributed based on how the ESI database had classified the journal in which the paper appeared, and the pure-applied status of discipline followed classifications identified in previous studies (for further details see [36]).
From this larger sample, all papers with a corresponding address in the US were selected, and the US state of each was recorded. Data on state academic R&D expenditure, number of doctorate holders in academia and number of papers published were taken directly from the State Indicators section of the Science and Engineering Indicators 2006 report [37]. This report compiles data from three different sources: Thomson ISI -Science Citation Index and Social Sciences Citation Index; National Science Foundation, Division of Science Resources Statistics -Survey of Doctorate Recipients; National Science Foundation, Division of Science Resources Statistics -Academic Research and Development Expenditures. When counting the number of papers by state, NSF corrects for multiple authorship by dividing each paper by the number of institutions involved. The scoring of papers as ''positive'' and ''negative'' was completely blind to the corresponding author's address. As explained in the Results section, data from other reports were extracted and used for sensitivity analyses.

Statistical analyses
The ability of independent variables to predict the outcome of a paper was tested by standard logistic regression analysis, fitting a model in the form: logit Y ð Þ~ln p i 1{p i ~b 0 zb 1 X i1 zb 2 X i2 z:::zb n X in in which p i is the probability of the ith paper of reporting a positive result, X 1 is the number of papers published per capita (per doctorate holder in academia) in the state of the corresponding author of the ith paper, X 2 is the ith paper's state R&D expenditure per capita, and X n represents the various characteristics of the ith paper that were controlled for in the models (e.g. dummy variables for methodology, discipline, etc…) as specified in the Results section. Statistical significance of the effect of each variable was calculated through Wald's test. Except where specified, all parameter estimates are reported with their standard error. The relative fit of regression models was estimated with Nagelkerke's adjusted R 2 .
Multicollinearity among independent variables was tested by examining tolerance and Variance Inflation Factors for all variables in the model. All variables had tolerance$0.42 and VIF#2.383 except one of the methodological dummy variables (Tolerance = 0.34 and VIF = 2.942). To avoid this (modest) sign of possible collinearity, methodological categories were reduced to the minimum number that previous analyses have shown to differ significantly in the frequency of positive results: purely physical and chemical, biological non-behavioural, and behavioural and mixed studies on humans and on non-humans [36]. This removed any presence of collinearity in the model. All analyses were produced using SPSS statistical package.

Figures
Confidence intervals in the graphs were obtained independently from the statistical analyses, using the following logit transformation to calculate the proportion of positive results and standard error: Where p is the proportion of negative results, and n is the total number of papers. Values for high and low confidence interval were calculated and the final result was back-transformed in percentages using the following equations for proportion and percentages, respectively: %~100P Where x is either P logit or each of the corresponding 95%CI values.