An observational analysis of the trope “A p-value of < 0.05 was considered statistically significant” and other cut-and-paste statistical methods

Appropriate descriptions of statistical methods are essential for evaluating research quality and reproducibility. Despite continued efforts to improve reporting in publications, inadequate descriptions of statistical methods persist. At times, reading statistical methods sections can conjure feelings of dèjá vu, with content resembling cut-and-pasted or “boilerplate text” from already published work. Instances of boilerplate text suggest a mechanistic approach to statistical analysis, where the same default methods are being used and described using standardized text. To investigate the extent of this practice, we analyzed text extracted from published statistical methods sections from PLOS ONE and the Australian and New Zealand Clinical Trials Registry (ANZCTR). Topic modeling was applied to analyze data from 111,731 papers published in PLOS ONE and 9,523 studies registered with the ANZCTR. PLOS ONE topics emphasized definitions of statistical significance, software and descriptive statistics. One in three PLOS ONE papers contained at least 1 sentence that was a direct copy from another paper. 12,675 papers (11%) closely matched to the sentence “a p-value < 0.05 was considered statistically significant”. Common topics across ANZCTR studies differentiated between study designs and analysis methods, with matching text found in approximately 3% of sections. Our findings quantify a serious problem affecting the reporting of statistical methods and shed light on perceptions about the communication of statistics as part of the scientific process. Results further emphasize the importance of rigorous statistical review to ensure that adequate descriptions of methods are prioritized over relatively minor details such as p-values and software when reporting research outcomes.


Introduction
An ideal statistical analysis uses appropriate methods to draw insights from data and inform the research questions. Unfortunately many current statistical analyses are far from ideal, with researchers often using the wrong methods, misinterpreting the results, or failing to adequately check their assumptions [1]. Some researchers take a "mechanistic" approach to statistics, copying the few methods they know regardless of their appropriateness, and then going through the motions of the analysis [2]. Applying this form of methodological illiteracy is at odds with the principles of scientific inquiry, yet continues to pervade published scientific research [3]. This paradox has been exemplified during the COVID-19 pandemic, which has led to unprecedented levels of published research of largely poor quality [4,5]. Many researchers lack adequate training in research methods, and statistics is something they do with trepidation and even ignorance [6,7]. However, using the wrong statistical methods can cause real harm [6,8] and bad statistical practices are being to used abet weak science [2]. Statistical mistakes are a key source of research waste and are contributing to the current reproducibility crisis in science [9]. Even when the correct methods are used, many researchers fail to describe them adequately, making it difficult to reproduce the results [10,11]. Poor statistical methods might not be caught by reviewers, as they may not be qualified to judge the statistics. A recent survey of editors found that only 23% of health and medical journals used expert statistical review for all articles [12], which was little different from a survey from 22 years ago [13].
There is guidance for researchers on how to write up their statistical methods and results. The International Committee of Medical Journal Editors recommend that researchers should: "Describe statistical methods with enough detail to enable a knowledgeable reader with access to the original data to judge its appropriateness for the study and to verify the reported results" [14]. More detailed guidance is given by the SAMPL and EQUATOR guidelines [15,16] covering all aspects of research reporting tailored to different study designs. Both of these guidelines were led by Doug Altman, who spoke for many years about the need for better statistical reporting. The awareness and use of these guidelines could be improved. There were 303 Google Scholar citations to the SAMPL paper (as at 8 October 2021) which is a good citation number for most papers, but is low considering the millions of papers that use statistical analysis.
A potential contributor to poor reporting is the temptation for researchers to re-use descriptions of the same default statistical methods, to make their papers resemble those of their peers and increase perceived chances of publication [17]. As these default choices become more common, valid criticism by reviewers and journal editors becomes increasingly difficult, as past use may be argued by researchers as offering precedent for the conduct of analysis within their discipline [18]. Two statisticians on this paper (AB and NW) have heard researchers admit that they have copied-and-pasted their statistical methods sections from other papers. To investigate the extent of this practice, we applied topic modelling to analyze text within statistical methods sections, as part of published journal articles and clinical trials protocols. Modelling results were used to estimate the extent that researchers are using cut-andpaste or 'boilerplate' statistical methods sections. Boilerplate text is that "which can be reused in new contexts or applications without significant changes to the original" [19]. The use of boilerplate text indicates that researchers are emphasizing the same details about chosen statistical analyses, and potentially giving little thought into the conduct and transparent reporting of statistical methods used.

Data sources
We used two openly available data sources to find statistical methods sections: research articles published in PLOS ONE and study protocols registered on the Australian and New Zealand Clinical Trials Registry (ANZCTR). Data sources were chosen as examples of common research outputs that include descriptions of statistical methods that are either planned or were used for analyzing studies.
Public Library of Science (PLOS ONE). PLOS ONE is a open access mega-journal that publishes original research across a wide range of scientific fields. Article submissions are handled by an academic editor who selects peer reviewers based on their self-nominated area(s) of expertise. Currently there are 324 academic editors out of 9,648 (3%) with the keywords of "statistics (mathematics)" or "statistical methods" in their expertise list (web search on 25-May-2021, https://journals.plos.org/plosone/static/editorial-board). Submissions do not undergo formal statistical review. Instead, reviewers are required to assess submissions against several publication criteria, including whether: "Experiments, statistics, and other analyses are performed to a high technical standard and are described in sufficient detail" [20]. All reviewers are asked the question: "Has the statistical analysis been performed appropriately and rigorously?", with the possible responses of "Yes", "No" and "I don't know". In September 2019, author instructions were updated to allow citations of established materials, methods and protocols, provided sufficient details are given for approaches to be understood independently of chosen references [21]. Authors are encouraged to follow published reporting guidelines such as EQUATOR, to ensure that chosen statistical methods are appropriate for the study design, and adequate details are provided to enable independent replication of results.
Data on all PLOS ONE articles can be accessed via the PLOS Application Programming Interface (API). This enabled us to conduct searches of full-text articles and analyze data on articles' text content and general attributes such as publication date and field(s) of research. All available papers regardless of publication date were considered. We applied a two-step approach to identify statistical methods sections: Step 1: Targeted API searches were completed using the R package 'rplos' [22]. Search queries targeted analysis-related terms, combining the words "data" or "statistical" with one of: "analysis", "analyses", "method", "methodology" or "model(l)ing". Terms could appear anywhere within the main body of the article, to account for the placement of relevant text in different sections, for example, in the Material and Methods section versus Results. Search results were indexed by a unique Digital Object Identifier (DOI). Attribute data collected per DOI included journal volume and subject classification(s).
Step 2: PLOS ONE does not use standardized headings to preface statistical methods sections. To address this, we performed partial matching on available headings against frequently used terms in initial search results: 'Statistical analysis', 'Statistical analyses', 'Statistical method', 'Statistical methods', 'Statistics', 'Data analysis' and 'Data analyses'. All available data were downloaded on 3 July 2020.
Code to complete steps 1 and 2 is available at https://github.com/agbarnett/stats_section/ code/plosone. Australia and New Zealand Clinical Trials Registry (ANZCTR). The ANZCTR was established in 2005 as part of a coordinated global effort to improve research quality and transparency in clinical trials reporting; observational studies can also be registered. All studies registered on ANZCTR are publicly available and can be searched via an online portal (https:// www.anzctr.org.au).
Details required for registration follow a standardized template [23], which covers participant eligibility, the intervention(s) being evaluated, study design and outcomes. The information provided must be in English. Studies are not peer reviewed.
For the statistical methods section, researchers are asked to provide a brief description of sample size calculations, statistical methods and planned analyses, although this section is not compulsory [23]. Studies are reviewed by ANZCTR staff for completeness of key information, which does not include the completeness of the statistical methods sections.
All studies available on ANZCTR were downloaded on 1 February 2020 in XML format. For our analysis, we used all text available in the "Statistical methods" section. We also collated basic information about the study including the study type (interventional or observational), submission date, number of funders and target sample size. These variables were chosen as we believed they might influence the completeness of the statistical methods section. For example, we hypothesized that larger studies and those with funding to be more complete. We were also interested in changes over time.
Studies prior to 2013 were excluded as the statistical methods section appeared to be introduced in 2013. Some studies were first registered on the alternative trial database clinicaltrials. gov and then also posted to ANZCTR. We excluded these studies because they almost all had no completed statistical methods section as this section is not included in clinicaltrials.gov.

Statistical methods
Full-text processing. Text cleaning aimed to standardize notation and statistical terminology, whilst minimizing changes to article style and formatting. R code used for data extraction and cleaning is available from https://github.com/agbarnett/stats_section.
Mathematical notation was converted from Unicode characters to plain text. Symbols outside of Unicode blocks including '%' (percent) and '<' ('less-than') were converted into plain text. General formatting was removed, including carriage returns, punctuation marks, in-text references (e.g. " [42]") centered equations, and other non-ASCII characters. Bracketed text was retained with brackets removed to maximize content for analysis. Stop words including pronouns, contractions and selected prepositions were removed. We retained selected stop words that, if excluded, may have changed the context of statistical methods being described, for example 'between' and 'against'.
Analysis of missing statistical methods sections. Statistical methods sections were missing for some studies downloaded from ANZCTR, including sections labelled as "Not applicable", "Nil" or "None". Since these studies would be excluded from topic modeling, we examined if there were particular studies where the statistical methods section was more likely to be missing. Analysis considered a logistic regression model estimated in the Bayesian framework ( [27]; www.r-inla.org), with missing statistical methods section (yes/no) as the dependent variable. The independent variables were date, study type, number of funders and target sample size which was log 2 transformed because of a large positive skew. Results were reported as odds ratios with 95% credible intervals.
Topic modelling. Text from statistical methods sections was analyzed using Non-Negative Matrix Factorization (NMF). NMF is an established approach for topic modelling, and provides an effective solution for text-based clustering when dealing with high-dimensional data [28,29].
For N studies, let P 2 R M×N denote a content matrix of text from statistical methods sections, comprising of M unique terms. Text clustering algorithms for identifying common topics across studies requires P to be represented with a vector space model. In our case, unique terms in P are modelled using the tf-idf (term frequency × inverse document frequency) weighting schema, to account for the relative importance of common and rare terms.
A common problem facing text clustering algorithms is the curse of dimensionality due to the high number of terms in the doc × term matrix representation [30,31]. Applying textbased methods based on distance, density or probability therefore face difficulties in highdimensional settings [32][33][34]. Specifically, distances between near and far points becomes negligible [31]. This behavior directly affects the performance of distance-based clustering methods such as k-means [35] in accurately identifying subgroups (topics) present in the data. Furthermore, sparseness associated with high-dimensional matrix representations does not allow for differentiation between topics based on density differences [32,36].
To address these limitations, NMF deals with high-dimensional data by mapping it to a lower-dimensional space. This mapping is achieved by approximating P with two factor matrices: W 2 R M×g and H 2 R N×g [31], such that P � WH T . The number of subgroups of common topics inferred from the data is given by g.
The matrix factorization process approximates the lower dimensional non-negative factor matrices W and H such that they can represent high dimensional P with the least error. Estimation of W and H is achieved by optimizing an objective function; for NMF, the Fronbeius norm is used, equivalent to minimizing the sum of squares for all elements of P: Following estimation, H contains the information regarding topic membership for all studies. In our case, topic membership (1, . . ., g) for a statistical methods section is inferred from the maximum coefficient value in the corresponding row of H, also known as the topic coherence score. For our two datasets, we applied NMF with g = 10 topics.
Content analysis. Results were summarized by word clouds and n-gram analysis to identify frequently occurring terms within topics. Evidence of boilerplate text was assessed at the section and sentences levels using a modified version of the Jaccard similarity index. We chose the Jaccard index as an easy to interpret measure; for two pieces of tokenized text A and B, we defined the similarity score as J(A, B) = |A\B|/|B|. Calculating similarities relative to a target piece of text (B) allowed us to identify instances of similar text either as a complete sentence, or embedded within larger sentences. Analyses considered text tokenised at the word level, with locality-sensitive hashing applied to reduce the number of pairwise comparisons [37]. Instances of boilerplate text were defined by a Jaccard index of 0.9 or higher.

Public Library of Science (PLOS ONE)
Targeted keyword searches using the PLOS ONE application programming interface (API) returned 131,847 papers, of which 111,731 (85%) included a statistical methods section (S1 Fig). In the final sample, 94,608 (85%) papers returned an exact match against one or more common section headings: 63,982 for 'statistical analysis', 13,343 for 'statistical analyses' and 13,510 for 'data analysis'. All papers included "Biology and life sciences" (n = 107,584), "Earth sciences" (n = 7,605) and/or "Computer and information sciences" (n = 5,190) in their top 3 subject classifications.
Statistical methods sections had a median length of 129 words and inter-quartile range of 63 to 258 words. 7,701 articles (7%) had a statistical methods section of 500 words or more. 19,077 articles (17%) had statistical methods sections with 50 words or less, equal to the length of this paragraph.
Topics reflected the use of statistical software (topics 3 and 5), descriptive statistics (topic 6), group based hypothesis testing (topics 1 and 4) and statistical significance (topics 1 and 9) (Fig 1). Also identified were topics related to regression (topic 2), meta-analysis (topic 7) and experimental designs (topic 10). At the section level, 528 studies (0.47%) were a direct cutand-paste from another paper; 37,333 studies (33%) included at least one exact match at the sentence level.
Definitions of statistical significance at α = 0.05 were the most common form of boilerplate text, found in approximately 1 in 10 of all included studies (Table 1). Topic 1 (n = 3,775) combined statistical significance with Student's t-test. Topic 9 (n = 6,104) focused on multiple thresholds for declaring statistical significance such as " � p < 0.05, �� p < 0.01 and ��� p < 0.001", a practice that has been criticized [38]. Minor variations of this phrase were identified in 40% of all studies assigned to this topic.
Statistical software topics differentiated between GraphPad Prism (topic 3: n = 9,879) and SPSS (topic 5: n = 9,574). Targeted searches for the n-gram "GraphPad Prism" returned 6,844 potential matches, including 263 studies that used the boilerplate text "statistical analysis was performed using GraphPad Prism" (Table 1). Common variants included software version (e.g. "version 5.0 for windows") and location information (e.g."La Jollie/San Diego CA USA"). Similar instances were identified for "SPSS" in topic 5, with 539 out of 9,005 studies (6%) identified as boilerplate text. Software details in both topics were frequently paired with hypothesis testing methods and definitions of statistical significance (S2 Fig).
Boilerplate text for descriptive statistics reflected the presentation of data as means plus or minus standard errors or standard deviations (Topic 6: 321/4,746 studies; 6.7%). In topic 2, an example of recycled text was "Continuous variables were expressed as mean ± standard deviation" (494 studies; 2.5%). Similar to other topics, descriptions were often paired with univariate hypothesis tests followed by more complex analyses, software and statements of statistical significance (S3 Fig).

Australia and New Zealand Clinical Trials Registry (ANZCTR)
We downloaded 28,008 studies and found that 9,523 (34%) had a completed statistical methods section (S1 Fig). The median length of sections was 136 words with an inter-quartile range of 74 to 230 words. Eight studies were only one word, including "ANOVA", "SPSS" and even "SSPS".
Observational studies were less likely to have a missing statistical methods section compared with interventional studies ( Table 2). Missing sections became less likely over time. Studies with more funders and a larger target sample size were less likely to have a missing statistical methods section.
Since studies registered with ANZCTR described planned analyses, we hypothesized that some studies did not specify statistical methods because they had yet to consult with a statistician. Targeted searches for "statistician" across all topics returned 381 studies, with examples including "Statistical analysis will be done in collaboration with a statistician" and "Pilot study at this point will use a statistician professionally to determine sample size calculations as required".
Evaluation of boilerplate text revealed sections from 484 studies (5.1%) were close matches and 251 (2.6%) were an exact cut-and-paste from another study (Table 3). At the sentence level, the proportion of studies with shared text varied by topic, from 12% in topic 5 (pilot studies) to 38% in topic 3 (student's t-test) (S4 Fig). Thematic analysis of n-grams differentiated between study designs and statistical methods topics (S5 Fig). At the n-gram level, we noted the use of similar methods across multiple topics. For example, while topic 3 (student's t-test) was dominated by mentions of group-based hypothesis tests as expected, the same topic also referenced linear modelling/regression methods and descriptive statistics. Similarly, the use of linear modelling/regression methods was referenced across multiple topics covering quantitative and qualitative methods. Among study design topics, matching sentences highlighted the planned use of intention-to-treat analysis and descriptive statistics. For topic 6 (safety and tolerability studies), approximately 1 in 3 studies had evidence of boilerplate text at the sentence level, which included different combinations of summary statistics for presenting study variables. In contrast, topic 4 (efficacy and safety studies) returned 211 matches against the n-gram "95 percent"; subsequent review identified 28 studies that were close matches to be phrase "at a confidence level of 95% and a precision around the estimate of 5%, a minimum of 73 patients will be included". Among methods topics, definitions of statistical significance was a recurring theme. Some topics simply stated the main analysis method, for example, "descriptive statistics" (topic 7; 16 exact matches). Examples of close matching sentences and Jaccard similarity scores are given in Table 4. Finally, we noted the use of the same methods among subgroups of topics. For example,

Discussion
The aim of our analysis was to identify common themes in statistical methods sections both in terms of chosen methods and how these methods are being communicated. Our findings provide evidence of boilerplate statistical methods sections, resulting from likely cut-and-pasting and slight modifications to existing text descriptions. Results from topic modeling further identified distinct themes across statistical methods sections that emphasised details about study design, chosen methods, p-values and software. This is a strong sign of the ritualistic practice of statistics where researchers go through the motions rather than using conscientious practice [2]. Despite the extensive array of statistical tests available, our results show that authors are often reporting the same few methods. In related work, a content-based analysis of ecology and conservation journals summarised trends in linear modelling using n-grams including "ttest", "ANOVA" and "regression"; results provided evidence of a movement towards modelbased inference [39]. We found that Student's t-test and ANOVA were commonly cited methods for comparing groups in both PLOS ONE and ANZCTR datasets. For statistical methods sections in PLOS ONE, we also found that many studies followed a generic template, combining chosen statistical methods with descriptive statistics for summarizing data, statements of statistical significance and/or choice of software. When investigating cases of boilerplate text, results based on n-grams versus close matches at the sentence level varied considerably by topic. These findings suggest that there is a tendency for researchers to default to the same common statistical methods when completing analyses, in line with the view of statistical analysis as a mechanistic process. However, for studies that use the same statistical methods, text used to describe important details may vary.
Defining statistical significance at p < 0.05 was the most common example of boilerplate text in both datasets. The widespread use of statistical significance is troubling given the bright-line thinking it engenders [40] and the common misinterpretations of p-values [41,42]. Nonetheless, conflicting views about the use of statistical significance remain. In a follow-up survey of signatories to an article calling for the end of statistical significance [43], 22% of respondents said they were likely to continue using the concept in future publications [44]. Reasons cited included the mindful use of statistical significance in combination with other evidence and concerns about the feasibility of abandoning statistical significance given its The number of studies with Jaccard similarity scores greater than or equal to 0.9 from pairwise comparisons are presented; the number of studies with cut-and-pasted text is given in brackets. https://doi.org/10.1371/journal.pone.0264360.t003

PLOS ONE
engrained usage in published literature. At the same time, null hypothesis significance testing has been cited as a root cause fueling the reproducibility crisis, and a problem that has been difficult to shift [45]. Two topics identified in the PLOS ONE dataset highlighted statistical software. Similarly, some sections extracted from ANZCTR only stated the software, implying that this was the primary criterion for statistical analysis. As Doug Altman said, "Many people think that all you need to do statistics is a computer and appropriate software" [6]. This is far from the truth, and whilst it is important for researchers to mention the software and version used for reproducibility purposes, it is a minor detail compared with explaining which methods were used and why.
One reason inadequate methods sections get published is because many journals do not use statistical reviewers, despite empirical evidence showing they improve manuscript quality [12]. It is possible that the exact details of statistical methods are viewed as relatively unimportant by authors and reviewers, and something that can be read last or even skipped [46]. Some journals foster this lack of importance by putting the methods section last. Statistical methods sections may be getting less scrutiny than other sections both because of their position in the paper, relatively low word counts, and because they so often contain boilerplate text. Another potential reason is that authors resort to boilerplate text is because of the overly-critical approach to statistics by some reviewers who pounce on anything outside the accepted dogma [47]. Analyses will be conducted on an intention-to-treat basis 1,630 0.6 (0.5 to 0.7) 191 Baseline characteristics will be summarised using descriptive statistics 1,375 0.5 (0.5 to 0.63)

23
The number of matching to each sentence was based on a Jaccard score of 0.9 or higher. Potential matches refers to the number of studies that contained the target ngram at least once. https://doi.org/10.1371/journal.pone.0264360.t004 Whilst checklists are a useful tool to improve statistical reporting, peer review by nonstatistical reviewers and editors cannot replace expert appraisal on the appropriateness of statistical methods used [48]. Mechanisms to encourage authors to share their analysis code would provide an alternative route for checking what statistical methods were used. This is not a perfect solution, as we still want authors to accurately report their methods in their paper, but it does increase transparency. A recent paper found that code sharing was very low in biomedical papers, with just 2% of a sample of over 6,000 papers sharing code [49]. The introduction of incentives for code sharing such as article badges has to date shown limited efficacy [50], however further research in this area may offer potential solutions for promoting reproducibility.
Our approach for identifying boilerplate text was not intended as a form of plagarism detection, but rather as evidence of standardised descriptions being used. For simple study designs, a boilerplate description might be adequate to promote consistency in reporting and meet reporting requirements. For example, ANZCTR sections commonly reported sample size justifications and planned analyses using intention-to-treat principles. Beyond statistical methods sections, initiatives such as 2WeekSR have been developed to streamline the completion of systematic reviews, including the use of automation to generate consistent descriptions of results suitable for using in papers [51]. However, if boilerplate descriptions are to be used, they must provide readers with sufficient details to confirm that appropriate methods were used and enable independent verification of results. Unfortunately, this is not always the case. For example, a study of papers that used ANOVA found 95% did not contain the information needed to determine what type of ANOVA was performed. This lack of information could well be because the authors used a boilerplate statistical methods section that was missing key details.
Our analysis focused on studies with a clearly marked statistical methods section, based on predefined section headings. It is therefore possible that some of the papers excluded from our analysis conducted statistical analyses but placed descriptions elsewhere. For PLOS ONE, excluded papers may have described statistical methods as part of the supplementary material, which tend to be less structured than the main text. Similarly, since submissions to both PLOS ONE and ANZCTR do not undergo compulsory statistical review, our results may not be generalizable to all journals and registries, especially those that consistently use a statistical reviewer. Given the large sample sizes for both datasets, it was not feasible to check whether papers used the correct methods.  2 and 6). General themes for statistical methods were based on targeted word searches and categorized into statistical significance, descriptive statistics, parametric hypothesis tests, nonparametric hypothesis tests, linear modelling/regression and software. The most frequent combinations of themes are given on the x-axis, with the corresponding number of studies on the y-axis. (TIF) S4 Fig. Total matching sentences by topic for the ANZCTR dataset. A match was defined any pair of sentences between ANZCTR studies with a Jaccard score equal to 0.9 or higher. (TIF) S5 Fig. Summary of close matches at the sentence level (x-axis) by ANZCTR themes inferred from common n-grams (y-axis), organized by study design (A) and methodsbased (B) topics. A close match was defined any pair of sentences between ANZCTR studies with a Jaccard score equal to 0.9 or higher. (TIF)