Gene name errors: Lessons not learned

Erroneous conversion of gene names into other dates and other data types has been a frustration for computational biologists for years. We hypothesized that such errors in supplementary files might diminish after a report in 2016 highlighting the extent of the problem. To assess this, we performed a scan of supplementary files published in PubMed Central from 2014 to 2020. Overall, gene name errors continued to accumulate unabated in the period after 2016. An improved scanning software we developed identified gene name errors in 30.9% (3,436/11,117) of articles with supplementary Excel gene lists; a figure significantly higher than previously estimated. This is due to gene names being converted not just to dates and floating-point numbers, but also to internal date format (five-digit numbers). These findings further reinforce that spreadsheets are ill-suited to use with large genomic data.


Author summary
Autocorrection is a feature of modern softwares including messaging apps, word processors and spreadsheets. These are designed to avoid data entry errors but "autocorrect fails" can lead to information being distorted in undesired and sometimes humorous ways. What is not funny though is having genomics spreadsheets suffer from auto-conversion of gene names like SEPT8, DEC1 and MARCH3 into dates, a problem first characterised in 2004. A 2016 article on this topic led the Human Gene Name Consortium to change many of these gene names to be less susceptible to autocorrect. Despite this, our work here shows that gene name autocorrect errors continue to accumulate in supplementary genomics spreadsheet files at a rapid pace. To avoid this and other reproducibility problems with spreadsheets, big changes are required in the way genomics scientists analyse and share data. We provide several practical steps researchers can take to avoid gene name errors and reiterate that big genomics data analysis is better suited to Python/R notebooks rather than spreadsheets.

Background
It is a well-documented problem that spreadsheet software inadvertently converts gene symbols to dates and floating-point numbers, with these errors propagating downstream to annotation sets and other databases [1]. Previous work shows that gene name errors are made while PLOS Computational Biology | https://doi.org/10.1371/journal.pcbi.1008984 July 30, 2021 1 / 13 a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 researchers analyse and prepare supplementary files for publication [2]. A screen of 18 journals found that one fifth of publications with supplementary Excel gene lists contained errors (704/ 3597). It remains unknown how frequent gene name errors are outside of these 18 journals, and whether the attention of previous publications has resulted in the mitigation of the problem.
Notably, software developers are beginning to remedy the problem at their end, with some packages like LibreOffice now resisting the conversion of gene symbols to dates (Version: 6.4.6.2). In addition, a recent announcement by the HUGO Gene Nomenclature Committee (HGNC) outlined plans for specifically changing gene symbols to avoid auto-correction [3]. For example, SEPT1 becomes SEPTIN1 and MARCH1 becomes MARCHF1. It will likely take months and perhaps years for the new gene symbols to appear in publications.
Although changes to gene names and software will help, they won't solve the overarching problem with spreadsheets; that (i) errors occur silently, (ii) errors can be hidden amongst thousands of rows of data, and (iii) they are difficult to audit. Research shows that errors are surprisingly common in the business setting [4], which raises the question as to how common such errors are in science. The difficulty in auditing spreadsheets makes them generally incompatible with the principles of computational reproducibility [5].
Our main goal here is to examine whether gene name errors have diminished since 2016 or they continue to be a problem. We also assess the behaviour of current spreadsheet software in converting gene names to dates and identify Excel date genes across Eukarya. We follow this up with a screen of supplementary files from genomics-related PubMed Central (PMC) publications in the period 2014 to 2020.

Testing spreadsheet software
We tested the propensity of various spreadsheet software to convert gene names into dates after importing a set of strings containing human gene names by (i) opening a text file, (ii) pasting data, and (iii) directly typing (Table 1). We found that Microsoft Excel and Google Sheets converted this data to dates in all three modes of import. LibreOffice and Gnumeric did not convert gene names to dates in our tests here. The date conversion behaviour of Excel and Google Sheets could be circumvented by formatting the destination cells as "plain text" prior to pasting or typing. Nevertheless, this result shows that using LibreOffice and Gnumeric are safer than Excel and Google Sheets.

Identifying Excel date genes across kingdoms
Although recent changes have been made to human and mouse gene names to prevent conversion to dates [3], it is uncertain whether such changes have propagated through to other species. To assess this, we downloaded all eukaryotic gene names available in Ensembl and imported this into Excel and collected all genes that were converted to dates. In total there were 1,544 gene names converted to dates, from 104 taxa (Tables 2 and S1). Although most affected gene names were vertebrate in origin, there were gene names affected in all groups.

Gene name errors by year
In order to determine whether gene name errors in supplementary files remain a problem, we undertook a screen of genomics-related publications in PMC. We collated a list of 166,139 genomics articles published between 2014 and 2020, and screened them using an enhanced script. In addition to identifying conversions to standard date formats (eg: 3/1/2016, Mar-3, 3-Mar) and floating-point numbers (eg: 9.33E+22), this script also recognises five-digit numbers as likely to be the result of gene name errors as this is the internal date format used by spreadsheets [1].
The results of this screen are shown in Table 3. From this set of publications, 32,841 had supplementary files in Excel format (with "xls" or "xlsx" suffixes). Of these, 11,117 publications were detected to contain at least one list of gene symbols. The software detected 3,470 publications with suspected gene name errors. After manually opening each spreadsheet file (5,136 files), we identified 34 publications as being false positives, leaving 3,436 publications with confirmed gene name errors (S2 and S3 Tables). These publications contain a total of 5,086 spreadsheets with gene name errors. The proportion of publications with Excel gene lists that contain errors was 30.9%; substantially higher than previously reported [2].
In the period 2014-2020, both the number of publications with Excel gene lists and the number of publications affected by gene name errors increased, with a pause in the period 2016-2018 (Fig 1A and 1B). On the other hand, the proportion of papers with Excel gene lists affected by errors remained stable over this period (Fig 1C). This result suggests gene name errors did not substantially reduce in the period after 2016 as we had hypothesized.
Next, to determine whether five-digit numbers explain the higher observed proportion of errors, we investigated a subset of 2160 affected spreadsheet files to determine frequency of error types. Dates in Mar-1 or 1-Mar format accounted for 1,797 files (83.2%). Errors in DD/MM/YYYY: format accounted for 19 files (0.88%) and 4 for floating-point numbers (0.18%). Five-digit numbers accounted for 340 files (15.7%) indicating that this error type is sufficiently common to account for the discrepancy between this and the previous report (S4 Table). When these five-digit numbers are formatted as standard dates, 292 (85.9%) appear in the months of March and September which is consistent with gene name errors.

Gene name errors by organism
Next, we investigated whether the rate of gene name errors was dependent on the organism under study. We found that the frequency of gene name errors was highest for mouse and human datasets, while lower for Arabidopsis, chicken and rice (Table 4). Next, we assessed whether there was any correlation between error proportion and the journal impact factor (JIF) for the set of 37 journals. A scatterplot of JIF and proportion of affected articles is shown in Fig 2. A correlation analysis indicated a statistically significant association using the Pearson (p = 0.0052, r = 0.462,) and Spearman (p = 1.95E-04; ρ = 0.589) methods.
Next we assessed the temporal trends for the three journals with most gene name errors (Fig 3). Nature Communications showed a strong increase in articles with Excel gene lists and gene name errors over the period, while the proportion of affected articles recorded an increase from 33.3% to 39.5% in the period 2014-2020. PLOS ONE showed a trend of decreasing numbers of articles with supplementary Excel gene lists and number of affected articles but the proportion of affected articles was relatively flat over this time. Scientific Reports recorded a strong increase in articles with supplementary Excel files in the period 2014-2017 but has since remained stable. The proportion of affected articles in this journal did not show any consistent trend over this period.

Novel error types
While we are familiar with common SEPT and MARCH conversions, we observed a variety of additional novel error modes. Some of these were likely related to locale language settings. In a few cases, the human gene AGO2 was converted to Aug-02 (eg: PMC5537504 & PMC6244004), which may be due to Excel working in languages such as Italian, Spanish or Portugese. Similarly, the gene MEI1 was seen to be converted to May-01 (eg: PMC6065148 & PMC5877863) and could be due to the similarity with the Dutch (mei). In one article (PMC5908809), TAMM41 was apparently converted to "Jan-41" due to similarity with the month of January in Finnish (tammikuu). There were also several cases where the dates appeared to be unrelated to Excel date genes. For example, article PMC6330011 S4 Table contained the following: "'Feb-97, Aug-97, Nov-97, Feb-98, Aug-98". Information in other columns of the spreadsheet indicated that these originated from SEPT, MARCH and DEC gene names. Cells containing Aug-97 through Aug-11 corresponded to SEPT2 to SEPT14 and SEP15. Article PMC5989470 showed evidence that the protein name "jun-1" was converted to "May-31". We posit that this type of error is caused by the spreadsheet evaluating protein names like "jun-1" as the month of June minus 1.

PLOS COMPUTATIONAL BIOLOGY
Other observations were more puzzling. There were two papers where it appears the P2RY1 gene (Ensembl identifier ENSGALG00000016687) was converted to "7"; possibly a problem in an upstream database. In one sheet (Table S5 of article PMC6506828), the numeric value "3002" was observed in the gene symbol column beside "NM_198411", corresponding to Inverted Formin 2 (Inf2). Perhaps the spreadsheet interpreted "Inf2" as a numerical value.

Discussion
We hypothesized that after a previous publication in 2016 received substantial attention in technology and social media spheres, that researchers and publishers would be aware of the issue of gene name conversion in spreadsheets and the prevalence of such errors would decline. On the contrary, this work demonstrates that overall there has been no substantial change in the rate of gene name errors in the period 2014 to 2020. Indeed the proportion of articles with Excel gene lists containing gene name errors was significantly higher here as compared to a previous report (30.9% and 19.6% respectively) [2]. This is due to two main contributors. Firstly, the articles here were sampled from PMC as compared to a set of 18 major genomics journals. Secondly this work identified gene names becoming converted to internal date format, which accounts for~15% of such errors detected here. These numbers correspond to the number of days since 1st January 1900; indeed, this is how spreadsheet software stores date information internally. Gene names can become converted to five-digit numbers by first converting them to dates upon import, followed by changing the cell formatting to "number" or "text", becoming permanent when the spreadsheet is saved.
Another take-away from this study is that articles with supplementary Excel gene lists in highly reputable journals like Cell, Nature and Proc Natl Acad Sci USA more frequently contained gene name errors as compared to their counterparts with lower JIF scores. This may seem counterintuitive, but is consistent with previous analysis [2]. Although it has been suggested that articles in highly prestigious journals are of an inferior methodological quality [6], the simpler explanation is that the number and size of supplementary gene lists accompanying articles is the main contributor to this trend (although we have not examined this hypothesis quantitatively). This is likely a contributing factor to why so many gene name errors were identified in Nature Communications. This journal recommends authors provide source data which contain the raw data underlying any graphs and charts, resulting in more data in attached Excel files. Additionally, this is a prolific and fast-growing multidisciplinary journal with 6,448 published articles in 2020 and~15% year-on-year growth since 2014. Concerningly, the proportion of papers in Nature Communications with supplementary Excel gene lists affected by gene name errors also increased in the period 2014-2020 (Fig 3). There are limitations to this study that need to be pointed out. For convenience, we only screened open access articles in PMC and so this might not be representative of the work in paywalled articles. Moreover, we screened a subset of PMC articles that contained the keyword "genom � " in the abstract or title. Out of 3,291,704 articles in PMC published in the period 2014-2020, we included only 116,139 (~5.0%). There are likely many gene name errors outside of this sample of articles and there is a chance that such errors appear at varying rates in the articles not analysed here. The updated screening software yielded a slightly higher fraction of false positives but was circumvented by systematically opening each file manually for

Journal name as it appears in PMC Number of articles with Excel gene lists Number of affected articles Proportion of articles affected (%)
Nat verification. Our script only identified vertical gene lists, so there were likely some in the horizontal orientation that were missed. There has been a great deal of discussion around who is responsible for the persistence in gene name errors over time. The software developers surely must take some blame because these conversions occur without any user notifications, and the date conversion feature is not one that can be disabled. In their defence, we must understand that Excel and other spreadsheet software were designed only for lightweight data entry and calculation, not for analysis of data containing many thousands of rows. Reviewers are doing their best with limited time but can do better with regards to quality checking supplementary files. Journal editors have yet

PLOS COMPUTATIONAL BIOLOGY
to put in place systems to identify gene name errors before they are published. Surely some blame rests on the researchers who inadvertently make these mistakes. In particular, senior authors need to take leadership in picking up such errors when they arise, but more importantly, they need to provide training opportunities and promote a culture of reproducibility in the groups they lead. Academic faculty need to ensure that biology graduates are trained in contemporary skills to conduct data-driven research that goes beyond appropriate use of spreadsheets. This needs to include competence in scripted computer languages, statistical analysis and computational reproducibility [5]. From the researcher's perspective, there are several practical ways that such errors can be avoided (Box 1).
The HGNC has taken the initiative to change the most susceptible gene names, but this will not entirely solve the problem. There are a number of gene names that could be converted if the user computer is set up to use a non-English language. While human, mouse, and rat gene names have been changed, such changes are yet to take place for other species such as D. rerio, C. elegans, D. melanogaster and A. thaliana (See S1 Table). Open-source tools are being developed to circumvent these errors. Truke is a web service that identifies and corrects corrupted gene names in affected files [7], while EscapeExcel is a tool designed to prevent gene name conversions from happening by protecting strings before import [8]. HGNChelper is an R package that recognises and fixes human gene symbols converted to dates [9]. It appears that these developments are not having a major impact yet because gene name errors continue to grow year-on-year and the proportion of affected articles has remained stable since 2014 (Table 3 and Fig 1).
It has been argued that gene name errors are of little consequence to the conclusions of a scientific publication [10], however our view is that it is a symptom of a larger problem-that overreliance on spreadsheets leads to errors occurring silently in large data files and that such errors are exceedingly difficult for researchers, reviewers, and editorial staff to identify. Previous spreadsheet research in the business setting indicates that errors exist in 0.9% to 1.9% of formula cells and from a sample of 50 spreadsheets, only seven were error-free [11]. In the healthcare sector, an analysis of data entry errors into a clinical pathology spreadsheet found errors in 0.5% to 6.4% of cells [12], while a systematic analysis of spreadsheet errors in a hospital setting found critical errors in 11 of 12 spreadsheets analyzed [13]. In the biomedical research setting we know that spreadsheet errors can occur and impact downstream work involving clinical drug trials [14]. Despite this potential risk, there has yet to be a systematic assessment of the full taxonomy of spreadsheet errors in biomedical research, so we don't know how frequently they occur.
It must be noted that a blanket ban on spreadsheets as supplementary files is unlikely to mitigate gene name errors entirely, as many researchers might simply export their working Box 1. Tips to avoid gene name errors • Scripted analyses are preferred over spreadsheets. Gene name to date conversion is a bug specific to spreadsheets and doesn't occur in scripted computer languages like Python or R. In addition, analyses conducted with Python and R notebooks (eg: Jupyter or Rmarkdown) capture computational methods and results in a stepwise fashion meaning these workflows can be more readily audited. These notebooks can therefore achieve a higher level of computational reproducibility than spreadsheets. Although this requires a big investment in learning a computer language, this investment pays off in the longer term.
• If a spreadsheet must be used, then LibreOffice is recommended because it will avoid such errors from occurring. This will not remedy other error types.
• If using Excel is unavoidable, then take great care importing the data. If opening a TSV or CSV file, use the data import wizard to ensure that each column of data is formatted appropriately. For example, columns containing gene names should be formatted as "free text", genomic coordinates formatted as "integers" and gene expression measurements as "numeric".
• Instead of spreadsheets, share genomic data as "flat text" files. These typically have the suffixes "csv", "tsv" or "txt". These are native formats for computer languages and suitable for long term data archiving. Excel formats such as "xls" or "xlsx" are proprietary and future development is decided by Microsoft.
• If it is unavoidable to use a spreadsheet with genomic data, verify that gene names are intact. To do this, sort columns containing gene names in ascending order. This will bring dates and numbers to the top of the column so it is obvious whether any gene symbols have been converted. Alternatively, use the Truke web tool to identify such errors (http://maplab.imppc.org/truke/).
• Assume that there are Excel date gene names in your organism of interest. Although human and mouse SEPT and MARCH gene names have been changed to avoid such errors, there are many taxa across Eukarya that are yet to see similar changes. Excel gene names may also be prevalent in Prokarya.
spreadsheets to flat text files (errors included). Rather, raising standards around computer code sharing, code review, and reproducibility measures is more likely to deliver lasting improvements in the quality of published research. In summary, this work demonstrates that gene name errors in supplementary data files of research articles are more frequent than previously appreciated and are not declining over time. Eliminating gene name errors will require major changes to researcher practices which are unlikely to happen in the near term. To monitor gene name errors in PMC we have set up an automated reporting system that will be updated monthly (URL: http://ziemann-lab.net/ public/gene_name_errors/).

Characterising spreadsheet software behaviours
We tested the default behaviour of four different spreadsheet programs (Microsoft Excel 365 MSO version, Google Sheets (accessed 4th June 2021), LibreOffice v6.4.6.2, and Gnumeric v1.12.46) by entering the list of strings shown in Box 2. These data were entered into spreadsheets by (i) opening directly from a text file with csv or tsv suffix, (ii) typing directly into cells, and (iii) pasting from a separate text file. We then observed and recorded the propensity of these programs to perform date conversion of the gene symbols.

Screening gene names that get converted to dates
All eukaryotic gene annotation files were downloaded from Ensembl (Vertebrates v102, Metazoa v49, Plants v49, Fungi v49 and Protists v49). Gene names were extracted from the GTF files and imported into Excel together with the taxa name (species/strain). The gene name column was sorted to bring cells containing dates to the top of the sheet, where we counted the number of date conversions per taxon. Searching PMC PMC (URL: https://www.ncbi.nlm.nih.gov/pmc/) was our starting point for shortlisting openaccess publications to screen. We did not screen every publication in PMC because most do not include genomic data. By searching for publications with the keyword "genom � " in the title or abstract, we were able to reduce the number of articles screened by~95%. For example, in the year 2015, there were 405,251 articles published, but only 21,213 had the keyword "genom � " in the title or abstract. We used this approach to create lists of PMC identifiers by year for the period 2014 to 2020.

Updated software for scanning for gene name errors
A shell script was used to perform the following. Each PMC publication in the shortlist was downloaded as a HTML file. Links to files with.xls or.xlsx suffixes in the HTML were extracted, these were assumed to be supplementary Excel files. Each Excel file was downloaded, and file metadata was scanned to confirm it is an Excel file and not simply a tabular text file with an incorrect suffix. True Excel files were extracted with an R script (R v4.0.0) using the readxl package v1.3.1 (https://CRAN.R-project.org/package=readxl) into tabular files. Other textbased files with xls or.xlsx suffices were processed with ssconvert v1.12.46 to tabular files. As per a previous study [2], these tabular data underwent screening for columns that contained gene symbols. Those columns with five or more gene symbols were considered to be gene lists and underwent screening for erroneous conversions, such as date formats and scientific numbers. The main difference being that this script also recognises five-digit numbers (internal date format). Analysis logs were processed and brought together with the corresponding journal name to yield a list of supplementary files suspected to contain a gene name error.

Verification and data visualisation
Each of these suspect files were downloaded and opened with either Excel or LibreOffice Calc to confirm the presence of gene name errors. To do this, columns appearing to contain gene names were sorted such that numeric values (dates) were brought to the top of the sheet. Summary data were loaded into R v4.1.0 for analysis and visualisation. The two-sided Pearson and Spearman correlation tests were executed in R.
Supporting information S1  Table. Excel files suspected to contain gene name errors by the screening software but turned out to be false positives. Ziemann.