EMBL2checklists: A Python package to facilitate the user-friendly submission of plant DNA barcoding sequences to ENA

Background The submission of DNA sequences to public sequence databases is an essential, but insufficiently automated step in the process of generating and disseminating novel DNA sequence data. Despite the centrality of database submissions to biological research, the range of available software tools that facilitate the preparation of sequence data for database submissions is low, especially for sequences generated via plant DNA barcoding. Current submission procedures can be complex and prohibitively time expensive for any but a small number of input sequences. A user-friendly software tool is needed that streamlines the file preparation for database submissions of DNA sequences that are commonly generated in plant DNA barcoding. Methods A Python package was developed that converts DNA sequences from the common EMBL and GenBank flat file formats to submission-ready, tab-delimited spreadsheets (so-called “checklists”) for a subsequent upload to the public sequence database of the European Nucleotide Archive (ENA). The software tool, titled “EMBL2checklists”, automatically converts DNA sequences, their annotation features, and associated metadata into the idiosyncratic format of marker-specific ENA checklists and, thus, generates output that can be uploaded via the interactive Webin submission system of ENA. Results EMBL2checklists provides a simple, platform-independent tool that automates the conversion of common plant DNA barcoding sequences into easily editable spreadsheets that require no further processing but their upload to ENA via the interactive Webin submission system. The software is equipped with an intuitive graphical as well as an efficient command-line interface for its operation. The utility of the software is illustrated by its application in the submission of DNA sequences of two recent plant phylogenetic investigations and one fungal metagenomic study. Discussion EMBL2checklists bridges the gap between common software suites for DNA sequence assembly and annotation and the interactive data submission process of ENA. It represents an easy-to-use solution for plant biologists without bioinformatics expertise to generate submission-ready checklists from common plant DNA sequence data. It allows the post-processing of checklists as well as work-sharing during the submission process and solves a critical bottleneck in the effort to increase participation in public data sharing.

Only a few software tools assist in the preparation of DNA sequence data for submission 2 to public sequence databases, despite the centrality of this process for generating and 3 disseminating novel biological data. Contemporary biological research depends on the 4 preservation, curation, and reproducibility of the data under study [1,2], and the 5 submission of analyzed data to publicly accessible databases constitutes one of the most 6 important best-practices in biology [3,4], particularly in the era of big data [5]. DNA 7 sequences generated to identify and characterize novel organisms or unchartered 8 biodiversity must typically be submitted to public sequence databases before 9 publication of the research is granted [6,7]. Compliance with this prerequisite remains 10 mixed [8][9][10]. Several large nucleotide sequence repositories accept DNA sequence 11 submissions, including GenBank [11], the European Nucleotide Archive [12] or the DNA 12 Data Bank of Japan [13]. These repositories coordinate their policies and operations 13 under the umbrella of the International Nucleotide Sequence Database Collaboration 14 (INSDC; [14]), but each database employs custom procedures for sequence upload and 15 data submission. ENA, for example, channels the submission of annotated DNA 16 sequences through the Webin submission framework 17 (https://www.ebi.ac.uk/ena/submit/sra/; [15]), which, in its interactive version, 18 operates with pre-formatted, tab-delimited spreadsheets (called "annotation checklists" 19 or "templates") that are filled out by the user and then uploaded for submission. In 20 order to account for different types of annotated DNA sequences (e.g., coding vs. 21 non-coding, nuclear vs. organellar origin), a series of pre-tailored spreadsheets (hereafter 22 "checklists") was developed by ENA, each with its idiosyncratic, tab-delimited fields of 23 information. Users of ENA must choose the correct spreadsheet for their data 24 submission (Fig. 1), and different types of DNA sequences must be submitted via 25 separate data uploads. Since June 2017, the submission process through Webin has 26 been automated and now includes automatic validation procedures for annotation 27 features, taxonomic metadata, and sequence integrity [12]. Despite the centrality of 28 data sharing to biological research, the range of user-friendly software tools that assist 29 in data preparation for database submission is perceived as low [3]. Indeed, very few 30 contemporary, user-friendly software tools exist (e.g., Geneious [16]) that facilitate the 31 preparatory steps of annotated DNA sequences prior to uploading them to public 32 sequence repositories. The number of software tools that assist with, and are specifically 33 customized for, the preparation of common plant DNA barcoding sequences is 34 particularly sparse. 35 Unlike sequence submissions to GenBank, the preparation of DNA sequence data for 36 submission to ENA is insufficiently facilitated, highlighting the demand for software 37 that converts annotated DNA sequences into submission-ready checklists. Upon DNA 38 sequencing, researchers often utilize user-friendly software suites such as Artemis [17], 39 Geneious or PhyDE [18] for the assembly and annotation of DNA sequences. Some of 40 these suits (e.g., Artemis, Geneious) enable the conversion of annotated DNA sequences 41 to file formats that are easily submittable to GenBank, either by producing files in a checklists are uploaded or generated online; and a programmatic route, in which both 53 checklists and pre-formatted flat files can be submitted to the ENA server [12]. ("plant DNA barcodes"; [22]) have become a key method in botanical research [23].

72
Thousands of DNA sequences have been generated in investigations on suitable plant 73 DNA barcoding markers [24][25][26], and plant DNA barcoding is now routinely applied 74 across evolutionary, ecological and conservation research [22,23,27], even in regional 75 studies [28][29][30]. Future investigations applying plant DNA barcoding will invariably 76 require the submission of novel DNA sequences to public sequence repositories [7,27], 77 and a user-friendly, streamlined data submission process can be instrumental to their 78 data sharing process [3]. In fact, plant DNA barcoding sequences lend themselves for 79 the application of software tools that streamline and automate their submission process, 80 because common barcoding markers display (a) a general homogeneity in sequence 81 length and gene synteny, at least within most target lineages [31,32], (b) a general 82 absence of structural inversions or strong secondary structure [32,33] input and returns properly-formatted checklists that are ready for data upload to ENA 103 via the interactive Webin submission system (Fig. 2).  The software EMBL2checklists was designed to convert annotated DNA sequences and 107 associated metadata into six different Webin checklist types. Sequence submission via 108 the Webin submission system is primarily conducted via marker-specific checklists [15]. 109 These checklists are pre-tailored to the idiosyncrasies of different genome regions, with 110 different checklists displaying marker-specific customizations in order to capture the 111 distinct information of the genomic regions under study (Fig. 1). A conversion of 112 annotated DNA sequences into Webin checklists, thus, needs to take these 113 marker-specific customizations into account. For example, a checklist that contains 114 sequence information on the plastid trnK/matK region will be more complex than a 115 checklist on the nuclear ribosomal external transcribed spacer (ETS) due to the location 116 of the gene matK inside the group II intron of the tRNA gene for Lysine 117 (trnK -UUU; [34] correct annotation features and feature qualifiers as input (Table 2), EMBL2checklists 126 can generate marker-specific checklists for a series of DNA markers that are commonly 127 employed in plant DNA barcoding. These markers are: (i) a common gene intron (e.g., 128 trnL intron; [35]); (ii) a common intergenic spacer (IGS; e.g., trnH-psbA; [36]); (iii) the 129 plastid trnK/matK region [37]; (iv) the nuclear ribosomal rRNA-encoding rDNA genes 130 (e.g., 18S rDNA; [38]); (v) the nuclear ribosomal internal transcribed spacer (ITS; [24]); 131 and (vi) the nuclear ribosomal ETS ( [39]).

133
EMBL2checklists is able to convert multiple sequence records contained in input flat file 134 into a single Webin checklist. EMBL-or GenBank-formatted flat files may contain 135 multiple sequence records, each with a specific set of annotation features and sequence 136 metadata. EMBL2checklists accepts such a user-selected flat file as input, parses each 137 sequence record individually, and writes the parsed information to the output file.

138
Specifically, the software converts the sequence information contained in each sequence 139 record into a separate line of the resulting checklist. Programmatically,

140
EMBL2checklists parses the flat file via the BioPython library [40] and then iterates 141 through the sequence records, processing one record at a time (Fig. 3). During each 142 iteration, the DNA sequence of a record, its annotation features, and its associated 143 metadata are extracted. The extracted information is formatted to a pre-tailored,  consists of two main processes that are executed sequentially for each sequence record: 156 input audit and data processing (Fig. 3). Upon initialization, EMBL2checklists qualifier "SEDIMENT" cannot be parsed from such a sequence record ( Table 2).

171
Sequence records that fail the evaluation of minimal feature prerequisites are skipped, 172 whereas those that fail the parsing of correct marker abbreviations terminate the entire 173 software execution (Fig. 3) because the latter error is indicative of an incorrect checklist 174 selection by the user. Upon successful input audit, data processing of the sequence  Table 2) and 198 the completeness of the intron ("5' PARTIAL" and "3' PARTIAL") is determined by EMBL2checklists because its information is mandatory for certain Webin checklist 234 types. It is typically answered with "yes" if the DNA sequences under study were 235 generated as part of a metabarcoding experiment. Upon specifying each of these 236 parameters, EMBL2checklists begins to process the input file.

237
Commandline and graphical user interface

238
EMBL2checklists was developed for classical plant biologists and bioinformaticians alike. 239 Thus, the software is equipped with a graphical user interface (GUI) as well as a 240 command-line interface (CLI) for its operation (Fig. 2). The GUI is based on the 241 Python library "Tkinter" [42] and designed to provide an intuitive and easy-to-use window of the GUI. The GUI can be accessed through file "EMBL2checklists GUI.py" 251 of the scripts folder. More information on the design and functionality of the GUI of 252 EMBL2checklists is available in [43]. The CLI employs functions of the Python library 253 "argparse" [44] and allows more experienced users to execute the software via the The CLI can be accessed through file "EMBL2checklists CLI.py" of the scripts folder. 263

Release, installation and operation 264
EMBL2checklists was written in Python 2.7 [45] and is, thus, platform independent. It 265 can be executed on any system equipped with a Python 2 compiler and after the 266 installation of the necessary Python dependencies. The software uses three separate 267 Python packages as dependencies: Biopython [40], argparse [44] and Tkinter [42]. investigation [46][47][48]. The two plant phylogenetic investigations utilized common plant 293 DNA barcoding markers to infer the phylogenetic history of select plant lineages [46,47]. 294 The fungal metagenomic investigation utilized nrDNA barcodes to characterize 295 arbuscular mycorrhizal soil fungi [48]. In each case, EMBL2checklists was used to 296 convert flat files in GenBank format that were generated from sets of assembled and 297 annotated sequences via the software suite Geneious. Upon conversion to checklists, the 298 sequence data was uploaded to ENA via the interactive Webin submission system, and 299 accession numbers were received from ENA by email within less than 48 hours of 300 submission.

301
Post-processing of checklists and work-sharing the submitter is associated with the data through the Webin submission service prior to 324 data upload, irrespective of the checklist type. Hence, EMBL2checklists does not need 325 to be executed by the same person that conducts the data upload or has generated the 326 sequence but allows a work-sharing strategy in which one person (or section of a 327 workflow) conducts the data conversion via EMBL2checklists, while another person (or 328 section of a workflow) conducts the data submission. Work-sharing may be helpful if 329 the sequence submission process is centralized within a lab or academic institution, 330 allowing those researchers that prepare the data for submission to ENA to be different 331 from those that actually conduct the data upload.

332
Data converters and other ENA submission strategies 333 The paucity of file formats acceptable for data submission to public sequence databases 334 is one of the main bottlenecks in the effort to increase participation in public data 335 sharing and has spurred the recent development of various data converters. The 336 software EMBL2checklists is one of several current projects that aim to provide 337 automated data conversion between the EMBL or GenBank flat file format and data 338 formats that are commonly parsed by biological software and databases [20,49,50]. The 339 underlying aim of many of these projects is to simplify the conversion process of 340 sequence data into file formats that are accepted during submissions to public sequence 341 databases [20,49]. Given the custom validation criteria and the idiosyncratic submission 342 procedures employed by many of these databases, such data converters represent an The lack in automated conversion between EMBL-or GenBank formatted flat files and 367 submission-ready Webin checklists represented a gap that compelled many researchers 368 to conduct manual data processing before submitting data to the public sequence 369 database ENA. By developing the software EMBL2checklists, we have filled this gap.

370
EMBL2checklists is designed as an easy-to-use software application that bridges the gap 371 between common software suites for DNA sequence assembly and annotation and the EMBL2checklists can be employed to prepare the most common plant DNA barcoding 378 marker sequences for upload and submission to ENA via the interactive Webin EMBL2checklists is best illustrated by its application during the submission preparation 385 of hundreds of DNA sequences submitted to ENA during the publication process of 386 several recent investigations [46][47][48]. With the development of EMBL2checklists, we 387 hope to provide a useful software tool to plant biologists and bioinformaticians alike, 388 increase the amount of sequence data deposited to public sequence databases [7] and 389 advance the idea of publicly-shared research data [9,10]. By extension, we believe that 390 EMBL2checklists may play an important role in future data management and data 391 stewardship of plant DNA sequence data under the FAIR data principle [52,53].