Consensus assessment of the contamination level of publicly available cyanobacterial genomes

Publicly available genomes are crucial for phylogenetic and metagenomic studies, in which contaminating sequences can be the cause of major problems. This issue is expected to be especially important for Cyanobacteria because axenic strains are notoriously difficult to obtain and keep in culture. Yet, despite their great scientific interest, no data are currently available concerning the quality of publicly available cyanobacterial genomes. As reliably detecting contaminants is a complex task, we designed a pipeline combining six methods in a consensus strategy to assess the contamination level of 440 genome assemblies of Cyanobacteria. Two methods are based on published reference databases of ribosomal genes (SSU rRNA 16S and ribosomal proteins), one is indirectly based on a reference database of marker genes (CheckM), and three are based on complete genome analysis. Among those genome-wide methods, Kraken and DIAMOND blastx share the same reference database that we derived from Ensembl Bacteria, whereas CONCOCT does not require any reference database, instead relying on differences in DNA tetramer frequencies. Given that all the six methods appear to have their own strengths and limitations, we used the consensus of their rankings to infer that >5% of cyanobacterial genome assemblies are highly contaminated by foreign DNA (i.e., contaminants were detected by 5 or 6 methods). Our results will help researchers to check the quality of publicly available genomic data before use in their own analyses. Moreover, we argue that journals should make mandatory the submission of raw read data along with genome assemblies in order to facilitate the detection of contaminants in sequence databases.


Introduction
Publicly available genomes are the basic ingredient of numerous studies, from single-gene functional studies to multi-genome phylogenetic inferences. Their quality (e.g., completeness, a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 or freshwater habitats (as picoplankton or benthic mats) to hypersaline environments or hot springs, in polar, temperate and tropical regions, with the exception of acidic waters [23]. Usually, contaminants are organisms that live in close proximity to the sequenced organism (i.e., parasites, symbionts, epiphytes) [6,10] or that are simultaneously studied [6,8]. It is for instance well known that growth of picocyanobacteria (Synechococcus and Prochlorococcus) is improved by mutualistic interactions with heterotrophic organisms, which notably remove toxic reactive oxygen species from these phototroph/heterotroph systems [24][25][26][27]. Cyanobacteria are also known to excrete different compounds, including polysaccharides, proteins, nucleic acids, osmolytes, in their immediate environment [28,29]. Bacteria from other phyla, such as Bacteroidetes and Proteobacteria, feed on those extracellular productions and thus live in close relationship with Cyanobacteria [28,30,31]. Because of this trophic coupling, the majority of available cyanobacterial strains are not axenic, hence resulting in potential genome contamination. In this respect, we recently noticed that six genome assemblies (GCA_000472885.1, GCA_000817745.1, GCA_000817775.1, GCA_000817785.1, GCA_000817735.1, GCA_000828075.1) presented two or even three copies of many proteins usually encoded by a single gene, one being the likely genuine cyanobacterial protein and the other one(s) being closely related to non-cyanobacterial species (data not shown).
As these observations prompted us to investigate the issue thoroughly, we designed a novel strategy, based on state-of-the-art tools, to evaluate the contamination level of publicly available genomes. Altogether, our analyses allowed us to establish a global ranking of 440 cyanobacterial genome assemblies, from the most contaminated to the least contaminated. These results suggest that genome producers should strive to upload genomic sequences devoid of contamination and that genome consumers should be cautious when analyzing such data in downstream studies. Moreover, we advocate the stance that raw sequencing reads should be released along with newly assembled genomes, so as to help genome scientists to evaluate the quality of primary data.

Case study strategy developed for the assessment of the contamination level of cyanobacteria
To evaluate the contamination status of public cyanobacterial genome assemblies, we implemented a strategy based on combining evidence across multiple independent methods (Fig 1). The rationale was that, given the complexity of the task, no single method would be expected to reach both maximal sensitivity and specificity. Moreover, we were interested not only in quantifying the overall contamination level of cyanobacterial genomes, but also in recognizing and eliminating foreign regions or scaffolds, so as to generate "decontaminated" assemblies. Hence, we first considered a few reliable loci, i.e., genes that have an extremely low probability of horizontal gene transfer (HGT), such as SSU rRNA (16S) and ribosomal proteins. We used RNAmmer/SINA [32,33], an approach based on a reference SSU rRNA (16S) database, and "42", a BLASTX-based program coupled with a reference ribosomal protein database to detect sequences in cyanobacterial genomes that are phylogenetically related to non-cyanobacterial organisms. Second, we used CheckM [14], a comprehensive package based on the phylogenetic labelling of lineage-specific marker genes, thereby allowing us to probe a larger part of the cyanobacterial assemblies. Third, we explored three genome-wide methods to maximize our detection power. We investigated their parameterization by comparing their results with those based on ribosomal genes, considered as the gold standard. Hence, we tested Kraken [34], a widely used metagenomic classifier based on long (signature) DNA kmers (21-31 nt), against a curated reference database derived from Ensembl Bacteria, and CONCOCT [35], a short DNA kmer (4-6 nt) metagenomic binning package that does not require a reference database, but designed to take advantage of sequencing coverage. As a third genome-wide method, we turned to DIAMOND blastx [36], using the same reference database as with Kraken. Altogether, our analyses allowed us to establish a global ranking of the 440 publicly available cyanobacterial genome assemblies, from the most contaminated to the least contaminated. We conclude that about 20 assemblies are highly contaminated by sequences from foreign phyla, whereas >200 additional assemblies are likely to be at least slightly contaminated. Finally, we provide download links to alternative versions of these assemblies, in which contaminating regions (or whole scaffolds) have been masked.

Properties of the cyanobacterial genome assemblies
We downloaded 440 cyanobacterial genome assemblies, altogether representing the eight orders defined in Komarek et al. [37], as well as the newly erected Gloeomargaritales [38]. These genomes correspond to 421 different organisms, the remaining 19 assemblies being updated versions of existing genomes or independent assemblies (or sequencing plus assembly) of the same strain (S1 Table). An overview of the morphology, the habitat and the taxonomy of the strains composing our dataset is given in S1 Fig. QUAST analyses allowed us to probe the size and the fragmentation level of the assemblies (Fig 2). While many unicellular Cyanobacteria have a genome size around 2 Mbp, filamentous, and especially heterocystous, Cyanobacteria are among the bacterial organisms featuring the largest genomes (Fig 2A). However, these analyses also revealed that 11 assemblies were <500 kbp, which strongly suggests that they do not correspond to complete genomes, whereas some others are suspiciously large for Cyanobacteria (5 assemblies >15 Mbp, including 2 >68 Mbp but with >90% "N" nucleotides). Regarding the fragmentation level, even if some genomes are assembled into a low number of scaffolds (e.g., one chromosome and a few plasmids), a large part of our dataset consists in genomes represented by more fragmented assemblies, with 57% of the genomes having >20 scaffolds (Fig 2B and S1 Table).

Ribosomal genes as a first estimator of the contamination level
Since C. Woese and co-workers [39], the SSU rRNA (16S) gene is considered the "gold standard" taxonomic marker. Consequently, large, broadly-sampled and trustworthy reference databases have been available for a long time [40], as well as specialized software to predict (e.g., RNAmmer, [32]) and classify (e.g., SINA, [33]) rRNA genes of unknown origin with both high sensitivity and specificity. Strikingly, 85 assemblies of our dataset appeared devoid of any SSU rRNA (16S) sequence, including 6 of the 13 ultrasmall "genomes" (S1 Table). Moreover, 4 assemblies contained only unclassified sequences and 1 assembly only one noncyanobacterial sequence. Among the 350 assemblies featuring at least one cyanobacterial sequence, 15 contained at least one sequence of non-cyanobacterial origin (7 assemblies had 1, 5 had 2, and 3 had 3).
Albeit the presence of one or more foreign SSU rRNA (16S) gene(s) almost certainly reveals a contaminated genome assembly, it only represents a single gene [41], even if it can be found in multiple copies (up to four loci) in Cyanobacteria [42]. To increase the odds of identifying sequences of non-cyanobacterial sources, we turned to a reference database of ribosomal protein genes [43]. While the latter are much less sampled than SSU rRNA (16S) genes, they have the advantage of representing about 50 loci spread over about 10 operons [44]. However, they are not totally immune to HGT (e.g., rps14 [45]). That is why we first inferred phylogenetic  trees for all alignments built from the database to identify and remove xenologous reference sequences. Mining of the cyanobacterial genome assemblies against ribosomal protein alignments with our own software "42" (which controls for orthology relationships; available at https://bitbucket.org/dbaurain/42/) showed that 21 assemblies contained at least one ribosomal protein gene of foreign origin (8 assemblies between 1 and 8 proteins, 9 assemblies between 16 and 41, and 4 assemblies between 56 and 80; S2 Table). As expected, almost all assemblies featuring at least one foreign SSU rRNA (16S) gene also showed at least one foreign ribosomal protein gene (14 out of 16 assemblies). This allowed us to classify assemblies into three categories, based on the number of ribosomal gene methods identifying contaminating sequences in each assembly: 2 (14 assemblies), 1 (9 assemblies) and 0 (417 assemblies).

Lineage-specific marker genes as an extended estimator of the contamination level
Even if ribosomal gene methods are sensitive, their power is limited because ribosomal genes represent a small fraction of a cyanobacterial genome (about 0.4-1.0%). Indeed, should ribosomal genes be partially or completely missing from the contaminating fraction of an assembly, it would lead to an underestimation of the contamination level. To mitigate this risk, we turned to CheckM [14], a two-step contamination detection tool based on lineage-specific marker genes, 104 bacterial genes and 150 archeal genes (both including ribosomal proteins, as with "42").
According to the classification used by Parks et al. [14], CheckM results indicated that 12 cyanobacterial genome assemblies were very highly contaminated (>15%), 2 highly contaminated (>10% to 15%), 7 moderately contaminated (>5% to 10%), 301 lowly contaminated ( 5%), whereas only 118 assemblies were not contaminated (= 0%). Interestingly, three assemblies (GCA_000472885.1, GCA_000828085.1, GCA_000817785.1) showed a level of contamination >100%, along with a completeness of 100%, thereby suggesting the presence of more than two contaminant organisms in each of them (multiple occurrences of the same markers; Table 1). All the 29 assemblies flagged as contaminated by at least one of the two ribosomal gene methods were all also tagged by CheckM. However, 4 assemblies (GCF_000828075.2, GCA_000341585.1, PRJNA165539, GCA_000775285.1) contaminated at >10% in terms of ribosomal proteins were only tagged as lowly contaminated by CheckM (3.41%, 1.96%, 0.88%, 0.84%, respectively). The reason for this discrepancy is likely that CheckM organizes marker genes into sets of collocated genes for estimating contamination. Indeed, genes that reside in close proximity to each other do not provide independent information regarding the overall level of contamination within a genome.

Genome-wide estimation of the contamination level using long DNA kmers
To confirm marker-based results and improve our detection power, we looked for a genomewide method and used a metagenomic software package relying on the classification of long (21-31 nt) signature DNA kmers, Kraken [34]. Kraken splits genomes into kmers that it organizes in a taxonomic tree that is then queried to classify raw sequencing reads to taxa of increasing ranks, depending on the conservation of their component kmers (see [34] for details). In this work, we "simulated" raw sequencing reads by splitting the cyanobacterial scaffolds into pseudo-reads of 250 nt and fed them to Kraken in order to analyze the taxonomic composition of each assembly. To minimize potential issues due to incomplete or aberrant genomes, we limited the genome-wide analyses reported in this section to the 343 assemblies that were not too small (!500 kbp) nor too large ( 15,000 kbp) and that contained at least one cyanobacterial SSU rRNA (16S) gene(s). However, all 440 assemblies were eventually analyzed and the corresponding results are shown in S2 Table, in which those 97 atypical assemblies are denoted by various symbols.
Three variables affect Kraken ability to classify sequences: the kmer size, the reference genome database, and the confidence parameter. To maximize the fraction of classified sequences (including those from evolutionarily isolated assemblies containing many unique kmers), we used a kmer size of 21 nt and built a curated database (27,762 genomes) from the release 30 of Ensembl Bacteria [46]. These important methodological choices are discussed in S1 Appendix (see also S2 and S3A and S3B Figs).
The confidence threshold is a parameter meant to adjust the trade-off between specificity and precision (in terms of taxonomic ranks). Because it has a large impact on Kraken behavior, fine-tuning this parameter is a crucial step. To this end, we compared Kraken results obtained on the 343 typical assemblies at three different confidence thresholds (0.02, 0.04 and 0.06) to the results obtained on the same assemblies with the ribosomal gene methods (categories 0, 1 and 2), here considered as the gold standard ( Fig 3A). We derived a contamination level from Kraken results for each assembly by summing the classifications that corresponded to noncyanobacterial sources. For some assemblies, a non-negligible pool of classified sequences were actually classified to the high-ranking "Bacteria" taxon (or to the lower "Terrabacteria" taxon). While these sequences were genuinely part of the total (100%), we did not include them in our contaminating fraction, nor in our cyanobacterial fraction and labelled them as "unknown". For our purposes, lowering Kraken precision thus affects its sensitivity, since a reduced precision increases the share of these classified-yet-unknown sequences. Based on Fig 3A, and with the aim of minimizing the fraction of unclassified/unknown sequences (sensitivity) without wrongly tagging as contaminated too many apparently clean genomes (specificity), we chose to work with the 0.04 confidence threshold to analyze our dataset (S2 Table). Overall, Kraken analyses provided an independent confirmation that 137 cyanobacterial genome assemblies are contaminated at a level !1% (255 at a level >0%). However, a number of these assemblies appear to have contaminating fractions well above the median, despite the SSU rRNA (16S) and ribosomal protein analyses suggesting these genomes are free of contamination (outliers in Fig 3A).
Contaminated fractions of the 343 cyanobacterial genome assemblies (expressed in %) were estimated with Kraken using three different confidence thresholds (a), or with DIAMOND blastx using three different hit number thresholds (b), or with CONCOCT using three different kmer sizes (c), then partitioned into three sets of assemblies, based on the number of ribosomal gene methods (SSU rRNA (16S) and ribosomal proteins) identifying contaminating sequences in each assembly (0, 1 or 2). Upper and lower whiskers extend from the hinge to the largest and lowest value no further than 1.5 Ã IQR from the hinge, respectively. Data points beyond the end of the whiskers are outliers and are plotted individually. The values given on top of each series in panels a and b are the median and IQR for the unclassified fractions across the 343 cyanobacterial assemblies, independently of the ribosomal category.

Genome-wide estimation of the contamination level using whole proteomes
Our thorough optimization of Kraken analyses led us to conclude that it is of limited use for our purpose, because it looks for known signature kmers. Given the fast evolutionary rate of nucleotide sequences, genomes of organisms not closely related to any reference genome cannot be classified. Moreover, combining all signature kmers into a single data structure makes it impossible to exclude self-matches when probing an assembly for contamination. This makes Kraken results uninformative for any genome used to build the reference database. To address this problem, we developed a detection protocol based on DIAMOND blastx [36], an ultrafast clone of the BLAST algorithm. In contrast to Kraken, DIAMOND blastx can classify organisms that are distant from all those of the reference database (protein sequences being much more conserved than nucleotide sequences), while still running fast (20,000 times faster than the NCBI BLAST+ implementation) [47]. Besides, it allows skipping self-matches when parsing its output for last-common ancestor (LCA) classification of pseudo-reads. We used the same curated database as for Kraken, except that reference sequences were conceptual translations of the genes instead of whole genomic sequences. Our LCA inference algorithm was inspired from MEGAN [48]. Using a tree-like collapsing of best hits taxonomy down to a bit score threshold expressed as a percentage of the highest bit score, it often allows labelling pseudo-reads based on more than a single best hit.
The better sensitivity of protein searches improved the classification of distant assemblies (S3C Fig; slope = -243, Pearson r = -0.76, P-value = 7.92e-35). However, the unclassified fraction of the assemblies was larger with DIAMOND blastx (median = 14.5%, IQR = 14.6%) than with Kraken (median = 0.2%, IQR = 11.4%). This reduced power of DIAMOND blastx is due to the fact that protein-based searches only target coding sequences. Accordingly, the unclassified fraction fits well with the fraction of non-coding sequences in the assemblies (median = 15.9%, IQR = 7.3%).
The potential of DIAMOND blastx was then evaluated exactly as for Kraken, using the 343 cyanobacterial genome assemblies and the ribosomal gene results as the gold standard. A striking difference was that the contamination levels estimated with DIAMOND blastx were much narrower, while still showing a good agreement with ribosomal gene categories (Fig 3B). Since the bit-score threshold did not make any obvious difference in our setup, we settled on a value of 5% of the best hit bit score, as in MEGAN (S2 Table). Even though DIAMOND blastx contamination levels were in line with Kraken levels across all ribosomal gene categories (compare medians of the middle series between panels a and b of Fig 3), all cyanobacterial genome assemblies appeared to contain a non-null fraction of foreign sequences with DIAMOND blastx. In contrast, 181 assemblies were devoid of any foreign sequence with Kraken, including the 170 untestable assemblies (NA in S2 Table) that are part of its reference database.
As a contribution towards better publicly available genome assemblies, we provide download links to our curated reference database derived from Ensembl Bacteria 30, a post-processor script for DIAMOND blastx output (https://figshare.com/s/344be979f5f2237ad735), and "decontaminated" versions of the 440 cyanobacterial genome assemblies (based on DIA-MOND blastx results) (https://figshare.com/s/08193cb71f0a65d99c69).

Genome-wide estimation of the contamination level using short DNA kmers
As explained in S1 Appendix, an issue with database-based methods, such as Kraken and DIA-MOND blastx, is that assemblies from Cyanobacteria that are isolated from a phylogenetic point of view can be very difficult to investigate due to the proper lack of related reference genomes in the database. To estimate the contamination level while accounting for sequences originating from not yet sampled organisms, we resorted to a program that does not require a reference database: CONCOCT [35]. This software clusters assembly sequences into non-hierarchical groups based on a Principal Component Analysis (PCA) of short (4-6 nt) DNA kmer frequencies. We hypothesized that a high number of CONCOCT groups would hint to a mixture of several genomes into a single assembly. However, we failed to demonstrate any correlation between the number of CONCOCT groups based on kmer frequencies and the DIAMOND blastx contamination level (e.g., for kmer size = 4: Pearson r = -0.04, P-value = 0.31).
By looking at detailed CONCOCT results, we noticed that many cyanobacterial genome assemblies had an overwhelming share of their genomic sequence included in the most abundant CONCOCT group (median = 94.45%, IQR = 15.28%). We thus confidently assumed that this group represented the genuine organism. For practical purposes, we further considered all the other groups as contaminants, which allowed us to compute contamination level estimates by summing the latter fractions. Of course, this is an approximation, because some minor groups might actually correspond to atypical regions of the genuine organism (e.g., repeated elements, plasmids). Nevertheless, when applying this simple metric to tetramer frequencies of the 343 typical assemblies, it performed well in comparison to ribosomal gene methods ( Fig  3C), in spite of a non-negligible number of assemblies wrongly flagged as highly contaminated (outliers in Fig 3C). In contrast, it did not work with pentamers and hexamers, because the most abundant CONCOCT group at these kmer sizes often corresponds to a (much) lower share of the genomic sequences (median = 90.94%, IQR = 21.56% for kmer size = 5 and median = 1.89%, IQR = 62.76% for kmer size = 6).

Global ranking of cyanobacterial genome assemblies
As the three genome-wide methods had largely confirmed the results obtained with ribosomal genes and lineage-specific marker genes on a subset of 343 cyanobacterial genome assemblies, we decided to use the contamination levels estimated through all six methods to build a consensus global ranking of the 440 assemblies composing our dataset (S2 Table). Congruence between individual methods and with this global ranking was investigated using Spearman rank correlations (S3 Table). Ribosomal gene methods correlated weakly when considering either 440 (rho = 0.32/0.37) or 343 assemblies (rho = 0.34/0.35), but worked much better on the top-50 (rho = 0.74/0.79). This is hardly surprising given that, being less sensitive than genome-wide methods for their limited number of interrogated loci, they can only find contaminants in highly contaminated assemblies. On the opposite, the database-free CONCOCT correlated well on the 440 assemblies (rho = 0.63) and this also result held for the 343 assemblies (rho = 0.61). Yet, it did not when focusing on the top-50 (rho = 0.14), as expected from its reasonable but rough metric. The situation for Kraken was similar to that of CONCOCT, with rho = 0.59 (440), rho = 0.59 (343), rho = 0.32 (50), suggesting that in spite of the shortcomings of its monolithic reference database (see S1 Appendix), Kraken remains useful for real applications. CheckM was the method that correlated the best on average-rho = 0.75 (440), rho = 0.76 (343), rho = 0.73 (50)-followed by DIAMOND blastx-rho = 0.61 (440), rho = 0.68 (343), rho = 0.75 (50). Interestingly, even the best methods had their outliers (e.g., CheckM: GCA_001456025.1 at rank 83, Kraken: GCA_001039265.1 at rank 36, DIAMOND blastx: GCA_000484535.1 at rank 119). This seems unavoidable as pipelines involving several programs and databases are difficult to develop and test exhaustively [49]he contamination level is more valuable than relying on a single approach.
Based on this global ranking, we defined three sets of assemblies. The top-21 (19 different genomes) contains assemblies tagged as contaminated by at least five of the six detection methods (except GCF_001904775.1, which is tagged by four methods) and thus corresponds to those with the highest level of contaminants. Three assemblies (GCA_000472885.1, GCA_000817785.1, GCA_000828085.1) even show CheckM contamination levels >100%, suggesting contamination by more than one organism. For these 21 assemblies, we explored the distribution of contaminants over the genomic scaffolds using DIAMOND blastx (S4 Fig). In many cases, it is very clear that at least two types of scaffolds co-exist, one corresponding to cyanobacterial segments (in green) and the other to contaminant segments (in red). Such a partition is often reflected in the GC-content of the scaffolds (e.g., GCA_000472885.1, GCA_000817745.1, GCF_001637395.1), and ribosomal gene (both SSU rRNA (16S) and ribosomal proteins) hits nearly always co-localize with the expected type of scaffold (e.g., GCA_000341585.2). This indicates that the different methods are congruent even within a single genome assembly and that their consensus faithfully reflects the underlying contamination level. After the top-21, a second part of the ranking (down to rank 230) contains assemblies that are only slightly contaminated. This is where discrepancies between individual methods are the most common (e.g., GCF_001456025.1 at rank 83, GCA_000291825.1 at rank 227). Finally, the last part of the ranking (210 assemblies) contains genomes that have a very low level of contaminants (generally <1% for both CheckM and DIAMOND blastx estimates).

Taxonomic analysis of contaminants
According to DIAMOND blastx, the two most frequent contaminant phyla in our top-21 ranking of contaminated assemblies are Proteobacteria and Bacteroidetes (Fig 4; see S5 Fig for  a similar analysis for all 440 assemblies). Proteobacteria contaminate 19 assemblies of the top-21 at a level !1%, Bacteroidetes contaminate 5 assemblies, whereas Firmicutes and Chloroflexi contaminate 1 assembly. Other contaminants (Acidobacteria, Actinobacteria) are only found at <1% in cyanobacterial genome assemblies. The detection of Proteobacteria can be explained by the presence of these bacteria in cyanobacterial polysaccharide sheaths, which are a source of nutrition for them [28]. Such sheath colonization often hinders efforts to make cyanobacterial cultures axenic [13]. Hence, the complete genome of Blastomonas sp., a heterotrophic proteobacterium living in close association with Cyanobium sp., has been recently obtained from a non-axenic culture of the latter [29]. Regarding Bacteroidetes, members of this phylum can degrade polymeric compounds, and one isolate was hypothesized to feed on cyanobacterial biomass [50]. Fig 4 further shows that some assemblies have a high level of unclassified sequences (up to 58%). As aforementioned, with methods based on reference databases, the efficiency of classification heavily depends of the diversity of the database. It means that sequences from uncommon organisms, such as extremophiles, are difficult to classify because they are under-represented in reference databases. As an example, Cyanobacteria bacterium JGI 0000014-E08 (PRJNA165539; unpublished), which was collected in a lagoon with a reportedly unique microbial diversity (https://www.ncbi.nlm.nih.gov/bioproject/?term=PRJNA165539), displays 58.4% of unclassified sequences. Moreover, a part of the unclassified fraction of this genome is probably of cyanobacterial origin, since cyanobacterial ribosomal gene hits are located on scaffolds with a high level of unclassified segments (S4 Fig), which also highlights the usefulness of combining the results of multiple methods, especially those of genome-wide approaches that output detailed taxonomic reports (i.e., per scaffold).
The 21 top-ranking contaminated cyanobacterial genome assemblies (see Table 1) were analyzed with DIAMOND blastx, as explained in the main text (using a LCA approach against a protein version of our curated Ensembl 30 database). Taxonomic classifications (expressed Contamination level of cyanobacterial genomes in % of the genomic sequence) are summarized at the phylum level. The "unclassified" classification corresponds to sequences that do not match any reference protein in the database, whereas "unknown" corresponds to high-ranking LCAs (Bacteria or Terrabacteria). "others" include the following phyla (in descending order of frequency): Spirochaetes, Deinococcus-Thermus, Candidatus Tectomicrobia, Verrucomicrobia, Nitrospirae, Tenericutes, Armatimonadetes, Fusobacteria, Thermotogae, Gemmatimonadetes, Synergistetes, Chlamydiae, Thaumarchaeota, Thermodesulfobacteria, Crenarchaeota, Chrysiogenetes, Dictyoglomi. Complete results for the 440 assemblies are available in S2 Fig, whereas an overview of the contaminant genera found in the 21 top-ranking assemblies in given in S5 Table. Validation using the sequencing coverage On average, contaminated genome assemblies have a higher number of scaffolds than non-contaminated assemblies, owing to both the decrease in sequencing coverage and the increase in assembly complexity associated with the mixing of several organisms in the same "genome" (Pearson r = 0.36, P-value = 2.71e-15). In some cases, however, chimerical scaffolds containing a mixture of cyanobacterial and non-cyanobacterial segments can be observed (e.g., GCF_001482745.1, GCA_000643395.1). Confirmed by detailed taxonomic analyses of the scaffolds (S4 Table), this result suggests that these scaffolds are probably the products of a too greedy concatenation of contigs during the finishing steps of genome sequencing. Ideally, such a bold interpretation should be supported by additional evidence, so as to exclude false positives due to methodological artifacts (e.g., reference database issues, lack of sensitivity of BLAST heuristics) and/or HGT.
To distinguish between genuine contamination and other events, we tried to take advantage of the sequencing coverage, i.e., by monitoring the number of reads mapping at every position of an assembly. Indeed, contaminant DNA is expected to be less abundant than genuine DNA (or at least not in equal proportion), which should result into non-cyanobacterial scaffolds and/ or sub-scaffolds having a lower (or at least different) coverage than their cyanobacterial counterparts [35,51]. In principle, the coverage along a genome can easily be obtained by remapping the raw sequencing reads to the corresponding assembly. Unfortunately, a careful search in NCBI SRA for raw sequencing reads corresponding to the 440 cyanobacterial genome assemblies revealed that such data files were only available for 83 (19%) assemblies, of which 63 (14%) were really usable for computing the sequencing coverage. In particular, coverage could be computed for only one (yet the most) highly contaminated assembly (GCA_000472885.1 at rank 1). This large "genome" (15.8 Mbp) is composed of two populations of scaffolds, each one characterized by a distinct value of GC-content ( Fig 5A). As for most other assemblies of the top-21 (see S4 Fig), this parameter agrees with our classification of scaffolds as either cyanobacterial (in green) or contaminant (in red), whether using ribosomal genes or DIAMOND blastx for classification. Hence, scaffolds characterized by a high GC-content are those that likely originate from the contaminating organism, rather than from the supposedly sequenced cyanobacterium. As expected, differences in sequencing coverage correlate well with the split into two populations, cyanobacterial scaffolds having a median coverage of 68 (IQR = 9), whereas contaminant scaffolds drop to 26 (IQR = 13). In contrast, when running similar analyses on a noncontaminated cyanobacterial genome assembly (GCF_000214075.1, rank = 379), both the GCcontent and the sequencing coverage remain largely uniform, except for plasmid segments visible at extremities of the GC-content range (Fig 5B). This confirms that this assembly only contains cyanobacterial scaffolds, as predicted without using the sequencing coverage.
For non-cyanobacterial segments in assemblies for which coverage cannot be obtained, it might be difficult to decide between a contamination (i.e., the experimentally-dependent introduction of a foreign DNA sequence in a genome assembly), a HGT event (i.e., the natural introduction of a foreign DNA fragment into a genome to be sequenced) or a methodological artefact (e.g., a chimerical or contaminated genome in the reference database). With respect to artefacts, we are quite confident that they should rarely affect assemblies in our top-21, because of the high level of congruence observed between the detection methods, including published approaches (e.g., CheckM) and in-house protocols based on robust algorithms (e.g., MEGANlike LCA inference), using a large and curated reference database. Regarding HGT, it is assumed to be common in prokaryotes [52]. However, recent studies do not necessarily agree on the extent of the phenomenon (e.g., [53,54]), and most authors argue that HGT is more frequent among closely related organisms [55][56][57][58]. Hence, HGT has been shown to occur within Cyanobacteria [59][60][61][62], often at different rates [63,64], depending on the evolutionary constraints acting on the transferred genes, i.e., "stand-alone" functions vs operon-encoded functions [61,[65][66][67][68][69]. In the context of the present study, aimed at assessing the contamination level of publicly available genomes, our opinion is that HGT should not be a major issue Contamination level of cyanobacterial genomes because we only consider as contaminants the segments that are clearly non-cyanobacterial (i.e., we do not address intra-cyanobacterial contamination) and because real HGT events are more likely to be found in the classified-yet-unknown fraction of the assemblies. Indeed, as HGT makes look highly similar sequences from organisms potentially very distant from a taxonomic point of view, LCA-based methods (Kraken, DIAMOND blastx) label them with highranking taxa, such as Bacteria, that our post-processors then convert to "unknown". Nevertheless, studies specifically designed to assess the extent of HGT in existing genomes should probably double-check that some of their results are not the product of contamination rather than transfer, whether in reference genomes or in genome assemblies under investigation.
Two representative cyanobacterial genome assemblies, the highly contaminated GCA_000472885.1 (A) and the non-contaminated GCF_000214075.1 (B), were split into segments of 10,000 nt for detailed analysis. On the X axis, segment-containing scaffolds were sorted by increasing average GC-content, with vertical thin lines representing scaffold boundaries. Top panels: segments were classified as either cyanobacterial (green), contaminated (red), unknown (grey) or unclassified (white), using DIAMOND blastx. The overlaid light blue curve is the GC-content of the corresponding segments. Open symbols give the locations of the detected ribosomal genes (circles: SSU rRNA (16S), squares: ribosomal proteins), either from cyanobacterial sources (green) or from non-cyanobacterial sources (red). Bottom panels: the light red curve gives the sequencing coverage of the segments.

Conclusion
Among the 440 cyanobacterial genome assemblies surveyed in this study, we concluded that 21 were highly contaminated (corresponding to 19 different genomes). Proteobacteria and Bacteroidetes are the most common contaminants, probably due to their close proximity with some Cyanobacteria in the environment, and maybe because of trophic interactions complicating the generation of axenic strains [28]. The six genomes that initially raised our attention, and motivated us to design a consensus pipeline for detecting contaminants, all occupy the top-tier of our global ranking (GCA_000472885.1, rank 1; GCA_000817745.1, rank 3; GCA_000817785.1, rank 6 GCA_000817775.1, rank 8; GCA_000817735.1, rank 9; GCA_000828075.1, rank 13). Moreover, 10 of the top-21 contaminated assemblies discovered in this study have now been removed from RefSeq, or have at least been flagged as suspicious in their metadata by NCBI curators, suggesting that our ranking procedure is accurate.
The vast majority of publicly available cyanobacterial genomes are only slightly contaminated (209 assemblies) or likely non-contaminated (210 assemblies). However, the impact of an error (in particular, a contamination) can be very detrimental [3]. For instance, researchers may infer that a particular enzyme is present in some cyanobacterium, whereas in reality it is absent from the phylum, or that some Cyanobacteria are much more prone to HGT events than others. As researchers, we are more interested by exceptions than by those that play along the rules. Hence, finding an orthologous gene in >400 cyanobacterial genomes attracts less attention (and impact) than a gene specific to Bacteroidetes in a handful of Cyanobacteria. In other words, we are much more prone to ingenuously study erroneous than correct data, thereby amplifying small errors and polluting the scientific literature with incorrect statements resulting from the analysis of contaminated genomes.
It is therefore of prime importance to eliminate as many errors as possible from genomic sequences that now constitute a mainstay of numerous research domains, and this is especially important for newly generated genomes making their way up to reference databases used for taxonomic classification. To this end, releasing the raw reads at the same time as the assembled genome should become mandatory. The availability of such data would allow others not only to repeat (or improve) the assembly process, but also spot obvious contaminants based on discrepancies in sequencing coverage. However, such a policy would likely be insufficient, because one cannot expect every genomicist to look for contaminants in tens of thousand of bacterial genomes by downloading and re-mapping raw reads on the corresponding assemblies. To ease the task of the large scientific community using genomic data, the sequencing coverage of each position, as well as other key statistics obtained during the assembly process (e.g., the number of non-mapping reads, the number of incongruently mapping paired ends) should be published along with the genome assembly. This would allow researchers to discard from their homology searches all the regions with a low coverage (probable contaminants) or with a high coverage (probable repeated elements). Technically, making available the .bam files in addition to the raw read files would provide most of the information needed by others. Implementing this constraint is then simply a matter of editorial policy, exactly as when, decades ago, Molecular Biology and Evolution decided to reject any manuscript featuring a phylogenetic tree devoid of statistical support (e.g., bootstrap). In contrast, factoring genomic meta-data (sequencing coverage, paired-end incongruence, etc.) into homology searches is likely to be more difficult. Indeed, it will require devising a standard format and modifying numerous software packages to take advantage of it, but it is certainly worth the effort if we want to make the most of genomic data.

Cyanobacterial genome assemblies
Cyanobacterial genome assemblies have been collected from four databases using a custom Perl script. The latter uses the assembly number to query public sequence databases according to a predetermined order, so that the genome redundancy across banks is handled consistently. Genomes were first collected from the release 30 of Ensembl Bacteria [46], then from the NCBI (with a priority on RefSeq [70], then from PATRIC [71], and finally from the bacterial part of the JGI genome portal [72]). A first batch of downloads was carried out on the 14th of April 2016 and a second batch on the 11th of December 2016.
Assemblies were first analyzed using QUAST 2.3 [2] with default values. The number of scaffolds, the number of scaffolds >1000 nt, the total length of the genomes, the total length of the genomes based on scaffolds >1000 nt, the length of the largest scaffold, the GC-content, the N50/N75 values, the L50/L75 values and the number of N's per 100 kbp were collected and summarized in S1 Table, along with the morphology, the habitat and the taxonomy of each corresponding strain.

SSU rRNA (16S) analyses
SSU rRNA (16S) genes in cyanobacterial genome assemblies were predicted with RNAmmer 1.2, a rRNA predictor using hidden Markov models [32]. The putative extracted SSU rRNA sequences were taxonomically classified by SINA 1.2.11, a multiple sequence alignment and classifier tool (associated with SILVA rRNA database) [33]. SINA was used in combination with the non-redundant and curated release 123.1 of SILVA, which contains 645,151 SSU rRNA (16S) sequences for the classification [40]. A rough estimate of the contamination level in each assembly was computed by taking the ratio of the number of SSU rRNA (16S) genes classified to foreign (i.e., non-cyanobacterial) Bacteria to the total number of predicted SSU rRNA (16S) genes.
The distance matrix used for benchmarking reference databases was computed by aligning the SSU rRNA (16S) sequences with MAFFT 7.273 [73], then selecting the conserved positions of the alignment with BMGE 1.12 [74] and finally calculating the distance matrix with IQPNNI 3.3.2, a tree-reconstruction algorithm based on maximum likelihood (ML) and used here with the GTR+Γ 4 model [75].

Ribosomal protein analyses
The contamination level estimates of cyanobacterial genome assemblies were refined using a broadly-sampled dataset of 52 multiple sequence alignments (MSAs) of ribosomal proteins aligned with MAFFT. The dataset was assembled from a ribosomal protein database for prokaryotes, RiboDB 1.4.0 [43], available at https://ribodb.univ-lyon1.fr/. These protein sequences were used as a reference to detect and classify contaminating sequences with "42", a program whose aim is to add (and to optionally align) sequences to a pre-existing MSA while controlling for orthology relationships (Baurain et al., to be published elsewhere; https://bitbucket. org/dbaurain/42/). Here, we used "42"'s ability to spot contaminants by attempting to add ribosomal proteins from each of our 440 cyanobacterial genomes to each MSA of our dataset of ribosomal MSAs. In practice, "42" searches each genome for sequences that are homologous to sequences of the current MSA and then, through advanced heuristics, sort out orthologues from possible paralogues. Each orthologous sequence is further classified by computing the LCA of the three sequences of the MSA that are the most similar to the newly added orthologue, with a minimum identity threshold of 70% and a minimum alignment length threshold of 30 amino acids. By default, we expect that the inferred LCA is in agreement with the taxonomy of the genome from which comes the orthologue, i.e., of cyanobacterial origin. If not, the orthologue is considered as a contaminant of the genome assembly (or remains unclassified if not matching anything close enough in the reference database).
We used RAxML trees inferred under the LG+F model [76] and the CAT approximation to check whether reference Cyanobacteria were monophyletic for all 52 ribosomal proteins. When automated parsing of the bipartitions suggested that it was not the case, visual inspection of the ML tree allowed us to decide between HGT (6 genes) and sequence divergence/ reconstruction error (3 genes) as the cause for the non-monophyly. Before proceeding, a few xenologous sequences were thus removed from 6 MSAs (L12, L31, L36, S16, S21, uS4). Moreover, 6 MSAs (L10, L25, L28, L29, L32, S21) were completely discarded due to poor performance of "42" (i.e., no or too few added sequences, orthology assessment issues, for further details, see S6 Table). The remaining 46 MSAs represented 178,634 sequences sampled across a variety of 3474 different bacterial and archaeal strains in total. They were used to estimate the contamination levels as with SSU rRNA (16S) genes, but by summing over all individual ribosomal proteins found in each cyanobacterial genome assembly, i.e., ignoring their possible collocation on the chromosome.

CheckM analyses
We analysed the 440 cyanobacterial genome assemblies with CheckM 1.0.7. CheckM uses a concatenation of predicted ribosomal proteins to automatically place the assembly under study in a reference genome tree. Then, it estimates the completeness and the contamination level of the assembly by searching the sequence for lineage-specific marker genes provided with the software (see [14]). We used the typical automatical workflow option lineage_wf. The contamination percentage extracted from the output was directly considered as the contamination level estimate.

Kraken analyses
Kraken 0.10.5-beta is a program that assigns taxonomic labels to genomic DNA sequences [34]. It uses exact (signature) kmer (21, 25, 31 nt) alignment to associate a taxonomic tree with a kmer database that it builds itself according to a default or user-specified genome set [24]. We used a newly created curated database derived from Ensembl Bacteria 30 (see S1 Appendix for details) with three different confidence thresholds (0.02, 0.04 and 0.06) to analyze cyanobacterial pseudo-reads. Kraken labels were further post-processed to classify pseudo-reads as either cyanobacterial, contaminating (non-cyanobacterial), unknown (too high-ranking LCA, such as "Bacteria" or "Terrabacteria") or unclassified (no hit in the database). Finally, a global contamination level was computed for each assembly by taking the ratio of the number of non-cyanobacterial pseudo-reads to the total number of pseudo-reads.

DIAMOND blastx analyses
To check the effect of using proteins instead of whole nucleotide genomes on detection sensitivity, we repeated most Kraken analyses with DIAMOND 0.8.22.84 blastx [36]. For this, we built a DIA-MOND database from the corresponding 27,762 complete proteomes of our curated Ensembl 30 genome set. Cyanobacterial genome assemblies, still split into pseudo-reads of 250 nt, were BLASTed against this protein database. Each pseudo-read was then labelled by computing the LCA of its best hits (excluding self-matches), provided they had a bit-score !80 and within 95% of the bit-score of the first hit (MEGAN-like algorithm [48]). We tested three different thresholds for the minimal number of hits required for LCA inference (1, 2 or 3) and chose to use a minimal number of 1 hit. From there, pseudo-read classification and contamination level estimation were carried out exactly as for Kraken. Decontaminated cyanobacterial genome assemblies, in which regions corresponding to contaminating pseudo-reads have been removed (or only masked with N's), are available for download at https://figshare.com/s/08193cb71f0a65d99c69. For S5 Table, the MEGAN-like algorithm was run using a bit-score threshold of 99% (instead of 95%).

CONCOCT analyses
We also analyzed cyanobacterial genome assemblies with CONCOCT 0.4.1 [35] using three default values for kmer size (4, 5 and 6). For each assembly, the script cut_up_fasta.py (provided with CONCOCT) was used to cut the genome sequences into non-overlapping segments of 10,000 nt. Dummy coverage files with a value of 1 for every segment were also created. CONCOCT was used on the resulting files with default values: a length threshold of 1000 nt for the segments and a number of groups of 400 for the clustering based on the principal component analysis (PCA) of kmer frequencies. For each assembly, we computed the total number of segments for the whole assembly and counted the number of segments associated with every group determined by CONCOCT. Then, we selected the group containing the largest number of segments and considered it to be the "core" genome without any contamination or recent HGT. To estimate the contamination level of an assembly, we used the simple following formula: one minus the ratio between the number of segments belonging to the largest group and the total number of segments.
Importantly, CONCOCT was designed for metagenomic analyses and requires coverage data to realize a meaningful clustering of sequencing reads after the PCA [35]. This is also the case for most other programs created for metagenomics, such as MetaBat [51], VizBin [77], GroopM [78], MaxBin [79] and the older CompostBin [80]. Unfortunately, sequencing coverage could be obtained for only 63 assemblies (see Validation using the sequencing coverage for details). Yet, when we reran CONCOCT using coverage files, we did not observe any large difference in clustering with respect to the results obtained without coverage (data not shown). However, none of these 63 assemblies (including 7 from the 97 atypical assemblies), except Mastigocoleus testarum BC008 (GCA_000472885.1, unpublished), were highly contaminated (DIA-MOND blastx level <2%). Thus, the use of real coverage values might have had little effect on the clustering results. This interpretation is strengthened by the observation that real coverage values were indeed relatively constant within each of the 62 non-contaminated assemblies, as assessed through the corresponding quartile coefficients of dispersion (median = 0.055, IQR = 0.053).

Global ranking
Since all methods were in general agreement, we decided to retain all of them in our final global ranking of cyanobacterial genome assemblies. This consensus ranking was computed by averaging the ranks of each assembly in the individual rankings obtained for each of the six methods (S2 Table). For the 170 assemblies included in the reference genome database, Kraken results were considered as missing values (NA) when taking the rank average. Spearman rank correlations were computed for all methods and the global ranking, considering all 440 assemblies, only the 343 typical assemblies, or only the 50 top-ranking assemblies out of the 440 (S3 Table), using the R option pairwise.complete.obs to handle missing values.

Validation using the sequencing coverage
To confirm that foreign sequences in contaminated cyanobacterial genomes were genuine contaminants, we checked that their sequencing coverage was different from the coverage of cyanobacterial sequences. Coverage was estimated using BBMap (http://bbmap.sourceforge.net/) from raw read data (when available). However, this was possible for only 63 of the 83 available SRA files: 12 genomes had avf_fold values <5 and 8 genomes returned no result at all. Among the 63 assemblies with coverage, only one (GCA_000472885.1) was part of our top-21 ranking, all other assemblies corresponding to slightly or non-contaminated genomes.
For each assembly, sequencing coverage, GC-content (computed with a custom Perl script) and taxonomic classification (cyanobacterial, contaminated, unknown and unclassified sequence fractions, as determined by DIAMOND blastx) were computed on 10,000-nt segments (as for CONCOCT) and used for generating combined genome maps. In the corresponding plots, scaffolds are arranged by GC-content, which allows highlighting the covariation of the latter with contamination level and sequencing coverage. As an additional congruence test, the positions of the SSU rRNA (16S) and ribosomal protein genes classified to Cyanobacteria and/or to contaminant organisms were also plotted below their corresponding scaffolds. The top-21 ranking assemblies were split into segments of 10,000 nt for detailed analysis. On the X axis, segment-containing scaffolds were sorted by increasing average GC-content, with vertical thin lines representing scaffold boundaries. Segments were classified as either cyanobacterial (green), contaminated (red), unknown (grey) or unclassified (white), using DIAMOND blastx. The overlaid light blue curve is the GC-content of the corresponding segments. Open symbols give the locations of the detected ribosomal genes (circles: SSU rRNA (16S), squares: ribosomal proteins), either from cyanobacterial sources (green) or from non-cyanobacterial sources (red). (PDF)  Table) were analyzed with DIAMOND blastx, as explained in the main text (using a LCA approach against a protein version of our curated Ensembl 30 database). Taxonomic classifications (expressed in % of the genomic sequence) are summarized at the phylum level. The "unclassified" classification corresponds to sequences that do not match any reference protein in the database, whereas "unknown" corresponds to high-ranking LCAs (Bacteria or Terrabacteria). "others" include the following phyla (in descending order of frequency): Euryarchaeota, Nitrospinae, Deinococcus-Thermus, Candidatus Tectomicrobia, Verrucomicrobia, Spirochaetes, Nitrospirae, Armatimonadetes, Aquificae, Synergistetes, Gemmatimonadetes, Chlamydiae, Thermotogae, Fusobacteria, Deferribacteres, Thaumarchaeota, Thermodesulfobacteria, Tenericutes, Chrysiogenetes, Crenarchaeota, Lentisphaerae, Caldiserica, Dictyoglomi. (PDF) S1 Table. List of the public cyanobacterial genome assemblies used in this study. The table gives the accession, the name (binomial incl. strain), the taxonomy (order, NCBI lineage), the ecology (morphology and habitat), and the assembly properties (as determined by QUAST) of the 440 assemblies surveyed. Assemblies are sorted by NCBI lineage. ( Ã ) indicates assemblies for which raw read data are in principle available for download from NCBI SRA; (˚) indicates assemblies that are devoid of SSU rRNA (16S) classified as Cyanobacteria; (+) indicates assemblies that are too large (>15,000 kbp); (-) indicates assemblies that are too small (<500 kbp). (XLSX) S2 Table. Global ranking of the cyanobacterial genome assemblies. The table gives the accession, the name (binomial incl. strain), the main assembly properties (as determined by QUAST), and the ranking results of the 440 assemblies surveyed. Assemblies are sorted from the most contaminated to the less contaminated. ( Ã ) indicates assemblies for which raw read data are in principle available for download from NCBI SRA; (˚) indicates assemblies that are devoid of SSU rRNA (16S) classified as Cyanobacteria; (+) indicates assemblies that are too large (>15,000 kbp); (-) indicates assemblies that are too small (<500 kbp). Assemblies included in the reference database have their Kraken contamination level set to missing data (NA). (XLSX) S3 Table. Correlation of all six methods with the global ranking. Spearman rank correlations were computed for all six methods and the global ranking, considering either all 440 assemblies, or only the 343 typical assemblies, or only the 50 top-ranking assemblies out of the 440. (XLSX) S4 Table. Taxonomic analysis of the scaffolds of the assemblies in the top-21 ranking. The table gives the DIAMOND blastx taxonomic classifications (in %, at the phylum level) of each scaffold of the top-21 assemblies (one sheet per assembly). For each scaffold, the GC-content (in %) and the scaffold length (in nt) are also reported. (XLSX) S5 Table. Taxonomic overview of the contaminant genera of the assemblies in the top-21 ranking. Only contaminants that could be identified at least at the genus level are considered. Redundant assemblies were only analyzed once. Absolute counts are in pseudo-reads, whereas relative counts are expressed with respect to the total number of pseudo-reads indentified at least at the genus level. (XLSX) S6 Table. Summary of the optimization of the "42" analyses. The table gives the number of ribosomal proteins added to each MSA using different combinations of "42" parameters. The final combination is highlighted on a red background. (XLSX)