Genome size distributions in bacteria and archaea are strongly linked to evolutionary history at broad phylogenetic scales

The evolutionary forces that determine genome size in bacteria and archaea have been the subject of intense debate over the last few decades. Although the preferential loss of genes observed in prokaryotes is explained through the deletional bias, factors promoting and preventing the fixation of such gene losses often remain unclear. Importantly, statistical analyses on this topic typically do not consider the potential bias introduced by the shared ancestry of many lineages, which is critical when using species as data points because of the potential dependence on residuals. In this study, we investigated the genome size distributions across a broad diversity of bacteria and archaea to evaluate if this trait is phylogenetically conserved at broad phylogenetic scales. After model fit, Pagel’s lambda indicated a strong phylogenetic signal in genome size data, suggesting that the diversification of this trait is influenced by shared evolutionary histories. We used a phylogenetic generalized least-squares analysis (PGLS) to test whether phylogeny influences the predictability of genome size from dN/dS ratios and 16S copy number, two variables that have been previously linked to genome size. These results confirm that failure to account for evolutionary history can lead to biased interpretations of genome size predictors. Overall, our results indicate that although bacteria and archaea can rapidly gain and lose genetic material through gene transfers and deletions, respectively, phylogenetic signal for genome size distributions can still be recovered at broad phylogenetic scales that should be taken into account when inferring the drivers of genome size evolution.


Introduction
Bacterial and archaeal genomes are densely packed with genes and contain relatively little non-coding DNA, and therefore an increase in genome size is directly translated into more genes [1][2][3]. In contrast, multicellular eukaryotes generally show genome expansion due to the proliferation of noncoding-DNA as a consequence of high genetic drift [2]. The Depletion of non-functional elements in prokaryotes is explained through the bias towards more deletions than insertions; newly acquired or existing genes are removed if selection on those genes is insufficient for their maintenance in the population [4][5][6]. Although narrowly constrained when compared with eukaryotes, prokaryotic genome sizes still vary by over one order of magnitude. Assuming an intrinsic deletion bias across all prokaryotes, it remains unclear what evolutionary forces determine which genes are maintained and which are lost, and what determines the variability of genome sizes across the broad diversity of bacteria and archaea.
Multiple individual factors have been hypothesized to be primary drivers of genome size in bacteria and archaea. Early studies suggested that effective population size (Ne) may be the primary force that determines genome size and fluidity in prokaryotes [7,8]. For example, genome reduction has been observed in host-dependent bacteria that have small Ne and correspondingly high levels of genetic drift due to population contractions. Under such evolutionary constraints, slightly deleterious deletions accumulate and cause overall genome reduction [9][10][11][12][13]. Paradoxically, later studies focusing on abundant free-living planktonic lineages in the ocean suggested that genome reduction can also be observed in bacteria with larger Ne that experience strong purifying selection [14][15][16][17]. In this case selection favors genomic economization, such as the removal of paralogs and intergenic sequences. Factors other than Ne and the strength of purifying selection have also been postulated to play a role in determining prokaryotic genome size. Recently, one study suggested that environmental stress leads to genome streamlining in soil bacteria [18], and other genomics studies have suggested that habitat complexity and ecological strategy [19], as well as the capability to use oxygen [20] may also play major roles in determining genome size in bacteria and archaea [19]. Mutation rate has also been proposed to be a major factor determining genome size [21,22]. In particular, it was suggested that a high mutation rate would be the primary cause of genome reduction in both streamlined and host-dependent bacteria due to the erosion of genes, loss of function, and subsequent deletion [21][22][23]. However, other studies analyzing the mutation rate of the abundant picocyanobacteria Prochlorococcus show estimates similar to Escherichia coli, casting doubt on the view that high mutation rates drive genome reduction in all cases [24,25]. Given the large number of forces that have been proposed to be primary determinants of genome size, it remains largely unknown whether genome size in prokaryotes is driven by unique variables, their interaction, or variables that have specific influence depending on the lineage. Importantly, most statistical analyses exploring the association between genome size and other traits have typically not used phylogenetic comparative methods that are necessary when using species as data points. Shared evolutionary history may obscure the relationship between traits because the phylogenetic dependence between lineages leads to the violation of the statistical assumption of independence in residuals. Thus, conventional statistical methods can lead to overestimation of the strength of the association between traits [26,27]. In this study, we estimated the phylogenetic signal of genome size across a broad diversity of bacterial and archaeal genomes available on the Genome Taxonomy Database (GTDB) [28,29]. Although genome size has been shown to change rapidly in prokaryotes due to HGT and gene loss, we sought to test if this trait still bore a phylogenetic signal across broad phylogenetic scales. Moreover, because previous studies have suggested that effective population size or ecological niche are potential drivers of genome size [3,8], we evaluated whether correlations with these factors would change if evolutionary history was taken into account. Our work provides important insights into the complex mechanisms that shape genome size in bacteria and archaea, and the importance of considering shared evolutionary relationships when studying its evolution to avoid bias in the association between traits.

Genome size distribution across major phyla of bacteria and archaea
In order to explore the distribution of genome size across the Tree of Life of bacteria and archaea, and to measure phylogenetic signal across broad phylogenetic scales, we built a phylogenetic tree using one representative genome of 836 genera belonging to 33 phyla available on the GTDB. For the reconstruction of this phylogeny, we used a set of ribosomal proteins and RNA polymerase subunits that we have previously benchmarked [30]. The size of genomes in our analysis and across the phylogeny varied by almost two orders of magnitude (0. 6-14.3 Mbp, Figs 1A and 2). The minimum and maximum corresponded to two bacterial lineages with contrasting lifestyles: the endosymbiont Buchnera aphidicola of the phylum Proteobacteria and the free-living Actinobacteria Nonomuraea sp. (Fig 1A and 2). The greatest withinphylum variation of genome size was observed for the phyla Actinobacteria and Cyanobacteria, whereas Patescibacteria had the shortest mean genome length ( Fig 1A). We also evaluated the difference in genome size found within the genera used in our study, which we report here as the variance ( Fig 1B) and the difference between the largest and smallest genomes within each genus (S1 Fig). Most of the genera used in our analysis (571 out of 863) showed a difference smaller than 1 Mbp, but some genera exhibited a wide range of genome sizes; for example the genera Streptomyces and Nonomuraea showed a difference of 6.29 and 6.06 Mpb between the smaller and the larger genomes, respectively (Figs 1B and S1). The large difference found between the largest and smallest genome of some of the genera in our dataset is consistent with previous observations of considerable differences in the genome size and genome content of many closely related taxa [31][32][33][34].

Genome size in bacteria and archaea is strongly dependent on phylogenetic history at broad evolutionary scales
Although it is well known that genome size can vary markedly between closely-related bacteria and archaea [31][32][33][34], it is still possible that overall genome size distributions are linked to evolutionary history at broad phylogenetic scales, which we define here as anything broader than the genus level according the GTDB classification (Fig 2). Due to the shared evolutionary history of some lineages, traits of related groups often resemble each other more than when compared with randomly-selected species in the same phylogenetic tree (phylogenetic signal) [35][36][37]. We therefore sought to investigate the phylogenetic signal of genome size distributions in our genome dataset (Fig 2). Phylogenetic methods are needed to analyze these associations because any study involving statistical analyses and species as data points potentially violates the assumption of independence of residuals [26,38].
When studying phylogenetic signal, it is recommended to measure it at two different levels: 1) in traits' raw data and 2) in the residuals resulting from statistical models (e.g., regressions) [39]. As a first approximation, we assessed whether genome size distribution data show phylogenetic signal by estimating Blomberg's K [35] on the genome size of the GTDB genome dataset (Fig 2). Values of Blomberg's K between 0 and 1 indicate that the sizes of closely related genomes resemble each other but less than expected under the Brownian Motion model (BM) of trait evolution, where trait variation is proportional to phylogenetic distance [26]. Conversely, a K of 1 is evidence of genome size variation according to the Brownian Motion expectation [35]. We observed phylogenetic signal in genome size data that is strong but different to what would be expected under the Brownian Motion model (BM) (K = 0.51, P = 0.001), suggesting that although genome size shows phylogenetic signal, variation is not fully explained through phylogenetic distance in our tree [40].
In addition, we tested the fit of different models of trait evolution for genome size, including Brownian Motion [40], Ornstein-Uhlenbeck [41], Early-Burst [42], a diffusion model, Pagel's model [43], a drift model, and a white-noise model (non-phylogenetic signal) ( Table 1). According to a likelihood ratio test performed (P<0.001 when compared with the next-best likelihood), Pagel's model showed the best fit (Table 1) with a lambda value of 0.90 (P<0.001). The Pagel's lambda (λ) represents how strongly phylogenetic relationships predict the observed pattern of variation of a trait at the tips of a phylogeny, and varies from 0 (no phylogenetic signal) to 1 (phylogenetic signal under BM) [43]. Although we obtained different estimates for Blomberg's K and Pagel's λ, we considered that λ is more reliable because this metric is more robust than Blomberg's K in situations of erroneous branch lengths [44]. Our λ estimate supports our conclusion that genome size data in bacteria and archaea show phylogenetic signal. These findings indicate that genome size in bacteria and archaea does not evolve independently of broad evolutionary relationships. To confirm that our phylogenetic signal estimates are not unduly influenced by the phylogenetic scale that we examined, we repeated our analyses using a larger set of genomes consisting of multiple representatives for each genus (S1 Data) and we observed a similar trend (S1 Table), suggesting that the phylogenetic signal trend observed in genome size data is not the result of a biased taxonomic representation. Moreover, for our genus-level tree we estimated kappa (k) and delta (δ) on genome size data, two parameters that describe the mode of evolution of a trait (punctuated vs gradual) and the rate change across the phylogeny (acceleration vs deceleration), respectively [45]. Our estimates (k = 0.24 and δ = 3) are consistent with a gradual and late diversification of genome size in bacteria and archaea, which might indicate lineage-specific adaptations [43,45].
Because phylogenetic signal estimates can be biased due to sample size [46], we measured phylogenetic signal within each phylum ( Fig 1A). Our results indicate that most of the phyla with a small sample size (<25 genomes) showed remarkably large or small K and λ values (S3 Fig), consistent with previous findings that small sample sizes lead to biased estimates [38,46].

PLOS GENETICS
We did not observe a linear increase in λ values with the number of genomes tested, however, suggesting that the large lambda estimate found in our overall genome size data is not associated with our large sample size (S3 Fig).

Non-phylogenetic regression overestimates the effect of dN/dS on genome size
We next explored whether the residuals resulting from the statistical association between genome size and other traits show phylogenetic signal. Previous studies have suggested that high levels of genetic drift are related with a decrease in genome size in bacteria [8,47]. However, such studies were based on a limited set of genomes available at the time and did not include a broad repertoire of streamlined genomes, which are notable for their small genomes and large effective population sizes [12,48]. We first investigated whether this trend is maintained when including a broader diversity of taxa by calculating pairwise dN/dS values for each genus in the GTDB genomes dataset. Our non-phylogenetic generalized least squares (GLS) showed a positive and significant but weak correlation between genome size and dN/dS (P<0.001, Pseudo-R 2 = 0.04, Table 2, Fig 3A). This result contrasts with earlier studies reporting a strong relationship between genome content and dN/dS [8,47]; we attribute this large discrepancy to the broad taxonomic representation in our dataset, which includes small genomes under both strong purifying selection and genetic drift [12]. Interestingly, when considering phylogeny through the better-fitting Pagel's model, our phylogenetic generalized least squares model (PGLS) showed poorer predictability and a non-significant relationship between both variables (P = 0.5, Pseudo-R 2 = 0.0006, Table 2, Fig 3A). Similar results were found in a study that analyzed the phylogenetic signal associated with genome size across prokaryotes and eukaryotes [49,50]. In this previous study, authors showed that the phylogenetic signal found in genome size data caused a biased association between Ne.μ (approximated through nucleotide diversity) and other genetic traits, including genome size [49,50]. Our PGLS analysis indicates that not only does genome size data show phylogenetic signal, but that the residuals of our regression models also bear this signal (Table 2), confirming the need of assessment of phylogenetic-based methods when studying the evolution of genome size [46,51]. We also calculated the lambda parameter on our dN/dS data, and the value found (λ = 0.68; 95% CI = 0.56-0.77) indicates a relatively high phylogenetic signal for this variable, suggesting that phylogenetically related microorganisms tend to experience similar levels of selection. Altogether, these results suggest that correlations between dN/dS and genome size found previously are largely driven by poor sampling and artifacts that arise by not specifically accounting for the recent shared evolutionary history of many lineages [26].
Although our results indicate that dN/dS is a poor predictor of genome size in bacteria and archaea (Fig 3A), it is worth mentioning that dN/dS only reflects recent evolutionary constraints due to saturation of substitutions at synonymous sites [52,53]. Therefore, we do not discount that genome reduction may be driven in part by processes such as population bottlenecks and periods of relaxed selection that happened in the past but are not reflected in dN/dS estimations. This scenario has been suggested for Prochlorococcus, in which the genome simplification observed in this clade could be the result of periods of relaxed selection experienced in the past [53].

Ecological strategy plays a role in genome size evolution in bacteria and archaea
In addition to testing the effect of the strength of selection on genome size, we also assessed the predictability of genome size from 16S rRNA copies as an approximation to ecological strategy using both, GLS and PGLS. Previous studies have shown that copies of the rrn operon can be a predictor of the number of ribosomes that a cell can produce simultaneously, and that this reflects the Table 2. Statistics of the regression models relating genome size with dN/dS and 16S rRNA as predictor variables using Generalized Least Square and Phylogenetic Least Square analyses. We highlighted the models that were statistically significant (α = 0.05).

Model
Predictor ecological strategy in microorganisms [54,55]. A large number of rrn copies is associated with the ability to adapt quickly to fluctuating environmental conditions (i.e., "boom and bust" strategies) [56], while multiple rrn copies would confer a metabolic burden to slow-growing microorganisms living in stable or low-nutrients environments because of ribosome overproduction [54]. Similarly to what we observed for dN/dS, we found a weak, positive, and significant relationship between genome size and 16S rRNA copies when using GLS (P<0.001, Pseudo-R 2 = 0.01, Table 2, Fig 3B). Interestingly, we still observed a significant relationship when accounting for the phylogenetic signal in the residuals through a PGLS analysis (P = 0.003, Pseudo-R 2 = 0.01, Table 2, Fig 3B). However, the Pagel's lambda of this model was not significantly different from 1 ( Table 2), indicating that the residuals of this model show a distribution closer to the BM expectation. After fitting under the BM, we still observed a positive and significant relationship between genome size and 16S rRNA copies (P<0.001, Pseudo-R 2 = 0.02). Although the predictability of 16S rRNA is weak under both BM and Pagel's model, our findings suggest that environment complexity plays a role on genome size independently of phylogenetic relationships. This is consistent with the observation that larger genomes tend to inhabit environments with temporal variability and diversity of resources [57,58]. In addition to fitting our model using dN/dS and 16S rRNA copies individually as predictors, we fitted an additive model with both variables ( Table 2). An ANOVA test showed that a model including both variables does not significantly improve the fit when compared with the model based on 16S rRNA copies as a unique predictor variable (P = 0.48).

A hypothesis for the evolutionary processes that shape genome size in bacteria and archaea
According to our phylogenetic comparative framework (Tables 1 and 2 and Fig 3), lineages with recent shared evolutionary history tend to maintain similar sizes since the divergence from their common ancestor. Nevertheless the pattern of variation in genome size data (λ =   [59], ecological strategy [19,60], and mutation rate [21][22][23], our findings strongly suggest that genome size is a complex trait determined by the interaction of multiple variables, and that the relative importance of these factors may vary across lineages. Phylogenetic signal estimates can vary across phylogenetic scales [61], and it is therefore possible that the strong phylogenetic signal found in our analyses is weaker or not observed at finer scales. This is particularly expected in clades that experience rapid genome turnover due to the acquisition and loss of genes through horizontal gene transfer events (HGT) and deletions, respectively. For example, genome contraction events are expected in endosymbionts like Buchnera and Blattabacterium, which are thought to derive from a large-genome ancestor [10], and are frequently undergoing bottlenecks and periods of diversity loss [9,10,62]. Such exacerbated loss of genes and diversity is enhanced by the nearly absent homologous recombination found in vertically transmitted endosymbionts [63]. These observations are consistent with the relatively high dN/dS value and small genome size that we observed for Buchnera and Blattabacterium (Fig 4). In contrast, some abundant marine clades inhabiting the open ocean such as Prochloroccocus and Pelagibacter have undergone long periods of adaptation and specialization to their stable environments [64,65]. The open ocean is characterized by chronically-oligotrophic nutrient conditions that are stable throughout the year [66], and genes that are under relaxed selection are therefore pseudogenized and lost [12]. The latter is supported by the unusual growth requirements and low number of transcriptional regulators found in Pelagibacter, which is expected to limit its response to changing environmental conditions [67,68]. Consistent with these observations, we observed low dN/dS values, small genome size, and fewer 16S rRNA for these streamlined bacteria (Fig 4). The small genomes observed in both endosymbionts and free-living planktonic lineages are therefore likely the result of distinct evolutionary processes, as previously proposed [17].
In contrast to the genome simplification observed in host-dependent and streamlined prokaryotes, genome expansion is expected in free-living lineages that inhabit complex environments like soils or sediments, where microenvironments with strikingly different abiotic conditions can be found millimeters apart [69]. Although temporal diversity declines and sweeps for specific gene variants are likely to occur in soil prokaryotes due to rapidly changing environmental conditions [69,70], larger genomes may be selected in these environmental realms due to variable abiotic and biotic constraints. Indeed, a study exploring the genes enriched in larger genomes of soil prokaryotes found a larger proportion of genes involved in regulation and secondary metabolism, and were depleted in genes related with translation, replication, cell division, and nucleotides metabolism when compared with smaller genomes [60]. These environmental and genomic findings are consistent with the large genome sizes, intermediate dN/dS, and multiple 16S rRNA copies estimated in our study for soil microorganisms of the genera Actinomyces, Actinoplanes, and Myxococcus (Fig 4), the latter showing complex fruiting body development [71]. It is interesting to note that the largest genomes analyzed in our study (>6 Mpb) tend to experience intermediate levels of purifying selection (dN/ dS), suggesting that either extremely high or low purifying selection are not conducive to genomic expansion events.

Conclusions
Despite the increase of genomes available on publicly available databases, the evolutionary processes and factors driving genome size and content in bacteria and archaea are continuously debated. Several studies have proposed ecological strategies, the strength of purifying selection, and mutation rate as prominent forces that individually determine prokaryotic genome size. Our statistical approach shows that, at broad phylogenetic scales, evolutionary history plays a large role in structuring genome size distributions across bacteria and archaea. Genome size is therefore not independent of phylogeny, and a failure to account for this can lead to misleading associations between traits. In some ways our finding of a strong phylogenetic signal to genome size in prokaryotes across broad evolutionary timescales is paradoxical given the wellknown variability of prokaryotic genome size within species and between closely-related lineages [31][32][33][34]. These two realities need not conflict, however; for example it is possible that genome size fluctuates rapidly at short evolutionary timescales but remains relatively constant due to an overall balancing of gene gain and loss over long periods of time. The significant but poor relationship between genome size and 16S rRNA copies suggest that besides phylogenetic history, ecological strategy plays a role in shaping genome size in bacteria and archaea, although this single trait is insufficient to completely represent ecological strategies. Future studies will be necessary to evaluate the evolution of genome size on a lineage-by-lineage basis. However the strong phylogenetic signal observed in genome size data indicates that analyses involving this trait cannot consider species as phylogenetically independent, therefore phylogenetic relatedness should be assessed and taken into account in order to avoid simplified models and biased associations between traits.

Genomes compilation and phylogenetic reconstruction
In order to estimate the phylogenetic signal in genome size data at a broad phylogenetic scale, we compiled a genomes dataset that included a broad diversity of bacteria and archaea. All the representative genomes available on the Genome Taxonomy Database (GTDB) (Release 05-RS95; 17th July 2020) [28,29] were filtered based on completeness (> = 95%) and contamination (< = 5%) and then classified at the class levels. Genomes belonging to the phylum Patescibacteria (also known as Candidate Phyla Radiation or CPR) were filtered using the parameters completeness> = 80% and contamination< = 5%. After filtering and classification, classes with more than 500 genomes were randomly downsample to 500 genomes. The resulting genomes were clustered based on their taxonomic identity at the genus level and genera with fewer than two genomes were discarded from further analyses. Our final dataset consisted of 4380 genomes classified in 836 genera. For phylogenetic reconstruction, we randomly selected one genome from each genus (referred hereafter as GTDB genomes dataset) and used the MarkerFinder pipeline reported previously [30]. This pipeline consisted in the identification of 27 ribosomal proteins and three RNA polymerase genes (Ribosomal-RNAP set) [72] using HMMER3. The resulting individual sequences were aligned with ClustalOmega and concatenated. We trimmed the concatenated alignment with trimAl [73] using the option -gt 0.1. The Ribosomal-RNAP alignment was then used to build the phylogenetic tree with IQ-TREE 1.6.12 [74] with the substitutions model LG+R10 and the options -wbt, -bb 1000, and-runs 10 [75][76][77]. The resulting phylogeny was manually inspected on iTOL [78] (Fig 2). Raw phylogenetic tree is included in S1 Phylogeny.

dN/dS estimation and rrn genes identification
To investigate whether the phylogenetic signal in genome size data leads to biased associations with other variables like the strength of selection and ecological strategy, we estimated the ratio of synonymous and nonsynonymous substitutions (dN/dS) within each genus cluster of our GTDB genomes dataset using two sets of conserved marker genes, checkm_bact and checkm_arch for bacteria and archaea, respectively [79]. Genomes used to calculate the dN/dS for each genus cluster are reported in S1 Data. The open reading frames (ORFs) retrieved from the GTDB were compared to the HMMs of the checkm_bact (120 marker genes) and check-m_arch marker (122 marker genes) sets using the hmmsearch tool available in HMMER v. 3.2.1 with the reported model-specific cutoffs [80]. We aligned the amino acid sequences for each marker gene and each genus cluster individually using ClustalOmega [81], and then converted amino acid alignments into codon alignments using PAL2NAL with the parameternogap [82]. We used the resulting codon alignments to estimate the pairwise ratio of synonymous and nonsynonymous substitutions for each pair of genomes using the maximum likelihood approximation (codeML) available on PAML 4.9h (runmode = -2) [83]. In order to avoid bias associated with divergence, dN/dS estimates with dS>1.5 were removed due to potential saturation. We also discarded pairwise comparisons with dS<0.1 because these might represent dN/dS values calculated from genomes of the same population. Moreover, dN/dS values >10 were considered artifactual [48]. Genomes with fewer than 25 dN/dS estimates remaining after filtering were discarded. We used the resulting median dN/dS of our representative genomes for further analysis. In order to examine the effect of genes' selection on final dN/dS estimations, we randomly selected 40 genera and identified their core genes using CoreCruncher [84] using usearch [85] and the default parameters except for -score 80. We estimated the pairwise dN/dS for each core gene using the approach described previously and estimated the median dN/dS for our genus-representative genomes. A linear regression between the dN/dS values resulting from core genes and the 120 marker genes set for the 40 genera showed similar results (S2 Fig), therefore we used the latter for further statistical analyses. In addition, we predicted ribosomal RNA genes in our representative genomes as an approximation to ecological strategy using Barrnap with the default parameters (barrnap 0.9: rapid ribosomal RNA prediction; https://github.com/tseemann/barrnap). Genome size, 16S rRNA copies, and dN/dS values for the GTDB representative genomes dataset are reported in S2 Data.

Statistical analyses
Due to the tendency of related species to resemble each other because of their shared phylogenetic ancestry, we assessed the suitability of a phylogeny-based method for our regression analyses by first estimating Blomberg's K on genome size data [35] using the phylosignal function on R [86]. This parameter represents the phylogenetic signal in a continuous trait, and goes from 0 (no phylogenetic signal) to 1 (phylogenetic signal) with the null hypothesis (K = 1) meaning that the trait analyzed evolves under Brownian Motion [40,87]. In addition, we also tested the fit of different trait evolution models, including Brownian Motion [40], Ornstein-Uhlenbeck [41], Early-Burst [42], a diffusion model, Pagel's model [43], a drift model, and a white-noise model (non-phylogenetic). We tested the predictability of genome size from dN/ dS and 16S rRNA copies as predictor variables using the "glm" function available on R. Since we detected phylogenetic signal in genome size data, we additionally accounted for potential phylogenetic nonindependence in the residuals using the PGLS method with the function pgls on the R package Caper [88] and the Pagel's model [43], as well as the function gls available on the package ape [89]. We additionally tested the effect of sample size on the calculation of Blomberg's K and Pagel's lambda by estimating these parameters within each phylum (Figs 1A and S3). The trait data and the phylogeny used in these analyses can be found in S2 Data and S1 Phylogeny, respectively. In addition to testing phylogenetic signal in a broad-scale phylogeny (S2 Data , Fig 2), we built a phylogenetic tree with multiple representative genomes for each genus and the IQ-TREE workflow used for the rest of our analyses. The genome size data and the phylogenetic tree used for this analysis can be found in S1 Data and S2 Phylogeny, respectively.