A Systematic Computational Analysis of Biosynthetic Gene Cluster Evolution: Lessons for Engineering Biosynthesis

doi:10.1371/journal.pcbi.1004016

Figure 1.

The rapid and dynamic evolution of BGCs differs from the evolution of ribosomal gene clusters and primary metabolism.

a, Distributions of the best matching sequence homologs with respect to organism similarity (based on 16S rRNA) for predicted BGCs and histidine operons suggest significant differences in the ways they evolve. b, Number of detected rearrangements, indels and duplications plotted against the average percent identity in the aligned gene cluster pairs from which the events were deduced for predicted BGCs (top) and ribosomal gene clusters (bottom). Ribosomal gene clusters were selected for comparison based on their relatively large sizes (∼10–15 kb) compared to primary metabolic operons; to obtain a fair comparison with BGCs, only gene clusters of sizes 5–15 kb were taken into account. Counts are based on a systematic comparison of all gene clusters in our data set that share regions of >1000 bp with >70% identity, in which events were inferred from alignments of such 1000 bp blocks. Of the 10,096 BGC pairs meeting these criteria, 1,750 had a rearrangement, 1,140 had an indel, and 135 had a duplication, each of which were far more common than the corresponding evolutionary events in gene clusters encoding the translation apparatus. Interestingly, while indels and rearrangements could be detected in ∼16% and ∼19% of BGCs of all sizes, duplications are found far more commonly in gene clusters with sizes of >40 kb (7.6%) than in gene clusters with sizes of 10–20 kb (0.3%), suggesting a possible role for duplication and divergence in the evolution of large gene clusters. c, Size distribution of inserted/deleted fragments during recent gene cluster evolution, based on the indel analysis.

More »

Expand

Figure 2.

Complex BGC architectures evolve through new combinations of sub-clusters that are shared between multiple gene cluster types.

a, Network of sub-clusters shared among 34 known BGCs. Nodes represent BGCs, and node size indicates the number of sub-clusters present in the gene cluster that are shared with other BGCs within the network. Edges represent shared sub-clusters, coded by color. The pattern of sharing indicates that many sub-clusters are regularly transferred between BGCs of different types. In the interpretation of this analysis, it should be kept in mind that in rare cases different biosynthetic routes (and hence, different sub-clusters) exist towards the same moiety. b, A sub-network from a showing the shared sub-clusters among the BGCs for rubradirin, rifamycin, simocyclinone, everninomicin, and polyketomycin, as well as the chemical moieties encoded by the sub-clusters.

More »

Expand

Figure 3.

Unexpected evolutionary relationships within the rapamycin family.

a, Distinct scaffolds produced by pathways from related BGCs. The scatter plot shows the relationship between the sequence homology of a pair of BGCs (x-axis) and the structural homology of their small molecule products (y-axis), compared to rapamycin and its BGC. Each circle represents a gene cluster and its small molecule product. Meridamycin and FK520 are closely related to rapamycin, as are their BGCs. While the pladienolide BGC is closely related to the rapamycin BGC, the structure of pladienolide itself is not very similar to that of rapamycin. In particular, pladienolide has a much smaller macrocycle and lacks shikimate- or pipecolate-derived moieties, and, as a result, binds to a distinct protein target. Structural similarity is estimated by the Tanimoto coefficient using linear-path fingerprints (FP2) from Open Babel [67], while sequence homology is represented as the Jaccard index defined on pairs of Pfam domains that share sequence identities within the top 10^th percentile of all-pair sequence identities. The number of domain pairs that share sequence identities within the top 10^th percentile and sequence identity of all domain pairs are shown as point sizes and colors, respectively. b, The role of concerted evolution in homogenizing domains within a BGC. Phylogenetic trees of KS and AT domains from the rapamycin, FK520, meridamycin, and pladienolide BGCs are shown (for detailed trees with accession numbers and bootstrap values, see Figure S11). The KS and AT sequences largely cluster into BGC-specific clades; for the AT domains, this is even the case for two different clusters encoding the same compound (meridamycin), showing the ability of concerted evolution to homogenize domains within a BGC. c, Chemical structures of rapamycin, meridamycin, FK520 and pladienolide. The sub-structure shared among rapamycin, meridamycin and FK520 is colored red, and the domains responsible for the biosynthesis of this sub-structure in each molecule are indicated with red circles in b.

More »

Expand

Figure 4.

Qualitative model for the evolution of NRPS/PKS domains.

After modules are duplicated, they may get ‘trapped’ in a cycle in which small sequence divergences are counterbalanced by internal recombinations that drive concerted evolution. Through strong diversifying selection (or sufficient drift), domains may break out of this cycle towards domain sequences that are protected from concerted evolution by functional divergence and subsequent stabilizing selection on the new function, or by reduced internal recombination rates due to larger sequence differences between the domains. The abovementioned sequence divergence may occur through cumulative mutation or through recombination with other gene clusters (or other modules within the same gene cluster).

More »

Expand

Figure 5.

Diverse and distinct modes of evolution for PKS and NRPS BGCs.

a, Scatter plot showing the first two principal components resulting from a PCA analysis of different evolutionary characteristics of BGCs encoding different classes of NRPs and PKs. The first two principal components describe 63% of the variance. BGCs encoding members of the same family (e.g., lipopeptides, glycopeptides or macrolides) tend to cluster together, suggesting that their family members evolve in similar ways, while different families cluster apart from each other, suggesting distinct modes of evolution. Colors indicate distinct classes of BGCs. b, Scatter plot showing two features of BGCs – internal similarity index and vertical evolution index – that, of the 25 measured features, underlie most of the variation. The internal similarity index indicates how similar domains in a BGC are to other domains within the same BGC. The vertical evolution index indicates how closely related a BGC is to the BGCs harboring the closest relatives of its constituent domains (see Methods for more details). Colors indicate distinct classes of BGCs, as in panel a. c–f, Domain architecture plots of PKSs and NRPSs show distinct modes of evolution: c, Internal duplication with concerted evolution; d, N-terminal additions by module duplication and recombination; e, domain swapping with other BGCs; and f, mixed evolution. Geometric shapes indicate domain types (see legend); domain colors indicate the internal homology p-value of each domain to its closest relative within the same gene cluster, within the total distribution of all similarities between domains of the same type in the entire data set: hence, domains colored red are most similar, while domains colored blue are most dissimilar.

More »

Expand