Knowledge-guided data mining on the standardized architecture of NRPS: Subtypes, novel motifs, and sequence entanglements

doi:10.1371/journal.pcbi.1011100

Fig 1.

Overview of the motif-and-intermotif architecture of canonical NRPS domains.

A. Organization of C domain. Rectangles in the first panel show the seven C motifs (black) and eight inter-motif regions (colored), including the last one as the inter-motif region between the C7 and A1 motifs. The heights of the rectangles indicate the average sequence identities. The widths of the rectangles are their average lengths in amino acid sequences, with the average lengths of inter-motif regions printed on their rectangles. Violin plots in the second panel mark the border positions of domains annotated by antiSMASH 5.1 by the default option (the start and end borders of the C domain: gray; the start of the A domain: blue). The third panel shows the distributions of the ending positions of the eight inter-motif regions, counting from their preceding conserved motifs. The fourth panel shows the mutual information (M.I.) between residues in the C domain and chirality subtypes. B. Same as that in A, but for the A and T domains. It starts with the A1 motif and ends before the C1 motif. In the violin plot, the start and end borders of the A domain are colored blue; the start and end borders of the T domain are colored yellow. The fourth panel shows the mutual information between residues and A domain substrate specificity (see Method for details). C. Same as that in A, but for the E domain. It starts after the T1 motif, and ends before the C1 motif. In the violin plot, the start and end borders of the E domain are colored black; the start of the C domain is colored gray. D. Same as that in A, but for the TE domain. It starts after the T1 motifs, and ends with regions after the annotated TE domain. In the violin plot, the start and end borders of the TE domain are colored red.

More »

Expand

Fig 2.

C domain subtype analysis and representative NRPS organizations in bacteria and fungi.

A. Maximum-likelihood phylogenetic tree of the condensation domain superfamily. Subtype classification and sequences are described in the main text and the Method. Different subtypes are indicated by colors, with subtypes exclusive to fungi marked by underlines, and subtypes found predominantly in bacteria marked by asterisks. This tree is rooted, taking papA and WES as outgroups [65] (black shading). L-clade and D-clade are indicated by blue and red shading, respectively. B. Domains adjacent to different C domain subtypes in bacteria and fungi. C. The statistics of subtype distribution in 83,489 bacterial C domains and 34,269 fungal C domains. C domains with HMM scores above the empirical threshold of 200 were annotated by their predictions, otherwise marked as “Low-confidence”. D. The sequence logo for the C3 or E2 motif from different C domain subtypes and the T1 or ACP1 motif adjacent to each subtype. Sequences from bacteria were marked by red, while sequences from fungi were marked by blue. E. Frequent NRPS organizations with known representative examples in bacteria and fungi.

More »

Expand

Fig 3.

Analysis of amino acid frequency reveals potential new motifs with implications for structural flexibility.

A. Amino acid frequency and gap frequency along the multiple sequence alignment of the NRPS C+A+T modules. In the bottom panel, bar heights indicate the frequency of the most frequent amino acid. Bars in the known core motifs from the C, A, and T domains were colored blue, orange, and yellow, respectively. The horizontal red dashed line represents the 0.95 frequency level. Domain boundaries annotated by Pfam are divided by red triangles. The colored patch above the amino acid frequency indicates gap frequency. Three potential new motifs (position 1183, 1960, and 2435/2447 in MSA) are marked by the blue dashed box. The upper panel shows the sequence logo and the gap frequency near the three potential new motifs. B. Chemical interactions and secondary structures surrounding the second potential new motif in (A) at the substrate donation state (PDB: 6MFY). Hydrogen bonds near the most conserved Gly were shown in blue dashed lines. Covalent bonds were shown as black lines. Secondary structures, such as beta-sheets, were demonstrated as bold gray arrows. Known A domain motifs adjacent to related residues were shown in the orange box. C. Same as that in B, but in the thiolation state (PDB: 6MG0). D. Same as that in B, but in the condensation state (PDB: 6MFZ).

More »

Expand

Table 1.

Representative chemically conserved residues in or interacting with the G-motif.

More »

Expand

Fig 4.

Mutations of G-motif G409 in FmqC support the importance of the conserved domain in the biosynthesis of fumiquinazoline C.

A. Residues in and near the G-motif in the predicted structure of FmqC by Phyre2 [96]. G409 is the conserved glycine in the G-motif. Residues in G-motif are marked by blue. The residues N397 and S491 (equivalent to F493 and N577 in LgrA, Fig 3B–3D), which may collide with G409 are marked by yellow and cyan. B. Same as that in A, but with simulated mutation of G409W. The mutated tryptophan is marked in magenta. C. The fmq gene cluster responsible for the production of FQC. Two NRPSs, fmqA and fmqC, are filled in red with their substrate selectivity marked. Ant: non-proteinogenic amino acid anthranilate. C* represents a truncated and presumably inactive C domain. C_T represents a terminal condensation-like domain that catalyzes macrocyclization reaction. D. The biosynthetic pathway for FQC is depicted, along with how it diverges into the production of compound 1 in the absence of functional FmqC. E. LC-MS analysis of the control fmqC (first row), ΔfmqC (second row), and six point mutation strains (3^rd to 8^th row). F. Normalized yield of FQC and compound 1 in different strains. For FQC, the yield is normalized by its production in the wild-type strain. For compound 1, the yield is normalized by its production in the fmqC gene deletion strain. Error bars show standard deviations.

More »

Expand

Fig 5.

Statistical coupling analysis reveals overlapped sectors across the C+A+T module.

A. The upper panel shows the conservation of residues in a multiple sequence alignment of 1,161 NRPS modules (containing the C, A, and T domains), quantified by the relative entropy in SCA method. The mean conservation level (0.32) is marked by the blue dashed line. In the lower panel, there are three groups of positions (II(+) with green, II(-) with magenta, and IV (-) with red, termed “sectors”. Their corresponding conservations are marked in the same color in the upper panel. Blue bars mark C domain motifs from C1 to C7. Orange bars mark A domain motifs from A1 to A10. Yellow bar marks the T domain motif T1. Domain boundaries annotated by Pfam are divided by vertical black dashed lines. Black triangle marks the re-engineering point in the C domain reported by Bozhüyük et al. [27], black circle marks the re-engineering point in the C-A inter-domain reported by Calcott et al. [28] and black diamond marks the re-engineering point in the C-A inter-domain reported by Bozhüyük et al. [26]. B. Mapping three groups of correlated conservation positions into the three-dimensional structure of the NRPS module (PDB 4ZXI, containing the C, A, T, and TE domains. TE domain is hidden for clarity). Three sectors are marked in the same color as that in (A). C domain, A core domain, A sub domain, T domain are circled by blue, orange, yellow green, and yellow dotted line, respectively. Gly and AMP are substrates of this A domain. They and Mg²⁺ (for catalysis) are colored cyan. C. Heatmap of the SCA matrix after reduction of statistical noise and of global coherent correlations (see Method for details). Each sector is marked by the corresponding color bracket under the heatmap, with the number of contained residues listed. 68, 54, and 50 positions belong to the II(-), II(+), and IV(-) sectors, respectively. In each sector, residues are ordered by descending contributions, showing that sector positions comprise a hierarchy of correlation strengths.

More »

Expand

Fig 6.

The specificity-conferring code of the A domain is correlated with loop length and phylogeny.

A. SCA of 2,636 A domain sequences, together with their substrate specificities attached to the last column of the multiple sequence alignment. Six sectors with a high contribution from the substrate column (>0.05, the size of points on the left scales the substrate’s contribution to the sector, see Method for details) are sorted by their eigenvalues. The size of points scales its contribution to the sector. Orange bars mark the A domain motifs from A1 to A8. The start and end of the five loop regions are marked by black and green dotted lines, respectively. S4 and S6 are the 4^th and 6^th of the specificity-conferring codes. G is the G-motif. B. Distance matrix of A domain. Upper right on the heatmap is the Euclidean distance of the loop length as a 5-element vector. Lower left on the heatmap is the sequence distance of the A domain. The matrix is sorted by the substrate specificity followed by the loop length group. Substrates, groups of loop length, and phylum of these A domains, are shown by colors in sidebars. C. Example showing that A domains conferring identical substrate exhibit distinct specificity-conferring codes, when they are categorized into different loop-length-groups. Phylum composition in each group is shown in the pie chart.

More »

Expand

Fig 7.

Demonstration on the result panel of the NRPS Motif Finder.

The NRPS Motif Finder result panel provides an interactive interface. The whole result could be navigated by scrolling the page, and details about motif and intermotif could be viewed by clicking the corresponding components. Especially, the predicted subtype and confidence score are displayed for C domains. The general statistics about the NRPS architecture are displayed on the right for comparison. The results could be downloaded in table format.

More »

Expand