In silico analysis of class I adenylate-forming enzymes reveals family and group-specific conservations

Luciferases, aryl- and fatty-acyl CoA synthetases, and non-ribosomal peptide synthetase proteins belong to the class I adenylate-forming enzyme superfamily. The reaction catalyzed by the adenylate-forming enzymes is categorized by a two-step process of adenylation and thioesterification. Although all of these proteins perform a similar two-step process, each family may perform the process to yield completely different results. For example, luciferase proteins perform adenylation and oxidation to produce the green fluorescent light found in fireflies, while fatty-acyl CoA synthetases perform adenylation and thioesterification with coenzyme A to assist in metabolic processes involving fatty acids. This study aligned a total of 374 sequences belonging to the adenylate-forming superfamily. Analysis of the sequences revealed five fully conserved residues throughout all sequences, as well as 78 more residues conserved in at least 60% of sequences aligned. Conserved positions are involved in magnesium and AMP binding and maintaining enzyme structure. Also, ten conserved sequence motifs that included most of the conserved residues were identified. A phylogenetic tree was used to assign sequences into nine different groups. Finally, group entropy analysis identified novel conservations unique to each enzyme group. Common group-specific positions identified in multiple groups include positions critical to coordinating AMP and the CoA-bound product, a position that governs active site shape, and positions that help to maintain enzyme structure through hydrogen bonds and hydrophobic interactions. These positions could serve as excellent targets for future research.


Introduction
Class I adenylate-forming enzymes (also termed the ANL superfamily [1]) include aryl-and acyl-CoA synthetases, fatty acid-AMP ligases, methylmalonyl-CoA synthetases, the adenylation domain of non-ribosomal peptide synthetases, and luciferases. They represent one class in a superfamily of enzymes that carry out adenylation, the activation of a carboxylate substrate PLOS  Luciferases (EC 1.13.12.7) in fireflies and luminous beetles also share a common structure with these other adenylate-forming enzymes. In the phenomenon of bioluminescence luciferases react luciferin with ATP to form an adenylated intermediate. Unlike most of the superfamily that would then proceed to a thioesterification reaction, the luciferyl-AMP reacts with O 2 in an oxidative decarboxylation to form AMP, CO 2 and emit a photon of light, typically in the yellow-green wavelength [24]. A S286N mutation in Luciola cruciata luciferase shifts the emission wavelength to red [25]. However, under anaerobic conditions the luciferyl-AMP intermediate can react with CoA to form luciferyl-CoA [26]. In fact, luciferases appear to also act as LACSs, preferring substrates such as linolenic and arachidonic acids [27]. In addition, a single mutation of Ser345 in Agrypnus binodulus ACS allowed for luminescent activity [28]. Bioluminescence occurs in several organisms including bacteria, dinoflagellates, jellyfish, crustaceans, insects and fish. It is believed that bioluminescence may have convergently evolved up to thirty times [29,30].
Another family member is the adenylate-forming domain of non-ribosomal peptide synthetases (NRPSs). Bacteria and fungi possess NRPSs to synthesize antibiotic peptides such as cyclosporin A, gramicidin S [7], enterobactin [31], tyrocidine [32] and acinetobactin [33]. NRPSs have multiple components which each add a single amino acid to the antibiotic peptide. Each module has an adenylation domain that shares homology to class I adenylate-forming enzymes. This domain takes the amino acid and ATP and forms an amino acyl-AMP intermediate. For the thioesterification step, a peptidyl carrier protein (PCP) domain, instead of free CoA, is used to form a thioester to the amino acid and release AMP. This amino acyl moiety is finally added to the peptide using an unrelated condensation domain, without the involvement of ribosomes [1,34]. A study of NRPS mutants in Pseudomonas aeruginosa suggests the NRPS product cyclodipeptides affect bacterial quorum sensing and root development in plants [35]. Fatty acid-AMP ligases (FAALs) form a fatty acyl-AMP intermediate from a fatty acid and ATP, similar for ACSs. However, in a process analogous to NRPSs the fatty acyl group is transferred to an acyl carrier protein component of the enzyme polyketide synthase. This pathway helps to generate lipids associated with virulence in organisms like Mycobacterium tuberculosis [36,37].
A large number of sequences and representative tertiary structures are available for each type of class I adenylate-forming enzyme. There has not been an extensive study that has compared these enzymes. The goal of this research was to align a large number of protein sequences for each homologue. We then attempted to identify and confirm the conserved structural and functional roles of residues and sequence motifs in all of these enzymes. Phylogenetic analysis was used to examine family relationships and identify enzyme groups for further analysis. Group entropy analysis and other methods indicated group-specific conservations for each enzyme homologue, identifying key residue positions that may help to determine the unique function of each enzyme.

Materials and methods
The procedure used here was analogous to the procedure we previously published [38,39].  (PDB ID: 3O82) and Mycobacterium tuberculosis FadD10 long chain fatty acyl-CoA ligase (PDB ID: 4IR7) from the RCSB Protein Data Bank. Each sequence was then used to perform a PSI-Blast [40] search of the non-redundant protein database at the National Center for Biotechnology Information (NCBI). A total of 374 amino acid sequences of class I adenylate-forming enzymes were collected with percent identities ranging from 99% to 12%. These sequences were initially aligned using T-Coffee [41]. To improve alignment quality, the alignment was manually adjusted using tertiary structure comparison of all structures using MAPSCI (http://www.geom-comp.umn.edu/mapsci/) [42] and through the RCSB PDB Protein Comparison Tool-jFATCAT method [43,44] of pairs of structures as a guide. The alignment editor used was GENEDOC [45]. Conservations within the alignment were analyzed for structural or functional significance. Molecular visualization and distance calculations were performed using RASMOL [46]. Salt bridges were identified as amino and carboxylate groups that were less than 3.0Å in distance apart. Hydrogen bonds were identified as hydrophilic groups that were less than or equal to 3.3Å in distance apart. Hydrophobic interactions were identified as nonpolar atoms less than or equal to 4.5Å in distance apart. Molecular graphics were generated using Chimera [47]. Torsional angles were determined using MolProbity [48]. Analysis of conserved sequence motifs was facilitated by MEME program [49], and these motifs were searched against a protein database using MAST [50]. Group entropy analysis (GEnt) [51] was performed to compare SACS, ACL, FAAL, FadD10, LACS, MACS, Luciferase, MMCS and NRPS groups to each other. Evolutionary trace (http://mordred.bioc.cam.ac.uk/~jiye/evoltrace/evoltrace.html) [52,53] was also performed on the entire alignment. Protein residue conservation prediction (http://compbio.cs.princeton. edu/conservation/score.html) [54] was performed on subalignments of each of the nine groups identified. Each algorithm was used using combinations of both possible backgrounds (BLO-SUM62 and SwissProt) and seven possible matrices (BLOSUM62, BLOSUM35, BLOSUM40, BLOSUM45, BLOSUM50, BLOSUM80 and BLOSUM100) distributed with the program. Scores presented for Shannon Entropy and Property Entropy represent the top 25 scoring residues. For Relative Entropy and JS Divergence, residue positions reported were predicted by all distributions used. For VN Entropy and Sum of Pairs analyses, residues reported were predicted using all seven scoring matrices (BLOSUM62, BLOSUM35, BLOSUM40, BLOSUM45, BLOSUM50, BLOSUM80 and BLOSUM100) distributed with the program.
The PHYLIP suite of programs was used to generate the phylogenetic tree [55]. First, the alignment was trimmed using TrimAl [56]. 400 Bootstrapped data sets of the trimmed alignment were then generated using the SEQBOOT program. Next, distances for the data sets were determined by the PROTDIST program using the Jones-Taylor-Thornton matrix. Phylogenetic trees for each data set were generated using the NEIGHBOR program. Lastly, the unrooted consensus tree was generated using the CONSENSE program. The tree graphic was generated using FigTree (available at http://tree.bio.ed.ac.uk/software/figtree). A parsimony tree was generated using 75 bootstrapped datasets using the PROTPARS program, followed by CONSENSE [55].

Structure and residue conservations
A total of 374 amino acid sequences from the class I adenylate-forming superfamily were aligned (Fig 2), guided by tertiary structural alignment. The entire alignment can be found in S1 File. Above each amino acid position column is an index number, which is numbered concurrently from the beginning of the alignment; these index numbers will be used to reference each position throughout this manuscript. The sequences used included 49 aryl-CoA ligases (ACLs), 84 luciferase sequences, 42 LACSs, 66 MACSs, 53 NRPSs, 25 acetyl-CoA synthetases (SACSs), 31 MMCSs, 17 FAALs, and 7 mycobacterial FadD10 fatty-acyl CoA ligase sequences. Five residue positions were invariant among all 374 sequences: Glu328{490}, Gly384{573}, Asp418{624}, Arg433{639} and Lys524{740} (residue positions are in Thermus thermophilus LACS (sequence Thethelon), unless otherwise noted, with alignment index positions in curly brackets). A total of 22 additional residues were conserved in at least 80% of the sequences aligned and 56 more residues conserved in at least 60%. A summary of the conserved residue interactions is found in Table 1. The locations of these evolutionary conservations were also visualized using the CONSURF program [57] (Fig 3A). Highly conserved residues in the family are clustered around the active site, which is the pocket in the enzyme where the substrates are bound, while the least conserved residues are located on the enzyme surface. Residue functions were analyzed using the Thermus thermophilus LACS (sequence Thethelon, PDB ID: 1V25) structure, with exceptions using the Luciola cruciata firefly luciferase (sequence Sequences include Luciola cruciata luciferase (Luccruluc), Alcaligenes 4-chlorobenzoyl-CoA ligase (Alcalc4b) as an ACL, Thermus thermophilus LACS (Thethelon), human MACS (Homsapacoa), Brevibacillus brevis gramicidin synthase phenylalanine-activating domain (Brebregram) as an NRPS, Salmonella enterica acetyl-CoA synthetase (Salentaco) as a SACS, Rhodopseudomonas palustris MMCS (Rhopalmco), E. coli FAAL (Ecolifaal) and Mycobacterium tuberculosis FadD10 long chain fatty acyl-CoA ligase (Myctubfd10). The entire alignment, which contains 374 protein sequences, is found in S1 File. Residue positions are colored based upon their conservation in the entire alignment as follows: red = 100% conserved, green = 80-99% conserved, and blue = 60-79% conserved. Indel (gap) positions from the entire alignment (S1 File) are retained to allow correlation with index position numbers (numbers shown above the alignment columns) that are noted within the text. Luccruluc, PDB ID: 2D1R) structure. Residues within the active site in both T. thermophilus LACS and L. cruciata luciferase are shown in Fig 3B and 3C. T. thermophilus LACS structure was chosen for analysis as it had ligands, ANP and Mg 2+ , bound in its active site and also had a substrate modeled to allow for atomic distances to be measured and functions to be interpreted. In addition, the function of several conserved residues had already been proposed [3]. L. cruciata luciferase structure was also chosen as it also had ligands bound in its active site to assist analysis and as it was the initial structure used in beginning the project.
Several conserved residues interact with the ATP/AMP coenzyme (Table 1). Gly302{457} and Tyr324{486} interact with the adenine moiety [3]. A mutation of Tyr304{486} in CBL to phenylalanine did not alter enzyme function, as phenylalanine could still ring stack with the adenine ring [22] ( Table 2). Asp418{624} coordinates both the 2' and 3' ribose hydroxyls, while Arg433{639}, which is found in the linker motif, also interacts with the 2' hydroxyl through a water molecule [3]. Mutations of both residues severely hindered enzymatic activity [22,59,60] (Table 2). In addition, Gly302{457} interacts with the 4' hydroxyl involved in the hemiacetal bond [3]. In CBL the adenine ring of the substrate-AMP adduct is located between the equivalent glycine (Gly281{457}) and Thr283{459}. It has also been suggested that a glycine at index 457 in CBL probably keeps the phosphopantetheine tunnel open [21]. Thr327{489} forms two hydrogen bonds of the α-phosphate on AMP [3]. Mutagenesis of the equivalent threonine (Thr307{489}) in CBL caused a significant reduction in catalytic efficiency with the 4-chlorobenzoate substrate [22] ( Table 2). Thr185{323} interacts through a water molecule  with the γ-phosphate of ANP. In CBLs the main chain nitrogen and side chain hydroxyl of Thr165{323} also interact with the γ-phosphate of ATP [22]. Lastly, while Lys524{740} lacked structural coordinates in the T. thermophilus LACS structure, Lys531{740} in the L. cruciata luciferase structure coordinates the α-phosphate of AMP [25]. The equivalent residue in CBL (Lys492{740}) lies close to and may react with the carboxylate group of the substrate in the adenylation conformation, with a significant decrease in rate for this part of the reaction seen in a K492A mutant (Table 2). This lysine rotates into the solvent in the thioesterification conformation [22]. The binding of the lysine at index 740 to ATP was also supported by mutagenesis in Mycobacterium tuberculosis FadD13 ACS [61] (Table 2). Thus, the majority of the invariantly conserved residues coordinate the AMP moiety and the critical Mg 2+ ion, functions shared by all family members. Four conserved residues (Table 1) line the myristoyl substrate pocket of the T. thermophilus LACS structure: Gly301{456}, Tyr324{486}, Gly325{487} and Thr327{489} [3]. The conserved glycine at index 487 lies at a location that is a tryptophan in SACS (Trp414{487} in S. enterica SACS, sequence Salentaco). This bulkier residue likely results in a shorter fatty acid substrate preference in SACS, while a glycine would allow for longer fatty acids to bind to MACSs and LACSs [5]. The carbonyl oxygen of the equivalent glycine in gramicidin synthase (Gly324 {487}, sequence Brebregram) hydrogen bonds to the amino group of the phenylalanine substrate [7].
Several other conserved residues may also help to maintain enzyme structure through hydrogen bond or salt bridge formation ( Table 1). The hydroxyl of Tyr183{321} forms a hydrogen bond to His117{206}. A Y213A mutant at index 321 in E. coli ACS resulted in no detectable activity [62] (Table 2). Lys192{330} lies at the end of the P loop and its side chain amine interacts with the carbonyl oxygen of another conserved residue, Thr188{326}, and also lies close to the hydroxyl of Thr187{325}. Mutagenesis of the lysine at index 330 and the threonine at index 325 both significantly hindered activity ( Table 2). The hydroxyl of Tyr397{591} forms a hydrogen bond with the side chain carboxylate of the invariant Glu328{490}. The side chain guanidinium of Arg433{639} forms a hydrogen bond to the carbonyl oxygen of Leu437 {643} and a salt bridge to the side chain carboxylate of Glu475{682}. Mutation of the arginine at index 639 (Arg400) in CBL indicates the importance of a salt bridge with Asp402{641} to stabilize the thioesterification conformation [22] (Table 2). Asp449{655} lies at a position that is always an acidic residue, with glutamate being is 85% conserved in the entire alignment. The side chain carboxylate of Asp449{655} lies close to the hydroxyl of Ser446{652} in T. thermophilus LACS. Glu416{655} in CBL forms a salt bridge to Lys474{722} and a hydrogen bond to Analysis of class I adenylate-forming enzymes the main chain nitrogen of His413{652}. Lastly, the side chain carboxylate of Glu451{657} forms a hydrogen bond to the main chain nitrogen of Val465{672} and a salt bridge to the side chain amine of Lys527{743}. An E457K mutation in Luciola mingrelica luciferase at index 657 (Table 2) caused a strong red shift in emission color, and suggested that rigidity in the carboxy-terminal domain is important for green emission in luciferases [63,64]. Eleven of the 27 residues conserved in at least 80% of sequences in the entire alignment were glycine residues: Gly68{150}, Gly96{178}, Gly186{324}, Gly189{327}, Gly325{487}, Gly358{538}, Gly384{573}, Gly417{623}, Gly426{632}, Gly442{648}, and Gly523{739}. The overrepresentation of glycines among the highly conserved residues is due to their critical role in protein structure in turns or where the lack of a side chain is necessary. This phenomenon occurs in other enzyme families, such as aldehyde dehydrogenases [65], alcohol dehydrogenases [66,67], arginases [68] and NDP-sugar dehydrogenases [38]. Seven conserved glycines (Gly68{150}, Gly96{178}, Gly186{324}, Gly189{327}, Gly358{538}, Gly426{632}, and Gly442 {648}) lie at turns in the enzyme structure, as seen within the 1V25 T. thermophilus LACS structure. Of those seven conserved glycines found in turns, all but Gly186{324} had positive phi angles, which is common in glycines found in turns [69]. In CBL Gly409{648}, which is part of the previously identified motif A8 [70], lines the tunnel for binding the phosphopantetheine portion of CoA. Mutation of this residue to leucine resulted in activity loss only during the thioesterification step [21] (Table 2). Three other glycines (Gly325{487}, Gly384{573}, and Gly417{623}) are found in beta strands. Mutation of the glycine at index 623 in E. coli ACS (Gly437) significantly reduced activity, but did not change substrate preference [60] ( Table 2). Next, Gly426{632} is found at the dimer interface of the T. thermophilus LACS structure, making hydrophobic contact with Leu30{103} from the neighboring subunit. Mutation of the glycine at index 632 in E. coli ACS (Gly446) significantly reduced activity for two of the three fatty acid substrates tested [60] (Table 2). Three highly conserved residues, Gly202{324}, Gly205{327}, and Pro207{329}, are found in the P-loop of L. cruciata luciferase which suggests that these residues may play critical structural roles for the P-loop. In human MACS Gly223 {324} lines the pyrophosphate-binding pocket [5]. Mutations in all three of these residues in the P-loop severely inhibited enzymatic activity ( Table 2).

Conserved motifs
The ten most conserved sequence motifs were statistically identified using the MEME program [49] (Table 3). Four of the five fully conserved residues cluster into three of the conserved motifs. Several of these motifs correlate to motifs previously identified specifically in the adenylation domain of NRPSs [70] (Table 3). Motif 1, which covers previous NRPS motifs A7 & A8, contains two invariant residues, Asp418{624} and Arg433{639}. Residues in motif 1 line the active site (Fig 4) and include the linker motif. Beta strands 19-22 and helix α-N comprise motif 1 (structural terminology from [3]). Motif 2 contains the fully conserved Lys524{740} and covers previous NRPS motif A10. It contains β-25 and α-P. Two highly conserved residues, Thr188{326} and Lys192{330}, are found in motif 3, which correlates to NRPS motif A3. Motif 3 lines the active site and includes the P-loop. Motif 3 has been well studied through site-directed mutagenesis (summarized in [70]), which suggest that it is critical in the adenylation step [4]. Motif 4 also lines the active site but is not present in NRPSs and FAALs, which both join the substrate to a carrier protein instead of CoA. Motif 5, which covers the previous NRPS motif A6, contains Gly384{573} which is found in β-18. This motif did not appear in the LACS, SACS, or MMCS groups. Motif 7 lines the active site but is not present in mycobacterial FadD10s.
One of the few motifs identified previously in NRPSs [70] that was not identified in the top ten motifs in this study was motif A5, which has a NxYGPTE sequence, covers the adenine (A) motif [3], and would be found at indices 484-490 in our alignment. Despite the fact that it is well conserved in our alignment, including the invariant Glu328{490}, it is not surrounded by additional conservations, which might have led to it not being identified here. This stretch of residues has also been suggested to be critical in the adenylation reaction [4].
The motifs identified by MEME were used to search the Uniprot database for other proteins with potential homology to class I adenylate-forming enzymes using MAST [50]. Most proteins identified by the MAST search, which returned more than 290,000 sequence hits ranging from the strongest hit with an e-score of 4.6e-114 to the weakest hit with an e-score of 10, were class I adenylate-forming enzymes. The MAST search also discovered a class I adenylate-forming enzyme that had not been included in this project, D-alanine-poly(phosphoribitol) ligase, which is also called D-alanine-D-alanyl carrier protein ligase (ACPL). An example of an ACPL is DltA D-alanine-D-alanyl carrier protein ligase from Streptococcus pyogenes (sp|P0DA64| DLTA_STRP3, PDB ID: 3LGX) [9], which had an e-score in the MAST search of 1.3e-24. DltA is involved in the process of adding D-alanine to lipoteichoic acids during cell wall formation in Gram-positive bacteria [9]. DltA possesses motifs 3, 7, 8, 5, 1 and 2 (in that order). In addition, structural alignment (not shown) with T. themophilus LACS (PDB ID: 1V25) showed a close match with a RMSD value of 2.79Å and a percent identity of 14.3%.
Two other proteins that came up multiple times in the MAST search results were cinnamyl alcohol dehydrogenase and phenylalanine racemase. An example of a cinnamyl alcohol dehydrogenase is from Arabidopsis thaliana (tr|B1GV07|B1GV07_ARATH), which had a search evalue of 2.1e-79. It possesses motifs 6, 3, 9, 7, 8, 5, 1, 4 and 2, in that order. Structural alignment of the AtCAD5 cinnamyl alcohol dehydrogenase from Arabidopsis (PDB ID: 2CF5) [71] with T. themophilus LACS (PDB ID: 1V25) showed some homology with a RMSD value of 3.60Å and a percent identity of 8.6%. However, cinnamyl alcohol dehydrogenases are in a different class of enzymes, oxidoreductases, and convert an alcohol to aldehyde using NADP + , not ATP Analysis of class I adenylate-forming enzymes [71]. An example of phenylalanine racemase is an ATP-hydrolyzing phenylalanine racemase from Serratia (tr|V3TT50|V3TT50_SERS3), which had a search e-value of 1.5e-51. It possesses motifs 10, 3, 9, 7, 8, 5, 1, 4 and 2 in that order. It is interesting to note that this is a similar pattern of motifs as found in cinnamyl alcohol dehydrogenase. There are no protein structures for phenylalanine racemases in the PDB database, but there is a N-amino acid racemase crystallized with N-acetyl-phenylalanine from Amycolatopsis (PDB ID: 5FJT) (to be published). Structural alignment of N-acetyl-phenylalanine from Amycolatopsis with T. themophilus LACS (PDB ID: 1V25) showed some structural homology with a RMSD value of 3.65Å and 6.1% percent identity. However, phenylalanine racemase is another enzyme from a different enzyme class, isomerases.

Phylogenetic analysis
An unrooted bootstrapped phylogenetic tree of the class I adenylate-forming enzyme superfamily was generated using the neighbor-joining method (Fig 5). This method was chosen as maximum likelihood and parsimony methods are computationally prohibitive for larger datasets, and as other studies have indicated that the neighbor-joining method has yielded quality evolutionary relationships in some families [72]. In fact, a bootstrapped parsimony tree (S1 Analysis of class I adenylate-forming enzymes Fig) using only 75 datasets had similar group arrangements and sequence groupings to the neighbor-joining tree using 400 replicates. The neighbor-joining tree was used to assign each sequence into an appropriate group for group entropy analysis. Nine distinct groups were identified in the phylogenetic tree: Luciferases, NRPS, LACS, MACS, ACL, SACS, MMCS, FAAL and FadD10. Groups were named based upon the representative tertiary structure present in each clade, although some ACS sequence names within the group did not necessarily correlate to the group name. For example, some sequences named medium chain ACSs, when part of this larger dataset, were more homologous to the long chain ACS structure, falling within the LACS clade of the tree. It is possible some of these sequences may have been misidentified due to homology searches at the time of submission. Luciferases were most similar to LACSs. This is not unexpected as luciferases can act as long chain fatty acyl-CoA synthetases [27], and vice versa [28]. It was surprising that long-chain ACSs (LACS) were quite removed in the tree from short-chain (SACS) and medium-chain ACSs (MACS), as these fatty acyl-CoA synthetases differ solely in the length of their fatty acyl substrate. MMCSs were closely related to ACLs, but due to their substrate difference were categorized as different groups. Both groups attach substrates to CoA. Two other closely related groups were FAALs and NRPSs. Both groups attach the reaction intermediate (amino acyl-AMP in NRPSs and fatty acyl-AMP in FAALs) to a carrier protein, rather than CoA. The NRPS group contained a subgroup of fourteen 2,3-dihydroxybenzoate AMP ligase (DHB) sequences.

Determining group-specific residues
The GEnt program [51] detects amino acid residues characteristic of an individual protein family from an alignment with other related proteins. The GEnt program utilizes the Kullback-Leibler method to calculate a divergence measure to identify covariance in protein families. GEnt calculates two entropy values, "Group Entropy" and "Family Entropy." Group Entropy represents the degree of residue conservation at a specific position within the designated group and Family Entropy represents the degree of residue conservation at that same position within the entire alignment. This study was concerned with residues with the highest Group Entropy scores, which indicates the residues are well conserved in its group, and low Family Entropy scores, which indicates the residues are not well conserved throughout the entire alignment. These residues would indicate novel positions that contribute to the unique function and structure of each adenylate-forming homologue. The GEnt program has been used to identify critical, group-specific conservations in class 3 ALDHs [51], NDP-sugar dehydrogenases [38] and heme oxygenase homologues [39]. The Evolutionary Trace program was developed to identify critical residues in active sites and clusters of residues at functional interfaces in proteins which are unique to each group in a protein family [52,53]. In addition, six other algorithms were used to identify functional residues in each group of class I adenylate-forming enzymes: Jensen-Shannon Divergence, Property Entropy, VN Entropy, Relative Entropy, Shannon Entropy and Sum of Pairs Analysis [54]. Only residues that were identified for all combinations of backgrounds and matrices used for each algorithm were reported as results.
The GEnt results will be focused on in this manuscript for several reasons. First, GEnt has been used previously to identify group-specific residues in several families, noted above. Secondly, GEnt allows the user to define their own groups and place specific sequences in each group while analyzing the entire alignment. However, six methods used (Shannon Entropy, Property Entropy, Relative Entropy, Jansen-Shannon Divergence, VN Entropy and Sum of Pairs analyses) could not identify groups within the entire alignment, so each method had to be provided subalignments for each individual group. Thus, they tended to identify residues already conserved in the entire alignment. The Evolutionary Trace program in our analysis also tended to identify residues conserved in the entire superfamily. For example, in the Luciferase group nearly half (15 of 32) of the positions identified by Evolutionary Trace were conserved positions in at least 80% of sequences in the entire alignment. Thus, only a fraction of the residues identified by these other methods may actually be group specific. Third, there was a degree of redundancy in the positions identified by these other methods. For example, Evolutionary Trace identified 16 index positions in LACSs, all of which were also identified in Luciferases. Also, Evolutionary Trace identified the eight index positions in NRPSs, which were all also identified in LACSs and Luciferases, several of which are highly conserved in our alignment. In addition, several of the positions identified by the majority of these other methods were also identified by GEnt. Lastly, GEnt does not analyze positions in the alignment that contain predominantly gaps. For these reasons, the results for all the methods used to identify group-specific residues are summarized in S1 Dataset.

Group-specific residues in luciferases
Eight residues had the highest Group Entropy scores in the Luciferase group (Table 4). Complete GEnt results for Luciferases can be found in S2 Dataset. The combined results for all methods used to identify group-specific functional residues in Luciferases are summarized in S1 Dataset. Examination of residues was done with L. cruciata luciferase (sequence Luccruluc, PDB ID: 2D1R). One residue, Ser200{322}, hydrogen bonds to the α-phosphate of AMP. Nakatsu and colleagues [25] also showed Ser200{322} also binds to the sulfate group of the bound DLSA, which represents a substitute for AMP in the binding pocket. Pro452{652} lies at the beginning of α-18 and may be important for the structure of the loop containing Gln450 {650}, also identified by GEnt in Luciferases. Two residues, Lys512{721} and Arg515{724}, form salt bridges in luciferases. The side chain amine of Lys447{647} is near the side chain hydroxyl of Tyr446{646}, but is too far for hydrogen bond formation. The remaining residues identified by GEnt (Gln450{650}, Tyr446{646} and Ala479{680}) are involved in hydrophobic interactions. Two of these residues, Tyr446{646} and Ala479{680}, contact each other. All of the highest scoring GEnt residues in Luciferases, except Ser200{322}, cluster on the surface in the carboxy-terminal domain (Fig 6). This clustering raised the question that perhaps these residues might be involved in intersubunit contact, as the L. cruciata luciferase structure is a monomer. However, analysis of the Photinus pyralis luciferase dimer (PDB ID: 5KYT) demonstrated that this region is not involved in dimeric contacts in that molecule [73].
Three positions, Arg218{343}, Leu286{421} and Ser347{494}, identified as lining the substrate binding site and affecting substrate specificity in Photinus pyralis luciferase (PDB ID: 4G36) [74], were not identified as group specific locations in luciferases in this study. In addition, none of the mutations, R214K{343}, H241K{373}, S246H{379} and H347A{488}, that caused a shift in emission wavelength of Pyrearinus termitilluminans luciferase [75] were identified as group specific positions in luciferases in this study. However, indices 373 and 488 were identified as group-specific positions in other groups.

Group-specific residues in LACSs
Eight residues had the highest Group Entropy values in the long-chain fatty-acyl CoA synthetase (LACS) group (Table 5). Complete GEnt results for LACSs can be found in S3 Dataset. Group-specific residues identified in LACSs by all methods used are summarized in S1 Dataset. Examination of residues was done with T. themophilus LACS (sequence Thethelon, PDB ID: 1V25). One residue, Trp444{650}, hydrogen bonds to the α-phosphate of the AMP moiety [3]. Trp234{378} lies within 4.5Å from the myristoyl moiety of the substrate. Hisanaga and colleagues [3] refer to Trp234{378} as the "gate residue" because once ATP binds, T. thermophilus LACS transitions to a closed conformation which leads to the opening of the tryptophan gate to the fatty acid-binding tunnel. His85{167}, His100{182} and Tyr196{334} form hydrogen bonds in LACSs. His85{167} hydrogen bonds to the carbonyl oxygen of Phe80{162}, also identified by GEnt, acting to maintain enzyme folding. The remainder of the residues identified by GEnt (Phe80{162}, Trp505{721} and Ala182{320}) form hydrophobic contacts in the enzyme. Ala182{320} hydrophobically contacts Tyr196{334}, noted above. Analysis of class I adenylate-forming enzymes A previous study [60] identified a signature sequence for ACSs, which in our alignment (Fig 2) would cover indices 607-641 and would comprise part of motif 1 identified here. This stretch contains several highly conserved residues, including Gly417{623}, Asp418{624}, Gly426{632} and Arg433{639}. However, none of the residues identified here as group-specific for LACSs are found in this region.
An additional note is that a mutagenesis study [76] was performed on E.coli FadD LACS to try and shift substrate preference towards medium chain fatty acids. Seven mutations caused increased growth rates with hexanoate and octanoate, but not oleate. The mutations were of residues Val4{which corresponds to alignment index 69}, Trp5{70}, Tyr9{74}, Gln338{461}, Asp372{501}, His376{533}, Phe447{633} and Val451{637} (Table 2). These residues were not near the fatty acyl-or CoA-binding sites, but near the site of AMP exit. None of these indices

Group-specific residues in NRPSs
Eight residues had the highest Group Entropy scores in the non-ribosomal peptide synthetase (NRPS) group (Table 6). Complete GEnt results for NRPSs can be found in S4 Dataset. Group-specific residues identified in NRPSs by all methods used are summarized in S1 Dataset. Examination of residues was done with Brevibacillus brevis gramicidin synthetase phenylalanine-activating domain (sequence Brebregram, PDB ID: 1AMU). Phe234{373} forms part of the active site pocket near the α-phosphate of AMP and the carbonyl oxygen of the phenylalanine substrate. Gln432{643}, Glu441{652} and Glu443{654} form hydrogen bonds in NRPSs. Glu424{635} was found on a surface loop where it lies close to His344{533}. Tyr358{547}, Leu442{653} and Leu512{735} contribute to hydrophobic packing interactions within the enzyme. Tyr358{547} ring stacks with Phe402{609}. Of note is that none of the positions identified as critical to substrate preference in B. brevis gramicidin synthetase and Paenibacillus fusaricidin synthase were identified with high Group Entropy scores in NRPSs [77,78].

Group-specific residues in MACSs
Ten residues were found to have the highest Group Entropy scores in the medium-chain fattyacyl CoA synthetase (MACS) group (Table 7). Complete GEnt results for MACSs can be found in S5 Dataset. The group-specific residues identified in MACSs by all methods used are summarized in S1 Dataset. Examination of residues was done with human MACS (sequence Homsapacoa), by examining both the adenylation (PDB ID: 3DAY) and thioesterification (PDB IDs: 2WD9 & 3EQ6) conformations. Phe458{636} in the adenylation conformation makes hydrophobic contact with the adenine ring of the bound APC, an ATP analog. Several residues identified by GEnt interact with butyryl-CoA in the thioesterification conformation in structure 3EQ6. Tyr540{723} hydrogen bonds to the 3' phosphate of the bound butyryl-CoA, while Arg501{680} forms a salt bridge to the β-5' phosphate of the butyryl-CoA. Trp265 {373} and Leu267{375} make hydrophobic contact with the bound butyryl-CoA. The bulky side chain of Trp265{373} constricts the active site channel to guide the CoA thiol group toward the fatty acid for thioesterification [5]. Leu267{375} lines the left pocket wall to allow ibuprofen to bind to MACS [5]. Gly226{337} lies next to several residues that contact the bound APC molecule [5]. Ser476 {654} hydrogen bonds with the main chain nitrogen of Gly226{337} to maintain enzyme folding. Thr137{185} provides a vital structural function in both conformations: during thioesterification the side chain hydroxyl of Thr137{185} provides an intradomain hydrogen bond with the side chain carboxylate of Asp262{370}, and an interdomain hydrophobic contact with Val554{737} in the adenylation conformation. Trp120{168}, Thr137{185}, Tyr219{320} and Met230{331} form hydrophobic contacts within MACSs.

Group-specific residues in SACSs
Eight residues were found to have the highest Group Entropy scores in the short-chain fattyacyl CoA synthetase (SACS) group (Table 8). Complete GEnt results for SACSs can be found in S6 Dataset. The group-specific residues identified in SACSs by all methods used are summarized in S1 Dataset. Examination of residues was done with Salmonella enterica acetyl-CoA synthetase (sequence Salentaco; PDB ID: 1PG3). Trp414{487} forms the pocket for the propyl Analysis of class I adenylate-forming enzymes group of the fatty acid substrate [10], which needs to be short for SACSs due to the presence of this large tryptophan residue. The conserved glycine at index 487 in MACSs and LACSs allows for a preference for longer fatty acid substrates [5]. Phe163{185} forms the active site pocket and is 3.3Å from the adenine ring of the bound CoA cofactor [10]. The hydroxyl of Thr438 {538}, which has been reported to have abnormal angles, with ϕ = 70˚and ψ = -118˚ [10], forms a hydrogen bond with the main chain nitrogen of Pro425{499}. Met141{163}, Thr278 {336}, Trp395{465}, Leu477{591} and Trp598{729} participate in hydrophobic interactions within SACSs.

Group-specific residues in MMCSs
Eleven residues were found to have the highest Group Entropy scores in the methylmalonyl-CoA synthetase (MMCS) group (Table 9). Complete GEnt results for MMCSs can be found in S7 Dataset. The group-specific residues identified in MMCSs by all methods used are summarized in S1 Dataset. Examination of residues was done using Rhodopseudomonas palustris MMCS (sequence Rhopalmco; PDB IDs: 4FUT & 4FUQ). Several residues identified by GEnt contact substrates in the active site. The carbonyl oxygen of Arg299{485} hydrogen bonds with the adenine ring of ATP. The main chain carbonyl of the corresponding residue, Arg283{485}, of Streptomyces coelicolor MMCS (PDB ID: 3NYQ) also forms a hydrogen bond to the adenine ring of AMP. However, Arg283{485} also demonstrates a role in substrate binding through salt bridges to the bound methylmalonyl-coenzyme A (MCA) [79]. His209{375} in R. palustris MMCS hydrogen bonds to Ser277{457}. The equivalent residue in S. coelicolor MMCS, His189 {375}, lines the active site pocket, even though the distance is too far (greater than 3.4Å, but within 3.8Å) to form hydrogen bonds to the methylmalonyl carbonyls of the bound MCA product. Ser277{457}, in addition to forming a hydrogen bond to His209{375}, makes hydrophobic contact with the adenine ring of ATP. The hydroxyl of the corresponding residue in S. coelicolor MMCS, Ser261{457}, also forms a hydrogen bond to the bound MCA [79]. Another residue that contacts MCA in S. coelicolor MMCS is Arg236{429}, which forms a salt bridge to the β-5' phosphate of the bound MCA [79]. Analysis of class I adenylate-forming enzymes Several more group-specific residues from hydrogen bonds and salt bridges. The side chain carboxylate of Glu351{576} forms a salt bridge on the surface of the molecule with Arg373 {605}. His285{465} ring stacks with Pro319{534} and forms a hydrogen bond with the carbonyl oxygen of Val296{482}. The side chain of the corresponding residue of S. coelicolor MMCS, His269{465}, forms a salt bridge with the side chain carboxylate of Glu282{484}, which is 3.7Å from the adenosine amino group of the bound AMP [79]. Interestingly, the equivalent glutamate in R. palustris MMCS, Glu298{484}, is too distant from His285{465} to form a salt bridge, but does form a hydrogen bond to the adenine amino group of the bound ATP [20]. Met240 {413}, Met247{421}, Phe273{453}, Met364{594}, and Met486{738} all form hydrophobic contacts in MMCSs. Met486{738} functions to form the binding pocket wall, at a distance of 6Å from an oxygen on the α-phosphate of the bound ATP.

Group-specific residues in FAALs
Nine residues were found to have the highest Group Entropy scores in the fatty acid-AMP ligase (FAAL) group (Table 11). Complete GEnt results for FAALs can be found in S9 Dataset. The group-specific residues identified in FAALs by all methods used are summarized in S1 Dataset. Examination of residues was done with E. coli fatty acid-AMP ligase (sequence Ecolifaal; PDB ID: 3PBK). An important note is that each position is three numbers higher in the PDB structure than in our sequence alignment. Position numbers from the PDB structure are used here. None of the residues identified by GEnt in FAALs interact with the substrate. One residue, Pro540{729}, forms a hydrogen bond between its carbonyl oxygen and the hydroxyl of Ser543{732}. Arg469{649} forms a salt bridge with Glu366{516}, which is in the insertion motif in FAALs. This blocks the binding of CoA, allowing for only the adenylation reaction to occur, rather than additional acyl-CoA synthetase activity [37]. The remainder of the residues that scored highly for Group Entropy (Trp224{368}, Leu245{390}, Trp262{408}, Phe279{425}, Cys284{430}, Phe494{675} and Ala557{746}) are involved in hydrophobic packing within FAALs. Three residues, Trp224{368}, Leu245{390} and Phe279{425}, appear to line the active site pocket, but are more than 5Å from the bound dodecacyl-adenylate molecule. Leu245{390}, which lies at a position that is a 78% conserved glycine within the entire alignment, is 7.5Å from the C ω of the bound dodecacyl-adenylate molecule. A glycine at this position could allow enzymes in other families to accommodate longer fatty acid chains. An additional note is that the activity of the Fad32 protein from mycobacteria, an FAAL involved in the synthesis of mycolic acids, is decreased by phosphorylation on Thr552, which is on an accessible loop [80]. However, structural alignment (not shown) of E. coli FAAL (PDB ID: 3PBK) with Fad32 from M. tuberculosis (PDB ID: 5HM3) revealed that Thr552 in Fad32 is in an insertion motif which is an extended loop not found in other aligned FAALs, and thus has no equivalent index position in our alignment. This suggests that this phosphorylation might be unique to mycobacteria.

Group-specific residues in FadD10s
Ten residues were found to have the highest Group Entropy scores in the mycobacterial FadD10 long chain fatty acyl-CoA ligase (FadD10) group (Table 12). Complete GEnt results for FadD10s can be found in S10 Dataset. The group-specific residues identified in FadD10s by all methods used are summarized in S1 Dataset. Examination of residues was done with Mycobacterium tuberculosis FadD10 (sequence Myctubfd10; PDB ID: 4IR7). Similar to the FAALs, the residue position number in Myctubfd10 in our alignment is one higher than that of the position in the structural coordinates, which are the position numbers reported here. Only one residue identified by GEnt interacts with the substrate in FadD10s. Trp231{381} lies 3.7Å from the C ω of the bound dodecacyl-adenylate substrate [81]. Therefore, Trp231{381} may influence the length of the fatty acid substrate that the enzyme could bind. Although not identified as group specific in Luciferases, a T251S mutation at index 381 improved luminescence with aminoluciferins [82] (Table 2); this change in substrate preference coincides with the residue's important location in the substrate-binding pocket.
Five other residues from hydrogen bonds within FadD10s. The apoenzyme structure (PDB ID: 4ISB) showed a hydrogen bond between the main chain nitrogen of Cys36{118} and the carbonyl oxygen of Gly245{395}. Ser425{641}, which lies in the linker motif connecting the amino-terminal and carboxy-terminal domains [81], forms a hydrogen bond to the side chain

Common group-specific positions
Residue positions with high Group Entropy scores in multiple groups would represent critical sites of evolutionary differences. There were eleven index positions identified by GEnt in multiple groups. Five common group-specific index positions line the active site pocket, including indices 185, 320, 373, 375 and 650. Index 650 had the highest Group Entropy score in three groups: Luciferases, LACSs and ACLs. The residue at this index appears to hydrogen bond to the α-phosphate of the AMP, but in a conformation dependent manner. The side chain of Trp444{650} in LACSs hydrogen bonds to the α-phosphate of the AMP moiety [3]. In CBL, an ACL, the side chain of Asn411{650} hydrogen bonds to the α-phosphate of the AMP when the enzyme is in the thioesterification conformation only [22]. In the L. cruciata luciferase Gln450 {650} was on a surface loop, removed from the active site. It is possible that this structure was in the adenylate-forming conformation, as luciferases do not carry out a thioesterification reaction. The residue at this index position throughout the entire alignment tends to be polar, being asparagine in ACLs, MMCSs, FAALs and FadD10s and arginine in SACSs, MACSs and NRPSs. Although index 650 was the position with the 54 th highest Group Entropy score in MACSs, Arg472{650} in human MACS (sequence Homsapacoa) was examined for differences in both adenylation and thioesterification conformations, as structures were available for both. In the thioesterification conformation (PDB ID: 2WD9) the side chain of Arg472{650} was 2.8Å from the bound ibuprofen and formed a hydrogen bond (3.1Å) from the side chain Analysis of class I adenylate-forming enzymes hydroxyl of the conserved Thr221{322}. Also seen in the thioesterification conformation is a conserved interdomain salt bridge between Arg472{650} and Glu365{490}, which serves to block further ATP binding [5]. In the adenylation conformation of human MACS (PDB ID: 3DAY) a new interdomain salt bridge is formed between Arg472{650} and Glu407{572}, which lies right beside the invariant Gly408{573}. Index 373 was identified by GEnt in NRPSs, MACSs and ACLs. Histidine is 65% conserved in the entire alignment at index 373. In NRPSs Phe234{373} forms part of the active site pocket near the α-phosphate. In CBL His207{373} binds to the acid anhydride bond that connects the AMP and 4-chlorobenzoate moieties [21]. As inferred by studying a H207A mutant, the side chain of His207{373} also interacts with the 4-chlorobenzoate during the first part of the reaction [22] (Table 2). In human MACS Trp265{373} acts to narrow the pantetheine channel in the thioesterification conformation, which in turn directs the thiol of the CoA substrate to the correct position for nucleophilic attack on the fatty acyl-adenylate intermediate [5]. Thus, the residue at index 373 lies near the actual site of adenylate bond formation during catalysis.
Index 320 was identified by GEnt in LACSs and MACSs, and was also identified by the majority of other methods used to determine group-specific residues in ACLs and FAALs (S1 Dataset). In the entire alignment the residue at index 320 tends to be aliphatic. In the groups noted the residue at index 320 is involved in hydrophobic packing. In ACLs Phe159{320} contributes to hydrophobic packing. Ala182{320} in T. thermophilus LACS is 5.7Å from the bound ANP and hydrophobically contacts Tyr196{334}, also identified by GEnt in LACSs. Tyr219{320} in human MACS contacts Ile266{374}, which forms the left pocket wall in the active site [5]. In FAALs Gln182{320} is nearly 7Å from the bound dodecanoyl-AMP. Hence, this position contributes to the active site shape.
Index 185, identified by GEnt in MACSs and SACSs, has enzyme-specific functions. In S. enterica SACS Phe163{185} hydrophobically contacts the adenine ring of the bound CoA cofactor in the active site pocket [10]. In human MACS the hydroxyl of Thr137{185} forms an intradomain hydrogen bond with the side chain carboxylate of Asp262{370} during thioesterification, but makes interdomain hydrophobic contact with Val554{737} in the adenylation conformation [5]. Luciferases, ACLs, LACSs and FadD10s tend to have an asparagine at this index position.
Index 375, identified by GEnt in MACSs and MMCSs, lines the hydrophobic pocket wall of the active site where substrates bind in both groups. In human MACS Leu267{375} lines the left pocket wall and also contacts the butyryl-CoA near the sulfur atom in the thioesterification conformation [5]. In S. coelicolor MMCS His189{375} lines the active site pocket and contacts of the bound MCA product [79]. His209{375} in R. palustris MMCS hydrogen bonds with Ser277 {457} and also makes hydrophobic contact with Met486{738} and Arg299{485}, all of which were also identified by GEnt. These three residues contacted by His209{375} all play important roles in MMCSs (noted above). Although not identified as group specific in Luciferases, a F247S mutant at index 375 in Photinus pyralis luciferase increased light output with aminoluciferin, but with a high K m value [82] (Table 2), indicating that it lies close to the substrate.
The residue at index 654, identified by GEnt in ACLs, MACSs, and NRPSs, forms bonds to maintain the structure of these enzymes. Though MACSs mostly have a phenylalanine at index 654, Ser476{654} in human MACS hydrogen bonds with the main chain nitrogen of Gly226{337}, which lies next to several residues that contact the bound APC molecule [5]. The hydroxyl of Ser415{654} in CBL hydrogen bonds to the main chain carbonyl oxygen of Thr164 {325}. In B. brevis gramicidin synthetase, a NRPS, Glu443{654} hydrogen bonds to the main chain nitrogen of the invariant Arg428{639} and the side chain of Asn431{642}.
Five additional common group-specific index positions, indices 465, 643, 652, 721 and 729, are involved in hydrophobic interactions within most enzymes. Index 721 in the carboxyl-terminal domain had high Group Entropy scores in both Luciferases and LACSs. In most groups, the residue at index 721 tends to be a hydrophobic residue. In Luciferases Lys512{721} forms a salt bridge with the side chain carboxylate of Glu455{655} on the enzyme surface. In LACSs Trp505{721} contributes to hydrophobic packing. As index 721 lies in the carboxy-terminal domain, it is possible that the binding contacts for this residue might also change upon a shift in domain alternation. In human MACS the C α of Tyr538{721} is 3.6Å from the O4 position of the bound butyryl-CoA in the 3EQ6 structure [5]. Thus, it may also play a role in coenzyme A binding.
Index 652 was identified by GEnt in NRPSs and Luciferases, and was also identified by the majority of other methods used to determine group-specific residues in ACLs, FAALs and MACSs (S1 Dataset). The residue at index 652 appears important to maintain enzyme structure, though through different mechanisms depending upon the enzyme. Index 652 is in a turn in the enzyme structure. Pro452{652} in Luciferases may be important for the structure of the loop containing Gln450{650}, also identified by GEnt in Luciferases. Proline at this position is unique to Luciferases. In MACS Gly474{652} also contributes to the structure of this turn. However, in NRPSs and ACLs the residue at index 652 forms a hydrogen bond. In NRPSs Glu441{652} forms a hydrogen bond to Gln414{625}. In ACLs the carbonyl oxygen of the conserved Thr164{325} forms a hydrogen bond to His413{652} during the thioesterification conformation [21]. In FAALs Trp472{652} is involved in hydrophobic packing.
Index 643 was identified by GEnt in NRPSs and ACLs. In the entire alignment, the residue at index 643 also tends to be aliphatic. In CBL Met404{643} contributes to hydrophobic packing. In gramicidin synthetase NRPS Gln432{643} hydrogen bonds with Gln414{625}, which is also contacted by Glu441{652} noted above. These interactions appear to be unique to NRPSs.
Index 465 was identified by GEnt in SACSs and MMCSs. In S. enterica SACS Trp395{465} contributes to hydrophobic packing. The side chain of His269{465} in S. coelicolor MMCS forms a salt bridge to Glu282{484}, which is close to the amino group of the bound AMP adenosine [79]. In both groups the residue at index 465 makes hydrophobic contact with the residue at index 534 and also contacts the residue at index 482.
Index 729, which is a 65% conserved phenylalanine in the entire alignment, was identified by GEnt in SACSs and FAALs. In E. coli FAAL Pro540{729} contributes to the structure of a surface loop and forms a hydrogen bond to Ser543{732}. In S. enterica SACS Trp598{729} is involved in hydrophobic packing.

Discussion
This project aligned a total of 374 amino acid sequences of class I adenylate-forming enzymes. Five residue positions were invariant, with 22 additional residues conserved in at least 80% of all of the aligned sequences, and 56 more residues conserved in at least 60%. Many of these residues have been studied by site-directed mutagenesis in several groups ( Table 2). A threonine at index 322 and glutamate at index 490 coordinate the Mg 2+ ion. Several highly conserved residues coordinate the AMP/ATP molecule, including indices 323, 457, 486, 489, 624, 639 and 740. Thirteen conserved positions, including indices 142, 145, 157, 325, 326, 329, 424, 456, 487, 573, 632, 686 and 734, contribute to hydrophobic packing within the enzyme. Five conserved residues at indices 321, 330, 591, 655 and 657 form hydrogen bonds or salt bridges that maintain enzyme folding. Four conserved residues at indices 465, 486, 487 and 489 line in the fatty acid-binding pocket of T. thermophilus LACS. A high proportion of the conserved residues were glycines, a phenomenon seen in several other enzyme families [38,[65][66][67][68]. These conserved residues are responsible for structural and functional aspects common to all superfamily members, such as magnesium and ATP binding, and hydrophobic packing.
Ten highly conserved sequence motifs were identified, half of which had been previously identified in the adenylation domain of NRPSs [70]. Motifs 1, 2, 3, 4, 7, 9 and 10 line the active site of T. thermophilus LACS. Motif 1 encompasses the linker (L) motif that connects the two domains. Motif 3 includes the P-loop in the phosphate-binding site. The adenine (A) motif that interacts with the adenine of AMP was not found in the ten motifs identified. Most sequence hits from a MAST search of a protein database using the motifs were adenylate-forming enzymes, including D-alanine-D-alanyl carrier protein ligase which was not included in this project. Two enzymes also identified by the MAST search were cinnamyl alcohol dehydrogenase and phenylalanine racemase, but they did not show functional similarities to adenylateforming enzymes.
Phylogenetic analysis verified nine distinct groups of class I adenylate-forming enzymes, which were then used to identify group-specific residues. Surprisingly, all of the ACSs (SACSs, MACSs and LACSs) were not on adjacent clades, with LACSs being more related to Luciferases than the other ACSs. FAALs and NRPSs are located on neighboring clades. Both groups attach the reaction intermediate to a carrier protein, rather than CoA.
Group entropy analysis, as well as other methods, were employed to determine the residues unique to each group. Unlike the residue positions conserved in the entire alignment, these group-specific positions are responsible for unique structural interactions or functional differences in each group. Eleven index positions identified by GEnt in multiple groups represent important sites of evolutionary differences. These common index positions include indices Residues conserved throughout the entire superfamily are highlighted red and the eleven common group-specific positions are highlighted green. Also shown is 4-chlorobenzoyl-CoA in orange, AMP in yellow and Mg 2+ in blue. While the AMP is surrounded by more overall conserved residues (red), the 4-chlorobenzoyl-CoA molecule is surrounded by more group-specific conservations (green). https://doi.org/10.1371/journal.pone.0203218.g007 Analysis of class I adenylate-forming enzymes 185, 320, 373, 375 and 465 from the amino-terminal domain, index 643 from the linker motif, and indices 650, 652, 654, 721 and 729 from the carboxyl-terminal domain. Five common group-specific index positions line the active site pocket, including indices 185, 320, 373, 375 and 650. The residue at index 650 interacts with the α-phosphate of AMP [3,22], while the residue at index 373 lies where the acid anhydride bond between AMP and the substrate occurs [5,21,22]. Index 320 contributes to the shape the active site pocket [5]. The residue at index 185 interacts with coenzyme A [10], while the residue at index 375 interacts with the CoAbound product [5,79]. Index 721 also contacts the butyryl-CoA in human MACS [5]. These positions are likely responsible for differences in catalytic function or substrate preference.
The residue at index 654 forms group-specific hydrogen bonds. Six common group-specific index positions, indices 320, 465, 643, 652, 721 and 729, are involved in hydrophobic interactions within most enzymes. In addition, four of these six positions (465, 643, 652 and 721) also participate in unique hydrogen bonds or salt bridges in specific families. These positions are critical for the unique structural differences in each enzyme group. While most of the residues conserved throughout the entire superfamily are found throughout the structure and specifically near the bound AMP, which is utilized by all members of the superfamily, several of the common group-specific residues lie closer to the substrate and coenzyme A molecules (Fig 7).
Additionally, there are three index positions identified by GEnt in specific groups, not common to multiple groups, that might influence the length of the fatty acid substrate. A glycine is conserved at index 487 in all groups aligned except SACSs. In SACSs a large tryptophan at index 487 necessitates a smaller fatty acid chain to bind [10], while in MACSs and LACSs a glycine at index 487 allows for longer chain fatty acids to bind [5]. Second, index 390 is a 78% conserved glycine within the entire alignment. However, in FAALs the residue at index 390 is a leucine that is 7.5Å from the C ω of the bound dodecanoyl-AMP molecule, possibly restricting the length of the fatty acid in this group. Lastly, a tryptophan at index 381 is 3.7Å from the C ω of the bound dodecanoyl-AMP substrate in FadD10s [81]. The amino acid composition at index 381, however, is variable in the different groups aligned. The group-specific conservations identified here, as well as the positions conserved in the entire superfamily, could serve as interesting targets for site-directed mutagenesis by other researchers.