Determining the Metabolic Footprints of Hydrocarbon Degradation Using Multivariate Analysis

The functional dynamics of microbial communities are largely responsible for the clean-up of hydrocarbons in the environment. However, knowledge of the distinguishing functional genes, known as the metabolic footprint, present in hydrocarbon-impacted sites is still scarcely understood. Here, we conducted several multivariate analyses to characterise the metabolic footprints present in a variety of hydrocarbon-impacted and non-impacted sediments. Non-metric multi-dimensional scaling (NMDS) and canonical analysis of principal coordinates (CAP) showed a clear distinction between the two groups. A high relative abundance of genes associated with cofactors, virulence, phages and fatty acids were present in the non-impacted sediments, accounting for 45.7 % of the overall dissimilarity. In the hydrocarbon-impacted sites, a high relative abundance of genes associated with iron acquisition and metabolism, dormancy and sporulation, motility, metabolism of aromatic compounds and cell signalling were observed, accounting for 22.3 % of the overall dissimilarity. These results suggest a major shift in functionality has occurred with pathways essential to the degradation of hydrocarbons becoming overrepresented at the expense of other, less essential metabolisms.


Introduction
Ecosystem functioning is highly dependent on microbial communities [1][2][3]. These communities are largely defined by a set of metabolic pathways, and are generally thought to be habitat specific [4], providing a link between the biology of a given community and the surrounding environment [5]. Environmental change can lead to a major shift in the structure and function of the inhabiting microbial consortia [6][7][8]. Physiological adaptations of microbes have been shown to be highly specific, allowing for the discrimination between chemical stressors [9]. The identification of defining metabolic pathways of a given ecosystem, known as metabolic footprints, allows for a greater understanding on how the microbial consortia are adapting and responding to environmental change [10,11].
Microorganisms are highly responsive to environmental stress, due to a variety of evolutionary adaptations and physiological mechanisms [12]. The innate ability of microbes to respond and adapt to the world around them means they are often used as biological indicators [13], and subsequently for bioremediation [14]. Many studies have investigated the use of specific microbial taxa as biological indicators [15][16][17][18]; however, the key distinguishing metabolisms associated with hydrocarbon contamination are less well characterized than the taxa. Previous reports have shown that metagenomes are highly predictive of metabolic potential within an ecosystem [3]. Furthermore, previous studies have shown that microbial communities often respond at a metabolic level before any disturbance is seen at the taxonomic level [17]. Therefore, to gain comprehensive insight into an ecosystem's functional response to environmental change, the underlying metabolic footprints should be elucidated.
Metabolic footprints is a term used to describe an ensemble of biological pathways that typically occur with a combination of environmental variables [10,19]. Due to the great diversity of metabolic pathways present within microbial communities, the determination of a metabolic footprint requires the use of multivariate analysis. A recent study by Gianoulis et al. [10] used multivariate canonical correlation analysis to describe the metabolic footprints associated with different marine environments. These metabolic footprints were thought to arise from differences in evolutionary strategies required to cope with unique environmental variables [10]. Similarly, Dinsdale et al. [4] used canonical correlations to discriminate between 9 discrete ecosystems.
Typically metabolic footprint studies employ constrained ordination tools, such as canonical discriminant analysis (CDA) and principal component analysis (PCA) [4,20,21], to explore the metabolic footprints of an environment. However, these methods are restricted in that PCA cannot be performed on datasets containing more variables (e.g. taxa/metabolic processes) than observations (samples), and CDA should be performed on a dataset where there are at least three times as many observations than variables [22]. This limitation results in the need to reduce the number of variables prior to analysis [20]. Microbial communities, however, comprise intricate networks whereby a large number of individuals and metabolic processes are important in the overall ecosystems functioning [23]. Thus, the community as a whole should be considered when categorising a given environment.
Canonical analysis of principal coordinates (CAP) is thus a constrained multivariate analysis, that uses both ordination and discrimination function techniques, but, unlike CDA and PCA it better allows for the characterisation of whole communities as it is not limited by observation size due to its testing via permutation [24]. Furthermore, canonical analysis of principal coordinates (CAP) is highly constrained to the hypothesis, allowing for discriminations to be made in strongly correlated variables, such as functional processes [25]. PERMANOVA on the other hand is affected by other variables that may be present within a given dataset, making it less able to detect differences in less abundant functional subsystems [46]. CAP analysis has been used in several studies to determine how microbial communities respond to various environmental conditions [25][26][27][28]; however, to date, it has not been employed to generate and explore metabolic footprints for impacted environments. Thus, we sought to construct a metabolic profile of microbial communities responding to various forms of environmental impacts, in order to generate metabolic footprints using CAP.
The long-lasting toxicity of xenobiotics makes their metabolism by microbial communities widely studied [29]. Petroleum hydrocarbons are a common target for bioremediation because they are widespread and persistent [7,[30][31][32][33]. While the taxa and environmental conditions for optimal degradation of hydrocarbons are well established [34][35][36][37], the effectiveness of a natural community to bioremediate is less well understood [38].
Advances in metagenomic technologies have allowed for the direct sequencing of environmental microbial communities [39], greatly increasing our potential to understand the metabolic processes being undertaken by the indigenous microbial communities. A recent study by Yergeau et al. [40] used metagenomic sequencing technologies to characterise the structure and function of an active soil microbial community in a hydrocarbon contaminated Arctic region. However, this study primarily focused on the taxa present, and not the defining metabolic activities associated with hydrocarbon contamination. Thus, knowledge about the distinguishing functional genes present in hydrocarbon contaminated environments is still lacking.
The aim of the present study was to compare hydrocarbonimpacted sites to non-impacted sites, and provide insight into the key metabolic functions present following hydrocarbon impact, thus elucidating the metabolic footprints for hydrocarbon contamination. The robustness of these metabolic footprints were assessed with the inclusion of metagenomes from a variety of geographical locations and substrate types, experiencing different contamination events.

Data Collection
To determine the functionality of microbial communities inhabiting hydrocarbon-impacted and non-impacted environments, publicly available datasets were chosen from the MetaGenomics Rapid Annotation using Subsystem Technology (MG-RAST) pipeline version 3.0 [41]. Due to constraints in the database, a total of 4 datasets were used to represent hydrocarbon-impacted environments, while 5 datasets were used for non-impacted environments (Table S1 in File S1). BLASTX was performed on all datasets, with a minimum alignments length of 50 bp and an E-value cut-off of E<1×10 -5 [4], to identify hits to the subsystems database.

Data Analysis
To statistically investigate the differences between metagenomes from hydrocarbon-impacted sites to metagenomes from non-impacted sites, heatmaps were generated containing the relative proportion of hits to the subsystem database in MG-RAST. Heatmaps had been standardized and scaled to account for differences in sequencing effort and read lengths. Statistical analysis was conducted on square-root transformed data to reduce the impact of dominant metabolisms using the software package PRIMER 6 for Windows (Version 6.1.13, Primer-E, Plymouth) [42]. To generate a robust set of metabolic footprints, the generalized cellular functions, termed level 1, and the subsystem, termed level 2 hierarchical classifications were used to determine the overall differences in metabolic potential [4,10].
To determine whether there was any loss of information between the levels of resolution for metabolism, the program RELATE in the PRIMER package was used to calculate the Spearman rank correlation between hierarchical levels 1 and 2 [43]. Differences in the overall metabolic potential between hydrocarbon-impacted and non-impacted sediments were analysed using the PERMANOVA+ version 1.0.3 3 add-on to PRIMER [44,45]. Non-metric Multi-Dimensional scaling (NMDS) of Bray-Curtis similarities was performed as an unconstrained ordination method to graphically visualise multivariate patterns in the metabolic processes associated with hydrocarbon-impacted or non-impacted sediment metagenomes. Metagenomes were further analysed using canonical analysis of principal coordinates (CAP) on the sum of squared canonical correlations as a constrained ordination and discrimination method, to determine whether there was any significant difference between metabolic processes according to hydrocarbon impact. The a priori hypothesis that the metabolisms between the two groups were different was tested in CAP [45] by obtaining a P value using 9999 permutations. Based on RELATE results, CAP ordinations were generated using hierarchy level 1 for metabolism. Where significant differences were found using CAP, the percent contribution of each metabolism to the separation between the hydrocarbon-impacted and non-impacted samples were assessed using similarity percentage (SIMPER) analysis [43]. The resulting top 90 percent of all metabolisms were used to determine the shifts in metabolic potential between the groups. To determine those metabolisms that were consistently contributing most to the overall dissimilarity between the hydrocarbon-impacted and non-impacted groups, the ratio of the average dissimilarity to standard deviation was used. A dissimilarity/standard deviation (Diss/SD) ratio of greater than 1.4 was used to indicate key discriminating metabolisms [46].
To assess the robustness of the metabolic footprints generated using this method, three common forms of environmental impact (agricultural, hydrocarbon and wastewater) from a diverse range of geographical locations and substrate types were compared (Table S1 in File S1). Firstly, heatmaps were generated as above and the squareroot transformed data was analysed using Primer 6 for windows. The CAP on the sum of squared canonical correlations [44] was performed to graphically illustrate the multivariate patterns of metabolism associated with these impacted environments. Significant trends in metabolic processes at each site were determined using the sum of squared canonical correlations. The a priori hypothesis that the metabolisms among the four groups were different was tested using 9999 permutations. Where statistically significant differences were shown using CAP analysis, similarity percentage (SIMPER) analysis [43] was conducted as above to determine the main metabolisms driving the dissimilarity between contamination types.

Results
RELATE analysis revealed a Spearman rank coefficient of 0.9 for the comparison between hierarchical levels 1 and 2, indicating similar results were seen irrespective of hierarchical level. Thus, to create a generalised, set of metabolic footprints, all further analyses were conducted on hierarchical level 1.
NMDS analysis revealed a clear separation of data between the hydrocarbon-impacted and non-impacted sediment metagenomes ( Figure 1). CAP analysis confirmed this separation showing significant differences between the two groups (P = 0.008). A strong association between the multivariate data and the hypothesis of metabolic difference was indicated by the large size of their canonical correlations (δ 2 = 0.83). The first canonical axis (m = 1) separated the sample types with no overlap (Figure 2). Cross validation of the CAP model showed all samples were correctly classified to either hydrocarbon-impacted or non-impacted sediments, hence with a zero mis-classification rate.
SIMPER analysis revealed that the main metabolic processes contributing to the dissimilarity in the non-impacted sediments, when compared to the hydrocarbon-impacted sediments, were genes associated with cofactors, virulence, phages and fatty acids, together accounting for 45.7 % of the overall dissimilarity. Genes associated with protein metabolism, carbohydrates, amino acids, clustering-based subsystems, potassium metabolism, respiration, RNA metabolism, nucleosides and cell wall were also higher in the non-impacted site compared to the impacted sites, collectively contributing to 9.9% of the overall dissimilarity (Table 1 & S2 in File S1).
Conversely, the main metabolic processes associated with the hydrocarbon-impacted sediments were iron acquisition and metabolism, dormancy and sporulation, motility, metabolism of aromatic compounds and cell signalling accounting for 22.3 % of the overall dissimilarity between the two groups (Table 1). Genes associated with nitrogen, phosphorus and sulfur . This NMDS ordination is derived from a Bray-Curtis similarity matrix calculated from the square-root transformed abundance of DNA fragments matching the subsystems database, level hierarchical system 1 (BLASTX E-value < E<1×10 -5 ). The light green polygons depict significantly different groupings (P < 0.05) as calculated by similarity profile (SIMPROF) analysis in PRIMER v6. See Table S1 in File S1 for the provenance of samples included in this analysis. doi: 10.1371/journal.pone.0081910.g001 metabolism were also higher in the hydrocarbon impacted site, collectively accounting for 2.5 % of the dissimilarity to the nonimpacted sites. Regardless of percent contribution, however, all metabolic processes, with the exception of secondary metabolism and photosynthesis, are likely good discriminators for hydrocarbon-impacted or non-impacted sediments, as indicated by a dissimilarity/standard deviation ratio (Diss/SD) of greater than 1.4 [46] (Tables 1 & S2 in File S1).
To determine if the metabolic footprints could be distinguished between contaminant types, multiple contamination types from diverse substrate types were compared (Table S1 in File S1). CAP ordination revealed a clear separation of data among the different impacted environments based on metabolic potential ( Figure 3); (P = 0.0005) ( Table 2). A strong association was seen between the multivariate data and the hypothesis of metabolic differences, indicated by the large size of their canonical correlations (hierarchial level 1: δ 2 = 0.88). Cross validation of the CAP model showed 79% of samples overall were correctly classified to their impacted environments. More specifically, 75% and 100% of hydrocarbon and agricultural-impacted samples were correctly allocated, while only 50% and 0% of wastewater and pristine samples, respectively, were correctly classified ( Table  2).
Based on CAP ordinations as well as mis-classification rates, SIMPER analysis was used to determine distinguishing metabolic processes for the hydrocarbon and agriculturalimpacted sites only. SIMPER analysis revealed the main Figure 2. Comparison of hydrocarbon-impacted samples (green) and non-hydrocarbon-impacted samples (blue). CAP analysis (using m = 1 principal coordinate axes) is derived from the sum of squared correlations of DNA fragments matching the subsystems database, level hierarchical system 1 (BLASTX E-value < E<1×10 -5 ). Significance P = 0.008 and the first axis explained δ 2 = 0.83 of the total variation. See Table S1 in File S1 for the provenance of samples included in this analysis. doi: 10.1371/journal.pone.0081910.g002 metabolic processes contributing to the dissimilarity in the agriculturally-impacted environments when compared to the hydrocarbon-impacted environments were genes associated with cofactors, virulence, phages and fatty acids, collectively accounting for 48.4% of the overall dissimilarity between these two types. Genes associated with protein metabolism, carbohydrates, amino acids and clustering-based subsystems were also higher in the agricultural-impacted sites when compared to hydrocarbon-impacted sites, collectively contributing to another 9.06% of the overall dissimilarity (Tables 3 & S3 in File S1).
Alternatively, the main metabolic processes associated with hydrocarbon impact were genes related to iron acquisition and metabolism, dormancy, aromatic compound degradation, and motility, collectively contributing to 17.1% of the overall dissimilarity (Table 3 & S3 in File S1). Genes associated with nitrogen metabolism and regulation were also higher in the hydrocarbon-impacted sites when compared to agricultural impacted sites, collectively accounting for 4.9% (Table 3 & S3 in File S1). Furthermore, all metabolic processes, with the exception of photosynthesis, secondary metabolism and potassium metabolism were consistently distinguishable between agricultural and hydrocarbon-impacted environments, as indicated by a Diss/SD of greater than 1.4 [46].

Discussion
Microbial communities are known to respond to hydrocarbon contamination at the functional level, whereby a shift in metabolic potential can be observed [14,47,48]. Thus, a major goal in the study of bioremediation is to identify the key metabolic processes being undertaken by the inhabiting microbial communities [38,49]. Here, we report the first metagenomic study to identify the overall metabolic footprints associated with discriminating hydrocarbon-impacted versus non-impacted sediment samples.

The metabolic footprints of hydrocarbon degradation
RELATE analysis showed a significant correlation (Rho: 0.773; P < 0.002) between hierarchial level 1 and 2, indicating there is no significant loss of information between the different levels of resolution. This result is consistent with previous studies that have shown changes to environmental conditions caused by anthropogenic disturbances have led to major shifts Table 1. Contribution of metabolic hierarchical system level 1 to the dissimilarity of the hydrocarbon-impacted and nonhydrocarbon-impacted metagenomes.

Metabolic Processes
Hydrocarbon-Impacted Non-Impacted Diss/SD Cum % in microbial community functionality that become evident across multiple levels of resolution [6,8,50]. Unconstrained (NMDS) and constrained (CAP) multivariate analyses, both showed clear separation of data (P-value = 0.008) between the hydrocarbon-impacted and non-impacted sediments (Figures 1 & 2). The similarities between constrained and unconstrained ordinations likely reflect the hydrocarbon impact. This is supported by the CAP analysis,  Table S1 in File S1). CAP analysis (using m = 2 principal coordinate axes) is derived from the sum of squared correlations of DNA fragments matching the subsystems database, level hierarchical system 1 (BLASTX E-value < E<1×10 -5 ). Significance P = 0.0005 and the first axis explained δ 2 = 0.88 of the total variation. doi: 10.1371/journal.pone.0081910.g003 Table 2. Results of CAP analysis (using m = 2 principal coordinate axes, explaining 88 % of total variation) testing the hypothesis that contaminant types differ for Level 1 metabolisms associated with impacted metagenomes. which shows that the majority of the variance is expressed on just the first canonical axis, with a squared canonical correlation (δ 2 ) of 0.83. A recent hydrocarbon-based study used high throughput functional gene array technology to show that all microbial samples with hydrocarbon contamination grouped together indicative of similar functional patterns [31]. Furthermore, it has been shown that differences in metabolic processes could be used to predict the biogeochemical status of the environment [4]. Thus, the clear separation between data points in the NMDS and CAP plots indicates the hydrocarbonimpacted sediment samples can be readily distinguished even at this coarse level of metabolic resolution, despite differences in geographical location. Furthermore, the same separation seen in unconstrained and constrained ordination methods demonstrates that the data points are not simply conforming to the more hypothesis-driven CAP analysis. The majority of the separation between the hydrocarbonimpacted and non-impacted groups was explained by a higher relative abundance of genes associated with cofactors, virulence, phages and fatty acids, collectively accounting for nearly half of the dissimilarity in the non-impacted sediment samples when compared to the impacted sites (Table 1). Those microbes capable of surviving following hydrocarbon impact become dominant, eventually leading to a major shift in the structure of the community [32,51]. This shift in structure is generally coupled with a shift in functionality, whereby previous studies have shown a significant decrease in the overall microbial functional diversity [6,31,52]. Thus, the high degree of dissimilarity driven by the non-impacted sediments, suggests the major factor causing the differences between the two groups can be explained by a shift in functionality, which has led to the reduction in non-essential metabolisms following hydrocarbon impact.
The reduction in non-essential metabolic pathways was coupled with a subsequent increase in pathways associated iron acquisition and metabolism, dormancy and sporulation, motility, metabolism of aromatic compounds and cell signalling (Table 1). These pathways have all previously been linked to stressed environments [6,[53][54][55], suggesting the microbial communities inhabiting the hydrocarbon-impacted environments are expending more energy on pathways essential to the utilization of carbon and survival.
The degradation of hydrocarbons is often hindered by the requirement to come into direct contact with hydrocarbon substrates [56]. Therefore, many microorganisms capable of catabolising hydrocarbons have shown chemotaxis abilities allowing them to move towards, and subsequently degrade the contaminant at a higher rate [57][58][59]. This degradation ability is then often further enhanced by the secretion of biosurfactants, which increase the availability of hydrocarbons in the soil [60]. Thus, the increase in motility and chemotaxis genes suggest that the microbial communities are increasing metabolic pathways that will allow for direct contact with hydrocarbon compounds (Table 1).
Following direct contact, the microbial communities must have genes that allow for the catabolism of hydrocarbons. Petroleum hydrocarbons are comprised of a complex mixture of compounds including cycloalkanes, alkanes, polycyclic aromatic hydrocarbons, aromatics and phenolics [61]. Previous studies have shown an increase in genes associated with the breakdown of these compounds in hydrocarbon-contaminated environments [31,62]. Thus, a higher relative abundance of metabolism of aromatic compound genes in the hydrocarbonimpacted sediments when compared to the non-impacted sediments is consistent with a community optimising its ability to utilise hydrocarbon as an energy source (Table 1). Following hydrocarbon contamination, microbial communities must adapt to survive the sudden increase in carbon availability and subsequent loss of limiting nutrients such as nitrogen and phosphorus and in some cases iron [14,55,63]. As a result, an increase in genes associated with nitrogen, phosphorus and iron metabolism have been shown, allowing for effective scavenging mechanisms (Smith et al., unpublished data). Our results indicate there may have been an increased need for nitrogen, phosphorus and iron metabolites in the hydrocarbonimpacted sediments when compared to non-impacted sediments. Furthermore, genes associated with cofactors, amino acid pathways, carbohydrates and protein metabolisms were all reduced in the hydrocarbon-impacted sites (Tables 1 &  S2 in File S1). Taken together, these results suggest that the microbial communities are expending most of their energy scavenging key nutrients needed for bioremediation of hydrocarbons, leading to the subsequent decrease in pathways associated with more complex carbohydrate and protein metabolisms and growth.

Contaminant types
When the hydrocarbon-impacted environments were compared to metagenomes experiencing different contaminant types from a wide range of habitats, CAP analysis showed a significant difference (P-value = 0.0005; Table 2) between the relative abundances of metabolisms across these impacted environments ( Figure 3). In particular, hydrocarbon and agricultural-impacted environments were found to have the highest allocation success, 75% and 100% respectively, when compared to wastewater and pristine sites, 50% and 0%, respectively ( Table 2). The higher allocation success for hydrocarbon and agricultural impacted sites was likely driven by the larger sample size for these environments. Furthermore, as the metagenomic samples included were from a variety of substrate types and geographical locations (Table S1 in File S1), our results indicate that the metabolic footprints created due to a contamination event, were more significant when compared to differences created based on geographical location and physico-chemical conditions (Table S4 in File S1).
SIMPER analysis revealed the main distinguishing metabolic processes associated with agricultural impacted environments were genes associated with cofactors, virulence, phages, fatty acids, protein metabolism, carbohydrates, amino acids and clustering based subsystems (Tables 3 & S3 in File S1), collectively accounting for 57.4% of the overall dissimilarity from the hydrocarbon-impacted environments. Agricultural practices are known to increase the deposition of nutrients into the surrounding environment [64,65]. Previous studies have shown that an increase of nutrients via agricultural impact can lead to an increase in microbial productivity [8]. As previously discussed, hydrocarbon impact has been shown to lead to a reduction in genotypic diversity, whereby only the essential metabolisms remain [6,31]. This is thought to be due to the toxic effect of hydrocarbon pollution which in turn can lead to a community exerting more energy for survival than on growth and productivity [66]. Thus, an increase in genes associated with protein metabolism in the agricultural impacted environments (Table 3) is consistent with a more active community when compared to the hydrocarbon-impacted environments [67].
SIMPER analysis also revealed the main distinguishing metabolic processes associated with hydrocarbon-impacted environments was a higher relative abundance of genes associated with iron acquisition and metabolism, dormancy, aromatic compound degradation, motility, nitrogen metabolism and regulation, collectively contributing to 22.1% of the overall dissimilarity (Table 3). These results are consistent with SIMPER results when comparing hydrocarbon-impacted and non-impacted environments, indicating the metabolic footprints for contaminant types are consistent even at this coarse level of metabolic resolution. Furthermore, hydrocarbon-impacted and agricultural-impacted metabolic footprints were distinguishable irrespective of differences in substrate type, physico-chemical conditions and geographical location. Thus, CAP analysis suggests these impacted environments have acquired microbial communities with differing metabolic functions, which have allowed for our ability to distinguish between contaminant types.
Although some pathways contributed to the dissimilarity between the two groups more than others, all metabolisms with the exception of photosynthesis and potassium metabolism (at the 90% cut off percentage) were identified as being consistent distinguishing metabolisms (Tables 1, 3, S2 & S3 in File S1). This suggests all are metabolic footprints of their given environment, indicating the overall metabolic signature is different between groups. In nature, microbial communities are typically composed of mixed communities characterised by an intricate network of metabolic processes [23]. Consequently, our results indicate a complete overview of the metabolites present within the inhabiting microbial consortia is needed to effectively characterise an environment.

Conclusion
Our approach indicates the hydrocarbon-impacted sediment samples can be distinguished from non-impacted sediments based on their metabolic signatures despite differences in geographical location. These signatures include metabolisms associated with iron acquisition and metabolism, dormancy and sporulation, motility, metabolism of aromatic compounds, cell signalling and nitrogen, phosphorus and sulfur metabolism. Our analysis also indicated that the majority of the dissimilarity was, however, due to a reduction of functional genes associated with cofactors, virulence, phages and fatty acids. Further to this, our approach illustrates the ability to distinguish between contaminant types from a wide range of habitats, with a clear separation in data points associated with either hydrocarbon or agricultural contamination. Here we provide the first metagenomic study to elucidate the metabolic footprints associated with hydrocarbon impact. Furthermore, the differentiation between hydrocarbon contaminants, for example long chain hydrocarbons compared to aromatics, is needed to fully determine the effects of hydrocarbon impacts on the environment.