Metabolite profiling of the carnivorous pitcher plants Darlingtonia and Sarracenia

Sarraceniaceae is a New World carnivorous plant family comprising three genera: Darlingtonia, Heliamphora, and Sarracenia. The plants occur in nutrient-poor environments and have developed insectivorous capability in order to supplement their nutrient uptake. Sarracenia flava contains the alkaloid coniine, otherwise only found in Conium maculatum, in which its biosynthesis has been studied, and several Aloe species. Its ecological role and biosynthetic origin in S. flava is speculative. The aim of the current research was to investigate the occurrence of coniine in Sarracenia and Darlingtonia and to identify common constituents of both genera, unique compounds for individual variants and floral scent chemicals. In this comprehensive metabolic profiling study, we looked for compound patterns that are associated with the taxonomy of Sarracenia species. In total, 57 different Sarracenia and D. californica accessions were used for metabolite content screening by gas chromatography-mass spectrometry. The resulting high-dimensional data were studied using a data mining approach. The two genera are characterized by a large number of metabolites and huge chemical diversity between different species. By applying feature selection for clustering and by integrating new biochemical data with existing phylogenetic data, we were able to demonstrate that the chemical composition of the species can be explained by their known classification. Although transcriptome analysis did not reveal a candidate gene for coniine biosynthesis, the use of a sensitive selected ion monitoring method enabled the detection of coniine in eight Sarracenia species, showing that it is more widespread in this genus than previously believed.


Introduction
Sarraceniaceae is a New World carnivorous plant family comprising three genera: Darlingtonia Torr. (monotypic), Heliamphora Benth. (ca. 23 species [1]) and Sarracenia L. (ca. 11 species [2]). The distribution of Darlingtonia is limited to a few locations along the western coast of North America, Heliamphora occurs mainly on tepuis of the Guiana Highlands in South America and Sarracenia is the most widespread genus in the family, found in the eastern coastal plains of North America. Darlingtonia californica, Sarracenia, and Heliamphora are a1111111111 a1111111111 a1111111111 a1111111111 a1111111111  [32] were analyzed for PKSs using Geneious (version 9.0.4) [33]. The tblastn algorithm in Geneious was used to search the sequence database with the Medicago sativa CHS2 amino acid sequence [34] as the template and a stringency setting of 1e-10. The obtained nucleotide sequence hits were translated to amino acid sequences, and the correct reading frames were chosen and aligned using the Geneious alignment option. . The computer-generated identifications were sorted manually, with a cut-off at 70% identification [35], into an Excel spreadsheet (Microsoft, Redmond, WA, USA) according to their chemical structure, elution time and origin. When peaks with same retention time were identified as different hydrocarbons in multiple samples, they were treated as n-alkanes at the specific retention time. The relative peak abundances were used in the data input.

Data mining
The metabolite data were treated in two formats: (1) a qualitative format representing presence (i.e. concentration level above the detection limit) or absence (concentration level below the detection limit) of a compound in a sample, by coding the presence and absence as 1 and 0, respectively, and (2) a quantitative or continuous format in which the concentration level is given as the percentage of the total peak area. The main aim of our data mining was to visualize any patterns present in the data. Towards this goal, it was first noted that the current data are very high dimensional (i.e. contain a large number of compounds), very sparse (91.35% zeros in the lids dataset and 91.86% in the pitchers dataset), and that the distinct species show huge chemical diversity (i.e. the metabolite composition of different plants is largely distinct). Therefore, it is reasonable to expect that only a small proportion of compounds are likely to be useful for clustering the samples. A feature selection approach for clustering [36] was applied in order to identify the most important features required for deriving hierarchical clusters. This approach computes and reweights the overall dissimilarity matrix while applying a lassotype penalty, which results in a dissimilarity matrix sparse in features [36]. This sparse clustering was applied using the R package sparcl. In order to compute the hierarchical clustering with the qualitative format of the data, the hamming distance was used as the dissimilarity measure. For the quantitative format of the data, the Euclidean distance was used. The complete linkage method was used for the clustering. In order to compare the phylogenetic structure with the chemical profiles, the MP-EST accession tree from [2] was downloaded. Then the accessions in the two studies were mapped based on the location of sample collection, which resulted in a many-to-many mapping ( Table 1) with one or more of 42 nodes in the phylogenetic tree matching one or more of 48 species in our study. From this, 36 possible bijective maps were enumerated, and compoundbased distances corresponding to each bijective map were calculated as follows. The distance between every pair of accessions was calculated using hamming distance for the binary and Euclidean distance for the continuous data of the selected metabolite features. These distances are referred to below as species-level distances (SLD). Using the clades resolved in the MP-EST accession tree (i.e. D. californica, S. flava, S. psittacina, S. minor, S. purpurea complex, S. rubra complex, S. alata, S. leucophylla, and S. oreophila), distances within and between the clades were calculated. A within-clade distance (WCD) was calculated as the average of all pairwise SLDs of accessions within the clade. A between-clade distance (BCD) was calculated as the average of all SLDs of accession-pairs across the pair of clades. Average species-level and cladelevel distance matrices were calculated over all 36 bijective maps to derive the average withinclade (aWCD) and between-clade distances (aBCD), as well as the average species-level distances (aSLD). These averaged distances were used to assess how well the metabolite data supports the phylogenetic structure. If the phylogenetic structure explains the compound data, the aWCDs are expected to be lower than the aBCDs. This was assessed by comparing aWCDs against not only aBCDs but also aSLDs as an additional test. More precisely, we (I) visualized aWCDs against the background distance distribution formed by aSLDs ( Fig 1B and  In order to visualize the metabolite features selected for clustering alongside the phylogenetic structure presented in [2], the best mapping of samples between the MP-EST accession tree and our compound data was obtained. The best bijective map is expected to result in the maximum BCD and minimum WCD among all possible bijective maps. To achieve this objective, we chose the map that yields the maximum difference between the mean values of BCD and WCD i.e. mean(BCD)-mean(WCD) for these visualizations (Fig 1A and Fig 2A, S1A and S2A Figs). Thus, the heat maps shown in Fig 1A and Fig 2A, S1A and S2A Figs contain only one sample from our compound dataset for each node in the MP-EST accession tree chosen to maximize the mean(BCD)-mean(WCD). Since only 42 nodes in the accession tree map to our dataset, each heat map omits 6 samples from our study. In particular, the samples numbered 31 and 46 (Table 1)  All the statistical analyses and visualizations were performed using the R statistical software [37] and its packages such as gplots, sparcl, metadar (http://code.google.com/p/metadar), ihm (http://code.google.com/p/ihm), and RColorBrewer.

Coniine identification and occurrence in Sarracenia
With the GC-MS method used, coniine elutes at a constant retention time (6.33±0.01min) even in spiked barley material and C. maculatum leaf extract. The samples were analysed on the basis of their SCAN mass spectra and were compared to a database. Pure coniine matched the database with 86%, or in plant matrix with 78%-86% identity. The retention time of coniine was very stable, and the ions 80, 84, and 126 exhibited the same relative abundances in the sample matrix and in the coniine reference substance (Fig 3). Therefore, a match lower than 90% can be considered acceptable. Using the SCAN mode, coniine was detected in S. alata, S. flava, S. leucophylla, S. oreophila, S. psittacina and S. purpurea (incl. S. rosea) ( Table 1). In D. californica, only the fragment m/z 84 was detected, whereas in S. jonesii (3) none of the ions were detected at 6.33 min.
In order to detect coniine at low concentrations, we operated the GC-MS in SIM mode. Based on the fragmentation pattern of coniine (m/z 43, 56, 70, 80, 84, 97, 110, and 126), the  [2] is displayed as the column dendrogram. Six samples of our dataset (11, 31, 35, 38, 42, and 46) are omitted from this heat map based on the sample selection procedure described in the Methods section. (B) Comparison of average within-clade distances (aWCDs) against the background distribution of average species-level distances (aSLDs) and average between-clade distances (aBCDs). characteristic ions m/z 56, 70, 80, 84 (base peak) and 126 (mass peak) were selected. The fragments m/z 80, 84, and 126 are specific for coniine, in contrast to the ions m/z 56 and 70, which are shared with many other molecules.
The limit of detection for coniine in SIM was 1 μg/ml, which corresponds to 1 μg/g dry weight. Using SIM detection, coniine was identified from S. alata, S. flava, S. leucophylla, S. minor, S. oreophila, S. psittacina, S. purpurea (incl. S. rosea) and S. rubra (incl. S. alabamensis) ( Table 2). Of these, S. flava and S. alata samples only contained coniine traces close to the detection limit, whereas other samples accumulated clearly higher levels of coniine. No coniine was detected in the pitchers of S. minor var. okefenokeensis or the lids of S. oreophila.

PKSs in Sarracenia transcriptomes
Sarracenia psittacina and S. purpurea transcriptomes were analysed using the tblastn algorithm with the stringency set to 1e-10 and M. sativa CHS2 as a template, resulting in 8 and 12 sequences, respectively. Correct reading frames were selected and aligned with each other after the nucleotide sequences were translated to amino acid sequences. This resulted in three unique contigs per species. Of these, one represents the N-terminus and two the C-terminus when compared to full-length PKS-enzyme. None of the contigs cover the middle part of the PKS-enzyme sequence, but they do contain all the conserved amino acids in the active site in the observed area [38] when compared to other full-length PKSs (S3 Fig).

Metabolite profiles
The metabolite profiles of lids and pitchers were analysed separately. In addition to analysing the metabolite profiles using the quantitative (concentration) data, we also investigated the qualitative (presence or absence) data in which compounds with non-zero concentration levels (i.e. with levels above the detection limits) were treated as present and compounds with levels below the detection limits as absent.
The manually aligned lid dataset consisted of a total of 560 compounds detected in at least one sample. Among these, there were library matches (!70%) for 69 alcohols, 70 aldehydes and ketones, 53 esters, 58 ethers, 30 [29]. Sarracenia leucophylla (17) displayed the highest number (n = 4) of floral scent compounds ( Table 3). The sample S. purpurea subsp. venosa var. burkii (39) is an exception in that it did not accumulate unique compounds, whereas S. flava var. atropurpurea (35) had the largest number (n = 18) of unique compounds. S1A Table shows the compounds unique to each sample along with their concentration levels. Finally, when we compared the lid samples in pairs, we observed that, on average, every lid sample contained 32 unique compounds (S2A Table).   Table). A sarracenin-like compound was found at an elution time of 18.2 min. Its mass peak was m/z 225, major fragments m/z 180 and 138, and further fragments were m/z 162, 120, 93, 67 and 43.

Selection of metabolites
Overall, both the lid and pitcher datasets are very sparse, with 91.35% zeros in the lid dataset and 91.86% in the pitcher dataset. These datasets are also high dimensional, as described above, with 560 and 589 compounds, respectively, in the lid and pitcher datasets. We performed sparse hierarchical clustering of the data in order to reduce the dimensionality of the datasets and identify the compounds important for clustering. The metabolite features selected using the qualitative and quantitative formats of the data are visualized as heat maps (S6-S9 Figs).

Integration of phylogenetic clustering
The MP-EST accession tree presented in [2] was integrated with metabolite profiling data. Firstly, the selected metabolite features were visualized as heat maps with the MP-EST accession tree (Fig 1A and Fig 2A, S1A and S2A Figs). Since the best bijective map between the samples of the two studies was selected for these visualizations, six samples from our compound dataset are omitted from each of the heat maps (Fig 1A and Fig 2A, S1A and S2A Figs). Secondly, the MP-EST accession tree was used to assess whether the metabolite profiles support the clade-level classification of the plant family. This was done by comparing the aWCDs against aBCDs as well as the background distance distribution formed by the aSLDs. The aWCDs were lower than aBCDs (Fig 1 and Fig 2, S1 and S2 Figs), indicating that the compound data was consistent with the clade-level classification. From the qualitative data of lids, all aWCDs were less than the mean and median values of the aBCDs. In comparison to the background distribution, eight out of nine aWCDs were less than the mean of the aSLDs and all the aWCDs were less than the median of the aSLDs (Fig 2B). Finally, the aWCDs were significantly lower than the aBCDs (Wilcoxon test P-value = 1.42e-05; Fig 2C). From the qualitative data of pitchers, all aWCDs were less than the mean and median values of the aBCDs as well as the aSLDs (Fig 2B), and the aWCDs were significantly lower than aBCDs (Pvalue = 5.109e-06; Fig 2C). The quantitative data weakly supported the clade-level classification (S1 and S2 Figs). From the quantitative data of lids, seven out of nine aWCDs were lower than the mean and median values of the aBCDs and aSLDs (S1B Fig), and the difference between aWCDs and aBCDs was marginally significant (P-value = 0.02; S1C Fig. From the quantitative data of pitchers, all aWCDs were lower than the mean of aBCDs, eight out of nine aWCDs were lower than the mean of aSLDs, seven aWCDs were less than the median of aBCDs, and six aWCDs were less than the median of aSLDs (S2B Fig). The difference between aWCDs and aBCDs was marginally significant (P-value = 0.004; S2C Fig).

Discussion
Coniine in Sarracenia sp.
The presence of coniine has been reported from poison hemlock and twelve Aloe species [22,23]. The only report of coniine in Sarraceniaceae is by Mody et al. [21], who isolated 5 mg of coniine from 45 kg fresh pitchers of S. flava via steam distillation. This is in contrast to the results of Romeo et al. [11], who did not detect any alkaloids or volatile amines in Sarracenia.
We have now confirmed the findings of Mody et al. [21] and also found that coniine occurs, often in low amounts, in at least seven other species, e.g. S. purpurea (Table 2). It remains unknown where exactly coniine is biosynthesized in Sarracenia spp., since the compound was detected both in lids and in the actual pitchers. Biosynthesis of coniine has been studied in poison hemlock. In this case the carbon backbone is derived from the iterative coupling of butyryl-CoA and two malonyl-CoAs by a PKS, CPKS5 [24]. According to our analysis, genes encoding such enzymes are present in the transcriptomes [32] of S. psittacina and S. purpurea. Both species harbour three contigs which represent two to three PKSs. The exact number could not be determined because the N-terminal contig cannot be assigned to either of the Cterminal contigs. The contigs do not represent full-length sequences and therefore it is impossible to clearly assign them as PKSs for coniine biosynthesis in Sarracenia spp. Important mutations might be located outside the observed area, preventing distinction from chalcone synthases involved in anthocyanin synthesis [9,10]. An important question is the function of coniine in Sarracenia. Why should plants living in nutrient-poor environments produce a nitrogenous compound if there are no benefits? Butler and Ellison [39] studied nitrogen acquisition of S. purpurea and reported that the pitchers are in fact very efficient in prey capture and could thus greatly enhance the available nitrogen for the following growth season. Mody et al. [21] postulated that coniine could be an insect-stunning agent. Coniine did indeed paralyze fire ants, but probably the tested concentrations were not physiological [21]. Another function for coniine could be insect attraction, as suggested by Harborne [25] and Roberts [40], who identified coniine as a floral scent compound in poison hemlock. In conclusion, it appears that an investment in coniine biosynthesis could have a double benefit by enhancing both insect attraction and retention.

Metabolite profiles of Sarracenia and Darlingtonia
There are several previous reports on Sarracenia volatiles [7,8]. For example, Miles et al. [7] reported benzothiazole, benzyl alcohol, heptadecane and tridecane from S. flava, which we also found from Sarracenia spp. Nonanal, a floral scent compound widespread in the plant kingdom [28], was found from Sarracenia spp. lids in our study. The compound is known to attract mosquitos [41], and Miles et al. [7] described it as one of S. flava's volatile organic compounds. The Venus flytrap (Dionaea muscipula), another carnivorous plant, emits this volatile organic compound when it is feeding on fruit flies (Drosophila melanogaster) [35]. Sarracenin (Fig 4A) has previously been reported from S. flava [17], S. alata, S. leucophylla, S. minor and S. rubra [18]. Our study confirmed the presence of this compound in all the aforementioned species, except S. minor, and revealed several new species containing sarracenin, namely, S. psittacina, S. purpurea and D. californica. The compound is volatile and attracts insects to Heliamphora sp. [4]. A possible explanation of why S. minor did not accumulate sarracenin in our study could be that our samples were not feeding on insects at the time of collection, and as a result, they did not synthesize the compound [4].
We also found (Z)-13-docosenamide (erucamide) to be a common compound in Sarracenia spp. and D. californica. It has previously been reported from H. tatei and H. heterodoxa [4], where it is a possible lubricating component of the nectar.
Other common compounds from Sarracenia sp. and D. californica are carboxylic acids (fatty acids) such as tetradecanoic, hexadecanoic and (Z)-9-hexadecenoic acids. All three are floral scent compounds and the latter is known from Hydnora africana [42]. Hexadecanoic acid is emitted by the Venus fly trap as a volatile organic compound after feeding [35].
Sarracenia spp. display a huge variety of unique compounds which are found only in their lid and/or pitcher. Actinidine is a floral scent compound known from Sauromatum guttatum [43] and an insect pheromone in Hymenoptera [44]. Trans-Jasmone acts either as an insect attractant or repellent depending on the insect species. Pulegone (Fig 4B) is a floral scent compound of Tilia sp. [45] and Agastache sp. [46], and functions as an insecticide [47]. 14-β-Pregna is a sex pheromone of the insect Eurygaster maura [48]. Lagumicine was found from S. oreophila lid. Previously it had been found from Alstonia angustifolia var. latifolia [49]. Miles et al. [17] suggested, on the basis of the possible cleavage of sarracenin, that terpene indole alkaloids could be synthesized in Sarracenia spp.
The studied accessions of Sarraceniaceae are characterized by a large number of diverse metabolites, with nearly 600 metabolites identified in lids as well as in pitchers. They are also characterized by a huge chemical diversity, as the metabolite compositions of different plants were largely distinct. Unlike mutation data from highly conserved genomic loci, the data that mainly displays wide heterogeneity of samples is not suitable for constructing taxonomies. Knudsen et al. [29] concluded that the usability of floral scent compounds in chemotaxonomy is limited because chemical composition usually differs even between closely related species. The composition may also vary among genera of a specific family, as it may vary among species of a given genus. Thus, the chemical composition alone is of little use for phylogenetic estimates above the genus level. As expected, clustering derived from our data alone does not agree with the phylogenetic structure of the accessions (see the column dendrograms in S6-S9 Figs).
The available phylogenetic information, on the other hand, may help us to understand the current data. We sought to explain the metabolite composition of plants with the known phylogenetic information from [2]. We successfully demonstrated that the metabolite data conform with the clade-level classification of the plant family and hence that the phylogeny can explain the metabolite composition of the plants to some extent. Notably, whereas the qualitative data could be largely explained by phylogeny (Fig 1 and Fig 2), the concordance of quantitative data with the clade-level classification was relatively weaker (S1 and S2 Figs). Thus, we speculate that evolution may more directly affect the presence or absence of specific chemicals than the exact amount in which the chemicals are present.
We have limited the focus of the current data mining to cataloging and visualizing the data. Given the dominance of zeroes, the current datasets may benefit from computational methods specially designed for zero-inflated or left-censored data. But such a detailed computational analysis is out of the scope of this biochemical profiling study.

Conclusion
Studied accessions of Sarraceniaceae possessed a diverse variety of compounds. Lids and pitchers were studied separately and approximately 600 compounds were detected in both collections. The accessions also showed huge diversity, with every accession containing unique compounds. Coniine was newly detected in seven Sarracenia species in addition to the known source, S. flava. However, we could not identify a specific candidate gene involved in coniine biosynthesis in Sarracenia spp. Among the common constituents of Sarraceniaceae are sarracenin, erucamide, and nonanal. By integrating existing phylogenetic information of Sarraceniaceae, we successfully demonstrated that the phylogeny can explain the metabolite composition of the plants. Phylogeny explained the presence or absence of compounds more strongly than their concentrations. Heat map visualization of selected metabolite features from the quantitative data of lids. The phylogenetic tree from [2] is displayed as the column dendrogram. Six samples of our dataset (14, 31, 35, 37, 44, and 46) are omitted from this heat map, based on the sample selection procedure described in the Methods section. (B) Comparison of average within-clade distances (aWCDs) against the background distribution of average species-level distances (aSLDs) and average between-clade distances (aBCDs). Distribution of aSLDs was calculated using qualitative data of the selected metabolite features and displayed in a density plot.