Automated recommendation of metabolite substructures from mass spectra using frequent pattern mining

Despite the increasing importance of metabolomics approaches, the structural elucidation of metabolites from mass spectral data remains a challenge. Although several reliable tools to identify known metabolites exist, identifying compounds that have not been previously seen is a challenging task that still eludes modern bioinformatics tools. Here, we describe an automated method for substructure recommendation from mass spectra using pattern mining techniques. Based on previously seen recurring substructures our approach succeeds in identifying parts of unknown metabolites. An important advantage of this approach is that it does not require any prior information concerning the metabolites to be identified, and therefore it can be used for the (partial) identification of unknown unknowns. Using association rule mining we are able to recommend valid substructures even for those metabolites for which no match can be found in spectral libraries or structural databases. We further demonstrate how this approach is complementary to existing metabolite identification tools, achieving improved identification results. The method is called MESSAR (MEtabolite SubStructure Auto-Recommender) and is implemented as a free online web service available at http://www.biomina.be/apps/MESSAR/.


Introduction
Metabolomics is the discipline that deals with the high-throughput analysis of metabolites, i.e. small biomolecules, with highly relevant applications in drug and biomarker discovery [4,26]. However, the identification of metabolites from a biological sample remains a major bottleneck in metabolomics, with a vast number of potentially interesting metabolites that are still unknown [12,19].
The standard method for metabolite identification is mass spectrometry (MS), preceded by a separation technique, such as gas chromatography (GC) or liquid chromatography (LC). Single-stage MS measures the mass-to-charge ratios (m/z) of intact metabolites. To obtain information beyond the mass of a molecule, tandem mass spectrometry (MS/MS) is used. During MS/MS the m/z of the product ion fragments of isolated metabolites are recorded, yielding additional structural information. The traditional way to identify an observed metabolite through spectral library searching works by comparing the measured MS/MS spectra to (historic) spectra of previously identified compounds stored in a spectral library and selecting the best match. A drawback of spectral library searching is that it is only possible to obtain a valid identification for a given MS/MS spectrum if the spectral library contains a corresponding reference measurement [19]. Unfortunately, the size of spectral libraries is necessarily limited: reference spectra need to be explicitly generated from (often synthetic) compounds, which takes substantial effort and is expensive. Consequently, only a somewhat limited number of known unknowns can be effectively identified in this manner.
Recent approaches have moved beyond the use of spectral libraries in an attempt to identify additional metabolites from a biological sample, for example by using molecular structural databases, which are larger than spectral libraries by several orders of magnitude [12]. Here, the experimental spectra are compared to fragmentation spectra that are predicted from molecular structures [3, 8-10, 14, 20, 28]. Nevertheless, these approaches are still limited as well as they can only identify molecules that are present in the used database and as they crucially rely on completely correct fragmentation predictions.
Besides identifying metabolites by searching in structural databases, there are methods that aim to predict certain structural properties from mass spectral data [5,6,10].
Other approaches identify structurally similar molecules [2], group similar spectra [29], mine spectra for specific user-defined properties [15] or help in the manual annotation of mass spectra [16]. Notably, FingerID [10,22] uses supervised machine learning to predict molecular fingerprints, i.e. bit vectors where each bit represents the presence (or absence) of a certain structural property of the molecule. Its successor CSI:FingerID [8] further improves the fingerprint prediction using high-resolution MS/MS data and uses the fingerprints to rank candidates retrieved from a molecular database [10]. MS2LDA, a recent approach by van der Hooft et al. [23,24], employs text mining techniques to discover patterns across fragmentation spectra, which can be used to aid the de novo annotation of unknown unknowns. A drawback of this approach, however, remains that the extracted patterns still need to be structurally annotated based on expert knowledge and matched to the reference spectra, a time-consuming and complex manual process.
In the presented work we introduce a new, automated approach for substructure recommendation from MS/MS spectra based on frequent itemset mining. Frequent itemset mining is a class of data mining techniques that is specifically designed to discover cooccurring items in transactional data sets [17]. Our approach does not rely on the full metabolite being seen before, but looks for commonly observed substructures to identify part of an unknown metabolite. Instead of using predefined molecular fingerprints in a supervised framework, we generate general substructures in an unsupervised fashion by breaking chemical bonds of the molecules. The advantage of this approach is that it allows an automated annotation of metabolite substructures based on fragmentation data to provide a (partial) identification of unknown unknowns. Substructure recommendations are computed from associations between spectral features and structural features derived from a high-quality set of mass spectrum identifications for known metabolites.
Based on these data pattern mining techniques are used to detect which substructures are associated with certain fragments and fragment mass differences.
The substructure recommender is available as a free online web service which can be accessed at http://www.biomina.be/apps/MESSAR.

Methods
An overview of the presented substructure recommendation workflow is depicted in figure 1. In brief, molecular substructures are first generated for a database of metabolites.
These substructures are then combined into a single data set with fragment ions and mass differences between fragment ions extracted from previously identified MS/MS spectra in a spectral library. We then apply frequent pattern mining techniques to this data set to infer which substructures are associated with certain fragments and fragment mass differences. This results in a list of recommendations of the form: peak p is associated with substructure s with frequency f p and confidence c p , mass difference md is associated with substructure s with frequency f md and confidence c md .

Structural information
We start with a set of metabolites for which both experimental MS/MS data and molecular structures are available. This molecular data is referred to as data set M .

Spectral information
Let P m be the set of all MS/MS peaks p in the spectrum corresponding to a metabolite m ∈ M , then the mass of every p ∈ P m is discretized by rounding to one decimal place.
We will refer to the sets of rounded peak masses of a spectrum of metabolite m ∈ M and of the entire data set M as p m and p M , respectively.
For every p, q ∈ P m , p = q, a mass difference md pq = |p−q| is calculated. Subsequently, the mass differences are discretized by rounding to the closest integer. We will refer to the sets of discretized absolute mass differences of a spectrum of metabolite m ∈ M and of the entire datas et M as mdiff m and mdiff M , respectively.

Pattern mining
Specialized pattern mining techniques are used to detect frequent substructures that can be consistently related to the occurrence of certain spectral features. Frequent substructures are identical parts of a molecular structure that frequently occur in a  Figure 1: Pattern mining workflow to generate metabolite substructure recommendations. The training data set consists of metabolites for which both an MS/MS spectrum and its molecular substructure are known. These data are transformed into a transactional format, i.e. for every metabolite a single transaction is created which combines the spectral information (peaks and mass differences between the peaks in the MS/MS spectrum) with the corresponding structural information (substructures of the metabolite). These transactions are collected into a single transactional data set and mined for association rules. Rules that are extracted have the form peak p (mass difference md) is associated with substructure s.
given data set. To perform pattern mining we require a transactional data set, with each transaction a set of items (or a so-called "itemset"). In our approach, each item consists of a molecular substructure or a spectral feature extracted from an MS/MS spectrum. A transaction is then the set of all substructures, peaks and mass differences for a single molecule. The support of an itemset is defined as the number of transactions in a data set that contain that itemset. An itemset is considered frequent if its support exceeds a specified minimal threshold.
After the frequent itemsets have been determined we mine for association rules to reveal hidden relationships in the transactional data. An association rule can be expressed as X ⇒ Y , where X and Y are sets of items, and X ∩ Y = ∅. X is called the body or antecedent of the rule, while Y is called the head or consequent of the rule. The rule cates the existence of an association between X and Y . The support of an association rule X ⇒ Y is equal to the support of X ∪ Y . The confidence of an association rule X ⇒ Y is the conditional probability that Y is present in a transaction given that X is also present in that transaction, and is defined as: Association rule mining is the task of finding all the association rules that are frequent and confident, i.e. identifying the association rules for which the support and confidence exceeds specified minimal support and confidence thresholds. A high support indicates that the rule applies to a large number of cases, while a high confidence indicates that the rule should often be correct.
In our approach, for each metabolite m ∈ M the transaction T m consists of all peaks p m , all mass differences mdiff m and all molecular substructures S m : We will refer to the set of transactions T m of all the metabo-lites m ∈ M as T M . We only mine for patterns in the transactional data set T M that consist of both molecular substructures and peak masses or mass differences, as these combinations indicate spectral-structural associations. Our frequent itemset mining algorithm is optimized so that only those itemsets that contain at least one pair (p, s) or pair (md, s) are retained, with p ∈ p m , md ∈ mdiff m and s ∈ S m . This step reduces the (large) search space by pruning uninteresting itemsets that do not contain a combination of both spectral and structural information. Furthermore, the association rule mining step is optimized so that only those rules that contain one peak p ∈ p m (mass difference md ∈ mdiff m ) in its antecedent and one substructure s ∈ S m in its consequent are considered.
Mining for association rules in the transactional data set T M results in a list of associations of the form: peak p i can be associated with substructure s j with support f p and confidence c p .
mass difference md i can be associated with substructure s j with support f md and confidence c md .
Such associations can be interpreted as recommendations for unexplained spectra in which the given peaks and mass differences are observed. Note that the peak (mass difference) in the antecedent of the rule and the molecular mass of the substructure in the rule's consequent will rarely be exactly equal. Because it is very hard to accurately simulate the fragmentation of a molecule to generate its substructures due to molecular rearrangements [25], the spectral information can typically be linked to a substructure that is a part of the metabolite under consideration, even though this might not necessarily be the exact substructure responsible for that spectral information.

Substructure recommendations
Small molecule data was retrieved from the Human Metabolome Database (HMDB) [27].
A total of 814 compounds for which both experimental MS/MS data and the molecu-lar structure are available were taken into account with no further restrictions being present. As such, this data contains a heterogeneous set of metabolites. Only those spectra labeled as 'Excellent' were used. Mass spectra generated at different collision energy levels belonging to the same metabolite were combined, while peaks from different collision energies with an m/z difference less than or equal to 0.01 Dalton were merged.
We grouped the spectra based on ionization modes and retained only those peaks with relative intensity compared to the most intense peak exceeding 5%.
After converting this data into two transactional data sets, for positive ionization mode and negative ionization mode, association rules were mined with support and confidence thresholds of 3 and 1% respectively, resulting in a total of 92,597 unique recommendations for positive ionization mode and 15,278 for negative ionization mode.
As an example, figure 2 shows some of the recommended substructures for the MS/MS spectrum of 3,4-Dihydroxyphenylacetic Acid.

Improving metabolite identifications
The substructure recommendations can be used to improve the accuracy of the existing tools for metabolite identification. First, to generate recommendations, for an unidenti- substructures and the candidate metabolite structure. Full metabolite structures that contain a higher number of increasingly confident substructures recommended by our approach receive a higher rank. In this fashion it is possible, for example, that a structure containing two recommended substructures is ranked higher than a structure containing three recommended substructures if the average confidence of the rules recommending the substructures is higher in the first case.

MESSAR rules cover the MS2LDA patterns
MS2LDA [24] discovers patterns across fragmentation spectra in an unsupervised fashion using text mining techniques. As it operates within a similar scope as our tool, a detailed comparison is warranted. This method was used to discover patterns consisting of spectral peaks and neutral losses, which are then structurally annotated based on expert knowledge and matched to reference spectra. This resulted in a set of so-called Mass2Motifs that couple patterns in the spectra to molecular structures. These patterns can then be used to partially identify unknown spectra. While this approach does not require prior structural information, the generated patterns do have to be manually annotated mid-process. Our pattern mining approach uses similar input information to address a similar problem, but it provides an automated recommendation in turn. We compared our set of recommendation rules with the Mass2Motifs derived from four beer extract data sets [24]. We only took into account those Mass2Motifs labeled with the highest level of confidence. Neutral losses were rounded to the nearest integer. In positive ionization mode, there are 31 high-confident Mass2Motifs. Each Mass2Motif consists of at least one spectral peak and/or neutral loss and a matching substructure. Out of these 31 patterns, 6 have equivalent MESSAR recommendation rules, i.e. the spectral peak (or neutral loss) is identical to the peak (or mass difference) in the antecedent of the   manual curation or integration of expert knowledge to generate its rules.

MESSAR rules correspond to true substructures
To evaluate the accuracy of the generated recommendations we used data provided for the previous two Critical Assessment of Small Molecule Identification (CASMI) challenges (CASMI2014, CASMI2016) [21]. CASMI is an open contest in which participants have to identify the molecular formula and chemical structure for molecules of natural as well as synthetic origins based on mass spectrometry data. The winner is determined by the number of correctly predicted structure identifications. This remains a significant challenge, as for some metabolites the true structure was even missing from all predictions, by any of the competing teams. We used MS/MS spectra from in total 60 metabolites from CASMI2014 and CASMI2016.
Using MESSAR we generated recommended substructures from the spectra of these 60 compounds. These recommendations were then compared to the true structures. As a measure of the quality of the prediction a Fisher's exact test was performed to see whether the recommendations are statistically significant compared to simply randomly assigning substructures to spectra. Figure 4 shows the p-values for this test, which indicates that our method is able to provide relevant recommendations. Only for a single compound the p-value exceeded 0.05, and that compound proved to be hard to identify for some of the CASMI contestants as well as two out of seven contestants did not list the true structure at the top of their predictions.
Each recommendation was assigned a confidence score based on the associations between substructures and peaks or mass differences found within the data set. Recommendations with a higher confidence can be assumed to have a higher chance of giving a correct recommendation. The recommendations can therefore be ranked by this confidence value for each CASMI metabolite. Figure 5 shows the mean receiver operator characteristic (ROC) curve for the CASMI data recommendations calculated from 60 0HWDEROLWHQXPEHU OQS S S! Figure 4: Enrichment of recommended substructures in the CASMI molecular structures. The p-values are given for each of the 60 molecules tested based on a Fisher's exact test. Green dots denote significant enrichment compared to random assignment, red dots denote non-significant enrichment. ROC results show that recommendations with a higher confidence are more likely to be true for each of these metabolites than those with a lower confidence. These findings suggest that the average confidence of the recommendations provides a good measure of the quality for the recommended substructures.

Substructure recommendations provide additional insights
One of the most common analysis methods for metabolite spectra is to search the full spectrum in a mass spectral database. To examine whether or not the substructure recommendations from MESSAR provide us with relevant additional information compared to full spectral matching, we used the MassBank mass spectral database [11] to identify 60 metabolites of the CASMI data. MassBank returned matches for only 7 metabolites out of 60. In addition, out of these 7 metabolites only one was identified correctly (ibuprofen) in the MassBank search results. In contrast, as shown above, we are able to generate relevant recommendations even for those spectra for which no similar spectrum can be found in a spectral library.

Comparison to FingerID
Another common analysis approach for metabolite spectra is to use supervised models to predict the presence or absence of fingerprints. We therefore compared our unsupervised method with a supervised fingerprint-based approach, namely FingerID [10,22]. For candidate molecule retrieval we used the PubChem database [13], which contains all of the CASMI compounds. First, for each of the 60 metabolites in the CASMI data set we retrieved all possible candidate matches from PubChem whose precursor mass falls within the given mass tolerance provided by the CASMI challenge. Given the PubChem version we used (2011), this resulted in 21 out of 60 molecules (18/21 in positive mode) for which the candidate list contained the correct identification. Next, we generated fingerprints using FingerID and substructures using MESSAR and evaluated how this information can be used to rank the candidates retrieved from PubChem. For FingerID we computed a fingerprint for each MS/MS spectrum based on the set of 528 unique fingerprints from OpenBabel (FP3, FP4 and MACCS) [18] and eliminated those that appear in all molecules or do not appear at all [10,22]. This resulted in a final number of 299 fingerprints. We followed the original authors' recommendations and only trained fingerprint prediction model applicable to a positive ionization mode [8]. Next, we ranked the candidate molecules retrieved from PubChem based on the similarity with the predicted fingerprints. Similarly, we obtained MESSAR recommendations for all MS/MS spectra and used these to rank each of the candidate molecules. We then compared the FingerID MESSAR Avg.rank rank≤20 Avg.rank rank≤20 1799 0/18 1456 3/18 FingerID ranks with the MESSAR ranks for each of the 18 spectra. Table 2 shows the ranking results for both FingerID and MESSAR. Although in general the ranking is rather poor for both tools this is to be expected considering the size of the PubChem database. Nevertheless, the MESSAR ranks are somewhat better than the FingerID ranks. For MESSAR both the average rank is slightly lower and the number of true structures that were ranked within the top 20 is higher. A direct comparison of the rankings does reveal that they perform roughly the same, as can be seen in figure 6.
This shows that the performance of our method is at least comparable to that of a recent fingerprint-based supervised method given the same low-resolution training data.
Furthermore the results suggest that the two approaches are likely complementary, each providing users with a different set of information that can be used to infer the true molecular structure.

Substructure recommendations improve existing prediction tools
As MESSAR does not provide full metabolite identifications but only recommends substructures, its main applicability lies in assisting a de novo annotation of mass spectra and in improving existing metabolite prediction tools. To evaluate the suitability of substructure recommendations in combination with existing tools we used the MAGMa search engine, a state-of-the-art metabolite structure identification tool based on a structural database [20], which was the winner of CASMI2014. For the MAGMa structural search again the PubChem database [13] was used. For each of the spectra in the CASMI data set we retained the 50 highest ranked MAGMa predictions, which were subsequently ORJUDQN 3URSRUWLRQRIGDWDVHW )LQJHU,' 0(66$5  Furthermore, for every CASMI metabolite the average similarity between the training data set and that metabolite is shown, as well as the percentage of training structures for which the similarity with that metabolite is larger than 0.2 (20%). The similarity is calculated by comparing molecular fingerprints [1], which are formed by bit vectors with each bit representing the presence (or absence) of a certain structural property of the molecule.
reranked based on our recommended substructures. Figure 7 shows the results of the reranking. Out of 20 metabolites for which the true structure was ranked among the top 50 MAGMa identifications, 11 correct identifications were ranked higher by combining the substructure recommendations with the MAGMa identifications, with 10 correct identifications receiving rank 1. Furthermore, four correct identifications which were top ranked retained this rank, while five correct identifications were ranked lower. The rank decrease for these last five correct identifications can be explained by the significant dissimilarity between their structures and the data set used to generate the recommendation rules. Although our approach does not require that a fully matching structure is present in the training data set, it still requires repeated observations of matching substructures. Unfortunately, because only a limited number of high-quality metabolomics MS/MS spectra are publicly available, the data set used within this study consisted of less than one thousand metabolites. As a result, for substructures that are missing from the data set, or that only occur a few times, there is insufficient data to learn the relationship between the substructure and its spectra. By increasing the size of the MS/MS data set to be mined we expect that the performance of our method will increase further. Nevertheless, despite the decrease in identification performance for a few molecules due to a lack of suitable training data, the provided rerankings on average greatly improve the MAGMa predictions.

Conclusions
We have introduced a novel pattern mining-based approach to recommend metabolite substructures from MS/MS spectra. The aim of this method is to provide ranked recommendations to the end user regarding the origins of unexplained mass spectra. We have shown that our method succeeds in recommending substructures even for those spectra for which no match can be found in mass spectral libraries. Therefore, our tool can be used to assist in the de novo annotation of metabolites not present in mass spectral or structural databases. In addition, the recommendations can be combined with existing tools for metabolite structure prediction to improve the accuracy of the compound identifications. An important advantage is that, as opposed to expert-driven substructure recommendations, our method is fully automated. It is freely available at http://www.biomina.be/apps/MESSAR. for small molecule substructure annotations from accurate tandem mass spectra.