Spec2Vec: Improved mass spectral similarity scoring through learning of structural relationships

doi:10.1371/journal.pcbi.1008724

Fig 1.

(A) MS-MS spectra can be considered as signatures of molecules: spectra are known to contain structural information of the original molecule, but without a straightforward way to translate mass spectral features into structural ones describing the fragmented molecule. (B) Spectra are commonly compared by similarity measures such as cosine or modified cosine scores. While those measures are very good at revealing (nearly-) equal spectra, they often underperform when it comes to spectra of complex molecules with high structural similarity, but which differ in multiple locations (C) Spec2Vec is based on algorithms from natural language processing and learns relationships between peaks based on how frequently they co-occur. (D) Two spectra from different yet similar molecules will hence be represented by similar spectral vectors even if many of their peak positions will differ.

More »

Expand

Fig 2.

In-depth comparison example of two spectra.

Since the two molecules differ slightly in three locations, both cosine and modified cosine scores fail to recognize the overall structural similarity and return low spectral similarity scores. Spec2vec for many peaks acknowledges that they often co-occur across the training data, hence showing a high peak context similarity which overall leads to a high Spec2Vec similarity score. For illustrative purposes, this figure only displays peaks between 400 and 1000 Da.

More »

Expand

Fig 3.

(A) histogram of the structural similarity scores across all possible spectra pairs between the 12,797 spectra in the UniqueInchikey dataset (81,875,206 unique pairs, not including pairs of spectra with themselves). The histogram indicates that randomly chosen pairs will most likely show scores between 0 and 0.5. Structural similarity scores > 0.6 are rare and hence unlikely to achieve by randomly choosing pairs (p = 0.0103 is the probability for randomly picking a pair with a structural similarity score > 0.6, p = 0.0034 for a score > 0.7). (B) Different spectral similarity scores were calculated for the same 81,875,206 spectral pairs. Comparing the highest 0.1% the resulting scores to the structural similarities reveals that Spec2Vec similarities show a notably higher correlation with actual structural similarities. Used parameters were 1) Spec2Vec, trained on UniqueInchikey for 50 iterations or trained on AllPositive for 15 iterations, 2) Modified cosine score with tolerance = 0.005 and min_match = 10, and 3) Cosine score with tolerance = 0.005 and min_match = 6 and 4) the theoretical maximum that can be achieved by choosing the highest possible Tanimoto scores for every percentile.

More »

Expand

Fig 4.

Spec2Vec similarity scores deliver improved true-to-false-positive ratios during library matching.

1000 randomly selected spectra, all with at least 2 identical InChIKey in the entire dataset, were removed from a AllPositive and then matched to the remaining spectra. Matching was done by pre-selecting spectra with the same precursor-m/z (tolerance = 1ppm) and then choosing the candidate with the highest spectral similarity score if this score was larger than a set threshold. The left plot shows the true-vs-false positive rate when using Spec2Vec (red) or cosine scores (black). Due to the required precursor-m/z match, the modified cosine scores here are virtually identical to the cosine scores and are hence not shown. Labels near the first and final dots report the used similarity score thresholds. The inset plot on the left displays how many spectra identical InChIKey are part of the library for the 1000 query spectra. The plot on the right displays the resulting accuracy and retrieval rates for the same parameters. Using Spec2Vec, library matching could be done with notably higher accuracy across all tested retrieval rates. Please note: Retrieval rates for the cosine score do not fully reach the level of the Spec2Vec based matching due to the set min_match parameter which in the presented case will assign a score of 0.0 to each pair with less than six matching peaks. Lowering the min_match parameter will increase the retrieval but also lower the accuracy (see also Fig A in S3 Text).

More »

Expand

Fig 5.

Matching of unknown molecules (not part of library) using Spec2vecs similarities.

All spectra of 200 randomly selected InChIkeys (1030 spectra) were removed from the AllPositive dataset. Using a word2vec model that was trained on the remaining dataset, also excluding non-annotated spectra (76,062 spectra), each removed “query” spectrum was compared to the dataset by only using the Spec2Vec similarity score. (A) shows a histogram of the best structural similarity score out of the found top-10 Spec2Vec similarities for each query. For nearly 60% of all queries, Spec2Vec finds a match with a structural similarity score > 0.6 reflecting high molecular similarity. (B) The quality of the suggested matches is highly dependent on the mass of the query compound. In particular for larger molecules (> 400 Da), Spec2Vec similarities allow finding highly similar molecules. (C+D) Examples of unknown molecules (not part of library) that are compared to all library spectra to find most similar matches using Spec2Vec. In both cases the algorithm is able to return highly related molecules to the query molecules that could be used to help with annotating the query spectra or to infer its chemical class.

More »

Expand

Fig 6.

Comparison of spectra clustering using modified cosine (left) or Spec2Vec (right) across a range of similarity score cutoffs.

The cluster quality is assessed by measuring the average structural similarity across all linked pairs within each cluster. Setting a structural similarity threshold of 0.5 (see Fig 3A) allows to compare the fraction of spectra that ends up in chemically homogenous clusters (red) with those in more heterogeneous clusters (green) and the fraction of spectra that is not clustered at all (those with no links above set threshold). Clustering is done here by creating edges between spectra (= nodes) for similarities above a certain cutoff (adding max. 10 links per node). To make the resulting clustering more robust and better comparable across different scores, we used the Louvain algorithm to break up the large clusters. Dashed squares mark regions of relatively high retrieval (high fraction of clusters with high structural similarity) and high accuracy (large discrepancy between fraction of high structural similarity and low structural similarity clusters). Overall, Spec2Vec allows to cluster higher fractions of spectra into high structural similarity clusters (> 35% of all spectra are in high similarity clusters for a Spec2Vec similarity threshold of 0.7).

More »

Expand