Spec2Vec: Improved mass spectral similarity scoring through learning of structural relationships
Fig 3
(A) histogram of the structural similarity scores across all possible spectra pairs between the 12,797 spectra in the UniqueInchikey dataset (81,875,206 unique pairs, not including pairs of spectra with themselves). The histogram indicates that randomly chosen pairs will most likely show scores between 0 and 0.5. Structural similarity scores > 0.6 are rare and hence unlikely to achieve by randomly choosing pairs (p = 0.0103 is the probability for randomly picking a pair with a structural similarity score > 0.6, p = 0.0034 for a score > 0.7). (B) Different spectral similarity scores were calculated for the same 81,875,206 spectral pairs. Comparing the highest 0.1% the resulting scores to the structural similarities reveals that Spec2Vec similarities show a notably higher correlation with actual structural similarities. Used parameters were 1) Spec2Vec, trained on UniqueInchikey for 50 iterations or trained on AllPositive for 15 iterations, 2) Modified cosine score with tolerance = 0.005 and min_match = 10, and 3) Cosine score with tolerance = 0.005 and min_match = 6 and 4) the theoretical maximum that can be achieved by choosing the highest possible Tanimoto scores for every percentile.