Spec2Vec: Improved mass spectral similarity scoring through learning of structural relationships

doi:10.1371/journal.pcbi.1008724

Spec2Vec: Improved mass spectral similarity scoring through learning of structural relationships

Fig 5

Matching of unknown molecules (not part of library) using Spec2vecs similarities.

All spectra of 200 randomly selected InChIkeys (1030 spectra) were removed from the AllPositive dataset. Using a word2vec model that was trained on the remaining dataset, also excluding non-annotated spectra (76,062 spectra), each removed “query” spectrum was compared to the dataset by only using the Spec2Vec similarity score. (A) shows a histogram of the best structural similarity score out of the found top-10 Spec2Vec similarities for each query. For nearly 60% of all queries, Spec2Vec finds a match with a structural similarity score > 0.6 reflecting high molecular similarity. (B) The quality of the suggested matches is highly dependent on the mass of the query compound. In particular for larger molecules (> 400 Da), Spec2Vec similarities allow finding highly similar molecules. (C+D) Examples of unknown molecules (not part of library) that are compared to all library spectra to find most similar matches using Spec2Vec. In both cases the algorithm is able to return highly related molecules to the query molecules that could be used to help with annotating the query spectra or to infer its chemical class.

doi: https://doi.org/10.1371/journal.pcbi.1008724.g005