SpeCollate: Deep cross-modal similarity network for mass spectrometry data based peptide deductions

Historically, the database search algorithms have been the de facto standard for inferring peptides from mass spectrometry (MS) data. Database search algorithms deduce peptides by transforming theoretical peptides into theoretical spectra and matching them to the experimental spectra. Heuristic similarity-scoring functions are used to match an experimental spectrum to a theoretical spectrum. However, the heuristic nature of the scoring functions and the simple transformation of the peptides into theoretical spectra, along with noisy mass spectra for the less abundant peptides, can introduce a cascade of inaccuracies. In this paper, we design and implement a Deep Cross-Modal Similarity Network called SpeCollate, which overcomes these inaccuracies by learning the similarity function between experimental spectra and peptides directly from the labeled MS data. SpeCollate transforms spectra and peptides into a shared Euclidean subspace by learning fixed size embeddings for both. Our proposed deep-learning network trains on sextuplets of positive and negative examples coupled with our custom-designed SNAP-loss function. Online hardest negative mining is used to select the appropriate negative examples for optimal training performance. We use 4.8 million sextuplets obtained from the NIST and MassIVE peptide libraries to train the network and demonstrate that for closed search, SpeCollate is able to perform better than Crux and MSFragger in terms of the number of peptide-spectrum matches (PSMs) and unique peptides identified under 1% FDR for real-world data. SpeCollate also identifies a large number of peptides not reported by either Crux or MSFragger. To the best of our knowledge, our proposed SpeCollate is the first deep-learning network that can determine the cross-modal similarity between peptides and mass-spectra for MS-based proteomics. We believe SpeCollate is significant progress towards developing machine-learning solutions for MS-based omics data analysis. SpeCollate is available at https://deepspecs.github.io/.


Introduction
To date, mass spectrometry (MS) proteomics data is identified using database search algorithms purely based on numerical techniques (Fig 1). These numerical techniques operate by limited to misidentifications or no identifications for peptides, statistical accuracy (FDR), and inconsistencies between different search engines [22]. Comparison across literature indicates decreased average accuracy of de novo algorithms (< 35%) [12] relative to database search algorithms (30-80%) [23]. Lack of quality assessment benchmarks makes the accuracy exhibited from these database search tools highly dependent on the data, indicating that further formal investigation and evaluation is warranted. Two major sources of heuristic errors that are introduced in the numerical database search algorithms are (1) how the peptide deduction takes place, i.e., simulation of the spectra (from peptides), and (2) the peptide spectrum match scoring-function. Previous work suggests that numerical algorithms and the use of traditional ML algorithms may not be able to capture and integrate the multidimensional features of MS data [24]. However, deep learning methods [12,13] may offer an improved approach for identifying peptides in noisy high-dimensional MS data and peptides that are very similar to each other [3]. Preliminary progress assessing deep learning methods in peptide deduction applied to MS data has yielded an average accuracy of 82-95% on selected data sets, but with limited precision (amino-acid level 72%) and recall (peptide level 39.24%) [13]. Prosit [2] incorporates theoretical spectral simulation in the database search pipeline by encoding the peptide sequences into a latent space and then decoding the embeddings to predict fragment ion intensities while Slider [25] uses deep convolutional neural networks to score experimental spectra against theoretical spectra. In [26], spectra for modified peptides are embedded close to the non-modified version in vectors of length 32, which enables them for fast lookup of the nonmodified version for a given spectrum. These studies have shown that deep-learning models can be helpful in modeling MS proteomics data and improved accuracy from these peptide algorithms warrants further investigation into the applicability of sophisticated machine-learning strategies.
In this paper, we make two fundamental contributions in MS-based proteomics search: 1) We present the design and implementation of Deep Similarity Network, called SpeCollate, for peptide-spectrum similarity measure. The goal is to learn a fixed-sized embedding of variable length experimental spectra, and peptide strings so that spectrum and its corresponding peptide are projected close to each other in the shared embedding space. Our proposed network consists of two sub-networks, i.e., spectrum sub-network (SSN) consisting of two fully connected layers and peptide sub-network (PSN) consisting of one bi-directional LSTM followed by two fully connected layers. 2) We then formulate and implement a custom loss function, called SNAP-loss for training the proposed deep-similarity network SpeCollate. This training process takes sextuplets as input, consisting of a combination of positive, negative, and anchor spectra and peptides. We train or network for 200 epochs on a dataset of size * 4.8 million sextuplets containing * 1.8 million unique peptides. As SpeCollate generates charge-independent peptide embeddings, we demonstrate that it performs better than Crux and MSFragger in terms of the number of PSMs and peptides identified under 1% FDR. SpeCollate encodes both peptides, and spectra into a latent space which are then used for direct comparison and peptide deduction. To the best of our knowledge, the proposed SpeCollate is the first deep-learning network that can determine cross-modal similarity between peptides and mass-spectra for MSbased omics.
The rest of the paper is organized as follows: In section 2, we discuss different motivations for our work. In section 3, we discuss the major contributions of this work and discuss the design and implementation of SpeCollate, SNAP-loss, and GPU-based inference. In section 4, we describe experimental setup and present results. In section 5, we discuss the results, limitations of SpeCollate, and future prospects. Section 6 concludes this paper.

Background
In this section, we will discuss the background of proteomics database search, emphasizing the shortcomings of the existing methods and potential solutions for them that inspired the development of the current framework. The shortcomings, including limitations and oversights of the existing numerical techniques, bounded performance of spectral simulators, unoptimized scoring heuristics, and the opportunities made available by huge data repositories with labeled spectra, are discussed.

Space transition
Peptides and their corresponding MS/MS spectra lie in vastly distinct spaces. Peptides consist of a string of (typically) twenty alphabets (each representing an amino acid). In contrast, spectra are a series of floating-point numbers generated by a complex and stochastic fragmentation process. Transitioning in-between spaces can only approximate the counterpart projection as manifested by the existing techniques. De novo projects spectra onto a sub-peptide-space but with underwhelming accuracy as the spectra are mostly noisy and necessary information is missing. Similarly, in peptide-spectrum scoring methods, peptides are projected onto subspectral-space, and the similarity is measured by projecting the experimental spectra onto the same subspace (dot-product) for comparison as shown in Fig 3. Although the database search process is relatively more accurate, the output quality is contingent on the quality of the experimental and projected theoretical spectra. Therefore, we argue that a more flexible technique that can learn intermediate embeddings for both spectra and peptides could improve database search quality. De novo and database search, that try to transform one space to another. This is prone to error and uncertainty as a lot of information can be missed. On the contrary, SpeCollate learns same sized embeddings for both peptides and spectra by projecting them to a shared euclidean space. https://doi.org/10.1371/journal.pone.0259349.g003

Spectral simulation
Most state-of-the-art database search tools provide a simulator to generate spectra containing b and y peaks (sometimes a, c, x, and z) from the theoretical peptides. Other simulators also provide options to generate peaks with different neutral losses (NH 3 , H 2 O), Immonium ions, and isotope ions [27]. These simulators incur numerous deficiencies due to the inherent complexities of the mass-spectrometry process, causing misidentification of multiple features. These include unaccounted peaks, missing peaks, falsely identified true and noisy peaks, peak intensities, neutral losses, isotopic peaks, and noise characteristics. As a result, the simulated spectra only manage to span a sub-space of experimental spectra (Fig 3). Recently, various deep learning-based simulators have mitigated the problems with the traditional simulation approaches to a certain extent. One of the main contributions of this paper is to directly match the experimental spectra and their corresponding theoretical peptide string by learning the embeddings in a shared subspace from huge sets of labeled data available, e.g., NIST and MassIVE.

Scoring heuristics
Although the simulators somewhat help improve the database search process, they only address half of the challenge, i.e., spectra simulation, while the scoring function is still the limiting factor. Presence or absence of specific peaks/ions can significantly impact peptide spectrum matching. In addition, the comparison of complex fragmentation spectra is not straightforward, and the outcome can vary among different scoring functions. SpeCollate can overcome this shortcoming by training on a wide variety of data to learn embeddings optimized for distance-based similarity. By processing data from different modalities using different types of networks, i.e., spectra using SSN and peptides using PSN, it can extract the valuable features needed for proper matching.

Learning from data
When learning a similarity (or scoring) function, ideally, we would like to retain all the features that improve the similarity measure and abolish the useless ones. SpeCollate approaches this solution by projecting both peptides and spectra onto a shared euclidean space. This is accomplished by learning embeddings of equal size for both spectra and peptides so that their similarity is directly proportional to L2 distance in the resultant euclidean space. Hence, addressing both above-mentioned fundamental problems by finding a middle ground between two extremes (De novo and database search) and simplifying the comparison. Using separate branches for spectra and peptides, SpeCollate can learn valuable features to assign spectra to their corresponding peptides confidently using the L2 distance-based similarity measure.

SpeCollate: Deep learning model for spectra-peptide similarity
This paper designs and implements a similarity network (SpeCollate) and the custom loss function (SNAP-loss) to learn a similarity function for peptide-spectrum matches. Similarity networks have been used for cross-modal retrieval of images and captions [28], face re/identification [29], and other problems [30,31]. In proteomics, the use of similarity networks has been limited to spectral library search [26] and spectral clustering [32]. Our goal is to learn a fixed-sized embedding of variable length experimental spectra and peptide strings so that a given spectrum and its corresponding peptide are projected close to each other (in terms of L2 distance) in the shared subspace. We use L2 distance as the similarity matrix since it has been shown in the literature to work well for similarity ranking loss functions, e.g., triplet loss, and perform better than other similarity metrics, e.g., cosine similarity. We took inspiration from FaceNet [29] to select L2 distance as the similarity metric. The network consists of two subnetworks, i.e., SSN consisting of two fully connected layers and PSN consisting of one bi-directional LSTM followed by two fully connected layers. The training process takes two sets of data points as inputs, i.e., the sparse spectrum vectors and encoded peptide strings. The loss value is calculated by generating sextuplets, after each forward pass, consisting of a positive pair (Q, P), a negative pair (Q N , P N ) Q for Q, and a negative pair (Q N , P N ) P for P where Q is the anchor spectrum and P is the positive peptide. The negative pairs are selected via online hardest negative mining to make the training process more efficient and faster. In this process, the negative spectra and peptides closest to Q and P are selected for a given batch after each forward pass.
In this regard, this paper's contributions are two folds: 1) Design and implementation of Deep Similarity Network, SpeCollate, for Proteomics Database Search and 2) Formulation and deployment of custom loss function SNAP-loss for training SpeCollate.

SpeCollate architecture
SpeCollate consists of two sub-networks called spectral sub-network (SSN) and peptide subnetwork (PSN), which process spectra and peptides respectively to generate embeddings of size 256. We empirically selected the embedding size of 256 as the smallest value without incurring any performance degradation. The complete SpeCollate network is shown in Fig 4.

Spectral Sub-Network (SSN).
The SSN branch of the network processes spectra and embeds them onto the surface of a unit hyper-sphere in a euclidean subspace IR 256 . SSN

PLOS ONE
consists of two fully connected hidden layers of size 80, 000 × 1024 and 1024 × 256 respectively and a L2 normalization output layer. The input layer is of size 80, 000, which takes spectra as input in the form of sparse vectors with intensity values normalized to zero mean and unit variance and mass binning of 0.1 Da. Both hidden layers use rectified linear unit or ReLU as the activation function. A dropout mechanism with the probability value of 0.3 is used after the first hidden layer to avoid over-fitting.

Peptide Sub-Network (PSN).
The PSN branch of SpeCollate processes the peptides and embeds them onto the surface of the same hyper-sphere in IR 256 enabling the direct comparison between spectra and peptides. The PSN consists of one bi-directional long short-term memory (Bi-LSTM) layer followed by two fully connected layers. An embedding layer is added before the Bi-LSTM layer to embed each amino acid character into a floating-point vector of size 256. We use a vocabulary of size 30 including 20 amino acids, 9 modifications, and a blank space. Bi-LSTM has the hidden dimension of 1024, and the outputs from both forward and backward passes are concatenated to get an output of a total length of 2048. This output is further fed to two fully connected layers of size 2048 × 1024 and 1024 × 256. ReLU activation function is used for the fully connected layers, and dropout with the probability of 0.3 is used after the Bi-LSTM and the first fully connected layer.

Train, test, and validation datasets
The training/testing dataset is generated from spectral libraries obtained from online repositories, NIST and MassIVE. The spectral libraries are preprocessed to generate two sets of data, i.e., spectra and their corresponding peptides. We obtained * 4.8 million spectra with known peptide sources containing * .5 million spectra from modified peptides. Train/Test split of 0.8/0.2 is applied such that no peptides are overlapping among the two splits. A separate spectral library is held out for validating the trained network, which is never used for training or testing purposes. Details of libraries used for train, test, and validation are provided in S1 File. Nine different types of modifications found in the training dataset are used for training the network. These modifications and their corresponding character values are given in Table 1. Details of the training dataset are given in Table 2.
Spectra are preprocessed into dense vectors (of length 80, 000) containing intensity values normalized to zero mean and unit variance. Each index of spectra vector represents an m/z bin of width 0.1 Da while the values at each index are the corresponding normalized intensities, hence each vector can hold a spectrum of up to 8, 000Da mass. Note that, due to the preprocessing step, the fragment tolerance is fixed to 0.1 Da for SpeCollate when performing database search. Peptide strings are padded with zero characters to the length of 64 before feeding to the

Training
The training process begins by a forward pass of a batch (a subset of 1024 data points) containing experimental spectra and their corresponding (positive) peptides through SSN and PSN, respectively. At this point, the dataset does not consist of sextuplets as the negatives examples have not been selected yet. Once a batch is forward passed through the network, the four negative examples for each positive pair (q i 2 Q, p i 2 P) are mined where Q is the set of embedded spectra, and P is the set of embedded peptides. A negative tuple (q j , p k ) for q i is mined such that q j is the closest negative spectrum to q i and p k is the closest negative peptide to q i . Similarly, a negative tuple (q l , p m ) for p i is mined such that q l is the closest negative spectrum to p i and p m is the closest negative peptide to p i . Hence, a sextuplet S ¼ ððq i ; p i Þ; q j i ; p k i ; q l i ; p m i Þ containing a query (or anchor) spectrum, positive peptide, two negative spectra and two negative peptides is constructed via online sextuplet mining for each positive example in the training dataset. The learning parameters are given in Table 3.

Online hardest sextuplet mining.
Here we will derive the mathematical formulation of online negative mining to generate sextuplets. Given a batch B containing b training samples i.e. two sets Q � and P � . After forwarding Q � through SSN and P � through PSN we get embedded spectra Q ¼ f SSN ðQ � Þ and peptides P ¼ f PSN ðP � Þ where Q; P � IR 256 . To efficiently compute negative examples for each positive pair (q i 2 Q, p i 2 P), three distance matrices, D Q×Q , D Q×P , and D P×P , containing pairwise squared L2 distances (SED) of spectra and peptides and the diagonal of G Q as: Then D Q×Q is given by: where 1 is a vector containing all ones and is the same length as g Q i.e. b. D Q×P can be derived in a similar fashion as follows: Let Once these matrices are calculated, the four negatives can be obtained using the min function over matrices. Let the elements of matrices D Q×Q , D Q×P , and D P×P be represented by qq ir , pp ir , and qp ir respectively where i, r represent the row and the column indexes. Then the subscripts j i , k i , l i , and m i for the negative examples in the sextuplet S can be determined using the following four equations: These subscripts indicate the corresponding indices of the negative spectra and peptides in sets Q and P; they can be directly accessed for loss calculation. Once all the sextuplets are constructed in a given batch, the loss value is computed using the custom-designed SNAP-loss function. The gradient update is back-propagated through both SSN and PSN. The online sextuplet mining is visualized in Fig 5.

SNAP-loss
The training objective is to minimize the SED between a given spectrum and its corresponding positive peptide while maximizing it for negative examples. To achieve this, we adopt a similar approach to the Triplet-Loss function [33] which works on triplets (A, P, N) with A as the anchor, P as the positive example and N as the negative example. In this case, the differences between SEDs among A and P kA − Pk 2 , and A and N kA − Nk 2 is minimized with a constant margin value, added to the positive distance as shown below.
maxðkA À Pk 2 À kA À NÞk 2 þ margin; 0Þ This approach works well where data with a single modality is dealt with, e.g., image verification in the case of FaceNet [29]. We design SNAP-loss which extends Triplet-Loss to multi-modal data, in our case numerical spectra and sequence peptides. For this purpose, we consider all possible negatives (q j , p k , q l , p m ) for a given positive pair (q i , p i ) and average the total loss. The four possible negatives are explained below: • q j : The negative spectrum for q i .
• p k : The negative peptide for q i .
• q l : The negative spectrum for p i .
• p m : The negative peptide for p i .
To calculate the loss value, we first define a few variables that are precomputed in distances matrices above as follows: Let

PLOS ONE
Then the SNAP-loss is calculated for a batch of size b as follows: The training process is visualized in Fig 5.

Similarity inference
Once the training is complete, we perform the similarity inference on the test dataset by simply transforming the peptides and spectra into the embedded subspace and applying the nearest neighbor search. Fig 3 shows the resultant euclidean space is R 256 , where all the peptides and spectra are projected onto.

Indexing.
Since a large number of spectra might need to be searched against peptides, we can index the peptides by precomputing the embedded feature vectors and store them for later use. Similar pre-computation can be performed for the experimental spectra before performing the search to avoid repeated encoding, as each experimental spectrum needs to be searched against multiple peptides.
The L2 distance measure can be efficiently calculated on the GPU by computing the masked distance matrix for the peptides that fall within the precursor m/z range. Furthermore, this process can easily scale to multiple GPUs, making it feasible for large datasets. The inverse of the L2 distance (1/L2) is reported as the match score.
3.5.2 L2 distance measure. To measure the L2 distance between the embedded set of spectra Q and peptides P, we split Q into batches of size 1024. On the other hand, peptides are selected for each batch of spectra based on the precursor tolerance, and their number can vary. We limit the maximum number of peptides in a batch to 16384 due to the memory limit (32 GBs) of the GPU. If more than 16384 peptides fall within the precursor window, they are further split into sub-batches, and the search process is repeated for each sub-batch. As a result, we have two matrices A 1024×256 and B <16384×256 containing a batch of spectra and a sub-batch of peptides, respectively. Parallel distance matrix D A×B calculation is performed using the following equation: where g A is the diagonal vector of the Gramian matrix G A of A and g B is the diagonal vector of the Gramian matrix G B of B. D A×B is a 1024× � 16384 distance matrix and contains the distances of each spectrum in A to each peptide in B, i.e., each row in D A×B represents an experimental spectrum while each column represents a peptide. Since not all peptides lie within the precursor window of every spectrum, we compute a mask matrix M of the same size as D A×B which contains 1 for peptides that fall within the precursor window of each spectrum and 0 for the rest. Matrix M is calculated dynamically based on the user-provided precursor tolerance (Da or ppm). Hadamard product of D A×B and M gives the distance measure of only relevant peptide-spectrum pairs. For each spectrum, 5 top-scoring peptide (minimum distance) are kept, and the rest are discarded, giving a resultant score matrix of size 1024×5, which is stored for posterior analysis later.

Experiments & results
In this section, we will discuss the experimental setup, present results, and provide a discussion.

Experimental setup
We train our network for 200 epochs on a dataset of size * 4.8 million sextuplets. The training is performed on an NVIDIA V100s (32 GB SMX2) GPUs with 32GB of memory provided by Expanse SDSC supercomputer installed in nodes with up to 4 GPU nodes, 40 CPU cores, and 384 GBs of memory (10 CPU cores and 96 GBs per GPU node). We use PyTorch 1.6 to design our network using python 3.7.4 on Linux 16.04. We also perform the database search on the test dataset to measure the quality of results just by comparing the embeddings.

Training and model evaluation
We train the network for 200 epochs to achieve validation accuracy of 95.5%. The accuracy is measured by the ratio of the number of times the correct peptide is the closest one to the anchor spectrum to the total number of spectra in a batch. Let us define the true peptide tp as a Boolean function which outputs 1 if the closest peptide p to the anchor q in the current batch B is the true peptide p q and zero otherwise.
where p q is the true peptide for q, B is the current batch, and P B and Q B represent the peptides and spectra in B respectively. Fig 6 shows the training process in terms of accuracy of training and validation data as well the loss value.
We plot the ROC curves as well as Precision-Recall curves for comparing the performance of the three scoring functions. As shown in Fig 7, SpeCollate performs significantly better than XCorr and Hyperscore in closed search. Note that ROC curves tend to overestimate the model's skill when the classes are not balanced, and there are far more true negatives than false positives. Therefore, for a scenario where positive examples are far more valuable than the negative ones (such as when searching for a peptide-spectrum match), Precision-Recall curves better represent the performance as the true-negatives are not considered for either precision or recall calculation. Fig 8, provides the UMAP [34] visualization of spectra and peptide embeddings. Embeddings are generated for library spectra provided by Proteome Tools by first sorting the spectra according to their precursor mass and then selecting spectra for 40 unique peptides, sequentially, for precursor mass value of � 2000 Da., such that the number of spectra for each peptide is � 15 to understand how SpeCollate maps spectra of different charges (2)(3)(4)(5) close to the corresponding peptide embeddings. Selecting spectra that are next to each other in precursor mass helps us visualize the separation of different clusters within a precursor mass window when performing the database search. Similar visualizations for different precursor mass values (1500-3000) are provided in S3 File.

Database search
The model is also tested in a complete proteomics pipeline using the dataset from NIST as well as three real-world datasets (PXD000612 [35], PXD009861 [36], PXD001468 [37]), obtained from PRIDE, by performing database search against the human proteome and obtaining PSMs and peptides reported under 1% FDR. The performance is compared against Crux [38] and MSFragger [7] by performing separate target decoy database searches and FDR estimation using Percolator [39]. Closed search is performed with precursor mass tolerance of ±5ppm. Note that fragment tolerance is fixed to 0.1 as MS/MS spectra are preprocessed into arrays

PLOS ONE
with each index representing a bin width of 0.1 Da. The protein database is in-silico digested using tryptic digestion, allowing up to two missed cleavages and 7-50 amino acids per peptide. Different peptide databases with zero and up to one and two phosphorylation and oxidation sites per peptide are generated and searched against different datasets. The decoy entries are generated by reversing the target peptides (except the first and the last amino acid) and removing any overlapping peptides from the decoy database. The resulting peptide databases contain * 2.7M, * 11.7M, and * 32.3M peptides for zero, one and two phosphorylation sites respectively while for oxidation, the peptide databases contain * 2.7M, * 3.9M, and * 4.2M peptides. The search is performed by generating theoretical spectra of fixed intensity values and reporting 5 top-scoring peptide spectrum matches per spectrum for each target and decoy database. Percolator/PeptideProphet then reorders these PSMs to estimate false-discovery rate, and a subset of PSMs and peptides estimated to have FDR < 1% is reported. As SpeCollate can only generate charge-independent peptide embeddings, we limit the fragment charge of theoretical spectra generated by Crux and MSFragger to +1 for a fair comparison.

PLOS ONE
identified PSMs and peptides under 1% FDR. We do not perform a search with no modifications as the NIST dataset being searched contains only phosphorylated spectra.
Next we perform the comparison using real-world datasets PXD000612 [35], PXD009861 [36], and PXD001468 [37] in Fig 10 (See S2 File for details about these datasets). For the PXD000612 dataset, the search is performed against different number of phosphorylation sites (up to two) added to the peptide database. In contrast, for dataset PXD009861, the search is performed against databases with different number of oxidized M-residues (up to two per peptide) while for dataset PXD001468, the search is performed against database with one n-term acetylation and up to two oxidation sites (max 2 modifications per peptide). SpeCollate is able to outperform Crux and MSFragger in all experiments except for PXD000612 searched against the single phosphorylated database where the number of peptides identified by SpeCollate is slightly less than Crux. Comparison using default XCorr and MSFragger settings (i.e., allowing theoretical fragments with > +1 charge) is provided in S5 File.
We also investigate the overlap of identified peptides by SpeCollate, Crux, and MSFragger using Venn diagrams as shown in Fig 11 for each experiment performed in Fig 10. As can be seen, most of the peptides for each experiment are common among the three search tools. However, SpeCollate does identify more unique peptides than Crux and MSFragger. Generally, we discovered that compared to Crux and MSFragger, unique peptides identified by Spe-Collate were mostly of smaller length 10. On the other hand, peptides identified by Crux and MSFragger (and missed by SpeCollate) followed similar length distribution to the overall set of identified peptides. On the other hand, the precursor charge distribution of unique peptides for each tool remains similar as the identified peptides are mostly of lower charge. As expected, both XCorr and MSFragger identify greater number of unique peptides with higher precursor charge when their default settings are used (i.e. theoretical fragments with > +1 charge are allowed).

Discussion
SpeCollate provides a stepping stone towards mass-spectrometry-based proteomics database search using deep learning. In contrast to previous approaches, SpeCollate maps both the spectra and peptide into an embedding which can then directly be compared with each other and eliminates the need for scoring functions and spectrum simulators. Our proposed cross-model method is then shown to work with deep learning methods that can outperform the state-theof-art techniques. Through extensive experimentation using spectral libraries and real-world MS/MS data, we demonstrate the efficiency of our cross-modal learning technique, presenting further opportunities to the research community for exploring simulator-less peptide database search implementations. For PXD000612, the search is performed against a peptide database generated from the human proteome with zero and up to one and two phosphorylation (amino acids S, T, and Y) modifications for subplots left, middle, and right, respectively. For dataset PXD009861, the search is performed against databases with a different number of oxidized M-residues (up to two per peptide), while for dataset PXD001468, the search is performed against a database with one n-term acetylation and up to two oxidation sites (max two modifications per peptide). SpeCollate can outperform Crux and MSFragger in terms of PSMs while giving a comparable performance in terms of the number of peptides. https://doi.org/10.1371/journal.pone.0259349.g010

Limitations and future directions
Currently, SpeCollate is trained to match spectra to their exact peptide sequence, i.e., modified spectra are not embedded close to the non-modified peptide. Moreover, peptide embeddings are charge-independent, and the model only generates one embedding per peptide. Although PLOS ONE spectra with a higher precursor charge > 2 are mapped close to the peptide, they tend to be relatively farther than precursor charge 2. One of the future goals of our work is to enable opensearch using deep learning, which in addition to identifying modified spectra, can also be used as a pre-search filter to extract a subset of relevant peptides. Furthermore, our next step is to enable charge-specific peptide embeddings to map spectra with different charges to their corresponding peptides efficiently. To this end, several research problems remain open, including curating improved training datasets, improving the network architecture to better capture the input domain, employing adversarial learning for bridging the modality gap between peptides and spectra and improving the training process by redesigning the loss-function from the ground up.

Conclusion
This paper designs and implements SpeCollate, a deep similarity network for proteomics, to learn a cross-modal similarity function between peptides and spectra to identify peptide-spectrum matches. Proteomics has entered the realm of Big-Data, and the number of available labeled and annotated spectra is increasing rapidly, enabling sophisticated models to be trained. SpeCollate provides a data-oriented algorithm design for peptide database search, eliminating the inherent problems associated with numerical strategies. This is achieved by learning a cross-modal similarity function that embeds spectra and peptides in a shared euclidean subspace for direct comparison. SpeCollate learns the similarity function by optimizing a custom-designed loss function, SNAP-loss, which trains on sextuplets of data points to project positive examples closer to each other while pushing negative examples far from each other. By training on 4.8 million sextuplets, SpeCollate is able to achieve a remarkable test accuracy of 95.5%. Although the test accuracy is relatively high, there is still room for real-world database search accuracy improvement. This can be improved by fine-tuning BiLSTM and generating better training data to enable the network to learn more features in future work.