Dispec: A Novel Peptide Scoring Algorithm Based on Peptide Matching Discriminability

Identifying peptides from the fragmentation spectra is a fundamental step in mass spectrometry (MS) data processing. The significance (discriminability) of every peak varies, providing additional information for potentially enhancing the identification sensitivity and the correct match rate. However this important information was not considered in previous algorithms. Here we presented a novel method based on Peptide Matching Discriminability (PMD), in which the PMD information of every peak reflects the discriminability of candidate peptides. In addition, we developed a novel peptide scoring algorithm Dispec based on PMD, by taking three aspects of discriminability into consideration: PMD, intensity discriminability and m/z error discriminability. Compared with Mascot and Sequest, Dispec identified remarkably more peptides from three experimental datasets with the same confidence at 1% PSM-level FDR. Dispec is also robust and versatile for various datasets obtained on different instruments. The concept of discriminability enhances the peptide identification and thus may contribute largely to the proteome studies. As an open-source program, Dispec is freely available at http://bioinformatics.jnu.edu.cn/software/dispec/.


Introduction
In the analysis of mass spectrometry (MS), the mass of each peptide is measured and then selected and fragmented to obtain MS/MS spectra [1]. These second-order spectra are identified by algorithms to determine peptide sequences. Large amount of spectra obtained from LC-MS/MS experiments sets a challenge to the identification of peptides [2,3]. A number of peptide identification algorithms for MS data analysis are available, and each of them uses different ways to select significant peaks, compare the peaks to the theoretical peaks and score the similarity [4][5][6][7][8][9][10][11]. However, a type of important information from the spectra, the discriminability of each peak, was not considered in any of these algorithms [4][5][6][7][8][9][10][11][12][13].
Discriminability of a peak (MS/MS fragmentation peak) is a type of score that characterizes the confidence of peptide matching: distinguishability of the matched peptide from other peptides, and the distinguishability of the real fragment ion from random ones. It comprises of three aspects: peptide matching discriminability for candidate peptides, intensity discriminability and m/z error discriminability between theoretical and experimental spectra (details see below). The peptide matching discriminability of each peak for candidate peptides can differ largely, providing various quality and confidence information. It is a property of the peak itself, not derived from any overall statistics of the spectra, can thus serve as additional independent information to improve the sensitivity and the confidence of the identification. We developed a novel model of Peptide Matching Discriminability (PMD) to calculate the discriminability of each peak for candidate peptides from MS/MS spectra. We further developed an open-source program Dispec based on the PMD model and performed a comparison test with other algorithms using the standard 18 proteins dataset and E. coli proteome dataset. Dispec demonstrated higher sensitivity and confidence in identifying peptides from different MS datasets at 1% PSM-level false discovery rate (PSM-level FDR), implicating that the PMD concept provides important insight for peptide identification.

Mass Spectrometry Datasets and Data Preprocessing
A dataset of 18 standard proteins mixture was used to test the accuracy, robustness and versatility of Dispec. The dataset measured by four instruments (Thermo Finnigan LTQ-FT, Thermo Finnigan LCQ DECA, Thermo Finnigan LTQ and Micromass/Waters QTOF Ultima, abbreviated below as FT, LCQ, LTQ and QTOF, respectively) was obtained from https:// regis-web.systemsbiology.net//PublicDatasets/ [14]. This dataset is widely used to validate peptide scoring algorithms and to test the dynamic range of the algorithm [15]. The LTQ-Orbitrap data obtained from the S. pneumoniae D39 protein identification (http:// bioinformatics.jnu.edu.cn/software/dispec/) containing more than 270,000 spectra served as training dataset for parameters of the model [16]. The dataset of E. coli proteome was obtained from http://marcottelab.org/MSdata/Data_03/ [17].
For S. pneumoniae D39 and E. coli datasets, the raw format files were converted to dta file format by Bioworks 3.31 (Thermo Finnigan, San Jose, CA). For the 18 proteins dataset, the dta format files were obtained from the website. All the dta format files were merged to Mascot generic format (mgf) by the merge.pl program (http://www.matrixscience.com/downloads/merge.zip). The dta format files were the input files of Dispec and Sequest software.

MS/MS Database Search
For target-decoy based FDR calculation, the D39 database contains 1914 real protein sequences and the built forward/ reverse database contains 3828 protein sequences; the 18 proteins database contains 1822 real protein sequences and the built forward/reverse database contains 3644 protein sequences; the E. coli database contains 4279 real protein sequences and the built forward/reverse database contains 8558 protein sequences. Mascot 2.3 search engine (Matrix Science, London, UK) was used to search the Mascot generic format (mgf) files. The dta format files were searched using the Sequest search engine (Thermo Fisher Scientific, Waltham, MA, version 28.13) and Dispec. For Mascot, Sequest and Dispec, the following search criteria were applied: full tryptic specificity was required; two missed cleavages were allowed; Cys (+57.021464 Da, Carbamidomethylation) was set as fixed modification, whereas Met (+15.994915 Da, Oxidation) was considered as variable modifications.
The precursor ion mass tolerances and fragment ion mass tolerances vary according to the instrument type ( Table 1). The fragment ion tolerance of Sequest was set to 1.0 Da since it requires integer value for m/z [4].

False Discovery Rate (FDR) at PSM-level
The peptide spectrum matches (PSMs) with the top rank were extracted from the Mascot data file (.dat) with our in-house Matlab program and exported to calculate FDR threshold at PSM-level. PSMs of Sequest results with the top rank and DCn $0.1 were  extracted from Sequest output files (.out) and exported to calculate FDR threshold at PSM-level. Dispec results and the extracted results of Mascot and Sequest were written to csv format files. All target and decoy scores with the best ranking PSMs were sorted in ascending order to calculate its FDR at PSM-level value by Kall's method [18][19][20]. FDR at PSM-level was calculated as the ratio between the number of decoy and target PSMs above threshold. The scoring functions vary in different search algorithms. For Mascot, the ion scores were sorted to calculate FDR at PSM-level when peptide length $6; for Sequest, the Xcorr scores were sorted to calculate FDR at PSM-level by different precursor charges when peptide length $6 and DCn $0.1; for Dispec, the Sp scores were sorted to calculate FDR at PSM-level when peptide length $6.
All the score thresholds of 1% FDR at PSM-level were calculated by our Matlab program. The number of identified unique peptides was compared at FDR#0.01.

Training Dataset of Intensity and m/z Error Discriminability
The identification result of D39 dataset at PSM-level FDR#0.01, including 97535 spectra and 3570 unique peptides, was considered as high-confidence and correct result. The corresponding reversed sequences of these peptides were considered as incorrect peptides. These high-confidence peptides and incorrect peptides serve as the training set of statistical analysis.

Peak Selection
In the Dispec algorithm, peaks closer than 160.25 Da are considered as isotope peaks and were filtered [9,10]. The range between maximum and minimum m/z values of the experimental spectrum was divided into 10 equal bins. The 20 most intense peaks in every bin were selected and the intensity of each selected peak was normalized against the highest intensity [4,11].

Theoretical Spectra
The theoretical spectra were generated according to the scenario of peptide bonds' breakage. We considered b/y fragment ions and a loss of b-H 2 O or y-H 2 O when the b, y fragment ions contain S, T, E, D amino acids; or a loss of b-NH 3 or y-NH 3 if the b, y fragment ions contain R, K, Q, N amino acids. For parent ions with charge $ +1, we considered +1 charge fragment ion peaks. For parent ions with charge $ +2 and their fragment ions contain one of the R, K, H amino acids, we considered +2 charge fragment ion peaks [5,9,10,21].

Peptide Matching Discriminability (PMD) for Candidate Peptides
A selected experimental peak in the MS/MS spectra matches one peptide if it matches at least one theoretical fragment ion peak of this peptide. The peptide matching number of candidate peptide of each selected peak was calculated as M i (i = 1, 2, …, n). We then calculated the average peptide matching number of all matched peak: Compared with other peaks of this MS/MS spectrum, it reflects the peptide matching confidence of this peak. The selected peaks and the peptide matching number of candidate peptides were shown as PMD for candidate peptides (Figure 1, Figure S1). Importantly, the peaks with higher intensities do not necessarily possess higher discriminability. Table 2. The intensity discriminability of b-ions (I(b j )), y-ions (I(y j )) and the six types (b, b-H 2 O, b-NH 3 , y, y-H 2 O and y-NH 3 ) of theoretical ions (I(s j )).  Table 3. The m/z error discriminability of b-ions (T(b j )), y-ions (T(y j )) and the six types (b, b-H 2 O, b-NH 3 , y, y-H 2 O and y-NH 3 ) of theoretical ions (T(s j )).

Statistical Analysis for Intensity Discriminability
In many previous algorithms the intensity information was used to calculate the similarity score between the experimental peaks and the theoretical peaks [4,8,10,11,22]. For consistency with PMD information, we defined intensity discriminability to utilize the peak intensity information. We divided the normalized peak intensity range [0, 1] into 10 equal intervals with an additional category of the highest peak for rounding convenience:  Table 2.

Statistical Analysis for m/z Error Discriminability
In some algorithms, e.g. pNovo, MassWiz and DeltAMT [7,15,21], the m/z error between the experimental peaks and the theoretical peaks was considered when calculating similarity score. Some studies [15,23] showed that the m/z error distribution remarkably differs between correct and wrong peptide match peaks. The m/z error of correct match peaks was mainly less than 1/5 of the error window, whereas the m/z error of wrong match peaks can be as high as the rest window [23]. This difference provides independent and additional information reflecting the correct match probability. Therefore, we introduced m/z error discriminability in our algorithm.
Similar to the intensity discriminability, the m/z error interval [0, 0.5] between experimental and theoretical fragment ions was divided into 10 equal intervals and an additional category of 0.   interval. In the j-th (j = 1, 2, …, 10) interval, the m/z error discriminabilityT j was calculated by the formula T j~N  Table 3.

Scoring Function
The scoring process of Dispec algorithm utilizes the above three types of discriminability information to evaluate the identification and matches. The scoring model based on PMD mainly considers three aspects: fragment ion matches, consecutive fragment ion matches and b/y fragment ion matches [21]. Each candidate peptides are scored and the scoring function is described as follows.
Fragment ion scoring. When matching an experimental peak to theoretical peak of fragment ion from a peptide in fragment error tolerance, the fragment ion discriminability of the jth matching peak is defined as p j~D (m j )I(s j )T(s j ). The total discriminability is D~P j log 10 (p j ) which is equivalent to D~log 10 ( P j p j ), and the discriminability score of all matching ions is S 0~D   n 0 = number of theoretical fragment peaks. 0.1406 = random matching probability of theoretical spectrum, which reflects the matched ability between experimental spectrum and decoy theoretical spectrum and is calculated from the training dataset using the following formula: sum of the random peptide matching peaks number sum of the random peptide theoretical peaks number Consecutive fragment ion scoring. Multiple consecutive ion matches can be converted into a series of ion pairs matches: N multiple consecutive ions matches are converted into N-1 two consecutive ion matches, for example, if b 1 , b 2 and b 3 ions are consecutively matched, this consecutive ion match is converted into two match pairs: b 1 -b 2 and b 2 -b 3 . The total discriminability of consecutive matches is D 1~P j log 10 (p l p m ) and the score of consecutive matches is S 1~D 1 k1 0:0279n1 , where p l = the discriminability of the l-th matched peak. p m = the discriminability of the m-th matched peak. Here, a consecutive ion match comprises of the l and m matches. k 1 = number of consecutive matches in the experimental spectrum.
n 1 = number of theoretical consecutive matches. 0.0279 = random consecutive matching probability of theoretical spectrum, which reflects the consecutive matching ability between experimental spectrum and decoy theoretical spectrum and is calculated from the training dataset by the following formula: sum of the random peptide consecutive matching number sum of the random peptide theoretical consecutive matching number b/y-fragment ion scoring. The intensity and m/z error discriminability of b/y-ions (especially for y-ion) are mostly more than the discriminability of the six ion types ( Table 2 and Table 3). This implies that b/y-ions matches are more efficient in the identification. Hence, the b/y-ion discriminability is separately considered in the scoring function. To score the b/y-ion peaks, the b/y-ion discriminability is firstly calculated as: Or p(i j )~D(m j )I(i j )T(i j ), i~b,y: And the score of b/y-ion match is then calculated: where k 2 = number of the peaks matching to b-ions and y-ions n 2 = number of b-ions and y-ions producing by theoretical spectra 0.0706 = b/y-ions random matching probability of theoretical spectrum, which reflects the b/y-ions matching ability between experimental spectrum and decoy theoretical spectrum and is calculated from the training dataset by the following formula: sum of the random peptide b=y{ions matching number sum of the random peptide b=y{ion theoretical peak number The overall score S(p) is the sum of the above three scores: S(p)~S 0 zS 1 zS 2 .

Comparison of Dispec with Mascot and Sequest
We compared our algorithm Dispec (Matlab version) with two widely-used MS identification algorithms Mascot and Sequest using three datasets: in-house generated S. pneumoniae D39 dataset, 18 standard proteins mixture and E. coli datasets.
In terms of the S. pneumonia D39 dataset, all algorithms were able to identify more than 3000 peptides and more than 97500 spectra under the criteria PSM-level FDR #0.01 (Figures 2A and  2B). Most of the peptides (2695) and spectra (81109) could be identified by all the three algorithms. The overlap ratio of identified peptides and spectra from Mascot and Dispec are as high as 89.9% and 97.2%, showing a good consistency of Dispec with other algorithms. As shown in Figure 3, Dispec identified more peptides and spectra than Mascot and Sequest in the PSMlevel FDR range of 0.2%,4% [18].
In terms of the publicly available standard 18 proteins dataset obtained using four types of MS instruments (FT, LTQ, LCQ and QTOF) and E. coli dataset (LTQ-Orbitrap), we tested Dispec's adaptability under PSM-level FDR #0.01 (Figure 4). Compared with Mascot and Sequest, Dispec identified more peptides than Mascot in all MS data, showing its robust power of identification, stability and extensiveness.

The Number of High-confidence Peptides Identified
Since all algorithms have their inherent advantages and disadvantages, and different algorithms give different identification results, any single algorithm cannot capture all MS information. Implementing multiple algorithms can enhance the confidence of the peptide identification. The high-confidence peptides can estimate the quality of algorithm's identification [11,21]. We calculated the overlaps of the identified peptides for each two algorithms (Table S1). 'High-confidence' peptides denote peptides found in at least two of the three search algorithms and calculated using the formula: (A\B)|(B\C)|(A\C), where A, B and C represent the identified peptides from Dispec, Mascot and Sequest. The number of high-confidence peptides identified by all the three algorithms was shown in Figure 5. In all cases, Dispec exceeded Mascot and Sequest in identifying high-confidence peptides, evidencing its quality to identify peptides. The detailed data are listed in Table S2.

Summary
Here we presented a novel concept based on discriminability that takes three aspects of discriminability into consideration, and an open-source peptide scoring algorithm, Dispec, based on this concept. We validated the accuracy, robustness and compatibility of Dispec by comparing with two widely used algorithms, Mascot and Sequest. We believe that peptide matching discriminability information of each peak will be broadly accepted and integrated into new identification algorithms as a new native property of each MS peak, enhancing identification capacity and quality, which are essential for proteome studies. Figure S1 The selected peaks and the peptide matching number of each peak.

(TIF)
Table S1 Number of the same peptides identified between any two algorithms of Mascot, Sequest and Dispec. (XLS)