Analysis by liquid chromatography and tandem mass spectrometry (LC-MS/MS) can identify and quantify thousands of proteins in microgram-level samples, such as those comprised of thousands of cells. This process, however, remains challenging for smaller samples, such as the proteomes of single mammalian cells, because reduced protein levels reduce the number of confidently sequenced peptides. To alleviate this reduction, we developed Data-driven Alignment of Retention Times for IDentification (DART-ID). DART-ID implements principled Bayesian frameworks for global retention time (RT) alignment and for incorporating RT estimates towards improved confidence estimates of peptide-spectrum-matches. When applied to bulk or to single-cell samples, DART-ID increased the number of data points by 30–50% at 1% FDR, and thus decreased missing data. Benchmarks indicate excellent quantification of peptides upgraded by DART-ID and support their utility for quantitative analysis, such as identifying cell types and cell-type specific proteins. The additional datapoints provided by DART-ID boost the statistical power and double the number of proteins identified as differentially abundant in monocytes and T-cells. DART-ID can be applied to diverse experimental designs and is freely available at http://dart-id.slavovlab.net.
Identifying and quantifying proteins in single cells gives researchers the ability to tackle complex biological problems that involve single cell heterogeneity, such as the treatment of solid tumors. Mass spectrometry analysis of peptides can identify their sequence from their masses and the masses of their fragment ions, but often times these pieces of evidence are insufficient for a confident peptide identification. This problem is exacerbated when analyzing lowly abundant samples such as single cells. To identify even peptides with weak mass spectra, DART-ID incorporates their retention time—the time when they elute from the liquid chromatography used to physically separate them. We present both a novel method of aligning the retention times of peptides across experiments, as well as a rigorous framework for using the estimated retention times to enhance peptide sequence identification. Incorporating the retention time as additional evidence leads to a substantial increase in the number of samples in which proteins are confidently identified and quantified.
Citation: Chen AT, Franks A, Slavov N (2019) DART-ID increases single-cell proteome coverage. PLoS Comput Biol 15(7): e1007082. https://doi.org/10.1371/journal.pcbi.1007082
Editor: Jürgen Cox, Max Planck Institute of Biochemistry, GERMANY
Received: August 29, 2018; Accepted: May 6, 2019; Published: July 1, 2019
Copyright: © 2019 Chen et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Funding: This work was funded by the National Institute of General Medical Sciences of the National Institutes of Health under Award Number DP2GM123497 (https://www.nigms.nih.gov/). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
Advancements in the sensitivity and discriminatory power of protein mass-spectrometry (MS) have enabled the quantitative analysis of increasingly limited amounts of samples. Recently, we have developed Single Cell Proteomics by Mass Spectrometry (SCoPE-MS). SCoPE-MS uses a barcoded carrier to boost the MS signal from single-cells and enhance sequence identification [1, 2]. While this design allows quantifying hundreds of proteins in single mammalian cells, sequence identification remains challenging because many lowly abundant peptides generate only a few fragment ions that are insufficient for confident identification [3, 4]. Such low confidence peptides are generally not used for protein quantification, and thus reduce the data points available for further analyses. We sought to overcome this challenge by using both the retention time (RT) of an ion and its MS/MS spectra to achieve more confident peptide identifications. To this end, we developed a novel data-driven Bayesian framework for aligning RTs and for updating peptide confidence. DART-ID minimizes assumptions, aligns RTs with median residual error below 3 seconds, and increases the fraction of cells in which peptides are confidently identified.
Multiple existing approaches—including Skyline ion matching , moFF match-between-runs , MaxQuant match-between-runs [7, 8], DeMix-Q  and Open-MS FFId —allow combining MS1 spectra with other informative features, such as RT and precursor ion intensity, to enhance peptide identification. These methods, in principle, may identify any ion detected in a survey scan (MS1 level) even if it was not sent for fragmentation and second MS scan (MS2) in every run. Thus by not using MS2 spectra, these methods may overcome the limiting bottleneck of tandem MS: the need to isolate, fragment and analyze the fragments in order to identify and quantify the peptide sequence.
However not using the MS2 spectra for identification has a downside: The MS2 spectra contain highly informative features even for ions that could not be confidently identified based on spectra alone. This is particularly important when MS/MSed ions are the only ones that can be quantified, as in the case of isobaric mass tags. Thus, the MS1-based methods have a strong advantage when quantification relies only on MS1 ions (e.g., LFQ , and SILAC ), while methods using all MS2 spectra can more fully utilize all quantifiable data from isobaric tandem-mass-tag experiments.
DART-ID aims to use all MS2 spectra, including those of very low confidence PSMs, and combines them with accurate RT estimates to update peptide-spectrum-match (PSM) confidence within a principled Bayesian framework. Unlike previous MS2-based methods which incorporate RT estimates into features for FDR recalculation , discriminants , filters [15–17], or scores [18, 19], we update the ID confidence directly with a Bayesian framework [20, 21]. Crucial to this method is the accuracy of the alignment method; the higher the accuracy of RT estimates, the more informative they are for identifying the peptide sequence.
The RT of a peptide is a specific and informative feature of its sequence, and this specificity has motivated approaches aiming to estimate peptide RTs. These approaches either (i) predict RTs from peptide sequences or (ii) align empirically measured RTs. Estimated peptide RTs have a wide range of uses, such as scheduling targeted MS/MS experiments , building efficient inclusion and exclusion lists for LC-MS/MS [23, 24], or augmenting MS2 mass spectra to increase identification rates [14–19].
Peptide RTs can be estimated from physical properties such as sequence length, constituent amino acids, and amino acid positions, as well as chromatography conditions, such as column length, pore size, and gradient shape. These features predict the relative hydrophobicity of peptide sequences and thus RTs for LC used with MS [25–31]. The predicted RTs can be improved by implementing machine learning algorithms that incorporate confident, observed peptides as training data [15, 19, 32–35].
Predicted peptide RTs are mostly used for scheduling targeted MS/MS analyses where acquisition time is limited, e.g., multiple reaction monitoring . They can also be used to aid peptide sequencing, as exemplified by “peptide fingerprinting”—a method that identifies peptides based on an ion’s RT and mass over charge (m/z) [28, 36–38]. While peptide fingerprinting has been successful for low complexity samples, where MS1 m/z and RT space is less dense, it requires carefully controlled conditions and rigorous validation with MS2 spectra [37–41]. Predicted peptide RTs have more limited use with data-dependent acquisition, i.e., shotgun proteomics. They have been used to generate data-dependent exclusion lists that spread MS2 scans over a more diverse subset of the proteome [23, 24], as well as to aid peptide identification from MS2 spectra, either by incorporating the RT error (difference between predicted and observed RTs) into a discriminant score , or filtering out observations by RT error to minimize the number of false positives selected [15–17]. In addition, RT error has been directly combined with search engine scores [18, 19]. Besides automated methods of boosting identification confidence, proteomics software suites such as Skyline allow the manual comparison of measured and predicted RTs to validate peptide identifications .
The second group of approaches for estimating peptide RTs aligns empirically measured RTs across multiple experiments. Peptide RTs shift due to variation in sample complexity, matrix effects, column age, room temperature and humidity. Thus, estimating peptide RTs from empirical measurements requires alignment that compensates for RT variation across experiments. Usually, RT alignment methods align the RTs of two experiments at a time, and typically utilize either a shared, confidently-identified set of endogenous peptides, or a set of spiked-in calibration peptides [42, 43]. Pairwise alignment approaches must choose a particular set of RTs that all other experiments are aligned to, and the choice of that reference RT set is not obvious. Alignment methods are limited by the availability of RTs measured in relevant experimental conditions, but can result in more accurate RT estimates when such empirical measurements are available [7, 8, 43]. Generally, RT alignment methods provide more accurate estimates than RT prediction methods, discussed earlier, but also generally require more extensive data and cannot estimate RTs of peptides without empirical observations.
Methods for RT alignment are various, and range from linear shifts to non-linear distortions and time warping . Some have argued for the necessity of non-linear warping functions to correct for RT deviations , while others have posited that most of the variation can be explained by simple linear shifts . More complex methods include multiple generalized additive models , or machine-learning based semi-supervised alignments . Once experiments are aligned, peptide RTs can be predicted by applying experiment-specific alignment functions to the RT of a peptide observed in a reference run.
Peptide RTs estimated by alignment can be used to schedule targeted MS/MS experiments—similar to the use of predicted RTs estimated from the physical properties of a peptide . RT alignments are also crucial for MS1 ion/feature-matching algorithms, as discussed earlier [5–10], as well as in targeted analyses of results from data-independent acquisition (DIA) experiments [49–51]. The addition of a more complex, non-linear RT alignment model that incorporates thousands of endogenous peptides instead of a handful of spiked-in peptides increased the number of identifications in DIA experiments by up to 30% .
With DART-ID, we implement a novel global RT alignment method that takes full advantage of SCoPE-MS data, which feature many experiments with analogous samples run on the same nano-LC (nLC) system [1, 2]. These experimental conditions yield many RT estimates per peptide with relatively small variability across experiments. In this context, we used empirical distribution densities that obviated assumptions about the functional dependence between peptide properties, RT, and RT variability and thus maximized the statistical power of highly reproducible RTs. This approach increases the number of experiments in which a peptide is identified with high enough confidence and its quantitative information can be used for analysis. The DART-ID program is freely available and can easily be run over the output of peptide search engines such as MaxQuant [7, 8].
Model for global RT alignment and PSM confidence update
Using RT for identifying peptide sequences starts with estimating the RT for each peptide, and we aimed to maximize the accuracy of RT estimation by optimizing RT alignment. Many existing methods can only align the RTs of two experiments at a time, i.e., pairwise alignment, based on partial least squares minimization, which does not account for the measurement errors in RTs . Furthermore, the selection of a reference experiment is non-trivial, and different choices can give quantitatively different alignment results. In order to address these challenges, we developed a global alignment method, sketched in Fig 1a and 1b. The global alignment infers a reference RT for the ith peptide, μi as a latent variable with value μik in the kth experiment. This can be related to the measured RT for peptide i in experiment k, ρik. (1) where μik ≜ gk(μi) and ϵik is an independent mean-zero error term expressing residual (unmodeled) RT variation. As a first approximation, we assume that the observed RTs for any experiment can be well approximated using a two-segment linear regression model: (2) where sk is the split point for the two segment regression in each experiment, and the parameters are constrained to not produce a negative RT and can be generalized to more complex monotonically-constrained models, such as spline fitting or locally estimated scatterplot smoothing (LOESS). We chose this model since we found that it outperformed a single-slope linear model by capturing more of the inter-experiment variation in RTs, S2 Fig. Based on this model, we can express the marginal likelihood for the RT of the ith peptide in the kth experiment as a mixture model weighted by the probability of correct sequence assignment (λik, the spectral posterior error probability (PEP)); see S1 Fig for more details. (3) where fik is the inferred RT density for peptide i in experiment k and is the null RT density. In our implementation, we let and , which we found worked well in practice (See S4 Fig). This framework is modular and can be easily extended to use different distributions. To account for the fact that residual RT variation increases with mean RT and varies between experiments (S3 Fig), we model its standard deviation, σik, as a linearly increasing function of μi, Eq 7.
(a) DART-ID defines the global reference RT as a latent variable, Eq 1. (b) The observed RTs are modeled as a function of the reference RT, which allows incorporating experiment specific weights and the uncertainty in measured RTs and peptide identification as shown in Eq 3. Then the global alignment model simultaneously infers the reference RT and aligns all experiments by solving Eq 4. (c) A conceptual diagram for updating the confidence in a peptide-spectrum-match (PSM). The probability to observe each PSM is estimated from the conditional likelihoods for observing the RT if the PSM is assigned correctly (blue density) or incorrectly (red density). For PSM 1, P(δ = 1 | RT) < P(δ = 0 | RT), and thus the confidence decreases. Conversely, for PSM 2, P(δ = 1 | RT) > P(δ = 0 | RT), and thus the confidence increases. (d) The Bayes’ formula used to formalize the model from panel c and to update the error probability of PSMs.
Using the vectorized likelihood function from Eq 3 and the priors described in Methods, we solve Eq 4 to infer the joint posterior distribution of all reference RTs (and associated model parameters) across all experiments: (4) The inference described above infers all reference RTs, μ, from one global solution of Eq 4. It allows the alignment to take advantage of any peptide observed in at least two experiments, regardless of the number of missing observations in other experiments. Furthermore, the mixture model described in Eq 3 allows for the incorporation of low confidence peptides by using appropriate weights and accounting for the presence of false positives. Thus this method maximizes the data used for alignment and obviates the need for spiked-in standards. Furthermore, the reference RT provides a principled choice for a reference (rather than choosing a particular experiment) that is free of measurement noise. The alignment process accounts for the error in individual observations by inferring a per peptide RT distribution, as opposed to aligning to a point estimate, as well as for variable RT deviations in experiments by using experiment-specific weights.
The conceptual idea based on which we incorporate RT information for sequence identification is illustrated in Fig 1c and formalized with Bayes’ theorem in Fig 1d. We start with a peptide-spectrum-match (PSM) from a search engine and its associated probability to be incorrect (PEP; posterior error probability) and correct, 1-PEP. If the RT of a PSM is far from the RT of its corresponding peptide, as PSM1 in Fig 1c, then the spectrum is more likely to be observed if the PSM is incorrect, and thus we can decrease its confidence. Conversely, if the RT of a PSM is very close to the RT of its corresponding peptide, as PSM2 in Fig 1c, then the spectrum is more likely to be observed if the PSM is correct, and thus we can increase its confidence. To estimate whether the RT of a PSM is more likely to be observed if the PSM is correct or incorrect, we use the conditional likelihood probability densities inferred from the alignment procedure in Eq 3 (Fig 1b). Combining these likelihood functions with Bayes’ theorem in Fig 1d allows us to formalize this logic and update the confidence of analyzed PSMs, which we quantify with DART-ID PEPs.
Global alignment process reduces RT deviations
To evaluate the global RT alignment by DART-ID, we used a staggered set of 46 60-minute LC-MS/MS runs performed over a span of 3 months. Each run was a diluted 1 × M injection of a bulk 100 × M SCoPE-MS sample, as described in Table 1 and by Specht et al. . The experiments were run over a span of three months so that the measured RTs captured expected variance in the chromatography. The measured RTs were compared to RTs predicted from peptide sequences [30, 31, 34], and to top-performing alignment methods [7, 8, 43, 52], including the reference RTs from DART-ID; see Fig 2a. All methods estimated RTs that explained the majority of the variance of the measured RTs, Fig 2a. As expected, the alignment methods provided closer estimates, explaining over 99% of the variance.
(a) Scatter plots of observed RTs versus inferred RTs. The comparisons include 33,383 PSMs with PEP < 0.01 from 46 LC-MS/MS runs over the span of three months. The left column displays comparisons for RT prediction methods—SSRCalc , BioLCCC , and ELUDE . The right column displays comparisons for alignment methods—precision iRT , MaxQuant match-between-runs [7, 8], and DART-ID. (b) Distributions of residual RTs: ΔRT = Observed RT − Reference RT. Note the different scales of the x-axes between the prediction and alignment methods. (c) Mean and median of the absolute values of ΔRT from panel (b).
To evaluate the accuracy of RT estimates more rigorously, we compared the distributions of differences between the reference RTs and measured RTs, shown in Fig 2b. This comparison again underscores that the differences are significantly smaller for alignment methods, and smallest for DART-ID. We further quantified these differences by computing the mean and median absolute RT deviations, i.e., |ΔRT|, which is defined as the absolute value of the difference between the observed RT and the reference RT. For the prediction methods—SSRCalc , BioLCCC , and ELUDE —the average deviations exceed 2 min, and ELUDE has the smallest average deviation of 2.5 min. The alignment methods result in smaller average deviations, all below < 1 min, and DART-ID shows the smallest average deviation of 0.044 min (2.6 seconds). Detailed alignment statistics can be visualized in both the graphical output of the DART-ID program and in the DO-MS visualization platform .
DART-ID increases proteome coverage of SCoPE-MS experiments
Search engines such as MaxQuant [7, 8] use the similarity between theoretically predicted and experimentally measured MS2 spectra of ions to match them to peptide sequences, i.e., peptide-spectrum-matches (PSM). The confidence of a PSM is commonly quantified by the probability of an incorrect match: the posterior error probability (PEP) [21, 55, 56]. Since the estimation of PEP does not include RT information, we sought to update the PEP for each PSM by incorporating RT information within the Bayesian framework displayed in Fig 1c and 1d. This approach allowed us to use the estimated RT distributions for each peptide with minimal assumptions.
The Bayesian framework outlined in Fig 1c and 1d can be used with RTs estimated by other methods, and its ability to upgrade PSMs is directly proportional to the accuracy of the estimated RTs. To explore this possibility, we used our Bayesian model with RTs estimated by all methods shown in Fig 2. The updated error probabilities of PSMs indicate that all RT estimates enhance PSM discrimination, S5 Fig. Even lower accuracy RTs predicted from peptide sequence can be productively used to upgrade PSMs. However, the degree to which PSMs are upgraded, i.e. the magnitude of the confidence shift, increases with the accuracy of the RT estimates and is highest with the DART-ID reference RTs.
We refer to the PEP assigned by the search engine (MaxQuant throughout this paper) as “Spectral PEP”, and after it is updated by the Bayesian model from Fig 1d as “DART-ID PEP”. Comparing the Spectral and DART-ID PEPs indicates that the confidence for some PSMs increases while for others decreases; see density plot in Fig 3a. Reassuringly, all PSMs with low Spectral PEPs have even lower DART-ID PEPs, meaning that all confident PSMs become even more confident. On the other extreme, many PSMs with high Spectral PEPs have even higher DART-ID PEPs, meaning that some low-confidence PSMs are further downgraded. Confidence upgrades, where DART-ID PEP < Spectral PEP, range within 1–3 orders of magnitude.
(a) A 2D density distribution of error probabilities derived from spectra alone (Spectral PEP), compared to that after incorporating RT evidence (DART-ID PEP). (b) Map of all peptides observed across all experiments. Black marks indicate peptides with Spectral FDR < 1%, and red marks peptides with DART-ID FDR < 1%. (c) Increase in confident PSMs (top), and in the fraction of all PSMs (bottom) across the confidence range of the x-axis. The curves correspond to PEPs estimated from spectra alone, from spectra and RTs using percolator and from spectra and RTs using DART-ID. DART-ID identifications are split into DART-ID1 and DART-ID2 depending on whether the peptides have confident spectral PSMs as marked in panel (b). (d) Distributions of number of unique peptides identified per experiment. (e) The fraction of decoys, i.e. the number of decoy hits divided by the total number of PSMs, as a function of the FDR estimated from spectra alone or from DART-ID. The Spectral FDR is estimated from separate MaxQuant searches, with the FDR applied on the peptide level.
The density plot in Fig 3a displays a subset of peptides with Spectral PEP > 0.01 and DART-ID PEP < 0.01. These peptides have low confidence of identification based in their MS/MS spectra alone, but high confidence when RT evidence is added to the spectral evidence. To visualize how these peptides are distributed across experiments, we marked them with red dashes in Fig 3b. The results indicate that the data sparsity decreases; thus DART-ID helps mitigate the missing data problem of shotgun proteomics. Fig 3b is separated into two subsets, DART-ID1 and DART-ID2, which correspond respectively to peptides that have at least one confident spectral PSM, and peptides whose spectral PSMs are all below the set confidence threshold of 1% FDR. While the PSMs of DART-ID2 very likely represent the same peptide sequence—since by definition they share the same RT, MS1 m/z and MS2 fragments consistent with its sequence—we cannot be confident in the exact sequence assignment. Thus, they are labeled separately and their sequence assignment is further validated in the next section.
The majority of PSMs whose confidence is increased by DART-ID have multiple confident Spectral PSMs, and thus reliable sequence assignment. Analysis of newly identified peptides in Fig 3c shows that DART-ID helps identify about 50% more PSMs compared to spectra alone at an FDR threshold of 1%. This corresponds to an increase of ∼30–50% in the fraction of PSMs passing an FDR threshold of 1%, as shown in the bottom panel of Fig 3c. Furthermore, the number of distinct peptides identified per experiment increases from an average of ∼1000 to an average of ∼1600, Fig 3d. Percolator, a widely used FDR recalculation method that also incorporates peptide RTs , also increases identification rates, albeit to a lesser degree than DART-ID, Fig 3c and 3d. The visualizations in Fig 3a, 3c and 3d can be generated for user inputted data by the DO-MS visualization platform .
We observe that DART-ID PEPs are bimodally distributed (S6 Fig), suggesting that DART-ID acts as an efficient binary classifier. Modifying error probabilities, however, does risk changing the overall false discovery rate (FDR) of the PSM set. To evaluate the effect of DART-ID on the overall FDR, we allowed the inclusion of decoy hits in both the alignment and confidence update process . The results from this analysis in Fig 3e indicate that, as expected, the fraction of PSMs matched to decoys is proportional to the FDR estimated both from the Spectral PEP and from the updated DART-ID PEP. We encourage users of DART-ID to evaluate the results from applying DART-ID and other related methods on their datasets using this benchmark as well as the numerous quantitative benchmarks described in the subsequent sections.
DART-ID increases proteome coverage of bulk LC-MS/MS experiments
While we were motivated to develop DART-ID within the context of the SCoPE-MS method, we show in Fig 4 that DART-ID is similarly able to increase quantitative coverage in a label-free  and a TMT-labelled  bulk LC-MS/MS experiment. The DART-ID alignment performed differently between the label-free set (120 min gradients) and the TMT-labelled set (180 min gradients) Fig 4a, with slightly higher residuals for the longer gradient. The percent increase in confident PSMs, when using DART-ID PEPs instead of spectral PEPs Fig 4b, also fell into the expected range of 30–50% at 1% FDR. The increase in confident PSMs is shown in discrete terms in Fig 4c, where experiments in both the label-free and TMT-labelled sets receive thousands of more confident PSMs that can then be used for further quantitative analysis.
Residual RTs after DART-ID alignment for (a) label-free dataset  and TMT-labelled dataset . (b) DART-ID doubles the PSMs at 0.01% FDR and increase them by about 40% at 1% FDR. Each circle corresponds to the number of PSMs in an LC-MS/MS run. (c) Number of PSMs per run at 1% FDR, after applying DART-ID versus before its application. The x-coordinate represents the Spectra PSMs and and y-coordinate represents the DART-ID PSMs at 1% FDR.
DART-ID decrease missing datapoints
These increases of confident PSMs, in both the SCoPE-MS and bulk LC-MS/MS sets, decreases the amount of missing data per run. In Fig 5a we show qualitatively that DART-ID can fill in many of these missing values on the protein level. On the level of experimental runs, as shown quantitatively in Fig 5b, DART-ID significantly reduces the amount of missing data and mitigates the stochasticity that is inherently to data-dependent MS methods.
(a) Map of quantified proteins across 209 SCoPE-MS runs, before and after applying DART-ID. A red mark denotes a protein quantified in an run at 1% FDR. Only peptides seen in >50% of experiments are included. (b) Decrease in missing data across all runs after applying DART-ID, for SCoPE-MS and the two bulk sets from Fig 4 at 1% FDR. All corresponding Spectra and DART-ID distributions differ significantly; the probability that they are sampled from the same distribution ≪ 1 * 10−10.
Validation of PSMs upgraded by DART-ID
We next sought to evaluate whether the confident DART-ID PSMs without confident Spectral PSMs, i.e. DART-ID2 from Fig 3b, are matched to the correct peptide sequences. To this end, we sought to evaluate whether the RTs of such PSMs match the RTs for the corresponding peptides identified from high-quality, confident spectra. For this analysis, we split a set of experiments into two subsets, A and B, Fig 6a. The application of DART-ID to A resulted in two disjoint subsets of PSMs: A1, corresponding to PSMs with confident spectra (Spectral PEP < 0.01), and A2, corresponding to “upgraded” PSMs (Spectral PEP > 0.01 and DART-ID PEP < 0.01). We overlapped these subsets with PSMs from B having Spectral PEP < 0.01, so that the RTs of PSMs from B can be compared to the RTs of PSMs from subsets A1 and A2, Fig 6a. This comparison shows excellent agreement of the RTs for both subsets A1 and A2 with the RTs for high quality spectral PSMs from B, Fig 6b and 6c. This result suggests that even peptides upgraded without confident spectral PSMs are matched to the correct peptide sequences.
(a) Schematic design of this validation experiment. It used 11 technical replicate LC-MS/MS experiments that were run on the same day. (b) Comparison of the RTs of subsets a1 and a2 to the RTs of corresponding peptides from B. Decoy PSMs have randomly sampled RTs and are included here as a null model. (c) Residual RT distributions for the two subsets of data a1 and a2 as defined in panel a and for a decoy subset.
Validation by internal consistency
We ran DART-ID on SCoPE-MS method development experiments , all of which contain quantification data in the form of 11-plex tandem-mass-tag (TMT) reporter ion (RI) intensities. Out of the 10 TMT “channels”, six represent the relative levels of a peptide in simulated single cells, i.e., small bulk cell lysate diluted to a single cell-level level. These six single cell channels are made of T-cells (Jurkat cell line) and monocytes (U-937 cell line). We then used the normalized TMT RI intensities to validate upgraded PSMs by analyzing the consistency of protein quantification from distinct peptides.
Internal consistency is defined by the expectation that the relative intensities of PSMs reflect the relative levels of their corresponding proteins. If upgraded PSMs are consistent with Spectral PSMs for the same protein, then their relative RI intensities will have lower coefficients of variation (CV) within a protein than across different proteins . CV is defined as σ/μ, where σ is the standard deviation and μ is the mean of the normalized RI intensities of PSMs belonging to the same protein. A negative control is constructed by creating a decoy dataset where PSM protein assignments are randomized.
For this and later analyses, we filter PSMs from our data into the following disjoint sets:
- Spectra—Spectral PEP < 0.01
- DART-ID—(Spectral PEP > 0.01) ∩ (DART-ID PEP < 0.01)
- Percolator —(Spectral PEP > 0.01) ∩ (Percolator PEP < 0.01)
where Spectra is disjoint from the other two sets, i.e., Spectra ∩ DART-ID = ∅ and Spectra ∩ Percolator = ∅. These sets of PSMs, as depicted in Fig 7a, are intersected with each other through a set of shared proteins between the three sets of PSMs.
(a) Schematic for separating PSM subsets, where Spectra and DART-ID subsets of PSMs are disjoint. (b) Distributions of coefficient of variation (CVs) for each protein in each subset. Decoy is a subset of PSMs with their protein assignments randomized. (c) Comparing protein CVs of n = 275 proteins between the Spectra and DART-ID PSM subsets, and from the Spectra and Decoy subsets.
The protein CVs of the Spectra, DART-ID, and Percolator PSM sets, depicted in Fig 7b, show similar distributions and smaller CVs than those from the decoy set. In addition, Fig 7c shows agreement between the protein CVs of the Spectra and DART-ID PSM sets, as opposed to the CVs of the Spectra set and Decoy set. This demonstrates that the protein-specific variance in the relative quantification, due to either technical or biological noise, is preserved in these upgraded PSMs.
Proteins identified by DART-ID separate cell types
The upgraded PSMs from the DART-ID set are not just representative of proteins already quantified from confident spectral PSMs, but when filtering at a given confidence threshold (e.g., 1% FDR), they allow for the inclusion of new proteins for analysis. As the quantification of these new proteins from the DART-ID PSMs cannot be directly compared to that of the proteins from the Spectra PSMs, we instead compare how the new proteins from DART-ID can explain the biological differences between two cell types—T-cells (Jurkat cell line) and monocytes (U-937 cell line)—present in each sample and experiment. The data was split into sets in the same manner as the previous section, as shown in Fig 7a, where the Spectra and DART-ID sets of PSMs are disjoint. We then filtered out all PSMs from DART-ID that belonged to any protein represented in Spectra, so that the sets of proteins between the two sets of PSMs were disjoint as well.
To test whether or not DART-ID identified peptides consistently across experiments, we used principal component analysis (PCA) to separate the T-cells and monocytes quantified in our experiments. This PCA analysis in Fig 8a shows clear separation of T-cells and monocytes from both the Spectra and DART-ID PSM sets. If boosted peptide identifications were spurious and inconsistent, then the PCA analysis could not separate the cell types or cluster them together. In addition, relative protein ratios (T-cells/monocytes) estimated from the two disjoint PSM sets are in good agreement (ρ = 0.84); see S7 Fig.
(a) Principal component analysis of the proteomes of 375 samples corresponding to either T-cells (Jurkat cell line) or to monocytes (U-937 cell line). The Spectra set contains proteins with Spectral PSMs filtered at 1% FDR, and the DART-ID set contains a disjoint set of proteins quantified from PSMs with high Spectral PEP but low DART-ID PEP. Only peptides with less than 5% missing data were used for this analysis, and the missing data were imputed. (b) The distributions of some features of the Spectra and DART-ID PSMs differ slightly. These features include: precursor ion area is the area under the MS1 elution peak and reflects peptide abundance; precursor ion fraction which reflects MS2 spectral purity; missed cleavages is the average number of internal lysine and arginine residues; and % missing data is the average fraction of missing TMT reporter ion quantitation per PSM. All distributions are significantly different, with p < 10−4.
While DART-ID2 PSMs are able to uncover entirely new proteins carrying consistent biological signal, on average these PSMs differ slightly from Spectral PSMs in purity, missed-cleavages, and missing data; see Fig 8b. However, the distributions of these features are largely overlapping, and the magnitude of these differences are relatively small; most spectra of DART-ID PSMs are still >90% pure, and have less than 16% missing data and missed cleavages. Of course the intended usage of DART-ID is not to separate these two groups of PSMs and analyze them separately, but instead to combine them and increase the number of data points available for analysis. Indeed, adding DART-ID PSMs to the Spectra PSMs doubles the number of differentially abundant proteins between T-cells and monocytes, Fig 9a, 9b and 9c.
The difference in protein abundance between T-cells and monocytes was visualized in the space of fold-change and its significance, i.e., volcano plots. The volcano plot using only proteins quantified from Spectra PSMs (a) identifies fewer proteins than the volcano plot using proteins from Spectra + DART-ID PSMs (b). Fold changes are averaged normalized RI intensities of T-cells (Jurkat cell line) / monocytes (U-937 cell line). q-values are computed from two-tailed t-test p-values and corrected for multiple hypotheses testing. (c) Number of differentially abundant proteins as a function of the significance FDR from panels a and b.
Here we present DART-ID, a new Bayesian approach that infers RTs with high accuracy and uses these accurate RT estimates to improve peptide sequence identification. We demonstrate that DART-ID can estimate and align RTs with accuracy of a few seconds for 60 minute LC-MS/MS runs and can leverage this high accuracy towards increasing the confidence in correct PSMs and decreasing the confidence in incorrect PSMs. This principled and rigorous estimation of the confidence of PSMs increases quantification coverage by 30–50%, primarily by increasing the number of experiments in which a peptide is quantified.
We validated the upgraded PSMs using methods for FDR estimation (Fig 3e), cross-validation (Fig 6), intra-protein CV validation (Fig 7), and biological signal validation (Fig 8). All of these methods strongly support the reliability of DART-ID inferences. We encourage the use of these methods for benchmarking the application of DART-ID (and any other related method) on other datasets.
DART-ID is applicable to any large set of LC-MS/MS analyses with a consistent LC setup. The more consistent the LC, the more powerful DART-ID is since its statistical power is proportional to the accuracy of RT estimates. Our SCoPE-MS and SCoPE2 runs have highly consistent RTs [1, 4, 60] and motivated us to develop DART-ID. However, we found (show in Fig 4) that DART-ID performs similarly well with bulk LC-MS/MS runs of TMT-labeled and label-free samples.
A principal advantage of DART-ID is that its probabilistic model naturally adapts to the RT reproducibility and obviates thresholds, e.g., a threshold on RT errors. Rather DART-ID updates the confidence of each PSMs using a rigorous quantitative model based on empirically derived distributions of RT reproducibility. Thus, it adapts and controls for the reproducibility of the LC and the accuracy of the RT estimates as shown in S5 Fig.
Another principal advantage of DART-ID is its ability to use all PSMs (including those with sparse observations and low confidence) to create a global RT alignment. This is possible because DART-ID alignment takes into account the confidence of PSMs as part of the mixture model in Eq 3. This results in accurate RT estimates (Fig 2) that are robust to missing data and benefit from all PSMs regardless of their identification confidence.
If the LC and RTs of a dataset are very variable, one may extend the alignment model beyond Eq 2 to capture the increased variability. The two-segment linear regression from Eq 2 demonstrated here captures more variation than a single-slope linear regression. DART-ID, however, is not constrained to these two functions and can implement any monotone function. Non-linear functions that are monotonically constrained, such as the logit function, have been implemented in our model during development. More complex models, for example monotonically-constrained general additive models, could increase alignment accuracy further given that the input data motivates added complexity.
While DART-ID is focused on aligning and utilizing RTs from LC-MS/MS experiments, the alignment method could potentially be applied to other separation methods, including ion mobility, gas chromatography, supercritical fluid chromatography, and capillary electrophoresis. The ion drift time obtained from instruments with an ion mobility cell are particularly straightforward to align and incorporate by DART-ID’s Bayesian framework. Another potential extension of DART-ID is to offline separations prior to analysis, i.e., fractionation. RT alignment would only be applicable between replicates of analogous fractions, but a more complex model could also take into account membership of a peptide to a fraction as an additional piece of evidence.
DART-ID is modular, and the RT alignment module and PEP update modules may be used separately. For example, the RT estimates may be applied to increase the performance of other peptide identification methods incorporating RT evidence [14–17]. One application is integrating the inferred RT from DART-ID into the search engine score, as done by previous methods [18, 19], to change the best hit for a spectrum, save a spectrum from filtering due to high score similarities (i.e., low delta score) , or provide evidence for hybrid spectra. Although DART-ID’s alignment is based on point estimates of RT, the global alignment methodology could also be applied to feature-based alignments [6, 8–10] to obviate the limitations inherent in pairwise alignments.
Data sources and experimental design
The data used for the development and validation of the DART-ID method were 263 method-development experiments for SCoPE-MS and its related projects. All samples were lysates of the Jurkat (T-cell), U-937 (monocyte), or HEK-293 (human embryonic kidney) cell lines. Samples were prepared with the mPOP sample preparation protocol, and then digested with trypsin . All experiments used either 10 or 11-plex TMT for quantification. Most but not all sets followed the experimental design as described by Table 1. All experiments were run on a Thermo Fisher (Waltham, MA) Easy-nLC system with a Waters (Milford, MA) 25cm x 75μm, 1.7μm BEH column with 130Å pore diameter, and analyzed on a Q-Exactive (Thermo Fisher) mass spectrometer. Gradients were run at 100 nL/min from 5-35%B in 48 minutes with a 12 minute wash step to 100%B. Solvent composition was 0% acetonitrile for A and 80% acetonitrile for B, with 0.1% formic acid in both. A subset of later experiments included the use of a trapping column, which extended the total run-time to 70 minutes. Detailed experimental designs and mass spectrometer parameters of each run can be found in S1 Table. All Thermo .RAW files are publicly available online. More details on sample preparation and analysis methods can be found from the mPOP protocol .
Searching raw MS data
Searching was done with MaxQuant v184.108.40.206  against a UniProt protein sequence database with 443722 entries. The database contained only SwissProt entries and was downloaded on 5/1/2018. Searching was also done on a contaminant database provided by MaxQuant, which contained common laboratory contaminants and keratins. MaxQuant was run with Trypsin specificity which allowed for two missed cleavages, and methionine oxidation (+15.99492 Da) and protein N-terminal acetylation (+42.01056 Da) as variable modifications. No fixed modifications apart from TMT were specified. TMT was searched using the “Reporter ion MS2” quantification setting on MaxQuant, which searches for the TMT addition on lysine and the n-terminus with a 0.003 Da tolerance. Observations were selected at a false discovery rate (FDR) of 100% at both the protein and PSM level to obtain as many spectrum matches as possible, regardless of their match confidence. All raw MS files, MaxQuant search parameters, the sequence database, and search outputs are publicly available online.
Only a subset of the input data is used for the alignment of experiments and the inference of RT distributions for peptides. First, decoys and contaminants are filtered out of the set. Contaminants may be problematic for RT alignment since their retention may be poorly defined, e.g., they may be poorly chromatographically resolved. Then, observations are selected at a threshold of PEP < 0.5. Observations are additionally filtered through a threshold of retention length, which is defined by MaxQuant as the range of time between the first matched scan of the peptide and the last matched scan. Any peptide with retention length > 1 min for a 60 min run is deemed to have too wide of an elution peak, or chromatography behavior more consistent with contaminants than retention on column. In our implementation, this retention length threshold can be set as a static number or as a fraction of the total run-time, i.e., (1/60) of the gradient length.
For our data, only peptide sequences present in 3 or more experiments were allowed to participate in the alignment process. The model can allow peptides only present in one experiment to be included in the alignment, but the inclusion of this data adds no additional information to the alignment and only serves to slow it down computationally. The definition of a peptide sequence in these cases is dynamic, and can include modifications, charge states, or any other feature that would affect the retention of an isoform of that peptide. For our data, we used the peptide sequence with modifications but did not append the charge state.
Preliminary alignments revealed certain experiments where chromatography was extremely abnormal, or where peptide identifications were too sparse to enable an effective alignment. These experiments were manually removed from the alignment procedure after a preliminary run of DART-ID. From the original 263 experiments, 37 had all of their PSMs pruned, leaving only 226 experiments containing PSMs with updated confidences. These experiments are included in the DART-ID output but do not receive any updated error probabilities as they did not participate in the RT alignment. All filtering parameters are publicly available as part of the configuration file that was used to generate the data used in this paper.
Global alignment model
Let ρik be the RT assigned to peptide i in experiment k. In order to infer peptide and experiment-specific RT distributions, we assume that there exists a set of reference retention times, μi, for all peptides i. Each peptide has a unique reference RT, independent of experiment. We posit that for each experiment, there is a simple monotone increasing function, gk, that maps the reference RT to the predicted RT for peptide i in experiment k. An observed RT can then be expressed as in Eq 1. As a first approximation, we assume that the observed RTs for any experiment can be well approximated using a two-segment linear regression model as described by Eq 2. This model can be extended to more complex monotonic models, such as spline fitting, or non-linear monotonic models, such as a logit function or LOESS.
To factor in the spectral PEP given by the search engine, and to allow for the inclusion of low probability PSMs, the marginal likelihood of an RT in the alignment process can be described using a mixture model as described in S1 Fig. For a PSM assigned to peptide i in experiment k the RT density is (5) where λik is the error probability (PEP) for the PSM returned by MaxQuant, fik is the inferred RT density for peptide i in experiment k and is the null RT density. In our implementation, we let: (6) which we found worked well in practice (See S4 Fig). However, our framework is modular and it is straightforward to utilize different residual RT and null distributions if appropriate. For example, with non-linear gradients that generate a more uniform distribution of peptides across the LC run , it may be sensible for the null distribution to be defined as uniformly distributed, i.e. .
Finally, to reflect the fact that residual RT variation increases with mean RT and varies between experiments (S3 Fig), we model the standard deviation of a peptide RT distribution, σik, as a linear function of the reference RT: (7) where μi is the reference RT of the peptide sequence, and ak and bk are the intercept and slope which we infer for each experiment. ak, bk and μi are constrained to be positive, and hence σik > 0 as well.
To estimate all unknown parameters, we consider the joint posterior distribution of the experiment specific alignment parameters and the reference RTs given the observed retention times, (8) where P(a, b, β0, β1, s, μ) are the prior distributions for all unknown alignment parameters and reference RTs and P(ρ | a, b, β0, β1, s, μ) is the likelihood, as determined by Equation. a, b, β0, β1, s are all K-vectors of alignment parameters for each experiment. μ consists of the reference RTs for every peptide.
The priors for the Bayesian inference can be found in the .stan model files, and for the analyses in this paper, are as follows:
where RTmean and RTsd are the mean and standard deviation of all RTs across all experiments, respectively. max(RT) is the maximum observed RT of all RTs across all experiments. These priors were chosen for groups of 60 min LC-MS/MS runs, and can be adjusted accordingly for different run lengths, gradient shapes, and groupings of runs with different run times.
We compared the DART-ID alignment accuracy against five other RT prediction or alignment algorithms. As some methods returned absolute predicted RTs (such as BioLCCC ) and others returned relative hydrophobicity indices (such as SSRCalc ), a linear regression was built for each prediction method. Alignment accuracy was evaluated using three metrics: R2, the Pearson correlation squared, and the mean and median of |ΔRT|, the absolute value of the residual RT, and is defined as |Observed RT − Predicted RT|. We selected only confident PSMs (PEP < 0.01) for this analysis, and used data that consisted of 33383 PSMs from 46 LC-MS/MS experiments run over the course of 90 days in order to produce more chromatographic variation. A list of these experiments is found in S1 Table.
SSRCalc  was run from SSRCalc Online (http://hs2.proteome.ca/SSRCalc/SSRCalcQ.html), with the “100Å C18 column, 0.1% Formic Acid 2015” model, “TMT” modification, and “Free Cysteine” selected. No observed RTs were inputted along with the sequences.
BioLCCC  was run online from http://www.theorchromo.ru/ with the parameters of 250mm column length, 0.075mm column inner diameter, 130Å packing material pore size, 5% initial concentration of component B, 35% final concentration of component B, 48 min gradient time, 0 min delay time, 0.0001 ml/min flow rate, 0% acetonitrile concentration in component A, 80% acetontrile concentration in component B, “RP/ACN+FA” solid/mobile phase combination, and no cysteine carboxyaminomethylation. As BioLCCC could only take in one gradient slope as the input, all peptides with observed RT > 48 min were not inputted into the prediction method.
ELUDE  was downloaded from the percolator releases page https://github.com/percolator/percolator/releases, version 3.02.0, Build Date 2018-02-02. The data were split into two, equal sets with distinct peptide sequences to form the training and test sets. The elude program was run with the --no-in-source and --test-rt flags. Predicted RTs from ELUDE were obtained from the testing set only, and training set RTs were not used in further analyses.
For iRT , the same raw files used for the previous sets were searched with the Pulsar search engine , with iRT alignment turned on and filtering at 1% FDR. From the Pulsar search results, only peptide sequences in common with the previous set searched in MaxQuant were selected. Predicted RT was taken from the “PP.RTPredicted” column and plotted against the empirical RT column “PP.EmpiricalRT”. Empirical RTs were not compared between those derived from MaxQuant and those derived from Pulsar.
MaxQuant match-between-runs [7, 8] was run by turning the respective option on when searching over the set of 46 experiments, and given the options of 0.7 min match time tolerance and a 20 min match time window. The “Calibrated retention time” column was used as the predicted RT, and these predicted RTs were related to observed RTs with a linear model for each experiment run.
For DART-ID, predicted RTs are the same as the mean of the inferred RT distribution, and no linear model was constructed to relate the predicted RTs to the observed RTs.
Comparison to linear alignment model
To compare the performance of the two-piece linear model for RT alignment against a simple linear model, we ran both alignments separately on the same dataset as described in the RT alignment comparison section. For S2 Fig, we used one experiment—180324S_QC_SQC69A—as an example to illustrate the qualitative differences between the two models. Panels b and c used all experiments from the set to give a more quantitative comparison.
We update the confidence for PSM i in experiment k according to Bayes’ theorem. Let δik = 1 denote that PSM i in experiment k is assigned to the correct sequence (true positive), δik = 0 denotes that the PSM is assigned to the incorrect sequence (a false positive), and as above, ρik is an observed RT assigned to peptide i. At a high level, the probability that the peptide assignment is a true positive is (9) Each term is described in more detail below:
The confidence update depends on the global alignment parameters. Let θ consist of the global alignment parameters and reference RTs, i.e. β0k, β1k, σik and μi. If θ were known, then the Bayesian update could be computed in a straightforward manner as described above. In practice the alignment parameters are not known and thus must be estimated using the full set of observed RTs across all experiments, ρ. The PSM confidence update can be expressed unconditional on θ, by integrating over the uncertainty in the estimates of the alignment parameters: (10) Although we can estimate this posterior distribution using Markov Chain Monte Carlo (MCMC), it is prohibitively slow given the large number of peptides and experiments that we analyze. As such, we estimate maximum a posteriori (MAP) estimates for the reference RTs μi, alignment parameters β0k, β1k, and RT standard deviation σik using an optimization routine implemented in STAN .
If computation time is not a concern, it is straightforward to generate posterior samples in our model by running MCMC sampling in STAN, instead of MAP optimization. This approach is computationally efficient but is limited in that parameter uncertainty quantification is not automatic.
To address this challenge, we incorporate estimation uncertainty using a computationally efficient procedure based on the parametric bootstrap. Note that uncertainty about the alignment parameters β0k and β1k is small since they are inferred using thousands of RT observations per experiment. By contrast, the reference RTs, μi, have much higher uncertainty since we observe at most one RT associated with peptide i in each experiment (usually far fewer). As such, we choose to ignore uncertainty in the alignment parameters and focus on incorporating uncertainty in estimates of μi.
Let and denote the MAP estimates of the location and scale parameters for the RT densities. To approximate the posterior uncertainty in the estimates of μi, we use the parametric bootstrap. First, we sample from with probability 1 − λik and with probability λik. We then map back to the reference space using the inferred alignment parameters as and compute a bootstrap replicate of the reference RT associated with peptide i as the median (across experiments) of the resampled RTs: , as the maximum likelihood estimate of the location parameter of a Laplace distribution is the median of independent observations. For each peptide we repeat this process B times to get several bootstrap replicates of the reference RT for each peptide. We use the bootstrap replicates to incorporate the uncertainty of the reference RTs into the Bayesian update of the PSM confidence. Specifically, we approximate the confidence update in Eq 10 as (11) This process is depicted in S8 Fig.
In addition to updating the PEPs for each PSM, DART-ID also recalculates the set-wide false discovery rate (FDR, q-value). This is done by first sorting the PEPs and then assigning the q-value to be the cumulative sum of PEPs at that index, divided by the index itself, to give the fractional expected number of false positives at that index (i.e., the mean PEP) .
TMT reporter ion intensity normalization
Reporter ion (RI) intensities were obtained by selecting the tandem-mass-tag (TMT) 11-plex labels in MaxQuant, for both attachment possibilities of lysine and the peptide N-terminus, and with a mass tolerance of 0.003 Da. Data from different experiments and searches are all combined into one matrix, where the rows are observations (PSMs) and the 10 columns are the 10 TMT channels. Observations are filtered at a confidence threshold, normally 1% FDR, and observations with missing data are thrown out.
Before normalization, empty channels 127N, 128C, and 131C are removed from the matrix. Each column of the matrix is divided by the median of that column, to correct for the total amount of protein in each channel, pipetting error, and any biases between the respective TMT tags. Then, each row of the matrix is divided by the median of that row, to obtain the relative enrichment between the samples in the different TMT channels. In our data the relative enrichment was between the two cell types present in our SCoPE-MS sets, T-cells (Jurkat cell line) and monocytes (U-937 cell lines).
Assuming that the relative RI intensities of PSMs are representative of their parent peptide, the peptide intensity can be estimated as the median of the RI intensities of its constituent PSMs. Similarly, if protein levels are assumed to correspond to the levels of its constituent peptides, then protein intensity can be estimated as the median of the intensities of its constituent peptides. The previous steps of RI normalization makes all peptide and protein-level quantitation relative between the conditions in each channel.
Principal component analysis
For the principal component analysis as shown in Fig 8a, data was filtered and normalized in the same manner as discussed previously. Additional experiments were manually removed from the set due to different experimental designs or poorer overall coverage that would have required additional imputation on that experiment’s inclusion.
PSMs were separated into two sets, as described in Fig 7a: Spectra and DART-ID. PSMs in the DART-ID set belonging to any parent protein in the Spectra set were filtered out, so that the two PSM sets contained no shared proteins. Additionally, proteins that were not observed in at least 95% of the selected experiments were removed in order to reduce the amount of imputation required.
Normalized TMT quantification data was first collapsed from PSM-level to peptide-level by averaging (mean) PSM measurements for the same peptide. This process was repeated to estimate protein-level quantitation from peptide-level quantitation. This data, from both sets, was then reshaped into an expression matrix, with proteins on the rows and “single cells” (TMT channel- experiment pairs) on the columns. As described earlier in the Results section, these samples are not actual single cells but are instead comprised of cell lysate at the expected abundance of a single cell; see Table 1.
Missing values in this expression matrix were imputed with the k-nearest-neighbors (kNN) algorithm, with Euclidean distance as the similarity measure and k set to 5. A similarity matrix was then derived from this expression matrix by correlating (Pearson correlation) the matrix with itself. Singular value decomposition (SVD) was then performed on the similarity matrix to obtain the principal component loadings. These loadings are the left singular vectors (the columns of U of SVD: UDUT). Each circle was then colored based on the type of the corresponding cell from annotations of the experimental designs.
Our raw data was searched with both the PSM and protein FDR threshold set, in the search engine, to 100% to include as many PSMs as possible. Therefore, once PSM confidences were updated with RT evidence, we needed to propagate those new confidences to the protein level in order to avoid any spurious protein identifications from degenerate peptide sequences . This is especially pertinent as many of the new DART-ID PSMs support proteins with no other confidently identified peptides, S9 Fig. Ideally we would run our updated PSMs back through our original search engine pipeline (MaxQuant/Andromeda) [7, 21], but that is currently not possible due to technical restrictions.
Any interpretation of the DART-ID data on the protein-level was first run through the Fido protein inference algorithm , which gives the probability of the presence of a protein in a sample given the pool of observed peptides and the probabilities of their constituent PSMs. The Python port of Fido was downloaded from https://noble.gs.washington.edu/proj/fido and modified to be compatible with Python 3. The code was directly interfaced into DART-ID and is available to run as a user option.
For the data in this paper, protein-level analyses first had their proteins filtered at 1% FDR, where the FDR was derived from the probabilities given to each protein by the Fido algorithm. We ran Fido with the default parameters gamma: 0.5, alpha: 0.1, beta: 0.01, connected protein threshold: 14, protein grouping and using all PSMs set to false, and pruning low scores set to true.
Application to other datasets
In Fig 4 we evaluated DART-ID on two other third-party, publicly available datasets: iPRG 2015  (MassIVE ID: MSV000079843), 12 label-free runs of yeast lysate, and TKO 2018  (ProteomeXchange ID: PXD011654), 40 TMT-labelled runs of yeast lysate. Raw files were searched in MaxQuant 220.127.116.11, against a UniProt yeast database (6721 entries, 2018/05/01). The iPRG 2015 dataset was searched with cysteine carbamidomethylation (+57.02146 Da) as a fixed modification and methionine oxidation (+15.99492 Da), protein N-terminal acetylation (+42.01056 Da), and asparagine/aspartate deamidation (+0.98401 Da) as variable modifications. The TKO 2018 dataset was searched with TMT11-plex on lysine/n-terminus, cysteine carbamidomethylation (+57.02146 Da) as a fixed modification and methionine oxidation (+15.99492 Da) as a variable modification. Both searches were done with Trypsin specificity, and PSM/protein confidence thresholds were set at 1 (100%) to obtain as many PSMs as possible. Searched data, configuration files, and DART-ID analysis results are available online.
The DART-ID pipeline is roughly divided into three parts. First, input data from search engine output files are converted to a common format, and PSMs unsuitable for alignment are marked for removal. Second, we estimate the alignment parameters and reference RTs using an by finding the maximum of the posterior distribution (Eq 4). Initial values for the algorithm are are generated by running a simple estimation of reference RTs and linear regression parameters for fik for each experiment. Third, inferred alignment parameters and reference RTs are used to update the confidence for the PEP of a PSM.
The model was implemented using the STAN modeling language . All densities were represented on the log scale. STAN was interfaced into an R script with rstan. STAN was used with its optimizing function, which gave maximum a posteriori (MAP) estimates of the parameters, as opposed to sampling from the full posterior. R was further used for data filtering, PEP updating, model adjustment, and figure creation. The code is also ported to Python3 and pystan, and is available as a pip package dart_id that can be run from the command-line. DART-ID is run with a configuration file that specifies inputs and options. All model definitions and related parameters such as distributions are defined in a modular fashion, which supports the addition of other models or fits. Full instructions for using the Python program are available at https://dart-id.slavovlab.net.
Code for analysis and figure generation is available at: github.com/SlavovLab/DART-ID_2018. The python program for DART-ID, as well as instructions for usage and examples, are available on GitHub as a separate repository: https://github.com/SlavovLab/DART-ID. All raw files, searched data, configuration files, and analyzed data are publicly available and deposited on MassIVE (ID: MSV000083149) and ProteomeXchange (ID: PXD011748).
S1 File. DART-ID Post-run Report.
A optional HTML report generated by the dart_id Python script. The report gives a summary of the alignment for each experiment, as well as a broad overview of the performance of the run as a whole, by showing aggregate increases in PSMs at a chosen confidence threshold.
S1 Table. SCoPE-MS and mPOP Experimental Designs.
An excel spreadsheet of the experimental designs of all raw files. Included are parameters for the liquid chromatography and parameters for the mass spectrometer. Also specified is the TMT channel layout for each experiment, with labels for J (T-cells, Jurkat cell line), U (monocytes, U-937 cell line), and H (human embryonic kidney cells, HEK-293 cell line).
S2 Table. Mappings of raw files to figures.
An excel spreadsheet providing a map that relates figures/analyses to raw files listed in S1 Table. TRUE denotes that the figure/analysis used that raw file, where FALSE denotes that it did not.
S1 Fig. Mixture model incorporates spectral confidence to estimate likelihood of observing RTs.
In the global alignment process, the likelihood of the alignment function and the reference RT is estimated from a mixture model, which combines the two possibilities of whether the peptide is assigned the correct or incorrect peptide sequence. These two distributions are then weighted by the error probability (PEP). This is similar to the update process, which updates the error probability and incorporates the previous error probability, as well as the two conditional probability distributions.
S2 Fig. Comparison of linear and segmented fits for reference RTs in experiments.
(a) The reference RT of PSMs compared with their observed RTs, and the model plotted in green line (linear fit), or green and red lines (segmented fit, representing the two segments). For the segmented fit, the inflection point is marked with the dotted blue line. Both fits were specified separately and run separately with the same input data. (b) Empirical cumulative density function (ECDF) of the residual RTs for both fits. The residual RT is defined as the Observed RT − Inferred RT, where the inferred RT is the reference RT aligned to that particular experiment via. the model function—linear or segmented. (c) Model-fitted standard deviations, σik, for each PSM as estimated by both linear and segmented fits. Points below the 45° line indicate a lower modeled RT standard deviation for the segmented fit, and vice versa. Clusters of points correspond to PSMs belonging to a particular experiment, as the PSM-specific variance of σik is mostly reliant on the experiment in which the PSM is observed.
S3 Fig. Accuracy of RT inferences varies with time and between experiments.
(a) Residual RT (observed RT—aligned RT) binned by RT for 60 min LC-MS runs. The gradient run is 5–35%B from 0–48 min, with a wash step of 35–100%B from 48–60 min. (b) Residual RT varying between different experiments, all 60 min LC-MS/MS runs.
S4 Fig. Distribution choice for inferred RT distribution and null RT distribution.
(a) Empirical distribution of all residual RTs, i.e., Observed RT − Predicted RT, and (b) all RTs. Red lines denote the distributions parametrized from the data.
S5 Fig. Bayesian updates of PSM confidence using RTs estimated by different methods.
(a) 2D density distributions of posterior error probabilities (PEP) derived from spectra alone (Spectral PEP) compared to the PEP after incorporating RT evidence. The RT estimates are the same as the ones shown in Fig 2. (b) Comparison of updated PEP derived from DART-ID and MaxQuant RT estimates. (c) Increase in confident PSMs at set confidence threshold using updated PEPs. (d) Validation of upgraded PSMs with quantification variance within proteins.
S6 Fig. Distributions of Spectral PEPs and DART-ID PEPs.
The bimodality of the DART-ID distribution suggests that DART-ID’s use of RTs helps cleanly separate correct from incorrect PSMs.
S7 Fig. Consistency of quantification between Spectra and DART-ID PSM sets.
The fold change in normalized RI intensity (T-cell/monocyte), from common proteins between the Spectra and DART-ID PSM sets. We included all proteins—not just those that are significantly (< 1% FDR) differentially abundant.
S8 Fig. Deriving conditional probability of RT given a correct match.
(a) The conditional probability distribution of RT given a correct peptide sequence assignment incorporates evidence about that peptide sequence across many different experiments. “Aligned RT” is the RT after applying the alignment function, and “Std” is inferred RT standard deviation for the peptide in the given experiment. (b) For each RT observation for a sequence in an experiment, we infer two distributions: one corresponding to RT density given a correct PSM and the other to an incorrect PSM match. These densities are weighted by the 1-PEP and the PEP respectively and summed to produce the marginal RT distribution. (c) The marginal RT distribution is then used to sample B bootstrap replicates of of the observed RTs. Each bootstrapped RT is then used to construct a bootstrapped reference RT for a given sequence. The reference RT is the median of the resampled RTs (in the aligned space). (d) The B bootstrap samples of μi are used to build distributions where the variance is determined by the model-derived variance of the peptide in an experiment. (e) The combination of the distributions in panel (d) forms a posterior predictive distribution for the observed RT, given that the peptide sequence assignment is correct.
S9 Fig. Distribution of peptides quantified per protein.
(a) Quantified PSMs per protein, including peptide sequences quantified across multiple experiments, and (b) peptide sequences quantified per protein. “Spectra” indicates proteins from PSMs identified below 1% FDR. “DART-ID new proteins” indicates PSMs boosted to below 1% FDR, that have different protein assignments from “Spectra”, i.e., this set of proteins and the “Spectra” set of proteins is disjoint. “DART-ID all proteins” contains all PSMs with updated DART-ID FDR < 1% FDR regardless of protein assignment. All PSMs are filtered at < 1% FDR at the protein level.
We thank H. Specht, T. Chen, and members of the Slavov laboratory for discussions and constructive comments. We also thank L. Reiter and L. Verbeke for access to the iRT method within the Biognosys Pulsar software.
- 1. Budnik B, Levy E, Harmange G, Slavov N. SCoPE-MS: mass-spectrometry of single mammalian cells quantifies proteome heterogeneity during cell differentiation. Genome Biology. 2018;19:161. pmid:30343672
- 2. Specht H, Harmange G, Perlman DH, Emmott E, Niziolek Z, Budnik B, et al. Automated sample preparation for high-throughput single-cell proteomics. bioRxiv. 2018.
- 3. Levy E, Slavov N. Single cell protein analysis for systems biology. Essays In Biochemistry. 2018;62. pmid:30072488
- 4. Specht H, Slavov N. Transformative opportunities for single-cell proteomics. Journal of Proteome Research. 2018;17:2563–2916.
- 5. MacLean B, Tomazela DM, Shulman N, Chambers M, Finley GL, Frewen B, et al. Skyline: an open source document editor for creating and analyzing targeted proteomics experiments. Bioinformatics. 2010;26(7):966–968. pmid:20147306
- 6. Argentini A, Goeminne LJE, Verheggen K, Hulstaert N, Staes A, Clement L, et al. moFF: a robust and automated approach to extract peptide ion intensities. Nature Methods. 2016;13(12):964–966. pmid:27898063
- 7. Cox J, Mann M. MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification. Nature Biotechnology. 2008;26:1367–1372. pmid:19029910
- 8. Tyanova S, Temu T, Cox J. The MaxQuant computational platform for mass spectrometry-based shotgun proteomics. Nature Protocols. 2016;11(12):2301–2319. pmid:27809316
- 9. Zhang B, Käll L, Zubarev RA. DeMix-Q: Quantification-Centered Data Processing Workflow. Molecular & Cellular Proteomics. 2016;15(4):1467–1478.
- 10. Weisser H, Choudhary JS. Targeted Feature Detection for Data-Dependent Shotgun Proteomics. Journal of Proteome Research. 2017;16(8):2964–2974. pmid:28673088
- 11. Cox J, Hein MY, Luber CA, Paron I, Nagaraj N, Mann M. Accurate Proteome-wide Label-free Quantification by Delayed Normalization and Maximal Peptide Ratio Extraction, Termed MaxLFQ. Molecular & Cellular Proteomics. 2014;13(9):2513–2526.
- 12. Ong SE, Mann M. A practical recipe for stable isotope labeling by amino acids in cell culture (SILAC). Nature Protocols. 2007;1(6):2650–2660.
- 13. Käll L, Canterbury JD, Weston J, Noble WS, MacCoss MJ. Semi-supervised learning for peptide identification from shotgun proteomics datasets. Nature Methods. 2007;4(11):923–925. pmid:17952086
- 14. Strittmatter EF, Kangas LJ, Petritis K, Mottaz HM, Anderson GA, Shen Y, et al. Application of Peptide LC Retention Time Information in a Discriminant Function for Peptide Identification by Tandem Mass Spectrometry. Journal of Proteome Research. 2004;3:760–769. pmid:15359729
- 15. Klammer AA, Yi X, MacCoss MJ, Noble WS. Improving Tandem Mass Spectrum Identification Using Peptide Retention Time Prediction across Diverse Chromatography Conditions. Analytical Chemistry. 2007;79(16):6111–6118. pmid:17622186
- 16. Pfeifer N, Leinenbach A, Huber CG, Kohlbacher O. Statistical learning of peptide retention behavior in chromatographic separations: a new kernel-based approach for computational proteomics. BMC Bioinformatics. 2007;8(1):468. pmid:18053132
- 17. Pfeifer N, Leinenbach A, Huber CG, Kohlbacher O. Improving Peptide Identification in Proteome Analysis by a Two-Dimensional Retention Time Filtering Approach. Journal of Proteome Research. 2009;8(8):4109–4115. pmid:19492844
- 18. Li GZ, Vissers JPC, Silva JC, Golick D, Gorenstein MV, Geromanos SJ. Database searching and accounting of multiplexed precursor and product ion spectra from the data independent analysis of simple and complex peptide mixtures. Proteomics. 2009;9(6):1696–1719. pmid:19294629
- 19. Dorfer V, Maltsev S, Winkler S, Mechtler K. CharmeRT: Boosting peptide identifications by chimeric spectra identification and retention time prediction. Journal of Proteome Research. 2018;0(ja):null.
- 20. Keller A, Nesvizhskii AI, Kolker E, Aebersold R. Empirical Statistical Model To Estimate the Accuracy of Peptide Identifications Made by MS/MS and Database Search. Analytical Chemistry. 2002;74(20):5383–5392. pmid:12403597
- 21. Cox J, Neuhauser N, Michalski A, Scheltema RA, Olsen JV, Mann M. Andromeda: A Peptide Search Engine Integrated into the MaxQuant Environment. Journal of Proteome Research. 2011;10(4):1794–1805. pmid:21254760
- 22. Moruz L, Käll L. Peptide retention time prediction. Mass Spectrometry Reviews. 2017;36(5):615–623. pmid:26799864
- 23. Krokhin OV, Ying S, Cortens JP, Ghosh D, Spicer V, Ens W, et al. Use of Peptide Retention Time Prediction for Protein Identification by off-line Reversed-Phase HPLC-MALDI MS/MS. Analytical Chemistry. 2006;78(17):6265–6269. pmid:16944911
- 24. McQueen P, Spicer V, Rydzak T, Sparling R, Levin D, Wilkins JA, et al. Information-dependent LC-MS/MS acquisition with exclusion lists potentially generated on-the-fly: Case study using a whole cell digest of Clostridium thermocellum. PROTEOMICS. 2012;12(8):1160–1169. pmid:22577018
- 25. Meek JL. Prediction of peptide retention times in high-pressure liquid chromatography on the basis of amino acid composition. Proceedings of the National Academy of Sciences. 1980;77(3):1632–1636.
- 26. Guo D, Mant CT, Taneja AK, Parker JMR, Rodges RS. Prediction of peptide retention times in reversed-phase high-performance liquid chromatography I. Determination of retention coefficients of amino acid residues of model synthetic peptides. Journal of Chromatography A. 1986;359:499–518. https://doi.org/10.1016/0021-9673(86)80102-9.
- 27. Sakamoto Y, Kawakami N, Sasagawa T. Prediction of peptide retention times. Journal of Chromatography A. 1988;442:69–79.
- 28. Krokhin OV, Craig R, Spicer V, Ens W, Standing KG, Beavis RC, et al. An Improved Model for Prediction of Retention Times of Tryptic Peptides in Ion Pair Reversed-phase HPLC: Its Application to Protein Peptide Mapping by Off-Line HPLC-MALDI MS. Molecular & Cellular Proteomics. 2004;3(9):908–919.
- 29. Baczek T, Wiczling P, Marszall M, Heyden YV, Kaliszan R. Prediction of Peptide Retention at Different HPLC Conditions from Multiple Linear Regression Models. Journal of Proteome Research. 2005;4(2):555–563. pmid:15822934
- 30. Krokhin OV. Sequence-Specific Retention Calculator. Algorithm for Peptide Retention Prediction in Ion-Pair RP-HPLC: Application to 300- and 100-Å Pore Size C18 Sorbents. Analytical Chemistry. 2006;78(22):7785–7795. pmid:17105172
- 31. Gorshkov AV, Tarasova IA, Evreinov VV, Savitski MM, Nielsen ML, Zubarev RA, et al. Liquid Chromatography at Critical Conditions: Comprehensive Approach to Sequence-Dependent Retention Time Prediction. Analytical Chemistry. 2006;78(22):7770–7777. pmid:17105170
- 32. Petritis K, Kangas LJ, Ferguson PL, Anderson GA, Paša-Tolić L, Lipton MS, et al. Use of Artificial Neural Networks for the Accurate Prediction of Peptide Liquid Chromatography Elution Times in Proteome Analyses. Analytical Chemistry. 2003;75(5):1039–1048. pmid:12641221
- 33. Petritis K, Kangas LJ, Yan B, Monroe ME, Strittmatter EF, Qian WJ, et al. Improved Peptide Elution Time Prediction for Reversed-Phase Liquid Chromatography-MS by Incorporating Peptide Sequence Information. Analytical Chemistry. 2006;78(14):5026–5039. pmid:16841926
- 34. Moruz L, Tomazela D, Käll L. Training, Selection, and Robust Calibration of Retention Time Models for Targeted Proteomics. Journal of Proteome Research. 2010;9(10):5209–5216. pmid:20735070
- 35. Lu W, Liu X, Liu S, Cao W, Zhang Y, Yang P. Locus-specific Retention Predictor (LsRP): A Peptide Retention Time Predictor Developed for Precision Proteomics. Scientific Reports. 2017;7:43959. pmid:28303880
- 36. Palmblad M, Ramström M, Markides KE, Håkansson P, Bergquist J. Prediction of Chromatographic Retention and Protein Identification in Liquid Chromatography/Mass Spectrometry. Analytical Chemistry. 2002;74(22):5826–5830. pmid:12463368
- 37. Palmblad M, Ramstrom M, Bailey C, McCutchen-Maloney S, Bergquist J, Zeller L. Protein identification by liquid chromatography-mass spectrometry using retention time prediction. Journal of Chromatography B. 2004;803(1):131–135.
- 38. Silva JC, Denny R, Dorschel CA, Gorenstein M, Kass IJ, Li GZ, et al. Quantitative Proteomic Analysis by Accurate Mass Retention Time Pairs. Analytical Chemistry. 2005;77(7):2187–2200. pmid:15801753
- 39. Conrads TP, Anderson GA, Veenstra TD, Paša-Tolić L, Smith RD. Utility of Accurate Mass Tags for Proteome-Wide Protein Identification. Analytical Chemistry. 2000;72(14):3349–3354. pmid:10939410
- 40. Norbeck AD, Monroe ME, Adkins JN, Anderson KK, Daly DS, Smith RD. The Utility of Accurate Mass and LC Elution Time Information in the Analysis of Complex Proteomes. Journal of the American Society for Mass Spectrometry. 2005;16(8):1239–1249. pmid:15979333
- 41. Bochet P, Rügheimer F, Guina T, Brooks P, Goodlett D, Clote P, et al. Fragmentation-free LC-MS can identify hundreds of proteins. Proteomics. 2011;11(1):22–32. pmid:21182191
- 42. Krokhin OV, Spicer V. Peptide Retention Standards and Hydrophobicity Indexes in Reversed-Phase High-Performance Liquid Chromatography of Peptides. Analytical Chemistry. 2009;81(22):9522–9530. pmid:19848410
- 43. Escher C, Reiter L, MacLean B, Ossola R, Herzog F, Chilton J, et al. Using iRT, a normalized retention time for more targeted measurement of peptides. Proteomics. 2012;12(8):1111–1121. pmid:22577012
- 44. van Nederkassel AM, Daszykowski M, Eilers PHC, Heyden YV. A comparison of three algorithms for chromatograms alignment. Journal of Chromatography A. 2006;1118(2):199–210. pmid:16643929
- 45. Podwojski K, Fritsch A, Chamrad DC, Paul W, Sitek B, Stühler K, et al. Retention time alignment algorithms for LC/MS data must consider non-linear shifts. Bioinformatics. 2009;25(6):758–764. pmid:19176558
- 46. Lange E, Tautenhahn R, Neumann S, Gröpl C. Critical assessment of alignment procedures for LC-MS proteomics and metabolomics measurements. BMC Bioinformatics. 2008;9(1):375. pmid:18793413
- 47. Stanstrup J, Neumann S, Vrhovšek U. PredRet: Prediction of Retention Time by Direct Mapping between Multiple Chromatographic Systems. Analytical Chemistry. 2015;87(18):9421–9428. pmid:26289378
- 48. Fischer B, Grossmann J, Roth V, Gruissem W, Baginsky S, Buhmann JM. Semi-supervised LC/MS alignment for differential proteomics. Bioinformatics. 2006;22(14):e132–e140. pmid:16873463
- 49. Bernhardt OM, Selevsek N, Gillet LC, Rinner O, Picotti P, Aebersold R, et al. Spectronaut A fast and efficient algorithm for MRM-like processing of data independent acquisition (SWATH-MS) data. poster. 2012; p. 1.
- 50. Gillet LC, Navarro P, Tate S, Röst HL, Selevsek N, Reiter L, et al. Targeted Data Extraction of the MS/MS Spectra Generated by Data-independent Acquisition: A New Concept for Consistent and Accurate Proteome Analysis. Molecular & Cellular Proteomics. 2012;11(6):17.
- 51. Röst HL, Rosenberger G, Navarro P, Gillet L, Miladinović SM, Schubert OT, et al. OpenSWATH enables automated, targeted analysis of data-independent acquisition MS data. Nature Biotechnology. 2014;32(3):219–223. pmid:24727770
- 52. Bruderer R, Bernhardt OM, Gandhi T, Reiter L. High-precision iRT prediction in the targeted analysis of data-independent acquisition and its impact on identification and quantitation. PROTEOMICS. 2016;16(15-16):2246–2256. pmid:27213465
- 53. Malioutov D, Slavov N. Convex Total Least Squares. Journal of Machine Learning Research. 2014;32:109–117.
- 54. Huffman G, Specht H, Chen AT, Slavov N. DO-MS: Data-Driven Optimization of Mass Spectrometry Methods. Journal of Proteome Research. 2019. pmid:31081635
- 55. Elias JE, Gygi SP. Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nature Methods. 2007;4(3):207–214. pmid:17327847
- 56. Käll L, Storey JD, MacCoss MJ, Noble WS. Posterior Error Probabilities and False Discovery Rates: Two Sides of the Same Coin. Journal of Proteome Research. 2008;7(1):40–44. pmid:18052118
- 57. Choi M, Eren-Dogu ZF, Colangelo C, Cottrell J, Hoopmann MR, Kapp EA, et al. ABRF Proteome Informatics Research Group (iPRG) 2015 Study: Detection of Differentially Abundant Proteins in Label-Free Quantitative LC–MS/MS Experiments. Journal of Proteome Research. 2017;16(2):945–957. pmid:27990823
- 58. Gygi JP, Yu Q, Navarrete-Perea J, Rad R, Gygi SP, Paulo JA. Web-Based Search Tool for Visualizing Instrument Performance Using the Triple Knockout (TKO) Proteome Standard. Journal of Proteome Research. 2018.
- 59. Franks A, Airoldi E, Slavov N. Post-transcriptional regulation across human tissues. PLOS Computational Biology. 2017;13(5):e1005535. pmid:28481885
- 60. Specht H, Emmott E, Koller T, Slavov N. High-throughput single-cell proteomics quantifies the emergence of macrophage heterogeneity. bioRxiv. 2019.
- 61. Verbeke L, Bernhardt OM, Gandhi T, Bruderer R, Reiter L. Pulsar: A Search Engine Integrated into Spectronaut using Dynamic PSM Stratification. 2017; p. 1.
- 62. Carpenter B, Lee D, Brubaker MA, Riddell A, Gelman A, Goodrich B, et al. Stan: A Probabilistic Programming Language; 2017.
- 63. Nesvizhskii AI, Keller A, Kolker E, Aebersold R. A Statistical Model for Identifying Proteins by Tandem Mass Spectrometry. Analytical Chemistry. 2003;75(17):4646–4658. pmid:14632076
- 64. Serang O, MacCoss MJ, Noble WS. Efficient Marginalization to Compute Protein Posterior Probabilities from Shotgun Mass Spectrometry Data. Journal of Proteome Research. 2010;9(10):5346–5357. pmid:20712337