Amino acid sequence assignment from single molecule peptide sequencing data using a two-stage classifier

doi:10.1371/journal.pcbi.1011157

Fig 1.

Overview of protein fluorosequencing.

(A) illustration of the sample preparation process. Each grey circle represents an amino acid, and the letter in the circle corresponds to the standardized single letter amino acid codes. In the diagram, proteins are denatured, cleaved with protease, labeled with fluorescent dyes, and then labeled peptides are immobilized by their C-termini on the surface of a flow-cell. (B) The Edman degradation chemical reaction cycle, used to predictably remove one amino acid per cycle from each peptide. (C) For a given peptide, the sequencing process removes amino acids one at a time from the N-terminus, taking with them any attached fluorescent dyes. (D) Major steps in computational data analysis include: (1) For each field of view, performing image analysis to extract fluorescence intensities for each spot (peptide) in each fluorescent channel across time steps (cycles), collating the fluorescence intensity data per spot across timesteps and colors. A vector of fluorescence intensities is produced, giving a floating-point value for every timestep and fluorophore color combination. (2) These raw sequencing intensity vectors (raw reads) must then be classified as particular peptides from a reference database. This step is the primary concern of this paper. (3) Identified peptides can then be used to identify and quantify the proteins in the original biological sample.

More »

Expand

Fig 2.

Nomenclature for different stages of fluorosequencing data analysis.

The whatprot algorithm maps raw single-molecule protein sequencing reads to peptides and their parent proteins in the reference proteome (black arrows) by comparing experimental data (at bottom) to synthetic data generated using a Monte Carlo simulation (gray arrows).

More »

Expand

Fig 3.

Illustration of the states and transitions of the HMM for an example peptide.

For the amino acid sequence RKKAY, we illustrate the case where the lysine (K) residues are labeled with fluorescent dyes of one color (blue stars) and the tyrosine (Y) residue is labeled by a second color (green star).

More »

Expand

Fig 4.

Illustration of HMM state space reduction for the peptide of Fig 3.

States are combined that have both the same number of amino acids remaining and the same fluorophore counts for each color of fluorophore.

More »

Expand

Fig 5.

An illustration of the factoring of the transition matrix for the peptide from Fig 4.

Note especially the reduction in the total number of transitions (arrows) when the transition matrix is factored. At left, black arrows represent non-zero entries in the unfactored transition matrix. At right, colored arrows (see key) represent non-zero entries in each of the matrices in the factored product. In both diagrams, arrows from a state to itself are omitted for visual clarity.

More »

Expand

Fig 6.

Illustration of the effects of HMM pruning for the peptide of Fig 5.

More »

Expand

Fig 7.

Illustration of HMM pruning combined with transition matrix factoring for the peptide of Fig 5.

We emphasize that this is an anecdotal example; while there are more arrows here than in Fig 6, this strategy provides an improvement in asymptotic complexity, as described in S4 Appendix and shown in experiments with simulated data.

More »

Expand

Fig 8.

Comparison of the HMM (Bayesian), hybrid, and NN classifiers.

This was done on a dataset of 10K reads from peptides chosen randomly from all 20,659 human proteins trypsinized and labeled on D/E, C, and Y. (A) The precision recall curves. (B) Runtimes.

More »

Expand

Fig 9.

Comparison of different labeling strategies.

The hybrid classifier was run on a dataset of 10K reads from peptides chosen randomly from all 20,659 human proteins cleaved with EndoPRO. (A) The precision recall curves. (B) Runtimes.

More »

Expand

Fig 10.

Comparison of size of reference database. The hybrid classifier was run on datasets of 10K reads each from peptides chosen randomly from different numbers of random human proteins treated with the same protease and labeling scheme (trypsin, labeled D/E,C,Y). (A) The precision recall curves. (B) Runtimes.

More »

Expand

Fig 11.

Analysis of the accuracy of probability estimates given as scores by the classifier.

Based on 10K reads from peptides chosen randomly from all 20,659 human proteins cleaved with EndoPRO and labeled on D/E,C,Y,K,H. (A) Classification results were sorted by their predicted accuracy scores, and then equally distributed between 100 buckets. The average predicted and true accuracy scores were then computed for each bucket and plotted. (B) The true result precision/recall curve was computed as normal, while the predicted result precision/recall curve was plotted assuming each classification was fractionally correct according to its predicted accuracy score.

More »

Expand

Fig 12.

Precision and recall are improved for proteins by integrating identifications across peptides.

The example shows 10K simulated reads from peptides derived from 100 proteins randomly selected from the human proteome, considering trypsin digestion and labels on D/E, C, and Y. We note the “HMM (no pruning)” curve is hidden under the “HMM (pruned)” curve.

More »

Expand

Fig 13.

Performance on C. elegans.

The example shows 10K simulated reads from peptides derived from 100 proteins randomly selected from the C. elegans proteome, considering trypsin digestion and labels on D/E, C, and Y. We note the “HMM (no pruning)” curve is hidden under the “HMM (pruned)” curve.

More »

Expand

Fig 14.

Performance on yeast.

The example shows 10K simulated reads from peptides derived from 100 proteins randomly selected from a yeast proteome, considering trypsin digestion and labels on D/E, C, and Y. We note the “HMM (no pruning)” curve is hidden under the “HMM (pruned)” curve.

More »

Expand