Estimating error rates for single molecule protein sequencing experiments

doi:10.1371/journal.pcbi.1012258

Fig 1.

Overview of single molecule protein fluorosequencing with various potential sources of errors highlighted in red.

More »

Expand

Fig 2.

Demonstration of the relationship between fluorosequencing data and the theoretical model.

(A) Illustration of a potential path through the states in the model, in this case the path represents perfect sequencing with no errors. (B) Illustration of the same peptide being sequenced with successive Edman cycles, aligned with the images in (C) the raw data we wish to analyze, i.e. the light emitted from a single molecule measured across two fluorescent channels (rows) for 5 Edman cycles (columns).

More »

Expand

Fig 3.

Diagram of factored hidden Markov model for fluorosequencing of an example peptide with three labeled amino acids.

More »

Expand

Fig 4.

Factored transition matrix diagram including N-terminal blocking.

More »

Expand

Fig 5.

Illustration of HMM factorization.

(A) Diagram of a non-factored HMM model. Arrows represent a conditional probability relationship. The transitions between states determine how a state at one time step is probabilistically related to the state at the preceding time step. Emissions represent how the observable data is probabilistically determined by the associated state. (B) Diagram including a factored transition. Breaking a transition into a factored product of sub-transitions introduces “sub-steps”; though not accurate models of any physical states of an actual peptide, these sub-steps prove useful for algorithmic purposes.

More »

Expand

Fig 6.

Data is lost due to missing fluorophores.

More »

Expand

Fig 7.

Illustration of DIRECT and Powell’s method.

More »

Expand

Fig 8.

Determination of fluorescence intensity distribution parameters and filtering of likely contaminants.

(A) unaltered histogram of intensity values for a NH2-G{azK}*AG{azK}*| peptide sequencing experiment with superimposed normal distributions (red, fit to peak max (μ) and half-width; yellow, expectation from 2*μ). We typically observe some deviation from a normal distribution that can cause challenges with fitting the distribution, often solved in practice either by fitting the max and peak half-width (as in the red curve) or by trial-and-error using expert judgment. (B) clipped data for the same experiment, removing ranges of intensity values likely to be caused by contaminants or signal bleed over from adjacent peaks. Typically, reads are removed from subsequent analyses if any of their intensity values fall in that range.

More »

Expand

Fig 9.

The Baum-Welch and DIRECT + Powell’s methods agree within 0.5% error on simulated data.

More »

Expand

Fig 10.

Simulated datasets with more reads exhibit tighter distributions of parameter estimates.

More »

Expand

Fig 11.

A comparison of Baum-Welch and DIRECT + Powell’s method on experimental sequencing data for a two-fluorophore peptide shows general agreement between the methods.

More »

Expand

Fig 12.

A higher Edman failure rate is observed for a proline-containing peptide, with general agreement between the estimation methods.

More »

Expand

Fig 13.

Analysis of experimental fluorosequencing data from a peptide similar to that in Fig 12 but containing no proline residues exhibits lower Edman failure rates and shows agreement between the methods.

More »

Expand

Fig 14.

Both parameter estimators correctly recognize high initial block rates (>91%) in an intentionally N-terminally blocked peptide.

More »

Expand

Fig 15.

Longer TFA incubation times reduce the Edman failure rate but increase the dye loss rate.

More »

Expand