Fig 1.
A strategy for single-molecule peptide sequencing.
Proteins are extracted and digested into peptides by a sequence-specific endo-peptidase. All occurrences of particular amino acids are selectively labeled by fluorescent dyes (e.g., yellow for tyrosine, green for tryptophan, and blue for lysine residues), and the peptides are surface immobilized for single-molecule imaging (e.g. by anchoring via cysteine). The peptides are subjected to cycles of Edman degradation; in each cycle, a fluorescent Edman reagent (pink trace) couples to and removes the most N-terminal amino acid. The step drop of fluorescent intensity indicates when labeled amino acids are removed, which in combination with the Edman cycle completion signal, gives the resulting fluorosequence (e.g., “WKKxY…”). Matching this partial sequence to a reference protein database identifies the peptide.
Fig 2.
Simulations of ideal experimental conditions suggest relatively simple labeling schemes are sufficient to identify most proteins in the human proteome.
Each curve summarizes the fraction of human proteins uniquely identified by at least one peptide as a function of the number of sequential experimental cycles (a paired Edman degradation reaction and TIRF observation). Here, we consider peptides generated by different proteases (e.g. Glu represents cleavage C-terminal to glutamic acid residues by GluC, Met represents cleavage after methionine residues by cyanogen bromide) and under different labeling schemes (e.g. Lys + Tyr indicates Lys and Tyr selectively labeled with two distinguishable fluorophores. Asp/Glu indicates both residues are labeled with identical fluorophores). Peptides are immobilized as indicated, with Cys representing anchoring by cysteines (thus, only cysteine-containing peptides are sequenced) and C-term representing anchoring by C-terminal amino acids. Increasing the number of distinct label types improves identification up to 80% within only 20 experimental cycles even when only Cys-containing peptides are sequenced; near total proteome coverage is theoretically achievable when cyanogen bromide generated peptides are anchored by their C-termini and labeled by a combination of four different fluorophores. Cycle numbers denote upper bounds, since each fluorosequence is not allowed to proceed past the anchoring residue (cysteine or C-terminus). Note also that the peptide length distributions change depending on the enzyme used for cleavage, with median lengths of 26 amino acids for cyanogen bromide, 8 for GluC and 10 for trypsin digests.
Fig 3.
Typical proteolytic peptides have counts of labelable amino acids sufficiently low to sequence.
Frequency histograms of amino acids in in silico proteolytic peptides for lysine (A), tyrosine (B), tryptophan (C), and glutamic acid/ aspartic acid (D) indicate low median values. Peptide sequences in A-C were generated in silico from the human proteome by GluC digestion, and those in D by cyanogen bromide digestion. Low counts of labelable amino acids per peptide are expected to increase the ability to discriminate removal of one fluorophore amongst many on a peptide.
Fig 4.
Overview of a Monte Carlo simulation of fluorosequencing with errors.
In detail, protein sequences are read as amino-acid character strings from the UniProt database. For each protein sequence, the subsequent steps are repeated: proteolysis was simulated and peptides lacking the residue for surface attachment (e.g. cysteine) were discarded. All remaining peptides were encoded as fluorosequences and subsequent steps were repeated in accordance to the desired sampling depth: The fluorosequences were altered via random functions modeling experimental errors—(1) labels were removed modeling failed fluorophores or failed fluorophore attachment, (2) positions of the remaining labels were randomly dilated modeling Edman reaction failures, and (3) fluorophores were shifted upstream from their positions, modeling photobleaching. Each resulting fluorosequence was sorted based on its position and label type and merged into a prefix trie to tally the frequencies of observing each fluorosequence from a given source protein.
Fig 5.
A simple example of the trie structure for storing and attributing fluorosequences to peptides or proteins.
Consider a toy peptide mixture with peptide X (sequence GK*EGC, where K* represents fluorescently-labeled lysine; the sequence can be simplified to (K,2)) and peptide Y (GK*GK*EC; represented as (K,2),(K,4)). Panels (A) and (B) summarize populating the trie with fluorosequences from 500 copies each of Peptide X and Y, respectively. For example, peptide X might generate fluorosequence xK*, incorporated into the trie as a new node (K,2), indicated by the dashed blue lines and arrows in panel (A). (B) Simulations on Peptide Y add additional nodes to the trie. For example, the fluorosequence xK*xK* yields an additional node (K,2),(K,4) after traversing node (K,2). Additional fluorosequences are incorporated into the trie in a similar fashion, along with a tally of the number of observations of each fluorosequence, stored for each trie node along with the source peptide identities. Following the Monte Carlo simulation, the frequency of each source protein or peptide can be calculated for each trie node. To simplify data analysis and visualization, thresholds can be applied (see Methods) to identify and count those source proteins most confidently identified by the observed fluorosequences. Here, fluorosequences ((K,2),(K,5)) and ((K,2),(K,4)) confidently identify peptide Y, while Peptide X is less confidently identified by fluorosequences (K,2) or (K,3).
Fig 6.
Monte Carlo sampling reveals the confidence with which fluorosequences can be attributed to specific source proteins.
(A) and (B) represent two example fluorosequences, illustrating opposite extremes in terms of the number of proteins capable of yielding each sequence. In (A), the frequencies with which rival source proteins yield fluorosequence “xxxxExxKxK” in the Monte Carlo simulations indicates low confidence in attributing that fluorosequence to any one protein. In (B), a single protein is by far the most likely source of fluorosequence “EEEEExxKxK”. (X-axes represent incomplete lists of proteins, ordered by the frequencyies with which they are observed to generate the given fluorosequence in the simulations.)
Fig 7.
Surface plots illustrate the consequences of differing rates of Edman efficiency, photobleaching, and fluorophore failure rates.
Each panel summarizes the consequences of varying rates of photobleaching and Edman failures for a different fixed fluorophore failure rate, ranging from 0% to 25%, as calculated after simulating 30 experimental cycles on the complete human proteome at a simulation depth of 10,000 copies per protein. Photobleaching shows the strongest negative impact on proteome coverage when compared to other errors; increasing the number of distinguishable labels strongly increases proteome coverage. Labeling and immobilization schemes are denoted as in Fig. 2. For comparison, literature evidence suggests that common failure rates of fluorophores may be about 15–20% [18,32], Edman degradation proceeds with about 94% efficiency [33], and the mean photobleaching lifetime of a typical Atto680 dye is about 30 minutes [23], corresponding to 1800 Edman cycles, assuming 1 sec exposure per Edman cycle. Thus, we expect error rates to be sufficiently low for effective fluorosequencing.