A Theoretical Justification for Single Molecule Peptide Sequencing
Fig 5
A simple example of the trie structure for storing and attributing fluorosequences to peptides or proteins.
Consider a toy peptide mixture with peptide X (sequence GK*EGC, where K* represents fluorescently-labeled lysine; the sequence can be simplified to (K,2)) and peptide Y (GK*GK*EC; represented as (K,2),(K,4)). Panels (A) and (B) summarize populating the trie with fluorosequences from 500 copies each of Peptide X and Y, respectively. For example, peptide X might generate fluorosequence xK*, incorporated into the trie as a new node (K,2), indicated by the dashed blue lines and arrows in panel (A). (B) Simulations on Peptide Y add additional nodes to the trie. For example, the fluorosequence xK*xK* yields an additional node (K,2),(K,4) after traversing node (K,2). Additional fluorosequences are incorporated into the trie in a similar fashion, along with a tally of the number of observations of each fluorosequence, stored for each trie node along with the source peptide identities. Following the Monte Carlo simulation, the frequency of each source protein or peptide can be calculated for each trie node. To simplify data analysis and visualization, thresholds can be applied (see Methods) to identify and count those source proteins most confidently identified by the observed fluorosequences. Here, fluorosequences ((K,2),(K,5)) and ((K,2),(K,4)) confidently identify peptide Y, while Peptide X is less confidently identified by fluorosequences (K,2) or (K,3).