A Theoretical Justification for Single Molecule Peptide Sequencing
Fig 4
Overview of a Monte Carlo simulation of fluorosequencing with errors.
In detail, protein sequences are read as amino-acid character strings from the UniProt database. For each protein sequence, the subsequent steps are repeated: proteolysis was simulated and peptides lacking the residue for surface attachment (e.g. cysteine) were discarded. All remaining peptides were encoded as fluorosequences and subsequent steps were repeated in accordance to the desired sampling depth: The fluorosequences were altered via random functions modeling experimental errors—(1) labels were removed modeling failed fluorophores or failed fluorophore attachment, (2) positions of the remaining labels were randomly dilated modeling Edman reaction failures, and (3) fluorophores were shifted upstream from their positions, modeling photobleaching. Each resulting fluorosequence was sorted based on its position and label type and merged into a prefix trie to tally the frequencies of observing each fluorosequence from a given source protein.