^{1}

^{3}

^{4}

^{*}

^{2}

^{1}

^{3}

^{4}

^{1}

^{2}

^{*}

^{¤}

Current address: Department of Molecular and Cellular Physiology, University of Cincinnati, Cincinnati, Ohio, United States of America

Conceived and designed the experiments: AB JLM HRE. Performed the experiments: AB. Wrote the paper: AB JLM HRE. Conceived the mathematical approaches used: AHA HRE. Conceived the PINE approach: AB JLM HRE. Developed, tested, and evaluated the software and PINE-NMR website: AB.

The authors have declared that no competing interests exist.

The process of assigning a finite set of tags or labels to a collection of observations, subject to side conditions, is notable for its computational complexity. This labeling paradigm is of theoretical and practical relevance to a wide range of biological applications, including the analysis of data from DNA microarrays, metabolomics experiments, and biomolecular nuclear magnetic resonance (NMR) spectroscopy. We present a novel algorithm, called Probabilistic Interaction Network of Evidence (PINE), that achieves robust, unsupervised probabilistic labeling of data. The computational core of PINE uses estimates of evidence derived from empirical distributions of previously observed data, along with consistency measures, to drive a fictitious system

What mathematicians call the “labeling problem” underlies difficulties in interpreting many classes of complex biological data. To derive valid inferences from multiple, noisy datasets, one must consider all possible combinations of the data to find the solution that best matches the experimental evidence. Exhaustive searches totally outstrip current computer resources, and, as a result, it has been necessary to resort to approximations such as branch and bound or Monte Carlo simulations, which have the disadvantages of being limited to use in separate steps of the analysis and not providing the final results in a probabilistic fashion that allows the quality of the answers to be evaluated. The Probabilistic Interaction Network of Evidence (PINE) algorithm that we present here offers a general solution to this problem. We have demonstrated the usefulness of the PINE approach by applying it to one of the major bottlenecks in NMR spectroscopy. The PINE-NMR server takes as input the sequence of a protein and the peak lists from one or more multidimensional NMR experiments and provides as output a probabilistic assignment of the NMR signals to specific atoms in the protein's covalent structure and a self-consistent probabilistic analysis of the protein's secondary structure.

Labeling a set of fixed data with another representative set is the generic description for a large family of problems. This family includes clustering and dimensionality reduction, an approach in which the original dataset is represented by a set of typically far lower dimension (the representative set). The representative set, often the parameter vector that signifies a set of data points, can be simply the cluster mean (center) or may include additional parameters, such as the cluster diameter. The labeling problem is important, because it is encountered in many applications involving data analysis, particularly where prior knowledge of the probability distributions is incomplete or lacking.

A challenging instance of the labeling problem arises naturally in nuclear magnetic resonance (NMR) spectroscopy, which along with X-ray crystallography is one of the two major methods for determining protein structures. Although NMR spectroscopy is not as highly automated as the more mature X-ray field, it has advantages over X-ray crystallography for structural studies of small proteins that are partially disordered, exist in multiple stable conformations in solution, exhibit weak interactions with ligands, or fail to crystallize readily

Protein NMR structure determination generally proceeds through a series of steps (

After the data have been collected, the challenging “front-end” process leads to sequence-specific amino acid labeling. The “back-end” process then leads to the three-dimensional structure.

The front-end “labeling process” associates one or more NMR parameters with a physical entity (e.g., nucleus, residue, tripeptide, helix, chain); the back-end “labeling process” associates NMR parameters with constraints that define or refine conformational states. In reality, the distinction between front-end and back-end is artificial. Strategies have been developed that use NOESY data for assignments

The usual approach to the solution of the problem of assigning labels to subsets of peaks (spin subsystems) assembled from multiple sets of noisy spectra is to collect a number of multidimensional, multinuclear datasets. After converting the time domain data to frequency domain spectra by Fourier transformation, peaks are picked from each spectrum for analysis. Methods have been developed for automated peak picking or global analysis of spectra to yield models consisting of peaks with known intensity, frequency, phase, and decay rate or linewidth

Peaks observed in multidimensional spectra are matched to search for common frequencies. Some common frequencies identify atoms within a residue; others identify atoms in neighboring residues. The common visual aid in this process is a series of paired strip plots from complementary NMR experiments. Strips from CBCA(CO)NH (^{α} (CA) and C^{β} (CB) frequencies assumed to belong to Thr^{66} (strip ^{α} and C^{β} of Tyr^{67} in (strip ^{68}. These additional peaks may be artifacts (false peaks), or peaks from other nuclei with similar frequency. Depending on the starting point of the assignment process, the choice of experiments, the amount of conflicting information, or other factors, an exponentially expanding number of alternative assignments can arise, rendering a computational solution intractable. This difficulty has proved to be a major drawback for NMR structure determination, particularly for larger proteins.

Protein Designator | Number of Residues | Backbone Data Completeness |
Correct Backbone Assignment Coverage |
Backbone Assignment Accuracy |
Side chain Assignment Accuracy |
Secondary Structure Accuracy |
Outlier Count |
Data Quality |
Running Time on 1.8Ghz Intel CPU (h) | PISTACHIO Backbone Assignment Accuracy | Level of Missing Peaks in HNCACB Dataset | Noise Level in HNCACB Dataset |
Experiments represented in the input peak lists |
||||||||

Proteins with both backbone and side chain experiments available: | |||||||||||||||||||||

Ubiquitin | 76 | 99% | 96% | 97% | 94% | 97% | 1 | 0.91 | 0.2 | 91% | 12% | 20% | X | X | X | X | X | X | |||

Mm202773 | 101 | 98% | 96% | 98% | 92% | 97% | 3 | 0.92 | 0.2 | 97% | 24% | 42% | X | X | X | X | X | ||||

At1g77540 | 103 | 97% | 96% | 99% | 93% | 94% | 2 | 0.93 | 0.2 | 99% | 37% | 48% | X | X | X | ||||||

At2g24940 | 109 | 99% | 99% | 100% | 95% | 95% | 1 | 0.91 | 0.2 | 100% | 29% | 10% | X | X | X | X | |||||

At5g22580 | 111 | 98% | 94% | 96% | 93% | 90% | 1 | 0.89 | 0.2 | 92% | 21% | 17% | X | X | X | X | X | ||||

At3g17210 | 112 | 97% | 92% | 95% | 92% | 90% | 1 | 0.84 | 1 | 90% | 9% | 14% | X | X | X | X | X | X | |||

At3g51030 | 124 | 96% | 92% | 96% | 91% | 88% | 2 | 0.88 | 1 | 90% | 38% | 68% | X | X | X | X | X | X | |||

At2g46140 | 174 | 98% | 92% | 94% | 98% | 90% | 1 | 0.86 | 1 | 91% | 20% | 35% | X | X | X | X | X | ||||

At3g16450 |
299 | 95% | 82% | 86% | 77% | NA | 1 | 0.89 | 2 | 80% | 23% | 135% | X | X | X | X | X | X | X | ||

Proteins with only backbone experiments available: | |||||||||||||||||||||

BMRB5106 | 70 | 96% | 91% | 95% | NA | 90% | 1 | 0.88 | 0.2 | 90% | 10% | 25% | X | X | |||||||

At2g23090 | 86 | 87% | 87% | 100% | NA | 92% | 3 | 0.89 | 0.2 | 97% | 30% | 44% | X | X | |||||||

At5g01610 | 170 | 96% | 81% | 84% | NA | 83% | 3 | 0.76 | 1 | 80% | 24% | 117% | X | X | X |

The maximum number of backbone assignment achievable theoretically on the basis of the peak lists provided as input to PINE, divided by the total number of backbone assignments deposited in BMRB, multiplied by 100.

Number of correct PINE-NMR backbone assignments, divided by the total number of backbone assignments deposited in BMRB, multiplied by 100.

Number of correct PINE-NMR (backbone/side chain) assignments (i.e. in agreement with those in BMRB), divided by the maximum number of (backbone/side chain) assignments achievable theoretically on the basis the peak lists provided as input to PINE, multiplied by 100.

Percentage of residues correctly assigned to helix, strand, or “other” by PINE-NMR on the basis of agreement with DSSP

Total number of C′, C^{α}, and C^{β} atoms detected as possible outliers by LACS method

Defined as

Defined as number of noise peaks divided by number of real peaks in HNCACB.

All input included data from an HSQC or HNCO experiment; data from additional experiments were as indicated by shaded boxes:

Stereo array isotope labeled (SAIL) protein; data were analyzed without corrections for isotope shifts due to deuterium labeling.

A common feature among prior approaches has been to divide the assignment of labels into a sequence of discrete steps and to apply varying methods at each step. These steps typically include an “assignment step”

The NMR assignment problem has been highly researched, and is most naturally formulated as a combinatorial optimization problem, which can be subsequently solved using a variety of algorithms. A 2004 review listed on the order of 100 algorithms and software packages

Similarly, a wide range of methods have been used to predict the protein secondary structural elements that play an important role in classifying proteins

Our goal is to implement a comprehensive approach that utilizes a network model rather than a pipeline model and relies on a probabilistic analysis for the results. We reformulate the combinatorial optimization problem whereby each labeling configuration in the ensemble has an associated but unknown non-vanishing probability. The PINE algorithm enables full integration of information from disparate steps to achieve a probabilistic analysis. The use of probabilities provides the means for sharing and refining incomplete information among the current standard steps, or steps introduced by future developments. In addition, probabilistic analysis deals directly with the multiple minima problem that arises in cases where the data does not support a single optimal and self-consistent state. A common example is a protein that populates two stable conformational states.

The PINE-NMR package described here represents a first step in approaching the goal of a full probabilistic approach to protein NMR spectroscopy. PINE-NMR accepts as input the sequence of the protein plus peak lists derived from one or more NMR experiments chosen by the user from an extensive list of possibilities. PINE-NMR provides as output a probabilistic assignment of backbone and aliphatic side chain chemical shifts and the secondary structure of the protein. At the same time, it identifies, verifies, and, if needed, rectifies, problems related to chemical shift referencing or the consistency of assignments with determined secondary structure. PINE-NMR can make use of prior information derived independently by other means, such as selective labeling patterns or spin system assignments. In principle, the networked model of PINE-NMR is extensible in both directions within the pipeline for protein structure determination (

In addition to its application to NMR spectroscopy, the PINE approach should be applicable to the unbiased classification of biological data in other domains of interest, such as systems biology, in which data of various types need to be integrated: genomics (DNA chips), proteomics (MS analysis of proteins), and metabolomics (GC-MS, LC-MS, and NMR) data collected as a function of time and environmental variables.

The fundamental idea of PINE is to embed the original assignment problem into a higher dimensional setting and to use empirically estimated compatibility (or similarity) conditions to iteratively arrive at an internally coherent labeling state. These conditions are embodied in the form of a parameterized Hamiltonian (energy function) that evolves at each iteration step. In the quasi-stationary regime, this construction yields clusters, defined as subsets of chemical shift data with assigned labels. The clusters have strong intra-cluster links and highly localized inter-cluster couplings. We view each possible cluster of related experimental data in the domain as a “site” that is to be potentially labeled. More specifically, our goal is to discover (learn) the map _{1}, _{2}, …, _{m}] is the set of data values available from all experiments, and _{1}, _{2}, …, _{n}] is the set of labels associated to the chemical shifts. At first it may appear that this map is trivial, because one protein has precisely one set of correct chemical shifts. However, breaks in the backbone sequential data, incompleteness of

In order to formulate the computational problem, we require that the labels for data values satisfy constraints that arise from the system of neighborhoods built around each data value. The system of neighborhoods is a dynamic state variable that co-evolves with the probability values. We assign an initial set of labels, L, with associated weights to each input data point, S (e.g., chemical shift) and introduce a measure of similarity based on distances between “neighboring points” (

Each input data point (S) is linked to a set of labels (L) with associated weights. Similarity measures and constraints are utilized to construct each neighborhood system or topology (as denoted by the arrows).

The approach used to measure the global compatibility or support for the specific labeling of site

To implement the intuitively appealing ideas presented above that are designed to find the optimal state in the form of marginal probabilities, we have devised an iterative approach that utilizes topology selection followed by a variation of belief propagation algorithm

We proceed by analogy to FK

In this formula, the outside sum is performed over the configuration states of the system represented by the map λ, and the inside product measures the compatibility of the vertex labels joined by the edge

In the setting of statistical physics, the Boltzmann weight of a configuration is

In the standard random-cluster model, the neighborhood structure, or topology, of the graph is prescribed, and the objective is to find the ground state for a given set of weights by varying the “spin”, or labeling, configurations. In our case, we are determining the ground state ensemble and the topology of the model at the same time. At each iteration step _{i}

The algorithm must address two critical challenges. The data that describe edge weights and states in Eq 2 are derived from empirical relationships that involve noisy data, and, therefore, a straightforward deterministic search of the resulting combinatorial space would be infeasible. In addition, the computational complexity of the resulting problem grows rapidly with the number of labels and the topology of the graph; thus, a suitable starting and evolving representation of the topology, and a corresponding approximation algorithm is the key to obtaining a robust solution to this problem.

The probabilistic construction used in PINE-NMR belongs to the general class of graphical models in which dependencies among random variables are constructed ahead of the inference task. In cases where the graph of dependencies is acyclic, there are powerful and efficient algorithms that correctly maximize the marginal probabilities through collecting messages from all leaf nodes at a root node

A set of probabilistic influence sub-networks are combined into a larger influence network. The iterative probabilistic inference on the complex network ensures globally consistent labeling.

1. Read input data and check for errors. If errors are found, report errors and abort.

2. Align the ^{1}H, ^{15}N, and ^{13}C dimensions of all spectra independently.

3. Generate spin systems (

Derive the similarity scores of peaks ^{i}_{j}^{15}N and ^{13}C, and 0.02 ppm for ^{1}H. The existence of peaks closer than default values in any spectra adjusts

Begin with sensitive spectra; build probabilistic spin systems for backbone atoms:

The resulting spin systems have the following fields (some fields might be empty or have several choices with different probabilities):

Derive connectivity scores for spin systems by a formula analogous to 3.a. The score _{i},SS_{j})_{i}_{j}

Utilize the scores to assemble the spin systems to triplet spin systems.

The peaks in the most sensitive experiments in the data are used initially as reference peaks. Aligning the peaks along the common dimensions and registering them with respect to reference peaks enables us to define a common putative object called the spin system. Spin systems are then assembled to derive triplet spin systems.

4. Estimate the

5. If

6. Otherwise, set

7.

8. Triplet amino acid typing:

Score each atom based on the probability distribution of chemical shifts derived from BMRB, and the latest secondary structure prediction._{i}(helix)_{i}(strand)_{i}(coil)

Adjust the scoring if any pre-assignment exists:

Adjust the scoring if any selective labeling experiment exists while taking into account the possibility of overlap:

The total score S

9. Derive the backbone assignment network weights based on amino acid typing scoring, connectivity experiments, latest backbone assignment, and possible outlier detections from the last iteration (_{k-1}(x_{n}(i))_{n}(i)

Overlapping tripeptides (triplet residue) are evaluated. The weights on the edges are derived from amino acid typing, secondary structures, connectivity experiments, and possible outlier assignments. According to the statistical physics model described in the text, application of the belief propagation algorithm yields the marginal probabilities for backbone assignments.

10. Select the network topology; calculate the threshold for removing low-weight edges from the network based on the quality of the data, use:

11. Apply the belief propagation algorithm _{k}^{n}(x_{n}(j))_{n}(j)

12. Given the marginal probabilities of the triplet residue assignments, derive the probabilistic assignment of the individual backbone atoms.

13. Detect and remove the outliers in the backbone assignments

14. Derive the secondary structure of each amino acid based on the formula: _{n}(s|x_{n}(j))_{n}(j)_{k}^{n}(x_{n}(j))_{n}(j)

15. If no convergence, probabilities are the average probability of last three iterations. “No convergence” indicates the presence of “nearby” local minima.

16. For every amino acid, generate an energetic model network and apply the Belief Propagation

17. Report the final probabilistic assignments: backbone, side chain, secondary structure prediction, and possible outliers. The output can be specified to conform to variety of formats, including Xeasy, SPARKY, and NMR-STAR (BMRB).

The input to PINE-NMR consists of the amino acid sequence and multiple datasets known as peak lists (chemical shifts) obtained separately from selected, defined NMR experiments. The peak lists consist of sets of real-valued two-dimensional, three-dimensional, or four-dimensional vectors, denoted by ^{l}X^{i}_{j}∈^{l} l = 2,3,4^{th} dataset, and the index ^{th} observation within the dataset. To compare data from different experimental sets (different ^{i}_{j}^{m}_{n}^{15}N-HSQC or HNCO) are used as the initial reference set. We define a common putative object, called the spin system (

The spin-system scoring step is used to integrate the spin system sub networks by assigning a score to each possible label that can be associated to a spin system. This process makes use of empirical chemical shift probability density functions, calculated from combined BMRB (chemical shift) and PDB (coordinate) data from proteins of known structure, for each atom of every amino acid type in three label states: α-helix, β-strand, and neither helix nor strand (other)

The challenge is to address the computationally demanding problem of deriving the backbone and side chain assignments from amino acid typing and other experimental data (connectivity experiments) according to the model described above. Rather than modeling the assignment of labels to individual peaks, or assigning spin systems to a single amino acid, we generate triplet spin systems and label them to overlapping triplets of amino acids in the protein sequence (

Secondary structure labels are dependent variables derived from prior chemical shift assignments. Each chemical shift assignment has an associated probability, and we derive the probabilities for the assignment of secondary structure labels from a normalized and weighted sum of associated probabilities. After computing the probability of each residue

Posterior probabilities derived in each iteration of the assignment process are used as local prior probabilities in the next round of assignment, provided that (1) the assignment has not been detected as an outlier, (2) the assignment of chemical shift is correlated with the assignment of secondary structure consistent with known empirical distributions, and (3) the assignment is consistent with established connectivity constraints.

If one or more of the above conditions are not met, the results are deemed inconsistent because the resulting probabilities appear as outliers of the marginals supported by the current graph topology. This view is driven by the notion that the equilibrium of our fictitious system is the fixed point of the energy functional, with the factorization induced by our graph. In order to reach the consistent state, scores are re-evaluated and a new local score is computed for the next iteration; a new topology is generated, and the computational steps are repeated. The iteration process continues until a stationary or quasi-stationary state is reached, i.e., when the topology of the network and the labeling probabilities do not vary significantly. The iteration process leads to “self-correction” through appropriate adjustments to the topology of the underlying network in order to preserve maximum information.

PINE-NMR is designed to analyze peak lists derived from one or more of a large set of NMR experiments commonly used by protein NMR spectroscopists. This set (listed on the PINE-NMR website) currently includes data types used for backbone and aliphatic side chain assignments. (PINE-NMR will be expanded in the future to handle aromatic side chain assignment.) To test the software, we asked colleagues at the Center for Eukaryotic Structural Genomics (CESG) and the National Magnetic Resonance Facility at Madison (NMRFAM) to provide subsets of data from projects that had led to structure determinations with assigned chemical shifts deposited in the BMRB

The secondary structure accuracy reported in ^{α}, or C^{β} atoms detected as possible outliers in the final assignment by the LACS method

In the majority of cases, the assignment accuracy was above 90% for backbone resonances and above 80% for aliphatic side chain resonances. Two cases in

An illustration of the improvement achieved by combining information comes from comparing the assignment accuracy results from PINE with those from PISTACHIO

The results of website users provide a separate measure of the performance of PINE-NMR. Since July, 2006, users have analyzed more than 1,300 sets of chemical shift data. Without access to the final structures and chemical shift assignments for these proteins, these results could not be analyzed, as in _{label} = x)

The level of accuracy and completeness achieved in favorable cases by a single automatic PINE-NMR computation was sufficient for the initial downstream steps of structure determination. For example, the PINE assignment output for ubiquitin, which was obtained from the input of automatically picked peak lists from HSQC, HNCO, CBCA(CO)NH, HNCACB, C(CO)NH, H(CCO)NH, HCCH-TOCSY, HBHA(CO)NH, and C13-HSQC spectra, along with ^{15}N-NOESY and ^{13}C-NOESY spectra for this protein were provided as input to the Atnos

The level of assignments achieved by PINE-NMR for small proteins meets or exceeds the assignment levels that led to successful structure determination of small (under 100 residue) proteins from chemical shift data alone

PINE-NMR also can be useful for semi-automated analysis of larger proteins that require for structure determination the collection of additional data such as dipolar couplings, manual NOESY assignments, or aromatic side chain assignments. We have developed PINE-NMR in ways that enable expert input, for example, by specifying a selective labeling scheme, pre-assigned cluster labels, pre-assigned spin systems, or pre-assigned cluster labels for subsets of the data. For pre-assigned cluster labels, PINE-NMR can act as a verification tool, for example, by checking their internal consistency with peak lists or by detecting chemical shift referencing problems or outliers (the LACS report). The software performs spectral alignment, detects excessive noise peaks, uncovers experimental inconsistencies, recognizes the insufficiency of input data, and identifies nomenclature conflicts.

The latest version of PINE-NMR is available for public use through a webserver at

Application of the PINE algorithm to the NMR assignment problem has led to a tool that is capable of analyzing data in a self-correcting manner without the need for the user to manipulate any parameters in the software. The public availability of PINE-NMR through an online server has made it possible for a variety of users to test its accuracy and robustness. The PINE algorithm reformulates an otherwise intractable network of interactions within the context of an energy minimization problem. To address the high computational complexity of the minimization problem, we have devised a local approximation algorithm with reliable global properties. To address the non-convexity of the energy functional and the potential of “getting stuck” in local minima, we perform successive approximations with increasingly more complex energy functionals and with the reweighting of solutions.

Our evolution and selection of the initial network topology of PINE-NMR emerged through the examination of two quantities: (1) the estimated conformity across all datasets with respect to a single reference dataset (

The impact of topology selection can be investigated computationally by running simulations that test the computational complexity (running time) and accuracy of the results as a function of increasing network complexity. For small proteins, where the number of false positives and negatives is small, increasing network complexity leads asymptotically to higher accuracy (

In practical terms, additional knowledge about the structure of a protein can improve the data interpretation. For example, NMR experts often use their experience and knowledge of similar structures or structural folds to make decisions – this knowledge is often hard to codify in an algorithm. In some instances, the bias is subtle. For example, the use of data from BMRB in order to generate simulated peaklists that are to be subsequently assigned is afflicted with bias, because the data in BRMB are highly likely to be associated with a known structure and, therefore, higher information content (sharper localization of parameters according to Bayes' formula).

One of the challenges in protein NMR spectroscopy is to minimize the time required for multidimensional data collection and analysis without sacrificing the quality of the resulting protein structure. We are in the process of coupling PINE-NMR to (HIFI-NMR)

The core computational model of PINE should be applicable to other problems where automated clustering is needed. For example, when DNA microarray data are used to explore all genes of an organism in order to detail their biochemical networks, automated clustering of gene networks can provide unbiased information about the underlying biology.

Running time and assignment accuracy of the results as a function of increasing network complexity. Network complexity is defined as: network complexity = −log(cutoff threshold). The results for smaller proteins or proteins with higher quality data (A) differ from those for larger proteins with low quality data (B). The results underscore the importance of proper setting the cut-off threshold in selecting the edge set when constructing the topology of the graph.

(0.09 MB TIF)

Side chain chemical shift assignment algorithm.

(0.03 MB DOC)

Examples of ten PINE-NMR runs with experimental NMR data showing how the data quality measure

(0.23 MB DOC)

We thank the members of NMRFAM and CESG who provided NMR data for this study and are pleased to acknowledge the many users of the PINE-NMR server, who provided useful feedback on its operation. William M. Westler carried out the combined PINE-NMR / Cyana structural analysis as part of an NMRFAM workshop.

^{1}H nuclear magnetic resonance spectra.

^{13}C chemical shift index: a simple method for the identification of protein secondary structure using

^{13}C chemical shifts.

^{13}Cα chemical shifts and their (i, i–1) sequential connectivities.

^{13}C,

^{15}N-labeled proteins by using

^{13}C- and

^{15}N-edited NOESY spectra.