On the effect of phylogenetic correlations in coevolution-based contact prediction in proteins

doi:10.1371/journal.pcbi.1008957

Fig 1.

Schematic representation of the information used by DCA and the null models.

MSA contain several types of information about the sequence variability. The sequence profile and residue covariation describe the statistics of individual MSA columns and column pairs, both are used in DCA. However, the MSA contains also phylogenetic information, here represented by the matrix of Hamming distances between sequences, or by the (inferred) phylogenetic tree. The different null models use the profile and phylogenetic information, but no residue covariation.

More »

Expand

Fig 2.

Eigenvalue spectra of the covariance matrix of the natural MSA and for Null models I and II.

We show the cumulative distribution of the unified eigenvalue spectra for the 60-protein dataset DS2, i.e. the fraction of eigenvalues larger than λ is shown as a function of λ. We observe that the phylogeny-aware Null model II shows the same fat tail for large eigenvalues, which is also present in the natural data, while the non-phylogenetic Null model I has a more compact support. The cutoff of the tail for large λ is an effect of the inter-family variability of the largest eigenvalues among the 60 spectra, cf. Fig D in S1 Text for the 9 individual proteins in dataset DS1.

More »

Expand

Fig 3.

DCA scores derived from natural sequence data and from MSA generated by Null models I and II, for datasets DS2 (panel A) of large MSA, and DS3 (panel B) of smaller MSA.

For the protein families under study, we show the histograms of DCA coupling scores F^APC (APC corrected Frobenius norm of couplings, the standard output of plmDCA), for the natural MSA and samples of Null models I and II. Here and in the following, histograms are normalized as probability distributions, i.e. to area one under the curve. It becomes evident that phylogenetic effects create—at least for sufficiently deep MSA—larger couplings than to be expected from finite sample size alone. However, couplings derived from the natural MSA have substantially larger values. The figures include also the positive predictive value (PPV, scale on the right of each panel), providing the fraction of true contacts in between all couplings F^APC above some threshold θ, as a function of θ, for plmDCA run on the natural MSA. We clearly see that almost all large couplings correctly predict contacts, while the PPV starts to drop once we reach F^APC reached also by phylogenetic effects in Null model II. We find this to be true for all non-trivial contacts (sequence separation |i − j| > 4) as well as for long-distance contacts (|i − j| > 24).

More »

Expand

Fig 4.

Histogram of DCA scores derived from natural sequence data (Panel A) and Null model II (Panel B) for residue-residue contacts and non-contacts.

For the protein families in DS2, we show the histograms of DCA coupling scores (APC corrected Frobenius norm of couplings), separated for contacts and non-contacts (defined using the representative protein structures in Table B in S1 Text). Only pairs with linear separation |i − j| > 4 along the chain are taken into account.

More »

Expand

Fig 5.

PPV for residue-residue contact prediction from natural data and Null model II.

The positive predictive values for residue contact prediction are shown for datasets DS2 (Panels A and C) and DS3 (Panels B and D), using the natural data (red, blue) and randomized data from Null model II (green). The upper panels (A,B) show joint contact prediction for all proteins, the lower panels (C,D) the averages over the individual PPV curves for all single families. All panels show also hypothetical PPV curves (purple), which might be reached by a method removing phylogenetic biases; they articficially combine DCA scores obtained from natural MSA on contacts, and from Null model I on non-contacts.

More »

Expand

Fig 6.

z-scores of couplings derived from the natural MSA, as compared to the distribution of couplings derived from Null model II.

For each residue pair (i, j), we calculate the z-score for the DCA score derived from natural data as compared to 50 realizations of Null model II. Panel A shows the data for the dataset DS2 of large MSA, Panel B for DS3 of small-intermediate MSA.

More »

Expand