The complexity of protein interactions unravelled from structural disorder

The importance of unstructured biology has quickly grown during the last decades accompanying the explosion of the number of experimentally resolved protein structures. The idea that structural disorder might be a novel mechanism of protein interaction is widespread in the literature, although the number of statistically significant structural studies supporting this idea is surprisingly low. At variance with previous works, our conclusions rely exclusively on a large-scale analysis of all the 134337 X-ray crystallographic structures of the Protein Data Bank averaged over clusters of almost identical protein sequences. In this work, we explore the complexity of the organisation of all the interaction interfaces observed when a protein lies in alternative complexes, showing that interfaces progressively add up in a hierarchical way, which is reflected in a logarithmic law for the size of the union of the interface regions on the number of distinct interfaces. We further investigate the connection of this complexity with different measures of structural disorder: the standard missing residues and a new definition, called “soft disorder”, that covers all the flexible and structurally amorphous residues of a protein. We show evidences that both the interaction interfaces and the soft disordered regions tend to involve roughly the same amino-acids of the protein, and preliminary results suggesting that soft disorder spots those surface regions where new interfaces are progressively accommodated by complex formation. In fact, our results suggest that structurally disordered regions not only carry crucial information about the location of alternative interfaces within complexes, but also about the order of the assembly. We verify these hypotheses in several examples, such as the DNA binding domains of P53 and P73, the C3 exoenzyme, and two known biological orders of assembly. We finally compare our measures of structural disorder with several disorder bioinformatics predictors, showing that these latter are optimised to predict the residues that are missing in all the alternative structures of a protein and they are not able to catch the progressive evolution of the disordered regions upon complex formation. Yet, the predicted residues, when not missing, tend to be characterised as soft disordered regions.


B On the normalisation of the B-factor
The different definitions of soft disorder given in the main text are formulated in terms of a normalised b-factor. This b-factor for each residue is obtained as where B is the experimental B-factor of the C α of each chain's residue, and µ(B) and σ(B), average and standard deviation of the distribution of B values for all residues in the chain, respectively.
In the main text we argue that we use the normalised version, instead of the original one, because it avoids the dependency on the experiments' resolution, but also because it better catches the structurally disordered/structured nature of these residues. We show several examples of this in S3 Fig. Our current definition is able to distinguish better the DtO residues from the rest (see S3A and S3B Figs), the residues adjacent to missing residues (see S3C and S3D Figs) and the residues predicted as disordered by several disorder predictors (see S3E and S3F Figs).

C About the clusters composition
The list of clusters used, together with the PDB index of the structures of the chains included, the list of structures with the different interfaces in each cluster and the list of clusters with unbound forms can be downloaded online www.lcqb.upmc.fr/disorder-interfaces/.
A breakdown of the number of clusters considered in each group of clusters discussed in the paper is shown in Table A D List of structures of cluster 6c9c_A (Fig 4 main text) We show in Fig 3 of the main text 13 structures for the protein chains in the cluster 6c9c_A. One of them is an unbound structure (chain A of PDB index 6c9c), and the other 12 cover are the different interfaces measured in the cluster. Their S1 Figure: Hierarchical organisation of interfaces in the core domain of protein p53 (cluster 4ibs_A). We reproduce in a larger format Fig 1 in Table A: Breakdown of the number of clusters in each group of clusters discussed in the main text. We include the amount of clusters that contain at least one interface, together with the total number of PDB complexes and individual chains.
2 Most of the figures in the main text are shown as function of the NDI. The NDI counts the number of different interfaces in the cluster with an overlap higher than the 5%. This choice was a bit arbitrary, and other elections could have been done, like reducing the threshold to 1% or plotting the curves against the cluster size. All these two options lead to noisier curves, but as we show in S6 Fig

F Total randomisation test of the disordered regions
In this section, we compare the statistics and prediction power of the disordered regions (DRs) and with the random guess. With this aim we proceed as follows. We read the disordered (as missing residue or high b-factor) amino acids from each chain structure in the cluster and randomise its location in the sequence. We consider two possible of such randomisations: a purely random one but not very meaningful in a biological sense, namely (test 1): a random reshuffling of the DRs (by means of a random permutation of all the sites in the sequence). After this randomisation, we compute the DR as the union of all these random disordered sites in the cluster, as done in the main text, and compare its properties with those of the experimental IRs and its power to predict IRs. In S7A Fig, we show the mean relative size of these DRs for the two sets. Clearly, the behaviour of these fake disordered regions with the number of interfaces in the cluster, whose sizes cover essentially the whole chain above 10 interfaces, differs strongly with the data shown for the union of the interface regions (or the real disordered regions).
We compare the data shown in Fig 7 in the main text, concerning the number of connected interfaces or disordered regions of our clusters (normalised by the sequence size), with the number of clusters that one would measure if the DRs were random. We consider two distinct tests: (test 3), a random permutation of the DR sites in the sequence, and (test 4), a reshuffling of the DRs keeping together all sequential disordered sites, keeping in both cases the total number of disordered sites fixed. We show in S7B Fig the averaged number of this number of connected regions among all the clusters with the same number of different interfaces. Again, both tests give significantly different curves than the real ones, as long as the NDI is not too high (where the disordered regions superimpose forming one or two very large clusters).   Figure: On the normalisation of the B-factor. We show two columns of figures, the left column is computed using with the normalised b-factor, and the right column using the experimental B-factor. In A and B, we show a histogram of the mean value (in each of the set of clusters) of the b-factor of the residues that belong either belong to the DtO or not, and grouped separately if each residue was or not part of the interface for each of the structures of the cluster. Fig  A shows the same data that Fig 6 in the main text, but in logarithmic scale. In C and D, we show an histogram of the mean value (for each cluster) of the b-factor of the residues that were (or were not) just next to a missing residue in the sequence. In E and F, we show the histogram of the b-factor of each of the residues o the cluster representative structure predicted (or not predicted) as disordered by the three predictors considered in the main text. In A, we compare the averaged relative size of union of disordered (orange) and interface (blue) regions (shown in Fig 7A in the main text) with respect to the sequence length as function of the cluster size, with the averaged relative size of the union of fake disordered regions obtained after reshuffling the experimental disordered regions (test 1, red). The randomised disordered regions follow a rather different behaviour with the number of interfaces than the union of experimental interface regions. In B, we compare the averaged number of connected disordered regions (DR, orange) and interface regions (IR, blue), normalised by the sequence length, as function of the number of different interfaces in the cluster with the numbers we would obtain if the same number of disordered sites where randomly distributed. We have considered two distinct randomisation tests: a random permutation of the disordered sites in the sequence (test 2) and a reshuffling of the disordered regions but keeping consecutive disordered sites together (test 3). Both tests lead to different curves than the real ones, with the exception of the very big clusters, where the regions superimpose forming a very large cluster. In dash lines, we show the median expected PPV for a trivial correlation using all structures (black) and structures with high resolution (grey), displaying essentially the same curves. As show, we observe no significant change in the correlation between soft disorder and interfaces with the resolution, despite the fact that curves are now noisier in the high resolution case, because there are less clusters with a high number of structures than in the case studied in the paper.
S9 Figure: Metrics excluding the once missing residues. We repeat Fig 8A but this time excluding from the analysis the residues that are reported as missing at least in one structure of the cluster. The results are indistinguishable from the ones containing DtO residues, thus excluding the possibility that the signal reported is trivially introduced by DtO residues forming interfaces. S10 Figure: Metrics of the goodness of the match between the USDR and the UIR. We reproduce the analogous curve to Fig 8A for other possible metrics, including the Sensibility (Sen), the Specificity (Spe), the Accuracy (Acc), the positive prediction value (PPV) and the F1 metrics, which is given by the harmonic mean between the Sensitivity and the PPV. The colour dots correspond to the structures shown in Fig 8C. We compare the predictions from the disorder predictors discussed in the main text, once removed the forever missing residues of the cluster, with our measures of the USDR. In A we show the PPV of the disorder predictions with respect to the different definitions of soft disorder, as function of the NDI. In B, we show instead the Sensibility. In C, we show the Sensibility versus the Specificity for the different NDI bins.