Distinct Types of Disorder in the Human Proteome: Functional Implications for Alternative Splicing

Intrinsically disordered regions have been associated with various cellular processes and are implicated in several human diseases, but their exact roles remain unclear. We previously defined two classes of conserved disordered regions in budding yeast, referred to as “flexible” and “constrained” conserved disorder. In flexible disorder, the property of disorder has been positionally conserved during evolution, whereas in constrained disorder, both the amino acid sequence and the property of disorder have been conserved. Here, we show that flexible and constrained disorder are widespread in the human proteome, and are particularly common in proteins with regulatory functions. Both classes of disordered sequences are highly enriched in regions of proteins that undergo tissue-specific (TS) alternative splicing (AS), but not in regions of proteins that undergo general (i.e., not tissue-regulated) AS. Flexible disorder is more highly enriched in TS alternative exons, whereas constrained disorder is more highly enriched in exons that flank TS alternative exons. These latter regions are also significantly more enriched in potential phosphosites and other short linear motifs associated with cell signaling. We further show that cancer driver mutations are significantly enriched in regions of proteins associated with TS and general AS. Collectively, our results point to distinct roles for TS alternative exons and flanking exons in the dynamic regulation of protein interaction networks in response to signaling activity, and they further suggest that alternatively spliced regions of proteins are often functionally altered by mutations responsible for cancer.


Introduction
While it is well established that a protein's three-dimensional structure determines its function, a large fraction of proteins and protein regions lack stable structure. Such intrinsically disordered proteins contain extended regions that do not fold into a native fixed conformation [1]. These disordered regions are widespread across the tree of life, particularly in eukaryotes [2]. For example, amino acids comprising approximately 30-40% of the human proteome are predicted to reside within disordered regions [3]. Many different functions have been ascribed to disordered proteins. For instance, they have been shown to carry out regulatory functions associated with signal transduction and molecular recognition, including transcription, protein phosphorylation, mRNA metabolism, RNA processing, translation, chaperone activity and regulation of the cell cycle [1,4,5].
Alternative splicing (AS) and post-translational modification such as phosphorylation are known to regulate and diversify the functions of proteins and are thought to partly account for the increased complexity of metazoan species. Human alternatively spliced exons are enriched in regions of intrinsic disorder, presumably to provide functional and regulatory diversity while avoiding disruption to core protein structure [3,6,7]. Moreover, we and others have recently shown that tissue-regulated alternative exons are enriched in highly disordered regions of proteins where they frequently modulate interactions in protein-protein interaction networks [8][9][10]. In addition, disordered regions often harbor linear motifs that mediate recognition functions and therefore can be considered as a class of functional domain [11,12].
Finally, intrinsic disorder is abundant among proteins associated with various human diseases such as cancer, cardiovascular disease, amyloidoses, diabetes, neurodegenerative diseases and others [13]. Furthermore, highly connected proteins in ''diseasome'' networks are enriched in disorder [14]. However, due to the wide range of roles of disordered proteins it has been difficult to ascribe specific functions to disordered regions.
In order to better understand the roles of intrinsic disorder, we previously developed a method to analyze the conservation of intrinsic disorder across the yeast clade [15]. Over large regions of proteins, the property of disorder is highly conserved, i.e., the same residues are disordered in most orthologous proteins. Additionally, the underlying amino acid sequence of the disordered regions may either be conserved or significantly diverged. Based on this observation, we defined two types of conserved disorder: 1) ''constrained disorder'', regions where the amino acid sequence is well conserved, and 2) ''flexible disorder'', regions where the amino acid sequence has diverged. Our analyses revealed that these two types of conserved disorder have different biophysical and biological properties. Flexible disorder is predominantly associated with signaling and regulation, whereas constrained disorder is associated with chaperones and ribosomal proteins.
Here, we investigate the roles of these different forms of disorder in metazoans, with a focus on the human proteome. We provide evidence for distinct roles for disorder in tissue-specific regulation. In particular, we find different roles for constrained and flexible disorder in relation to alternatively spliced regions of proteins, phosphorylation sites and short linear motifs. While flexible disorder may predominantly function by providing structural flexibility that enables the expression and folding of splice isoforms, constrained disorder appears to provide structural scaffolding for presentation of linear motifs and phosphorylation sites, enabling tissue-regulated alternative splicing to rewire signaling pathways and protein interaction networks.

Results
A new role for disorder in tissue-specific protein regulation Using our previously described methodology [15], we analyzed the distribution of conserved flexible and constrained disorder in human proteins. To ensure reliable disorder prediction and sequence alignment we used two different and independent strategies, which yielded qualitatively similar results (See Methods and Text S1). As the assignment of the two types of conserved disorder categorization is dependent on the cut-off values used to classify residues as disordered and conserved, we employed steps to ensure consistent criteria in our analyses (See Methods). Specifically, we sought to maximize consistency in assignments of disorder category between the current work and previous study in yeast [15] i.e., residues in human proteins should be assigned the same category as the corresponding residue in their yeast ortholog (if existent). Among all orthologous proteins, we observe 61% overlap between assigned disordered residues in both species. Interestingly, there is a significantly higher overall level of conserved flexible disorder in human compared to yeast proteins (79% vs. 38%; P = 0, Chi-squared Test). In contrast, when comparing human proteins that have yeast orthologs, which are an older evolutionary origin, with human proteins that lack yeast orthologs, there is significantly more constrained disorder in the latter set (5% and 8%, respectively; P = 0, Chi-squared Test). Similarly, yeast proteins that lack human orthologs on average have a slightly higher level of constrained disorder (See Figure 1). It is interesting to consider that the significant increase in constrained disorder in more recently evolved human proteins may be associated with increase in organismal complexity. Likewise, the increase of flexible disorder in such human proteins may be associated with a higher rate of neutral change, which may provide a basis for the evolution of new functions.
To further examine the possible role of conserved constrained and flexible disorder, we performed a functional enrichment analysis of proteins containing relatively high proportions of flexible or constrained disordered residues (See Methods). We find that both flexible and constrained disorder are enriched in proteins with functions related to cell differentiation and development (See Table S1). For example, proteins enriched in flexible disorder are significantly associated with categories such as erythrocyte differentiation and osteoblast development. Likewise, proteins with constrained disorder are enriched in functions associated with fibroblast migration and smooth muscle development. This is consistent with our earlier findings focusing on the yeast clade, in which we found that disorder is closely related to regulatory functions, rather than structural or enzymatic activities. Regulatory function in human proteins is often related to cell differentiation and development and, evidently, disordered regions play an important role in these processes [15].

Relationships between disorder and alternative splicing
Regulation of tissue-specificity can be achieved through multiple processes including differential gene expression [16], posttranslational modification [17] and alternative splicing [18][19][20][21][22]. To better understand the role of conserved disorder in determining tissuespecificity, we explored its relationship with tissue-specific regulation at the levels of mRNA expression, alternative splicing and phosphorylation. We observe that constrained disorder is weakly although significantly correlated with tissue-specificity in mRNA expression (r~{0:13, P,2.2e-16, see Methods and Figure S2) [23,24]. However, we observe a stronger association between constrained disorder and tissue-regulated AS (see below).
We have recently shown that tissue-specific exons are enriched in regions of highly disordered amino acid sequences, and that these exons often function in controlling PPIs in networks [8]. In contrast to a previous report [6], we found that alternatively spliced exons that are not alternatively spliced in a tissue-specific manner, termed here as general AS events, are not significantly enriched in disordered regions (see also Figure 2A). Here, we resolve this apparent discrepancy. The Romero et al. study mostly analyzed UniProt-annotated alternatively spliced exons, which are enriched in tissue-specific AS exons (P,0.004, Chi-squared test, See Text S1). In fact, by pre-defining a bona-fide set of proteins with tissue-specific AS exons, we find that the UniProt set of proteins contain approximately the same level of disorder as our set, whereas exons that are not pre-selected as tissue-specifically regulated in the UniProt set have a markedly lower level of disorder and are very close to the genomic average (See Figure 2B). Our findings underline the importance distinguishing between tissue-specific and general AS exons when establishing relationships between disorder and AS. Importantly, when extending the above analysis by further categorizing conserved protein disorder into subgroups associated with AS regions of proteins, we observe several interesting

Author Summary
A protein's cellular and molecular function is typically determined by its folded structure. However, a large fraction of proteomes lack stably folded structure. These regions are referred to as intrinsically disordered. Protein disorder has largely been understudied, although it is emerging to have numerous important functions in a cell. Similarly, although alternative splicing (AS) is well established as an important regulatory layer of metazoan gene expression, its specific roles at the protein level are not well understood. Others and we recently have provided evidence that tissue-regulated AS likely plays a widespread role in the control of protein-protein interactions. In the present study, we investigate how two different classes of conserved protein disorder may contribute distinct functions in relation to roles of regulated alternative exons in the dynamic remodeling of interaction networks. We also investigate the distribution of cancer causing mutations in regulated and other alternatively spliced regions of proteins.
relationships. While tissue-specific alternative exons have a significantly higher rate of flexible disorder relative to general alternative exons (i.e. those exons that are generally not subject to tissue regulation), conserved constrained disorder is not enriched in these exons (P,3.36e-5 for flexible disorder, Mann-Whitney test; see Figure 3A and Figure 3B). In contrast, the constitutive exons immediately flanking the tissue-specific alternative exons are significantly enriched in both flexible and constrained disorder when compared to general alternatively spliced exons. Similar results are observed when controlling for potential biases stemming from alignment methodology, alignment quality, or from disorder prediction methodology, as well when controlling for possible biases due to alternative exons missing in some orthologs (see Text S1 and Figures S5, S6, S7, S8).
The enrichment in flexible disordered amino acids in tissuespecific alternative exons is consistent with the hypothesis that disordered regions afford structural flexibility such that exons can be alternatively spliced in or out without jeopardizing protein stability [6]. This view is consistent with previous observations that regulated AS events are under-represented in folded domains of proteins [8,9,20,25,26], while transcripts harboring such AS events appear to be generally translated [27], although in some cases it has been reported that alternatively spliced exons lead to misfolded or unstable proteins, which are degraded [28,29]. This latter situation may in some cases provide a form of post-translational regulation [29]. Furthermore, a subset of AS events will lead to low-abundance isoforms, including those containing premature termination codons, which are often targeted by nonsense mediated mRNA decay (NMD) and are less likely to be translated [30,31].
Given these possible scenarios, we determined whether our set of proteins containing tissue-specific alternative exons are enriched in bona-fide proteins listed in Hegyi et al. [32] (i.e., proteins for which there is evidence from mass spectroscopy studies), over the set of proteins that contain general alternative exons. Indeed, we find proteins harboring tissue-regulated alternative exons are significantly more often likely to be functional (See Methods), consistent with the idea that tissue-specific AS events affect tissue development and identity through the regulation of protein function (P,0.03, Chi-squared Test, See Figure 3C). Further supporting this conclusion, as found for tissue-regulated alternative exons, we find that alternative exons overlapping bona-fide proteins are also significantly enriched in flexible disorder, compared to the general alternative exons (p,0.05, Mann-Whitney Test, See Figure 3D). These results suggest that the enrichment of tissue-regulated alternative exons in flexible disorder in is largely due to structural reasons, i.e., to aid the folding and stability of both alternative isoforms.
We also observe a second, distinct relationship between conserved disorder and tissue-regulated AS events, namely, that both flexible and constrained disorder are significantly enriched in the constitutive exons immediately flanking the alternatively spliced exons (see Figure 3A and 3B). The majority of interactions in signaling pathways are mediated by short, flexible interfaces that can be detected at the sequence level as linear motifs. These motifs mostly occur in disordered regions due to the conformational flexibility afforded by these regions, which is important for their recognition. Some are bound by peptide binding domains such as SH3 domains, while others are sites of post-translational modification, e.g., by protein kinases. Taken together with our recent results revealing a widespread role for tissue-specific alternative exons in controlling PPIs [8], we considered that the enrichment of the flanking constitutive exons in flexible disorder may be important for controlling interactions mediated by the adjacent alternative exons. Accordingly, we sought to better define the linear motifs and phosphosites associated with alternatively spliced exons.
Linear motifs and phosphosites are enriched in flanking constitutive exons, but not in alternatively spliced exons First, we analyzed the role of flexible and constrained disorder with respect to phosphosites and linear motifs. Consistent with earlier results, we find that both kinds of disorder are enriched in these protein features [15]. Extending this, we find that while actual phosphosites and linear motifs are associated with a peak in constrained disorder, the immediate flanking regions have comparatively higher rates of flexible disorder (See Figure 4A). This finding leads to one tempting image: regions around phosphosites are enriched in flexible disorder, thereby providing flexibility needed for phosphorylation. Conversely, the phosphosite itself tends to be conserved, rendering it to be more enriched in constrained disorder.
Next, we investigated the extent of enrichment of phosphosites and linear motifs in regions surrounding alternatively spliced exons. Zhang et al. previously observed an enrichment of phosphosites in proteins regulated by the Nova splicing factor [33]. While previous studies found enrichment for linear motifs in alternatively spliced exons [7,9], we find strong enrichment for both features in exons flanking the alternative exon, but no measurable enrichment in the alternative exon itself (See Figure 4B, 4C and also Text S2 for comparison against recent findings of Buljan et al [9]). It suggests that the role of disorder in alternative exons likely differs from that in flanking exons. In particular, constitutive exon flanks may provide scaffolding for regulatory roles of linear motifs and phosphosites, while flexible disorder in alternatively spliced exons may largely have a structural role (see above). We compared the rates of constrained disorder of residues within and outside of phosphosites and linear motifs, respectively, in constitutive exon flanks and in randomly selected distal exons. In other words, in this analysis we compared the increase in constrained disorder due to the presence of a phosphosite or linear motif to the increase due to tissue-specific alternative splicing. We find that the enrichment for constrained disorder in exons flanking tissue-specific AS exons are to a large extent driven by the presence of phosphosites and linear motifs ( Figure 5). In particular, compared to the proteome-wide disorder rate average of 36%, we find that tissue-specific exons outside of phosphosites are slightly enriched in disorder (45%), while a larger increase in enrichment of both constrained and flexible disorder is observed for residues located around phosphosites and ELMs (81%). Interestingly, when performing the same analysis for alternative exons and flexible disorder, we observe a relatively large enrichment for flexible disorder (.52% See Figure S3) that is independent of phosphosites or ELMs compared to the proteome-wide average of 20%. This observation is consistent with our earlier result that the enrichment of flexible disorder in tissue-specific alternative exons is due to structural flexibility.

Alternatively spliced exons and their flanking exons are enriched in cancer driver mutations
Both disordered regions and linear motifs are known to have important roles in regulation of many cellular processes and have been implicated in numerous diseases. As we observed significant enrichment of flexible and constrained disorder in tissue regulated exons and flanking exons, respectively, we therefore next asked whether such regions are associated with disease mutations. More specifically, we asked whether mutations implicated in driving cancer growth are enriched in these regulation ''hot spots''. For control and comparison purposes, we investigated enrichment of cancer mutations in general alternative exons and flanking exons. Abnormal perturbations in cell regulation due to genetic mutations can result in uncontrolled cell proliferation and tumor formation [34]. Such changes are caused by ''driver'' mutations, i.e., mutations that provide a growth advantage. By contrast, the majority of somatic mutations in cancer are ''passenger'' mutations that accumulate in the cancer genome as a result of a breakdown of DNA repair processes [35]. To define driver and passenger mutations, we used cancer mutation frequency information from the Catalogue of Somatic Mutations in Cancer (COSMIC) [36,37]. For our analysis, we classified driver mutations based on their occurrence in multiple independent tumor samples, whereas passenger mutations were present in single tumor samples (See Methods for details).
Although we did not observe significant enrichment of driver mutations in regions containing tissue specific AS events compared to regions containing general AS events, we did observe an overall significant enrichment of driver mutations in AS neighborhoods ( Figure 6A) compared to randomly selected exons. Remarkably, 690 of 1502 (46%) driver mutations were detected in alternative splicing regions encompassing alternative (A) exons and flanking constitutive exons (C1 and C2). Specifically, there is a density of 0.43, 0.93 and 0.49 driver mutations per 10 Kb in C1, A and C2, respectively, whereas the density in the overall exome is 0.24 driver mutations per 10 Kb. Since the A and flanking C1 and C2 exons constitute only a small portion of the coding genome (,10 million nucleotides as per our dataset), this enrichment is highly significant as revealed by a Chi-square test (P,1.99e-108), when comparing the ratios of driver vs. passenger mutations in alternative splicing neighborhoods as compared to the rest of the exome. Our results remain qualitatively unchanged when we use other frequency thresholds for calling driver and passenger mutations, indicating robustness of our observations (See Methods). Moreover, a missense mutation occurring in an alternatively spliced neighborhood is ,5 times more likely to be a driver than a passenger mutations when compared to constitutive distal exons in the same proteins (See Figure 6B, P,2.59e-63, Chi-square Test). Likewise, it is more than 4.5 times more likely to be a driver than a passenger mutation compared to mutations occurring in the rest of the exome (P,5.9e-202, Chi-square Test).
These results provide evidence that alternatively spliced exons and their flanking exons are hot spots for cancer driver mutations. Although we did not observe significant enrichment of driver or passenger mutations in tissue-regulated exons or their flanking constitutive exons, driver mutations were nevertheless detected in these regions. Given the importance of these regions in the regulation of protein-protein interactions and in signaling, it is therefore important to consider that such disease mutations in these regions may result in the rewiring of signaling and protein-protein interaction networks in cancer cells. Conversely the enrichment of driver mutations in regions that are alternatively spliced but not annotated as undergoing tissue regulation could reflect possible selection acting to avoid disruption of regions of proteins that are more often associated with formation of interaction hubs in protein interaction networks. Conversely, it is also possible that many such regions annotated as being ''general'' AS, are in fact regulated in a tissue-specific or condition-specific manner but were not detected as such using the limited panel of RNA-Seq data employed in this study. Regardless, these results provide a basis for future investigations addressing the mechanisms by which cancer driver mutations contribute to the onset and progression of tumors.

Discussion
In this work we used a comparative proteomics approach to investigate fundamental properties of conserved disorder in higher eukaryotes. Our results suggest that conserved flexible disorder may largely have a structural role associated with tissue-specific alternative splicing, whereas conserved constrained disorder has a regulatory role by providing scaffolding for linear motifs. As it becomes increasingly evident that alternative splicing affects a substantial fraction of the proteome and is an important determinant in controlling protein interactions, future studies will be facilitated by taking these different possible roles of disorder into account. It will be of considerable interest to determine the Figure 5. The enrichment of disorder around alternatively spliced exons is driven by phosphosites and ELMs. Disorder rates of residues in different alternatively spliced exons. Left: Disorder rates of residues with and without phosphosites in general alternatively spliced exons and of residues with and without phosphosites in tissue-specific alternatively spliced exons. While the increase in disorder rate is modest between residues in general to tissue-specific exons, a much stronger increase is observed when comparing between residues with and without phosphosites. (All differences are significant with P,1e-16, Wilcoxon rank-sum test). Right: Disorder rates of residues with and without linear motifs in general alternatively spliced exons and of residues with and without linear motifs in tissue-specific alternatively spliced exons. While the increase in disorder rate is modest between residues in general to tissue-specific exons, a much stronger increase is observed when comparing between residues with and without linear motifs. (All differences are significant with P,1e-16, Wilcoxon rank-sum test). doi:10.1371/journal.pcbi.1003030.g005 different functional relationships between AS and the various protein motifs and features that we find are enriched in and proximal to tissue-regulated alternative exons in this study. In particular, it will be important to address the role of specific arrangements of linear motifs in the regulation of protein-protein interactions [8][9][10]. The lack of enrichment of interaction motifs in regulated alternative exons may imply that these exons attenuate interactions that are mediated by linear motifs or phosphosites in flanking constitutive exons (where they are enriched). On the other hand, the alternatively spliced exon may represent the main site of the protein-protein interaction and its affinity may be modulated by the modification status of sites within the flanking exon regions, with the interaction dependent on both splicing and phosphosite or the status of other PTMs. Our results thus provide interesting testable hypotheses that can be addressed in future experiments. Finally, we provide new insight into relationships between cancer driver mutations, AS, and protein composition and function, that will facilitate future studies directed at determining mechanisms underlying the growth and spread of cancer cells.

Orthologue selection and alignment
The selection of human proteins were made from 81968 human proteins in Ensembl (v57.0) [38] using two rules: 1. The protein identifier mapped to CCDS [39]. 2. The protein had more than 15 orthologues within the Eukaryotes [40].
In the event of one-to-many and many-to-many ortholgous relationships for a given human protein, blastp was used to select the closest orthologue by using the lowest e-value. The resulting 28781 orthologue groups spanning 51 eukaryote species were aligned using the multiple sequence alignment tool MAFFT with default options [41,42]. 22 of 55 species were selected to be sufficiently diverse in order to prevent the over estimation of sequence conservation [43,44] (See Figure S4). To avoid biases due to the alignment tool, we also used an alternate alignment strategy (See Text S1).

Protein disorder
Protein disorder was derived using the software Disopred2 with default settings [45]. To avoid biases due to the disorder prediction algorithm, we also used an alternate prediction tool (See Text S1).

Calculation of residue and disorder conservation score
Amino acid conservation and disorder conservation scores were calculated in the same manner as in Bellay et al [46]: Amino acid conservation score (A n ) of position n in an alignment with K sequences is calculated and binned as follows: A n~m axf P i~20 i~1 a(i,n)g K Where a(i,n) is the number of sequences that has amino acid of type i on position n. Next we binned each position as follows: The disorder conservation score (D n ) is the binned score (the same conservation binned scoring scheme) of the percentage of species in a multiple sequence alignment retaining the same disorder classification. This is achieved by superimposing the disorder classification for each amino acid by Disopred2 [45] on the previously described multiple sequence alignment.

A systematic classification of disorder
Conserved disorder refers to aligned positions that have D. = 3, indicating that . = 30% of aligned residues are disordered. This category contains two classes: 1. Constrained disorder: aligned positions where D. = 3 and A. = 9, indicating that the selected sequences are disordered in 30% or more of aligned residues and conserved in 80% or more of aligned residues. 2. Flexible disorder: aligned positions where D. = 3 and A, 9, indicating that the selected sequences are disordered in 30% or more of aligned residues and conserved in less than 80% of the aligned residues.

GO enrichments
GO term enrichment for each class (constrained and flexible disorder) was performed by binning into one of the categories classes based on its maximum proportion of residues in that class. The distribution of disorder for each GO term was tested against the background distribution of that disorder type using the Wilcoxon Rank Sum test for p-value,0.05, where p-value was adjusted for multiple hypotheses testing using false discovery rate.

Tissue-specificity and gene expression
We used the RNA-Seq data from Illumina's Human BodyMap 2.0 project, which was kindly provided by Dr. Gary Schroth (Illumina) and recently documented by Rinn and colleagues [47]. The data consist of 16 human tissue types, including adrenal, adipose, brain, breast, colon, heart, kidney, liver, lung, lymph, ovary, prostate, skeletal muscle, testes, thyroid, and white blood cells. We trimmed all reads to 50 nucleotides, and used only the forward end. We then mapped the reads to the transcriptome using bowtie [48] with -m 1 -v 2 parameters (requiring unique mapping and two or less mismatches across the full alignment). We performed multiple mapping corrections as follows: each position in each transcript using 50-nt windows was mapped back against the whole transcriptome. If the sequence mapped somewhere else in addition to itself we discard it and discounted from the transcript effective length (length-49). We then used the ''effective length'' to divide the raw read counts per million mapped reads for each gene to obtain corrected-RPKM values (cRPKM). We then used a conservative cRPKM cutoff of 10, and called a gene expressed in a given tissue if cRPKM. = 10. Finally, we derived a tissue-specificity score for each of the 17039 genes as follows: where t is the number of tissues the genes is expressed in and T = 16 is the total number of tissues considered.

Alternative splicing
Using the same RNA-Seq dataset described above in addition to the alternative splicing events previously mined (See [8] for details) from the BodyMap dataset. Of the 27,240 distinct human cassette exon alternative splicing (AS) events from RNA-Seq data, 16050 of these events were mapped to the subset of Ensembl protein isoforms (explained above) with high confidence. Of these, we used only the 4328 AS events that had both the inclusion and exclusion isoforms mapped. We refer to this dataset as the 'general AS' event set. From this set, we further derived a set of 268 tissue-specific events that we previously called as specific to one or more of the tissues listed above. See Supplementary material in [8] for detailed description of categorization of alternative splicing events into constitutive, general and tissue-specific events.

Phosphorylation sites and Eukaryotic Linear motif sites
Human phosphorylation sites were obtained from PhosphoSi-tePlus [49] and Phospho.ELM [50]. We used 77615 phosphorylation sites from 13010 proteins. ELM sites were kindly provided by Dr. Norman Davey (EMBL, Heidelberg) who used SLiM-Search 2.0 [51] tool to generate the high-quality ELM dataset.

Enrichment map
We used Cytoscape [52] and the Enrichment Map plugin [53] to create the Enrichment Maps. The edges represent the value of the overlap coefficient (size of the intersection of both GO terms/ size of the small GO term) with a cutoff at 0.4.

Cancer mutations
The mutation data was obtained from the Sanger Institute Catalogue Of Somatic Mutations In Cancer web site, http:// www.sanger.ac.uk/cosmic [36].
Somatic missense mutations from 98463 amino acid sites were downloaded (version 59). Classification of driver mutation sites and passenger mutation sites were determined by their mutation frequency. Missense mutations were defined as a driver mutation if at least 5 distinct COSMIC samples from at least 3 distinct studies. To prevent bias from low throughput, targeted gene analysis, we also called mutations coming from in at least 3 distinct samples from whole genome screening based studies as driver mutations. We obtained 1502 driver and 97961 passenger mutations. While the frequency thresholds used were arbitrary set due to lack of a golden truth set, we observed that our results remain qualitatively unchanged even when using a range of thresholds for calling driver and passenger mutations, implying robustness of our observations. Figure S1 Each network is a representation of the GO terms over-represented in the sets of proteins enriched in (A) Constrained disorder, (B) Flexible disorder. Each node represents a GO terms, its size indicating the significance of the enrichment (the bigger the node, the more significant the enrichment). Edges represent overlap between two GO terms (Overlap coefficient). (TIF) Figure S2 The boxplots show the correlation between the tissue specificity of the gene and the portion of (A) flexible disorder and (B) constrained disorder. All genes are binned into 5 different bins depending on the tissue specificity score. (TIF) Figure S3 The enrichment of disorder, constrained disorder, and flexible disorder in different types of exons is largely driven by phosphosites and ELMs. Text S1 Alternative alignments and disorder prediction methodology. Results obtained from re-implementing our pipeline with MUSCLE [42] and IUPred [54] tool combination.

(DOCX)
Text S2 A note on results of Buljan et al [9]. Comparison of our ELM enrichment against the results reported in Buljan et al [9]. (DOCX)