Sequence Complexity of Amyloidogenic Regions in Intrinsically Disordered Human Proteins

An amyloidogenic region (AR) in a protein sequence plays a significant role in protein aggregation and amyloid formation. We have investigated the sequence complexity of AR that is present in intrinsically disordered human proteins. More than 80% human proteins in the disordered protein databases (DisProt+IDEAL) contained one or more ARs. With decrease of protein disorder, AR content in the protein sequence was decreased. A probability density distribution analysis and discrete analysis of AR sequences showed that ∼8% residue in a protein sequence was in AR and the region was in average 8 residues long. The residues in the AR were high in sequence complexity and it seldom overlapped with low complexity regions (LCR), which was largely abundant in disorder proteins. The sequences in the AR showed mixed conformational adaptability towards α-helix, β-sheet/strand and coil conformations.


Introduction
The available genome sequences and several computational methods have revealed a unique presence of some proteins which remain disordered under physiological condition and resemble their own functional states [1][2][3][4][5][6][7][8][9]. These proteins are known by different names like intrinsically disordered [10], natively denatured [11], natively unfolded protein and intrinsically unstructured proteins [3], [10]. The accepted convention is however intrinsically disordered protein (IDP). It comprises of 25-30% of eukaryotic proteome and ,50% of eukaryotic proteins contain long disorder regions [12]. The IDPs lack any well-defined three dimensional folded structures in solution and structurally they remain as an ensemble of interconverting conformations under physiological conditions [13][14][15]. The lack of a rigid and folded stable structure may provide large plasticity to IDPs to interact efficiently with different targets, as compared to a globular protein with limited conformational flexibility [16], [17]. These characteristics possibly aid good efficacy to IDPs to be involved in different pathological and biochemical functions [5], [6], [13], [16], [18][19][20]. The functional domain varies from DNA binding to cell cycle regulation, membrane transport, different molecular recognition processes, and other important cellular functions [19], [21][22][23].
In addition to IDPs' important role in cellular activity, the inherited structural disorder plays an important role in the formation of protein assembly structure [24]. The structural disorder and flexibility of IDPs are also linked to formation of amyloid aggregates that is implicated in several human disorder such as Parkinson's disease, Alzheimer's disease, type II diabetes and others [25][26][27][28][29][30]. The major protein component of fibrillar deposits found in Parkinson's disease is a disordered protein, a-synuclein [25][26][27][28][29][30]. Alzheimer's disease is directly linked with production of ordered fibrillar structure of peptide Ab42. Thus several neurological disorders are linked to formation of amyloid fibrils and their deposition in various cellular organs.
However, it is not very clear how normally soluble disordered proteins/peptides are converted into amyloid fibre that possesses compact b-sheet structure. It has been also further observed and presented in many in vitro experiments that some structured proteins convert to amyloid fibrils under solution conditions where the proteins attained partial disordered structure [31], [32]. Experimental study and many computational analyses showed that short sequence stretches in proteins may be responsible and act as nucleating centres for amyloid fibril formation [33][34][35][36]. These regions are often known as amyloidogenic regions (ARs). Amyloidogenic sequences of six to eight residues when inserted in the C-terminal hinge loop of RNase A, the enzyme shows amyloidogenicity and forms amyloid fibres [34][35][36]. Presence of such regions in many water soluble proteins has been suggested by Dobson [36], [37] and others [38]. According to 'amyloid stretch hypothesis' [35], a short amyloid stretch (equivalent to AR) in a certain solution condition triggers the aggregation process. Mutation or reshuffling in this regions leads to decrease or total absence of such aggregation [33], [39]. Thus AR often acts as a nucleation center and governs protein aggregation that eventually leads to formation of b sheet rich amyloid fiber.
The IDPs are also rich sequences with biased amino acid residues in a stretch, often known as low complexity regions (LCRs). These regions may also play a critical role in protein stability and energetic of fibril formation [1], [40][41][42][43][44][45][46][47]. LCRs are usually of two types: a majority of LCRs is composed of mixed polar and charged amino acid (aa) residues and the presence of such regions enhances protein solubility and mobility in solution.
Second type of LCR is a repeat of one/two sequence which is prone to form amyloid fiber. A good example of such region is a stretch of Glu (polyGlu) [48]. Thus the presence of LCR modulates the solubility and amyloidogenicity of disordered proteins [45], [49], [50].
The composition, content and distribution of ARs and LCRs in a protein sequence, therefore, may have a certain role in protein aggregation and amyloidogenicity. However, no major investigation has been carried out regarding sequence complexity of ARs and their spacing among LCRs which are commonly found in IDP sequences. In the present investigation, we computationally detected and analyzed the sequence composition and complexity, distribution pattern and structural aspects of ARs and LCRs in proteins those are deposited in DisProt and IDEAL databases [4], [50], [51]. About 8% residue is found to be in AR and the average length of the region is 8 residues. Further we have found that the sequences in AR are highly complex and they rarely overlap with LCR.
Among many recently developed computational approaches and algorithms, we have used Waltz method that is developed by Maurer-Stroh et al. [52][53][54][55][56] to predict the ARs. The Waltz algorithm uses a position specific scoring matrix (PSSM) and combined physical properties and structural aspects of protein residues to identify AR [40], [41], [57], [58]. Computation tool SMART is used to predict the sequence complexity parameters. We have measured the structural propensity of the residues in AR by APSSP2 algorithm which is freely available in the World Wide Web [59], [60].

Selection of Intrinsically Disordered Proteins
DisProt database release 5.6 (http://www.disprot.org/) provides a set of proteins with different degree of disorderness [4]. It gives the name of the protein, accession codes, aa sequence, location of the disordered region(s), and methods used for structural (disorder) characterization. DisProt analysis also reveals biological function(s) of each disordered regions. Sequences of each protein were retrieved in FASTA format. Length, the aa composition, residue characteristics such as total number of positive and negative residues and theoretical isoelectric point (PI) were computed using the ProtParam tool of ExPASy Proteomic server (http://us. expasy.org/tools/protparam.html). The total charge of the proteins was calculated by 'protein calculator' server (http://www. scripps.edu/ , cdputnam/protcalc.html).
Additional disordered proteins were selected from IDEAL data set that contained experimentally verified IDPs [51]. The structural disorder of the proteins was varied from 0 to 100%. The proteins with (21)% disorder were excluded. Structural disorder was further calculated using IUPred algorithm, which is available at http://iupred.enzim.hu [61]. Protein disorderness was estimated by counting the number of residues in disordered regions in a protein as predicted by IUPred and it was divided by the length of the protein sequence followed by multiplication with 100.

Calculating LCR and AR
Protein sequences obtained from DisProt and IDEAL were used to calculate both the LCR and AR. The content of LCR of an individual protein was predicted by SEG method as implemented in SMART (simple modular architecture research tool) [40], [62], a web based server available at http://www.bork.embl-heidelberg. de/Modules/sinput.shtml. Default SEG parameters were used for finding the LCR. The SEG method detects LCRs based on the measurement of information content present in the complexity state vector [40]. The ratio of total number of aa residues in all the LCRs of a protein to the protein sequence length was used to calculate the content of low-complexity region in a particular protein. Amyloidogenic region of the proteins was identified by a web based computational tool Waltz [56], http://waltz.switchlab. org. The % content of residues in AR in a protein was measured by taking a ratio of sequences in all the ARs and the sequence length of the protein.

Prediction of Secondary Structure
APSSP2 was used for the secondary structure prediction of each protein from their aa sequence [59]. The algorithm uses a sequence of amino acids as a query input and predicts the corresponding secondary structure with certain confidence level. Percentages of residues those prefer to be in a-helix, b-strand and coiled conformation were calculated by taking a ratio of total residues in a particular conformation to the sequence length of the proteins. Structural preferences of the residues in ARs and LCRs were obtained by selecting the respective sequence regions in the predicted structure of the protein. Percentage of AR/LCR sequence with a preference for a particular conformation was measured against the total number of AR/LCR sequence in the protein.

Statistical Analysis
All the statistical analysis was performed in Wolfram Mathematica 8. Mean, standard error of mean (SEM), standard deviation (SD) were calculated for AR/LCR length and content. Stable distribution function (Text S1) with index of stability a, skewness parameter b, location parameter m, and scale parameter s was fitted to the data to show distribution pattern of AR/LCR length and the AR/LCR content in a protein. Bivariate probability distribution such as smoothed kernel density distribution was used to show the distribution of AR/LCR content with the protein length. To find the correlation between the AR/LCR content and protein sequence length negative hyperbolic equations were fitted to the data.

Content of AR and LCR in Different Classes of IDPs
The DisProt database analysis revealed 221 human proteins and 432 nonhuman (other than human) proteins with different degree of disorderness. Table 1, Tables S1 and S2 list some of these proteins with their physicochemical properties. Additional 186 unstructured human proteins and 25 nonhuman proteins were obtained from IDEAL database (Tables S3 and S4). Tables S1, S2, S3, and S4 show the protein name, database ID and the % of protein disorder measured by IUPred. The tables also show the content (%) of AR and LCR in a particular group of proteins. Last two columns in the tables display the number of ARs found within 15 residues from the C-and N-terminal of the protein sequence and these are marked as 'C' and 'N' column, respectively. The DisProt database provides the content of structural disorder, however, the disorderness of all the proteins present in IDEAL and DisProt databases was calculated using IUPred server. The proteins from both the databases were arranged in a descending order of disorderness. The content (%) of AR sequences decreased with increasing order of structural disorder. However, a less number of LCR sequence was present in proteins with high content of structural elements.
Based on the calculated disorderness, the proteins in each type (human/nonhuman) of proteins were grouped into three categories as suggested in previous report [63]. Proteins with 71-100% structural disorder were grouped as largely disordered proteins (LDPs). Moderately disordered proteins (MDPs) possessed 31-70% sequences in disorder region(s) and the remaining proteins with less than 30% sequences the disorder segment were grouped as partially disordered proteins (PDPs). Sequence details of the AR and LCR in this group of proteins are shown in Table 2. Figure 1 displays the graphical view of the analysis. The number of LDPs was less compared to MDPs and PDPs. Percentage content of amyloidgenic proteins (proteins that contained at least one AR) was also found to be less in LDP group. To gain confidence about this analysis, a t-test was performed based on sequence content (%) in an individual protein of each group (LDP, MDP and PDP). Confidence level was gained from the respective p-values as given in Table S5. Table 2 and Tables S1, S2, S3, and S4 show that some of the proteins in each group contained no AR. For instance, among 221 human proteins in DisProt database, 191 (,86%) proteins were amyloidogenic and each contained at least one AR. 30 human proteins contained no ARs. The number of amyloidogenic proteins was maximum (93%) for PDPs. However, the value decreased to 70% for the LDPs. A similar trend was observed with nonhuman proteins as presented in Table 2 and Table S2. Analysis of protein sequence from IDEAL database also revealed a similar trend in the content of amyloidogenic protein in different group of proteins (Table 2 and Table S3). Percentage of sequences in low complexity region (LCR) in each and individual protein in DisProt and IDEAL databases are also given in Tables S1, S2, S3, and S4. A group wise distribution of the LCRs is presented in Figure 1 and Table 2. The content of LCR sequence (%) was maximum in LDPs and a little more than 20% of the sequence was found in LCR regions in human proteins found in DisProt. The content of LCR sequences was found to increase with the decrease of structural disorder. Nonhuman DisProt proteins contained slightly higher percentage (16%) of LCR sequences than the proteins in human category. The LCR sequence content in proteins of IDEAL database was less than the DisProt proteins. The content of LCR was least in PDPs. P-values from the t-test of some of the above comparison are given in Table S5.
The sequence length of the AR/LCR and their content varied from protein to protein. Table 3 and Table S6 provide the sequence detail of the ARs, LCRs and the overlap regions between the two regions (AR/LCR). The table provides information regarding AR/LCR length and sequence position of the regions  Table S5). doi:10.1371/journal.pone.0089781.g001 Table 3. LCRs, ARs (*) and overlap regions ({) in some of the human disordered proteins from DisProt data.    (Tables S1, S2, S3, and S4). For example, the shortest protein, 37 residues long antibacterial LL-37 (DP0004_C002) contained no AR, tau with 441 amino acids enriched with 1.3% AR residues. DP00069 with sequence length of 116 was very rich in AR sequences (14%). In contrast to ARs, most of the LCRs were 8-40 residues long. The shortest LCR was 8 residues long. One such region was detected in DP00040. The largest LCR of 84 residues long was detected in DP00017. LCRs in tau (DP00126), for instance, occupied 17% of its total sequences. More than 35% residues in bcasein (DP00199) and regulatory subunit 1 (DP00219) were in LCRs.

Statistical Analysis
Statistical analysis was carried out to reveal the average of AR/ LCR content (%) and the length of the two regions (AR/LCR) in human proteins. To obtain the statistical parameters, AR/LCR content in all the human proteins from DisProt and IDEAL databases (Tables S1 and S2) was combined. The total number of proteins examined was 407 and the combined number of AR and LCR were 1765 and 1348, respectively, ( Table 2).
A stable distribution function (see Materials and Methods and Text S1) was applied to the experimental data (detected ARs and LCRs). Figure 2 shows the frequency histogram and the fitted distribution function for both the LCR and AR. Table 4 reports the statistical parameter values estimated from the fit to ARs/ LCRs. It was found that the statistical population (% of AR/LCR sequences) was characterized by a positive (and much larger than zero) value of the skewness coefficient. The mean value was ,8% of sequences for the AR. A similar distribution fit was made to the available lengths of the ARs/LCRs as shown in Figure 3 and the mean value was about 8 residues for the AR and 34 residues for the LCR. Figure 3 shows the smoothed kernel density estimation for the LCR/AR content in a protein (left and right panel, respectively). The plots have been shown in two different clipping planes. Bottom figure shows the smoothed 3D histogram. The smoothed kernel density estimation plot shows a distinct peak suggesting  ,8% AR content in a ,400 aa long protein and indicated that the detected proteins in the two databases populated at ,400 aa long and largely contributed to the estimate of average content of the AR and LCR. No correlation could be observed between the AR/ LCR content and protein length ( Figure 4). Although at deeper clipping plane it suggested a negative hyperbolic fit i.e. with the increase in protein length there is decrease in the AR/LCR content. However, no significant fit could be obtained to validate this assumption.

Sequence Aspects of AR and LCR
One interesting observation was that a major number of proteins contained both the AR and LCR, however, the two regions rarely overlapped with each other (Figure 1, Tables S1, S2, S3, and S4, Table 3 and Table 5). For instance, DisProt human proteins contained 894 ARs and 638 LCRs, however, only 53 occurrences of sequence overlapping between the two regions were observed and in most of the cases the overlap was partial (Table 5). A LCR with residues 97-112 in DP00069 overlapped with C-terminal AR of residues 101-116, and the overlapping region contain 12 residues. Whereas in DP00332, LCR with residues from 302-314 overlapped with an AR (310-317). Only four residues were found in the overlapping region. Similarly four ARs from DP00119, DP00551, DP00643_A002 and DP00683 partially overlapped with the LCRs. In other group of proteins also a similar result was obtained. Among 1889 AR regions in DisProt nonhuman proteins, only 74 ARs overlapped with the LCRs. In an average, ,3% of the AR sequences overlapped with the LCR sequences. These observations clearly indicated that the  residues in AR were very complex and rarely overlapped with the LCR. We also calculated average content of different types of amino acid residues in both the AR and LCR. Figure 5 displays the average content of different types of residues present in the AR, LCR and total proteins. A major fraction of the AR residues was hydrophobic and Leu was the most abundant (12.6%) residue. Other major residues in the region were Ile (11.2%), Phe (8.8%), Tyr (8.6%), Val (8.1%), Ala (7.3%). The AR regions were depleted in Pro, Lys, His and others. A major number of residues in the LCR was hydrophilic in nature and the regions were enriched with Ser (13.1%), Pro (12.1%), Gly (9.8%) and Ala (9.2%).
The structural propensities of residues in the ARs were measured using the APSSP2 algorithm (see Materials and Methods). The analysis showed that the conformational preference of the AR residues was not confined to any particular structure, rather in average a mixed structural preference of the AR residues was observed in all three groups of proteins. Figure 6 displays the overall structural heterogeneity of the AR sequences present in human (DisProt) proteins. The average number of sequence that preferred a-helical conformation was ,38%. Preferences for bsheet/strand and coil conformations were ,31% and ,32%, respectively. This result indicated that all of the sequences in the ARs did not favour b-conformation. When compared with total protein sequence present in the same group of proteins, about 56% residues preferred coil conformation and ,30% residues showed structural propensity towards a-helical conformation. Remaining 14% favoured b-sheet/strand conformations. Number of residues that preferred b-sheet component increased substantially in the ARs, however, large fraction of the AR residues (38%) favoured a-helical conformation.

Discussion
It is known from previous investigations that AR acts as a key for several protein aggregations and amyloid fibril formation. In this report we detected ARs by using Waltz algorithm and analyzed computationally the sequence complexity, conformational preference and the distribution of ARs in disordered human proteins present in Disprot and IDEAL databases. There are several methods to detect ARs [56], [64][65][66]. Some important algorithms and software to predict aggregation aspects of proteins are Tango [55], Waltz [56], PASTA [67][68][69][70], Aggrescan [71], SALSA [72], Zyggregator [73], AmylPred [64], FoldAmyloid [74]. The ability of the protein sequences to form b-strands/sheets is a predominant feature in most of these algorithms. PASTA was developed based on hidden b-propensity of the protein sequences [67][68][69][70]. Aggrescan software was based on an aggregation propensity scale for the 20 natural amino acids [71]. This method stressed that short and specific sequence stretches were responsible for protein aggregation. Based on average packing density of the aa residues, FoldAmyloid identified a sequence pattern that could promote amyloid fibril formation [34]. Waltz methodology was used in this investigation because many of its selected regions were experimentally verified and the method was better capable to differentiate amyloid fiber formation and amorphous aggregates [56].
The investigation revealed that more than ,80% disordered human proteins (DisProt and IDEAL databases) possessed at least one AR, indicating that a significant number of disordered proteins were amyloidogenic. Waltz detected ARs from a large number of proteins in DisProt and IDEAL databases. The large number of data set helped to derive, along with discrete analysis (Table 6), statistical average of AR and LCR sequence percentage and the average of AR and LCR sequence length. Discrete analysis result of all groups of proteins is given in Table 2 and  Table 6. The average values did not differ much with statistical analysis result (Table 4). However, the statistical values may be more acceptable to represent the average properties and composition of the LCRs and ARs.
Percentage of amyloidogenic proteins was higher in the PDP groups. Thus the content of AR sequences was more in proteins      with less structural disorder or in structured proteins. A similar observation was also made by Linding et al. [75]. These proteins contained less number of LCRs which were composed of less number of hydrophobic amino acids. LCR thus may have a significant role in protein aggregation process and amyloid formation. AR may be exposed to start the aggregation process and LCR regions could have certain role in the process. However, a large number of LCR along with a high content of polar amino acids and attenuated hydrophobicity may not allow the protein to misfold/fold further to gain b-sheet rich amyloid aggregate, in largely disordered proteins [3]. Therefore, the content of AR and LCR and the unique balance between the two regions are very crucial for protein stability (for disordered proteins) and amyloid formation. A proper solution condition may be needed based on the content of AR/LCR to unfold the region of structured proteins partially or fully to trigger amyloid fiber formation [76]. Nature may have designed the disordered proteins with a unique balance of AR and LCR sequences to provide stability and the ability to perform multifunction. However, an external disturbance or change in internal cellular condition may break this unique balance and could enhance protein aggregation and amyloid formation.
Most of the detected ARs in amyloidogenic proteins were six to eight residues long. We detected six residues long (residues 35-40) AR in a-synuclein. It was significantly shorter than the aggregation prone segment obtained by Der-Sarkissian et al. Zhang et al. showed four additional segments that might be involved in asynuclein aggregation [72]. However, the used methods did not define adequately the characteristics of nucleation site of amyloid formation. Waltz allowed identification and better distinction between amyloid sequences from the protein segments that promote b-sheet rich amorphous aggregates, and that could be a possible reason of less number of AR regions found in this investigation.
Statistical analysis results and discreet analysis (Tables S1, S2, S3, and S4, Table 6) established that the content of AR sequences was not always proportional to the protein sequence length. It showed a negative hyperbolic correlation among the protein sequence length and the percentage of AR/LCR sequence ( Figure 4). The reason of this was not known. Chiti et al. observed less aggregation propensity of proteins those were longer with respect to short proteins [77]. The longer proteins thus may have evolved with attenuation (low content) of ARs to reduce unwanted aggregation and fibril formation. It would be interesting, however, to test whether increasing number of ARs could enhance the aggregation kinetics or the quality of fibril formation in longer proteins.
In this regard, it was also important to know the conformational preferences of AR residues. We observed that aa residues in the ARs showed propensity towards a-helix, b-sheet/strand and coil conformations and all the residues were not very hydrophobic. Waltz, used in this investigation, did not fully rely on b-sheet structural propensity of the residues but was built on PSSM and on consideration of other physicochemical properties of the protein sequences. It allows some tolerance towards charged and polar residues with different hidden structural propensity. Proteins with diverse structural domains (b-sheet, a-helix, or random coil) including globular proteins were found to produce aggregates with fibrillar structure under certain solution condition [23], however, a crucial structural rearrangement often occurred during conversion of these proteins into amyloid fiber [78]. Thus slightly polar amino acids or the presence of LCR may play important role in structural reorganization.
Aggregation propensity and overall protein aggregation may also depend on the location of AR in the protein sequence, and how the ARs are surrounded by local excess of polar/charged amino acids or LCRs. Kar et al. recently showed that addition of a polyproline sequence to C-terminal side of polyGlu slowed aggregation of the peptide [48]. However insertion of the same residues to the N-terminal side of polyGlu caused very little effect on overall aggregation of the peptide. N-terminal residues in Huntingtin protein situated adjacent to the polyGlu sequence dramatically altered aggregation property of the peptide. However, position dependent role of LCRs, rich in polar and charged  residues, on aggregation propelled by ARs was not known with certainty. According to amyloid stretch hypothesis the AR containing proteins were needed to be locally/partially unfolded to initiate and promote the process of amyloid fiber formation [35]. Thus the presence of LCR in a protein with less disorder may significantly alter the amyloid formation kinetics. The IDPs play a vital role in molecular recognition process and the interaction has found to lead formation of structured protein complexes. A model of molecular recognition features or elements (MoRFs) has been proposed to define this interaction and the reorganization processes [79][80][81][82]. The MoRF model recognizes, in a disordered protein sequence, a linear region that undergoes a disorder-to-order transition upon binding to its partner. These regions are often referred as MoRFs. The regions could attain ahelices, form b-strands (b-MoRFs), irregular structures (i-MoRFs), and a combination of all these structural elements upon binding to its partner. However, our analysis largely directed to find the amyloid forming region and the region of protein sequences that are sequentially less complex. Both the AR and LCR could be part of MoRFs and may be involved in molecular reorganization process. However, further analysis may be needed to address this issue.
One of the significant observations was that the AR sequences were highly complex. Our analysis with IDPs showed that ,20% sequence was in the LCR and the value was close to the overall predicted value for SWISS-PROT database [41]. However most (greater than 97%, Table 2) of the AR sequences were not within the LCRs. It indicated complexity pattern of the AR sequences and confirmed the presence of less number of biased aa residues in the ARs. Some LCRs with one or more aa residues form stretches of a single amino acid, produce homopolymeric structure [41], [49], [40], [83] and became amyloidogenic [84]. However, we could detect in IDPs no such LCR which were polymeric in nature and amyloidogenic. Many prion proteins, e.g mammalian PrP, the yeast prions, Ure2p and Sup35 contain disordered stretches that also form beta sheet rich aggregates. These aggregate prone domains are also found to contain segments with low sequence complexity and often are enriched with Glu/Asp [85][86][87][88]. Thus prion proteins also contained both the ARs and LCRs. A test was performed with prion protein (P04156) and Huntingtin (P42858), however waltz methods could detect the palindromic region (residue 112-119) in P04156 and polyQ region in Huntingtin (P42858) only when 'custom' is used as the threshold in the analysis [56]. In our analysis, 'best overall performance' was used as the threshold and it missed the detection of above two amyloidogenic regions. We also analysed the content of ARs and LCRs in a group of proteins which were amyloidogenic and the amyloidogencity of the proteins were experimentally proven [56]. The list of the proteins and the analysis results are shown in Table 7. It includes protein like insulin, prion protein (P04156) and yeast protein Sup 35 (P05453). The observation was that the sequence overlapping of the AR and LCR were also very less (Table 7). This indicated that the ARs are compositionally highly complex. As such the sequence complexity and structural heterogeneity of the AR sequences was a vital observation. Also a few % of residues that overlapped with the LCR showed mixed structural propensity. The C terminal LCR in DP00069 that overlapped with the AR contained seven Ile (not at a stretch) and these residues showed preference for a-helical conformation. The overlapping sequences of AR and LCR, however, in DP00332 showed propensity towards random coil structure. Being a part of Figure 6. Comparison of the conformational preferences of residues in the ARs with that of total protein. A 3D plot shows the percentage of residues with conformational preference for a-helix (green), bstrand/sheet (red) and coil (blue) for total proteins and their ARs as represented in X-axis. Lower panel shows the 2D plot of the above data along with the error limits. doi:10.1371/journal.pone.0089781.g006  an AR both the overlapping regions was expected to induce aggregation in a certain solution condition. However, the LCR component may modulate the aggregation process in different way and the content may be changed depending on the solution condition [89]. Future experiments, starting with these overlapping ARs and LCRs, would enhance our understanding about how the sequence region composed of AR with low complexity sequences would modulate the protein aggregation process that lead to eventual formation of amyloid fiber.

Conclusion
The current investigation was focused on sequence complexity and content of AR present in proteins which were partially or fully disordered. The study observed a very high sequence complexity of the ARs and the regions not commonly overlapped with the LCRs which were abundant in the protein sequence. The future investigation may examine experimentally whether a unique balance between the content of AR and LCR could provide a suitable stability to a monomeric disordered protein to remain in a solution state. It would be interesting to examine how the spacing of LCR and AR and, swapping of AR positions influence the energetic of amyloid fiber formation. It will enhance our understanding why some proteins favor aggregation in a certain environment and may add more information about the mechanism of amyloid formation which is linked to several pathological human disorders.

Supporting Information
Text S1 Stable distribution function. Details of the statistical distribution function applied to AR/LCR length/ content distribution. (DOCX) Table S1 DisProt human proteins. Protein name, database IDs and AR/LCR content measured by IUPred are listed. Last two columns in the tables display the number of ARs found within 15 residues from the C-and N-terminal of the protein sequence and these are marked as 'C' and 'N' column, respectively. (XLSX) Table S2 DisProt nonhuman proteins. Protein name, database IDs and AR/LCR content measured by IUPred are listed. Last two columns in the tables display the number of ARs found within 15 residues from the C-and N-terminal of the protein sequence and these are marked as 'C' and 'N' column, respectively. (XLSX) Table S3 IDEAL human proteins. Protein name, database IDs and AR/LCR content measured by IUPred are listed. Last two columns in the tables display the number of ARs found within 15 residues from the C-and N-terminal of the protein sequence and these are marked as 'C' and 'N' column, respectively. (XLSX) Table S4 IDEAL nonhuman proteins. Protein name, database IDs and AR/LCR content measured by IUPred are listed. Last two columns in the tables display the number of ARs found within 15 residues from the C-and N-terminal of the protein sequence and these are marked as 'C' and 'N' column, respectively. (XLSX)