Polyglutamine Induced Misfolding of Huntingtin Exon1 is Modulated by the Flanking Sequences

Polyglutamine (polyQ) expansion in exon1 (XN1) of the huntingtin protein is linked to Huntington's disease. When the number of glutamines exceeds a threshold of approximately 36–40 repeats, XN1 can readily form amyloid aggregates similar to those associated with disease. Many experiments suggest that misfolding of monomeric XN1 plays an important role in the length-dependent aggregation. Elucidating the misfolding of a XN1 monomer can help determine the molecular mechanism of XN1 aggregation and potentially help develop strategies to inhibit XN1 aggregation. The flanking sequences surrounding the polyQ region can play a critical role in determining the structural rearrangement and aggregation mechanism of XN1. Few experiments have studied XN1 in its entirety, with all flanking regions. To obtain structural insights into the misfolding of XN1 toward amyloid aggregation, we perform molecular dynamics simulations on monomeric XN1 with full flanking regions, a variant missing the polyproline regions, which are hypothesized to prevent aggregation, and an isolated polyQ peptide (Qn). For each of these three constructs, we study glutamine repeat lengths of 23, 36, 40 and 47. We find that polyQ peptides have a positive correlation between their probability to form a β-rich misfolded state and their expansion length. We also find that the flanking regions of XN1 affect its probability to^x_page_count=28 form a β-rich state compared to the isolated polyQ. Particularly, the polyproline regions form polyproline type II helices and decrease the probability of the polyQ region to form a β-rich state. Additionally, by lengthening polyQ, the first N-terminal 17 residues are more likely to adopt a β-sheet conformation rather than an α-helix conformation. Therefore, our molecular dynamics study provides a structural insight of XN1 misfolding and elucidates the possible role of the flanking sequences in XN1 aggregation.


Introduction
Similar to eight other neurodegenerative polyglutamine diseases [1][2][3][4][5], Huntington's disease (HD) is associated with a polyglutamine (polyQ) expansion in the huntingtin protein. In HD, the first exon (XN1) of the huntingtin protein, which contains the polyQ region, has been implicated as the pathogenic polypeptide [6][7][8][9]. It is hypothesized that after it is cleaved from huntingtin, the XN1 polypeptides misfold and form lethal cellular aggregates or inclusions in neuronal cells [1][2][3][4][5]. Evidence from in vivo studies [6,7] indicates that XN1 polypeptides can form aggregates similar to those observed in the neurons of afflicted patients. Despite limited knowledge on the structure and function of XN1, some information about its aggregation is known. The length of the polyQ region in XN1 is inversely proportional to the age of onset of symptoms [2,9]. Additionally, there is a threshold of approximately 36-40 repeats, within which symptoms may or may not develop [10]. However, beyond this threshold, lethal symptoms eventually develop in the lifetime of the patient. These symptoms develop earlier in the life of patients with longer repeat lengths. In vitro kinetic studies of polyQ peptides have emulated this clinical correlation between age and length [11,12]. Kinetic studies of polyQ [13] aggregation suggested that the misfolding of a polyQ monomer initiates the aggregation. Although other studies [14,15] proposed more complex aggregation scenario, we hypothesize that the misfolding of a polyQ monomer plays a critical role in the formation of ordered amyloid aggregations. Similarly, the aggregation of XN1 displays more complicated aggregation behavior [16][17][18]19,20], it has been shown that the lengthdependent misfolding of polyQ sequence in the XN1 monomer plays a critical role in the initial oligomerization and the later formation of b-rich amyloid aggregates [20]. Here, we propose to study the structure and dynamics of monomeric XN1, in hopes to help illuminate the aggregation mechanism and help determine the role of these aggregates.
Despite the structural information about each region, there still lacks a complete picture of XN1 due to the non-additive effect of interactions between different structural elements. Here, we perform all-atom discrete molecular dynamics (DMD) [49][50][51][52] simulations to systematically study structural dynamics of XN1 and its variants. DMD has been shown to have a higher sampling efficiency than traditional molecular dynamics and has been used to study protein folding thermodynamics and protein aggregation [53]. All-atom DMD features a transferable force field and has been successfully used to fold several small proteins ab initio [52] and to study the folding and misfolding dynamics of Cu, Zn superoxide dismutase [54]. We construct 12 polypeptides (Fig. 1) categorized into three sets. In the first set, we study XN1 in its entirety (Fig. 1a) in order to capture the interactions between the Nt17, polyQ and polyP flanking regions simultaneously. In the second set, we study XN1 without the polyP flanking regions, titled XN1-P 11 -P 10 , (Fig. 1b) in order to determine the effect of the polyP regions on XN1 in the context of naturally occurring flanking regions. In the third set, we study polyQ homopolymers in the absence of all flanking regions (Q n ) as controls. In each set, four different numbers of glutamine repeats are included: 23 (nonpathogenic), 36 (threshold), 40 and 47 (pathogenic). As originally suggested by Perutz [36] and expanded upon by others [1,3,20,35,36,38,55], the sequence context for a polyQ region can alter its aggregation mechanism; here, we find a more complete view of how the context plays a significant role in the secondary structure. In the context of XN1, the residues in the polyQ region have a lower probability of adopting b-sheet conformations, due to inhibition by the polyP regions. Surprisingly, by increasing the number of glutamine repeats in the polyQ region, the Nt17 region can be induced to fold into a b-strand. Thus, we suggest that the polyQ and flanking regions in XN1 are strongly coupled in XN1 folding and misfolding.

Results
For each of the 12 models (Fig. 1), we perform replica exchange DMD simulations to efficiently sample the folding landscape of the polypeptides [56][57][58][59][60]. In each simulation, we start from a completely stretched conformation. We discard the first 0.5% of the trajectories, which have drastic energy and structural changes, to disregard the initial equilibration; we only use the equilibrated parts of the trajectories for analysis. First, we model XN1 (Fig. 1a) in its entirety to simultaneously study the interactions among all flanking regions, such as the Nt17, polyQ and polyP regions. Second, we model a mutant of XN1 that is missing the polyP regions: XN1-P 11 -P 10 ( Fig. 1b) to study the structural effects of the polyP regions, which have been shown to protect against XN1 Figure 1. Constructs studied. Diagrams of the organization of the sequence regions in the (a) XN1 and (b) XN1-P 11 -P 10 constructs are shown. The number of residues in each region is indicated below the region. For all three constructs (XN1, XN1-P 11 -P 10 and the homopolymer Q n , not shown) we vary the number of repeats modeled: n = 23, 36, 40, 47. In (a) XN1, the first N-terminal 17 residues are collectively referred to as the Nt17 region. Following Nt17 is the polyQ region, which contains a variable n number of glutamine repeats. P 11 and P 10 are the regions of 11 and 10 proline repeats respectively; they are referred to as the polyP regions. A region of 17 residues tethers the polyP regions together. Finally, there are 12 residues in the C-terminus of XN1. The (b) XN1-P 11 -P 10 construct is identical to XN1 with the exception that it does not contain the polyP regions. The XN1 sequence is explicitly written in Fig. 4d. We use the title of ''construct'' to refer to either XN1, XN1-P 11 -P 10 or Q n . A ''model'' is a specific polypeptide, such as XN1Q 23 , which is a polypeptide of XN1 with 23 glutamine repeats. Thus a total of 12 models were studied, divided into 3 constructs with 4 different glutamine lengths. doi:10.1371/journal.pcbi.1000772.g001

Author Summary
Huntington's Disease is a neurodegenerative disorder associated with protein aggregation in neurons. The aggregates formed are thought to lead to neurotoxicity and cell death. Understanding the molecular structure of these aggregates may lead to strategies to inhibit aggregation. Exon 1 (XN1) of the huntingtin protein is critical for aggregate formation. This polypeptide has a naturally occurring polyglutamine sequence (polyQ), which is elongated in patients afflicted with the disease. The polyQ region in XN1 has several flanking sequences with distinct physicochemical properties, including the Nterminal 17 residues, two polyproline regions, and Cterminal sequences, that may affect its overall structure and aggregation. What is the overall structure of XN1, and what structural effects do the neighboring sequences have on each other and polyQ? We address these questions by studying computational models of various polypeptides, including XN1 and three mutant forms associated with Huntington's Disease. Certain neighboring sequences are found to inhibit aggregation, while others may be recruited by polyQ to form aggregates. Our results suggest the role that the flanking sequences may play in XN1 aggregation and may subsequently guide future structural models of XN1 aggregation.
aggregation. By comparing the results from the XN1 models to the XN1-P 11 -P 10 models, we can determine the role of the polyP regions. Lastly, as controls, we model Q n homopolymers in the absence of all flanking regions. XN1Q n are less stable than Q n and XN1Q n -P 11 -P 10 We apply the weighted histogram analysis method (WHAM) to analyze the folding thermodynamics of all simulated peptide systems [61]. For each peptide model, we calculate the heat capacity (C V ) at different temperatures (Fig. 2). We find Q 23 undergoes a non-cooperative folding transition from an extended and unfolded state to a collapsed globule state, characterized by the broad and shallow C V peak (Fig. 2a). Similar peaks, indicating a coiled-globule transition of a polyQ system, have been documented elsewhere [14]. As the length of Q n increases, the heat capacity peak gets taller and narrower, suggesting an increased folding cooperativity. In all three constructs (XN1, XN1-P 11 -P 10 and Q n ) we find the transition temperature corresponding to the C V peak ( Fig. 2 and Table S1) is almost unaffected by the length of the polyQ region. Since the transition temperature is indicative of the polypeptide stability, our simulation suggests that the length of the polyQ region does not affect the stability of the polypeptides. Interestingly, we find that the average transition temperature for the XN1 models (311 K) is smaller than that of the Q n (322 K) and the XN1-P 11 -P 10 (340 K) models. The differences in transition temperatures indicate the XN1-P 11 -P 10 models are the most stable, followed by the Q n models and lastly, the XN1 models are the least stable.

Q n transitions from random coil to b-sheet
As suggested by the Wetzel group and others [11,13,19], misfolding of polyQ monomers might initialize the aggregation and play a crucial role in the formation of amyloid fibrils. The peptide of polyQ is naturally unstructured, and thus, the lengthdependent aggregation behavior can only be explained by the rare and spontaneous misfolding of the peptide. Therefore, we focus on the compact low-energy state from simulations (see Methods). The compact low-energy state usually constitutes only 13-27% of the total populations (Table S2). Among this subset of compact structures, we find a representative structure using a clustering algorithm (see Methods). These representative structures are not definitive misfolded states, but rather represent common, accessible compact states. We find that Q n is able to form a b-sheet for all lengths modeled (Fig. 3d). As the length of Q n increases, the bsheet expands and contains more b-strands. Even the shortest polypeptide, Q 23 , is capable of forming a small b-sheet (a bhairpin); although, this structure rarely forms during the simulation (Fig. 3d). Additionally, we calculate the probability of observing a given secondary structure for each residue (Fig. 3a,b,c) in the compact state (see Methods). For example, we find that in Q 47 there are many segments of the polypeptide featuring high bconformation (Fig. 3c). Although the probability is computed for each residue, a high probability of b-conformation for consecutive residues indicates the formation of a b-strand (Fig. 3d). Furthermore, an overall perspective is gained when this probability is averaged over multiple residues (Figs. 3a and 3b). We find the compact states of Q 23 to be primarily unstructured; a typical residue in Q 23 is a random coil 67% of the time. In contrast, the residues in the long Q n models of Q [36][37][38][39][40][41][42][43][44][45][46][47] , have b-sheet dihedral angles 40% of the time. That is, residues in the long Q n models adopt b-sheet conformations almost 2.5 times more often than Q 23 (Fig. 3a). The relatively high b-sheet probabilities do not contradict the experimental observations of little to no monomeric b-sheet structure [19,15,25], since the secondary structure probability calculation is computed only for the small subset of compact and low-energy states. Additionally, the structural ensembles are constructed from the replica exchange simulations, which cover a wide-range of temperatures (see Methods). As a result, the distribution of conformations does not correspond to a single experimental condition, and thus, cannot be compared with experimental measurements. However, these ensembles can be used to evaluate the structural propensities of various polyQ constructs. From our calculations of the compact polypeptide, we find that Q n residues tend to transition from random coil conformations at short lengths, to b-sheet conformations at long lengths.

PolyP hinders b-sheet formation of polyQ in XN1
In the XN1 model, the polyQ region is surrounded by multiple flanking regions (Fig. 1), including the Nt17 and polyP regions. We find that for n = 36, 40 and 47, the residues in the polyQ region adopt b-sheet conformations approximately 10% less frequently than residues in isolated Q n (Fig. 3a). Thus, the flanking regions can lower the probability of the residues in the polyQ region to adopt b-sheet conformations. Based on secondary structure probabilities (data not shown) and the corresponding representa-  36 and XN1Q 47 models, the transition temperatures are nearly identical (308K, 308K and 304K respectively). However, the XN1Q 40 model has a larger transition temperature around 325K. (c) The XN1Q 23 -P 11 -P 10 , XN1Q 36 -P 11 -P 10 , XN1Q 40 -P 11 -P 10 and XN1Q 47 -P 11 -P 10 models have varied transition peaks. The transition temperatures are 365K, 335K, 317K and 343K respectively. Detailed data on the peak positions are found in Table S1. doi:10.1371/journal.pcbi.1000772.g002 tive structures (Figs. 4a, 4b), we find that the two polyP stretches in XN1 consistently form PPII helices (Fig. 4c). It is unlikely that the polyP regions fold into PPII helices due to interactions from neighboring regions. This is because the PPII helices are consistently found in every one of the compact structures; whereas, the other regions have more variable secondary structures. Thus, the polyP regions are likely forming PPII helices independently. We hypothesize that the P 11 and P 10 regions dominate the fold of XN1, by forming these PPII helices, and subsequently, the remaining regions, including the polyQ region, are affected by these two PPII helices.
In order to study the effect of the PPII helices on the polyQ region, we perform simulations of XN1-P 11 -P 10 polypeptides that are sequentially identical to XN1 but lack the polyP regions (Fig. 1b). We find that residues in the polyQ region of XN1-P 11 -P 10 models adopt b-sheet conformations 10% more often than In the XN1 and XN1-P 11 -P 10 models, polyQ residues have an almost constant b-strand probability for varying number of glutamine repeats. For all lengths of polyQ in XN1, the polyQ residues have a 31%64% b-strand probability. For all lengths of polyQ in XN1-P 11 -P 10 , the polyQ residues have a 42%61% b-strand probability. In the context of Q n , however, the glutamine residues for long Q n lengths have an increase in b-strand probability and a decrease in the random coil probability (data not shown). On average, for n = 36, 40 and 47, residues in the Q n polypeptides are 9% more likely to adopt b-strand conformations than the polyQ residues in the XN1 polypeptides. For the same polyQ lengths, polyQ residues in XN1-P 11 -P 10 and all the residues in Q n have an average difference of less than 1% in bstrand probability. (b) We show the a-helix and b-strand probabilities of the Nt17 residues as the length of the neighboring polyQ region increases. For XN1Q 23 , we find the probabilities of forming an a-helix or b-strand are similar; the difference is less than 1%. However, when the polyQ length increases (XN1Q [36][37][38][39][40][41][42][43][44][45][46][47], the difference becomes more than 20% in favor of a b-strand. Contrarily, we find that the Nt17 residues in XN1-P 11 -P 10 models consistently prefer b-strand dihedral angles over a-helix dihedral angles; the difference is over 20% for each length of polyQ. (c) As an example, we present the probability of each residue in Q 47 to have a b-sheet conformation. There are continuous stretches of high probabilities: residues 2-8, 11-17, 21-29 and 34-45. These stretches are likely the locations of continuous b-strands. The residues with surprisingly low b-sheet conformation are the location of turns between b-strands. The periodic shape of the graph indicates a b-sheet similar to the one in panel d). The average of all these probabilities is 42%; it is one data point in panel a). (d) These are representative structures of the Q n polypeptides determined through clustering. Q 23 shows a b-hairpin, and although this structure is from the largest cluster, it represents only 6% of all the structures used for clustering (see methods). Therefore, it is rare, but possible, for Q 23 to adopt a b-hairpin conformation. The longer homopolymers are more likely to adopt a conformation similar to the ones depicted here; Q 36 , Q 40 and Q 47 represent 50%, 23% and 40% of their respective clustered structures (Table S3). Most strands are between 6 and 9 residues long. Intra-backbone hydrogen bonds are shown only for those residues forming b-strands. Secondary structures are automatically calculated by PyMOL and are not used to calculate probabilities. doi:10.1371/journal.pcbi.1000772.g003 similar polyQ residues in the XN1 models (Fig. 3a). In fact, for n = 36, 40 and 47, the polyQ residues in the XN1-P 11 -P 10 and Q n models have nearly equal b-sheet probabilities (Fig. 3a). Thus, the decrease in the probability of the polyQ region to form b-sheets in XN1 may be attributed to the polyP regions that form PPII helices. This relationship is consistent with experimental observations which show the polyP regions inhibit polyQ aggregation and toxicity [1,3,5,20,39,40,[42][43][44].

Nt17 can be induced into a b-strand
It has been experimentally shown that the Nt17 region can be a tight random coil [20,46] or an a-helix [20,45,46,48]. However, once fused to the polyQ region, the Nt17 region promotes the aggregation of polyQ [20,47]. To investigate the impact of the Nt17 region in XN1 misfolding, we study the Nt17 region in the context of XN1. From secondary structure probabilities of the XN1 and XN1-P 11 -P 10 models, we find the Nt17 residues adopt random coil conformations over 50% of the time, regardless of the polyQ length. This observation is consistent with the previous [20] study in that the Nt17 region is most likely to be a random coil in the context of XN1. We also characterize the a-helix and b-sheet conformation probabilities for the Nt17 residues. For the XN1 models, we find that the likelihood of Nt17 residues forming an ahelix or b-strand conformation is correlated with the length of the neighboring polyQ region (Fig. 3b). In the shortest model, XN1Q 23 , the residues in the Nt17 region are equally likely to adopt an a-helix or b-sheet conformation. However for longer models, XN1Q 36-47 , the Nt17 residues adopt a b-sheet conformation at least six times more often than an a-helix conformation. In fact, for the longest model, XN1Q 47 , the Nt17 residues sample b-sheet dihedral angles 30 times more frequently than a-helix dihedral angles (Fig. 3b). The representative structures of the XN1Q 23 and XN1Q 47 models depict an example of this transition from an a-helix to a b-strand (Figs. 4a and 4b). Thus, in the context of XN1, we find that as the polyQ length increases, the Nt17 residues significantly prefer a b-sheet conformation over an a-helix conformation. That is, by elongating the polyQ region, the Nt17 region can be induced into a b-strand.
In the absence of the polyP regions, the Nt17 residues have different secondary structure probabilities. First, for all lengths of polyQ, the Nt17 residues in the XN1-P 11 -P 10 models show a preference for adopting b-sheet rather than a-helix conformations. Even for the shortest polyQ length, the Nt17 residues prefer bsheet dihedral angles (Fig. 3b). Thus, at least in the case of 23 glutamine repeats, the Nt17 region must be either directly or indirectly affected by the polyP regions; such that, by removing the polyP regions, the Nt17 region is more likely to form a b-strand. Second, for n = 36, 40 and 47, the Nt17 residues are nearly equally likely to adopt b-sheet conformations in either the XN1 or XN1-P 11 -P 10 constructs (Fig. 3b). Therefore, for long polyQ lengths, the polyP regions in XN1 have little effect on the b-sheet probability of the Nt17 region.  Table S3). From these example conformations, we see some structural correlations that complement the secondary structure probability calculations (Fig. 3). First, no drastic change in the b-strand structure of the polyQ region is seen between the XN1Q 23 and XN1Q 47 structures. Second, the a-helix in the Nt17 region of (a) XN1Q 23 has transformed into a b-strand in (b) XN1Q 47 . Third, the two polyP regions form (c) PPII helices. The probability data (not shown) indicates that these PPII helices exist in 100% of the partially folded structures. The intra-main chain hydrogen bonds are shown for those residues forming b-strands. As in Fig. 3d, the secondary structures are assigned by PyMOL. (d) The sequence is divided into the same regions as outlined in Fig. 1a. doi:10.1371/journal.pcbi.1000772.g004

Discussion
The length dependence of Q n aggregation One intriguing phenomena of glutamine expansion diseases is the length dependence of disease onset [2,9]. It has been suggested that both short and long Q n polypeptides can access similar misfolded structures that lead to aggregation, but the frequency at which this misfolded structure is visited depends on the length of Q n [62]. That is, long Q n aggregate fast, because they misfold frequently; contrarily, short Q n rarely misfold and thus aggregate slow. Mounting experimental evidences support this model by showing that, regardless of length, Q n polypeptides form aggregates of similar structure [25,27,28,39,62], which suggests the misfolded structures for all Q n are also similar. Additionally, kinetic studies [16,[21][22][23][24][25][26][27] verify the correlation between the length of the Q n polypeptide and the rate at which it aggregates in solution. The remaining question is to determine the common misfolded structure that leads to aggregation. To this end, some investigators [8,11,[16][17][18] have suggested that early forms of Q n aggregates have high amounts of b-sheets. In the previous investigation by Wetzel's group, [11] glutamine homopolymers were capped by flanking lysine residues (K 2 Q n K 2 ) to increase the peptide solubility [63]. It has been argued that the electrostatic repulsion between flanking lysines might prevent the formation of compact structures and alter the aggregation kinetics of the peptide system [19]. As the length increases, the screening effect reduces. Hence, despite the screening effect, the observation of intrinsic b-sheets formation of polyQ peptides is still valid. Our resulting extended b-sheet structures are prone to aggregation with exposed hydrogen bond donors and acceptors found in the polypeptide backbone [64]. By seeking to satisfy these bonds, the polypeptides can form bonds with other polypeptides that similarly have an exposed backbone, leading to the formation of large aggregates.
Our simulations of Q n support a model, similar to one outlined by [27], wherein a common, compact misfolded state is accessible to most Q n monomers and rates of misfolding are lengthdependent. Accordingly, we find that partially folded monomers of both short and long Q n polypeptides can have high amounts of b-character (Fig. 3d). Additionally, we find that the residues in long Q n models, with 36 or more repeats, are more likely to have b-sheet conformations than short models, Q 23 (Fig. 3a). The secondary structure probability of each individual residue is proportional to the probability of the overall polypeptide adopting that secondary structure. Thus we find a positive correlation between the length of the Q n polypeptide and its probability of forming a b-sheet. Because previous studies [8,11,[16][17][18] have linked b-sheet formation to aggregation, we suggest that long Q n polypeptides are therefore more likely to form aggregates.

The polyP regions protect XN1 from aggregation
Recent studies [1,3,5,20,39,40,[42][43][44] indicate that the addition of a polyP region can inhibit aggregation of the polyQ region; this inhibition has been associated [40] with a PPII helix structure in the polyP region. We are able to find a structural effect on the polyQ region from the formation of PPII helices in the polyP regions. We find that the polyP regions form PPII helices and suppress the probability of polyQ residues in XN1 to adopt b-sheet dihedral angles. This suppression is still present for long polyQ lengths that are associated with disease. Upon removing these PPII helices, we find the polyQ residues are more likely to adopt b-sheet conformations; the likelihood is similar to that of the isolated glutamine homopolymers: Q n (Fig. 3a). This similarity indicates that the other flanking regions of XN1 have little effect on the probability of the polyQ region to form a b-strand. Furthermore, by considering that b-sheet formation has been linked to aggregation, we find that our results reflect experimental results. That is, because the polyP regions decrease the probability of polyQ residues adopting b-strand conformations, these polypeptides have a slower rate of aggregation, which is seen in other experiments [1,3,5,20,39,40,[42][43][44]. Additionally, the polyP regions greatly destabilize XN1 polypeptides (Fig. 2b and 2c). Since unfolded polypeptides in the random coil state are unlikely to organize as an aggregate, the presence of the polyP in the XN1 sequence prevent it from folding into the aggregation-prone state. Therefore, we hypothesize that polyP regions protect XN1 from aggregation in two ways: 1) destabilize the polypeptide, and 2) the PPII helices formed by polyP inhibit formation of b-sheets.

Nt17 misfolding in monomeric XN1
Currently, the role of the flanking regions in XN1 and other polyglutamine diseases is under debate [1,3,20,[45][46][47][48]. One model [3,47] suggests that aggregation is initiated by the flanking Nt17 region. That is, initially, these flanking regions misfold and form oligomers; subsequently, there is an increase in the local concentration of the polyQ region, which causes the polyQ regions to misfold into protofibrils and ultimately mature, fatal fibrils. Others, however, contend that the native structure of the flanking regions is one that resists aggregation [1,46,48]. In a recent study [20], the expansion of polyQ repeats in XN1 is found to promote the misfolding of Nt17, which leads to rapid formation of oligomers. In particular, the structure of the Nt17 flanking region in XN1 is one part of this debate, which we discuss here.
We find a sharp decrease in the probability of the Nt17 residues to adopt a-helix conformations as the length of the polyQ region increased. Concurrently, however, these residues are more likely to have b-strand conformations for longer polyQ repeat lengths (Fig. 3b). Hence, our results indicate misfolding of the Nt17 region; that is, the a-helical native structure misfolds into a b-strand in the pathogenic associated XN1 models. Our simulation of the XN1Q 47 suggests that both the Nt17 and polyQ regions simultaneously form b-strands (Fig. 4b). A possible future direction would be to identify the temperature at which the Nt17 residues transition from a-helical to b-sheet conformation; a similar calculation has been done elsewhere [65]. In terms of the problem of the aggregation mechanism of XN1, the next step is to determine the role these two regions play in oligomerization of XN1. Here, our computational study suggests that the polyQ region also plays a critical role in the early stages of aggregation [20].

Discrete molecular dynamics
Unlike traditional molecular dynamic simulations, we discretize the spherically symmetric, pair-wise interaction potential in a DMD simulation [49][50][51][52], where, the continuous potential between any two atoms is reduced to a series of square well potentials. Such a simplification considerably accelerates the computation time because in square well potentials, the particles do not experience any force except at the boundaries of the square wells. Thus, the particles travel with constant momenta until a boundary is reached; at such a boundary, the two particles experience a force and are considered colliding. At a collision, the momenta, angular momenta and energies of only the two colliding particles are updated according to conservation laws. Thus, the most computationally intense process is to sort the event list to determine the next collision. Further details on the particular version of DMD used here, such as interaction strength between atoms, are provided in reference [52]. Each polypeptide begins in an extended conformation; to allow for equilibration, the first 500 time units wherein the polypeptide has drastic energy and structural changes are discarded. We simulate each polypeptide in a cube with periodic boundaries. The dimension of the box is chosen to be large enough to fit the extended polypeptide.
All-atom models of Q n , XN1 and XN1-P 11 -P 10 We model each polypeptide using an united all-atom approach, which is explained in detail elsewhere [52]. Briefly, this approach models all heavy atoms and polar hydrogen atoms in a polypeptide; interactions between atoms are governed by the Medusa Force Field [66]. The interactions include van der Waals (VDW) based on CHARMM19, orientation-dependent hydrogen bonding and implicit solvation EEF1 [67]. Because electrostatic interactions at long distances are weakened due to solvent screening, we currently do not model these effects in the all-atom DMD. Salt bridges between side chains are captured partially through the hydrogen bonding potential [52]. Despite the approximation of electrostatic interactions, we were able to fold six small proteins to their native state ab initio [52]. The sequence used to model XN1 is taken as the first 90 residues in Human Huntingtin Protein from NCBI [68] (Fig. 4d). There are charged, polar and non-polar groups scattered throughout the sequence. Additionally, the particular sequence, taken from NCBI, contains 23 glutamine repeats, which is a non-pathogenic length. The sequence is not modified except to add glutamines in the polyQ region as indicated or to remove the P 11 and P 10 stretches in the XN1-P 11 -P 10 models.

Replica exchange DMD simulations
To efficiently sample protein conformations, we use the replica exchange simulation technique [56][57][58][59][60]. With this technique, we are able to utilize multiple, parallel simulations of identical systems called replicas. For each of the 12 polypeptides, we perform simulations on eight replicas with the following set of temperatures: {0.85, 0.75, 0.68, 0.64, 0.6, 0.57, 0.53, 0.5}. The temperature units are in kcal/mol/k B , or about 500K. At a regular time interval of 500 time units (approximately 25 ps), we consider exchanging the temperature of two replicas. We only allow an exchange for two replicas with neighboring temperatures, for example 0.6 and 0.64. We use a Monte-Carlo based approach to accept or reject an exchange. The simulation length of each replica is 1610 6 time units (,50 ns).

Structure screening
Replica exchange simulations allow us to efficiently sample the conformational space of XN1 and its variants by simulating a wide range of temperatures. However, we focus only on the compact structures of the polypeptides. To screen for these compact states, we eliminate highly extended structures, which are those with a large radius of gyration (R g ), and we include only those with low energy. The former criterion eliminates the transient structures explored during the early stages of folding when the polypeptide is far from its favored structure. The latter criterion selects for structures further along the folding pathway, because polypeptides lose energy during folding. We determine both the R g cutoff (Fig.  S1) and energy cutoff (Fig. S2) from a histogram of conformations sampled during the simulation. Each simulation produces 800,000 structures (1 conformation per 10 time units per replica). From this entire set, a subset of roughly 100,000-200,000 structures is selected through this screening process (Table S2). The energy cutoff is chosen to select for the lowest energy Gaussians.

Secondary structure probabilities
The probabilities calculated here are averages over the compact ensemble, which is a small subset of the entire population (Table  S2). Thus, the calculations do not describe the polypeptides in general. Instead, the secondary structure likelihoods describe the polypeptides in a partially folded state, which estimates the misfolded structure. For each of the compact structures, we can calculate the backbone dihedral angles (Q & y) and the corresponding secondary structure for each residue. Then from the ensemble of compact structures for each polypeptide model, we compute the probability of a given residue to adopt a-helix, bsheet, turn or random coil dihedral angles (Fig. 3c). Furthermore, we also determine the secondary structure probability of a set of residues, or a region, by averaging the secondary structure probability over those residues (Figs. 3a, 3b). For example, a bstrand probability of 0.3 in the polyQ region means that on average, a given residue in the polyQ region has a 30% chance of adopting b-strand dihedral angles. These probabilities are calculated for individual amino acids regardless of neighboring residues. However, consecutive residues with high amount of calculated secondary structure probability will suggest the probability of forming specific secondary structures. Thus, this analysis method allows for comparison of secondary structure tendencies for polypeptides based on the compact ensemble. To calculate the error, we compute the standard error for each value; where the number of events is the number of times the energy of the trajectory crossed the energy cutoff (Fig. S3). In effect, this method counts the number of times during the folding trajectory that the polypeptide enters or exits the compact domain of its folding landscape.

Clustering
For visualization purposes, we identify representative structures for each polypeptide using an hierarchical clustering algorithm [69]. The structures used for clustering are taken from the simulation trajectories and are separated by at least 50ps. For most of the models studied, roughly 1500-2000 structures are used for clustering. As exceptions, the XN1Q 23 and Q 23 models include only about 1030 structures. The number of clusters is different for each polypeptide model and varied from 101-843. Similarly, the population of the largest cluster also varied. A detailed summary of these values for each polypeptide model is presented in the supplemental materials (Table S3). In this clustering scheme, the nodes are structures and the distance between a pair of nodes is the root mean square distance (RMSD) between the two structures. We use the single-linkage or minimum distance criterion for clustering. That is, the distance between a node and a cluster is equal to the distance between that node and the closest node in the cluster. The cutoff distance determines the size of clusters. If the distance between a node and a cluster is less than the cutoff, we include the node in that cluster; otherwise, we exclude the node from that cluster. We use a RMSD cutoff of 2.5Å for clustering the XN1Q 47 -P 11 -P 10 polypeptide and a 2Å RMSD cutoff for all other polypeptides. These cutoffs are chosen to maintain high structural similarity among the nodes in a cluster (Fig. S4). We study the centroid of each cluster, which is the most representative node or the centermost node of a cluster. Furthermore, we select the centroids from the largest clusters to be overall representatives of their respective polypeptides. To gauge the significance or reliability of the centroid, we calculate the ratio of the number of structures present in its cluster and the total number of structures considered for clustering. Example calculations are reported in the captions of Figs. 3d and 4. Finally, necessary raw data is presented in Table S3.   Figure S1 Example Rg Histogram. Distribution of the Radius of Gyration (Rg). This histogram is an example from one polypeptide (XN1Q23). The bin size is 1Å . Exactly 800,000 Rg values, one per 10 time units per replica, from the trajectories of all eight replicas are used to generate the distribution. The dashed line indicates the Rg cutoff value (20Å in this case), which is used to determine compact structures. Structures with a larger Rg than the cutoff, are considered extended or insufficiently compact and are not included in the compact ensemble used for further analysis. Note the sharp peak around 16Å ; most polypeptides produced a similar sharp peak. This peak indicates that most conformations explored during the simulation are compact. Found at: doi:10.1371/journal.pcbi.1000772.s004 (0.07 MB DOC) Figure S2 Example Energy Histogram. Example histogram of two replicas from one polypeptide (XN1Q23). The bin size is 1 cal/mol. The distribution for each replica is based on 1,000,000 data points: one value per time unit. Here, the data from two replicas are pooled; therefore this distribution is based on 2,000,000 data points. The dashed line indicates the energy cutoff value (2285 cal/mol); this cutoff applies to all replicas of this polypeptide. The cutoff is chosen such that only the lowest energy states are included as compact structures. Found at: doi:10.1371/journal.pcbi.1000772.s005 (0.08 MB DOC) Figure S3 Trajectories Sampling the Compact Domain. Two example trajectories that depict how the simulations explored states that are compact and partially folded. Both trajectories are from the XN1Q23 simulation, and each trajectory corresponds to the particular replica indicated in the legend. The energy cutoff value is indicated as the red line. This value is determined by a histogram (Fig. S2). All explored states that occur below the cutoff are considered compact and are included in the compact ensemble. Found at: doi:10.1371/journal.pcbi.1000772.s006 (0.13 MB DOC) Figure S4 Hierarchical Clustering. One example (from XN1Q23) of how the cutoff distance for clustering is determined. With hierarchical clustering, we are able to determine the number of clusters remaining for a given Root Mean Square Deviation (RMSD) cutoff. The cutoff value, indicated by the dashed line (2Å ), is chosen to maximize the clustering of similar structures while avoiding clustering of unlike structures. Clustering of similar structures occurs in the steeply descending portion of the curve (between 1Å and 2Å ). In this region clusters are close to each other or similar, because a very small increase in the cutoff distance joins two clusters. Clustering of structures that are less similar occurs in the tail end of the curve (roughly 3Å or longer). Here, the clusters are distant or dissimilar and a large increase in the cutoff distance is required to join two clusters. Thus, a suitable cutoff is often found after the most similar clusters have been joined (here, between 1Å -1.5Å ) and before the distant clusters are joined (2.5Å and longer). Found at: doi:10.1371/journal.pcbi.1000772.s007 (0.08 MB DOC)