Insight into the Structure of Amyloid Fibrils from the Analysis of Globular Proteins

The conversion from soluble states into cross-β fibrillar aggregates is a property shared by many different proteins and peptides and was hence conjectured to be a generic feature of polypeptide chains. Increasing evidence is now accumulating that such fibrillar assemblies are generally characterized by a parallel in-register alignment of β-strands contributed by distinct protein molecules. Here we assume a universal mechanism is responsible for β-structure formation and deduce sequence-specific interaction energies between pairs of protein fragments from a statistical analysis of the native folds of globular proteins. The derived fragment–fragment interaction was implemented within a novel algorithm, prediction of amyloid structure aggregation (PASTA), to investigate the role of sequence heterogeneity in driving specific aggregation into ordered self-propagating cross-β structures. The algorithm predicts that the parallel in-register arrangement of sequence portions that participate in the fibril cross-β core is favoured in most cases. However, the antiparallel arrangement is correctly discriminated when present in fibrils formed by short peptides. The predictions of the most aggregation-prone portions of initially unfolded polypeptide chains are also in excellent agreement with available experimental observations. These results corroborate the recent hypothesis that the amyloid structure is stabilised by the same physicochemical determinants as those operating in folded proteins. They also suggest that side chain–side chain interaction across neighbouring β-strands is a key determinant of amyloid fibril formation and of their self-propagating ability.


Introduction
An increasing number of human pathologies are associated with the conversion of peptides and proteins from their soluble functional forms into well-defined fibrillar aggregates [1,2]. The diseases can be broadly grouped into neurodegenerative conditions, in which fibrillar aggregation occurs in the brain, nonneuropathic localised amyloidoses, in which aggregation occurs in a single type of tissue other than the brain, and nonneuropathic systemic amyloidoses, in which aggregation occurs in multiple tissues [1,2]. The fibrillar deposits associated with human pathologies are generally described as amyloid fibrils when they accumulate extracellularly, whereas the term ''intracellular inclusions'' has been suggested to be more appropriate when fibrils morphologically and structurally related to extracellular amyloid form inside the cell [3].
Amyloid formation is not restricted, however, to those polypeptide chains that have recognised links to protein deposition diseases. Several other proteins that have no such link have been found to form fibrillar aggregates in vitro with morphological, structural, and tinctorial properties that allow them to be classified as amyloid-like fibrils [4,5]. This finding has led to the idea that the ability to form the amyloid structure is an inherent property of polypeptide chains, encoded in main backbone chain interactions. From a theoretical perspective it was also recently shown that simple considerations of geometry and symmetry are sufficient to explain, within the same sequence-independent framework, the emergence of a limited menu of native-like conformations for a single chain and of b-aggregate structures for multiple chains [6].
The generic ability to form the amyloid structure has apparently been exploited by living systems for specific purposes, as some organisms have been found to convert, during their normal physiological life cycle, one or more of their endogenous proteins into amyloid-like fibrils that have functional properties rather than deleterious effects [7][8][9]. Perhaps the most surprising of these functions is the ability of amyloid-like fibrillar aggregates to serve as a nonchromosomal genetic element. Proteins such as Ure2p and Sup35p (Saccharomyces cerevisiae) or HET-s (P. anserina) can adopt a fibrillar conformation that, in addition to giving rise to specific phenotypes, appears to be self-propagating, transmissible, and infectious [10].
In their soluble states, the proteins able to form fibrillar aggregates do not share any obvious sequence identity or structural homology to each other. In spite of these differences in the precursor proteins, morphological inspection reveals common properties in the resulting fibrils [11]. Images obtained with transmission electron microscopy or atomic force microscopy reveal that the fibrils usually consist of 2-6 protofilaments, each about 2-5 nm in diameter [12]. These protofilaments generally twist together to form fibrils that are typically 7-13 nm wide [11,12], or associate laterally to form long ribbons that are 2-5 nm high and up to 30 nm wide [13][14][15]. X-ray fibre diffraction data have shown that the protein or peptide molecules are arranged so that the polypeptide chain forms b-strands that run perpendicular to the long axis of the fibril [11].
Solid-state nuclear magnetic resonance (ss-NMR), X-ray micro-or nano-crystallography, and other techniques such as systematic protein engineering coupled with site-directed spin-labelling or fluorescence-labelling have transformed our ability to gain insight into the structures of fibrillar aggregates with residue-specific detail [16][17][18][19][20][21][22][23][24][25][26][27][28][29]. These advances have allowed us to go beyond the generic notions of the fibrillar appearance and presence of a cross-b structure. These studies have indeed allowed the identification of regions of the sequence that form and stabilise the cross-b core of the fibrils, as opposed to those stretches that are flexible and exposed to the solvent. In many cases, the arrangement of the various molecules in the fibrils has also been determined, clarifying the nature of the intermolecular contacts and the structural stacking of the molecules along the fibril axis. One frequent characteristic emerging from these studies, particularly for fibrils formed by long sequences, is the parallel in-register arrangements (PIRA) of bstrands in the fibril core [17][18][19][20][21][23][24][25][26]28], but antiparallel arrangements are also possible, especially for shorter strands [27,30].
At the same time, mutational studies of the amyloid aggregation kinetics revealed simple correlations between physico-chemical properties (charge, hydrophobicity, and bsheet propensity) and aggregation propensities [31]. This allowed the development of different methods, which successfully predict aggregation-prone regions in the ami-no-acid sequence of a full-length protein [32][33][34][35][36][37]. All such approaches focus on predicting the intrinsic b-aggregation propensity of a sequence stretch using only the amino-acid sequence as an input. In [35] the possible parallel/antiparallel arrangement of the sequence stretch with itself was also taken into account. Molecular dynamics simulations of sequence fragments mounted on idealized b-strand templates, either parallel or antiparallel, were used to identify the most amyloidogenic fragments in a specific case [38]. A template amyloid structure based on PIRA is also employed in a very recent method for identifying fibril-forming segments [39]. A yet-unanswered question is why PIRA is found to be the most frequent arrangement of b-strands in the fibril core.
Here we introduce a computational approach by editing a pairwise energy function based on the propensities of two residues to be found within a b-sheet facing one another on neighbouring strands, as determined from a dataset of globular proteins of known native structures. We extract two different propensity sets depending on the orientation (parallel or antiparallel) of the neighbouring strands. Our method associates energy scores to specific b-pairings of two sequence stretches of the same length, and further assumes that distinct protein molecules involved in fibril formation will adopt the minimum-energy b-pairings in order to better stabilise the cross-b core.
A novel feature of our method is the ability to predict the registry of the intermolecular hydrogen bonds formed between amyloidogenic sequence stretches. In this way we can rationalise the observed tendency of proteins to assemble into parallel b-sheets in which the individual strands are inregister, contributing to form stackings of the same residue type along the fibril axis. Our algorithm is also able to correctly discriminate the orientation between intermolecular b-strands, either parallel or antiparallel. As a further demonstration of the robustness of the approach we will illustrate the ability of our algorithm to predict the portions of the sequence forming the cross-b core of the fibrils for a set of proteins, in excellent agreement with the experimentally determined amyloid structures, similar to previously proposed methods [32][33][34][35][36][37].
Our approach is based on the key assumption that a universal mechanism is responsible for b-sheet formation both in globular proteins and in fibrillar aggregates. The successful predictions obtained in this work suggest the validity of the above hypothesis in agreement with the unified framework presented previously [6].

Results
The Parallel In-Register Arrangement of b-Strands in the Amyloid-Like Fibrils Based on the procedure described in detail in Materials and Methods and sketched in Figure 1, we can associate an energy score e pðaÞ i;j ðLÞ, from Equations 2 and 3, to the b-pairing of two sequence stretches chosen from distinct protein chains sharing an identical sequence. The pairing is specific since only pairs of residues facing each other in the corresponding register contribute to the energy score. All possible aggregation patterns are then defined in terms of the positions along the sequence i,j, the length L, and the relative orientation (either parallel or antiparallel) of the two sequence stretches participating in the pairing. We assume that the faithful Synopsis In many fatal neurodegenerative diseases, including Alzheimer, Parkinson, and spongiform encephalopathies, proteins aggregate into specific fibrous structures to form insoluble plaques known as amyloid. The amyloid structure may also play a nonaberrant role in different organisms. Many globular proteins, folding to their biologically functional native structures in vivo, can be induced to aggregate into amyloid-like fibrils under suitable conditions in vitro. One hallmark of amyloid structure is a specific supramolecular architecture called cross-beta structure, held together by hydrogen bonds extending repeatedly along the fibril axis, but intermolecular interactions are yet unknown at the amino-acid level except for very few cases. In this study, the authors present an algorithm, called prediction of amyloid structure aggregation (PASTA), to computationally predict which portions of a given protein or peptide sequence forming amyloid fibrils are stabilizing the corresponding cross-beta structure and the specific intermolecular pattern of hydrogen-bonded amino acids. PASTA is based on the assumption that the same amino acid-specific interactions stabilizing hydrogen bond patterns in native structures of globular proteins are also employed by nature in amyloid structure. The successful comparison of the authors' prediction with available experimental data supports the existence of a unique framework to describe protein folding and aggregation.
repetition of this aggregating unit is at the basis of the assembly of polypeptide chains into amyloid fibrils, determining the highly regular cross-b core of the fibril.
We first analyse the properties of our energy function at the level of single pair energies E pðaÞ ab (see Equation 1). Residue pairs that appear from the analysis to possess low values of E p ab or E a ab should then have a propensity to aggregate in the context of amyloid fibrils higher than other pairs. Figure 2 shows the distribution of the 210 entries for E p ab , E a ab , and for the 20 in-register entries E p aa . All entries for both parallel and antiparallel pairing are shown in Table 1. Antiparallel pairing is favoured, on average, but the most favourable entries are found in the left tail of the parallel pairing distribution (with the only exception of the CYS-CYS antiparallel entry). Moreover, many of those are achieved for in-register pairings, notably for the hydrophobic residues VAL, ILE, and PHE. On the contrary, E p aa energies for charged and for some of the polar residues can assume significantly higher values. The highest E p aa energy is obtained for PRO, as expected, since it breaks the regular pattern of main backbone hydrogen bonding.
To verify whether the energies obtained with Equation 1 promote a general pattern in the aggregation, we use the sequence of the human amyloid b-peptide (Ab 1-40 ), a peptide known to be involved in Alzheimer disease and other pathological conditions such as hereditary cerebral hemmorhage with amyloidosis and inclusion-body myositis [2]. We   [57,58]. Note that the energy for parallel arrangement of CYS-CYS is repulsive. doi:10.1371/journal.pcbi.0020170.g002 are interested in rationalising on general grounds the competition between different registers in achieving the most favourable pairing. To average out as much as possible the influence of sequence specificity, we need to find a set of different minimum energy pairings. For fixed L and jijj, we slide the b-pairing segments along the sequence looking for the minimum energy pairing in both the parallel and antiparallel orientations (for the analysis shown in Figure 3 we consider the length independent energy term e pðaÞ i;j ðLÞ þ LDs). The minimum energies collected in this way are then averaged over different segment lengths (4 L 23) for a fixed value of ji -jj, yielding a mean value that is plotted as a function of ji -jj in Figure 3. As a matter of fact, the inregister parallel alignment (ji -jj ¼ 0) is considerably more favourable than any other out-of-register parallel alignment (ji -jj 6 ¼ 0). We interpret oscillations in the curve for parallel pairings as a signature of some degree of pattern repetition in the sequence. On the other hand, (ji -jj ¼ 0) is the preferred pairing also for antiparallel orientation, but in this case the average minimum energy exhibits a linear increase with ji -jj. All these features are consistently retrieved in all sequences analysed in this work (unpublished data), whereas the existence and the values of the ''gap'' between the ji -jj ¼ 0 parallel and antiparallel depends crucially on the specific sequence (see Table 2).
Our results show that on average the assembly of Ab 1-40 molecules with PIRA of sequence segments is favoured over both antiparallel and parallel out-of-register arrangements. ss-NMR and site-directed spin labelling experiments indeed show that amyloid fibrils from Ab contain such a parallel inregister stacking of b-strands contributed by distinct molecules [17,18]. Similar results are obtained when computing the sequences of amylin, a-synuclein, and the PHF43 segment of tau protein (unpublished data), again in agreement with the experimental results [19][20][21]23]. For the Ab 1-40 peptide and for the islet amyloid polypeptide, PIRA is clearly preferred over the antiparallel one within this analysis ( Table  2). On the other hand, the preference is milder for the PHF43 fragment of the tau protein, and for human a-synuclein, being within the standard deviation of the energies employed for the average, as shown in Table 2.
The behaviour of the two curves shown in Figure 3 can be understood on the basis of simple statistical considerations. The problem consists in finding several low-energy pairings in a row. For a generic out-of-register parallel arrangement, the lowest E p ab values need to be found within all 210 possible entries. Therefore, the probability of finding several consecutive low-energy pairings is indeed quite low, independently of the sequence distance ji -jj between the segments (as long as ji -jj 6 ¼ 0). On the other hand, the search problem is much easier in the case of in-register parallel pairing (ji -jj ¼ 0), since the lowest pairing energies need to be found only within the 20 E p aa entries (see Figure 2). Therefore PIRA is favoured, with respect to other parallel alignments, because many of the most favourable entries can be found more easily.
In the case of antiparallel arrangement, the search always has to be performed among 210 entries, but a symmetry effect favours the ji -jj ¼ 0 register. Indeed, when two overlapping sequence segments are aligned in antiparallel manner, some pairings are repeated twice (see the antiparallel case in Figure 1 with j¼ i ). The number of low-energy pairings to be found is thus effectively reduced. The extent of this reduction is proportional to the length of the overlapping portion, thus explaining the linear increase with ji -jj of the antiparallel curve in Figure 3. (Further details can be found in the Figure 3 legend.) We remark that the above general arguments rely on the fact that the most favourable entries do indeed correspond to PIRAs, due to the stacking of hydrophobic and hydrophilic residues. In other words, PIRA provides a natural way of maximizing the number of favourable stacking interactions, lining up hydrophobic and hydrophilic residues in long rows along the fibril axis. Any other out-of-register parallel arrangement will most likely disrupt such an ordered pattern of stabilizing interactions.

Prediction of Alignment Orientation for Fibril-Forming Peptides
We employ prediction of amyloid structure aggregation (PASTA) to predict the orientation between b-strands in fibrillar structures formed by short, previously investigated peptides. In all cases we assume the full peptide length is involved in the b-core of the fibril, so that we simply compare the energy score of the parallel and antiparallel b-pairings of the full segment with itself. Results are shown in Table 3, showing in the three considered cases that PASTA correctly identifies the experimentally determined orientation as the minimum energy pairing. To our knowledge, the first two peptides are the only cases of a detailed atomic resolution achieved for a fibrillar structure obtained by means of X-ray diffraction from microcrystals. GNNQQNY is a fragment from the yeast prion protein Sup35 displaying a parallel orientation between b-strands within the same b-sheet [28]. KFFEAAAKKFFE is a peptide explicitly designed to form amyloid-like fibrils and was shown to be composed of antiparallel b-sheets [27]. KLVFFAE is the (16-22) fragment of the human Ab 1-40 amyloid peptide, whose b-sheet structure was indicated to be antiparallel by ss-NMR data [40]. In the latter case it is remarkable that PASTA recognises

Prediction of Specific Pairings and Sequence-Aggregation Propensities
We employ PASTA to identify the regions of the sequencepromoting aggregation for five natively unfolded systems. These include human Ab 1-40 , human a-synuclein, the human islet amyloid polypeptide, the PHF43 fragment from human tau, and the HET-s prion domain protein from P. anserina. We decided to perform the analysis on such systems rather than on globular proteins because our analysis utilises values of intrinsic propensity to aggregate residue pairs and does not take into account the presence and type of secondary and tertiary structure in the analysed polypeptide chain. Indeed, it is well-known that the presence of structure in the initial nonaggregated state of the protein is an important determinant of aggregation and reduces dramatically the aggregation propensity of the structured regions [41]. In addition, the five natively unfolded systems analysed here were chosen because their aggregation-promoting regions were also determined experimentally, allowing our predictions to be directly tested.
The energy functions introduced in Equations 2 and 3 can be used to compare different segment lengths, and we will first list the three pairings yielding the minimum energy when looking among all possible segment lengths. (By definition the energy of a nonaggregating system is zero.) The results are summarized in Table 4. We then use the single-residue propensity h(k) defined in Equation 5 to take into account other low-energy pairings that could be close competitors of the lowest-energy pairing.
Human amyloid b-peptide. We first apply PASTA to study Ab 1-40 . It is known by proline-scanning mutagenesis and quantitation of fibrils by Congo red binding [42], ThT binding, electron microscopy, and SDS-Page [43], ss-NMR (17) and site-directed spin labelling [18] that the regions of the sequence involved in b-aggregation are approximately the segments 12-24 and 30-40 (the boundaries of the two regions vary somewhat in the various reports). Both segments are almost exactly predicted and are found as minima closely competing with each other. In Figure 4A we are plotting h(k) for Ab  . We see that in the region 12-20 and 31-40 the propensity is very strong, in almost perfect agreement with the experimental prediction, whereas it is negligible in the other parts of the protein. In both cases PIRA is predicted in perfect agreement with experimental data [17].
Human a-synuclein. This protein is involved in Parkinson disease and in dementia with Lewy Bodies [2]. By synthesising peptides of various lengths and quantifying their aggregation using HPLC and circular dichroism, the region 63-78 has been proposed to be involved in aggregation [44,45]. More recent experimental studies employing ss-NMR have allowed the identification of several sequence portions involved in bstrand formation within the fibrils [23]. These are shown as thick red bars in Figure 4B, together with the aggregation profile predicted by our algorithm. Four out of five of the experimentally determined sequence stretches are correctly identified by PASTA. The overall arrangement is parallel in-  (B) Same as in (A) but for the protein human a-synuclein. Thick red bars mark sequence stretches involved in b-strands according to ss-NMR experiments [23]. The thin red bars show the whole sequence portion found to be in PIRA, according to site-directed spin-labelling, solid line [19], and found to participate in main backbone hydrogen bonding according to hydrogen-deuterium exchange, dashed line [22]. The two experimentally determined portions differ only in the location of the initial boundary.
(C) Same as in (A) but for the subsection islet amyloid polypeptide. The thin red line shows the whole sequence portion found to be in PIRA according to site-directed spin-labelling experiments, with the dashed portions representing the uncertainty on boundary location [20]. Thick red bars show the sequence portions proposed to participate in b-strands according to a structural model based on a serpentine PIRA [24].
(D) Same as in (A) but for the PHF43 fragment from the fetal form of human tau. The thick red line shows a local sequence motif identified to be crucial for b-aggregation [46]. register, as determined by site-directed spin-labelling studies [19]. PASTA correctly finds the best minimum for a parallel in-register pairing, but the second-best pairing is a parallel out-of-register one. Looking at the segments involved, which are VVHGVATV (48)(49)(50)(51)(52)(53)(54)(55) and VVTGVTAV (70-77), we realize that this is due to a strong pattern repetition. Five out of eight residues are matched for an in-register alignment, including the four valines that are most responsible for the low pairing energy. In Figure 5 we show the b-pairing contact map h 2 (k,m), where a compendium of the general features predicted by PASTA can be found. The strongest signal is for PIRA, but parallel out-of-register arrangement is also selected in the presence of repetition of sequence patterns along the chain. Weak signals are also present for antiparallel arrangement, which would take place between identical sequence stretches, as predicted on general grounds. Islet amyloid polypeptide. The 37-residue islet amyloid polypeptide is the major component of pancreatic amyloid deposits, which are the hallmark of noninsulin-dependent (type II) diabetes mellitus. We plot h(k) in Figure 4C. Again there is quite a good agreement with site-directed spin-label experiments (20), which show parallel in-register aggregation in the region 12-29. It should be remarked that in this case, unlike for Ab 1-40 , PASTA clearly signals the existence of a single continuous pairing. In a recently proposed model, resulting from a number of experimental constraints, residues 12-17, 22-27, and 31-37 are proposed to form bstrands in a serpentine arrangement in each molecule, with very short loops connecting them [24]. This structural arrangement is repeated for each peptide molecule along the fibril axis so that the parallel in-register orientation is maintained [24]. The short length of the loop may make it difficult to distinguish between a single continuous pairing and three very-nearby short pairings.
PHF43 fragment from the fetal form of human tau. Filamentous inclusions from tau proteins are present in numerous neurodegenerative diseases, including Alzheimer disease and frontotemporal dementia with Parkinsonism linked to Chromosome 17 [2]. The region, found experimentally to be involved in aggregation within the tau fragment PHF43, is the segment 11-16, as identified by means of spot membrane-binding assay [46]. A good agreement is again found between these experimental data and those found with our prediction, as shown by both the minimum energy pairings listed in Table 4 and the plot of h(k) in Figure 4D. The arrangement is also correctly predicted to be parallel inregister, as determined by site-directed spin-labelling coupled with EPR methods [21].
HET-s prion domain fragment from P. anserina. The prion form of the protein HET-s is involved in a programmed cell death mechanism called heterokaryon incompatibility [47,48]. The recombinant HET-s prion domain (fragment 218-289) can form amyloid-like fibrils in vitro and induce prion phenotypes in a host cell [49]. Recent experiments employing fluorescence studies, quenched hydrogen exchange NMR, and ss-NMR [29] determined four sequence portions involved in b-strand structure within the fibrils, shown as red bars in Figure 4E, together with the aggregation profile predicted by our algorithm. PASTA correctly predicts four sequence stretches to be involved in b-aggregation, placing three of them in good agreement with experiments. The peculiar arrangement suggested by Ritter et al. on the basis of their experimental data is parallel but not in-register, pairing different portions of the same chain [29]. The method described in this work is based on the assumption of interchain pairing. Further studies are being carried out to extend our algorithm to intrachain pairing as well.

Discussion
We introduced a pairwise energy function based on the propensities of two residues to be found within a b-sheet facing one another on neighbouring strands, as determined from a dataset of globular proteins of known native structures. Such energy function was incorporated within an algorithm able to predict amyloidogenic sequence stretches, as well as the registry of the intermolecular hydrogen bonds formed between them. The latter type of prediction is a novel feature of our approach.
For a set of natively unfolded proteins involved in the formation of amyloid fibrils, we correctly predict their observed tendency to assemble into parallel b-sheets in which the individual strands are in-register. Our algorithm is also able to correctly determine the orientation between b-strands in the fibrils, either parallel or antiparallel, as shown by a comparison with fibrillar structures formed by short peptides determined experimentally at the atomic level.
Our energy function predicts that PIRA is favoured on general grounds, with respect to other parallel out-of-register alignments, because the most favourable b-pairing found in globular proteins is indeed parallel and obtained for hydrophobic pairs sharing the same residue kind. Even though such parallel in-register pairing can be unfavourable for other residues (especially charged ones), PIRA by itself constrains the search for good pairs in a much smaller set than for out- of-register arrangement. A similar, yet milder, effect induced by pairing statistics is detected for antiparallel arrangement, favouring the case in which the latter is achieved between identical sequence stretches. Parallel arrangement is generally favoured over antiparallel, but in some cases sequence specificity can override this tendency, as in the case of short peptides. Out-of-register parallel arrangement is also predicted as a good competitor in the presence of repeated (periodic) patterns in the sequence, which actually occur in several prion proteins, both in mammals and in fungi.
Our algorithm was also used to predict the portions of the sequence, for an initially unstructured polypeptide chain, that form the cross-b core of the fibrils. A good agreement with the experimental information available on amyloid structures, similar to other proposed methods [32][33][34][35][36][37], was found for human Ab 1-40 , a-synuclein, islet amyloid polypeptide, a fragment from human tau, and the prion domain of HET-s from P. anserina.
The results obtained in this work, besides rationalising on general grounds the common occurrence of PIRA in amyloid fibrillar structures, suggest two important conclusions. First, the existence of a preferred b-pairing is an important determinant of the self-propagating nature of amyloid fibrils and of the difficulty of these to seed the fibrillar state in proteins that have even subtle differences in sequence, a phenomenon associated with the species barrier in prion transmissibility. Moreover, the polymorphism often observed for amyloid fibrils [15,50], leading to the existence of different prion strains [10], might be explained by the competition between different low-energy b-pairings that are realizable for the same sequence.
The notion of a preferred b-pairing is the simplest one that can be put forward to account for the self-complementation of protein molecules on a structural basis [51]. It can be seen as a way of reconciling the roles of side chains in driving specific aggregation and of main backbone interactions in determining the general tendency of polypeptide chains for fibril formation. The knowledge-based energy function introduced in this work describes how side chain-side chain interactions between residues facing each other modulate the main chain hydrogen bond energy common to all residues. Stacking of hydrophobic residues [27] or hydrogen bonding between side chain groups [28] will favour PIRA, whereas electrostatic repulsion between charges of the same type disfavours it. All such interactions are captured within our knowledge-based approach. A determinant of self-complementation that we neglect in our simple scheme is the steric interdigitation between different sheets forming the fibril core [39]. However, the good performance of our algorithm shows that sequence information is already relevant at the level of b-strand pairing within the same sheet.
As a second important conclusion, the fact that the whole computational approach is derived from the knowledge of globular proteins underscores the universality of the physicochemical mechanisms underlying amyloid fibril formation. Moreover, it indicates that the structure and stabilising interactions existing in the apparently monotonous amyloid or amyloid-like fibrils are of the same essential nature as those determining structural and functional diversity in globular proteins.

Materials and Methods
Knowledge-based pair potential. We derive an energy function for specific b-aggregation using the top500H database [52]. It is a nonredundant specially refined set of 500 high resolution X-ray crystallographic structures of globular proteins, where hydrogen atoms were also reconstructed. These proteins include all-a, all-b, a/b, and a þ b proteins, and their structures are deposited in the Protein Data Bank. All occurring instances, n ab , of a given ab residue pair are partitioned (n ab ¼ n c ab þ n p ab þ n a ab þ n d ab ) into four different classes according to whether the two residues are facing each other on neighbouring parallel b-strands (n p ab ) or on neighbouring antiparallel b-strands (n a ab ), and whether the distance between their C a atoms is less than 6.5 Å -without participating in a ordered b-geometry (generic bulk contacts n c ab )-or more than 6.5 Å (noncontacting disordered pairs n d ab ). All pairs are included in the count, except those formed by consecutive residues along the protein chain. The participation to either parallel or antiparallel b-bridges is assessed by using the DSSP algorithm [53], but with a slightly stricter electrostatic energy threshold of À1 Kcal/mol to assign hydrogen bonds. (The distribution of such energies obtained from the Richardson set peaks around the value of À2.4 Kcal/mol, but increases again for values higher than À1 Kcal/mol, unpublished data).
Energies can be assigned to the occurrence of parallel b-pairing and antiparallel b-pairing for two amino acids of type a and type b, by assuming that the database of protein native structures is a system in thermodynamic equilibrium at a single temperature, assumed to be roughly constant for all the proteins in the database [54]. Upon further assumption that correlations between different pairings can be neglected within single proteins in the database [55], the propensity, p ab (x), of the ab pair to be found in one of the four pairing types, x, is given by the Boltzmann factor, p ab (x) ¼ exp(ÀE x ab ). The E 's are energy differences, measured in units of thermal energy, between the native and the reference state with respect to which propensities are computed [54].
Propensities are defined as the ratio of the observed frequency over the expected probability in the reference state, which is in turn estimated as the frequency observed over all pairs.
A similar expression yields the energy E d ab , which should be assigned to a noncontacting pair ab. Since the numbers n p ab , n a ab , and n c ab can be very small (or even zero in some special cases involving PRO and CYS), we used an averaging procedure to decrease statistical error [33] þ 1, n ab ! n ab þ 1) or subtracting (n p ab ! n p ab À 1; n ab ! n ab À 1), a single event, to the observed number of cases (whenever n p ab , 2, 0.5 is used in place of n p ab À 1). Statistical potentials describing residue pair correlations within b-sheets were developed in the context of structure prediction, limiting the total ensemble of residue pairs to those in which both residues participate in a b-structure [56][57][58][59]. Our derivation instead places all residue pairs in the total ensemble. b-pairing energy function. Our aim is to predict the specific aggregation pattern of a pair of identical proteins of N amino acids fa k g 1 k N , as determined by the specific b-pairing (either parallel or antiparallel) of the sequence stretch of length L, beginning at position i on the first chain, with the sequence stretch of the same length, beginning at position j on the second chain. We assume throughout the rest of this work that only a single stretch per sequence participates in the b-pairing and that all other residues (from 1 to i À 1 and from i þ L to N for the first chain and from 1 to j À 1 and from j þ L to N for the second chain) are not involved in aggregation and are found in a disordered noncompact conformation. We assume further that the energies E d ab of all pairs involving these latter residues can be neglected, since n d ab ' n ab and E d ab ' 0. Remaining pairs whose residues are both present in the b-aggregating stretches but not specifically paired with each other are assumed to be noncontacting as well. We verified that the results we present in this work do not change upon inclusion of noncontacting pair terms. The overall pairing aggregation energy for a given parallel/antiparallel pattern is then determined only by residue pairs mutually involved in the ordered b-pairing, and can be written, by assuming they do so independently of one another, as where the overscripts 1 and 2 correspond to the first and second chain, respectively, and DS ¼ LDs is the entropy loss due to the bordering of the L residue pairs, with Ds corresponding to the average entropy loss per residue pair. Due to the many approximations involved in the standard derivation of statistical potentials, the latter extensive term might actually compensate for any bias introduced with the choice of the reference state, making its a priori evaluation too difficult. Therefore we set Ds ¼À0.2 throughout all our work on a purely empirical basis. The proper introduction of sequence specific Ds ai might certainly improve the quantitative agreement with experimental observations, but we chose to keep our energy-scoring function as simple as possible to directly test the relevance of bpairing specificity in dictating aggregation patterns. Since the computation of energy scores e p i; j ðLÞ and e a i; j ðLÞ involves a summation over only L terms, it can be easily performed on a genome-wide scale.
Sequence-dependent aggregation propensities and contact maps. To take into account in a more complete manner all possible pairing energies close to the minimum, we introduce an ''ordered b-pairing partition function'': where d i k , i þ L ¼ 1 if residue k belongs to the L-stretch going from i to i þ LÀ1 and d i k , i þ L ¼ 0 otherwise. Note that h(k) is a probability since P k hðkÞ ¼ 1. It tells how a given residue is more likely to aggregate in an ordered b-structure with respect to others.
A more complete piece of information that can be extracted from the method is the normalized two-dimensional probability h 2 (k,m) of two given residues found paired to each other within an ordered bstructure. It is given by where k and m label residues in two different chains and d k À m þ j À i ¼ 1 if k m þ j À i ¼ 0, and 0 otherwise. Based on h 2 (k,m), a b-pairing contact map can be produced where the orientation (parallel or antiparallel to the diagonal) and the register of the best pairings is easily traced out (see Figure 5).
We name the full procedure described in this section PASTA.