Switch Region for Pathogenic Structural Change in Conformational Disease and Its Prediction

Many diseases are believed to be related to abnormal protein folding. In the first step of such pathogenic structural changes, misfolding occurs in regions important for the stability of the native structure. This destabilizes the normal protein conformation, while exposing the previously hidden aggregation-prone regions, leading to subsequent errors in the folding pathway. Sites involved in this first stage can be deemed switch regions of the protein, and can represent perfect binding targets for drugs to block the abnormal folding pathway and prevent pathogenic conformational changes. In this study, a prediction algorithm for the switch regions responsible for the start of pathogenic structural changes is introduced. With an accuracy of 94%, this algorithm can successfully find short segments covering sites significant in triggering conformational diseases (CDs) and is the first that can predict switch regions for various CDs. To illustrate its effectiveness in dealing with urgent public health problems, the reason of the increased pathogenicity of H5N1 influenza virus is analyzed; the mechanisms of the pandemic swine-origin 2009 A(H1N1) influenza virus in overcoming species barriers and in infecting large number of potential patients are also suggested. It is shown that the algorithm is a potential tool useful in the study of the pathology of CDs because: (1) it can identify the origin of pathogenic structural conversion with high sensitivity and specificity, and (2) it provides an ideal target for clinical treatment.


About testing set
Clinical reports on actual cases with definite and confirmed nosogenesis are very scarce. References [2,3,4,5,6] describe a total of 31 proteins responsible for various conformational diseases. Twenty-two of them have usable structural information. Since our method is based on knowledge of non-membrane proteins, this restricts the scope of the application, and four membrane or membrane-associated proteins are unsuitable for our method (amyloid-β precursor protein, α-ketoacid dehydrogenase complex, β-hexosaminidase, α-synuclein). Here we analyzed all other classical proteins that are believed to be responsible for different conformational diseases, and identified regions that cover significant sites for pathogenic structural changes. The conformational diseases can be classified as blood disorders, cardiovascular disease, cerebrovascular disease, neuropathy, encephalopathy, cancer, blindness, and kidney disease, among others.
We treated protein as successive overlapping short residue segments. For each residue segment, we evaluate its ability in arising pathogenic structural change by its probability to jump from the native helix-donut/strand-arc state to the other state. Regions with high interchange probabilities are listed in Table 1, with brief descriptions of their vital roles identified in clinical research. For fibrinogen, we could not find any clinical reports regarding the residues involved in the PDB database. It was difficult to evaluate the predictions for the proteins and hence the result is not listed.

Difference between hot spots of fibril formation and switch region
As shown in the example of transthyretin, amyloid-related mutations are not necessarily involved in aggregation-prone regions. It is still not clear whether initial/switch sites occur in hot spots of aggregation or not. While compared with the small counts of switch sites, aggregation-prone sites are abundant in disease-related proteins. Consequently, according to de Groot et al. [7], at least one-third of residues are involved in predicted hot spots.

Insulin
Insulin is a peptide hormone with extensive effects on metabolism and many body systems. Insulin injection is used medically to treat some forms of diabetes mellitus. Under solution conditions where the native state is destabilized, this largely helical polypeptide hormone can readily aggregate to form amyloid fibrils with a characteristic cross-β structure. Consequently, it is associated with a clinical syndrome, injection-localized amyloidosis [11].
It was revealed by mass spectrometry analysis that there is a character as insulin forming amyloid fibrils: The disulfide bonds of the native hormone are retained in the amyloid form, providing substantial constraints to refolding. Moreover according to the work of Jimènez et al., a segment donating such disulfide bond constraints plays a significant role in forming the initial aggregates of insulin amyloid fibrils, i.e. is a switch region [12]. As shown in figure S1, it coincides with our prediction that segment 6-20 is the switch region of insulin. Actually this segment is the minimum window that every cysteine responsible for the aforementioned disulfide bond constraints is involved, namely our prediction is very accurate. Figure 1: Results of insulin(PDBID: 1ai0 A, 21 residues in length).(A) Structure of insulin. Cysteines responsible for disulfide bonds conserved in refolding are shown in bonds(6yellow, 7green, 11blue and 20magenta). (B) The three native disulfide bonds(gold) shown in insulin structure and topology diagram [12]. (C) Interchange probability for each 15-residue segment indexed by its central residue. (D) The interchange probability for each residue site. In A and D, switch regions predicted are shown in red. This region involves every donor of disulfide bonds which act as constraints to refolding and contribute to initial aggregation of insulin.

Apolipoprotein AI
Apolipoprotein AI(Apo-AI) is the major protein component of high density lipoprotein(HDL) in plasma. The protein promotes cholesterol efflux from tissues to the liver for excretion. As an acceptor for sequential transfers of phospholipids, Apo-AI has two states in vivo(the lipid-free state and the lipid-bound one), with quite different conformations. The lipid-free Apo-AI is comprised of an N-terminal four-helix bundle and two C-terminal helices. Some mutations of Apo-AI can result in hereditary amyloidosis, such as familial amyloid polyneuropathy and familial visceral amyloid [2]. Since the amyloidosis is a consequence of protein aggregation, the lipid-free Apo-AI should be responsible for these diseases. Therefore, it is in the scope that we can cope with.
Stability of lipid-free Apo-AI should be vital for defending amylogenesis of Apo-AI. It is reported that mutations found in human amyloid deposits appear to occur more frequently at the amino terminus of Apo-AI. This is due to the fact that the N-terminal four-helix bundle, especially the helices A:1-43 are essential for the structural stability of lipid-free Apo-AI [14]. Structure of truncated protein 1-43 is quite different from wild type fold. In consequent, there should be some residues in helices A governing the stability of native fold of lipid-free Apo-AI. In helices A, some point mutations at sites 3,10, 13, and 26 result in various clinical consequences [15]. These sites should be in switches in destabilizing native fold of lipid-free Apo-AI. As shown in figure S2, the switch region predicted, #1-15 is an inbuilt segment of helices A. Three out of four aforementioned disease-related sites are in the predicted region. It means our result is correct.

Calcitonin
Calcitonin is a 32-residue peptide hormone that is being produced by the C-cells of the thyroid and is mainly known for its hypocalcemic effects and the inhibition of bone resorption. Amyloid fibrils of human calcitonin were found to be associated with medullary carcinoma of the thyroid. Calcitonin has little secondary structure at room temperature. However, with a conformational conversion, calcitonin fibrils were found to be highly ordered, consisting of both helix and strand elements.
Recent work indicates a critical role of residue 15-21 for fibril forming and bioactivity of calcitonin [16,17]. In particular, the conformation and the topological features of side chains of residue 18 and 19 are strongly associated with the self-assembly state, binding affinity and the in vivo hypocalcemic potency of human calcitonin. Another interrelated report is that joint mutations: Y12L, N17H, A26N, I27T, A31T hamper the pathogenic refolding, and result in a non-amyloidogenic analogue of human calcitonin [18]. All the aforementioned sites are important for the disease-related stability of calcitonin. As shown in figure S3, segment 16-31 is identified as switch region in our prediction. This coincide with the aforementioned researches very well. The interchange probability for each residue site. Sites related to the non-amyloidogenic analogue of human calcitonin, namely vital sites in inhibiting the pathogenic refolding are marked in yellow. In A and C, switch regions predicted are shown in red.

Cystatin C
Wild-type human cystatin C is a high-affinity inhibitor of some human cysteine proteases that belong to the papain family, such as cathepsins B, H, K, L, and S. In pathological processes, it forms part of the amyloid deposits in brain arteries that lead to cerebral angiopathy. Patients usually die in their teens from cerebral hemorrhage. The formation of amyloid cystatin C is claimed to be due to conformational changes in the monomer and subsequent domain swapping in the β-fibril structure [19].
According to clinical reports on cystatin C, the most important mutation is at 68 Leu→Gln, which is associated with a severe conformational disease and causes massive amyloidosis, cerebral hemorrhage and death in young adults [20,21]. Our analysis showed that a peak in interchange probability occurs at window 67 ( Figure S4). This means that residues 61-74 around site 68 are critical in this conformational disease, and are related to the initiation of structural changes in view of the double-zone feature of the polypeptide phase space.

Hemoglobin
Hemolytic anemia is a disorder in which destruction of red blood cells is faster than their production by bone marrow. Many cases of this disease are believed to be due to the presence of unstable hemoglobin that can change its structure and result in disorder. There are various causes of hemoglobin instability, such as single-point mutations, insertion or deletion of amino acids, frame shifts, etc. Unstable hemoglobin molecules lead to varying degrees of hemolysis [22].
Here we analysed the β-chain of human hemoglobin that is believed to be responsible for several hemolytic anemia. The detailed results are shown in Figure S5. Two successive peaks for the interchange probability occur at windows 107 and 118. The two polypeptides overlap and have much higher interchange probabilities (at least two-fold) than the other sites. This means that the switch region for hemoglobin is greater than 15 residues and should be extended. Thus, residues 99-125 were predicted to be prone to conformational changes leading to disorder. This result coincides with clinical reports. According to the database in reference[23], there are totally 43 residue sites for which hemolytic anemia related point mutations have been observed. These sites are interspersed along the 146-residue sequence. In all the corresponding variants, there are four highly unstable mutants(V60E, L110P, A115D, Q127R). Three out of the four are in or close to the region we predicted. It means that the switch region predicted is significant for the structural stability of hemoglobin β-chain. Wherein the highly unstable mutants related sites are shown in various colors(60blue, 110yellow, 115cyan, 127magenta). In A and C, switch regions predicted are shown in red. The highly unstable mutations are prone to occur in or close to the region we predicted. Therefore, disease-related stability of hemoglobin β-chain should be highly sensitive to the region we predicted.

Gelsolin
Gelsolin is an actin-binding protein that is a key regulator of actin filament assembly and disassembly. Wild-type gelsolin is not associated with any amyloid pathology; however, inheritance of some types of mutations, e.g. D187N or D187Y, confers 100% penetrance of Finnish hereditary systemic amyloidosis which is characterized by extensive skin, arterial, neurologic, and ophthalmologic amyloid deposition. Unlike some disease-related proteins which undergo amyloidogenic conformational changes in native fold, gelsolin is associated with amyloidosis due to aberrant cleavage of its precursor protein, i.e. the formation of 70-and 71-residue amyloidogenic gelsolin fragment found in patients. Abnormal cleavage is a direct causation of the disease.
As shown in figure S6, segment 243-264 is predicted to be switch region of gelsolin. It means that disease-related mutation should confer region 243-264 a notable conformational change which produce a sufficient condition for aberrant cleavage, e.g. peptide unique to hydrolase digest. This coincides with the discovery of Page et al. that gelsolin amyloidogenesis is triggered by metalloendoprotease cleavage [24]. The cleavage site is fitly either A 242 -M 243 or M 243 -L 244 .

Lysozyme
Lysozyme is an antibacterial protein for which mutations are associated with familial visceral amyloidosis in the liver, spleen, kidneys, and other internal organs. There are five known mutations in the human lysozyme gene that give rise to six variant proteins, 56 Ile→Thr, 57 Phe→Ile, 64 Trp→Arg, 67 Asp→His, 70 Thr→Asn, and the double mutation F57I&T70N. All the variants apart from T70N have been detected in association with amyloid deposits in various human patients.
There are two candidate switch regions, around sites 57 and 67. As shown in figure S7B, compared with structure of wild-type protein, mutation T70N can result in considerable structural rearrangement at sites 68-75 [25] without causing conformational disease. Therefore, the second site is likely not important for the initiation of pathogenic structural changes and can be excluded. In fact, according figure S7B, there is some structural rearrangement at sites 45-51 in the T70N variant, but this is not large enough to cause conformational disease. However the structural rearrangement in this region is very strong for the disease-related mutant D67H, the amyloid donor. This indicates that sites 45-51 correspond to a switch region for lysozyme. As shown in Figure S7, successive peaks for the interchange probability occur at windows 48-50, so that polypeptide 41-57 should be significant in the initiation of structural changes, in agreement with previous research.

Fibrillin-1
Fibrillin-1 is the basic structural element of microfibrils that form a sheath surrounding the amorphous elastin. Mutations in fibrillin-1 is associated with Marfan syndrome that is an autosomal dominant disorder affecting mainly the cardiovascular, skeletal and ocular systems. The range of the clinical severity of Marfan syndrome is strikingly wide. In the most severe form, children of neonatal Marfan syndrome have severe cardiac valve regurgitation and dilation of the proximal aorta,which usually lead to heart failure and death in the first year of life [26].
The structure of the calcium-binding epidermal growth factor-like domains from human fibrillin-1 has been reported in 1996 [27]. According to the Marfan database [28], there are three diseaserelated mutations(D2127E, N2144S, and C2151W) in this domain. As shown in figure S8, the second and third mutates occurs in/near the switch region we predicted.  β 2 microglobulin is a non-polymorphic light chain of the class I major histocompatibility complex (MHC-I) that plays an important role in the immune system, autoimmunity, and reproductive success [29]. As part of its normal catabolic cycle, β 2 microglobulin dissociates from MHC-I and is transported in serum to the kidneys, where the majority of the protein is degraded. If there is renal failure whereby β 2 microglobulin does not pass through the dialysis membrane, then its clearance from serum is disrupted. This results in an increase in β 2 microglobulin. When a high blood level is maintained for more than 10 years, the protein then self-associates to form amyloid fibrils, causing dialysis-related amyloidosis [30,31].
It is believed that the formation of amyloid fibrils of β 2 microglobulin accompanies a significant conformational change. There are two successive peaks for the interchange probability at central sites 14 and 16 in Figure S9, indicating that sites 7-23 are likely involved in the initiation of structural changes in β 2 microglobulin. This should correlate with the observation that fragment 21-31 is the well-known amyloidogenic core fragment of β 2 microglobulin [32,33,34]. Different from some other proteins, the wild-type β 2 microglobulin can aggregate due to the influence of ageing. In such process, acidification, e.g. in site 17( N→D ), is necessary to form amyloid fibrils from both wild type β 2 microglonulin and its variants [35]. It means residues around 17 is significant in triggering pathogenic refolding for both wild-type microglonulin and its variants. In A and C, switch regions predicted are shown in red. There are three overlapped residues between the switch region predicted and the amyloid core. Residue N17 is important for the pathological mechanism of amyloid formation of wild-type β 2 microglobulin.

Superoxide dismutase, SOD
The enzyme superoxide dismutase is metalloprotein which catalyzes the dismutation of superoxide into oxygen and hydrogen peroxide. Therefore, it is an important antioxidant defense in nearly all cells exposed to oxygen. There are three major families of superoxide dismutase, depending on the metal cofactor: Cu/Zn type, Fe/Mn type, and Ni type. Some mutations in Cu/Zn SOD enzyme can cause familial amyotrophic lateral sclerosis.
According to reference [36], there are totally 26 residue sites for which disease-related point mutations have ever been reported. As shown in figure S10, segment 37-48 is the region with the highest density of the 26 residue sites. This coincides with our prediction that segment 30-46 should be switch region of pathogenic structural changes for Cu/Zn SOD enzyme.

Transthyretin
Transthyretin (TTR) is a serum and cerebrospinal fluid carrier of the thyroid hormone thyroxine. It also acts as a carrier of retinol (vitamin A) through an association with retinol binding protein. Amyloid deposition of TTR is associated with several diseases, such as senile systemic amyloidosis, familial amyloid neuropathy, and familial cardiac amyloid [2]. The fibrillar structure resulting from self-association of an abnormal conformation of TTR is thought to be the causative agent in these disorders. The majority of TTR-associated amyloidoses are due to single amino-acid substitutions. In senile systemic amyloidosis, the non-mutated protein is present in amyloid fibrils [37]. However, the mechanism that by which normally soluble TTR tetramers are converted into insoluble amyloid fibrils remains largely unknown.
Here we analyzed normal human TTR to identify sites involved in the initial structural changes in this protein. As shown in Figure S11, polypeptide 46-69 (central residue corresponding to the two successive peaks at sites 53 and 62) was identified as the switch region for conformational changes in TTR. Clinical data demonstrated that the mutation 55 Leu→Pro can cause early-onset familial amyloidotic polyneuropathy [38]. According to clinical reports, L55P is the most notorious mutant, with onset of clinical disease appearing approximately 20 years of age. In comparison, the age of onset is approximately 30 years for V30M carriers and 80 years for wild-type subjects. Analysis based on the crystal structure of the L55P mutant showed that site 55 is important site in the pathway for TTR polymerization to amyloid fibrils [39]. Amyloidogenic regions experimentally determined to date are 10-19 [40] and 105-115 [41], but no segments covering site 55 have been identified so far.

Tumor suppressor protein p53
p53 is a transcription factor in multicellular organisms, where it regulates the cell cycle and thus functions as a tumor suppressor in preventing cancer. As such, p53 has been described as "the guardian of the genome", referring to its role in conserving stability by preventing genome mutation. In about 50% of human cancers, p53 is inactivated as a result of missense mutation in the p53 gene.
Actually, p53 is the most complicated molecule we have ever coped with. As shown in figure  S12, there are totaly twenty sites for which the activity of mutants decreases more than 50% [42]. Such sites are interspersed along the 289-residue fold. All mutants without biological activity locate in the N-terminal half(site index > 195) of p53 sequence. Moreover, five of the top six amino acid residues that are most frequently mutated in human cancer are in the N-terminal half(Arg-175, Gly-245,Arg-248, Arg-249, Arg-273,and Arg-282). It means the N-terminal half is significant for conserving the activity of p53(This is largely due to the property conservation of DNA-binding core domain [43]). We predicted region 191-205 to be switch region of p53. The peaks in the Nterminal half are higher than those of the other half. This fits with the aforementioned knowledge qualitatively. Whereas, features in the stability of p53 is extremely complicated. In figure S12, there are several peaks with similar standards of interchange probability. Each of them involves several sites for which the highly destabilizing mutants have been reported [44]. As such, there should be several switch regions in p53, interferefering accuracy of our method by unsuitable basic hypothesis(we suppose there is only one switch region in protein). The interchange probability for each residue site. We found p53 is extremely complicated: <A mass of sites are sensitive to p53's biological activity> The twenty sites for which the activity of mutants decreases more than half of that of wild-type protein are marked in yellow. Some variants of zero biological activity are produced by point mutation at sites marked in green. <Nearly every site corresponds to some disease-related point mutations respectively> Magenta points show sites of the top six most frequently mutated residues in human cancer [42]. <Various regions are significant for the disease-related stability of p53> Blue points are sites corresponding to the highly destabilizing mutants reported in [44]. Positions related to the top five most highly destabilizing variants are marked in cyan. In A and C, switch regions predicted are shown in red.

Serpins
Serpins are a family of proteins that inhibit proteases via a profound conformational change that irreversibly locks the protease and serpin together. The normal physiological functions of serpins are based on such highly specific transitions in the natural conformation. Premature conversion of the protein structure will result in a deficiency or dysfunction of the inhibition of proteases, then fibril formation and quite different disease consequences such as emphysema, cirrhosis, and thromboembolic disease [2].
Results of our analysis were shown in Figure S13. Initiation sites for serpins are predicted to be around site 379 (polypeptide 372-386), in agreement with the results of Johnson et al., who reported that wild-type residue Glu-381 plays an important role in stabilising the native, inserted, and activated states of serpin proteins [45]. Actually the predicted region is on the N-terminus of reactive loop of serpins which is vital for the biological properties of the protein.

Crystallin
Crystallin is a water-soluble structural protein found in the lens of the eye. It is the major protein of the eye lens, accounting for the transparency of lens. Mutations and aging of crystallins cause cataracts, the predominant cause of blindness in the world. In human gamma-D crystallin, there are totally five residue sites for which mutants associated with congenital cataracts have ever reported, i.e. R14, P23, R36, R58, and E106. Three of the five are in the segment 14-36 which we predicted as switch region of gamma-D crystallin(figure S14). Moreover, the threonine substitution in site 23 is reported as the cause of pivotal local conformational and dynamic differences in human gamma-D crystallin [46]. All these evidences prove our result is correct.

Low-Density Lipoprotein(LDL) receptor
LDL receptor is a mosaic protein that mediates endocytosis of cholesterol-rich lipoprotein particles. The amino-terminal region of LDL receptor, which consists of seven tandemly repeated cysteine-rich modules(LDL-A modules), mediates binding to lipoproteins. Normally these LDL-A modules extend out into the extracellular fluid, seize the lipoproteins wherein. Then the lipoprotein arrested is imported into the cell by receptor-mediated endocytosis. Mutations of LDL receptor that affect this process result in failure to clear lipoprotein from the circulation, pathologically elevated blood cholesterol and premature heart disease.
Many point mutations that cause familial hypercholesterolaemia map to the fifth LDL-A module of the LDL receptor(LR5). As the module works far from membrane, largely not in a membranelike environment, we can predict the switch region of LR5 with our method. As shown in figure S15, segment 25-39 is predicted as the switch region of LR5. This coincides with the observation that disease-related point mutations mainly map to a cluster of acidic residues near the carboxyterminal end of LR5. There are totally eight residue sites for which disease-related point mutations have been observed [13], i.e. illness is induced. Six of them are in the region we predicted. The interchange probability for each residue site. In A and C, switch regions predicted are shown in red. The region we predicted is a segment with highest density of disease-related sites, and should be switch region of low-density lipoprotein receptor.

Cystic fibrosis transmembrane conductance regulator(CFTR)
Cystic fibrosis transmembrane conductance regulator is an ion channel that transports chloride ions across epithelial cell membranes. This transporter-class protein is a complex polytopic membrane protein with several large cytosolic domains, including the regulatory domain R and two cytoplasmically oriented tails, nucleotide binding domains(NBD) 1 and 2. As shown in figure S16B, these domains extend out into the body fluid, and largely in a non-membrane-like environment. Mutations in these domains can disrupt biosynthetic process and result in the cystic fibrosis disease. We have analyzed the NBD1 domain of CFTR. Sites 546-561 were predicted as switch region of NBD1 CFTR. As shown in figure S16D, it is a region with the highest density of missense mutations responsible for cystic fibrosis[47].