Large Ankyrin repeat proteins are formed with similar and energetically favorable units

Ankyrin containing proteins are one of the most abundant repeat protein families present in all extant organisms. They are made with tandem copies of similar amino acid stretches that fold into elongated architectures. Here, we built and curated a dataset of 200 thousand proteins that contain 1.2 million Ankyrin regions and characterize the abundance, structure and energetics of the repetitive regions in natural proteins. We found that there is a continuous roughly exponential variety of array lengths with an exceptional frequency at 24 repeats. We described that individual repeats are seldom interrupted with long insertions and accept few deletions, in line with the known tertiary structures. We found that longer arrays are made up of repeats that are more similar to each other than shorter arrays, and display more favourable folding energy, hinting at their evolutionary origin. The array distributions show that there is a physical upper limit to the size of an array of repeats of about 120 copies, consistent with the limit found in nature. The identity patterns within the arrays suggest that they may have originated by sequential copies of more than one Ankyrin unit.

Natural proteins that are formed with repetitions of stretches of amino-acids are 2 abundant in extant organisms [1]. Some proteins contain repetitions of short stretches, 3 forming fibrillate structures like collagen, and some contain longer repetitions of 4 globular domains like beads on a string. In between, there is a class of proteins that is 5 formed with tandem repetitions of similar stretches of about 30∼40 residues. These kind 6 of proteins (from now on repeat proteins) are present in all organisms and are believed 7 to be ancient systems [2]. Typically these polypeptides form elongated structures where each repeat motifs packs against its nearest neighbors, stabilizing an overall 9 super-helical fold [3]. Since most of the structural characterization of these proteins 10 were performed on model systems of short arrays that are experimentally amenable, we 11 aim at characterizing the overall structures of an abundant family of proteins. 12 Ankyrin repeat proteins (ANKs) are usually described as formed with linear arrays 13 of tandem copies of a 33 residues length motif that fold to a α-loop-α − β-hairpin/loop. 14 Being one of the most common repeat proteins in nature, these molecules are believed 15 to function as specific protein-protein interactions [4]. Most of the structural knowledge 16 about ANKs is derived from the study of systems of biomedical relevance (the protein 17 Ankyrin that gives name to the family, but also p16, Notch, IκB, etc, [5], [6], [7], [8]); 18 and from designed ANK proteins [9]. In these cases, the proteins are formed with a 19 relatively few number of repeats, between 3 and 7, with a 12 repeat protein being the 20 largest one for which folding was studied [10]. The folding of these repeat arrays can 21 usually be described with a simple 1-D Ising model in which the most favourable 22 repeats form a nuclei and structure propagates to near-neighbors [6] 23 , [10], [11], [12], [13]. Small energetic inhomogeneities along the structure can break the 24 folding cooperativity of multiple repeats and give rise to the appearance of folding 25 intermediates [14], [15], [16]. Thus longer arrays are expected to break into folding 26 subdomains of different stability [17], [18]. Moreover, good approximations to the 27 folding energy can be constructed from statistical analysis of the extant 28 sequences [19], [20]. We studied here the abundance, length distribution and energetics 29 of ANK arrays in natural polypeptides. 30 In contrast to most globular domains, repeat proteins are believed to distinctively 31 evolve by duplication and deletion of internal repetitions [2], [21], [22], [23]. It was 32 recently suggested that this horizontal evolution is accelerated compared to their 33 vertical divergence in related species [24]. The internal sequence similarity in each 34 protein suggested that the domain repeats are often expanded through duplications of 35 several domains at a time, while the duplication of one domain is less common, 36 although no common mechanism for the expansion of repeats was found [23]. Here we 37 re-examine the correlations of sequence similarity in ANKs and describe the occurrence 38 of multiple types of duplication mechanisms within this family.

40
Repeats detection and array construction 41 In order to detect a majority of the possible Ankyrin repeats, we searched the full 42 UniProtKB database [25], including manually reviewed Swiss-Prot (February 2019) and 43 all the unreviewed TrEMBL (December 2017) sequences. 44 We used the structurally-derived hidden Markov models (HMM) developed by Parra 45 et. al. [26] for ANK repeats: one for internal repeats , one for C-terminal repeats and 46 another one for N-terminal repeats These models fix a consistent phase for the repeats 47 detection. We scan all the database, splitted in single sequence fasta format, with the 48 hmmsearch tool with default parameters [27] using the internal repeat HMM, detecting 49 194938 sequences with at least one hit. Subsequently, we run hmmsearch with the other 50 two HMM in order to detect terminal repeats and we eliminated the redundant hits.

51
To build aligned repeats from hmmer hits, we identify every model matched amino 52 acids (AA) in the correspondent full sequence and we copy AA before and after those 53 detected that are needed to complete a 33 AA repeat. We take into account three 54 particular cases: insertions inside the repeat, deletions and truncations. To resolve 55 deletions and truncations, we simply admit the gap character '-' in our AA alphabet. In 56 the case of the insertions inside the repeat, we eliminate the corresponding positions for 57 every insertion length. There is a possible case of double repeat detection, when hmmer 58 identify independently two hits which belong to the same repeat. After completing the 59 repeats, we eliminated the double detections. We obtained a Multiple Repeat Alignment 60 (MRA) of more than 1,2 million repeats sequences with exactly 33 positions.

61
In previous works, it has been reported that the insertions length between ANK 62 repeats has a characteristic length of 17 AA [26]. However, when analyzed at the full 63 primary structure, we find a length distribution that extends beyond this (not shown). 64 The distribution of insertions between repeats display a visible peak corresponding to a 65 entire repeat length of 33 AA. In these cases, we interpret that the HMMs failed to 66 detect a repeat between another two consecutive ones. Taking into account this 67 observation, we define an array as the concatenation of consecutive repeats that are less 68 than 67 AA away. With this definition, we consider the eventuality of losing a repeat in 69 detection and an insertion of 17 AA each side the lost repeat. We note that we allow to 70 have more than one array for each full sequence, all of which we keep for analysis. Also, 71 we note that the sequence database thus constructed does not necessarily represents the 72 total universe of sequences, but is biased by the human sequencing bias and by the 73 phylogenetic relationship between the sequences. To minimize these biases in the 74 analysis, we clustered the data by similarity using CD-hit [28] with a cutoff of 90% and 75 we assigned a weight to sequences defined as 1/n i , being n i the number of sequences in 76 the i th cluster. This way, we end with 153209 effective arrays of ANK repeats. We 77 took into account these weights to make all the statistical calculations in this work.

78
Sequence identity calculations 79 We define the pairwise identity or pID between two repeat sequences as the normalized 80 quantity of identical AA in identical positions, excluding gap coincidences. We consider 81 pID between every internal repeats in each array, distinguishing if they are first, second 82 or i-th neighbors. We treat terminal repeats as different natural objects, so we do not 83 compare them to internal repeats in an identity analysis. Consistently, we consider 84 arrays from four repeats onward, so each has at least two internal repeats to compare to. 85 Autocorrelation analysis 86 We compute an auto correlation vector (ACV ) between repeats r in an array as 87 proposed by Björklund et al [23]. The n-component of the vector is the mean value of 88 the pID for all r at neighborhood n, normalizing by the mean pID at first neighbors for 89 the array 90 ACV n = pID (ri,rj ) |i−j|=n pID (ri,rj ) |i−j|=1 (1) Energetic modeling 91 We consider that an Ankyrin repeat sequence is a state σ = (a 1 , a 2 , .., a L=33 ) as 92 previously done [20]. Each position is occupied by one of the 20 amino-acids or the gap 93 character, so it has 21 possibilities. We assume that the system is in the state σ with a 94 probability distribution that is mathematically equivalent to the Boltzmann 95 distribution [29], [20] 96 taking the temperature such as k B T = 1. Here E( σ) is the energy of the state σ and Z 97 is the partition function. If we assume that positions are independent, discarding any 98 interaction between different sites along the sequence, the energy can be written as where h i (a i ) is a local energy field that indicates the propensity to find an amino-acid 100 a i in a position i, and it can be calculated as follows using the frequency of finding in 101 the MRA a residue in each column, We choose the constant C imposing the condition ai h i (a i ) = 0. The natural 103 frequency f i (a i ) was measured taking into account the weights determined by the full 104 sequence similarity clustering.  However, repeats does usually not come alone in natural sequences, but one next to 113 each other in long tandems conforming arrays. Given these definitions, we can find one 114 or more array of repeats in each natural protein (Fig 1). 115 We collected and curated a database of 1,2 million repeats constructed as defined in 116 Methods organized in 257703 arrays, which we weight by phylogeny obtaining 153209 117 effective arrays. In 74% of cases all repeats in each protein cluster together in a single 118 array, while 19% of the proteins has two arrays, only 3% has three and only 4% has four 119 or more arrays. Notably, there are example proteins that have up to 10 arrays. The 120 effective arrays belong to Eukaryota proteome in 85.5%, Bacteria 13.0%, Viruses 1.4% 121 and Archaea 0.1%, in line with previous census [1]. 122 We classify the data according the array length, or simply the number of repeated 123 units in each array. The distribution is presented in Fig 2A as an histogram. There is a 124 large number of arrays of just one repeat unit, representing 19% of arrays, of which 50% 125 were detected as single repeats in the natural sequence and the remainder are at least 67 126 residues apart from their nearest neighbour. Since it is known that ANK proteins 127 require multiple repeats to acquire a stable fold [30], [31], [13], these may represent miss 128 detections of ANK patterns in unrelated sequences, as shown later by their energetic 129 distribution (see below). The abundance of arrays decreases roughly exponentially with 130 array length with an anomalous peak around 23 repeats. The length distribution is not 131 homogeneous across the domains of life, with the longest arrays being exclusively found 132 in Eukarya (Fig S1).

133
To analyze the distribution of arrays taking into account the total protein length, we 134 combine the information in a heat map plot, presented in Fig 2B. There is a prohibited 135 region in the plot, as sequences must have a minimum length of 33 * N to contain N 136 units of 33 residues. The proteins for which all the polypeptide is formed with a single 137 ANK array fall on the diagonal, notably up to one hundred repeats. On the upper left 138 side there is a heterogeneity in the population distribution, with most proteins being We searched the whole UniProt database and detected repeats with a structurally-based HMM sequence model. If the detected repeats are separated by less than 67 residues, we define them as belonging to the same array. In the above example, Sequence A codes for 1 single array, and sequence B codes for two arrays. Finally, we get a Multiple Repeat sequence Alignment (MRA) of more than 1,2 million repeats sequences with exactly 33 positions belonging to specific arrays. of proteins over 10000 residues long that contain short arrays. Notably, the presence of 141 arrays 22∼23 repeats highlights in sequences from 3000 to 8000 residues long. It is 142 interesting to note that there is one protein with an array of 23 repeats for which the 143 crystallographic structure has been solved [32]. Analysis of this structure shows that our 144 automatic repeat annotation missed one terminal repeat, and that the exact number 24 145 ANK repeats corresponds with a complete turn of an ANK super-helix of ∼ 60Å of 146 diameter and ∼ 150Å height [32]. Thus, the anomalous peak we detect in the length 147 distribution 22∼24 may correspond to compact arrays of ANK repeats that make one 148 complete turn when folded.

149
Natural ANK repeats do not always have exactly 33 residues [26]. Usually the 150 structure can tolerate insertions, that we detect in the primary structure with the 151 protocol described in Methods. We found that insertions occur only in 9% of the repeats 152 of natural proteins. The distribution of the insertions length shows that the majority of 153 these are of just one amino acid, and insertions longer than 5 residues are rare (Figure 154  3A). The sites were the insertions occur along the ANK repeat is clearly not random 155 ( Figure 3B). Tertiary structure studies have previously characterized the insertion 156 tolerance of ANK arrays [26] that is in excellent agreement with the primary structure 157 we detect here, two regions of the repeats where insertions are more likely, positions 6-7 158 and 17-20, that correspond with the linker regions between the helices that form the 159 repeat units. Interestingly, we found repeats with long insertions of more than 60 AA in 160 sequences of arrays between 3 and 10 repeats, reaching 1.2% of the repeats ( Figure S2). 161 In some instances we found that a segment interpreted by us as an ANK repeat with a 162 long insertion is annotated in Pfam as an ArfGAP domain, next to an ANK (e.g. The longer the arrays, the more similar the repeats are 171 Are the ANK arrays constructed from a random sample of repeats or is there a 172 correlation between repeats that conform the arrays present in natural proteins? As a 173 first step towards this analysis, we measured the pairwise identity at the sequence level 174 (pID) between repeats, as described in Methods. We exclude from this analysis the (124 ± 4) repeats ( Fig 4C). Surprisingly, this array length is coincident with the longer 198 arrays found in the natural data set, and may constitute a physical upper limit for the 199 length of an Ankyrin repeat array. We made the ACV calculation for all the 257703 arrays, observing that many 215 proteins have very different identifiable periods. There are proteins that present signals 216 at lengths of 3, 5, 6 and 7 ( Fig S4, Fig S5, Fig S6 and Fig S7), while other proteins 217 present ACV with no appreciable signal. Also, we found examples of proteins that 218 display two different periods along one single array (Fig S5). We found that the 219 distribution of patterns is not characteristic of single domains of life, but both 220 Eukaryota and Bacteria encode proteins with various ACV distributions (Fig S5).

221
Another notable characteristic is the qualitative difference between the terminal repeats 222 and the internal ones along the arrays, and in some cases between more than one 223 terminal repeat and the rest of the array (Fig 5A for repeats 17 and 18).

224
In order to find if there is any general pattern for long proteins, we consider the 225 arrays with 12 or more repeats and we calculate the ACV for each one up to 226 neighborhood 7, only for internal repeats, and we then take the mean of all of them 227 considering the phylogenetic biases as described in Methods. Using this subset of more 228 than 11.4 thousand effective arrays allows us to avoid the noisier components of each 229 ACV . The overall signal is presented in Fig 6A and collects together a relative 230 measurement of autocorrelation per array. The curve presents a maximum for 231 neighborhood 2 and 4, were the relative identity is greater than that of the nearest 232 neighbours. Also, the mean overall ACV decreases with the distance between repeats. 233 For the same subset of arrays, we calculated the maximum for the ACV of each array 234 and we plot a histogram of the distribution (Fig 6B). The nearest neighbors repeats have 235 the greatest score in the most of cases. The distribution has a weak decreasing trend, so 236 the maximum of ACV s is sparse. The distribution of maximum ACV is roughly the 237 same for Eukaryota and Bacteria (Fig S8). Also, arrays with each maximum seem to be 238 distributed without an evident trend along array length (Fig S9). Finally, we calculated 239 the mean pID per neighborhood for each array length (Fig S10). On average, larger 240 arrays present stronger periodicities than shorter ones, so the ACV signal that we 241 obtain for every array is not a consequence of their overall similarity. In summary, the 242 autocorrelation analysis of all the ANK repeat proteins points that the arrays are 243 constructed with internal copies of various repeats, where sometimes the duplicated unit 244 appears to be two repeats, sometimes three, five and up to seven consecutive units.

245
Energetic characterization of the arrays 246 In order to analyze the folding energy distribution of the natural arrays found in protein 247 sequences, we define a simple energetic model based on the per-site occurrence of amino 248 acids (see Methods). This model is a simplification of a previously reported one [19] 249 that captures the most salient energetic features. We then split the Ankyrin repeat residues is centered near zero, as defined by the model. Second, the distribution for an 256 alignment of consensus-like Ankyrin repeats [33], are clearly shifted to the lowest values, 257 in correspondence to their measured extreme thermodynamic stability [34]. Finally, we 258 plot the energy distribution for the natural and complete alignment but with it columns 259 permuted, thus keeping the natural amino acid distribution. In Fig 7B we plot the 260 energy mean and variance for every array, averaged according to the arrays length.

261
The distribution that corresponds to repeats that come alone in the arrays is clearly 262 distinguishable form the rest. The single repeats seems to be the least favorable in this 263 energetic scale, and regarding the mean value the difference is higher. This indicates 264 that single repeats detected in the database are different objects from the ones that 265 come in pairs or bigger tandems and may even be considered as non-true Ankyrin 266 repeats. Furthermore, single repeats that are alone in the full sequences or that share 267 the protein with an other array have distributions with despicable differences between 268 them, indicating that the actual natural arrays are continuous tandem objects.

269
For repeats that come in pairs or longer tandems, the distributions clearly shift to 270 more favourable regions as the arrays get longer, and the variance gets smaller. This 271 observation is still evident when we eliminate from the analysis the terminal repeats, so 272 it cannot be attributed to a border effect (not shown).

273
If we consider the energy of the consensus-designed proteins, the distribution is centered at -70 units, which appears to be the lower limit of the energy scale. In 275 conclusion, longer arrays are formed with repeats that are more energetically favorable 276 than repeats that form shorter arrays. Interestingly, it is clear that longer arrays are not 277 only closer to the energy minimum, but are overall more homogeneous in their energy 278 distribution (Fig 7B), indicating that they are formed with sequences that display 279 similar local stabilization energy.

281
We constructed a large dataset of Ankyrin-repeat arrays by collecting and curating 282 sequences from all the known proteomes of a large variety of organisms. We analyzed 283 one and a half hundred thousand non-redundant arrays containing more than 1,2 284 million aligned repeats. Around 75 percent of the proteins present a single array of 285 multiple repeats. We found that 80 percent of the arrays are constituted with less than 286 7 repeats, yet the arrays span a large variety of sizes with roughly an exponential 287 distribution (Fig 2). We found that insertions in the ANK repeats are rare, with both 288 the length and the most common relative position of the insertions compatible with a 289 previous 3D structural analysis [26] for the Ankyrin family. Curiously, we found a 290 particularly abundant array length of 22-24 repeat-units. Structurally, this is the size 291 needed for ANK arrays to make a complete turn of the superhelical fold [32], and thus 292 may be exceptionally abundant for functional reasons, such as to bring in spatial 293 proximity binding partners that are held together at each end of the repetitive array.

294
The analysis of the pairwise identity pID between repeats that belongs to arrays of 295 different lengths shows that shorter arrays are less homogeneous, but longer ones 296 impose, gradually, a higher pID between first neighbors. If extrapolated to conform an 297 array of identical repeats, this trend implies an upper limit for the array length that we 298 estimate to be (124 ± 4) repeats, which is compatible with the longest arrays found in 299 natural proteins. Considering a simple site-independent model to approximate the 300 folding energy [35], we calculated the energy distributions of the arrays and found that 301 longer arrays are made with more favorable repeats than shorter arrays (Fig 7A). At the 302 same time, longer arrays are found to be more energetically homogeneous than shorter 303 ones (Fig 7B). Energy landscape theory arguments [36] predict that non-native traps 304 would raise bigger free energy barriers in the folding of large proteins, so selection 305 against misfolding should be stronger for longer proteins than shorter ones. To avoid 306 misfolded traps, repeat protein may have to be more homogeneous and favorable as they 307 get longer, nucleating folding and propagating to near 308 neighbours [6], [10], [11], [12], [13], which is in line with our findings in the natural 309 proteins. We propose that long, heterogeneous and less favorable repeat arrays may not 310 fold robustly in vivo and may be detrimental to fitness, so we will not find them in 311 nature. Recently, Persi et al [24] proposed that there is a universal accelerated 312 horizontal evolution of repeats that drive them to homogeneity, finding strong 313 signatures of purifying selection, which is compatible with the scenario we propose.

314
Comparing the pID between the repeats of the same array at fix neighborhood using 315 an autocorrelation vectors ACV analysis [23] reveals that there are, in many cases, clear 316 periodicities along the tandem copies of the arrays. In some proteins, the array appears 317 to be originated with copies of two consecutive Ankyrin repeats (Fig 5), while in other 318 instances the pattern has periods from 3 to at least 7 repeats (Fig S4, Fig S5, Fig S6   319 and Fig S7), consistent with previous findings [23]. The size of our data set allows us to 320 get clear ACV signals, which averaged over the set ACV peaks in 2,4 and 6 repeats 321 with a decreasing trend (Fig 6A). The distribution of absolute maximums for each 322 protein is roughly uniform at least up to neighborhood 7 (Fig 6B). Björklund et 323 al [23] [37] postulated that there may be a biological mechanism that can copy and 324 insert more than one repeat at once, giving rise to Superepeats (SR) in the structure of 325 repeat proteins. This could explain the uneven distribution of the ACV s, which is clear 326 in the Nebulin family [37]. For ANKs, our results are compatible with the existence of 327 SR with different lengths in particular cases. Given the roughly uniform distribution of 328 maximum ACV (Fig 6B), we cannot point to a characteristic duplication size of the SR 329 unit. This kind of expansion of internal repeats does not seem to have a characteristic 330 length for the SR, but a weak decreasing probability as the number of repeat units by 331 SR increases. However, if we look at particular instances, proteins such as W4XDH7 332 (Fig 5) shows a regular periodicity in the ACV , indicating that the SR has copied 333 several times in the same sequence and, notably, conserving the phase of the repeat unit. 334 The same behaviour at other repeat frequencies is observed for W4ZBY3 (Fig S4), for 335 A0A0L8GA82 (Fig S6) and for A0A1X7UVJ5 (Fig S7).

336
Taken together, these results suggests that the generative mechanism for duplicating 337 units depends on the identity of the existent repeats. Once a SR is copied, the next 338 duplication event is biased in favor of the same SR length. In other words, the 339 duplication mechanism should somehow recognize the previous SR copy as a seed to 340 make a new copy. This "memory effect" of the last step could be explained with an 341 identity dependent mechanism. We propose a molecular mechanism that at first copies 342 any number of repeats at the same time and paste them in tandem with the preexisting 343 ones. When this happens once, the probability of it happening again increases, 344 preserving the phase and the number of copied units. However, we note that there are 345 also examples with two different periods along the same array, like the bacterial protein 346 R5A1C8 (Fig S5), which in this framework could indicate the generation of two 347 independent "seeds" in the same sequence. The existence of harmonics in the copies 348 explains why the average ACV is higher for second neighbors than for the first ones, 349 even though there are more similar first neighbors than second ones.

350
Repeat duplication could be explained by various molecular mechanisms such as 351 illegitimate recombination, exon shuffling, DNA slippage, etc., but no common 352 mechanism for the expansion of all repeats could be detected [23]. We found that the 353 distribution of maximum ACV is roughly the same in Eukaryota and Bacteria in the 354 ANK family (Fig S8). This fact opens 3 possible explanations: (1) There is a common 355 mechanism such as non-homologous recombination that governs ANK repeat expansion 356 in all the organism, so we can discard exon shuffling or chromatin geometry dependent 357 mechanism that are exclusive to Eukayotes, (2) The mechanism that allows the expansion by SR of any length is only possible in Eukaryota and massive horizontal 359 gene transfer delivered the repeated sequences to Bacteria, or (3) Different mechanism 360 are operating en Eukaryota and Bacteria, yet converging into a similar outcome. A 361 deeper evolutionary study is needed to contrast these hypothesis.

362
It should be noted that even if a length-independent SR copying mechanism may be 363 acting, physical folding limits prevent the existence of arbitrary long tandem ANK 364 repeat-proteins, as sequences can not be arbitrarily energetically favorable locally in 365 each part of the array and neither repeats more homogeneous than 100 percent identical. 366 Contemplating symmetries and regularities hint to the existence of non-random 367 structure in the biological realm. Repeat-proteins constitute excellent systems in which 368 to study the interplay between order and disorder, as traces of their origin, evolution 369 and function are coded in their sequences. Here we shed light into some of these aspects 370 on the most abundant repeat-protein class, the Ankyrin family.   4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20   20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 Repeat number