Coherent Somatic Mutation in Autoimmune Disease

Background Many aspects of autoimmune disease are not well understood, including the specificities of autoimmune targets, and patterns of co-morbidity and cross-heritability across diseases. Prior work has provided evidence that somatic mutation caused by gene conversion and deletion at segmentally duplicated loci is relevant to several diseases. Simple tandem repeat (STR) sequence is highly mutable, both somatically and in the germ-line, and somatic STR mutations are observed under inflammation. Results Protein-coding genes spanning STRs having markers of mutability, including germ-line variability, high total length, repeat count and/or repeat similarity, are evaluated in the context of autoimmunity. For the initiation of autoimmune disease, antigens whose autoantibodies are the first observed in a disease, termed primary autoantigens, are informative. Three primary autoantigens, thyroid peroxidase (TPO), phogrin (PTPRN2) and filaggrin (FLG), include STRs that are among the eleven longest STRs spanned by protein-coding genes. This association of primary autoantigens with long STR sequence is highly significant (). Long STRs occur within twenty genes that are associated with sixteen common autoimmune diseases and atherosclerosis. The repeat within the TTC34 gene is an outlier in terms of length and a link with systemic lupus erythematosus is proposed. Conclusions The results support the hypothesis that many autoimmune diseases are triggered by immune responses to proteins whose DNA sequence mutates somatically in a coherent, consistent fashion. Other autoimmune diseases may be caused by coherent somatic mutations in immune cells. The coherent somatic mutation hypothesis has the potential to be a comprehensive explanation for the initiation of many autoimmune diseases.


Introduction
I have previously provided evidence that somatic gene conversion and/or deletion in sequence harboring long segmental duplications is correlated with disease [1]. According to this hypothesis, autoimmunity is a response to novel (somatically mutated) antigens. Others have proposed a role for somatic mutation in autoimmunity [2,3]. The remarkable extent of somatic mutation, including copy number variation and somatic mosaicism, has recently been elucidated, with several proposed links to neurological disease [4][5][6][7][8][9]. The connection between somatic mutation and autoimmunity requires that somatic mutations be coherent [1], i.e., that the same type of mutation occur in many cells, to the point that the somatically mutated protein either disrupts normal function or is noticed by the immune system as non-self. A coherent mutation may be recurrent (occuring independently in many cells) [10] or clonal (occuring once and replicating many times).

Somatic Mutation of Tandem Repeat Sequence
Coherent somatic mutation of the haptoglobin gene (HP) has been observed in vivo in humans [11]. Carriers of the HP2 allele have a segmentally duplicated 1.7kb sequence fragment within the gene that includes two additional exons beyond the shorter HP1 allele. In an HP2 homozygote, Asakawa et al [11]. found a shorter DNA sequence corresponding to an exact excision of one copy of the tandem repeat. In each of several HP2 homozygotes subsequently tested, a small but measurable concentration of the shorter sequence was identified. Asakawa et al. argued that rare but regular somatic deletion events occur in vivo. In the mouse, a similar kind of somatic mutation has been observed in vivo at a longer 70 kb segmental duplication [12,13]. The mutation frequency was much higher than for HP in humans, presumably due to both the longer duplicon and the fact that phenotypic measurement was performed in gene-expressing tissues where mutations would be more common, rather than in blood cells [11,14]. Somatic mutation at additional loci, mediated by inverted repeats [15] or tandem repeats [16], has been observed in vivo in humans.
Long segmental duplications are not the only repetitive sequence subject to high mutation frequencies. Simple tandem repeats (STRs), including microsatellites and minisatellites that are highly mutable in germ-line cells, are also mutable in somatic cells [17,18]. Some STRs encode proteins, and somatic mutations would generate novel, potentially immunogenic proteins. While not strictly an STR, such an effect has been observed at the La antigen associated with Systemic Lupus Erythematosus (SLE) and Sjogren's Syndrome (SJ), where somatic mutations of an 8bp poly-A sequence into a 7 bp mutant have been observed [19]. These mutations correlate with autoimmunity, in that about 30% of La-reactive SLE/SJ patients respond specifically to the mutant protein [19] and somatic mutant DNA can be detected in such individuals [20].
Other STRs occur within introns, where changes in repeat counts can change splicing behavior [21]. Altered splicing of autoantigens has been proposed as a mechanism for generating immunogenic protein variants [22]. In particular, inflammation can lead to reduced levels of the splicing factor ASF/SF2 [22]. Low levels of ASF/SF2 are associated with DNA double strand breaks and DNA rearrangements triggered by R loops between DNA and transcribed RNA [23]. R loops promote instability in GC-rich trinucleotide repeats [24], suggesting that transcribed repetitive sequence may be particularly vulnerable to somatic mutation induced by ASF/SF2 depletion.
Additionally, repeat mutations are often accompanied by significant changes in methylation [25]. Demethylation can potentially lead to aberrant transcription initiation in the middle of the gene sequence [26]. Repetitive sequence is also an essential factor in cellular mechanisms for methylating nearby sequence [27,28]. Changes to the methylation pattern can also affect splicing [29]. Altered methylation patterns have been observed in several autoimmune diseases [30].
Yet another reason to focus on somatic repeat mutations in autoimmune disease is the observation that somatic tandem repeat mutations can be induced by inflammation typical of an immune or autoimmune response [31,32]. This observation provides the basis for a feedback loop. An initial immune response against a pathogen could, as a side-effect of inflammation, trigger the initial production of aberrant protein. The aberrant protein induces a second immune response, with further inflammation and coherent somatic mutation in nearby cells (or remote cells opsonized by autoantibodies [33,34]) creating a cycle of autoimmunity. Antiinflammatory medications reduce rates of somatic mutation in some cancers [35], further supporting a link between inflammation and somatic mutation, Human STR sequence is overabundant near telomeres [18,36]. Nevertheless, the germ-line variability of a minisatellite repeat in a population does not depend on its chromosomal location [37]. Instead, the primary determinants of minisatellite variability are (a) the number of repeat units it contains, and (b) the degree of identity between different repeat units within the sequence [37]. Variability is a nonlinear function of these measures: Doubling the copy number increases the probability of being variable about 15fold, and adding 10% to the repeat unit similarity increases the probability of being variable about 18-fold [37]. A more recent model also takes into account the size of the repeat unit [38]. The total repeat length (i.e., the product of the repeat unit size and the repeat count) is strongly correlated with variability [38]. For segmental duplications, high sequence identity is most important for structural variability, with high duplicon length and low duplicon separation also playing a role [39].
While somatic and germ-line microsatellite mutation patterns appear similar [18], somatic and germ-line mutation patterns differ for minisatellites [40]. Germ-line minisatellite mutations involve recombination-based repair of double strand breaks (DSBs), while sponteneous somatic minisatellite mutations arise by replication slippage or mitotic recombination [40]. For somatic mutations induced by inflammation [31,32], DNA damage appears to be critical, including DNA strand breaks [41]. The resulting mutation patterns in STRs may therefore more closely resemble germ-line mutations or somatic mutations in cancer [42] than spontaneous somatic mutations. Structural mutations in repetitive sequence are orders of magnitude more frequent than point mutations [43]. Mitotic mutation rates of up to 2% have been observed in the longest human tandem repeat sequences [44].

Autoimmunity
Autoimmune diseases have overlapping features, including shared susceptibility loci [45][46][47][48] and cross-heritability [49]. Nevertheless, each autoimmune disease has specific manifestations, causing damage to particular organs or systems. The central enigma of autoimmune disease is why a relatively small set of specific proteins are immunologically targeted [50]. Many, but not all autoantigens in systemic autoimmune diseases are proteins that are cleaved during apoptosis [51,52], but the reason for this association is unclear given that T cell tolerization to such cleaved proteins is expected [52,53]. Autoantigens appear to have longer exons and harbor more SNPs than other genes [3,54], and they are enriched in several biologically relevant categories [3].
The most prominent phenotype of autoimmune disease is the presence of specific antibodies (Tables 1 and 2). While T-cell epitopes are also implicated in autoimmunity, they are more difficult to measure [55]. Mutant protein can induce antibodies to wild-type protein, even when T-cell tolerance to wild-type protein is maintained [56]. Thus, antibodies are likely to provide the most robust signal about autoimmune targets.
A B cell epitope does not have to be from the same protein molecule as the T cell epitope in order for the B cell to be activated by a CD4+ (helper) T cell. A B cell that endocytoses a protein complex by binding to one of its proteins can be activated by a CD4+ T cell specific to another protein in the complex. Such a mechanism has been used to explain anti-TG2 antibodies in celiac disease, where a TG2-specific B cell is activated by a CD4+ T cell specific to gliadin after endocytosis of a TG2-gliadin complex [57].
Thus, a protein is a candidate CD4+ T cell target either if it elicits antibodies itself, or if an in-vivo binding partner of the protein elicits antibodies. B cell specificities (and thus antibodies) to multiple proteins can be supported by a single CD4+ T cell epitope. I use the term peri-antigen to mean an in-vivo binding partner of an autoantigen. A peri-antigen can potentially function as a CD4+ T-cell target supporting B cell specificity to the autoantigen.

Testing the Coherent Somatic Mutation Hypothesis
I sought data to test the hypothesis that autoimmune disease is associated with mutable repetitive sequence. Because of its construction from long contigs [58], the reference human genome has reliable sequence for most repetitive regions, although gaps still remain. Because shorter reads were used, the Celera sequence is missing the interiors of many repetitive elements [59]. Most current sequencing technologies use short reads that must be assembled into whole genomes. Both de-novo assembly and alignment-based assembly are unreliable in highly repetitive regions [60][61][62]. The reference human genome is therefore the primary currently available source of robust repetitive sequence throughout the genome.
Antibodies that develop early in disease progression provide the strongest evidence for a causative role for the corresponding antigen. A primary autoantigen is one whose antibodies have been shown, in at least a subset of cases, to be the first disease-associated antibodies to appear. A test of the coherent somatic mutation hypothesis can be formulated as follows: Is there a statistical link between primary autoantigens (and/or their peri-antigens) and genes containing highly mutable sequence?
Once such a statistical link is established, a subsequent test of the comprehensiveness of the coherent somatic mutation hypothesis would consider other mutable (e.g., long STR) sequence. To Table 1. Twenty-one of the most prevalent human autoimmune diseases, in approximately decreasing order of prevalence [49,275].  [211,281,299,[371][372][373][374][375] Autoantibodies to antigens in bold are known to be primary antibodies that occur early in disease progression, often prior to the appearance of symptoms. The tryptophan allele of the Arg620Trp polymorphism at rs2476601 in the PTPN22 gene is associated with many autoimmune diseases, as indicated in the ''PTPN22 Assn.'' column. Atherosclerosis (CAD) is not universally considered an autoimmune disease, and is therefore not listed. Nevertheless, CAD does have autoimmune features [376] and an association with PTPN22 [377][378][379]. Ã The initial pathology in some MS lesions is associated with MAG loss [329,380,381]. # The tryptophan PTPN22 allele is protective from CD [333]. z In MG, two studies conflict about whether PTPN22 is specifically associated with the subset of cases having anti-TTN antibodies. doi:10.1371/journal.pone.0101093.t001 Motivated by known somatic mutation of HP, I looked for examples of long tandem duplications with at least 96% identity occuring within protein coding genes. Since the tandem repeat finder algorithm limits repeat units to 2000 bp, and its coverage of some longer units appears to be incomplete (e.g., it misses the 1.7 kb repeat in HP), such repeats may have been overlooked in the earlier analysis. The segmental duplications track of the UCSC database [64] was used as described in the Methods. Table 3 shows all tandem duplications of total length at least 3400 bp where at least one duplicon occurs entirely within a protein-coding gene locus, and the tandem duplicons have the same orientation and are separated by at most 100 bp. Several genes appearing in Figure 1 also appear in Table 3, having long segments that are high identity tandem repeats. Of the remaining genes, five are autoantigens: complement component receptor 1 (CR1) in SLE and multiple sclerosis (MS); pepsinogen 4, group I (PGA4) in pernicious anemia (PA); titin (TTN) in myasthenia gravis (MG); interferon-gamma-inducible protein 16 (IFI16) in RA, SLE, SSc and SJ; and HP in celiac disease (CEL) ( Table 1). The presence of five autoantigens among the top 33 genes is statistically significant (pv0:0004, see Methods).
Copy number variations in the 54.7 kb STR of CR1 (Table 3) have been associated with SLE [66] and Alzheimer's disease (ALZ) [67]. The CR1-S allele has three repeats (as in the human reference genome) and has a population frequency of about 15%, while the shorter CR1-F allele has two repeats and a frequency of 83% [67]. The repeat length is functionally important, since the repeat includes sequence that codes for complement binding sites [67]. In both SLE and ALZ, the longer CR1-S allele is the high-risk variant [66,67]. CR1 plays an important immunological role in various cell types [68].
PGA4 is one of three genes in the human reference genome coding for highly similar (but not identical) versions of pepsinogen A, an autoantigen in PA. Low levels of pepsinogen A are specific in diagnosing PA [69]. Variant alleles observed in the population contain three, two or one pepsinogen A gene [70]. The other major autoantigen in PA is ATP4A/ATP4B (Table 1), which both interacts with and colocalizes with pepsinogen A on the parietal cell surface [71].

Additional Long Repeats Obtained from Self-Chain Alignments
To ensure completeness of the long repeat dataset, I queried the self-chain track of the UCSC database as described in the Methods. These alignments capture tandem repeats that may be slightly imperfect, i.e., there may be gaps between segments in the alignments, as well as repeats whose unit length is above the 2 kb threshold for TMRF. The results, shown in Table 4, are largely in agreement with Figure 1 and Table 3. Table 4 includes the following additional genes with alignments over 13 kb and exhibiting germ-line structural variation (File S1): LPA, DMBT1, MGAM, KIR3DL1, KATNAL2.

TTC34 is a Candidate CD4+ T Cell Antigen for Systemic Lupus Erythematosus
The gene TTC34 is an outlier in Figure 1, both in terms of the length of the repetitive segment (an underestimate because the repeat is terminated by a gap in the human reference assembly) as well as the number of repeat units. TTC34 encodes an uncharacterized protein that binds to PPP4C [78]. In support of a functional role for TTC34/PPP4C binding, RNAi depletion of either protein induces a common elongated cell phenotype [79].
If somatic mutation of TTC34 induces autoimmunity, then antibodies to binding partners of TTC34/PPP4C would be expected. PPP4C is a ubiquitous serine/threonine phosphatase that regulates a variety of cellular functions [80]. Based on the localization of those cellular functions, I hypothesize that TTC34 mutation underlies the initial pathenogenesis of SLE. Table 5 shows that many autoantigens in SLE, including known primary SLE autoantigens, associate with PPP4C. Under this hypothesis, the broad array of autoantigens in SLE is a consequence of the many functions of PPP4C, together with secondary immunogenicity caused by the aberrant clearance of apoptotic cells [52,81].
The long TTC34 STR appears (with shorter length) in several primate species, but not in more distantly related species whose genomes have been sequenced [64]. Surprisingly, a 12 kb long STR has independently evolved in the mouse (GRCm38) genome, 3.2 kb upstream of the mouse Ttc34 start site [82]. The mouse repeat unit length is 37, similar to the unit length of 40 in the human repeat. As for humans, the 12 kb mouse repeat is an outlier within the mouse genome: among all STRs that overlap a proteincoding gene locus, including a 5 kb segment upstream of the gene, the Ttc34 repeat is the fifth longest ( Table 6). The independent evolution of such a similar long repeat argues strongly for a functional role. If the TTC34 repeat mutates under inflammation [31], then the desired functional role would be one where changes in TTC34 expression and/or PPP4C activity would be adaptive under inflammation. PPP4C depletion makes T cells resistant to apoptosis [83]. The association of apoptosis reduction with inflammation is biologically plausible, since T cells in inflammatory environments would be expected to receive survival signals during normal immune responses.

LPA in Atherogenesis
LPA encodes a protein that binds to ApoB-100 in LDL particles to form Lp(a) lipoprotein particles containing lipids, phospholipids and cholesterol [84]. In coronary artery disease (CAD) ApoB-100 and LDL are immune targets of T cells and antibodies [85], meaning that LPA encodes a peri-antigen for CAD. Under the coherent somatic mutation hypothesis, rare but regular somatic mutation to LPA would occur, analogously to that observed for HP [11]. Epitopes of the mutant protein would be presented by immune cells in blood vessels, leading to activation of immune cells in atherosclerotic lesions [85] and autoimmune responses against other components of Lp(a) lipoprotein particles. LPA is central to CAD pathenogenesis, since an elevated plasma Lp(a) lipoprotein level predicts stroke and vascular disease, particularly in men [86,87]. SNPs in LPA have the largest known effect on CAD risk [88], including an odds ratio of 1.74 for the minor allele of rs3798220.

ABCG8 in Hypercholesterolemia
ABCG8 contains a long (10.8 kb) intronic repeat, part of a larger compound repeat separated by a LINE insertion ( Figure 2). ABCG8 encodes a cholesterol transporter that has been implicated in CAD [88,89] and in gallstone formation [90]. SNPs rs41360247 and rs4245791 in ABCG8 are associated with both CAD risk and LDL cholesterol levels [89]. Additionally, the SNP rs4952688 was shown to influence the mRNA expression of both ABCG8 and its co-transporter ABCG5 in liver cells [91]. rs4952688 is located within the compound repeat sequence (Figure 2), implicating this repeat sequence (or nearby linked sequence) in the expression levels of these two cholesterol transporters.
The normal function of ABCG8 and ABCG5 in liver cells is to excrete cholesterol into the bile [92]. Disruption of this process could lead to hypercholesterolemia, the initial manifestation of atherosclerosis. ABCG8 variants can also influence cholesterol levels by modulating cholesterol absorption [93]. Somatic repeat mutations accumulating over time could change expression levels of these proteins, thereby altering the rate of cholesterol excretion/ absorption. Germ-line mutations in these genes are associated with premature atherosclerosis [91,94], as are mutations in other cholesterol transporters such as APOE [95,96].
In principle, somatic repeat mutations could induce the production of aberrant ABCG8 protein variants that would be immunogenic, as previously argued for autoimmune disease.
Antibodies to such variants could interfere with cholesterol excretion, but ABCG8-specific antibodies have not been documented in CAD. The molecular mechanisms by which the proteins encoded by ABCG8 and ABCG5 transport cholesterol are not fully understood [97]. If the ABCG5/ABCG8 complex binds to LDL, then ABCG8 would encode a peri-antigen for CAD since oxidized LDL is an autoantigen [85]. DMBT1, FCGBP, and the Mucins MUC4, MUC5B, MUC12 and MUC17 Mucins including MUC4, MUC12 and MUC17 are important for intestinal integrity and have previously been associated with both ulcerative colitis (UC) and Crohn's disease (CD) [98][99][100]. MUC17 depletion increases epithelial permeability in the face of E. coli exposure [101]. FCGBP is a component of the mucus layer coating of the intestinal tract [102], and expression is higher in several autoimmune diseases [103]. The DMBT1 protein also provides mucosal protection of the intestine, and expression levels correlate with disease activity in CD and UC [104]. Host-microbe interactions appear to be central to the pathogenesis of UC and CD [105]. CD, UC, psoriasis (PSO) and ankylosing spondylitis (AS) have common features [105,106] that suggest a cluster of diseases with related etiology. AS has been associated with the gut microbiome [107], and PSO has been associated with intestinal yeast infections [108].
A critical clue is provided by the PTPN22 rs6679677 C/A polymorphism that is in high linkage disequilibrium with the rs2476601 C/T polymorphism associated with many autoimmune diseases [109]. At rs6679677, the A allele appears to be a risk allele for UC (as for most other autoimmune diseases) but protective for CD [105]. In the context of the coherent somatic mutation hypothesis, one could interpret this opposite PTPN22 association in terms of alternative responses to somatic mutation. UC would be caused by an autoimmune response against the mutant protein, while CD would be caused by the failure of the mutant protein's function, in the absence of a direct immune response against that protein. This interpretation is consistent with a clear role for MHC alleles in UC but not CD [105,110], and with a reduction in mucus quantity and/or goblet cell density specifically in UC [111,112].
CD and UC have opposite risk alleles for NOD2 polymorphisms [105]; NOD2 variation modulates adaptive immune responses to microbial antigens [113], and regulates DMBT1 expression in CD [114]. Significantly, short alleles of the DMBT1 tandem repeat that encode fewer bacterial recognition sites are overrepresented in CD but not UC [104]. DMBT1 has high protein homology with the CD autoantigen CUZD1 [115], potentially leading to crossreactive antibodies. Further, DMBT1 -coded protein binds to pancreatic amylase [116,117] that in turn binds to the CD autoantigen GP2 [118], meaning that DMBT1 encodes a periantigen for CD.
In Sjogren's sydrome, a primary initiating change is the dysregulation of mucins [119], including the aberrant exocytosis of MUC5B [120]. MUC4 is an interesting somatic mutation candidate because its expression pattern in the eye, vagina, ectocervix, trachea, and salivary gland [121] closely aligns with locations where symptoms occur [122]. MUC5B is expressed in many of these tissues [123], but not in the tear fluid [124]. Somatically mutated mucins could induce an immune response against the mutant protein. Alternatively, aberrant mucin protein may offer reduced protection of epithelial cells, making them vulnerable to infection. Apoptosis of the epithelial cell could trigger the induction of antibodies to apoptotically generated proteins in Sjogren's syndrome. Table 3. Long tandem duplications with at least 96% identity that occur within a gene locus.

Gene
Length Copies Gap Coding Duplications were identified as described in the Methods. The length indicates the total length of the high-identity tandem duplicons. The gap is the separation between the two highest-identity (long) duplicons, which was required to be less than 100 bp. The duplication is ''coding'' if a duplicon overlaps at least one exon. Ã FCGBP has a third duplicon, but with less than 96% identity. # These genes have duplicons that are themselves STRs of lower fidelity; only the copy number for the high-identity long tandem duplication is reported in this KIR3DL1 encodes an inhibitory receptor expressed on natural killer (NK) cells and T cells [125]. There is a high degree of copy number variation of the KIR genes around this locus, and some haplotypes do not possess KIR3DL1 [126]. HLA-Bw4 is the ligand for KIR3DL1, and is protective in MS [125] and primary sclerosing cholangitis [127]. The presence of KIR3DL1 is protective for AS [128], particularly AS with uveitis (UV) [129]. Somatic mutations to KIR3DL1 could reduce inhibition of NK cells and/or T cells, leading to selective activation and clonal expansion.
The segmental duplication at the NKG2-E locus overlaps the genes KLRC1, KLRC2 and KLRC3. Copy number variation at NKG2-E (manifested as a deletion of KLRC2) is associated with psoriasis susceptibility [130]. Reduced KLRC2 expression in T cells is observed in PSO [131], and enhanced expression of KLRC2 on CD4+ T cells is observed in MS [132]. KLRC1 encodes a critical receptor on NK cells, regulating the elimination of autoreactive CD4+ T cells in animal models of MS [133]. KLRC1 plays a critical role in tolerization by regulatory T cells [134], and is downregulated in PSO [135].
KIR3DL1 and KLRC1 encode NK cell receptors. NK cells and their receptors regulate autoimmunity in MS [136], and NK cell populations rise and fall in ways that correlate with the development of lesions in relapsing-remitting MS [137,138]. NK cells are found in psoriatic plaques, and circulating NK cells are reduced in PSO, MS, SLE and T1D [139,140].
The segmental duplication within the long HCAR1 repeat identified in Tables 3 and 4 covers the two genes HCAR2 and HCAR3. HCAR2 codes for a niacin receptor that is expressed on antigen presenting cells and functions in a tolerization pathway for T cells [141]. Niacin administration ameliorates an animal model of MS through this pathway [141]. Table 7 summarizes the autoimmune associations of genes with long STRs. This key table shows that long STRs within twenty genes are associated with sixteen common autoimmune diseases and atherosclerosis. Each of these putatively mutable STRs exhibits germ-line structural variation (File S1), consistent with a somatically mutable locus. The coherent somatic mutation hypothesis thus has the potential to be a comprehensive explanation for many autoimmune diseases.

Summary: Long Simple Tandem Repeats in Autoimmunity
With the exception of MS and possibly PA and SJ, each of the diseases associated with an autoantigen or peri-antigen in Table 7 is influenced by the functional rs2476601 single-nucleotide polymorphism in the PTPN22 gene (Table 1). This polymorphism specifically influences T cell signaling [142,143], B cell signaling [144,145], autoreactive B cell generation [144], and T cell and dendritic cell hyper-responsiveness [146]. The role of PTPN22 in some but not all autoimmune diseases suggests a common Sequences with a self-similarity score of 60 or above having both query and target mapped within a protein-coding gene locus were obtained from the self-alignment track [267] of the UCSC database as described in the Methods, and ranked by match-length. In this table, the match length corresponds to the length of identity between the two duplicons. Note that self-aligned duplicons may overlap. doi:10.1371/journal.pone.0101093.t004 underlying pathway for this subset of diseases [45,143] that may be related to STR length and/or mutability. Table 8 shows that the conditions associated with autoantigens/ peri-antigens above have a high degree of co-morbidity and/or familial association. Taken together, the data support the following model for this subset of diseases: A disjoint subset of diseases, including MS, PSO, UV, and AS have no association with the PTPN22 gene polymorphism ( Table 1). All four of these conditions are associated with immune-cell expressed genes spanning long repeats. Somatic mutation in those genes, rather than in antigenic genes, may be the critical step for such diseases.
A Repeat Constituting 97% of the Intron Sequence within an Autoantigen for Pemphigus Vulgaris Somatic repeat mutations in introns could be particularly disruptive when the intron is almost exclusively tandem repeat Table 5. Correspondence of PPP4C localization with many known SLE autoantigens.

Ro60(TROVE2)
See La; Ro60 and La are components of a common protein/RNA complex [404].
Primary autoantigens are in bold. doi:10.1371/journal.pone.0101093.t005 sequence. I therefore queried the reference genome for genes containing introns where a single tandem repeat occupies a large fraction of the intron ( Table 9). The top-ranked gene in this analysis is PKP3, containing a 2310 bp repeat occupying over 97% of the eighth intron. There is germ-line structural variation at this locus in the HapMap population, with deletion variants encompassing almost the entire STR sequence [147].
PKP3 encodes an autoantigen in pemphigus vulgaris (Table 2). Furthermore, PKP3 binds in vivo to several other primary pemphigus vulgaris autoantigens including DSG3, DSG1, DSC1, and DSC3 [148]. Aberrant PKP3 could therefore serve as a CD4+ T cell antigen in the induction of antibodies to these other proteins. The p value for the top gene being an autoantigen is 0:02 (see Methods).  (Table 1). MAG encodes a multiple scleroisis autoantigen that binds in vivo to MBP and PLP [149], two other MS autoantigens (Table 1). Anti-MAG antibodies have also been observed in various polyneuropathies [150][151][152]. The presence of two autoantigens among the top twelve is statistically significant (pv0:022, see Methods). On the other hand, the STRs in MAG and MUSK do not exhibit germ-line structural variation at 50 bp resolution (File S1); germ-line variation would be expected for a somatically mutable locus.

Discussion
Somatic mutation has been overlooked or discounted as a cause of autoimmunity, primarily because ''random'' mutation would not lead to consistent and specific disease characteristics [153]. However, many kinds of somatic mutation are nonrandom, caused by mechanisms that yield coherent mutation patterns both within and across individuals. Coherent somatic mutation is a unifying and biologically plausible hypothesis to explain the specific targets of autoimmune disease.

Longer-Range Segmental Duplications
Long high-identity segmental duplications that are not strict tandem repeats may still lead to somatic protein changes via deletion or duplication if they partially overlap genes. Examples of this pattern include: RHD and GYPA, autoantigens in autoimmune hemolytic anemia; AMY2A, an autoantigen in autoimmune pancreatitis and fulminant T1D, and a binding partner of the CD autoantigen GP2 [118]; CES1 and PDIA3, autoantigens in type-2 autoimmune hepatitis; TYR, an autoantigen in vitiligo; and CHRNA7, an autoantigen observed in schizophrenia (Tables 1 and 2, [154]). The genomic structure of TYR makes it particularly susceptible to gene conversion and deletion (Figure 4).
The human genome contains segmental duplications that span whole genes, and copy number variation in these tandem repeats is likely to affect gene dosage [155]. These duplications are not considered in the primary anaylsis since repeat-dependent somatic mutation via deletion and/or duplication is less likely to induce altered protein. Nevertheless, the potential for altered protein exists through gene conversion or other processes that combine sequence from multiple instances of the gene. The primary autoantigen in Addison's disease is encoded by CYP21A2 (Table 1), which resides within a segmentally duplicated region and is a known locus of germ-line gene conversion [156].
A five gene cluster (GH1, GH2, CSH1, CSHL1, CSH2) on chromosome 17 resides in a region characterized by complex segmental duplications with identity ranging from 92% to 96%. This cluster is a hot-spot for germ-line gene conversion [157]. Variations in these genes are associated with metabolic syndrome later in life [158]. Anti-pituitary antibodies are observed in conjunction with type-2 diabetes [159,160] and GH1 is one of the autoantigens [161]. GH1 codes for human growth hormone, and growth impairment is observed in celiac disease in conjunction with anti-pituitary antibodies [162].

Mechanisms of Coherent Somatic Mutation
PTPRN2 is an outlier not just in the length of its repetitive sequence; it has the most predicted sites of R loop formation in the whole genome [163]. The R loop sites do not overlap the 12 kb repeat in PTPRN2, but several long R loop sites occur about 20 kb upstream of this repeat. These R loops may contribute to the instability of the repeat region, and implicate mis-splicing [22] of PTPRN2 in T1D.
Coherent somatic mutation can occur through a variety of mechanisms besides repeat instability and gene conversion, discussed below and summarized in Table 10.

RAG-mediated Somatic Recombination and Rheumatoid Factor
Cancer studies provide valuable information about coherent somatic mutation in vivo. Many cancers elicit antibodies that are also found in autoimmune disease [164], further supporting a role for somatic mutation in autoimmunity. A striking example of coherent somatic mutation in cancer is the gene IKZF1. Internal IKZF1 deletions occur in over 80% of cases of BCR-ABL1 acute lymphoblastic leukemia (ALL) [165]. Consistent breakpoints suggest aberrant RAG-mediated recombination [165]. The mutations coincide with a transition in the cancer from Chronic Lymphocytic Leukemia (CLL) to ALL.
CD5 expression on B cells is a common feature of both RA and CLL [166], CD5 expression correlates with RAG activity in B cells of people with autoimmune disease [167], and RAG is expressed in B cells in the RA synovium [168]. In RA, the appearance of rheumatoid factor (RF, an antibody to Fc-IgG) correlates with the hypogalactosylation of IgG, occuring roughly two years after the appearance of antibodies to citrullinated proteins, but two years before RA diagnosis [169]. RF is detected in several other autoimmune and infectious diseases [170].
If the RAG-dependent IKZF1 mutations that consistently occur in ALL also occur in RA B cells, possibly followed by clonal expansion, then aberrant glycosylation would be explained because IKZF1 appears to be critical for proper IgG glycosylation [171]. The improperly glycosylated IgG would be immunogenic. In the context of a normal immune response to a pathogen, a somatic mutation to IZKF1 could be adaptive, because it would lead to RF production and potentially enhanced clearance of

AS, UV Yes
Genes with long STRs come from Figure 1, Table 3 and Table 4. A bold autoantigen label corresponds to a known primary autoantigen. The CNV column indicates whether a germ-line STR length variant is associated with the disease. Gene expression changes during disease are also shown. Ã While many genes qualify as encoding peri-antigens in SLE, TTC34 encodes a peri-antigen for many autoantigens (  SSc [421,425] Comorbidity may reflect common susceptibility factors or secondary disease effects, such as inflammation in RA contributing to CAD risk [427]. Comorbidities with some of these diseases exist for alopecia areata [428,429], vitiligo [430,430,431], juvenile idiopathic arthritis [432], myasthenia gravis [433,434], and Addison's disease [435], five additional PTPN22 -associated diseases, as well as celiac disease [436,437] and pernicious anemia [438,439]. Ã A link between GD and CAD is potentially confounded by the anti-atherogenic properties of thyroid hormones [440]. doi:10.1371/journal.pone.0101093.t008 immune complexes [172]. However, in the context of an autoimmune response, RF production could increase the severity of disease [172]. RF is also found in SLE [173], and reduced IKZF1 expression has been associated with SLE [174,175].

Mutagens and Oxidative Stress
Cigarette smoking is mutagenic, and appears to be selectively associated with antibodies to the primary autoantigens encoded by ENO1 [176], VIM [177], and FGB [177] in RA. VIM mutations induced by oxidative stress influence antigenicity [178]. The association of RA with smoking is strong only among individuals with particular HLA alleles. A similar phenomenon occurs in MS [179]. This interaction of mutagen, autoantigen and HLA suggests that mutation is pathogenic primarily when the mutant epitope is well-presented by the corresponding antigen presenting molecules.

Clonal Expansion Following Somatic Mutation
Somatic mutations in the TSHR gene are relatively common [180] and can induce activation and clonal expansion in thyroid tissue [181,182], potentially explaining TSHR-antigenicity in GD. Paraneoplastic autoimmunity [164,183,184] is a related phenomenon in which an immune response to a tumor expressing mutant antigens also affects normal tissues expressing wild-type proteins.

Pathogen-Induced Protein Binding and Modification
A pathogen-expressed protein that binds with an endogenous protein complex could serve as a CD4+ T cell target, providing help to B cells generating antibodies to proteins in the protein complex. A pathogen-modified endogenous protein could behave in a similar fashion Rheumatic Fever (RHF) is a condition characterized by autoimmune attack against cardiac muscle, usually associated with group A streptococcal infections [185]. There is some in-vitro evidence of cross-reactivity of antibodies to streptococcal proteins and autoantigens in RHF [186]. Nevertheless, there is also evidence that mimicry may not be an important feature of RHF [187]. Autoreactivity to collagen in RHF has been proposed to result from collagen binding to streptococcal proteins [187].
The RHF autoantigens vimentin, myosin, and tropomyosin ( Table 2) form part of the calcium-bound sarcomere protein complex [188]. Two lines of evidence implicate vimentin as an initiating autoimmune target (and peri-antigen) in RHF. First, vimentin is modified (ADP-ribosylated) by the group A streptococcal protein SpyA in a way that alters both its sequence and its organization [189]. Second, group A streptococci are known to bind to vimentin, particularly at sites of muscle injury [190].

Apoptotic Cleavage
Adaptive immune reponses require the joint participation and mutual activation of CD4+ T cells and antigen-presenting cells such as B cells. B cells become anergic under chronic low-level exposure to antigen with limited costimulation [191]. Nevertheless, even anergic B cells can be activated with sufficient stimulation [191]. Protein that is post-translationally modified only upon apoptosis would presumably generate only low-level exposure to B cells. A post-translationally modified protein that forms part of a protein complex containing a somatic mutant is liable to trigger B cell/T cell co-activation. In such a case, a CD4+ T cell specific to the mutant peri-antigen could activate a previously anergic B cell clone. Such a mechanism could explain why post-translationally modified proteins, particularly those geneated during apoptosis, would be over-represented among B cell autoantigens [51,52,192]. Table 9. Genes with intronic tandem repeats occupying more than 90% of an intron. Retrotransposition An additional potential mechanism of coherent somatic mutation is retrotransposition. Retrovirus [193,194] and retrotransposon [195] integration hotspots exist, independent of selective pressure for cell growth/survival. This form of mutation could be relevant to Bout Onset Multiple Sclerosis (BOMS) in which an endogenous retrovirus has been implicated [196,197], as well as schizophrenia [198] and amyotrophic lateral sclerosis [199]. Alternatively, retroviral expression could be a driver of neuroinflammation [200], leading to somatic mutation at other mutable repeat sequence.

Dysregulation of Protein Modification Pathways
In SSc, the presence of one antibody type is generally exclusive of the others [201,202], suggesting several subtypes of SSc with different mechanisms of induction. Chromosomal abnormalities are found at high frequency in the lymphocytes of patients with anti-centromere or anti-TOP1 antibodies, but at normal frequency in patients with anti-RNAPIII antibodies [203]. In SSc fibroblasts, increased sumoylation of TOP1 induces deficits in TOP1-mediated supercoiled-DNA relaxation [204] and disruption of TOP1 is known to cause chromosomal aberrations [203]. Inhibition of sumoylation improves TOP1 function in fibroblasts [203] and reduces fibrosis [205]. One interpretation of this data is that anti-TOP1 SSc is a sumoylation disorder. Hyper-sumoylated TOP1 could induce cell death via chromosomal aberrations, and at the same time trigger an immune response. Because the posttranslationally modified protein would not be normally presented to immature B or T cells, tolerization to modified TOP1 would not occur. The centromere protein and SSc autoantigen CENPB is also a sumoylation target [206][207][208][209].
A similar neoantigen-creating role for sumoylation in a subset of patients with primary biliary cirrhosis (PBC) has previously been proposed [210]. In patients with antibodies to PML or SP100, two sumoylation target proteins [206][207][208][209], antibodies to SUMO2 and SUMO1 have been observed [210]. CENPB is also an autoantigen in PBC (Table 2). SSc and PBC are comorbid, with anti-CENPB as a common risk factor [211,212], suggesting a shared etiology.

Schizophrenia and Autism
Schizophrenia and autism have prominent immunological features, including HLA associations, comorbidity with autoimmune diseases, and associations with viral triggers and maternal infections during pregnancy (Table 11). Immunological theories of schizophrenia have been proposed [213].
A clue that somatic repeat mutation may contribute to schizophrenia comes from a twin study in which a genomewide measure of somatic trinucleotide repeat mutation was obtained [214]. A high somatic trinucleotide mutation rate associated selectively with the schizophrenic proband in monozygotic twins discordant for disease [214].
Four NBPF family genes are among the top twelve in Figure 1, including the two longest STR sequences. The four NBPF genes in Figure 1 are located between positions 145.2 M and 148.3 M on chromosome 1, overlapping the 1q21.1 region. NBPF genes contain many copies of the DUF1220 element; DUF1220 copy number is closely related to brain size, and humans have many more copies than other primate species [215,216]. In humans, high DUF1220 copy number correlates with macrocephaly, and low copy number correlates with microcephaly [217,218]. Germline deletions within the 1q21.1 region are associated with schizophrenia [219,220], while duplications are associated with autism [217]. Somatic genomic instability is likely in such highly repetitive regions [217]. Somatic mutations early in embryonic development [221], suggested by the link to maternal infections during pregnancy, could lead to effects that mirror those of germline mutations. Early somatic mutation also creates the possibility that the thymus and brain express different haplotypes, preventing thymic deletion of T cells reactive to proteins coded by a brainspecific haplotype.
Several autism-related genes appear in Figure 1 and Table 4. SNTG2 binds to neuroligins 3 and 4, genes that have been associated with autism, and known autism-related mutations in those neuroligins weaken the binding with SNTG2 [226]. ROBO2 is an axon-guidance protein with significantly reduced expression in autistic brains [227]. ASMT encodes the last enzyme in the melatonin biosynthesis pathway, low melatonin expression is observed in autism spectrum disorders, and rare ASMT mutations are associated with autism [228][229][230]. MGAM is a gene involved in starch metabolism, with dysregulated mRNA expression in autism [231]. Germ line loss-of-function mutations in KATNAL2 have been associated with autism [232].
Additional autism related genes appear in Figure 3 and exhibit structural variation in their STR sequence (File S1). Like ROBO2, PLXNA4 is an axon-guidance protein with significantly reduced expression in autistic brains [227]. ASMTL binds with TDO2 [233]; TDO2 is the rate-limiting enzyme in the catabolism of tryptophan, the precursor of serotonin, which is known to be elevated in 30% of autism cases [234].
There is a high concentration of autism-related genes among a relatively small set of putatively mutable genes. In light of the autoimmune features of autism (Table 11), this concentration suggests that somatic repeat mutation may contribute to the etiology of autism.

Explaining Autoimmunity
A satisfying feature of the coherent somatic mutation hypothesis is that it provides a parsimonious yet comprehensive account of autoimmunity. The initiation of most diseases is attributed to a single mutable locus. A handful of diseases having several known subtypes include more than one corresponding mutable locus. Only four of the top sixteen genes in Figure 1 (ANKRD36C, ANKRD36, AHNAK2, NSUN6) do not have a link with an autoimmune disease, an autoimmune-associated mental disorder, or atherosclerosis. These relatively uncharacterized genes are promising candidates for future study.
The most prominent prior theory of autoimmunity is molecular mimicry, the hypothesis that peptides similar to host proteins are expressed by host-resident microbes, sometimes inducing an autoimmune reaction against the host proteins. The attractive feature of molecular mimicry has been that it provides a plausible explanation for the known link between infection and autoimmunity [235,236]. However, despite decades of research, no human autoimmune diseases have been clearly attributed to molecular mimicry [235,237,238].
Autoimmune diseases have historically been categorized as organ-specific or systemic, with some diseases hard to categorize [239]. Under the coherent somatic mutation hypothesis, both kinds of disease have a common etiology, with the phenotype dependent on the expression patterns of the autoantigen. A narrow expression pattern (such as PTPRN2) leads to an organspecific disease (T1D), while a widely expressed protein complex (TTC34/PPP4C as proposed in this report) leads to a systemic disease (SLE).
The incidence of each of several autoimmune diseases has been rising in recent years [240], as has the apparent incidence of autism [241]. The ''hygiene hypothesis'' states that autoimmune disease is linked to the absence of infections, through one of several possible immunoregulatory mechanisms [240]. Some infections that are protective if they occur early in development are possible triggers of autoimmunity if they occur later [240]. The present theory is consistent with a variant of the hygiene hypothesis in which tolerance to coherently mutated antigens is dependent on the early generation of such mutants. Infections or other inflammatory stimuli would increase the rate of somatic mutation, allowing for more efficient induction of peripheral tolerance. In the absence of peripheral tolerance, late generation of somatic mutants could induce autoimmunity. Alternative hypotheses based on increasing exposure to environmental mutagens [242,243] are also consistent with an etiology dependent on somatic mutation.

Autoinflammatory Disease
Several non-autoimmune diseases may also be caused by somatic mutation of highly mutable repeat sequence in the context of inflammation. Atopic dermatitis and icthyosis vulgaris are inflammatory skin conditions caused by inactivating germ-line mutations of the FLG gene in some cases [244,245]. Somatic inactivating mutations of the 10.8 kb coding tandem repeat in FLG, reinforced by local inflammation, could contribute to the pathogenesis of these conditions. An accumulation of somatic mutations in PTPRN2 (without autoimmunity) could lead to glucose intolerance [246]. Similar mechanisms could underlie various autoinflammatory conditions [247].

Genetics
Our study is limited by its reliance on a single human genome for long repetitive sequence. Some reference alleles are much shorter than those typically observed in the population (e.g., MUC1 [248,249]). It is likely that long repetitive sequence is highly variable in the population [37,38,250], and that variations in germ-line sequence would modulate disease risk as seen for CR1, LPA, HP and DMBT1. Nevertheless, primary autoantigens whose genes contain long repeats were identified in a presumably healthy random individual, suggesting that, at least for those genes, all humans have some degree of somatic mutation and risk for disease.
Linkage based analysis of sequence variation in a population would not identify mutable repetitive regions because the high germ-line mutation rate would rapidly eliminate any linkage disequilibrium with adjacent sequence [157]. In contrast, there are likely to be few germ-line mutations within a pedigree, meaning that estimates of heritability [251] will include any effects of commonly inherited mutable sequence. Together, these effects could explain at least some of the missing heritability observed in many genomewide association studies [252][253][254].

Immunological Aspects
Not all somatic mutation is likely to be immunogenic, even in protein-coding sequence. Somatic mosaicism observed in triplet repeat expansion diseases [255] would not generate immunogenic protein if the repeat length is longer than the fragment expressible in MHC molecules (8-10 amino acids for MHC-I, 15-24 amino acids for MHC-II). On the other hand, a long triplet repeat could be vulnerable to somatic deletions, yielding a short, potentially immunogenic peptide repeat.
Keratinocytes express FLG [256] and are non-professional antigen presenting cells (APCs) [257]. Pancreatic beta cells express PTPRN2, and thyroid epithelial cells express TPO; both of these cell types are also non-professional APCs. The purpose of antigen presentation by such cells is assumed to be tolerization in the absence of costimulatory molecules [258], which seems appropriate in the case of three primary (and putatively mutable) autoantigens. The presence of antigen presentation on these cell types may have allowed the evolution of mutable genes without significant risk of abrogating tolerance. Alternatively, antigen presentation within these cell types may have evolved as a response to selective pressure for longer repeat sequences in these genes.
While T cell tolerance can be induced by the administration of peptides [259,260], attempts to induce tolerance in humans suffering from autoimmune disease have been largely unsuccessful [261]. Nevertheless, the success of these attempts is critically dependent on the peptide sequence used. The coherent somatic mutation hypothesis suggests that for intronic repeats, the initial immunogenic proteins may be mis-spliced or truncated forms of a native protein. Peptides covering the splice or truncation boundaries of putative mutant protein would be natural candidates for tolerance induction.

Validation
Many of the high prevalence diseases in Table 1 have been specifically associated with mutable antigens or peri-antigens in the present report. Some more speculative hypotheses for the involvement of somatic mutation in other diseases are presented in File S1. The proposed associations should be considered tentative, and subject to experimental validation. For reasons described previously and below, experimental validation may be technically difficult.
Recent sequencing advances have the potential to accurately sequence long repetitive regions [250]. Accurately sequencing many cells in search of rare somatic mutants will require significantly more effort, although new technologies will help [6]. Obtaining putatively mutated cells from sites of autoimmune damage is challenging, since such cells would be subject to immunological destruction as soon as the mutation occurs.

Conclusions
The coherent somatic mutation hypothesis states that recurrent or clonal somatic mutation underlies the initiation of autoimmune disease. Long STR sequence is likely to be somatically mutable in vivo, motivating the present study. A highly significant association between three primary autoantigens (covering four autoimmune diseases) and long STR sequence was established. Additional autoantigens and peri-antigens were identified among genes spanning long STR sequence, and among genes with other known markers of somatic mutation. The work presented here could lead to a partial resolution of the mystery of why particular proteins are targets of autoimmune destruction [50]. Experimental validation of the specific predictions made here is the next step.

Materials and Methods
Genome coordinates use the GRCh37 (hg19) sequence. Gene names use HGNC approved nomenclature. Queries were submitted to the UCSC MySQL database server [64] and processed as described below. The SQL queries can be found in File S1. Gene transcripts were required to be protein-coding according to GENCODE version 17 [262] or (for Queries 2 and 6) RefSeq [263].

Identifying Genes with Intragenic Repeats
Query 1 was submitted to obtain genes containing long or frequent repeats. The output from this query was edited as follows: N Genes not on the reference chromosomes were removed. Only one such gene (MGC39584/AC018692.2 on chr4_gl000193_ random) had length over 5 kb and none had a repeat count over 100.
N For genes occurring on both the X and Y chromosomes, only the X chromosome instance was retained.
N TMRF often generated multiple repeat candidates for a region with the periods of the candidates being multiples of the shortest period. In such cases, only the shortest-period candidate with the highest repeat-unit count was kept, even if it spanned a slightly smaller region.
N When TMRF generated a consensus repeat unit that was itself repetitive (e.g., AGTTAGTTAGTT) the TMRF entry was replaced by one with a shorter repeat unit (e.g., AGTT) and a higher repeat-unit count, retaining the degree of identity from the longer sequence. Examples include VPS53 (in which a 96 bp repeat is itself made of 3 instances of a 32 bp repeat), MUC4 (in which a 96 bp repeat consists of two consecutive instances of a 48 bp repeat), and MAL (with an 8 bp AGTGAGTG repeat).
N In a small number of cases, TMRF generated multiple essentially contiguous repeats with the same period and consensus sequence. The only such case where the repeat was either more than 5 kb long or contained more than 600 repeat units was PTPRN2 (chr7:158122660-158135328) where the contiguous repeat records were combined into a single longer 12.6 kb repeat.
To see whether the output was dependent on the source of the gene annotations, I reformulated the query as Query 2 using RefSeq [263]. The following differences were noted for repeats longer than 5 kb: N There was some discrepant labeling of the NBPF genes. The NBPF repeat sequences were the same, with the exception of one NBPF10 repeat (see below). N A large majority of repeats were common to the two annotations, with the differences mentioned above largely due to differences in the labeling of a gene transcript as protein coding.
The differences between the two annotations appear to be small. The MUC19 transcript identified by RefSeq may have immunological significance given the association of MUC19 with Crohn's disease and ulcerative colitis [105,264,265].
Genes that span gaps in the human assembly where the gaps are presumed to include repetitive sequence (e.g., MUC5AC [250]) are absent from the query result. Applying the tandem repeat finder algorithm [63] to the MUC5AC exon 31 sequence reported by Guo et al [250] revealed a longest tandem repeat of 1.6 kb. Table 11. Immunological features of autism and schizophrenia.

Feature
Autism Schizophrenia
Yes [441] Yes [452] Autoantibodies Brain-specific antibodies in mothers and probands [441,446,453,454]; Anti-nuclear antibodies [446,455] Yes [456] Other Gene expression changes reminiscent of autoimmunity [457]; NK cell dysregulation [458]; Amelioration of aberrant behaviors during fever [459] Various immunological abnormalities [460][461][462]; differentially expressed genes involve immune pathways [463] doi:10.1371/journal.pone.0101093.t011 Identifying Genes Spanning Long Segmental Duplications Query 3 was used to identify a preliminary set of segmental duplications occuring within protein-coding genes, using the segmental duplication track [266] of the UCSC MySQL database server [64]. At least one duplicon was required to occur entirely within the gene sequence. The structure of the identified segmental duplications was examined using the UCSC genome browser. Where more than two contiguous tandem duplications exist (CR1, NEB, SPDYE3), the records for the gene were combined into a single record for the longer compound tandem repeat. When multiple segmental duplications overlapped (TTC34) only the longer duplication was retained.

Additional Queries
Query 4 was used to identify long self-alignments (score at least 60) within protein-coding genes, using the self-alignment track [267] of the UCSC MySQL database server [64]. Query 5 was used to identify repeats constituting almost an entire intron within a gene. Query 6 was used to identify long repeats in the mouse genome; repeats are required to overlap a protein-coding RefSeq gene, including 5 kb of sequence upstream of the gene start site.
Query 7 was used to identify pairs of long repeats where the second repeat unit is the reverse complement of the first. The purpose of this analysis is to understand the genomewide significance of this feature of the NSUN6 repeats (File S1). The output of this query was filtered to remove sequences on unplaced chromosomes and rows in which the two repeat sequences are not reverse complements. Queries 8 through 12 identify structural variation at STR loci utilizing information from the DGV database [268][269][270] (File S1).

Significance of Autoantigen Over-Representation in Gene Lists
Primary Autoantigens. To determine the statistical significance of a set of primary autoantigens within a gene list, an estimate of the number of known primary autoantigens for common autoimmune diseases is required. Based on Table 1, there are nineteen known primary autoantigens for those diseases. This number includes pANCA, a category covering five proteins in UC [271], and ribosomal P (3 proteins), so a more precise estimate of the number of genes is 25. The null hypothesis H 0 states that each gene associated with a primary autoantigen is equally likely to appear anywhere in the ranked list of genes. There are 20,330 protein-coding genes in GENCODE V17 [272]. Choosing the top eleven genes is therefore well approximated by a binomial process, where a selected gene has a 25=20330&0:0012 probability of being a primary autoantigen under the null hypothesis.
I apply an exact one-sided binomial test of goodness of fit. The p-value for 3 or more of the top 11 genes being primary autoantigens under the null hypothesis is 3:0|10 {7 : The significance is robust to the size of the prefix of the gene list. For example, taking the top 35 genes rather than the top 11 yields pv1:2|10 {5 : One can therefore reject the null hypothesis and conclude that the overrepresentation of primary autoantigens near the top of the list is highly significant.
Autoantigens. Determining the significance of a set of autoantigens within a gene list requires an estimate of the total number of autoantigens. Stadler et al. [54] tabulate 348 known autoantigens, but this list is incomplete (e.g., it does not include FLG or PKP3). For the purposes of determining a p value, 400 autoantigens and 20,330 protein-coding genes [272] are assumed for a one-sided binomial goodness of fit test. All p values calculated above remain significant at pv0:05 even if an estimate of 600 autoantigens was used.

Supporting Information
File S1