Population Genetics and New Insight into Range of CAG Repeats of Spinocerebellar Ataxia Type 3 in the Han Chinese Population

Spinocerebellar ataxia type 3 (SCA3), also called Machado-Joseph disease (MJD), is one of the most common SCAs worldwide and caused by a CAG repeat expansion located in ATXN3 gene. Based on the CAG repeat numbers, alleles of ATXN3 can be divided into normal alleles (ANs), intermediate alleles (AIs) and expanded alleles (AEs). It was controversial whether the frequency of large normal alleles (large ANs) is related to the prevalence of SCA3 or not. And there were huge chaos in the comprehension of the specific numbers of the range of CAG repeats which is fundamental for genetic analysis of SCA3. To illustrate these issues, we made a novel CAG repeat ladder to detect CAG repeats of ATXN3 in 1003 unrelated Chinese normal individuals and studied haplotypes defined by three single nucleotide polymorphisms (SNPs) closed to ATXN3. We found that the number of CAG repeats ranged from 13 to 49, among them, 14 was the most common number. Positive skew, the highest frequency of large ANs and 4 AIs which had never been reported before were found. Also, AEs and large ANs shared the same haplotypes defined by the SNPs. Based on these data and other related studies, we presumed that de novo mutations of ATXN3 emerging from large ANs are at least one survival mechanisms of mutational ATXN3 and we can redefine the range of CAG repeats as: ANs≤44, 45 ≤AIs ≤49 and AEs≥50.


Introduction
Spinocerebellar ataxias (SCA) were a group of autosomal dominant ataxic disorders and spinocerebellar ataxia type 3 (SCA3) was regarded as one of the most common SCAs in the world [1,2]. SCA3 was caused by a CAG repeat expansion located in exon10 of the ATXN3 gene on chromosome 14q32.1 [3] and it was associated with a variety of clinical manifestations, including progressive ataxia, ophthalmoplegia, pyramidal signs, extrapyramidal signs and facial myokymia [4].
The relative frequency of SCA3 in SCAs was varied in different populations. Some studies have suggested that the frequency of large normal alleles (ANs) is related to the prevalence of SCA3 [5][6][7][8][9][10][11] and large ANs and expanded alleles (AEs) share the same haplotypes [8,10,12]. All these data supported the hypothesis that large ANs may constitute a reservoir from which AEs may emerge. However, a reverse result by detecting 1000 normal individuals in the Portuguese population with the highest prevalence of MJD globally was reported [13]. It seemed reasonable that there was no direct relationship between large ANs and the prevalence of SCA3 because of large samples [13].
For its complexity and heterogeneity in clinical manifestations, molecular genetic testing is the only way to make a definite diagnosis to SCA3. In addition, the definition of range of CAG repeats was a key point to interpreting the results of genetic testing as mentioned by some guidelines of genetic testing in SCAs [14][15][16]. The numbers of CAG repeats of ATXN3 were first described as  in ANs and as 68-79 in AEs [3]. However, in recent publications [17][18][19][20][21][22] and even in guidelines of SCAs genetic testing [14,15], the range of CAG repeats had broadened to different numbers, which might lead to confusion in the molecular diagnosis of SCA3.
To illustrate this issue, we prepared a novel CAG repeat ladder, used it as a size marker to analyze the distribution of CAG repeats in 1003 unrelated Chinese normal individuals, and analyzed haplotypes defined by three SNPs closed to ATXN3. Based on the data of the current study and other related studies, we further redefined the range of CAG repeats in ATXN3 to drive the genetic testing in Chinese patients with SCA3.

Materials and Methods Subjects
One thousand and three unrelated normal individuals with no known history of hereditary disorders were recruited. Most of them (997/1003) were from eastern and southeastern China, including Fujian (595), Shanghai (330), Zhejiang (30), Jiangsu (28), Anhui (6), Jiangxi (4), Hunan (2) and Shandong (2). The remaining 6 were from Henan (3), Shanxi (1), Beijing (1) and Liaoning (1). Also, 30 unrelated SCA3 patients confirmed by molecular analysis of ATXN3 were recruited for haplotypes studying. Written informed consent was obtained from each subject (if <18 years of age, written informed consent was obtained from their legal guardians), and the study was approved by the Ethics Committee of Huashan Hospital and First Affiliated Hospital. Genomic DNA was extracted from peripheral EDTA blood with a QIAamp DNA Blood Minikit (QIAGEN, Hilden, Germany).

Identification of CAG repeats via DNA sequencing
The number of CAG repeats was identified via DNA sequencing in 200 out of the 1003 normal individuals. The CAG repeats expansion was amplified using MJD52/MJD25 primers and the procedure was as previously reported [23]. The amplified products were purified and subjected to direct sequencing using the procedure as previously reported [24].

Preparation of a novel CAG repeat ladder
PCR products with different CAG repeats identified by DNA sequencing were cloned into a pMD18-T vector according to the manufacturer's recommendations (TaKaRa, Chiba, Japan). The positive colonies were verified via PCR using MJD52/MJD25 primers [3] and the numbers of CAG repeats were further verified by DNA sequencing. The plasmids with different numbers of CAG repeats were mixed proportionally and used as a template to amplify the different CAG repeats expansion. This PCR product is a novel CAG repeat ladder containing a lot of bands and can be used as a DNA size marker to identify the number of CAG repeats in an 8% polyacrylamide gel (PAGE).

Identification of CAG repeats via PAGE
The numbers of CAG repeats were identified via PAGE in the remaining 803 normal individuals. The prepared CAG repeat ladder was used as a DNA size marker, and the procedure of PAGE was used as previously reported [23]. When the transport ratio of a certain PCR product was different from all bands of the CAG repeat ladder, this PCR product was selected to clone into the pMD18-T vector and the procedure was repeated in order to broaden the range of the ladder. To measure the validity of the CAG repeat ladder, 30 samples were selected randomly from 803 normal individuals and their CAG repeats numbers were confirmed both PAGE and DNA sequencing.

Analyses of SNPs genotype and haplotype
Three SNPs closely linked to the CAG repeats were studied. The genotype of A 669 TG/G 669 TG was detected by single-strand conformation polymorphism (SSCP) using MJD1VSR / MJD734R primers [25]. SSCP was performed in a polyacrylamide gel containing 5% glycerine. The gel was run at 27 V/cm at 20°C for 4.5 hours and silver stained to visualize the bands. The genotypes of the other two SNPs, C 987 GG/G 987 GG and TAA 1118 /TAC 1118 , were identified via allele-specific PCR using MJD-GGG or MJD-CGG and MJD-TAA or MJD-TAC, respectively, in combination with MJD-52 as a primer [12]. The PCR products were first separated by 2.5% agarose gel electrophoresis (AGE); if blurred, they were further separated by PAGE. The haplotype (partial haplotype) of these two SNPs was identified via AGE and PAGE, because the PCR products contained the CAG repeats expansion. However, the haplotype defined by all three SNPs (complete haplotype) could only be determined when the genotype of A 669 TG/G 669 TG was homozygous.

Statistical analysis
All statistical analyses were performed using SPSS software version 11.0 (SPSS, Chicago, IL). The mean, median, variance, skewness and heterozygosity were determined for the distributions of ANs. In accordance with the previous report [5], the alleles carrying more than 27 CAG repeats (>27 repeats) were defined as large ANs. Chi-square tests were used to analyse the difference between present study and other studies both in the frequency of the large ANs and the relationship between CAG repeats number and the genotype and haplotype of SNPs. The results were considered statistically significant at p < 0.05.

Analysis of CAG repeats ladder
As shown in Fig 1, Table 1.

Association of haplotypes with CAG repeats expansion
A 669 TG/G 669 TG, C 987 GG/G 987 GG and TAA 1118 /TAC 1118 were studied in 94 normal individuals and 30 SCA3 patients. Three partial haplotypes defined by C 987 GG/G 987 GG and TAA 1118 / TAC 1118 , CA (72.2%), GC (26.2%) and GA (1.6%) were found in all 124 subjects. The CA haplotype was significantly associated with large ANs compared to non-large ANs (χ 2 = 69.325,  P<0.001). Actually, there were 94 CA haplotypes in all 96 partial haplotypes associated with large ANs, the other two haplotypes were GC and GA and both of which were associated with alleles contained 28 CAG repeats, and the haplotypes associated with AEs were exclusive CA (Fig 3 and Table 2).
Sixty-nine subjects were homozygotes at the SNP of A 669 TG/G 669 TG, 68 were homozygote of AA, and one was homozygote of GG. There were four complete haplotypes, ACA (87.7%), AGC (8.7%), AGA (2.2%) and GGC (1.4%), defined by all three SNPs in all 69 homozygotes. The ACA haplotype was significantly associated with large ANs compared to non-large ANs (χ 2 = 20.908, P<0.001). There were 74 ACA haplotypes in all 76 complete haplotypes associated with large ANs, the other two haplotypes were AGC and AGA, both of which were associated with alleles contained 28 CAG repeats, and the haplotypes associated with AEs were exclusive ACA ( Table 2). The agarose gel and polyacrylamide gel electrophoresis analysis of PCR products of C 987 GG/G 987 GG and TAA 1118 /TAC 1118 are shown in Fig 4, respectively. The SSCP analysis of PCR products of A 669 TG/G 669 TG is shown in Fig 5.

Discussion
It is well known that DNA sequencing, capillary electrophoresis (CE) and PAGE are the most common methods used to detect the CAG repeat number. DNA sequencing includes direct sequencing and sequencing after DNA recombination. Direct sequencing could detect the CAG repeat number accurately, but there are some difficulties caused by heterozygosity and somatic mosaicism that affect performance [26]. These difficulties could be avoided by DNA recombination, but the CAG repeat number may be altered in the procedure of DNA recombination [27,28]. The benefit to using CE to detect CAG repeats is that it is highly efficient, and the drawback is that the detection cannot reflect the precise CAG repeat number [29]. Detecting CAG repeats via PAGE needs different kinds of DNA markers as size markers. The M13 DNA sequencing ladder is most widely used because of its precision, however it requires the rigorous conditions of electrophoresis and autoradiography to visualize the bands.
To avoid these challenges, we prepared a novel and precise CAG repeat ladder by mixing the  PCR products with different CAG repeat numbers. The disparity in size of each two consecutive bands was 3bp, therefore, the conditions of electrophoresis were not so rigorous compared to the M13 ladder. The different bands can be distinguished clearly by an 8% polyacrylamide gel and visualized by silver staining rather than autoradiography. And since the CAG repeats numbers were the same using both DNA sequencing and PAGE with the CAG repeat ladder, the accuracy of the CAG repeat ladder could be assured. Thus this novel ladder is an accurate and easily used size marker to detect the number of CAG repeats.
In the present study, we analyzed the characteristics of CAG repeats of ATXN3 in 1003 normal individuals and found that the CAG repeat number ranged from 13 to 49 and 14 was the most common one. The frequency of large ANs (>27 repeats) was 0.28, which was the highest in all related reports as far as we known and significant higher than the frequency in populations of Japanese [5], Indian [8], Czechs [11] and combined population of Acadian, Black, Caucasian, Inuit and Thai [30]. In our previous study [23], the relative frequency of SCA3 in Chinese mainland population was also significant higher than that in these populations. Therefore, we presume that the frequency of large ANs is related to the prevalence of SCA3. In addition, the haplotypes of CA and ACA were both significant higher in large ANs than in nonlarge ANs, and the haplotypes of all AEs of SCA3 patients at present study were all CA or ACA. Base on these data, we suggest that large ANs may constitute a reservoir from which the AEs may be emerged in the Chinese population. Studies involved in SCA3 in different populations including Chinese Taiwan, Japanese, Australian, Indian, Czechs and French [5][6][7][8][9][10][11][12] or other diseases caused by dynamic mutation [31,32] also supported this hypothesis, however, the hypothesis did not sustain by another two studies involved in Portuguese in which the prevalence of SCA3 was the highest globally [13,25]. Therefore, it seems that the relationship between large ANs and AEs differ in different populations. Intriguingly, we have such different results from that of Lima et al [13], though both involving in CAG distribution of large samples of normal individuals. We showed that large ANs and AEs were closely related, the distribution skewness which could reflect mutational bias [5,31] was positive and there were four AIs. However, Lima et al [13] found that large ANs were not related to the prevalence of SCA3, the skewness was negative and there were no AIs. In regard to the survival mechanism of mutational ATXN3, Lima et al [13] suggested that though larger AEs would be selected against because of early onset and severe symptoms, smaller AEs would be survival to maintain AEs in the populations since age of onset was after their reproductive period. However, this hypothesis seemed not to consider that even the smaller AEs would not survival since allele size of smaller AEs may increase successively over the generation sand lead to early onset [33]. The hypothesis also could not explain where AEs of sporadic SCA3 patients originate in different populations [34][35][36][37]. Therefore, we suggest that de novo mutations of ATXN3 emerged from large ANs is at least one of survival mechanisms. Mittal et al [10] reported one AE with 45 CAG repeats may arise through gene conversion with one large AN and one smaller AN. Though de novo mutations of ATXN3 that were direct evidence of our hypothesis yet reported, study of Mittal et al [10] supported our hypothesis indirectly. It is believable that de novo mutations of ATXN3 exist since they have been found in some other dynamic mutation diseases such as Huntington's disease (HD; MIM# 143100) [38], SCA2 (MIM# 183090) [39], SCA7 (MIM# 164500) [40] and SCA17 (MIM# 607136) [41]. The study should move on to find de novo mutations of ATXN3 in next generations of carriers of large ANs and AIs.
The definition of range of CAG repeats is fundamental to genetic diagnosis of SCA3. The alleles of ATXN3 can be divided into ANs that have never been associated with SCA3, AIs that are partially associated with SCA3 and AEs that are always associated with SCA3. When ATXN3 was first cloned in 1994, the range of ANs and AEs were 13-36 and 68-79, respectively, and AIs were not found [3]. Since Tuite et al found one AE with 61 CAG repeats in 1995 [42] and Hsieh et al found one AN with 44 CAG repeats in 1997 [43], the range of ANs and AEs became 44 and 61, respectively, which was widely cited [26,44]. Afterwards, many studies found AEs with CAG repeat numbers smaller than 61 [45][46][47][48][49][50], and 45 CAG repeats was the minimum for AE [50]. Also, Maciel et al reported one "AN" with 51 CAG repeats [26]. So, the range of CAG repeats was divided into ANs44, 45 AIs 51 and AEs52, which was reported by Paulson [51]. However, this range has not been cited by the recent studies [14,15,[17][18][19][20][21][22], except for our previous study [23]. Also, the range cited by the recent studies was different from one another. Therefore, there is a lot of confusion surrounding the range of CAG repeats.
In the current study, we found 4 individuals carrying 46, 48, 48 and 49 CAG repeats, respectively. Since none of them or their family members present cerebellar ataxia or other symptoms related to SCA3, the odds that these 4 individuals to be presymptomatic SCA3 patients is very rare. However, we will follow-up them to exclude presymptomatic states. Therefore, we suggest that the range of CAG repeats in ATXN3 should be: ANs44, 45 AIs 49 and AEs50. Intriguingly, there was a definition of mutable normal alleles (mutable ANs) except for ANs, AIs and AEs in HTT gene responsible for HD (MIM# 143100) [52]. Mutable ANs have not been associated with HD, but could increase to AIs or AEs to cause HD in the next generation. However, there is no definition of mutable ANs in ATXN3 because no de novo mutation has been reported.
In summary, using the novel CAG repeat ladder, we detected CAG repeats of ATXN3 in large Chinese population and found 4 AIs that had never been reported and the highest frequency (0.28) of large ANs. We presumed that de novo mutations of ATXN3 is at least one survival mechanisms of mutational ATXN3 since large ANs were so closed linked to AEs in Chinese population. And we redefine the range of CAG repeats in ATXN3 as: ANs44, 45 AIs 49 and AEs50.