Genome-Wide Association Study Identified Copy Number Variants Important for Appendicular Lean Mass

Skeletal muscle is a major component of the human body. Age-related loss of muscle mass and function contributes to some public health problems such as sarcopenia and osteoporosis. Skeletal muscle, mainly composed of appendicular lean mass (ALM), is a heritable trait. Copy number variation (CNV) is a common type of human genome variant which may play an important role in the etiology of many human diseases. In this study, we performed genome-wide association analyses of CNV for ALM in 2,286 Caucasian subjects. We then replicated the major findings in 1,627 Chinese subjects. Two CNVs, CNV1191 and CNV2580, were detected to be associated with ALM (p = 2.26×10−2 and 3.34×10−3, respectively). In the Chinese replication sample, the two CNVs achieved p-values of 3.26×10−2 and 0.107, respectively. CNV1191 covers a gene, GTPase of the immunity-associated protein family (GIMAP1), which is important for skeletal muscle cell survival/death in humans. CNV2580 is located in the Serine hydrolase-like protein (SERHL) gene, which plays an important role in normal peroxisome function and skeletal muscle growth in response to mechanical stimuli. In summary, our study suggested two novel CNVs and the related genes that may contribute to variation in ALM.


Introduction
Loss and function impairment of skeletal muscle, especially in the elderly, are related to a number of public health problems (such as sarcopenia, osteoporosis) and increased mortality [1,2]. Whole lean body mass (LBM) is composed of skeletal muscle (,60%), viscera, and some other connective tissues. Appendicular lean mass (ALM) is sum of skeletal muscle mass in arms and legs which is the primary portion of skeletal muscle involved in ambulation and physical activities. ALM is considered to be an ideal measure for skeletal muscle mass [3,4,5,6]. ALM can be measured accurately by dual energy X-ray absorptiometry (DXA).
Skeletal muscle is under strong genetic control, with heritability estimates of 30-85% for muscle strength and 50-80% for muscle mass [7,8]. Genome wide association studies have identified a number of variants that may account for variation in ALM [9,10]. However, collectively, the identified loci/genes/variants only explain a small fraction of genetic variation in ALM, and the majority of the genetic determination remains to be revealed. Traditional association studies have focused on single nucleotide polymorphisms (SNPs). Studies on other types of genetic variants, which may account for the ''missing'' heritability, have been relatively rare.
Recent studies have shown that copy number variation (CNV) plays an important role in human diseases, such as schizophrenia [11,12], Parkinson's disease [13], and autism [14]. CNV is a common type of genomic variability with the size of DNA fragments ranging from one kilobase to several megabases and presents at variable copy numbers in comparison with reference genome [15]. CNV may influence gene expression, phenotypic variation and adaptation by disrupting coding or altering gene dosage [16,17,18,19]. Furthermore, it may affect gene expression indirectly through position effects, predispose to deleterious genetic changes, or provide substrates for chromosome change in evolution [15,20,21,22]. A recent GWAS of CNVs in Chinese identified the gremlin1 gene that was associated with LBM variation [23]. However, to date, no study has been performed to investigate whether CNVs contribute to ALM in other ethnic groups such as Caucasians.
In this study, we performed a CNV-based GWAS to identify genetic loci influencing variation in ALM in 2,286 Caucasian subjects. Follow-up replication analyses were performed in a Chinese population consists of 1,627 subjects.

Ethics Statement
The study was approved by Institutional Review Boards of Creighton University, University of Missouri-Kansas City, Hunan Normal University of China and Xi'an Jiaotong University of China. Signed informed-consent documents were obtained from all study participants before they entered the study.

Subjects
The discovery sample consisted of 2,286 unrelated Caucasian subjects that were of European origin recruited in Midwestern US (Kansas City, Missouri and Omaha, Nebraska). The inclusion and exclusion criteria were described in our previous publications [24].
Replication sample is an independent Chinese sample containing 1,627 unrelated subjects. All subjects were recruited from the cities of Xi'an and Changsha and their neighboring areas in China.

Phenotyping
Anthropometric measures and a structured questionnaire covering lifestyle, diet, family information, medical history, etc. were obtained for all the study subjects. ALM and fat body mass (FBM) were measured using a dual-energy X-ray absorptiometry scanner Hologic QDR 4500 W (Hologic Inc., Bedford, MA, USA), for the all study samples. ALM (kg) was calculated as the sum of lean soft tissue (nonfat, non-bone) mass in the arms and legs. Weight was measured in light indoor clothing, using a calibrated balance beam scale, and height was measured as without shoes using a calibrated stadiometer.

Genotyping
Genomic DNA was extracted from peripheral blood leukocytes using standard protocols. Genome-Wide Human SNP Array 6.0 (Affymetrix, Santa Clara, CA, USA), which includes 906,600 SNPs and 940,000 copy number probes, was used to genotype each subject from the discovery sample, according to the Affymetrix protocol. Briefly, approximately 250 ng of genomic DNA was digested with restriction enzyme NspI or StyI. Digested DNA was adaptor-ligated and PCR-amplified for each sample. Fragment PCR products were then labeled with biotin, denatured, and hybridized to the arrays. Arrays were then washed and stained using Phycoerythrin on Affymetrix Fluidics Station, and scanned using the GeneChip Scanner 3000 7 G to quantitate fluorescence intensities. Data management and analyses were conducted using the Genotyping Command Console Software. For sample quality control (QC), a contrast QC threshold was set at a default value of greater than 0.4. The final average contrast QC across the entire sample reached a high level of 2.76 for our Caucasian cohort and 2.62 for our Chinese cohort.

Copy Number Analysis
Common CNVs were identified using the CANARY algorithm implemented in the Birdsuite software [25], which utilized a previously defined copy number polymorphism (CNP, namely CNV with frequency greater than 1%) map based on HapMap samples [26]. In total, 1,216 CNPs were genotyped for the subjects of the discovery sample and 1280 CNPs in the replication sample, respectively.

QC
We conducted QC filtering both at the sample level and the CNV level, according to the previously reported methods [27].
First, for the sample level QC, we used three quality metrics reported by the Birdseye method to evaluate the initial 2,286 subjects for quality in copy number genotyping. The following procedures were adopted: 1) we removed any sample that was greater or less than three standard deviations (SD) from the average estimate of copy number, which was approximate two copies at genome-wide level; 2) we calculated the variability in copy number and SNP probe intensities with each standardized per chromosome. We removed any sample with more than three SD than these estimates on the average genome-wide level; 3) we removed any sample in which more than two chromosomes failed any of these three metrics, i.e. more than three SD in estimated copy number or excessive CNV or SNP variability for chromo-  some. According to above criteria, 71 subjects were discarded. The copy numbers of the remaining 2,215 subjects were successfully genotyped using the CANARY software. Second, we conducted QC filtering at the CNV level. Out of the initially called CNVs, we excluded those with uncertain or missing copy call of .5% or with a minor variant frequency of ,1%. We discarded the CNVs with allele frequency of ,1%. With the above QC criteria, a total of 410 CNVs remained in the subsequent analyses for the Caucasian sample.

Statistical Analyses
Association analyses of CNV were performed using a linear regression model in R package ''glm'' [28]. For both the initial GWAS and subsequent replication studies, stepwise regression was performed to screen the effects of covariates on ALM variation. Age, sex, height, and FBM were significant effectors (p,0.05) and raw ALM values were adjusted for these factors. We adjusted for covariates by a 2-stage procedure where the outcomes were regressed on covariates only, and then the resulting residuals were regressed on CNVs. To correct for the effect of potential population stratification, we conducted a principal component analysis on genome-wide SNP data with EIGENSTRAT [29] and included the top ten principal components as covariates. Fisher's method [30] was used to combine the p-values from the discovery sample and replication sample.

Results
The basic characteristics of the subjects used in both discovery and replication samples are summarized in Table 1.
In the discovery sample, 20 CNVs showed evidence for association with ALM at a p value of 0.05 (Table 2). CNV1191 and CNV2580 were replicated in the Chinese sample. The p values of CNV1191 in the discovery and replication samples were 2.26610 22 and 3.26610 22 , respectively, and p values of CNV2580 in the discovery and replication samples were 3.34610 23 and 0.107, respectively ( Table 2). The combined p values of the two CNVs were 6.05610 23 and 3.27610 23 , respectively.
We further tested association between normal (CN = 2) and deletion (CN = 0, 1) groups, and between normal and duplication (CN = 3, 4) groups, separately. The results showed that while the direction of effect of CNV2580 was consistent in discovery and replication samples, it was not the case for CNV1191 (Table 3). However, both CNVs remained to be significant in the combined analyses.
In addition to the 2-step adjustment procedure for covariates aforementioned, we performed association analyses where CNVs and covariates were included in a single model. The results were quite similar to those of the 2-step procedure (Table S1).
According to the UCSC Genome Browser on Human February 2009 (GRCh37/hg19) Assembly, CNV1191 is located at the chromosome region 7q36.1 with physical position ranging from 149,916,734 bp to 149,932,502 bp, within the gene GTPase   Table 5. SNPs located in the two CNVs regions or outside the two CNVs boundaries and their association signals with ALM.  Table 4 lists the proportion of subjects for each copy of CNV2580. The table also includes theoretical proportion calculated based on empirical CN frequencies and random mating assumption. Goodness-of-fit (GOF) test showed that empirical distribution did not deviate from the theoretical distribution (p = 0.22 for both populations).
There are two SNPs that are located in the region of CNV1191 and eight SNPs outside the CNV1191 boundaries but inside the gene of GIMAP1. None of these ten SNPs was significantly associated with ALM in the discovery sample, but rs11769150 was associated with ALM in the replication sample with p-value of 0.02 (Table 5).
There are four SNPs that are located in the region of CNV2580 and fifteen SNPs outside the CNV2580 boundaries but inside the gene of SERHL. None of these nineteen SNPs was significantly associated with ALM in the discovery sample, but two SNPs rs139116 and rs139120 were associated with ALM in the replication sample with p-values of 0.02 (Table 5).

Discussion
This is the first CNV-based GWAS for ALM in Caucasians. Two CNVs, CNV1191 and CNV2580, were identified to be associated with ALM. CNV1191 is located in the gene GIMAP1, which encodes GTPase, IMAP family member 1. GIMAP (GTPase of the immunity-associated protein family) proteins are a family of putative GTPases believed to be regulators of cell death in lymphomyeloid cells. GIMAP1 was the first reported member of this gene family [31]. This gene was involved in the differentiation of T helper (Th) cells of the Th1 lineage, and the related mouse gene has been shown to be critical for the development of the mature B and T lymphocytes [32].
Culturing myotubes from skeletal muscle-biopsies found coordinated reduced expression of five members of the GIMAP family GIMAP1, GIMAP4, GIMAP5, GIMAP6 and GIMAP7, which form a cluster on chromosome 7 and participate in SM cell survival/death [33]. A study in pig skeletal muscle indicated that GIMAP1 was correlated with meat quality and regulation of biological processes involved in the induction of apoptosis [34]. This gene was also involved in regulation of lipid catabolic process, defense response and positive regulation of calcium ion transport [35]. Our findings, combined with the above evidence, support the potential contribution of GIMAP1 to variation in skeletal muscle.
SERHL is a gene coding for a new member of the family of serine hydrolases that is located within peroxisomes [36]. In vivo studies showed that mRNA expression of SERHL increased in response to passive stretch imposed upon skeletal muscle [36].
The association directions of CNV1191 in the discovery and replication studies were different. This inconsistency may be explained by the following reasons. First, genetic variants may have different effects in different populations. A genetic variant may have different allele frequencies among diverse populations because of different evolution histories, which result in different modes of genotype-phenotype association [37]. Second, significant associations are usually found at molecular markers that are in linkage disequilibrium (LD) with causal variant, rather than the causal variant itself. Therefore, the inconsistency in direction could be a result of opposite patterns of LD between the two populations.
Within the two CNVs regions, we did not identify any significant SNPs that were associated with ALM in the discovery sample. A possible explanation is that, different from SNP, CNV is a structural genetic variant that generally covers a larger genomic region and thus CNV may influence phenotypic variation by mechanisms that are different from SNP.
In summary, we identified CNV1191 and CNV2580 that were associated with ALM. The relevant genes, GIMAP1 and SERHL, may play roles in skeletal muscle metabolism. Our findings may provide useful information for molecular functional studies of candidate genes for ALM.