Genome-wide association study identifies a locus associated with rotator cuff injury

Rotator cuff tears are common, especially in the fifth and sixth decades of life, but can also occur in the competitive athlete. Genetic differences may contribute to overall injury risk. Identifying genetic loci associated with rotator cuff injury could shed light on the etiology of this injury. We performed a genome-wide association screen using publically available data from the Research Program in Genes, Environment and Health including 8,357 cases of rotator cuff injury and 94,622 controls. We found rs71404070 to show a genome-wide significant association with rotator cuff injury with p = 2.31x10-8 and an odds ratio of 1.25 per allele. This SNP is located next to cadherin8, which encodes a protein involved in cell adhesion. We also attempted to validate previous gene association studies that had reported a total of 18 SNPs showing a significant association with rotator cuff injury. However, none of the 18 SNPs were validated in our dataset. rs71404070 may be informative in explaining why some individuals are more susceptible to rotator cuff injury than others.


Introduction
Rotator cuff injuries are a common cause of shoulder pain throughout life. The prevalence of rotator cuff tears in the general population is approximately 20%; increasing from 10% in the sixth decade of life to 50% in the ninth decade of life [1]. While the aetiology of rotator cuff tears is poorly understood, the literature supports an age-related progression, primarily affecting middle-aged and older patients. In addition to aging, other risk factors for degenerative tears include smoking, hypercholesterolemia, genetic predisposition and shoulder use [2].
40% of overhead throwing athletes were found to have rotator cuff tears in their dominant arm [3]. Athletes participating in overhead sports place substantial demands on the shoulderfrom elite swimmers ranging their shoulders through approximately 2 million strokes per year to professional baseball pitchers generating ball speeds of up to 165 km/hr with associated peak internal rotation velocities of up to 6940 degrees per second [4]. Given these high demands, rotator cuff injury can have devastating consequences on an athlete's performance, require substantial recovery and rehabilitation time, or even prematurely end an athlete's career [5,6].
Multiple lines of evidence suggest that genetic differences explain part of the predisposition for rotator cuff injury. First, siblings and other close relatives have an elevated risk of rotator cuff tears and show a faster rate of progression of tear size compared to non-related controls [7,8]. Second, genetic association studies have previously identified 18 single nucleotide polymorphisms (SNPs) associated with rotator cuff tears [9][10][11][12][13][14]. From a cohort with 335 cases, Tashjian et al. identified two genome-wide significant SNPs that showed an association with full thickness rotator cuff tears [12]. Moreover, several candidate gene studies have been performed to identify 16 additional SNPs associated with rotator cuff tears [9][10][11]13,14]. For rotator cuff injury, none of these 18 SNPs have yet been validated in an independent study. One SNP in the ESRRB gene remained associated with rotator cuff injury when the cohort was expanded from 175 to 335 cases [12,13].
A better understanding of the genetic loci involved in rotator cuff injury could shed light on the poorly understood molecular basis of tendon injury and repair. Although an individual's genotype is fixed, knowledge of this risk could prompt dedicated preventative measures to potentially decrease the risk of injury.
The purpose of this genome-wide association study (GWAS) was to identify SNPs associated with rotator cuff injury using available data from a cohort of 102,979 patients that included 8,357 cases with rotator cuff injury.

Materials and methods
A genome-wide association screen was performed for rotator cuff injury using data from the Genetic Epidemiology Research on Adult Health and Aging (GERA) cohort. The data generation and data analysis pipeline have been previously described in Jorgenson et al 2015 [15]. Supplemental Methods contains full information about the methods used.

Phenotype definition
Rotator cuff injury cases were identified in the GERA cohort based on clinical diagnoses and surgical procedures captured in the Kaiser Permanente Northern California (KPNC) Electronic Health Record system. The Electronic Health Record includes reported injuries over the entire lifetime of the patients, including those that occurred prior to enrollment in KPNC. It also includes injuries that occurred after the genotyping analysis was performed if reported by the patient and recorded by the physician, until the data were accessed on July 22, 2015. Nine total codes, including International Classification of Disease, Ninth Revision (ICD-9) and Common Procedure Terminology, Fourth Edition (CPT-4) codes, were included (Table 1). Cases were defined as individuals with at least one ICD-9 code (727.61, 840.3, 840.4, 840.5, 840.6) or CPT4 code (23410, 23412, 23420, 29827). The definitions of each injury code are listed in Table 1. Phenotypes were defined as: full rupture (individuals with ICD727_61), partial/full rupture (individuals with ICD727_61, CPT23410, CPT23412 or CPT29827), and rotator cuff injury (any code). There were 904 cases of full rupture, 2241 cases of partial/full rupture and 8357 cases of rotator cuff injury from the cohort of 102,979 individuals. Participants were categorized as cases if they contained any of the injury codes listed in Table 1; otherwise, they were categorized as controls.

Genome-wide association and meta-analysis
Genome-wide association analyses of the GERA cohort genomic data were conducted using PLINK v1.90(b3.34) (https://www.cog-genomics.org/plink2) [16,17]. SNP associations were tested with rotator cuff injury codes with a logistic regression model using allele counts for typed and imputed SNPs in an additive genetic model for each of the five race/ethnic populations. The model was adjusted for genetic sex, age at enrollment into the RPGEH cohort, race/ ethnicity using principal components, and variations in genotyping protocol. The final number of SNPs that were analyzed was 9,870,147 for European (EUR); 11,158,335 for Latin American (LAT); 8,951,026 for East Asian (EAS); 17,224,907 for African American (AFR); and 25,874 for Southeast Asian (SAS) populations. Results from each population were combined by inverse-variance, fixed-effects meta-analysis. The final number of SNPs that was analyzed in the fixed-effects meta-analysis was 10,582,947.
We examined the level of heterogeneity using two measures: 1) the I 2 statistic, which measures the percentage of variability across studies that is due to heterogeneity, where a lower value indicates more consistent results across studies, and 2) Cochran's Q statistic, which measures whether observed differences in study results are due to chance alone, where a low associated p-value indicates heterogeneity [18,19]. Locus plots showing regional association signals were generated in LocusZoom (http://locuszoom.sph.umich.edu/locuszoom) [20].

Replication analysis of previously reported SNPs
A literature search was conducted to compile a list of candidate genes previously tested for association with rotator cuff injuries. We searched the public databases PubMed/MEDLINE (http://www.ncbi.nlm.nih.gov/pubmed) and Scopus (http://www.scopus.com/) for previously published studies on the genetics of rotator cuff injury. Inclusion criteria for filtering results were: research papers with original data, English language, full-text article available (or abstract with enough data), genetic association studies with human subjects, cases with rotator cuff phenotypes, and gene or SNP associations reported. Articles were included regardless of whether or not the associations were significant and regardless of the size of the study population. Furthermore, the references from each publication were used to find other articles not captured in the initial search. The final searches were conducted on June 1, 2016. Overall, six total publications were identified reporting significant associations for 18 SNPs in 10 genes. Additional information regarding the literature search can be found in the Supplemental Methods. In the replication analysis, we searched our GWAS and meta-analysis results for these SNPs and used the Benjamini-Hochberg method to set the false discovery rate to 5% to account for multiple testing [21]. Summary statistics for all SNPs from the fixed-effects meta-analysis are available at NIH GRASP: https://grasp.nhlbi.nih.gov/FullResults.aspx.

Ethical considerations
This study analyzed stored data from RPGEH subjects who consented to genomic testing and use of their genomic data, as well as health data from the KPNC Electronic Health Record, for future research studies. The health and genotype data for the subjects were de-identified. All study procedures were approved by the Kaiser Permanente Research Institute Institutional Review Board.

Study population and genotype information
The Genes, Environment and Health cohort includes genotype and medical data from 102,979 individuals in the Kaiser-Permanente Northern California system. Demographic data for the cohort are presented in Table 2. The electronic health records were interrogated for individuals that had incurred a rotator cuff injury (Table 1)(Methods). Genome-wide study for association with rotator cuff injury The RPGEH cohort, genotyping data, methodological approach and logical flow presented here overlap those used in previous work by the same authors on MCL injury, shoulder dislocation, plantar fasciitis, ACL injury, Achilles tendon injury and ankle injury [22][23][24][25][26]. However, the analyses presented here present new results on the genetic basis for rotator cuff injury. A logistic regression was performed to search for SNPs associated with rotator cuff injury (Methods). We compared the observed p-values from the GERA cohort meta-analysis to the distribution of p-values that would be expected by chance in a Q-Q plot (Fig 1). We observed slight deviation from the null hypothesis for the lowest observed p-value, illustrated by the farthest upper-right points tending to be above the diagonal red line. This deviation indicates that one or more SNPs from our analysis show an association with rotator cuff injury.
We plotted the p-value for every SNP from the meta-analysis on a Manhattan plot (Fig 2). The tested SNPs are arranged linearly by genomic position along the x-axis and the p-value for each SNP is indicated along the y-axis. To correct for multiple hypothesis testing, we used the commonly-accepted genome-wide threshold for statistical significance of p<5x10 -8 (indicated by the red line) [27][28][29]. One SNP (rs71404070) exceeds the genome-wide statistical threshold for association with rotator cuff injury (Table 3). rs71404070 yielded data for three races (EUR, LAT, AFR) but not the two Asian races. rs71404070 was not directly genotyped in the dataset, but rather the genotypes were imputed with fairly high accuracy (R 2 = 0.93)  (Table 3). Summary statistics for all SNPs from the fixed-effects meta-analysis are available at NIH GRASP: https://grasp.nhlbi.nih.gov/FullResults.aspx.
The GWAS results were analyzed to determine whether the association with rotator cuff injury for rs71404070 was stronger in some races than in others, a phenomenon known as heterogeneity [30]. A forest plot shows the odds ratio and 95% confidence interval for the three different races with data, as well as the overall result from all three races combined (Fig 3). The odds ratios for each race were in the same direction and of similar magnitude. Using I 2 and Cochran's Q to assess heterogeneity, we saw no evidence of heterogeneity; specifically, I 2 = 0% (95% confidence interval: 0-79%) and Cochran's Q = 1.614 with p = 0.446.
The allelic odds ratio for rs71404070 is 1.25. Individuals that carried one risk allele at rs71404070 (A/T) had a 29% increased chance of rotator cuff injury compared to individuals that have no risk alleles (Table 4).
Our cohort of rotator cuff injuries includes 904 cases of full rupture and 2241 cases of either full or partial rupture of the rotator cuff (Table 5). We asked whether increasing severity of rotator cuff injury would also increase the strength of the association of rs71404070 with that injury. We repeated the analysis for rs71404070 using either full or partial/full rupture as cases  versus uninjured patients as controls (Table 5). We note that the odds ratio increases for the partial/full ruptures (1.38) but not for full ruptures (1.22) compared to the overall cohort of all injured patients (1.25). As expected, the p-values for the association of rs71404070 with each rotator cuff injury phenotype became less significant as the number of cases decreased. Rs71404070 is located in the 1.5 Mb intergenic region between LOC729159 and cadherin8 (CDH8) on chromosome 16 (Fig 4). There is one SNP that is tightly linked (R 2 = 0.99) and 8 additional SNPs that are weakly linked (R 2 >0.6) to the sentinel SNP rs71404070 (S1 Table). Since all of these SNPs are linked and tend to be inherited as one haplotype, it is unclear which of these, if any, is the SNP that directly affects risk for rotator cuff injury and which show an association simply because they are linked.
We investigated all of the SNPs linked to rs71404070 for evidence that they may affect the expression or coding capacity of neighboring genes, consistent with a variant that is directly responsible for affecting rotator cuff injury risk. The entire linkage disequilibrium block is intergenic, so none of the SNPs directly affect protein coding regions.
Next, we searched for evidence that these SNPs affect gene expression levels of either LOC729159 or cadherin8 by acting as expression quantitative trait loci (eQTLs)(Methods). The Genotype-Tissue Expression database has documented SNPs associated with changes in expression of nearby genes [31]. However, neither rs71404070 nor any of the nine linked SNPs showed an association with variation in expression of either LOC729159 or cadherin8.
Finally, we interrogated data from the ENCODE project to determine whether rs71404070 or any of the nine linked SNPs might affect binding of transcription factors [32]. In general, when transcription factors bind to DNA, they are associated with a factor known as CTCF and the bound region becomes DNAse I hypersensitive [32]. SNPs located in these regions may affect transcription factor binding and consequently gene expression. The lead SNP in our  study, rs71404070, was not associated with either CTCF binding or DNAse I hypersensitivity.
One weakly-linked SNP, rs35448966 (R 2 = 0.67), is located in a CTCF binding site [32]. A moderately-linked SNP, rs76226460 (R 2 = 0.81), is located in a DNase I hypersensitivity peak [32]. These data suggest that these two SNPs might have an effect on expression of nearby genes. However, their association with rotator cuff injury is weak (p>10 −7 ) and they are not highly correlated with the sentinel SNP rs71404070 (R 2 <0.81), indicating that these SNPs are not likely to be responsible for the highly-significant association of rs71404070 with rotator cuff injury (S1 Table). In summary, none of the nine SNPs linked to rs71404070 has strong evidence for being directly responsible for affecting the activity of any of the nearby genes. Since rs71404070 has by far the most significant association with rotator cuff injury, it has the best evidence for being the SNP that is responsible for the association. We attempted to validate the association of rs71404070 with rotator cuff injury using data from a previous GWA study [12]. However, neither rs71404070 itself nor any closely linked SNPs were included in the previous set of genotyped markers (C. Teerlink and R. Tashjian, Personal Communication).
We tested whether any of these 18 SNPs showed an association with rotator cuff injury in our dataset (Table 6). To compensate for multiple testing, we used the Benjamini-Hochberg method, with a false discovery rate of q = 0.05 [33,34]. None of the 18 SNPs showed a significant association with rotator cuff injury in our dataset. In previous studies, the cases had full ruptures of the rotator cuff whereas in this work, the cases included partial ruptures, full ruptures, sprains and avulsions. One possibility is that the 18 SNPs might show an association with full ruptures but not with other injuries of the rotator cuff. To address this possibility, we repeated the analysis using only full ruptures (904 cases) or partial/full ruptures (2241 cases) of the rotator cuff (Table 6). Even restricting the analysis to full and/or partial ruptures, none of the 18 previous SNPs showed a significant association with rotator cuff injury in our dataset.

Discussion
With the advent of large-scale genotyping programs, it is now possible to screen the entire genome for polymorphisms associated with musculoskeletal injury risks, such as rotator cuff b P-value from fixed-effects meta-analysis. c Allelic odds ratio with 95% confidence interval, adjusted for covariates in the logistic regression from fixed-effects meta-analysis. https://doi.org/10.1371/journal.pone.0189317.t005 tears. A genome-wide screen for rotator cuff injury could provide new insights regarding the differences between individuals in their inherent propensity for injury and reveal insights into  underlying mechanisms for shoulder tendinopathy. Furthermore, because the genotype data includes most of the common polymorphisms that are known, a genome-wide screen reports the strongest associations in the genome in an unbiased manner. Here, we have performed a study to find DNA polymorphisms associated with rotator cuff injury by obtaining access to large-scale genotype and phenotype data from the Research Program on Genes, Environment and Health. The data contained information from 102,979 individuals of whom 8,357 had rotator cuff injury. We found rs71404070 to be associated with rotator cuff injury with a p-value that has genome-wide significance. It will be important to validate this finding in independent studies [35,36]. rs71404070 lies in a linkage disequilibrium block that contains nine other SNPs that also show an association with rotator cuff injury, although at a lower level than rs71404070. Within this block, it is unclear which variant(s) causes increased risk of rotator cuff injury and which are neutral polymorphisms linked to this variant(s). Neither rs71404070 nor any of the linked SNPs affect protein-coding regions. Additionally, none of these SNPs are known to affect expression of the two nearest genes (LOC729159 or cadherin8). The cadherin8 gene encodes a type II classical cadherin, which is an integral membrane protein that mediates calcium-dependent cell-cell adhesion. The function of LOC729159 is currently unknown.
Individuals carrying one risk allele at rs71404070 (A/T) had a 29% increased chance of injury compared to individuals with no risk allele (T/T). The size of the effect of rs71404070 is typical for many genome-wide association studies but far smaller than those for simple Mendelian traits. The rs71404070 genotype explains part of the heritable risk for rotator cuff injury, but a large part of the heritability remains unanswered. For some traits such as height or bone mineral density, heritability is largely explained by the cumulative effect from hundreds or thousands of loci [37][38][39]. One possibility is that rotator cuff injury is also polygenic, in which case identification of more loci in future studies might explain the heritability of this injury more fully. Another possibility is that rotator cuff injury may be heterogeneous; i.e. there may be distinct types of injury, and the rs71404070 SNP might explain risk for some types but not others [40]. In this case, methods could be developed to classify rotator cuff injury into sub-types, in which case the effect size of rs71404070 might increase for a specific sub-type of injury.
We re-tested 18 SNPs in 10 genes that were previously reported to show an association with rotator cuff injury, but we were unable to validate any associations. Evidence from many other genetic association studies suggests that candidate gene associations need to be independently replicated, otherwise their credibility is low [41][42][43][44]. Our study included 8,357 cases of rotator cuff injury compared to between 175 and 331 cases of rotator cuff injury in previous studies [12,13]. Compared to previous studies, our study had a larger number of cases, indicating that we had good statistical power to replicate the previously reported associations [45]. For example, for a SNP with a minor allele frequency of 5% and a genotype relative risk of 1.2 for 8,357 cases, power calculations indicate that our study had a 99% chance of replication. However, if we restrict the re-testing to just 904 cases of full rotator cuff rupture, our study had only a 15% chance of replicating such a SNP.
One possible reason explaining why our study was not able to replicate previous results is that rotator cuff injury was classified according to data from electronic health records. Some of the previous studies classified rotator cuff injuries using magnetic resonance imaging to confirm full-thickness rotator cuff tears. For large studies like ours, it is typical to perform bioinformatics searches of electronic health records, as it is not possible to manually evaluate each case. Nevertheless, it could be that some of the rotator cuff injuries in our study may be misclassified due to inaccuracy of the electronic health records.
The summary statistics for 10,582,947 SNPs from the GWAS for rotator cuff injury may be used in future genetic studies of rotator cuff injury. For instance, one could make a list of additional genes and genetic pathways involved in the development, formation or structure of the rotator cuff, and then use the data presented here to test those genes for an association. In addition, GWA studies for rotator cuff injury may be conducted in the future, in which case the data from this paper could be used to validate any new SNPs that are found. Finally, the data on rotator cuff injury could be combined with data from other musculoskeletal injuries (e.g. Achilles tendon injury) in a cross-phenotype meta-analysis in order to find SNPs associated with tendon injuries in general.
An attractive possibility is that rs71404070 could be used as a diagnostic marker to identify individuals with increased risk for injury, and then to take preventative measures to alleviate some of that risk. In our data from the general population, having one copy of the risk A allele of rs71404070 increased the risk for rotator cuff injury by 29% compared to having no copies. The underlying biological mechanism responsible for the association of rs71404070 with rotator cuff injury is currently unknown.
There are several limitations to this study. First, the results should be replicated in an independent cohort. Second, the phenotypes were defined from codes contained in patient electronic health records, which may be inaccurate. Third, the number of individuals of Latin-American, African-American and Asian ethnicity was relatively small, and hence the association results for these results are weaker than those from the European group. Fourth, these results should be repeated in a cohort of athletes to determine whether the effect size is similar in athletes. Fifth, additional studies are warranted to begin to illuminate the underlying biological mechanism for the association of variation near cadherin8 and rotator cuff injury.
Supporting information S1