Family-based exome-wide association study of childhood acute lymphoblastic leukemia among Hispanics confirms role of ARID5B in susceptibility

We conducted an exome-wide association study of childhood acute lymphoblastic leukemia (ALL) among Hispanics to confirm and identify novel variants associated with disease risk in this population. We used a case-parent trio study design; unlike more commonly used case-control studies, this study design is ideal for avoiding issues with population stratification bias among this at-risk ethnic group. Using 710 individuals from 323 Guatemalan and US Hispanic families, two inherited SNPs in ARID5B reached genome-wide level significance: rs10821936, RR = 2.31, 95% CI = 1.70–3.14, p = 1.7×10−8 and rs7089424, RR = 2.22, 95% CI = 1.64–3.01, p = 5.2×10−8. Similar results were observed when restricting our analyses to those with the B-ALL subtype: ARID5B rs10821936 RR = 2.22, 95% CI = 1.63–3.02, p = 9.63×10−8 and ARID5B rs7089424 RR = 2.13, 95% CI = 1.57–2.88, p = 2.81×10−7. Notably, effect sizes observed for rs7089424 and rs10821936 in our study were >20% higher than those reported among non-Hispanic white populations in previous genetic association studies. Our results confirmed the role of ARID5B in childhood ALL susceptibility among Hispanics; however, our assessment did not reveal any strong novel inherited genetic risks for acute lymphoblastic leukemia among this ethnic group.

Introduction Acute lymphoblastic leukemia (ALL) is the most common malignancy among children, with B-cell ALL (B-ALL) accounting for the majority (80% to 85%) of cases [1,2]. Genome-wide association studies (GWAS) have identified several inherited genetic variants associated with childhood or adolescent ALL risk, including but not limited to single nucleotide polymorphisms (SNPs) in ARID5B, IKZF1, CEBPE, CDKN2A, PIP4K2A, GATA3, LHPP, and ELK3 [3][4][5][6][7][8][9][10][11]. Another important risk factor for childhood ALL is Hispanic ethnicity. Children of Hispanic ethnic background have a 10% to 30% higher incidence of ALL than do non-Hispanic whites, and a rate almost two times higher than among non-Hispanic blacks [12,13]. Hispanic children with ALL also have a lower 5-year survival rate and a higher incidence of relapse than do non-Hispanic whites [14,15].
These differences in ALL incidence and outcomes among Hispanics could be due to differences in the frequency of known or novel genetic risk factors that are unique to this population. Genetic risk factors associated with Amerindian ancestry could account for increased ALL incidence and decreased ALL survival among Hispanics [15]. Furthermore, risk alleles in ARID5B and GATA3 were found to be more frequent among Hispanic Americans than in European Americans [8,9], consistent with ancestry-related disparities in ALL susceptibility and treatment outcomes. However, there is much work remaining to identify the missing heritability of ALL among Hispanics.
While a majority of GWAS rely on the case-control study design, this approach is subject to population stratification bias. This bias may result in spurious associations or the masking of associations due to subgroups within a population that have different genetic profiles and/or frequencies of disease [16,17]. Population stratification bias can be particularly important among Hispanic populations, where a three-way admixture among Native American, European, and West African ancestry populations is common [18,19]. However, the family-based case-parent trio design is immune to population stratification bias, as this approach relies on evaluating disequilibrium in the transmission of alleles among affected cases and their parents, which does not vary by race and ethnicity [16,20]. Thus, the case-parent trio study design has an advantage over the case-control study when analyzing data from highly admixed populations such as Hispanics.
To determine if previous ALL GWAS findings in populations primarily of European descent are transferable to Hispanic populations, and to identify novel genetic risks for ALL for this ethnic group, we conducted an exome-wide association study (EXWAS) among Hispanics recruited in Guatemala and the Southwestern United States. We used a case-parent trio approach to attenuate the potential for population stratification bias due to admixture among this population.

Study subjects and samples
The study population has been described previously [21]. Briefly, the population under study included Guatemalan ALL case-parent trios recruited during the period 2012-2015 from Unidad Nacional de Oncología Pediátrica (UNOP), a pediatric cancer treatment center in Guatemala City, as well as Hispanic trios recruited during the period 2003-2011 from the Texas Children's Cancer Center (part of Texas Children's Hospital (TCH) in Houston, Texas). Children and adolescents (ages 1 to 19 years) diagnosed with ALL and treated at either UNOP or TCH during their respective recruitment timeframes were eligible for the study. All Guatemalan ALL cases were either of indigenous Native American ancestry (Mayan; primarily K'iche', Mam, and Q'anjob'al) or Ladino (Mestizo, with a mixed Native American and European background) ancestry. Written informed consent was obtained from parents, guardians, or patients, as appropriate. For all ALL cases, germline genomic DNA was extracted from peripheral blood samples obtained during clinical remission. Peripheral blood samples were also obtained from participating Guatemalan parents, while saliva samples were collected from parents of TCH cases. DNA extraction was performed using a Puregene DNA isolation kit by Gentra Systems.
This study was approved by the Institutional Review Boards at University of Texas Health Science Center (UTHSC), Baylor College of Medicine, and St. Jude Children's Research Hospital, as well as by the Bioethics Committee at Facultad de Medicina, Universidad Francisco Marroquín in Guatemala.

Genotyping and quality control (QC)
Germline genomic DNA for all samples was submitted to exome-wide genotyping using the Illumina Human Exome BeadChip (Illumina, San Diego CA). SNPs genotyped (N = 237,436) were mostly from exonic regions. Genotype calls were made using Illumina GenomeStudio Software; genotype was coded as 0, 1, or 2, indicating the number of minor alleles (an additive genetic model was assumed). Individuals with a genotype call rate <95% and case-parent trios or duos with a Mendelian error rate (i.e., Mendelian inconsistencies) >0.5% were excluded from analyses. In addition, non-autosomal SNPs, SNPs with a minor allele frequency (MAF) <1%, and SNPs with poor genotyping quality (<95% call rate) were excluded. For SNPs with a MAF of 1%-5%, a more stringent set of QC criteria determined by call rate was used (S1 Fig), and SNPs that did not meet these criteria were also excluded from analyses.

Statistical analysis
For each SNP, associations between the inherited genotype and risk of childhood ALL were assessed with multinomial modeling, using the EMIM program [22]. This analysis compares observed case genotype distributions to expected distributions, assuming Mendelian transmission of the minor allele [20,22,23]. Multinomial modeling is mathematically equivalent to a log-linear approach [24], which has been used in several case-parent trio GWAS [25][26][27]. The multinomial modeling approach allows for the direct estimation of inherited effects even when one or more individuals are missing from a case-parent trio [24]. Case-parent trios, casemother or case-father duos, parents alone, or cases alone can all be analyzed simultaneously in EMIM to estimate inherited genetic effects.
We used EMIM to estimate relative risks (RR) for each SNP, as well as corresponding 95% confidence intervals (CIs) and chi-squared values, assuming a log-additive model of inheritance. EMIM analyses stratified the data by parental mating type, inherently adjusting RR estimates for effects of population stratification [23,24]. We calculated two-tailed p-values from chi-squared values for each SNP using R, version 3.2.2. We also created Manhattan plots and Q-Q plots using R; and regional association plots using LocusZoom [28]. We used a threshold of P<1.0 × 10 −6 to denote statistical significance in our analyses. The majority of ALL cases in the study cohort were of the B-ALL subtype; therefore, in addition to examining inherited variants associated with ALL in the entire study cohort, we also conducted a subgroup analysis with only B-ALL cases and families, to assess inherited genotype effects for B-ALL in particular.
In order to select cases for the TCH population, we determined genetic ancestry using STRUCTURE [15,29] on the basis of genotypes at 30,000 randomly selected SNPs, using Hap-Map samples (CEU, YRI, CHB/JPT) and indigenous Amerindian references [30] as ancestry populations. Hispanic Americans were defined as individuals for whom the proportion of Native American genetic ancestry was !10% and was also greater than the proportion of African ancestry [8].

Results
The study cohort consists of a total of 733 individuals from 332 families: 287 families (628 individuals) from Unidad Nacional de Oncología Pediátrica (UNOP) in Guatemala, and 45 families (105 individuals) from Texas Children's Hospital (TCH) in the United States. We excluded nine families from analysis because they did not meet Mendelian error rate criteria. The full cohort used in analyses included 323 families with 710 individuals. For 24 of these families, parental genetic data were used in the analysis, but genetic information was missing for the ALL cases themselves. Demographic and clinical characteristics of the 299 ALL cases included in this study are shown in Table 1. Of the 237,436 exonic SNPs available for analysis from the Illumina chip, 32,175 SNPs met the MAF and call rate criteria and were subsequently used in analyses (S1 File). The vast majority (99.6%) of excluded SNPs were omitted from analyses because they had a MAF of <1%.

Discussion
In this family-based EXWAS of childhood ALL among those of Hispanic ancestry, we observed two inherited SNPs in ARID5B, rs10821936 and rs7089424, that were associated with childhood ALL. These two SNPs are in strong linkage disequilibrium with one another (r 2 = 0.84, D' = 0.92), most likely representing a single susceptibility locus (Fig 2). Both ARID5B rs7089424 and rs10821936 have been identified in previous ALL GWAS [3][4][5]7,8]. These two SNPs also have had some of the strongest and most statistically significant associations with   ALL in GWAS analyses [4,7,8]. We reviewed whole-exome sequence data from a subset of the cases (n = 41 from TCH) described here and did not identify any recurring ARID5B coding variants in patients with the risk allele.
We also compared findings from our analysis to susceptibility SNPs identified in previous GWAS/EXWAS of childhood ALL (Table 3). Notably, the magnitude of the effects of rs7089424 and rs10821936 in our study were over 20% higher than estimates calculated in previous genetic association studies among primarily European (non-Hispanic white) populations.
ARID5B SNPs have also been evaluated among Hispanics in particular in a small subset of studies. Our results are consistent with one other study, which observed a similar effect size for rs7089424 among Hispanics with childhood B-ALL (OR = 2.33, 95%CI: 1.85-2.92) [13].
It has been observed that these ARID5B variants appear to be more common among Hispanics [8,13]. As would be expected, the risk allele frequencies (RAFs) observed for both rs10821936 and rs7089424 among both cases and parents in our study were higher than the risk allele frequencies reported among Hispanic and Native American reference populations (S1 Table). However, a GWAS by Xu et al. (2013) observed a RAF of 0.63 for rs10821936 among Hispanic ALL cases [8], which was closer to the RAF observed among Hispanic ALL cases for this SNP in our study (0.73). It is possible that the increased frequency of the risk allele for both rs10821936 and rs7089424 among Hispanics and Native Americans could be a factor in the increased risk of childhood ALL among these populations.
The higher magnitude of effect observed among Hispanics in this study (as compared to studies among predominantly non-Hispanic whites) could be due to a stronger effect within this population for biological reasons, which are not apparent through this study. It is also possible that the larger effect size could be due to the study design used. The case-parent trio design may be a more unbiased approach in evaluating these effects, because it produces estimates that are inherently adjusted for effects of population stratification.
ARID5B regulates the transcription of certain genes during embryonic development, and plays a role in the differentiation of B-lymphocyte progenitor cells [7,32]. Therefore, germline variation at this locus could alter B-lymphocyte development, thus playing a part in susceptibility to B-ALL [7,33]. Furthermore, both rs7089424 and rs10821936 may have a role in transcriptional regulation. The risk allele for rs10821936, in particular, is thought to eliminate the binding site of transcriptional factor NIT2, which could alter gene expression in cis [34,35].
While SNPs in ARID5B were the only variants to reach genome-wide level significance in our study, previously-reported SNPs IKZF1 rs4132601 and rs11978267 were nominally significant (P<0.05), and their effect sizes were similar to previous assessments [3][4][5]7,8]. The only other SNPs previously found to be associated with ALL that were directly genotyped on the platform included PIP4K2A rs10828317 and rs7088318. Although PIP4K2A rs10828317 was not statistically significant, the effect size was comparable to that reported in Migliorini et al. (RR = 1.16 vs. 1.23) [11]. However, this was not the case for PIP4K2A rs7088318 (RR = 1.04 vs. 1.40) [8]. It is possible that variations in risk allele frequency among Hispanics could contribute to these differences, in part because prior assessments have largely focused on those of European ancestry. Because of the nature of the variants contained on the SNP array used in this study (i.e., the majority were less common exonic SNPs, with 86.1% of the SNPs on the chip having a MAF 1%), most of the SNPs associated with ALL in the literature were not available and therefore were not evaluated in this analysis.
This study should be considered in the light of certain limitations. As noted, because of the rare exonic content of the SNP chip, as well as the admixed study population, we did not perform SNP imputation. Additionally, because of the sample size, imputation of additional rare variants (and subsequent association testing) would have provided limited information. More specifically, our sample size did not allow for the detection of novel variants that are  [19,36]. Because Hispanic children carry the burden of both a higher incidence of childhood ALL and worse outcomes of the disease than non-Hispanic whites or blacks [37,38], it is especially important to conduct exome-and genome-wide assessments for this disease among this particular ethnic group. Furthermore, a case-parent trio study design was used; unlike the case-control design employed by many GWAS, this approach is not subject to population stratification bias and is thus more likely to yield valid conclusions when assessing inherited genetic effects among Hispanic or other genetically diverse populations [16,17]. In summary, this is the first family-based study to evaluate inherited genetic variations associated with childhood ALL among those with Hispanic ancestry. These results confirm that ARID5B plays an important role in childhood ALL susceptibility in this high-risk population, and that the risk of childhood ALL associated with ARID5B rs10821936 and rs7089424 variants may be greater for Hispanics than among European populations. The study's focus on Hispanic populations addresses an important gap in information about childhood ALL genetic risk factors for Hispanics.
Supporting information S1 Table. Risk allele frequencies (RAFs) of ARID5B rs10821936 and rs7089424 for each of the groups used in the analysis, as well as for selected reference populations. (DOCX) S2 Table. Results for the 100 most highly significant SNPs in the inherited effects analysis, entire ALL study cohort.