Skip to main content
Advertisement
  • Loading metrics

Eight quick tips for including chromosome X in genome-wide association studies

  • Justin Bellavance ,

    Contributed equally to this work with: Justin Bellavance, Linda Wang

    Affiliations Faculty of Medicine, Université de Montréal, Montréal, Québec, Canada, Research Centre, Montréal Heart Institute, Montréal, Québec, Canada

  • Linda Wang ,

    Contributed equally to this work with: Justin Bellavance, Linda Wang

    Affiliations Faculty of Medicine, Université de Montréal, Montréal, Québec, Canada, Research Centre, Montréal Heart Institute, Montréal, Québec, Canada

  • Sarah A. Gagliano Taliun

    sarah.gagliano-taliun@umontreal.ca

    Affiliations Research Centre, Montréal Heart Institute, Montréal, Québec, Canada, Department of Medicine, Faculty of Medicine, Université de Montréal, Montréal, Québec, Canada, Department of Neurosciences, Faculty of Medicine, Université de Montréal, Montréal, Québec, Canada

Introduction

All individuals carry a minimum of 1 copy of chromosome X. Despite being a relatively long chromosome with more than 150 million base pairs [1], similar in length to chromosome 8, association testing of genetic variants on chromosome X is still not routinely conducted. Genome-wide association studies (GWAS) have been used to identify a vast range of genomic loci of interest for a variety of complex human diseases and traits by quantifying genetic variants that are statistically associated with a given disease/trait [2,3]. However, a lack of testing for variants on the X chromosome limits our ability to identify vital loci and subsequently understand potential mechanisms linked to this chromosome. There was a call for the inclusion of chromosome X into genome-wide association analyses presented in 2013. At that time, a scan of published GWAS from 2010 and 2011 showed that only 33% of the studies had tested variants on the X chromosome in their analyses [4]. Despite this call for inclusion, the lack of representation of this chromosome has not improved according to a 2023 study. Of the 136 publications that submitted at least 1 summary statistics file to the NHGRI-EBI GWAS Catalog in 2021, only 25% reported chromosome X results [5].

Indeed, there are several characteristics of this chromosome that make it unique compared to the autosomes, which can pose analytical challenges in association testing. Such challenges include how to account for X inactivation in individuals with an XX karyotype, how to model the hemizygous state of genotypes in individuals with an XY karyotype, or how to best code genotypes at the 2 pseudo-autosomal regions, short stretches at either end of the X with high homology with the Y chromosome, known as PAR1 and PAR2. The non-pseudo-autosomal region (nonPAR) denotes the middle sequence of the X chromosome.

Furthermore, there are many well-used software that take GWAS summary statistics as input and ignore chromosome X information [6,7]. This practice can make it difficult and unintuitive for researchers to run association testing on the X chromosome.

Inclusion of chromosome X routinely in GWAS and downstream analyses will serve to enhance our understanding of the genetic contributors to complex diseases and traits. Here, we propose 8 tips to help move towards the inclusion of X in GWAS to provide a suggested set of concrete actions that can be taken to overcome the challenges or obstacles preventing routine analysis of this chromosome.

Tip 1: Retain chromosome X variants in your quality control pipeline

Chromosome X variants tend to have lower quality whether genotyped, sequenced, or imputed compared to autosomal variants [8,9]. Perform careful quality control checks genome-wide, including variants on the X chromosome instead of discarding genotyping or sequencing calls for variants on the X chromosome. For sequencing data, depth on the X can be used to get a sense of each study participant’s sex chromosome karyotype. For genotyping array data, F statistics of X chromosome heterozygosity, as implemented in PLINK, for example, can be estimated to accomplish this task. Chromosome X-specific variant filters as reviewed by Keur and colleagues [10] should be incorporated, including testing for missingness by sex, call rate by sex, and the proportion of heterozygotes in individuals with XY karyotypes should be assessed. Additionally, Hardy–Weinberg Equilibrium models accounting for different ploidy on chromosome X (for example, diploid for females and haploid for males in the nonPAR region) with different assumptions with regard to X inactivation (see Tip 6) have been proposed [11].

Tip 2: Impute chromosome X variants

For genotyping array data, maximize the number of genetic variants available for association testing by imputing genotypes of variants that are not present on the array, including variants on the X chromosome. Various free online resources can be used to carry out the computationally intensive task of genotype imputation using available whole-genome sequencing imputation panels (Table 1), including the Trans-Omics for Precision Medicine (TOPMed) Imputation Server [12], Sanger Imputation Service [13], and Michigan Imputation Server [14]. The TOPMed and Michigan Imputation Servers will perform some basic chromosome X-specific quality control on the uploaded genotypes, including verifying that all variants in the nonPAR region are either haploid or diploid. For these 2 servers, chromosome X is split into 3 parts (PAR1, nonPAR, and PAR2) for statistical phasing and genotype imputation, and then subsequently merged. In contrast, the Sanger Imputation Service does not perform quality control steps on the genotypes. A thorough assessment of the various imputation panels and tools available to handle chromosome X imputation is pertinent but is beyond the scope of this short educational article.

thumbnail
Table 1. Descriptive summary of some available genotype imputation panels.

https://doi.org/10.1371/journal.pcbi.1012160.t001

Tip 3: Use genetic association software that supports X chromosome testing

It is necessary to understand how conventional genetic association study software handles the X chromosome because different tools handle association testing on the X chromosome differently (Table 2). These differences can cause some missteps for researchers trying to include the X chromosome in their analyses for the first time. While there is no one recommended tool or approach for X chromosome association analysis, one study observed that in samples with a skewed male:female ratio, coding the male chromosome X genotypes as 0/2 (rather than 0/1) alleviated type 1 error [15]. Therefore, when male:female imbalance is present in an analysis dataset, coding of male genotypes on the X should be considered. We recommend implementing accepted genome-wide significant thresholds that account for the multiple testing inherent to genome-wide scans (for example, 5 × 10−8, which adjusts for 1 million independent tests) [16], rather than deriving X-specific significance thresholds.

thumbnail
Table 2. Non-comprehensive list of software for association testing, including chromosome X variants.

Flags for single variant association testing are described.

https://doi.org/10.1371/journal.pcbi.1012160.t002

Tip 4: Perform stratified association analyses for X chromosome variants

Stratified analyses carrying out association tests in individuals with XX karyotypes separately from those with XY karyotypes can be conducted to identify associations on the X (as well as for autosomal variants) that differ in magnitude, direction, or significance depending on sex chromosome karyotype. Tests of heterogeneity to identify genetic variants that exhibit significant association differences in the stratified association tests should also be carried out to quantity possible association differences. As sample sizes increase, individuals with sex-chromosome aneuploidy should be included in analyses, possibly stratified according to sex chromosome karyotype. However, stratifying study individuals into subgroups will reduce the total sample size for the association test. In datasets where statistical power is a concern, there are alternative methods to test for effect of sex, such as testing for genetic variant-by-sex interactions.

Tip 5: At the minimum, assess biallelic single-nucleotide polymorphisms and variants as well as small insertions and deletions (indels) on chromosome X

Limited work has been done on investigating genetic variant–trait associations for more complex types of genetic variation on the X chromosome. Nevertheless, there is evidence of many types of genetic variation on the X chromosome contributing to complex disease risk, thus warranting additional investigations across traits and across types of variation (Table 3). When possible, expand to other variation types such as copy number and structural variants and include variants with more than 2 alleles (i.e., multi-allelic variants).

thumbnail
Table 3. Summary of selected types of genetic variation with known associations for complex human diseases or traits on chromosome X.

https://doi.org/10.1371/journal.pcbi.1012160.t003

Tip 6: Consider implementing various statistical models to account for X inactivation

In XX individuals, there can be compensatory mechanisms to reduce the dosage of gene expression in which one copy is subject to inactivation, resulting in only one copy of the gene being expressed (Fig 1). This biological factor must be considered while performing genetic association analyses on chromosome X. Statistical methods have been proposed to account for different inactivation models: random, skewed, or escape from inactivation [23,24,31,32]. PLINK [17,18] and XWAS [23] can model escape from inactivation or random inactivation, and there are models for skewed inactivation [31,33]. Given the lack of widely used standards and guidelines for association testing on the X chromosome, it can be useful to report association results from more than one applied method to further research in this direction using the most recent map of X inactivation across human tissues [34]. Applying and comparing association models accounting for different models of X inactivation for genetic variants on the X chromosome in XX individuals will be an important step forward to foster chromosome X association testing.

thumbnail
Fig 1. Representation of proposed X inactivation models in XX individuals.

In random X inactivation, 50% of cells have 1 allele inactive, whereas the other 50% have the other allele inactive (a). In skewed (also known as nonrandom) inactivation, more than 50% of cells (say more than 75% or more extreme) have the same allele inactive (b). Finally, escape of X inactivation describes the scenario where both alleles are active in all cells (c).

https://doi.org/10.1371/journal.pcbi.1012160.g001

Tip 7: Make your code and association summary statistics, including chromosome X results, publicly available

Data availability increases transparency and reproducibility and facilitates future research within the broader research community. Share your results with the broader community in a peer-reviewed publication describing your work. Sharing of summary statistics is commonplace and now required by many journals for publication of primary research articles. These summary statistics are used by the broader research community for a variety of tasks including replication of association signals, meta-analysis of cohorts in which the same or similar trait is measured, carrying out two-sample mendelian randomization, identification of variants to include in the development of a polygenic score and their corresponding weights, and many more downstream analyses. Not to mention, including chromosome X in summary statistics facilitates the inclusion of variants on this chromosome in these downstream analyses, furthering knowledge on the contribution of chromosome X variants to complex traits and diseases.

Code availability goes hand in hand with data availability in the call for open science. Making code available can provide other researchers with a starting point to facilitate the inclusion of X variants in their analyses. To this end, it is important to provide annotated code where the implemented steps are well documented. Guidelines described elsewhere should be followed to ensure code reproducibility that supports open science [35,36].

Tip 8: Embrace the biological complexity and value of X chromosome analyses

By following these tips, we can all do our part in increasing the inclusion of chromosome X in association analyses and embracing our biological complexity through association analyses.

Conclusions

Testing variants on the X chromosome is important for increasing understanding of this chromosome’s role in complex traits and diseases. Not to mention, making these results available to the broader community will facilitate downstream analyses including meta-analyses, the creation of polygenic risk scores and causal inference using mendelian randomization that incorporate chromosome X to advance the field. That being said, the inclusion of X chromosome genetic variants in association analyses is not a “one size fits all” approach. There is no single software or X inactivation model that will always give optimal results for every trait and in every study design. We encourage researchers to conduct further investigations to choose the best tool, resource, or method for their specific task using the information provided here as a starting point. All in all, addressing the challenges associated with chromosome X analyses will be crucial to foster future opportunities for scientific discovery related to topics that are not yet well understood, including, but not limited to, understanding haploid versus diploid states, sex chromosome aneuploidy states, and the biology of nonrandom X inactivation.

References

  1. 1. Ross MT, Grafham DV, Coffey AJ, Scherer S, McLay K, Muzny D, et al. The DNA sequence of the human X chromosome. Nature. 2005;434(7031):325–337. pmid:15772651; PubMed Central PMCID: PMC2665286.
  2. 2. Watanabe K, Stringer S, Frei O, Umicevic Mirkov M, de Leeuw C, Polderman TJC, et al. A global overview of pleiotropy and genetic architecture in complex traits. Nat Genet. 2019;51(9):1339–48. Epub 20190819. pmid:31427789.
  3. 3. MacArthur J, Bowler E, Cerezo M, Gil L, Hall P, Hastings E, et al. The new NHGRI-EBI Catalog of published genome-wide association studies (GWAS Catalog). Nucleic Acids Res. 2017;45(D1):D896–D901. Epub 20161129. pmid:27899670; PubMed Central PMCID: PMC5210590.
  4. 4. Wise AL, Gyi L, Manolio TA. eXclusion: toward integrating the X chromosome in genome-wide association analyses. Am J Hum Genet. 2013;92(5):643–647. pmid:23643377; PubMed Central PMCID: PMC3644627.
  5. 5. Sun L, Wang Z, Lu T, Manolio TA, Paterson AD. eXclusionarY: 10 years later, where are the sex chromosomes in GWASs? Am J Hum Genet. 2023;110(6):903–912. pmid:37267899; PubMed Central PMCID: PMC10257007.
  6. 6. Bulik-Sullivan BK, Loh PR, Finucane HK, Ripke S, Yang J, Schizophrenia Working Group of the Psychiatric Genomics C, et al. LD Score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat Genet. 2015;47(3):291–5. Epub 20150202. pmid:25642630; PubMed Central PMCID: PMC4495769.
  7. 7. Brown BC, Asian Genetic Epidemiology Network Type 2 Diabetes C, Ye CJ, Price AL, Zaitlen N. Transethnic Genetic-Correlation Estimates from Summary Statistics. Am J Hum Genet. 2016;99(1):76–88. Epub 20160616. pmid:27321947; PubMed Central PMCID: PMC5005434.
  8. 8. Chen DZ, Roshandel D, Wang Z, Sun L, Paterson AD. Comprehensive whole-genome analyses of the UK Biobank reveal significant sex differences in both genotype missingness and allele frequency on the X chromosome. Hum Mol Genet. 2024;33(6):543–551. pmid:38073250.
  9. 9. Schurz H, Muller SJ, van Helden PD, Tromp G, Hoal EG, Kinnear CJ, et al. Evaluating the Accuracy of Imputation Methods in a Five-Way Admixed Population. Front Genet. 2019;10:34. Epub 20190205. pmid:30804980; PubMed Central PMCID: PMC6370942.
  10. 10. Keur N, Ricano-Ponce I, Kumar V, Matzaraki V. A systematic review of analytical methods used in genetic association analysis of the X-chromosome. Brief Bioinform. 2022;23(5). pmid:35901513; PubMed Central PMCID: PMC9764208.
  11. 11. Graffelman J, Weir BS. Testing for Hardy-Weinberg equilibrium at biallelic genetic markers on the X chromosome. Heredity (Edinb). 2016;116(6):558–68. Epub 20160413. pmid:27071844; PubMed Central PMCID: PMC4868269.
  12. 12. Taliun D, Harris DN, Kessler MD, Carlson J, Szpiech ZA, Torres R, et al. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. Nature. 2021;590(7845):290–9. Epub 20210210. pmid:33568819; PubMed Central PMCID: PMC7875770.
  13. 13. McCarthy S, Das S, Kretzschmar W, Delaneau O, Wood AR, Teumer A, et al. A reference panel of 64,976 haplotypes for genotype imputation. Nat Genet. 2016;48(10):1279–83. Epub 20160822. pmid:27548312; PubMed Central PMCID: PMC5388176.
  14. 14. Das S, Forer L, Schonherr S, Sidore C, Locke AE, Kwong A, et al. Next-generation genotype imputation service and methods. Nat Genet. 2016;48(10):1284–7. Epub 20160829. pmid:27571263; PubMed Central PMCID: PMC5157836.
  15. 15. Ozbek U, Lin HM, Lin Y, Weeks DE, Chen W, Shaffer JR, et al. Statistics for X-chromosome associations. Genet Epidemiol. 2018;42(6):539–50. Epub 20180613. pmid:29900581; PubMed Central PMCID: PMC6394852.
  16. 16. Uffelmann E, Huang QQ, Munung NS, de Vries J, Okada Y, Martin AR, et al. Genome-wide association studies. Nat Rev Methods Primers. 2021;1(1):59.
  17. 17. Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007;81(3):559–75. Epub 20070725. pmid:17701901; PubMed Central PMCID: PMC1950838.
  18. 18. Chang CC, Chow CC, Tellier LC, Vattikuti S, Purcell SM, Lee JJ. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience. 2015;4:7. Epub 20150225. pmid:25722852; PubMed Central PMCID: PMC4342193.
  19. 19. Zhou W, Nielsen JB, Fritsche LG, Dey R, Gabrielsen ME, Wolford BN, et al. Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies. Nat Genet. 2018;50(9):1335–41. Epub 20180813. pmid:30104761; PubMed Central PMCID: PMC6119127.
  20. 20. Mbatchou J, Barnard L, Backman J, Marcketta A, Kosmicki JA, Ziyatdinov A, et al. Computationally efficient whole-genome regression for quantitative and binary traits. Nat Genet. 2021;53(7):1097–103. Epub 20210520. pmid:34017140.
  21. 21. Loh PR, Kichaev G, Gazal S, Schoech AP, Price AL. Mixed-model association for biobank-scale datasets. Nat Genet. 2018;50(7):906–908. pmid:29892013; PubMed Central PMCID: PMC6309610.
  22. 22. Yang J, Lee SH, Goddard ME, Visscher PM. GCTA: a tool for genome-wide complex trait analysis. Am J Hum Genet. 2011;88(1):76–82. Epub 20101217. pmid:21167468; PubMed Central PMCID: PMC3014363.
  23. 23. Gao F, Chang D, Biddanda A, Ma L, Guo Y, Zhou Z, et al. XWAS: A Software Toolset for Genetic Data Analysis and Association Studies of the X Chromosome. J Hered. 2015;106(5):666–71. Epub 20150812. pmid:26268243; PubMed Central PMCID: PMC4567842.
  24. 24. Clayton D. Testing for association on the X chromosome. Biostatistics. 2008;9(4):593–600. Epub 20080425. pmid:18441336; PubMed Central PMCID: PMC2536723.
  25. 25. Vuckovic D, Bao EL, Akbari P, Lareau CA, Mousas A, Jiang T, et al. The Polygenic and Monogenic Basis of Blood Traits and Diseases. Cell. 2020;182(5):1214–31 e11. pmid:32888494; PubMed Central PMCID: PMC7482360.
  26. 26. Conti DV, Darst BF, Moss LC, Saunders EJ, Sheng X, Chou A, et al. Trans-ancestry genome-wide association meta-analysis of prostate cancer identifies new susceptibility loci and informs genetic risk prediction. Nat Genet. 2021;53(1):65–75. Epub 20210104. pmid:33398198; PubMed Central PMCID: PMC8148035.
  27. 27. Eichler EE. Copy Number Variation and Human Disease. Nat Educ. 2008;1(3):1.
  28. 28. Auwerx C, Lepamets M, Sadler MC, Patxot M, Stojanov M, Baud D, et al. The individual and global impact of copy-number variants on complex human traits. Am J Hum Genet. 2022;109(4):647–68. Epub 20220302. pmid:35240056; PubMed Central PMCID: PMC9069145.
  29. 29. Sudmant PH, Rausch T, Gardner EJ, Handsaker RE, Abyzov A, Huddleston J, et al. An integrated map of structural variation in 2,504 human genomes. Nature. 2015;526(7571):75–81. pmid:26432246; PubMed Central PMCID: PMC4617611.
  30. 30. Collins RL, Brand H, Karczewski KJ, Zhao X, Alfoldi J, Francioli LC, et al. A structural variation reference for medical and population genetics. Nature. 2020;581(7809):444–51. Epub 20200527. pmid:32461652; PubMed Central PMCID: PMC7334194.
  31. 31. Wang J, Yu R, Shete S. X-chromosome genetic association test accounting for X-inactivation, skewed X-inactivation, and escape from X-inactivation. Genet Epidemiol. 2014;38(6):483–93. Epub 20140708. pmid:25043884; PubMed Central PMCID: PMC4127090.
  32. 32. Chen B, Craiu RV, Sun L. Bayesian model averaging for the X-chromosome inactivation dilemma in genetic association study. Biostatistics. 2020;21(2):319–335. pmid:30247537.
  33. 33. Su Y, Hu J, Yin P, Jiang H, Chen S, Dai M, et al. XCMAX4: A Robust X Chromosomal Genetic Association Test Accounting for Covariates. Genes (Basel). 2022;13(5). Epub 20220509. pmid:35627231; PubMed Central PMCID: PMC9141238.
  34. 34. Tukiainen T, Villani AC, Yen A, Rivas MA, Marshall JL, Satija R, et al. Landscape of X chromosome inactivation across human tissues. Nature. 2017;550(7675):244–248. pmid:29022598; PubMed Central PMCID: PMC5685192.
  35. 35. Sandve GK, Nekrutenko A, Taylor J, Hovig E. Ten simple rules for reproducible computational research. PLoS Comput Biol. 2013;9(10):e1003285. Epub 20131024. pmid:24204232; PubMed Central PMCID: PMC3812051.
  36. 36. Tonzani S, Fiorani S. The STAR Methods way towards reproducibility and open science. iScience. 2021;24(4):102137. Epub 20210401. pmid:33997663; PubMed Central PMCID: PMC8100894.