Tuberculosis severity associates with variants and eQTLs related to vascular biology and infection-induced inflammation

Background Tuberculosis (TB) remains a major public health problem globally, even compared to COVID-19. Genome-wide studies have failed to discover genes that explain a large proportion of genetic risk for adult pulmonary TB, and even fewer have examined genetic factors underlying TB severity, an intermediate trait impacting disease experience, quality of life, and risk of mortality. No prior severity analyses used a genome-wide approach. Methods and findings As part of our ongoing household contact study in Kampala, Uganda, we conducted a genome-wide association study (GWAS) of TB severity measured by TBScore, in two independent cohorts of culture-confirmed adult TB cases (n = 149 and n = 179). We identified 3 SNPs (P<1.0 x 10–7) including one on chromosome 5, rs1848553, that was GWAS significant (meta-analysis p = 2.97x10-8). All three SNPs are in introns of RGS7BP and have effect sizes corresponding to clinically meaningful reductions in disease severity. RGS7BP is highly expressed in blood vessels and plays a role in infectious disease pathogenesis. Other genes with suggestive associations defined gene sets involved in platelet homeostasis and transport of organic anions. To explore functional implications of the TB severity-associated variants, we conducted eQTL analyses using expression data from Mtb-stimulated monocyte-derived macrophages. A single variant (rs2976562) associated with monocyte SLA expression (p = 0.03) and subsequent analyses indicated that SLA downregulation following MTB stimulation associated with increased TB severity. Src Like Adaptor (SLAP-1), encoded by SLA, is highly expressed in immune cells and negatively regulates T cell receptor signaling, providing a potential mechanistic link to TB severity. Conclusions These analyses reveal new insights into the genetics of TB severity with regulation of platelet homeostasis and vascular biology being central to consequences for active TB patients. This analysis also reveals genes that regulate inflammation can lead to differences in severity. Our findings provide an important step in improving TB patient outcomes.

We computed principal components (PC's) but did not include these PC's in the regression analyses as the sample populations did not appear to show substantial substructure. This is consistent with previous studies that have shown there is little to no discernable genetic sub-structure within our study population based on microsatellite analyses and previous regression and PC analyses (2). Our decision was based on 3 pieces of evidence in our data that we used to decide against their inclusion a priori: 1.) There did not appear to be clustering in the plot of PC1 vs. PC2 (S2 and S3 Figs); 2.) Based on the eigenvalues, the PC's did not explain a substantial amount of variation in the genotypes of the subjects included in the cohort (2.8% for the top PC in Cohort 1 and 4.3% for the top PC in Cohort 2); and 3.) the top 3 PC's did not associate with the outcome (i.e. TBscore). The P-value for the association between PC1 and TBscore was 0.2 in Cohort 1 and 0.6 in Cohort 2, using a linear regression model (and these were the lowest P-values for any of the top 10 PC's in either cohort). If the PC's neither correlate with the outcome nor explain variation in the genotype data, then it is unlikely that unmeasured population substructure will bias the association between genotype and TBscore. However, we also performed a sensitivity analysis to determine how the inclusion of PC's in the regression equations for each cohort would affect the P-values and Beta coefficients for the top SNP.

Enrichment and Annotation of Severity Associated SNPs
In order to provide greater depth of meaning and interpretability to our GWAS summary statistic results, we utilized FUMA GWAS to provide annotation of individual SNPs, such as regulatory information, tissue specificity, and gene mapping (3). We utilized p<1e-05 as the threshold for enrichment and annotation and a distance of +/-10 kb for gene mapping to assign a gene to SNPs that were not located within a gene. We also utilized FUMA GWAS' GENE2FUNC feature for the genes represented in the GWAS summary statistics. This feature maps GWAS summary statistics to genes, and then provides gene set or pathway enrichments based on gene sets from MsigDB, KEGG, WikiPathways, and the GWAS Catalog. In order to determine significance, FUMA uses hypergeometric mean pathway analysis and adjusts the pvalues for significance based on the Benjamini-Hochberg method (i.e. FDR) using the number of data sources of tested gene sets as the number of independent tests performed in the multiple testing correction. FUMA reports gene sets with adjusted P-value ≤ 0.05 and the number of genes that overlap with the gene set > 1 (3). In addition to FUMA, we utilized GeneCards, Ensembl, DICE, and STRING DB to help annotate and enrich our results with respect to function, expression, and downstream protein interactions (4-7).
Another feature that FUMA provides is a MAGMA gene-based analysis using the full range of GWAS summary statistics. MAGMA is a computational biology tool that utilizes the significance and direction of effect of multiple SNPs aggregated together in the same genes (8).
MAGMA is "based on a multiple linear principal components regression model, using an F-test to compute the gene p-value" (8). In addition to looking for gene-level associations, FUMA also used MAGMA's gene set enrichment and tissue specificity analyses, the latter of which uses tissue-specific data on gene expression from GTEx v8. When uploading summary statistics to FUMA, it automatically runs the MAGMA analyses on the data, provided that summary statistics for all SNPs tested are uploaded. Thus, it helps inform the tissue or pathway specific functions of the gene-level associations, rather than mapping individual SNPs to genes and using those. We also decided to look up SNPs in the region of genes that have previously been associated with TB susceptibility. This allows for a comparison of the determinants of each phenotype and can demonstrate whether or not there are different genes associated with each phenotype. To accomplish this, we pulled the genes from a previous review by Stein et al. and examined SNPs within a +/-5kb window of the gene (9).

RNA-Seq Methods
For the purpose of measuring gene expression in the 72 RSTR and LTBI subjects for our eQTL analyses, RNA sequencing was performed on the Illumina HiSeq 2500 in high output mode, which achieved a read depth of 30 million paired-end, 50 base pair reads. Adapters were removed prior to sequence alignment using STAR 2.6.0 (10). Counts for each gene were assigned using RSEM 1.3.0 (11). Using multidimensional scaling (MDS), we found no evidence of batch effects nor any global influence of age or the other available metadata from each cohort (12).
After transcript reads were mapped to genes and pooled gene counts were generated, we filtered the mapped genes to exclude genes not contained on chromosome 1-22 and non-protein coding genes. We limited the data to non-protein coding genes so we could observe changes in expression of genes that had functional downstream consequences. DNA variants in non-coding regions were still examined for regulatory effects but we did not examine changes in expression for non-coding genes. Then, the pooled counts were normalized using the trimmed mean of Mvalues method, which reduces the possibility of bias due to read depth and variations in library size (13). This step is necessary to allow for direct and unbiased comparison of expression across different genes. After normalization, genes were further filtered out if they had low or no expression across all samples. After QC and filtering, there were 13,547 genes remaining in this analysis.

eQTL analysis
Matrix eQTL tests cis and trans eQTL's separately and the subsequent FDR correction used to determine significance is performed separately for each category. In this analysis, we defined a cis eQTL as being within 500Kb of the gene location downloaded from Biomart in Ensembl Version 104 using the GRCh37 (Hg19) human genome build (14). We generated the list of differentially expressed genes using the tag-wise dispersion method (15). A log2 fold change greater than 1 or less than -1 (i.e. genes with expression levels that either double or halved after stimulation with MTB) and an FDR of 0.01 was used as the definition of differential expression. This list of differentially expressed genes was then used to test for eQTL's. All QC, filtering, and differential analyses were performed using Limma, Voom, and EdgeR packages in the Bioconductor suite for R v3.6.3 (16)(17)(18)(19).
The final study population for our follow-up eQTL analysis included 72 subjects without active TB that had data for RNA expression and genotype (S5 Table). The differential expression analysis yielded 1,986 genes that met the definition for differential expression (S6 Table). 161 of our severity-associated SNPs were also available for analysis in the genotype chip. In total, there 78 cis SNP-gene pairings and 332,065 trans SNP-gene pairings, where a cis pairing represented any combination of a SNP and a gene located within 500 kb and a trans pairing represented any combination of a SNP and a gene greater than 500 kb from the target gene.
One of these was rs7707024, a G to C change at position 159320915 on chromosome 5 that associated with a 1-point reduction in TBscore (P=0.02, β= -0.93). rs7707024 is a noncoding transcript variant in a long non-coding RNA (lncRNA) gene, LINC01847. rs7707024 is in a known regulatory region according to Ensembl and is suggested to be an enhancer, but this has not been experimentally verified. We also identified a second SNP, rs13358509, which is an intron variant for LOC285626, an uncharacterized novel transcript. rs13358509 associated with a 0.71 reduction in TBscore (P=0.039).