Risk prediction of pulmonary tuberculosis using genetic and conventional risk factors in adult Korean population

A complex interplay among host, pathogen, and environmental factors is believed to contribute to the risk of developing pulmonary tuberculosis (PTB). The lack of replication of published genome-wide association study (GWAS) findings limits the clinical utility of reported single nucleotide polymorphisms (SNPs). We conducted a GWAS using 467 PTB cases and 1,313 healthy controls obtained from two community-based cohorts in Korea. We evaluated the performance of PTB risk models based on different combinations of genetic and nongenetic factors and validated the results in an independent Korean population comprised of 179 PTB cases and 500 healthy controls. We demonstrated the polygenic nature of PTB and nongenetic factors such as age, sex, and body mass index (BMI) were strongly associated with PTB risk. None of the SNPs achieved genome-wide significance; instead, we were able to replicate the associations between PTB and ten SNPs near or in the genes, CDCA7, GBE1, GADL1, SPATA16, C6orf118, KIAA1432, DMRT2, CTR9, CCDC67, and CDH13, which may play roles in the immune and inflammatory pathways. Among the replicated SNPs, an intergenic SNP, rs9365798, located downstream of the C6orf118 gene showed the most significant association under the dominant model (OR = 1.59, 95% CI 1.32–1.92, P = 2.1×10−6). The performance of a risk model combining the effects of ten replicated SNPs and six nongenetic factors (i.e., age, sex, BMI, cigarette smoking, systolic blood pressure, and hemoglobin) were validated in the replication set (AUC = 0.80, 95% CI 0.76–0.84). The strategy of combining genetic and nongenetic risk factors ultimately resulted in better risk prediction for PTB in the adult Korean population.


Introduction
Pulmonary tuberculosis (PTB) is a prevalent infectious disease that is caused by Mycobacterium tuberculosis (M. tuberculosis). According to the Global Tuberculosis Report published by the World Health Organization in 2015, the estimated numbers of new tuberculosis cases and tuberculosis deaths worldwide reached almost 9.6 million and 1.1 million in 2014, respectively.

Statistical analysis
We evaluated the associations between PTB and 17 conventional risk factors, including household income, cigarette smoking, alcohol consumption, and 12 clinical variables using generalized linear regression models constructed using STATA v.11.2 (Stata Corp., TX, USA). Continuous variables, such as BMI, SBP, DBP, Hb, and BUN, were converted into categorical variables and the new categorical variables were analyzed as well.
We conducted a GWAS for PTB under three genetic models (i.e., additive, dominant, and recessive) after adjusting for age, sex, and BMI using PLINK v.1.07 (http://pngu.mgh.harvard. edu/~purcell/plink/) [20]. The 374 SNPs with P values less than 0.001 identified in the KARE GWAS were validated among subjects of the HEXA Study. In addition, we compared our results with findings cataloged in existing genetic association studies databases, such as the HuGE Navigator Phenopedia (https://phgkb.cdc.gov/HuGENavigator) and NHGRI-EBI GWAS Catalog (https://www.ebi.ac.uk/gwas). We evaluated the statistical power and the sample size to achieve sufficient statistical power (80%) for each SNP using the Genetic Power Calculator, under the assumptions; PTB prevalence of 15% in Koreans, the genetic model showing the lowest P-value for each SNP, complete LD, 1:2.81 case-control ratio used in the current study, and type 1 error rate at genome-wide significance level of 5×10 −8 (http://pngu. mgh.harvard.edu/~purcell/gpc/).

In silico functional analysis
After removing redundant SNPs (r 2 > 0.8), we conducted in silico functional analyses of the ten replicated SNPs to evaluate if they are in high linkage disequilibrium with functional SNPs such as nonsynonymous SNPs, nonsense SNPs, and SNPs creating transcription factor binding sites, splicing regulator, or predicted miRNA binding sites using the SNP Function Prediction program of the SNPinfo Web Server (http://snpinfo.niehs.nih.gov/snpinfo/snpfunc.htm).

Risk prediction for pulmonary tuberculosis
We generated a genetic risk score model comprising ten SNPs replicated in the HEXA study population. We developed weighted genetic, nongenetic, and combined risk score models to predict the risk of developing PTB (i.e., wGRS, wnGRS, and wGRS+wnGRS, respectively). For each individual, a wGRS was calculated by summing the values obtained from multiplying model-specific genotype scores (additive, 0, 1, 2; dominant, 0, 1, 1; recessive, 0, 0, 1) by the natural log transformed odds ratio (ln OR) for each SNP in the model. Likewise, four different wnGRS models composed of different combinations of the nine covariates (age, sex, BMI, systolic and diastolic blood pressures (SBP, DBP), hemoglobin (Hb), and blood urea nitrogen (BUN), cigarette smoking, and alcohol consumption) that demonstrated significant associations with PTB in the univariate models were calculated by categorizing (i.e., 0, 1 or 0, 1, 2) and summing the ln ORs across variables. The predictive abilities of the wGRS and wnGRS models were evaluated by comparing the area under the receiver operating characteristic curves (AUC) using the roctab and roccomp commands in STATA. After stratification into quartiles based on the risk scores of each model, we also examined the associations between each risk score quartile and PTB risk. Finally, we compared the performance of the PTB risk models in HEXA dataset.

Association of nongenetic factors
Ten of the 17 conventional risk factors were identified as significantly associated with increased risk of PTB in the KARE study dataset. PTB risk was higher in men than in women (OR = 2.03, 95% CI 1.64-2.52, P = 1.1×10 −10 ) and in people aged 50 years and older than their younger counterparts (OR = 1.81, 95% CI 1.46-2.24, P = 5.3×10 −8 ). Environmental factors, such as low household income, cigarette smoking, and alcohol consumption, were also identified as potent risk factors. While high BMI was observed to be a strong protective factor against PTB, elevated SBP, DBP, Hb, and BUN levels were associated with increased risk of PTB. The results obtained using the HEXA dataset were similar to those obtained using KARE; however, the associations between PTB and alcohol consumption and BUN were marginally significant, and household income was not significantly associated with PTB in the HEXA dataset ( Table 1). The application of the strict selection criteria excluded a large proportion of controls with hypertension and high fasting glucose levels and 500 healthy normotensive nondiabetic controls were analyzed in the HEXA Study. As a result, the replication study yielded stronger effects for the covariates than the effects reported in the initial study. For instance, the OR of TB for current smokers was much higher in the HEXA data than the KARE data (OR = 2.72, 95% CI 1.67-4.46 vs. 1.52, 95% CI 1.38-2.07). Categorization of continuous covariates resulted in loss of power; especially the statistical significance disappeared after dichotomization of BUN (Table 1). Among these risk factors, gender, age, and BMI remained in the optimal model after stepwise selection was performed in the KARE Study dataset. Of the ten variables that were available and significant in the univariate models, four variables: gender, BMI, SBP, and DBP, remained in the optimal model after stepwise selection (Table A in S1 File).

Genome-wide association and replication studies
After elimination of the SNPs in linkage disequilibrium (LD, r 2 > 0.8), 374 unique SNPs achieved a significance level of P < 0.001 in the KARE GWAS with adjustment for age, sex, and BMI (data not shown). However, none of these SNPs achieved conventional genome-wide significance (P < 5×10 −8 ). Of the six SNPs with P < 1.0×10 −5 , the rs3825435 SNP located on intron of the FARP1 gene (13q32.2) showed the strongest association with PTB under an additive model (OR = 1.69, 95% CI 1.38-2.07, P = 5.3×10 −7 ) ( Table 2). A total of ten out of the 374 SNPs were associated with PTB at P < 0.05, and two SNPs, rs9682385 near the GADL1 gene (3p23) and rs9787961 near the 5'-UTR of the CTR9 gene (11p15.3), yielded the strongest associations with PTB in the HEXA dataset (P = 0.007 and 0.008, respectively) ( Table 3). The joint analysis of data from the KARE and HEXA cohorts did not improve the observed statistical significance but suggested that an intergenic SNP, rs9365798, located 400 kb downstream of the C6orf118 gene (6q27) was most significantly associated with PTB (OR = 1.59, 95% CI 1.32-1.92, P = 2.1×10 −6 ). Among the SNPs reported in previous studies, eight SNPs located near or in seven genes were evaluated; however, none of these SNPs showed evidence of association with PTB in either the KARE or HEXA Study datasets (Table B in S1 File). Furthermore, none of the SNPs in the three genes reported to be associated with PTB in recent GWASs, HLA class II, ASAP1, and WNT1, yielded a significant allelic association with PTB in the KARE GWAS (Table C in S1 File). One SNP, rs571110, of the six SNPs with P < 1.0×10 −5 identified in the KARE GWAS (Table 2) showed a statistical power of 86% under a recessive model; however, the current study was underpowered for detection of other SNPs at genome-wide significance level, ranging from 24% (rs9381416) to 68% (rs17394081). Among the ten replicated SNPs that are shown in Table 3, the rs9840514 showed a sufficient statistical power under a recessive model in the joint analysis (99%); other SNPs were underpowered, ranging from 12% (rs9682385) to 72% (rs4348560).

In silico functional analysis
Among the ten replicated SNPs, eight SNPs, including six SNPs in strong LD with four genotyped SNPs, located in four genes (r 2 > 0.8) fell into different functional categories: transcription factor binding sites (TFBSs), exonic splicing sites (ESSs), miRNA, and nonsynonymous SNPs (nsSNPs) (Table D in S1 File). Specifically, rs2259633 (CCDC67, 11q21) have been predicted to be an amino acid substitution (Gln > Lys). This SNP was in strong LD with the SNP rs3019221 identified in the current GWAS (r 2 = 1.00).

Risk prediction models for pulmonary tuberculosis
The estimated AUC values of the wGRS model composed of the ten SNPs that were validated in the HEXA dataset were approximately 0.64 within both study populations. Data for the missing rate of household income was as high as 17% in the HEXA dataset; thus, this variable was excluded from subsequent analyses. Inclusion of cigarette smoking and alcohol consumption did not improve the predictive ability of the model (AUC, 0.630 to 0.627 in KARE and 0.693 to 0.689 in HEXA); however, the combined model including age, sex, BMI, SBP, Hb, and smoking leads to a much greater AUC increase in the HEXA study (AUC, 0.80; sensitivity 0.83, specificity 0.63, Fig 1B) than in KARE Study (AUC, 0.69; sensitivity 0.70, specificity 0.59, Fig 1A) ( Table 4). The wGRS model replicated in HEXA Study captured significantly more risk than did any individual SNP, with 3.7-fold increased risk of PTB identified in the highest risk quartile compared with the lowest risk quartile (P = 1.6×10 −5 , Fig 2A). Persons in the highest risk quartile in the combined model were identified to have a higher risk of PTB (OR = 16.98, 95% CI 8.67-33.24, P = 1.4×10 −16 , Fig 2C) than the risk that was identified the model in which only conventional risk factors were taken into account (OR = 8.78, 95% CI 4.77-16.14, P = 2.8×10 −12 , Fig 2B) ( Table E in S1 File).

Discussion
Traditional risk factors such as gender (male), aging, low household income, cigarette smoking, alcohol consumption, and high blood pressure, which is one of the risk factors for DM, were consistently found to be associated with PTB risk [15,21]. Specifically, male, increased age and low BMI were identified as strong predictors of PTB in Korean adult populations. Consistent with previous studies, blood pressure was positively correlated with BMI (β = 0.02, 95% CI 0.02-0.03, P = 6.4×10 −9 and β = 0.04, 95% CI 0.03-0.05, P = 1.5×10 −11 for SBP and DBP, respectively), while BMI was negatively associated and blood pressure was positively associated with PTB in the current study [2,21]. Categorization of two continuous predictors, Hb and BUN, made them insignificant. These results support previous findings that categorization of continuous   risk factors, especially dichotomization, reduces statistical power and leads to incomplete correction for confounding factors [22]. Thus, baseline BMI, SBP, DBP, HB, and BUN were investigated as continuous variables in multivariate logistic analysis. We cannot, however, rule out potential temporal bias in the estimation of causal effects of the nongenetic factors as prevalent cases were analyzed in the current study. Recent GWASs have proposed a few candidate genes for PTB; however, most of these SNPs are located in intronic regions, and subsequent studies have not replicated previous findings, except for the WT1 gene (rs2057178) [5][6][7][8][9][10][11]. None of the associations previously identified between PTB and these gene variants were replicated in the evaluated Korean populations; instead, SNPs located in introns of the SRBD1, CLIC5, OXR1, and FARP1 genes or nearby the genes, IGSF11 and KLHL36, were identified to be associated with PTB at P values of less than 1×10 −5 . Three genes, SRBD1, CLIC5, and KLHL36, have been found to be involved in immune responses and reported to be associated with various phenotypes, such as glaucoma, neurological disorders, and carcinoma [23][24][25]. Dysfunction of the OXR1 gene resulted in increased oxidative stress associated with neurodegenerative disorders [26]. IGSF11-specific activation has been reported to be correlated with HLA-A Ã 0201-restricted cytotoxicy [27]. The role of KLHL36 gene in human complex disease is unknown. Of ten genes replicated in the HEXA Study dataset, three tumor suppressor genes, CDCA7, KIAA1432, and CCDC67, which were functionally annotated with Polyphen-2 may directly or indirectly play key roles in the development of PTB through immune system and inflammatory responses. The CDCA7 and KIAA1432 genes have been reported to be highly overexpressed in lung cancer cells [28,29]. CCDC67 mRNA, which encodes a 604 amino acid protein containing a coiled-coil domain, was associated with down-regulation in various tumor tissues such as lung, cervix, and gastric cancers [30]. The CDH13 polymorphisms play critical roles in metabolic processes and in the development of metabolic-related diseases and lung diseases such as chronic obstructive pulmonary disease [31,32]. In most previous PTB GWASs, nongenetic covariates were not considered; however, we could not replicate the associations reported in these studies, even in our unadjusted model. This lack of replication may be due to considerable genetic, phenotypical and environmental diversity among study populations; more fundamentally, however, PTB may occur as a result of a complex host-pathogen-environment interplay and polygenic pathways including many genetic variants, each with a relatively small effect [33].
Previous studies have recognized that genetic risk models for complex diseases developed based on specific populations are unlikely to be applicable to other populations because the small effect sizes of common alleles tend to vary based on population composition and may even be identified as having opposing effects in different populations, and risk estimates for some variants could be imprecise, as various characteristics of study subjects potentially confound genetic associations [34]. This study provided important insights into the identification of susceptibility genes for PTB. Overall, we improved the discriminatory ability of PTB risk models by including both genetic and conventional risk factors. The risk model that comprised ten replicated SNPs was found to have similar AUCs in both study datasets (0.636 and 0.639, respectively). Inclusion of six conventional risk factors (i.e., age, sex, BMI, SBP, Hb, and cigarette smoking) resulted in a greater improvement in PTB risk prediction in the HEXA Study than in the KARE Study (AUC, 0.80 vs. 0.69), which was likely due to the stronger effects of the nongenetic factors observed in the HEXA dataset (e.g., OR for men, 4.2 vs. 2). However, addition of three risk factors, alcohol consumption, DBP, and BUN did not significantly improve the AUC of the model, wGRS+wnGRS3 (Table 4). The best prediction of individual disease risk is achieved using a large number of predictors; however, given several models with similar predictive accuracy, the simplest model with fewer predictors is considered to be more efficient and cost-effective by collecting relevant minimum information on patients [35].
Although their signature was not found to be sufficiently predictive, a recent African study prospectively identified RNA markers to predict progression to active PTB among latent PTB patients (Sensitivity 53.7%, Specificity 82.8%) [36]. Future studies are warranted to perform validations of the risk models in independent populations with substantially larger samples, and further prospective investigations are needed. Clearly, future risk models with clinical utility in preventing active PTB will need to include susceptibility variants with potent discriminatory ability based on whole genome sequencing and omics data integration; however, these models will likely have limited applicability in the near future due to cost constraints and technical limitations [34,37]. Gene-gene interactions or epistasis are a potential source of missing heritability and new strategies for modeling epistasis to improve the predictability of PTB risk may warrant further examination in future studies.
The current study consisted of 646 PTB patients and 1,813 health Koreans, aged 40 years and above, selected from two population based cohorts. The present study may have been underpowered to detect genetic variants for PTB susceptibility with weak or modest effect sizes at a genome-wide level of significance; hence further studies with large PTB cases are warranted to validate our findings. Given the importance of age at PTB onset, future studies in patients aged younger than 40 years will be important to decipher the genetic basis for PTB arising at young age. HIV infection is the most potent risk factor for developing active PTB [3]. In the current study, however, none of the participant reported being HIV infected and the prevalence of HIV was very low in South Korea, about 4.2 per 100,000 people in 2009 [38]. The genetic risk prediction models developed in this study seemed to have limited predictive power for clinical use; however, the performance of risk prediction based on the results of numerous models suggested important insights regarding ongoing attempts to identify individuals susceptible to PTB in the general population. Firstly, models considering polygenic inheritance and evaluating the role of and interactions between replicated variants influencing the metabolic, inflammatory, and immune pathways in the pathogenesis of infectious disease may better explain an individual's risk of developing PTB. Secondly, we found that gender and BMI were the strongest predictors of PTB and validated the effects of a number of nongenetic risk factors in an independent Korean population. Finally, our strategy of constructing risk prediction models for PTB by combining genetic and nongenetic risk factors may ultimately result in better risk prediction. These findings pave the way for the development of models to identify susceptible individuals and prevent their progression to active PTB.
Supporting information S1 File. Supplementary Tables. Table A, Multivariate logistic regression analysis and associations between baseline characteristics and pulmonary tuberculosis in the KARE and HEXA Studies. Table B, Comparison of association results of previous reported SNPs with the current study. Table C, Association results of the SNPs near or in three reported genes, HLA class II, ASAP1, and WNT1, in the KARE GWAS. Table D, Functional annotation of eight SNPs, including six SNPs in strong linkage disequilibrium (r 2 > 0.8 and average r 2 = 0.95) with four genotyped SNPs identified in the KARE Study (P < 0.001). Table E, Associations between risk score quartiles and tuberculosis in KARE and HEXA Studies. (DOCX) Project (4851-307, KBP-2014-13) from the Korea Center for Disease Control and Prevention, Republic of Korea. We greatly thank the investigators who contributed to generation of KARE and HEXA study data: Bok-Ghee Han, PhD, of the National Institute of Health at Osong, directly contributed to the development and maintenance of the Korean Genome Epidemiology Study (KoGES). Nam Han Cho, PhD, of the Ajou University School of Medicine at Suwon, and Chol Shin, MD, PhD, of the Korea University Ansan Hospital at Ansan, created and developed the Ansung and Ansan cohorts, respectively. Daehee Kang, MD, PhD, of the Seoul National University College of Medicine at Seoul, is the Principal Investigator of the HEXA study. Phenotype information was collected by Haesook Min, MS and Sung Soo Kim, PhD, both of the National Institute of Health at Osong. Young Jin Kim, MS, of the National Institute of Health at Osong, and Yoon Shin Cho, PhD, of Hallym University at Chuncheon, performed genotyping experiments and assured the quality of the genotype data. We thank all the staff who have been involved in the generation of data at each local institute.