Whole genome sequencing to identify predictive markers for the risk of drug-induced interstitial lung disease

Drug-induced interstitial lung disease (DIILD) is a serious side effect of chemotherapy in cancer patients with an extremely high mortality rate. In this study, to identify genetic variants with greater risk of DIILD, we carried out whole genome sequencing (WGS) of germline DNA samples from 26 patients who developed DIILD, and conducted a case-control association study between these 26 cases and general Japanese population controls registered in the integrative Japanese Genome Variation Database (iJGVD) as a screening study. The associations of 42 single nucleotide variants (SNVs) showing P < 0.0001 were further validated using an independent cohort of 18 DIILD cases as a replication study. A further combined analysis of the screening and replication studies showed a possible association of two SNVs, rs35198919 in intron 1 of the chromosome 22 open reading frame 34 (C22orf34) and rs12625311 in intron 1 of the teashirt zinc finger homeobox 2 (TSHZ2), with DIILD (Pcombined = 1.87 × 10−5 and 5.16 × 10−5, respectively). Furthermore, in a subgroup analysis of epidermal growth factor receptor (EGFR)–tyrosine kinase inhibitor (TKI)-induced interstitial lung disease (ILD), we observed seven candidate SNVs that were possibly associated with ILD (P < 0.00001). This is the first study to identify genetic markers for the risk of DIILD using WGS. Collectively, our novel findings indicate that these SNVs may be applicable for predicting the risk of DIILD in patients receiving chemotherapy.


Introduction
Drug-induced interstitial lung disease (DIILD) is a serious side effect of a wide range of drugs, including anticancer [1][2][3], antirheumatic [4][5][6], and cardiology drugs [7][8][9]. While the incidence rate of DIILD is approximately 0.1%-10% for anticancer drugs, DIILD is potentially life-threatening and represents a serious clinical problem [10]. Therefore, tools for predicting PLOS  the risk of DIILD are important in order to avoid life-threatening drug toxicity. Although the precise mechanism underlying DIILD is not well understood, cumulative evidence suggests that direct cytotoxic effects related to the parent drug and/or its metabolites, along with certain immune factors, may be involved [11][12][13]. Serum levels of Krebs von der Lungen-6 (KL-6), surfactant protein (SP) -A and SP-D have commonly been used as biomarkers for ILD in clinical settings [14][15][16]. KL-6 is a high molecular weight glycoprotein secreted by proliferating or damaged type II alveolar pneumocytes and bronchial epithelial cells and has high diagnostic sensitivity (93.9%) and specificity (96.3%) for ILD [17]. Clinical risk factors for developing DIILD vary according to the disease or drug involved; however, certain risk factors such as increased age [18][19][20], preexisting ILD or idiopathic pulmonary fibrosis (IPF) [21,22], smoking [19,22], and poor performance status [19,20] are associated with the development of the toxicity. Although human leukocyte antigen (HLA) alleles have been reported to be associated with methotrexate-induced ILD and gemcitabine plus erlotinib-induced ILD [23,24], a clinically useful biomarker for predicting the risk of DIILD before treatment has yet to be developed.
In this study, we carried out whole genome sequencing (WGS) on genomic DNA from 26 patients who developed DIILD and identified two genomic regions, the chromosome 22 open reading frame 34 (C22orf34) and the tea shirt zinc finger homeobox 2 (TSHZ2) gene on chromosome 20, which may be associated with DIILD. Our results may provide insight into the underlying mechanism of DIILD and aid in the development of a diagnostic system for predicting the risk of this life-threatening adverse event.

Patients
To identify genetic susceptibility loci for DIILD, we used 26 DIILD patients treated with epidermal growth factor receptor (EGFR)-tyrosine kinase inhibitor (TKI), nivolumab, gemcitabine, and trastuzumab for the screening study. Of the 26 patients, 12 who had EGFR-TKIinduced ILD between June 2002 and July 2008 at the National Cancer Center Hospital, and 1 patient who was registered in the BioBank Japan (http://biobankjp.org/) and satisfied the following criteria; a. EGFR-TKI was used, b. patient with DIILD during the therapy, were used for the screening study. In addition, from the registered samples in the National Cancer Center (NCC) biobank, 13 patients who were treated with nivolumab, gemcitabine, or trastuzumab, and were confirmed to have DIILD in clinical data as of May 2016, were used for the screening study (Table 1). In a replication study, 18 patients who were treated with nivolumab or trastuzumab and were newly confirmed to have DIILD in updated clinical data as of November 2018, were used from the registered samples in the NCC biobank (Table 1). DIILD was diagnosed by clinical course, objective findings, and independent computed tomography (CT) findings by radiologists or respiratory oncologists. We used 3,554 general Japanese population registered in the integrative Japanese Genome Variation Database (iJGVD) as study controls. Because DIILD is considered a rare adverse event in the population, using a large population control set (healthy individuals) increases the statistical power even while recognizing the (small) potential for misclassification. Population controls have been successfully used in this fashion in previous studies of rare adverse drug reactions and have yielded significant associations [25][26][27][28]. In addition, 1,138 patients undergoing chemotherapy but never developed DIILD were recruited as tolerant controls from the NCC biobank in a replication analysis for the two possible SNVs (rs35198919 and rs12625311). This study was approved by the Institutional Review Board of the National Cancer Center (Tokyo, Japan) and the Japanese Foundation for Cancer Research (Tokyo, Japan).

WGS
WGS was conducted for 26 DNA samples from patients with DIILD. Genomic DNA was extracted from the whole blood or normal tissues, sheared into approximately 350 bp fragments, and used to make a library with the TruSeq Nano DNA Sample Prep Kit (Illumina, San Diego, CA, USA). Sequencing was performed on an Illumina HiSeq X platform using a paired-end 150 bp configuration.

Statistical analysis
In the screening (WGS) and replication study, a case (patients with DIILD)-control (3,554 general Japanese individuals in iJGVD) [33] association study was performed using the Fisher's exact test in an allele frequency model. The significance levels after Bonferroni correction for multiple testing were P = 7.60 × 10 −8 (0.05/658,234) in the screening study and P = 1.47 × 10 −3 (0.05/34) in the replication study because eight SNVs were highly linked (r 2 = 1) in the 42 candidate SNVs. The odds ratios (ORs) and confidence intervals (CIs) were calculated using the non-risk allele as a reference. All statistical analyses were carried out using R statistical environment version 3.3.1 (http://www.r-project.org/) or the BellCurve for Excel (Social Survey Research Information Co., Ltd., Tokyo, Japan).

WGS and identification of candidate genetic biomarkers for the risk of DIILD
To identify genetic susceptibility loci for DIILD, we performed WGS using the germline DNA of 26 patients who developed DIILD during anticancer treatment ( Table 1). The mean number of SNVs per patient was 4,042,041 with a 35× average depth. Variants were filtered using an in-house program as described in the Materials and methods; a total of 658,234 SNVs were analyzed in the 26 patients. We next conducted a case-control association study between the 26 cases and the 3,554 (maximum) general Japanese population controls registered in the iJGVD to identify SNVs associated with DIILD as a screening study. Although no SNVs reached a significance level of P < 7.60 × 10 −8 (see Materials and methods), we observed 42 SNVs showing P < 0.0001 (3.31 × 10 −6 < P < 9.94 × 10 −5 ) as shown in S1 Table.

Replication and combined study using additional patients
The associations of 42 SNVs showing P < 0.0001 in the screening study were further investigated using an independent cohort consisted of 18 DIILD cases (Table 1, S1 Table). We performed genotyping of the 42 SNVs by the Sanger sequencing method using the above 18 cases; however, we could not observe the SNV that reached the significance level of P = 1.47 × 10 −3 (see Materials and methods) as shown in S1 Table. However, in the combined analysis of the screening and replication studies for the 42 SNVs, we found that the associations of two SNVs, rs35198919 in intron 1 of C22orf34 and rs12625311 in intron 1 of TSHZ2, with DIILD were stronger than those in the screening study (P combined = 1.87 × 10 −5 , odds ratio (OR) = 4.15, 95% CI; 2.36-7.31, and P combined = 5.16 × 10 −5 , OR = 2.43, 95% CI; 1.59-3.71, respectively, Table 2, Fig 1).

Subgroup analysis for the identification of genetic markers for EGFR-TKIinduced ILD
To identify the genetic marker(s) for EGFR-TKI-induced ILD, we next performed an association study using the 13 patients with EGFR-TKI-induced ILD (cases) and the abovementioned controls from the iJGVD as a subgroup analysis. We observed that seven SNVs were possibly associated with EGFR-TKI-induced ILD, with P < 0.00001 as shown in Table 3, however, they showed no or weak associations in the 13 patients with nonEGFR-TKI-induced ILD (P = 0.25-1).

Discussion
The development of predictive marker(s) for the risk of life-threatening toxicities related to cancer treatment, including ILD, based on germline mutation or variation is important in order to provide safe treatment regimens for patients with cancer. This study is the first WGS that attempts to identify common or rare genetic variants associated with DIILD in the Japanese population. The SNV strongly associated with DIILD was not identified in this study; however, we identified two independent loci showing possible associations with DIILD ( Table 2, Fig 1). Moreover, subgroup analysis using patients with EGFR-TKI-induced ILD suggested a possible association of seven SNVs with EGFR-TKI-induced ILD (Table 3). To enlarge the sample size, we further added patients undergoing chemotherapy after cancer diagnosis but never developed DIILD as a control for a replication cohort and analyzed for rs35198919 and rs12625311 (S2 Table). The association results of the replication and combined studies were comparable with those obtained from the original control set (general Japanese population) as shown in Table 2 and S2 Table. The most strongly associated SNV from the combined results of the screening and replication studies was rs35198919 (P combined = 1.87 × 10 −5 , OR = 4.15), located in C22orf34 on chromosome 22q13.33. Although the function of C22orf34 remains to be clarified, the gene expression of C22orf34 in human lung is higher compared with other human tissues according to the Genotype-Tissue Expression (GTEx) Analysis Release V7 [34]. To effectively clarify the mechanism of DIILD, we performed expression quantitative trait locus (cis-eQTL) analysis for rs35198919 and observed a cis-regulatory effect on C22orf34 expression in whole blood and  rs12625311 demonstrated a stronger association in the combined study and is located in intron 1 of TSHZ2, which encodes a transcriptional repressor that suppresses the transcriptional activity of genes involved in tumor development and growth [35,36]. Single nucleotide polymorphisms (SNPs) in TSHZ2 are known to be involved in Stevens-Johnson syndrome/ toxic epidermal necrolysis with severe ocular complications (SJS/TEN with SOCs) in the Japanese population [37]. ILD has been reported to be one of the complications in patients with SJS [38,39], suggesting a common mechanism between SJS and DIILD involving a pathway regulated by TSHZ2. However, further validation studies and functional analyses are needed to evaluate the true association. DIILD, including EGFR-TKI-induced ILD, is more common in the Japanese population [40,41]. The frequency of the risk allele (C) in rs12625311 is higher in the Japanese population (36%) compared with other ethnic populations (South Asian: 19%, African: 18%, American: 3% and European: 2% in the 1000 Genomes Project), suggesting that the difference in the above allele frequency may cause interethnic differences in the frequency of DIILD.  We next examined the association of the two SNVs (rs35198919 and rs12625311) in patients with a history of ILD and patients that had never had ILD before. The above two SNVs showed larger odds ratios in patients with a history of ILD (rs35198919: OR = 5.51, rs12625311; OR = 7.42) compared with those that had never had ILD (rs35198919: OR = 3.91, rs12625311; OR = 2.02). These findings suggested that these two SNVs may be strong predictors for DIILD in patients with a history of ILD (S3 Table). We further carried out a subgroup analysis of the tumor type for the two possible SNVs (rs35198919 and rs12625311). Although none of the tumor types showed smaller P values compared with those in the original analysis, lung and pancreatic cancers showed larger ORs in rs35198919 and rs12625311, and gastric cancer also showed larger ORs in rs12625311 (S4 Table), suggesting that these SNVs may be specific predictors for the risk of DIILD in patients with lung, pancreatic, and gastric cancers.
In the subgroup analysis of EGFR-TKI-induced ILD, we identified seven candidate SNVs that could be predictive markers for the risk of EGFR-TKI-induced ILD (Table 3). Although none of these SNVs showed a significant association with nonEGFR-TKI-induced ILD, four (rs417168, rs442281, rs75399069, and rs10165147) were included in the 42 candidate SNVs identified in the screening study using the 26 patients with DIILD (S1 Table). Moreover, the associations of these four SNVs in the 13 patients with EGFR-TKI-induced ILD were stronger compared with those of the 26 patients with ILD induced by either EGFR-TKI or nonE-GFR-TKI. Hence, these four SNVs may be candidate predictive markers for the risk of EGFR-TKI-induced ILD rather than common predictive markers for the risk of DIILD. Of these four SNVs, rs10165147 is located 31 kb upstream of the polypeptide N-acetylgalactosaminyltransferase 13 (GALNT13) (S2 Fig). GALNT13 is a member of the UDP-N-acetyl-alpha-D-galactosamine:polypeptide N-acetylgalactosaminyltransferase family, which initiate the Olinked glycosylation of mucins [42]. Mucin 5B, oligomeric mucus/gel-forming (MUC5B) has been reported to be one of the candidate genetic markers for ILD [43,44]. Although further analysis is required to clarify the functional importance of GALNT13 in DIILD, the difference in efficiency of the O-linked glycosylation of mucins by GALNT13 may cause interindividual differences in the susceptibility to DIILD. Furthermore, we investigated the association between MUC5B and DIILD and found that rs565375327 showed the lowest P value [screening cohort (26 patients); P = 3.58 × 10 −3 , EGFR-TKI-induced ILD subgroup (13 patients); P = 4.74 × 10 −4 ] of the SNVs in MUC5B, suggesting that MUC5B may be a promising predictive marker for the risk of DIILD, particularly for EGFR-TKI-induced ILD as well as GALNT13.
Surfactant protein A1 (SFTPA1) and surfactant protein A2 (SFTPA2) have been suggested as susceptibility genes for IPF [45,46]. In the current study, SNVs in these candidate genes showed no or weak associations with DIILD (P = 2.29 × 10 −1 -9.99 × 10 −1 ). It is reported that gefitinib inhibits cyclin G-associated kinase (GAK), as well as EGFR signaling [47], and kinase deficiency of GAK causes the neonatal death of mice due to respiratory dysfunction [48]. However, no SNVs in GAK showed a significant association with EGFR-TKI-induced ILD (EGFR-TKI-induced ILD subgroup; P = 5.76 × 10 −2 -9.99 × 10 −1 ) in our study. In addition, although the expression level of Mucin 4, cell surface associated (MUC4) has been reported to be increased in IPF [49], we observed no association between MUC4 and EGFR-TKI-induced ILD (P = 1.74 × 10 −1 -9.99 × 10 −1 ). Further validation analysis using a larger number of patients is required to clarify the effects of these variants on the risk of DIILD.
Previous candidate gene approaches have reported the positive association of HLA alleles with DIILD [23,24]. However, these results are unlikely to be fully replicated, and some of the studies failed to confirm the positive association of DIILD with these HLA alleles. Further validation studies using independent cohorts are needed to confirm the true association of the HLA alleles with DIILD. On the other hand, in our study, we carried out a WGS approach to identify novel candidate variants that might be involved in the unknown mechanism underlying DIILD. Because the sample size used in our study was not large enough, we could not identify the variants that reached the significance level after Bonferroni correction. Similar to the above candidate genes, further validation studies using larger sample size cohorts are needed to confirm the true association of the variants identified in this study with DIILD. To establish a clinically useful prediction system for the risk of DIILD, the integration of the results from the candidate gene and WGS approaches is important. Furthermore, combination analysis of variants identified using the WGS approach with previously identified variants using candidate gene approach, including HLA alleles, may prove to have cumulative effects on the risk of DIILD.
In conclusion, using a WGS approach, we identified two SNVs as possible genetic susceptibility factors for the risk of DIILD. Although the mechanism underlying DIILD should be further investigated by using a larger number of samples and molecular analysis, our results provided a possibility of prediction of the risk of DIILD that could lead to better prognosis and quality of life for patients with cancer.