Molecular characterization of lung adenocarcinoma from Korean patients using next generation sequencing

The treatment of Lung adenocarcinoma (LUAD) could benefit from the incorporation of precision medicine. This study was to identify cancer-related genetic alterations by next generation sequencing (NGS) in resected LUAD samples from Korean patients and to determine their associations with clinical features. A total of 201 tumors and their matched peripheral blood samples were analyzed using targeted sequencing via the Illumina HiSeq 2500 platform of 242 genes with a median depth of coverage greater than 500X. One hundred ninety-two tumors were amenable to data analysis. EGFR was the most frequently mutated gene, occurring in 106 (55%) patients, followed by TP53 (n = 67, 35%) and KRAS (n = 11, 6%). EGFR mutations were strongly increased in patients that were female and never-smokers. Smokers had a significantly higher tumor mutational burden (TMB) than never-smokers (average 4.84 non-synonymous mutations/megabase [mt/Mb] vs. 2.84 mt/Mb, p = 0.019). Somatic mutations of APC, CTNNB1, and AMER1 in the WNT signaling pathway were highly associated with shortened disease-free survival (DFS) compared to others (median DFS of 89 vs. 27 months, p = 0.018). Patients with low TMB, annotated as less than 2 mt/Mb, had longer DFS than those with high TMB (p = 0.041). A higher frequency of EGFR mutations and a lower of KRAS mutations were observed in Korean LUAD patients. Profiles of 242 genes mapped in this study were compared with whole exome sequencing genetic profiles generated in The Cancer Genome Atlas Lung Adenocarcinoma. NGS-based diagnostics can provide clinically relevant information such as mutations or TMB from readily available formalin-fixed paraffin-embedded tissue.


Introduction
Lung adenocarcinoma (LUAD) is the leading cause of cancer death worldwide. In particular, the incidence of LUAD is increasing in both never-smokers and females [1]. This means that prognosis and treatment of each patient can differ widely at the molecular level based on their gene expression patterns, copy number alterations, and mutations. Previous genomic studies of LUAD have shown that patients with driver gene mutations, such as those in epidermal growth factor receptor (EGFR) and anaplastic lymphoma kinase (ALK), receive a significant survival benefit from personalized therapy for LUAD [2,3]. The recent discoveries of C-Ros oncogene 1, receptor tyrosine kinase (ROS1) and Ret proto-oncogene (RET1) fusions have raised expectations for the development of new targeted agents in LUAD. In molecularly selected patients, response rates to the appropriate targeted treatment can reach 60-70% or more, compared to the 20-30% response rate in an unselected population treated with conventional chemotherapy [4].
Ethnicity plays a distinct role in the prevalence of some genetic markers [5]. Asian patients with LUAD have a longer survival (11.0 vs. 8.9 months, p < 0.001), higher response rates (32.7 vs. 29.8%, p = 0.027), and greater toxicity in response to targeted therapy than Caucasian patients [6]. However, there is still a limited understanding of the genetic features of LUAD in Asian patients based on a lack of representation in existing public databases. Therefore, it is worthwhile to investigate whether these ethnic differences are due to genetic variation among ethnic groups. In this study, we investigated these variations in a Korean LUAD cohort. As we were able to sequence individual genomes, we examined these markers via next generation sequencing (NGS) technology, which can determine the profile of genetic changes in tumors, including single-nucleotide variations (SNVs), copy number variations (CNVs), and complex chromosomal rearrangements. NGS technology can provide a fast turnaround time and costeffective sequencing for high numbers of targets. Given this, we sought to delineate a comprehensive characterization of the genomic landscape in Korean patients with LUAD using formalin-fixed paraffin-embedded (FFPE) surgical tissues and NGS technology. We have rendered to provide NGS results in a relevant time with simple FFPE samples rather than fresh tissue by targeted sequencing analysis, which is feasible to apply in clinical practice. Our data may serve as a reference in the development of precision medicine for Korean LUAD patients.

Patients and data collection
A total of 201 LUAD patients with surgically resected primary lung cancer were prospectively enrolled from the Yonsei Cancer Center and Ulsan University Hospital between 2014 and 2016. All patients provided prior written informed consent, and this study was conducted with the approval of Institutional Review Board of Yonsei University Health System, Severance Hospital. A predesigned data collection format was used to review the patients' electronic medical records for evaluation of clinicopathological characteristics and survival outcomes. Never-smokers were defined as those with a lifetime smoking dose of < 100 cigarettes. Ten tumor tissue sections (at least 10 μm thick) and patient blood samples (5 ml) were collected from prospectively recruited patients to differentiate between germline and somatic genetic aberrancy. Genetic analyses were performed in routine practice and included EGFR mutation and ALK/ROS1 rearrangement. We uploaded raw NGS data to National Center for Biotechnology Information Sequence Read Archive (NCBI SRA) website for public access. (https:// trace.ncbi.nlm.nih.gov/Traces/sra/, SRA accession ID is SRP200786.)

Targeted sequencing of tumors
Genomic DNA was isolated from FFPE samples using the QIAamp DNA FFPE Tissue Kit (Qiagen, Hilden, Germany) for the targeted sequencing of 242 lung cancer-related genes selected based on a literature search (S1 Table) [2,7,8]. The genomic regions of the 242 genes were captured by the customized SureSelectXT Target Enrichment library generation kit (Agilent, Santa Clara, CA, USA) and sequenced on the Illumina HiSeq 2500 platform with a depth of coverage > 500X and a read length of 100 bp.
To do FFPE quality control and analysis, two cross validations were performed. First, we checked up and confirmed that FFPE precisely detect EGFR hotspot mutations which are the main target of LUAD therapy. We compared the NGS result with the PCR result regarding the EGFR hotspot mutations in the same sample. Another is to compare the results of the known frozen fresh (FF) data in public dataset. We first evaluated how similar the overall pattern of LUAD results of current study with that of The Cancer Genome Atlas (TCGA) dataset which was conducted by FF [2]. To evaluate the overall pattern of our data, we compared that of TCGA dataset [2]. This TCGA data set is composed of a total of 230 patients, of which the majorities (173 patients) were Caucasian [2].

Variant calling and functional annotation
By default, base quality trimming for short reads from the targeted sequences was performed using Sickle [9]. Filtered reads were mapped to the human reference genome (GRCh37/hg19) using BWA [10]. All reads with a mapping quality score < 20 were discarded. The aligned reads (BAM file) were further processed with the Genome Analysis Tool Kit v3.5 [11], including Mark Duplicate, Local Realignment, and Base Quality Score Recalibration. Candidates for somatic mutations were called by MuTect ver. 1.17 [12] with default parameters. Somatic insertions/deletions were called by Scalpel [13] with default parameters. During somatic mutation calling, FoxoG sequencing artifacts [14] were removed using the Oxidative Damage Detection and Removal Tools (https://github.com/migbro/IGSB_oxoG_tools) to discard skewed readorientation variants with the FoxoG parameter 0.625. Even after FoxoG filtration, nine samples had unexpectedly large numbers of mutations (Z-score of tumor mutational burden (TMB) > 1) and thus were excluded from further analysis under suspicion of potential damage to DNA. Somatic variants that passed all filters were considered high-confidence variants. CNVs were called using a CNV kit [15]. CNVs in genes were defined as follows: deletion, 0 copies; loss, 1 copy; gain, 3 copies; and amplification, � 4 copies. The functional impacts of high confidence variants were annotated with ANNOVAR software [16], based on the consequences, predicted impacts, and reported allele frequencies in the population. In particular, non-rare variants (minor allele frequency > 0.05 in gnomAD database [17]) were discarded to remove nonpathogenic variants. Finally, CIVic and DoCM databases were used for clinical interpretation of variants in cancer. TMB was measured by the number of non-synonymous missense mutations per megabase (Mb) within the range of the targeted capture region. An 'Oncoprint' is a way to visualize overall genomic alteration events using a heatmap. Mutations of each sample on the Oncoprint are aligned in a mutually exclusive manner. For example, the samples with the highest frequency in the entire sample are aligned on the top left, and the samples with the next highest frequency are aligned on the back. This is a kind of clustering that can easily distinguish between co-occurrence and mutually exclusive patterns on the major genes. It was drawn using the bioconductor package 'Complex Heatmap' [18] in R ver. 3.4. Using the package 'maftools' [19], lollipop plots were drawn for frequently mutated genes to check the recurrence of genomic loci with variants, and somatic interactions between mutually exclusive or co-occurring sets of genes were investigated. Mutations and putative CNVs stored in cBioportal [20,21] were used for the above genomic analysis. Pathway diagrams were depicted using Pathway Mapper [22]. To identify the clinical importance of mutations, we created a mutation classification system based on knowledgebase databases and a computational prediction algorithm. Clinical importance was ranked using CIVic ( prescription & responsive & alteration match-complete categories) [23], and CRAVAT(criteria: CHASM FDR � 0.1 & TARGET DB only categories), sequentially. To confirm how many ranked mutations were included, Venn diagrams were drawn using Venny [24].

Statistical methods
All statistical analyses were performed using R and Python (Scipy and Seaborn packages). Student's t-test or Fisher's exact test was used for group comparisons. Disease-free survival (DFS) was measured from the date of diagnosis to tumor recurrence or death, while overall survival (OS) was measured from the date of diagnosis until the date of death. Patients were censored on October 2017 if alive and recurrence free. Patients without a known date of death were censored at the time of last follow-up. A log rank test for mutations of each gene, signaling pathways, and TMB was used to compare the DFS between groups. Two-sided p-values < 0.05 were considered significant.

Clinical characteristics
We enrolled 201 patients with LUAD, and their characteristics are summarized in Tables

Clinical implication with somatic mutation classification system for LUAD
We attempted to implement a precision medicine approach for application in the clinical field. The purpose of precision medicine through NGS is to determine the link between each mutation with an associated targeted therapy and the clinical outcome in cancer patients. Although there are many clinical annotation databases for various somatic mutations, the determination of which mutations have clinical implications differs slightly in each. Hence, a harmonized system for a meta-knowledgebase of clinical interpretations of cancer genomic variants is  required to reliably determine clinical implications for as many patients as possible [26]. Of 192 LUAD patient samples, 121 samples (63%) were clinically annotated in CIVic [27], validated through various publications and clinical trials, and annotated through CGI [23], resulting in the annotation of 151 samples (79%). Potential targets that still remain are annotated with CRAVAT [28] (155 samples, 81%), which involves computational prediction (Fig 3A). There were a total of 86 samples annotated in the three databases, 3 of which were annotated only in CIVic, 11 only in CGI, and 4 only in CRAVAT (Fig 3B). The somatic mutations reported in CIVic were the well-known EGFR L858R, exon 19 deletion, and T790M mutations; the G12V/D/C/S/A mutation in KRAS; E542K in PIK3CA; and Y220C and R175H in TP53. Genes with CNVs included CDKN2A, EGFR, and PTEN, among others. Somatic mutations annotated only in CGI were ARID1A, BRAF, BRCA, STK11, and BAP1, while SETD2 and STK11 were the annotated somatic CNVs. The somatic mutations independently estimated by CRAVAT were H179R and G245C for TP53 and P750R for DNMT3A. Prospective application of this approach should be assessed in a future umbrella trial of lung cancer patients.

Discussion
Our study shows that it is feasible to incorporate NGS into the clinical care of lung cancer patients. Through our NGS analysis, the most common genomic alterations (EGFR, TP53, ADGRV1, and SMARCA2) were slightly different from those observed in present investigations of LUAD in Caucasian [2]. In The Cancer Genome Atlas (TCGA), the rate of KRAS mutation in LUAD is 33%, while that of EGFR mutation is only 14% [2]. It should be noted that we have a higher proportion of female (56.2%) and non-smokers (62.2%) than is found in TCGA. However, the most prominent difference is the ethnicity of the patients. Only eight Asian patients are included in TCGA [2]. We analyzed only ethnic Korean patient samples and can conclude that EGFR mutation is the most common (55%) in Koreans, based on the current study and a rate of 59% among Asian patients with LUAD in previous reports [29]. Since KRAS mutation occurs exclusive of EGFR mutation, KRAS mutations are slightly less frequent in ethnic Koreans than in Caucasian patients. Luo published the results of whole genome sequencing for young never-smoked Asian with lung adenocarcinoma [30]. Compared with this study, we conducted in a more practical way by targeted sequencing with FFPE. In Luo study, EGFR mutation was found to be somatic SNV 25% and CNV 19% but ours was 64%, 15% respectively. And KRAS of Luo study was sSNV 11% but ours was 2%. Interestingly, several genes showed almost the same ratio (TP53 sSNV, Luo 28% vs. ours 31%; MYC sCNV, 14% vs. 10%; TERT sCNV, 17% vs. 17%, respectively).
Similar to the discovery of EGFR, ALK, and ROS1, various studies for identifying molecular characterizations of LUAD are under way, and our study is also part of this effort. Mutations in specific genes affect not only the carcinogenic process but also dysfunction of signaling pathways and can be important mediators in tumorigenesis [31]. The WNT pathway is involved in the formation of lung homeostasis and tumor angiogenesis [32]. WNT pathway aberrations are potential therapeutic targets in lung cancer patients [33]. The most studied WNT pathway mutations in cancers include sporadic mutations in APC and β-catenin genes.
Since APC is part of the degradation scaffold for β-catenin, mutations of APC result in reduced degradation and increased nuclear accumulation of β-catenin, leading to activation of target oncogenes including cyclin D1 and c-Myc [33]. Clinical trials of WNT signaling pathway inhibitors have been conducted in advanced solid tumors (NCT03355066). Our analysis also shows that patients with APC, CTNNB1, and AMER1 mutations in the WNT pathway show shorter DFS compared to wild-type patients (Fig 4). In addition, we investigated the clinical significance of TMB in patients with LUAD and examined the relationship between TMB and prognosis. TMB is thought to be associated with the amount of tumor neoantigen and to have an important role in predicting the effect of immune checkpoint inhibitors [34]. We found that smokers had a significantly higher TMB than never-smokers (average 4.84 vs. 2.84 mt/Mb, respectively, p = 0.019). Devarakonda et al. also annotated a TMB greater than 8 mt/Mb as high and reported a better prognosis in this group [35]. On the other hand, Owada-Ozaki reported that shorter OS and DFS was associated with high TMB in stage I NSCLC [34]. In our data, patients with a TMB < 2 mt/Mb showed longer DFS than patients with a TMB � 7 mt/ Mb (p = 0.041) (Fig 5A). Since there are still many conflicting results, further studies are needed to validate TMB as a prognostic marker. Notably, exon 19 deletion was the most common mutation in the low TMB group, which exhibited good prognosis. It is already known that exon 19 deletion results in a better prognosis than other EGFR mutations [36] (Fig 5B).
In order for the above analyses to be applied to clinical practice, appropriate use of a metaknowledgebase of clinical implications of cancer genomic variants is necessary [37]. A metaknowledge-based framework of holistic interpretation comprehensively covers hundreds of genes, disease and drugs. Hence, we included predicted target mutations in CRAVAT, as well as providing annotations via CIVic and CGI. Overall, this methodology may expedite the widespread implementation of an umbrella trial of lung cancer patients.
Several technical limitations were identified in this study. First, a low tumor cellularity in samples, owing to normal cell contaminants, and high levels of intra-tumor heterogeneity make it difficult to accurately call SNVs and CNVs. For this reason, the variant allele frequency was lower than the theoretical value of 0.5 (S6 Fig). Second, targeted sequencing for the identification of CNVs remains a secondary option when more sensitive methods, such as wholegenome sequencing or specialized array-based methods, are unavailable. As targeted sequencing-based CNV analysis generally performs better in a larger cohort, the size and sustainability of clinical trials should be considered when they are designed. Third, the NGS platform used in this study detected only SNVs and CNVs although diverse structural variations and epigenetic events exist outside of the captured exons. Active participation of genome analysis experts is strongly recommended to manage these technical issues. Finally, since we used only 242 genes in this study, other factors including genetic alteration in other genes, epigenetic alterations, gene and protein expression may be related to LUAD risk. There are recent reports that exposure to outdoor particulate matter (PM 10 ) [38,39] or indoor secondhand smoke and high temperature cooking oil fumes [40] are associated with lung cancer. Since, there were inadequate information for patient's dwelling or occupation, it was precluded to analyze environmental factor.

Conclusions
In conclusion, targeted sequencing using NGS can provide clinically relevant mutation profiling information from readily available FFPE tissues. EGFR was the most frequently mutated gene (55%), followed by TP53 (35%) and KRAS (6%). This may assist in decision to the use of innovative clinical trials of genotype-matched drugs and provide benefits to many cancer patients.