Signatures of somatic mutations and gene expression from p16INK4A positive head and neck squamous cell carcinomas (HNSCC)

Human papilloma virus (HPV) causes a subset of head and neck squamous cell carcinomas (HNSCC) of the oropharynx. We combined targeted DNA- and genome-wide RNA-sequencing to identify genetic variants and gene expression signatures respectively from patients with HNSCC including oropharyngeal squamous cell carcinomas (OPSCC). DNA and RNA were purified from 35- formalin fixed and paraffin embedded (FFPE) HNSCC tumor samples. Immuno-histochemical evaluation of tumors was performed to determine the expression levels of p16INK4A and classified tumor samples either p16+ or p16-. Using ClearSeq Comprehensive Cancer panel, we examined the distribution of somatic mutations. Somatic single-nucleotide variants (SNV) were called using GATK-Mutect2 (“tumor-only” mode) approach. Using RNA-seq, we identified a catalog of 1,044 and 8 genes as significantly expressed between p16+ and p16-, respectively at FDR 0.05 (5%) and 0.1 (10%). The clinicopathological characteristics of the patients including anatomical site, smoking and survival were analyzed when comparing p16+ and p16- tumors. The majority of tumors (65%) were p16+. Population sequence variant databases, including gnomAD, ExAC, COSMIC and dbSNP, were used to identify the mutational landscape of somatic sequence variants within sequenced genes. Hierarchical clustering of The Cancer Genome Atlas (TCGA) samples based on HPV-status was observed using differentially expressed genes. Using RNA-seq in parallel with targeted DNA-seq, we identified mutational and gene expression signatures characteristic of p16+ and p16- HNSCC. Our gene signatures are consistent with previously published data including TCGA and support the need to further explore the biologic relevance of these alterations in HNSCC.


Introduction
Recent reports have noted an increase in the incidence of head and neck squamous cell carcinomas (HNSCC), which account for approximately 3% of all cancers in the US (https://seer. cancer.gov/). These cancers are more than twice as common among men as among women. Tobacco and alcohol use are the major risk factors for HNSCC (~75%), while infection with high-risk types of human papilloma virus (HPV) accounts for the remainder of HNSCC [1][2][3]. HNSCC, particularly oropharyngeal squamous cell carcinomas (OPSCC), that are caused by infection with cancer causing types of HPV, especially HPV type 16, show distinct clinical, pathological, and molecular features compared with non-HPV related cancer [1,4]. OPSCC and non-OPSCC are two subsites with in HNSCC. Increased expression of p16 INK4A , referred to hereafter as p16+, has been reported to strongly correlate with HPV infection in HNSCC; however, p16-positivity is not limited to HPV-positive tumors and therefore, is not a perfect surrogate for HPV positivity [5,6]. Previous studies have indicated that in HNSCC, HPV-positive patients have better overall survival cure rates than their HPV-negative counterparts [7].
Multiple studies have identified significant enrichment of somatic mutations in genes such as PIK3CA, TP53, CDKN2A, FGFR3, PTEN and RB1 in HPV-associated HNSCC [4,[8][9][10]. Other distinctions between HPV-positive and HPV-negative cancers have been identified by combined analysis of somatic mutations and gene expression profiles [11,12]. Although the ClearSeq Comprehensive Cancer panel can identify driver mutations in the coding region of 151 cancer genes, it may fail to detect structural changes, such as gene fusions, that may be therapeutically relevant. The possibility of therapeutically actionable gene fusions driving HNSCC has not been fully explored. We therefore proposed that a combination of RNA-Seq and targeted DNA-seq to identify a high yield diagnostic tailored for HNSCC. In addition, we aimed to correlate our methods with those used in other published studies in HNSCC. A prior proof-of-principle study using a small sample size confirmed the feasibility of using FFPE samples in HNSCC to that end [13]. Here, we report an analysis of genetic alterations in HNSCC arising in diverse anatomical sites by use of targeted DNA-seq and RNA-seq of FFPE tumor samples mainly from patients with OPSCC.

Somatic mutation analysis and confirmation of recurrent gene mutations among HNSCC individuals
We conducted targeted sequencing of 151 disease-associated genes that have been implicated in studies of a wide range of cancers in 22 p16+ (HPV+), 4 p16-(HPV-), and one p16-unknown (HPV unknown status) tumor samples using Agilent's ClearSeq Comprehensive Cancer panel. Clinical and demographic information are provided in Table 1 and S1 Table. A summary of somatic mutations in p16+ and p16-tumor samples is reported in Figs 1 and 2, and S2 Table. We restricted our analysis to predicted protein coding gene regions and excluded synonymous and noncoding variants from our analysis (S3 Table). We observed C>T substitutions were increased in p16+ as compared to p16-cancers (Fig 2; [4]). The most common mutations were in PIK3CA, KMT2A, and PTEN. We also identified that PIK3CA mutations were localized to E542K, E545K, and H1047L hotspots known to promote activation, with the remaining mutations of uncertain function (Fig 1C and 1D; [14]). Many canonical pathways known to be involved in oncogenic signaling were mutated, including RTK-RAS and PI3K ( [15]; Fig 3). Further, the "druggability" of the 25 genes specifically mutated in the RTK-RAS and PI3K pathways was searched using the Drug Gene Interaction database (DGIdb) [16] and the majority of the genes were found to be in druggable categories (S1 Fig). Among these were 5 genes for which "clinically actionable" compounds are available including the EGFR and PIK3CA related genes.

Gene fusion detection
In HNSCC, TCGA provided the first report on gene fusions [14]. Among these fusion events, a known gene fusion, FGFR3-TACC3, was identified in two p16+ tumors i.e GHN-66 and GHN-80 (S4 Table). Other studies also showed that gene fusions are associated with significant upregulation of genes including EGFR [18]. However, we did not see the upregulation of EGFR gene expression.

Differential expression of genes among HNSCC individuals
We conducted differential expression analysis between p16+ (N = 12) and p16-(N = 5), or p16 + (N = 12) and p16-unknown (N = 7) groups and created a catalog of gene expression alterations (Fig 4A-4C). The number of differentially expressed (DE) genes in p16+ vs p16-comparison was 1,044 and 8, respectively at FDR < 0.05 and 0.01 (total of 16,178 genes assayed). Among a total of 16,253 tested genes, the number of differentially expressed (DE) genes in p16 + vs p16-unknown group comparison was 574 and 58, respectively at FDR < 0.05 and 0.01. To confirm the findings that we obtained using our dataset, we used the TCGA dataset for validation. Specifically, we used comparable tissues of origin including the base of tongue (BOT), tonsil etc (S5 Table). The DEGs identified in our analyses were used to create a heatmap with TCGA expression data (Fig 4D). With this abundance of gene expression data, we have identified a misclassified TCGA sample. The sample TCGA-CV-5971 was identified by TCGA as HPV+ but our gene expression data show that its expression is similar to that of the HPVgroup (Fig 4D & S6 Table). Canonical pathway analysis on the differentially expressed genes (p16+ vs p16-) using ingenuity pathway analysis (IPA) revealed most significant enrichment in pathways related to cell cycle (Fig 5).

p16 expression and survival
Survival analyses were performed using the Kaplan-Meier method with the log-rank test for statistical significance. From the curves, it is evident that the patients, who have negative (or unknown) status for the p16, have more death rate as compared to the patients, who are positive for p16. For all 3 groups, the rate of decrease in survival rate is fairly constant (Fig 6). Moreover, a clear association of Kaplan-Meier survival curve of p16-based molecular subtype p16-negative with p16-unknown group supports clustering of RNA-seq based gene expression As the p16 unknown group consisted of non-oropharyngeal tumors, these results were not surprising as the significance of p16 status in non-oropharyngeal primaries is unclear.

Discussion
Using RNA-seq data in parallel with targeted sequencing of a panel of 151 genes, we demonstrate that gene expression data from FFPE samples can identify gene signatures characteristic of p16 + versus p16-OPSCC which is a subsite of HNSCC. These signatures need to be further explored for their biologic relevance in OPSCC. One caveat is the relatively limited size of the p16-negative group, which poses challenges in achieving accurate estimates of biological variability and statistical robustness in data analysis. Although matched tumor-normal analysis is preferred due to higher precision, we demonstrate that mutation detection without matched normal samples is possible for certain applications. We were also able to confirm the reliability of using routine FFPE specimens to accurately identify possible targeted pathways that were confirmed relevant in p16-positive versus p16-negative tumors in prior reports, and that could have wider future applicability and inform clinical applications in HNSCC.

Research subjects
The study was approved by Emory University Institutional Review Board (IRB). Study participants included individuals with a pathologically documented diagnosis of HNSCC/OPSCC and who had undergone surgical resection of their primary disease or had adequate core biopsies of their primary tumors. A summary of clinical data of the study participants is shown in Table 1 and S1   OPSCC tumor specimens, 10 of which were oral cavity specimens were included in the study. As part of the routine pathology diagnosis of OPSCC, the tumors were evaluated for p16 immunoreactivity. Some non-oropharyngeal tumors had exhausted tissue and could not be checked for p16 (p16 unknown). The FFPE tumor blocks were de-identified according to the IRB-approved protocol. 5μM serial sections of FFPE samples were obtained from each block for DNA and RNA isolation.

DNA and RNA extraction
Genomic DNA and total RNA were isolated using Omega BioTek chemistries according to the manufacturer's protocol from 5 μM sections for each isolate. DNA was quantitated using NanoDrop and Qubit, and RNA was quantitated using NanoDrop and Agilent BioAnalyzer.

RNA-seq library preparation
The RNA from FFPE tumor samples was used to prepare libraries for RNA sequencing. 200 ng of genomic DNA from each specimen was used to generate uniquely barcoded sequencing libraries. Briefly, libraries were prepared using the Ion DNA Barcoding and Ion Xpress Template kits (Life Technologies, Grand Island, NY, USA) according to the manufacturer's protocols. Fragment sizes were assessed on an Agilent Bioanalyzer 2100 using the High Sensitivity

Next-generation comprehensive cancer profiling
Agilent's targeted sequencing ClearSeq Comprehensive Cancer Panel was used. We extracted DNA from selected samples (N = 27) and sequenced each sample for 151 disease-associated genes that have been implicated in studies of a wide range of cancers. All coding exons, exonintron boundaries and selected introns of these genes are targeted. Libraries were paired-end sequenced using the Illumina MiSeq platform with the 75bp paired-end read mode. On average 10 million reads per sample were generated.

Short-read mapping, and somatic variant calling
Reads were trimmed for adaptors and paired-end mapped to the reference human genome (hg19) using BWA-MEM algorithm (version 0.7.12) from GenAligners v3.0 (SureCall 4.0, analysis software from Agilent Technologies). For the detection of somatic single-nucleotide polymorphism (SNP) and insertion and deletion (indel), we used Mutect2 (GATK v4) on tumor samples ("tumor-only" mode) on each sample. Resulting BAM files were sorted using Samtools v1.3 and PCR duplicates were marked using Picard v2.6.0. Realignment was performed following the Genome Analysis Toolkit (GATK) best practices. A base quality recalibration table was generated using GATK BaseRecalibrator with one or more databases of known polymorphic sites (dbSNP138 (hg19) and HAPMAP 3.3 (hg19) from the GATK resource bundle). The appropriate liftover chain file from GATK resource bundle also downloaded (b37tohg19.chain) to convert the genomic coordinates from b17 to hg19 build. To restrict a subset of genomic regions in variant calling while using GATK tools, we have provided a.bed file of exome intervals obtained from Agilent website. VCF containing population allele frequencies (AF > 0.05) of common germline variants from ExAC and an exome intervals (.bed) file were provided to GATK GetPileupSummaries to get a summary of read counts from recalibrated tumor BAM that support a set number of known variant sites. This summary of read counts was used to calculate cross-sample contamination. To call somatic variants from tumor (recalibrated) BAM via local assembly of haplotypes (using Mutect2), a VCF containing population allele frequencies of common and rare alleles from gnomAD to avoid calling any germline variants and a.bed file of exome intervals were used. The Mutect2 called variants were then filtered based on the contamination estimate using GATK FilterMutectCalls and only passed somatic variants were included in the further analysis. Calls that are likely true positives get the PASS label in the FILTER field. This step seemingly applies 14 filters, including contamination.

Comparison of mutations with COSMIC database
To determine if the called variants have been previously detected, we have annotated the derived list of somatic variants using Catalog of Somatic Mutations in Cancer (COSMIC) version 87 [19] (hg19, Coding and Non Coding vcf files combined) with ANNOVAR tool [20]. Mutations not detected by the above methods were considered novel (S3 Table).

Analysis of significantly mutated pathways
We examined the distribution of mutations in known oncogenic signaling pathways derived from TCGA cohorts] 17]. The mutation profiles in p16+ and p16-were shown with the R package "MAFtools" [21]. We also used MAFtools to calculate the mutation rate of each gene.

Data pre-processing and alignment of sequenced reads
Quality assessment of raw FASTQ reads was performed using the FASTQC program. Paired end RNA-Seq samples were mapped to the human genome reference assembly (hg38) with STAR 2.4.2a. Transcript expressions as counts were estimated with HTSeq. Count data were subsequently normalized using TMM (weighted trimmed mean of M-values) with the EdgeR package [22], and converted to counts per million (CPM) and log2-transformed. A filtering process was also performed to exclude the genes without at least 10 counts in 33% of the samples. For gene-fusion detection, we use STAR-Fusion (https://github.com/STAR-Fusion/ STAR-Fusion66). It is a method that accurately identifies fusion transcripts from RNA-seq data and outputs all supporting data discovered during alignment.

Data visualization and clustering analysis
The similarity of the relative gene transcript abundances (using log2-transformed values of CPM) for each of the samples was compared using an unsupervised hierarchical clustering and heatmap analysis in R. Unsupervised hierarchical clustering of the differentially expressed genes was performed using Pearson correlation distance and average clustering.

Differential gene expression analysis
To assess the significance in the difference between p16-positive OPSCC samples and p16-negative OPSCC samples in terms of gene expression, we used the two-sample t-test. Transcripts were considered to be differentially expressed if their FDR < 0.05. Volcano plots of -log10(pvalue) vs. log2 (CPM) fold-change were made to examine these associations in each tissue pair within each individual.

Analysis of significantly enriched pathways
WEB-based Ingenuity Pathway Analysis (IPA) (QIAGEN) was used for pathway analysis to identify pathways that were enriched in all significant gene lists by each of the HPV pairs (p16-positive vs. p16-negative). Only genes that were differentially expressed in sample comparisons (significance at p-value � 0.05) were included in the analysis. Pathways were considered to be significant if the pathway's p-value of enrichment was �0.01.

The Cancer Genome Atlas (TCGA)
TCGA RNA-seq data including 49 HPV-negative and 18 HPV-positive tissue samples in the form of raw gene count (disease ="HNSC" and data.type ="RNASeq2") was downloaded using TCGA2STAT package for R [23] and used to find overlap between TCGA gene expression and our HNSCC data. Both expression data and clinical data were available for 67 HNSCC tumors (from different locations: oral tongue, BOT and tonsil). We used the dataset abbreviations as defined by the TCGA consortium, a public resource that catalogues clinical data and molecular characterizations of many cancer types (as defined in https://tcga-data.nci.nih.gov/docs/ publications/tcga/). In order to validate the differentially expressed gene signature identified using our OPSCC patient samples, we queried transcriptome data for HNSCC downloaded from the TCGA cancer program. Only the patient tumor samples with available clinical and RNA-seq gene expression data were obtained. As our study focuses only on oropharyngeal samples, patient samples obtained from anatomic sites-"Base of Tongue" and "Tonsil" with p16 status annotation were only used in this analysis, resulting in a total of 67 samples (p16-positive OPSCC samples (n = 18) and p16-negative OPSCC samples (n = 49)). The remaining samples were excluded as they were from a different anatomic site or p16 status annotation was not available. Unsupervised hierarchical clustering of the OPSCC TCGA patient samples was performed based on the expression of our differentially expressed gene signature using Pearson correlation distance and average clustering.

Gene fusion analysis
A fusion gene refers to two genes (either in whole or in part) that undergo fusion resulting in a chimeric gene, which is usually caused by reasons such as chromosome translocation and associated problems. STAR-fusion software analysis and the detection of fusion genes were used to identify fusion genes. Fusion gene lists were filtered with STAR-fusion with the default filter method and other parameters.
Supporting information S1 Table. Tumor