A computational approach to distinguish somatic vs. germline origin of genomic alterations from deep sequencing of cancer specimens without a matched normal

A key constraint in genomic testing in oncology is that matched normal specimens are not commonly obtained in clinical practice. Thus, while well-characterized genomic alterations do not require normal tissue for interpretation, a significant number of alterations will be unknown in whether they are germline or somatic, in the absence of a matched normal control. We introduce SGZ (somatic-germline-zygosity), a computational method for predicting somatic vs. germline origin and homozygous vs. heterozygous or sub-clonal state of variants identified from deep massively parallel sequencing (MPS) of cancer specimens. The method does not require a patient matched normal control, enabling broad application in clinical research. SGZ predicts the somatic vs. germline status of each alteration identified by modeling the alteration’s allele frequency (AF), taking into account the tumor content, tumor ploidy, and the local copy number. Accuracy of the prediction depends on the depth of sequencing and copy number model fit, which are achieved in our clinical assay by sequencing to high depth (>500x) using MPS, covering 394 cancer-related genes and over 3,500 genome-wide single nucleotide polymorphisms (SNPs). Calls are made using a statistic based on read depth and local variability of SNP AF. To validate the method, we first evaluated performance on samples from 30 lung and colon cancer patients, where we sequenced tumors and matched normal tissue. We examined predictions for 17 somatic hotspot mutations and 20 common germline SNPs in 20,182 clinical cancer specimens. To assess the impact of stromal admixture, we examined three cell lines, which were titrated with their matched normal to six levels (10–75%). Overall, predictions were made in 85% of cases, with 95–99% of variants predicted correctly, a significantly superior performance compared to a basic approach based on AF alone. We then applied the SGZ method to the COSMIC database of known somatic variants in cancer and found >50 that are in fact more likely to be germline.


Introduction
Characterization of clinical cancer specimens using MPS for targeted treatment selection is becoming increasingly common [1][2][3][4][5]. These procedures generate large numbers of alterations per patient, only a minority of which are potential oncogenic drivers or therapeutically relevant, while the rest are either passenger mutations or germline polymorphisms that are typically functionally benign [6]. Although most therapeutic strategies will focus on variants that have already been well-characterized in the literature, an important opportunity to discover novel oncogenic targets will arise as hundreds of thousands of clinical cancer cases are sequenced. An essential component in this on-going analysis will be prioritizing uncharacterized variants for further follow-up, with somatic versus germline origin determination being a critical step.
The definitive approach to distinguishing somatic mutations from germline variants requires sequencing the tumor alongside a patient matched normal, and subsequently performing a comparison: variants detected in tumor tissue but not present in the normal control are advanced as mutation candidates [7][8][9][10]. However, while it is possible to establish protocols for paired collection in the academic cancer center setting, sequencing a patient matched normal specimen is not part of broad oncology practice, and known cancer drivers targetable by approved or investigational therapies can usually be discerned from tumor sequencing alone from well-established databases such as COSMIC [11]. It is therefore likely that as clinical cancer sequencing becomes routine and wide-spread, matched normal data will not be available for the majority of cases, foreclosing a significant opportunity for novel discovery and potential future therapeutic benefit unless this limitation is overcome. Although methods have been developed to determine germline status by matching to public germline databases like dbSNP or sequence a large number of normal individuals to be surrogates for matched normal [12], such methods cannot adequately account for rare germline variants that are private to a family or small population.
We present SGZ, a novel computational method for predicting the somatic vs. germline origin of variants discovered in cancer specimens (Fig 1) without the need for a matched normal sample. In this method, the cancer specimen is sequenced to high depth (>500x) using MPS, in our implementation by a targeted clinical assay of 394 cancer-related genes and over 3,500 genome-wide SNPs [1]. SGZ leverages the precise measurement of the allele frequencies of variants of interest offered by deep sequencing and a statistical model of genome-wide copy number and tumor/normal admixture to characterize the mutational state of the variants. The method is generally applicable to any MPS sequencing platform where the sequencing depth is sufficient, an accurate model of copy number can be created, and the tumor specimen is sufficiently admixed with the surrounding normal tissue.

Methods
The SGZ method works as follows (Fig 1, S1 Fig): For each sample, we first execute a standard MPS variant analysis pipeline, which aligns unique sequence reads and obtains candidate mutations with associated mutant allele frequencies [1]. The pipeline also creates a genomewide copy number profile based on coverage and allele frequencies at over 3,500 SNPs, which The SGZ pipeline is overviewed in panel A. Key components include fitting an optimal copy number model to the genome-wide log-ratio and minor allele frequency profiles (B), and modeling the expected allele frequencies of germline, somatic, and subclonal somatic mutations (C). In panel B, the dots in the top panel correspond to log ratios at each exon sequenced, segmented and fitted to discrete copy number levels, while the dots on the bottom panel are germline SNP minor allele frequencies. In panel C, examples of expected variant allele frequencies are shown for various scenarios of copy number and tumor purity. The expected allele frequencies are shown for germline (blue), somatic (red), and subclonal somatic (yellow).
https://doi.org/10.1371/journal.pcbi.1005965.g001 is segmented and modeled to estimate the overall tumor purity (p) and ploidy (C), as well as the per segment copy number (C) and minor allele count (M). An overview of our copy number detection approach is shown in Fig 2. To obtain a log-ratio profile of signal intensity, aligned tumor sequence reads are normalized by dividing read depth by that of a processmatched normal control, followed by a GC-content bias correction using Lowess regression. The minor allele frequency (MAF) profile is obtained from the heterozygous genome-wide SNPs. These constitute the observed data for the statistical model.
We fit the log-ratio and MAF data by a statistical model which predicts genome-wide copy number profile. This is done in two steps: First, we use the circular binary segmentation (CBS) algorithm to divide the genome into segments of equal copy number [13]. CBS recursively divides the log-ratio data into individual segments until each segment is homogenous such that no further divisions lead to statistically significant differences in signal level. Depending on the aneuploidy and data quality of one sample, the number of segments can range from 22 to a few hundred. Second, we use the segment-based log-ratio and MAFs to fit the statistical copy number model. Briefly, if S i is a genomic segment, let l i be its length and C i be its copy number. The tumor ploidy C of the sample is C ¼ If r i is the random variable representing the median-normalized log-ratio coverage of all exons within S i , and p is the tumor purity, we model r i as a Normal distribution as: where σ ri is the SD of the log-ratio data in segment S i , reflecting the noise observed. Similarly, if f i random variable represents the MAF of SNPs within segment S i , M i the copy number of minor alleles in S i , distributed as integer 0 M i C i/2 , and σ fi the SD of the SNP data at segment S i , we model f i as: Given this model of the log-ratio and MAF, a two-step approach is used to find the optimal fit of model parameters C i and M i at each segment, as well as the genome-wide model parameters tumor purity (p) and ploidy (C). First, an initial fit is assessed using the JAGS software package [14], a Gibbs sampling based Markov Chain Monte Carlo algorithm. Assuming a sample has 200 segments after segmentation, the total number of parameters is more than 400. Based on our pipeline design, there are around 10,000 observed SNPs and 50,000 observed median-normalized log-ratios. After checking the convergence of all parameters, the following key MCMC parameters are employed: sampling size at 500, burn-in size at 500, thinning Aligned DNA sequences of the tumor specimen are normalized against a process-matched normal, producing log-ratio and minor allele frequency (MAF) data. Next, whole-genome segmentation is performed using a circular binary segmentation (CBS) algorithm on the log-ratio data. Then, a Gibbs sampler fitted copy number model and a grid-based model are fit to the segmented log-ratio and MAF data, producing genome-wide copy number estimates. Finally, the degree of fit of candidate models returned by Gibbs sampling and grid sampling are compared and the optimal model is selected by an automated heuristic.
https://doi.org/10.1371/journal.pcbi.1005965.g002 interval at 1 and 9 chains. Second, a grid-based method is used to find alternative solutions that can also fit the model [15]. The grid-based method evaluates the mean-squared-error between the measured and the expected copy numbers, over a grid of different tumor purity and ploidy. All local minima in the grid are considered as model candidates.
The goodness of fit of all copy number models returned by Gibbs sampling and grid sampling are assessed by the mean squared error (MSE) of log-ratios of all segments and the MSE of MAF of all segments. Gibbs model is the default optimal model and is compared to gridbased copy number models at the first three local minima. A grid-based copy number model is selected as the final optimal model if it is proven to meet all of the following five requirements: 1) the MSE of log-ratios and the MSE of MAFs are reduced; 2) the ploidy is higher than 1.2; 3) the model does not have excessive copy number loss events (CN = 0); 4) it is not a more complex model, which is defined by a higher ploidy delta of at least 1.1 and a lower purity delta of at least 0.1; 5) it is not a high purity sample (predicted purity > 0.99) unless an independent high purity estimation prediction algorithm agrees.
Given the output of the copy number model, each variant's measured AF is compared to expectation at its local segment i: To determine whether a variant is predicted somatic, germline, or ambiguous, we used the following statistical model: Define y: = (n,f), where y is the variant data comprising read depth n and allele frequency f; G: = germline hypothesis; and S: = somatic hypothesis. Given the germline hypothesis G, the probability of y is obtained using the 2-tailed binomial test P(y|G; AF germline ) = Bin (nf, n, AF germline ). Given the somatic hypothesis S, the probability of y is obtained using the 2-tailed binomial test P(y|S;  [15][16][17][18], the SGZ method classifies the tumor zygosity of the mutation (homozygous vs. heterozygous) or predicts that the mutation resides in a minor subclone. A variant in the tumor is classified as homozygous if all copies in the tumor carries the mutant allele (V = C and V6 ¼0), heterozygous if both the reference and the mutant are present (V6 ¼C and V6 ¼0), and not in tumor if the tumor only carries the reference (V = 0, applicable only to germline variants). A somatic mutation is further classified as subclonal if the allele frequency is significantly less than the lowest expected allele frequency.

Method validation datasets
We validated SGZ in three different ways, including (1) specimens with matched normal where the true origin of all alterations was known, (2) cell-line admixtures that modeled the impact of varying tumor purity on the inference, and (3) a large set of clinical FFPE specimens with known somatic drivers where real-world somatic variant recovery was assessed.
The first dataset consisted of 87 specimens from 30 non-small cell lung and colon cancer patients, wherein each patient we studied three samples: the primary tumor, a metastatic site, and adjacent tissue matched normal (S3 Table). All DNA were extracted from fresh-frozen clinical specimen. The primary and metastatic tumors uniformly contained a mixture of malignant and benign epithelial, stromal and inflammatory cells. The gold standard origin of a mutation is established by following the rules: whenever a variant appeared in the matched normal with significant allele frequency, it was considered germline, and tumor-only variants were called somatic. In several samples, low levels of tumor infiltrated into the matched normal sample, hence the sample was found to carry low levels of mutation with allele frequency <10%. These were regarded as somatic mutations. A total of 330 unique variants were detected and evaluated, including 70% (N = 231) germline and 30% (N = 99) somatic according to gold standard. SGZ was applied to the primary and metastatic tumor samples to make somatic/ germline predictions. DNA from the 30 non-small cell lung and colon cancer patients was obtained from Institute Gustave Roussy [19,20].
To assess the robustness of the method to different levels of tumor purity, we examined three cancer cell lines (HCC-1937, HCC-1954, & NCI-H1395), which were titrated with their matched lymphoblastoid normal to six levels of tumor purity (10%, 20%, 30%, 40%, 50%, 75%). A total of 42 unique variants were detected by our pipeline and used for validation (S4 Table).
The third dataset is data from 20,182 clinical FFPE tissue samples sent to Foundation Medicine for FoundationOne testing. The samples were of a variety of tumor types, originating from a wide diversity of cancer centers and community oncology practices. To evaluate SGZ predictions of germline/somatic origin, we examined predictions at 17 known somatic hotspot mutations (e.g. BRAF V600, KRAS G12) and 20 common germline SNPs. To assess SGZ predictions of tumor zygosity, we selected the most frequently mutated somatic variants at oncogenes (BRAF, EGFR, IDH1, KRAS, NRAS, PIK3CA) and tumor suppressor genes (TP53, RB1, PTEN) for analysis. To assess the ability of SGZ to detect subclonal mutations, we examined EGFR T790M, a common subclonal tyrosine kinase inhibitor resistance mutation, in all the non-small cell lung samples in this dataset (N = 69). The FoundationOne assay platform, its clinical application, and an early description of the cohort genomics is described in Frampton et al. 2013.

Method validation results
To demonstrate the importance of taking into account the genome-wide copy number profile for somatic/germline prediction, we applied SGZ to the three validation datasets and compared SGZ to a method that does not take tumor aneuploidy into account (referred to as "basic method"), in which a variant is classified as germline if its mutation frequency is near 50% or 100%, or otherwise is classified as somatic [21] (S1 Method).
SGZ yielded somatic vs. germline calls for 85% of variants in the lung and colon samples, 83% of variants in the three cell lines admixtures, and 84% in the 17 somatic hotspot mutations and 20 common germline variants in the 20,182 Foundation Medicine clinical samples. Among these calls, 95%, 97% and 96% of the somatic mutations were predicted correctly, respectively; 99%, 97%, and 97% of the germline mutations were predicted correctly, respectively. On the contrary, the basic method was able to make predictions for 100% of the variants in the three datasets, but only predicted somatic variants correctly 67%, 92% and 95% of time, and germline variants correctly 87%, 41% and 51% of the time, which are significantly lower than the accuracy of SGZ. Importantly, in none of the three datasets did the basic method achieve satisfactory performance in both germline mutations and somatic mutations simultaneously (Table 1, S1 Table). In the cell line dataset, out of a total number of 184 short variants that are correctly classified by SGZ, 63 short variants are incorrectly classified by the basic method due to local copy number deviation from 2 and/or zygosity deviation from the heterozygous state, strongly suggesting the necessity to take copy number variation into account in order to make accurate predictions (S5 Table,  SGZ had a no-call rate in around 15% of mutations in the lung and colon samples and the Foundation Medicine clinical dataset due to multiple factors (Fig 3), including excessively high tumor purity (>95%), gross deviations of the copy number model at the variant site, observed mutation AF compatible with both somatic and germline AF expectations, and observed AF outside of both somatic and germline expectations.
To characterize the performance of SGZ as a function of tumor purity, we captured the call rate and prediction accuracy of SGZ in each tumor purity level in the cell-line dataset (Table 2). Overall, the call rate is between 75% to 94%, and the prediction accuracy ranges from 88% to 100%. As expected, the call rate at 10% tumor purity is the highest among all dilution levels, due to the large difference between expected germline and somatic AF (S1 Fig). Though not available in this dataset, it is expected that call rate would rapidly drop to 0% as the tumor purity exceeds 90% due to much smaller differences between somatic and germline variant AF expectations. For germline and somatic prediction accuracy, a high level of accuracy is maintained from 10% through 75% tumor purity. It is also expected that the prediction accuracy would drop as tumor content exceeds 90%.  To assess SGZ predictions of tumor zygosity, we examined data from the most frequently mutated somatic variants at oncogenes (BRAF, EGFR, IDH1, KRAS, NRAS, PIK3CA) and tumor suppressor genes (TP53, RB1, PTEN) in the Foundation Medicine clinical sample set. Alterations in oncogenes are expected to be mostly heterozygous, as a single mutation is required for activation, whereas the tumor suppressor genes are expected to have the first functional copy inactivated via mutation, and the second inactivated through loss-of-heterozygosity (LOH) [22]. Our predictions of tumor zygosity are concordant with the roles these genes play: TP53 and RB1 were to determined have >90% of mutations under LOH, while BRAF V600 and KRAS G12 mutations showed no significant enrichment for LOH (Table 3).
To assess our ability to detect subclonal mutations, we examined EGFR T790M in the nonsmall cell lung carcinoma subset of our dataset, where the mutation would be expected to occur in tyrosine kinase inhibitor resistant subclones. Indeed, we discovered a significant enrichment of subclonal somatic vs. somatic heterozygous/homozygous calls for T790M -a ratio of 1.5 (41/28)-compared to a ratio of only 0.24 (1043/4282) for the 17 somatic hotspot mutation sites. The SGZ method was also applied to predict the clonality and zygosity of 12 ESR1 mutations in estrogen receptor positive breast cancer biopsies and determined these ESR1 mutations to be somatically acquired, clonal biomarkers of endocrine resistance [23].

Cancer database application
Despite best efforts of cancer investigators leveraging matched normal controls, germline variants may erroneously get nominated and recorded as somatic mutations in the literature and public catalogues of somatic variation, due to the challenges inherent in large scale sequencing studies and MPS data analysis. These variants may divert scarce resources needed for functional follow-up or potentially mislead therapeutic choice if pursued clinically. It would thus be beneficial if these false somatic variants could be collectively flagged and potential interpretation and application corrected.
To discover mutations that may be misclassified as somatic in a public database, we applied the SGZ method to the 20,182 clinical specimens to identify variants predicted to be germline but annotated in COSMIC (v62) as somatic. To confidently call a variant as germline, we required germline predictions in multiple specimens and obtained p-values using a binomial model of SGZ error rate by tabulating the number of somatic, germline, and ambiguous predictions for each variant and obtaining P(S|n G , n S ), the probability of a variant being somatic, given the n G germline calls and n S somatic calls: Using Bayes rule and a flat prior, i.e P(G) = P(S) = 0.5, PðSjn G ; n S Þ ¼ Pðn G ;n S jSÞ Pðn G ;n S jGÞþPðn G ;n S jSÞ . Multiple observations were modeled as binomial distributions: À e G Þ n G and Pðn G ; n S jSÞ ¼ n G þ n S n S ! e n G S ð1 À e S Þ n S with e G as the single sample germline error rate, i.e. the probability of SGZ making an error given a germline prediction is made and e S as the single sample somatic error rate. We used conservative parameters e G = 0.05 and e S = 0.10, which are higher than the error rates from Fig 2A. P(G|n G , n S ) can be readily obtained as 1 -P(S|n G , n S ). Table 4 shows the top 10 variants present in COSMIC, but strongly predicted by our method as germline. Each variant was predicted germline in at least 45 samples. Although 9 of 10 variants were annotated as confirmed somatic, the number of entries in the database were all low ( 4), reinforcing that the somatic annotation is likely inaccurate. Further evidence of germline origin was that most variants had an entry in dbSNP, though few were classified as common SNPs. The full list of seventy COSMIC variants predicted to be germline is given in S2 Table.

Discussion
The SGZ method leverages deep MPS to predict variant somatic and germline origin without a matched normal control. While the definitive approach for discovery of novel somatic mutations includes sequencing a patient matched normal, SGZ supports functional prioritization and interpretation of alterations discovered on routine testing performed with tumor alone and can enable assay development and clinical research. There are several limitations of the SGZ method. Samples must have adequate admixture of the surrounding normal tissue. The exact mixture requirement depends on sequencing depth, which is considered in our statistical model on a per mutation basis, but given our coverage depth of >500X, somatic versus germline calling generally requires at least 10% normal tissue, i.e. tumor content under 90%. This held for 97% of solid tumor clinical cancer specimens that we sequenced. Zygosity calling required estimated tumor purity to be at least 20%, which held for 76% of our sample set.
Accuracy of the copy number model is likewise important. Minor misfit of the model can lead to an elevated rate of no calls, and major misfit of the model can lead to misclassification of somatic versus germline status, especially when tumor content is high, where the expected difference between germline and somatic allele frequency is reduced (S1 Fig). However, in copy number modeling, a key subset of copy number models is mathematically equivalent in terms of SGZ predictions, which improves robustness (S1 Note). Additionally, in low tumor content samples, the differences in expected allele frequencies between germline and somatic mutations are large, hence more robust to deviations in copy number model.
As shown in S1 Fig, there are also limited scenarios where the differences in expected allele frequencies between germline and somatic mutations are small, hence a prediction cannot be made. For example, a mutation with measured allele frequency of 33% in a genomic region with copy number 3 and LOH is equally likely to be either "germline and not in tumor" or "somatic and homozygous". Finally, there is a scenario in which a subclonal somatic mutation produces an allele frequency equivalent to the expected germline frequency, misclassifying the mutation as germline. In practice, this is rare.
Despite these limitations, SGZ achieved impressive accuracy in validation studies, reaching call rates of 85% and accuracy of 95-99% when applied to individual samples. Importantly, for recurrent mutations (typical focus of cancer studies and clinical research), a key way to improve performance is to apply SGZ to a large cohort of samples, where recurrent mutations can be tabulated in the number of times a germline or somatic prediction is made. This information can be used to annotate variants for which somatic/germline status is unknown or in doubt. When applied over a large cohort of samples, SGZ can aid in the discovery of novel recurrent somatic mutations, along with their clonality and zygosity status [23,24]. Conversely, SGZ can also identify germline variants not yet catalogued in public databases such as dbSNP and flag them from further consideration as cancer drivers. In this report, we describe the computational approach, which may be implemented on any cancer deep sequencing platform with copy number modeling support and provide both the methodology and a detailed worksheet to ease implementation (S1 Fig). We also apply the method to generate a proposed re-annotation of a large number of variants currently believed to be somatic, in the hope of improving the reliability of publicly availably cancer information. Ultimately, the application of SGZ may inform clinical decision making and expand treatment choices for cancer patients.