Systems-level network modeling of Small Cell Lung Cancer subtypes identifies master regulators and destabilizers

Adopting a systems approach, we devise a general workflow to define actionable subtypes in human cancers. Applied to small cell lung cancer (SCLC), the workflow identifies four subtypes based on global gene expression patterns and ontologies. Three correspond to known subtypes (SCLC-A, SCLC-N, and SCLC-Y), while the fourth is a previously undescribed ASCL1+ neuroendocrine variant (NEv2, or SCLC-A2). Tumor deconvolution with subtype gene signatures shows that all of the subtypes are detectable in varying proportions in human and mouse tumors. To understand how multiple stable subtypes can arise within a tumor, we infer a network of transcription factors and develop BooleaBayes, a minimally-constrained Boolean rule-fitting approach. In silico perturbations of the network identify master regulators and destabilizers of its attractors. Specific to NEv2, BooleaBayes predicts ELF3 and NR0B1 as master regulators of the subtype, and TCF3 as a master destabilizer. Since the four subtypes exhibit differential drug sensitivity, with NEv2 consistently least sensitive, these findings may lead to actionable therapeutic strategies that consider SCLC intratumoral heterogeneity. Our systems-level approach should generalize to other cancer types.


Introduction
A major barrier to effective cancer treatment is the occurrence of heterogeneous cell subpopulations that arise within a tumor via genetic or non-genetic mechanisms. Clonal evolution of these subpopulations via plasticity, drug-induced selection, or transdifferentiation allows tumors to evade treatment and relapse in a therapy-resistant manner. Characterizing cancer subpopulations, or subtypes, has led to breakthrough targeted treatments that significantly improve patient outcomes, as in the case of melanoma [1], breast [2], and lung cancer [3]. However, approaches to subtype identification suffer from several limitations, including: i) focus on biomarkers, which frequently possess insufficient resolving power; ii) lack of consideration for the system dynamics of the tumor as a whole; and iii) often phenomenological, rather than mechanistic, explanations for subtype sources.
To accelerate progress in cancer subtype identification, we set out to develop a general systems-level approach that considers underlying molecular mechanisms to generate multiple stable subtypes within a histological cancer type. We focused on gene regulatory networks (GRNs) comprised of key transcription factors (TFs) that could explain the rise, coexistence and possibly transdifferentiation of subtypes. To enumerate subtypes, identify key regulating TFs, and predict reprogramming strategies for these subtypes, we established the workflow shown in Fig 1. Briefly, we use consensus clustering and weighted gene co-expression network analysis on transcriptomics data to identify cancer subtypes distinguished by gene expression signatures, biological ontologies, and drug response. We validate the existence of the subtypes in both human and mouse tumors using CIBERSORT [4] and nearest neighbor analyses, and develop a GRN that can explain the existence of multiple stable subtypes within a tumor. We then introduce BooleaBayes, a Python-based algorithm to infer partially constrained regulatory interactions from steady state gene expression data. Applied to this GRN, BooleaBayes identifies and ranks master regulators and master destabilizers of each subtype. In a nutshell, starting from transcriptomics data, the workflow can predict reprogramming strategies to improve efficacy of treatment.
We applied this workflow to Small Cell Lung Cancer (SCLC), in which genetic aberrations cannot fully distinguish subtypes [5], or point toward a targeted therapy. SCLC treatment has instead remained cytotoxic chemotherapy (a regimen of etoposide and a platinum-based agent such as cisplatin) and radiation for over half a century, despite the fact that virtually all patients relapse after therapy. This has caused SCLC to be designated as a recalcitrant cancer by the Recalcitrant Cancer Research Act of 2012, with five year survival rates less than 5%.
Recently, efforts to stratify patients have led to the recognition of phenotypic heterogeneity within and between SCLC tumors, raising hopes for more efficient subtype-based treatment strategies. As first described over 30 years ago, human SCLC cell lines can be categorized into two broad subtypes: a neuroendocrine (NE) stem-cell-like "classic" subtype and a distinct non-NE "variant" subtype [6][7][8]. In both human and mouse tumors, most cells appear to belong to the NE subtype, corresponding to a pulmonary neuroendocrine cell (PNEC) of origin [9], with high expression of neuroendocrine genes such as ASCL1. However, several groups have found evidence for non-NE variants within SCLC tumors [10][11][12], as well as an NE variant driven by MYC overexpression and NEUROD1 overexpression, instead of ASCL1 [13][14][15]. We previously described SCLC cell lines with hybrid expression of both NE and non-NE markers [16], and proposed they could serve as a resistant niche since drug perturbations shifted most cell lines towards hybrid phenotype(s). Taken together, these observations indicate the existence of a complex landscape of SCLC phenotypes that may form a tumor microenvironment robust to perturbations and treatment [10,17]. However, previous SCLC subtype reports were limited in their ability to systematically identify subtypes and understand plasticity across them. We hypothesized that our workflow, by taking into account the dynamics of underlying GRNs, could make systems-level predictions that more accurately reflect the occurrence and transdifferentiation of coexisting subtypes within SCLC tumors.
Starting from transcriptomics data from SCLC cell lines, our pipeline identifies four transcriptional subtypes, and a GRN that describes their dynamics. Three of these correspond to known ones, the fourth is a previously unreported NE variant (termed NEv2) with reduced sensitivity to drugs. Both CIBERSORT and single-cell validation reveal that in virtually every human and mouse tumor heterogeneity encompasses NEv2, and that all other previously reported subtypes are represented across tumors. BooleaBayes identifies both master regulators and master destabilizers for each subtype, opening the way for treatment strategies that may take SCLC subtypes into account. For instance, we hypothesize that by targeting these master TFs, the NEv2 phenotype may be destabilized leading to increased treatment sensitivity of SCLC tumors.

Data
Human SCLC cell line data was taken from the Broad Institute's CCLE RNA-seq expression data (version from February 14, 2018) at https://portals.broadinstitute.org/ccle/data. 81 human tumors were obtained from George et al. dataset, courtesy of R.K. Thomas [5]. The Myc-high mouse data set [15] was obtained from the NCBI GEO deposited at GSE89660. PDX/CDX mouse data [18] was obtained from the NCBI GEO deposited at GSE110853. Data from the CCLE was subsetted to only include SCLC cell lines (50). Features with consistently low read counts (< 10 in all samples) and non-protein-coding genes were removed. All expression data was then converted to TPM units and log1p normalized by dataset.

Clustering and WGCNA
We applied Consensus Clustering to RNA-seq gene expression data from the 50 SCLC cell lines in the Cancer Cell Line Encyclopedia (CCLE) using the ConsensusClusterPlus R package [19]. Gene expression (TPM) was median-centered prior to clustering, and we clustered the cell lines using a k-means method with a Pearson distance metric for k 2 {2, 12}. Other parameters were set as follows: reps = 1000, pItem = 0.8, pFeature = 0.8, seed = 1. Best k value was chosen heuristically based on the cumulative distributive function plot, tracking plot, delta area plot, and consensus scores.
To identify gene programs driving the distinction between the four SCLC phenotypic clusters, we performed weighted gene co-expression network analysis (WGCNA) on the same RNA-seq data. The softPower threshold was chosen as 12 to generate a signed adjacency matrix from gene expression. A topological overlap matrix (TOM) was created using this adjacency matrix as input. Hierarchical clustering on 1-TOM using method = 'average,' and the function cutTreeDynamic was used to find modules with parameters: deepSplit = 2, pamRespectsDendro = TRUE, minClusterSize = 100. These settings were chosen based on an analysis of module stability and robustness. We then computed an ANOVA comparing the four subtypes for each module. 11 out of 18 modules were able to statistically distinguish between the four clusters with an FDR-adjusted p-value < 0.05.

Gene ontology enrichment analysis
We ran a gene ontology (GO) enrichment analysis on each module that was significantly able to distinguish the phenotypes (11 total). The terms that were significantly enriched in at least one module were culminated into a general list of terms enriched in SCLC, which had 1763 terms. To visualize these terms, we computed a distance matrix between pairs of GO terms using GoSemSim [20], and used this matrix to project the terms into a low dimensional space using t-SNE. t-SNE is a popular method that computes a low-dimensional embedding of data points and seeks to preserve the high-dimensional distance between points in the low-dimensional space.

Drug sensitivity analysis
Our drug sensitivity analysis used the freely available drug screen data from Polley, et al [21]. This screen included 103 Food and Drug Administration-approved oncology agents and 423 investigational agents on 63 human SCLC cell lines and 3 NSCLC lines. We subsetted the data to the 50 CCLE cell lines used for our previous analyses that had defined phenotypes according to Consensus Clustering (above). As described in [21], "the compounds were screened in triplicate at nine concentrations with a 96-hour exposure time using an ATP Lite endpoint." Curve fitting, statistical analysis, and plotting was done by Thunor Web, a web application for managing, visualizing and analyzing high throughput screen (HTS) data developed by our lab at Vanderbilt University [22]. To fit a dose response curve for each drug and cell line pair, we fit percent viability data from the screen to a three parameter log-logistic model. The three parameters are E max , EC 50 , and the Hill coefficient, where each coefficient is constrained to reasonable ranges (E max is constrained to be between 0 and 1, and the Hill coefficient (slope) is constrained to be non-negative.) Activity area (AA) was calculated as described in [23]. Briefly, AA is the area (on a log-transformed x-axis) between y = 1 (no response) and linear extrapolations connecting the average measured response at each concentration. A larger activity area indicates greater drug sensitivity, characterized either by greater potency or greater efficacy, or both. By segregating the cell lines by subtype, we were able to evaluate the relationship between drug response and subtype.

CIBERSORT
CIBERSORT is a computational inference tool developed by Newman et al. at Stanford University [4]. We utilized the interactive user interface of CIBERSORT Jar Version 1.06 at https://cibersort.stanford.edu/runcibersort.php. Gene signatures were automatically determined by the software from a provided sample file with a matching phenotype class file. For this sample file and class file, the RNA-seq data from 50 human SCLC cell lines were inputted with their consensus clustering class labels. For each run, 500 permutations were performed. Relative and absolute modes were run together, with quantile normalization disabled for RNA-seq data, kappa = 999, q-value cut-off = 0.3, and 50-150 barcode genes considered when building the signature matrix.

Single cell RNA sequencing of TKO SCLC tumors
The Tp53, Rb1 and p130 triple-knockout (TKO) SCLC mouse model with the Rosa26membrane-Tomato/membrane-GFP (Rosa26mT/mG) reporter allele has been described (Denny and Yang et al., 2016). Tumors were induced in 8-weeks old TKO; Rosa26mT/mG mice by intratracheal administration of 4x107 PFU of Adeno-CMV-Cre (Baylor College of Medicine, Houston, TX). 7 months after tumor induction, single tumors (one tumor each from two mice) were dissected from the lungs and digested to obtain single cells for FACS as previously described [10,24]. DAPI-negative live cells were sorted using a 100 μm nozzle on a BD FAC-SAria II, spundown and resuspended in PBS with 10% bovine growth serum (Fisher Scientific) at a concentration of 1000 cells/μl. Single cell capture and library generation was performed using the Chromium Single Cell Controller (10x Genomics) and sequencing was performed using the NextSeq High-output kit (Illumina).

Single cell analysis
Cells with � 500 detected genes per cell or with � 10% of transcripts corresponding to mitochondria-encoded genes were removed. Low abundance genes that were detected in less than 10 cells were excluded. Each cell was normalized to a total of 10,000 UMI counts, and log2transformed after the addition of 1. Top 1000 highly variable genes were selected and clusters of cells were identified by the shared nearest neighbor modularity optimization based on the top 10 PCs using the highly variable genes and visualized by t-SNE in R package Seurat [25]. The k-nearest neighbors (kNN) with k = 10 of human cell lines was detected for each mouse cell to predict subtypes of the individual cell based on signature genes of each subtype. If at least 80% nearest human cell line neighbors for a mouse cell belong to one subtype, the mouse cell was assigned to that subtype. Otherwise, the subtype was undetermined (not assigned).

Genomic analysis
Mutational Analysis was performed by MutSigCV V1.2 from the Broad Institute [26]. First, a dataset of merged mutation calls (including coding region, germ-line filtered) from the Broad Cancer Dependency Map [27] was subsetted to only include SCLC cell lines. Background mutation rates were estimated for each gene-category combination based on the observed silent mutations for the gene and non-coding mutations in the surrounding regions. Using a model based on these background mutation rates, significance levels of mutation were determined by comparing the observed mutations in a gene to the expected counts based on the model. MutSigCV was run on the GenePattern server using this mutation table, the territory file for the reference human exome provided for the coverage table file, the default covariate table file (gene.covariates.txt), and the sample dictionary (mutation_type_dictionary_file.txt). Only genes with an FDR-corrected q-value < 0.25 were considered significant.

Gene regulatory network construction
Transcription factors from significantly differentiating gene modules were used as input to network structure construction. A list of connections between these TFs was curated from the literature and added as edges between the TF nodes. The ChEA database of ChIP-seq-derived interactions [28] was queried to add additional connections between TFs that may not have been found in the literature. Our edge list thus comprises the literature-based connections that are verified from ChEA, and additional connections from the ChEA database directly. The network was built using NetworkX software [29].

BooleaBayes inference of logical relationships in the TF network
A Boolean function of N input variables is a function F: {0, 1} N 7 ! {0, 1}. The domain of F is a finite set with 2 N elements, and therefore F is completely specified by a 2 N dimensional vector in the space f0; 1g 2 N in which each component of the vector corresponds to the output of F for one possible input. In general, knowledge of the steady states of F is unlikely to be sufficient to fully constrain all 2 N components of the vector describing F. BooleaBayes is a practical approach that constrains F in the neighborhood of stable fixed points based on steady-state gene expression data. In practice, we let each component of the vector be a continuous realvalue v i 2 [0, 1] reflecting our confidence in the output of F, based on available constraints. Components of F that are near 0.5 will indicate uncertainty about whether the output should be 0 or 1, given the available constraining data.
Given M observations (in our case, each observation is a measurement of gene expression of the N regulator TFs and the target TF in M = 50 cell lines), we want to compute this vector (Ṽ ) describing a probabilistic Boolean function F of N variables. First, we organize the inputoutput relationship as a binary tree with N layers leading to the 2 N leaves, each of which corresponds to a component of vectorṼ . For instance, given two regulators A and B (N = 2), the leaves of the binary tree correspond to the probabilities that ( � A^� B), ( � A^B), (A^� B), and (A^B). Collectively, the observations define an M × N matrix R ¼ ½R 1 ;R 2 ; . . . ;R N � quantifying the input regulator variables (columns) for each observation (rows), as well as an M dimen- . . . ; t M � quantifying the output variable. A Gaussian mixed model is then used to transform the columns of R (regulator variables) and the vectorT into probabilities R 0 andT0 of the variables being OFF or ON in each observation (row).
Let P j ¼ ðR 0 i Þ be a function that quantifies the probability that the input variables of the i th observation belong to the j th leaf of the binary tree. For instance using the example above, the second leaf of the binary tree is ( � A^B). Therefore, P j=2 (A, B) = (1 − A) � B. Note that by this definition, that describes how much the i th observation constrains the j th component ofṼ . Additionally, to avoid overfitting under-determined leaves, we define the uncertaintyŨ ¼ ½u 1 ; u 2 ; . . . ; u 2 N � of each leaf From these, we then define the vectorṼ describing function F as Thus, each component ofṼ is the average of the output target variableT weighted by W, with an additional uncertainty termŨ to avoid overfitting. For leaves j of the binary tree that are poorly constrained by any of the observables, v j � 0.5, indicating maximal uncertainty in the output of F at those leaves. Uncertainty of a leaf j also arises when observations i with large weight w i,j have inconsistent values for t 0 i , such as if t 0 1 ¼ 0 and t 0 2 ¼ 1.

BooleaBayes network simulations
As input to the BooleaBayes simulations, we know the network structure defining regulatory relationships as described above, and regulatory rules (from BooleaBayes algorithm for rule fitting, see above). We first pick a random initial state by choosing a vector where g is the number of genes in the network. We initialize each v i in this vector to be 0 (OFF) or 1 (ON). For in silico perturbation experiments, this initial state is be chosen as one of the pseudo-attractors corresponding to a specific subtype. We then randomly pick one transcription factor x (where each gene has probability 1 g of getting picked) to update. Using the rule for x given by the rule fitting method above, find the column that corresponds to the current state (V) of the parent genes (pa(x)) of x (in other words, find the column corresponding to (V[pa(x)]). This column is defined by the state of the parent nodes of x, and it has some probability associated with it for how likely it is to turn on x when in the state V[pa(x)]. In Results section Transcription factor network defines SCLC phenotypic heterogeneity and reveals master regulators, this probability is visualized as a color (blue to red) at the bottom of the figures. We then flip a weighted coin with this probability, and turn x ON or turn x OFF based on the outcome. This will result in moving to a state 1 step away (if we do indeed flip the expression of x from 0 to 1 or 1 to 0), or in staying in the same state (if we "flip" from 0 to 0 or 1 to 1). The state has now moved to a new state in the state transition graph. If all transition probabilities to neighboring states are less than 0.5, this state is considered a pseudo-attractor. For the in silico perturbation experiments, the number of steps in the shortest path from the current state to the starting state is recorded instead.
See Algorithm 1 for pseudo-code describing the pseudo-attractor finding algorithm, and Algorithm 2 for pseudo-code describing the random-walk stability scores.

Ethics statement
Mice were maintained according to practices prescribed by the NIH (Bethesda, MD) at Stanford's Research Animal Facility, accredited by the Association for the Assessment and Accreditation of Laboratory Animal Care (AAALAC). All animal studies were conducted following approval from the Stanford Animal Care and Use Committee (protocol 13565).

Consensus clustering uncovers new SCLC variant phenotype
Recently, the occurrence of variant SCLC subtypes has been reported [13,15,16]. Given the translational value of defining subtypes, a more global approach to comprehensively define SCLC subytpes would be desirable. To this end, we devised the workflow described in Fig 1. First, we applied Consensus Clustering [30] to RNA-seq gene expression data from the 50 SCLC cell lines in the CCLE [31]. Here, the underlying assumption of bulk RNA-seq data is that single-cells from each cell line belong to one cellular state. While this is consistent with our previous findings that SCLC cell lines resolve into discrete clusters by flow cytometry [16], future cell-line analysis at single-cell resolution may refine our results, and it will be interesting to see to what extent subtype heterogeneity may be reflected within one cell line. We clustered the cell lines using a k-means method with a Pearson distance metric for k 2 {2, 20} (Fig 2A, S1 Table). Consensus Clustering is a method in which multiple k-means clustering partitions have been obtained for each k. Consensus Clustering is then used to determine the consensus (or best) clustering across these multiple runs of the k-means algorithm, in order to determine the number and stability of clusters in the data. Using criterion such as the tracking plot and delta area plot (S1 Fig), both k = 2 and k = 4 gave well defined clusters. Since recent literature suggests that more than two subtypes are necessary to adequately describe SCLC phenotypic heterogeneity, we selected k = 4 for further analyses.
To align the 4-cluster classification ( Fig 2B) with existing literature, we considered wellstudied biomarkers of SCLC heterogeneity across the clusters. Three of the four consensus clusters could be readily matched to subtypes previously identified with 2 to 5 biomarkers: the canonical NE subtype (SCLC-A [14,32]), an NE variant subtype (referred to here as NEv1, corresponding to SCLC-N in [15,32]), and a non-NE variant subtype (SCLC-Y) [10,32,33]. The fourth cluster (referred to here as NEv2) could not be easily resolved using few markers. For example, NEv2 may be considered a tumor propagating cell (TPC, which encompasses the NE, or SCLC-A, subtype) by biomarkers in Jachan et al [24], yet expression of a single biomarker, HES1, would suggest this subtype falls outside of the NE subtype according to Lim et al [10]. Discrepancies like this drove us to consider broader patterns of gene expression, rather than a limited number of biomarkers, to characterize each subtype.

SCLC phenotypes are differentially enriched in diverse biological processes, including drug catabolism and immuno-modulation
To capture global gene expression patterns, we applied Weighted Gene Co-expression Network Analysis (WGCNA) [34] to RNA-seq data from CCLE for multiple SCLC cell lines (See Methods). This analysis revealed 17 groups, or modules, of co-expressed genes. Module eigengenes could be used to describe trends of gene expression levels. 11 of these 17 groups of coexpressed genes could statistically distinguish between the four consensus clusters (Fig 3A, S2  Table, Kruskal-Wallis, FDR-adjusted p < 0.05). To specify the biological processes enriched in each of these 11 gene modules, we performed gene ontology (GO) enrichment analysis using the Consensus Path Database [35], which resulted in a combined total of 1,763 statistically enriched biological processes (Fig 3B, S3 Table). In particular, the turquoise, yellow, salmon, and pink modules are enriched for neuroendocrine differentiation and neurotransmitter secretion and are upregulated in the canonical NE and NEv1 phenotypes, as quantified by Gene Set Enrichment Analysis [36] (Fig 3C, S2 and S3 Figs, S4 Table). PNECs, the presumed cell of origin for SCLC, group into neuroendocrine bodies (NEBs) that are innervated by sensory nerve fibers and secrete neuropeptides that affect responses in the autonomic and/or central nervous system. This is consistent with the NE-and NEv1-enriched GO terms "learning or memory" and "chemical synaptic transmission" (Fig 3C). Evidently, such functions may be maintained in NE and NE-v1 subtypes, as reflected by the frequent occurrence of paraneoplastic syndromes in SCLC patients [37]. In contrast, the blue, black, and purple modules, enriched for cell adhesion and migration processes, are upregulated in the non-NE variant phenotype, in agreement with the observed adherent culture characteristics of these cell lines (S4 Fig,  S1 File).
Genes within the brown, midnight blue, and green modules are upregulated in the NEv2 phenotype ( Fig 3A, S3 Fig). The brown module is enriched for canonical phenotypic features of SCLC, particularly cellular secretion and epithelial differentiation, and accordingly is also upregulated in the canonical NE subtype. The midnight blue module, enriched in nervous system processes and lipid metabolism, is highly expressed in the NEv2 cell lines. The green module is enriched for immune/inflammatory response, wound healing, homeostasis, drug/ xenobiotic metabolism, and cellular response to environmental signals ( Fig 3C). Enrichment of these GO terms suggest that NEv2 cells may more easily adapt to external perturbations such as therapeutic agents, and potentially show higher drug resistance.
To visualize these enriched GO terms in an organized way (Fig 3B), we used the GOSem-Sim package [20] in R to compute a pairwise dissimilarity score, or distance, between all enriched GO terms (FDR-adjusted p < 0.05 in at least one of the 11 significant modules). We then projected all significant GO terms into a 2D space by t-distributed stochastic neighbor embedding (t-SNE) [38]. In this t-SNE projected phenospace, GO terms that describe semantically similar biological processes are placed close to one another and grouped into a general Transcriptional patterns that distinguish the four subtypes are captured in WGCNA analysis. Gene modules by color show patterns of expression that are consistent across the subtypes. Only modules that significantly distinguish between the subtypes are shown (ANOVA, FDR-corrected p-value < 0.05). B. SCLC heterogeneity biological process phenospace. A dissimilarity score between pairs of SCLC-enriched GO terms was calculated using GoSemSim, and used to create a t-SNE projection grouping similar biological processes together. Several distinct clusters of related processes can be seen. C. Module-specific phenospace. A breakdown of where some of the 11 statistically significant WGCNA modules fall in the GO space from A. Of particular interest, the green module, which is highly upregulated in the NEv2 phenotype, is enriched in metabolic ontologies, including drug catabolism and metabolism and xenobotic metabolism. The yellow module is enriched in canonical neuronal features.
https://doi.org/10.1371/journal.pcbi.1007343.g003 biological process. This map allows exploration of biological processes enriched in individual gene modules or subtypes, and it shows that SCLC heterogeneity spans biological processes that can largely be grouped as 1) related to neuronal, endocrine, or epithelial differentiation; 2) metabolism and catabolism; 3) cell-cell adhesion and mobility; and 4) response to environmental stimuli, including immune and inflammatory responses. In summary, the phenospace constructed from global gene expression patterns captures the unique characteristics of each SCLC subtype.

Drug resistance is a feature of the NEv2 subtype
As mentioned previously, the enriched GO terms for drug catabolism and xenobiotic metabolism in the green module suggest that the NEv2 phenotype may have a higher ability to metabolize drugs and therefore exhibit decreased sensitivity. To test this possibility, we reanalyzed drug responses of SCLC cell lines to a panel of 103 FDA-approved oncology agents and 423 investigational agents in the context of our four subtype classification [21]. We used the Activity Area (AA) metric as a measure of the resultant dose-response curves. The drugs were analyzed individually and clustered by common mechanism of action and target type, and the cell lines were grouped by the four subtypes (S5 Table). Across all evaluated drugs, the NEv2 subtype exhibited the most resistance (54% of drugs showed NEv2 as most resistant). In contrast, both NE and NEv1 exhibited less resistance (20%), with non-NE exhibiting the least resistance (6%) (Fig 4A). Taken together, these results confirm that based on the prediction from the gene-regulation based classification, the subtypes exhibit different levels of resistance and that high resistance is a feature of the NEv2 subtype (Fig 3C), even though the subtypes do not show differential response to the standard of care (etoposide and platinum based agents, Fig  4B). In particular, mTOR inhibitors are a class of compounds to which NEv2 was significantly more resistant (Fig 4C). PI3K pathway mutations have previously been implicated as oncogenic targets for SCLC, as about a third of patients show genetic alterations in this pathway [39]. Among the four subtypes, NEv2 is also the least sensitive to AURKA, B, and C inhibitors (AURKA shown); TOPO2 inhibitors; and HSP90 inhibitors (Fig 4C-4F). These results have implications for interpreting expected or observed treatment response with respect to tumor heterogeneity in individual patients.

Neuroendocrine variants are represented in mouse and human SCLC tumors
Next, we investigated whether the four subtypes we detected in human SCLC cell lines are also present in tumors. We used CIBERSORT [4] to generate gene signatures for each of the 4 subtypes. These gene signatures could then deconvolve RNA-seq measurements on 81 SCLC tumors from George et al [5] to specify relative prevalence of each subtype within a single tumor. Consistent with studies of intra-tumoral heterogeneity in other types of cancer, such as breast cancer [40], CIBERSORT predicted that a majority of tumors were comprised of all four subtype signatures, in varying proportions across tumor samples (Fig 5A). We then analyzed the patient/cell-derived xenograft models (PDXs/CDXs) developed by Drapkin et al [18], and the tumors also showed vast differences across samples (Fig 5B). Some of these samples were taken across multiple time points from the same patient, thus enabling us to test both tumor composition and dynamic changes in tumor subpopulations. Three samples taken from patient MGH1514, before and after treatment, indicated a change in tumor composition in favor of the NE phenotype. In contrast, patient MGH1518 showed a reduction of NEv1 and increase in NEv2 after treatment. Similar observations of phenotypic changes over treatment time courses, made in breast cancer patients [40] have recently been explained in the context of a mathematical model of epithelial to mesenchymal transition (EMT) [41]. It is possible that the tumor composition changes we observe may also be explained by molecular level and/or cell population level models [42]. Overall, the high variance in proportions of each subtype suggest a high degree of intertumoral, as well as intratumoral, dynamic heterogeneity and plasticity. We also investigated phenotypic patterns in mouse tumors from two different sources to determine whether human SCLC subtype signatures are conserved across species [15] (Methods). The first mouse model is a triple knockout (Rb1, Tp53, and P130, conditionally deleted in lung cells via a Cre-Lox system, TKO), and these tumors were primarily composed of the NE and NEv2 subtypes (Fig 6Ai). Of note is the lower percentage of non-NE cells found in each tumor in Fig 6Ai; we suspect this is due to a filtering step before sequencing (Methods), as the non-NE subtype signature is more similar to tumor-associated immune cells in an unfiltered tumor population. The second mouse model was generated with Myc overexpression (double knockout of Rb1 and Tp53, and overexpression of Myc) (Fig 6Aii) as reported previously [15]. Using the subtype gene-signatures developed in the previous sections, the Mychigh tumors showed a clear increase in the percentage of NEv1 detected compared to the triple Lastly, we analyzed two primary TKO mouse tumors by single cell RNA-seq (scRNA-seq). For each mouse single cell transcriptome, we computed the k = 10 nearest human cell line neighbors (kNN with k = 10), and assigned each mouse cell to a subtype based on its neighbors (Methods). As shown in Fig 6B, a large portion of the cells from each tumor correspond to one of the four human subtypes. A small non-NE population can be seen in both tumors, and about a third of the assigned cells correspond to the NE subtype (Fig 6B). Tumor 1 has a large proportion of the NEv2 subtype, corresponding to the tumors in Fig 6Ai. In contrast, tumor 2 has a large NEv1 subpopulation, similar to the tumors in Fig 6Aii. Taken together, these results indicate that subtypes in SCLC tumors are conserved across species, and can be categorized either by CIBERSORT analysis of bulk transcriptomics data, or by kNN analysis of scRNA-seq data.

Genetic mutations alone cannot account for four SCLC phenotypes
The evidence above for intratumoral and intertumoral heterogeneity led us to investigate how the subtypes arise and coexist in both human and mouse SCLC tumors. To determine whether mutations could be responsible for defining the four SCLC subtypes, we analyzed genomic data in the Broad Cancer Dependency Map [27]. We subsetted these data to the 50 SCLC cell lines with matching CCLE RNA-seq data, and using MutSigCV [26], we found 29 genes (S5 Fig) mutated more often than expected by chance (using a significance cutoff of q-value � 0.5 to be as inclusive as possible). However, none of these genes were able to separate the four subtypes by mutational status alone (S5 Fig), suggesting alternative sources of heterogeneity.

Transcription factor network defines SCLC phenotypic heterogeneity and reveals master regulators
To investigate these alternative sources of heterogeneity, we hypothesized that different SCLC subtypes emerge from the dynamics of an underlying TF network. We previously identified a TF network that explained NE and non-NE SCLC subtype heterogeneity [16]. That analysis suggested the existence of additional SCLC subtypes but did not specify corresponding attractors [16]. Here, we performed an expanded TF network analysis to find stable attractors for all four SCLC subtypes. As an initial step, we identified putative master TF regulators within each of the 11 WGCNA modules (Fig 3B) based on differential expression. Regulatory interactions between these TFs were extracted from public databases, including ChEA, TRANSFAC, JAS-PAR, and ENCODE, based on evidence of TF-DNA binding sites in the promoter region of a target TF, as well as several sources from the literature. This updated network largely overlaps with, but contains several refinements compared to our previous report [16], as detailed in Fig 7A. Following the procedure we previously used [16], we simulated the network as a dynamic Boolean model. In a Boolean model, the state of the network at a given time, t, is defined by the value of all TFs, each of which can be either ON or OFF. Each TF can be updated to determine its value at time t + 1 based on a Boolean rule, or logical statement, that represents how that TF is regulated by its regulators. For example, if A t+1 = B t or C t , and if A(t) = OFF, B(t) = ON, and C(t) = OFF, then updating A will give A(t + 1) = ON or OFF = ON, so A turns ON. Boolean models are powerful tools to investigate the regulation of attractors corresponding to stable subtypes or oscillators of biological systems. Because precise update rules are often not known, one of two approximations are commonly applied: inhibitory dominant [43], or majority rules [43,44]. Inhibitory dominant rules assert that the target node turns ON only when at least one activator is ON and all inhibitors are OFF, otherwise the target turns OFF (S6C Fig). Majority rules, conversely, assert that the target node turns ON as long as it has more activators ON than inhibitors, otherwise the target turns OFF (S6C Fig). Using the network in Fig 7A, neither of these approximations stabilized attractors corresponding to either the NEv1 or NEv2 phenotypes (S6 Fig), suggesting that the regulatory rules governing stability of these phenotypes are more complex.
To address this complexity, we developed BooleaBayes, a method to infer logical relationships in gene regulatory networks (Fig 7B) using gene expression data, by enhancing confidence in Boolean rules via a Bayes-like adjustment approach (see Methods). BooleaBayes leverages sparsity (the in-degree of any node is much less than the total number of nodes) in the underlying regulatory network structure, allowing it to make partially constrained predictions about regulatory dynamics, even in regions of state space that are not represented in the data. An advantage of this method is that its predictions are intrinsic to the parts of the  Fig 2B. Colors indicate which phenotype each TF is upregulated in. Red edges indicate inhibition (on average), and green activation (on average). B. Probabilistic Boolean rule fits for ASCL1. The target gene is a function of all the genes along the binary tree at the top, while expression of the target is shown on the left. Each row represents one cell line, each column represents one possible input state, and the bottom shows the inferred function F for every possible input state. Color ranges from 0 = blue (highly confident the TF is off), to 0.5 = white, to 1 = red (highly confident the TF is on). Rows are organized by subtype (top to bottom: NE, NEv1, NEv2, non-NE). C. Attractors found with asynchronous updates of Boolean network. 10 attractors were found, and each correlates highly with one of the four defined network in which we are most confident, based only on relationships between each TF and its parent nodes. See Methods for more details about the BooleaBayes algorithm.
BooleaBayes rules, like the Boolean example above, describe when a target node will be ON or OFF, given that state of all its regulators. Unlike the Boolean example, BooleaBayes rules are probabilistic, accounting for the (un)certainty with which we can state a target node will turn ON or OFF. For instance, values of 0 means it is certain the target node will turn OFF, 1 means it is certain the target node will turn ON, 0.5 means it is equally likely the target node will turn ON or OFF. BooleaBayes rules were derived for each node of the SCLC TF network in Fig 7A. As an example, Fig 7B shows  We simulated the dynamics of the Boolean network using a general-asynchronous update scheme [43]. This formed a state transition graph (STG), in which each state is defined by a vector of TF ON/OFF expression values.
Initial states for simulation were chosen near where we expected the four subtypes would be, by discretizing the average TF expression for each of the four SCLC subtypes. We exhaustively searched the neighborhood of each of these starting states out to a distance of 6 TF changes in the STG (Algorithm 1). Within these neighborhoods, we found 10 states for which all 27 TFs had at least a 50% chance of remaining unchanged. Transitions into these states are therefore more likely, and escapes less likely. Thus, these 10 states represent semi-stable states of the network dynamics (Fig 7C), that we refer to as pseudo-attractors. We also searched within neighborhoods of over 200 random initial states (allowing us to search over 200,000 total additional states), and found no additional pseudo-attractors (S6 Table).
These 10 pseudo-attractor states each correlated with, and could be assigned to, one of the 4 SCLC subtypes (stars in Fig 7C); this indicates that the updated network structure and Boolea-Bayes rules are sufficient to capture stability of the four SCLC phenotypes. Having identified network dynamics that closely match experimental observations, we are now in a position to perform in silico (de)stabilizing perturbations and predict the resulting trajectory through the STG for each subtype. We do so in the next section.

In silico SCLC network perturbations identify master regulators and master destabilizers of SCLC phenotypes
To quantify the baseline stability of the steady states in Fig 7C, we performed random walks (algorithm described in Methods) starting from each of the 10 pseudo-attractors. We counted how many steps were required to reach a state more than 4 TFs away (Hamming distance greater than 4) from the starting state (Fig 8, Algorithm 2). We chose a 4-TF neighborhood to account for the models' greatest intra-subtype attractor variability (Fig 7C, Hamming Distance), and therefore movement within the 4-TF neighborhood of a starting state is still considered reflective of that subtype. For each simulation, one TF in the network was either activated (held constant at TF = 1) or silenced (TF = 0) in each of the stable states (Fig 7C). 1000 random walks were executed for each condition. The number of steps in each random walk required to leave the 4-TF neighborhood was recorded in a histogram (Fig 8A). We defined (de)stabilization as the percent decrease or increase of the average number of steps subtypes (represented by stars). Hamming distance between intra-subtype attractors and inter-subtype attractors are shown. The average distance between intra-subtype attractors was around 2.5, while the average distance between subtype attractors was around 16, signifying that the variation between subtypes is much greater that that within a single subtype. Specifics of the probabilistic simulation are described in Results.
https://doi.org/10.1371/journal.pcbi.1007343.g007 Fig 7C will eventually leave the start state due to uncertainty in the Boolean rules. Control histogram shows how many random steps are required to reach a state with a Hamming distance � 4 under the network's natural dynamics. The knockdowns and activations shown here hold expression of the perturbed gene OFF or ON in an attempt to destabilize the start state, such that the random walk leaves the neighborhood sooner. A shift to the left in the perturbed distribution signifies that the perturbation "pushed" the simulated cell out of the 4-TF neighborhood more quickly, and the perturbation thus "destabilized" the subtype represented by the start state. This indeed occurs for several perturbations, shown for NE, NEv1, NEv2, and non-NE starting states. Dotted line shows mean for each histogram, which is under perturbation relative to the unperturbed reference (Fig 8B, S8 Fig). For example, either activation of GATA4 or silencing FOXA1 are predicted to destabilize both the NE and NEv2 subtypes (Fig 8C).

Fig 8. Destabilization of subtypes by perturbation to network. A. Random walks starting from the attractors in
TFs that, when silenced, cause destabilization greater than 20% (score � -0.2) of a specific subtype were considered master regulators of that subtype. They include REST (non-NE) (in agreement with [10]), TEAD4 (non-NE), ISL1 (NE), and TCF4 (NEv1). TEAD4 is downstream mediator of YAP1 action, which has been previously identified as a possible phenotypic modulator in a subset of SCLC cell lines [45]; our analyses suggest that expression of TEAD4 may be able to stabilize this phenotype. Simulations of the network also identified the novel NEv2 master regulators, ELF3 and NR0B1.
Our network simulations further identified TFs that can be considered master "destabilizers", i.e., activation of these TFs destabilizes a specific phenotype by at least 20%. For instance, activation of ELF3 is predicted to destabilize non-NE, while activation of NR0B1 would destabilize both non-NE and NE subtypes. Simulations identified a single master destabilizer for NEv2, the TF TCF3 (Fig 8C). Taken together, our pipeline, which includes subtype identification, drug response analysis, and network simulations, suggests possible therapeutic perturbations that could shift the phenotypic landscape of SCLC into a more sensitive state for treatment.

Discussion
We report a systems approach to understanding SCLC heterogeneity that integrates transcriptional, mutational, and drug-response data. Our findings culminate in discrimination and mechanistic insight into the four SCLC subtypes shown in Table 1: NE, non-NE, NEv1, and NEv2. Within the context of the broader literature on SCLC heterogeneity, we showed that  NE, non-NE, and NEv1 correspond to several subtypes that have been previously reported based on a few markers-more specifically, SCLC-A, SCLC-Y, and SCLC-N, respectively [32]. Significantly, we find that one (NEv2) has not been described previously, and which is nearly indistinguishable from NE based on currently used markers of SCLC heterogeneity. Because this subtype has high expression of ASCL1, it would be SCLC-A2 in the nomenclature used in a recent review [32]. Tumor deconvolution by CIBERSORT and scRNA-seq data indicate that a large proportion of human and mouse tumors comprise more than one subtype (Fig 6). While MutSigCV mutational analysis did not find any significant differences in mutated genes between subtypes (S5 Fig), we cannot rule them out, and future studies may uncover genomic mechanisms interfacing with the epigenetic heterogeneity reported here. Existing examples of epigenetic intratumoral heterogeneity are often framed in the context of transitions between epithelial and mesenchymal differentiation states [41]. Mechanisms underlying SCLC differentiation heterogeneity remain to be defined, and they may include functional states of PNECs, distinct cells of origin, or response to microenvironmental factors. It remains to be seen whether changes in tumor composition after treatment (Fig 5B) are due to phenotypic transitions, selection, or both.
A drug screen across a broad range of compounds indicated that the NEv2 subtype is more resistant than the others, especially in response to AURK and mTOR inhibitors. This is reminiscent of a new hybrid EMT phenotype recently identified as more aggressive and drug resistant than other phenotypes [46][47][48]. More broadly, recent reviews have suggested that both genetic mutations and epigenetic regulators such as histone demethylases may affect intratumoral heterogeneity and modulate therapeutic response [49]. Additionally, non-genetic processes such as phenotypic plasticity and stochastic cell-to-cell variability may enable tumor cells to evade therapy and give rise to drug-tolerant persisters [47,50]. Our findings of differential drug response across subtypes corroborate the significance of these reports. In vivo verification of NEv2's drug resistant properties in mouse and human tumors will be an important next step. Along these lines, it is tempting to speculate that the increase of the NEv2 signature in patient MGH1518 after drug treatment (Fig 5B) may be responsible for acquired drug-resistance in this patient. However, this study was under-powered for our analyses, and more experimental data will be necessary to strengthen this conclusion.
A significant advance of our work is the introduction of BooleaBayes, which we developed to infer mechanistic insights into regulation of the heterogeneous SCLC subtypes. By considering the distinct subtype clusters as attractors of a gene regulatory network, BooleaBayes infers partially-constrained mechanistic models. A key benefit of this method is that it does not overfit data: predictions are based only on parts of the network for which available data can constrain the dynamics, while states that lack constraining data diffuse randomly. With this method we were able to recapitulate known master regulators of SCLC heterogeneity, as well as identify novel ones such as ISL1 (NE) and TEAD4 (non-NE). Additionally, we predict ELF3 and NR0B1 to be master regulators of the NEv2 phenotype. Furthermore, we introduce the label of "master destabilizers" to describe TFs whose activation will destabilize a phenotype. Our method gives a systematic way to rank perturbations that may destabilize a resistant phenotype. We emphasize that BooleaBayes provides an adaptive roadmap to systematically walk the circle from prediction to experimental validation and back. Thus, a prediction from Boo-leaBayes about stabilizers can be experimentally tested, and the outcome will inform a new datapoint to further constrain the BooleaBayes model to refine predictions. For instance, if cells become stuck in a previously unknown partially reprogrammed attractor [51], expression data from these cells may be added to constrain BooleaBayes in a region where no data previously existed. In ongoing work, we are validating these predictions experimentally. We propose that with BooleaBayes, our approach for identifying master TFs could be applicable to other systems, including other cancer types or transcriptionally-regulated diseases. This approach parallels other modeling techniques to identify phenotypic stability factors, such as recent bifurcation analysis on an EMT network [52,53].
While many of the previously reported subtypes of SCLC fit into our framework, a few are noticeably absent, and will require further study. The vasculogenic subtype of SCLC described by Williamson et al. [54] did not emerge from our analysis. We speculate that this may be due to the rarity and/or instability of this CTC-derived phenotype among the available SCLC cell lines. Denny and Yang et al. have previously reported that Nfib amplification promotes metastasis [55]; however, our clusters do not correlate with location of the tumor sample from which each cell line was derived (e.g., primary vs metastatic, S1 Table). Poirier et al., using a similar clustering approach to ours, identified highly methylated SCLC subtypes (M1 and M2) [33], and the correspondence of these subtypes with the ones described here is intriguing and remains to be defined. Finally, Huang et al. recently reported an SCLC subtype defined by expression of POU2F3 [12]. In our data, POU2F3 was highly expressed in only four cell lines and was placed into a small (328 genes, green-yellow) module, and therefore represented only a small signal in our data. Overall, future studies with additional cell line and/or mouse data may be used to further investigate these different subtypes, underscoring that the delineation of four subtypes here does not preclude the existence of others.
To identify subtype clusters and BooleaBayes rules, we rely on the underlying assumption of bulk RNA-seq data that single-cells from each cell line belong to one cellular state. While this is consistent with our previous findings that SCLC cell lines resolve into discrete clusters by flow cytometry [16], future cell-line analysis at single-cell resolution may refine our results, and it will be interesting to see to what extent subtype heterogeneity may be reflected within one cell line.
An advantage of our analyses is that each subtype is defined by distinct co-expressed gene programs, rather than by expression of one or few markers, which has been customary in the field but has limited ability to discriminate between phenotypes (Fig 2B). In addition, these modules participate in unique biological processes (e.g., as identified by GO), such that the systems-level approach presented here may provide a comprehensive framework to understand the regulation and functional consequences of SCLC heterogeneity in a tumor. This understanding can be actionable since SCLC subtypes show differential drug sensitivity; for example, our analyses in this paper support the hypothesis that NEv2 may be a drug-resistant phenotype of SCLC. We propose that identification of drugs targeting the NEv2 subtype, or perturbagens that reprogram it toward less recalcitrant states, may lead to improved treatment outcomes for SCLC patients. R Maximum radius to search from state_init P T Threshold probability used to define pseudo-attractors (we used P T = 0.5 so that pseudoattractors are defined to have out-transitions with probability less than 50%.)

Inputs: Output:
PseudoAttractors-A set of strongly connected components of the state transition graph for which transitions in have probability greater than P T pending {state_init} (A set containing the initial state) STG empty directed graph oob dummy vertex (this will be the "out-of-bounds" vertex-all states in the STG with distance greater than R from the initial state will point to this vertex, preventing them from being detected as attractors) Add Supporting information S1 Fig. Tracking plot, delta area plot, and CDF for consensus clustering and ProgenyClust scores with different values of k. A. Tracking plot shows slight inconsistency for cell lines with k = 3. One of these is assigned to the "light green" cluster in the k = 3 clustering scheme, whereas when k = 4, it returns to the "light blue" cluster. The others are in the "dark blue" cluster when k = 2 and "light blue" cluster when k = 3. Thus k = 3 is not a good fit to the data B. The delta area plot shows the relative change in the area under the CDF curve (Fig 2A). The largest changes in area occur between k = 2 and k = 4, at which point the relative increase in area becomes noticeably smaller (from an increase of 0.5 and 0.4 to 0.15). This suggests that k = 4,5, or 6 are the best clustering that maximizes detail (more, smaller clusters present a more detailed picture than a few large clusters) and minimizes noise (by minimizing average pairwise consensus values and maximizing extreme pairwise consensus values). Average cluster consensus scores (CCS) across clusters show that k = 4 may be the best choice because it has the highest average (k = 4 average CCS: 0.848, k = 5 average CCS: 0.814, k = 6 average CCS: 0.762). C. Consensus Cumulative Distribution Function. This CDF show that k = 4 has more black cells and white cells than gray, suggesting the consensus clusters are more robust. D. ProgenyClust suggests that the optimal number of clusters is 3 or 4, as determined by the maximum of the Gap and Score criteria values. As k = 3 was already ruled out above, we chose k = 4 to describe the heterogeneity between cell lines. Significantly mutated genes across 50 SCLC cell lines, as determined by MutSigCV, ordered by significance. As expected, significant mutations were found in both the Rb1 and Tp53 genes. Inspection by eye shows that no significant mutations can distinguish completely between two or more phenotypes. This suggests an alternate source of heterogeneity, such as transcriptional regulation. Significance cut-off: q (p-value corrected for multiple comparisons) � 0.25. q � 0.5 shown. (EPS)

S6 Fig. Cross-validation of network inference and comparison to other Boolean rule schemes. A.
To test for overfitting, we partitioned the samples into training sets (80% of samples) used to fit BooleaBayes rules, and testing sets (20%), over 70 iterations. We calculated the squared error between BooleaBayes predictions and the true values in both the training and testing sets. The squared error was similar for the training and testing data, suggesting the results are not due to overfitting. B. Correlation between prediction and measured expression is a function of predicted expression from the BooleaBayes rule. For values x < 0.5 along the xaxis, correlation is calculated on a subset of the data for which prediction < x or prediction > 1 − x (e.g., at x = 0.1, correlation is calculated on subset with prediction 0 to 0.1 or 0.9 to 1. When x = 0.5, all data are included. When x = 0.9, data subset is identical to when x = 0.1). In cases where the BooleaBayes predictions are confident whether a gene was ON (near 1) or OFF (near 0), the correlation between the prediction and the actual expression approaches 1. When BooleaBayes is not confident (near 0.5), the correlation decreases. Both (A) and (B) are consistent with our interpretation that BooleaBayes is fitting rules only where the data are well constrained. BooleaBayes on both the training and testing data sets predicts expression with high confidence, especially when the actual expression is near 0 or 1. C. Inhibitory dominant and majority rule update schemes both converge onto the same, single attractor (left-most state). This attractor has a Hamming distance of 4-6 away from both the NE and NEv2 attractors found by BooleaBayes (right 4 states), suggesting that it may represent an "average'' NE attractor. We speculate it is an artifact of the coarse-graining imposed by those rules, as they both assume simplistic TF interactions.   Table. Network simulations starting at random states in the state transition graph. We ran multiple simulations searching various regions of the state transition graph, and did not find any additional attractors. This suggests that BooleaBayes was capable of identifying all of the biologically relevant attractors, but does not preclude the possibility of additional unseen attractors using our network structure and BooleaBayes rules. (CSV)