MFmap: A semi-supervised generative model matching cell lines to tumours and cancer subtypes

Translating in vitro results from experiments with cancer cell lines to clinical applications requires the selection of appropriate cell line models. Here we present MFmap (model fidelity map), a machine learning model to simultaneously predict the cancer subtype of a cell line and its similarity to an individual tumour sample. The MFmap is a semi-supervised generative model, which compresses high dimensional gene expression, copy number variation and mutation data into cancer subtype informed low dimensional latent representations. The accuracy (test set F1 score >90%) of the MFmap subtype prediction is validated in ten different cancer datasets. We use breast cancer and glioblastoma cohorts as examples to show how subtype specific drug sensitivity can be translated to individual tumour samples. The low dimensional latent representations extracted by MFmap explain known and novel subtype specific features and enable the analysis of cell-state transformations between different subtypes. From a methodological perspective, we report that MFmap is a semi-supervised method which simultaneously achieves good generative and predictive performance and thus opens opportunities in other areas of computational biology.


TCGA bulk tumour subtype annotations
The TCGA bulk tumour subtype annotations were collected from literatures listed in Table 1  Glioblastoma multiforme and lower grade glioma [10] Bulk tumor molecular data processing Copy number, gene expression and mutation data from bulk tumours for 10 cancer types listed in Table 1 in the main text were obtained from Firehose Broad GDAC portal (http://gdac.broadinstitute.org/). Copy number Log2 ratio segment data genome were input into GISTIC2.0 (version 2.0.23) [11] to obtain thresholded CNV values. These data were dichotomised and stored in a sample by gene binary matrix, where 0 indicates a diploid copy number level and 1 indicates one of the following CNV levels: homozygous deletion, heterozygous deletion, low-level amplification, high-level amplification. RSEM normalised gene expression data were preprocessed as follows: genes with missing values in more than 20% of the samples were excluded. For the remaining genes, missing values were imputed using impute.knn function from impute package [12]. The resulting gene expression values were transformed by adding 1.0 to each value and taking the logarithm (pseudo-log-transformation).
Mutation data were downloaded as MAF files and synonymous mutations were excluded. The single-sample MAF files were further aggregated into a binary matrix M , with entries (M ) i,j = 1 indicating that there is a nonsynonymous mutation at j in patient i.

Cell line molecular data processing
Cell line data for the same cancer types in Table 1 in the main text were downloaded from the CCLE portal (https://portals.broadinstitute.org/ccle) and processed in the same way as bulk tumours, except that ensemble ids were mapped to gene symbols using human genome Genecode V19.

Molecular data normalisation
All molecular data input into MFmap neural network are normalised to the 0 − 1 range.

Bulk tumor clinical data
Multiple subtype classification schemes for TCGA BRCA, COADREAD and GBMLGG exist and we decided to follow the most frequently used schemes. COADREAD has four consensus molecular subtypes CMS1-4 obtained from gene expression data [13]. GBMLGG [10] has seven subtypes (IDHmut-codel or Codel, G-CIMP-low, G-CIMP-high, Classic-like, Mesenchymal-like, LGm6-GBM, PA-like) which were derived from multi-modal data sets including mutation, methylation and gene expression patterns. BRCA subtypes (Basal, Her2, Luminal A, Luminal B, and Normal) are based on PAM50 (Prediction Analysis of Microarray using 50 gene set) [14] features derived from gene expression data and obtained from [1]. In our study, only Basal, Her2, Luminal A, Luminal B samples were kept. The subtype annotations of all studied cancer types listed in Table 1 in the main text can be found in Table 1.
Cell line drug sensitivity data CCLE cell line sample annotation data were obtained from CCLE portal and drug sensitivity data were downloaded from the Cancer Therapeutics Response Portal (CTRP [15], www.broadinstitute.org/ctrp).

Correcting for batch effects between bulk tumour and cell line gene expression
Batch effects between bulk tumours and cell lines were corrected by using the function Combat from the R package SVA [16]. The label bulk tumour or cell line was used as the covariate.

Handling class imbalance
Cancer subtype datasets especially GBMLGG are unbalanced, which is a major reason of overfitting for machine learning models. To overcome this issue, we applied Synthetic Minority Over-sampling Technique (SMOTE) [17] implemented in SmoteClassif function of UBL [18] R package to oversample the minority subtypes, creating an equal balance with majority subtypes.

Propagating copy number and mutation profiles on the protein-protein interaction network
Mutation profiles were mapped to human cancer network A curated by pyNBS [19] as protein-protein interaction network (PPI) source, which aggregates different interaction types.
The propagating function in pyNBS with optimal signal diffusion distance parameter setting α = 0.7 was applied to perform network propagation, described by the iterative process is the adjacency matrix of the PPI, and α is the parameter controlling the diffusion distance. The same approach was used for dichotomised copy number profiles. The smoothed copy number profiles and mutation profiles (sample by gene matrices) were combined into a single DNA-view matrix.

Pathway activity scores
Gene expression data for MsigDB [20] gene sets were input to ssGSEA [21], which outputs sample-wise pathway activity scores.

Biological annotation of latent representations
For a given componet z k of the latent representation, we selected all pathways whose activities are significantly associated with z k (association was measured using the information coefficient, FDR threshold of 5%). Some pathways are associated with more than one latent representation. To resolve these ambiguities, the Pearson correlation coefficients were estimated as well. The selection of pathways was further refined by using the fold change of z k intensities between subtypes with the highest two latent representations and the lowest two latent representations. Here fold change and significance were estimated by a linear model implemented in the limma package [22], only those pathways with FDR adjusted p-value less than 0.05 were further kept for human review.

Generating MFmap visualisations
Let Z = (z ij ) be the n × h-matrix of latent representations. The element z ij is the value of the latent representations j ∈ {1, . . . , h} for patient i ∈ {1, . . . , n}. The MFmap visualisation proceeds in three steps: Step 1: Generate coordinates C 1j , C 2j of the MFmap prominent component nodes j by projecting the columns of matrix Z to 2-D space. The projection is chosen in order to preserve the distances between samples (Sammon projection). Delaunay triangulation on the 2-D coordinates is used to connect neighbouring nodes.
Step 2: Project samples onto the MFmap layout. The coordinates (S 1i , S 2i ) of sample i ∈ {1, . . . , n} are given as where α is a tuneable hyper-parameter controlling the distance between nodes.
Step 3: Generate MFmap contour lines and background colours based on the estimated sample density of each subtype. The density is obtained from a Gaussian kernel density estimate on the coordinate lattice corresponding to the projected MFmap latent representations. The probability estimate for the subtype with the highest probability is then used for MFmap visualisation.

In-silico perturbation analysis
Let z The encoder receives a gene expression profile x rna and a propagated DNA alteration profile x dna as input. Two hidden layers first encode the two input layers into two 1024-dimensional latent vectors. The second hidden layer then concatenates the two 1024-dimensional latent vectors and then is further encoded into 512-dimensional vector by the third hidden layer. The third hidden layer is fully connected to two output layers representing mean µ and log-transformed variation log σ 2 of q φ (z|x). The dimension of the latent representation z is set as the number of subtypes of cancer. The encoder is given by the following equations: where W 0 are trainable parameters of the encoder network. Rectified linear unit (ReLU) activation is defined as ReLu(x) = max(x, 0). ⊕ denotes concatenating two matrices.

The MFmap decoder network
The MFmap decoder layer structure is symmetric to the encoder with latent representations z as inputs and reconstructed data as outputs x rna and x dna : dna are trainable parameters of the decoder network. The sigmoid activation is defined as The subsetting operation of the top and bottom j elements of x is denoted by j (x) and ⊥ j (x), respectively.

The MFmap classifer network
The classifier serves as a regulariser controlling the capability to learn a latent representation that is cancer subtype relevant. Since cancer subtypes are clinically and biologically meaningful, a higher classification accuracy will encourage the neural network to extract features essential to patients stratification and are more interpretable. The classifier is a neural network with three fully connected layers and takes the expectation µ (see Eq (6)) of the distribution p(z |x) with x = (x rna , x dna ) as input and outputs a probability for each of the h subtypes: are trainable parameters of the classifier. The softmax activation is defined as softMax(x, y = c) = e xc C j=1 e x j , given the inputs and label pair (x, y = c).

Details of the MFmap hidden layers
Each hidden layer is a block containing a fully connected layer, a batch normalisation layer and an activation.

Details of the MFmap loss function
The MFmap loss function in Eq (10) in the main text for the specific distributional assumptions can be rewritten as S(x, y) = v∈{rna,dna} L recon (x v ,x v ) + D KL (N (µ, σ) N (0, I)) + L CE (y,ŷ) + H(ŷ), (x, y) ∈ D tu , In Eq (17), µ, σ are estimated parameters of MFmap encoder.ŷ denotes the subtype label probability predicted by the MFmap classifier and L recon , D KL , L CE , and H denote the reconstruction loss, KL divergence, cross entropy loss and entropy, respectively. We next detail each term. The reconstruction loss L recon is quantified by the binary cross entropy loss between input data x and reconstructed datax: The classification loss for sample i is implemented as a cross entropy loss The entropy of sample i is implemented as In Eq (19) and Eq (20)ŷ where c(·) = (c 1 (·), . . . , c h (·)) is the classification model parametrised by θ, and softMax k is the softmax function for subtype label k.
In fact, for bulk tumours the MFmap loss function can be viewed as a basic VAE loss plus an entropy regularised classification loss which is equivalent to a modified soft bootstrapping loss proposed by [23] for positive unlabelled learning, extended to multi-class case. It updates the prediction objective based on currently predicted subtype probability. Concretely, the modified soft bootstrapping loss for a pair of feature and subtype label (x (i) , y (i) ) is: Gradients of the MFmap loss function can be computed with the reparametrisation trick [24], which involves sampling a random vector ∼ N ( | ; 0, I) and transforming to This ensures that z ∼ q(z|x) = N (z|µ(x), diag(σ) 2 ).

Details of the MFmap implementation and training process
The MFmap neural network was implemented using PyTorch (version 1.5.0) and trained on two NVIDIA Tesla V100 SXM2 GPUs (each has memory of 32 gigabytes) using the Adam optimiser [25]. The preprocessed data were randomly split into training (90%) and test (10%) sets. Hyperopt [26] was used to search the best model avoiding overfitting by selecting optimal hyper-parameters yielding the minimum total loss divergence between training and validation datasets (ratio training/validation data 9/10). The searching space for hyper-parameter selection is: • dimensions of first DNA or RNA encoder hidden layer {4096, 2048, 1024} • dimensions of second DNA or RNA encoder hidden layer {1024, 512, 256} • dimensions of third DNA or RNA encoder hidden layer {256, 128, 64} • dimensions of first classifier hidden layer {512, 256, 128, 64} • dimensions of second classifier hidden layer {128, 64, 32} • batch size {64, 32} • learning rate {10 −4 , 10 −3 , 10 −2 } The optimised setting found from the minimal validation loss is: learning rate as 10 −3 ; batch size as 32; first, second and third encoder hidden layer dimension as 1024, 512, 256 respectively; first, second classifier hidden layer dimension as 128, 64 respectively. These hyper-parameters were used for all cancer types in Table 1 in the main text.

Derivation of the unsupervised and supervised ELBO
We first derive the ELBO for cell line data using Jensen's inequality: log p(x) = log y dz q(z|x) p(x, y, z) q(z|x) = log E q(z|x)p(y|z) p(x|z)p(z) q(z|x) ≥ E q(z|x)p(y|z) p(x|z)p(z) q(z|x) = E q(z|x) [log p(x|z)] − D KL (q(z|x)||p(z)) . (23) Here, we have used the conditional independence assumption Eq (3c) in the main text and p(y|z) = q(y|z). Similarly, for the labelled examples (bulk tumor samples), we can derive the ELBO for the log-likelihood as log p(x, y) = log dz q(z|x) p(x, y, z) q(z|x) ≥ E q(z|x) log p(x|z)p(y|z)p(z) q(z|x) = E q(z|x) [log p(x|z)] − D KL (q(z|x)||p(z)) +E q(z|x) [log p(y|z)] , where we have used the conditional independence of x and y given z in both the generative and the inference model.