A novel single-cell based method for breast cancer prognosis

Breast cancer prognosis is challenging due to the heterogeneity of the disease. Various computational methods using bulk RNA-seq data have been proposed for breast cancer prognosis. However, these methods suffer from limited performances or ambiguous biological relevance, as a result of the neglect of intra-tumor heterogeneity. Recently, single cell RNA-sequencing (scRNA-seq) has emerged for studying tumor heterogeneity at cellular levels. In this paper, we propose a novel method, scPrognosis, to improve breast cancer prognosis with scRNA-seq data. scPrognosis uses the scRNA-seq data of the biological process Epithelial-to-Mesenchymal Transition (EMT). It firstly infers the EMT pseudotime and a dynamic gene co-expression network, then uses an integrative model to select genes important in EMT based on their expression variation and differentiation in different stages of EMT, and their roles in the dynamic gene co-expression network. To validate and apply the selected signatures to breast cancer prognosis, we use them as the features to build a prediction model with bulk RNA-seq data. The experimental results show that scPrognosis outperforms other benchmark breast cancer prognosis methods that use bulk RNA-seq data. Moreover, the dynamic changes in the expression of the selected signature genes in EMT may provide clues to the link between EMT and clinical outcomes of breast cancer. scPrognosis will also be useful when applied to scRNA-seq datasets of different biological processes other than EMT.


Introduction
Cancer prognosis plays an important role in clinical decision making. Traditionally, cancer prognosis is based on several clinical and pathological variables such as tumor size, lymph node status, histological grades [1]. However, these clinicopathological factors are insufficient for cancer prognosis because cancer is heterogeneous at the molecular (e.g., genes) level. Hence, recent clinical guidelines have highlighted the importance of using multi-gene tests to select patients who should receive adjuvant therapies [2]. The multiple genes in the tests are known as cancer signatures, which are crucial to cancer prognosis. Cancer signatures can be identified by in vivo biological experiments. For example, the LM method [3] analyzed transcriptomics in the cell lines and chose 54 genes associated with lung metastagenicity and virulence. However, these experiments cannot be done on human beings. Meanwhile, experiments on animals would not guarantee that the same conclusion can be drawn for humans. Therefore, computational methods are needed to identify cancer signatures from existing data, including gene expression data and clinical data.
Computational methods for breast cancer prognosis have shown some successes. Generally, these methods select the prognostic genes from a large number of human genes and then train survival models based on the selected genes. For instance, PAM50 starts with an extended intrinsic gene set from previous studies, then selects genes based on their contributions in terms of distinguishing the five intrinsic breast cancer subtypes [4]. The RS method selects 16 cancer signatures from 250 published candidate genes [5]. Mamma [6] and GGI97 [7] use a statistical test to choose the genes which differentially express between two distinct groups of tumors. Most of these methods use supervised algorithms to select the candidate genes and only GGI97 ranks genes based on the similarities between gene expression profiles and tumor histologic grades. Based on the selected genes, most methods train linear regression models to predict the outcomes of the new coming patients. The clinical benefits of these prognostic genes for breast cancer are well studied on the traditional transcriptomics data, and some of the methods have approved by the Food and Drug Administration for commercial use [2].
The common feature of existing computational methods for breast cancer prognosis is that they are based on bulk RNA-seq data, which can lead to the following problems. Firstly, different tumor samples in bulk RNA-seq data have different proportions of cancer cells (named tumor purities) that can bias the results of these methods [8]. The traditional RNA sequencing technology measures the average expression levels of genes for an ensemble of cells from a tumor sample to obtain the so called bulk RNA-seq data. As a solid tumor tissue is a mixture of normal and cancer cells, the bulk RNA-seq data hence contain mixed signals and the noncancerous components may have influences on genomic analysis of the bulk RNA-seq data or even bias the results [8]. There are works to uncover tumor purity and correct the bias in the detecting of differential genes [9] and identification of cancer subtypes [10]. It has been shown that differentially expressed genes and cancer subtypes are crucial to the selection of cancer signatures. Secondly, with bulk RNA-seq data, we may not able to determine how gene signatures are related to cell level perturbation during cancer progression. Increasing evidence shows that the expression patterns of genes are heterogeneous from cell to cell [11]. These stochastic expression patterns trigger cell fate decisions and can affect cancer initiation and progress. However, based on the bulk RNA-seq data, the existing cancer prognosis methods cannot determine the correlation between clinical outcomes and dynamic gene behaviors along cellular trajectory.
Single cell RNA sequencing (scRNA-seq) has emerged recently and has many advantages over bulk RNA sequencing. Firstly, scRNA-seq does not have the tumor purity problem because it is possible to discover the existence of the micro-environment cell populations from scRNA-Seq data (See a review on this in [12]). Secondly, scRNA-seq is a powerful method to comprehensively characterize the cellular perturbation or stages within tissues [13] as it measures the expression of genes in individual cells. Additionally, scRNA-seq trajectory methods can provide a precise understanding of dynamic cell fate differentiation (See a systematic comparison in [14]). Through continuous cell stages along the pseudo-trajectory, we can observe the stochastic nature of gene expression [15]. Currently, scRNA-seq data are mostly used to detect cell types or to find novel biomarkers. As far as we know, there has been no work conducted on using scRNA-seq data to improve breast cancer prognosis.
In this work, we use an scRNA-seq dataset of Epithelial to Mesenchymal Transition (EMT) for identifying the breast cancer signatures for prognosis. There has been evidence showing that EMT is associated with carcinogenesis, invasion, metastasis, and resistance to therapy in cancer [16]. It has been found that EMT phenotype is associated with pancreatic cancer invasion and metastasis and is an important prognostic factor for patients [17]. There is more evidence indicating EMT phenotypes are associated with cancer prognosis. For instance, Tan TZ et al. [18] used an EMT scoring metric to quantify the EMT phenotypes of tumors and uncovered the correlations between high EMT score and poor disease-free survival in ovarian cancer and colorectal cancer. Similarly, George JT et al. [19] developed a scoring metric based on EMT gene expression to classify tumors into epithelial phenotype, hybrid E/M phenotype, and mesenchymal phenotype. They found that lung cancer patients in the hybrid E/M phenotype had significantly lower relapse-free and overall survival probability than those categorized as the epithelial phenotype. There has also been evidence that suggesting that genes playing an essential role in EMT (called EMT related genes hereafter) are associated with cancer prognosis [20,21]. Liang JY et al. [20] screened EMT-associated genes and observed that 35.29% of them were significantly associated with better survival outcomes and 12.04% of them were correlated with decreased survival in clear-cell renal cell carcinoma patients. In head and neck squamous cell carcinoma, the partial EMT signatures identified from single-cell level were found to be associated with metastasis and poor prognosis [21]. Furthermore, EMT related genes have been used for predicting clinical outcomes of cancer patients. For example, mesenchymal mRNA signatures were used for predicting overall survival of lung adenocarcinoma patients [22]. For the same cancer type, Shao BR et al. [23] used the EMT network-based signatures to improve the accuracy in classifying patients into good prognosis group and poor prognosis group. For one more example, Tao CM et al. [24] identified a group of seven EMT-related signature genes that successfully inferred the survival rates of glioma patients.
In view of the above described research and findings, in this paper, we hypothesize that genes that play an important role in EMT can be associated with breast cancer prognosis, and hence in this paper we develop a novel method, called scPrognosis to utilise scRNA-seq data of EMT for discovering breast cancer prognosis signatures. Moreover and importantly, recent studies have established the belief that EMT is a dynamic course instead of a binary process [16]. Therefore, scPrognosis leverages the dynamic information from the EMT scRNA-seq data to identify breast cancer signatures as well as the clinical significance of these signatures.
To fully exploit the scRNA-seq data towards optimal identification of breast cancer signatures, we propose to assess the importance of genes in the EMT process by integrating the following three measures: (1) their median absolute deviation in expression level; (2) their differentiation in different stages of EMT; (3) their roles in the dynamic gene co-expression network in EMT. scPrognosis uses a linear model to integrate the three measures for inferring breast cancer signatures. The significant difference between our method and the bulk RNAseq data based methods is that we reconstruct the pseudotemporal trajectory known as pseudotime of cells [25] in EMT and incorporate this information into differential gene expression analysis and dynamic gene co-expression network construction.
To validate the prognostic ability of these discovered gene signatures and apply them to breast cancer prognosis, we use them to build prediction models using bulk RNA-seq data as the data contains matched clinical information (and there are no single-cell data with matched clinical information available). We apply scPrognosis to four independent bulk breast cancer datasets, ranging from about 200 to 1200 patients. The experimental results show that scPrognosis improves cancer prognosis compared with other benchmark breast cancer prognosis methods based on bulk RNA-seq data. A significant portion of the discovered prognostic genes is proved to be associated with breast cancer prognosis. Moreover, the dynamic changes in the expression trends of the genes provide clues to the link between EMT transition and clinical outcomes of breast cancer.

Materials and methods
Overview of scPrognosis scPrognosis contains five steps as depicted in Fig 1. In step 1, MAGIC [26] and a gene filter are used to pre-process the noisy and high-dimensional scRNA-seq data. In step 2, EMT pseudotime, pseudotime series gene expression data, and dynamic gene co-expression network are inferred from the scRNA-seq data. In this step, firstly VIM gene expression level and pseudotemporal trajectory estimated by the Wanderlust algorithm [27] are used to identify EMT pseudotime for all cells in the scRNA-seq dataset. The EMT pseudotime describes the gradual transition of the single-cell transcriptome during the EMT transition process and helps to study gene expression dynamics in different EMT transition stages. Secondly, pseudotime series gene expression data is obtained by ordering cells in the scRNA-seq dataset from epithelial stage to mesenchymal stage according to the EMT pseudotime. Thirdly, from the ordered scRNA-seq data, a dynamic gene co-expression network is constructed by using the LEAP R package [28]. In step 3, based on the ordered scRNA-seq data, three methods are adopted to obtain the different gene ranking measures, including Median Absolute Deviation (MAD), switchde [15] and Google PageRank. MAD and switchde are used to compute gene importance based on their expression level. Google PageRank ranks genes based on their roles in the dynamic gene co-expression network. In step 4, we integrate the three different rankings obtained in step 3 to prioritize genes. In step 5, the top N ranked genes are selected as signatures to predict the survival outcomes of breast cancer patients in bulk RNA-seq data. Details of each step are described in the following sub-sections.

Pre-processing scRNA-seq data
In the first step, scPrognosis pre-processes the input scRNA-seq dataset. The scRNA-seq dataset is a data matrix with G rows and C columns, where each column stores the expression levels of G genes in a single cell. Due to the low amounts of transcripts in a cell, an expressed gene may not be detected during sequencing with current scRNA-seq technology. This can lead to missing values of expressed genes, which is called the "dropout" phenomenon. For example, There are five main steps in scPrognosis, including: ① Pre-processing scRNA-seq data; ② Inferring EMT pseudotime, pseudotime series gene expression data, and dynamic gene co-expression network from the filtered scRNA-seq data; ③ Ranking genes by three measurements; ④ Prioritizing genes via an integrative model; ⑤ Cancer prognosis using the top N ranked genes. The first four steps are based on scRNA-seq data while the last step uses bulk RNA-seq data to select parameters.
https://doi.org/10.1371/journal.pcbi.1008133.g001 scRNA-seq data by the inDrops platform only have about 30% effective reads for each cell. The dropout events can lead to significant bias in gene-gene relationships and other downstream analyses [26].
MAGIC [26] is a method to denoise scRNA-seq data and impute the missing gene expression profiles. To overcome the sparsity and noise of the raw count matrix, MAGIC uses PCA (principal component analysis) components to calculate cell-cell distance matrix. The distance matrix is converted to a cell-cell affinity (similarity) matrix by an adaptive Gaussian kernel method. The affinity matrix is symmetrized and Markov-normalized to construct a Markov transition matrix. The final denoised and imputed data matrix is obtained by multiplying the exponentiated Markov transition matrix by the raw count matrix. Based on the information sharing across similar cells, MAGIC recovers gene expression from the dropout and other sources of noise.
After the imputation, we filter out genes with low coverage rates and low expression levels because these genes are most likely not expressed. It is suggested that these genes should be removed when searching for discriminative genes in microarray data [29] and implementing the switchde method. More experimental details of MAGIC and the gene filter method are provided in Section 2 in S1 File.

Inferring EMT pseudotime, pseudotime series gene expression data, and dynamic gene co-expression network
Recently, it has been proposed that EMT transition occurs through continuum stages and there are several intermediate stages known as hybrid (partial) epithelial/mesenchymal (E/M) stages. Interestingly, these hybrid E/M stages are stable and can be the endpoint of a transition [16]. This means that cells may not go through the whole EMT transition and stop at a hybrid E/M stage. Switch-like genes that are up-or down-regulation along the EMT trajectory may induce cells to undergo a transition from one hybrid E/M stage to another hybrid E/M stage. Applying the proposition to the continuum stages of EMT transition, we could characterize the nature of switch-like genes and dynamic gene-gene relationships along the EMT trajectory. The strength of switch-like changes and the importance of genes in the dynamic gene coexpression network will be used to rank genes in our methods.
In this step, we will firstly infer the EMT pseudotime, and then based on the obtained pseudotime, we construct the pseudotime series gene expression dataset from the scRNA-seq dataset, which will be used in Step 3 to capture the switch-like changes along the pseudotime. At the same time, we also construct the dynamic gene co-expression network based on the pseudotime series gene expression dataset.
Even we do not have the true time-series data of individual cells undergoing EMT transition, we still can use scRNA-seq trajectory method to infer pseudotime from static scRNA-seq data. We assume that the EMT trajectory is a linear topology of ordered single cells, and cells represent the entire developmental process from E to M, i.e. each cell in the ordered sequence represents a different stage of the E to M transition. The trajectory then provides an indication of the timeline of the EMT transition, known as the EMT pseudotime. The pseudotime can be obtained using different approaches. One simple way to approximate the EMT pseudotime from a static scRNA-seq dataset is to order cells by their expression values of VIM [26], and we denote this pseudotime as VIM-time. Another way to infer the EMT pseudotime from a scRNA-seq dataset is by using the Wanderlust algorithm [27]. Wanderlust is a graph-based method to infer a linear tread to recapitulate cell trajectory. Wanderlust converts scRNA-seq data into a k-nearest neighbor graph (k-NNG). In k-NNG, each node is a cell, and each cell is connected to k cells that have similar expression profiles. Then Wanderlust generates several l-out-of-k-nearest neighbor graphs (l-k-NNGs) by randomly keeping l of k-nearest neighbors for each node in the k-NNG. For each l-k-NNG, Wanderlust identifies a trajectory score for each cell using a repetitive randomized shortest path algorithm. The final trajectory is computed by the average over all graph trajectories. We use the final trajectory as the EMT pseudotime named W-time. All the parameter assignments of Wanderlust can be found in Section 2 in S1 File.
After obtaining the EMT pseudotime, we have a trajectory score ranging from 0 to 1 for each cell which indicates its developmental stage of the E to M transition. Therefore, the scRNA-seq dataset (a data matrix) can be converted to a pseudotime series gene expression dataset by sorting cells (columns) based on the EMT pseudotime.
Then we construct a dynamic gene co-expression network from the above obtained pseudotime series expression dataset. Each node of the network represents a gene in the dataset. To capture the dynamic regulatory relationship between two genes, we use LEAP (lag-based Expression Association Pseudotime-series) [28] package to determine if there is an edge between two nodes. Given C cells ordered by the EMT pseudotime, the MAC_counter() function in LEAP calculates the maximum absolute correlation (MAC) between the two nodes across all the time lags l 2 {0, 1, . . ., C/3} using the pseudotime series expression data. If the MAC between two nodes g and tg is tested to be statistically significant, an edge is added from g ! tg.
The three measures for ranking genes scPrognosis combines three measures to rank genes, including Median Absolute Deviation (MAD) of gene expression profiles, the Switch-like Differentiation of genes in different stages of EMT (SDE), and the roles played by genes in the gene co-expression NETwork in EMT (NET). In this step, scPrognosis calculates the three measures individually before they are integrated into the next step. In the following, we describe the details of calculating each of the measures.
Let (e 1 , e 2 , . . ., e C ) represents the expression profile of a gene g 2 {1, . . ., G}, where C is the number of cells. The MAD of the gene can be computed as: where median() is the function returning the median value of a given variable.
To calculate SDE, we use the software tool switchde [15] which can estimate the differentiation of switch-like genes in different stages of EMT. switchde defines a sigmoid function as shown in Eq 2 to fit the profile of a gene g with regard to a pseudotime t c (c is the index of a cell and c 2 {1, . . ., C}). In Eq 2, m 0 g , k g and t 0 g are the average peak expression value, the active strength and the active time of g. k g presents how quickly the gene g is up or down regulated along the pseudotime. We define SDE(g) as the switch-like differential expression level of the gene g, and SDE(g) = k g .
switchde adopts the gradient-based L-BFGS-B optimization algorithm [30] to obtain the maximum likelihood estimates of the parameters m 0 g , k g , and t 0 g . switchde also do the hypothesis testing associated with gene differential expression and adjust p-value by the Benjamini-Hochberg method.
To calculate NET, we follow the modified Google PageRank algorithm presented in [31]. The modified Google PageRank algorithm is used to calculate the regulatory importance of a gene in the dynamic gene co-expression network. Suppose there are G genes, the ranking of a gene g is defined as the following: where d is the damping factor in PageRank and is set to 0.85 by default. tg is a target of g and we use T(g) to denote the set of all targets of g. L(tg) is the number of genes which regulate tg. From Eq 3, we can see that the rank of a gene depends on the rank of all its target genes. NET(g) is initialized to the same value for all g, and can be calculated using a iterative algorithm until it converges.

Prioritizing genes via an integrative model
Although all the three measures are all associated with the clinical outcomes of cancer, none of the individual measure suffices to cancer prognosis. The expression variation (MAD) helps with distinguishing different cell populations. Genes with high expression variations are also of great clinical interest. The differentiation in different stages of EMT is corresponding to the gene behavior along the trajectory of EMT. SDE helps identify the genes that switch on and off alternatively during the trajectory to trigger EMT. The gene co-expression network is important for us to better understand the mechanisms of cell differentiation and carcinogenesis at a systems level. NET helps us discover hub regulatory genes that target the highest degree of a series of genes (called targets) in the network. It is believed that the hub regulatory genes are more closely related to cancer and have more biological significance compared with their targets [32]. Because each of them only reflects one aspect of the importance of a gene, and a combination of the three would be a more comprehensive measure. Therefore, we propose a linear model to integrate the three measures to obtain the final score for each gene. Before integrating the three measures, we normalize them as follows: Then we integrate the normalised individual measures as follows.
where α, β, and γ are the weights of MAD, SDE and NET respectively, α + β + γ = 1, and 0 � α, β, γ � 1. Then we rank the genes in descending order of the integrated measure. We use the grid search and cross-validation methods to tune the weights of the linear model in the experiments. The optimal weights can lead to the best predictor of cancer prognosis on the bulk RNA-seq data.

Cancer prognosis using the top N ranked genes
From the list of ranked genes obtained in Step 4, we select the top N ranked genes as cancer signatures. Then the Cox proportional hazards (PH) model [33] is trained based on these cancer signatures and bulk RNA-seq data. The PH model assumes that the effect of covariances on the survival outcomes is time-independent. Given survival time t, the general function of the PH model is defined as the following: where β 0 is a N × 1 vector that holds estimated regression coefficients, X is the expression data of the top N genes, and h 0 (t) is the baseline hazard function. The risk score of a new patient is calculated by where X i is the expression data of the top N genes of the new patient i, and mean() is the function returning the average values of given data.

Performance evaluation
Concordance index. (C-index) [34] is commonly used to validate the predictive ability of cancer prognostic models. Let z i and r i be the potential survival time and the risk score predicted using Eq 7 for patient i, respectively. C-index is equal to the concordance probability P(r i > r j |z i < z j ) for a randomly selected pair of patients i and j. However, we cannot observe potential survival time for some patients who are lost to follow-up or event free at the end of a study (right censored). Hence the actually observed survival time t i = min(z i , c i ), where c i is the potential right censoring time. Let δ i be the censoring status. An event (e.g. death or relapse) is developed within the study period when δ i = 1. For the right censoring data, C-index can be defined as the following: where I() is an indication function. C-index ranges from 0 to 1. The bigger the C-index is, the more accurate of a model will be. Hazard ratio. To assist clinicians in tailoring treatment strategy, we often need to stratify patients into the high-risk group and the low-risk group via dichotomizing the predicted risk scores around their median value. Therefore, we need an accuracy measure to compare different methods. We use the hazard ratio (HR) as an accuracy measure, similar to other work [35]. We binarize the predicted risk scores to obtain the predicted groups R for patients. Then we estimate the risk difference between the two survival groups by Cox's proportional hazards model as: where h 0 (t) is the same as that in Eq 9. The quantity exp(β) is defined as HR, which indicates the risk difference between the two groups of patients. The larger the HR is, the larger discrimination between the low-and high-risk group becomes, and therefore the better the prediction method will be. Kaplan-Meier survival curve. The Kaplan-Meier (KM) survival curve [36] combined with the Log-rank [37] test can identify whether the two risk groups show significantly different survival patterns. In the KM curve plot, the Y-axis is the probability of surviving in a given length of time, and the X-axis is survival time. The KM curves should have different characteristics and should not overlap for different groups predicted by a good method. The Log-rank test determines whether the survival curve estimated for each group is identical or not. If the p-value of the Log-rank rest is less than 0.05, the survival curves are statistically significantly different. Implementation scPrognosis has been implemented using MATLAB and R packages. All the datasets and the R scripts to reproduce the results in this paper are available online at https://github.com/ XiaomeiLi1/scPrognosis.

Data sources and preparation
scRNA-seq data. In this paper, we use the scRNA-seq data of HMLE breast cancer cell lines from [26] to identify the EMT pseudotime for each cell and then select cancer signatures. The cells were stimulated with TGF-beta to induce EMT transition and the single-cell sequencing was performed using the inDrops platform. There are 28910 transcripts effectively measured in 7523 single cells. The scRNA-seq data can be download from the Gene Expression Omnibus (GEO) database (https://www.ncbi.nlm.nih.gov/geo/) under accession number GSE114397.
Bulk RNA-seq data. For training and validating the cancer prognosis model based on the selected signatures, we use bulk RNA-seq data of 2979 breast cancer patients from four different repositories, including TCGA (753 samples), METABRIC (1283 samples), GEO (736 samples) and UK (207 samples). Most of the breast cancer samples possess detailed clinical data, such as age, nodal, stage, grade, survival time, and event status. The TCGA and METABRIC datasets contain both overall survival time (OS) and relapse-free survival (RF) endpoints. The GEO and UK datasets only have the endpoints of relapse-free survival. The TCGA dataset was downloaded from the TCGA data portal (http://firebrowse.org/) and the dataset consists of level 3 mRNA expression data of primary breast cancer. The METABRIC dataset [38] was downloaded from the European Genome-phenome Archive (https://www.ebi.ac.uk/ega/ accession number EGAS00000000083, approval needed). The GEO dataset consists of 5 datasets: GSE12276 (204 samples), GSE19615 (115 samples), GSE20711 (88 samples), GSE21653 (252 samples) and GSE9195 (77 samples). We merge the five GEO datasets into a bigger dataset and adjusted the batch effects by the ComBat algorithm from the sva library [39]. The UK (known as GSE22219) dataset contains 207 early-invasive breast cancer cases with complete follow-up clinical data in 10 years. Both the GEO and UK datasets were downloaded from the Gene Expression Omnibus repository (https://www.ncbi.nlm.nih.gov/geo/). We summarize the details of these bulk RNA-seq datasets in Table 1.
As shown in Table 1, three of the datasets were obtained from the same platform Illumina RNA-seq, and only GEO was obtained using a different platform Affymetrix microarray. Microarray-based gene expression profiling has been widely applied in breast cancer research [40]. However, the use of gene expression data measured by RNA-seq platform is increasing since RNA-seq platform has several advantages over traditional microarrays for conducting transcriptional profiling [41]. While we acknowledge the heterogeneity in the datasets obtained using different platforms (and bias correction of data from different sources is an active research topic [42,43]), in this work, we only used these datasets as testing data to evaluate the performance of the proposed method and the benchmark methods. Therefore, we did not correct cross-platform bias from the datasets since the bias would not affect the fairness of the comparison.

scPrognosis is better than benchmark methods for risk score prediction
As discussed in the previous section, the SDE and NET measures have a considerable dependency on the pseudotime. We investigate the performance of two versions of scPrognosis based on different pseudotime, including VIM-time and W-time which are based on the expression profile of gene VIM and the Wanderlust algorithm, respectively. Section 2 in S1 File has more experiment details of calculating VIM-time and W-time. We denote the two versions of the implementations of scPrognosis as scP.V and scP.W, corresponding to the use of VIM-time and W-time, respectively. To illustrate that scRNA-seq data can help to select prognostic signatures of breast cancer, we choose six widely used breast cancer prognosis benchmark methods that are based on the signatures selected from bulk RNA-seq data. More information about the benchmark methods can be found in Section 1 and Table A in S1 File. We compare the performance of the two versions of scPrognosis (scP.V and scP.W) with the benchmark methods on the datasets listed in Table 1. We report the results on TCGA and METABRIC according to the overall survival (OS) and relapse-free (RF) time. For the GEO and UK datasets, we report the results on the relapse-free time. Table 2 shows the C-indices and the mean ranking scores of all the methods compared. The reported C-index is the average C-index of 100 runs of 10-fold cross-validation on a dataset. Based on the C-indices, mean ranking scores are calculated by Friedman's test, which is a two-way analysis of variance by ranks for related samples. scP.W is better than other methods since it wins three times. Compared to the benchmark methods, scP.W outperforms all the methods for the prediction of the risk of RF time on the TCGA and UK datasets. Moreover, from the mean ranking results, we can see that scPrognosis overall outperforms the benchmark methods.
To test whether a method performs significantly better than the other, we conduct the Wilcoxon signed-rank test based on the C-indices of scPrognosis and the benchmark methods. The result shows that both scP.V and scP.W perform significantly better than Mamma (the pvalues are 0.017 and 0.018, respectively) and GGI97 (the p-values are 0.016 and 0.017, respectively). scP.V significantly outperforms LM (p-value = 0.03) while scP.W is superior to RS significantly (p-value = 0.046). Moreover, as previously shown (Table 2), according to the mean ranking scores, our methods still marginally improve the other two methods, PAM50 and Endo.
In summary, we only use scRNA-seq data to measure the importance of genes, whereas the benchmark methods use signatures directly obtained from breast cancer clinical data and prior knowledge. Even so, the results have shown that both scP.V and scP.W achieve better or competitive performance compared with the benchmark methods. This indicates scRNAseq data can improve the performance of breast cancer prognosis, and the signatures of EMT potentially are high quality predictors for breast cancer prognosis.

scPrognosis is better than benchmark methods for risk group prediction
In this section, we evaluate scPrognosis using the Hazard Ratio (HR) criterion, in comparison with the six benchmark methods. For each method, we stratify patients into two groups using the risk scores calculated by the method. If a patient's risk score bigger than the median value the patient is put into the high-risk group, otherwise the patient is put into the low-risk group. The HRs for all the methods are reported in Table 3. We observe that the two versions of scPrognosis (scP.V and scP.W) win once and twice, respectively, but PAM50, RS, and Endo each wines once this time. Based on the mean ranking results, we can conclude that overall scPrognosis outperforms the benchmark methods in stratifying patients into two risk groups.
Then we use the Wilcoxon signed-rank test to test the significance of the results on the HR criterion. Again, both scP.V and scP.W have perform significantly better than Mamma (the pvalues are 0.047 and 0.031, respectively), GGI97 (the p-values are 0.016 and 0.016, respectively) and LM (the p-values are 0.016 and 0.030, respectively).

Evaluation using independent test
According to the results in Tables 2 and 3, among the two different implementations of scPrognosis, scP.W outperforms scP.V. So we choose scP.W as our final method to identify breast cancer signatures. For further evaluating the robust of scP.W in breast cancer prognosis, we conduct independent tests on three bulk RNA-seq datasets. Due to the small sizes of the GEO, and UK datasets, we don't train scP.W based on these datasets. Fig 2 shows the independent test results on TCGA when training on METABRIC. Fig 2A  and 2C show the comparison of scP.W and the benchmark methods. In these two figures, the The top-performing result is highlighted for each dataset. The reported C-index is the average C-index of 100 runs of 10-fold cross-validation on each dataset.
https://doi.org/10.1371/journal.pcbi.1008133.t002 The top-performing result is highlighted for each dataset. The reported hazard ratio is the average hazard ratio of 100 runs of 10-fold cross-validation on each dataset.
https://doi.org/10.1371/journal.pcbi.1008133.t003 Y-axis is the C-index, and the X-axis is the category of methods. Based on C-index, scP.W achieves the best results in predicting overall survival and relapse-free survival time. Fig 2B  and 2D are the KM curves and the Log-rank test of risk group prediction using scP.W on the TCGA dataset. The results show that scP.W successfully stratifies patients into two risk groups of relapse and overall survival. The p-values by the Log-rank test are less than 0.05, which indicates that two risk groups have significantly different survival patterns, and the high-risk group has lower survival probability than that of the low-risk group. The TCGA dataset is the second-largest dataset in breast cancer and widely used in breast cancer research. We report the comparison results of our method based on the TCGA dataset and the current breast cancer prognostic methods in Table C in S1 File. The results also show that scP.W achieves the best results in cancer prognosis.

Breast cancer signatures identified by scPrognosis
From the previous sections, we see the EMT signatures discovered by our methods are good breast cancer signatures too. To further validate these signatures, we compare the signatures discovered by our method with those discovered by benchmark methods. The EMT signatures are the top N ranked genes based on the scores calculated by Eq 5. The parameters N, α, β, and γ are determined by the 10-fold cross-validation results on METABRIC since this dataset has more samples and obtains better performance than TCGA (Tables C and D in S1 File). scP.W selects 10 genes as breast cancer signatures, KRT15, UBE2C, TOP2A, KRT6B, MKI67, HMGB2, ASPM, CDC20, KIF20A and CDK, when trained on METABRIC. Comparing the 10 genes with the signatures used by the benchmark methods, we find 5 genes (UBE2C, MKI67, ASPM, CDC20, and KIF20A) showed up in one or more benchmark methods. ASPM is the common signature when scP.W is trained on TCGA and METABRIC. In our model, high ASPM levels are associated with adverse prognostic factors and shorter survival and relapse-free time. Recent evidence suggests that ASPM promotes prostate cancer stemness and progression and has important clinical and therapeutic significance [44]. Besides ASPM, other common signatures also have been proved to relate to breast cancer prognosis. For instance, high UBE2C expression is associated with poor prognosis in breast cancer, especially basal-like breast cancer [45]. CDC20 over-expression means short-term breast cancer survival [46]. Fig 3 shows the diagram of overlapping genes among different methods. The diagram shows that a significant portion of the prognostic genes discovered by our method is overlapped with the current signatures of breast cancer prognosis. Though the clinical significance of the other five signature genes discovered by our method (KRT15, TOP2A, KRT6B, HMGB2, and CDK1) is not clear at present, they can be novel signatures for human breast cancer. There have been researches investigating the relationship between these genes and breast cancer. For example, KRT6B and KRT15 were found to be the makers of basal-like breast cancers [47], and TOP2A expression levels were reported to have a significant association with metastasis-free survival in node-negative breast cancer [48].

Correlation between EMT markers, breast cancer signatures and the EMT pseudotime
This paper is the first work to identify switch-like differential expression genes along the EMT pseudotime to understand their efficacy in deciphering the survival of breast cancer patients. No matter we use VIM-time or W-time, the models built have a good agreement on the performance. This is because W-time is highly related to the expression profile of VIM (the Pearson correlation is 0.46). For visualizing the dynamic behavior of genes in different stages of EMT, we divide W-time into 10 equal sized bins that present pseudo-stages of EMT. The expression level of a gene in a bin is calculated by the average of the profile during the time interval. shows the tendencies of the 10 cancer signatures along the pseudo-stages. We also plot the expression profiles of genes along W-time in Fig C in S1 File. Only KRT15 and KRT6B are down-regulated by the EMT transition while other signatures do not vary at the E and M stages, but peak at the hybrid E/M stages. Recent experimental and theoretical evidence suggests that the hybrid E/M stages are stable phenotypes and is associated with aggressive tumor progression [49]. Our method demonstrates the relevance of the hybrid E/M phenotypes to patient survival in breast cancer.
We also visualize the EMT markers' dynamic behavior to determine whether the W-time could successfully model the cell evolved from the E stage to the M stage. The results from

Enrichment analysis of the signatures discovered by scPrognosis
We validate discovered breast cancer signature genes against the literature knowledge of pathways using the WikiPathways (http://www.wikipathways.org) platform [50]. The results in Table 4 show that the 10 signatures are highly relevant to the regulation of cancer. For instance, pathways 1, 2, 3, and 8 are direct pathways of cancer, and others are important pathways involved in the process of tumorigeneses.
We also conduct gene ontology enrichment analysis for the 10 breast cancer signatures. From Table E in S1 File, we can see that they are regulators of cell cycle progress and ubiquitin-protein ligase activities. Table A in S1 File shows that current signatures based on bulk RNA-seq data are also enriched in cell cycle regulation. Recent studies reveal the important roles of ubiquitin-protein ligase activity played in breast cancer [51,52].

Discussion and conclusion
Breast cancer is a complex disease caused by intricate genetic and molecular alterations. Thus traditional clinicopathological factors are not sufficient for the accurate prognosis of breast cancer. Recently, a wide range of computational methods have been proposed to identify multi-genes for breast cancer prognosis, and some of the methods have been approved for commercial use, including PAM50, Mamma, and RS test. These methods lead to a revolution in the breast cancer treatment paradigm. However, all of the progress in cancer prognosis has not been enough to overcome therapy resistance in breast cancer under current cancer therapeutics. Some tumor cells acquire resistance to targeted cancer therapy, which leads to worse survival of cancer patients. scRNA-seq can reveal genes that affect cell fate decision by monitoring the expression of genes in different cell states and sub-populations. In this paper, we use scRNA-seq data to detect signatures related to EMT that affect the clinical outcomes of breast cancer patients. For almost two decades, the prospect that EMT may play an important role in tumor stemness, metastasis, and drug resistance has been vigorously debated. However, evidence demonstrating the prognosis power of EMT markers in breast cancer clinical studies has not been identified. Recently scRNA-seq is used to identify the continuum of EMT transition. We try to use the EMT scRNA-seq data to link the EMT related genes to breast cancer survival. To investigate how genes are related to cell level perturbation during EMT, we use the computational method Wanderlust to infer the EMT pseudotime. We integrate multiple measurements, MAD, SDE, and NET to measure the importance of a gene based on its expression variance, its dynamic differentiation, and its role in the dynamic gene co-expression network. We apply our method to four breast cancer cohorts. The experimental results illustrate that scPrognosis is more efficient than the benchmark methods based on bulk RNA-seq data and single-cell based methods only using individual measurements ( Table B in S1 File). Our work also emphasizes the benefit of EMT mechanisms that incorporate background knowledge for identifying biologically relevant signatures of cancer prognosis. And the results show the good performance of the signatures in breast cancer prognosis.
Moreover, the results of scPrognosis may give us some clues for interpreting the EMT process. We look at the dynamic change of the gene expression along the EMT pseudotime. Interestingly, only two identified breast cancer signature genes are down-regulated along the EMT pseudotime, while the remaining genes peak at the intermediate of the E to M transition. These genes could be novel biomarkers for the hybrid E/M stages. We assume that the hybrid E/M stage is more relevant to patient survival as supported by the recent study in [19].
To identify the activity of EMT-related breast cancer signatures, we conduct a pathway analysis of the discovered breast cancer signatures. The results show that a significant number of the identified signatures are enriched in the pathways associated with cancer. Through the GO enrichment analysis, the signatures found by our method are closely related to the biological functions of cell cycle activity and ubiquitin-protein ligase activity, and the latter activity is not showing up in most of the current signatures.
However, there is no universal method that outperforms all the other methods. We still need to discover novel mechanisms involved in breast cancer progress, metastasis or resistance. In the future, our method can be extended to improve breast cancer prognosis by immune cell trajectories. Understanding immune cell development and response to disease is a crucial step for conquering cancer metastasis by immunotherapy. Recently there are some single-cell experiments for investigating cellular dynamics in the context of immunology [53].
In conclusion, we have proposed a novel method scPrognosis for breast cancer prognosis based on scRNA-seq data. scPrognosis uses an integrative model to infer breast cancer signatures based on MAD, SDE, and NET measurements. We empirically compared our method with the existing methods on four breast cancer datasets. The results show that the scRNAseq based method is a good and useful method for breast cancer prognosis. The signatures detected by our method show the link between EMT and the clinical outcomes of breast cancer, which may give some clues for current cancer therapeutics.