Inferring latent temporal progression and regulatory networks from cross-sectional transcriptomic data of cancer samples

Unraveling molecular regulatory networks underlying disease progression is critically important for understanding disease mechanisms and identifying drug targets. The existing methods for inferring gene regulatory networks (GRNs) rely mainly on time-course gene expression data. However, most available omics data from cross-sectional studies of cancer patients often lack sufficient temporal information, leading to a key challenge for GRN inference. Through quantifying the latent progression using random walks-based manifold distance, we propose a latent-temporal progression-based Bayesian method, PROB, for inferring GRNs from the cross-sectional transcriptomic data of tumor samples. The robustness of PROB to the measurement variabilities in the data is mathematically proved and numerically verified. Performance evaluation on real data indicates that PROB outperforms other methods in both pseudotime inference and GRN inference. Applications to bladder cancer and breast cancer demonstrate that our method is effective to identify key regulators of cancer progression or drug targets. The identified ACSS1 is experimentally validated to promote epithelial-to-mesenchymal transition of bladder cancer cells, and the predicted FOXM1-targets interactions are verified and are predictive of relapse in breast cancer. Our study suggests new effective ways to clinical transcriptomic data modeling for characterizing cancer progression and facilitates the translation of regulatory network-based approaches into precision medicine.

I only have a few minor comments: (1) In equation (5), the authors described that the root is determined as the patient (x) with the largest distance to the patients the maximal grade score (e.g., stage 4). I would suggest limiting the candidate x \ patients with the smallest grade (e.g., grade 0). If not, theoretically, it is likely that x0 will come from a later stage (e.g., stage 2 or even stage 4 (e.g., because of a few outliers that produce a considerable distance). In that case, the root x0 will come from stage 2 or stage 4, which would not make much sense to me. As the stage information is available, I would suggest to narrow down the selection of x0 among patients in stage 0 R: Thank you for your good suggestion. We agree with your comment on the potential influence of a few outliers. According to your advice, we have revised it as follows (now Equation (9) in the revised manuscript): where is a randomly selected patient from the maximal grade subpopulation. The selection of 0 was limited among patients with the smallest grade (i.e., { }) to eliminate potential influence of a few outliers in the data." We also modified the code correspondingly and tested the influence of this modification on the two real datasets used in this study and detected no difference at least for the studied two cases.
(2) In this work, the authors simulate Gaussian noise to examine the robustness of the method. I am amazed by the robustness of the method to Gaussian noise. However, the Gaussian noise is usually well dealt with. Did the authors also simulate other types of noise (e.g., drop-out noise is also a very common noise type)? It would be great if the methods can also be tested against other noises.

R:
Thank you for your insightful comments and good suggestion.
In addition to the above Gaussian noises, we also tested the robustness of our method against exponential noises ( Figure S5). The gene expressions were randomly perturbed by using multiplicative exponential noises generated from the exponential distribution with mean ranging from 0 to 0.3. We agree that the drop-out noise is also a common noise type, particularly for single-cell RNA-sequencing data, but may not be the major source of noise in RNA-seq data of cancer samples. Since our study is intended primarily for the transcriptomic data of cancer samples (e.g., microarray and RNA-seq data) whose typical noise may come from sample purity, so we used Gaussian noise and exponential noises (always positive) to imitate this situation. We acknowledge that the drop-out noise is indeed more difficult to dealt with. Nevertheless, we have figured out a way to deal with the drop-out noise under a unified framework of our method, which will be investigated in our future study oriented specifically to scRNA-seq data.

a b
General Comments Authors developed a latent-temporal progression-based Bayesian method, PROB, for inferring gene regulatory networks (GRNs) from the cross-sectional transcriptomic data of tumor samples. Mathematical proofs and numerical verification are provided to support the robustness of PROB to the measurement variabilities in the data. Benchmarking PROB with alternative methods of GRN inference shows advantages of PROB in both pseudotime inference and GRN inference. Authors also demonstrated the applications of PROB for identification of key regulators of cancer progression or drug targets as well as performed validations experiments. Overall, the manuscript is clear and accessible.
R: Thank you for your positive feedback. We now have revised the manuscript according to your comments and suggestions.
Specific comments Pages 9-12: Section "Latent-temporal progression-based Bayesian (PROB) method to infer GRN" This section can be rewritten more compactly to improve reading experience. Either put all details in the main text or keep only the main idea in the main text and leave the details in the supplementary text. Current main text is an abridged version of Text S1 and does not read consistent.

R:
Thanks for your good suggestion.
To keep text consistent and symbols in Algorithm 1 clearly defined, we decided to put all details of the method into the main text, except for the details of assumption and derivation of the dynamical modeling, which are now provided in Text S1 ("Progressiondependent dynamic modeling of the GRN").
Page 9: "… the root was automatically identified with the aid of staging information." What is "root" meant here and in the subsequent context? Does it adapt a generally used and accepted mathematical meaning? Should it be called "optimizer"?

R:
The "root" here means the starting point in the latent temporal progression trajectory. It is widely used and accepted for pseudo-temporal trajectory inference in the previous papers of single-cell RNA-seq data analysis, for instance: To make it clearer, we described it as "root of the progression trajectory". The above sentence has been revised as follows: "… the root of the progression trajectory was automatically identified with the aid of staging information" Page 11: "The above model is then transformed into a linear regression model …" I think it's better phrased as "a linear system". Rigorously speaking Y is not linear in X.
R: Sorry for confusion due to missing information in the original main text. Now we have included details into the main text and the final model reads (Equation (15)): We can see that Yi is linear in X (i) . So we decided to keep "linear regression model" in the revised manuscript, provided that no confusion would occur now.
Page 12: "Therefore, the above theorem theoretically guarantees the consistency and robustness of the estimates of the regulatory coefficients." It can be elaborated more the role of Theorem 1; for example, add 2-3 lines of equations.

R:
Thanks for your suggestion. We have added the following corollary and implications of Theorem 1. "A corollary of the above theorem is that the mapping ↦ ൣ ሺ ሻ൧ × defined by Equation (10) is continuous under certain appropriate metric. More specifically, for two trajectories and ǁ , if the difference between the two inferred regulatory coefficients ൣ ሺ ሻ൧ × and ൣ ሺ ǁ ሻ൧ × is significantly larger than 0, then the difference between and ǁ should not be arbitrarily small. This implies that if the inferred regulatory networks for two progressions are largely different, then the two progressions should have different trajectories and thus distinct clinical outcomes. Hence, Theorem 1 also suggests that GRN-based signatures may be used for predicting or controlling cancer progression." Page 12: pseudo-code of PROB Variable E is not defined in the main text. Definition of phi should be readdressed in the algorithm. Variable/function PPD is not defined.

R:
Thanks for pointing them out.
Variable E has now been defined in the Methods section of the main text.
Definition of 0 has been included in the algorithm.
Page 13: "To this end, we defined an outgoing causality score (OCS) …" Does OCS have any mathematical/statistical meaning?

R:
The OCS represents the sum of regulatory strengths from gene i to its target genes (∑ =1 ). It was defined in view of the Equation (10) (dynamic model of gene regulation) based on the mass action kinetics. The OCS quantifies the outgoing regulatory potential of a give gene and takes large value for regulators but small values for targets. Intuitively, for a set of genes with known regulatory relationship (regulators and targets), if gene i is a regulator, then the OCS value should be large since many values of (absolute value of mean of ) should be greater than 0; If gene i is a target, then the OCS value should be small since most values of should be zero or small enough.
We have added the following text into the main text: "The OCS is defined in accordance of the Equation (10) based on the mass action kinetics and quantifies the outgoing regulatory potential of a give gene." Page 21: "The code for PROB is available at https://github.com/SunXQlab/PROB." This page does not exist.

R:
We are sure that the code has been uploaded into Github with URL: https://github.com/SunXQlab/PROB We have confirmed that the above URL is valid, which has also been tested by other colleagues. If the problem still exists, please try another browser. Thanks.
Page 28: Figure 2d Why does not AUC decrease monotonically with CV? R: Figure 2 illustrates the result of one realization of the simulation under specific noise level. Due to random variations, the AUC in Figure 2d does not monotonically decrease. The multiple simulation results demonstrated that the mean value of AUC decreases monotonically with CV, which were provided in supplementary Figure S4C.
( Figure S4C) Page 32: Figure 6: Details related to survival analysis are missing (e.g. how authors define high and low risk groups. What is their cut-off?) Also role of FOXM1 on breast cancer progression is known and citations are missing.

R:
Thanks for you pointing this out. We now have added the details of survival analysis in Figure 6 into the Text S6.
"A risk score was defined for each patient based on the expression level of FOXM1 according to COX PH model. High expression level of FOXM1 corresponds to high risk, and low expression level of FOXM1 corresponds to low risk. In this way, patients were divided into two groups: high-risk group and low-risk group. The optimal cutoff value of the risk score was determined using the ROC method [2]." The citations for the role of FOXM1 in breast cancer have been added (Ref. [46]).