Development and validation of a set of novel and robust 4-lncRNA-based nomogram predicting prostate cancer survival by bioinformatics analysis

Background and objective Accumulating evidence shows that long noncoding RNAs (lncRNAs) possess great potential in the diagnosis and prognosis of prostate cancer (PCa). Therefore, this study aimed to construct an lncRNA-based signature to more accurately predict the prognosis of different PCa patients, so as to improve patient management and prognosis. Methods Through univariate and multivariate Cox regression analysis, this study constructed a 4 lncRNAs-based prognosis nomogram for the classification and prediction of survival risk in patients with PCa based on TCGA data. Then we used the data of TCGA and ICGC to verify the performance of our prediction model. The receiver operating characteristic curve was plotted for detecting and validating our prediction model sensitivity and specificity. In addition, Cox regression analysis was conducted to examine whether the signature’s prediction ability was independent of additional clinicopathological variables. Possible biological functions for those prognostic lncRNAs were predicted on those 4 protein-coding genes (PCGs) related to lncRNAs. Results Four lncRNAs (HOXB-AS3, YEATS2-AS1, LINC01679, PRRT3-AS1) were extracted after COX regression analysis for classifying patients into high and low-risk groups by different OS rates. As suggested by ROC analysis, our proposed model showed high sensitivity and specificity. Independent prognostic capability of the model from other clinicopathological factors was indicated through further analysis. Based on functional enrichment, those action sites for prognostic lncRNAs were mostly located in the extracellular matrix and cell membrane, and their functions are mainly associated with the adhesion, activation and transport of the components across the extracellular matrix or cell membrane. Conclusion Our current study successfully identifies a novel candidate, which can provide more convincing evidence for prognosis in addition to the traditional clinicopathological indicators to predict the PCa survival, and laying the foundation for offering potentially novel therapeutic treatment. Additionally, this study sheds more lights on the PCa-related molecular mechanisms.


Introduction
Prostate cancer (PCa) represents a frequently occurring cancer in men globally [1]. A number of clinicopathological factors, such as the Gleason score, margin status, or the TNM stage, are incorporated into prior models to diagnose and detect PCa after treatment [2]. Of such factors, Gleason score exhibits the highest sensitivity and effectiveness. Nonetheless, the subjectivity and sampling error related to the assessment of Gleason score are the prominent confounders. Recently, increasing research attempts to establish the gene molecular signature for enhancing the prediction ability for PCa [3]. Nonetheless, little existing research has examined the possible roles of lncRNAs in the prediction of PCa survival risk as the new models. lncRNAs have been usually referred to as the RNA transcripts with the length of over 200 nucleotides (nt) that cannot encode proteins [4]. It is increasingly suggested that, lncRNAs have exerted vital parts in numerous biological processes through regulating gene expression, or proliferation and apoptosis of cells [5]. The aberrant lncRNAs are related to tumorigenesis, which may serve as the oncogenes or tumor suppressor genes. In addition to their parts in tumor genesis and development, lncRNAs may serve as the candidate biomarkers [6]. So far, bioinformatic analysis is substantially used for molecular biological experiments or in clinic. Therefore, this work aimed to employ bioinformatic analysis to discover those differently expressed lncRNAs (DElncRNAs) between PCa and non-carcinoma tissue samples from the sequencing data obtained through The Cancer Genome Atlas (TCGA). Then, Cox regression analysis together with the survival associated risk score formula was used to develop a nomogram for prognosis based on four lncRNAs, so as to differentiate cases with favorable or dismal prognostic outcome. Then we used TCGA and International Cancer Genome Consortium (ICGC) datasets to verify the performance of our prediction model. The receiver operating characteristic (ROC) curve was also plotted, with a higher area under curve (AUC) indicated favorable model sensitivity and specificity. In addition, univariate as well as multivariate Cox regression suggested that this4 lncRNAs-based prognosis nomogram showed higher independence and robustness in predicting prognosis compared with additional clinicopathological variables. Besides, analyzing the functions and involved pathways for these 4 lncRNAs shed more lights on the lncRNAs prediction ability and the possible molecular mechanism.

The overall analysis route in this work
The general research analysis route can be observed from Fig 1. First, those transcriptome RNA-seq expression profiles from 551 cases were obtained based on the TCGA database. Then lncRNA was analyzed for differential expression and batch survival. Univariate and multivariate COX regression analysis was performed for differentially expressed lncRNAs to establish a gene model of 4-lncRNA. Then we combined ICGC database data to verify the prediction performance of the prediction model. Survival analysis, ROC curve analysis, and patient risk heat plot, risk curve and survival status plot were performed respectively for our model. Target genes of lncRNAs were predicted by co-expression, and then the target genes were analyzed by functional clustering analysis. Finally, independent prognostic analysis and correlation analysis were conducted between our 4-lncrNA nomogram and other common clinical characteristics, as well as further stratified analysis and combined analysis of Gleason Score.

Datasets
LncRNA RNA-seq data (HTSeq-Counts) in the United States and China including PCa and control samples and corresponding clinical data of PCa patients were extracted based on the publicly available Genomic Data Commons (GDC) data portal (https://portal.gdc.cancer.gov/) and TCGA data portal (https://dcc.icgc.org/releases/ current/Projects/PRAD-CA) respectively. Altogether 551 gene expression profiling samples were collected from TCGA, comprising 499 PCa and 52 non-carcinoma tissues. 435 gene expression profiling samples were took from

PLOS ONE
ICGC, comprising 393 PCa and 42 non-carcinoma tissues. Samples with insufficient clinical data or those with OS < 3 months were eliminated from the present research.

Identification of DElncRNAs between PCa and non-carcinoma tissue samples
In line with our inclusion criteria, clinicopathological data from 397 PCa cases were obtained from TCGA database ( Table 1), and clinical data from 378 PCa cases were obtained from ICGC database ( Table 2). Clinical covariates for both cases and cancers can be observed from Tables 1 and 2. The R4.0.1 software was employed for analysis. For identifying the candidate lncRNAs in later survival analysis, the trimmed mean of M values was adopted to normalize and differentially analyze the expression profiles by edgeR package from Bioconductor [7]. The heterogeneities of lncRNA expression in PCa compared with adjacent normal tissues were presented in the form of log2FC together with related P-value. In this study, |log2FC| > 1 and false discovery rate (FDR) q < 0.05 were selected as the thresholds. Thereafter, the R package pheatmap function (version 1.0.12) was used to perform unsupervised hierarchical clustering according to DElncRNA expression levels.

Survival analysis and development of the lncRNAs expression model
Firstly, RNA-seq expression data were normalized according to log2. Later, the Survival R package from CRAN (https://rweb.stat.umn.edu/R/site-library/survival/html/00Index.html) was used to assess the relationships of DElncRNAs with patient OS through univariate Cox proportional hazards regression. lncRNAs with P-value < 0.05 were identified to be the potential variables, which were incorporated into the stepwise multivariate Cox regression examined through Akaike Information Criterion (AIC), which help to determine the optimal exchange between model complexity and the optimal informative and explanatory effectiveness.

Risk stratification, survival curve and concordance index (C-index)
The riskscore values for all patients were determined by the following formula (Risk score = 0.524052278 × HOXB-AS3-0.663887578 × LINC01679-0.504710478 × PRRT3-AS1 + 1.092940489 × YEATS2-AS1) on the basis of multivariate Cox regression. Using the as-prepared risk scoring system, all cases were classified as high-or low-risk group by median riskscore value. Thereafter, heterogeneities of OS time between these two groups were examined through two-sided log-rank test. The Kaplan-Meier (K-M) method was employed to plot the OS curves for these two groups. Besides, the AUC values were determined for comparing the model sensitivity and specificity in predicting OS by the "survival ROC" of R package. The model discrimination ability was evaluated by calculating the C-index.

Predictive independence of our 4 lncRNAs-based nomogram for survival rate from additional clinicopathological factors
The independence of our 4 lncRNAs-based nomogram from additional clinicopathological factors (such as age, Gleason score, and TNM classification) in predicting OS was examined by univariate as well as multivariate Cox regression. OS was used to be a dependent variable, whereas the constructed lncRNA nomogram together with additional common clinical factors were used to be the independent variables.

Joint analysis of the four-lncRNA signature with Gleason score
In order to test whether our model could accurately predict the prognosis of patients with Gleason score � 8, a stratified analysis was carried out. To verify whether our lncRNA model could enhance Gleason score's accuracy in PCa survival risk prediction, the ROC curves of the three comparisons was drawn and the AUCs were calculated.

Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analyses
Using the z-test and two-sided Pearson correlation coefficients, Pearson correlation coefficients were examined between those 4 lncRNAs expression and PCGs, so as to discover the possible biological processes together with related pathways involving those predictive

Identification of DElncRNAs between PCa and non-carcinoma tissue samples
Based on thresholds of |log2FC| > 1.0 and false discovery rate (FDR) q < 0.05, altogether 451 lncRNAs, among which, 307 were up-regulated whereas 144 were down-regulated, were discovered to show different expression between PCa and the non-carcinoma tissue samples, and they were applied in later stepwise survival analysis (S1 Table). The expression profiles were intuitively reflected by volcano plots (Fig 2). According to the DElncRNAs profiles, the unsupervised hierarchical cluster analysis was carried out, which suggested the possibility to distinguish between PCa and non-carcinoma samples (S1 Fig). To identify prognosis-related lncRNAs which are associated with patients' OS in PCa, the lncRNA expression profiles were evaluated by univariate Cox regression analysis data. 36 lncRNAs related to OS were selected at the threshold of 0.05 (S2 Table), which were then applied for stepwise multivariate Cox regression analysis, and four lncRNAs therein (HOXB-AS3, LINC01679, PRRT3-AS1,  Table 3) were ultimately screened out from the 451 lncRNAs identified before to establish a predictive model. Expression profiles of those 4 lncRNAs were integrated with the related regression coefficients to construct a prognosis nomogram. Based on the above-mentioned analysis, the riskscore formula was the sum of 4 lncRNAs expression levels weighted by the corresponding relative regression coefficients obtained upon multivariate Cox regression, as shown below: survival risk score = (0.524052278 × HOXB-AS3 expression) + (1.092940489 × YEATS2-AS1 expression) + (-0.663887578 × LINC01679 expression) + (-0.504710478 × PRRT3-AS1 expression). Of these four lncRNAs, two had positive coefficients upon multivariate Cox regression analysis, associated with high risk because the up-regulated level indicated the reduced patient

PLOS ONE
OS (HOXB-AS3 and YEATS2-AS1, Coef > 0) and the remaining two lncRNAs (LINC01679 and PRRT3-AS1, Coef < 0) were shown negative coefficients upon Cox regression analysis, which indicated that such lncRNAs were protective, because cases having up-regulate lncRNAs levels tended to show extended OS relative to patients having decreased expression (Fig 3).

Favorable performance for the 4-lncRNA prognostic model in the prediction of OS for PCa cases
The 493 cases were classified as high-(n = 246) or low-risk (n = 247) group in line with the median riskscore (0.943) obtained based on those 4 lncRNAs expression levels (also defined as the survival risk score, SRS) in the TCGA database (Fig 4 and S3 Table). According to the expression levels of these 4 lncRNAs in the ICGC database, the median SRS was 1.561, and 392 patients were divided into the high (n = 196) group and the low (n = 196) group (Fig 5 and S4  Table). Difference in survival was determined by log-rank test. The K-M method was used for survival analysis. It was illustrated from  (Fig 8), and the 1-year, 3-year and 5-year AUC of ICGC database were 0.946, 0.928 and 0.905 respectively (Fig 9),

Contrast of the four-lncRNA signature with Gleason score
Hierarchical analysis showed that PCa with Gleason score > 7 could be further stratified as 2 groups that had different survival by the nomogram (TCGA log-rank test, P = 2.359e-02, Fig  12; ICGC log-rank test, P = 3.847e-06, Fig 13).  Fig 15).

Identification of the prognostic lncRNAs signature associated biological functional characteristic
After determining the associations between those 4 lncRNAs and the PCGs, this study selected co-expression between 1 of those 4 lncRNAs and 4080 PCGs (|Pearson correlation coefficient| > 0.4, P-value < 0.05, and q-value < 0.05). Afterwards, GO and KEGG analyses were conducted for the lncRNAs-related PCGs to display the possible functions for those 4 prognostic lncRNAs. The 4080 PCGs were most significantly enriched into 28 GO terms and they were  (Fig 16A and  S2 Fig and S5 Table). Three KEGG pathways were enriched which are mainly concerned with the interaction and binding between cells or between cells and ECM, and cell proliferation

Discussion
Cancer risk assessment tools are essential for individualized clinical diagnosis and personalized treatment. However, the traditional clinicopathological factors, risk stratification among PCa cases represented by Gleason Score face great challenges [8]. Therefore, it needs to be addressed to establish more sensitive and effective prediction model for PCa as soon as possible. Accumulating lncRNA research publications has provided us with a new perspective and avenue for diagnosing and treating diseases like cancers [4]. Evidence is mounting that a variety of spectrum of disease progression, including tumors, is often accompanied by aberrant expression of lncRNAs, which means it may be used to be the factor to independently predict prognosis for these diseases [9]. Hence, the present work thoroughly analyzed lncRNA profiles in PCa tissues together with the matched non-carcinoma tissues against TCGA and ICGC database. Our use of TCGA and ICGC RNA sequencing data ensured the maximum detection of lncRNAs. The 4 lncRNAs-based nomogram exhibits good discriminating ability, and the

PLOS ONE
or adjuvant therapy, and the low-risk ones may receive active surveillance. As a result, treatments can be evaluated more effectively. The four-lncRNA nomogram (HOXB-AS3, LINC01679, PRRT3-AS1 and YEATS2-AS1) was shown to be associated with patient prognosis and risk stratification upon univariate as well as multivariate COX analysis that incorporated the common clinicopathological risk factors such as age, TNM classification and Gleason score for PCa. Gleason score represents a potent marker for evaluating PCa patient prognosis. Our multivariate COX analysis also indicated that only Gleason score and our model were good enough to predict the Overall survival risk of PCa. For PCa, Gleason score has always been adopted as an important criterion for adjuvant chemoradiotherapy or other treatments. Our study demonstrates that the four lncRNAs signature is more capable of being considered as reference factor for adjuvant therapy. Patients who had a high Gleason score may have the aggressive PCas with dismal prognosis [10]. However, Gleason score could not sufficiently accurately divide the survival risk and stage and predict prognosis of patients as a separate diagnostic indicator. Moreover, not all high-grade PCas (Gleason score � 8) are at high risk and require further adjuvant therapy [11]. As revealed by the stratification analysis, the as-constructed model assisted in classifying PCa cases with Gleason score � 8 to high-or low-risk group. Besides,

PLOS ONE
our 4 lncRNAs-based model better increased the distinguishing power of Gleason score, which indicated that our constructed model helped to enhance the prediction accuracy for PCa survival risk.
Several previous studies identify lncRNAs as the excellent predictors for PCa survival. However, they were all confined to PCa biochemical recurrence patients [12][13][14]. In addition, in agreement with our data, Li Fan et al. found that silence of PRRT3-AS1 suppresses the proliferation of PCa cells and boost their autophagy and apoptosis [15]. It is also suggested previously that, HOXB-AS3 shows tight association with the dismal prognosis for numerous cancer types [16][17][18]. The above findings suggest that, HOXB-AS3 and PRRT3-AS1 can be used to be the prognostic biomarkers for numerous cancer types. Therefore, the above studies have revealed the nomogram reasonability and reliability. Moreover, such lncRNAs show possible significance in molecular targeted treatments. However, so far, little research is carried out on LINC01679 and YEATS2-AS1. Future studies should concentrate on such lncRNAs and examine the functions within PCa. Moreover, such lncRNAs possibly display possible significance for molecular targeted therapy. A majority of lncRNAs have not been functionally annotated within PCa so far, we performed the biological function clustering and associated biological signaling pathway analysis on all four lncRNAs.

PLOS ONE
Accumulating evidence shows that lncRNAs participate in a wide range of biological processes through modulating mRNA levels epigenetically, transcriptionally, and post-transcriptionally. Nonetheless, there are still many functional annotation of lncRNAs in PCa to be further explored. This work suggested that the related biological pathway and function enrichment of those four lncRNAs by GO and KEGG. The main functional sites of our lncRNA signature are extracellular matrix and membrane, which are associated with the adhesion, activation, and transport of cellular active substances across the extracellular matrix or cell membrane. The possible molecular function analysis can shed light on future study to examine PCa incidence and development mechanisms.
Our study used genetic and clinical data from two databases, TCGA and ICGC, which covered a wide range of fields and provided a solid database for our research. At the same time, several limitations need to be acknowledged in the current study. First, the nature of

PLOS ONE
retrospective study, such as inadequate data, is inevitable. Secondly, our analysis was based on public data without experimental verification.

Conclusion
To sum up, some DElncRNAs between PCa and non-carcinoma tissue samples are identified in this study. Then, the 4 lncRNAs-based signature is constructed to predict the OS for PCa PLOS ONE through bioinformatic analysis. As discovered by our results, our as-constructed prediction signature contributes to the effective classification of PCa as high-and low-risk groups. This prediction signature outstandingly improves the Gleason score performance in distinguishing PCa, and it can also be used to discriminate PCa (Gleason > 7). Our constructed prediction signature will assist the clinicians in taking individualized clinical treatment. However, more studies are needed to verify our results and to examine the predicting ability of our signature for adjuvant therapy safety and efficacy. Besides, functional study is also needed to further understand molecular mechanisms underlying PCa.