PenDA, a rank-based method for personalized differential analysis: Application to lung cancer

doi:10.1371/journal.pcbi.1007869

Fig 1.

The PenDA method.

(a) Violin-plots for the distributions of Spearman correlation between two samples taken from the TCGA database on lung adenocarcinoma: between two non-tumorous samples (ctrls vs ctrls, n = 4,656 pairs), between two tumorous samples (ADC vs ADC, n = 103,285), between paired normal and tumorous samples (paired ctrls-ADC, n = 48), and between unpaired controls and tumors (ctrls vs ADC, n = 44,135). Shown p-values correspond to Wilcoxon tests. (b) Basic scheme depicting the PenDA method. (Top) For each gene g, the algorithm infers sets of genes whose expressions are always lower (L(g)) or higher (H(g)) than that of g in a pool of control, reference samples. (Bottom) In a given individual (tumor) sample, g is viewed as deregulated if its relative ordering with genes in the L(g) and H(g) lists is modified. (c) Examples of genes in the L (g^’₁, top) or H (g^’₃, bottom) lists of a gene g. While the individual distributions of gene expression in the control samples may overlap (left), the distribution of the difference in gene expression in controls (right) is always positive or negative for genes in L and H lists respectively.

More »

Expand

Fig 2.

Parameter analysis and predictive power.

ROC curves (true positive rate TPR vs false positive rate FPR) of the PenDA method on simulated datasets. The curves were obtained by varying the proportion threshold h for various values of other method parameters or of properties of the investigated dataset. Insets show the maximal informedness that represent the maximal value of the difference TPR-FPR computed for each ROC curve. (a) Effect of the maximal size l of L and H lists. (b) Impact of the number of control samples used to infer the L and H lists. (c) Effect of the total number of genes in the dataset. (d) Impact of the proportion of deregulated genes in the tumorous samples.

More »

Expand

Fig 3.

Comparison with other methods.

(a) ROC curves on the same simulated dataset (normalized data, 97 control samples) as used in Fig 2 for PenDA, a simple percentile-based method, 2 versions of RankComp and DESeq2. (b) As in (a) but reference pool was composed by only 10 control samples. (c) As in (a) but data were not normalized.

More »

Expand

Fig 4.

Overview of genetic deregulation in adenocarcinoma and squamous cell carcinoma.

(a) The percentage of deregulated genes in ADC (left panel) and SQCC (right panel) patients. % of up-regulated genes is indicated in red, % of down-regulated genes is indicated in blue, total % of deregulated genes (up + down) is indicated in black. Patients are ordered by increasing total number of deregulated genes. (b,c) Scatterplot of the percentage of deregulated patients for each gene in the ADC cohort (x-axis) versus deregulated patients percentage in the SQCC cohort (y-axis). Left panel (b) represents downregulation events and right panel (c) represents upregulation events. Colored points represent significant differences between ADC and SQCC cohorts (two-sided two-proportion z-test, p-value < 0.05 after Bonferroni correction for 18143 multiple testing).

More »

Expand

Fig 5.

The gene deregulation pattern.

(a-b) Scatterplots of the percentage of up-regulated versus down-regulated patients in the ADC (left panel) and SQCC (right panel) cohorts. Each dot corresponds to one gene. The x-axis indicates the percentage of up-regulation within the cohort, the y-axis indicates the percentage of down-regulation within the cohort. The contour lines correspond to the density of genes. Genes that are significantly differentially expressed at the individual level (t-statistic, q-value < 0.05) are represented using the following color code: green genes are super-conserved (SC), blue genes are super-down-regulated (SD), red genes are super-up-regulated (SU), other genes are depicted in gray. (C) Venn diagrams indicating the total number of SC, SU and SD genes in ADC and SQCC cohorts. (d-e-f) (Top panels) Distributions of gene expression levels (normalized counts) for three representative genes (the SC gene CAPS in (d), the SU gene ESPRP1 in (e), the SD gene RILPL2 in (f)) in the ADC cohort (yellow), in the SQCC cohort (purple), and for the control patients (gray). The dashed lines represent the mean expressions. (Bottom panels) The corresponding percentages of patients deregulated for each shown gene in ADC and SQCC cohorts are represented by bar plots: gray for non-deregulated patients, blue for down-regulated patients and red for up-regulation patients.

More »

Expand

Fig 6.

Genetic deregulations efficiently classify cancer histologies.

(a, b) Principal Component Analysis on TCGA non-small-cell lung cancers (ADC and SQCC cohorts) using normalized count matrix (a) or PenDA differential expression matrix (b) as input. Full lines represent the decision boundary between ADC and SQCC histologies (using a linear SVM classifier on the first two principal components). Dashed lines represent the upper and lower margins of the decision boundary. Each symbol represents an individual sample (orange crosses for ADC, purple triangles for SQCC). (c) At the bottom, the bar plot represents the histology predictions based on the SVM classifier. SVM on PenDA predicts correctly 95% of ADCs and 93% SQCCs. SVM on count predicts correctly 92% of ADCs and 92% SQCCs. (d) Heatmap of PenDA differential expression matrix applied to a specific set of classifier genes (n = 875) in TCGA non-small-cell lung cancers: ADC (orange) and SQCC (purple). Two hierarchical clustering analyses were performed: using Euclidean distance to sort genes and using Pearson correlation-based distance to classify patients, with a complete linkage function in both cases. ADC subclasses (color-coded, class I to III) are defined according to the dendrogram cutoff n = 3 groups (cutting section = green dashed line). (e) Graphical representation of the contingency table between ADC subtypes (Chen et al,) and ADC subclasses (PenDA analysis). Each bar plot represents the total number of patients in each cell of the table.

More »

Expand

Fig 7.

Upregulation of 37 genes in adenocarcinoma is a strong predictor of poor prognosis.

(a) Principal Component Analysis on ADC cohort. Each cross represents an individual sample. The color of the dots represents the three subclasses defined in Fig 6. (b) Survival of ADC patients classified according to the 2 main subtypes (classes II and III). (c) The percentage of deregulated patients within the ADC class II (y-axis) or the ADC class III (x-axis). Each dot corresponds to one gene. The contour lines correspond to the density of genes. Pink dots indicate genes with a significant higher proportion of deregulation in the class II (proportion test, p-value < 0.05 after Bonferroni correction for multiple testing). Red dots define 37 genes highly deregulated (>75%) in the class II group and lowly deregulated (<25%) in the class III group. (d) (Top) Classification of ADC TCGA-LUAD built on the total number of up-regulated genes among the subset of 37 classifiers defined in (c). Patients are separated into 3 discrete groups: a group with a low upregulation (black, score < 4), a group with intermediate deregulation (gray, 4 ≤ score < 34) and a group with most genes upregulated (red, 34 ≤ score). (Bottom) Survival of patients according to these 3 groups. (e) As in (d) but for ADC Grenoble Hospital patients. Patients are separated into 3 discrete groups: a group with a low upregulation (black, score ≤ 0), a group with intermediate deregulation (gray, 0 < score < 15) and a group with most genes upregulated (red, 15 ≤ score).

More »

Expand

Fig 8.

Synergic effects of gene deregulation within a protein complex.

(a) Heatmap showing the distribution of gene deregulations of genes coding for the GINS complex in the ADC cohort. Patients are ordered from left to right according to an increasing number of gene deregulations within the GINS complex. The patients were separated into discrete deregulation groups of: 0 up-regulation (-), 1–3 up-regulations (+) and 4 up-regulations (+++). (b) Survival of ADC patients according to the deregulation groups defined in (a). (c) Cox regression p-values associated with different models (multivariate and univariate). Cox regression is applied on PenDA deregulation matrix (triangles) or expression matrix (ticked boxes, normalized count values). ALL corresponds to a multivariate cox model including the four genes of the GINS complex. The red line corresponds to the significance level of 0.01.

More »

Expand