CAncer bioMarker Prediction Pipeline (CAMPP)—A standardized framework for the analysis of quantitative biological data

doi:10.1371/journal.pcbi.1007665

Fig 1.

A. Diagram depicting the different types of methodologies and analyses employed by the CAncer bioMarker Prediction Pipeline (CAMPP). B. Diagram depicting the structure of CAMPP output, with folders and subfolders organized by analyses.

More »

Expand

Table 1.

Table summarizing preliminary data management and analyses implemented in CAMPP, along with specific methods and underlying R-packages.

More »

Expand

Fig 2.

The output of a CAMPP data check.

The gene used in this example is FAM27E2, randomly selected from the ten variable check plots. Top panel from the left; Cullen and Frey graph showing skewness and kurtosis of normalized and transformed expression data and histogram with different distribution models overlayed. Lower panel from left, quantile-quantile, and probability-probability plot.

More »

Expand

Fig 3.

Results of gene selection using DEA and elastic-net regression.

The dataset contained ~ 15.000 genes and 80 samples, groups used for contrast were estrogen positive (n = 61) vs estrogen negative samples (n = 19). Fig 3A is a multidimensional scaling plot showing the partitioning of samples (based on all genes), colored by estrogen status. Fig 3B shows the overlap of results from elastic-net regression (alpha = 0.5) and differential expression analysis with significance cutoffs logFC > 1 or < -1 and FDR < 0.05. Fig 3C depicts the performance statistics for elastic-net regression, e.g., 10-fold cross-validation errors and area under the curve (AUC) scores for the test set. Elastic-net is run 10 times with different random seeds.

More »

Expand

Fig 4.

The heatmap in Fig 4 shows the partitioning breast cancer tissues into estrogen receptor-positive (ER+) samples and estrogen receptor-negative (ER-) samples, based on the consensus set of variables from differential expression analysis and elastic-net regression.

Green = ER+ samples and Purple = ER- samples. Color scale of heatmap (blue to yellow) denotes log2 fold change.

More »

Expand

Fig 5.

Results of Weighted Gene Co-expression Network Analysis on dataset of ~ 15.000 genes and 80 samples.

As the dataset contained more than 5000 variables, WGCNA was performed in a block-wise manner to save computational time, in accordance with the WGCNA reference manual [44]. Fig 5A shows the module clustering tree for the first block as an example. Fig 5B depicts the co-expression heatmap for the small module 2, in which a set of six genes display highly correlated expression patterns. Fig 5C contains the top 25% (in this case five) most interconnected genes from the small module 2, with module interconnectivity scores.

More »

Expand

Fig 6.

Plot showing the top 100 best protein-protein (gene-gene) interaction pairs from the analysis of HER2-enriched vs Luminal A samples.

Colors denote the log fold change of a gene; yellow = up-regulated and blue = down-regulated. The size of the node shows the absolute log fold change, while the ordering from left to right denotes the degree of node interconnectivity. The width of the arch represents the interaction score from the STRING database.

More »

Expand

Fig 7.

Results of correlation analysis with N-glycan abundances in interstitial fluids and paired serum samples.

Dataset contained a total of 103 samples (51 normal interstitial fluids and 52 tumor interstitial fluids) with ~70 N-glycan groups (165 N-glycans). Fig 7A shows the correlation scores for differentially abundant N-glycan groups, three of these, GP1, GP37, and GP38 met the requirement for significance (corr > 0.5 and fdr < 0.05), y-axis = Spearman correlation coefficient. Fig 7B shows the individual correlation plots for the three significant N-glycan groups, x-axis = tumor interstitial fluid abundance and y-axis = serum abundance.

More »

Expand

Fig 8.

Results of survival analysis (cox-proportional hazard regression) with correction for patient age at diagnosis and tumor infiltrating lymphocyte status (TILs).

Survival analysis was run on the set of differentially expressed N-glycan groups. Only one N-glycan, GP38, was significant after correction for multiple testing. Hazard ratios are displayed on a log2 scale with confidence intervals, x-axis = N-glycan groups, and y-axis = log2 hazard ratio.

More »

Expand

Table 2.

Table showing run times and memory usage for CAMPP applied to datasets of different sizes.

As the weighted gene co-expression network analysis (WGCNA) and estimation of optimal number of clusters for k-means are by far the slowest and most memory consuming processes, we have provided estimates with and without these two analyses. The [.] denotes that a given analysis was not performed on a dataset.

More »

Expand