Fig 1.
Step one, normalize the expression of genes and calculate each gene set expression score for each sample. Step two, find significant gene sets (e.g., pathways) according to Fisher’s exact test (count the number of gene set scores above or below the average score in controls and cases, and construct 2x2 contingency tables). Step three, obtain significant gene set expression according to a list of significant gene sets and gene sets expression, and then subject the results to IGSA clustering (the similarity measure is SMIC).
Fig 2.
The workflow of IGSA clustering.
Step one, create an empty seed set and an empty candidate set. Step two, construct a start seed by calculating the average expression value of each significant gene set in the normal samples, and add the start seed into the empty seed set. Add all of the disease samples into the empty candidate set. Step three, calculate the average similarity of each sample in the candidate set with all seed samples in the seed set, and move the sample with the highest similarity score from the candidate set to the seed set. Step four, repeat step three until the candidate set is null.
Fig 3.
The comparison of the seven methods by accuracy (average accuracy in three cancer-related datasets) and the proportion of significant pathways supported by papers found in three cancer-related datasets.
IGSA was more robust and sensitive in finding significant pathways compared with the other methods. Although the accuracy of DAVID and SPIA was a bit higher than that of IGSA, both DAVID and SPIA found only a subset of significant pathways.
Fig 4.
The clustering of samples in hepatitis datasets.
The blue curves show the average similarity scores of clustering samples. The red curves were loess curves obtained by fitting the similarity scores. The gray vertical lines are used to divide the samples according to the flex points in the red curve. Most serious disease samples (nash) tended to be clustered on the right (in class 3). In class 2, Samples of less severe disease (steatosis) showed a tendency to cluster in the middle. In class 1, most of healthy obese samples tended to be clustered on the left. To some extent, the clustering may reveal the severity of samples in hepatitis datasets.
Fig 5.
The comparison of IGSA clustering with different similarity measurement.
The blue points represent the survival time of the samples. The blue lines were generated by linear fitting the blue points. The green curve shows the average similarity scores of clustering samples. The red curves were loess curves obtained by fitting the similarity scores. The gray vertical lines are used to distinguish the samples according to the flex points in the red curve. (A) represents the IGSA clustering based on SMIC applied in the ovarian cancer data set (batch 9) based on pathways. (B) represents the IGSA clustering based on Euclidean distance applied in the ovarian cancer data set (batch 9) based on pathways. The survival time in both methods (A, B) tended to decrease. However, the red curve in B was too smooth to divide the samples into different disease classes.
Fig 6.
The classification comparison of IGSA, HCBP (hierarchical clustering based on pathways) and HCBG (hierarchical clustering based on genes) in ovarian cancer datasets (TCGA batch 9).
A shows the survival time curves of three classes obtained by IGSA (p value of 0.0362). B shows the survival analysis of three classes obtained by HCBP (p value of 0.187). C shows the survival time curves of three classes obtained by HCBG (p value of only 0.240). D shows the survival time curves of two classes (class 1 and class 2,3) obtained by IGSA (p value of 0.0362). The p values in both A and D are significant compared with HCBP and HCBG.
Fig 7.
The survival analysis of ovarian cancer datasets (TCGA batch 9 and batch 40).
Part A shows the survival time curves of two classes obtained by IGSA (p value of 0.0778). Part B shows the survival analysis of two classes obtained by IGSA based on the same significant pathways (13 paper supporting SUPs and paper supporting 16 SDPs, p value of 0.0364).
Fig 8.
The clustering of samples and significant pathways in gastric cancer datasets.
The x-axis was generated according to the list of significant pathways in the gastric cancer data clustered by IGSA clustering. The y-axis was generated according to the list of cases in the gastric cancer data clustered by IGSA clustering (samples whose pathway expression values were more similar to the average expression values of normal samples were closer to the origin of the coordinate). The dots represent the marks for pathways whose expression values in cancer samples were higher than the average level. The color of the dots from blue to green represents the potential progression (mild to severe) of the cancer.