IGSA: Individual Gene Sets Analysis, including Enrichment and Clustering

Analysis of gene sets has been widely applied in various high-throughput biological studies. One weakness in the traditional methods is that they neglect the heterogeneity of genes expressions in samples which may lead to the omission of some specific and important gene sets. It is also difficult for them to reflect the severities of disease and provide expression profiles of gene sets for individuals. We developed an application software called IGSA that leverages a powerful analytical capacity in gene sets enrichment and samples clustering. IGSA calculates gene sets expression scores for each sample and takes an accumulating clustering strategy to let the samples gather into the set according to the progress of disease from mild to severe. We focus on gastric, pancreatic and ovarian cancer data sets for the performance of IGSA. We also compared the results of IGSA in KEGG pathways enrichment with David, GSEA, SPIA, ssGSEA and analyzed the results of IGSA clustering and different similarity measurement methods. Notably, IGSA is proved to be more sensitive and specific in finding significant pathways, and can indicate related changes in pathways with the severity of disease. In addition, IGSA provides with significant gene sets profile for each sample.

The gray vertical lines are used to distinguish the samples according to the flex points in the red curve. The first four figures (A, B, C, D) represented the IGSA clustering based on SMIC, Euclidean distance, Pearson's correlation and Spearman's rank correlation, respectively, which were applied in the ovarian cancer data set (batch 9) based on pathways. The last four figures (E, F, G, H) represented the survival analysis studies on these classes that obtained by the IGSA clustering based on SMIC, Euclidean distance, Pearson's correlation and Spearman's rank correlation, respectively. The survival time in all the methods (A, B, C, D) tended to decrease. However, compared with Euclidean distance (F) (p value of 0.061), Pearson's correlation (G) (p value of 0.049) and Spearman's rank correlation (H) (p value of 0.049), the SMIC (E) (p value of 0.018) was more remarkable, although the clustering based on Pearson's correlation and Spearman's rank correlation can cluster the disease samples significantly too.  were loess curves obtained by fitting the similarity scores. The gray vertical lines are used to divide the samples according to the flex points in the red curve. A represents the IGSA clustering applied in ovarian cancer data (batch 9). B represents the IGSA clustering applied in ovarian cancer data (batch 40). The disease samples in the two batch data were both divided into two classes, and according to the survival analysis, the difference between the two classes was significant (Figure 6 in article).The p value were 0.0778 and 0.0364 in batch 9 data and batch 40, respectively.

Part II The results of IGSA based on GO(Gene Ontology) gene sets
Figure D | The comparison of the accuracy of six enrichment analysis methods based on GO gene sets (the method SPIA cannot be used for GO enrichment analysis). The green columns represent the average accuracy in three cancer-related datasets of the six methods. The blue columns represent the proportion of significant GO gene sets supported by papers found in three cancer-related datasets. IGSA, compared with the other methods, can identify robust and sensitive significant GO gene sets for different cancer types. Although the average accuracy of DAVID was a bit higher than that of IGSA, the proportion of significant GO gene sets supported by papers found in three cancer-related datasets was very low, that meant DAVID found only a subset of significant GO gene sets.

Figure E | The IGSA clustering of disease samples in hepatitis datasets based on GO gene sets (including BP gene sets and MF gene sets).
The blue curves show the average similarity scores of clustering samples. The red curves were loess curves obtained by fitting the similarity scores. The gray vertical lines are used to divide the samples according to the flex points in the red curve. Most serious disease samples (nash) tended to be clustered on the right (in class 3). In class 2, Samples of less severe disease (steatosis) showed a tendency to cluster in the middle. In class 1, most of healthy obese samples tended to be clustered on the left. To some extent, the clustering may reveal the severity of samples in hepatitis datasets.

Figure I | The double clustering of samples and significant BP gene sets in gastric cancer
datasets. The x-axis was generated according to the list of significant BP gene sets in the gastric cancer data clustered by IGSA clustering (samples whose BP gene sets expression values were more similar to the average expression values of normal samples were closer to the origin of the coordinate). The y-axis was generated according to the list of cases in the gastric cancer data clustered by IGSA clustering. The dots represent the marks for BP gene sets whose expression values in cancer samples were higher than the average level. The color of the dots from blue to green represents the potential progression (mild to severe) of the cancer.

Figure J | The double clustering of samples and significant MF gene sets in gastric cancer
datasets. The x-axis was generated according to the list of significant MF gene sets in the gastric cancer data clustered by IGSA clustering (samples whose MF gene sets expression values were more similar to the average expression values of normal samples were closer to the origin of the coordinate). The y-axis was generated according to the list of cases in the gastric cancer data clustered by IGSA clustering. The dots represent the marks for MF gene sets whose expression values in cancer samples were higher than the average level. The color of the dots from blue to