Ranking of cell clusters in a single-cell RNA-sequencing analysis framework using prior knowledge

doi:10.1371/journal.pcbi.1011550

Fig 1.

Flowchart of overall adopted methodology.

Step 1 (Basic Analysis): Basic scRNA-seq analysis using SEURAT resulting in enriched pathways and repurposed drugs. Step 2 (Prior Knowledge Acquisition): Defining prior knowledge. Two options are available: i) proceed with all the terms obtained from a check list of predefined terms provided by querying MalaCards with a disease of interest ii) to perform a hypothesis-driven approach whereby the user can provide specific keywords/terms associated with the hypothesis under investigation to perform a de novo search across the supported databases. Step 3 (Mapping Basic Analysis to Prior Knowledge): Mapping step. Merges output from Steps 1 and 2 and assesses how well prior knowledge “maps” to the results obtained from scRNA-seq analysis. This is done firstly by mapping prior knowledge pathways against pathway enrichment results attained from analyzing the scRNA-seq data. Secondly, prior knowledge drug names and/or drug mode of actions (MOAs) are mapped against drug repurposing results from analyzing the scRNA-seq data using the CMAP database. Step 3 is performed for all cell types in the analysis. Step 4 (Cell Ranking): Scoring and ranking the cell types with respect to how well the data-driven output from pathway enrichment analysis and drug repurposing for individual cell types, “maps” to the predefined information provided by the expert user. Step 4 is further split into 2 steps (4.1 (Cell Ranking using Pathways and Drugs) and 4.2 (Cell Ranking using CellChat)): Step 4.1—Matching the position of the prior knowledge in the output (enriched pathways and repurposed drugs) of the scRNA-seq analysis and then taking the Euclidian distance between the matched positions. Step 4.2—Ranking of cells using cell-cell communication networks generated using CellChat. Performing a comparison of the number of interactions (edges) between cell types (nodes) in the two different networks (control vs. disease) and ranking the cells by log fold difference in interactions (LogFDI) taking in consideration both positive and negative fold changes. Finally, the union between results is obtained (denoted by U above) taking into consideration the top 3 ranked cell types from Steps 4.1 and 4.2.

More »

Expand

Fig 2.

Riverplot showing an example of mapping of prior knowledge from MalaCards source wiki pathways to wiki pathways obtained from enrichment analysis on the scRNA-seq data for a specific cell type in a scRNA-seq dataset.

Mapping is performed by matching the position of the prior knowledge (left) to the output (enriched pathways) of the scRNA-seq analysis (right) and then taking the Euclidian distance between the final vectors generated from the matched positions. Positions that are not matched/mapped at all receive a NA value. To avoid biases from cases with multiple NA terms, we used the Jaccard similarity (J) to penalize sparsely mapped vectors. This was done by dividing the Euclidian distance (E) with Jaccard similarity (J) calculated from binary asymmetric variable vectors.

More »

Expand

Fig 3.

Example output from CellChat for a disease-control dataset.

A. Data is split into two networks control and disease. B. Cell types are ranked based on taking the log of the fold difference in their edge number interactions (LogFDI) between disease and control networks. Taking in consideration both positive and negative log fold changes. Node sizes are proportional to the number of cells in each cell group and edge widths with the number of interactions between nodes.

More »

Expand

Fig 4.

Discovery mode approach rankings across all 3 datasets.

A. LAM cell-rankings based on drugs (DRUG) and pathway databases (KEGG, GOBP, MSIG, WIKI, REACT). The boxplot shows the individual cell rankings obtained by our methodology. The order of the cells is defined by the average rankings across all six parameters used to generate the final ranking for the annotated cell types in the analysis. B. CellChat rankings of LAM dataset cell types based of LogFDI. C. ASD cell-rankings based on drugs and pathway databases (WIKI and GOBP—Note that KEGG, MSIG and REACT were not successfully mapped with prior knowledge for this case study). The boxplot shows the individual ranking obtained by our methodology. The order of the cells types is defined by the average rankings across the three parameters that attained mapping information in the analysis. D. CellChat ranking of autism dataset cell types based on LogFDI. E. COVID cell-rankings based on drugs (DRUG) and pathway databases (KEGG, GOBP, MSIG, WIKI, REACT). The boxplot shows the individual ranking obtained by our methodology. The order of the cells is defined by the average rankings across all six parameters used to generate the final ranking for the annotated cell types in the analysis. F. CellChat rankings of COVID-19 dataset cell types based on LogFDI. Results from the bulk RNA-seq simulation are also shown (boxplots highlighted in red).

More »

Expand

Fig 5.

Hypothesis-driven rankings across all 3 datasets.

A. LAM cell-rankings based on drugs (DRUG) and pathway databases (KEGG, GOBP, MSIG, WIKI, REACT). The boxplot shows the individual cell rankings obtained by our methodology. The order of the cells is defined by the average rankings across all six parameters used to generate the final ranking for the annotated cell types in the analysis B. ASD cell-rankings based on drugs and pathway databases (KEGG, GOBP, MSIG). The boxplot shows the individual ranking obtained by our methodology. The order of the cells is defined by the average rankings across the three parameters that attained informative information used to generate the final ranking for the annotated cell types in the analysis. C. COVID-19 cell-rankings based on drugs (DRUG) and pathway databases (KEGG, GOBP, MSIG, WIKI, REACT). The boxplot shows the individual ranking obtained by our methodology. The order of the cells is defined by the average rankings across all six parameters used to generate the final ranking for the annotated cell types in the analysis. Results from the bulk RNA-seq simulation are also shown (boxplots highlighted in red).

More »

Expand

Fig 6.

Proportion of cell types across different conditions for the 3 datasets used to validate our methodology.

A. LAM disease dataset and the proportion of cell types across conditions. B. AUTISM disorder dataset showing the proportion of cell types across conditions. C. Brain COVID-19 dataset showing the proportion of cell types across conditions. Cells are ranked in descending order with respect to the difference between the proportion of cells in disease vs control samples.

More »

Expand

Fig 7.

Number of differentially expressed genes (DEGs) for each cell type across different conditions for the 3 datasets used to validate our methodology.

A. LAM disease dataset and the number of DEGs for individual cell types across conditions. B. AUTISM disorder dataset showing the number of DEGs for individual cell types across conditions. C. Brain COVID-19 dataset showing the number of DEGs for individual cell types across conditions. Cells are ranked in descending order with respect to the total number DEGs in control vs disease samples.

More »

Expand