Approximate distance correlation for selecting highly interrelated genes across datasets

doi:10.1371/journal.pcbi.1009548

Fig 1.

Schematic diagram of ADC.

Single-cell gene expression data are used as an example to illustrate ADC. X and Y are data matrices with matched genes as the inputs of ADC. For each target gene, ADC selects k genes having the highest Pearson correlation coefficient with the target one to calculate the p value of DC. After that, ADC performs the BH adjustment to control the FDR and outputs the most highly interrelated genes.

More »

Expand

Fig 2.

Simulation experiments on DC combined with the BH method in terms of Power and FDR (the target level is 20%).

(A) Each pair of vectors are dense, and k dimensions are shared with a linear transform. (B) Each pair of vectors are sparse with 90% zero entries and k dimensions are shared with a linear transform. (C) Each pair of vectors are dense and k dimensions are shared with a non-linear transform. (D) Each pair of vectors are sparse with 90% zero entries and k dimensions are shared with a non-linear transform.

More »

Expand

Fig 3.

Performance of ADC on three simulated datasets.

(A) The t-SNE plot of these datasets with three cell types in total. 80% of cells in Data 1 and Data 3 are of the same type, 60% of cells in Data 2 and Data 3 are of the same type, and 40% of cells in Data 1 and Data 2 are of the same type. Each dataset contains 1000 cells and 5000 genes. (B) Heat map of the highly interrelated genes selected by ADC across Data 1, Data 2 and Data 3 (FDR = 0.05). (C and D) Running time and peak cost of ADC with two datasets with 1 million cells and 10 thousand genes each in no more than 135 mins and under 225 GB of RAM. Each entry of the datasets was generated with a random variable which obeys uniform distribution. GB indicates the GigaByte.

More »

Expand

Fig 4.

Highly interrelated genes and their biological functions among 21 cancers.

(A) Heatmap of the number of highly interrelated genes selected by ADC between each pair of cancers (FDR = 0.05). Hierarchical clustering was performed with the reciprocal value of the number. (B and D) The top ten enriched functional terms of these selected genes between HNSC and CESC (B), BRCA and OV (D) respectively. (C and E) The gene network constructed with GeneMANIA using the selected genes between HNSC and CESC (C), BRCA and OV (E) respectively.

More »

Expand

Fig 5.

Highly interrelated genes selected by ADC between different cell types along the hematopoietic cell lineage.

(A) A schematic of mouse hematopoietic differentiation. The cells in gray color cell type were not present in our datasets. (B) Heat map shows the number of highly interrelated genes selected by ADC cross six cell types (FDR = 0.10). The upper triangular is the result of mouse hematopoietic cells while the lower triangular is the result of human hematopoietic cell data downloaded from GEO with accession code GSE117498. Unsupervised hierarchical clustering analysis was performed with the reciprocal value of the number. (C) Confusion matrix of data-driven clusters representing the percent frequency distribution of immunophenotypically defined cell types. (D) Heat map shows the dissimilarity of the distributions of the three cell types using the Jensen-Shannon (JS) divergence.

More »

Expand

Fig 6.

(A) Number of overlapping genes among the top 1000 HVGs of the data from five different technologies, and (B) Highly interrelated gene selected by ADC between each pair of these datasets (FDR = 0.01).

Unsupervised hierarchical clustering was performed with the reciprocal value of the number in both situations.

More »

Expand

Fig 7.

Biological functions and network analysis of highly interrelated genes selected by ADC across modalities.

(A) Schematic diagram shows the work flow of performing ADC on the scRNA-seq data and scATAC-seq data (FDR = 0.20). We first converted the scATAC-seq data to a predicted gene expression matrix. Specifically, we constructed a “gene activity matrix” from scATAC-seq dataset by utilizing the reads at gene body and 2kb upstream, then ADC was applied to a pseudo gene expression data and a real one. (B) The top enriched functional terms of the genes selected by ADC. (C) The gene network constructed with GeneMANIA using the top genes selected by ADC.

More »

Expand