Predicting lung aging using scRNA-Seq data

doi:10.1371/journal.pcbi.1012632

Fig 1.

The flowchart of the analysis.

A. Polynomial features were extracted for each cell type and features were aggregated at donor level. Regression models were trained based on the extracted polynomial features, either at original gene space or at the PCA-transformed space. B. Training and testing strategy, we adopted LOO test and CD test to evaluate the model performance (see Materials and Methods). C. Cell type mapping and dataset integration. D. SHAP-based ranking and empirical p-value for genes selected in each cell type.

More »

Expand

Table 1.

Number of cells, genes for each dataset after preprocessing steps.

More »

Expand

Fig 2.

Cell type mapping, dataset integration, comparison of gene types and cell types.

A. Joint cell embeddings of query datasets (IPF, Carraro, Nuclear-Seq) and reference dataset (HLCA) after cell type transfer performed by scArches. Plots were generated by the 30 dimensions of latent representations transformed from the original gene space. B. Joint cell embeddings of query datasets (IPF, Carraro) and reference dataset (HLCA) after dataset integration performed by mnnpy. Plots were generated based on the intersection of HVGs between the reference and the query datasets. C. The mean R² scores for the comparison of all tested methods. R² scores shown in the plot are from the top 10 cell types with highest R² scores for each method. P-value annotation legend: ns (not significant): p-value ≥ 0.05; *: 0.01 < p-value ≤ 0.05; **: 0.001 < p-value ≤ 0.01; ***: 0.0001 < p-value ≤ 0.001; ****: p-value ≤ 0.00001. D. The rankings of different types of gene markers for the top cell types. The rankings were computed for each cell type separately. For each cell type, we selected for gene type’s best PCA setting as determined by highest R² score. We used this R² score as the representative R² score of that gene type for the given cell type. We then ranked the different gene types by these representative R² scores. The resulted rankings (from 1 to 6) for the top 10 cell types are presented in the plots. These cell types are extracted from the top 10 cell types as shown in Figs 3 and. 4. E,F. Predicted donor ages VS true donor age for the top cell types. For each cell type, we selected its best gene type and PCA setting as determined by highest R² score. The corresponding best gene type and PCA setting is labeled in each plot.

More »

Expand

Table 2.

Percentage of unannotated cells for each HLCA-defined cell type level.

More »

Expand

Fig 3.

R² scores from LOO test and CD test for non-smoker donors and comparison between transcriptome predicted ages and methylation predicted ages.

A R² scores for non-smoker donors tested with LOO test and CD test. Each row represents the corresponding best type of gene marker from a cell type. The bar in each row represents mean and standard deviation of R² score from five runs. Left: R² scores from the LOO test in HLCA dataset; middle: R² scores from the CD test which used a subset of HLCA for training and a subset of HLCA and other datasets for testing (Materials and Methods); right: LOO test in Nuclear-Seq dataset. The rows with no bars shown indicate R² scores equal to or smaller than zero. See S2 File for more information on the corresponding senescence markers and number of donors used in each row. B Comparison between transcriptomic ages and methylation ages. Transcriptomic ages were predicted by polyEN and methylation ages were predicted as described in MATERIALS AND METHODS. “true” represents true chronological donor age and “pred” represents transcriptomic ages predicted by polyEN. “hor1” is methylation ages predicted by Horvath1 method; “hor2” is methylation ages predicted by Horvath2 method and “han” is methylation ages predicted by Hannum method. polyEN was applied to the cell types and marker gene lists shown in A for Nuclear-seq.

More »

Expand

Fig 4.

R² scores from LOO test and CD test for smoker donors.

Each row represents the corresponding best type of gene marker from a cell type. The bar in each row represents mean and standard deviation of R² score from five runs. Left: R² scores from the LOO test in HLCA dataset; middle: R² scores from the CD test which used a subset of HLCA for training and a subset of HLCA and other datasets for testing (Materials and Methods). The rows with no bars shown indicate R² scores equal to or smaller than zero. See S2 File for more information on the corresponding senescence markers and number of donors used in each row.

More »

Expand

Fig 5.

Top significant GO terms from GSEA and polynomial features for basal-related cell types.

A Visualization of the polynomial features for RHOB and PMAIP1/Noxa in basal, basal resting and suprabasal cells of the nonsmoker group. Values visualized in the plots are polynomial features computed based on the log-normalized gene expressions. B The top five significant GO terms from GSEA for basal and basal resting cells of the nonsmoker group. C The common genes with significant SHAP scores among the three basal related cell types; top table: genes identified from all expressed genes; bottom table: genes identified from the union of senescence marker lists. D The distribution of predicted ages VS real ages for IPF disease donors. Models were trained using the non-smoker donors of IPF disease and tested using the smoker donors of IPF disease. x axis denotes the age and y axis denotes the density.

More »

Expand

Fig 6.

Polynomial features for genes with significant SHAP scores identified in basal, basal resting and suprabasal cells of nonsmokers.

A. Visualization of the polynomial features for all expressed genes. We selected only the genes assigned with significant empirical p-values for each cell type (See Materials and Methods). B. Visualization of the polynomial features for union of senescence markers. We selected only the genes assigned with significant empirical p-values (See Materials and Methods). Each row represents one gene and genes were sorted by row-wise sum.

More »

Expand