A joint complex network and machine learning approach for the identification of discriminative gene communities in autistic brain

doi:10.1371/journal.pone.0334181

Fig 1.

Flowchart of the proposed pipeline.

Initially, a gene co-expression network is constructed based on significant Pearson’s correlations between gene expression profiles. In Step 1, hierarchical community detection using the Leiden algorithm is applied to identify stable and biologically relevant communities within the network. These communities serve as the basis for the independent machine learning analysis performed in Step 2, which consists of a 5-fold cross-validation procedure that includes Boruta feature selection and a Random Forest classifier to discriminate between ASD and control subjects. Finally, in Step 3, XAI analysis based on Shapley values is conducted to interpret the classifier results by quantifying the contribution of each gene within the identified communities, enhancing biological interpretability of the predictive model.

More »

Expand

Fig 2.

Box-plots of the classification accuracies.

The accuracies was obtained from the six gene communities significantly enriched with SFARI database. Each box represents the accuracy distribution obtained by training a Random Forest classifier within a 5-fold cross-validation framework (with Boruta feature selection), repeated over 100 rounds. The numbers below each box represent the number of the corresponding community.

More »

Expand

Table 1.

Classification performances.

Performance refers to the six gene communities significantly enriched with SFARI database on the training set. The results shown include the mean AUC, the F1 score, and the average number of genes selected by Boruta feature selection in 100 repetitions of 5-fold cross-validation.

More »

Expand

Table 2.

Classification performances of the two found gene communities on the independent test dataset.

The table reports mean classification accuracy, the number of overlapping genes, AUC and F1 Score for the independent dataset. Results were obtained by averaging over 100 repetitions of the 5-fold cross-validation procedure, with estimated errors indicated. Detailed results for the complete set of 41 communities are provided in Supplementary S2 Table.

More »

Expand

Fig 3.

SHAP summary plots of the two found communities on the independent test dataset.

The plots illustrate how individual gene expression values influence the classifier’s prediction of ASD. Each row corresponds to a gene, ordered vertically by importance (the top 20 most influential genes are shown). The horizontal axis represents the SHAP values, indicating the magnitude and direction of each gene’s impact on the prediction: positive SHAP values correspond to contributions towards an ASD-positive diagnosis, while negative values contribute to a control diagnosis. Each point represents a sample, colored according to the expression level of the corresponding gene (low expression in blue, high expression in red).

More »

Expand