Chapter 15: Disease Gene Prioritization

doi:10.1371/journal.pcbi.1002902

Chapter 15: Disease Gene Prioritization

Figure 5

Predicting gene-disease involvement using artificial neural networks (ANNs).

In a supervised learning paradigm, the neural networks are trained using experimental data correlating inputs (descriptive features relating genes to diseases) to outputs (likelihood of gene-disease involvement). The training and testing procedures for the generalized network (Panel A) are described in text. In our example, the WEKA [129], [130], [131], [139] ANN (Panel B; a = 0.5, λ = 0.2) is trained using the training set (Panel C) repeated 500 times (epochs). The network “memorizes" (Predictions in Panel C) the patterns in the training set and is capable of making accurate predictions for four out of seven instances it has not seen before (test set, Panel D). It is important to note here that the erroneously assigned instances (yellow highlight) in the test set are, for the most part, unlike the training. The first one has very little literature correlation (0.01), while sequence similarity to another disease-involved gene is fairly high 0.55). The second maps an unlikely candidate gene (very low literature, no homology) to disease, and the third has barely enough literature mapping and borderline homology. Representation of neither of these instances was consistently present in the training set. This example highlights the importance of training using a representative training set, while testing on a set that is not equivalent to training.

doi: https://doi.org/10.1371/journal.pcbi.1002902.g005