Using machine learning to predict and analyze complex trait diseases: Lessons from a simple abstract model

doi:10.1371/journal.pone.0342490

Fig 1.

Graphic representation of the Simple disease model of R risk alleles and P pathways.

More »

Expand

Fig 2.

Graphic representation of a Weighted disease model of R risk-alleles and P pathways.

The second pathway (red ‘2’) represents a central-pathway and the orange-colored alleles represent the high-risk alleles.

More »

Expand

Fig 3.

Graphic representation of an Overlap disease model of R risk alleles and P pathways.

Pathways are divided into pairs, and each pair shares two SNPs. A shared risk allele value is included in the calculations of both paths.

More »

Expand

Fig 4.

Graphic representation of a Subtype disease model, composed of two alternative Simple models.

More »

Expand

Fig 5.

Risk allele histograms for the core Crohn’s disease (CD, 3004 individuals) and Control populations (1949 individuals) in the 48 loci (left), and in a simulated data set (right).

The top, middle, and lower panels show the distribution of individuals carrying 0, 1, and 2 risk alleles, respectively. The CD population had a slightly higher fraction of risk alleles than the Control one. These trends are reproduced in the simulated data.

More »

Expand

Fig 6.

Schematic example of a single swap.

Each row lists the number of alleles at each of four SNP positions in an individual (0, 1, or 2) and each column lists the number alleles for a specific SNP across four individuals. The swap is allowed as both the first and third rows have the values ‘1’ and ‘2’ in opposite positions. The result of that single swap is shown on the right: the sum of value in each row and column is maintained, while swapping the order within each row and column.

More »

Expand

Fig 7.

AUC of the different algorithms for the Simple model as a function of data set size.

The sizes are shown on the x-axis as a percentage of the original size of 3004 Control individuals and 1949 Case individuals. All methods improve with increased data size, but the Neural Network analysis has by far the best AUC.

More »

Expand

Fig 8.

Average AUC obtained with the Weighted and Overlap models, compared with a simpler model.

A more complex disease structure produces improved performance. 20 replicates were performed for each condition/model, all of size 100%. P-values of two-sample one-tailed t-test on 20 replicates for each model are shown and standard error bars are presented.

More »

Expand

Fig 9.

The effect of missing and surplus alleles on prediction performance.

Omission of relevant alleles has a larger effect compared to the inclusion of irrelevant alleles. Average AUC, Simple model with a dataset size of 200%.

More »

Expand

Fig 10.

Average AUC obtained with various percentages of mislabeled individuals in the training data, using the NN algorithm.

The percentage on the x-axis represents the percentage of Case individuals mislabeled as Control out of all Case individuals in the training set.

More »

Expand

Fig 11.

Comparison of models with different prevalence.

Left: The performance on a Simple disease model with disease prevalence of 35% is compared with the performance on a Simple model in which the parameters were tuned to decrease the prevalence to 19%. Right: For the Overlap model, a disease with prevalence of 39% was compared to a model with a lower prevalence of 29%. In both cases, the performance was significantly better for the less common diseases. Population size of 100% was used, and the results are based on 20 replicates. P-values of two-sample one-tailed t-test were calculated and standard error bars are presented.

More »

Expand

Fig 12.

The results of running a t-SNE algorithm with NN weights as input, for different AUC value situations.

Each point represents a SNP, and points are colored according to the pathway the SNP sits in. (A) t-SNE of the Weighted model (20% of the original size, with AUC of 0.708). (B) t-SNE of the Weighted model (100% of the original size, with AUC of 0.95). (C) t-SNE of the Weighted model (200% of the original size, with AUC of 1). For higher AUCs, the ability of the t-SNE algorithm to extract the structure of the underlying disease model improves significantly.

More »

Expand

Fig 13.

The results of running a t-SNE algorithm with NN weights of a “mislabeled” model as input (100% of the original size, 10% mislabeled samples in training set, with AUC of 0.82).

Even with an AUC that is not very high, the ability of the t-SNE algorithm to extract the structure of the underlying disease model is good.

More »

Expand

Fig 14.

The results of running a t-SNE algorithm with NN weights of a “Subtype” model as input (500% of the original size, two subtypes model, AUC of 0.91).

Predicting disease status in a subtype model was shown to be a difficult task. Still the plot shows a reasonable separation between pathways that belong to type 1 (on the upper part) and pathways that belong to subtype 2 (mostly lower part).

More »

Expand