Enabling interpretable machine learning for biological data with reliability scores

doi:10.1371/journal.pcbi.1011175

Fig 1.

The SRS reveals the underlying structure of the data in example training scenarios.

A) For a classification problem with a single attribute, we illustrate how the SRS rises and falls over a range of classification probabilities. Data was generated for two classes (class 1: blue, dashed line, and class 2: red, solid line) with a single attribute. SWIF(r) was trained using 1000 instances of each class drawn from the attribute distribution shown on the left. Following training, SWIF(r) was tested on values across the range (-20, 20), resulting in the SWIF(r) probabilities and SRS values shown on the right. In the right hand graph, SWIF(r) probability of class 2 is shown in orange, and SRS is shown in dark gray. Here we see that the SRS drops in regions that are not represented by either class in training, across a wide range of classification probabilities. B) The SRS detects differences in the correlation structure of testing data compared to training data in a two-attribute model. Data was generated for two classes (class 1 shown in blue, and class 2 shown in red) with two attributes. Different distributions and correlations between attributes were chosen to create three SWIF(r) models labeled “X” (left), “Y” (middle) and “Z” (right). In the top row for each model is the data used to train the model, with one attribute along each axis. The middle and bottom rows show testing data drawn uniformly across the same area shown. In the middle row, testing data is colored by the SWIF(r) probability, ranging from 0 to 1 for each class (note that for a binary classifier, ). In the bottom row testing data is colored by the relative value of the SRS for the testing data shown, with yellow corresponding to the highest SRS, and purple corresponding to the lowest SRS. We note that all combinations of high and low SRS, and high and low SWIF(r) probabilities are possible, depending on the structure of the dataset and the particular instance being classified.

More »

Expand

Fig 2.

The SRS is lower when an instance’s class is excluded from training.

SWIF(r) was trained using each combination of two classes in the wheat morphology dataset [50] (see Methods). The trained model was then tested on instances from all three classes. A) Distribution of the training and testing data across all seven attributes: Perimeter, Area, Kernel Length, Compactness, Kernel Width, Asymmetry and Length of Kernel Groove. Note that wheat 1 (blue circles) generally has trait values that are intermediate to the other two classes. See S1 Fig for additional views into the raw dataset. B) Histograms show the distribution of the SRS for instances of each class. In each case, the unknown class has a more negative distribution of SRS compared to the two known classes. This is true even for wheat 1 (blue), which has attributes that are intermediate in value as compared to the other classes (see S1 Table for p-values). C) Data graphed by two attributes (Kernel Length and Compactness). Left: full data set colored by true labels. Right: colored by SWIF(r) probability (top) and SRS (bottom). SWIF(r) scores were generally greater than 90% for one of the two trained classes, even for instances from the unknown class, with just a handful of points receiving intermediate values from SWIF(r) (black diamonds). In contrast, coloring by SRS shows that points associated with the unknown class (larger dots) tend to have lower SRS, while points associated with known classes (smaller dots) received higher SRS.

More »

Expand

Fig 3.

Low average SRS can indicate systemic mismatch between training and testing data.

A) Histograms and boxplots showing the distributions of “match” and “mismatch” cohorts for each experiment. Histograms and boxplots show the same information, with boxplots zoomed in to allow for easier comparison of the distributions. The x-axis range in the histograms represents the complete range of observed SRS values in each case. In the sex-mismatch model (left), SWIF(r) was trained using a dataset consisting of male individuals of European ancestry divided into two categories based on HBA1C readings, used to esimate blood sugar: Elevated, or Normal. The data provided for training consisted of 22 health-related attributes (see Methods). The trained model was then tested on two cohorts, females with European ancestry and an independent cohort of males with European ancestry. These cohorts were labeled with their Elevated or Normal status, allowing for identification of correct or incorrect classification by SWIF(r). In the top left, we see the distribution of SRS for Elevated and Normal individuals from either the matching cohort (male) or non-matching cohort (female). The female cohort has lower average SRS, visible as a leftward shift in both the Elevated (t-test p-value = 1.53e-4) and Normal (t-test p-value = 5.11e-34) distributions (*** represents p<0.001). Likewise in the ancestry mismatch model (right), SWIF(r) was trained using a dataset of male and female individuals of European ancestry divided into two categories: Elevated and Normal. The trained model was then tested on two cohorts, males and females with African ancestry and an independent cohort of males and females with European ancestry. As above, we see the distribution of SRS for each cohort. The non-matching African ancestry cohort has lower average SRS, visible as a leftward shift in both the Elevated (t-test p-value = 6.04e-06) and Normal (t-test p-value = 1.50e-28) distributions. B) Confusion matrices show differences in SWIF(r) classification accuracy between matching and non-matching cohorts. The non-matching sex cohort experienced a small shift towards the Elevated classification when compared to the matching cohort. The non-matching ancestry cohort experienced a larger shift towards the Normal classification when compared to the matching cohort. C) SRS and SWIF(r) probability are calculated over a plane defined by two of the twenty-two model attributes (Hemoglobin v Cholesterol for the sex-mismatch analysis, and BMI v LDL for the ancestry-mismatch analysis), providing a background of points for each graph. For other views into the data, see S6 and S7 Figs. On top of each is graphed a contour plot of the distribution of Elevated (red-to-white, solid line) or Normal (black-to-white, dashed line) data for each cohort. On the left, comparing the Male and Female cohorts we can observe a shift in the overall distribution of the data, pushing the Female cohort into an area with lower SRS values (top) and greater Elevated SWIF(r) probability (bottom). On the right, comparing the European and African cohorts, we observe that the mean difference between the Elevated and Normal cohorts is higher for individuals with European ancestry, and smaller for individuals with African ancestry. This results in greater overlap between the Elevated and Normal distributions in the African cohort, as well as an overall shift towards the Normal classification.

More »

Expand

Fig 4.

The SRS can identify valuable attributes to create smarter filters on missingness.

A) Application data exhibits extreme low values of the SRS as compared to simulated data. B) Missing Informative attributes lead to lower SRS compared to missing Noise attributes. SWIF(r) was trained with a combination of five potentially informative attributes (“Informative”), and three “Noise” attributes drawn randomly from Gaussian distributions with no relation to the class of the instance. In the testing set, 3 attributes were removed from each instance: either all three Noise attributes (“Noise Removed”), or 3 randomly selected Informative attributes (“Informative Removed”). The SRS was then calculated for all instances. Instances where Informative attributes were removed have lower average SRS values compared to instances where the same number of Noise attributes were removed, visible in the histograms shown.

More »

Expand