Hands-on training about overfitting

doi:10.1371/journal.pcbi.1008671

Fig 1.

Orange data science toolbox.

Orange provides data analysis components, also called widgets, assembled into a data analysis workflows through visual programming. The components typically encapsulate some data processing or modeling methods; they receive the input and submit the results to the output. Widgets in Orange are represented with icons with an input slot on the left and the output slot on the right. Users place widgets on the canvas and connect the inputs and outputs of the widgets. In this way, they define the data and information processing pipeline. The system processes the workflow on-the-fly: as soon as the widget receives the information, it would handle it and send out the results. In the workflow shown on the figure, the data pipeline starts with reading the data (File widget) and passes it to cross-validation (Test and Score), which also receives a learning algorithm on its input. Double-clicking the widget exposes its content. For instance, we pass cross-validation results to the Confusion Matrix, which shows that logistic regression misclassified only two data instances. We use the Scatter Plot to show the entire data set and also display selected data from the Confusion Matrix. Any change in selection in the Confusion Matrix would change its output. This would trigger the change in the Scatter Plot. With this composition of components, we turn this workflow into a visual explorative environment for examining cross-validation results.

More »

Expand

Fig 2.

Incorrect evaluation of models.

The tree is tested on the data from which it was induced. The Distribution widget shows perfect correspondence between the predicted and actual gene functions.

More »

Expand

Fig 3.

Modeling from permuted data.

Permutation of class labels should prevent successful modeling, yet the Distribution widget and the scores at the bottom of the Prediction widget show that the tree almost perfectly fits the data.

More »

Expand

Fig 4.

A tree induced from random data.

Observing the tree reveals that it is too large for the given data set, and hence does not generalize well.

More »

Expand

Fig 5.

Testing a model on a separate data set.

The Random Sample widget splits the data into two subsets, one for fitting and one for testing. Distribution of predictions by the model (right-hand histogram) roughly match the distribution of actual classes (left-hand side), but the actual class no longer matches the predicted.

More »

Expand

Fig 6.

Testing with cross-validation.

The Tree widget receives no data and does not output a tree but only an algorithm (ä recipe") for building one. The Randomize widget, which shuffles the data, is here only to demonstrate that cross validation discovers overfitting by showing a small accuracy. In practice, we would use the actual, non-randomized data.

More »

Expand

Fig 7.

Improper way to select features.

The Dataset widget loads the data from our curated repository of data sets. The Preprocess widget selects ten most informative features. This data is used to cross-validate logistic regression, which achieves a 96% classification accuracy.

More »

Expand

Fig 8.

Selection of features on randomized data.

Performance of logistic regression remains excellent even on randomized data: the classification accuracy is 80%, compared to the 60% majority class.

More »

Expand

Fig 9.

Comparison of models on all and on selected features from random data.

Classification accuracy of logistic regression on all features is 62%, which is about the same as proportion of majority class. The model is thus no better the random guessing, as expected. This proves that feature selection is responsible for the overly optimistic result.

More »

Expand

Fig 10.

The proper workflow for cross-validation that includes data preprocessing.

In this workflow preprocessing is not done prior to splitting the data. The preprocessing recipe, provided by the Preprocess widget, enters the cross-validation procedure and is applied to each training data subset separately, without being informed by the data that is used in testing.

More »

Expand

Fig 11.

t-SNE visualisation of a random data.

We generate 10,000 normally distributed random variables and one Bernoulli variable, which we designate as the target variable. In Preprocess, we choose ten variables that are most correlated with the target. This data is then used in the t-SNE visualisation. In this particular visualisation, the value of the target variable is not shown, and the dots in the t-SNE visualisation representing the data set items seem to placed randomly, as expected, and do not expose any clustering structure.

More »

Expand

Fig 12.

Colored t-SNE on random data.

Data preprocessed by feature selection is visualized in a t-SNE plot that separates the data instances of different class, denoted by blue and red color. Density of blue data points is higher in the top part of the visualisation, and green points are denser at the lower half of the plot. This separation of instances of different class is seemingly surprising as the class value assignment is random, and is a by-product of preprocessing and choosing of ten features that are, albeit arbitrarily, most correlated with the class variable.

More »

Expand

Fig 13.

t-SNE visualization of a separate test data.

This workflow uses a random sample to discover the most informative variables, that is, variables that are most correlated with the class variable. The Apply Domain then takes the out-of-sample data and applies the transformation from the Preprocess widget, and in this case, removes all except the ten variables chosen from the sample data. In this way, the procedure that selects the variables is not informed by the data that is shown in the visualization. This time, the data plot does not expose any class structure; red and blue data points are intermixed. Compare this outcome to the overfitted visualization from Fig 12.

More »

Expand

Fig 14.

Exploring class-randomized data.

Scatterplot in Orange can search for feature combinations that best split the classes. For the yeast expression data, diauxic shift (diau f) and sporulation at a five-hour timepoint (spo-mid) provide for the best combination. When the data is class-randomized, the class labels change, the pattern of class-separation is no longer there, but the data points keep their position. The effect of randomization is also visible by comparing the two Data Tables.

More »

Expand