Hands-on training about data clustering with orange data mining toolbox

doi:10.1371/journal.pcbi.1012574

Fig 1.

An Orange workflow we use in the early stages of training, where the student loads the data, selects 2 features, computes the distance matrix, then performs the clustering (Hierarchical Clustering) and displays the results in the scatter plot.

The visualization of the dendrogram showing the result of the hierarchical clustering allows the user to interactively set the cut-off point, indicated by the vertical line; the cut-off point in the figure resulted in 3 clusters, which are also shown in the scatterplot.

More »

Expand

Fig 2.

Interactive k-means in Orange.

Several components of Orange were designed specially to support training in data science. In a shown workflow, trainees can paint the data and then in an interactive k-means clustering widget set the initial positions of centroids and execute the algorithms by, interchangeably, recomputing positions of centroids and reassigning centroid membership of data instances.

More »

Expand

Fig 3.

Experimenting with DBSCAN with a workflow where students draw the data and then interactively adjust the neighborhood distance parameter in DBSCAN’s scree plot, with the effects of this choice immediately visible in the scatter plot showing the clusters (in color) and outliers (in gray).

DBSCAN is a clustering technique that is conceptually very different from hierarchical clustering and k-means, and as such it is a great candidate for a review in an additional pedagogical unit, time permitting.

More »

Expand