Fig 1.
Representation of the workflow.
On the left, reduction of the high dimensional space defined by the fingerprints and clustering; on the right, molecular substructures calculation.
Fig 2.
Distribution of the odor notes and the number of their occurrences.
A: Histogram of the number of odorants according to the number of odor notes. B: Histogram of the workforce according to the number of occurrences of the odorants.
Fig 3.
Visualization of the compounds-odors dataset in the 2-two dimensional spaces obtained after dimension reduction using PCA, MDS, t-SNE and UMAP.
The data are colored according to the clusters produced by the k-means clustering and AHC that were carried out on the basis of the coordinate in the 2D spaces. The colors allow only to visualize the clusters easily and are specific to each method; there is no correspondence between the colors according to the several methods. The data are reported in S1 Table. (a) Clusters obtained by the PCA k-means approach: the clusters C1a, C2a, C3a and C4a encompass respectively 1523, 1466, 1622 and 1427 smell compounds; (b) Clusters obtained by PCA AHC approach: the clusters C1b, C2b, C3b and C4b encompass respectively 1461, 1756, 1997 and 824 smell compounds; (c) Clusters obtained by MDS k-means approach: the clusters C1c, C2c, C3c and C4c encompass respectively 1312, 1774, 1468 and 1484 smell compounds; (d) Clusters obtained by MDS AHC approach: the clusters C1d, C2d, C3d and C4d encompass respectively 854, 1551, 1970 and 1663 smell compounds; (e) Clusters obtained by t-SNE k-means approach: the clusters C1e, C2e, C3e, C4e and C5e encompass respectively 1008, 1375, 1225, 1122 and 1308 smell compounds; (f) Clusters obtained by t-SNE AHC approach: the clusters C1f, C2f, C3f, C4f and C5f encompass respectively 1480, 636, 1633, 1524 and 765 smell compounds; (g) Clusters obtained by UMAP k-means approach: the clusters C1g, C2g, C3g and C4g encompass respectively 1597, 1344, 1454 and 1643 smell compounds; (h) Clusters obtained by UMAP AHC approach: the clusters C1h, C2h, C3h and C4h encompass respectively 1640, 1584, 1332 and 1482 smell compounds. In each chart, C1, C2, C3, C4 and C5 clusters are depicted respectively in blue, orange, grey, yellow and light blue.
Fig 4.
Radar charts of the distribution of the %ON values obtained for the 17 most frequent odor notes across the clusters.
(a) Clusters obtained by PCA k-means method; (b) Clusters obtained by PCA-AHC method; (c) Clusters obtained by MDS k-means method; (d) Clusters obtained by MDS-AHC method; (e) Clusters obtained by t-SNE k-means method; (f) Clusters obtained by t-SNE-AHC method; (g) Clusters obtained by UMAP k-means method; (h) Clusters obtained by UMAP-AHC method. In each chart, C1, C2, C3, C4 and C5 clusters are depicted respectively in blue, in orange, in grey, in yellow, in light blue.
Fig 5.
Histogram of the number of odor notes whose %ON is greater than 50 for each technique.
Fig 6.
Histogram of the distribution of the chemical functional groups according the clusters.
Only the structures present in at least 5% of the molecules of one of the 4 clusters C1, C2, C3 and C4 are represented: C1 in light blue; C2 in dark blue; C3 in dark red; C4 in yellow.
Fig 7.
Network representation of the links between odor notes (red ellipse) and chemical functional groups (blue diamond).
The nature of the line varies as a function of the relative frequency of occurrences. The thicker the line, the higher is the number of occurrences of an odor note or a chemical functional group within the cluster to which it is linked. The edges are invisibly for the relative frequency of occurrences less than 0.1. The blue, orange, grey and yellow rectangles correspond respectively to clusters 1, 2, 3 and 4. The blue lines correspond to the associations between the cluster 1 and the odor notes or the cluster 1 and the chemical functional groups. The orange lines correspond to the associations between the cluster 2 and the odor notes or the cluster 2 and the chemical functional groups. The grey lines correspond to the associations between the cluster 3 and the odor notes or the cluster 3 and the chemical functional groups. The yellow lines correspond to the associations between the cluster 4 and the odor notes or the cluster 4 and the chemical functional groups.