Fig 1.
Group separability measures in data science: an overview.
(a) Cluster validity indices are a subclass of separability measures that aim to compactness. (b) While group separability measures such as PSI-types aim to preserve both inter- and intra- group diversity rating the same the left and right data representations, a cluster validity measure such as CH rates higher a compact representation that neglects the intra-group diversity. This example offers evidence that using cluster validity indices for evaluation of group separability is not an appropriate solution. (c) Rows indicate different groups separability measures: yellow background for cluster validity measures and green for geometric separability measures. The columns indicate different features of the measure, their applicability to a spectrum of problems that arise in data science, and their robustness under different types of noise.
Fig 2.
Geometric separability based on projection or first-neighbor strategies.
(a) Example of how the centroid projection line (CPS) or the linear discriminant line (LDPS) would be drawn considering the linear geometric separability of the two community-based groups of network nodes in Fig 3C. The two black dots at the center of the plot are the centroids of the respective groups of nodes. (b) The nodes are projected on the projection separability line and the AUC-PR is computed to evaluate the extent of linear separability between the two groups. The AUC-PR can be substituted by any other bi-class classification measure for unbalanced data. (c) Travelling salesman tour (with the dashed line) and path (without the dashed line) across the points that are the nodes of the nPSO network (Fig 3) embedded in the two-dimensional space (Fig 3C). The travelling salesman (TS) path approximates the projection separability curve that accounts for the intrinsic nonlinear geometry of the data points. (d) The nodes are aligned on the rectified TS path and the AUC-PR is computed to evaluate the extent of separability between the two groups. (e) and (f) are respectively equivalent to (c) and (d) when the labels of the two communities are uniformly at random reshuffled to generate one instance of the null model. (g) The geometric separability index adopts a strategy defined as the proportion of data points whose classification labels are the same as those of their first-nearest neighbor.
Fig 3.
Geometric separability of community-based mesoscale patterns in complex networks.
(a) The adjacency matrix of an artificial network with two communities generated with the nonuniform popularity similarity model (nPSO). From the adjacency matrix, the presence of any mesoscale structure associated to community organization is not visible (b) Embedding by the HOPE algorithm of the nPSO network in a two-dimensional geometric space reveals the presence of a geometric representation composed by two groups of nodes (one up and one down), providing evidence of network embedding efficacy to visualize the latent mesoscale structure of complex networks. (c) Attributing to each node a color related with the respective community type (red or green) in the network, we note that nodes in the same community locate closer to each other forming two groups in the geometric space. Evaluating the representation of a network in relation to the geometric separability of the groups of nodes formed by their communities is an innovation that we introduce in this article.
Fig 4.
Statistical significance of the travelling salesman projection separability (TSPS) measure.
(a) The travelling salesman (TS) path that connects the points to be rectified (b) and the AUC-PR is computed for assessing the separability of the observed group organization embedded in the geometric space. (c) the labels of the groups are uniformly at random reshuffled and the AUC-PR is computed to generate a random instance. (d) the process to generate random replicates is repeated for a certain number of times (rows, n = 1000 in our study) and for all number of group pairs (columns). (e) Then the mean value across the group pairs (columns) is computed to generate the final value to build a null model distribution. The p-value of the observed TSPS measure is computed as the fraction of random values that are larger than the observed value.
Fig 5.
Linear and nonlinear separability in complex data science.
Red and green dots indicate the samples of two different groups. (a-e) refer to an example of a linearly separable dataset called Rhombus. (f-j) refer to an example of a nonlinearly separable dataset called Halfkernel. (a, f) centroid projection separability (CPS): the two black dots indicate the centroids (median estimator) of the two groups of samples, the black line indicates the projection line, the vertical blue dashed lines indicate the projections of the samples. (b, g) linear discriminant projection separability (LDPS): the black line indicates the first component projection vector of the linear discriminant analysis (LDA), the other graphics are as for (a). (c, h) Travelling salesman projection separability (TSPS): the travelling salesman path across the samples is indicated by the black solid lines. (d, i) geometrical separability index (GSI): the black solid lines indicate the first neighbor sample matching. (e, j) separability of each measure in the respective dataset: (e) Rhomboid and (j) Halfkernel. (k, l) mean and minimum separability of each measure across the two datasets. In (e, j) the values of the indices with a significant (p-value < 0.01) geometric separability are marked with a star, which means that these values are very unlikely to be obtained by chance.
Fig 6.
Hard separability problems in complex data science.
Examples of hard separability problems, the term ‘hard’ indicates difficulty to detect the presence of separability. (a-e) refer to an example of a linearly separable dataset called Parallel lines. (f-j) refer to an example of a nonlinearly separable dataset called Circles. (k-o) refer to an example of a nonlinearly separable dataset called Spirals. (a, f, k) Centroid projection separability (CPS): the two black dots indicate the centroids (median estimator) of the two groups of samples, the black line indicates the projection line, the vertical blue dashed lines indicate the projections of the samples. (b, g, l) Linear discriminant projection separability (LDPS): the black line indicates the first component projection vector of the linear discriminant analysis (LDA), the other graphics are as for (a). (c, h, m) Travelling salesman projection separability (TSPS): the travelling salesman path across the samples is indicated by the black solid lines. (d, i, n) Geometrical separability index (GSI): the black solid lines indicate the first neighbor sample matching. (e, j, o) separability of each measure in the respective dataset: (e) Parallel lines, (j) Circles and (o) Spirals. (p, q) mean and minimum separability of each measure across the three datasets. In (e, j, o) the values of the indices with a significant (p-value < 0.01) geometric separability are marked with a star, which means that these values are very unlikely to be obtained by chance.
Fig 7.
Empirical evidence on real complex multidimensional data.
The radar signal dataset is composed of 350 valid samples, 34 features, and three groups: good radar signal, bad radar signal type1 and bad radar signal type2. The complexity of this dataset is associated with the mix of hierarchical and similarity relations between the data samples. (a) Embedding of the radar signal dataset in the two-dimensional space by MCE algorithm. (b) The geometric separability of MCE representation is estimated of high quality (performance larger than 0.8) according to any type of measure, because MCE algorithm is able to produce a representation that accounts for hierarchical structure in the data. (c) t-SNE representation (p = 31 is the best perplexity setting, see text for details) suffers from the crowding problem because t-SNE algorithm does not well preserve hierarchical organization. (d) the geometric separability measures confirm that the representation of t-SNE is of less quality of MCE for what concerns the separability of the data group structure in the representation space. In (b,d) the values of the indices with a significant (p-value < 0.01) geometric separability are marked with a star, which means that these values are very unlikely to be obtained by chance.
Fig 8.
The multivariate community separability analysis of the methods performances (MCSAmp).
x-axis for PC1 and y-axis for PC2. (a) spots of the same shape represent the same separability measure, spots of the same color represent the same embedding method, the dashed lines of the same color connect the same embedding method evaluated using different separability measures across the networks. Node2vec (green polygon) produces network representations that do not overlap with the ones of other methods. (b) spots of the same shape represent the same embedding method, spots of the same color represent the same separability measure, the dashed lines of the same color connect the same separability measure evaluated on the representations of different embedding methods across the networks. GSI (green triangle) produces separability evaluations that do not overlap with the ones of other measures.
Fig 9.
The multivariate community separability analysis of the network representations (MCSAnr).
x-axis for PC1 and y-axis for PC2. (a) spots of the same shape represent the same embedding method, spots of the same color represent the same network representation, the dashed lines of the same color connect the same network’s representations evaluated using different separability measures across the networks. (b) adaptive geometric separability (AGS) values evaluated in each network for the three different embedding methods. (c) mean separability of each embedding method across the networks. (d) minimum separability of each embedding method across the networks.
Fig 10.
Two-dimensional representations of Football, Polblogs and Karate according to the different embedding methods.
x-axis for the first dimension of embedding and y-axis for the second dimension of embedding. The name of the represented network is reported in bold on top of each panel, the name and hyperparameter settings (when available) of the network embedding method are reported under the network name, the adaptive geometric separability (AGS) value together with the name of the separability measure/s that achieved the maximum is reported under the embedding method name.