Network-Based and Binless Frequency Analyses

We introduce and develop a new network-based and binless methodology to perform frequency analyses and produce histograms. In contrast with traditional frequency analysis techniques that use fixed intervals to bin values, we place a range ±ζ around each individual value in a data set and count the number of values within that range, which allows us to compare every single value of a data set with one another. In essence, the methodology is identical to the construction of a network, where two values are connected if they lie within a given a range (±ζ). The value with the highest degree (i.e., most connections) is therefore assimilated to the mode of the distribution. To select an optimal range, we look at the stability of the proportion of nodes in the largest cluster. The methodology is validated by sampling 12 typical distributions, and it is applied to a number of real-world data sets with both spatial and temporal components. The methodology can be applied to any data set and provides a robust means to uncover meaningful patterns and trends. A free python script and a tutorial are also made available to facilitate the application of the method.


A. Mean vs. Mode to Represent a Distribution
Most distributions in real life are asymmetric, and the arithmetic mean of a distribution therefore rarely corresponds to the mode of the distribution. Despite this problem, the mean of a distribution is often used to describe overall trends and patterns in a system. Arguably, however, the mode of a distribution is much more representative of overall trends and should therefore be used to describe these trends.
For unimodal distribution ( Fig. A1.a), this asymmetric property essentially skews the mean towards the side that has a larger tail, therefore missing the mode. For multimodal distributions ( Fig. A1.b), the mean is located in-between the modes and therefore reveals false information about relevant trends.

B. Kernel Density Estimation
Kernel density estimation (KDE) is a conventional machine learning technique to determine the probability density function of a dataset. Akin to the proposed methodology, it is non-parametric and it compares every single value of a dataset with one another within a certain threshold to construct a probability density function. KDE does not go beyond calculating a probability density function, however, unlike the proposed methodology.
The premise of KDE is to calculate a normal distribution for every data point taking its value as the mean. Various methods exist to assign a value for the standard deviation discussed briefly below. Mathematically, for a dataset of N points, the probability p(x) given any data point x i and a standard deviationσ is: where is the normal distribution. From [1], the approach can be generalized by: where κ h is the range set around each and every value, akin to the proposed methodology. This equation is also called a Parzen window density estimator.
One of the most common methods to determine κ h is the mean integrated squared error (MISE), which minimizes the squared error of the resulting distribution with the original data. Since there are no analytical solutions, this process can be relatively slow but comparable to the proposed methodology.
Supplementary Material 3

C. Simple and Illustrated Example for the Methodology
To illustrate the method, Fig. C1 shows a random sampling of 10 values from the normal distribution N(5,2). The left-hand side applies a traditional binning process, where we chose a bin width of 1 and the right-hand side applies the proposed method with ζ = 0.5, therefore representing a range of 1 as well. We can see that in this particular case, the NB histogram better captures the properties of the simulated distribution than the traditional histogram. Moreover, the traditional histogram places the mode of the distribution in a range [5,6), as opposed to a more desirable discrete value, compared to the NB methodology which gives 5.33 as the mode.

E. Method Validation And Distribution Properties
The figures and information below show the twelve distributions that were selected for Fig. 2. We used the python library NumPy to randomly draw 100 points from each of these distributions. We then used the python library igraph to form the networks.
First, we list the equations for each distribution, the parameters selected, and the optimal cutoff value ζ s . Below the equations, we show the simulated network-based histogram in the solid green line and the theoretical distribution in the shaded gray. The figures were standardized between 0 and 1 so they could be superimposed.
Right of these figures, we overlay the theoretical distribution in shaded gray with a histogram in shaded blue that was produced using Scott's rule for bin sizing, where the bin size b is defined as: where σ is the standard deviation and n is the size of the population.
Below the main figures for each distribution, we show how the total number of edges/links E, the proportional number of nodes in the largest cluster p g , and the diameter D and average path length L avg respectively evolve as a function of ζ.
To determine ζ s , we first picked the median of the distribution and assigned ζ as 1% of the median, followed by 2% of the median, and so on and so forth until 50% of the median. For each resulting network, we measured p g = V g / V. The cutoff ζ s was selected when p g remained identical for eight consecutive increase in the cutoff ζ. Eight was selected as a fairly standard statistical value, but it can be increased or decreased depending on the selection of intervals between each ζ.

Geometric Distribution
Comment: all numbers generated are necessarily integers, which is why the network properties do not evolve by increasing ζ.

F. Network Properties of Three Real-World Applications
A similar procedure as the one described in S1 E was employed for the three real-world applications. For life, however, the increment was modified from 1% of the median to 0.1% of the median considering the large number of points (i.e., 199 countries) and the relatively small scale of possible values (from 45.32 to 83.48 years old).
The figures below show first the histogram that was produced using Scott's rule for bin sizing, and then the graph properties of the selected networks in the following order: number of links, evolution of p g , and evolution of D and L avg .  Figure G1 shows a map of the 2,207 census tracts in the Chicago MSA. The categories of colors were carefully selected to highlight the fact that the traditionally reported population density of 496 pers/km 2 does not represent well the region. We can clearly see from this map that peripheral census tracts with low population densities have large areas that heavily impact the measurement of population density. In contrast, Figure G2 shows a cartogram of the Chicago MSA where the polygons of the census tracts have been redrawn and weighted by population. We can now clearly see that the large peripheral census tracts do not house a large proportion of the population. In contrast, the number of census tracts around the calculated mode is significant. By definition, half the census tracts have a population density higher than the median (i.e., the last two categories), attesting the heavily skewed feature of population density (i.e., large right tail), hence the use of a logscale in Fig. 3.

H. Longitudinal Study of Life Expectancy
Unlike the other applications, we started with a ζ = 0.1, since life expectancy ranged from 28.21 years old (Mali in 1960) to 83.16 years old (San Marino in 2010). We then increased ζ by 0.1 years until 3.5. Moreover, a threshold of 0.68 was set for p g , and we reduced the number of consecutive identical p g values to 6 because of this threshold. The eventual selection of ζ tends to be around the last peak in diameter.
The figures below show first the histogram that was produced using Scott's rule for bin sizing, and then the graph properties of the selected networks in the following order: number of links, evolution of p g , and evolution of D and L avg . Here again we can see that the traditional