Theoretical properties of distance distributions and novel metrics for nearest-neighbor feature selection

doi:10.1371/journal.pone.0246761

Fig 1.

Convergence to Gaussian for Manhattan and Euclidean distances for simulated standard uniform data with m = 100 instances and p = 10, 100, and 10000 attributes.

Convergence to Gaussian occurs rapidly with increasing p, and Gaussian is a good approximation for p as low as 10 attributes. The number of attributes in bioinformatics data is typically much larger, at least on the order of 10³. The Euclidean metric has stronger convergence to normal than Manhattan. P values from Shapiro-Wilk test, where the null hypothesis is a Gaussian distribution.

More »

Expand

Fig 2.

Summary of distance distribution derivations for standard normal () and standard uniform () data.

Asymptotic estimates are given for both standard (Eq 1) and max-min normalized (Eq 58) q-metrics. These estimates are relevant for all and p ≫ 1 for which the normality assumption of distances holds.

More »

Expand

Fig 3.

Asymptotic estimates of means and variances for the standard L₁ and L₂ (q = 1 and q = 2 in Fig 2) distance distributions.

Estimates for both standard normal () and standard uniform () data are given.

More »

Expand

Fig 4.

Asymptotic estimates of means and variances for the max-min normalized L₁ and L₂ distance distributions commonly used in Relief-based algorithms.

Estimates for both standard normal () and standard uniform () data are given. The cumulative distribution function of the standard normal distribution is represented by Φ. Furthermore, (Eq 86) is the asymptotic median of the sample maximum from m standard normal random samples.

More »

Expand

Fig 5.

Purines (A and G) and pyrimidines (C and T) are shown.

Transitions occur when a mutation involves purine-to-purine or pyrimidine-to-pyrimidine insertion. Transversions occur when a purine-to-pyrimidine or pyrimidine-to-purine insertion happens, which is a more extreme case. There are visibly more possibilities for transversions to occur than there are transitions, but there are about twice as many transitions in real data.

More »

Expand

Fig 6.

Predicted average TiTv distance as a function of average minor allele frequency (see Eq 142).

Success probabilities f_a are drawn from a sliding window interval from 0.01 to 0.9 in increments of about 0.009 and m = p = 100. For η = 0.1, where η is the Ti/Tv ratio given by Eq 126, Tv is ten times more likely than Ti and results in larger distance. Increasing to η = 1, Tv and Ti are equally likely and the distance is lower. In line with real data for η = 2, Tv is half as likely as Ti so the distances are relatively small.

More »

Expand

Fig 7.

Density curves and moments of TiTv distance as a function of average MAF , given by Eq 142, and Ti/Tv ratio η, given by Eq 127.

We fix m = p = 100 for all simulated TiTv distances. (A) For fixed , TiTv distance density is plotted as a function of increasing η. TiTv distance decreases as η increases. For η = Ti/Tv = 0.5, there are twice as many transversions as there are transitions. On the other hand, η = Ti/Tv = 2 indicates that there are half as many transversions as transitions. Since transversions encode a larger magnitude distance than transitions, this behavior is expected. (B) Simulated and predicted mean ± SD are shown as a function of increasing Ti/Tv ratio η. Distance decreases as Ti/Tv increases. Theoretical and simulated moments are approximately the same. (C) For fixed η = 2, TiTv distance density is plotted as a function of increasing . TiTv distance increases as approaches maximum of 0.5, which means that there is about the same frequency of minor alleles as major alleles. (D) Simulated and predicted mean ± SD as a function of increasing average MAF . Distance increases as the number of minor alleles increases. Theoretical and simulated moments are approximately the same.

More »

Expand

Fig 8.

Asymptotic estimates of means and variances of genotype mismatch (GM) (Eq 113), allele mismatch (AM) (Eq 114), and transition-transversion (TiTv) (Eq 115) distance metrics in GWAS data (p ≫ 1).

GWAS data , where f_a for all are the probabilities of a minor allele occurring at locus a. For the TiTv distance metric, we have the additional encoding that uses γ₀ = P(PuPu), γ₁ = P(PuPy), and γ₂ = P(PyPy).

More »

Expand

Fig 9.

Organization based on brain regions of interest (ROIs) of resting-state fMRI correlation dataset consisting of transformed correlation matrices for m subjects.

Each column corresponds to an instance (or subject) I_j and each subset of rows corresponds to the correlations for an ROI attribute (p sets). The notation represents the r-to-z transformed correlation between attributes (ROIs) a and k ≠ a for instance j.

More »

Expand

Fig 10.

Aymptotic means and variances for the new standard (Eq 158) and max-min normalized (Eq 166) rs-fMRI distance metrics.

More »

Expand

Fig 11.

Comparison of theoretical and sample moments of Manhattan (Eq 1) distances in standard normal data.

(A) Scatter plot of theoretical vs simulated mean Manhattan distance (Eq 41). Each point represents a different number of attributes p. For each value of p we fixed m = 100 and generated 20 distance matrices from standard normal data and computed the average simulated pairwise distance from the 20 iterations. The corresponding theoretical mean was then computed for each value of p for comparison. The dashed line represents the identity (or y = x) line for reference. (B) Scatter plot of theoretical vs simulated standard deviation of Manhattan (Eq 1) distance (Eq 42). These standard deviations come from the same random distance matrices for which mean distance was computed for A. Both theoretical mean and standard deviation approximate the simulated moments quite well.

More »

Expand

Fig 12.

Distance densities from uncorrelated vs correlated bioinformatics data.

(A) Euclidean distance densities for random normal data with and without correlation. Correlated data was created by multiplying random normal data by upper-triangular Cholesky factor from randomly generated correlation matrix. We created correlated data for average absolute pairwise correlation (Eq 176) . (B) TiTv distance densities for random binomial data with and without correlation. Correlated data was created by first generating correlated standard normal data using the Cholesky method from (A). Then we applied the standard normal CDF to create correlated uniformly distributed data, which was then transformed by the inverse binomial CDF with n = 2 trials and success probabilites . (C) Time series correlation-based distance densities for random rs-fMRI data (Fig 9) with and without additional pairwise feature correlation. Correlation was added to the transformed rs-fMRI data matrix (Fig 9) using the Cholesky algorithm from (A).

More »

Expand

Fig 13.

Simulation comparison between rule-of-thumb naive k = 10 and distance-distribution informed k_{α = 1/2}.

Precision and recall for the functional features are significantly improved using informed k versus naive k = 10. The training and validation classification accuracy are similar for the two values of k with slightly less overfitting for informed-k.

More »

Expand