Fig 1.
Convergence to Gaussian for Manhattan and Euclidean distances for simulated standard uniform data with m = 100 instances and p = 10, 100, and 10000 attributes.
Convergence to Gaussian occurs rapidly with increasing p, and Gaussian is a good approximation for p as low as 10 attributes. The number of attributes in bioinformatics data is typically much larger, at least on the order of 103. The Euclidean metric has stronger convergence to normal than Manhattan. P values from Shapiro-Wilk test, where the null hypothesis is a Gaussian distribution.
Fig 2.
Summary of distance distribution derivations for standard normal () and standard uniform (
) data.
Asymptotic estimates are given for both standard (Eq 1) and max-min normalized (Eq 58) q-metrics. These estimates are relevant for all and p ≫ 1 for which the normality assumption of distances holds.
Fig 3.
Asymptotic estimates of means and variances for the standard L1 and L2 (q = 1 and q = 2 in Fig 2) distance distributions.
Estimates for both standard normal () and standard uniform (
) data are given.
Fig 4.
Asymptotic estimates of means and variances for the max-min normalized L1 and L2 distance distributions commonly used in Relief-based algorithms.
Estimates for both standard normal () and standard uniform (
) data are given. The cumulative distribution function of the standard normal distribution is represented by Φ. Furthermore,
(Eq 86) is the asymptotic median of the sample maximum from m standard normal random samples.
Fig 5.
Purines (A and G) and pyrimidines (C and T) are shown.
Transitions occur when a mutation involves purine-to-purine or pyrimidine-to-pyrimidine insertion. Transversions occur when a purine-to-pyrimidine or pyrimidine-to-purine insertion happens, which is a more extreme case. There are visibly more possibilities for transversions to occur than there are transitions, but there are about twice as many transitions in real data.
Fig 6.
Predicted average TiTv distance as a function of average minor allele frequency (see Eq 142).
Success probabilities fa are drawn from a sliding window interval from 0.01 to 0.9 in increments of about 0.009 and m = p = 100. For η = 0.1, where η is the Ti/Tv ratio given by Eq 126, Tv is ten times more likely than Ti and results in larger distance. Increasing to η = 1, Tv and Ti are equally likely and the distance is lower. In line with real data for η = 2, Tv is half as likely as Ti so the distances are relatively small.
Fig 7.
Density curves and moments of TiTv distance as a function of average MAF , given by Eq 142, and Ti/Tv ratio η, given by Eq 127.
We fix m = p = 100 for all simulated TiTv distances. (A) For fixed , TiTv distance density is plotted as a function of increasing η. TiTv distance decreases as η increases. For η = Ti/Tv = 0.5, there are twice as many transversions as there are transitions. On the other hand, η = Ti/Tv = 2 indicates that there are half as many transversions as transitions. Since transversions encode a larger magnitude distance than transitions, this behavior is expected. (B) Simulated and predicted mean ± SD are shown as a function of increasing Ti/Tv ratio η. Distance decreases as Ti/Tv increases. Theoretical and simulated moments are approximately the same. (C) For fixed η = 2, TiTv distance density is plotted as a function of increasing
. TiTv distance increases as
approaches maximum of 0.5, which means that there is about the same frequency of minor alleles as major alleles. (D) Simulated and predicted mean ± SD as a function of increasing average MAF
. Distance increases as the number of minor alleles increases. Theoretical and simulated moments are approximately the same.
Fig 8.
Asymptotic estimates of means and variances of genotype mismatch (GM) (Eq 113), allele mismatch (AM) (Eq 114), and transition-transversion (TiTv) (Eq 115) distance metrics in GWAS data (p ≫ 1).
GWAS data , where fa for all
are the probabilities of a minor allele occurring at locus a. For the TiTv distance metric, we have the additional encoding that uses γ0 = P(PuPu), γ1 = P(PuPy), and γ2 = P(PyPy).
Fig 9.
Organization based on brain regions of interest (ROIs) of resting-state fMRI correlation dataset consisting of transformed correlation matrices for m subjects.
Each column corresponds to an instance (or subject) Ij and each subset of rows corresponds to the correlations for an ROI attribute (p sets). The notation represents the r-to-z transformed correlation between attributes (ROIs) a and k ≠ a for instance j.
Fig 10.
Aymptotic means and variances for the new standard (Eq 158) and max-min normalized (Eq 166) rs-fMRI distance metrics.
Fig 11.
Comparison of theoretical and sample moments of Manhattan (Eq 1) distances in standard normal data.
(A) Scatter plot of theoretical vs simulated mean Manhattan distance (Eq 41). Each point represents a different number of attributes p. For each value of p we fixed m = 100 and generated 20 distance matrices from standard normal data and computed the average simulated pairwise distance from the 20 iterations. The corresponding theoretical mean was then computed for each value of p for comparison. The dashed line represents the identity (or y = x) line for reference. (B) Scatter plot of theoretical vs simulated standard deviation of Manhattan (Eq 1) distance (Eq 42). These standard deviations come from the same random distance matrices for which mean distance was computed for A. Both theoretical mean and standard deviation approximate the simulated moments quite well.
Fig 12.
Distance densities from uncorrelated vs correlated bioinformatics data.
(A) Euclidean distance densities for random normal data with and without correlation. Correlated data was created by multiplying random normal data by upper-triangular Cholesky factor from randomly generated correlation matrix. We created correlated data for average absolute pairwise correlation (Eq 176) . (B) TiTv distance densities for random binomial data with and without correlation. Correlated data was created by first generating correlated standard normal data using the Cholesky method from (A). Then we applied the standard normal CDF to create correlated uniformly distributed data, which was then transformed by the inverse binomial CDF with n = 2 trials and success probabilites
. (C) Time series correlation-based distance densities for random rs-fMRI data (Fig 9) with and without additional pairwise feature correlation. Correlation was added to the transformed rs-fMRI data matrix (Fig 9) using the Cholesky algorithm from (A).
Fig 13.
Simulation comparison between rule-of-thumb naive k = 10 and distance-distribution informed kα = 1/2.
Precision and recall for the functional features are significantly improved using informed k versus naive k = 10. The training and validation classification accuracy are similar for the two values of k with slightly less overfitting for informed-k.