A novel scale-space approach for multinormality testing and the k-sample problem in the high dimension low sample size scenario

Two classical multivariate statistical problems, testing of multivariate normality and the k-sample problem, are explored by a novel analysis on several resolutions simultaneously. The presented methods do not invert any estimated covariance matrix. Thereby, the methods work in the High Dimension Low Sample Size situation, i.e. when n ≤ p. The output, a significance map, is produced by doing a one-dimensional test for all possible resolution/position pairs. The significance map shows for which resolution/position pairs the null hypothesis is rejected. For the testing of multinormality, the Anderson-Darling test is utilized to detect potential departures from multinormality at different combinations of resolutions and positions. In the k-sample case, it is tested whether k data sets can be said to originate from the same unspecified discrete or continuous multivariate distribution. This is done by testing the k vectors corresponding to the same resolution/position pair of the k different data sets through the k-sample Anderson-Darling test. Successful demonstrations of the new methodology on artificial and real data sets are presented, and a feature selection scheme is demonstrated.


Introduction
In practice, it is frequently assumed that a data set can be described by a multivariate normal distribution. Many common statistical procedures rely on the data being multinormal, something which is often not adequately checked before using the procedures [1][2][3]. Often, this assumption is false for either the whole data set or parts of it. Another classical problem is the testing of whether k multivariate data sets originate from the same distribution. For each of the two problems, a scale-space inspired algorithm that tests all resolutions and positions simultaneously, is presented. See the "Materials and methods" section for definitions of "resolution" and "position". The two presented algorithms are very similar apart from the type of onedimensional tests used. A weighted summation is performed across the dimensions/positions in both algorithms. The notion of resolution is connected to the number of dimensions being summed across, while the different dimensions/positions typically are temporal or spatial samples. Scale-space theory is a framework for representing signals on multiple scales/resolutions, developed by the computer vision, image processing and signal processing communities. The development of scale-space methodology is typically regarded to start with two papers by Witkin [4,5]. A recent review by Holmström and Pasanen [6] shows how scale-space methodology has been extended to a large number of areas. The goal of statistical scale-space methodology is to extract features from noisy data at several levels of resolution. Typically, the data is an observed time series or a digital image where features at different temporal or spatial scales/ resolutions might be of interest. Since the scale-space idea is important in the present paper, we introduce the scale-space idea through the SiZer methodology developed by Chaudhuri and Marron [7]. To this end, we produce the output from SiZer in Fig 1 when applied to an artificial data set. SiZer is based on nonparametric smoothing and the upper panel shows the artificial data points as dots and a large number of curves obtained for different values of the smoothing parameter. In this setting, the scale/resolution corresponds to the value of the applied smoothing parameter. A rough curve in the upper panel corresponds to a small smoothing parameter and hence to a short scale. Long scales correspond to smooth curves obtained by large values of the smoothing parameter. The SiZer map in the lower panel reveals what features the observed data contain at different scales. In this context, a black pixel means that the curve is significantly increasing, a white pixel corresponds to a significantly decreasing feature, and a gray pixel corresponds to a situation where the curve is considered to be flat.
From Fig 1, it can be seen that SiZer flags regions as significantly decreasing and/or increasing for different positions and smoothing parameters.
In the present paper, we will adapt the SiZer methodology to our situation and develop a scale-space methodology that can be useful for the k-sample problem and for testing of multinormality.
The presented algorithms have two aspects that make them useful in many situations. As will be shown, the algorithms avoid the need to estimate the covariance matrix, leading to algorithms that can handle the High Dimension Low Sample Size (HDLSS) situation. Furthermore, the algorithms allow an evaluation of the data set for all resolutions and all positions simultaneously. By this approach, it may, for the multinormality testing, be detected if only some parts of the data set originate from a multinormal distribution. For the k-sample case, the scale-space approach can detect if one or more of the k samples differ on different resolutions and/or positions. By not estimating the covariance matrix, the presented scalespace tests potentially loose some power compared to tests that incorporate the information from the estimated covariance matrix. This loss of power is acceptable on the grounds of being able to handle the HDLSS situation. As a result of the summation, the algorithms will include a large number of one-dimensional tests. Two versions of the Anderson-Darling (AD) test (see the "Anderson-Darling testing" section) are applied as one-dimensional tests for the multinormality testing problem and the k-sample problem. The choice of using the AD test is a result of its excellent power against all alternatives and existence of very good approximations for the asymptotic distribution and formulas adjusting for the finite sample sizes [8][9][10].
For the results presented, the Anderson-Darling (AD) test (see the "Anderson-Darling testing" section) is used for both the multinormality testing and the k-sample problem as the onedimensional test used on the summations. The choice of using the AD test is a result of its excellent power against all alternatives and existence of very good approximations for the asymptotic distribution and formulas adjusting for the finite sample sizes [8][9][10].
A simple artificial example is presented to illustrate the main ideas of the paper. The data set is generated to have a distribution that is multivariate normal for some of the dimensions and a mixture of two different normal distributions for the rest of the dimensions. In particular, the population is a mixture of two different underlying true signals. In the first population, 20 signals are sampled from a zero mean Gaussian autoregressive process of order 1, more specifically cov(X i , X j ) = 0.5 1+|i−j| . The remaining 20 signals have the same covariance structure, but a different mean. In particular, the mean of the second population is equal to −2.15 for position 6 to 12 and −3.5 for position 20. For indices 26, . . ., 40, the expected value increases linearly from 0.1 to 2.5, while the rest of the dimensions have expectation equal to zero. The horizontal axis is the same as in Fig 2 and shows the position, while different window widths are given on the vertical axis. These different window widths represent the resolution part of the presented algorithms. Resolution 1 corresponds to testing the marginal distribution of each dimension. Higher resolutions are results of normality tests of local averages at a corresponding position and corresponding window width. For a distribution to be multinormal, all marginals and all local averages must be normally distributed. By going through the test results for all resolution/position pairs, the significance map is produced. Red pixels mark Bonferroni [11] adjusted rejections of the null hypothesis of normality, i.e. indicating that the part of the data matrix that is summed across cannot be considered as a sample from a multinormal distribution. Note that the abrupt deviation from normality at dimension 20 is found on low resolution values, while the more gradual departure from multinormality at dimension 6 to 12 and dimensions 25 to 40 are found on larger resolution values. This example shows that both low and high resolutions may be of importance in the same data set.
The "Materials and methods" section presents the concept of scale/resolution and space/ position as used in this paper, the statistical problems being investigated and the details of the two presented algorithms. Some investigations into the power of the tests are also presented. In the "Results" section, the algorithms are applied to some real data sets, comparisons with other algorithms are done, and a feature selection scheme is presented and tested on real data. Finally, the "Conclusions" section sums up the presented methods.

Materials and methods
Recall that an important motivation for applying a scale-space approach is the fact that different phenomena can be visible/detectable on different resolutions and/or positions of the data set. In classical nonparametric smoothing schemes, some sort of bandwidth parameter has to  be chosen [12]. By selecting one bandwidth only, features detectable on other bandwidths will not be found. However, using a scale-space approach, one can look at all bandwidths simultaneously. Scale-space ideas have proven useful in many areas and have been applied to feature detection in curves and images [7,13], density estimation [14], curve fitting [15], Bayesian time series analysis [16] and spectral feature detection [17].

Assumptions
For the multinormality testing case, let X 1 , X 2 , . . ., X n be a set of p-dimensional vectors. The null hypothesis assumes that these vectors originate from a p-dimensional multinormal distribution N ðμ; ΣÞ, i.e.
where the mean vector μ and the covariance matrix Σ are unknown. For the presented algorithm, the parameters of this assumed multinormal distribution do not have to be estimated. Note that by avoiding the need for an estimate of the covariance matrix, the algorithm can be applied to data sets with any combination of sample size and sample dimension, as long as the sample size is high enough for the one-dimensional normality test to be applicable.
The algorithm works with any covariance structure and there are no requirements for smoothness of expected values of neighboring dimensions. As will be presented later, the algorithm performs a weighted summation across neighboring dimensions. A motivation behind this summation is that neighboring dimensions frequently have some sort of logical connection to each other, as for example in a time series. When the data set consists of a time series, the different dimensions are equivalent to the different sampling times. If the dimensions are shifted around, the algorithm could produce different results. Therefore, interpretations of the results are easier when the different dimensions have a natural ordering, as for example with spatial or temporal data.
For the k-sample case, each of the k samples consist of a given number (which can be different for each k) of p-dimensional vectors with unknown cumulative distribution functions (CDF), given by F 1 , F 2 , . . ., F k , respectively. The null hypothesis is then stated as Since this methodology only tests whether or not the CDFs all are the same, the CDFs can take any form or belong to any class of distributions. Again, the interpretations of the results are easiest when working with data having a natural ordering. For each of the different resolutions s, a weighted summation across different dimensions/ positions is performed, producing L s,d , where d is the position index ranging from 1 to p and L s,d is a vector of length n. The resulting L s,d 's form a matrix L with size [n s , p, n], where n s is the number of resolutions being used. A discrete Epanechnikov [12] window function is used as summation weights. For a given pair of s and d, the Epanechnikov summation window is a column vector given by

Concept of resolution and summation across dimensions
where K is some normalizing constant, d�e is the ceiling function, and the plus function is where the data matrix X has size [n, p], with the n samples of length p along each row, and � indicates normal matrix multiplication. The resulting vector L s,d is thereby a weighted summation across the s dimensions centered on the d-th dimension.

Normality testing
From the matrix L, the actual one-dimensional normality test statistics are calculated. For each of the (s, d) pairs, the p-value of the AD test statistic of the vector L s,d is stored. To address the problem of multiple testing, the algorithm outputs two significance maps, one based on the Bonferroni approach and one based on False Discovery Rate (FDR) [18]. The p-dimensional vector of p-values of each resolution is fed into FDR, generating the FDR-based significance map resolution by resolution. For the Bonferroni approach, the critical value is obtained from the nominal significance level α divided by the number of dimensions p, producing on average one false alarm every 1/α resolution. This follows the usual SiZer recommendation of adjusting the significance for each resolution separately. The alternative, adjusting the output map for all the resolution/position pairs simultaneously, is known from the SiZer literature to be overly conservative [7]. The nominal significance level is by default equal to α = 0.05.

The k-sample problem
For the k-sample problem, the k data matrices X For the green and blue boxes, summation is performed across dimensions 1-3 and 2-5, respectively. The blue box is adjusted to not extend outside the data matrix. Note that two significance maps are produced, one each for the Bonferroni/FDR approaches, with ones in R marking rejections of the null hypothesis for the corresponding resolutions and positions. When plotting the significance maps, the vertical axis is inverted. https://doi.org/10.1371/journal.pone.0211044.g004 Scale-space approach for multinormality testing and the k-sample problem in the HDLSS scenario the k corresponding vectors (of size n i , i = 1, . . ., k) from the L i matrices are fed into the ksample AD test [9,19]. The distributions of the sums along the dimensions will in general be different from the marginal distributions. Nevertheless, if the k data sets do have the same multivariate distribution, for a given resolution/position pair (s, d), the distributions of the k different summation vectors will be the same. The p-values of the tests are stored and used in the generation of the FDR-based significance map, while the Bonferroni approach finds the critical value as for the multinormality testing. If the null hypothesis is rejected, the (s, d)-element of the output matrix is marked as a significant element, indicating that at least one of the empirical distributions are significantly different from the others for this resolution/position pair.

Anderson-Darling testing
The two algorithms presented use three different AD tests. The AD goodness-of-fit test is used in the case of checking for multinormality [20][21][22]. For the two-sample/k-sample case, the versions of the AD test suggested by [9] and [19], respectively, are used.
The AD goodness-of-fit test checks the simple null hypothesis that a sample is from a distribution with a known continuous CDF, F(x). Let x 1 � x 2 � � � � � x n be the ordered sample of size n, and let u i = F(x i ), i = 1, . . ., n. The AD test statistic is defined as This clearly shows that the AD test is distribution free, as long as the null distribution is fully known. Approximate expressions for the asymptotic distribution of the AD test are given by [8,10].
When testing for multinormality with unknown distributional parameters, i.e. testing a composite hypothesis, F(x) is some unknown normal CDF, something which changes the distribution of the AD test statistic. In this case, the sorted data are normalized, producing z i , i = 1, . . ., n. Then, u 0 i ¼ F 0 ðz i Þ is produced, where F 0 (�) is the standard normal CDF. These u 0 i values are fed into Eq (3), and the final test statistic is obtained by applying the correction factor for finite sample sizes given on page 123 of [23]. The p-values and critical values are calculated from the approximations given on page 127 of [23]. Following page 373 of [23], the presented algorithm requires n � 8. The presence of ties in the data is a good indicator of nonnormality, something which the AD test will reflect too. For instance, if normally distributed data is in some way rounded off, the rejection rate will be higher than the rate expected from the prescribed significance level.
For the k-sample case, there is no need to estimate any parameters, and the test statistic reduces to a rank statistic. Hence, under the null hypothesis, the distribution of the test statistic is independent of the distribution of the k samples. The two-sample case and the k-sample case are treated separately, even though the k-sample reduces to the two-sample case in [9] when k = 2. The correction factor in [9] is used to produce the final two-sample test statistic. [9] shows that the distribution of the sample-size adjusted two-sample AD test statistic can be approximated well by the asymptotic distribution of the AD goodness-of-fit test for a fully known null distribution. The presented algorithm uses Equation (3.6) in [10] to produce the approximate p-value of the test statistic when k = 2.
The general k-sample AD test statistic in [19] is given as where N = n 1 + n 2 + � � � + n k , and M ij is the number of observations in the i-th sample that are not greater than the j-th observation of the pooled sample of all k samples. Equation (6) in [19] modifies the expression for A kN , to be able to handle ties in the data. The presented algorithm uses the expression adjusted for ties, both for the two-sample and k-sample cases. Thereby, F i (x) in Eq (1) can be connected to a continuous or discrete random vector. The interpolation scheme of [19] is used to determine the p-value of A kN when k > 2. Inspired by [9], it is required that all n i � 8, i = 1, . . ., k.

Cramér-Wold
The Cramér-Wold theorem states that two random column vectors X and Y have the same distribution if and only if for all row vectors a, the random variables a � X and a � Y have the same distribution [40]. In the presented algorithms, the different summation weights of the Epanechnikov window take the role of a. Thereby, when doing the summation and testing for normality/difference between samples for many resolutions, a set of a vectors are applied to the single or many data sets. The Cramér-Wold theorem requires that the distribution of a � X and a � Y are equal for all possible a vectors. In the presented setting, only a finite number of vectors are tested. Since the presented algorithms are most suitable for data with some sort of neighboring structure (e.g. time series or spatial data), the important a vectors should be those that look at dimensions close to each other to a varying degree. Hence, following the Cramér-Wold theorem, a lack of rejection for (almost) all resolutions/positions should be seen as a good indication of the null hypothesis actually being true for the whole data set.

Significance of rejections
The p-value is available for all the resolution/position pairs. The lower the p-value of a "rejection pair", the more significant the rejection of the null hypothesis is on that resolution/position. By changing the significance level, one can determine on which resolution/position the null hypothesis is most significantly rejected. In

Power of the scale-space tests
There are no clear templates for power studies of the proposed scale-space tests. After the summations are done, the tests use the well-documented AD tests. Thereby, the power of the scalespace tests is connected to the power of the AD tests. Instead, it can be informative to illustrate how the power varies over the different resolution/position pairs of the output matrix for a given example. Assume that the data set has the same structure as in the motivational example of the Introduction. Fig 6 shows the rejection ratio (from 1000 data sets) of the scale-space test for multinormality. As can be seen, one finds the highest powers for the resolution/position pairs that best fit the non-normal dimensions. Similar results would be obtained for the test for comparing k data sets.
To investigate the effect of increased number of dimensions, a number of normally distributed dimensions are added to the right side of the signal of the Introduction. Table 1 shows the

Results
The presented algorithms are tested on a number of different data sets. A five percent significance level is used for all the figures, unless otherwise stated. First, the initial example of the Introduction is investigated in more detail.

Introductory example revisited
For larger resolutions, the scale-space test for multinormality can be shown to increase the mode separation if the distribution has more than one mode. This is demonstrated through some simple examples. Assume that all the dimensions of some data set are unimodal normal with different means and/or variances for different dimensions. The result of the summation will then be some other normal distribution.
A short example of this is given. Assume that the data matrix X has dimensions [10,3] and that column 1, 2, and 3 contain N ð0; 1Þ, N ð4; 1Þ, and N ð8; 1Þ, distributed variables, respectively. The summation (for simplicity, assuming even weights of 1/3) over these three columns would produce a 10-element long vector with distribution N ð4; 1=3Þ, which the AD test would detect as normal, i.e. the test would not reject it. Now assume that the ten samples of a given dimension do not have the same distribution. Assume that the five first samples of the three columns are distributed as N ð1; 1Þ, while the last five are distributed as N ð0; 1Þ. When checking the columns separately, the 10-element vector might not "look" enough different from a unimodal normal distribution to be rejected by the AD test. When summing (again, assuming even weights of 1/3) over the three columns, the distribution of the sum of the first five samples is given by N ð1; 1=3Þ, while the last five have a N ð0; 1=3Þ distribution. This shows that the peaks have larger separation (both variances have decreased) as a result of the summation.

Comparison of temperature records
Temperature data sets from two different meteorological stations in the Oslo area are compared. One is located at Ferder lighthouse at the start of the 100 km long Oslo fjord, while the other is located at Fornebu, which is at the very inner part of the Oslo fjord. The two data sets consist of more or less overlapping yearly records, with 64 and 45 complete years, respectively. Years with missing data in the months of interest have been removed. Fig 10 shows the two data sets, and Fig 11 shows the resulting significance maps. It is clear that the temperature  Scale-space approach for multinormality testing and the k-sample problem in the HDLSS scenario distribution at the two stations differ early and late in the summer. From a closer inspection, it is clear that Fornebu is warmer in early summer, while the opposite effect takes place a few months later.

Comparison to other methods
Just about all methods for testing for multinormality rely in some way on inverting the estimated covariance matrix. When the number of samples is less or equal to the number of dimensions (HDLSS setting), i.e. when n � p, the estimated covariance matrix is non-invertible. The projection methods of [41] and the method based on Srivastava's graphical method in [42] are applicable in this HDLSS setting, but no open implementations of the methods Scale-space approach for multinormality testing and the k-sample problem in the HDLSS scenario exist for power evaluation. The methods of Liang in [43,44] are also applicable in the HDLSS setting and open implementations exist. The preferred method of Liang [43] first transforms the data matrix, and then projects it onto some lower-dimensional space of dimension d � min(n − 2, p). The transformed data will under the null hypothesis be distributed as a ddimensional standard multinormal vector, something which is checked using the skewness and kurtosis test of [45]. Asymptotic distributions are given, but in the setting of interest (n is not large compared to p), the use of the Liang test [43] relies on a permutation procedure for generating p-values.
It is not straightforward to compare the presented scale-space method to the Liang procedure since the presented scale-space method does not produce one single answer to the hypothesis testing problem. A simple example is analyzed to illustrate that the presented method outperforms the Liang test in some settings. Assume the same data set structure as in the example of the Introduction, except that the only non-normal part is the mixture of dimensions 6 to 12, the other dimensions are zero mean normally distributed. This setup results in the optimal resolution/position pair being (4,9), i.e. summing over dimensions 6 to 12. When the non-zero mean value in this area is 2.35, the presented scale-space method has a detection ratio of 0.884/0.918 (Bonferroni/FDR) for the pair (4, 9) (based on 1000 Monte Carlo repetitions). The Liang test has for the same data sets a rejection ratio of 0.659. For the Liang test only the kurtosis test and only the optimal projection dimension (d = 1) are used. In a real setting, the optimal projection dimension would not be known and both the skewness and kurtosis test would be used, leading to a significantly lower power when the correction for multiple testing is done. In the same way, when the non-zero mean value is 2.05, the presented scale- space method has a rejection ratio of 0.569/0.628 for the pair (4,9), while the Liang test has for the same data sets a rejection ratio of 0.480.
For the comparison of two or more data sets, there are several methods that handle the n � p situation. Many of these methods use some kind of distance measure between the data vectors [46][47][48][49]. From these distances, the test statistics are generated, without estimating any covariance matrices. The test by Székely and Rizzo [49] is a k-sample extension of the twosample test suggested by Baringhaus [50]. A similar two-sample test was suggested by Aslan [51]. The Aslan test performed very similar to, but not better than, the Székely-Rizzo/Baringhaus test in the two-sample test case of Table 2. Different projection methods that handle the n � p situation also exist, e.g. Random Projection (RP) [52] and DiProPerm [53] (see paper for more methods). For the case of interest (n � p), the tests all rely on permutation procedures to determine the p-value of the test statistic.
The case of two data sets X and Y is first investigated. The expected value of X is zero for all dimensions, while Y has one region of a number of neighboring dimensions with a non-zero expected value. Both X and Y have the same covariance structure as the example of the Introduction. The number of dimensions of Y that have a non-zero mean value is varied, along with this non-zero value. The upper part of Table 2 shows the results. The result of the scale-space algorithm refers to the resolution/position pair with the highest rejection ratio.
Of the alternative tests, the method of Székely and Rizzo [49] consistently shows the greatest power in the tested settings. When the difference between X and Y is across many dimensions, the power of the Székely and Rizzo test is higher than the power of the scale-space approach. If there instead is only one dimension with a different distribution of X and Y, the power of the scale-space approach is greater than for the Székely test. This means that the Székely is a good alternative approach, but by using the scale-space approach one can determine where in the data set the difference is located.
For the case of k = 3, the presented scale-space method is only compared to the method of Székely and Rizzo (the Hall-Tajvidi, RP and DiProPerm tests cannot be extended to k > 2). The same covariance structure as for the two-sample case is used for the three data sets X, Y and Z. X Table 2. Power of comparing a number of different data sets with a varying number of dimensions ("Dim") for which there is an expected value difference δ in the tested data sets. For the Hall test, the T and S tests gave very similar results. Three nearest neighbors were used in the Nearest Neighbor test. The results of the Friedman-Rafsky test are for three trees, which consistently performed better than one and two trees in this setting. The scale-space results are for the Bonferroni/FDR correction, respectively. A 0.10 significance level is used and 2000 Monte Carlo samples are used. is zero mean, while Y has for some neighboring dimensions a non-zero expected value of δ, and Z has for the same dimensions a non-zero expected value of −δ. See the middle part of Table 2 for the results. The case of k = 7 is finally investigated in the lower part of Table 2. Here, the different data sets have the same structure as for the case of k = 3, but the different data sets X i , i = 1, 2, . . ., 7 have mean values equal to i � δ for the non-zero dimensions. From these results, the scale-space method seems to improve compared to the Székely-Rizzo method when the number of data sets increase, and the methods are giving comparable results in the tested settings.

Feature selection
In a classification setting, the p-values of the different resolution/position pairs can be used to find useful scale-space features. The pairs with the smallest p-values should be good candidate features for classification algorithms. The p-values of neighboring pairs will be correlated (for all resolution values larger than 1). An ad hoc strategy to avoid the selection of neighboring pairs is used. That is, say that the most significant pair is at window width 7 (i.e. resolution number 4) and position 5. Then, all pairs for two resolutions down (resolution number 2 and 3) and two resolutions up (resolution number 5 and 6) that sum over the data of position 5, are excluded from being selected as a feature as a result of pair (4, 5) being selected as a feature. The next feature to be selected corresponds to the resolution/position pair, which has not been excluded in the steps before, with the lowest p-value of the pairs not already selected. This is repeated until a wanted number of features are found or there are no good features left to pick from, where a potential feature's "goodness" is connected to its p-value. The suggested feature selection algorithm is tested on a setting similar to the example of the Introduction. Here, instead of having one data set with two parts, there are two data sets X and Y. X is distributed as the 20 first samples of the motivational example, while Y is distributed as the 20 remaining samples, except that the expected value equals −0.65 for position index 6 to 12 and −1 for position index 20. For indices 26, . . ., 40, the expected value increases linearly from 0.05 to 0.75.
The suggested feature selection scheme is compared to using all dimensions as inputs to classification algorithms. This is meant as a proof of concept, not a thorough comparison to other methods. The tested sample sizes of both X and Y were 20, 30 and 60. For the classification, k Nearest Neighbor classification (with k = 1 and k = 3), Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) were used, when applicable [54]. For the scale-space feature selection, the number of features selected ranged from 1 to 15. One pair of X and Y data sets was used to find the training features. These features were then used to classify 500 X and 500 Y data sets. This was repeated 100 times, making up in total 100000 tests, and the ratio of correct classification was averaged across these 100000 tests, as shown in Fig  12. The splitting up was done to average out the fact that different features will be selected depending on the training data set. With three well-selected features, one can capture the main differences in the two data sets, but as the figure shows, one needs on average more than three features to have the maximum ratio of correct classification. The figure shows that using the suggested scale-space features is better than using the raw data in this example.

Conclusions
The scale-space methodology is applied to the testing for multivariate normality and the ksample problem. The summation across dimensions/positions reduces the multivariate problem to a large number of one-dimensional tests. A significance map, showing where and for which resolutions the null hypothesis is rejected, is generated by going through all combinations of the position and resolution parameters. The summation throws away all information of the dependency structure of the data. When there are more multivariate observations than dimensions, i.e. n > p, the discharging of covariance information will lower the power of the scale-space tests compared to tests that use this information gained through estimation of the covariance matrix. What is gained on the other hand, is the ability to check for multinormality and compare data sets in the High Dimension Low Sample Size setting, something which almost all other methods fail to handle.
The presented algorithms are tested on artificial data and real temperature data sets, showing how both the check for multinormality and how the comparison of data sets can be done through a scale-space approach.
Within the scale-space framework, to the authors' best knowledge, there is no other algorithm to compare the presented work with, even though a large number of tests for assessing the multinormality of a given data set exist [26,[55][56][57]. To the knowledge of the authors, the only multivariate methods for testing multinormality that handle the case when n � p, are the methods [41][42][43][44]. The preferred Liang method [43] is inferior to the presented method in some relevant aspects and cases.
In the case of comparing k data sets, there exist some methods that handle the case where at least one of the sample sizes are less than the number of dimensions. In general, these methods are based on some distance measure between the data vectors, and do not estimate the covariance matrix, or projection onto lower-dimensional spaces. The suggested scale-space method is compared to these methods. In the tested settings, the power of the method of Székely and Hall [49] is comparable to the power of the scale-space approach. The Székely test does not on the other hand provide any info about where the data sets differ, information that is essential for doing feature selection. Selection of relevant features based on the presented scale-space ksample problem algorithm is demonstrated in the "Results" section.