Figures
Abstract
In this study, we introduce an innovative methodology for anomaly detection of curves, applicable to both multivariate and multi-argument functions. This approach distinguishes itself from prior methods by its capability to identify outliers within clustered functional data sets. We achieve this by extending the recent AA + kNN technique, originally designed for multivariate analysis, to functional data contexts. Our method demonstrates superior performance through a comprehensive comparative analysis against twelve state-of-the-art techniques, encompassing simulated scenarios with either a single functional cluster or multiple clusters. Additionally, we substantiate the effectiveness of our approach through its application in three distinct computer vision tasks and a signal processing problem. To facilitate transparency and replication of our results, we provide access to both the code and the datasets used in this research.
Citation: Alcacer A, Epifanio I (2024) Outlier detection of clustered functional data with image and signal processing applications by archetype analysis. PLoS ONE 19(11): e0311418. https://doi.org/10.1371/journal.pone.0311418
Editor: Dariusz Siudak, Lodz University of Technology: Politechnika Lodzka, POLAND
Received: April 2, 2024; Accepted: September 18, 2024; Published: November 25, 2024
Copyright: © 2024 Alcacer, Epifanio. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: Code and data to reproduce the results are available at https://github.com/aleixalcacer/JA-ODCFDAA.
Funding: This work was partially supported by the Spanish Ministry of Universities (FPU grant FPU20/0182 to A.A.), Spanish Ministry of Science and Innovation (PID2022-141699NB-I00 and PID2020-118763GA-I00 to A.A. and I.E.), Generalitat Valenciana (CIPROM/2023/66 to I.E.) and Universitat Jaume I (UJI-B2020-22 and TRANSUJI/2023/6 to A.A. and I.E). There was no additional external funding received for this study. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
1 Introduction
In functional data analysis (FDA), each observation is a function. These functions can be univariate or multivariate, with one or more arguments. A classic reference in this field is [1]. In FDA, as in multivariate analysis (MVA), anomaly detection is an essential step [2]. However, handling outliers in functional data is challenging due to its infinite-dimensional nature [3]. [4] proposed a taxonomy of functional outliers, which was improved by [5]. According to this taxonomy, outliers can be isolated or persistent depending on whether the anomalous behavior occurs during a very short or a long part of the function domain, respectively. Furthermore, persistent outliers could be categorized into shift, amplitude or shape outliers depending on whether they are identical to the other observations after a baseline correction, a rescaling or a warping transformation, respectively.
Nevertheless, this taxonomy does not consider the case of functional clustered data. Clustered data are characterized as data that can be classified into a number of distinct groups or clusters. When we work with clustered data, an upper level in the hierarchy of that taxonomy is the “degree of zoom with which we view the data”. In that case, functional anomalies can be classified as global or local anomalies. A global functional outlier is an anomalous function with respect to all other observations in the data set as a whole, although not necessary with respect to their neighboring observations. On the contrary, a local functional outlier is an anomalous function with respect to other neighboring observations, although it may not be an anomaly in a global view of the data set. By neighboring observations to a certain function f, we mean other functions in the data set within a certain distance from f. This distance can be the L2 distance between two functions. Therefore, it is always assumed that our functions belong to a Hilbert space, satisfy reasonable smoothness conditions and are square integrable functions on their domain, a certain interval [a, b].
Fig 1 illustrates the ideas presented in the previous paragraph, which is a generalization of the concept of global and local anomalies in MVA [6]. There are two clear clusters (c1 and c2 in light blue and purple, respectively) and outliers (f1 in yellow and f2 in green). Clusters appear with transparent colors, except one random function of each cluster that is opaque. f1 and f2 are global anomalies. f1 is very different from the other data (its distance to the other functions is very large), as is f2, which can be seen as a shape outlier for c1 or a vertical shift outlier for c2. However, f3 (shown in red) is a local anomaly. When focusing on the data set globally, f3 can be considered a normal case because it is not too far from cluster c2. But when we compare f3 with its neighbors, i.e., the functions in cluster c2 while neglecting all the other observations, f3 can be seen as an anomalous record. Depending on the applications, local anomalies may or may not be of interest. Another situation appears in Fig 1. The functions in c3 in dark blue generate an open question: Are the observations in c3 anomalies or a (small) regular cluster? c3 constitutes what is called a micro-cluster. For these cases, it can be very useful that our anomaly detection algorithm returns outlierness scores, so that the degree of outlierness of the functions in c3 should be larger than for normal records, but smaller than for clear outliers.
As regards a taxonomy about unsupervised anomaly detection algorithms for MVA, [6] divided them into four categories: (1) Nearest-neighbor (NN) based techniques, (2) Clustering-based methods, (3) Statistical algorithms and (4) Subspace techniques. Despite the great variety of methods in MVA for outlier detection, in the more recent field of FDA, many functional outlier detection methods are global depth-based [7]. Therefore, there are many possibilities to extend methods that perform well in MVA to FDA for outlier detection.
In the sphere of multivariate analysis (MVA), while depth-based methods offer valuable insights, they may exhibit limitations, particularly in detecting anomalies within the sparse interior regions of a data set, as noted by [2]. These methods, in their conventional form, often face challenges when tasked with identifying outliers within clustered data. This characteristic suggests that their efficacy may be compromised when applied to functional data exhibiting similar clustering traits. To our knowledge, the specific issue of outlier detection in clustered data, a notable concern in MVA, has not been extensively examined within the functional data analysis (FDA) framework. Addressing this gap forms one of the primary objectives of our present study. It is is well known that unsatisfactory results can be obtained when clustering methods that are no robust are applied in the presence of outliers [8]. For example, one group could be consititued by outlying observations or heterogeneous groups could be joined together [9]. Therefore, the naive approach of using no robust clustering methods and looking for outliers inside the groups is discarded.
To achieve this primary objective, another aim is to extend a recent technique that performs well for detecting outliers in MVA to FDA for univariate and multivariate functions, with one or more arguments, and compare it with previous alternatives. This MVA technique was proposed by [10] and performed among the best in an extensive comparison with 23 state-of-the-art outlier detection algorithms in MVA with several benchmark data sets. The technique combines categories (1) and (4), since it projects data into relevant subspaces by archetype analysis (AA) [11] and then uses an NN-technique through an appropriate ensemble of the results. Despite NN-based techniques being very popular in MVA for outlier detection, they have not been very widely used in FDA [7]. In this work, we propose to use functional archetypal analysis (FAA) [12], since projecting in appropriate subspaces can improve proximity-based methods, and then to use NN-techniques to detect outliers in those subspaces. AA represents the instances by means of a mixture of archetypes, which are a mixture of instances. The objective of FAA is analogous, but with functions. AA is sensitive to outliers since archetypes lie on the boundary of the convex hull of the data set. The idea of the projections into AA (or FAA) is to exploit this sensitivity to outliers for detecting them. Although it is not the first time that AA or archetypoid analysis (ADA), which is a variant of AA [13], have been used for outlier detection of functions [7, 14], in previous cases, they assumed that the sample is constituted by homogeneous functions generated from a unimodal distribution, i.e., these techniques will fail for clustered data.
To highlight the challenge of finding outlier of functional data (FD) we must bear in mind that FD are not only high-dimensional data, but intrinsically infinite dimensional [15]. Many of the popular anomaly detectors do not work properly with increasing dimensionality. This is an artifact of the well-known curse of dimensionality, whose impact in the outlier detection problem was first noted in [16]. For example, the effectiveness of proximity-based detection in increasing dimensionality is compromised because distances in the original space lose descriptive power. Therefore, a mapping to some more suitable space is necessary, which is called projected outlier detection or, alternatively, subspace outlier detection [2]. There are different classes of projections, such as rarity-based, unbiased and aggreagtion based methods [2, Ch. 5]. The main novelty of this work lies in the use, for the first time, of FAA as a rarity-based subspace method for functional outlier detection. Rarity-based methods seeks subspaces based on rarity of the underlying distribution. Anomaly detection process can be enhanced by identifying subspaces in which the observations deviate remarkably from the normal behavior. Note that the objective of FAA is to discover p extreme functions and multiple diverse subspaces can be obtained by changing FAA parameter p, i.e. by parametric ensemble. Therefore, FAA subspace outlier detection is inherently posed as an ensemble-centric problem. After each different FAA projection, we consider a proximity-based methodology and a model-centric ensemble, but these base detectors could be replaced by others. Previous methods [7, 14] did not use FAA as parametric ensemble, but they only considered FAA with a unique parameter, p = 3, and did not exploit the information in this projection as in our proposal.
This study makes several significant contributions to the field of functional data analysis. We present an innovative method for outlier detection that is applicable to both univariate and multivariate functions with varying numbers of arguments, a notable advancement over many existing methods that are limited to univariate functions. Unlike other methods, our proposal does not require to fix the proportion of outlier detection. Our method is thoroughly evaluated through a simulation study, where it is compared against a comprehensive suite of the most advanced techniques in functional outlier detection. This rigorous comparison reveals the enhanced performance of our approach, particularly in scenarios involving functional clustered data—a situation in which previous methods have shown limitations and it is really challenging [17].
Furthermore, we extend the practical application of our proposed method to real-world data, exemplifying its efficacy in resolving complex problems within the realm of computer vision and signal processing. The application to these real-world problems not only demonstrates the versatility of our method but also its superiority in practical settings.
In line with our commitment to the scientific community’s principles of transparency and reproducibility, we are releasing the code that underpins our methodology. This will allow peers and future researchers to replicate our findings and extend upon our work, reinforcing the integrity and applicability of our contributions.
The paper is organized as follows. Section 2 reviews AA, FAA and previous functional outlier detection methods together with robust curve clustering methods. Section 3 introduces our proposal. The simulation study and results using real data are presented and discussed in Section 4. Finally, we provide some conclusions in Section 5.
2 Background
2.1 AA and FAA
In MVA, let X be an n × m data matrix with n observations (xi) and m variables. The rows of a p × m matrix Z contain the p archetypes (zj), which are a mixture of data points, while data points are in turn approximated by a mixture of archetypes. Therefore, we have to minimize the following residual sum of squares (RSS), where the elements of the two n × p matrices α and β have to be determined:
(1)
with two constraints: 1)
with αij ≥ 0 and i = 1, …, n, and 2)
with βjl ≥ 0 and j = 1, …, p. In summary, αij is the weight of the archetype zj in the approximation of xi,
; while βjl is the weight of the data point xl in the definition of the archetype
. AA can be computed by an alternating minimizing algorithm described by [11]. This is a nondeterministic algorithm, therefore it is run beginning from 10 random initializations, and the best model is chosen for each p.
In FDA, let {x1(t), …, xn(t)} with t ∈ [a, b] be the FD set. In FAA, RSS is computed with a functional norm (the L2-norm, ), since the observations (xi(t)) and archetypes (zj(t)) are now functions. However, the matrices α and β are interpreted as in AA. As functions are recorded at discrete sites in practice, we can apply AA to the function values of m equally-spaced values from a to b to calculate FAA. For a more computationally efficient approach, FD can be approximated by means of basis functions:
, where Bh (h = 1, …, m) are the basis functions and bi is the vector of length m with the coefficients. So, the RSS in FAA is [12]:
(2)
where
and W is the order m symmetric matrix with the inner products of the pairs of basis functions
. W is the order m identity matrix when the basis is orthonormal, such as the Fourier basis. In that case, we can obtain FAA by using AA with the basis coefficients. In another case, we have to compute W one time only by numerical integration.
If our data set is composed of multivariate functions, in particular to simplify the explanation, bivariate functions fi(t) = (xi(t), yi(t)), with coefficient vectors for the basis functions Bh, and
, respectively; then the RSS is as follows:
(3)
where
and
. As before, for orthogonal basis functions, FAA can be obtained by applying AA to the n × 2m coefficient matrix composed by joining the coefficient matrix for x and y components.
2.2 Functional outlier detection methods
We review some of the main methods devoted to functional outlier detection along with their implementation in R [18]. These methods will be compared with our proposal. [19] used a likelihood ratio test (LRT). [20] identified outliers as functions whose depth levels are below a certain threshold. This threshold is established by a bootstrap procedure based on either trimming (TRIM) or weighting of the sample (POND). LRT, TRIM and POND are found in the R packages fda.usc [21] and rainbow [22]. In this package, we also find the procedure by [23] (ISFE), where integrated square forecast errors are used; the procedure by [24] (RMAH), where the robust Mahalanobis distance is applied but regarding the functions as multivariate observations; and the functional highest density region (HDR) boxplot introduced by [25]. The classical boxplot was extended to FD by [26]. This procedure (FB) is available in the R package fda [27]. The outliergram (OUG) was proposed by [28] and can be found in the R package roahd [29]. [30] proposed to use a kernelized functional spatial local depth for identifying outliers. The Functional Outlier Map (FOM) was proposed by [4] and improved by [31]. It is based on functional outlyingness measures and is available in the R package mrfDepth [32].
More recent literature comprise methods by [33] based on functional directional outlyingness; [34] based on functional principal component analysis and an adaptive and data driven approach for dimension selection; [35], where a sequence of transformations was considered; and [36], where the total variation depth was proposed. [37] considered features extracted from differential geometry. Outlier detection for multivariate time series is carried out by [38] using a functional approach, through quantile cross-spectral densities and functional depths. Other methods are based on elastic depths [39] and a new elastic metric [40]. Methodologies for detecting outliers in big functional data have recently been proposed by [7, 41–43]. This last method, CRO-FADALARA, is explained below. Depthgram was introduced by [44] for visualizing outliers in high-dimensional functional data. Depending on the problem, other works aim to develop robust methods against outliers, such as [3].
Two previous methods, FOADA and CRO-FADALARA, used ADA for functional outlier detection, although in different ways to our proposal. On the one hand, FOADA was introduced by [14]. In FOADA, outliers were iteratively detected by computing ADA with three archetypoids repeatedly while sieving the sample. The outlier detection is based on the alpha coefficients either using the robust Mahalanobis distance (FOADARMAH) or using a threshold for the alpha values (FOADATH). FOADA iterates until no more outliers are detected. On the other hand, CRO-FADALARA has two phases: in the first phase, the cleaning phase, the most obvious outliers (amplitude and isolated) are detected with the classical boxplot, while in the second phase, robust ADA is computed and outliers are detected by applying the adjusted boxplot [45] to the norm of the residuals. Note the differences with our proposal, which is not an iterative procedure like FOADA nor does it use the residuals like CRO-FADALARA. As explained in Sec. 1, our proposal is a rarity-based subspace outlier detection method, a parametric-ensemble method.
As we deal with clustered functional data sets, we also consider in the comparison a classical reference for robust clustering such as [46], TRIMKMEANS, which can be found in the R package trimcluster [47]. Trimming techniques have been used in functional data contexts, such as the works by [8, 48–50]. Proportion of observations to be trimmed have to specified. Trimmed observations will be considered as outliers. [51] performed hierarchical functional clustering with special consideration of outliers and [17] considered a contaminated mixture model.
Some interesting applications of functional outlier detection are flood analysis [52], monitoring of helicopters in flight and the analysis of the spectrometry of construction materials [53], traffic flow data [54], bike sharing system data [55] and profile monitoring in industrial manufacturing [56].
3 The FAA + k-NN method for detecting functional outliers
As mentioned in Section 1, the objective is to project FD in relevant subspaces, where outliers can be better detected, and to reduce the dimensionality. Then distance-based methods are applied to detect anomalies in those subspaces. FAA is considered for projection, with different p values (i.e. number of archetypes), since archetypes do not necessarily nest nor are they orthogonal to one another [11]. Therefore, for each p, the returned archetypes can be changed to better capture the configuration of the data set. This property, together with the fact that the archetypes are on the boundary of the convex hull of the data for more than one archetype [11], favors our proposal. On the one hand, as we can obtain different explanations of the data with different number of archetypes, we can combine the different results by ensembles. This is advantageous since accuracy and diversity are desirable properties for the good performance of ensembles [57]. On the other hand, as the number of archetypes increases, some returned archetypes could be outliers, but this is not a problem for us; quite the opposite, in fact. Once the FD set is FAA projected for multiple number of archetypes, the k-NN method is applied to the α values for each projection (remember that the α values always add 1, regardless of the number of archetypes).
In conclusion, our proposed methodology is encapsulated within two primary steps:
- Step 1 Project FD in relevant subspaces using FAA.
- Project the functional data (FD) into the relevant subspaces by applying FAA. Begin with a single archetype (p = 1) and incrementally increase the number of archetypes up to the maximum number predetermined for your study (p = p2).
- Calculate the RSS for each value of p as you progress from p = 1 to p = p2.
- Construct a plot of RSS against the corresponding values of p.
- Examine the plot to identify the ‘elbow’, which is the point where the RSS begins to decrease at a diminishing rate as p increases. This is achieved using the elbow criterion, an intuitive heuristic approach used elsewhere [11].
- Select the number of archetypes corresponding to the ‘elbow’ as the minimal number of archetypes (p1) that provides a satisfactory description of the data. This represents the point at which additional archetypes yield only marginal improvements in data representation.
- Step 2 Apply k-NN (sum of Euclidean distance to k nearest neighbors) to detect anomalies.
- Apply the k-NN algorithm to each of the α matrices for the subspaces defined by p1 up to p2. Use the sum of Euclidean distances to the k nearest neighbors to calculate the outlier scores for each data point.
- Normalize the outliers scores of each data point by implementing a cumulative-sum approach, which is equivalent to averaging the outlier scores across the different projections. This results in a single outlier score for each data point that synthesizes the information from all the projections.
- As the same k is used each time, the scale of the outlier scores is the same, so the aggregation ensemble is well founded.
Note that if dealing with multivariate FD with different ranges in each functional variable, a previous standardization of the data may be necessary.
As the bias-variance trade-off in outlier detection is almost identical to that in classification [2], averaging also reduces variance in outlier identification. If we want to make the algorithm more robust, we could apply the Step 2 for a range of k and then averaging the results.
Bias-variance theory developed in the supervised learning field can also be adjusted to anomaly detection with some changes [58]. [2, Ch. 6] details the similarities between outlier ensembles and classification ensembles. Our proposal consists of a rarity-based subspace outlier detection method. It is a parametric ensemble, where a range of different values of p are used with FAA in step 1, and then the scores obtained in step 2 are averaged. This is a typical outlier ensemble approach [59]. Outlier ensembles are especially convenient in settings in which they are able to reduce the uncertainty related with difficult algorithmic selections [2, Ch. 6], such as discovering relevant subspaces to different observations. This makes FFA very suitable for this purpose since FAA results for different p values are not necessarily nested and they change to explore better the shape of the dataset. Furthermore, averaging the scores in step 2 results in variance reduction, which improves performance by the theoretical foundations explained by [58] for outlier ensembles. For the same reason, different k values can be considered in step 2, and their results can be averaged, as explained before.
By default, we obtain outlier scores, where the highest level of outlierness is indicated by the highest scores. If we would like to return a binary label (being an outlier or not), a box-plot with the outlier scores could be used and we could label the points identified as outliers as anomalies. Despite being a simple hardening strategy, it has been proven effective in our experiments.
This is an extension of the proposal by [10] from the multivariate case to the functional case, i.e., in Step 1, we use FAA instead of AA. As aforementioned, [10], performed among the best in an extensive comparison with 23 state-of-the-art outlier detection algorithms in MVA with several benchmark data sets. One of those outlier detection algorithms was using only the k-NN method, i.e. skipping step 1. Another one was substituting the step 1 by robust principal component analysis (RPCA) and using PC scores with the k-NN method. Using AA as step 1 improved clearly the detection of anomalies than skipping step 1 or using RPCA.
3.1 Interpretation of parameters
The parameters within our proposed framework are pivotal to the analysis and are interpreted as follows:
- p1: This parameter signifies the ideal number of archetypes that adequately describe the dataset, i.e the point where the elbow is found.
- p2: Set to be larger than p1, p2 defines the upper limit of the number of archetypes to consider. The choice of p2 should be reflective of meaningful improvements in model accuracy. Increasing p2 without concurrent improvements in loss reduction does not add value to the analysis and may unnecessarily complicate the model.
- k: Determines the sensitivity to the size of clusters identified as outliers. A cluster exceeding the size of k will not be flagged by the algorithm using the k-NN method, since k points will be inherently proximal. It is recommended to employ a range of k values, as suggested in previous works [6].
3.1.1 Toy example.
To elucidate the parameter interpretation within our framework, we introduce an illustrative example featuring five distinct distribution types. Fig 2 illustrates that the dataset comprises three larger clusters (with 20 elements each), a smaller cluster (containing 5 elements), and a lonely outlier.
In alignment with our methodology, the initial task is to compute data projections through FAA. In this example, we have chosen pmax = 8, a decision justified in Fig 3, which validates its sufficiency for the determination of p1 and p2.
Upon calculating the projections, Fig 3 shows the RSS plotted against the number of archetypes. The ‘elbow’, indicative of the optimal archetype count, is apparent at two archetypes, thereby setting p1 = 2. Additionally, the plot indicates a minimal RSS variance beyond four archetypes, leading to the selection of p2 = 4.
With the projection range (p1 ≤ p ≤ p2) determined, we advance to identifying an appropriate k value that will facilitate the detection of desired outliers. In Fig 4, we illustrate three configurations of k, each yielding distinct outcomes.
Outliers appear in blue. A: k = 1. B: 1 ≤ k ≤ 5. C: 5 < k ≤ 10.
In the initial scenario presented in Fig 4A, the parameter k is set to 1. This choice is based on the intention to identify outliers that are singular, i.e., data points that stand alone. Contrary to the expectation of detecting only the solitary purple data point as an outlier, the analysis reveals a total of three outliers. This outcome suggests the presence of additional lonely outliers within the data set.
Moving to the second scenario, depicted in Fig 4B, the k parameter is varied from 1 to 5. This range is strategically chosen to enhance the robustness of the outlier detection process, aiming to mitigate the impact of local anomalies that might otherwise be mistaken for significant outliers. In this case, only the purple data point is considered as outlier, indicating that the other two could be local anomalies.
In the concluding scenario, shown in Fig 4C, the strategy diverges from isolating individual outliers to recognizing clusters of outliers consisting of up to five data points by opting for a k value between 5 and 10. Therefore, the methodology also categorizes the cluster of five data points as an outlier.
In conclusion, the selection of the parameters p1 and p2 is informed by the data illustrated in Fig 3. In addition, the choice of the k value hinges on the specific outlier definition one wishes to apply and the particular characteristics of outliers that need to be identified. This exposition has clarified the implications of parameter choices and illustrated their utility in configuring outlier detection for diverse situational requirements.
4 Results and discussion
4.1 Simulation study
Our method is assessed in two different scenarios, considering non-clustered or clustered data. In each scenario, five kinds of outliers contaminate the models: amplitude, vertical shift, horizontal shift, shape and isolated outliers. In all of the following models, ϵ(t) will be a Gaussian process with zero mean and covariance function γ(s, t) = 0.3 exp{−|s − t|/0.3}, and functions are observed at 25 equidistant points between 0 and 1. In the first scenario, there is a single cluster and the simulation design was similar to that used by [7, 14, 20, 28, 60]. Cluster 1 is defined by the equation X1(t) = 30t(1 − t)3/2 + ϵ(t). In the second scenario, a second cluster is added. We generate samples from all models with a sample size equal to 100. In the first and second scenario, the number of outliers is 5, while the number of functions in the first cluster is 70 (95) in the second (first) scenario and 25 functions make up the second cluster, respectively. The simulation design of outliers and cluster 2 depends on the kind of outliers assessed and appears in Table 1. Fig 5 displays the simulated functions for scenario 2.
A: Amplitude. B: Vertical shift. C: Horizontal shift. D: Shape. E: Isolated.
Z1 denotes a truncated standard normal density, centered in the 6th observation, while Z2 denotes a truncated standard normal density, centered in the 20th observation, i.e., they are defined for the first 11 observed points and the last 11 observed points, respectively.
We have generated 50 synthetic data sets and applied FAA + k − NN together with the methods explained in Section 2.2. Default parameters (they can be seen in the code file in the Supplementary material) have been considered, except for the procedures that require the outlier rate to be specified. For those procedures (LRT, TRIM, POND and HDR), the true values have been used, which could give them an advantage. For TRIMKMEANS one and two clusters are considered for the first and second scenario, respectively. In accordance with our methodological procedure, we have established for our method that p1 = 2, p2 = 5, and k ranges from 5 to 15. This last parameter is based on our knowledge that the outliers will consist of a group of 5 elements.
The results are summarized in Fig 6. In scenario 1, our proposal obtains 100% correct detection of outliers for all the types, with 4% false positive detection. These results are only improved by FOADATH, which gives no false negatives. FOADARMAH also detects all the outliers of all types, but with a slight increase in false negatives for some types. For the other methods, perfect detection is obtained for some types, but not for others, such as CRO-FADALARA, TRIMKMEANS, RMAH, ISFE, TRIM, POND, FOM and FB, while for other methods, the percentage of correct detection is lower in all the classes, such as OUG and HDR. As regards the results in scenario 2, our proposal is clearly the best. FAA+kNN detects all the outliers for all types of the outliers, except for H-shift, with 96% success and a very low false positive rate. The accuracy of TRIMKMEANS is very high, but it is smaller than that of our method for H-shift, shape and isolated outliers, whose accuracies are 84%, 96% and 94%, respectively. The other methods cannot detect any outlier in many cases. The third-best method for amplitude outliers is ISFE, with 70% correct detection but 19% false positives. The third-best methods for V-shift outliers are TRIM and HDR, with 50% correct detection. The third-best methods for H-shift are ISFE, TRIM and HDR, with 42% correct detection. For shape outliers, FOADATH and TRIM also obtain 100% correct detection. For isolated outliers, the best-second method is CRO-FADALARA, with 98% correct detection.
4.2 Real data
In addition to simulations, we tested our proposed method on three datasets from image data and another from signal data. The first two sets of data include univariate curves, the third one involves three-variate functions and the fourth one includes three-variate functions with two arguments.
Since this study involves analysis of existing and publicly data sources coming from peer-reviewed and formally published sources, which are detailed in the following sections, consent was waived for this study for data with human participants. Furthermore, the statistical analyses carried out in this study are completely different from those sources and any other, since those analyses are made with our original methodology proposed in Section 3. Therefore, this study does not constitute dual publication.
4.2.1 Planes dataset.
The shape database for aircraft silhouette recognition curated by [61] encompasses a diverse array of fighter aircraft, including the Mirage, Eurofighter, F-14, Harrier, F-22, and F-15. Notably, the F-14 Tomcat is represented by two distinct configurations due to its variable-sweep wing design, which can be either extended or retracted, thereby contributing to the overall count of seven shape classes within the database.
The database was meticulously compiled by photographing scale models of these aircraft from a top-down perspective, resulting in images with a resolution of 640x480 pixels. These images were subsequently processed using the Spedge and Medge techniques for color image segmentation, facilitating the isolation of the aircraft shapes from the background.
Despite the precision in capturing and processing the images, the resultant shapes were not immune to imperfections. Artifacts arising from the segmentation process introduced noise, while variations in the viewing angles during photography led to distortions in the perceived shape of the aircraft. To refine the database, the shapes underwent additional processing with a Gaussian filter, characterized by a standard deviation of 10, to smooth out irregularities. Furthermore, a standardization step was employed, normalizing the length of the shape contours to 144 points to maintain consistency across the dataset.
The dataset encapsulates a total of 105 samples, with the distribution across the seven classes being 15, 14, 9, 16, 13, 18, and 20, respectively.
Fig 7 provides a visual depiction of the shapes of the various fighter aircraft within the database. At first glance, we could see that this dataset includes some anomalies; for instance, in class 5, certain shapes are shift outliers.
Planes belonging to each of the seven classes are represented in a different color according to the legend.
We sought to pinpoint outliers by applying previously outlined methods, factoring in an outlier rate of 5% for those who require it, and utilizing their default settings. For TRIMKMEANS, 7 groups are considered. In our approach, we refer to the RSS plot to select values for p1 and p2. Accordingly, p1 is set to 7 and p2 to 11. This choice aligns with the fact that the dataset comprises 7 distinct groups. Additionally, we have considered the variable k in the range of 3 to 10. The outcomes of these applications are illustrated in Fig 8. It was observed that the methods CRO-FADALRA, FB, FOM, ISFE, FOADARMAH, and FOADATH tend to classify entire classes as outliers, which may not be desirable in a practical scenario. On the contrary, OUG identifies only a few outliers within class 5, and RMAH fails to recognize any outliers.
In relation to the methods requiring an outlier rate, TRIMKMEANS does not identify the conspicuous outliers in class 5, whereas POND only detects these particular ones. The TRIM method categorizes an entire class as outliers, and HDR is the sole method that appears to detect not just the outliers in class 5, but also several more in other classes.
Notably, like the HDR method, the FAA+kNN approach stands out as it not only successfully detects the outliers in class 5 but also identifies additional outliers across other classes. This demonstrates its effectiveness in discerning subtle deviations within the data, thereby enhancing the robustness of outlier detection in this context.
4.2.2 ECG200 data.
The ECG200 dataset, introduced by [62], comprises 200 electrocardiograms and is accessible on the UCR Time Series Classification and Clustering website [63] and PhysioNet [64], and were collected in Beth Israel Hospital Arrythmia Laboratory in 1990 [65]. This dataset is divided into two distinct categories: one containing 133 electrocardiograms and the other comprising 67. Each electrocardiogram in the dataset is recorded at 96 equally spaced intervals. In Fig 9, the displayed electrocardiograms are illustrated.
Each line in this figure represents an individual electrocardiogram. The color coding of each line denotes the specific cluster to which the observation belongs.
In prior research, specifically in [51, 66], the detection of four outliers in the data set was noted. Consequently, this led to set an outlier rate of 0.02 (4/200 = 0.02). For TRIMKMEANS, 2 groups are considered. In our method, we set the parameters following the previously defined procedure. The RSS plot was a critical tool in this process, guiding our decision to select p1 = 3 and p2 = 8 as parameters. Furthermore, with the objective of identifying a small cluster comprising four outliers, we defined the parameter k to range between 5 and 50.
Fig 10 provides a comprehensive visual representation of all outliers identified by various methods. A key observation from this figure is that only our proposed method successfully detects the outliers that are aligned with the findings presented by [51, 66].
4.2.3 Anomaly detection in image textures.
Detecting outliers in textured images poses a significant challenge due to the inherent complexity and variability of textures, especially in natural settings such as forests. An illustrative example of this challenge is presented in Fig 11, which displays an image of forest stands characterized by highly variable textures, complicating the processing task. This specific image has been previously analyzed by [67].
The image underwent three mathematical morphology transformations as outlined by [68], with the transformed states depicted in Fig 12. Subsequently, following the approach of [67], local granulometries were computed within systematically selected 100 × 100 windows, the regions of which are exhibited in Fig 13. This methodology for texture description has been validated as effective by both [67, 69], for example.
A: it shows the extended-minima transform; B: it is a thresholded gradient; C: it shows a thresholded of a filling holes operation based on morphological reconstruction.
A: it shows the image in tiles of 100 × 100. B: it shows the granulometry curves associated with each tile.
For the analysis of these tri-variate functional data (the granulometries of the three transformations), methods described previously were employed. As we were looking for only one outlier, we set the outlier rate of 0.015 (1/64 ≃ 0.015). For TRIMKMEANS, 3 groups are considered. Given the modest sample size of 64, the parameters were set to p1 = 3 (there are three different sets of textures), p2 = 4 and k ranging from 1 to 3 in order to detect lonely outliers. The anomalies identified through this process are detailed in Fig 14.
Our method distinguishes itself by being the sole technique that accurately detects the anomaly within the dataset. In contrast, the alternate methods either fail to identify the anomaly, leading to a false negative, or incorrectly flag non-anomalous data as outliers, resulting in false positives. This comparison underlines the robustness and reliability of our proposed method in isolating true anomalies in challenging datasets where texture plays a predominant role.
A reviewer raised the question of the performance obtained with state-of-the-art deep learning based methods for image anomaly detection. Note that unsupervised deep learning-based image anomaly detection are not truly unsupervised because normal samples feed the algorithms, i.e. normal samples have to be included in the training set [70]. Therefore, the information about what is a normal sample is provided to the algorithm. This is an important difference with our proposal, which is fully unsupervised. In order to make the comparison, we have considered as training set the original tiles in rows from 1 to 4 and from 7 to 8 in Fig 13A for Fig 11, i.e. a total of 48 tiles. Test set, which contains the anomalous tile, is composed by the tiles in rows 5 and 6, i.e. a total of 16 samples. We have applied two state-of-the art methods: FastFlow [71] and PatchCore [72] with the Matlab functions fastFlowAnomalyDetector and patchCoreAnomalyDetector with default parameters, respectively. According with the outlier scores returned, the anomalous tile is the third most normal tile of the 16 test tiles with FastFlow; while, it is the most normal with PatchCore. Therefore, performance of deep-learning-based methods is very low in this case. Note that the deep learning has a good performance with large sample data sets. However, accuracy decreases in the small sample size setting due to difficulties in training and over-fitting [73]. Furthermore, processing natural landscapes is a high challenge. Mixture of natural (real) textures are more difficult to process than industrial images, where deep-learning-based methods have good performance, due to their high natural variation.
4.2.4 3D shape of the left and right hippocampi.
We consider the 3D shape of the left and right hippocampi of 28 individuals from Structural Magnetic Resonance Imaging (sMRI) brain scans. Anonymized data are publicly available at [74]. These data were introduced by [75]. As detailed in [75], all experimental procedures complied with the guidelines that were approved by the ethical research committee at the Universitat Jaume I. Written informed consent was obtained from every individual or their appropriate proxy prior to participation, complying with existing Spanish legislation (Ley Orgánica 15/1999, de 13 de diciembre, de Protección de Datos de Carácter Personal, LOPD) granting the use of the data for research purposes. Data were collected from June 21st 2004 to October 19th 2005 by the Neurology Service at La Magdalena Hospital (Castelló, Spain) and the Neuropsychology Service at the Universitat Jaume I.
They are divided into three groups: 12 cognitively normal (CN) subjects, 6 individuals with mild cognitive impairment (MCI) and 10 subjects with early Alzheimer’s Disease (AD). Each hippocampus is expressed as a three-vector-valued function in the unit sphere with spherical angles (θ, φ) by spherical harmonic representation. Therefore, we work with multivariate and multiargument functional data. As spherical harmonics are orthonormal, we can stack the vectors of SPHARM basis coefficients and work as in the multivariate case [12]. The data and their representation are described by [74, 76]. Fig 15 displays an example of one left hippocampus for each of the three groups.
We implement our methodology using p1 = 2 and p2 = 4, as indicated by the RSS plot. We set k to 2 given the small size of our sample. In this case, we analyze the outlier scores and their relationship with the degree of severity of the disease. The outlierness increases according to the degree of severity of the disease. An ordered logistic regression has been applied and for every one unit increase in outlier score the odds of being a more severe degree of AD increase 12.41-fold and 1.63-fold, for the left and right hippocampi, respectively. This difference between the two hippocampi is not surprising, since the shape of the left hippocampus is more discriminant than that of the right hippocampus, i.e., a better correct classification percentage is obtained using the information in the left hippocampus [76].
5 Conclusion
This study has introduced an innovative approach for the identification of outliers in functional data, which exhibits versatility across univariate, multivariate, and multi-dimensional functions. A significant advantage of our method over existing techniques is its ability to efficaciously detect outliers within clustered functional data. Through extensive testing on simulated and real datasets, we have demonstrated that our method not only excels when faced with clustered data scenarios but also maintains comparable efficacy to current methods in single-group contexts.
Looking ahead, there is potential for employing our outlier detection method as a preliminary step in clustering analyses to isolate and remove outliers, thereby refining the clustering process. Additionally, our approach holds promise for defect detection in textured images, which has implications for quality control across various industries, such as textiles and ceramics.
Future research directions could also include adapting our method for application to datasets comprising mixed types of data, encompassing functional, numerical, and categorical variables, as well as accommodating datasets with missing values adjusting the methodologies by [77], sparse functional data [78–80], directional data [81], shape datasets [82–84], ordinal [85] and binary [86] data sets, interval data [87], data cells for finding cellwise outliers [88], among others. These enhancements would further the applicability of our approach in a wider array of analytical scenarios, thereby broadening the impact and utility of our work in the field of data analysis. Another future research direction is to apply the proposal to anomaly detection on (multivariate) time series [89, 90] by transforming the long (multivariate) time series into a (multivariate) functional data set by dividing the time axis into intervals depending on the period, as made by [14], for instance. Finally, other implementations [91] for obtaining archetypes could be considered.
References
- 1.
Ramsay JO, Silverman BW. Functional Data Analysis. 2nd ed. Springer; 2005.
- 2.
Aggarwal CC. Outlier analysis. 2nd ed. Springer; 2017.
- 3. Cao C, Liu X, Cao S, Shi JQ. Joint classification and prediction of random curves using heavy‐tailed process functional regression. Pattern Recognition. 2023;136:109213.
- 4. Hubert M, Rousseeuw PJ, Segaert P. Multivariate functional outlier detection. Statistical Methods & Applications. 2015;24(2):177–202.
- 5. Arribas-Gil A, Romo J. Discussion of “Multivariate functional outlier detection”. Statistical Methods & Applications. 2015;24:263–267.
- 6. Goldstein M, Uchida S. A comparative evaluation of unsupervised anomaly detection algorithms for multivariate data. PLOS ONE. 2016;11(4):e0152173. pmid:27093601
- 7. Vinué G, Epifanio I. Robust archetypoids for anomaly detection in big functional data. Advances in Data Analysis and Classification. 2021;15(2):437–462.
- 8. Garcia-Escudero LA, Gordaliza A. A proposal for robust curve clustering. Journal of Classification. 2005;22(2):185–201.
- 9. García-Escudero LA, Gordaliza A, Matrán C, Mayo-Iscar A. A review of robust clustering methods. Advances in Data Analysis and Classification. 2010;4:89–109.
- 10. Cabero I, Epifanio I, Piérola A, Ballester A. Archetype analysis: A new subspace outlier detection approach. Knowledge-Based Systems. 2021;217:106830.
- 11. Cutler A, Breiman L. Archetypal Analysis. Technometrics. 1994;36(4):338–347.
- 12. Epifanio I. Functional archetype and archetypoid analysis. Computational Statistics & Data Analysis. 2016;104:24–34.
- 13. Vinué G, Epifanio I, Alemany S. Archetypoids: A new approach to define representative archetypal data. Computational Statistics & Data Analysis. 2015;87:102–115.
- 14. Millán-Roures L, Epifanio I, Martínez V. Detection of anomalies in water networks by functional data analysis. Mathematical Problems in Engineering. 2018;2018(Article ID 5129735):13.
- 15. Wang JL, Chiou JM, Müller HG. Functional data analysis. Annual Review of Statistics and its application. 2016;3(1):257–295.
- 16.
Aggarwal CC, Yu PS. Outlier detection for high dimensional data. In: Proceedings of the 2001 ACM SIGMOD International conference on Management of data; 2001. p. 37–46.
- 17. Amovin-Assagba M, Gannaz I, Jacques J. Outlier detection in multivariate functional data through a contaminated mixture model. Computational Statistics & Data Analysis. 2022;174:107496.
- 18.
R Core Team. R: A Language and Environment for Statistical Computing; 2023.
- 19. Febrero M, Galeano P, González-Manteiga W. A functional analysis of NOx levels: location and scale estimation and outlier detection. Computational Statistics. 2007;22(3):411–427.
- 20. Febrero M, Galeano P, González-Manteiga W. Outlier detection in functional data by depth measures, with application to identify abnormal NOx levels. Environmetrics. 2008;19(4):331–345.
- 21. Febrero-Bande M, de la Fuente M. Statistical Computing in Functional Data Analysis: The R Package fda.usc. Journal of Statistical Software. 2012;51(4):1–28.
- 22.
Shang HL, Hyndman RJ. rainbow: Rainbow Plots, Bagplots and Boxplots for Functional Data; 2016. Available from: https://CRAN.R-project.org/package=rainbow.
- 23. Hyndman RJ, Ullah MS. Robust forecasting of mortality and fertility rates: A functional data approach. Computational Statistics & Data Analysis. 2007;51(10):4942–4956.
- 24.
Rousseeuw PJ, Leroy AM. Robust Regression & Outlier Detection. Wiley; 1987.
- 25. Hyndman RJ, Shang HL. Rainbow Plots, Bagplots, and Boxplots for Functional Data. Journal of Computational and Graphical Statistics. 2010;19(1):29–45.
- 26. Sun Y, Genton MG. Functional Boxplots. Journal of Computational and Graphical Statistics. 2011;20(2):316–334.
- 27.
Ramsay JO, Wickham H, Graves S, Hooker G. fda: Functional Data Analysis; 2014. Available from: http://CRAN.R-project.org/package=fda.
- 28. Arribas-Gil A, Romo J. Shape outlier detection and visualization for functional data: the outliergram. Biostatistics. 2014;15(4):603–619. pmid:24622037
- 29.
Tarabelloni N, Arribas-Gil A, Ieva F, Paganoni AM, Romo J. roahd: Robust Analysis of High Dimensional Data; 2017. Available from: https://CRAN.R-project.org/package=roahd.
- 30. Sguera C, Galeano P, Lillo RE. Functional outlier detection by a local depth with application to no x levels. Stochastic Environmental Research and Risk Assessment. 2016;30(4):1115–1130.
- 31. Rousseeuw PJ, Raymaekers J, Hubert M. A Measure of Directional Outlyingness with Applications to Image Data and Video. Journal of Computational and Graphical Statistics. 2018;27(2):345–359.
- 32.
Segaert P, Hubert M, Rousseeuw P, Raymaekers J. mrfDepth: Depth Measures in Multivariate, Regression and Functional Settings; 2017. Available from: https://CRAN.R-project.org/package=mrfDepth.
- 33. Dai W, Genton MG. Directional outlyingness for multivariate functional data. Computational Statistics & Data Analysis. 2019;131:50–65.
- 34. Lakra A, Banerjee B, Laha AK. A data-adaptive method for outlier detection from functional data. Statistics and Computing. 2024;34:7.
- 35. Dai W, Mrkvička T, Sun Y, Genton MG. Functional outlier detection and taxonomy by sequential transformations. Computational Statistics & Data Analysis. 2020;149:106960.
- 36. Huang H, Sun Y. A Decomposition of Total Variation Depth for Understanding Functional Outliers. Technometrics. 2019;61(4):445–458.
- 37. Lejeune C, Mothe J, Soubki A, Teste O. Shape-based outlier detection in multivariate functional data. Knowledge-Based Systems. 2020;198:105960.
- 38. López-Oriona A, Vilar JA. Outlier detection for multivariate time series: A functional data approach. Knowledge-Based Systems. 2021;233:107527.
- 39. Harris T, Tucker JD, Li B, Shand L. Elastic depths for detecting shape anomalies in functional data. Technometrics. 2021;63(4):466–476.
- 40. Epifanio I, Gimeno V, Gual-Arnau X, Ibáñez-Gual MV. A New Geometric Metric in the Shape and Size Space of Curves in Rn. Mathematics. 2020;8(10):1691.
- 41. Azcorra A, Chiroque LF, Cuevas R, Fernández Anta A, Laniado H, Lillo RE, et al. Unsupervised scalable statistical method for identifying influential users in online social networks. Scientific Reports. 2018;8(1):6955. pmid:29725046
- 42. Ojo OT, Fernández Anta A, Lillo RE, Sguera C. Detecting and classifying outliers in big functional data. Advances in Data Analysis and Classification. 2022;16(3):725–760.
- 43. Ojo OT, Fernández Anta A, Genton MG, Lillo RE. Multivariate functional outlier detection using the fast massive unsupervised outlier detection indices. Stat. 2023;12(1):e567.
- 44. Alemán-Gómez Y, Arribas-Gil A, Desco M, Elías A, Romo J. Depthgram: Visualizing outliers in high-dimensional functional data with application to fMRI data exploration. Statistics in Medicine. 2022;41(11):2005–2024. pmid:35118686
- 45. Hubert M, Vandervieren E. An adjusted boxplot for skewed distributions. Computational statistics & data analysis. 2008;52(12):5186–5201.
- 46. Cuesta-Albertos JA, Gordaliza A, Matrán C. Trimmed k-means: an attempt to robustify quantizers. The Annals of Statistics. 1997;25(2):553–576.
- 47.
Hennig C. trimcluster: Cluster Analysis with Trimming; 2020. Available from: https://CRAN.R-project.org/package=trimcluster.
- 48. Cuesta-Albertos JA, Fraiman R. Impartial trimmed k-means for functional data. Computational Statistics & Data Analysis. 2007;51(10):4864–4877.
- 49. Rivera-García D, García-Escudero LA, Mayo-Iscar A, Ortega J. Robust clustering for functional data based on trimming and constraints. Advances in Data Analysis and Classification. 2019;13(1):201–225.
- 50. D’Urso P, De Giovanni L, Massari R. Trimmed fuzzy clustering of financial time series based on dynamic time warping. Annals of operations research. 2021;299:1379–1395.
- 51. Justel A, Svarc M. A divisive clustering method for functional data with special consideration of outliers. Advances in Data Analysis and Classification. 2018;12:637–656.
- 52. Chebana F, Dabo-Niang S, Ouarda TB. Exploratory functional flood frequency analysis and outlier detection. Water Resources Research. 2012;48(4):W04514.
- 53. Staerman G, Adjakossa E, Mozharovskyi P, Hofer V, Sen Gupta J, Clémençon S. Functional anomaly detection: a benchmark study. International Journal of Data Science and Analytics. 2023;16(1):101–117.
- 54. Jeng-Min Chiou WHC Yi-Chen Zhang, Chang CW. A functional data approach to missing value imputation and outlier detection for traffic flow data. Transportmetrica B: Transport Dynamics. 2014;2(2):106–129.
- 55. Liu C, Gao X, Wang X. Data adaptive functional outlier detection: Analysis of the Paris bike sharing system data. Information Sciences. 2022;602:13–42.
- 56. Yu G, Zou C, Wang Z. Outlier detection in functional observations with applications to profile monitoring. Technometrics. 2012;54(3):308–318.
- 57.
Dietterich TG. Ensemble methods in machine learning. In: International workshop on multiple classifier systems. Springer; 2000. p. 1–15.
- 58. Aggarwal CC, Sathe S. Theoretical foundations and algorithms for outlier ensembles. ACM SIGKDD Explorations. 2015;17(1):24–47.
- 59. Aggarwal CC. Outlier ensembles: position paper. ACM SIGKDD Explorations. 2013;14(2):49–58.
- 60. Fraiman R, Svarc M. Resistant estimates for high dimensional and functional data based on random projections. Computational Statistics & Data Analysis. 2013;58:326–338.
- 61.
Thakoor N, Gao J. Shape classifier based on generalized probabilistic descent method with hidden Markov descriptor. In: Tenth IEEE International Conference on Computer Vision (ICCV’05). vol. 1; 2005. p. 495–502.
- 62.
Olszewski RT. Generalized feature extraction for structural pattern recognition in time-series data [PhD thesis]. Carnegie Mellon University. Pittsburgh; 2001. Available from: https://www.cs.cmu.edu/~bobski/pubs/tr01108-twosided.pdf.
- 63. Dau HA, Bagnall A, Kamgar K, Yeh CCM, Zhu Y, Gharghabi S, et al. The UCR time series archive. IEEE/CAA Journal of Automatica Sinica. 2019;6(6):1293–1305.
- 64. Goldberger AL, Amaral LA, Glass L, Hausdorff JM, Ivanov PC, Mark RG, et al. PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals. Circulation. 2000;101(23):e215–e220. pmid:10851218
- 65.
Greenwald SD. Improved detection and classification of arrhythmias in noise-corrupted electrocardiograms using contextual information [PhD thesis]. Harvard University–Massachusetts Institute of Technology. Cambdrige; 1990. Available from: http://hdl.handle.net/1721.1/29206.
- 66. Jacques J, Preda C. Model-based clustering for multivariate functional data. Computational Statistics & Data Analysis. 2014;71:92–106.
- 67. Cabero I, Epifanio I. Archetypal analysis: an alternative to clustering for unsupervised texture segmentation. Image Analysis & Stereology. 2019;38:151–160.
- 68.
Soille P. Morphological Image Analysis: Principles and Applications. 2nd ed. Berlin, Heidelberg: Springer; 2003.
- 69. Epifanio I, Soille P. Morphological Texture Features for Unsupervised and Supervised Segmentations of Natural Landscapes. IEEE Transactions on Geoscience and Remote Sensing. 2007;45(4):1074–1083.
- 70. Liu J, Xie G, Wang J, Li S, Wang C, Zheng F, et al. Deep industrial image anomaly detection: A survey. Machine Intelligence Research. 2024;21(1):104–135.
- 71.
Yu J, Zheng Y, Wang X, Li W, Wu Y, Zhao R, et al. FastFlow: Unsupervised Anomaly Detection and Localization via 2D Normalizing Flows. CoRR. 2021;abs/2111.07677.
- 72.
Roth K, Pemula L, Zepeda J, Schölkopf B, Brox T, Gehler P. Towards Total Recall in Industrial Anomaly Detection. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2022. p. 14298–14308.
- 73. Zhao W. Research on the deep learning of the small sample data based on transfer learning. AIP Conference Proceedings. 2017;1864(1):020018.
- 74. Epifanio I, Ventura-Campos N. Hippocampal shape analysis in Alzheimer’s disease using functional data analysis. Statistics in Medicine. 2014;33(5):867–880. pmid:24105806
- 75.
Barros-Loscertales A, Belloch-Ugarte V, Rosella M, Martínez-Lozano MD, Forn C, Parcet MA, et al. Volumetric Hippocampal Differences Affect fMRI Differences in the Brain Functional Activation for Alzheimer’s Disease Patients During Memory Encoding. In: Alzheimer’s Disease Research Trends. Nova Publishers; 2008. p. 147–173.
- 76. Ferrando L, Epifanio I, Ventura-Campos N. Ordinal classification of 3D brain structures by functional data analysis. Statistics & Probability Letters. 2021;179:109227.
- 77. Epifanio I, Ibáñez MV, Simó A. Archetypal analysis with missing data: see all samples by looking at a few based on extreme profiles. The American Statistician. 2020;74(2):169–183.
- 78. Vinué G, Epifanio I. Archetypoid Analysis for Sports Analytics. Data Mining and Knowledge Discovery. 2017;31(6):1643–1677.
- 79. Vinué G, Epifanio I. Forecasting basketball players’ performance using sparse functional data. Statistical Analysis and Data Mining: The ASA Data Science Journal. 2019;12:534–547.
- 80. Wang X, Li C, Shi H, Wu C, Liu C. Detection of outlying patterns from sparse and irregularly sampled electronic health records data. Engineering Applications of Artificial Intelligence. 2023;126:106788.
- 81. Olsen AS, Høegh RM, Hinrich JL, Madsen KH, Mørup M. Combining electro-and magnetoencephalography data using directional archetypal analysis. Frontiers in Neuroscience. 2022;16:911034. pmid:35968377
- 82. Epifanio I, Ibáñez MV, Simó A. Archetypal shapes based on landmarks and extension to handle missing data. Advances in Data Analysis and Classification. 2018;12(3):705–735.
- 83. Alcacer A, Epifanio I, Ibáñez MV, Simó A, Ballester A. A data-driven classification of 3D foot types by archetypal shapes based on landmarks. PLOS ONE. 2020;15(1):e0228016. pmid:31999749
- 84. Epifanio I, Gimeno V, Gual-Arnau X, Ibáñez-Gual MV. Archetypal Curves in the Shape and Size Space: Discovering the Salient Features of Curved Big Data by Representative Extremes. La Matematica. 2023;2(3):635–658.
- 85. Fernández D, Epifanio I, McMillan LF. Archetypal analysis for ordinal data. Information Sciences. 2021;579:281–292.
- 86. Cabero I, Epifanio I. Finding archetypal patterns for binary questionnaires. SORT. 2020;44(1):39–66.
- 87. D’Esposito MR, Palumbo F, Ragozini G. Interval Archetypes: A New Tool for Interval Data Analysis. Statistical Analysis and Data Mining. 2012;5(4):322–335.
- 88. Alcacer A, Epifanio I, Gual-Arnau X. Biarchetype Analysis: Simultaneous Learning of Observations and Features Based on Extremes. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2024; p. 1–12. pmid:38739514
- 89.
Audibert J, Michiardi P, Guyard F, Marti S, Zuluaga MA. USAD: UnSupervised Anomaly Detection on Multivariate Time Series. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. KDD’20; 2020. p. 3395–3404.
- 90.
Su Y, Zhao Y, Niu C, Liu R, Sun W, Pei D. Robust Anomaly Detection for Multivariate Time Series through Stochastic Recurrent Neural Network. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. KDD’19; 2019. p. 2828–2837.
- 91.
Mair S, Sjölund J. Archetypal Analysis++: Rethinking the Initialization Strategy; 2024. Transactions on Machine Learning Research. Available from: https://openreview.net/forum?id=KVUtlM60HM.