Figures
Abstract
Pattern discovery and subspace clustering play a central role in the biological domain, supporting for instance putative regulatory module discovery from omics data for both descriptive and predictive ends. In the presence of target variables (e.g. phenotypes), regulatory patterns should further satisfy delineate discriminative power properties, well-established in the presence of categorical outcomes, yet largely disregarded for numerical outcomes, such as risk profiles and quantitative phenotypes. DISA (Discriminative and Informative Subspace Assessment), a Python software package, is proposed to evaluate patterns in the presence of numerical outcomes using well-established measures together with a novel principle able to statistically assess the correlation gain of the subspace against the overall space. Results confirm the possibility to soundly extend discriminative criteria towards numerical outcomes without the drawbacks well-associated with discretization procedures. Results from four case studies confirm the validity and relevance of the proposed methods, further unveiling critical directions for research on biotechnology and biomedicine. Availability: DISA is freely available at https://github.com/JupitersMight/DISA under the MIT license.
Citation: Alexandre L, Costa RS, Henriques R (2022) DISA tool: Discriminative and informative subspace assessment with categorical and numerical outcomes. PLoS ONE 17(10): e0276253. https://doi.org/10.1371/journal.pone.0276253
Editor: Sathishkumar V E, Hanyang University, KOREA, REPUBLIC OF
Received: March 16, 2022; Accepted: October 3, 2022; Published: October 19, 2022
Copyright: © 2022 Alexandre et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The first three datasets are available at UCI machine learning repository (https://archive-beta.ics.uci.edu/) with acess links: - https://archive-beta.ics.uci.edu/ml/datasets/echocardiogram- https://archive-beta.ics.uci.edu/ml/datasets/liver+disorders- https://archive-beta.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+diagnostic The last dataset is made available by https://www.nature.com/articles/s41467-020-18008-4 The aforementioned sources are the original repositories, a secundary source is found in the following Github Repository:- https://github.com/JupitersMight/DISA/tree/main/Example.
Funding: This work was supported by the Associate Laboratory for Green Chemistry (LAQV), financed by national funds from FCT/MCTES (UIDB/50006/2020 and UIDP/50006/2020), INESC-ID plurianual (UIDB/50021/2020), the contract CEECIND/01399/2017 to RSC and the FCT individual PhD grant to LA (2021.07759.BD). This work was further supported by IPOscore with reference (DSAIPA/DS/0042/2018) and ILU (DSAIPA/DS/0111/2018).
Competing interests: The authors have declared that no competing interests exist.
Introduction
The discovery of discriminative patterns has proven essential to support predictive and descriptive tasks [1–6]. More specifically in gene expression data, discriminative patterns play an essential role to discover outcome-specific regulatory modules for knowledge acquisition, biomarking phenotypes of interest [7], or serve as the basis for drug targeting (e.g. cancer) after rigorous validation [8]. Discriminative pattern mining also plays a role in unraveling complex interactions in biological processes such as the condition-specific interplay among transcription factors in organisms [9]. In this context, patterns help mapping regulatory interactions, forming regulatory networks, that provide a vital information to better understand the evolution of the genes, as well as unique regulatory cascades elicited in response to stimuli, disease progression or drug action [10]. These discriminative properties towards an outcome of interest can either be incorporated in the pattern discovery process [11, 12], or assessed after extracting classic informative patterns. In both cases, one or multiple interestingness measures, such as confidence [13], statistical significance [14] (probability of pattern occurrence against expectations) and/or discriminative power views [15, 16], are combined into pattern-centric models to aid medical decisions and study regulatory responses to events of interest [12, 17].
Although it is crucial to incorporate these discriminative criteria in the discovery task, existing contributions are generally focused on nominal outcomes [18, 19]. Nonetheless, many phenotypes of interest, such as molecular and physiological features, as well as risk scales or drug dosages, are quantitative variables in nature. In metabolic engineering the levels of production and/or degradation of certain organic compounds are continuous outcomes of interest [20, 21]. In such cases, to assess the ability of the underlying patterns to discriminate specific outcomes of interest, related work usually resorts to one of three following approaches: 1) distribution-based methods [22, 23], which explore properties of the distribution of continuous data, providing standard statistical measures on the distribution of the pattern-associated outcomes. In this context, Aumann and Lindell [23] consider measures such as the mean, with the possible alternatives of variance or median, to describe numerical distributions. An example of the aforementioned is an association rule like “sex = female → mean wage = $7.90 p/hr”, where they guarantee the rule’s discriminative properties by using classical measures, lift and confidence, and further ensure the validity of the outcome of interest by applying a Z-test; 2) discretization-based methods [24], which categorise the outcome variable in order to apply classic discriminative criteria in the discovery task. Well-known discretization methods categorise data based on frequency, user-inputted ranges, or more complex approaches such as the ones proposed by Alexandre et al. [25] where numerical variables are fitted and categorised according to a continuous distribution. While not the same as discretization, fuzzy-logic-based approaches can also be used in the presence of quantitative [26–28], and continuous variables [29], to extract informative patterns; and 3) optimization-based methods [30], which consider stochastic searches that follow the idea of natural selection and genetics (e.g., particle swarm optimization methods). Particles produced and modified along the evolution process are the targeted discriminative patterns, where both the pattern and the bounded range of relevant outcomes are optimized during the search [30]. While classic discriminative views are only prepared for nominal outcomes, these three classes of approaches are unable to establish an objective assessment of whether a given pattern is able or not to significantly discriminate a specific range of numerical outcomes.
To address these limitations, this work proposes a methodology to rigorously assess association rules with expressive patterns in the antecedent and numerical outcomes in the consequent, thus avoiding the discovery of spurious association rules (false positives). To this end, we introduce a novel distribution-based approach that inspects the differences between the distribution of a numerical outcome for all observations and a given pattern. To the best of our knowledge, there are no software packages able to assess association rules with robustness in the presence of numerical outcomes [31, 32]. Ergo, we propose DISA (Discriminative and Informative Subspace Analysis), a software package in Python to assess patterns with numerical outputs by statistically testing the correlation gain of the pattern against the overall data, identifying discriminative ranges of numerical outcomes tailored to each pattern.
Background
Multivariate data can be structured in the form of a matrix A = (X, Y), with a set of observations X = {x1, …, xN}, variables Y = {y1, …, yM}, and elements observed for observation xi and variable yj. One way to extract patterns from this data structure is through the use of biclustering algorithms [33, 34]. The biclustering task aims to identify a set of biclusters , where each bicluster B = (I, J) is an n × m subspace (subset of observations I = {i1,.., in} ⊆ X and subset of variables J = {j1,.., jm} ⊆ Y), that satisfy specific criteria:
- homogeneity—commonly guaranteed through the use of a merit function, such as the variance of the values in a bicluster [33], guiding the formation of biclusters in greedy, exhaustive, and stochastic/parametric searches determining their coherence, quality and structure;
- statistical significance—in addition to homogeneity criteria, guarantees that the probability of a bicluster’s occurrence (against a null model) deviates from expectations [14];
- dissimilarity—criteria further placed to guarantee the absence of redundant biclusters (number, shape, and positioning) [35].
The bicluster pattern φJ is the set of expected values in the absence of adjustments and noise. A bicluster pattern is:
- constant overall if for all i ∈ I and j ∈ J, aij = μ + ηij, where μ is the typical value and ηij is the observed noise;
- constant on columns, i.e. pattern on rows, if aij = μj + ηij, where μj represents the expected value in column yj;
- additive if for all i ∈ I and j ∈ J, aij = μj+ γi + ηij where μj represents the expected value in column yj and γi the adjustment for observation xi;
- multiplicative if for all i ∈ I and j ∈ J, aij = μj × γi + ηij where μj represents the expected value in column yj and γi the adjustment for observation xi;
- order-preserving on variables if there is a permutation of J under which the sequence of values in every row is strictly increasing. Likewise, order-preserving on observations if there is a permutation of I under which the sequence of values in every columns is strictly increasing.
Fig 1 applies the aforementioned concepts.
The constant (on columns) subspace has pattern (value expectations) = {μ1 = 1.1, μ2 = 0.45, μ3=0.9}, the additive pattern = {μ1 = 1.1, μ2 = 0.45} and {γ1 = 0.6, γ2 = 0, γ3 = 0}, and the order-preserving subspace satisfies the y2≤y3≤y1 permutation on 3 observations.
The coverage Φ of the bicluster pattern φJ, defined as Φ(φJ), is the number of observations containing the bicluster pattern φJ. The same logic can be applied to a nominal outcome of interest c, where c can take any value in the class variable (e.g., yout in Fig 1). The coverage of the outcome, defined as Φ(c), is the number of observations with the outcome of interest.
Association rules describe a link between two events. An association rule is formed by two sides, the left-hand side (antecedent) and the right-hand side (consequent). In this case, an association rule can take the form φJ → c, where a pattern in the antecedent discriminates an outcome of interest in the consequent. The coverage of the association rule, Φ(φJ → c), is given by the number of observations where both the pattern φJ and the outcome c co-occur.
Through the use of interestingness measures, an association rule can be assessed with respects to its’ interestingness, statistical significance, usefulness, information gain, discriminative power, amongst others [15]. Two well-established interestingness measures are the confidence, Φ(φJ → c)/Φ(φJ), measuring the probability of c occurring when φJ occurs, and, the lift, (Φ(φJ → c)/(Φ(φJ) × Φ(c)) × N, that further considers the probability of the consequent to assess the dependence between the consequent and antecedent.
A simple extension of the interestingness measures to accommodate continuous output variables is through the use of an numerical interval of interest. In the context of this extension, the coverage of the outcome Φ(c) can now be rewritten as Φ([v1, v2]), where v1 represents the lower bound of the interval and v2 the upper bound, and the coverage of the association rule as Φ(φJ → [v1, v2]). Understandably, the outcomes conditioned to a pattern of interest can be described by a probability density function (pdf). In this context, mapping the outcomes into a simple numerical range is generally inadequate as the pdf of pattern-conditional outcomes is often non-uniform and its discriminative properties can only be determined against the remaining observations.
Methods
Proposed approach
The proposed methodology allows for a robust analysis of the discriminative properties of a pattern in the presence of numerical outcome variables without imposing predefined rigid boundaries. Given a pattern φJ, we first compare the underlying distributions of the outcome variable of interest, z, for the overall observations, p(z∣X), and the pattern coverage, p(z∣Φ(φB)), in order to extract numeric ranges, that compose the consequent, . Observations with the targeted pattern have higher likelihood to have numerical outcomes in the extracted range. Both empirical and theoretical distributions are allowed for this calculus. If instead of considering the underlying distributions to extract a range of values, we considered just the minimum and maximum values within the pattern, the likelihood of the target pattern having high values of discriminative properties would be lessened due to: 1) presence of outliers that make the interval more relaxed, and 2) the possibility of the interval being to rigid and excluding nearby values outside after and before the maximum and minimum values.
To illustrate these concepts, Fig 2 provides an example with two theoretical distributions approximated from the overall and pattern-conditional targets, respectively. By estimating the relative frequency of each of the distributions, two points of intersection v1 and v2 can be calculated, composing an interval that can be potentially discriminated by observations with the given pattern.
The yellow line represents the outcome variable, which follows a gamma distribution; The blue line represents the pattern-conditional outcome variable, which follows a χ2 distribution. In this example, the two points of intersection between the distributions form an interval, [v1, v2]. Observations with the targeted pattern have higher likelihood to have numerical outcomes in the extracted range.
Once ranges of outcomes of interest are identified, classic interestingness measures for association rules can be extended to handle these consequents. Considering the previously introduced lift function—a paradigmatic function to assess the discriminative power of an association rule –, it can now be rewritten, (1)
Note that the coverage of the outcome of interest is now defined as the interval created by the intersection points of the distributions. Instead of a predefined restrictive category range, intervals disclose outcomes of interest that are dynamically inferred for a given pattern in order to better assess its discriminative profile.
Consider now an example with two random empirical distributions originating more than two points of intersections, illustrated in Fig 3.
The blue line represents the outcome variable, and the yellow line the pattern-conditional outcome variable. In this example, four points of intersection between the distributions form two intervals with ranges [v1, v2] and [v3, v4].
In this example, observations with a the selected pattern, have higher likelihood to show outcomes in the two inferred ranges. With two intervals, [v1, v2] and [v3, v4], we can further assess the discriminative power of the pattern with regards to each interval, as well as both intervals, (2) (3) (4)
Different discriminative criteria can be considered in the presence of consequents given by multiple ranges. Considering the lift as the illustrative case, the discriminated outcomes can be given by the numerical interval that maximises the lift function, (5) where, for the given example, ci ∈ {([v1, v2]∪[v3, v4]), [v1, v2], [v3, v4]}. All numerical intervals where lift satisfies a minimum threshold θ, (6)
Both are valid options and allow for a robust analysis of the numerical outcome. The first approach retrieves the numerical interval, or combination of numerical intervals, with highest discriminative power. The second filters out uninformative/non-discriminative numerical intervals, allowing for a more comprehensive analysis of each pattern.
DISA implements the presented methodology and is given in Algorithm 1.
Algorithm 1: DISA tool
Input: data_matrix, class_vector, pattern_list, distribution
Output: list of statistics per pattern
statistics = [];
for p in pattern_list do
if class_vector is continuous then
if distribution == “empirical” then
intervals = intersection(empirical_pdf(class_vector), empirical_pdf(p));
end
if distribution == “gaussian” then
intervals = intersection(gaussian_pdf(class_vector), gaussian_pdf(p));
end
if distribution == “min_max” then
intervals = [p.min(), p.max()];
end
if distribution == “average” then
m = p.mean();
std = p.std();
intervals = [m-std, m+std];
end
temp_class_vector = discretize(intervals);
pattern_properties = properties(data_matrix, p, temp_class_vector);
else
pattern_properties = properties(data_matrix, p, class_vector);
end
statistics.append(objective_functions(pattern_properties))
end
return statistics;
When analysing the subspace in the presence of a continuous output variable DISA implements four different setups: 1) MinMax, where the cut-off points, [v1, v2], correspond to the minimum and maximum pattern-conditional outcomes, respectively; 2) Average, where v1 and v2 are the bounds formed by considering the standard deviation from the average, μ − σ and μ+ σ, respectively, where μ (σ) is the average (standard deviation) of the pattern-conditional outcomes; 3) Gaussian, where we assume that both the output variable and the pattern-conditional outcomes follow a normal distribution. In this case, v1 and v2 correspond to the intersection points between the gaussians where the range [v1, v2] represents the most probable values of interest that the pattern discriminates. Fig 4 provides a in-depth example with three distinct patterns; 4) Empirical, where we assume they follow their own unique empirical distribution, instead of assuming that both the outcome variable and the pattern-conditioned outcomes follow a well-known theoretical continuous distribution. In this case, v1 and v2 might not be the only points of intersection. We assume there can be any number between one and n points of intersection. Fig 3 provides an example with four points of intersection, creating two ranges of intervals of interest. However, it is important to note that the number of intervals created is not necessarily correlated with the number of points of intersection. When the relative frequency of the pattern-outcome starts, or finishes, above the relative frequency of the output variable, the number of intervals changes. In Fig 5 we present three cases where the aforesaid happens.
Consider a dataset with observations X={x1,.., x9}, variables Y={y1,.., y4, class}, and a set of three association rules. By intercepting the output variable pdf with the pdfs of each rules’ consequent, we obtain the following intervals: a) [0.80, 1.70], b) [1.18, 1.61], and c) [1.02, 2.59]. With this, discriminative power statistics, such as lift, can be computed. In this case lift is equal to: a) 2.27, b) 3.03, and c) 2.27.
The blue line represents the outcome variable, and the yellow line the pattern-conditional outcome variable. (a) Intersection forms the interval [−inf, v1]. (b) Intersection forms the intervals [−inf, v1] and [v2, v3]. (c) Intersection forms the intervals [−inf, v1], [v2, v3] and [v4, inf].
To calculate the intersection points in linear time, O(N), where N represents the number of observations in the output variable, DISA executes the following steps: i) calculate the relative frequency of each unique value for both the overall and pattern-conditioned outcomes, ii) element-wise subtraction between the arrays, iii) extract the element-wise indication of the sign of each number on the resulting array, iv) calculate the discrete difference along the sign vector (value at position i+1 minus value at position i), and finally v) find the indices of elements that are non-zero, grouped by element.
Consider a practical example where outputs = [1, 3, 4, 5, 7] and pattern-conditioned outputs = [3, 4, 5]. Accordingly, i) relative frequency conversion yields outputs = [0.2, 0.2, 0.2, 0.2, 0.2] and pattern-conditioned outputs = [0.0, 0.3(3), 0.3(3), 0.3(3), 0.0], ii) element-wise subtraction returns [0.2, −0.1(3), −0.1(3), −0.1(3), 0.2], iii) sign extraction returns [1, −1, −1, −1, 1], iv) differencing operation leads to [−2, 0, 0, 2], finally, v) the indices that are non-zero will produce the intersection points [v1 = 0, v2 = 3] that map to the original values of 1 and 5.
As previously mentioned, the intersection of empirical distributions can generate more than one interval of interest. By default DISA considers all the of the pattern-conditioned outcome intervals to compute the discriminative and informative properties of the pattern. However, if the discriminative power of the pattern is still below a minimum threshold (e.g., lift<1.3), then DISA will start to disregard uninformative intervals. Starting from the lowest individually ranked (e.g., by lift), the intervals are disregarded one by one, until either all of them are removed (resetting to the default behavior) or the satisfaction of the minimum discriminative power.
Software
The previously introduced methodology is made available as an open-source software package, DISA, developed in Python (v3.7). DISA is able to assess the discriminative properties from the inputted patterns in the presence of numeric or categorical outcome variables. A pipeline of the DISA package is illustrated in Fig 6. If DISA receives a numerical outcome, the outcome ranges that are likely to be discriminated by the observations supporting a given pattern are first determined. DISA accomplishes this by approximating two probability density functions (e.g. Gaussians), one for all the observed targets and the other with targets of the pattern coverage. The intersecting points between the two probability density functions is computed to identify the range of values discriminated by the pattern. Second, DISA extends state-of-the-art statistics for assessing the informative and discriminative power of classic association rules. Currently, DISA supports 53 evaluation metrics in total. An illustrative subset of metrics is provided in Table 1 (complete list in DISA’s GitHub repository).
Input: multivariate data (optional); list of patterns; and outcome variable. Statistical calculus: a) discriminated ranges from pdf (probability density function) intersection points (numerical outcomes only) and b) pattern properties, and metrics (e.g. statistical significance, gini index, information gain). Output: list of metrics per pattern.
Three types are presented: support-based metrics, confidence-based, and lift-based.
Results
In order to illustrate DISA properties, we considered four public datasets taken from the literature: 1) Echocardiogram [41], monitoring physiological features of patients that suffered heart attacks at some point in time, the task consists in extracting discriminative patterns of survivability after a heart attack; 2) Liver Disorders [42], a dataset of molecular features from blood tests which are thought to mark liver disorders that might arise from excessive alcohol consumption, in this case the task consists in extracting patterns that discriminate the number of intake drinks per day; 3) Breast Cancer Wisconsin (Diagnostic) [43], where each observation corresponds to the follow-ups of a breast cancer patient, variables concern cancer cell nuclei features from a digitized image of a fine needle aspirate, and the outcome is the number of months until cancer relapse; and 4) Dodecanol production [20, 21], a dataset that monitors the concentration of key enzymes observed in the two Design-Build-Test-Learn cycles of 1-dodecanol production (a medium-chain fatty acid used in detergents, pharmaceuticals and cosmetics) in Escherichia coli with the outcome determining the concentration of the targeted organic compound. The list of variables per dataset, as well as their meaning, is presented in Table 2.
Variables not presented in this table were removed due to redundancy/irrelevance.
We used BicPAMS software [35] to extract patterns, with particular focus on constant coherence on columns (pattern on rows). Regarding statistical significance, we did not filter patterns exhibiting a p-value above 0.05 (patterns that might have occurred by chance). To allow the creation of larger patterns during the merging step, we allow up to 30% noise within the pattern. This will reduce the number of redundant patterns. Numeric input variables are categorised with DI2 discretizer [25], with |L| = 3, |L| = 5, and |L| = 7 categories.
Broad characteristics of the extracted patterns are presented in Tables 3 and 4. A set of illustrative patterns for each dataset are displayed in Fig 7, with the respective properties in Table 5.
Each chart displays the Gaussian intersections between the outcome variable and the distribution of pattern-conditional outcomes. The blue line represents the Gaussian of the pattern outcome space, orange line represents the Gaussian of the original outcome space. All patterns displayed above have |L| = 7 categories (very low, low, medium-low, medium, medium-high, high, very high). Patterns: (a) = {medium contractility (epss), medium size of the heart at end-diastole (lvdd)}; (b) = {medium-high values of alkaline phosphotase (alkphos), very-high values of alanine aminotransferase (sgpt), very-high values of gamma-glutamyl transpeptidase (gammagt)}; (c) = {high standard deviation among the cells compactness, very high compactness, very high severity of concave portions of the contour, very high symmetry}; (d) = {very high values of A1U2T0, low values of A1U3L3}. (a) Echocardiogram pattern with intersections at −11.04 and 12.33. (b) Liver Disorders pattern with intersections at −0.94 and 2.04. (c) Breast Cancer Wisconsin (Diagnostic) pattern with intersections at 1.43 and 9.05. (d) Dodecanol pattern with intersections at −0.07 and 0.14.
Each row from left to right indicates the percentage of noise allowed, the number of categories for the continuous variables (coherence strength), the number of extracted patterns, the average number of columns in each patterns (and standard deviation), and the average number of rows in each pattern (and standard deviation).
A selective list of statistical measures is provided. For each measure, the average value obtained across the patterns per parameterization, as well as the standard deviation, are presented.
In this analysis, we compare the discriminative assessment produced over the range of discriminated outcomes with DISA against classic alternatives produced by discretizing the numerical outcome using DI2 [25] and the standard MinMax and Average approach (see Methods section) for the four datasets. The reference DISA values per function are presented in the last two rows.
Discussion
In this work, we proposed an approach for pattern evaluation in the presence of numerical outcomes. Below, we experimentally assess the results of the proposed methodology on four publicly available datasets.
Case study: Echocardiogram
A few discriminative patterns, yielding Support < 10%, were extracted from the Echocardiogram data as shown in Table 3. In Table 4, we can see that the most prominent discriminative criteria for the found patterns were an average Lift ≥ 1.6 and StandardisedLift ≥ 0.75 for configuration |L| = 3. A pattern with lift above 1 and Standardised Lift in [0.7,1] generally discriminate the subspace of values it forms. The analysis of the found patterns using DISA reveals that the majority of discoveries discriminate a low survivability range (see GitHub repository https://github.com/JupitersMight/DISA/tree/main/Example for a detailed description of all patterns), including the pattern shown in Fig 7a. The patients suffering a heart attack in Fig 7a pattern exhibited moderate values of contractility and moderate size of the heart at end-diastole. In this case, when the local survivability within the pattern intersects the overall survivability, it forms a span of time that is discriminative of patients who survive a maximum of 12 months. This pattern possesses a high discriminative power, the maximum achievable, with a Lift = 3.09 and StandardisedLift = 1, and it also yields . A means that the null-hypothesis of independence between the pattern and the outcome should be rejected. Larger chi-squared values indicate stronger evidence of a strong relationship between the pattern and the outcome.
Case study: Liver disorders
The extracted patterns from this data source display a Support ≥ 10% in configuration |L| = 3, as shown in Table 3. As the cardinality of input variables increases (higher |L|), the most salient discriminative criteria are , Lift ≥ 1.7, StandardisedLift ≥ 0.80, and Stat.Significance ≤ 0.03 for configuration |L| = 7. In this context, a statistical significance lower than 0.05 means that the pattern probability of occurrence deviates from expectations. The careful analysis of the patterns using DISA revealed that a good portion of the discovered patterns discriminate a low drink intake per day (see GitHub repository for a trace of all patterns). An example is shown in Fig 7b, whose individuals with very high values of alanine aminotransferase and gama-glutamyl transpeptidase, generally drank up to 2 drinks per day. This pattern possesses a high discriminative power, the maximum achievable, with a Lift = 2.04 and StandardisedLift = 1. It also has a strong dependence with the outcome with a , and statistical significance (p-value = 0.008).
Case study: Breast cancer wisconsin (Diagnostic)
A high number of patterns were extracted from this data source as shown in Table 3. The patterns in general were borderline discriminative, with the most notable discriminative criteria being the Lift ≥ 1.3 and Stat.Significance ≤ 0.04 for configuration |L| = 7. The careful analysis of the patterns using DISA revealed that a significant number of the found patters discriminated a low time until cancer recurrence (images in GitHub). An example is shown in Fig 7c, whose patients with short periors until cancer relapse show high heterogeneity cells characteristics, a very high number of compact cells and high severity of concave portions of the contour. In this case, the dynamically inferred discriminative span of time until relapse is between 1.4 and 9 months. The pattern possesses a high discriminative power, the maximum achievable, with a Lift = 4.18 and StandardisedLift = 1. It also has a , meaning that the pattern is strongly dependent of the given span of time for cancer reoccurrence, and statistical significance (p-value = 8.63 × 10−6).
Case study: Dodecanol
Finally in the Dodecanol dataset, for configuration |L| = 7, a moderate number of patterns were extracted as shown in Table 3. The discriminative criteria is optimal across the found patterns from all configurations, e.g. , Lift ≥ 1.5, StandardisedLift > 0.9, and Stat.Significance ≤ 0.04. The careful analysis of the patterns using DISA revealed that some of the discovered patterns discriminated a low production of dodecanol (images in GitHub). An example is shown in Fig 7d, where a reduced dodecanol production (maximum of 0.14 units) is discriminated by the presence of samples with a very high concentration of enzymes responsible for the catalysis of long chain fatty acyl-CoA into a chain primary alcohol and low concentration of enzymes responsible for oxidation-reduction (redox) reactions. The pattern possesses moderate discriminative power with a Lift = 1.49 and StandardisedLift = 0.87. It further shows a strong relation with the outcome, , and possesses high statistical significance (p-value = 7.23×10−6).
State-of-the-art comparison
To test the DISA assessment of the patterns’ discriminative power, we considered an additional set of approaches: 1) Classic approach, where the numerical outcome variable is discretized by applying DI2 [25] with |L| = 7, and the outcome is then interpreted as a class. In this case, DISA selects for each pattern the best fitting class ordered by lift. It is important to note that the outcome class is defined prior to the discovery of the pattern, without assumptions related with the subsequently mined pattern-conditioned outcomes; 2) MinMax approach, that uses the minimum and maximum of the pattern-conditional outcomes, 3) Average approach, that uses the average of the pattern-conditioned outcome with bounds inferred using the observed standard deviation; and 4) Empirical approach, where DISA considers the empirical distributions of both the continuous outcome variable and the pattern-conditioned outcome variable. These novel approaches are applied to each pattern illustrated in Fig 7. Table 5 contains the results of this analysis and we will discuss the results of: i) the proposed Gaussian approach versus Classic and standard approaches; and ii) the proposed Empirical approach versus all others.
When comparing the results of the Classic and Gaussian approaches in the Liver Disorders, Breast Cancer, and Dodecanol datasets we observe that the Gaussian approach exhibits higher values in the function, whilst the Classic displays slight improvements in the lift function for most patterns. In spite of these results, the Classic approach fails to maximize the patterns’ potential to discriminate specific ranges of outcomes. This can be concluded by observing a considerable decrease in the values of Standardised Lift. In the case of Breast Cancer and Dodecanol, Standardised Lift plummeted below 0.5 using the classic approach. The Average approach also exhibits this failure to find intervals that maximize the patterns’ discriminative potential. These values confirm our initial hypothesis, that the classic and Average approaches form intervals that might not fully explore the discriminative profile of each pattern. The MinMax approach is able to fully accommodate noise and outlier pattern-conditioned outcomes, yet it creates intervals that can be too large or permissive, i.e. intervals that accommodate outcome ranges that are not discriminated by the pattern. The Gaussian approach is in theory more robust to this problem, an observation that is corroborated by the collected results.
Considering the selected Echocardiogram pattern, two intervals are formed by the Empirical approach. The first interval captures a low survivability range, [0.25, 0.75] whilst the second captures a higher survivability range [11.0, 12.0]. If we observe the statistics of the other approaches that enclose either one or two of these intervals we can conclude: 1) that the interval of low survivability provides discriminative properties, i.e. in the classic approach the range [0.03, 0.59] partially encloses [0.25, 0.75]; and 2) that the approaches which consider the inclusion of higher values of survivability also display discriminative properties, i.e. a compact range that encloses both of the aforementioned intervals are observed in the MinMax and Gaussian approaches, [0.5, 12.0] and [−11.04, 12.33], respectively. Note, nevertheless that the Empirical approach disregards the range of values between [0.75, 11.0]. Results from all datasets confirm an increase in the discriminative potential in most statistics for all patterns (e.g., Standardised Lift as 1). However, the Empirical approach is generally restrictive in the formation of pattern-conditioned outcome intervals, and should be applied with care, e.g., complemented with the Gaussian approach to guarantee that the consequent of the target association rules include all numerical ranges discriminated by a given pattern.
Conclusion
This work proposed a novel distribution-based method to rigorously assess association rules in the presence of numerical outcomes in the consequent, by inspecting the differences between the distribution of the numerical outcomes for all observations and those supporting a given pattern. This methodology allows for a dynamic and pattern-tailored approach to numerical outcomes as the patterns dictate how the discriminated ranges of values are statistically produced. The results further confirm the utility of the proposed methodology in dynamically producing pattern-tailored intervals, where discovered patterns from multiple domains exhibited maximum achievable discriminative power properties.
The methodology is implemented in DISA, an open source Python package capable of robustly assessing the statistical significance and discriminative power of association rules in the presence of numerical and categorical outcomes. DISA implements over 50 metrics, heuristics that can be used to guide the discovery process of discriminative patterns and subspace clusters in various data domains. We believe that DISA can be easily embed and further extended for more complex patterns, such as with multiple points of intersection, therefore aiding the scientific community ability along pattern-centric descriptive and predictive tasks. For instance, the extraction of patterns in omic data able to discriminate numerical phenotypes, or the extraction of patterns in clinical data able to discriminate risk scales.
References
- 1. Liu X., Wu J., Gu F., Wang J. & He Z. Discriminative pattern mining and its applications in bioinformatics. Briefings In Bioinformatics. 16, 884–900 (2015). pmid:25433466
- 2. Busygin S., Prokopyev O. & Pardalos P. Biclustering in data mining. Computers & Operations Research. 35, 2964–2987 (2008).
- 3. Aggarwal C. Applications of frequent pattern mining. Frequent Pattern Mining. pp. 443–467 (2014).
- 4. Xie J., Ma A., Fennell A., Ma Q. & Zhao J. It is time to apply biclustering: a comprehensive review of biclustering applications in biological and biomedical data. Briefings In Bioinformatics. 20, 1450–1465 (2019). pmid:29490019
- 5. Saranya A., Kottursamy K., AlZubi A. & Bashir A. Analyzing fibrous tissue pattern in fibrous dysplasia bone images using deep R-CNN networks for segmentation. Soft Computing. pp. 1–15 (2021). pmid:34867079
- 6. Cheng X., Guo Z., Shen Y., Yu K. & Gao X. Knowledge and data-driven hybrid system for modeling fuzzy wastewater treatment process. Neural Computing And Applications. pp. 1–22 (2021).
- 7. Ben-Dor A., Chor B., Karp R. & Yakhini Z. Discovering local structure in gene expression data: the order-preserving submatrix problem. Proceedings Of The Sixth Annual International Conference On Computational Biology. pp. 49–57 (2002).
- 8. Maind A. & Raut S. Identifying condition specific key genes from basal-like breast cancer gene expression data. Computational Biology And Chemistry. 78 pp. 367–374 (2019). pmid:30655072
- 9. Babu M., Luscombe N., Aravind L., Gerstein M. & Teichmann S. Structure and evolution of transcriptional regulatory networks. Current Opinion In Structural Biology. 14, 283–291 (2004). pmid:15193307
- 10. Iskar M., Zeller G., Blattmann P., Campillos M., Kuhn M., Kaminska , et al. Characterization of drug-induced transcriptional modules: towards drug repositioning and functional understanding. Molecular Systems Biology. 9, 662 (2013). pmid:23632384
- 11. Fang G., Pandey G., Wang W., Gupta M., Steinbach M. & Kumar V. Mining low-support discriminative patterns from dense and high-dimensional data. IEEE Transactions On Knowledge And Data Engineering. 24, 279–294 (2010).
- 12. Alexandre L., Costa R., Santos L. & Henriques R. Mining pre-surgical patterns able to discriminate post-surgical outcomes in the oncological domain. IEEE Journal Of Biomedical And Health Informatics. (2021). pmid:33687853
- 13. Brin S., Motwani R. & Silverstein C. Beyond market baskets: Generalizing association rules to correlations. Proceedings Of The 1997 ACM SIGMOD International Conference On Management Of Data. pp. 265–276 (1997).
- 14. Henriques R. & Madeira S. BSig: evaluating the statistical significance of biclustering solutions. Data Mining And Knowledge Discovery. 32, 124–161 (2018).
- 15.
Tan P., Kumar V. & Srivastava J. Selecting the right interestingness measure for association patterns. Proceedings Of The Eighth ACM SIGKDD International Conference On Knowledge Discovery And Data Mining. pp. 32–41 (2002).
- 16. McNicholas P., Murphy T. & O’Regan M. Standardising the lift of an association rule. Computational Statistics & Data Analysis. 52, 4712–4721 (2008).
- 17. Henriques R. & Madeira S. FleBiC: Learning classifiers from high-dimensional biomedical data using discriminative biclusters with non-constant patterns. Pattern Recognition. 115 pp. 107900 (2021).
- 18. Kianmehr K., Alshalalfa M. & Alhajj R. Fuzzy clustering-based discretization for gene expression classification. Knowledge And Information Systems. 24, 441–465 (2010).
- 19. Shih M., Jheng J., Lai L., et al. A two-step method for clustering mixed categroical and numeric data. Journal Of Applied Science And Engineering. 13, 11–19 (2010).
- 20. Radivojević T., Costello Z., Workman K. & Martin H. A machine learning Automated Recommendation Tool for synthetic biology. Nature Communications. 11, 1–14 (2020). pmid:32978379
- 21. Opgenorth P., Costello Z., Okada T., Goyal G., Chen Y., Gin J., et al Lessons from two design–build–test–learn cycles of dodecanol production in Escherichia coli aided by machine learning. ACS Synthetic Biology. 8, 1337–1351 (2019). pmid:31072100
- 22.
Webb G. Discovering associations with numeric variables. Proceedings Of The Seventh ACM SIGKDD International Conference On Knowledge Discovery And Data Mining. pp. 383–388 (2001).
- 23. Aumann Y. & Lindell Y. A statistical theory for quantitative association rules. Journal Of Intelligent Information Systems. 20, 255–283 (2003).
- 24. Garcia S., Luengo J., Sáez J., Lopez V. & Herrera F. A survey of discretization techniques: Taxonomy and empirical analysis in supervised learning. IEEE Transactions On Knowledge And Data Engineering. 25, 734–750 (2012).
- 25. Alexandre L., Costa R. & Henriques R. DI2: prior-free and multi-item discretization of biological data and its applications. BMC Bioinformatics. 22, 1–19 (2021). pmid:34496758
- 26. Guo Z., Yu K., Jolfaei A., Ding F. & Zhang N. Fuz-spam: label smoothing-based fuzzy detection of spammers in internet of things. IEEE Transactions On Fuzzy Systems. (2021).
- 27.
Olson D. & Li Y. Mining Fuzzy Weighted Association Rules. 2007 40th Annual Hawaii International Conference On System Sciences (HICSS’07). pp. 53–53 (2007).
- 28.
Hong T., Kuo C. & Chi S. A fuzzy data mining algorithm for quantitative values. 1999 Third International Conference On Knowledge-Based Intelligent Information Engineering Systems. Proceedings (Cat. No.99TH8410). pp. 480–483 (1999).
- 29. Ishibuchi H., Nakashima T. & Yamamoto T. Fuzzy association rules for handling continuous attributes. ISIE 2001. 2001 IEEE International Symposium On Industrial Electronics Proceedings (Cat. No.01TH8570). 1 pp. 118–121 vol.1 (2001).
- 30. Alatas B. & Akin E. Rough particle swarm optimization and its applications in data mining. Soft Computing. 12, 1205–1218 (2008).
- 31. Hahsler M., Chelluboina S., Hornik K. & Buchta C. The arules R-package ecosystem: analyzing interesting patterns from large transaction data sets. The Journal Of Machine Learning Research. 12 pp. 2021–2025 (2011).
- 32. Kaiser S., Santamaria R., Khamiakova T., Sill M., Theron R., Quintales L., et al. Package ‘biclust’. The Comprehensive R Archive Network. (2015).
- 33. Madeira S. & Oliveira A. Biclustering Algorithms for Biological Data Analysis: A Survey. IEEE/ACM Transactions On Computational Biology And Bioinformatics. 1, 24–45 (2004). pmid:17048406
- 34. Henriques R. & Madeira S. BicPAM: Pattern-based biclustering for biomedical data analysis. Algorithms For Molecular Biology. 9, 1–30 (2014). pmid:25649207
- 35. Henriques R., Ferreira F. & Madeira S. BicPAMS: software for biological data analysis with pattern-based biclustering. BMC Bioinformatics. 18, 82 (2017).
- 36. Agrawal R., Imielinski T., Swami A., et al. Association rules between sets of items in large databases. Proc. Of ACM SIGMOD Int. Conf. On Management Of Data, Washington. pp. 207–216 (1993).
- 37. Omniecinski E. Alternative interest measures for mining associations. IEEE Trans. Knowledge And Data Engineering. 15 pp. 57–69 (2003).
- 38. Kodratoff Y. Comparing machine learning and knowledge discovery in databases: An application to knowledge discovery in texts. Advanced Course On Artificial Intelligence. pp. 1–21 (1999).
- 39. Hahsler M. & Hornik K. New probabilistic interest measures for association rules. Intelligent Data Analysis. 11, 437–455 (2007).
- 40. Tan P., Kumar V. & Srivastava J. Selecting the right objective measure for association analysis. Information Systems. 29, 293–313 (2004).
- 41.
UCI Machine Learning Repository. Echocardiogram (1989).
- 42.
UCI Machine Learning Repository. Liver Disorders (1990).
- 43. Street W., Wolberg W. & Mangasarian O. Nuclear feature extraction for breast tumor diagnosis. Biomedical Image Processing And Biomedical Visualization. 1905 pp. 861–870 (1993).