DISA tool: Discriminative and informative subspace assessment with categorical and numerical outcomes

Pattern discovery and subspace clustering play a central role in the biological domain, supporting for instance putative regulatory module discovery from omics data for both descriptive and predictive ends. In the presence of target variables (e.g. phenotypes), regulatory patterns should further satisfy delineate discriminative power properties, well-established in the presence of categorical outcomes, yet largely disregarded for numerical outcomes, such as risk profiles and quantitative phenotypes. DISA (Discriminative and Informative Subspace Assessment), a Python software package, is proposed to evaluate patterns in the presence of numerical outcomes using well-established measures together with a novel principle able to statistically assess the correlation gain of the subspace against the overall space. Results confirm the possibility to soundly extend discriminative criteria towards numerical outcomes without the drawbacks well-associated with discretization procedures. Results from four case studies confirm the validity and relevance of the proposed methods, further unveiling critical directions for research on biotechnology and biomedicine. Availability: DISA is freely available at https://github.com/JupitersMight/DISA under the MIT license.


Introduction
The discovery of discriminative patterns has proven essential to support predictive and descriptive tasks [1][2][3][4][5][6]. More specifically in gene expression data, discriminative patterns play an essential role to discover outcome-specific regulatory modules for knowledge acquisition, biomarking phenotypes of interest [7], or serve as the basis for drug targeting (e.g. cancer) after rigorous validation [8]. Discriminative pattern mining also plays a role in unraveling complex interactions in biological processes such as the condition-specific interplay among transcription factors in organisms [9]. In this context, patterns help mapping regulatory interactions, forming regulatory networks, that provide a vital information to better understand the evolution of the genes, as well as unique regulatory cascades elicited in response to stimuli, disease progression or drug action [10]. These discriminative properties towards an outcome of interest can either be incorporated in the pattern discovery process [11,12], or assessed after extracting classic informative patterns. In both cases, one or multiple interestingness measures, such as confidence [13], statistical significance [14] (probability of pattern occurrence against expectations) and/or discriminative power views [15,16], are combined into pattern-centric models to aid medical decisions and study regulatory responses to events of interest [12,17].
Although it is crucial to incorporate these discriminative criteria in the discovery task, existing contributions are generally focused on nominal outcomes [18,19]. Nonetheless, many phenotypes of interest, such as molecular and physiological features, as well as risk scales or drug dosages, are quantitative variables in nature. In metabolic engineering the levels of production and/or degradation of certain organic compounds are continuous outcomes of interest [20,21]. In such cases, to assess the ability of the underlying patterns to discriminate specific outcomes of interest, related work usually resorts to one of three following approaches: 1) distribution-based methods [22,23], which explore properties of the distribution of continuous data, providing standard statistical measures on the distribution of the pattern-associated outcomes. In this context, Aumann and Lindell [23] consider measures such as the mean, with the possible alternatives of variance or median, to describe numerical distributions. An example of the aforementioned is an association rule like "sex = female ! mean wage = $7.90 p/hr", where they guarantee the rule's discriminative properties by using classical measures, lift and confidence, and further ensure the validity of the outcome of interest by applying a Z-test; 2) discretization-based methods [24], which categorise the outcome variable in order to apply classic discriminative criteria in the discovery task. Well-known discretization methods categorise data based on frequency, user-inputted ranges, or more complex approaches such as the ones proposed by Alexandre et al. [25] where numerical variables are fitted and categorised according to a continuous distribution. While not the same as discretization, fuzzy-logic-based approaches can also be used in the presence of quantitative [26][27][28], and continuous variables [29], to extract informative patterns; and 3) optimization-based methods [30], which consider stochastic searches that follow the idea of natural selection and genetics (e.g., particle swarm optimization methods). Particles produced and modified along the evolution process are the targeted discriminative patterns, where both the pattern and the bounded range of relevant outcomes are optimized during the search [30]. While classic discriminative views are only prepared for nominal outcomes, these three classes of approaches are unable to establish an objective assessment of whether a given pattern is able or not to significantly discriminate a specific range of numerical outcomes.
To address these limitations, this work proposes a methodology to rigorously assess association rules with expressive patterns in the antecedent and numerical outcomes in the consequent, thus avoiding the discovery of spurious association rules (false positives). To this end, we introduce a novel distribution-based approach that inspects the differences between the distribution of a numerical outcome for all observations and a given pattern. To the best of our knowledge, there are no software packages able to assess association rules with robustness in the presence of numerical outcomes [31,32]. Ergo, we propose DISA (Discriminative and Informative Subspace Analysis), a software package in Python to assess patterns with numerical outputs by statistically testing the correlation gain of the pattern against the overall data, identifying discriminative ranges of numerical outcomes tailored to each pattern.

Background
Multivariate data can be structured in the form of a matrix A = (X, Y), with a set of observations X = {x 1 , . . ., x N }, variables Y = {y 1 , . . ., y M }, and elements a ij 2 R observed for observation x i and variable y j . One way to extract patterns from this data structure is through the use of biclustering algorithms [33,34]. The biclustering task aims to identify a set of biclusters B ¼ ðB 1 ; ::; B k Þ, where each bicluster B = (I, J) is an n × m subspace (subset of observations I = {i 1 ,.., i n } � X and subset of variables J = {j 1 ,.., j m } � Y), that satisfy specific criteria: • homogeneity-commonly guaranteed through the use of a merit function, such as the variance of the values in a bicluster [33], guiding the formation of biclusters in greedy, exhaustive, and stochastic/parametric searches determining their coherence, quality and structure; • statistical significance-in addition to homogeneity criteria, guarantees that the probability of a bicluster's occurrence (against a null model) deviates from expectations [14]; • dissimilarity-criteria further placed to guarantee the absence of redundant biclusters (number, shape, and positioning) [35].
The bicluster pattern φ J is the set of expected values in the absence of adjustments and noise. A bicluster pattern is: • constant overall if for all i 2 I and j 2 J, a ij = μ + η ij , where μ is the typical value and η ij is the observed noise; • constant on columns, i.e. pattern on rows, if a ij = μ j + η ij , where μ j represents the expected value in column y j ; • additive if for all i 2 I and j 2 J, a ij = μ j + γ i + η ij where μ j represents the expected value in column y j and γ i the adjustment for observation x i ; • multiplicative if for all i 2 I and j 2 J, a ij = μ j × γ i + η ij where μ j represents the expected value in column y j and γ i the adjustment for observation x i ; • order-preserving on variables if there is a permutation of J under which the sequence of values in every row is strictly increasing. Likewise, order-preserving on observations if there is a permutation of I under which the sequence of values in every columns is strictly increasing. The coverage F of the bicluster pattern φ J , defined as F(φ J ), is the number of observations containing the bicluster pattern φ J . The same logic can be applied to a nominal outcome of interest c, where c can take any value in the class variable (e.g., y out in Fig 1). The coverage of the outcome, defined as F(c), is the number of observations with the outcome of interest.
Association rules describe a link between two events. An association rule is formed by two sides, the left-hand side (antecedent) and the right-hand side (consequent). In this case, an association rule can take the form φ J ! c, where a pattern in the antecedent discriminates an outcome of interest in the consequent. The coverage of the association rule, F(φ J ! c), is given by the number of observations where both the pattern φ J and the outcome c co-occur.
Through the use of interestingness measures, an association rule can be assessed with respects to its' interestingness, statistical significance, usefulness, information gain, discriminative power, amongst others [15]. Two well-established interestingness measures are the confidence, F(φ J ! c)/F(φ J ), measuring the probability of c occurring when φ J occurs, and, the lift, ) × N, that further considers the probability of the consequent to assess the dependence between the consequent and antecedent.
A simple extension of the interestingness measures to accommodate continuous output variables is through the use of an numerical interval of interest. In the context of this extension, the coverage of the outcome F(c) can now be rewritten as F([v 1 , v 2 ]), where v 1 represents the lower bound of the interval and v 2 the upper bound, and the coverage of the association rule as . Understandably, the outcomes conditioned to a pattern of interest can be described by a probability density function (pdf). In this context, mapping the outcomes into a simple numerical range is generally inadequate as the pdf of pattern-conditional outcomes is often non-uniform and its discriminative properties can only be determined against the remaining observations.

Proposed approach
The proposed methodology allows for a robust analysis of the discriminative properties of a pattern in the presence of numerical outcome variables without imposing predefined rigid boundaries. Given a pattern φ J , we first compare the underlying distributions of the outcome variable of interest, z, for the overall observations, p(zjX), and the pattern coverage, p(zjF(φ B )), in order to extract numeric ranges, that compose the consequent, φ J ! S i ½v ðiÞ 1 ; v ðiÞ 2 �. Observations with the targeted pattern have higher likelihood to have numerical outcomes in the extracted range. Both empirical and theoretical distributions are allowed for this calculus. If instead of considering the underlying distributions to extract a range of values, we considered just the minimum and maximum values within the pattern, the likelihood of the target pattern having high values of discriminative properties would be lessened due to: 1) presence of outliers that make the interval more relaxed, and 2) the possibility of the interval being to rigid and excluding nearby values outside after and before the maximum and minimum values.
To illustrate these concepts, Fig 2 provides an example with two theoretical distributions approximated from the overall and pattern-conditional targets, respectively. By estimating the relative frequency of each of the distributions, two points of intersection v 1 and v 2 can be calculated, composing an interval that can be potentially discriminated by observations with the given pattern.
Once ranges of outcomes of interest are identified, classic interestingness measures for association rules can be extended to handle these consequents. Considering the previously introduced lift function-a paradigmatic function to assess the discriminative power of an association rule -, it can now be rewritten, Note that the coverage of the outcome of interest is now defined as the interval created by the intersection points of the distributions. Instead of a predefined restrictive category range, intervals disclose outcomes of interest that are dynamically inferred for a given pattern in order to better assess its discriminative profile.
Consider now an example with two random empirical distributions originating more than two points of intersections, illustrated in Fig 3. In this example, observations with a the selected pattern, have higher likelihood to show outcomes in the two inferred ranges. With two intervals, [v 1 , v 2 ] and [v 3 , v 4 ], we can further assess the discriminative power of the pattern with regards to each interval, as well as both

PLOS ONE
intervals, Different discriminative criteria can be considered in the presence of consequents given by multiple ranges. Considering the lift as the illustrative case, the discriminated outcomes can be

PLOS ONE
given by the numerical interval that maximises the lift function, where, for the given example, All numerical intervals where lift satisfies a minimum threshold θ, Both are valid options and allow for a robust analysis of the numerical outcome. The first approach retrieves the numerical interval, or combination of numerical intervals, with highest discriminative power. The second filters out uninformative/non-discriminative numerical intervals, allowing for a more comprehensive analysis of each pattern.
DISA implements the presented methodology and is given in Algorithm 1. When analysing the subspace in the presence of a continuous output variable DISA implements four different setups: 1) MinMax, where the cut-off points, [v 1 , v 2 ], correspond to the minimum and maximum pattern-conditional outcomes, respectively; 2) Average, where v 1 and v 2 are the bounds formed by considering the standard deviation from the average, μ − σ and μ+ σ, respectively, where μ (σ) is the average (standard deviation) of the pattern-conditional outcomes; 3) Gaussian, where we assume that both the output variable and the pattern-conditional outcomes follow a normal distribution. In this case, v 1 and v 2 correspond to the intersection points between the gaussians where the range [v 1 , v 2 ] represents the most probable values of interest that the pattern discriminates. Fig 4 provides a in-depth example with three distinct patterns; 4) Empirical, where we assume they follow their own unique empirical distribution, instead of assuming that both the outcome variable and the patternconditioned outcomes follow a well-known theoretical continuous distribution. In this case, v 1 and v 2 might not be the only points of intersection. We assume there can be any number between one and n points of intersection. Fig 3 provides an example with four points of intersection, creating two ranges of intervals of interest. However, it is important to note that the number of intervals created is not necessarily correlated with the number of points of intersection. When the relative frequency of the pattern-outcome starts, or finishes, above the relative frequency of the output variable, the number of intervals changes. In  frequency of each unique value for both the overall and pattern-conditioned outcomes, ii) element-wise subtraction between the arrays, iii) extract the element-wise indication of the sign of each number on the resulting array, iv) calculate the discrete difference along the sign vector (value at position i+1 minus value at position i), and finally v) find the indices of elements that are non-zero, grouped by element.
As previously mentioned, the intersection of empirical distributions can generate more than one interval of interest. By default DISA considers all the of the pattern-conditioned outcome intervals to compute the discriminative and informative properties of the pattern. However, if the discriminative power of the pattern is still below a minimum threshold (e.g., lift<1.3), then DISA will start to disregard uninformative intervals. Starting from the lowest individually ranked (e.g., by lift), the intervals are disregarded one by one, until either all of them are removed (resetting to the default behavior) or the satisfaction of the minimum discriminative power.

Software
The previously introduced methodology is made available as an open-source software package, DISA, developed in Python (v3.7). DISA is able to assess the discriminative properties from the inputted patterns in the presence of numeric or categorical outcome variables. A pipeline of the DISA package is illustrated in Fig 6. If DISA receives a numerical outcome, the outcome ranges that are likely to be discriminated by the observations supporting a given pattern are first determined. DISA accomplishes this by approximating two probability density functions (e.g. Gaussians), one for all the observed targets and the other with targets of the pattern coverage. The intersecting points between the two probability density functions is computed to identify the range of values discriminated by the pattern. Second, DISA extends state-of-theart statistics for assessing the informative and discriminative power of classic association rules. Currently, DISA supports 53 evaluation metrics in total. An illustrative subset of metrics is provided in Table 1 (complete list in DISA's GitHub repository).

Results
In order to illustrate DISA properties, we considered four public datasets taken from the literature: 1) Echocardiogram [41], monitoring physiological features of patients that suffered heart attacks at some point in time, the task consists in extracting discriminative patterns of survivability after a heart attack; 2) Liver Disorders [42], a dataset of molecular features from blood tests which are thought to mark liver disorders that might arise from excessive alcohol consumption, in this case the task consists in extracting patterns that discriminate the number of intake drinks per day; 3) Breast Cancer Wisconsin (Diagnostic) [43], where each observation corresponds to the follow-ups of a breast cancer patient, variables concern cancer cell nuclei features from a digitized image of a fine needle aspirate, and the outcome is the number of months until cancer relapse; and 4) Dodecanol production [20,21], a dataset that monitors the concentration of key enzymes observed in the two Design-Build-Test-Learn cycles of 1-dodecanol production (a medium-chain fatty acid used in detergents, pharmaceuticals and cosmetics) in Escherichia coli with the outcome determining the concentration of the targeted organic compound. The list of variables per dataset, as well as their meaning, is presented in Table 2.
We used BicPAMS software [35] to extract patterns, with particular focus on constant coherence on columns (pattern on rows). Regarding statistical significance, we did not filter patterns exhibiting a p-value above 0.05 (patterns that might have occurred by chance). To allow the creation of larger patterns during the merging step, we allow up to 30% noise within the pattern. This will reduce the number of redundant patterns. Numeric input variables are categorised with DI2 discretizer [25], with |L| = 3, |L| = 5, and |L| = 7 categories.
Broad characteristics of the extracted patterns are presented in Tables 3 and 4. A set of illustrative patterns for each dataset are displayed in Fig 7, with the respective properties in Table 5.

Discussion
In this work, we proposed an approach for pattern evaluation in the presence of numerical outcomes. Below, we experimentally assess the results of the proposed methodology on four publicly available datasets.

Case study: Echocardiogram
A few discriminative patterns, yielding Support < 10%, were extracted from the Echocardiogram data as shown in Table 3. In Table 4, we can see that the most prominent discriminative criteria for the found patterns were an average Lift � 1.6 and StandardisedLift � 0.75 for configuration |L| = 3. A pattern with lift above 1 and Standardised Lift in [0.7,1] generally Table 1. A small sample of metrics implemented in DISA. Three types are presented: support-based metrics, confidence-based, and lift-based.

PLOS ONE
discriminate the subspace of values it forms. The analysis of the found patterns using DISA reveals that the majority of discoveries discriminate a low survivability range (see GitHub repository https://github.com/JupitersMight/DISA/tree/main/Example for a detailed description of all patterns), including the pattern shown in Fig 7a. The patients suffering a heart attack in Fig 7a pattern exhibited moderate values of contractility and moderate size of the heart at end-diastole. In this case, when the local survivability within the pattern intersects the overall survivability, it forms a span of time that is discriminative of patients who survive a maximum of 12 months. This pattern possesses a high discriminative power, the maximum achievable, with a Lift = 3.09 and StandardisedLift = 1, and it also yieldsw 2 ¼ 8:64. Aw 2 > 3:84 means that the null-hypothesis of independence between the pattern and the outcome should be

Case study: Liver disorders
The extracted patterns from this data source display a Support � 10% in configuration |L| = 3, as shown in Table 3. As the cardinality of input variables increases (higher |L|), the most salient discriminative criteria arew 2 � 3:31, Lift � 1.7, StandardisedLift � 0.80, and Stat. Significance � 0.03 for configuration |L| = 7. In this context, a statistical significance lower than 0.05 means that the pattern probability of occurrence deviates from expectations. The careful analysis of the patterns using DISA revealed that a good portion of the discovered patterns discriminate a low drink intake per day (see GitHub repository for a trace of all patterns). An example is shown in Fig 7b, whose individuals with very high values of alanine aminotransferase and gama-glutamyl transpeptidase, generally drank up to 2 drinks per day. This pattern possesses a high discriminative power, the maximum achievable, with a Lift = 2.04 and Stan-dardisedLift = 1. It also has a strong dependence with the outcome with aw 2 ¼ 5:28, and statistical significance (p-value = 0.008).

Case study: Breast cancer wisconsin (Diagnostic)
A high number of patterns were extracted from this data source as shown in Table 3. The patterns in general were borderline discriminative, with the most notable discriminative criteria being the Lift � 1.3 and Stat.Significance � 0.04 for configuration |L| = 7. The careful analysis of the patterns using DISA revealed that a significant number of the found patters discriminated a low time until cancer recurrence (images in GitHub). An example is shown in Fig 7c, whose patients with short periors until cancer relapse show high heterogeneity cells characteristics, a very high number of compact cells and high severity of concave portions of the contour. In this case, the dynamically inferred discriminative span of time until relapse is between 1.4 and 9 months. The pattern possesses a high discriminative power, the maximum achievable, with a Lift = 4.18 and StandardisedLift = 1. It also has aw 2 ¼ 10:21, meaning that the pattern is strongly dependent of the given span of time for cancer reoccurrence, and statistical significance (p-value = 8.63 × 10 −6 ).

Case study: Dodecanol
Finally in the Dodecanol dataset, for configuration |L| = 7, a moderate number of patterns were extracted as shown in Table 3. The discriminative criteria is optimal across the found patterns from all configurations, e.g.w 2 � 16, Lift � 1.5, StandardisedLift > 0.9, and Stat. Table 3. Configurations used in BiCPAMS and best results from the four case studies. Each row from left to right indicates the percentage of noise allowed, the number of categories for the continuous variables (coherence strength), the number of extracted patterns, the average number of columns in each patterns (and standard deviation), and the average number of rows in each pattern (and standard deviation).

PLOS ONE
Significance � 0.04. The careful analysis of the patterns using DISA revealed that some of the discovered patterns discriminated a low production of dodecanol (images in GitHub). An example is shown in Fig 7d, where a reduced dodecanol production (maximum of 0.14 units) is discriminated by the presence of samples with a very high concentration of enzymes responsible for the catalysis of long chain fatty acyl-CoA into a chain primary alcohol and low concentration of enzymes responsible for oxidation-reduction (redox) reactions. The pattern possesses moderate discriminative power with a Lift = 1.49 and StandardisedLift = 0.87. It further shows a strong relation with the outcome,w 2 ¼ 12:8, and possesses high statistical significance (p-value = 7.23×10 −6 ).

State-of-the-art comparison
To test the DISA assessment of the patterns' discriminative power, we considered an additional set of approaches: 1) Classic approach, where the numerical outcome variable is

PLOS ONE
discretized by applying DI2 [25] with |L| = 7, and the outcome is then interpreted as a class. In this case, DISA selects for each pattern the best fitting class ordered by lift. It is important to note that the outcome class is defined prior to the discovery of the pattern, without assumptions related with the subsequently mined pattern-conditioned outcomes; 2) MinMax approach, that uses the minimum and maximum of the pattern-conditional outcomes, 3) Average approach, that uses the average of the pattern-conditioned outcome with bounds inferred using the observed standard deviation; and 4) Empirical approach, where DISA considers the empirical distributions of both the continuous outcome variable and the pattern-conditioned outcome variable. These novel approaches are applied to each pattern illustrated in Fig 7. Table 5 contains the results of this analysis and we will discuss the results of: i) the proposed Gaussian approach versus Classic and standard approaches; and ii) the proposed Empirical approach versus all others. When comparing the results of the Classic and Gaussian approaches in the Liver Disorders, Breast Cancer, and Dodecanol datasets we observe that the Gaussian approach exhibits higher values in thew 2 function, whilst the Classic displays slight improvements in the lift function for most patterns. In spite of these results, the Classic approach fails to maximize the patterns' potential to discriminate specific ranges of outcomes. This can be concluded by observing a considerable decrease in the values of Standardised Lift. In the case of Breast Cancer and Dodecanol, Standardised Lift plummeted below 0.5 using the classic approach. The Average approach also exhibits this failure to find intervals that maximize the patterns' discriminative potential. These values confirm our initial hypothesis, that the classic and Average approaches form intervals that might not fully explore the discriminative profile of each pattern. The MinMax approach is able to fully accommodate noise and outlier pattern- Table 5. Value of objective interestingness functions for each pattern in Fig 7. In this analysis, we compare the discriminative assessment produced over the range of discriminated outcomes with DISA against classic alternatives produced by discretizing the numerical outcome using DI2 [25]  conditioned outcomes, yet it creates intervals that can be too large or permissive, i.e. intervals that accommodate outcome ranges that are not discriminated by the pattern. The Gaussian approach is in theory more robust to this problem, an observation that is corroborated by the collected results. Considering the selected Echocardiogram pattern, two intervals are formed by the Empirical approach. The first interval captures a low survivability range, [0.25, 0.75] whilst the second captures a higher survivability range [11.0, 12.0]. If we observe the statistics of the other approaches that enclose either one or two of these intervals we can conclude: 1) that the interval of low survivability provides discriminative properties, i.e. in the classic approach the range [0.03, 0.59] partially encloses [0.25, 0.75]; and 2) that the approaches which consider the inclusion of higher values of survivability also display discriminative properties, i.e. a compact range that encloses both of the aforementioned intervals are observed in the MinMax and Gaussian approaches, [0.5, 12.0] and [−11.04, 12.33], respectively. Note, nevertheless that the Empirical approach disregards the range of values between [0.75, 11.0]. Results from all datasets confirm an increase in the discriminative potential in most statistics for all patterns (e.g., Standardised Lift as 1). However, the Empirical approach is generally restrictive in the formation of pattern-conditioned outcome intervals, and should be applied with care, e.g., complemented with the Gaussian approach to guarantee that the consequent of the target association rules include all numerical ranges discriminated by a given pattern.

Conclusion
This work proposed a novel distribution-based method to rigorously assess association rules in the presence of numerical outcomes in the consequent, by inspecting the differences between the distribution of the numerical outcomes for all observations and those supporting a given pattern. This methodology allows for a dynamic and pattern-tailored approach to numerical outcomes as the patterns dictate how the discriminated ranges of values are statistically produced. The results further confirm the utility of the proposed methodology in dynamically producing pattern-tailored intervals, where discovered patterns from multiple domains exhibited maximum achievable discriminative power properties.
The methodology is implemented in DISA, an open source Python package capable of robustly assessing the statistical significance and discriminative power of association rules in the presence of numerical and categorical outcomes. DISA implements over 50 metrics, heuristics that can be used to guide the discovery process of discriminative patterns and subspace clusters in various data domains.We believe that DISA can be easily embed and further extended for more complex patterns, such as with multiple points of intersection, therefore aiding the scientific community ability along pattern-centric descriptive and predictive tasks. For instance, the extraction of patterns in omic data able to discriminate numerical phenotypes, or the extraction of patterns in clinical data able to discriminate risk scales.