Learning high-order interactions for polygenic risk prediction

Within the framework of precision medicine, the stratification of individual genetic susceptibility based on inherited DNA variation has paramount relevance. However, one of the most relevant pitfalls of traditional Polygenic Risk Scores (PRS) approaches is their inability to model complex high-order non-linear SNP-SNP interactions and their effect on the phenotype (e.g. epistasis). Indeed, they incur in a computational challenge as the number of possible interactions grows exponentially with the number of SNPs considered, affecting the statistical reliability of the model parameters as well. In this work, we address this issue by proposing a novel PRS approach, called High-order Interactions-aware Polygenic Risk Score (hiPRS), that incorporates high-order interactions in modeling polygenic risk. The latter combines an interaction search routine based on frequent itemsets mining and a novel interaction selection algorithm based on Mutual Information, to construct a simple and interpretable weighted model of user-specified dimensionality that can predict a given binary phenotype. Compared to traditional PRSs methods, hiPRS does not rely on GWAS summary statistics nor any external information. Moreover, hiPRS differs from Machine Learning-based approaches that can include complex interactions in that it provides a readable and interpretable model and it is able to control overfitting, even on small samples. In the present work we demonstrate through a comprehensive simulation study the superior performance of hiPRS w.r.t. state of the art methods, both in terms of scoring performance and interpretability of the resulting model. We also test hiPRS against small sample size, class imbalance and the presence of noise, showcasing its robustness to extreme experimental settings. Finally, we apply hiPRS to a case study on real data from DACHS cohort, defining an interaction-aware scoring model to predict mortality of stage II-III Colon-Rectal Cancer patients treated with oxaliplatin.


Introduction
Despite the evolution of statistical learning theory, Logistic Regression (LR) is still a commonly used statistical method in empirical studies in many research fields (Dreiseitl and Ohno-Machado, 2002).It is widely regarded as the model of choice for situations where the occurrence of a dichotomous outcome is to be predicted from one or more covariates (Boateng and Abaye, 2019).Examples can be found in medical research (Boateng and Abaye, 2019), educational research (Niu, 2020), public health research †Corresponding author: michelacarlotta.massi@polimi.it,Via Edoardo Bonardi, 9, 20133 Milan, Italy (Lemon et al., 2003), political sciences (Nicolau, 2007), economics (Sloane and Theodossiou, 1994;Zaghdoudi, 2013) and many others.
Regardless of the context of application, it is often the case that the logit of the expected value of the dichotomous response variable cannot be explained solely by additive functions of the predictors.In other terms, when the function f (x 1 , x 2 ) cannot be expressed as g(x 1 ) + h(x 2 ) for some functions g and h, we say that there is an interaction in f between x 1 and x 2 .
Many of the aforementioned fields of research share the need for introducing these interaction effects between their predictors to better infer on -and characterize -the outcome.For instance, in Genome-Wide Association Studies (GWAS) there is increasing awareness that epistasis, or gene-gene interaction, plays a role in susceptibility to common human diseases (Moore, 2003;Onay et al., 2006).It has been argued (Moore, 2003) that epistasis is a ubiquitous component of the genetic architecture of common human diseases and that complex interactions are more important than the independent main effects of any susceptibility gene.
This calls for novel statistical approaches to identify the complex non-additive relationships between multiple variables to be included into a predictive model that would fully capture the underlying relationship with the target.However, when dealing with real-world applications of LR and interaction search, such as the one described above, several additional considerations must be taken into account.
First of all, nowadays datasets are growing wider, with more and more variables, but the sample size may vary from extremely large to very limited.Linear models scale relatively well when handled via standard software, up to thousand of features together (Lim and Hastie, 2015).However, scalability becomes an issue when dealing with interactions.Indeed, in a model with intercept and p numerical predictors the total number of terms is given by This exponential growth gets even steeper in the presence of categorical covariates, as each one of the features' levels has to be considered.Nevertheless, dealing with a huge moltitude of categorical covariates is getting more frequent for instance in the aforementioned GWAS studies, where interactions between millions of Single Nucleotide Polimorphisms (SNPs) with triplets of levels affect even extremely rare mutations in the patients' phenotype (Llinares-López et al., 2019;Ceddia et al., 2020).This makes the inclusion of higher order interactions technically intractable, and causes the problem of finding interactions to fall in the framework of "large p small n" problems, where the number of features significantly outweighs the number of observations and the curse of dimensionality affects the reliability of statistical estimates.
In particular, when fitting LR models with no prior knowledge regarding the parameters, the traditional choice is Maximum Likelihood Estimation (MLE).However, a recent contribution from Sur and Candès (2019) shows how as p grows w.r.t.n, estimates seem systematically biased, in the sense that effect magnitudes are overrestimated, they are far more variable than predicted using classical methods (such as Inverse Fisher Information), and inference measures, e.g.p-values, are unreliable especially at small values.This makes both prediction and interpretation of results extremely questionable and hinders the validity of the analysis.
When data exhibit complex dependencies, as well as high dimensionalities and small sample sizes, researchers are often also forced to tackle the issue of clustering observations in presence of class imbalance.This latter issue requires a further tailoring of the classification models that would otherwise incur in the risk of losing predictive power on the underrepresented class.This fact becomes a remarkable concern in application areas, such as healthcare or insurance fraud, where identifying the underrepresented class correctly is the most -if not the only -relevant matter.
For all these reasons, a new class of statistical methods needs to be designed in order to address the aforementioned complexities and provide LR models that can be effectively applied in most research domains.Our present work tackles these needs and focuses on developing a novel approach to Interaction Learning specifically designed to: (i) identify a restricted number of the most relevant interactions of arbitrarily high order among categorical features; (ii) optimize the selection of interactions for separating the classes in cases of strong imbalance; (iii) allow the user to keep the number of interactions to be included in the model under control, in order to foster significance and interpretability of the resulting LR.All these aspects were considered in the design of the proposed algorithm, while guaranteeing (iv) scalability, tractable computational times, and robustness to varying dimensionality, sample size and imbalance ratio.

Interaction Learning Approaches and Our Contribution
Discovering interactions is an active area of research (Lim and Hastie, 2015).However, when discussing related approaches in the multifaceted scenario our proposal deals with, it is important to make a distinction between two fairly independent lines of research that address different aspects of the subject matter.The first one deals specifically with categorical features -even though some studies try to expand their applicability to continuous features -and it is devoted to finding lists of high-order feature interactions associated with the outcome.One example can be found in Shah and Meinshausen (2014).Research groups devoted to significant (discriminative) pattern mining expand this concept by taking into account the statistical significance of the association, controlling the Family-Wise Error Rate (FWER), or the probability to detect false positive patterns (Llinares-López et al., 2015;Papaxanthos et al., 2016;Pellegrina and Vandin, 2018;Sugiyama and Borgwardt, 2019).These works found a broad range of applications in some of the domains we mentioned beforehand, such as statistical genetics or healthcare (Llinares-López et al., 2019;Ceddia et al., 2020).However, their final objective is that of identifying all significantly associated interactions, without specific requirements on the number of selected patterns or their predictive power and tractability once fed to a classification model as LR.
Conversely, the other related research line works with both categorical and continuous features, and it focuses more on identifying the interactions to be used within non-linear models and Generalized Linear Models (GLM).Most of these methods deal with different regularization strategies to shrink models to the most useful primary effects and interaction terms (Radchenko and James, 2010;Rosasco et al., 2010;Bien et al., 2013;Lim and Hastie, 2015); a possible alternative falling into this class is to use a stochastic search algorithm for variable selection in Bayesian LR, as proposed by Chen et al. (2011).When selecting among very large numbers of features, one drawback of penalized models is the complexity of directly controlling for the number of terms introduced in the final classification model -despite their undeniable potential in producing powerful classifiers.Especially when dealing with high-order categorical interaction terms, even in the presence of strong penalizations, the actual number of terms in a LR might easily explode.As previously mentioned, the growing number of predictors may lead to an inflation of the estimates, hindering the attempt to make significant inference out of the resulting model, besides reducing the results interpretability.
Our contribution lies at the crossing of these two research lines.Indeed, we first develop a targeted high-order interaction search algorithm which, from a set of categorical features, produces a list of useful interactions associated with the outcome and ranks them on the basis of their Odds Ratio (OR).Then, we introduce a novel feature selection method designed to pick from that list only a predefined number of interaction terms to be included in a LR model finally capable of well discriminating between the two classes (even in cases of small sample size and imbalanced setting) as well as yielding both interpretable results and reliable statistical inference.The algorithms have been developed in Python 3 (Van Rossum and Drake, 2009); the related package is available upon request.This paper is structured as follows.In Section 2 we detail our methodology, starting from the main algorithm, LIPS, and then discussing a few possible alternatives.In Section 3 we validate our design choices for the proposed algorithms and provide empirical evidence that both LIPS and its variants are able to perform extremely well in binary classification tasks with categorical covariates, even under conditions of strong imbalance and low sample size.In Section 4, by analyzing both simulated and real data, we compare our approaches with glinternet (Lim and Hastie, 2015), a state-of-the-art algorithm for interaction learning in contexts of LR.There we show that LIPS and its variants can perform comparably well, with the additional advantage of returning models that are much more interpretable.In Section 5 we draw some conclusions and discuss possible future developments.Further algorithmic details and mathematical insights can be found in the Appendix.

High-order interaction learning via targeted pattern search
In this section we formally introduce LIPS, the proposed Learning high-order Interactions via targeted Pattern Search algorithm.
Note that the method has been developed to handle interactions among categorical covariates.Therefore, we will firstly provide some context and notation to describe the specific setting we are dealing with and its intrinsic peculiarities.Then, we will detail the two main steps of the algorithm, i.e. the preliminary targeted pattern search and the subsequent novel dissimilarity-based interaction selection method, which together define the Interaction Learning core of LIPS.
Because of our interest in developing a methodology broadly applicable to most reallife research domains, we include in this work two variants of the main algorithm -namely Scores LIPS and Clusters LIPS.As further detailed in the last part of this section, these two alternative strategies are meant to satisfy specific user and data requirements, while also improving interpretability of the model.

Notation
Let us first provide some background and introduce the needed notation to follow the details of our methodology.We are given p categorical random variables X 1 , . . ., X p , where each X j takes values in some set of labels {l (j) m1 , . . ., l (j) mj }; note that the number of levels m j depends on j, i.e. the categorical variables may have different supports.
First of all, we define the so-called dummy variables j is {0, 1}-valued and equals 1 if and only if X j attains the ith level.Next, we introduce the collection of all possible interaction terms among the covariates X 1 , . . ., X p as so that each interaction is a binary r.v. that encodes a certain levels' combination.We adopt the standard convention by which the order of an interaction corresponds to the number of terms therein minus one; for instance, X 0 1 X 1 3 defines a first order interaction.With little abuse of notation we let 1 ∈ I be the empty interaction (J = ∅) and we allow for 0-order interactions, meaning that all dummies also fall in I.
It is straightforward to see that indeed, when defining an interaction each X k leaves us with exactly (m k + 1) choices, as we may either discard that variable or include one of its corresponding m k dummies.
For more readability, let us introduce a few more preliminary definitions: (a) For an interaction T ∈ I we write |T | to denote the number of dummy variables involved in the product representation of T ; (b) Given two interactions T, S ∈ I we say that T is a subinteraction of S if all its dummies appear in the product expansion of S; equivalently, S = T • Z for some Z ∈ I; (c) For T, S ∈ I we define their maximum common divisor, MCD(T , S), as the highest order interaction Z ∈ I that is both a subinteraction of T and S; (d) We say that two interactions T and S are incompatible, and we write T ⊥ S, if they respectively include two different levels of the same random variable and therefore T • S is identically zero.

Interaction Learning in the LIPS algorithm
We are given a dataset D = {(x 1,i , . . ., x p,i , y i )} n i=1 consisting of n i.i.d.observations of p categorical covariates X 1 , . . ., X p and a binary outcome Y .We assume to be in an imbalanced setting with respect to Y , that is we suppose the classes O = {i | y i = 1} and Z = {i | y i = 0} to come in remarkably different proportions.Without loss of generality, we consider the case of O being the minority class, which reflects the situation of P(Y = 1) P(Y = 0).Our first purpose is to deduce from the data a suitable subcollection of interactions that is informative w.r.t.Y but small in size.To achieve such goal, we perform two steps: identification of all the candidates interactions through a targeted pattern search; choosing of the top K most relevant interactions via our novel dissimilarity-based interaction selection method, where K is user-specified and tipically satifies K |I|.

Targeted pattern search
In principle, starting from the data in D, it is very easy to compute the sample values of any given T ∈ I.However, as in general the cardinality of I is huge, it is seemingly impossible to study each conceivable interaction term alone; it is this very drawback that whence the need of finding a preliminary subset of candidates Î ⊂ I.
To address this task, we propose to focus only on those interactions that are different from zero within the minority class O with an empirical frequency that is higher than a fixed threshold supp min > 0.
This choice, which we further discuss in Section 3.3 and we will also refer to as One-Class-Learning (OCL), is mainly driven by the need of undertaking the imbalance between the two classes, but has also a few advantages in terms of computational cost.
In what follows, for an interaction T ∈ I we denote by t i its value corresponding to the ith statistical unit in the dataset.As mentioned above, we define Since it is not feasible to compute such frequencies for all T ∈ I, to determine Î we can rely on the duality between the concepts of interactions and itemsets.The latter, which are also referred to as patterns, are tipically found in problems of association rule learning such as market basket analysis (Borgelt, 2012), intrusion detection (Bekti et al., 2017) and others.There, one is given a set of possible items and a list of transactions, which are nothing but sets of items.The goal is then to mine the list of such transactions in order to find which subsets of items (itemsets) occur with a sufficient frequency (also called support).
In our framework, we observe that there is a one-to-one correspondence between itemsets and interactions, as these can be thought as being itemsets of dummy variables.In fact, if we interpret dummy variables as items, then the observations {(x are nothing but transactions (where an item is picked if and only if its corresponding dummy equals 1); similarly, a certain pattern of dummies is present in the transaction if and only if the corresponding interaction term is nonzero.
This property is very useful as it allows us to take advantage of all the wide variety of frequent item set mining algorithms, which are recently becoming more and more efficient (Borgelt, 2012;Shah and Meinshausen, 2014).In particular, we may reformulate the problem of computing Î as that of finding a list of patterns that appear with a frequency greater than supp min (minimum support) within the minority class.Many possible implementations -such as the Apriori algorithm (Agrawal and Srikant, 1994), which we will later make use of -are available for our purpose, and it can be convenient to test different ones depending on the data.
After having constructed the candidate set Î, preliminary to the next phase we build for each T ∈ Î the contingency table of (T, Y ) along the whole dataset D. This task requires, at least, to compute the corresponding frequencies in the majority class, 1 |Z| i∈Z t i .We point out that, besides the possibly high-cardinality of Î, this step can now be performed very efficiently: for more technical details, refer to the Appendix, Section A.1.

A dissimilarity measure based feature selection algorithm
Knowing the contingency table (T, Y ) for every T ∈ Î allows us to associate to each candidate interaction an odds-ratio: .
We use this statistic to rank the candidate interactions in Î: more precisely, we sort the patterns in descending order according to the quantity log |OR T |, so that reciprocal odds-ratios are treated equally.
Next, given a fixed K ∈ N, we propose a way to extract from the sorted list of candidates Î a subset of (at most) K interactions that are suitable to perform inference on Y .Naively, one may think of selecting the first K patterns in the list.However, this approach may present several drawbacks (cfr.Section 3.3): due to the combinatorial nature of high-order interactions, such strategy could easily result in a list of nested patterns (subinteractions), thus carrying redundant information; if K is small, this may lead to overfitting-like phenomena, as the feature space is not explored sufficiently.
To overcome this drawback, we introduce a dissimilarity measure, d : where x ∨ y := max {x, y}.This allows us now to compare two different patterns.
By definition, d returns higher dissimilarities for patterns that involve completely different variables and for those that are incompatible.Because of these two properties, we may consider d a suitable measure for exploring different regions of the feature space; see Section B in the Appendix for further insights about the definition of d.
By making use of the dissimilarity measure d, we extract K patterns from the sorted list Î according to the following iterative procedure: 1. We remove from Î the first interaction T , which corresponds to the one that maximizes log |OR T |, and place it in a new list I K ; 2. We search for those interactions in Î that are the most dissimilar from the ones in I K .In other words we determine argmin T ∈ Î min S∈IK d(T, S).Among these, we select the one with highest rank and move it from Î to I K ; 3. We repeat the instructions in (2) until I K contains K interactions.
In the end, we are left with a collection I K = {T k } K k=1 of K interactions which can now be used as predictors in a logistic model with response Y .In other words, LIPS's resulting model reads (2) Especially when the sample size is not large enough, it can be convenient to primarily filter the candidate list Î by removing those interactions that have an odds-ratio suspiciously close to 1.For example, as we did in our case study (cfr.Section 4.2), one may choose γ ∈ (0, 1) and discard all those patterns whose γ-level confidence interval for the odds-ratio contains the value 1.In general, this will not affect the ranking and the final steps described above, but may result in more reliable models.

Variants
In this paragraph we explore two variants of our algorithm which aim at increasing interpretability and tractability of the final model.Although in our context of categorical predictors even high-order interactions remain highly interpretable (as aforementioned, they encode the co-presence of multiple levels of different variables), further reductions in the model structure can be of great interest in certain applications (cfr.Section 5).
Below we discuss two possible approaches: Scores LIPS, which condenses all the information into just a pair of variables; Clusters LIPS, that is instead a possible compromise between the two, where the K identified interactions are grouped into multiple Compatibility Clusters.Note that, in both cases the number of terms in the model is smaller than or at most equal to K.This makes the two alternatives a go-to solution in case we needed to include the information carried by several interactions (for instance, in a p n setting), without incurring in the risk of ending with an LR model that has too many terms w.r.t n.
As both variants require a further reshaping of the information carried by the selected interaction terms, we can expect their performances to be slightly worse w.r.t. the base LIPS where all terms are included independently.However, as mentioned beforehand -and further expanded in the Discussion (Section 5) -these variants present several advantages in terms of both interpretability and dimensionality reduction, two aspects that play a key role in certain research scenarios and make these algorithms widely applicable.
2.3.1.Risk and Protection Scores: Scores LIPS In Section 2.2, we ranked the candidate interactions in a unique list using the log oddsratio as criteria.However, we may avoid mixing together terms that exert opposite influences on the target and instead split the collection Î into two sublists which we may now sort separately, respectively by descending and ascending oddsratios.We refer to the two collections as risk and protection patterns, a terminology that is mostly motivated by the majority of clinical applications.
Given an even integer K, we then apply our dissimilarity selection algorithm in order to extract two groups, RK/2 ⊂ R and PK/2 ⊂ P, each of dimension K/2.Then, we construct the Risk Score R and Protection Score P as The intuition between the two scores is that R (resp.P ) counts the number of selected risk (protection) patterns that are present within a certain observation.In this way, the information relevant for inferring on Y gets squished into a single pair of highly interpretable variables.The final model is then logitP (Y = 1|R, P ) = µ 0 + αR + βP. (3)

Compatibility Clusters: Clusters LIPS
The approach described in the previous section has the advantage of substituting the p original covariates X i with two scores that carry information about possible interactions.However, compressing the two lists, RK/2 and PK/2 , directly into the scores R and P could be too rough.
Here we propose milder aggregation criteria, the compatibility clusters, that are based on the idea that we should not bind together incompatible patterns.In principle, in fact, no observation can ever present two incompatible patterns simultaneously; therefore, we may want to weight them differently in our model.Concretely, after having defined RK/2 and PK/2 as in the previous section, we focus on each of them separately and propose a way to aggregate their terms.Starting from RK/2 , we wish to find a partition R1 K/2 ∪ ... ∪ Ra K/2 = RK/2 for which: 1) all interactions within the same subgroup Rj are compatible with one another; 2) given two subgroups there always exists at least a pairing of incompatible patterns, that is: no subgroups can be fused together without violating the first rule.Stated as above, there is no unique way of doing this.In our experiments, we decided to construct the groups s.t.their total number was as small as possible, thus maximazing the information compression.Further details on the implementation can be found in the Appendix, Section A.2.
Performing the described steps also for the protection patterns allows us to define the cluster scores T, for j = 1, . . ., a and i = 1, . . ., b.The final model reads: Comparing ( 4) with (3), we see that now we are splitting the risk and protection scores into more sub-scores, arguing that each one describes different, possibly incompatible, situations.

Simulation Study
This section is devoted to verify some of the statements made on the most relevant aspects of the proposed algorithm throughout the paper.In order to provide these empirical evidences in a controlled setting, we had to build a suitable simulation protocol.Indeed, to be meaningful, we had to construct a dataset where the binary outcome showed complex dependencies with multiple categorical covariates simultaneously, so that learning the right interactions would be essential to separate the two classes.In Section 3.1 we describe how the needed data was defined and simulated; then, on such data, we compare the performances of LIPS and its variants with those achieved by other logistic models that do not account for interaction effects.
The subsequent experiments are instead devoted to verifying whether the "targeted" search or One Class Learning (OCL), focused on identifying patterns in the positive class only, could actually compete against the same algorithm fed with lists of interactions extracted from both classes.Then, we validate the value added by the dissimilaritybased feature selection in building the final set of interactions to include in the LR, w.r.t.picking the top K interactions ranked on log|OR T |.Finally, we verify the robustness of LIPS to variations in sample size and class imbalance rate.

Simulated Data
For the definition of the datasets adopted in all the simulation experiments described in the following, we exploited a geometric model.This allowed us to synthetically introduce the aforementioned outcome-covariates dependency based on multiple interactions.
The underlying geometric model is defined as a double squared tiling structure (Fig. 1).Each instance in the dataset is represented by a couple of triangular tiles, one picked from the left tiling and one from the right with uniform probability.Two example instances are reported in Fig. 1.a.All the covariates in the dataset are binary representations of the position of one of the two tiles (from left or right tiling).Indeed, the position of the first tile on the left is described by 5 binary features defined as the following: (a) R 1 (right) takes value 1 if the tile is in the right half, 0 otherwise.(b) U 1 (up) takes value 1 if the tile is located in the upper half of the tiling, 0 otherwise.(c) D 1 (diagonal ) takes value 1 if the tile is located below the main diagonal of the tiling, 0 otherwise.(d) A 1 (anti-diagonal ) takes value 1 if the tile is located below the anti-diagonal, 0 otherwise.(e) O 1 (outside) takes value 1 if the tile is outside the inner square (the 45-degrees rotated square -composed of 8 triangular tiles -that has its vertices on the mean points of the tiling's edges), 0 otherwise.
Similarly, R 2 , U 2 , D 2 , A 2 and O 2 describe the position of the second tile in the right squared tiling.Notice that to recover the notation adopted in Section 2 one may simply let The dependent variable Y , that defines the two classes (y = {0, 1}), signals whether at least one of the two tiles of the observations falls within one of the red areas highlighted in Fig. 1.b.As such, Y is univocally determined by the 10 covariates described above.However, in order to work in a non-deterministic, setting, we artificially added noise to the data by mislabeling some observations.In particular: (a) If both tiles of instance i (left and right) fell in one red area (hence, y i = 1), the label was changed to y i = 0 with probability p = 0.005.(b) In all other cases, the value of y was changed to the opposite class with a probability q = 0.05.
There are several reasons motivating this construction of the dataset.First of all, note that the two classes are unbalanced by construction.Indeed, before introducing the noise, one has P(Y = 1) = 1 − P(Y = 0) = 1 − 14 16 2 23.4%, as the two tiles are extracted independently and, due to uniform distribution, each one of them has a probability of 14/16 of not falling within a red zone.Secondly, strong dependencies exist within groups of covariates: for instance, if R 1,i = U 1,i = 1, then necessarily A 1,i = 1.Lastly, the minority class (y i = 1) is built in such a way that attaining a good classification performance would be impossible without introducing interaction terms, as the red tiles need to be described with more variables simultaneously to be properly identified in the tilings.

Learning with or without interactions
As mentioned, the simulated data were built with the aim of challenging the LR models that try to classify observations exploiting the additive effect of the predictors alone.As a starting point, we hence compare the performance of a LR and a Lasso Regression without interaction terms against our proposed algorithm in its three variants.For this experiment, we adopted the previously described procedure in order to generate 10 datasets, each of n =10,000 observations with P(Y = 1) 0.234.We then splitted each dataset into training and test sets according to a 70/30 ratio.In Table 1 we reported the following metrics: Area Under the ROC Curve (AUC), Sensitivity, Specificity, Negative Predictive Value (NPV) and Positive Predictive Value (PPV).It is clear from Table 1 that considering only main effects and their additive contribution makes the model practically useless in discriminating between the two classes (AU C = 0.49 ± 0.015 for both LR and Lasso).The Lasso Regression was performed imposing λ = 10 −5 , because larger λ values shrinked the model to no term at all.
Note that our algorithm (here with K = 10 and supp min = 10%) is yielding a very high score in all its variants.Nonetheless, it is not surprising to see LIPS using as-is interactions obtain the highest AUC among the three.
A remarkably positive aspect about building statistical models with high-order interactions among categorical features, is the undeniable interpretability of the resulting model itself.This is especially true when applying LIPS, that returns an arbitrarily long list of high-order interaction terms selected in order to (i) be highly descriptive of one class in particular and (ii) cover as much diverse information as possible regarding such class -thanks to the dissimilarity-based selection.In Figure 2 we provide a concrete demonstration of this claim.We represented graphically the 10 interaction terms included in the model generated by LIPS in one of the trials of this experiment.The tiles in each pair of squared tilings are colored higlighting the areas defined by the dummies included in each of the K = 10 selected interactions.Red and blue coloring depend on the OR.The order in which the terms are drawn (to be read by row), is the same order in which the algorithm picked each of them.This means, for instance, that the only blue pattern was selected as 6 th interaction to be included.It can be easily noticed  how each of the selected interaction describes precisely certain aspects of the two classes.
For instance, the first pattern spots one triangle in the right tiling that appears to be associated with the minority class, which is coherent with the original construction of the data (cfr.Figure 1.b).In contrast, the second selected interaction highlights a completely different (but still consistent) area.It is interesting to see how the only protection pattern, despite not telling much about the right tiling, precisely defines an area in the left one where no minority class observation should fall.
3.3.One-Class Targeted Search and the relevance of Dissimilarity-based Interaction Selection Because of its combinatorial character, the search for high-order interactions remains a computational challenge, irrespectively of whether the objective is to fit a classification model or to identify all relevant patterns.As extensively discussed above, LIPS algorithm exploits the interaction-item set duality to extract relevant high-order interactions from the training data.To do that, it employs the widely recognized Apriori algorithm for frequent item set mining, focusing the search on the positive class observations only.Among other things, the experiment conducted in our simulated setting is meant to support this approach by empirically demonstrating how it does not hinder the classification of both classes, and it may even foster performance while saving time w.r.t.extracting item sets from the two.
Once proven the validity of this first relevant aspect of the proposed algorithm, the second fundamental design choice that deserves an empirical evaluation is the dissimilarity- based feature selection method.Indeed, this peculiar passage builds arbitrarily long lists of features (interactions) by selecting the most diverse among those with the highest OR.This method was designed under the assumption that a more diverse set of interactions might reduce noise, generalize better on unseen minority class observations by capturing the whole underlying class' distribution and collect the most useful interactions to interpret and characterize this group.We tested these ideas within the same experiment as above.
We exploited the same 10 datasets described in Section 3.2 but trained four different algorithms: In Figure 3 we provide the results of this experiment.The most traditional implementation of LIPS outperforms all other versions basically on all the considered metrics.First of all, note that the DS-LIPS with the same supp min performs almost comparably in terms of AUC, but Sensitivity is strongly affected by searching for patterns within both classes.This result supports our OCL approach to foster classification accuracy specifically on the underrepresented class.Moreover, the DS-LIPS with the highest support performs worst on all dimensions: at first sight, this is surprising as one would expect highly frequent patterns in both classes to separate the two at best.However, raising the supp min value, especially for what concerns the minority class, can result in a poor performance whenever the separating patterns are very different and scattered along the data.
For what concerns Top LIPS, its performance is particularly low in terms of Sensitivity.Comparing this metric with the others, it is straightforward to deduce that the algorithm's performance is hindered by a very high number of False Negatives (Sensitivity 0.2).
Interestingly, this latter result highlights a possible reason why the dissimilarity-based interaction selection step may enhance the identification of the real underlying distribution of minority class observations in a generalized and robust manner.Indeed, the K interactions selected by Top LIPS with their high OR are very specific to certain minority class observations (P P V 1), thus guaranteeing a precise majority class classification as well (Specif icity 1, as the identified patterns are extremely rare in the overrepresented class), but the chance to generalize them to the whole minority class is quite low.In other words, the algorithm is including a set of interactions that exist within the positive class only, however they all similarly describe one subgroup of this class -which appears to be quite distant from all other examples in the dataset.This situation may arise in presence of outliers within the underrepresented class, or in case of an irregularly distributed minority class with two or more subgroups of observations.As a matter of fact, note that the simulated data has evident subgroups, as positive observations are defined by several distinct areas in the geometric space.This testifies in favour of the robustness of the dissimilarity-based feature selection method that would, by design, include at least one of the high OR patterns (thus describing the subgroup), but would then be forced to add dissimilar interaction terms, lowering the risk to overfit the group of outliers, and providing a better generalization on the whole minority class population.

On the robustness to Sample Size
Nowadays, real-life research settings present an extremely wide range of different scenarios when it comes to sample size.Huge data sets in some domains are opposed to extremely limited samples in others, above all the healthcare or medical research field.Therefore, any novel statistical approach that aims at finding broad application needs to be flexible to the different situations it might be applied to.
This experiment is meant to evaluate and discuss the performance of LIPS for extremely limited sample sizes.Indeed, as n becomes larger, the only expected drawback of the proposed methodology regards computational time.It has already been discussed which safety measures were taken to limit the impact of extremely large datasets, therefore the focus here will be around the potential complexities in implementing LIPS on very few observations.Indeed, mining item sets on one class only may easily induce low generalization capability of the identified interactions, if the sample is too small or poorly representative of the real class' distribution.
Moreover, decreasing minority class sample size may bias the structure (and reduce the number) of patterns to evaluate for inclusion as interaction terms.Indeed, as the sample size gets smaller, fewer levels of each covariate may be represented in minority

class examples.
With this experiment we wish to empirically demonstrate that thanks to the combined effect of the supp min parameter, of the OCL search, and the dissimilarity-based feature selection, LIPS can handle extremely limited sample sizes granting a good performance on the evaluation metrics.Indeed, searching on minority class examples with a low threshold on supp min allows the algorithm to identify a sufficiently large set of interactions despite working on very few observations.Moreover, the dissimilarity-based selection lowers the risk to overfit the small pool of observed data points and include in the final model interactions that might generalize better on the overall population.
For the experiment we trained LIPS, Scores LIPS and Clusters LIPS on datasets of varying sample size and K = 10.For each sample size (n ∈ {100, 1000} with a step of 50 observations) we generated 50 training sets and test sets.In Figure 4 we provide the obtained results in terms of mean and standard deviation on AUC, Sensitivity and Specificity.As the reader may notice, despite sample size gets significantly small and the number of terms included in the LR is rather limited (10 terms at most), the performance is not degrading significantly (lowest AUC score around 0.7 with the smallest sample of 100 observations only).Among the three, LIPS with ungrouped interactions performs significantly better than the other two versions.Again, this is not surprising as we are adding more terms to the LR without any information preprocessing.Nonetheless, the two other versions provide a solid performance as well, and might be useful in the case the number K of interactions had to grow larger, or to satisfy specific interpretability requirements.

On the robustness to Class Imbalance
The last relevant aspect to be discussed regarding the broad applicability of our proposed approach, is its capability to handle situations of strong class imbalance.As a matter of fact, the nature of the algorithm itself makes it almost unaffected by class imbalance ratios if the number of observations within minority class is sufficiently large.Indeed, the one-class-learning procedure reduces the effect of class imbalance, that would instead make the problem rather intractable in case the algorithm searched on the whole dataset together.
Let us consider an extreme case in which the itemset i has empirical frequency f i,1 1 among minority class and f i,0 0 within majority observations.In what follows we write n 1 for the number of minority class observations and we let p 1 := n 1 /n.Then, in order for the algorithm to include i (which is extremely relevant in discriminating between the two classes) in the list of interactions we should impose supp min < p 1 .In strongly imbalanced settings, p 1 can easily take values below 0.1.This would mean that including patterns specific to the positive class would force the imposition of extremely low supp min values, causing the list of potential interactions to explode.This would increase computational time, noise, and lower the probability of including strongly discriminative interactions in the model despite the high associated ORs.
Instead, the OCL approach of LIPS overcomes this limitation and identifies relevant interactions irrespectively of the imbalance ratio.For this experiment we generated 50 training and test set for each percentage of minority class observations, from p 1 := n 1 /n = 1% to p 1 = 40%, with n = 5, 000.Minimum support was set to 0.1 and K = 10.In Figure 5 it can be noticed how none of the presented metrics are strongly affected by the incresingly imbalanced setting of the classification problem.Of course, in this case the algorithm was dealing with a rather large n.However, as we already discussed the performance for small sample sizes, we were more interested in observing the effect of the imbalance ratio only.The three variants demonstrate a solid performance for extreme imbalancement as well, and (especially for traditional LIPS) basically reach and robustly keep their best performance from 5% on.

Benchmark Experiments
As discussed in Section 1.1, our proposition lies at the crossing of two independent research lines, building on concepts borrowed from the best of both.
In particular, the group of works dealing with significant pattern mining (such as Pellegrina and Vandin (2018); Sugiyama and Borgwardt (2019)) provides a set of potential alternatives for the identification of the list of interaction to be filtered on a dissimilarity basis.As a first development of our methodology we decided to apply the most widely recognized item set mining algorithm, the Apriori algorithm.Nonetheless, future developments might include alternative pattern mining methods that might scale better, improve the efficiency or handle some data and problem-specific requirements more effectively.As a matter of fact, we see these methodologies more as potential additions to our algorithm, than actual competitors.Therefore, to compare the performance of our algorithm with a competing benchmark, we focused on the second line of research mentioned among related works.In particular, we chose the recent work of Lim and Hastie (2015), where the authors present Group-Lasso INTERaction-NETglinternet, a state-of-the-art approach when it comes to selecting interactions for LR models.For readability and comparability of results, there are a few considerations to be made regarding glinternet.The algorithm is indeed a regularization-based interaction learning method and as such it does not allow for a precise control over the number of terms included in the model, especially in case of categorical covariates.In facts, the open source software ‡ allows the user to define the number of interactions (n inter ) to include in the model, and then the algorithm iteratively relaxes the regularization to get as close as possible to n inter interactions.The procedure stops when the algorithm reaches at least n iter , meaning that the number of interactions may be larger than required.Note that, glinternet includes main effects and interactions by enclosing the whole variables, with all their categorical levels.
To compare LIPS with glinternet we applied both algorithms to simulated data and to a dataset from UCI Machine Learning Repository (Dua and Graff, 2017), namely the Breast Cancer § dataset.

Benchmark comparison on simulated data
To test LIPS's performance against the benchmark, we run glinternet on 10 simulated training and test sets, generated as described in Section 3.1.To compare results on the basis of how many terms were introduced in the LR as well, we imposed several different n inter values (n inter ∈ {3, 4, 5, 6, 7, 8, 10}).Then, we collected the number of β parameters associated with each level of the selected covariates to count precisely the number of terms in the model.Table 2 reports the results of the experiment.Again, LIPS was trained with K = 10 and supp min = 0.1.
Our proposed algorithm performs notably better than the benchmark competitor on average, up until glinternet includes 10 interaction terms in the model.However, using 10 interactions translates in including on average 50 or 60 terms in the LR, as opposed to LIPS that is using 10 pattern-terms only.This makes the resulting model way more interpretable, and the estimated β parameters more reliable.‡ cran.r-project.org/web/packages/glinternet/index.html§ https://archive.ics.uci.edu/ml/datasets/breast+cancer

Breast Cancer Case Study
For this last benchmarking experiment it was chosen a freely available dataset from UCI Machine Learning Repository, namely the Breast Cancer Dataset.This dataset describes oncological patients in terms of their age, menopause state and further features regarding both the tumor itself (size, malignity, breast location etc.) and its treatment.Thus, instances are described by 9 categorical attributes in total, some of which are continuous binned into categories and some are nominal.Patients are splitted in two classes: one grouping 201 instances with tumor recurrency, and the other including 85 women with no recurrency events.What makes this dataset interesting for this application is the rather small sample size (typical of a real-life medical experimental setting), and the varying number of levels per feature, ranging from 2 to 11.In Table 3 the results of the application of the three versions of LIPS, against glinternet, a LR model and a Lasso Regression both without interactions.Because of the small sample size, the cross-validation was performed by bootstrapping training and test set 10 times with splitting ratio 70/30.
Our algorithms were all trained with supp min = 0.3 and the additional use of 90% confidence intervals for the odds-ratios; we used different values for K, as reported in Table 3. LIPS and its variants consistently attain the best performance in terms of AUC and Sensitivity, using the smallest number of terms in the model.Lasso Regression and glinternet perform comparably, however the latter includes in the model an eccessive and disproportionate number of terms.Indeed, as the number of terms grows, the performance of glinternet worsens.
These results suggest that the interaction effect in describing the outcome in this particular dataset is not extremely evident, nor necessary.Indeed, the absence of interaction terms is not strong enough to hinder the classification capability of the Lasso.However, the co-occurrence of some specific dummy variables in the minority class seems to capture a relevant part of the variability in the data, allowing LIPS and its variants to provide the best performance with a very limited pool of patterns.

Discussion and Conclusions
In this paper we presented LIPS, an high-order interaction learning algorithm via targeted pattern search, to select high-order interactions among categorical covariates and build a LR model.Together with LIPS, we proposed two alternative strategies to represent and include the selected interactions in the model, namely Scores LIPS and Clusters LIPS, to address the problem-specific needs of the research community that uses LR for inference and modeling.The proposed strategies have been assessed through a wide set of experiments on both simulated and real data.In particular, as LIPS was designed specifically to address a broad range of real-life data analysis issues, we wish to conclude the paper by recalling and discussing the most notable pros of the algorithm.
First and foremost, LIPS was designed to make LR models more powerful by introducing interaction terms between categorical covariates.However, to make the resulting model reliable for inference, a strict control over model dimensionality is needed.To this specific aim, the option to choose the precise number of interaction terms to incude allows the user to tailor the LR fitting on the problem at hand, managing the risk of biased estimates and unreliable inference (Sur and Candès, 2019) induced by p n settings.Controlling for the dimensionality of the model surely makes LIPS extremely effective in small sample size settings.Nonetheless, the algorithm is fast and scalable enough to be applied to research scenarios where both p and n are large.Indeed, the proposed method provides a useful feature selection technique, that reduces the computational cost by focusing its pattern search on one class only (targeted search) without losing generalizability, and that returns lean and interpretable models in manageable running times.
Besides inference robustness, model dimensionality surely impacts interpretability as well.Most real-life application domains are willing to sacrifice a little on the performance side, to foster readability and explainability of results.LIPS addresses this aspect by providing an arbitrarily small and extremely predictive set of discriminative interaction terms, that are easy-to-read and interpret.We provided a clear example via simulation, where the selected patterns described precisely the areas where the two classes were to be sought for.Moreover, we introduced two alternatives to the principal algorithm (Scores LIPS and Clusters LIPS), whose aim is that of serving this need.Indeed, some potential applications of the two variants could be easily identified in the healthcare and lifescience domain, where n is traditionally very small, p may grow extremely large and interpretability is key.For instance, genome-wide association studies may require long lists of interactions to properly characterize a patient and introducing K terms in the model may reduce both interpretability and reliability of results (as even K may be larger than n).Furthermore, the Risk and Protection Scores of Scores LIPS provide a clear representation of the patterns and their role in profiling the population, while being an agile scoring system that can be easily accompanied by other explicative covariates.
Another point of paramount importance, is the ability of LIPS and its variants to manage even extremely imbalanced settings.We already highlighted how oftentimes minority class represents the most interesting class to study, while researchers need to make reliable inference on a very small sample of this critical population.LIPS, combining targeted search with dissimilarity-based interaction selection, demonstrated to be a powerful tool to capture the underlying minority class distribution in a robust and generalizable manner, even in presence of very few observations.Lastly, it is relevant to address the fact that LIPS and its variants focus their attention on the search for interactions among categorical covariates.This apparent limitation is actually grounded on a solid and multifaceted rationale.First of all, as previously mentioned, fully categorical data are getting more and more common in several scenarios, such as life sciences.The tractability of these types of data within LR models with interactions is strongly affected by the exponential growth of the number of covariates to be considered as the number of levels grow.This same impact is less dramatic when dealing with continuous features.Nonetheless, LIPS was designed to solve this specific issue at best, while being a flexible and effective addition to a broader study framework with various data types as well.Indeed, once selected a restricted and powerful subset of K interaction terms from the categorical subset of variables (absorbing a great portion of computational complexity), the user can easily accompany them with additional numerical covariates -and their interactions.These further terms might indeed be selected -separately or jointly with the K LIPS's terms -via other traditional techniques, that could work on a reduced number of features w.r.t.using the whole original data.
We also compared the performance of the proposed algorithm with that of a stateof-the-art interaction selection method, glinternet, demonstrating our method's superior performance in terms of accuracy, interpretability, and significance of the resulting model, as LIPS and its variants obtained better results with notably less terms in the model.
In conclusion, with its targeted approach, the introduction of a novel dissimilarity-based interaction selection, and its flexibility to be tailored around the specific user needs (i.e.number and order of interaction terms, interaction representation via Scores or Compatibility Clusters, confidence adjustment), LIPS results to be a novel and powerful approach to face the complexities of applying LR to many real-life research domains.Future works may be devoted to improving the algorithm's efficiency by including more refined item set mining methods at its core and expanding the methodology to include numerical covariates in a noise-adverse manner.

A.2. Compatibility clusters computation
In Section 2.3.2,we described the algorithm Clusters LIPS, which condensates the information coming from the interaction terms into fewer variables associated to certain compatibility clusters.For more readability, let us recall the notation we used in that context.From the list of candidates Î, we exploited the dissimilarity measure to extract K patterns (with K even).Half of those were selected from the risk interactions, resulting in the sublist RK/2 ; similarly, the other half consisted of protective patterns, PK/2 .Next, our algorithm requires the partitioning of these two lists (separately) into smaller groups of patterns so that: each group is formed by pairwise compatible patterns; different groups always have -at least-two incompatible patterns.For the sake of simplicity, we address this problem only for the risk patterns.
To determine a suitable partition of RK/2 into compatibility clusters R1 K/2 , . . ., Ra K/2 , it can be convenient to consider the interactions as nodes of an undirected compatibility graph G( RK/2 ), where two interactions are linked together if and only if they are compatible.An example can be found in Figure 6.In that way, our original problem becomes equivalent to that of partitioning a graph into cliques, with the additional constraint that such cliques should not be further joinable into a larger ones.As foretold, in general the solution to such puzzle is not unique and many partitions are feasible; in view of dimensionality reduction, one may thus want to strengthen the constraints and search for a minimal cover.However, the problem of finding a minimal clique cover for a graph is known to be NP-hard (Karp, 1972), so we partially avert this issue by opting for a greedy procedure.In short: iteratively, we find the maximal clique in G( RK/2 ), store its nodes in the new list Rj K/2 , and remove the clique from the graph; we keep this up until the graph becomes empty.
The determination of the maximal clique can be done in several ways.For the implementation of LIPS Clusters, we actually relayed on the more general Born-Kerbosch algorithm (Bron and Kerbosch, 1973), which in principle allows one to list all the maximal cliques within a graph.This algorithm is known for being computationally expensive on large graphs, but that was not our case as we typically picked small values for K. Whenever large values of K are needed, it may be convenient to relay on other maximal clique search algorithms (Tarjan and Trojanowski, 1977;Robson, 1986;Tomita and Kameda, 2007;Konc and Janežič, 2007).

A.3. Further technicalities
Here we address a few last technical details that can be considered when implementing LIPS or one of its variants.
First of all, even by exploiting the most sophisticated routines certain datasets may still yield too high computational times in the first training phase.Aside from replacing Apriori with other frequent itemsets mining algorithms, another possibility is that of reducing the interaction search only to those patterns that have a length smaller than a fixed L > 0. This further constraint can be easily added as it does not impact on the subsequent steps of the algorithm.Fixing a maximal length L can also be used to intentionally bound the order of the interactions, which may be of interest depending on the context.Finally, another possible issue is that of finding interactions T that along the dataset have an uncomputable rank, log |OR T |.This happens everytime the contingency table of (T, Y ) has at least an empty cell.To deal with such situations, several approaches can be adopted.For example one may either: 1) further investigate the anomaly and decide whether to discard the interaction or force its presence in the final list of patterns; 2) adopt an adjusted version of the OR, for instance by employing the Haldane-Anscombe correction (Ruxton and Neuha, 2013), which consists in preliminary increasing each value in the contigency table by 0.5.

B. On the dissimilarity measure
Here, we wish to point out a few theorical facts about the dissimilarity measure d (1), introduced in Section 2.2.2.First of all, the proposed dissimilarity measure defines a so called semi-metric over I (Wilson, 1931).In fact, as the MCD of two interactions is always a subinteraction of them both, it is straightforward to see that d satisfies: In general though, d does not define a metric over I, since the triangular inequality does not hold.As a counterexample consider the case of three binary variables A, B and C.Then, for T := A (0) B (0) C (0) , S = A (0) B (0) C (1) and Z = A (0) B (0) , because T ⊥ S, one has

Fig. 1 .
Fig. 1.Geometric model behind simulated data.On the left (a) are two illustrative observations in the dataset, represented by two tiles from each squared tiling.On the right (b) represents the tiles that determine the positive class.

Fig. 2 .
Fig. 2. K patterns identified by LIPS in one trial of the first simulation experiment (where K = 10).Tiles are colored according to the areas defined by the categorical terms involved in each of the selected interactions.Red patterns are considered risk patterns (OR > 1), while protection patterns (OR < 1) are colored in blue.

Fig. 3 .
Fig. 3.In order, performance of LIPS in its three variants (red), against DS-LIPS with supp min = 0.1, DS-LIPS with supp min = 0.5 and TOP LIPS (blue) on simulated data.
(a) LIPS in its characteristic One-Class-Learning (OCL) version, searching on minority class examples only with supp min = 0.1.We tested the algorithm in all its three variants (LIPS, LIPS Scores and LIPS Clusters).(b) Two-Class-Learning LIPS (referred to as Double Search LIPS), building the list of interactions searching separately in both classes.Similarly to OCL LIPS we imposed supp min = 0.1.(c) Double Search LIPS with supp min = 0.5.(d) To evaluate the substantiation of the hypotheses on the value added by the dissimilaritybased feature selection, the fourth algorithm is a version of the OCL LIPS that only picks the top K interaction terms from the ranked list (Top LIPS).

Fig. 4 .
Fig. 4. Performance of LIPS, Scores LIPS and Clusters LIPS for varying sample sizes.

Fig. 5 .
Fig. 5. Performance of LIPS, Scores LIPS and Clusters LIPS for varying imbalance ratios.The percentages on the x-axis (reported in log-scale) represent the portion of minority class observations on the whole dataset.

Fig. 6 .
Fig. 6.Example of a compatibility graph.Several interactions terms (two of which are degenerate) involving the dummies of four {0, 1}-valued random variables, A, B, C and D, define the nodes within the graph.Two nodes share a link if and only if the corresponding interactions are compatible.Our greedy decomposition breaks the graph into two clusters: the maximal clique {B(0) , C (1) , A (1) C (1) , A (1) B (0) D (0) } and the remaining pair {A (0) B (0) C (0) , C (0) D (1) }.

Table 1 .
Results on the simulated dataset for LR, Lasso Regression and LIPS in its three variants.The columns report model performances in terms of Area Under the ROC Curve (AUC), Sensitivity, Specificity, Negative Predictive Value (NPV) and Positive Predictive Value (PPV).

Table 2 .
Results on the simulated dataset for glinternet (with varying number of interactions and consequent number of terms) and LIPS (last row).

Table 3 .
Results on the Breast Cancer dataset for LIPS -(i) Individual interactions, (ii) Scores and (iii) Clusters, (iv) LR without interactions, (v) Lasso Regression and (vi) glinternet (with varying number of interactions and consequent number of terms).