Browse Subject Areas

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Using redescription mining to relate clinical and biological characteristics of cognitively impaired and Alzheimer’s disease patients

  • Matej Mihelčić,

    Roles Conceptualization, Formal analysis, Investigation, Methodology, Software, Visualization, Writing – original draft, Writing – review & editing

    Affiliations Division of Electronics, Ruđer Bošković Institute, Zagreb, Croatia, Jožef Stefan International Postgraduate School, Ljubljana, Slovenia

  • Goran Šimić,

    Roles Funding acquisition, Investigation, Validation, Writing – original draft, Writing – review & editing

    Affiliation Department for Neuroscience, Croatian Institute for Brain Research, University of Zagreb Medical School, Zagreb, Croatia

  • Mirjana Babić Leko,

    Roles Writing – original draft, Writing – review & editing

    Affiliation Department for Neuroscience, Croatian Institute for Brain Research, University of Zagreb Medical School, Zagreb, Croatia

  • Nada Lavrač,

    Roles Conceptualization, Funding acquisition, Supervision, Writing – original draft, Writing – review & editing

    Affiliations Jožef Stefan International Postgraduate School, Ljubljana, Slovenia, Department of Knowledge Technologies, Jožef Stefan Institute, Ljubljana, Slovenia

  • Sašo Džeroski,

    Roles Conceptualization, Funding acquisition, Writing – original draft, Writing – review & editing

    Affiliations Jožef Stefan International Postgraduate School, Ljubljana, Slovenia, Department of Knowledge Technologies, Jožef Stefan Institute, Ljubljana, Slovenia

  • Tomislav Šmuc ,

    Roles Conceptualization, Funding acquisition, Investigation, Methodology, Supervision, Writing – original draft, Writing – review & editing

    Affiliation Division of Electronics, Ruđer Bošković Institute, Zagreb, Croatia

  • for the Alzheimer’s Disease Neuroimaging Initiative

    Membership of the Alzheimer’s Disease Neuroimaging Initiative is provided in the Acknowledgments.

Using redescription mining to relate clinical and biological characteristics of cognitively impaired and Alzheimer’s disease patients

  • Matej Mihelčić, 
  • Goran Šimić, 
  • Mirjana Babić Leko, 
  • Nada Lavrač, 
  • Sašo Džeroski, 
  • Tomislav Šmuc, 
  • for the Alzheimer’s Disease Neuroimaging Initiative


Based on a set of subjects and a collection of attributes obtained from the Alzheimer’s Disease Neuroimaging Initiative database, we used redescription mining to find interpretable rules revealing associations between those determinants that provide insights about the Alzheimer’s disease (AD). We extended the CLUS-RM redescription mining algorithm to a constraint-based redescription mining (CBRM) setting, which enables several modes of targeted exploration of specific, user-constrained associations. Redescription mining enabled finding specific constructs of clinical and biological attributes that describe many groups of subjects of different size, homogeneity and levels of cognitive impairment. We confirmed some previously known findings. However, in some instances, as with the attributes: testosterone, ciliary neurotrophic factor, brain natriuretic peptide, Fas ligand, the imaging attribute Spatial Pattern of Abnormalities for Recognition of Early AD, as well as the levels of leptin and angiopoietin-2 in plasma, we corroborated previously debatable findings or provided additional information about these variables and their association with AD pathogenesis. Moreover, applying redescription mining on ADNI data resulted with the discovery of one largely unknown attribute: the Pregnancy-Associated Protein-A (PAPP-A), which we found highly associated with cognitive impairment in AD. Statistically significant correlations (p ≤ 0.01) were found between PAPP-A and clinical tests: Alzheimer’s Disease Assessment Scale, Clinical Dementia Rating Sum of Boxes, Mini Mental State Examination, etc. The high importance of this finding lies in the fact that PAPP-A is a metalloproteinase, known to cleave insulin-like growth factor binding proteins. Since it also shares similar substrates with A Disintegrin and the Metalloproteinase family of enzymes that act as α-secretase to physiologically cleave amyloid precursor protein (APP) in the non-amyloidogenic pathway, it could be directly involved in the metabolism of APP very early during the disease course. Therefore, further studies should investigate the role of PAPP-A in the development of AD more thoroughly.


Alzheimer’s Disease (AD) is an irreversible neurodegenerative disease that results in progressive deterioration of cognitive abilities and behavioural control due to synapse and neuron loss. It is the most common cause of dementia among older adults. Although available medications for treatment of mild to moderate AD (donepezil, galantamine, and rivastigmine) and severe AD (memantine) help to some level, these drugs do not modify the underlying disease process.

The Alzheimer’s Disease Neuroimaging Initiative (ADNI) [1] aims to collect various imaging and biomarker data, that could be potentially useful in diagnostics and treatment of AD. The analysis of these data provides means to potentially extend our understanding of the disease, its impact on various functions of human comportment and cognitive functions, and tracking its progression.

In this work, we analysed the data obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database [1], containing clinical and biological measurements (listed in S1S3 Files and available at These measurements are taken for a set of subjects in order to test for presence of AD and the level of subjects’ cognitive impairment. We divided the attributes in two main groups: clinical (clin) and biological (bio).

Clinical attributes have been obtained from numerous questionnaires and neuropsychological instruments designed to test cognition and memory with the hope of early detection of AD. These tests have been carefully designed, studied and regularly updated to increase the detection of various forms of cognitive impairment. Many such tests exist [2], but there has been no unique measure that can be used to reliably make the diagnosis [3]. Thus, combining different tests has been shown to provide more reliable results. Biological attributes have contained neuroimaging data of a number of methods to visualize brain activity, such as MRI and PET scans, along with some related and derived scores. They have also contained biospecimens: a number of blood tests and measurements, and information about the subjects’ genetic markers (genetic data). These attributes have been generally considered less reliable, but are still actively investigated with the aim to aid in the early detection of AD and to help understand its complex genetic, epigenetic, and environmental landscapes.

Manual investigation of associations between attributes and analysis of their effects would require insurmountable efforts, which prompted us to use a data mining technique called redescription mining.

Work related to understanding cognitive impairment

Considerable work has been oriented towards understanding the role of biological or clinical attributes, determining correlations between different attributes and assessing their predictive power for determining the level of cognitive impairment.

Researchers have used neural imaging (MRI, PET, etc.) [46] to predict levels of cognitive impairment. For example, Doraiswamy et al. [7] studied PET images of subjects with cognitive decline. Donovan et al. [8] studied correlations between regional cortical thinning and worsening of apathy and hallucinations. Guo et al. [9] studied the effects of intracranial volume on association between clinical disease progression and brain atrophy or apolipoprotein E genotype. Hostage et al. [10] studied the effects of apolipoprotein E (APOE alleles) ε4 and ε2 on hippocampal volume. Other investigators have also studied the role of apolipoprotein E [11] in early mild cognitive impairment. These are just a few samples of the huge set of studies of correlations between biological, clinical attributes and the level of cognitive impairment. More extensive list can be found at

Recently, Gamberger et al. used a multi-layer clustering method [12] to identify clusters of AD patients with respect to several clinical and biological attributes [3]. The same method was applied [13] to detect differences between clusters containing male and female patients. Breskvar et al. used Predictive Clustering Trees (PCTs) [14] to discover and analyse patient clusters. They focused on relations between biological features and the progression of AD by observing behavioural response of patients and their study partners (persons who are in frequent contact with the patient, study with the patient, and are able to assess the patient’s functioning in daily life).

Redescription mining and related fields

In this section, we provide background information related to redescription mining and motivate its choice as a data mining technique used in our work.

The most open-ended, unsupervised data-mining technique, clustering [1519] finds and groups similar instances based on a predefined similarity measure. It is used when underlying and possibly interesting natural grouping is unavailable, but also to reveal new groups that were previously unknown. Clustering techniques typically do not create interpretable models of data, so one has to apply other technique in order to get interpretable descriptions of induced clustering. One such approach, limited to using a single attribute set, is conceptual clustering [20, 21] that aims at finding clusters that can be described with concepts derived by using some description language.

There exists a broad group of descriptive pattern mining techniques that find and describe subsets of examples using single attribute set or view.

For example, association rule mining [22] finds associations between items (in transaction databases) or different attributes in the form of unidirectional rules. Interesting associations are typically selected based on support and confidence scores of association rules and possibly some other interestingness measures.

Subgroup discovery [23, 24] is a technique that finds queries describing groups of instances having unusual and interesting statistical properties with respect to the target variable. Contrast Set Mining [25] identifies monotone conjunctive queries that best discriminate between instances containing one target class from all other instances (e.g. subjects with diagnosis Alzheimer’s Disease (AD) vs Control (CN) subjects).

In contrast to techniques operating on a single set of attributes, multi-view techniques offer advantages when the available data contains information from various sources or descriptions of different properties of instances (as is the case in this study).

Two-view data association discovery [26] aims at finding a small, non—redundant set of associations that provide insight in how two views are related. The approach can create both bidirectional and unidirectional rules as translation patterns.

Redescription mining, introduced by Ramakrishnan et al. [27], is capable of mining descriptions of subsets of data described by multiple sets of attributes. The building blocks of redescriptions are called queries (logical formulas describing a set of instances by using attributes from some particular view). Redescription queries can describe the same or very similar subset of instances with different queries, which is an important capability in the context of knowledge discovery.

Rationale for using redescription mining

Redescription mining offers advantages over related techniques and provides specific results required for our analysis. The multi-view and descriptive capabilities of redescription mining make it suitable for relating different biological attributes, many with unknown or scarcely explored role and effects on cognitive impairment, to clinical attributes designed to detect cognitive impairment and make the preliminary diagnosis.

Although a two-view data association discovery approach can be applied to this data, we aimed at discovering interesting equivalence-like associations between biological and clinical attributes on different support levels and validating them with the subjects diagnosis, that is possible with redescription mining. Two-view association discovery is also somewhat limited as it is designed to mine Boolean data and to provide small and non-redundant sets of associations (translations) between different attribute sets. In our discovery study we aim to create, potentially larger number, of understandable redescriptions that would be used as a basis for the thorough statistical analyses and the analysis performed by the domain expert.

Similar data and attributes, related to AD, have been studied before [3, 13, 14, 28]. However, this study is focussed on the analysis of the ADNI data using redescription mining, which enables using its specific advantages over other approaches to find potentially new insights and improve our understanding of the genesis of AD.

Materials and methods

This section contains descriptions of data, notation and related redescription mining approaches, CLUS-RM algorithm [29, 30] and the motivation for its use in this work. It includes description of algorithmic extensions incorporated into CLUS-RM that enable fully automated constraint-based redescription mining, where we generalize the attribute and instance level constraints introduced by Zaki and Ramakrishnan [31].

Data description

For this study, we extracted data from the ADNI database [1]. To obtain the data, we used the Merged ADNI 1/GO/2 Packages for R [32] located in study info section of the download data page in the database. This package contains majority of available datasets in the format of R data frames. The basis of our datasets was contained in the adnimerge data table, which contains measurement of several clinical attributes (derived by using questionnaires, observations by doctors and other tests measuring level of cognition) and biological attributes (different blood tests, genetic markers, attributes derived from brain images, volumes of different parts of the brain etc.) for 1,737 subjects. There was also a target variable—diagnosis (not used for redescription construction) containing categorical values: control normal (CN), significant memory concern (SMC), early mild cognitive impairment (EMCI), late mild cognitive impairment (LMCI) and probable AD. Values of a target variable can be considered as ordered (levels of cognitive impairment). Each subject was assigned in exactly one category and there were no missing values for this variable. By examining the subjects contained in the adnimerge data table, we have noticed two distinct groups of subjects for whom some additional distinct attributes were measured. Therefore, we created and studied three related datasets.

The distributions of patients, divided by the level of cognitive impairment, for all three datasets are provided in Table 1.

Table 1. The number of subjects contained in datasets D1, D2 and D3 divided by the level of cognitive impairment.

Division of attributes to clinical (clin) and biological (bio) forms two disjoint sets of attributes used as views in redescription mining. In all datasets, subjects or patients constitute the instances for the redescription mining process.

Table 2 contains full names and abbreviations for all attributes required to present our work, while Tables 3 and 4 contain corresponding basic statistical information for these attributes. Due to data normalization (especially of biological attributes), the original measuring units do not correspond to the attribute values and are not specified in the tables.

Table 2. A list of clinical and biological attributes discussed in the text.

Table 3. Information about value range and percentage of missing values for biological attributes discussed in the text.

Absence of an attribute from a dataset is denoted with “-” in the range and missing columns.

Table 4. Information about value range and percentage of missing values for clinical attributes discussed in the text.

Absence of an attribute from a dataset is denoted with “-” in the range and missing columns. If some dataset has equal range as D1, this is denoted with “-||-” in the appropriate field.

The first dataset (D1) contained 1,737 subjects. The dataset contained a number of biological attributes such as APOE genotype, different brain measurements, such as the volume of the whole brain, the hippocampus, ventricles, and many other structures, including brain images obtained by using the 18fluorodeoxyglucose (FDG)-PET method. The dataset contained various blood analysis, such as levels of white and red blood cells, protein (RCT12) and glucose (RCT11) levels, and many others. It also contained a number of neuropsychological tests, such as the Alzheimer Disease Assessment Scale (ADAS11, ADAS13, etc.), several different Rey Auditory Verbal Learning Tests (RAVLT), Mini-Mental State Examination (MMSE), Functional Assessment Questionnaire (FAQ), and others, including several attributes related to clinical dementia rating (CDR) and geriatric depression scale (GDS). Several features describing the subject’s symptoms, such as presence of nausea (BCNAUSEA), vomiting (BCVOMIT), sweating (BCSWEATN), as well as results of various neurological examinations were also included. Information about attributes and subjects contained in D1 are available in S1 File.

The second dataset (D2) contained 918 subjects. In addition to features contained in the first dataset, it also contained features describing subjects’ performance on Montreal Cognitive Assessment (MOCA) scale and features related to the Eastern Cooperative Oncology Group (ECOG) Scale of Performance Status. It also contained values of cerebrospinal fluid (CSF), total tau (TAU) and phospho-tau (PTAU) levels. Information about attributes and subjects contained in D2 are available in S2 File.

The third dataset (D3) contained 820 subjects. It was extremely useful to study the differences and special properties of healthy subjects as compared to patients with severe stages of dementia. This dataset lacked information about ECOG Scale of Performance Status, MOCA, and information about CSF biomarkers, but it contained several additional attributes related to hormones and proteins measured. It also contained information about T2 weighted total cranial vault segmentation (T2TCV) and plasma biomarkers 1−40 and 1−42. One particularly useful imaging attribute was Spatial Pattern of Abnormalities for recognition of early AD (SPARE_AD), which was specifically constructed to help in early detection of AD. Dataset D3 also contained the attribute PAPP-A which is analysed in more detail in this work. The AD assessment scale contained many additional attributes corresponding to different cognitive tasks, the full set of attributes being publicly available on the ADNI web page Information about attributes and subjects contained in D3 are available in S3 File.

Relation between attributes used in different datasets is visible in Fig 1.

Fig 1. Relations between attributes used in constructed datasets D1, D2 and D3.

Left Venn diagram depicts clinical and right Venn diagram biological attributes.

Division among subjects in the constructed datasets is as follows: D1 = D2D3, D2D3 = {2002}, where 2002 denotes the roster id (RID), unique id of subject contained in the intersection.

In all analysed datasets, there were slightly more males than females. Males constitute 55% of the first, 52% of the second and 58% of the third dataset. They also constitute 57%, 53% and 61% of all subjects with some level of cognitive impairment in these datasets. Pregnancy in female subjects can alter levels of PAPP-A attribute. Although the information about the pregnancy status for female subjects analysed was not directly available in our dataset, documents describing ADNI1 exclusion criteria (which cover patients contained in our dataset D3) [33] clearly state that female participants must be sterile or two years past childbearing potential to be included in the study group. Documents related to ADNIGO [34] and ADNI2 exclusion criteria [35] state that the participant must not be pregnant, lactating or of childbearing potential. As a result of these exclusion criteria, we can assume that the PAPP-A levels, for the studied female subjects, were not influenced by pregnancy.

Redescription mining

Redescription mining [27] works on a dataset D, containing |D| instances and one set, or two disjoint sets of attributes (views, denoted as W1 and W2) describing these instances. A redescription (as for example R = (q1, q2)) is a pair of queries, containing one query per view. Each query is a propositional logic formula that can contain conjunction, disjunction or negation operators and is used to define conditions on values of a subset of attributes from a particular view. The subset of instances described by a query qi, denoted supp(qi) is called the query support set. The support set of a redescription is the set of instances described by both queries that constitute this redescription: supp(R) = supp(q1) ∩ supp(q2). We also use the notation E1,1 to denote the set of instances described by both queries, E1,0 a set of instances described by the first query but not described by the second query, E0,1 a set of instances described by the second query but not described by the first query, E0,0 a set of instances that are not described by either query. E?,1 denotes a set of instances for which it is not possible to determine if they are described by the first query, due to missing values, but are described by the second query, E1,? contains a set of instances described by the first query but for which it is not possible to determine if they are described by the second query, due to missing values, E?,0 denotes a set of instances for which it is not possible to determine if they are described by the first query, due to missing values, and are not described by the second query, E0,? contains a set of instances not described by the first query but for which it is not possible to determine if they are described by the second query, due to missing values. The set E?,? contains instances for which it is not possible to determine if they are described by either query due to missing values. attr(R) denotes a multiset of attributes contained in redescription queries, whereas attrs(R) represents a corresponding set of attributes. attr(D) denotes all attributes contained in both views of the dataset and denotes a redescription set.

We evaluate the quality of mined redescriptions by using two measures [36]: i) the Jaccard index, which measures the similarity of support sets of the two redescription queries (also often called accuracy of redescription, since it measures how close two query support sets are to containing identical set of instances) and ii) statistical significance of the observed redescription, expressed through a p-value.

The Jaccard index is defined as: Assessment of the statistical significance of the redescription R = (q1, q2) is based on an assumption that the support sets, of two queries q1 and q2, are selected randomly, with marginal probabilities and respectively. The statistical significance of redescription measures how probable it is to obtain overlap of the size |supp(R)| or larger when sampling two subsets of instances from a set of size |D|, using sampling probabilities p1 and p2 respectively. The size of the intersection follows a binomial distribution and the probability we are looking for can hence be written as:

Example 1. Redescription Rex = (qclin, qbio), discovered on dataset D3, whose queries are defined as: qclin: 0.0 ≤ GDTOTAL ≤ 2.0 ∧ GDALIVE = 0.0 ∧ CDMEMORY = 0.0 qbio: 0.5 ≤ HMT18 ≤ 16.0 ∧ −3.86 ≤ SPARE_AD ≤ −0.93, provides alternative descriptions of 156 different normal control subjects. Query qclin describes 204 subjects with specific value for the following clinical attributes: memory score (CDMEMORY), total score in geriatric depression scale (GDTOTAL), score on a question Do you think its wonderful to be alive now? (GDALIVE) while query qbio describes 172 subjects having specific values for biological attributes such as percentage of Eosinophils (HMT18) and a Spatial Pattern of Abnormalities for Recognition of Early Alzheimer’s disease (SPARE_AD). The set of subjects described by at least one query of redescription Rex contains 220 subjects, i.e |supp(qclin) ∪ supp(qbio)| = 220. For 156 of 220 subjects, both queries are valid, i.e. |supp(qclin) ∩ supp(qbio)| = 156. This means that the Jaccard index (accuracy) for this redescription is . The redescription is statistically significant with the p-value < 2 ⋅ 10−17 (which can be computed by using the formula above). It means that it is highly unlikely to observe a redescription of support size 156 or larger given that we combine two statistically independent queries, with marginal probabilities and , into a redescription Rex.

Existing approaches for redescription mining.

The first algorithm for redescription mining, called CARTwheels, was developed by Ramakrishnan et al. [27]. Several redescription mining algorithms have been developed since, all of which can handle Boolean attributes. From these, some algorithms [29, 30, 37, 38] work also with categorical and numerical attributes. Currently, only two redescription mining algorithms ReReMi [37] and CLUS-RM [29, 30], work with attributes containing missing values.

Redescription mining algorithms can be divided into three main categories: a) algorithms based on itemset mining, b) greedy algorithms and c) tree-based algorithms.

Itemset mining based redescription mining algorithms utilize different itemset mining methods to create itemsets, which are used to create redescriptions. Approach by Zaki and Ramakrishnan [31] and the approach by Parida and Ramakrishnan [39], use a lattice (partially ordered set) of attribute sets to find redescriptions. Approach developed by Gallo et al. [40] is based on frequent itemset mining. The field is known as Frequent Itemset Mining, because the notion of frequency (support size, the apriori principle) is central in obtaining practical algorithms.

Greedy algorithms for redescription mining work by incrementally updating queries with the goal of increasing redescription accuracy. The first algorithm developed in this category was the greedy algorithm from Gallo et al. [40]. This algorithm has been extended by Galbrun and Miettinen [37], under the name ReReMi, to work with categorical and numerical attributes.

Tree-based algorithms use decision trees [41] or Predictive Clustering trees (PCTs) [42] to create redescriptions. This category includes the first developed algorithm for redescription mining called CARTwheels, developed by Ramakrishnan et al. [27]. This algorithm works by building two decision trees per iteration (one for each view) that are joined in the leaves. Redescriptions are created by reading off the conditions along the paths from the root node of the first tree to some specified class (which constitutes one redescription query) and the paths from the root node to the matching leafs of the second tree (which constitutes the second redescription query). All created trees are of the same predefined depth, and the process iterates for a predefined number of iterations. This algorithm uses multiclass classification to guide the search between the two views. Layered trees (LayeredT) and Split trees (SplitT) algorithms developed by Zinchenko [38] use a different methodology of decision tree construction to obtain redescriptions. Instead of creating fully grown trees of predefined depth, the Layered trees algorithm creates one or more depth one trees at each algorithm step. For each leaf of the tree under construction, at some fixed iteration, the Layered trees algorithm builds a new depth one tree and appends it to the corresponding leaf of the existing tree (thus increasing its complexity and size). The algorithm allows creating more informed splits, since at a certain step of tree construction, the algorithm uses information about splits created at a corresponding level of the tree constructed on the opposite view. To construct a tree of maximal depth, the algorithm considers all nodes of the tree created on the opposite view (not just the leaves of a fully grown tree as in CARTwheels). The Split trees algorithm creates decision trees of increasing size. At each step of tree construction, the depth is increased by one and a whole new tree of larger depth is built (completely replacing the previously constructed tree) until trees of maximally allowed depth are built. This algorithm simultaneously refines classes (since it obtains finer splits with trees of larger depth) and trees (by increasing their complexity and providing more specific classes).

The CLUS-RM algorithm developed by Mihelcic et al. [29, 30] uses multi-target Predictive Clustering trees (PCTs) [42, 43], instead of decision trees to construct redescriptions. Using multi-target PCTs allows using information about all nodes (intermediate nodes as well as leaves) in the constructed PCT simultaneously to create redescriptions (which increases accuracy, diversity and number of produced redescriptions). This algorithm has been extended by Mihelcic et al. [44] to use a random forest of PCTs which further increases accuracy and diversity of produced redescriptions. The CLUS-RM is also equipped with a redescription set construction procedure called redescription set optimization [29, 30, 44]. It enables incorporating quality constraints in multi-objective optimization manner and uses all produced redescriptions to create a reduced redescription set of user-defined size. A generalized version of redescription set optimization has been presented by Mihelcic et al. [45]. In addition to its main purpose of redescription set construction, this procedure allows for use of ensembles of redescription mining algorithms, influencing the structure of produced sets through user-defined importance weights and performing computationally efficient construction of multiple redescription sets with different properties, which is beneficial for exploration [45].

Choice of methodology, redescription accuracy measure and a query language

In this section, we describe our motivation underlying the use of CLUS-RM algorithm and the extensions made to allow performing constraint-based redescription mining. In addition, we describe what reasons motivated us for the use of a redescription accuracy evaluation measure and a specific query language used to construct redescriptions.

Choice of redescription mining algorithm.

To create redescriptions, we used the CLUS-RM algorithm [29, 30] based on Predictive Clustering trees (PCT) [42, 43]. PCTs allow clustering on both target and descriptive space. By using their multi-label and multi-target capability one can use multiple (or all) nodes in a given tree simultaneously to produce redescriptions. Due to the property of inductive transfer [46], multi-target classification can outperform single-target classification, which improves the overall accuracy of produced redescriptions. The CLUS-RM algorithm incorporates a redescription set optimization procedure (a novelty compared to other redescription mining approaches), which uses the large number of diverse redescriptions produced to optimize a redescription set of user-define size.

Using a large number of produced redescriptions in the optimization process increases the quality of the redescription set presented to the user. The optimization process evaluates redescriptions according to accuracy, significance and redundancy (with respect to redescription support sets and attributes contained in redescription queries).

Since our data contain missing values, we could only use the CLUS-RM or the ReReMi algorithm to find redescriptions. Given our goal of using the produced redescription sets to perform further statistical analysis, there are several reasons that motivate the use of CLUS-RM as the redescription mining algorithm in this work. CLUS-RM has the ability to produce potentially large sets of redescriptions that can be used to perform statistical analysis (e.g. of obtained associations). Multiple different redescriptions containing the same attribute pair and describing different subsets of instances reinforce the importance of frequently co-occurring attributes. CLUS-RM can constrain redescription support set size to an interval, which provides experts with a range of associations (hypotheses), from general (intervals containing larger support set size) to more specific (intervals containing smaller support set size). It can also produce redescription sets of user defined size which allows creating sets that contain equal number of members per support interval for further statistical analysis. Because of this, association statistics will not be constructed only from very general or very specific redescriptions, but from redescriptions covering a whole range of different support sizes. The experiments with CLUS-RM [30], and its extension [44], as well as the integration of the CLUS-RM into a redescription mining framework for redescription set construction [45], show that the produced redescription sets were fully competitive with other state-of-the-art solutions, and in some cases (as when only conjunctions are used in redescription query construction), the resulting redescription sets can even contain significantly more accurate and diverse redescriptions.

To obtain the results presented in this work, we required the constraint-based redescription mining capability, mostly using one attribute as constraint. However, developing a constraint-based methodology that is able to use multiple attributes (instances) as constraint was straightforward and is also presented as a part of this work. The proposed extensions include several modes of constraint-based redescription mining (CBRM) that allow exploring interactions of multiple attributes from different views with Boolean, categorical and numerical variables, extending the state-of-the-art in CBRM. Instance level constraints can be incorporated in analogous fashion.

The one-attribute CBRM capability of Siren [47] allows selecting one attribute as constraint and defining its numerical interval (for numerical attributes). The resulting redescription set is comprised of redescriptions that are obtained by extending the initial query supplied by the user. When compared to this limited CBRM capability of Siren, the CLUS-RM extension operates in a fully automated constraint-based setting (allowing multiple attributes as constraints). Also, it is not necessary to manually select numerical bounds as is currently the case in Siren. In general, performing interactive constraint-based redescription mining can demand significant effort and time from the domain expert (in addition to examination of computed redescriptions, which also needs to be done in our approach), but can potentially enable tuning the algorithm better to find information about some specific, targeted problem.

Analysis and exploration of precomputed redescription sets, based on multiple different redescription criteria, exploration of different attribute associations and groupings of instances based on a produced redescription set is also possible with the tool InterSet [48].

Choice of redescription accuracy measure.

Since the data contains missing values, we used the query non-missing Jaccard index, introduced in [30], and further explained in [45] to evaluate redescriptions. The query non missing Jaccard index is defined as:

Query non-missing Jaccard index evaluates instances as being a part of redescription support set only if there is enough information in the data (given the query language) to deduce that these instances satisfy the conditions of both redescription queries. The construction of this measure is guided by the principle that the query cannot contain an instance in its support set if it cannot be evaluated due to missing values. Because of this, the measure does not penalize the score with instances contained in the sets E?,?, E0,?, E?,0 and rather treats them as if they were contained in the set E0,0 but penalizes the score with instances contained in the sets E?,1 and E1,? and treats them as if they are contained in sets E1,0 and E0,1.

Query non-missing Jaccard index has been designed to trade-off between the pessimistic and the optimistic Jaccard index [36], which are each forcing opposite extreme values and are thus leading to less realistic estimates of the true Jaccard index. Query non-missing Jaccard index is optimistic because it does not penalize the score with instances that are not described by one query and cannot be evaluated by the other query, due to missing values (E?,0, E0,?). On the other hand, it is pessimistic, since it penalizes the score with instances that are described by one redescription query, but cannot be evaluated by the other, due to missing values (E1,? and E?,1). Redescription accuracy estimates provided by query non-missing, pessimistic and optimistic Jaccard index have already been compared experimentally in [45].

Choice of a query language.

In this work, our redescriptions consist exclusively of conjunctive queries. Queries containing only conjunction operators are easier to understand and usually shorter than those containing combination of all operators. In redescriptions with queries containing only conjunction operators, every attribute used in its queries must describe all instances from redescription support set. Thus, such redescriptions discover stronger associations between attributes than redescriptions with queries containing all operators. These reasons make us believe that applying CLUS-RM with restriction to use of conjunctions to ADNI data is the right choice which may reveal useful medical hypotheses that can be further developed by the domain experts. Described query language is similar to the one used in bi-directional association rules which can, for instance, be produced by the two-view data association discovery approach, discussed in the Introduction section. In general, using negation and disjunction operators in redescription construction can increase the diversity and accuracy of produced redescriptions, but it can also make them more difficult to understand for domain experts.

CLUS-RM algorithm description

All experiments were performed with the CLUS-RM redescription mining algorithm [29, 30], presented in Algorithm 1. CLUS-RM uses PCTs [43] to find descriptions of groups of instances (i.e. subjects, as is the case in our medical study).

Algorithm 1 The CLUS-RM algorithm

Require: First view (W1), Second view (W2), maxIter, Quality constraints

Ensure: A set of redescriptions

1: procedure CLUS-RM

2:   createInitalData(W1, W2)

3:   createInitialPCTs(, )

4:   extractRulesFromPCT()

5:  for Ind ∈{1, …, maxIter} do

6:    constructTargets(, )

7:    createPCTs()

8:    extractRulesFromPCT()

9:   for () do

10:     addReplaceDiscard()

11:   minimizeQueries()

12:  return

The presented algorithm pseudocode describes the CLUS-RM functionality in case only conjunction logical operators are used to create redescription queries. The extended version of the algorithm pseudocode for the case in which conjunction, negation and disjunction logical operators can be used in redescription query construction is described in [30] and supplementary document S18 File.

The algorithm consists of four main parts: 1) Initialization, 2) Query creation (divided in query construction 2.1 and query exploration 2.2), 3) Redescription creation and 4) Redescription set optimisation.

1) In the initialization phase (line 2 in Algorithm 1), the algorithm makes a copy of each instance from the original dataset and shuffles the attribute values for the copies. For each attribute, the algorithm selects a random instance from the dataset and copies its value for the selected attribute to the target copy (value of one instance from the original dataset can be copied multiple times). This procedure breaks correlations between attributes in the copied instances. Each instance from the original dataset is assigned a target value 1.0 and each artificially created instance a target value 0.0. It is possible to use the PCT algorithm to create initial clusters, from such dataset, by distinguishing between original instances and copies containing shuffled values (line 3 in Algorithm 1). The described procedure is repeated independently for both views contained in the dataset.

2.1) Each node in the obtained PCTs represents a cluster. These nodes are transformed to rules (line 4 in Algorithm 1) which are valid for the corresponding group of instances. More details about transforming PCTs to rules can be seen in [49].

2.2) The next step is to describe the same groups of instances, as those described by the produced rules, with the second attribute set (lines 6−8 in Algorithm 1). To do this, for each instance of the original dataset, the algorithm constructs a set of target variables containing equal number of targets as number of rules constructed using the first set of attributes (for more details see [30]). The instance has a target value 1 on position j if it is described by the j-th rule from a set of rules constructed on the first set of attributes, otherwise the value is 0. Instances for which information is missing, making it impossible to determine the membership in support set of the query are also labelled with 0. We use the multi-target classification and regression capability of PCT to construct clusters on different views containing similar instances. The procedure is repeated by creating initial rules on the second view and describing similar sets of instances by using attributes from the first view.

3) Once the algorithm obtains rules for both views, it combines them by computing the Cartesian product of two rule-sets (line 9 in Algorithm 1). Each redescription is evaluated with various user predefined constraints (such as minimal redescription accuracy, minimal support, maximal p-value, contained in a set of redescription quality constraints ), to select candidates for redescription set optimization.

4) Each redescription satisfying all user-defined redescription quality constraints is a candidate for redescription set optimization (line 10 in Algorithm 1). Satisfactory redescriptions are added to the redescription set, in the order of creation, until the maximal number of redescriptions (user-defined parameter) is reached. When this number is reached, the algorithm computes the score difference, defined in [29, 30], between the new redescription and every redescription already contained in the redescription set based on redescription score. The score of some redescription , based on its support set and a redescription set , is computed as: where denotes the number of times, the instance i is described by redescriptions from the redescription set . The denominator of a score redScoreInst(R) can be also written as . Similarly, the redescription score: is based on attributes contained in redescription queries, where denotes the number of times attribute a is used in queries of redescriptions contained in . The denominator of a score redScoreAttr(R) can be also written as .

The score of a newly created redescription Rnew is computed in the same way as the score for some but using frequencies for all redescriptions contained in the set in the numerator of redScore and redScoreAttr.

The error score is computed as errSc(R) = 1.0 − J(R) and the final redescription score is computed as: where . Lower total redescription score is favourable because it implies smaller error in redescription accuracy and smaller level of instance and attribute redundancy with respect to other redescriptions from the set . The user—defined weights αk regulate importance of different scores which affect the properties of the resulting redescription set. In this work, we use . Redescription contained in the redescription set with the highest score difference with the newly created redescription is replaced thus improving the overall redescription set quality. At each redescription exchange all frequency scores are updated.

The minimization procedure introduced in [30] and performed in line 11 of Algorithm 1 is a heuristic procedure designed to reduce the size of redescription queries by removing redundant attributes (attributes that can be removed without changing redescription accuracy). It is performed individually on each redescription of the resulting redescription set.

Constraint-based redescription mining.

In this work, we extended the CLUS-RM algorithm to a constraint-based redescription mining setting. The algorithm incorporates constraints in redescription creation and one additional score in the optimization function used for redescription set creation. Necessary CBRM extensions of the CLUS-RM algorithm, when conjunction, negation and disjunction operator can be used in redescription query construction are described in supplementary document S18 File.

We present the attribute level constraints useful for gaining knowledge as demonstrated in this work. Constraints involving instances can be introduced in the analogous fashion by using redescription support set (supp(R)) instead of attribute set (attrs(R)) in formulas (1), (2) and (3).

Constraint-based redescription mining, first defined in [31], allows placing constraints on attributes that must occur in redescription queries or instances that must be contained in redescription support set. The constraints are in the form , where each constraint Ci specifies a set of attributes that must occur in redescription queries or a set of instances that must be contained in redescription support set. In the original formulation, at least one constraint Ci must be satisfied by a redescription (contain all attributes or instances specified in the set) to be presented to the user. We denote this original definition as strict constrained-based redescription mining and mostly use it in our study. In practice, various relaxed versions of constrained-based redescription mining might be useful. In the continuation, we specify one existing (strict) and two newly defined (soft and suggested) modes of constraint-based redescription mining (focusing only on attribute constraints):

  1. Strict constraint-based redescription mining: there must exist at least one constraint such that all defined attributes occur in redescription queries.
  2. Soft constraint-based redescription mining: there must exist at least one constraint such that a part of defined attributes occurs in redescription queries. Satisfying larger portion of constraints is favoured by the redescription evaluation score.
  3. Suggested constraint-based redescription mining: defined constraints are used as suggestions that increase the overall redescription score depending on the number of satisfied constraints, however high quality redescriptions not satisfying any of these constraints can also enter redescription set if their total score is high enough.

Strict constraint-based redescription mining can be used when the expert already has a hypothesis (obtained through domain knowledge and extensive experimentation) and wants to explore the specified associations in more detail. Soft constraint-based redescription mining can be used when a set of attributes of interest has been determined (by applying the combination of domain knowledge and experimentation) but it is not clear which interactions from the set should be fully explored. Thus, further study of their interactions is needed to form, refine or confirm the expert hypothesis. Suggested constraint-based redescription mining can be used when the expert, knowing the research domain (having a priori knowledge about the problem), selects a set of attributes that are known or suspected to be (currently) more interesting for exploration, though at current stage there is no immediate focus on any particular hypothesis.

To allow constraint-based redescription mining, we extend the CLUS-RM algorithm by adding a new set of constraints containing the user-defined attributes of special interest and a type of CBRM used (parameter ). Line 9 of Algorithm 1 is changed to . Thus, redescriptions are created only by combining those queries that satisfy predefined constraints. For each redescription Rnew, we apply query minimization procedure before using redescription set optimization (defined in line 10 of Algorithm 1). If query minimization procedure removes any of the key constraint attributes, defined in set of CBRM, the created redescription is discarded.

In addition, CLUS-RM is extended with a new score scConstr, which is used in suggested constraint-based redescription mining to increase the overall score of a redescription satisfying user-defined attribute constraints. The score is defined as: (1)

The first term in the score rewards redescriptions satisfying higher fraction of constraints from some set Ci. Due to the fact that more disjoint or partially overlapping constraint sets can be given and the fact that some redescriptions can satisfy parts of larger number of constraint sets Ci, we take the maximum score achieved among constraint sets as a quality of redescription—thus favouring compliance with larger number of constraints from a single constraint set. The second term favours redescriptions that, among the attributes contained in their queries, have larger fraction of attributes of interest to the user. Here, we reward satisfied constraints from any constraint set defined by the user.

The score used for soft constraint-based redescription mining is defined as: (2) Similarly, the score used for strict constraint-based redescription mining is defined as: (3) Higher scores denote higher level of agreement of redescriptions with the imposed constraints (redescriptions with higher score are thus preferable).

Finally, redescription score sc(R) is extended to: where and redScoreConst(R) denotes any variant of the constraint-based score chosen by the user. Redescriptions with the score value of ∞ are not allowed to enter redescription set.

With the extension introduced above, the CLUS-RM is the only redescription mining algorithm capable of performing fully automated constraint-based redescription mining on categorical, numerical and data containing missing values with more than one attribute constraint.

Experiments and results

In this section, we present the experimental setup and some selected results obtained through the analyses of the produced redescription sets.


Our main goal was to study clinical and biological attributes, and to find interesting relations among them. To retrieve maximum information from and to obtain deeper insight into the data, we divided redescriptions by the number of described subjects and used the diagnosis of the level of cognitive impairment to further assess the relevance and interestingness of the obtained redescriptions. For each dataset, we created four redescription sets containing redescriptions with different supports, describing [5, 10], [11, 39], [40, 99] and at least 100 subjects. The maximum support threshold was set to subjects contained in the dataset Di, i ∈ {1, 2,3}. We are interested in re-describing subsets of subjects with some level of cognitive impairment and using cognitively normal subjects as a control group. Studying different biological, clinical attributes and their interactions in the context of different levels of cognitive impairment is also of high interest. Higher homogeneity of described subjects increases the amount of information obtained about different changes in biological and clinical attributes occurring as a result of different level of cognitive impairment. Developing an approach with a combined properties of redescription mining and subgroup discovery may also be interesting in this setting, but is beyond the scope of this work. Each set contains 100 redescriptions with a minimal Jaccard Index of 0.2 and a maximal p-value of 0.01. Allowed support intervals, as well as other parameter limits were found through experimentation. Redescriptions contained up to 8 attributes per query.

The same support intervals were used to create redescriptions on each dataset. This allows making easier comparisons of redescriptions and statistics of attribute co-occurrence across different datasets. Distribution analysis of redescription quality measures, in the produced redescription sets, reveals potentially interesting datasets, attributes and support intervals.

Since PAPP-A showed interesting associations with cognitive impairment in the experiments described above, we performed constraint-based redescription mining with the same algorithmic parameters but focusing redescription search on redescriptions containing pregnancy associated plasma protein A (PAPP-A) in the redescription queries. We created one redescription set containing 100 redescriptions describing at least 100 subjects.

Redescription accuracy and homogeneity analysis

We merged the four sets of redescriptions, of different supports, created on each dataset (D1, D2, D3) and formed one larger redescription set (RS) per dataset, denoted (see Fig 2). For the obtained redescriptions, contained in the corresponding redescription sets (), we analysed the homogeneity of the described subsets of subjects with respect to the degree of cognitive impairment (CN, SMC, EMCI, LMCI and AD) by computing the entropy of described subject’s medical diagnosis (demonstated in Fig 2).

Fig 2. Entropy (i) and Jaccard index (ii) value distributions for the redescription sets created on each dataset (first dataset—D1 at the top, third dataset—D3 at the bottom).

For a dataset Di, i ∈ {1, 2,3}, we create four redescription sets so that the number of described subjects in each redescription (from a particular redescription set) falls in the corresponding interval shown on the y-axis (boxplots representing distributions for each interval are coloured in different color). Each redescription set contains 100 redescriptions.

The entropy was computed for the support set of each redescription by using the package entropy developed for the programming language R. The package allows estimating Shannon’s entropy () [50] of some finite set of probabilities obtained from the observed counts (occurrence frequencies of each level of cognitive impairment in the redescription support set). In this use-case, N equals the number of different target classes occurring in the support set of a redescription. Probability pi is computed as , where targeti, i ∈ {0, …, N − 1} denotes a set of entities with target label i. Characteristics of redescription sets produced with different support intervals (1., 2., 3., 4. in Fig 2), can be seen on a plot showing entropy distributions (i in Fig 2) and distributions of redescriptions’ Jaccard index (ii in Fig 2).

Due to the smaller diversity in target classes (containing no SMC subjects and only 1 EMCI subject), it was easier to distinguish between different groups of subjects on dataset 3 (which is illustrated in Fig 2) than on the other two datasets. On dataset 3, we obtained many clusters of various size, homogeneous with respect to medical diagnosis, which gives us confidence that we found attribute combinations and numerical intervals useful for the analysis and understanding of cognitive impairment connected to AD.

The entropy increases with the increase of the number of described subjects, while the Jaccard index shows stronger associations in redescriptions with support in the first (|supp(R)| ≥ 100 in Fig 2) and the last interval (|supp(R)| ∈ [5, 10] in Fig 2). Redescriptions describing the smallest number of subjects (the last interval) use larger number of attributes with very specific numerical intervals to isolate groups of subjects that are very homogeneous with respect to the medical diagnosis and describe many different groups of subjects suffering from severe cognitive impairment (LMCI, AD). In contrast, many accurate redescriptions (in the first interval) use larger numerical intervals, thus often describing subjects with various levels of cognitive impairment. Additional reason for higher accuracy in this interval compared to the middle two intervals is the detection of highly accurate redescriptions describing subgroups of CN subjects. Missing values in the data and potential noise, occurring from the errors in measurements and data processing, negatively affect the Jaccard index.

Analyses based on examination of redescription sets

Redescription set analyses, which included: a) the examination and expert evaluation of individual redescriptions, b) the distribution analysis of level of dementia for the described subjects of these redescriptions, c) comparative analyses of attribute value distribution between different subsets of subjects (LMCI/AD vs CN or supp(R) vs CN), allowed us to find useful information related to subjects with cognitive impairment.

From the clinical attributes, we noticed that ADAS, MOCA, Geriatric Depression Scale, Rey Auditory Verbal Learning Test (especially the percent forgetting score), and Mini-Mental State Exam (MMSE) occurred frequently in queries of obtained redescriptions that describe subjects suffering from various degrees of cognitive impairment. Nevertheless, there were instances where some CN subjects fell in the identified intervals of values for these attributes. Attributes connected to Clinical Dementia Rating distinguished well between CN subjects and those with different degrees of cognitive impairment. Redescriptions mostly contained the attributes CDMEMORY, CDGLOBAL and CDR-SB (clinical dementia rating sum of boxes). From the biological attributes, we often encountered attributes connected to brain volume, hippocampus, various blood and urinary tests (attributes HMT and RCT), intracranial volume (ICV), 18fluorodeoxyglucose—positron emission tomography (FDG-PET) and 18F-florbetapir (AV45). These attributes have been studied before by Gamberger et al. [3, 13]. We noticed that the biological attribute SPARE_AD (Spatial Pattern of Abnormalities for Recognition of Early AD) correlated with subject’s diagnosis very well and occurred frequently in redescriptions constructed on dataset 3 that contains it. Also, the gene variant APOE ε4 was present exclusively in redescriptions describing subjects diagnosed with LMCI and AD.

We report several attributes, discovered during our analyses, for which we detected variations in levels connected to AD or discovered interesting subgroups of patients with significantly different distribution of values for a given attribute compared to CN subjects. Difference in distribution is measured with three different statistical tests: a) Anderson-Darling (ADT) test [51, 52], Kolmogorov-Smirnov (KST) test [53, 54] and Mann-Whitney U (MWUT) test [55]. For Anderson-Darling we perform two-sided test and report simulated (ps) and asymptotic (pa) p-values, while for Kolmogorov-Smirnov and Mann-Whitney U test we report p-values, obtained by performing one-sided tests, and the observed direction of the shift of distribution. Alternative hypothesis (a), for one-tailed tests have two possible forms: a equals ( = ) less (l), or (a) = greater (g). Depending on the choice of statistical test, the alternative hypothesis have different meaning (explained in S17 File). Simulated p-value in ADT are obtained with default parameters (1000 simulations). Short motivation for the used statistical tests, providing references to implementations and meaning of the chosen alternative, for the used one-sided tests, is available in supplementary document S17 File. Tests of statistical significance of difference in distribution between one selected example group and a group of CN subjects for all mentioned attributes is displayed in Table 5. Information about attributes with statistically significant difference in distribution between AD/LMCI and CN subjects is reported in Table 6.

Table 5. Attributes analysed in this section with corresponding example redescription containing this attribute.

For each selected attribute we present example redescription that describes subjects with statistically significant difference in attribute value distribution compared to a group of CN subjects.

Table 6. Analysed attributes with statistically significant difference in value distribution between groups of LMCI or AD patients and CN subjects.

By observing redescriptions describing very homogeneous groups of subjects with high level of cognitive impairment (LMCI and AD), we discovered groups where testosterone levels (TSTSTRNT) were significantly decreased. Although some studies (e.g. Zhao et al. [56]) and meta-analyses showed no differences in plasma levels of testosterone between AD and matched controls (e.g. Xu et al. [57]), some studies, such as the one of Hogervorst et al. [58] and Lv et al. [59], found low free testosterone level to be an independent risk factor for AD. Plasma testosterone levels display circadian variation, peaking during sleep, and reaching a lowest level in the late afternoon, with a superimposed ultradian rhythm with pulses every 90 min reflecting the underlying rhythm of pulsatile luteinizing hormone (LH) secretion [60]. The increase in testosterone during sleep requires at least 3 hours of sleep with normal sleep architecture. However, since noradrenergic locus coeruleus and serotonergic dorsal raphe nucleus are among the first neurons affected by neurofibrillary tau pathology, their changes lead to the early and prominent deterioration of the sleep-wake cycle in AD (for a review, see šimić et al. [61]), which may add to a reduction of testosterone levels with advancing age. Experimental data obtained in animal models of AD suggest that low levels of testosterone increase Aβ and tau pathology through both androgen and estrogen pathways (testosterone is metabolized in the brain into androgen dihydrotestosterone, DHT, and 17β-estradiol, the E2 estrogen) [62, 63].

Unlike previous scarce data and negative correlation [64], we also found increased levels of ciliary neurotrophic factor (CNTF) in plasma in several redescriptions describing subjects with high level of cognitive impairment, together with decreased levels of leptin. The difference in distribution of leptin level between groups of AD/LMCI patients and CN subjects is significantly different (lower for AD and LMCI patients). This is in agreement with the results of Marwarha and Ghribi [65], showing that lower leptin levels detected in AD subjects can be a possible target for developing supplementation therapies for reducing the progression of AD. Some groups of subjects (such as R45 from S14 File) had significantly increased levels of plasma angiopoietin-2 (ANG2). This is in agreement with research by Thirumangalakudi et al. [66] and research by Grammas et al. [67], that revealed elevated expression of angiopietin-2 in the brains of AD subjects and the transgenic AD mice, respectively.

Increased levels of plasma brain natriuretic peptide (BNP) were found in several redescriptions containing subjects with severe cognitive impairment. Previous research [68] suggested that this peptide has more significant association with vascular dementia than with AD. This could suggest either that this group of subjects, described by redescriptions containing BNP attribute, suffered from both types of dementia (mixed dementia), or that these cases do not suffer from AD but indeed suffer from vascular dementia. Distributions of level of BNP are significantly different, in dataset D3, between groups of LMCI/AD and CN subjects.

Finally, we also found alteration in plasma levels of several other attributes, whose relationship with AD has already been shown in the literature. These include increase in serum apolipoprotein B (APOB) [69], pancreatic polypeptide (PPP) [70, 71] and for very small groups, the increase of plasma insulin [72] and the CSF macrophage migration inhibitory factor (MCRPHMIF) [73] in AD brain. Fas (CD95) ligand (FASL) levels are found to be significantly decreased in LMCI patients compared to AD and CN subjects in our dataset. Levels are also lower in AD patients than in CN subjects but the difference is not statistically significant. Although one study suggests the upregulation of FASL in AD brain [74], the levels and variations seem to heavily depend on the part of the brain. For instance, FASL levels are found to be significantly decreased in hippocampus [75] in patients suffering from AD. Several groups of LMCI/AD patients with significantly lower levels of APOAII compared to the CN subjects were detected (which corresponds to research performed in [76, 77]). The difference in value distribution in dataset D3 is significant between groups of LMCI/AD patients and CN subjects. Alterations in the levels of the PAPP-A attribute between CN subjects and LMCI/AD patients are very interesting (see Tables 5 and 6). The PAPP-A levels rise in LMCI subjects than drop significantly in AD subjects. This very property has been already detected in [78].

For each redescription set, we extracted one interesting, statistically significant redescription, and displayed its queries, along with the diagnosis distribution of the subjects described by this redescription (as shown in Fig 3).

Fig 3. Example redescriptions (one for each dataset), each describing at least 100 subjects.

All subjects described are diagnosed with EMCI, LMCI or AD. Attribute explanations can be seen in Tables 2 and 3 (P denotes PAPP-A and FDG denotes FDG-PET).

The three redescriptions (as shown in Fig 3 from top to bottom) describe 602, 118 and 365 subjects, respectively with different proportion of EMCI, LMCI and AD diagnosis. They are statistically significant and describe 46%, 20% and 62% of all subjects with some level of cognitive impairment contained in the corresponding dataset. Their queries mostly contain well known attributes listed in Table 2 and in S1S3 Files. The clinical attributes contained are memory score (CDMEMORY), Clinical Dementia Rating Scale Sum of Boxes (CDRSB), judgement and problem solving score (CDJUDGE), Alzheimer’s Disease Assessment Scale (ADAS), Mini-Mental State Exam (MMSE). The biological attributes used contain neutrophils (HMT8), 18F-florbetapir (AV45), 18fluorodeoxyglucose—positron emission tomography (FDG-PET), Spatial Pattern of Abnormalities for Recognition of Early Alzheimer’s disease (SPARE_AD) and Pregnancy-Associated Protein-A (PAPP-A) measurements.

Pairwise attribute association analysis based on co-occurrences

In this section, we present results of attribute association analyses based on attribute co-occurrences in queries of redescriptions contained in our redescription sets. To obtain these associations, we studied the attribute co-occurrence frequencies in redescriptions contained in redescription sets and . We focused on redescriptions describing subjects with some level of cognitive impairment. Co-occurrence frequencies were computed separately for pairs of attributes contained in views bio-bio, clin-clin and bio-clin, where bio denotes the view containing biological and clin denotes the view containing clinical attributes. Finally, we merged all redescriptions computed on all three datasets to obtain global information about pairwise attribute associations (set ). We do this for bio-bio, clin-clin and bio-clin combinations of views. Besides the associations, we also computed the pairwise attribute correlations, by using values of all subjects in the corresponding dataset for the selected pair of attributes, and the statistical significance of these correlations. For each attribute we performed the Kolmogorov-Smirnov test to assess if its values, for subjects contained in the dataset, follow normal distribution. If we obtained p-values smaller than 0.05 for both attributes in the pair, we computed Pearson correlation coefficient [79], otherwise we computed the Spearman’s correlation coefficient [80] and the appropriate p-value of the corresponding significance test. Spearman’s test was also used to compute correlations involving attributes with ordinal values.

A short list of top 5 pairwise associations (by co-occurrence) between attributes contained in the analysed datasets is provided in Tables 7, 8 and 9.

Table 7. The top five associations between pairs of biological attributes as measured by their co-occurrence in redescription queries.

Attribute correlations for a redescription set are computed on dataset Di. P denotes the Pearson correlation coefficient and S denotes the Spearman’s correlation coefficient. . Correlations for attribute pairs from the redescription set are computed on the largest dataset containing both attributes.

Table 8. The top five associations between pairs of clinical attributes as measured by their co-occurrence in redescription queries.

Attribute correlations for a redescription set are computed on dataset Di. P denotes the Pearson correlation coefficient and S denotes the Spearman’s correlation coefficient. . Correlations for attribute pairs from the redescription set are computed on the largest dataset containing both attributes.

Table 9. The top five associations between pairs of attributes consisting of a clinical and a biological attribute.

The association is measured as their co-occurrence in redescription queries. Attribute correlations for a redescription set are computed on dataset Di. P denotes the Pearson correlation coefficient and S denotes the Spearman’s correlation coefficient. . Correlations for attribute pairs from the redescription set are computed on the largest dataset containing both attributes.

Table 7 shows high association between FDG-PET and the volume of the hippocampus, the entorhinal cortex, as well as an attribute related to the volume of the lateral ventricles. High association was also found between intracranial volume and creatine kinase levels (CKMB). This enzyme is present in greatest amounts in skeletal muscle, myocardium, and brain. The FDG-PET attribute often occurred in the same descriptive rules as the attribute measuring the level of vitamin B12 (BAT126). Administering of vitamin B12 is known to have beneficial effects on cognition when there is insufficient level of B12 in the organism [81, 82]. The incidence of AD increases with age and in fact, older adults often show deficiency of vitamin B12, mainly due to the impaired vitamin B12 uptake in the gastrointestinal tract [83]. AD patients also have increased homocysteine levels in the blood. Since homocysteine is directly associated with brain atrophy, it is possible that vitamin B12 supplementation (that reduces homocysteine levels) can actually slow the progression of brain atrophy [81]. However, since meta-analyses failed to prove [84, 85] the connection of vitamin B12 supplementation with homocysteine levels and improved cognition, further studies should be conducted to resolve this issue. The correlation between FDG-PET and B12 values in our dataset was not statistically significant, though it may be more pronounced on a subset of subjects (for instance those above a certain age). It has been reported [86] that diagnosis based on FDG-PET can lead to false diagnosis of AD, where subjects can be cognitively normal or have cognitive impairment due to a reversible cause.

The clinical attributes ADAS, MOCA, MMSE, CDR, FAQ and RAVLT co-occurred frequently. Interestingly, the question number 13 (number of targets hit) from the ADAS test occurred very frequently in redescription queries. In this task, the participants are required to cross-out specific digits from a long list of digits. High frequency co-occurrences and corresponding correlations for all aforementioned attributes can be seen in Table 8.

There was also a strong association of the ADAS, CDR and MOCA clinical attributes with FDG-PET and SPARE_AD, the volume of the entorhinal cortex and the hippocampus, and other biological attributes (see Table 9). Correlations between these attributes were statistically significant. One of the most interesting associations is that between CDRSB and PAPP-A which is used in screening tests for Down syndrome. CDRSB and PAPP-A negatively correlated (−0.15) and the correlation was statistically significant at the significance level of 0.01.

Associations with PAPP-A.

Motivated by the statistically significant association between PAPP-A and CDRSB, we used constraint-based redescription mining to create a new redescription set (on dataset D3) by focusing only on redescriptions containing PAPP-A as one of the attributes in the redescription queries (corresponding redescription set is presented in supplementary document S16 File). The associations from this redescription set, containing 100 redescriptions, are presented in Table 10. Support sets of all constructed redescriptions contained both male and female subjects with diagnosis LMCI and AD.

Table 10. The top four associations of PAPP-A with other attributes based on attribute pair occurrences in redescription queries obtained by using constraint-based redescription mining on dataset D3.

S denotes Spearman’s correlation coefficient. The produced redescription set contains 100 different redescriptions.

The associations presented in Table 10 show that PAPP-A occurs frequently in redescription queries together with the clinical tests CDMEMORY, CDRSB, MMSE and ADAS13. Correlations between PAPP-A and all these attributes were statistically significant at the significance level of 0.01. Interestingly, SPARE_AD and PAPP-A occurred in every redescription from the redescription set obtained with constraint-based redescription mining. As noted earlier, the correlation between these two attributes was not statistically significant when measured for all subjects in the dataset. However, the correlation (Spearman’s ρ = −0.096) was statistically significant (with p = 0.026) when measured for subjects with AD and LMCI at the significance level of 0.05. The fact that every redescription in the set obtained with constraint-based redescription mining described exclusively subjects with AD and LMCI possibly explains the high frequency of association between those attributes and necessitates further exploration of the role of PAPP-A in AD and LMCI. Additionally, we found an interesting association between PAPP-A and two other biological attributes: the volume of the entorhinal cortex and the volume of the fusiform gyrus (Fusiform). Correlations between PAPP-A and these biological attributes were statistically significant at the significance level of 0.05.


The redescription mining approach to segmenting high-dimensional datasets offers several advantages over classical clustering, subgroup discovery and association mining, such as the capability to generate relevant equivalence associations among combinations of attributes. We performed redescription mining experiments on three different datasets, created by extracting different sets of attributes from the ADNI database, and measured the redescription accuracy and the level of homogeneity (in terms of level of cognitive impairment) of the subjects described by each redescription. Basically, the main aim of our study has been to differentiate between cognitively normal subjects and those with some level of cognitive impairment, using clinical and biological attributes potentially related to AD. Our experiments over the constructed datasets were deliberately split into different support ranges in terms of subjects described with redescriptions to allow extracting general and specific, relevant AD-related information.

In this study, we found a number of surprisingly large and homogeneous groups and many smaller, more specific subgroups of subjects that are described with informative redescriptions, in a large extent confirming findings of previous works, corroborating some previously debatable findings or providing additional information about various attributes. After obtaining interesting associations with PAPP-A, we used the introduced extensions to the CLUS-RM algorithm to perform constraint-based redescription mining, allowing us to further explore associations of various attributes with PAPP-A. CLUS-RM is extended to perform fully automated constraint-based redescription mining on data containing either numerical, categorical attributes or missing values. In addition, it is equipped with soft and suggested CBRM capability, introduced in this work.

The clinical attribute CDR (CDMEMORY, CDGLOBAL and CDR-SB) was shown to be a very good attribute for differentiating CN subjects and subjects with some level of cognitive impairment. The gene variant APOE ε4 was associated with subjects with high level of cognitive impairment (LMCI and AD), whereas the biological attribute SPARE_AD was highly correlated with the subject’s diagnosis.

Additionally, high association of ADAS, CDR, and MOCA clinical attributes with FDG-PET, SPARE_AD, and the volume of the entorhinal cortex and hippocampus were shown. When describing homogeneous groups of subjects with high level of cognitive impairment (LMCI and AD), the decrease of testosterone plasma levels, CNTF plasma levels and increase of BNP plasma levels were observed. Likewise, changes in other biological attributes previously reported as being altered in AD, such as increase in levels of serum apolipoprotein B, pancreatic polypeptide, plasma insulin and Fas (CD95) were found.

Finally, probably the most important finding of this study was the detection of altered levels of those biological attributes, for subjects with cognitive impairment, that could have potential as therapeutic targets in AD, namely decreased leptin and increased angiopoietin-2 plasma levels. Decreasing leptin levels have been suggested to alleviate AD-related cellular changes in rabbit organotypic slices [87] and in human neuroblastoma cell culture [88, 89], suggesting that lowered leptin levels detected in AD subjects can be a possible target for developing supplementation therapies for reducing the progression of AD. The finding of increased angiopoietin-2 plasma levels in AD patients is in accordance with the study of Thirumangalakudi et al. [66], who showed that angiopoietin-2 is expressed by AD, but not control-derived microvessels, supporting the idea of targeting the angiogenic changes in the microcirculation of the AD brain as a potential therapeutic approach in AD [67]. Altogether, analysing redescriptions from all three different datasets allowed finding many different associations. Some of these associations, such as SPARE_AD and PAPP-A are novel and require more in depth analysis with the supervision of domain experts. The correlation between SPARE_AD and PAPP-A was not statistically significant when computed for all subjects contained in the dataset , but it was statistically significant when computed only for subjects with AD and LMCI at the significance level of 0.05. PAPP-A showed significant correlation with the volume of the Fusiform gyrus and the volume of the Entorhinal cortex—both already known as being associated with AD [90, 91]. Further, PAPP-A had statistically significant correlation to the most widely used clinical cognitive tests: ADAS, Mini-Mental State Examination and Clinical Dementia Rating Sum of Boxes.

It has been shown [92] by measuring the reference intervals of PAPP-A (in 52 healthy males and 74 healthy, non-pregnant women) that the reference intervals are <22.9 ng/mL for men and <33.6 ng/mL for non-pregnant women. PAPP-A levels of smokers were lower than that of non-smokers and there is a positive correlation between serum PAPP-A levels and subjects’ age. The measured median value of PAPP-A in males 6.85 with the range [undetectable, 24, 40] ng/mL were significantly higher than the median of female subjects 3.4 with the measured range [undetectable, 36, 7] ng/ml. For both males and females, non-smokers had higher levels of PAPP-A than smokers. For males, the difference was statistically significant and for females, it was not. PAPP-A levels in pre-menopause women were lower than in the post-menopause women, however the difference was not statistically significant. In male subjects, the study found a significant correlation between subjects’ age and the level of PAPP-A, however in female subjects this correlation was not statistically significant.

Our search (PubMed search on 3 March 2016.) by using the keywords pappalysin-1/Pregnancy-associated plasma protein-A (PAPP-A) and Alzheimer’s disease revealed only one publication [93] that associates PAPP-A with depressive symptoms.

Results by Llano et al. [78] show that PAPP-A is among the most significant descriptors in plasma proteomic data for distinguishing between CN, MCI and AD patients by different supervised machine learning algorithms. We discovered associations between PAPP-A and cognitive status (LMCI, AD). These results demonstrate the importance of further study of PAPP-A as potential marker for early detection of AD.

Distribution analysis of PAPP-A values based on our data and those of Llano et al. [78] show that PAPP-A levels are increased in MCI and LMCI patients but are significantly decreased in subjects diagnosed with AD. Decrease in PAPP-A levels from LMCI to AD patients on our data is more pronounced in female than in male patients. The possible link between PAPP-A and AD related genes (ABCA1, ABCG1) discovered in Hu et al. [94] is explained by Tang et al. [95]. This publication discusses the role of PAPP-A in pathogenesis of atherosclerosis through its inhibition of liver X receptors α (LXRα) through the insulin-like growth factor (IGF)-I-mediated signalling pathway, and negative regulation of expression of ABCA1 and ABCG1 genes—all significantly associated with AD [94]. Although LXR are best known as the key regulators of cholesterol metabolism and transport, LXR signaling has also been shown to have significant anti-inflammatory properties [96]. Various studies surveyed in štefulj et al. [96] implicate LXR in the pathogenesis, modulation, and therapy of AD.

Further potential association between PAPP-A and AD can be seen through study of patients suffering form type-2 diabetes. It has been shown [97] that patients suffering from type-2 diabetes also have significantly increased level of PAPP-A. Akter et al. [98] showed the potentially shared pathology of type-2 diabetes and AD, where some research (e.g. [99]), shows high influence of type-2 diabetes on the potential development of AD. Also, one study performed on mice [100] suggested that changes in the brain during AD can potentially cause diabetes.


The association of PAPP-A (previously known as pappalysin-1) with cognitive status is probably the most intriguing and novel finding of this study, as it has been scarcely investigated in this context.

PAPP-A was detected as a significant attribute in differentiating between CN, MCI and AD subjects [78] through use of different supervised machine learning algorithms. It has also been shown that it is significant in predicting the progression from MCI to AD, though none of the used subsets of attributes provided adequate predictions of progression between these two classes. High association of PAPP-A with depressive symptoms has already been demonstrated [93] by using the ensemble machine learning algorithm of Random Forests.

In our work, we detected important correlation between the attribute PAPP-A and the cognitive test CDRSB. By applying the newly developed constraint-based extensions of the CLUS-RM algorithm, we detected a larger number of attributes with statistically significant correlation with PAPP-A. In addition to CDRSB, we observed more clinical tests, such as MMSE and ADAS13, with statistically significant correlations with PAPP-A. Interesting and significant correlations were also observed with the biological attributes: volume of the Fusiform gyrus and volume of the Entorhinal cortex both known as being associated with AD [90, 91] with the volume of Entorhinal cortex being significantly reduced even in the mild case of AD [91].

The high importance of our finding lies in the fact that PAPP-A is a metalloproteinase, already known to cleave insulin-like growth factor (IGF) binding proteins (IGFBPs). Perhaps even more importantly, since it also shares similar substrates with the A Disintegrin and Metalloproteinase (ADAM) family of enzymes (the main group of enzymes that act as α-secretase to physiologically cleave the amyloid precursor protein (APP) in the so-called non-amyloidogenic pathway [101]), it could be directly involved in the metabolism of the amyloid precursor protein (APP) in the very early stages of AD. Based on the above, the role of PAPP-A in AD should be investigated in greater details.

Supporting information

S4 File. Redescriptions obtained on D1 with support in [5, 10] interval.


S5 File. Redescriptions obtained on D1 with support in [11, 39] interval.


S6 File. Redescriptions obtained on D1 with support in [40, 99] interval.


S7 File. Redescriptions obtained on D1 with support in [100, 820] interval.


S8 File. Redescriptions obtained on D2 with support in [5, 10] interval.


S9 File. Redescriptions obtained on D2 with support in [11, 39] interval.


S10 File. Redescriptions obtained on D2 with support in [40, 99] interval.


S11 File. Redescriptions obtained on D2 with support in [100, 470] interval.


S12 File. Redescriptions obtained on D3 with support in [5, 10] interval.


S13 File. Redescriptions obtained on D3 with support in [11, 39] interval.


S14 File. Redescriptions obtained on D3 with support in [40, 99] interval.


S15 File. Redescriptions obtained on D3 with support in [100, 420] interval.


S16 File. Redescriptions obtained on D3 by using constraint-based redescription mining with support larger than 100 subjects.


S17 File. Motivation and explanation of statistical tests used in this work.


S18 File. Pseudocode of the CLUS-RM algorithm that can use conjunction, negation and disjunction logical operator in redescription query construction and explanation of introduced constraint-based redescription mining extensions.



Data used in preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database ( As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data, but did not participate in analysis or writing of this report. A complete listing of ADNI investigators can be found at:


  1. 1. ADNI database: last access 18.08.2017. Available from:
  2. 2. Smith G E, Bondi M W. Mild Cognitive Impairment and Dementia. Oxford University Press; 2013.
  3. 3. Gamberger D, Ženko B, Mitelpunkt A, Lavrač N. Multilayer Clustering: Biomarker Driven Segmentation of Alzheimer’s Disease Patient Population. In: Ortuño F, Rojas I, editors. Bioinformatics and Biomedical Engineering. vol. 9043 of Lecture Notes in Computer Science. Springer International Publishing; 2015. p. 134–145.
  4. 4. Adaszewski S, Dukart J, Kherif F, Frackowiak R, Draganski B, ADNI, et al. How early can we predict Alzheimer’s disease using computational anatomy? Neurobiology of aging. 2013;34(12):2815–2826. pmid:23890839
  5. 5. Liu X, Tosun D, Weiner MW, Schuff N, Initiative ADN, et al. Locally linear embedding (LLE) for MRI based Alzheimer’s disease classification. NeuroImage. 2013;83:148–157. pmid:23792982
  6. 6. Dukart J, Mueller K, Barthel H, Villringer A, Sabri O, Schroeter ML, et al. Meta-analysis based SVM classification enables accurate detection of Alzheimer’s disease across different clinical centers using FDG-PET and MRI. Psychiatry Research: Neuroimaging. 2013;212(3):230–236. pmid:23149027
  7. 7. Doraiswamy P, Sperling R, Johnson K, Reiman E, Wong T, Sabbagh M, et al. Florbetapir F 18 amyloid PET and 36-month cognitive decline: a prospective multicenter study. Molecular Psychiatry. 2014;19(9):1044–1051. pmid:24614494
  8. 8. Donovan NJ, Wadsworth LP, Lorius N, Locascio JJ, Rentz DM, Johnson KA, et al. Regional cortical thinning predicts worsening apathy and hallucinations across the Alzheimer disease spectrum. The American Journal of Geriatric Psychiatry. 2014;22(11):1168–1179. pmid:23890751
  9. 9. Guo LH, Alexopoulos P, Wagenpfeil S, Kurz A, Perneczky R, ADNI, et al. Brain size and the compensation of Alzheimer’s disease symptoms: a longitudinal cohort study. Alzheimer’s & Dementia. 2013;9(5):580–586.
  10. 10. Hostage CA, Roy Choudhury K, Doraiswamy PM, Petrella JR, ADNI. Dissecting the Gene Dose-Effects of the APOE epsilon4 and epsilon2 Alleles on Hippocampal Volumes in Aging and Alzheimer’s Disease. PLoS ONE. 2013;8(2):e54483. pmid:23405083
  11. 11. Risacher SL, Kim S, Shen L, Nho K, Foroud T, Green RC, et al. The role of apolipoprotein E (APOE) genotype in early mild cognitive impairment (E-MCI). Frontiers in Aging Neuroscience. 2013;5. pmid:23554593
  12. 12. Gamberger D, Mihelčić M, Lavrač N. Multilayer Clustering: A Discovery Experiment on Country Level Trading Data. In: Proceedings of the 17th International Conference, Discovery Science, DS 2014, Bled, Slovenia; 2014. p. 87–98.
  13. 13. Gamberger D, Ženko B, Mitelpunkt A, Lavrač N, ADNI. Identification of Gender Specific Biomarkers for Alzheimer’s Disease. In: Guo Y, Friston K, Aldo F, Hill S, Peng H, editors. Brain Informatics and Health. vol. 9250 of Lecture Notes in Computer Science. Springer International Publishing; 2015. p. 57–66.
  14. 14. Breskvar M, Ženko B, Džeroski S. Relating Biological and Clinical Features of Alzheimer’s Patients With Predictive Clustering Trees. In: Proceedings of the 18th International Information Society. vol. E of IS’15. Ljubljana, Slovenia; 2015. p. 5–8.
  15. 15. Cox DR. Note on Grouping. Journal of the American Statistical Association. 1957;52(280):543–547.
  16. 16. Fisher WD. On Grouping for Maximum Homogeneity. Journal of the American Statistical Association. 1958;53(284).
  17. 17. Ward JH. Hierarchical Grouping to Optimize an Objective Function. Journal of the American Statistical Association. 1963;58(301):236–244.
  18. 18. Jain AK, Murty MN, Flynn PJ. Data Clustering: A Review. ACM Computing Surveys. 1999;31(3):264–323.
  19. 19. Xu D, Tian Y. A Comprehensive Survey of Clustering Algorithms. Annals of Data Science. 2015;2(2):165–193.
  20. 20. Michalski RS. Knowledge Acquisition Through Conceptual Clustering: A Theoretical Framework and an Algorithm for Partitioning Data into Conjunctive Concepts. Journal of Policy Analysis and Information Systems. 1980;4(3):219–244.
  21. 21. Fisher DH. Knowledge Acquisition Via Incremental Conceptual Clustering. Machine Learning. 1987;2(2):139–172.
  22. 22. Agrawal R, Mannila H, Srikant R, Toivonen H, Verkamo AI. Fast Discovery of Association Rules. In: Fayyad UM, Piatetsky-Shapiro G, Smyth P, Uthurusamy R, editors. Advances in Knowledge Discovery and Data Mining. Menlo Park CA, USA: American Association for Artificial Intelligence; 1996. p. 307–328. Available from:
  23. 23. Klösgen W. Explora: A Multipattern and Multistrategy Discovery Assistant. In: Fayyad UM, Piatetsky-Shapiro G, Smyth P, Uthurusamy R, editors. Advances in Knowledge Discovery and Data Mining. Menlo Park CA, USA: American Association for Artificial Intelligence; 1996. p. 249–271. Available from:
  24. 24. Wrobel S. An algorithm for multi-relational discovery of subgroups. In: Komorowski J, Zytkow J, editors. Principles of Data Mining and Knowledge Discovery. vol. 1263 of Lecture Notes in Computer Science. Berlin / Heidelberg: Springer; 1997. p. 78–87. Available from:
  25. 25. Bay SD, Pazzani MJ. Detecting group differences: Mining contrast sets. Data Mining and Knowledge Discovery. 2001;5(3):213–246.
  26. 26. van Leeuwen M, Galbrun E. Association Discovery in Two-View Data. IEEE Trans Knowl Data Eng. 2015;27(12):3190–3202.
  27. 27. Ramakrishnan N, Kumar D, Mishra B, Potts M, Helm RF. Turning CARTwheels: An Alternating Algorithm for Mining Redescriptions. In: Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD’04. New York NY, USA: ACM; 2004. p. 266–275. Available from:
  28. 28. Weiner MW, et al. The Alzheimer’s Disease Neuroimaging Initiative: A review of papers published since its inception. Alzheimer’s & Dementia. Journal of the Alzheimer’s Association. 2012;8:1–68.
  29. 29. Mihelčić M, Džeroski S, Lavrač N, Šmuc T. Redescription mining with multi-label Predictive Clustering Trees. In: Proceedings of the 4th workshop on New Frontiers in Mining Complex Patterns. NFMCP’15. Porto, Portugal; 2015. p. 86–97. Available from:
  30. 30. Mihelčić M, Džeroski S, Lavrač N, Šmuc T. Redescription Mining with Multi-target Predictive Clustering Trees. In: Ceci M, Loglisci C, Manco G, Masciari E, Ras ZW, editors. New Frontiers in Mining Complex Patterns—4th International Workshop, NFMCP 2015, Held in Conjunction with ECML-PKDD 2015, Porto, Portugal, Revised Selected Papers. vol. 9607 of Lecture Notes in Computer Science. Springer; 2015. p. 125–143. Available from:
  31. 31. Zaki MJ, Ramakrishnan N. Reasoning About Sets Using Redescription Mining. In: Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery in Data Mining. KDD’05. New York, NY, USA: ACM; 2005. p. 364–373. Available from:
  32. 32. R Core Team. R: A Language and Environment for Statistical Computing; 2014. Available from:
  33. 33. ADNI1 procedures manual: last access 22.08.2017.;. Available from:
  34. 34. ADNIGO procedures manual: last access 22.08.2017.;. Available from:
  35. 35. ADNI2 procedures manual: last access 22.08.2017.;. Available from:
  36. 36. Galbrun E. Methods for Redescription mining [Ph.D. dissertation]. University of Helsinki; 2013.
  37. 37. Galbrun E, Miettinen P. From black and white to full color: extending redescription mining outside the Boolean world. Statistical Analysis and Data Mining. 2012;5(4):284–303.
  38. 38. Zinchenko T. Redescription Mining Over non-Binary Data Sets Using Decision Trees, M.Sc. thesis [MSc dissertation]. Universität des Saarlandes Saarbrücken. Germany; 2014.
  39. 39. Parida L, Ramakrishnan N. Redescription Mining: Structure Theory and Algorithms. In: Veloso MM, Kambhampati S, editors. AAAI. AAAI Press / The MIT Press; 2005. p. 837–844. Available from:
  40. 40. Gallo A, Miettinen P, Mannila H. Finding Subgroups having Several Descriptions: Algorithms for Redescription Mining. In: SDM. SIAM; 2008. p. 334–345. Available from:
  41. 41. Breiman L, Friedman JH, Olshen RA, Stone CJ. Classification and Regression Trees. Monterey, CA: Wadsworth and Brooks; 1984.
  42. 42. Blockeel H, De Raedt L. Top-down Induction of First-order Logical Decision Trees. Artificial Intelligence. 1998;101(1-2):285–297.
  43. 43. Kocev D, Vens C, Struyf J, Dzeroski S. Tree ensembles for predicting structured outputs. Pattern Recognition. 2013;46(3):817–833.
  44. 44. Mihelčić M, Džeroski S, Lavrač N, Šmuc T. Redescription mining augmented with random forest of multi-target predictive clustering trees. Journal of Intelligent Information Systems. 2017; p. 1–34.
  45. 45. Mihelčić M, Džeroski S, Lavrač N, Šmuc T. A framework for redescription set construction. Expert Systems with Applications. 2017;68:196–215.
  46. 46. Piccart B. Algorithms for Multi-target Learning [Ph.D. dissertation]. Katholieke Universiteit Leuven. Belgium; 2012.
  47. 47. Galbrun E, Miettinen P. Siren: An Interactive Tool for Mining and Visualizing Geospatial Redescriptions. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD’12. New York, NY, USA: ACM; 2012. p. 1544–1547. Available from:
  48. 48. Mihelcic M, Smuc T. InterSet: Interactive Redescription Set Exploration. In: Discovery Science—19th International Conference, DS 2016, Bari, Italy, October 19-21, 2016, Proceedings; 2016. p. 35–50. Available from:
  49. 49. Aho T, Ženko B, Džeroski S, Elomaa T. Multi-target Regression with Rule Ensembles. J Mach Learn Res. 2012;13(1):2367–2407.
  50. 50. Shannon CE. A mathematical theory of communication. ACM SIGMOBILE Mobile Computing and Communications Review. 2001;5(1):3–55.
  51. 51. Anderson AD W Theodore Darling. Asymptotic Theory of Certain “Goodness of Fit” Criteria Based on Stochastic Processes. Ann Math Statist. 1952;23(2):193–212.
  52. 52. Fritz W Scholz MAS. K-Sample Anderson–Darling Tests. Journal of the American Statistical Association. 1987;82(399):918–924.
  53. 53. Kolmogorov AN. Sulla Determinazione Empirica di una Legge di Distribuzione. Giornale dell’Istituto Italiano degli Attuari. 1933;4:83–91.
  54. 54. Smirnov N. Table for Estimating the Goodness of Fit of Empirical Distributions. Ann Math Statist. 1948;19(2):279–281.
  55. 55. Mann HB, Whitney DR. On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other. Ann Math Statist. 1947;18(1):50–60.
  56. 56. Zhao JV, Lam TH, Jiang C, Cherny SS, Liu B, Cheng KK, et al. A Mendelian randomization study of testosterone and cognition in men. Scientific Reports. 2016;6.
  57. 57. Xu J, Xia LL, Song N, Chen SD, Wang G. Testosterone, Estradiol, and Sex Hormone-Binding Globulin in Alzheimer’s Disease: A Meta-Analysis. Current Alzheimer Research. 2016;13(3):215–222. pmid:26679858
  58. 58. Hogervorst E, Bandelow S, Combrinck M, Smith A. Low free testosterone is an independent risk factor for Alzheimer’s disease. Experimental Gerontology. 2004;39(11):1633–1639. pmid:15582279
  59. 59. Lv W, Du N, Liu Y, Fan X, Wang Y, Jia X, et al. Low testosterone level and risk of Alzheimer’s disease in the elderly men: a systematic review and meta-analysis. Molecular neurobiology. 2016;53(4):2679–2684. pmid:26154489
  60. 60. Wittert G, et al. The relationship between sleep disorders and testosterone in men. Asian Journal of Andrology. 2014;16(2):262. pmid:24435056
  61. 61. Šimić G, Leko MB, Wray S, Harrington CR, Delalle I, Jovanov-Milošević N, et al. Monoaminergic Neuropathology in Alzheimer’s disease. Progress in Neurobiology. 2016;
  62. 62. Papasozomenos SC, Shanavas A. Testosterone prevents the heat shock-induced overactivation of glycogen synthase kinase-3β but not of cyclin-dependent kinase 5 and c-Jun NH2-terminal kinase and concomitantly abolishes hyperphosphorylation of τ: Implications for Alzheimer’s disease. Proceedings of the National Academy of Sciences of the United States of America. 2002;99(3):1140–1145. pmid:11805297
  63. 63. Rosario ER, Carroll J, Pike CJ. Testosterone regulation of Alzheimer-like neuropathology in male 3xTg-AD mice involves both estrogen and androgen pathways. Brain Research. 2010;1359:281–290. pmid:20807511
  64. 64. Gelernter J, Dyck CV, van Kammen DP, Malison R, Price LH, Cubells JF, et al. Ciliary neurotrophic factor null allele frequencies in schizophrenia, affective disorders, and Alzheimer’s disease. American Journal of Medical Genetics. 1997;74(5):497–500. pmid:9342199
  65. 65. Marwarha G, Ghribi O. Leptin signaling and Alzheimer’s disease. American Journal of Neurodegenerative Disease. 2012;1(3):245. pmid:23383396
  66. 66. Thirumangalakudi L, Samany PG, Owoso A, Wiskar B, Grammas P. Angiogenic proteins are expressed by brain blood vessels in Alzheimer’s disease. Journal of Alzheimer’s Disease. 2006;10(1):111–118. pmid:16988487
  67. 67. Grammas P, Tripathy D, Sanchez A, Yin X, Luo J. Brain microvasculature and hypoxia-related proteins in Alzheimer’s disease. International Journal of Clinical and Experimental Pathology. 2011;4(6):616. pmid:21904637
  68. 68. Kondziella D, Göthlin M, Fu M, Zetterberg H, Wallin A. B-type natriuretic peptide plasma levels are elevated in subcortical vascular dementia. Neuroreport. 2009;20(9):825–827. pmid:19424098
  69. 69. Caramelli P, Nitrini R, Maranhao R, Lourenço A, Damasceno M, Vinagre C, et al. Increased apolipoprotein B serum concentration in Alzheimer’s disease. Acta Neurologica Scandinavica. 1999;100(1):61–63. pmid:10416513
  70. 70. Soares HD, Potter WZ, Pickering E, Kuhn M, Immermann FW, Shera DM, et al. Plasma biomarkers associated with the apolipoprotein E genotype and Alzheimer disease. Archives of neurology. 2012;69(10):1310–1317. pmid:22801723
  71. 71. Roberts RO, Aakre JA, Cha RH, Kremers WK, Mielke MM, Velgos SN, et al. Association of pancreatic polypeptide with mild cognitive impairment varies by APOE ε4 allele. Frontiers in aging neuroscience. 2015;7. pmid:26441635
  72. 72. Watson GS, Craft S. The role of insulin resistance in the pathogenesis of Alzheimer’s disease. CNS Drugs. 2003;17(1):27–45. pmid:12467491
  73. 73. Bacher M, Deuster O, Aljabari B, Egensperger R, Neff F, Jessen F, et al. The role of macrophage migration inhibitory factor in Alzheimer’s disease. Molecular Medicine. 2010;16(3-4):116. pmid:20200619
  74. 74. Su JH, Anderson AJ, Cribbs DH, Tu C, Tong L, Kesslack P, et al. Fas and Fas Ligand are associated with neuritic degeneration in the AD brain and participate in β-amyloid-induced neuronal death. Neurobiology of Disease. 2003;12(3):182–193. pmid:12742739
  75. 75. Ferrer I, Puig B, Krupinski J, Carmona M, Blanco R. Fas and Fas ligand expression in Alzheimer’s disease. Acta neuropathologica. 2001;102(2):121–131. pmid:11563626
  76. 76. Song F, Poljak A, Crawford J, Kochan NA, Wen W, Cameron B, et al. Plasma Apolipoprotein Levels Are Associated with Cognitive Status and Decline in a Community Cohort of Older Individuals. PLOS ONE. 2012;7(6):1–11.
  77. 77. Ma C, Li J, Bao Z, Ruan Q, Yu Z. Serum levels of ApoA1 and ApoA2 are associated with cognitive status in older men. BioMed research international. 2015;2015.
  78. 78. Llano DA, Devanarayan V, Simon AJ, ADNI. Evaluation of plasma proteomic data for Alzheimer disease state classification and for the prediction of progression from mild cognitive impairment to Alzheimer disease. Alzheimer Disease & Associated Disorders. 2013;27(3):233–243.
  79. 79. Pearson K. Note on regression and inheritance in the case of two parents. Proceedings of the Royal Society of London. 1895;58(347-352):240–242.
  80. 80. Spearman C. The Proof and Measurement of Association Between Two Things. American Journal of Psychology. 1904;15:88–103.
  81. 81. Smith AD, Smith SM, de Jager CA, Whitbread P, Johnston C, Agacinski G, et al. Homocysteine-Lowering by B Vitamins Slows the Rate of Accelerated Brain Atrophy in Mild Cognitive Impairment: A Randomized Controlled Trial. PLoS ONE. 2010;5(9):1–10.
  82. 82. Douaud G, Refsum H, de Jager CA, Jacoby R, Nichols TE, Smith SM, et al. Preventing Alzheimer’s disease-related gray matter atrophy by B-vitamin treatment. Proceedings of the National Academy of Sciences of the United States of America. 2013;110(23):9523–9528. pmid:23690582
  83. 83. Gröber U, Kisters K, Schmidt J. Neuroenhancement with vitamin B12–underestimated neurological significance. Nutrients. 2013;5(12):5031–5045. pmid:24352086
  84. 84. Ford AH, Almeida OP. Effect of homocysteine lowering treatment on cognitive function: a systematic review and meta-analysis of randomized controlled trials. Journal of Alzheimer’s Disease. 2012;29(1):133–149. pmid:22232016
  85. 85. Clarke R, Bennett D, Parish S, Lewington S, Skeaff M, Eussen SJ, et al. Effects of homocysteine lowering with B vitamins on cognitive aging: meta-analysis of 11 trials with cognitive data on 22,000 individuals. The American Journal of Clinical Nutrition. 2014;100(2):657–666. pmid:24965307
  86. 86. Shipley SM, Frederick MC, Filley CM, Kluger BM. Potential for misdiagnosis in community-acquired PET scans for dementia. Neurology: Clinical Practice. 2013;3(4):305–312.
  87. 87. Marwarha G, Dasari B, Prasanthi JR, Schommer J, Ghribi O. Leptin reduces the accumulation of Aβ and phosphorylated tau induced by 27-hydroxycholesterol in rabbit organotypic slices. Journal of Alzheimer’s Disease. 2010;19(3):1007–1019. pmid:20157255
  88. 88. Marwarha G, Dasari B, Ghribi O. Endoplasmic reticulum stress-induced CHOP activation mediates the down-regulation of leptin in human neuroblastoma SH-SY5Y cells treated with the oxysterol 27-hydroxycholesterol. Cellular Signalling. 2012;24(2):484–492. pmid:21983012
  89. 89. Marwarha G, Raza S, Meiers C, Ghribi O. Leptin attenuates BACE1 expression and amyloid-β genesis via the activation of SIRT1 signaling pathway. Biochimica et Biophysica Acta (BBA)-Molecular Basis of Disease. 2014;1842(9):1587–1595.
  90. 90. Chang YT, Huang CW, Chen NC, Lin KJ, Huang SH, Chang WN, et al. Hippocampal amyloid burden with downstream fusiform gyrus atrophy correlate with face matching task scores in early stage Alzheimer’s disease. Frontiers in aging neuroscience. 2016;8.
  91. 91. Juottonen K, Laakso M, Insausti R, Lehtovirta M, Pitkänen A, Partanen K, et al. Volumes of the entorhinal and perirhinal cortices in Alzheimer’s disease. Neurobiology of aging. 1998;19(1):15–22. pmid:9562498
  92. 92. Coskun A, Serteser M, Duran S, Inal TC, Erdogan BE, Ozpinar A, et al. Reference interval of pregnancy-associated plasma protein-A in healthy men and non-pregnant women. Journal of Cardiology. 2013;61(2):128–131. pmid:23159209
  93. 93. Arnold S, Xie S, Leung Y, Wang L, Kling M, Han X, et al. Plasma biomarkers of depressive symptoms in older adults. Translational Psychiatry. 2012;2(1):e65. pmid:22832727
  94. 94. Hu YS, Xin J, Hu Y, Zhang L, Wang J. Analyzing the genes related to Alzheimer’s disease via a network and pathway-based approach. Alzheimer’s Research & Therapy. 2017;9(1):29.
  95. 95. Tang SL, Chen WJ, Yin K, Zhao GJ, Mo ZC, Lv YC, et al. PAPP-A negatively regulates ABCA1, ABCG1 and SR-B1 expression by inhibiting LXRα through the IGF-I-mediated signaling pathway. Atherosclerosis. 2012;222(2):344–354. pmid:22503545
  96. 96. Štefulj J, Panzenboeck U, Hof PR, Šimić G. Pathogenesis, modulation, and therapy of Alzheimer’s disease: A perspective on roles of liver-X receptors. Translational Neuroscience. 2013;4(3):349–356.
  97. 97. Heidari B, Fotouhi A, Sharifi F, Mohammad K, Pajouhi M, Paydary K, et al. Elevated serum levels of pregnancy-associated plasma protein-A in type 2 diabetics compared to healthy controls: associations with subclinical atherosclerosis parameters. Acta Medica Iranica. 2015;53(7):395–402. pmid:26520625
  98. 98. Akter K, Lanza EA, Martin SA, Myronyuk N, Rua M, Raffa RB. Diabetes mellitus and Alzheimer’s disease: shared pathology and treatment? British journal of clinical pharmacology. 2011;71(3):365–376. pmid:21284695
  99. 99. Stanley M, Macauley SL, Holtzman DM. Changes in insulin and insulin signaling in Alzheimer’s disease: cause or consequence? Journal of Experimental Medicine. 2016; p. jem–20160493.
  100. 100. Plucinska K, Dekeryte R, Koss D, Shearer K, Mody N, Riedel G, et al. Neuronal Human BACE1 Knock-in Induces Systemic Diabetes in Mice. In: Diabetes. vol. 65. Amer. Diabetes Assoc. 1701 N Beauregard st, Alexandria, VA 22311-1717 USA; 2016. p. A430–A430.
  101. 101. Kuhn PH, Colombo AV, Schusser B, Dreymueller D, Wetzel S, Schepers U, et al. Systematic substrate identification indicates a central role for the metalloprotease ADAM10 in axon targeting and synapse function. eLife. 2016;5:e12748. pmid:26802628