Measuring heterogeneity in normative models as the effective number of deviation patterns

Normative modeling is an increasingly popular method for characterizing the ways in which clinical cohorts deviate from a reference population, with respect to one or more biological features. In this paper, we extend the normative modeling framework with an approach for measuring the amount of heterogeneity in a cohort. This heterogeneity measure is based on the Representational Rényi Heterogeneity method, which generalizes diversity measurement paradigms used across multiple scientific disciplines. We propose that heterogeneity in the normative modeling setting can be measured as the effective number of deviation patterns; that is, the effective number of coherent patterns by which a sample of data differ from a distribution of normative variation. We show that lower effective number of deviation patterns is associated with the presence of systematic differences from a (non-degenerate) normative distribution. This finding is shown to be consistent across (A) application of a Gaussian process model to synthetic and real-world neuroimaging data, and (B) application of a variational autoencoder to well-understood database of handwritten images.


Concern 2
Your ethics statement should only appear in the Methods section of your manuscript. If your ethics statement is written in any section besides the Methods, please move it to the Methods section and delete it from any other section. Please ensure that your ethics statement is included in your manuscript, as the ethics statement entered into the online submission form will not be published alongside your manuscript.
Thank you. We have split the "Background and Methods" section into separate "Background" and "Methods" sections, such that the ethics statement is now more clearly included in the methods. Results of the MNIST experiment. Each plot corresponds to a normative model (cVAE [1,2]) trained on images from the "normative" digit class identified in the title. The x-axes index "clinical" or "target" digit classes. AN The y-axes plot the β-heterogeneity at q = 1 (Π

Concern 1
The first dataset needs more information about the type of data and explanation of data like second dataset.
Thank you. We have updated the methods section to read as follows: Experiment using Synthetic Data Our first experiment uses synthetic data from a simple system of M = 30 real-valued features, yi-which may be considered analogous to 30 regions of interest in a neuroimaging study-using the following function parameterized by five simulated "covariates" (Fig. 1A where and where xi = (xij)j=1,2,...,5 are the covariates sampled from an isotropic multivariate Gaussian. Specific model parameters are included in the reproducible supplementary notebooks. The generative model specified by Equations 11-13 was selected in order to generate nonlinear patterns that remain easy to visualize, yet can result in non-trivial patterns of differences between simulated groups. That is, two groups simulated under these data will not simply differ along one feature dimension, but could show heterogeneous differences across multiple features. AN

Concern 2
Is probability distribution random?
Thank you. Yes, the probability distribution over covariates is random. We updated the subsection describing the synthetic data as follows: We generated data from this system for N = 50 subjects in a "normative cohort" defined by a specific parameterization of isotropic multivariate Gaussian distribution AN over covariates (Fig. 1F).

Concern 3
"unlike variance and entropies, a 50% increase in heterogeneity corresponds to a 50% increase of variation in the system. -It needs explanation.
Thank you, we have added a supplementary figure to illustrate this phenomenon ( Figure 1 in this letter).
In other words, unlike variance and entropies, a 50% increase in Rényi heterogeneity corresponds to a 50% increase of variation in the system (illustrated graphically in S2 Fig

Concern 4
What is c-deviant?
Thank you. We have added the following clarifying statement to the background.
..,N be the N × M absolute standardized deviance matrix, where z * i is the absolute standardized deviance for subject i ∈ {1, 2, . . . , N }. We set the bottom 100(1 − c)% of values in z * i to 0 and rescale the remainder such that they sum to one, thereby generating a probability distribution over the c-deviant features (i.e. those features with the top 100c% of standardized deviance values) AN

Concern 5
The method generated data from this system for N = 50 subjects in a "normative cohort"-How you get the value N=50 as subjects?
Thank you. We have added the following statement to the description of the synthetic dataset.
We generated data from this system for N = 50 subjects in a "normative cohort" defined by a specific parameterization of isotropic multivariate Gaussian distribution AN over covariates (Fig. 1F). The sample size was chosen to be large enough to allow for clear visualization of differences, while remaining small enough that the analyses could be reasonably reproduced by readers running typical personal computers. Interested readers may manipulate all parameters of this analysis in the Supplementary code (S1 File). AN A Gaussian process normative model was fit to these data (implemented in GPyTorch v.1.1 [4]) and evaluated for generalizability on an independently sampled normative cohort (Fig. 1G).

Concern 6
Sampling rate is not specified.
Thank you. We have specified all necessary sampling rates for all analyses. In the synthetic data analysis, we generate M = 30 features for N = 50 subjects. Note that this is not a time series. In the MNIST data, there are 70,000 images across 10 classes, with the digit class-specific normative models trained using 10fold cross-validation. In the autism dataset, there were 199 subjects in each of the control and autism groups, with the generalizability of the model evaluated under 10-fold cross-validation.

Concern 7
Specify result of the method in a tabular form and there is a comparative study between your method and exiting methods.
Thank you. We have added a table to describe the results of the experiment on the ABIDE dataset (included at the end of this document as Table ). We have not added the heterogeneity accumulation curves in tabular format as that table would be large and not convey the main point as well as the figures. Similarly, the results on the synthetic and MNIST data would not be better conveyed by a tabular representation of results.
Finally, there are no other approaches to measuring heterogeneity in normative modeling that can be compared to our method, so no comparative results can be offered. We have highlighted the novelty of our work in this respect in the Introduction, ...To date, there is no proposed method to rigorously measure heterogeneity in normative models. AN ...
We therefore introduce a Rényi heterogeneity measurement for normative modeling studies, the effective number of deviation patterns (ENDevs), which estimates the number of distinct ways in which a cohort deviates from the normative distribution. To our knowledge, this is the first such explicitly defined heterogeneity measurement approach for the normative modeling framework. AN We also demonstrate a useful property of ENDevs: that cohorts sampled from outside a non-trivial normative distribution will tend to have lower ENDevs than cohorts sampled from the normative distribution proper. and in the Background.
...Statistics summarizing deviance at the subject level can then be computed in various ways, including summation or average deviance, thresholding of the Z-score [5], or by application of extreme value statistics to model largest deviations [6]. However, there are no existing approaches for quantifying the absolute amount of heterogeneity using normative modeling. AN Abraham Nunes MD PhD MBA Department of Psychiatry 5909 Veterans Memorial Lane QEII Health Sciences Centre Halifax, NS, B3H 2E2, Canada E-Mail: nunes@dal.ca However, we have also noted that the effective number of deviation patterns (ENDevs) is not the only possible heterogeneity measure that could be derived from normative models, and have called for further research in this respect: In principle, one may also derive various other RRH-derived indices for the normative modeling approach. New measures could be identified by defining other representations of deviation from the normative range; so long as one can also specify a probability distribution over such a representation, Rényi heterogeneity will be applicable, and the heterogeneity index will inherit its well understood properties [3,7,8]. Thank you, while we have not added a complete section on motivation, we have expanded the first paragraph of the introduction to read as follows: Psychiatric disorders are defined by their clinical presentations, which means that several different biological abnormalities could result in what we erroneously call a single condition. This has important consequences for biological and treatment studies that assume subjects form a single homogeneous population. Instead, in reality, a given psychiatric condition such as bipolar disorder might differ from normal population variation in heterogeneous ways. Normative modeling is a popular method for disentangling this heterogeneity in clinical cohorts AN [5,[9][10][11][12][13][14][15][16][17]. This method involves learning a distribution of normal variation, with the assumption that clinically relevant phenotypes are identifiable by significant deviations from this normative range. Unfortunately, it does not measure the amount of heterogeneity in a cohort. To improve our understanding of psychiatric nosology, we must aim to understand the factors that cause biological and other forms of heterogeneity in our diagnostic system, as well as the consequences of that heterogeneity on research and clinical practice. However, to study heterogeneity rigorously, we must be able to measure it rigorously. To date, there is no proposed method to rigorously measure heterogeneity in normative models. AN

Concern 2
Figures are also stretched.
Thank you. We believe this is an artifact of the upload to the manuscript submission server, since all images were submitted in their original sizes, which are not stretched.

Concern 2
Additionally, a typographic error in Line No. 178 needs to be corrected.
Thank you. This has been corrected. The introduction section should explain the problem at hand more elaborately.

Abraham Nunes MD PhD MBA
Thank you. We have updated the introduction to now read as follows.
Psychiatric disorders are defined by their clinical presentations, which means that several different biological abnormalities could result in what we erroneously call a single condition. This has important consequences for biological and treatment studies that assume subjects form a single homogeneous population. Instead, in reality, a given psychiatric condition such as bipolar disorder might differ from normal population variation in heterogeneous ways. Normative modeling is a popular method for disentangling this heterogeneity in clinical cohorts AN [5,[9][10][11][12][13][14][15][16][17]. This method involves learning a distribution of normal variation, with the assumption that clinically relevant phenotypes are identifiable by significant deviations from this normative range. Unfortunately, it does not measure the amount of heterogeneity in a cohort. To improve our understanding of psychiatric nosology, we must aim to understand the factors that cause biological and other forms of heterogeneity in our diagnostic system, as well as the consequences of that heterogeneity on research and clinical practice. However, to study heterogeneity rigorously, we must be able to measure it rigorously. To date, there is no proposed method to rigorously measure heterogeneity in normative models. AN Heterogeneity measurement has been studied for more than a century [18], but most indices, such as variance and entropies [19][20][21], have inconsistent units and can scale counterintuitively [7,22]. Conversely, ecologists and others have adopted the Rényi heterogeneity family of indices as a "true diversity" index [8,[23][24][25]. The Rényi heterogeneity measures a system's effective number of configurations (numbers equivalent [26]). This measure scales linearly and generalizes most commonly used diversity indices [3,7,8,27,28].
We therefore introduce a Rényi heterogeneity measurement for normative modeling studies, the effective number of deviation patterns (ENDevs), which estimates the number of distinct ways in which a cohort deviates from the normative distribution. To our knowledge, this is the first such explicitly defined heterogeneity measurement approach for the normative modeling framework. AN We also demonstrate a useful property of ENDevs: that cohorts sampled from outside a non-trivial normative distribution will tend to have lower ENDevs than cohorts sampled from the normative distribution proper.

Concern 2
The authors have not emphasized the necessity (novelty) of this work.
Thank you, in the introduction, we have now emphasized that there are no specifically defined approaches for measuring heterogeneity in the normative modeling framework, We therefore introduce a Rényi heterogeneity measurement for normative modeling studies, the effective number of deviation patterns (ENDevs), which estimates the number of distinct ways in which a cohort deviates from the normative distribution. To our knowledge, this is the first such explicitly defined heterogeneity measurement approach for the normative modeling framework. AN We also demonstrate a useful property of ENDevs: that cohorts sampled from outside a non-trivial normative distribution will tend to have lower ENDevs than cohorts sampled from the normative distribution proper. and the following added sentence highlights the importance of measuring heterogeneity in an absolute sense: To improve our understanding of psychiatric nosology, we must aim to understand the factors that cause biological and other forms of heterogeneity in our diagnostic system, as well as the consequences of that heterogeneity on research and clinical practice. However, to study heterogeneity rigorously, we must be able to measure it rigorously. To date, there is no proposed method to rigorously measure heterogeneity in normative models. AN

Concern 3
The review of past work has been covered in a sentence with a series of references; however, the readers may not be well aware of the subject. A paragraph towards the same should suffice.
Thank you. We have expanded the first paragraph of our introduction as to provide further context as follows: Psychiatric disorders are defined by their clinical presentations, which means that several different biological abnormalities could result in what we erroneously call a single condition. This has important consequences for biological and treatment studies that assume subjects form a single homogeneous population. Instead, in reality, a given psychiatric condition such as bipolar disorder might differ from normal population variation in heterogeneous ways. Normative modeling is a popular method for disentangling this heterogeneity in clinical cohorts AN [5,[9][10][11][12][13][14][15][16][17]. This method involves learning a distribution of normal variation, with the assumption that clinically relevant phenotypes are identifiable by significant deviations from this normative range. Unfortunately, it does not measure the amount of heterogeneity in a cohort. To improve our understanding of psychiatric nosology, we must aim to understand the factors that cause biological and other forms of heterogeneity in our diagnostic system, as well as the consequences of that heterogeneity on research and clinical practice. However, to study heterogeneity rigorously, we must be able to measure it rigorously. To date, there is no proposed method to rigorously measure heterogeneity in normative models. AN In the Background section, we also provide a detailed description of normative modeling,

Normative Modeling
Normative modeling involves four steps [9]. First, one defines the spaces of predictor and response variables, X = R K and Y = R M , respectively. A dataset Do = {(xi, yi) : i ∈ {1, 2, . . . , N }} is collected for a "normative" cohort comprising predictor x ∈ X and response variables y ∈ Y. Often, the predictor variables are clinical covariates such as behavioural traits, sex, and age, with the response variable being some biological measurement such as the volume of some brain region on structural neuroimaging.
The second and third steps involve learning a model of the mapping X → Y using the normative cohort's data Do, with out-of-sample model criticism. Ideally, the model should map predictors onto a space of probability distributions on the response variable, that is g : X → P(Y) [9].
Finally, one collects dataset D = {(x i , y i ) : i ∈ {1, 2, . . . , N }} from a target cohort of N subjects, such as clinical patients. One then computes the degree to which Y deviates from the predictions given by g(X ). A common approach to quantifying this deviation when g is a Gaussian process regression model is a Z-score that we henceforth call the "standardized deviance." For each subject i ∈ {1, 2, . . . , N } and response variable j ∈ {1, 2, . . . , M } the standardized deviance is where µj(x i ) and σ 2 j (x i ) are the expected value and predicted variance, respectively, given predictors x i , and η 2 j is the variance in the response variable learned by the normative model. Statistics summarizing deviance at the subject level can then be computed in various ways, including summation or average deviance, thresholding of the Z-score [5], or by application of extreme value statistics to model largest deviations [6]. However, there are no existing approaches for quantifying the absolute amount of heterogeneity using normative modeling. AN as well as heterogeneity measurement:

Measuring Heterogeneity
Heterogeneity is the degree to which a system deviates from perfect conformity, and is measured as the effective size of the system's event space using the Rényi heterogeneity indices [3,24,25,29]. Given a system with probability distribution p = (pi)i=1,2,...,n over n categorical states, the Rényi heterogeneity of order q ≥ 0 is defined as follows: Our first experiment used synthetic data to provide a simple illustration of how a "clinical" or "target" cohort will have a lower ENDevs than a sample of individuals drawn from the distribution of normative variation. Our second experiment seeks to evaluate whether this same phenomenon will occur with more complex and high-dimensonal data, such as natural images of objects from different categorical classes. For such a dataset to be appropriate, we must be able to guarantee that there are meaningful differences between groups in the raw data, and that these differences can be easily visualized for illustrative purposes. We therefore use the AN MNIST dataset, which includes 70,000 images of handwritten digits, roughly evenly balanced over classes {0, 1, . . . , 9} [30]. The MNIST dataset AN is among the simplest non-trivial examples in which consistent deviation patterns exist between robustly-defined classes. That is, (A) digit classes are valid partitions, and (B) their use for written communication mandates the existence of consistent and circumscribed deviation patterns between classes.

Concern 6
Is it possible to adapt datasets used by similar study with the proposed methodology.
Thank you. We have updated a paragraph in the discussion as follows: Representational Rényi heterogeneity is a general theory of heterogeneity that can be applied to any dataset, given a suitable probabilistic representation [3]. Since normative models can generate probability distributions over patterns of deviation from normal variation, RRH (here with units of ENDevs) can therefore be measured in any normative modeling study, although the normative modeling framework at present has been largely isolated to neuroimaging applications [9]. AN Future work should extend normative modeling, and consequently AN our heterogeneity measure, to genetic, connectomic, and behavioural data.

Concern 7
Discussion section should include comparative study with other similar work (at least 2-3 articles that have approached the problem at hand).
Thank you. The following modification was made to the introduction to explain that there have been no previous attempts to measure heterogeneity using normative models.
We therefore introduce a Rényi heterogeneity measurement for normative modeling studies, the effective number of deviation patterns (ENDevs), which estimates the number of distinct ways in which a cohort deviates from the normative distribution. To our knowledge, this is the first such explicitly defined heterogeneity measurement approach for the normative modeling framework. AN We also demonstrate a useful property of ENDevs: that cohorts sampled from outside a non-trivial normative Abraham Nunes MD PhD MBA Department of Psychiatry 5909 Veterans Memorial Lane QEII Health Sciences Centre Halifax, NS, B3H 2E2, Canada E-Mail: nunes@dal.ca distribution will tend to have lower ENDevs than cohorts sampled from the normative distribution proper.
However, to provide further clarification on the chain of papers leading to the present study, we have modified a paragraph of the discussion to read as follows: Our method exploits a potential synergy between normative modeling [9] and representational Rényi heterogeneity (RRH) [3]. The RRH theory was developed after previous work showed that there were no existing heterogeneity measures capable of application to the types of data used in modern psychiatric research [18,31]. Furthermore, RRH was shown to generalize heterogeneity measures used across multiple disciplines, such as ecology [7,24], economics [25], and statistical physics [28]. AN Representational Rényi heterogeneity involves measuring the size of a system's event space, which can be generated (along with a probability distribution over it) by a normative model. In the present study, this was the space of c-deviation patterns, where c was an extreme value threshold. If this set of deviation patterns constitutes a relevant representation for the condition of scientific interest, then Rényi heterogeneity will provide an axiomatically sound measure of diversity in that set [7,8,18,22,29].