SiGMoiD: A super-statistical generative model for binary data

In modern computational biology, there is great interest in building probabilistic models to describe collections of a large number of co-varying binary variables. However, current approaches to build generative models rely on modelers’ identification of constraints and are computationally expensive to infer when the number of variables is large (N~100). Here, we address both these issues with Super-statistical Generative Model for binary Data (SiGMoiD). SiGMoiD is a maximum entropy-based framework where we imagine the data as arising from super-statistical system; individual binary variables in a given sample are coupled to the same ‘bath’ whose intensive variables vary from sample to sample. Importantly, unlike standard maximum entropy approaches where modeler specifies the constraints, the SiGMoiD algorithm infers them directly from the data. Due to this optimal choice of constraints, SiGMoiD allows us to model collections of a very large number (N>1000) of binary variables. Finally, SiGMoiD offers a reduced dimensional description of the data, allowing us to identify clusters of similar data points as well as binary variables. We illustrate the versatility of SiGMoiD using multiple datasets spanning several time- and length-scales.


Author summary
Collectively varying binary variables are ubiquitous in modern biology. Given that the number of possible configurations of these systems typically far exceeds the number of available samples, generative models have become an essential tool in quantitative descriptions of binary data. The state-of-the-art approaches to build generative models have several conceptual limitations. Specifically, they rely on the modeler choosing system-appropriate constraints, which can be challenging in systems with many complex interactions. Moreover, they are computationally expensive to infer when the number of variables is large (N~100). To address this issue, we propose a theoretical generalization of the maximum entropy approach that allows us to model very high dimensional data; at least an order of magnitude higher than what is currently possible. This framework will be a significant advancement in the computational analysis of covarying binary variables.

Introduction
Recent technical advances allow us to collect high resolution and high dimensional data across several biological systems. In several cases, these data can be accurately represented as collectively varying binary variables. Significant examples can be found in genomics, where sequencing data is first mapped to gene families and then the presence or absence of thousands of genes across microbial genomes is investigated [1], in microbial ecology, where sequences of the 16s ribosomal gene are first mapped onto operational taxonomic units (OTUs) and then presence or absence of species across microbiomes is investigated to identify direct metabolic interactions [2], or in neuroscience, where electrical current recordings from hundreds of thousands neurons are binarized into spike trains which are then related to organismal level tasks [3]. Unfortunately, estimating the frequency of occurrence of every possible binary configuration from available samples is not possible for any reasonably sized collection; a system with N co-varying binary variables has 2 N possible configurations and the number of collected samples is typically orders of magnitude lower than the number of configurations. At the same time, given the complexity of interactions, in most cases, it is infeasible to build bottom-up mechanistic models to describe these systems. A popular alternative is to derive approximate top-down probabilistic models and train those models on the data. Over the past two decades, the maximum entropy (max ent) method [4] has emerged as perhaps the only candidate for building approximate generative models across a variety of contexts [5][6][7][8][9][10][11][12]. Briefly, amongst all probability distributions (models) that are consistent with user-specified constraints, max ent chooses the least biased one; the max ent distribution does not disfavor any outcome unless warranted by the imposed constraints. However, traditional application of max ent has several drawbacks. (1) Perhaps the biggest limitation is that the modeler is required to a priori identify constraints that are appropriate for a given system. Depending on the complexity of interactions, these constraints may not be obvious. A work around is to impose a very large number of constraints. For example, a max ent model to analyze correlated firing N neurons will typically involve N constraints on mean firing rates of individual neurons and~N 2 constraints on covariations in firing rates for all pairs of neurons. (2) These user-identified constraints are imposed using Lagrange multipliers and the multipliers need to be tuned such that model predictions numerically match imposed constraints. However, most often these multipliers cannot be determined analytically and have to be inferred numerically. The most common approach is to use Markov chain Monte Carlo (MCMC) methods to estimate the gradients of the log-likelihood of the data with respect to the Lagrange multipliers and then performing gradient ascent [10,13,14]. These calculations are computationally infeasible even when the number of dimensions is only moderately large (N~100). (3) The numerical values of the imposed constraints are often evaluated using experimental samples which implicitly assumes that samples points are statistically independent of each other. This assumption is not true in most practical applications, for example, in temporally correlated firing of neurons or phylogenetically related protein sequences [10,15]. These limitations taken together have severely limited the application of max ent when there an increasing interest in describing the collective behavior of thousands of cells, genes or microbial species, among others.
In order to study covariation in a large number of binary variables in a constraint-agnostic and numerically efficient manner, we propose a novel dimensionality reduction framework inspired from statistical physics; Super-statistical Generative Model for binary Data (SiG-MoiD). SiGMoiD is a generalization of the max ent approach and has several salient features that distinguish it from the state-of-the-art models of binary variables. (1) In SiGMoiD, the modeler only specifies the total number of constraints. The constraints are optimally learned from the collected samples and a max ent model is fit to those constraints. (2) As a result of this optimal choice of constraints, SiGMoiD requires a much smaller number of constraints than traditional max ent. Consequently, the inference in SiGMoiD is significantly faster than typical max ent models, allowing us to analyze very high dimensional data sets (dimensions �1000) that remain well out of the reach of current max ent methods. (3) SiGMoiD does not assume that collected samples are drawn from the same distribution. Instead, motivated by superstatistics, it imagines each sample as arising from its own max ent probability distribution. Since each sample is approximated by a small set of Lagrange multipliers, SiGMoiD is also a non-linear dimensionality reduction method. Below, we first sketch the outline of SiG-MoiD and then illustrate it utility by applying it to several data sets.

The model
We assume that experimental measurements are in a form where individual samples (data points), indexed by subscript s, comprise N binary variables {σ si } (i2 [1,N], s2 [1, S]) that take values 0 or 1. Let us denote by π si the probability that σ si = 1 and by π s the vector of probabilities π s = {π s1 , π s2 ,. . .,π sN }. To motivate our model framework (Fig 1), we imagine the following physical process: for a fixed sample s, each binary variable i in the collection of N variables is interacting with the same bath that can exchange K types of extensive variables (energies). The k th type of energy (feature) for each binary variable in the state when it is active (σ si = 1) is E ki and zero when it is inactive (σ si = 0) (denoted collectively by E). Under these circumstances, the probability of the i th binary variable is equal to 1 in the s th sample is given by the Gibbs-Boltzmann distribution [16]: In Eq 1, β s = {β s1 , β s2 ,. . .,β sK } are the intensive variables (latent space representation or latents) specific to sample s. The probabilities in Eq 1 are the maximum entropy probability distributions when averages hE ki i s (k2 [1,K]) of the K types of energies are specified for each variable (i2 [1,N]) for every sample s.
We have set up the model such that the latents β sk depend on the sample index s but not on the index i of the binary variables. In contrast, the features E ki depend on the binary variables i but are shared across all samples s. Let us consider that we are given S samples {σ si } (i2 [1,N], s2 [1,S]) of the binary variables. From these samples, we infer sample-specific latents β s and sample-independent features E. To that end, we take a maximum likelihood approach. We write the log-likelihood of the data given the parameters: The log-likelihood can be maximized to determine the parameters using gradient ascent. The gradients are given by: SiGMoiD has several salient features. First, similar to other non-linear dimensionality reduction methods, if K�N, SiGMoiD offers a reduced dimensional description of the data; the K dimensional vectors β s embed the N dimensional data point σ s in a K�N dimensional space. In addition, since SiGMoiD is a fully probabilistic approach, it can also be used as a generative model. Random samples can be generated as follows. We first select a random set of latents β s , evaluate the probabilities π s and sample random variables σ as Bernoulli variables using those probabilities. SiGMoiD also allows us to evaluate the probability of a new set of binary variables σ given the other observations. Specifically, the probability is where is the probability of observing the binary variables σ when the latents are fixed at β s . We note that even though we have proposed to identify the parameters using a maximum likelihood approach, given a suitable prior p prior (β, E) over the parameters, we can also estimate the Bayesian uncertainty in parameter estimation using the posterior distribution Finally, we comment on the degeneracies in SiGMoiD inference procedure. If we multiply the S×K matrix of latents β sk by a K×K invertible matrix M:β!βM and simultaneously multiply the K×N matrix of features E ki by M −1 :E!M −1 E, the SiGMoiD predictions do not change. Therefore, SiGMoiD-based inference of parameters will significantly depend on the initialization. These degeneracies can therefore be minimized or completely removed without changing model performance by imposing additional restrictions on the parameters, for example, by requiring that the latents or the features are orthogonal to each other, i.e. imposing β T β = I K or EE T = I K . We leave these explorations for future studies.

Accuracy of SiGMoiD as a probabilistic model: Modeling the collective firing of neurons
Before illustrating SiGMoiD using high dimensional data sets, we first show a comparison between SiGMoiD and the standard approach to model binary variables; a max ent model. We use a previously collected data set measuring the collective firing of 160 retinal neurons for the duration of a movie that lasted 19 seconds [17,18] (see Supplementary Information). We note that inference of a max ent model for the collective firing of all 160 neurons is currently computationally prohibitive. We chose the 15 most active neurons in the data (15 highest firing propensities) to illustrate our approach. First, we inferred a max ent model from the data that constrained mean firing rates and pairwise correlations. The max ent model describes the probability of any configuration σ as: In Eq 7, J ij are coupling constants (Lagrange multipliers) that need to be inferred from the data, typically using gradient ascent of the log likelihood of the data [10,13,14]. Given that there are only 2 15~3 ×10 4 states for 15 neurons, we could estimate model predictions and therefore the coupling constants by an exhaustive brute force summation over all possible states without resorting to MCMC simulations to estimate the gradients. This minimized the errors in max ent inference that arise due to inaccuracies in MCMC-based estimates of average firing rates and neuron-neuron correlations. The resulting max ent model perfectly captured the average firing rates and the pair correlations (S1 Fig).
In Fig 2, we compare the max ent model with SiGMoiD. The max ent model has 15 2 ! ¼ 105 neuron-specific parameters. To match that number, we choose K = 7 in SiGMoiD. In panel (a), we show a comparison between the raw frequencies of individual configurations obtained from data (x-axis) to model predicted probabilities (y-axis, red: max ent, blue: SiG-MoiD). The raw frequencies were obtained by counting the number of instances of individual configurations across all 10 4 samples. It is clear that SiGMoiD has smaller error compared to the max ent model (mean absolute error 8.6×10 −6 vs 1.4×10 −5 ). In panel (b), we plot the probability that n neurons fire at the same time as observed in the data (black), predicted using SiG-MoiD (blue), and using the max ent model (red). Here too, the SiGMoiD model performs well when capturing the probability of simultaneous activity. In panels (c) and (d), we plot the absolute values of the three-body correlations |hδσ i δσ j δσ k i| as observed in the data (x-axis) and as predicted by the model (y-axis, SiGMoiD, panel (c), max ent, panel (d)). Both models capture the three body correlations with reasonable accuracy; the mean absolute error is 8.2×10 −4 vs 1.4×10 −3 for the SiGMoiD and the max ent model respectively. This analysis illustrates that the SiGMoiD is better than the max ent based model at capturing the data and making predictions. Next, we move to systems that are currently well out of the reach of max ent methods.

Inference of interactions from bacterial co-occurrences using SiGMoiD
Gut microbiomes are complex ecosystems whose statistical properties have received significant attention in the last couple of years [19,20]. Gut bacteria live in species-rich communities where they compete for nutrients and also exchange metabolites with each other. Describing these interactions is critical to map the ecological networks of gut microbiomes and identify targets for controlling microbial communities [21]. However, many of the direct metabolic interactions between gut microbes are likely to occur at a micron length scale [2]. Therefore, it is infeasible to infer these interactions from macroscopic, community-wide abundance covariation.
To address this issue, Sheth et al. [2] recently probed the spatial organization of the gut microbiome at the micron length scale, allowing them to capture putative direct interactions between bacteria. In these experiments, Sheth et al. [2] fractionated mice guts into particles with a median diameter of 30 μm and quantified the membership of 347 operational taxonomic units (OTUs) across 1406 particles. However, given that co-occurrences are transitive (if A interacts and co-occurs with B, and B interacts and co-occurs with C, then A cooccurs with C even in the absence of interactions), it is not possible to use simple co-occurrence calculations to identify putative pairs of directly interacting OTUs [22].
Given that SiGMoiD can directly model occurrence of individual OTUs across particles, it can be used to identify clusters of OTUs that co-vary across particles as well as clusters of particles that show specific OTU occurrence profiles. We therefore analyzed the data collected by Sheth et al. [2] using SiGMoiD (see Supplementary Information). Each particle was characterized by a binary vector of dimension representing the OTUs present in that particle. It is evident that SiGMoiD will fit the data better as the number of components K increases. Unlike the neuron samples which were correlated in time and across different trials, the microbiome particles are likely to be closer to statistical independence. Therefore, we can use information theory-based criteria to select the optimal K that fits the data but avoids overfitting. In S2 Fig,  we show the Akaike information criteria (AIC) vs. K for the OTU data. The model picks out K = 8 as the optimal value which we use in further analysis. When individual samples are correlated, one can use cross-validation; splitting the samples in a training vs. a validation set and then evaluating the probability of the validation set, to avoid overfitting.
The number of species found in each particle, a gross descriptor of the complexity of the community [8], varies substantially from particle to particle. As shown in Fig 3, generative modeling of the particles using SiGMoiD accurately captures this quantifier of ecological PLOS COMPUTATIONAL BIOLOGY complexity. In Fig 3A we show the probability of co-occurrence of multiple OTUs in any community as observed in the data (black circles) and as predicted by SiGMoiD (blue line). We compare these distributions with the null expectation given by the probabilities of occurrence of n OTUs in any given particle calculated using the occurrence frequencies of individual OTUs but neglecting the correlations between OTUs (red line). The significant difference between the two suggests that SiGMoiD can accurately capture the interactions between OTUs, which in turn allows it to predict the co-occurrence distribution.
In fact, SiGMoiD can be used to identify specific bacteria with similar occurrence profiles across particles. SiGMoiD characterizes each binary variable (here, OTU presence/absence) using a K dimensional vector of features. OTUs with similar features will have similar cooccurrence profiles as well. Therefore, the feature vector can be used to identify clusters of cooccurring OTUs. SiGMoiD-based clustering of OTUs is a more direct way of identifying clusters by relying on inferred inherent properties of the OTUs rather than their co-occurrence profiles. Fig 3B shows

PLOS COMPUTATIONAL BIOLOGY
SiGMoiD: A super-statistical generative model for binary data tailed hypergeometric distribution p-value 0.009). Notably, the OTUs belonging to this cluster had predominantly positive correlations across different particles; 2330 out of the 2346 unique pairs had a positive correlation with 92% of pairs with a p-value less than 10 −2 (84% of pairs with a p-value less than 10 −4 and 74% of pairs with a p-value less than 10 −6 ). In comparison, only 50% of unique pairs from other OTUs had a positive correlation and only 23% of those correlations had a p-value less than 0.01. These analyses suggest that SiGMoiD-based features can identify clusters of OTUs that significantly co-occur in a given ecology. There are two types of metabolic interactions between bacteria that lead to co-occurrence in an ecosystem [23], especially at the micron length scale [2]. Genetically related bacteria tend to co-occur because they have similar metabolic networks and can compete for the same resources. In contrast, genetically dissimilar bacteria have different metabolic networks and can cross-feed each other; one species utilizing the metabolic byproducts of another. Therefore, this cluster likely represents the co-occurrence of multiple species in the Lachnospiraceae family that compete with each other for the same resources in the mouse gut.
In addition to identifying OTUs that have similar occurrence profiles across communities, SiGMoiD can also be used to identify communities that have similar OTU occurrence profiles. SiGMoiD embeds each high dimensional binary sample in a much lower dimensional space of sample-specific β latents. Using K-means clustering of sample-specific β latents, we identified 3 clusters of particles (S2 Fig). Principal component analysis (PCA)-based visualization of the particles clearly shows the three identified clusters (Fig 3C). Notably, several specific OTUs were co-present with much higher occurrence frequencies in the identified small cluster (cluster 3, comprising 47 particles). In Fig 3D, we compare the pairwise co-occurrence frequency of 10 OTUs whose occurrence frequency was identified to be most significantly different between particles in cluster 3 compared to the baseline using a hypergeometric test (S1 Table). It is clear that compared to the baseline co-occurrence frequency (sub-diagonal half of Fig 3D), the pairwise co-occurrence frequencies of the 10 OTUs are significantly elevated in the communities in cluster 3. These analyses show that SiGMoiD can also identify specific communities that comprise strongly co-occurring bacteria that differentiate them from other communities. These significant clusters can potentially be investigated for direct co-operative or competitive interactions, as well as their association with distinct regions of the gut. Importantly, clusters of particles with these tightly correlated species were not detected when we clustered the samples (binary vectors) directly using the same approach (S3 Fig).

Identifying missing metabolic reactions using SiGMoiD
The metabolic repertoire of microorganisms enables them to convert nutrients into biomass and energy and underlies phenotypic traits central to their ecosystem roles [24]. E.g. microbial fermentation in the gut and its impact on human health [25] or methane production in animal agriculture or wetland ecosystems [26,27].
Genome sequencing and annotation methods have enabled the identification of metabolic transformations that individual microbes can potentially carry out through the reconstruction of their metabolic networks. Metagenomics sequencing on the other hand, has allowed the study of the genomes and metabolic properties of microorganisms in microbiomes of interest which have not yet been cultured and characterized. Nevertheless, due to the complexity of these microbial communities, it is often not possible to determine the full genomic content of most members of a given microbiome. Therefore, beyond a few highly abundant microbes whose genomes can be inferred to a reasonable degree of completeness from metagenomics data, the metabolic capabilities of many microbes of interest can only be partially assessed through metagenome annotation and binning methods. Here we show that SiGMoiD can be used to infer missing reactions in the metabolic repertoire of incompletely sequenced genomes, as are often produced by metagenome assembly and binning pipelines.
To that end, we downloaded genome-scale metabolic reconstructions for~4000 bacteria from KBase [28] generated with the ModelSeed pipeline [29] (see Supplementary Information). Often, intercompartmental transport reactions and other reactions are added to metabolic reconstructions to ensure mass balance and viability of biomass production without a clear identification of the genes that may carry out these reactions. To avoid biasing our approach towards or against these ad hoc additions, we only retained those reactions that were assigned to one or more genes in the reconstructions. This resulted in a total of~3300 reactions across all bacterial metabolic reaction sets. We randomly selected 400 bacteria as a test set to quantify the accuracy of our predictions. We inferred SiGMoiD parameters on the rest of the bacteria and used the inferred parameters to identify missing reactions in the test dataset as follows.
First, for each bacteria in the testing set, we removed a fixed fraction of reactions ranging from 10% to 90% to simulate incomplete genome coverage from metagenomic sequencing. Next, we used SiGMoiD and the known reactions for each bacteria in the testing set to predict the missing reactions. We used K = 15 components. We identified groups of reactions with similar occurrence profiles across the bacterial world by clustering their corresponding E ki features obtained from the training data using agglomerative hierarchical clustering [30]. The similarity d ij between two metabolic reactions i and j was defined as the L2 norm: Hierarchical clustering divides the reactions into multiple clusters depending on a tunable level of clustering. On one end, all reactions are clubbed into a single cluster, and on the other end, every reaction is its own cluster. For any given level of clustering, we employed a simple rule to predict missing reactions from known reactions. If any one of the reactions in a cluster is known to be present in the metabolic repertoire of a bacterium, all other reactions in the cluster are also predicted to be in the network. Using this simple prediction model and by varying the level of clustering, we obtained true and false positive rates for the predictions for each bacteria. These rates were averaged across all bacteria for a given level of clustering and plotted as a receiver operating characteristic (ROC) curve. Fig 4A shows that this simple approach to predict the missing metabolic reactions performs exceedingly well. The area under the curve when only 10% of the reactions are known is 0.916, which increases to 0.988 when 90% of the reactions are known. Notably, the performance of our approach is extremely Importantly, the highest overlap between the true metabolic repertoire and the predicted metabolic repertoire as quantified by the Jaccard index occurs near the true size of the network (Fig 4B).
Notably, unlike other gene inference methods that rely on phylogenetic placement of the query genome against a reference database [31,32] or approaches that can use phenotypic knowledge about the organism (nutrients that can sustain its growth, biomass composition, etc.) [33] our approach predicts reactions solely based on their co-occurrence profile across the bacterial kingdom. Therefore, our approach can be integrated with other methods to robustly predict missing reactions in genome scale metabolic reconstructions.

Discussion
A deluge of biophysical data in the last decade has called for the development of top-down modeling approaches. Here, instead of describing the data from first principles mechanistic models, one constructs probability distributions that represent it. As a result, generative models of collective behavior have become essential to modeling several biophysical systems. The most popular way to generate top-down models is the maximum entropy (max ent) approach wherein one approximates the data using a probabilistic model that reproduces lower order statistics estimated from the data. The max ent approach has the significant conceptual advantage that it represents the simplest model consistent with the imposed constraints. However, there are two significant drawbacks. First, the constraints are hand-picked by the modeler and the model therefore depends on these constraints. For binary data, constraints of averages and pair correlations have become popular. Second, the inference of max ent models for large data sets can be computationally expensive and it may be unrealistic to infer models for >100 binary variables.
To address these issues, we developed SiGMoiD. SiGMoiD takes an agnostic approach about the constraints. In SiGMoiD, instead of specifying the constraints, the user only specifies the total number of constraints. SiGMoiD learns these constraints from the data. Moreover, parameter inference in SiGMoiD is orders of magnitude faster than max ent inference. We showed using three data sets of varying complexity that SiGMoiD not only performs as well as max ent models in terms of accuracy but can also be applied to study very large data sets that are currently out of the reach of max ent inference. Going forward, we believe that this computationally efficient and conceptually straightforward approach will be immensely valuable in modeling collective behavior of high dimensional data.
We have previously developed SiGMoiD-like approaches [16,34] to model multinomially distributed abundance data common in sequencing studies including 16s sequencing based characterization of the microbiome [34]. Going forward, the most straightforward generalization to SiGMoiD is applying it to study amino acid/nucleotide variation in sequencing data. Concretely, a collection of N protein sequences of length L each can be represented by binary variables σ nia = 0/1 where n2 [1, N] is the index of the sample, i2 [1, L] is the position of the amino acid in the protein sequence, and a2 [1,21] is the identity of the amino acid (there are 20 naturally occurring amino acids, plus an additional index for a gap in the multiple sequence alignment). The probability π nia of observing amino acid a in position i in the n th sequence can be modeled using a tensor-based decomposition as was recently done in the analysis of variability in the microbiome [35]: We leave developing this generalization to future work.