GENERALIST: A latent space based generative model for protein sequence families

Generative models of protein sequence families are an important tool in the repertoire of protein scientists and engineers alike. However, state-of-the-art generative approaches face inference, accuracy, and overfitting- related obstacles when modeling moderately sized to large proteins and/or protein families with low sequence coverage. Here, we present a simple to learn, tunable, and accurate generative model, GENERALIST: GENERAtive nonLInear tenSor-factorizaTion for protein sequences. GENERALIST accurately captures several high order summary statistics of amino acid covariation. GENERALIST also predicts conservative local optimal sequences which are likely to fold in stable 3D structure. Importantly, unlike current methods, the density of sequences in GENERALIST-modeled sequence ensembles closely resembles the corresponding natural ensembles. Finally, GENERALIST embeds protein sequences in an informative latent space. GENERALIST will be an important tool to study protein sequence variability.

The authors propose a method called Generalist to model the sequence variability in a protein family and to generate novel protein sequences.The method is based on the idea of a latent space: a hidden state of dimension is attached to each sequence in the protein family, and sites are independent for a fixed state of the hidden variables.Authors show that their method generates sequences with statistical properties very close to those of the MSA, accurately reproducing high order statistics.They also show that generated sequences are relatively distant from those found in the MSA, showing the model is able to generalize.
I found the model and results very interesting and the paper relatively clear.However, I found that the authors remain a bit superficial in the analysis of their results, even in the SI.I also have doubts about whether the good performance of Generalist in reproducing statistics of sequences is due to it also fitting phylogenetic relations between sequences, which is something I believe a generative model should avoid (or at least it should be discussed).
I list below many comments going more in detail.

Comments 1
If found the paragraph about the mathematical aspect of Generalist a bit confusing.
as far as I understand, is the probability to observe state at position given a state of the latent space.This is a general relation and is unrelated to the training data and the alignment, but this is not clear from the text.Why not use a more explicit notation such as ?Maybe this is a misunderstanding on my part.
I found the sentence "Sequences are modeled as arising from their own Gibbs-Boltzmann distribution" (l.128) to be a bit confusing.Shouldn't it be something like "Given a state of the latent variables , a sequence is distributed according to a Gibbs-Boltzmann distribution containing only fields".Of course,, a state of the latent variables is attached to each sequence of the alignment.
One has to read the SI to understand how sequences are actually sampled.Given that the main point of the article is the statistical properties of sampled sequences, I think this could go into the main text.

2
The authors use Potts models and Variational Auto-encoders as a comparison to Generalist.However, it seemed to me that Restricted Boltzmann Machines are a very natural comparison.The main difference between RBMs and Generalist is that in the latter, hidden units (or latent variables) are fixed to some predetermined states , one for each sequence of the original alignment.In both setups, once the latent state is fixed, all positions in the sequence are chosen independently.
I was a bit surprised to not find any discussion of this connection with RBMs, and am curious to ask the authors what they think about it.

3
A question about training: is the likelihood a convex function?Does it have a clear minimum?
Illustrating this, I ran a simple numerical experiment using the code and notebooks that the authors provided.I trained generalist on an alignment of 400 sequences of length 2 being either "AC" or "CA".
The magnetizations are such that letters "A" and "C" are heavily favored over other amino-acids, and there is a very strong correlation between the two positions.
I used a latent space dimension .As far as I understand, generalist should be able to perfectly reproduce the statistics with parameters such as fields for the first position and "temperatures" for the "AC" sequences and for the "CA" sequences.
At sampling time, if is chosen, the most likely sequence is "AC".If is chosen, the most likely sequence is "CA".
Here were the results: with the default settings, generalist did not converge to the above described state, but instead to a state with and for all sequences.Of course, this reproduces the magnetizations, but not the correlations, as the four states "AA", "AC", "CA", "CC" will have the same energy.
when I initialized the model with the fields and , training nicely converged to the solution proposed above.
I conclude from this that training highly depends on the initial state of parameters, and that the likelihood may posess several local maxima.However, I could see no mention of this in the article.Did I misunderstand something?
I think the authors should comment a bit more on the training procedure: does it strongly depend on the initial conditions?does the optimal vary from training to training?why are the initialized to be positive values, whereas using negative values could increase the expressiveness of the model?

4
The authors find optimal values for the dimension of the latent space using a quite interesting method, avoiding overfitting.I have questions regarding the results.
The optimal varies significantly for different families: from for mTOR to for DHFR.I would appreciate a comment of the authors on this variation.Was that expected from the diversity of sequences found in the different alignments?Looking at Figure 2, it seems that the least diverse alignments have the lowest : this could easily be explored quantitatively.

K
How robust is the optimal value of ?For instance for BPT1, going from to completely changes the distribution of Hamming distances.How much do these values change, say, when using different initial conditions for the training?As a suggestion, maybe plotting the optimized quantity as a function of would give the reader more intuition?
It would be interesting to see how the reproduction of 2nd and higher order statistics behaves as a function of .Curiously, this is done for VAE in the SI, but not for Generalist.

5
In figure 5, the authors show that for large alignments, the minimal hamming distance between pairs of Generalist generated sequences and the minimal distance to a natural sequence is close to the ones found for the natural MSA and on average much lower than when using VAE and Potts models.
First, this result is not completely surprising as has been optimized exactly for this.Seconds, I have the impression that this means Generalist reproduces the phylogenetic bias found in the natural sequences.If this is the case, the good performance of Generalist in reproducing high order statistics could be due to the fact that these statistics are indicative of phylogenetic relations and not of functional constraints.These are relations that Potts models try not to learn.I think it is important that the authors this potential pitfall.

6
The authors use Potts models as a comparison for Generalist and discuss them at length in the introduction.A major difference between Potts models (and also RBMs) and more involved techniques such as VAEs or neural networks is that the parameters of the former have a biological interpretation (to some extent).This is even pointed out in the introduction, as a downside of deep generative models (l.90).
Did the authors explore possibilities to interpret the inferred parameters of Generalist?

Minor comments
In the introduction (around l.78), the authors make well-argued case against hyper-parameters used in training Potts models.Does Generalist use regularization?If not, how does one avoid inifinite values for some parameters?The optimizer ADAM also has some hyperparameters: do those influence predictions (if the likelihood has a single minimum, I imagine they should not)?I think the article would benefit from a more explicit discussion of these issues.
In the paragraph starting at line 140, the authors describe the protein families used in the article.I am a bit annoyed by the use of the "approximate" sign .Why not give the exact values of number of sequences and amino acid lengths?Also, a table would be really helpful here: it could contain number of sequences, length, optimal and fitting quality for the different families.
Code: it would be nice to provide installation instructions.Generalist depends on many other python libraries (torch, numpy, etc...), and I had to "manually" figure out which and install them.
l.73 -"inefficient" might not be the best word.Maybe "computationally hard"?l.308 -I am not sure that one can refer to Potts models as "physics-based".