Fig 1.
Naive sequence validation plot.
The hamming distances between the simulated naive DNA sequences and their corresponding linearham, partis, and ARPP estimates versus the tree imbalance values of the simulated trees. Linear regression lines are superimposed for each method to indicate how the results vary as trees get more imbalanced. For reference, we plot the tree imbalance values for the PC64 and VRC01 trees.
Table 1.
Naive sequence validation table.
Fig 2.
Intermediate ancestral sequence validation plot.
The positive predictive values and the true positive rates versus the tree imbalance values of the simulated trees, stratified by decision boundary ρ. Positive predictive values and true positive rates are computed on the DNA sequences and for the linearham, RevBayes, and dnaml programs. Linear regression lines are superimposed for each package to indicate how the results vary as trees get more imbalanced. For reference, we plot the tree imbalance values for the PC64 and VRC01 trees (vertical dashed lines).
Table 2.
Intermediate ancestral sequence validation table.
Fig 3.
Naive sequence posterior probability logos.
The linearham-inferred (top) and ARPP-inferred (middle) amino acid naive sequence posterior probability logos for (a) the pruned PC64 dataset of 100 sequences and (b) the trimmed VRC01 alignment of 268 sequences. We also display the empirical sequence logo (bottom) for each dataset and highlight the inferred CDR3 regions (black lines).
Fig 4.
Naive-to-tip sequence trajectory graphics.
The linearham-inferred naive-to-tip amino acid sequence trajectories for the pruned PC64 dataset of 100 sequences and the trimmed VRC01 alignment of 268 sequences, displaying only the edges that satisfy the given posterior probability threshold, and only the nodes that contact edges above the threshold. The tip sequences of interest for the PC64 and VRC01 datasets are chosen to be PCT64-35M and NIH45-46, respectively, and we use 0.04 probability cutoffs for these lineage graphics (such that any edge with probability less than this threshold is discarded). The nodes correspond to unique ancestral sequences filled with red color, where the opacity is proportional to the posterior probability of the associated sequence. The directed edges connecting nodes represent ancestral sequence transitions and are shaded blue with an opacity proportional to the posterior probability of the associated sequence transition. Nodes without any probable edges connecting them are not displayed in these graphics. The absence of many nodes for VRC01 indicates that these naive-to-tip sequence trajectories are highly uncertain. A more detailed version of this graphic, including predicted lineage mutations, is included as S1 Fig.
Fig 5.
(A) A schematic representation of the naive rearrangement process from [11]. First, V (green), D (orange), and J (purple) genes are randomly selected from the respective gene pools in the body. Then, nucleotides are randomly deleted (red X’s) from both ends of the V-D and D-J junction regions and random bases (blue) are added to the same junction regions before the V, D, and J germline genes can be joined together. The BCR sequences can be partitioned into framework (FWK) and complementarity-determining (CDR) regions. (B) Our Bayesian phylo-HMM jointly models V(D)J recombination at the root of the tree (using an HMM) and then subsequent diversification (via a phylogenetic tree). We do posterior inference conditioning on the observed sequence alignment in a clonal family, but not on a fixed inferred naive sequence.
Fig 6.
A graphical model representation of our phylo-HMM for an example alignment with m = 3 sequences and n = 3 sites.
The τ, t, π, and e nodes represent the 4-tip unrooted tree topology, the associated 5 branch lengths, the GTR exchangeability rates, and GTR equilibrium base frequencies, respectively. The parameter α denotes the gamma shape parameter associated with the K-class discrete gamma distribution, which is used to model phylogenetic rate variation among sites; r symbolizes the vector of K discrete rates that is deterministically induced by α. The set of nodes defines the rates that are drawn from r at each particular site. The
“hidden state” node collection represents the Markov process that stochastically generates the naive sequence in our phylo-HMM. The node sets
and
denote the internal nodes of τ excluding the naive sequence Ynaive and the observed MSA, respectively. We draw plates around the
and D(j) node sets for j ∈ {1, 2, 3} to indicate that any directed edges touching a plate apply to all nodes in the plate (except for edges that originate from t, which apply element-wise to the nodes in the plate).