Fig 1.
Schematic representation of the comparison of CBN models.
A Data sets D1 and D2 consist of N1 and N2 genotypes, respectively, and, in this example, p = 4 mutations. We combined both data sets into a single one D0 with N1 + N2 genotypes. B We randomly split data set D0 into data sets S1 and S2 and we do so B times. C For each data set, we apply the H-CBN2 approach to learn the structure of the network and for each pair, S1 and S2, we compute the Jaccard distance. D The empirical distribution of the test statistic is computed by aggregating the distances between pairs S1 and S2. E We compare the inferred CBN posets from original data sets D1 and D2 by computing the Jaccard distance and assess its significance.
Fig 2.
Assessment of H-CBN2 on simulated data.
A Box plots of the difference between true (ϵ) and estimated () error rate (y-axis) for each of the evaluated poset sizes (x-axis). B Box plots of the relative median absolute error (RMAE; y-axis) of the estimated rate parameters
. C Average run time of the MCEM/EM step (y-axis, logarithmic scale) for different poset sizes (x-axis, logarithmic scale). The blue dotted line corresponds to linear scaling, whereas the red line corresponds to quadratic scaling. In panels A to C, different colors indicate different importance sampling schemes and we show results of 100 simulated data sets for each of combination of the simulation settings. The true error rate is ϵ = 0.05, the number of samples drawn from the proposal distribution is set to L = 1000 unless specified otherwise and we run 100 iterations of the MCEM/EM algorithm. D Error in the estimation of the log-likelihood,
. E Box plots of F1 scores for reconstructed network edges. In panels D and E, we show results of 20 different networks with 16 mutations and an error rate of 5%. We fix the ideal acceptance rate to 1/p, and run 25,000 iterations of the simulated annealing algorithm. The initial temperature is set to Θ0 = 50 for all runs, and for adaptive simulated annealing, three adaptation rates are evaluated (ar = 0.1, 0.3, 0.5). Comparison of H-CBN2 to MC-CBN methods in terms of F the difference in normalized log-likelihood and G F1 scores for two poset sizes and various error rates. For the H-CBN2 results shown in panels F and G, we employ the ASA algorithm. SA: simulated annealing, ASA: adaptive simulated annealing, +: with additional new moves.
Fig 3.
Consensus posets for lopinavir resistance for two different HIV-1 subtype C data sets.
Shown are the consensus poset for A the South African cohort and B for the remaining HIV-1 subtype C sequences retrieved from the HIVDB. Nodes in the network correspond to amino acid changes in the HIV-1 protease, where mutations at the same locus are grouped together in one event. Only edges with a bootstrap support greater than 0.7 are shown and the edge thickness indicates the bootstrap support. Nodes with white background show residues with at least one major PI mutation.
Fig 4.
Consensus poset for the accumulation of mutations in HIV-1 subtype B under lopinavir treatment.
The underlying data set contains 470 genotypes retrieved from the HIVDB and SHCS. Nodes in the network correspond to amino acid changes in the HIV-1 protease, and mutations at the same locus are grouped together. Edge labels indicate the bootstrap support, and we show only edges with a bootstrap support greater or equal to 0.7.
Fig 5.
Empirical null distribution of pairwise Jaccard distances estimated by permuting group labels.
Displayed are the histograms of Jaccard distances for the comparison of subtypes B and C for H-CBN2 posets with A 11 mutations and B 18 mutations, as well as the histograms of Jaccard distances for the comparison of two data sets for subtype C for H-CBN2 posets with C 11 mutations and D 19 mutations. Vertical dotted lines indicate the distance between the CBNs obtained from the observed data.