DeepDynaForecast: Phylogenetic-informed graph deep learning for epidemic transmission dynamic prediction

doi:10.1371/journal.pcbi.1011351

Fig 1.

Ranging complexity of tree topological features resulting from a structured infected population.

n transmission clusters, distinct from the background population (maroon), can vary in effective reproductive number (R_e) over time (x-axis) and rate of infection (m) by or to individuals from other groups. Variations in transmission dynamics are imprinted in branch patterns within the corresponding phylogeny and can aid in identifying groups of interest.

More »

Expand

Fig 2.

Our proposed DeepDynaForecast architecture.

A: Pathogen genomic data collected during outbreak molecular surveillance. B: A phylogenetic tree, reconstructed from the genomic data, which is used as input to trace the transmission among the populations. In this tree, nodes represent individuals, while edges represented transmission or mutation events. The phylogenetic tree is modeled using a bi-directed graph, where initial node representation vector v_i is randomly generated for each node i, and edge representation vector e_ij for edge e_ij from node i to j is initialized from the branch length with a neural network. C-D: Example of Primal-Dual Graph Long Short-Term Memory (PDGLSTM) learning architecture on a subtree to update v_i and e_ij at the l-th layer. Two parallel LSTM modules are utilized to update the node and edge representations in each message-passing iteration. Within this process, each edge/node aggregates adjacent node/edge representations and encodes low-dimensional messages by the neural networks ϕ_E and ϕ_N. These node/edge messages are input into their corresponding LSTM modules to facilitate the update of node and edge representations E: This system sequentially applies N rounds of message-passing iterations, thus producing updated nodes and edges representations. F: Cross-layer Prediction (CLP) module on each leaf node. A series of neural networks {ψ_L} are engaged in predicting the dynamics of leaf n using various levels of node representations {v_n}. This process is followed by dropout layers and summation operations to generate the final prediction. G: Predicted dynamics for leaves on the phylogenetic tree.

More »

Expand

Table 1.

Performance for two baseline models and DeepDynaForecast on transmission dynamic prediction of external nodes with three training scenarios.

More »

Expand

Fig 3.

Figurative performance comparison of five models on combined ARI and TB test sets.

A: Confusion matrices with row-wise normalized elements. B: One-verse-rest receiver operator characteristic curve (ROC) for each class and a macro averaged ROC curve with magenta dash lines. The corresponding AUCs are indicated for each curve. C: UMAP visualization of aggregation of learned node representations in message-passing iterations. Plots were generated in randomly sampled 50 phylogenetic trees in ARI and TB test sets.

More »

Expand

Fig 4.

Sensitivity of the DeepDynaForecast model to cluster size and risk group transmission type.

A: Balanced accuracy when predicting leaves within groups of different external node sizes. Binned intervals for quantitative data were generated using eight quantiles. Left panel: model performance among all groups. Middle panel: model performance on decaying clusters only. Right panel: model performance on growing clusters only. B: Performance of varying risk groups in ARI and TB simulations (see Supplementary S1 Table).

More »

Expand

Fig 5.

Florida HIV-1 subtype B pol sequence phylogeny (2012–2017).

The maximum likelihood phylogeny was generated as described in Rich et al. [19] for 27, 115 partial pol sequences sampled from individuals across the state of Florida, for whom metadata were provided. County of residence, categorized according to EHE prioritization, is shown in the corresponding heatmap. Cluster status for each external branch according to MicrobeTrace [33] and dynamic prediction using the DeepDynaForecast are also shown. Branches are scaled in substitutions/site.

More »

Expand

Fig 6.

Relationship of predicted transmission dynamics with genetic clustering and patient risk factors.

Transmission clusters were previously identified in Rich et al. [19] using MicrobeTrace [33] and the corresponding information used herein to determine the percentage of clustered sequences classified according to prediction category (A), as well as the percentage of predicted categories represented among clustered sequences (B). A multivariate logistic regression model was used to identify predictors for each transmission category, regardless of clustering status (C). Only risk factor categories with significant or marginally significant predictors are depicted, with the exception of the inclusion of county of origin, owing to its importance in clustering [34]. A linear regression model, treating prediction categories as ordinal, was used to quantify the relationship between year of birth of individuals and prediction status (D). Prediction categories refer to growth, decay, or static transmission relative to the background, or majority, infected population and were determined using the trained PGLSTM model implemented in DeepDynaForecast.

More »

Expand