An improved dataset for predicting mammal infecting viruses from genetic sequence information
Fig 2
Distribution of viral genomes in the datasets used in this work, categorized by human infectivity and training and test data split.
The datasets are improved versions of the original dataset analyzed by Mollentze et al. [2], and previously curated by several others [4,25,26], with specific improvements including removal of problematic genomes and updating known human infectivity (A), or additionally rebalancing the datasets by random shuffling with preservation of human infectivity ratios (B).