Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Evaluating machine learning approaches for host prediction using H3 influenza genomic data

Abstract

Background

H3 influenza A viruses (IAV) have been shown to frequently cross the species barrier which can be an important factor in sustained transmission and spread. Machine learning methods have been widely explored for host prediction of IAV using genomic data; however, this is often done using data from only one of the eight IAV segments or by using all available IAV data to predict broad categories of hosts.

Objective

The objective of this study was to combine machine learning algorithms with H3 IAV sequence data from all eight segments to train predictive machine learning models for distinct host prediction and validate model performance.

Methods

Models were trained on both k-mers and amino acid properties alongside machine learning algorithms that included random forest and XGBoost for each of the eight IAV genome segments. Models were then validated on a test dataset through analytics of model class predicted probabilities and subsequently used to investigate between-species transmission patterns within case studies including canine H3N8, swine H3N2 2010.2, and duck H3 sequences.

Results

Models demonstrated strong performance in host prediction across all eight segments on the test dataset, with overall accuracies and κ (kappa) values ranging from 0.995–0.997, 0.984–0.990, respectively. Misclassified test dataset sequences with high predicted probabilities (> 90%) were validated using available literature and were identified to be frequently associated with between-species transmission events. Between-species transmission patterns within case study model class predicted probabilities were also identified to be consistent with the literature in cases of both correct and incorrect classification.

Conclusions

These models allow for rapid and accurate host prediction of H3 IAV datasets from any of the eight IAV segments and provide a solid framework that allows for identification of variants with higher than typical between-species transmission potential. However, results obtained on selected case studies suggest further improvements of the training and validation processes should be considered.

Introduction

The influenza virus is a pathogen that is classified into the family Orthomyxoviridae and comprises four of its seven genera which each consist of one influenza virus species (A-D). Influenza A and B viruses are the primary agents responsible for causing seasonal epidemics in humans, with influenza A virus (IAV) being the only species to have caused pandemics to date [1,2]. Influenza C and D viruses are associated with more sporadic cases of infections, with influenza C viruses primarily affecting humans and influenza D viruses primarily affecting cattle [3,4]. Of these four species, IAV is generally considered the most important species in humans due to the global economic burdens associated with seasonal epidemics that result in over 650,000 deaths annually within human populations [5]. Similarly, significant economic losses are attributed to IAV infection in many different animal production industries that include poultry and swine, with millions of avians culled globally per annum; and an estimated cost of $1-$5 per market hog within the United States [68].

IAV can be classified into subtypes based on the antigenic properties of the outer surface glycoproteins of hemagglutinin (HA) and neuraminidase (NA), with 18 HA (H1-H18) and 11 NA (N1-N11) subtypes identified thus far, respectively [9]. H3 IAVs are of notable interest as they are often associated with more severe influenza seasons and also infect a very wide range of hosts that includes mammals such as humans, swine, canines, and equines in addition to avians such as chickens, mallards, and geese [1,10,11]. This wide host range has become an ongoing global concern to both public and animal health as many between-species transmission events that allow for novel strains to arise have been documented across the globe and these novel strains have been determined as the root cause of several past pandemics [10,1215]. Spill-over events have been identified where H3 avian influenza viruses were found to be the origin of outbreaks within swine, canines, equines, and seals [1619]. Several hundred cases of zoonotic infections in humans have also been identified to be caused by swine H3 IAVs in addition to numerous cases of swine IAV spilling over into turkeys [2024]. Reverse zoonosis of human H3 IAVs has also been found to occur in swine with regularity [2527]. Therefore, surveillance and timely identification of hosts with higher than typical potential for between-species transmission is important to limit the spread of H3 IAV and prevent the occurrence of future H3 IAV pandemics.

Large public repositories of whole genome sequence data from past IAV infections have become increasingly available over time as a result of advancements in next generation sequencing techniques, providing opportunities for machine learning techniques that require very large datasets. Machine learning algorithms such as eXtreme Gradient Boosting (XGBoost), random forest, and multinomial logistic regression with ridge penalization have previously been used in the classification of IAVs and prediction of their hosts/sources based on sequence data. However, this is often performed using methodology involving only the HA segment [28] or by using all subtypes with broad host categories such as avian [2931]. Utilizing the full genome of IAV rather than only the HA segment has been demonstrated to result in more accurate surveillance and genetic characterization of IAV strains and allows for better understanding of reassortment of the remaining 7 segments which is often overlooked [32,33]. Furthermore, machine learning models have been previously trained on H3 HA sequence data from distinct avian classes such as turkeys, mallards, and chickens and predicted with high levels of accuracy, indicating that sequences from broad categories can be delineated into distinct species without degradation in host predictability [28]. Thus, the primary objective of this study was to combine machine learning algorithms with sequence data from all 8 segments of H3NX IAVs to train predictive machine learning models for distinct host prediction and validate model performance. Validated models were trained with the ultimate goal of contributing to the development of a framework for identifying IAV variants with higher than typical potential for between-species transmission.

Materials and methods

Dataset retrieval

The dataset used in this study was obtained on April 16th, 2024 by retrieving whole genome nucleotide and protein sequence sets from the National Center for Biotechnology Information (NCBI) Influenza Virus Database (IVD), Bacterial and Viral Bioinformatics Resource Center (BV-BRC) database, and Global Initiative on Sharing All Influenza Data (GISAID) Epiflu database [3436]. Sequence selection criteria from each database are summarized in S1 File.

Dataset preprocessing

Preprocessing of the dataset began by extracting the hosts from the labels of the 37328 sequence sets and assigning a host category as shown in Table 1. Whole genome sequence set duplicates within host species and across databases were checked for and excluded. Sequence sets from hosts where there were less than 100 sequences sets available were excluded as this was considered an insufficient amount of data for model training. Sequence sets from the avian category were excluded as avian is considered to be too broad of a category to be useful in host prediction. Sequence sets from the duck and goose categories were not excluded so that they could be later used as a case study and as an additional waterfowl class for model training, respectively (Table 1). After these exclusions, the dataset consisted of 7 host classes of canine, chicken, equine, goose, human, mallard, and swine for model training in addition to the 2 case study classes of duck and environment. Included species for each class are shown in S1 Table.

Sequence sets obtained for the combined IVD, BV-BRC, and Epiflu dataset initially, after preprocessing, after case study selection, and the final dataset used in model training/testing.

The dataset was then filtered by requiring each whole genome sequence set to have exactly one nucleotide sequence per genome segment (PB2, PB1, PA, HA, NP, NA, MP, NS) for a total of 8 nucleotide sequences. Protein sequences for the 10 essential proteins of PB2, PB1, PA, HA, NP, NA, M1, M2, NS1, NS2 and accessory protein PB1-F2 were also required for a total of 11 protein sequences. Protein sequences from the accessory protein PA-X were not included as preliminary analysis identified that PA-X protein sequences were absent in a significant portion of the genome sets (15%~).

Leading and lagging Ns within sequences were trimmed and whole genome sequence sets that contained a sequence with a N count of greater than 5% of the expected sequence length for any of the eight genome segments were excluded. Sequence sets that were identified to be lab strains or contained sequences that were shorter than 90% of the expected sequence length were excluded. Subtypes were also checked, and any sequence sets not belonging to the H3 subtype were excluded with the subtypes for the remaining 21429 sequence sets shown in S2 Table.

Case study datasets

Four separate case study groups were extracted from the preprocessed dataset for model validation and investigation of between-species transmission patterns using model class predicted probabilities. The four case studies consist of sequence sets from 1) canine H3N8, 2) the swine H3N2 2010.2 clade, 3) duck H3NX, and 4) environment H3NX. Case study 1 was used to investigate canine H3N8, which was first detected in canines within the United States in 2004 and was identified to have spilled-over from equines into canines [3739]. Case study 2 was used to investigate the swine H3N2 2010.2 clade, which was first detected in 2017 as the second occurrence of a distinct seasonal human H3N2 being transmitted from humans to swine within the United States [25,40]. Sequence sets for case study 2 were extracted by using the OctoFlu tool [41] to label all present H3 swine sequence sets and extracting only sets labelled as part of the H3N2 2010.2 clade. Case study 3 was used to investigate class predicted probabilities for the broad category of duck which contained sequence sets from several different species of ducks. Case study 4 was used to investigate the possible hosts for environmental H3NX sequence sets which were comprised of sequences not labelled with an animal host but with locations such as “environment” or “water” instead. Each case study dataset was separated by genome segment, with one dataset per genome segment.

Training and test datasets

The remaining preprocessed dataset was randomly partitioned into a training dataset (70%) and a test dataset (30%). The training and test datasets were then separated by genome segment, with one training and test dataset per genome segment.

Feature extraction

K-mers were extracted from both nucleotide and protein sequence using the Biostrings [42] and kmer [43] packages in R. K-mers of length 1–6 (n = 4, 16, 64, 256, 1024, 4096) were extracted from the nucleotide sequences and k-mers of length 1–3 (n = 20, 400, 8000) were extracted from the protein sequences. Amino acid properties (n = 6) were also extracted from the protein sequences using the alakazam package in R [44]. The chosen amino acid properties for feature extraction were amino acid length, gravy, bulk, aliphatic index, polarity, and overall net charge. Sequences from genome segments with one protein available (HA, NA, NP, PA, PB2) had a total of 13887 features extracted, whereas sequences from genome segments with two proteins available (PB1, NS, MP) had a total of 22314 features extracted. For models with two proteins available, protein 1 refers to PB1, NS1, MP1 and protein 2 refers to PB1-F2, NS2, and MP2.

Feature selection

Feature selection was performed to reduce computational time in R and to use only the most important features from each genome segment for host prediction. This was done by training a preliminary random forest model per genome segment on all available features using the caret package in R with “rf” as the chosen algorithm and default parameters [45]. Cross-validation was not performed during feature selection to not further increase computational time. After the preliminary models were trained, feature importance was determined using mean decrease Gini scores [46] and features with the top 10% highest mean decrease Gini scores were selected to train the final models for each genome segment.

Model training

Three models were trained for each of the eight genome segments for a total of 24 models. The three models per genome segment were trained using multinomial logistic regression with ridge penalization, random forest, and XGBoost which were implemented through the caret and glmnet packages in R [45,47,48]. Models were also cross-validated using 5-fold stratified cross-validation to ensure that all classes were represented proportionally due to the imbalanced nature of the dataset.

Model hyperparameter tuning

Model hyperparameters were tuned using a grid search during cross-validation to obtain the most robust models possible. The λ (regularization parameter) and mtry (number of features randomly sampled at each split) hyperparameters were tuned for the multinomial logistic regression with ridge penalization and random forest models, respectively. For the XGBoost models, the hyperparameters consisting of the number of rounds (nrounds), learning rate (eta), maximum depth (max_depth), column subsampling (colsample_bytree), minimum sum of instance weight required within a child (min_child_weight), subsample ratio (subsample), and gamma (minimum split loss) were tuned in stepwise groups.

Model testing

Trained models were validated on the test dataset to observe their performance on unseen data. Confusion matrices were generated during model validation and metrics that include overall accuracies, 95% confidence intervals, κ, no-information rates, p-values, and class sensitivities and specificities were calculated. κ refers to Cohen’s κ which is calculated using the following equation:

where is the observed agreement and is the expected agreement of the model [49]. No-information rate refers to the overall accuracy of a naïve classifier that predicts every input as the majority class.

The best performing model for each genome segment was then determined using overall accuracies and κ values. In cases where two or more models had the same overall accuracies and κ values within a genome segment, models using the algorithm with the best performance overall across all 8 segments were chosen as the best performing model. Host predictions from the best performing models for each segment were then tallied per whole genome sequence set (8 sequences, 1 per genome segment) and host prediction frequencies were investigated (e.g., all 8 segments correctly predicted as human for a human sequence set or with mixed host prediction, 7 segments correctly predicted as human, 1 segment predicted as swine).

Model class predicted probabilities

Model class predicted probabilities were investigated by creating heatmaps with the predicted probabilities of correctly classified and misclassified sequences using only the best model per genome segment. Groups of sequences with available literature were labelled with the notation “Pattern #” on the heatmaps and were used for subsequent investigation of possible explanations for model predicted probabilities. Additionally, patterns in model class predicted probabilities were explored for each of the case study datasets by using the best performing models per genome segment on the case study datasets, and plotting the model class predicted probabilities averaged by year for case studies 1 and 2 and by sequence number for case studies 3 and 4.

Case study model predicted probabilities were further validated through phylogenetic analysis by constructing a representative maximum likelihood phylogenetic tree for the HA segment. Representative sequences were obtained for each of the 7 host classes and 4 case study datasets separately using the CD-HIT EST tool with a 90% similarity threshold [50]. Representative sequences were then aligned using MUSCLE with default settings and a maximum likelihood tree with 100 bootstrap iterations was constructed with default settings using MEGA11 software [51]. Patristic distances were then calculated using the ape package in R [52]. This tree was then imported into the Interactive Tree of Life (iTOL) tool for visualization [53].

Results

Model feature selection

The number of features pre- and post- feature selection is summarized in Table 2. Models with one protein (HA, NA, NP, PA, PB2) had 13887 features available and models with two proteins (PB1, NS, MP) had 22314 features available, which were reduced by 90% to 1388 and 2231 after performing feature selection, respectively. All models were found to select features from all nucleotide k-mer categories of size 1–6 and protein k-mer categories of size 1–3. Models with two proteins (PB1, NS, MP) were descriptively identified with a generally higher number of features selected in protein one (PB1, NS1, MP1) versus protein two (PB1-F2, NS2, MP2) with exceptions in the PB1 and MP models, where a higher number of amino acid properties was selected for the PB1-F2 protein versus the PB1 protein and a higher number of 1-mers was selected for the M2 protein versus the M1 protein, respectively.

thumbnail
Table 2. Number of features extracted from nucleotide and protein sequence sets and number of features selected after feature selection.

https://doi.org/10.1371/journal.pone.0336142.t002

Amino acid properties

Amino acid properties selected by each model are summarized in Table 2. A minimum of 1 amino acid property was selected per protein for each of the models with the exception of the PB1 model where protein 1 had 0 amino acid properties selected. The HA, NA, and PB2 models differed from the other models by only having 1 amino acid property selected which was identified as net charge.

Top ten most important features

The top 10 most important features selected by each of the preliminary models are summarized in S3 Table. A pattern was identified where models either had predominantly nucleotide 5-mers and 6-mers as the top 10 features (HA, NA, MP) or models had predominantly amino acid 2-mers and 3-mers as the top 10 features (NP, PA, PB2, NS). The PB1 model differed by having a mix of both nucleotide and amino acid k-mers as the top 10 features. Additionally, the PB2 model was the only model with an amino acid property (net charge) within the top 10 features.

Model validation results

Model performance metrics for all final trained models using the test dataset are summarized in Table 3. All models were shown to perform very well in host prediction across all 8 genome segments with overall accuracies and κ values ranging from 0.9951–0.9967 and 0.9844–0.9896, respectively. Random forest and XGBoost models performed exactly the same in 5 segments (HA, NP, PB2, PB1, MP) where they had the same overall accuracies and κ values. XGBoost models outperformed random forest models in two segments (NA, PA) and random forest model marginally outperformed XGBoost models in one segment (NS). Multinomial logistic regression with ridge penalization was found to be the worst performing model overall, performing worse than both XGBoost and random forest models across all genome segments except for the MP genome segment where all models were found to have the same overall accuracies and κ values. All of the models were found to have significantly greater (p < 0.01) overall accuracies than no-information rates (0.8175). Macro F1 scores ranged from 0.8457–0.9240 for multinomial logistic regression models and 0.9251–9696 for XGBoost and random forest models. F1 scores for each class are summarized in S4 Table.

thumbnail
Table 3. Model performance metrics on the testing dataset.

https://doi.org/10.1371/journal.pone.0336142.t003

Model class sensitivities

Model class sensitivities per genome segment are summarized in Table 4. Class sensitivities for the human, mallard, swine, and goose classes were shown to vary within each model per genome segment whereas the canine, chicken, and equine classes had little to no variation in model class sensitivities across all genome segments. Large variation was observed in class sensitivities for the goose class where class sensitivities ranged from 0.5000–0.7500 for the XGBoost models, 0.3750–0.7500 for the random forest models, and 0–0.3750 for the multinomial logistic regression with ridge penalization models. Model class specificities are summarized in S5 Table and ranged from 0.9874–1 across all classes, with slight variability observed in the human and mallard classes and little to no variability observed in the swine, goose, equine, chicken, and canine classes.

thumbnail
Table 4. Class sensitivities for each of the 7 classes on the testing dataset.

https://doi.org/10.1371/journal.pone.0336142.t004

Best performing models

The best performing models for the HA, NA, NP, PA, PB2, PB1, and MP genome segments were determined as the XGBoost models and the best performing model for the NS genome segment was identified as the random forest model. Host predictions from the best performing models per segment were then tallied for the each of the whole genome sequence sets within the test datasets and are shown in S6 Table. The majority of sequence sets were correctly classified with all segments predicted as the same host (6043/6078, 99.4%). A small number of sequence sets were classified as the correct host but with one or more segments predicted to be from a different host (13/6078, 0.213%), misclassified as the wrong host with all segments being predicted as the same host (15/6078, 0.247%), and misclassified as the wrong host with one or more segments being predicted as a different host (7/6078, 0.115%).

Model class predicted probability heatmaps

For each genome segment a heatmap was generated to investigate model predicted probabilities of correctly classified and misclassified sequences. The heatmap for the HA genome segment is shown in Fig 1 with the heatmaps for the remaining genome segments shown in S1S7 Figs. In Fig 1, the predicted probabilities for the correctly classified sequences are depicted in Fig 1A and the predicted probabilities for the misclassified sequences are depicted in Fig 1B. In Fig 1A, 6037/6055 (99.7%) sequences were classified with predicted probabilities of 90–100% for the correct class and 18/6055 (0.03%) were correctly classified with mixed host predicted probabilities. From the latter group of 18 sequences, 7 sequences were identified to have mixed host predicted probabilities > 10% split between human and swine, 6 sequences split between goose and mallard, 2 sequences split between chicken, goose, human, and mallard, and 3 split between mallard and classes with less than 10% predicted probabilities. As an example, strain A/mallard/Hungary/19616/2007(H3N8) (accession # GQ240821) from Pattern 1 had predicted probabilities of 89% mallard, 7.2% goose, 2.1% human and 1.2% chicken. The remaining predicted probabilities for sequences from the mixed host Patterns 1–3 are detailed in S2 File.

thumbnail
Fig 1. Heatmaps of correctly classified and misclassified HA sequences.

Heatmaps for the HA genome segment using the XGBoost model (XGB), A shows the predicted probabilities for the correctly classified sequences and B shows the predicted probabilities for the misclassified sequences. Predicted probabilities are read as rows, and the N on the y axis denotes the number of sequences with the predicted probability pattern shown for the respective row. Rows labelled as Pattern 1-7 indicate rows where a representative sequence was investigated using available literature.

https://doi.org/10.1371/journal.pone.0336142.g001

In Fig 1B, 17/23 (73.9%) sequences were misclassified with predicted probabilities of 90–100% for the incorrect class and 6/23 (26.1%) were misclassified with mixed host predicted probabilities. From the former group of 17 sequences, 11 sequences were misclassified with 90–100% predicted probabilities as human, and 6 sequences were misclassified with 90–100% predicted probabilities as mallard. As an example, strain A/chicken/Nanjing/B854-2/2011(H3N8) (accession # KU158890) from Pattern 4 was being misclassified as mallard with 90–100% predicted probability and had predicted probabilities of 91% mallard, 2.8% human, 1.6% goose, 1.5% chicken, 1.5% canine, and 1% swine. The remaining predicted probabilities for misclassified sequences from Patterns 4–7 are detailed in S2 File.

Case study predicted probabilities

The best performing models per genome segment were used on the case study datasets and model predicted probabilities were plotted averaged by year for case study 1 (canine H3N8) and case study 2 (swine H3N2 2010.2 clade) as shown in Figs 2-3, respectively, and plotted by sequence number for case study 3 (duck H3NX) and case study 4 (environment H3NX) as shown in Figs 4-5, respectively. In Fig 2, the 33 canine H3N2 sequence sets in case study 1 from 2004–2016 were misclassified as the equine class with high averaged predicted probabilities across all genome segments.

thumbnail
Fig 2. Predicted probability plots by year for the canine H3N8 case study.

Averaged predicted probability plots by year for each genome segment using the best performing model per segment (random forest model for the NS segment, XGBoost models for the remaining 7 segments) on the canine H3N8 case study 1 dataset.

https://doi.org/10.1371/journal.pone.0336142.g002

thumbnail
Fig 3. Predicted probability plots by year for the swine H3N2 2010.2 case study.

Averaged predicted probability plots by year for each genome segment using the best performing model per segment (random forest model for the NS segment, XGBoost models for the remaining 7 segments) on the swine H3N2 2010.2 clade case study 2 dataset.

https://doi.org/10.1371/journal.pone.0336142.g003

thumbnail
Fig 4. Predicted probability plots by sequence number for the duck H3NX case study.

Predicted probability plots by sequence number for each genome segment using the best performing model per segment (random forest model for the NS segment, XGBoost models for the remaining 7 segments) on the duck H3NX case study 3 dataset.

https://doi.org/10.1371/journal.pone.0336142.g004

thumbnail
Fig 5. Predicted probability plots by sequence number for the environment H3NX case study.

Predicted probability plots by sequence number for each genome segment using the best performing model per segment (random forest model for the NS segment, XGBoost models for the remaining 7 segments) on the environment H3NX case study 4 dataset.

https://doi.org/10.1371/journal.pone.0336142.g005

In Fig 3, the 40 H3N2 2010.2 clade sequence sets in case study 2 from 2017–2023 were misclassified as the human class with very high averaged predicted probabilities for the HA and NA segments with the remaining 6 segments being predicted as the swine class with very high averaged predicted probabilities. The HA and NA segments had predicted probabilities of 0.98–1 for the human class that stayed consistently the same throughout 2017–2023. The NP, PA, PB2, PB1, NS and MP segments had predicted probabilities of 0.75–0.80 for the swine class in 2017, which increased to 0.90–1 from 2018–2023.

In Fig 4, the 977 H3 duck sequence sets in case study 3 were predicted as the mallard class with high predicted probabilities for the majority of sequences across all genome segments. Predicted probabilities for the mallard class ranged from 0.75−1 for most sequence sets, with a moderate number of sequence sets predicted as the mallard class with predicted probabilities ranging from 0.25–0.75 for all genome segments. Smaller groups of sequences were also predicted as either the goose or chicken class with predicted probabilities ranging from 0.25–0.95 in each genome segment. Additionally, a few sequences were predicted as swine or human with predicted probabilities ranging from 0.25–0.85 in each genome segment, with a very small number also being predicted as canine with predicted probabilities ranging from 0.40–0.75 in the PA and PB1 segments.

In Fig 5, the 105 environment sequence sets in case study 4 were predicted as the mallard class with high predicted probabilities for the majority of sequence sets across all genome segments in addition to a small number of sequence sets being predicted as swine with high predicted probabilities. Predicted probabilities for the mallard class ranged from 0.5−1 for most sequence sets with a small group of sequence sets ranging from 0.7−1 for the swine class. The PB2 segment differed from the other segments by having a small number of sequences being predicted as the human class with predicted probabilities ranging from 0.27–0.80 and are shown in S7 Table.

Phylogenetic analysis of case study datasets

Phylogenetic analysis was conducted using a maximum likelihood tree constructed from 64 representative HA sequences as shown in Fig 6. The representative H3N8 canine case study sequence was descriptively clustered with an equine representative sequence with accession number MH796298. The representative H3N2 2010.2 swine clade sequence was descriptively clustered with a human representative sequence with accession number KY925323. The duck case study had multiple representative sequences which descriptively formed many different clusters generally comprised of a mix of mallard, chicken, and goose representative sequences. The environment case study also had multiple representative sequences which descriptively formed clusters which generally consisted of mallard and duck representative sequences in addition to a few clusters with swine, human, or goose representative sequences. Summary statistics for patristic distances are shown in S8 Table. Species-species patristic distances are summarized in S9 Table, with patristic distances between and within each host class shown in S8 Fig.

thumbnail
Fig 6. Representative maximum-likelihood phylogenetic tree for HA sequences.

Maximum-likelihood phylogenetic tree with 100 bootstrap iterations constructed using the 64 HA representative sequences retrieved from the 4 case study datasets and 7 classes within the whole dataset excluding the case study sequences.

https://doi.org/10.1371/journal.pone.0336142.g006

Discussion

Comparison of model performance during model validation revealed that all three types of models were capable of predicting the host for each genome segment with high accuracy using nucleotide and protein features. Random forest and XGBoost models had the same predictive performance in most genome segments, with XGBoost models slightly outperforming random forest in two segments whereas random forest outperformed XGBoost in only 1 segment. Prior studies have compared predictive capabilities of these models where similar results were found with boosting machines generally outperforming random forest models [28,5456]. Notably, datasets from these other studies were often not imbalanced, whereas the models in our study were trained on an imbalanced dataset. Class sensitivities for our models were very high across all genome segments with the exception of the goose class, which had class sensitivities that were as low as zero for some of the multinomial logistic regression with ridge penalization models. This can be explained by the fact that the dataset was imbalanced and that there were only 29 sequence sets available for the goose class; however, sensitivities for goose ranged from 0.3750–0.7500 for the random forest and XGBoost models indicating that these ensemble methods may perform satisfactorily even when trained on a low number of sequence sets. Our results therefore support previous studies indicating that ensemble methods such as boosting machines and random forests are resilient options when working with imbalanced datasets [5759].

Prior studies involving HA IAV classification have also achieved similar performance using different approaches with overall accuracies > 0.98 or F1 scores > 0.96 for predicting the host classes of human, swine, and the broad class of avian [2931]. Usage of a broad avian class allows for a reduction in the number of classes to learn from and mitigates issues associated with small sample size. Despite these advantages, this approach also results in the loss of species-level information and limits the ability to investigate between-species transmission patterns for many important avian hosts such as turkeys [21,60]. A recent study has demonstrated that H3 HA IAV avian sequences can be delineated into distinct species and predicted with moderate to high class sensitivities such as 0.84 for the chicken class and 0.57 for the turkey class [28]. The dataset in their study was imbalanced, with very few turkey sequences available (n = 26), and the low class sensitivity for the turkey class was thought to be due to small sample size. This result is very similar to the goose class in our present study, where the small sample size (n = 29) also resulted in low class sensitivity for the goose class across all of our models. Therefore, these results are indicative of the need for a large sample size when separating broad avian categories into distinct species for host prediction.

Many studies have used nucleotide sequences and nucleotide k-mers for host prediction of viruses, however, protein sequences and amino acid k-mers are less frequently used in comparison with even fewer studies using both in conjunction for host prediction [28,6163]. Feature selection for our models demonstrated that k-mers from both nucleotide and protein sequences are useful for IAV genome segment host prediction, with all preliminary models selecting features from all available nucleotide and protein sequence k-mer categories. Protein sequences are more conserved than their nucleotide counterparts therefore features such as amino acid k-mers and amino acid properties may be more informative over wider evolutionary distances in host prediction [64]. Nucleotide k-mers of length 5–6 and amino acid k-mers of length 2–3 comprised the top 10 most important features in our models, with some models predominantly favouring nucleotide or amino acid k-mers. Optimal k-mer length has been shown to vary depending on the type of virus involved, with lengths of 2–4 shown to be optimal for a mix of RNA viruses in multi-host prediction and length >= 6 optimal for phages [63,65,66]. Longer k-mers may improve predictive performance as there could be host-specific k-mers present that can distinguish between hosts [67], which may explain why the longer k-mers were being selected as the top 10 most important features in our models. Of the six amino acid properties included, net charge was selected by 7 of the 8 preliminary models for protein 1, was the only amino acid property selected by the HA, NA, and PB2 models, and was the 2nd most important feature in the PB2 model. Net charge has been previously identified to alter the behaviour of electrostatic interactions involved in receptor and protein binding affinities for many genome segments including HA and PB2 in addition to playing a role in antigenic evolution of NA [6871]. Previous studies have also identified net charge as a feature with high importance in host prediction of the HA segment and in prediction of host tropism for the HA, NS1, and PB2 proteins [28,68]. Resultantly, amino acid net charge and feature importance in general are important characteristics that warrant further investigation in future studies involving host prediction of IAV genome segments.

Possible biological explanations were sought out within the literature to explain the predicted probabilities of correctly classified sequences with mixed host predictions and misclassified sequences from the HA segment. Sequences from Pattern 1–3 corresponded to correctly classified sequences and Pattern 4–7 corresponded to misclassified sequences from the HA heatmap (Fig 1). Sequence GQ240821 from Pattern 1 (89% mallard, 7.2% goose) was phylogenetically analysed and was found to cluster with various wild avian H3N8 isolates from Northern to Southern Europe [72]. Sequence AB569511 from Pattern 2 (85.3% goose and 13% mallard) clustered with waterfowl strains [73]. Sequence CY116315 from Pattern 3 (82.2% swine and 14.8% human) clustered with swine H3N2 human seasonal-origin strains [74]. Sequence KU158890 from Pattern 4 (91% mallard, sampled host was chicken) clustered with duck strains sampled within the same study and it was noted that the slaughterhouses where these samples were retrieved contained many different species of avians including chickens and ducks that remained in contact with each other for time periods of many days, providing a fertile environment for between-species transmission of IAV [75]. Sequence LC644998 from Pattern 5 (100% human, sampled host was swine) was found to cluster closely with seasonal human H3N2 strains, specifically seasonal human strains from within the same region (both pig and human strains were from Zambia) suggesting that reverse zoonosis had occurred [76]. Sequence JX080759 from Pattern 6 (47.4% mallard, 37.5% goose, sampled host was goose) was found to have been sampled from the Yukon-Kushokwim Delta and this region is considered a high priority for surveillance of between-species transmission due to it being a major migratory flyway across the eastern and western hemispheres [77]. Sequence JX096504 from Pattern 7 (81.6% mallard, 6.6% chicken, sampled host was swine) clustered with strains from domestic aquatic birds and was the first detection of a swine H3N2 strain with a genome segment from H5N1 highly pathogenic avian influenza [78]. As demonstrated, strong amounts of evidence from the literature support the model predicted probabilities for the HA segment and denote a pattern pertinent to risk assessment where misclassified sequences with high predicted probabilities are very often involved in between-species transmission. Furthermore, correctly classified and misclassified sequences with predicted probabilities that were very close between two hosts (e.g., Pattern 6, 47.4% mallard, 37.5% goose) represent sequences that warrant further investigation as these sequences may be suggestive of recent between-species transmission or potential for such further events.

Predicted probability plots for case study 1 revealed that all segments of canine H3N8 were being misclassified as the equine class with high averaged predicted probabilities from 2004–2016. Phylogenetic analysis using our maximum-likelihood tree supported these model predictions as the representative canine H3N8 HA sequence was found to cluster with a representative equine H3N8 HA sequence. Equine H3N8 was first isolated in equines in 1963 and was identified to have spilled-over into canines in 1999, with the first detection occurring in the United States in 2004 [37,39]. This subtype was primarily restricted to only the United States and seemingly became extinct from 2016 onwards, hence the limited time frame of 2004–2016 in our dataset. Phylogenetic analysis of equine and canine H3N8 conducted by a previous study determined that these strains were distinct, indicating that the virus had evolved in canines after introduction [79]. This was somewhat shown in the predicted probabilities plots for this case study where the predicted probabilities for equine slightly decreased and the predicted probabilities for swine were slightly increasing from 2008–2016, suggesting that changes were occurring in this canine subtype over this timeframe. Further investigation should be conducted to determine why the models were predicting this strain as becoming more swine-like rather than canine-like. Additionally, this misclassification by the models is indicative of the fact that the models are not capable of accurate host prediction for subtypes that are not present within the training dataset. However, this type of misclassification where all segments are being misclassified with high predicted probability is still informative when combined with phylogenetic analysis and could be useful in systematic identification of variants of interest if this is shown to be a recurrent pattern.

Predicted probability plots for case study 2 identified that the HA and NA segments of the swine H3N2 2010.2 clade were being misclassified with very high averaged predicted probabilities as the human class with the remaining 6 segments being predicted as the swine class with very high averaged predicted probabilities from 2017–2023. Phylogenetic analysis using our maximum-likelihood tree supported these model predictions as the representative swine H3N2 2010.2 clade HA sequence was found to cluster with a representative human H3N2 HA sequence. This strain was first detected in swine from Oklahoma in 2017 and the HA and NA of this strain was identified as having 99% nucleotide identity with seasonal human strains from 2017, suggesting that this strain was a product of reverse zoonosis [40]. This study also identified the remaining segments of PB2, PB1, PA, NP, and NS to be most similar to that of triple reassortment swine-origin strains with the MP segment phylogenetically analyzed as most similar to the 2009 H1N1 pandemic swine strains. Therefore, this evidence supports the model predicted probabilities for this case study and this misclassification where one or more segments are being misclassified with high predicted probability could be useful in systematic identification of reassortants if this pattern occurs consistently.

Predicted probability plots for case study 3 showed that a majority of duck H3 strains were being predicted with moderate to high predicted probabilities as the mallard class in addition to minor groups of sequences also being predicted as goose or chicken with moderate to high predicted probabilities across all genome segments. Phylogenetic analysis using our maximum-likelihood tree somewhat supported these model predictions as the representative duck HA sequences were descriptively clustered with a mix of mallard, goose, and chicken HA representative sequences. The duck case study was comprised of sequence sets from over 40 different species of ducks (S1 Table) excluding mallards, indicating that mallards may be representative of duck H3 strains as a whole. Mallards are the most common duck species found within the Northern Hemisphere and are key natural reservoirs for IAV infection [80,81] therefore it is logical that a majority of the duck sequence sets had high predicted probabilities for the mallard class. However, numerous amounts of sequence sets in this case study were simply labelled as “duck” therefore there is still some uncertainty whether these sequences sets are from mallards or other types of ducks. Furthermore, small groups of sequences being predicted as goose and chicken were not surprising given that these are the other two avian classes present in the models and are hosts that are commonly infected with avian IAV. Further investigations into the specific instances where sequence sets are being predicted as goose and chicken with high predicted probability may be useful to identify whether between-species transmission of strains had occurred in these instances.

Predicted probability plots for case study 4 are summarized in S3 File where a majority of the environmental sequences were being predicted as mallard, with a small group of sequence sets consistently predicted as swine across all segments. Most of the sequences predicted as mallard were identified as originating from Maryland, United States, with no further information available. On the other hand, six sequence sets were consistently predicted as swine across all 8 segments and were identified to have originated from Indiana, United States, with four of these sequence sets identified to be from agricultural fairs involved with swine [82] and two sequence sets identified to be from livestock exhibitions. A prior study had also predicted the same 4 sequence sets from agricultural fairs as swine using only the HA segment, supporting the model predictions for these environmental sequence sets as swine [28].

Limitations

Case study 1 demonstrated a limitation of the models where no segments of canine H3N8 were being predicted as canine due to all canine H3N8 sequences being removed from the training dataset to be used as a case study. Evidently, these models require a minimum number of sequence sets from both a subtype and host to learn from before being able to accurately predict it, therefore new emerging subtypes and subtypes from classes outside of the ones included in model training will almost certainly be misclassified. However, misclassification of this type is useful in itself as these misclassified sequences can then be further investigated once flagged by these models as misclassified with high predicted probabilities.

Another limitation of these models would be that sequence data were insufficient or absent for many host classes of interest resulting in poor performance for the goose class or non-inclusion of important hosts such as turkeys. Class sensitivities for the goose class were much lower and inconsistent as a result of having only 29 sequence sets available for model training but would be expected to improve if the number of sequence sets obtained was at least 100. Turkeys have also been identified to be involved with H3 IAV between-species transmission events in many hosts including swine and wild waterfowl, warranting further investigation [21,60]. More data should therefore be collected from additional databases to retrieve as many sequence sets as possible to maximize predictive performance and to ensure that all important hosts are included as classes for model prediction. Alternative approaches to this may include applying oversampling methods such as synthetic minority oversampling technique (SMOTE) when databases are exhausted and additional sequence sets are not available [8385]. Additionally, recombination of sequences was not considered within this study when it is an important feature involved in viral evolution and should correspondingly be investigated within future studies regarding host prediction [86].

Furthermore, an additional limitation would be that the features chosen during feature selection may be biased towards our models as cross-validation was not performed during feature selection. This results in lower generalizability of our features to other models. Future studies should therefore look to perform cross-validation while conducting feature selection to obtain the most robust and unbiased features possible for subsequent model training. Other approaches to cross-validation and model validation should also be considered to limit the possibility of data leakage when splitting the data randomly into training and test sets.

Conclusion

In conclusion, all models demonstrated strong performance in distinct host prediction of IAV whole genome sequence data using features comprised solely of k-mers and amino acid properties with overall accuracies and κ values greater than 0.995 and 0.984, respectively. Models involving the ensemble methods of random forest and XGBoost were also shown to be resilient options for host prediction of very small minority classes in imbalanced datasets as these models had class sensitivities that ranged from 0.375–0.750 for the smallest minority class of goose. In comparison, models involving the non-ensemble method of multinomial logistic regression with ridge penalization had lower class sensitivities that ranged from 0–0.375. Furthermore, misclassified sequence sets with high predicted probabilities were identified as possible indicators for systematic identification of between-species transmission events and reassortant strains and were strongly supported by external validation using past literature and case study datasets. Similarly, correctly classified and misclassified sequences with predicted probabilities that were very close for two or more hosts were also indicative of recent and potential between-species transmission events. Application of these models as classifiers for H3 IAVs will therefore allow for accurate and rapid host prediction of IAV sequence data from any of the 8 genome segments and provide a strong framework that can be expanded upon for risk assessment and investigation of variants with higher than typical potential for between-species transmission. Nonetheless, results on specific case studies which resulted in misclassification also warrant caution, and further improvement of the training and validation process to prevent data leakage between the training and the validation datasets.

Supporting information

S1 Table. Sequence labels included from each class.

Sequence labels included from each species class after extracting the hosts from the labels and preprocessing to only have the classes of canine, equine, goose, human, mallard, swine, duck, and environment.

https://doi.org/10.1371/journal.pone.0336142.s001

(DOCX)

S2 Table. Subtype distribution after preprocessing.

Subtype distribution of the 21429 H3 whole genome sequence sets by host class retrieved from the NCBI Influenza Virus Database, BV-BRC database, and EpiFlu database after preprocessing was completed.

https://doi.org/10.1371/journal.pone.0336142.s002

(DOCX)

S3 Table. Top 10 most important features for each genome segment.

Top 10 features with the highest mean decrease Gini scores after feature selection for each of the 8 preliminary random forest models.

https://doi.org/10.1371/journal.pone.0336142.s003

(DOCX)

S4 Table. F1 Scores for each class on the testing dataset.

F1 scores for each of the 7 classes of canine, chicken, equine, goose, human, mallard, and swine obtained during model validation using the trained models from each genome segment on their respective test datasets.

https://doi.org/10.1371/journal.pone.0336142.s004

(DOCX)

S5 Table. Class specificities for each class on the testing dataset.

Class specificities for each of the 7 classes of canine, chicken, equine, goose, human, mallard, and swine obtained during model validation using the trained models from each genome segment on their respective test datasets.

https://doi.org/10.1371/journal.pone.0336142.s005

(DOCX)

S6 Table. Segment host prediction counts on the testing dataset.

Segment host prediction counts obtained from using the best performing models per genome segment (random forest model for the NS segment, XGBoost models for the remaining 7 segments) on the test dataset of 6078 sequences.

https://doi.org/10.1371/journal.pone.0336142.s006

(DOCX)

S7 Table. Accession numbers and labels for PB2 environment sequences being predicted as human.

https://doi.org/10.1371/journal.pone.0336142.s007

(DOCX)

S8 Table. Summary statistics for patristic distances from the maximum-likelihood phylogenetic tree.

Mean, median, mode, minimum, and maximum patristic distances were calculated from the representative HA maximum-likelihood phylogenetic tree.

https://doi.org/10.1371/journal.pone.0336142.s008

(DOCX)

S9 Table. Summary statistics for species-species patristic distances.

Mean, median, and patristic distance range for each of the species-species pairs.

https://doi.org/10.1371/journal.pone.0336142.s009

(DOCX)

S1 Fig. Heatmaps of correctly classified and misclassified PB2 sequences.

Heatmaps for the PB2 genome segment using the XGBoost model (XGB), A shows the predicted probabilities for the correctly classified sequences and B shows the predicted probabilities for the misclassified sequences. Predicted probabilities are read as rows, and the N on the y axis denotes the number of sequences with the predicted probability pattern shown for the respective row.

https://doi.org/10.1371/journal.pone.0336142.s010

(TIFF)

S2 Fig. Heatmaps of correctly classified and misclassified PB1 sequences.

Heatmaps for the PB1 genome segment using the XGBoost model (XGB), A shows the predicted probabilities for the correctly classified sequences and B shows the predicted probabilities for the misclassified sequences. Predicted probabilities are read as rows, and the N on the y axis denotes the number of sequences with the predicted probability pattern shown for the respective row.

https://doi.org/10.1371/journal.pone.0336142.s011

(TIFF)

S3 Fig. Heatmaps of correctly classified and misclassified PA sequences.

Heatmaps for the PA genome segment using the XGBoost model (XGB), A shows the predicted probabilities for the correctly classified sequences and B shows the predicted probabilities for the misclassified sequences. Predicted probabilities are read as rows, and the N on the y axis denotes the number of sequences with the predicted probability pattern shown for the respective row.

https://doi.org/10.1371/journal.pone.0336142.s012

(TIFF)

S4 Fig. Heatmaps of correctly classified and misclassified NP sequences.

Heatmaps for the NP genome segment using the XGBoost model (XGB), A shows the predicted probabilities for the correctly classified sequences and B shows the predicted probabilities for the misclassified sequences. Predicted probabilities are read as rows, and the N on the y axis denotes the number of sequences with the predicted probability pattern shown for the respective row.

https://doi.org/10.1371/journal.pone.0336142.s013

(TIFF)

S5 Fig. Heatmaps of correctly classified and misclassified NA sequences.

Heatmaps for the NA genome segment using the XGBoost model (XGB), A shows the predicted probabilities for the correctly classified sequences and B shows the predicted probabilities for the misclassified sequences. Predicted probabilities are read as rows, and the N on the y axis denotes the number of sequences with the predicted probability pattern shown for the respective row.

https://doi.org/10.1371/journal.pone.0336142.s014

(TIFF)

S6 Fig. Heatmaps of correctly classified and misclassified MP sequences.

Heatmaps for the MP genome segment using the XGBoost model (XGB), A shows the predicted probabilities for the correctly classified sequences and B shows the predicted probabilities for the misclassified sequences. Predicted probabilities are read as rows, and the N on the y axis denotes the number of sequences with the predicted probability pattern shown for the respective row.

https://doi.org/10.1371/journal.pone.0336142.s015

(TIFF)

S7 Fig. Heatmaps of correctly classified and misclassified NS sequences.

Heatmaps for the NS genome segment using the random forest model (RF), A shows the predicted probabilities for the correctly classified sequences and B shows the predicted probabilities for the misclassified sequences. Predicted probabilities are read as rows, and the N on the y axis denotes the number of sequences with the predicted probability pattern shown for the respective row.

https://doi.org/10.1371/journal.pone.0336142.s016

(TIFF)

S8 Fig. Between and within species patristic distances for each of the 7 classes.

Patristic distances between and within each of the 7 host classes plotted separately.

https://doi.org/10.1371/journal.pone.0336142.s017

(TIFF)

S1 File. Supplementary information on the sequence selection criteria for each of the databases.

https://doi.org/10.1371/journal.pone.0336142.s018

(DOCX)

S2 File. Supplementary information on the Fig 1A and 1B heatmap predicted probabilities.

https://doi.org/10.1371/journal.pone.0336142.s019

(DOCX)

S3 File. Supplementary information on the environment case study.

https://doi.org/10.1371/journal.pone.0336142.s020

(DOCX)

Acknowledgments

We would like to thank researchers for depositing their sequences into public repositories allowing for ease of access.

References

  1. 1. Allen JD, Ross TM. H3N2 influenza viruses in humans: Viral mechanisms, evolution, and evaluation. Hum Vaccin Immunother. 2018;14(8):1840–7. pmid:29641358
  2. 2. Taubenberger JK, Morens DM. Influenza: The Once and Future Pandemic. Public Health Rep. 2010;125(3_suppl):15–26.
  3. 3. Liu R, Sheng Z, Huang C, Wang D, Li F. Influenza D virus. Curr Opin Virol. 2020;44:154–61. pmid:32932215
  4. 4. Sederdahl BK, Williams JV. Epidemiology and Clinical Characteristics of Influenza C Virus. Viruses. 2020;12(1):89. pmid:31941041
  5. 5. Tyrrell CS, Allen JLY, Gkrania-Klotsas E. Influenza: epidemiology and hospital management. Medicine (Abingdon). 2021;49(12):797–804. pmid:34849086
  6. 6. Boni MF, Galvani AP, Wickelgren AL, Malani A. Economic epidemiology of avian influenza on smallholder poultry farms. Theor Popul Biol. 2013;90:135–44. pmid:24161559
  7. 7. Friesema IHM, Havelaar AH, Westra PP, Wagenaar JA, van Pelt W. Poultry culling and Campylobacteriosis reduction among humans, the Netherlands. Emerg Infect Dis. 2012;18(3):466–8. pmid:22377498
  8. 8. Moraes DCA, L Vincent Baker A, Wang X, Zhu Z, Berg E, Trevisan G, et al. Veterinarian perceptions and practices in prevention and control of influenza virus in the Midwest United States swine farms. Front Vet Sci. 2023;10:1089132. pmid:36816189
  9. 9. Tong S, Zhu X, Li Y, Shi M, Zhang J, Bourgeois M, et al. New world bats harbor diverse influenza A viruses. PLoS Pathog. 2013;9(10):e1003657. pmid:24130481
  10. 10. Bean WJ, Schell M, Katz J, Kawaoka Y, Naeve C, Gorman O, et al. Evolution of the H3 influenza virus hemagglutinin from human and nonhuman hosts. J Virol. 1992;66(2):1129–38. pmid:1731092
  11. 11. Webster RG, Bean WJ, Gorman OT, Chambers TM, Kawaoka Y. Evolution and ecology of influenza A viruses. Microbiol Rev. 1992;56(1):152–79.
  12. 12. Guan Y, Vijaykrishna D, Bahl J, Zhu H, Wang J, Smith GJD. The emergence of pandemic influenza viruses. Protein Cell. 2010;1(1):9–13. pmid:21203993
  13. 13. Long JS, Mistry B, Haslam SM, Barclay WS. Host and viral determinants of influenza A virus species specificity. Nat Rev Microbiol. 2019;17(2):67–81. pmid:30487536
  14. 14. Short KR, Richard M, Verhagen JH, van Riel D, Schrauwen EJA, van den Brand JMA, et al. One health, multiple challenges: The inter-species transmission of influenza A virus. One Health. 2015;1:1–13. pmid:26309905
  15. 15. Webster RG, Shortridge KF, Kawaoka Y. Influenza: interspecies transmission and emergence of new pandemics. FEMS Immunol Med Microbiol. 1997;18(4):275–9. pmid:9348163
  16. 16. Anthony SJ, St Leger JA, Pugliares K, Ip HS, Chan JM, Carpenter ZW, et al. Emergence of fatal avian influenza in New England harbor seals. mBio. 2012;3(4):e00166–12. pmid:22851656
  17. 17. Guo Y, Wang M, Kawaoka Y, Gorman O, Ito T, Saito T, et al. Characterization of a new avian-like influenza A virus from horses in China. Virology. 1992;188(1):245–55. pmid:1314452
  18. 18. Karasin AI, West K, Carman S, Olsen CW. Characterization of avian H3N3 and H1N1 influenza A viruses isolated from pigs in Canada. J Clin Microbiol. 2004;42(9):4349–54. pmid:15365042
  19. 19. Tu J, Zhou H, Jiang T, Li C, Zhang A, Guo X, et al. Isolation and molecular characterization of equine H3N8 influenza viruses from pigs in China. Arch Virol. 2009;154(5):887–90. pmid:19396578
  20. 20. Anderson TK, Chang J, Arendsee ZW, Venkatesh D, Souza CK, Kimble JB, et al. Swine Influenza A Viruses and the Tangled Relationship with Humans. Cold Spring Harb Perspect Med. 2021;11(3):a038737. pmid:31988203
  21. 21. Choi YK, Lee JH, Erickson G, Goyal SM, Joo HS, Webster RG, et al. H3N2 influenza virus transmission from swine to turkeys, United States. Emerg Infect Dis. 2004;10(12):2156–60. pmid:15663853
  22. 22. Freidl GS, Meijer A, de Bruin E, de Nardi M, Munoz O, Capua I, et al. Influenza at the animal-human interface: a review of the literature for virological evidence of human infection with swine or avian influenza viruses other than A(H5N1). Euro Surveill. 2014;19(18):20793. pmid:24832117
  23. 23. Hoschler K, Thompson C, Casas I, Ellis J, Galiano M, Andrews N, et al. Population susceptibility to North American and Eurasian swine influenza viruses in England, at three time points between 2004 and 2011. Euro Surveill. 2013;18(36):pii=20578. pmid:24079379
  24. 24. Raghunath S, Pudupakam RS, Deventhiran J, Tevatia R, Leroith T. Pathogenicity and transmission of triple reassortant H3N2 swine influenza A viruses is attenuated following Turkey embryo propagation. Vet Microbiol. 2017;201:208–15. pmid:28284612
  25. 25. Sharma A, Zeller MA, Souza CK, Anderson TK, Vincent AL, Harmon K. Characterization of a 2016-2017 human seasonal H3 influenza A virus spillover now endemic to U.S. swine. mSphere. 2022;7.
  26. 26. Zeller MA, Carnevale de Almeida Moraes D, Ciacci Zanella G, Souza CK, Anderson TK, Baker AL, et al. Reverse zoonosis of the 2022-2023 human seasonal H3N2 detected in swine. Npj Viruses. 2024;2(1):27. pmid:40295797
  27. 27. Zhou NN, Senne DA, Landgraf JS, Swenson SL, Erickson G, Rossow K, et al. Genetic reassortment of avian, swine, and human influenza A viruses in American pigs. J Virol. 1999;73(10):8851–6. pmid:10482643
  28. 28. Alberts F, Berke O, Maboni G, Petukhova T, Poljak Z. Utilizing machine learning and hemagglutinin sequences to identify likely hosts of influenza H3Nx viruses. Prev Vet Med. 2024;233:106351. pmid:39353303
  29. 29. Chrysostomou C, Alexandrou F, Nicolaou MA, Seker H. Classification of Influenza Hemagglutinin Protein Sequences using Convolutional Neural Networks. Annu Int Conf IEEE Eng Med Biol Soc. 2021;2021:1682–5. pmid:34891609
  30. 30. Xu Y, Wojtczak D. Predicting Influenza A Viral Host Using PSSM and Word Embeddings. In: 2021 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB), 2021. 1–10. doi: https://doi.org/10.1109/cibcb49929.2021.9562959
  31. 31. Xu Y, Wojtczak D. Dive into machine learning algorithms for influenza virus host prediction with hemagglutinin sequences. Biosystems. 2022;220:104740. pmid:35934256
  32. 32. Van Poelvoorde L, Vanneste K, De Keersmaecker SCJ, Thomas I, Van Goethem N, Van Gucht S, et al. Whole-Genome Sequence Approach and Phylogenomic Stratification Improve the Association Analysis of Mutations With Patient Data in Influenza Surveillance. Front Microbiol. 2022;13:809887. pmid:35516436
  33. 33. Van Poelvoorde LAE, Bogaerts B, Fu Q, De Keersmaecker SCJ, Thomas I, Van Goethem N, et al. Whole-genome-based phylogenomic analysis of the Belgian 2016-2017 influenza A(H3N2) outbreak season allows improved surveillance. Microb Genom. 2021;7(9):000643. pmid:34477544
  34. 34. Bao Y, Bolotov P, Dernovoy D, Kiryutin B, Zaslavsky L, Tatusova T, et al. The influenza virus resource at the National Center for Biotechnology Information. J Virol. 2008;82(2):596–601. pmid:17942553
  35. 35. Khare S, Gurry C, Freitas L, Schultz MB, Bach G, Diallo A, et al. GISAID’s Role in Pandemic Response. China CDC Wkly. 2021;3(49):1049–51. pmid:34934514
  36. 36. Olson RD, Assaf R, Brettin T, Conrad N, Cucinell C, Davis JJ, et al. Introducing the Bacterial and Viral Bioinformatics Resource Center (BV-BRC): a resource combining PATRIC, IRD and ViPR. Nucleic Acids Res. 2023;51: D678–D689.
  37. 37. Crawford PC, Dubovi EJ, Castleman WL, Stephenson I, Gibbs EPJ, Chen L, et al. Transmission of equine influenza virus to dogs. Science. 2005;310(5747):482–5. pmid:16186182
  38. 38. Payungporn S, Crawford PC, Kouo TS, Chen L, Pompey J, Castleman WL, et al. Influenza A virus (H3N8) in dogs with respiratory disease, Florida. Emerg Infect Dis. 2008;14(6):902–8. pmid:18507900
  39. 39. Wasik BR, Rothschild E, Voorhees IEH, Reedy SE, Murcia PR, Pusterla N, et al. Understanding the divergent evolution and epidemiology of H3N8 influenza viruses in dogs and horses. Virus Evol. 2023;9: vead052.
  40. 40. Zeller MA, Li G, Harmon KM, Zhang J, Vincent AL, Anderson TK, et al. Complete Genome Sequences of Two Novel Human-Like H3N2 Influenza A Viruses, A/swine/Oklahoma/65980/2017 (H3N2) and A/Swine/Oklahoma/65260/2017 (H3N2), Detected in Swine in the United States. Microbiol Resour Announc. 2018;7(20):e01203–18. pmid:30533826
  41. 41. Chang J, Anderson TK, Zeller MA, Gauger PC, Vincent AL. octoFLU: Automated Classification for the Evolutionary Origin of Influenza A Virus Gene Sequences Detected in U.S. Swine. Microbiol Resour Announc. 2019;8(32):e00673–19. pmid:31395641
  42. 42. Pagès H, Aboyoun P, Gentleman R, DebRoy S. Biostrings: Efficient manipulation of biological strings. https://bioconductor.org/packages/Biostrings. 2024.
  43. 43. Wilkinson S. Kmer v1.0.2. Zenodo. 2018. doi: https://doi.org/10.5281/zenodo.1227690
  44. 44. Gupta NT, Vander Heiden JA, Uduman M, Gadala-Maria D, Yaari G, Kleinstein SH. Change-O: a toolkit for analyzing large-scale B cell immunoglobulin repertoire sequencing data. Bioinformatics. 2015;31(20):3356–8. pmid:26069265
  45. 45. Kuhn M. Building Predictive Models inRUsing thecaretPackage. J Stat Soft. 2008;28(5).
  46. 46. Louppe G, Wehenkel L, Sutera A, Geurts P. Understanding variable importances in forests of randomized trees. Adv Neural Inf Process Syst. 2013;26.
  47. 47. Simon N, Friedman J, Hastie T, Tibshirani R. Regularization Paths for Cox’s Proportional Hazards Model via Coordinate Descent. J Stat Softw. 2011;39(5):1–13. pmid:27065756
  48. 48. Tay JK, Narasimhan B, Hastie T. Elastic Net Regularization Paths for All Generalized Linear Models. J Stat Softw. 2023;106:1. pmid:37138589
  49. 49. Cohen J. A Coefficient of Agreement for Nominal Scales. Educ Psychol Meas. 1960;20:37–46.
  50. 50. Li W, Godzik A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006;22(13):1658–9. pmid:16731699
  51. 51. Tamura K, Stecher G, Kumar S. MEGA11: Molecular Evolutionary Genetics Analysis Version 11. Mol Biol Evol. 2021;38(7):3022–7. pmid:33892491
  52. 52. Paradis E, Schliep K. ape 5.0: an environment for modern phylogenetics and evolutionary analyses in R. Bioinformatics. 2019;35(3):526–8. pmid:30016406
  53. 53. Letunic I, Bork P. Interactive Tree Of Life (iTOL) v5: an online tool for phylogenetic tree display and annotation. Nucleic Acids Res. 2021;49(W1):W293–6. pmid:33885785
  54. 54. Hong W, Zhou X, Jin S, Lu Y, Pan J, Lin Q, et al. A Comparison of XGBoost, Random Forest, and Nomograph for the Prediction of Disease Severity in Patients With COVID-19 Pneumonia: Implications of Cytokine and Immune Cell Profile. Front Cell Infect Microbiol. 2022;12:819267. pmid:35493729
  55. 55. Omar ED, Mat H, Abd Karim AZ, Sanaudi R, Ibrahim FH, Omar MA, et al. Comparative Analysis of Logistic Regression, Gradient Boosted Trees, SVM, and Random Forest Algorithms for Prediction of Acute Kidney Injury Requiring Dialysis After Cardiac Surgery. Int J Nephrol Renovasc Dis. 2024;17:197–204. pmid:39070075
  56. 56. Sahin EK. Assessing the predictive capability of ensemble tree methods for landslide susceptibility mapping using XGBoost, gradient boosting machine, and random forest. SN Appl Sci. 2020;2: 1–17. doi:https://doi.org/10.1007/S42452-020-3060-1/TABLES/1
  57. 57. Liu L, Wu X, Li S, Li Y, Tan S, Bai Y. Solving the class imbalance problem using ensemble algorithm: application of screening for aortic dissection. BMC Med Inform Decis Mak. 2022;22(1):82. pmid:35346181
  58. 58. Muchlinski D, Siroky D, He J, Kocher M. Comparing Random Forest with Logistic Regression for Predicting Class-Imbalanced Civil War Onset Data. Polit anal. 2016;24(1):87–103.
  59. 59. Tanha J, Abdi Y, Samadi N, Razzaghi N, Asadpour M. Boosting methods for multi-class imbalanced data classification: an experimental review. J Big Data. 2020;7(1).
  60. 60. Guo X, Flores C, Munoz-Aguayo J, Halvorson DA, Lauer D, Cardona CJ. Historical and Recent Cases of H3 Influenza A Virus in Turkeys in Minnesota. Avian Dis. 2015;59(4):512–7. pmid:26629625
  61. 61. Ahlgren NA, Ren J, Lu YY, Fuhrman JA, Sun F. Alignment-free $d_2^*$ oligonucleotide frequency dissimilarity measure improves prediction of hosts from metagenomically-derived viral sequences. Nucleic Acids Res. 2017;45(1):39–53. pmid:27899557
  62. 62. Raj A, Dewar M, Palacios G, Rabadan R, Wiggins CH. Identifying hosts of families of viruses: a machine learning approach. PLoS One. 2011;6(12):e27631. pmid:22174744
  63. 63. Young F, Rogers S, Robertson DL. Predicting host taxonomic information from viral genomes: A comparison of feature representations. PLoS Comput Biol. 2020;16(5):e1007894. pmid:32453718
  64. 64. Wheeler D, Bhagwat M. BLAST QuickStart: Example-Driven Web-Based BLAST Tutorial. Comparative Genomics. Humana Press:149–76. doi: https://doi.org/10.1385/1-59745-514-8:149
  65. 65. Zhang M, Yang L, Ren J, Ahlgren NA, Fuhrman JA, Sun F. Prediction of virus-host infectious association by supervised learning methods. BMC Bioinformatics. 2017;18(Suppl 3):60. pmid:28361670
  66. 66. Perelygin FS, Lukashev AN, Aleshina YA. The effect of taxonomic, host-dependent features and sample bias on virus host prediction using machine learning and short sequence k-mers. Sci Rep. 2025;15(1):31592. pmid:40866484
  67. 67. Moeckel C, Mareboina M, Konnaris MA, Chan CSY, Mouratidis I, Montgomery A, et al. A survey of k-mer methods and applications in bioinformatics. Comput Struct Biotechnol J. 2024;23:2289–303. pmid:38840832
  68. 68. Eng CLP, Tong JC, Tan TW. Predicting host tropism of influenza A virus proteins using random forest. BMC Med Genomics. 2014;7 Suppl 3(Suppl 3):S1. pmid:25521718
  69. 69. Hensley SE, Das SR, Bailey AL, Schmidt LM, Hickman HD, Jayaraman A, et al. Hemagglutinin receptor binding avidity drives influenza A virus antigenic drift. Science. 2009;326(5953):734–6. pmid:19900932
  70. 70. Kobayashi Y, Suzuki Y. Compensatory evolution of net-charge in influenza A virus hemagglutinin. PLoS One. 2012;7(7):e40422. pmid:22808159
  71. 71. Wang Y, Lei R, Nourmohammad A, Wu NC. Antigenic evolution of human influenza H3N2 neuraminidase is constrained by charge balancing. Elife. 2021;10:e72516. pmid:34878407
  72. 72. Szeleczky Z, Bálint A, Gyarmati P, Metreveli G, Dán A, Ursu K, et al. Characterization of two low pathogenic avian influenza viruses isolated in Hungary in 2007. Vet Microbiol. 2010;145(1–2):142–7. pmid:20363081
  73. 73. Simulundu E, Ishii A, Igarashi M, Mweene AS, Suzuki Y, Hang’ombe BM, et al. Characterization of influenza A viruses isolated from wild waterfowl in Zambia. J Gen Virol. 2011;92(Pt 6):1416–27. pmid:21367986
  74. 74. Lycett SJ, Baillie G, Coulter E, Bhatt S, Kellam P, McCauley JW, et al. Estimating reassortment rates in co-circulating Eurasian swine influenza viruses. J Gen Virol. 2012;93(Pt 11):2326–36. pmid:22971819
  75. 75. Cui H, Shi Y, Ruan T, Li X, Teng Q, Chen H, et al. Phylogenetic analysis and pathogenicity of H3 subtype avian influenza viruses isolated from live poultry markets in China. Sci Rep. 2016;6:27360. pmid:27270298
  76. 76. Harima H, Okuya K, Kajihara M, Ogawa H, Simulundu E, Bwalya E, et al. Serological and molecular epidemiological study on swine influenza in Zambia. Transbound Emerg Dis. 2022;69(4):e931–43. pmid:34724353
  77. 77. Reeves AB, Pearce JM, Ramey AM, Ely CR, Schmutz JA, Flint PL, et al. Genomic analysis of avian influenza viruses from waterfowl in western Alaska, USA. J Wildl Dis. 2013;49(3):600–10. pmid:23778609
  78. 78. Su S, Chen J, Qi H, Zhu W, Xie J, Huang Z, et al. Complete Genome Sequence of a Novel Avian-Like H3N2 Swine Influenza Virus Discovered in Southern China. J Virol. 2012;86(17):9533. pmid:22879607
  79. 79. Rivailler P, Perry IA, Jang Y, Davis CT, Chen L-M, Dubovi EJ, et al. Evolution of canine and equine influenza (H3N8) viruses co-circulating between 2005 and 2008. Virology. 2010;408(1):71–9. pmid:20880564
  80. 80. Arsnoe DM, Ip HS, Owen JC. Influence of body condition on influenza A virus infection in mallard ducks: experimental infection data. PLoS One. 2011;6(8):e22633. pmid:21857940
  81. 81. Latorre-Margalef N, Brown JD, Fojtik A, Poulson RL, Carter D, Franca M, et al. Competition between influenza A virus subtypes through heterosubtypic immunity modulates re-infection and antibody dynamics in the mallard duck. PLoS Pathog. 2017;13(6):e1006419. pmid:28640898
  82. 82. Lauterbach SE, Wright CM, Zentkovich MM, Nelson SW, Lorbach JN, Bliss NT, et al. Detection of influenza A virus from agricultural fair environment: Air and surfaces. Prev Vet Med. 2018;153:24–9. pmid:29653731
  83. 83. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: Synthetic Minority Over-sampling Technique. jair. 2002;16:321–57.
  84. 84. Gunasekaran H, Ramalakshmi K, Rex Macedo Arokiaraj A, Deepa Kanmani S, Venkatesan C, Suresh Gnana Dhas C. Analysis of DNA Sequence Classification Using CNN and Hybrid Models. Comput Math Methods Med. 2021;2021:1835056. pmid:34306171
  85. 85. Mujahid M, Kına E, Rustam F, Villar MG, Alvarado ES, De La Torre Diez I, et al. Data oversampling and imbalanced datasets: an investigation of performance for machine learning and feature engineering. J Big Data. 2024;11(1).
  86. 86. Jaya FR, Brito BP, Darling AE. Evaluation of recombination detection methods for viral sequencing. Virus Evol. 2023;9.