Sequence-Based Antigenic Change Prediction by a Sparse Learning Method Incorporating Co-Evolutionary Information

Rapid identification of influenza antigenic variants will be critical in selecting optimal vaccine candidates and thus a key to developing an effective vaccination program. Recent studies suggest that multiple simultaneous mutations at antigenic sites accumulatively enhance antigenic drift of influenza A viruses. However, pre-existing methods on antigenic variant identification are based on analyses from individual sites. Because the impacts of these co-evolved sites on influenza antigenicity may not be additive, it will be critical to quantify the impact of not only those single mutations but also multiple simultaneous mutations or co-evolved sites. Here, we developed and applied a computational method, AntigenCO, to identify and quantify both single and co-evolutionary sites driving the historical antigenic drifts. AntigenCO achieved an accuracy of up to 90.05% for antigenic variant prediction, significantly outperforming methods based on single sites. AntigenCO can be useful in antigenic variant identification in influenza surveillance.


Introduction
Influenza A virus causes both seasonal and pandemic outbreaks, presenting a continuous challenge to public health. Influenza A virus is an RNA virus in the family Orthomyxoviridae, and its genome is composed of eight negative stranded RNA segments. Two processes, namely antigenic drift and antigenic shift, drive the antigenic changes of influenza A virus. Antigenic drift is mainly caused by mutations in influenza surface glycoproteins hemagglutinin (HA) and neuraminidase (NA), which are primary targets for host immune systems. Antigenic shift occurs when an influenza strain with antigenically distinct HA and/or NA genes appears, usually resulting from genetic reassortment. Genetic reassortment is the exchange of one or more discrete RNA segments into multipartite viruses when two or more viruses infect the same cell. Antigenic drift events occur more frequently than antigenic shift events.
Vaccine is the primary option in counteracting influenza outbreaks. Antigenic matches between circulating strains and vaccine seed strain is the key to an effective vaccination program. Recently, we developed AntigenBridges, a sequence-dependent influenza antigenicity quantification method [1]. This method identified antigenicity-associated sites using sparse learning and developed a quantification score using single mutations. This same method was shown to be effective in inferring influenza antigenicity up to an accuracy of 83.78%. Compared to other laboratory-based methods such as hemaglutination inhibition (HI) and microneutralization (MN), this sequence-dependent method allowed us to perform large-scale antigenic characterization in influenza surveillance. More importantly, it can facilitate antigenic characterization for those viruses requiring a high biosafety facility, such as H5 and H7 influenza A virus, which generally require BioSafety Level 3 (BSL-3) facility.
During influenza evolution, multiple sites can co-evolve. A recent study on HA1 proteins of H3N2 influenza A viruses from 1968 to 2005 showed that 88 of the 95 substitutions occurred in groups, and two or more of these residues can mutate simultaneously [2]. These multiple simultaneous mutations at antigenic sites cumulatively enhance antigenic drift [2]. Other studies also identified 308 putative pairs of co-evolved amino acid positions [3]. These studies have suggested that the residues under correlated evolution (co-evolution) are more likely to be physically close in the three-dimensional structure of the protein [4]. Because the impacts of these co-evolved sites are not necessarily additive, the single site-based method described in the previous study would need to be optimized. In this study, we developed AntigenCO, a sparse learning method by incorporating evolutionary information. Our results showed that AntigenCO outperformed Anti-genBridges, and can reach up to 90.05% predictive accuracy.

Results
AntigenCO is a new sparse learning algorithm to quantify antigenic distances among influenza A viruses using influenza HA1 protein sequence information. AntigenCO identifies top features determining antigenic profiles embedded in serological data. This feature can be either a single residue, co-evolved residues, or a residue coupled within a certain physical distances in three-dimensional structures of HA protein. We summarized in Table S1 various features and feature parameters. Two validation schemes, namely sequential validation and 5-fold cross-validation process were used in tuning the model parameters, and the top selected features were used for constructing a model for sequencebased antigenic prediction. This framework was applied into H3N2 influenza dataset containing 512 viruses and 133 serums collected between 1968 and 2007 together with the corresponding HA1 protein, and the obtained prediction model was tested on H3N2 influenza viruses from 2002 to 2013. As influenza A viruses have been evolved into antigenic clusters [5], we applied AntigenCO to infer mutations leading to the antigenic drift events among these clusters.
A set of 12 single or co-mutations were predicted to be responsible for antigenic drifts among these antigenic clusters (Table 2, Figure 2, S1, and S2). Single mutations K156E and S193F were responsible for antigenic drifts TX77-.BK79 and CA04-.BR07, respectively. The other 10 antigenic drift events were instead driven by co-evolutionary mutations, which can be located at the same antibody binding site or across different antibody binding sites. Antigenic cartography demonstrated those mutations drove antigenic drift among these clusters ( Figure 3 and Figure S3).

Co-evolutionary and structural information increases the accuracy in antigenic distance measurement
We compared the prediction accuracies of the separate methods using single sites, co-evolutionary, and co-neighboring sites. Our results showed that both the co-evolutionary information and the co-neighboring information can improve antigenic distance measurement accuracies of sparse learning method, and the coevolutionary information seemed to be more effective ( Figure 4, Table S2). The method combining both co-evolutionary and structural information was also tested, but the prediction accuracy remained similar to those using co-evolutionary information alone. The method ''sinco+EvolT4'' outperformed all other methods in most years. Comparing to ''single'' sites, the prediction RMSE of ''sinco+EvolT4'' decreased by 21.54% on average and decreased up to 60% in some years : 1987, 1995, and 2003. A comparison of different methods and Lasso parameters based on average prediction RMSE from 1985 to 2003 was plotted in Figure 4, Figure 5, and Table S2. The method ''sinco+EvolT4'' with Lasso parameter 16 is used in future prediction studies.
The features with co-evolutionary information were more effective than those with single sites in antigenic variant identification We compared prediction accuracy of the sparse learning framework in this study with three reported feature sets, including 44-single sites [5], 25-single sites [6], and 39-single sites [1]. Our results clearly showed that the feature set integrating coevolutionary information outperformed the other three feature sets (Table 3 and Figure 6). The improvements were up to 41.1%, 32.4%, 28.7% compared with the method using 44-single sites [5], 25-single sites [6], and 39-single sites [1], respectively.
Sequence based antigenicity predication using coevolutionary sites  [7]. In addition, the cartography shows that there is a small antigenic distance between viruses before and after 2011, and the viruses after 2011 had a large extent of antigenic variations.

Discussion
Antigenic changes in seasonable influenza viruses were recently shown to occur more gradually by our recent study [1] and others [2]. These results suggested that multiple mutations in antibody binding sites can occur but not necessarily simultaneously. Some antigenic drift events were driven by multiple mutations, and the impacts of these mutations on antigenic changes are not necessarily additive. For example, our previous experiment showed that mutation N145K and G172D from virus JO/33 changed antigenic distance by 1.29 and 0.44 unit respectively. The N145K-G172D double mutation resulted in an antigenic distance change of 1.83 units, which are different from the simple sum of the antigenic distances from two corresponding individual mutations [1]. This motivated us to improve our earlier prediction functions of influenza A viruses in this study by incorporating features with the co-evolved residues in addition to those single residues. Our results confirmed that incorporation of coupled residues into the prediction function does improve the predictive function of antigenic distances.
This study identified 65 features derived from 38 individual residues contributing to antigenic changes of H3N2 influenza A viruses. These residues included 13 single sites and 52 pairs of coevolutionary sites (Table 1). Except for site 244, 12 out of these 13 single sites were identical to those identified from our previous study [1]. The learning results demonstrated that the impacts of these residues on antigenic drift are not additive, confirming our hypothesis. Nevertheless, the performance of the new predictive function incorporated evolutionary information has been significantly improved.
Our previous study suggested multiple mutations leading to a single antigenic drift event can occur at not only the same antibody Table 1. Top 65 predominant antigenicity associated sites for H3N2 influenza A viruses. Weight denotes the importance of the single and co-evolutionary sites in shaping the antigenic evolution. As suggested by the parameter tuning process ( binding site but also more than one antibody binding site [1]. In this study, we evaluated the co-evolved residues within the same binding site and those across multiple antibody binding sites. Our results further confirmed our previous study and are also consistent with those reported recently [8]. From 1968 to 2007, only two of those 12 historical antigenic drift events, TX77-BK79 and CA04-BR07, were caused by a single mutation whereas the other 12 by two or three residues within or across at least two antibody binding sites ( Table 2). In summary, this study developed a predictive function, AntigenCO, in quantifying antigenic changes using antigenicityassociated sites derived from HI results by sparse learning. The impacts of individual residues on antigenic changes were shown to be non-additive. AntigenCO incorporates such information and  Table 2. Single and co-evolutionary sites driving the 12 antigenic drift events between successive clusters from HK68, EN72, VI75, TX77, BK79, SI87, BE89, BE92, WU95, SY97, FU02, CA04 and BR07.

Drift
Sites Domain CA04-BR07 S193F B As suggested by parameter tuning process (Table S3 and  achieved an accuracy of up to 90.05% for antigenic variant prediction.

AntigenCO
AntigenCO is a sparse learning algorithm to quantify antigenic distance among influenza A viruses using influenza HA1 protein sequence. We implemented AntigenCO in Matlab and all source codes and data used for this study are available at http://sysbio. cvm.msstate.edu/AntigenCO. Sparse learning algorithms. Specifically, the pairwise antigenic distances among viruses are measured by antigenic cartography [9] based on a serological dataset, e.g. HI data. Let the number of viruses be N, then the distances could be formulated as a vector Y of length N 2 , in which each entry calculates the antigenic distance between a pair of viruses. And similarly, we used a matrix X~½x 1 ,x 2 , Á Á Á ,x m to model the genetic profile, where m is the number of single and correlated co-sites, and x i is a vector of length N 2 representing the pairwise genetic change at single or correlated co-site i. To identify antigenicity-associated sites, we mapped the antigenic distances to genetic profile and selected sites whose mutations shape antigenic vector Y. This is a typical feature selection problem in machine learning. Methods like Lasso [10,11] and Ridge regression [11] are effective for selecting a small to moderate number of antigenicity-associated single and correlated co-sites. The Lasso [10] formulates the problem as where s is a threshold parameter that can be tuned to optimize accuracy and w~½w 1 ,w 2 , Á Á Á ,w m denotes the weights of each single and co-evolved sites. Similarly, the Ridge regression formulation of the problem is Lasso is solved by the Matlab code from Sköglund [12] and Ridge regression by the Matlab built-in function ''ridge.m.'' For our H3N2 data, Lasso performs slightly better in prediction root-mean-square error (RMSE) than Ridge regression in two randomly selected antigenic drift data and three sequential data  (Table S5). Thus, we adopted Lasso to perform the analysis throughout this study.
Construction of feature vector and scoring schemes. To generalize the single feature in Sun et. al [1], we introduced cofeatures modeling co-mutations at two sites and then combined single and co-features to model the contribution of both types of sites. There are three types of feature selection methods ''single,'' ''co'', and ''sinco''. The feature type ''single'' takes each single site as a feature and does not consider their correlations; the feature ''co'' considers each pair of single sites as a feature and only considers their co-mutation information; the feature ''sinco'' combines the ''single'' and ''co'' features and models both the contribution of single amino acid sites and their correlations to antigenic evolution. The residues on the surface of HA protein play central roles in shaping antigenic evolution and coevolutionary residues [5] and structural co-neighboring residues work together to define protein functions [4]. Thus, here, the feature vectors only include the residues located at the HA surface, which either co-evolve or are physically close in the protein structure.
Feature encoding functions. We adopted two-feature encoding schemes in this study, binary and PIMA ( Figure S4) scoring schemes, as described in our previous study [1]. A simple comparison in Table S5 shows that the performance of PIMA slightly outperforms that of binary scheme, and thus PIMA will be used in this study.
Single feature. The construction of the single feature is the same as described by Sun et. al [1]. Specifically, let S be the HA1 is the binary or PIMA score of a pair of amino acids S a,i and S b,i for 1ƒa,bƒN and S a,i denotes the amino acid of virus a at position i. Co-and sinco feature. Let i and j be two sites and X i and X j be their representing single features. We construct the co-feature of site i and j by the inner product of vector X i and X j . For example, if the vectors of the ''single'' features are This ''co-feature'' models the co-occurrence of mutations in site i and j. The ''sinco'' feature are those integrating both ''single'' and ''co-'' features.
Identifying surface sites and structural co-neighboring sites. Residues predicted to be on the surface of the HA homotrimer were determined as described previously [1]. Jmol (www.jmol.org) was used to identify amino acid residues having distances less than a predetermined distance threshold from 1 to  10 angstrom with step 1 from H3N2 HA structure (pdb file 2VIU). By structural co-neighbor restriction, we only allowed the cofeatures with amino acid pairs having distances less than the threshold. In Figure S5, the curve shows the number of neighboring pairs against the distance threshold from 1 to 10. A pre-analysis showed that threshold around 6 to 10 (data not shown) maximized the prediction accuracy, and we selected two thresholds 6 and 10 for further analysis. The two methods are denoted as ''sinco+Struct6A'' and ''sinco+Struct10A.'' Mutual information to identify co-evolutionary sites. It has previously been shown that co-evolution at antigenic sites cumulatively enhances antigenic drift [2]. We adopted simple mutual information methods to identify co-evolutionary amino acid sites: Let A~A 1 ,A 2 , Á Á Á ,A 20 f gbe the amino acid set. Then, the entropy of a single amino acid site x is defined to be where p(x~A i ) denotes the frequency of amino acid A i at site x. The joint entropy of two sites x and y is defined as where p(x~A i ,y~A j ) denotes the frequency of amino acid A i at site x and amino acid A j at site y simultaneously. The mutual information of site x and y is then defined as Martin et. al [13] previously showed that MI(x,y)=H(x,y) removes the background information and outperforms others, thus we adopted this normalization scheme. After calculating the mutual information for all pairs, we calculated its Z{score of the mutual information of a pair of sits x and y as , where M MI denotes the mean value of the mutual information of all pairs and s(MI) denotes the standard deviation of mutual information. A threshold value was set to Z to determine co-evolutionary pairs. Figure S6 shows the correlation between the number of co-evolutionary pairs inferred and the thresholds of Z-value for the benchmark 512 sequence from 1968 to 2007. A simple cross-validation analysis showed that the best threshold was from 4 to 16 based on prediction RMSE (data not shown). Thus, we selected thresholds 4, 8, 10, and 16 for further analysis and the methods are denoted as ''sinco+EvolT4,'' ''sinco+EvolT8,'' ''sinco+EvolT10,'' and ''sinco+EvolT16'' respectively.
Combining co-evolutionary and structural information. It is natural to combine co-evolutionary and structural information as they may function together. In the study, we only tested one method combining ''sinco+EvolT2'' and ''sinco+Struct10A,'' which is denoted as ''sinco+Struct10A+EvolT2.'' Figure 5 illustrates how this method is outperformed by ''sinco+EvolT4.'' We believe that the combination of co-evolutionary and structural information may not increase the prediction accuracy significantly. Thus, we did not perform a deeper analysis.
Determining the top number of features. The top number of features of sequential prediction and antigenic drifts are determined by the prediction RMSE curve against the number of features (see Figure S7, S8, and S9). The two figures show that 10-15 features would be enough for antigenic drift data and 30-65 features work well for sequential prediction.
Sequence-based antigenic distance predicting function. Using Lasso, suppose we selected p antigenicityassociated single and co-sites and their associated weights. For simplicity, the p single and co-sites are re-labeled from 1 to p. We quantified the antigenic distance using the function where X X i is the normalized features on single or co-site i, y y is the mean antigenic distance in the training set, and w i is the weight assigned to each selected feature i. The predicted antigenic distances of the viruses are then plotted into two-dimensional or three-dimensional cartography using a multidimensional scaling method (Figure 1, 3, and 7).

Selecting top single or co-evolutionary sites for antigenic drifts
In selecting the top single or co-evolutionary sites for antigenic drift events, we first applied our sparse learning framework with co-evolutionary information into the drift data, obtaining the weights for both single and co-sites. Then, we searched complete graphs in the co-evolutionary file, adding the weights up for all the single and double sites. For example, the weight of co-evolutionary site 135-145-193 is defined as Currently, we have only identified co-evolutionary sites of sizes up to 3. In the end, we ranked all the cliques and selected the one with the highest weight as the single or co-evolutionary sites responsible for the antigenic drift (see Table 2).

Evaluation methods and parameter tuning
Similar to [1], the root mean square error (RMSE) and Pearson correlation coefficient (CC) were used as measures of prediction and training accuracy for tuning best model parameters, e.g. Lasso and Ridge parameter, feature types, and top number of single or co-sites to choose. Specifically, the Lasso and Ridge parameter was tuned from 2 210 to 2 10 with a multiple of two. In addition, we compared single feature alone, ''sinco+StructnA'' with n from 2 to 10, ''sinco+EvolTm'' with m being 0, 2, 4, 8, 10, 16 and 32, and ''sinco+Struct10A+EvolT2''. For brevity, we only showed the prediction accuracy of representative feature types and parameters, for example, ''sinco+EvolT4'' and ''2 21 ''.
We applied two types of validation methods, namely 5-fold cross-validation and sequential validation for antigenic drift data and the whole H3N2 data (See Figure S10), respectively. For the 5-folder cross-validation, we randomly selected 20% viruses as a testing set and the remaining 80% viruses as a training set. Then, we examined the true and predicting distances within viruses in the testing set as well as between testing viruses and training viruses. To avoid the influence of randomness, we reran the program 100 times and used the mean RMSE as the criterion. For sequential parameter tuning, five schemes (Pred1, Pred2, Pred3, Pred4, and Pred5) were applied. Pred1 predicted the pairwise distances of viruses in each pair of consecutive years k and k21 for k [ 1990,2003 ½ using viruses in [1968, k21] as training data. Pred2 predicted the distances between viruses in year k and k21, and between viruses in year k22 and those in years k and k21 using viruses in [1968, k22] as training data. Similar definitions hold for Pred3, Pred4, and Pred5.

Antigenic cartography
The two-dimensional antigenic cartography was constructed by AntigenMap [9] and the three dimensional cartography by AntigenMap3D [14], where the lower reactor was set to 20. The distance-based cartography was constructed using Matlab's builtin nonmetric multidimensional scaling function, ''mdscale.m''. The downloaded sequences were aligned using MUSCLE [15].    Table S3 Compare the influence of restriction methods on antigenic drifts. Each cell records the prediction RMSE of the corresponding restriction method, e.g. ''Single'' on antigenic drift data, e.g. ''HK68-EN72''. To avoid randomness, the RMSE are averaged over 100 runs. In each run, we perform a 5 folder cross validation. For brevity, ''6A'' indicates co-neighbor restriction with distance 6 angstrom, and ''T4'' indicates evolutionary restriction with Z-score threshold 4. Similar definition applies for other methods. (DOC) Table S4 Compare the influence of lasso parameter on antigenic drifts. Each cell records the prediction RMSE of the corresponding lasso parameter, e.g. ''2 21 '' on antigenic drift data, e.g. ''HK68-EN72''. To avoid randomness, the RMSE are averaged over 100 runs. In each run, we perform a 5 folder cross validation. (DOC) Table S5 Comparison of 2 machine learning methods Lasso and Ridge regression and 2 scoring schemes 0-1 and PIMA. Each cell lists the smallest average prediction RMSE for all feature types and model parameters on drift data ''HK68-EN72'' and ''BE92-WU95'', and sequential data [1968,1985], [1968,1986] and [1968,1987]. (DOC)