Towards a Better Detection of Horizontally Transferred Genes by Combining Unusual Properties Effectively

Background Horizontal gene transfer (HGT) is one of the major mechanisms contributing to microbial genome diversification. A number of computational methods for finding horizontally transferred genes have been proposed in the past decades; however none of them has provided a reliable detector yet. In existing parametric approaches, only one single compositional property can participate in the detection process, or the results obtained through each single property are just simply combined. It’s known that different properties may mean different information, so the single property can’t sufficiently contain the information encoded by gene sequences. In addition, the class imbalance problem in the datasets, which also results in great errors for the gene detection, hasn’t been considered by the published methods. Here we developed an effective classifier system (Hgtident) that used support vector machine (SVM) by combining unusual properties effectively for HGT detection. Results Our approach Hgtident includes the introduction of more representative datasets, optimization of SVM model, feature selection, handling of imbalance problem in the datasets and extensive performance evaluation via systematic cross-validation methods. Through feature selection, we found that JS-DN and JS-CB have higher discriminating power for HGT detection, while GC1–GC3 and k-mer (k = 1, 2, …, 7) make the least contribution. Extensive experiments indicated the new classifier could reduce Mean error dramatically, and also improve Recall by a certain level. For the testing genomes, compared with the existing popular multiple-threshold approach, on average, our Recall and Mean error was respectively improved by 2.81% and reduced by 26.32%, which means that numerous false positives were identified correctly. Conclusions Hgtident introduced here is an effective approach for better detecting HGT. Combining multiple features of HGT is also essential for a wider range of HGT events detection.


Introduction
Horizontal gene transfer (HGT, also called lateral gene transfer) is a transfer of genetic material from one lineage to another and has played a key role in species evolution and microbial genome diversification [1,2]. Transfers can occur both between closely and distantly related species or strains, and are thought to be frequent events [3]. In addition, horizontal gene transfer has also been proposed to result in the emergence of novel human diseases and poses several risks to humans [4,5]. As sequence data has accumulated, evidence for rampant HGT has increased dramatically. Thus, detecting HGT has enormous practical significance for providing a better understanding of the impact of HGT on genome evolution and for identifying new drug targets.
At present, there are two primary strategies to detect the genes that have been transferred horizontally: phylogenetic approaches and parametric approaches [6,7]. Phylogenetic approaches are typically based on the comparative study of numerous genomes to find genes with unusually taxonomic distributions. However, many other phenomena, such as biased mutation rates, gene loss and long branch length attraction etc., also can cause the phylogenetic tree for a gene to differ from that for the species, thus, phylogenetic approaches are time-consuming and insufficiently robust [8,9].
In contrast, parametric approaches (also called compositionbased approaches) are based on a common theory that the unusual characteristics of horizontally transferred genes can distinguish themselves from other genes in genome. This kind of approach is computationally less demanding and can be carried out in each single genome. So far, various parametric approaches have been proposed, but it's not difficult to find one common drawback that only one single compositional property could be used to identify the transferred genes in each predicting experiment. It's known that different properties may mean different information, and this limitation also results in great errors for HGT detection. Some combined methods were also proposed by Becq et al. [10] and Azad et al. [11] to resolve this problem, but these methods just only combined the predictive results that obtained through each single property, the essence remained that only one single property was used. Therefore, how to sufficiently extract the information encoded by genes has become an open and challenging issue. In addition, machine learning also was applied widely for HGT detection [12,13], but the class imbalance problem which can result in poor classification performance with respect to the minority class [14] hasn't been considered by them.
In light of all the caveats, in this study, we have developed a new strategy (Hgtident) which used support vector machine (SVM) to detect horizontally transferred genes by combining the unusual properties effectively, meanwhile, the class imbalance problem was also considered. The information from combined properties can sufficiently stands for the whole gene sequence. To our knowledge, this is the first use of such integrated strategy to identify horizontally transferred genes. As a result, Hgtident can achieve better performance than the existing methods.

Datasets
In previously published study, various artificial datasets were put forward [3,5,6,7,8,10,11,12,13], and this kind of simulative dataset was composed of donor genes and recipient genome. The task is that of recovering as many as possible of the donor genes. But it's important to note that, in the evolutionary histories of recipient genome, those genes from a transfer of genetic materials between different species don't have been considered.
So, in this article, in order to validate the performance of Hgtident in genuine genomes, we chose six common genomes published in more reliable HGT-DB database (http://genomes. urv.cat/HGT-DB/) [15], which was E. coli K12, E. coli O157 Sakai, S. enterica Typhi CT18, S. enterica Paratypi ATCC 9150, C. pneumoniae CWL029 and S. agalactiae 2603, respectively. The horizontally transferred genes and others in genome were respectively regarded as positives and negatives. The results predicted by Hgtident would be compared with that of the existing popular multiple-threshold approach proposed by Azad et al. [11].

Selection of SVM Model
SVM is a supervised machine learning paradigm derived from the statistical learning theory of structural risk minimization principle for solving linear and non-linear classification and regression problem [21]. We chose SVM as our classification paradigm due to its high generalization capability, ability to find global classification solutions [21], and successful application in bioinformatics and other practical domains.
The model selection for SVM involves the selection of a kernel function and its parameters which yield the optimal classification performance for a given dataset [21]. Among the available kernel functions, we chose the most popular and widely used Radial Basis Function (RBF) as the kernel function because of its higher reliability in finding optimal classification solutions in most practical situations [22]. The performance of the classifier at each parameter point (c, g) is evaluated by 5-fold cross-validation on the training dataset. After finding the best parameters, a new SVM model was trained using the complete training dataset at those parameters. Then a separate testing dataset was used to measure the performance of the developed classifier. The C++ interface of libsvm3.1 package [23] was used to develop SVM model. Before training the SVM classifier systems, the complete dataset was scaled into (21, +1) interval.

Feature Selection
Selecting the most discriminative set of features would increase the performance, efficiency and comprehensibility of a classifier system by reducing its complexity. In particular, through analyzing the optimal feature subsets for these genomes, we can clearly realize which features make more important contributions to HGT detection. Here genetic algorithm (GA) was chosen as our feature selection paradigm due to its strong random search ability to find the convincingly optimal feature subset. The evaluation from GA aims to one feature subset, not one single feature, and this can guarantee the combination optimization of feature subset [24]. Firstly, generated some feature subsets randomly, then the new feature subsets were obtained through selection, cross and mutation. After many iterations of this ritual, the result would converge to the optimal solution, which corresponded to the optimal feature subset. The 5-fold cross-validation was used to test the generalization ability of feature subsets, the feature subset which obtained optimal classification performance would be considered as optimal feature subset.
The feature selection procedure was carried out based on initial imbalanced dataset. Binary string was chosen to code the feature data of the population, 1 meant that the corresponding feature was selected, and 0 was just the reverse. The chosen fitness function was f(x) = 10000*Recall because we needed to evaluate the Mean error under the situation where the highest Recall was obtained. Each generation contained 100 individuals. And the cross probability, mutation probability and iteration number was set to 0.9, 0.1 and 200, respectively. We employed the classical proportion selection operators, and the optimal two individuals in every generation were directly passed to the next generation.

Class Imbalance Problem
As is well-known, the horizontally transferred genes are far less than others in each genome, which will inevitably result in the sample imbalance problem. It has been well studied that training a classifier with an imbalance positive and negative dataset in machine learning research would result in poor classification performance with respect to the minority class [14,25], in this case, it would be with respect to the horizontally transferred gene class. According to previous study, Synthetic Minority Over-sampling Technique (SMOTE) which is independent from the learning algorithm and involves in pre-processing of training data was successfully applied to this kind of problem [26]. It is an oversampling technique which introduces new synthetic examples in the neighborhood of the existing minority examples. Therefore, SMOTE was chosen to resolve the class imbalance problem in this study.

Evaluation Criteria
In this research, we used detection rate Recall as our primary evaluation criteria as the same with that in the paper published by Azad et al. [11]. In addition, Mean error was also used as the evaluation criteria to sufficiently evaluate the performance of Hgtident. They are defined as follows,

Feature Selection Results
As the first experiment, we trained an SVM model with the complete imbalanced dataset to observe the classification performance. The complete dataset was randomly divided into five equally sized partitions and each partition contained the same ratio of positives and negatives. Then four partitions were used together as the training dataset to develop an SVM classifier, the resulted model was tested for its classification performance on the fifth partition. This procedure was repeated five times with different combinations of training and testing dataset, and the results were averaged. Table 1 shows the classification results obtained subjected to all features and optimal feature subset. For each genome, Recall and Mean error were respectively improved and reduced effectively by using the optimal feature subset. In average, Recall was improved by 6.50%, and Mean error was reduced by 4.67%, which showed the optimal feature subset has a significant influence in better classification results. The resulted optimal feature subsets with less number of features not only gave higher classification results, but also immensely reduced computational complexity.
At the same time, the optimal feature subset for each genome was analyzed as well, and summarized in Table 2. We found the JS-DN and JS-CB appeared in five out of six feature subsets, which indicated these two features have higher discriminating power for HGT detection than the others. Second was Karlin's codon bias, x 2 dinucleotide and x 2 codon bias. In addition, GC1-GC3 and kmer (k = 1, 2, …, 7) hardly ever appeared in these optimal feature subsets, which also indicated these features make the least contribution to HGT detection. These deductions could also be well achieved from the results obtained through the multiplethreshold approach in Section ''Comparison of multiple-threshold approach with Hgtident''.

Class Imbalance Learning Results
The imbalance learning experiments would be carried out to observe the classification results by 5-fold cross validation. First, an SVM model was trained by applying SMOTE on a training dataset containing four-fifth complete dataset. Then its performance was tested on the remaining imbalanced one-fifth of dataset. This procedure was repeated five times with different combinations of training and testing datasets, finally, the results were averaged. Table 3 presents the classification results through class imbalance learning method with the optimal feature subsets. From these results, we could find that, compared with the preliminary classification results obtained through the imbalanced datasets, the application of SMOTE could improve the Recall and reduce the Mean error effectively. In average, Recall was improved by 6.53%, and Mean error was reduced by 6.02%, which provided a good evidence for us to apply SMOTE in this problem for the development of a better performing classifier with respect to imbalanced positive and negative classes.

Comparison of Multiple-threshold Approach with Hgtident
At present, the multiple-threshold approach proposed by Azad et al. [11] is very popular, because better results can be obtained. Thus the comparison between these two approaches would be carried out to evaluate the performance of Hgtident (Table 4). It's not difficult to find that, in the multiple-threshold approach, each Recall was obtained at the cost of a higher Mean error, which means that a mass of false positives were produced. The reason is maybe that only one single property can be applied in this approach, however, every one single property can't sufficiently express the comprehensive information encoded by genes. This information should be expressed sufficiently by different properties based on different directions. Therefore, these seven comprehen-  sive and representative features were applied to this research together. From Table 4, we could clearly observe that Hgtident effectively reduced the Mean error, which also illustrated the correctness of our viewpoint.
In addition, we respectively chose the highest Recall and the corresponding Mean error obtained through multiple-threshold approach in each genome to compare with our results (Fig. 1). For E. coli K12, our Recall was reduced by 0.76%; but for E. coli O157 Sakai, S. enterica Typhi CT18, S. enterica Paratypi ATCC 9150, C. pneumoniae CWL029 and S. agalactiae 2603, our Recall was respectively improved by 0.73%, 5.95%, 0.82%, 3.80% and 6.33%, as a whole, the mean was 2.81%. In addition, for each genome, our Mean error was reduced dramatically, and the overall mean was 26.32%. Tsirigos et al. [12] and Chen et al. [13] also used SVM to research the prediction of horizontally transferred genes, but they used the simulated datasets, most importantly, only one single property was used in their researches, which also indicated insufficient information encoded by genes was extracted. In addition, surprisingly, none of them have considered a proper class imbalance learning method for classifiers development. Thus, their results were even inferior to that obtained through the multiple-threshold approach. Therefore, we can state that the results reported in our research are much more reliable and better than those results published by other existing approaches.

Conclusions
In this research, an integrated strategy, which more comprehensively described the biological information encoded by genes, was proposed to identify horizontally transferred genes. Meanwhile, SMOTE was also considered to address the class imbalance problem. Extensive experiments indicated that the extraction of sufficient information can reduce Mean error dramatically, and also improve Recall by a certain level. However, change in gene  inventory is a historical process, how to thoroughly extract the useful information encoded by genes still remain a challenging and open issue. Further study is yet needed to decrease the false positives and negatives.