Fig 1.
Similarity can reveal useful meaning if you compare with the right things.
Gene A is being compared against 2 other genes, B and C. By direct comparison, Gene A would be deemed as related to B but not C. Gene A is highly similar to Gene B, and thus, produces a generous block of conserved regions. However, if we were to compare with gene D, which is evolutionarily more distant, then only the blue motif is revealed as conserved, and more likely critical to gene function. However, note that with gene C and gene D, there is limited similarity to A.
Fig 2.
Having sufficient variation in training data is important.
Fig 3.
EnsembleFam model architecture in 3 steps: feature collection, model training, and ensemble decision.
For each protein family, a model is built to identify its members using a combination of similarity and dissimilarity features.
Table 1.
Performance comparison of EnsembleFam with other methods on the twilight zone (identity < = 30%) of COG and GPCR data sets.
In COG data set, the results are divided into 6 subgroups based on number of predictions made by EnsembleFam, where EnsembleFam and pHMM provides more than 1 prediction (indicated by predCount) and DeepFam provides exactly one. If one of the predictions in a subgroup is correct for EnsembleFam and pHMM, and the only one for DeepFam is correct, then it is assumed as correct. All the results displayed here are average of 5-fold cross-validation.
Fig 4.
Depictions of Data Doppelganger Identification (DDI) (Panel A), the experiment setup for Doppelgänger Inflation Test (DIT) (Panel B), the expected “leakage-like” effect during DIT (Panel C) and the procedure of identifying highly correlated features.