TXSelect: A multi-task learning model to identify secretory effectors

doi:10.1371/journal.pcbi.1013677

Fig 1.

ESM group ranking and performance of selected ESM pooling strategies in multi-task classification.

(A–C) Feature group ranking based on silhouette scores. Supervised Uniform Manifold Approximation and Projection (UMAP) with 5-fold cross-validation was used to evaluate clustering ability of different ESM pooling strategies across tasks. Bars indicate the mean validation silhouette score ± standard deviation for (A) TXSE (T1/2/3/4/6SE), (B) T1/2SE subset, and (C) T3/4/6SE subset. (D–E) Performance of selected ESM features. Based on ranking results and widely recognized pooling strategies, ESM mean, ESM max, ESM N-terminal mean, and ESM core region mean were selected for multi-task training. Radar plots show their classification performance across TXSE, T1/2SE, and T3/4/6SE tasks on the (D) validation set and (E) test set. Among these, ESM core region mean, ESM N-terminal mean, and ESM mean consistently achieved strong performance.

More »

Expand

Fig 2.

Classical sequence descriptor group ranking and performance of selected descriptors in multi-task classification.

(A–C) Feature group ranking based on silhouette scores. Supervised UMAP with 5-fold cross-validation was applied to evaluate the clustering ability of various handcrafted sequence descriptors. Bars indicate the mean validation silhouette score ± standard deviation for (A) TXSE (T1/2/3/4/6SE), (B) T1/2SE subset, and (C) T3/4/6SE subset. (D–E) Performance of selected descriptors. Radar plots summarize the classification performance of representative descriptors (DR, SC-PseAAC, PC-PseAAC, QSOrder, AAC, and APAAC) across tasks. Results are shown for the (D) validation set and (E) test set. Among these, DR, SC-PseAAC, and QSOrder consistently achieved strong performance across tasks.

More »

Expand

Fig 3.

Performance comparison of feature combinations and detailed performance of the optimal TXSelect framework.

(A–C) Model performance based on combinations of selected ESM pooling and classical sequence descriptors. Fusion experiments were conducted by combining ESM core region mean, ESM mean, and ESM N-terminal mean with DR, SC-PseAAC, and QSOrder. Bars indicate classification metrics (AUC, F1, Precision, Recall). Among these, ESM N-terminal mean + DR + SC-PseAAC achieved the highest validation F1 score (0.867) and also performed best on the test set (F1 = 0.8645). Adding QSOrder resulted in a comparable validation F1 score (0.863); however, its test performance decreased (F1 = 0.8507), confirming ESM N-terminal mean + DR + SC-PseAAC as the optimal combination. (D–E) Classification performance of the optimal TXSelect. Heatmaps show per-class performance of the optimal feature combination (ESM N-terminal mean + DR + SC-PseAAC) on the (D) validation dataset and (E) test dataset. Metrics (AUC, F1, Precision, Recall) are reported for each effector type (T1SE, T2SE, T3SE, T4SE, T6SE).

More »

Expand

Table 1.

Data sources and sample sizes for each effector category.

More »

Expand

Fig 4.

Overview of the TXSelect framework for multi-task identification of secretory effectors.

(A) Dataset construction. Secretory effectors (T1SE, T2SE, T3SE, T4SE, and T6SE) were collected from the literature and redundancy was removed using CD-HIT. For each task, the target label was set to 1, whereas the labels of the other tasks were set to 0 (e.g., in the T1SE task, the T1SE label is 1, while the labels for T2/3/4/6SE are 0). (B) Model architecture. Multiple feature descriptors, including evolutionary scale modelling. (ESM) N-terminal mean embedding, distance-based residue (DR), and split amino acid composition (SC-PseAAC), were integrated to construct sequence representations. These representations are processed through a shared backbone network composed of stacked linear, ReLU, and dropout layers, followed by task-specific heads for predicting different effector types. (C) Model performance. Training loss and validation F1-scores of T1SE, T2SE, T3SE, T4SE, and T6SE tasks across 500 epochs. The curves demonstrate stable convergence of the shared multi-task framework and balanced performance across effector classes.

More »

Expand

Fig 5.

Feature process strategies for ESM representations.

(A) Basic pooling operations, including mean, max, min, and standard deviation pooling, applied to residue-level embeddings. (B) Region-specific feature extraction. Protein sequences were divided into N-terminal, core region, and C-terminal segments according to sequence length. The lengths of terminal regions were dynamically determined with lower/upper constraints on amino acid counts. The remaining residues constituted the core region. Minimum length constraints were further applied to ensure balanced representation. Detailed rules for length assignment and minimum thresholds are provided in the Methods.

More »

Expand