Leveraging conformal prediction to annotate enzyme function space with limited false positives
Fig 2
PenLight2, the base ML model of CPEC, outperforms the state-of-the-art methods for EC number prediction.
(A) We evaluated DeepEC [9], ProteInfer [11], DeepFRI [10], and PenLight2 for predicting the 4th-level EC number, using F1 score, the normalized discounted cumulative gain (nDCG), and coverage as the metrics. Specifically, coverage is defined as the proportion of test proteins for which a method has made at least one EC number prediction. (B) We further evaluated all methods for predicting the 4th-level EC number on more challenging test proteins with [0, 30%) sequence identities to the training proteins and drew the micro-averaged precision-recall curves. For each curve, we labeled the point with the maximum F1 score (Fmax).