Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Large-scale statistical dissection of sequence-derived biochemical features distinguishing soluble and insoluble proteins

Abstract

Protein solubility critically influences recombinant expression efficiency and downstream biotechnological applications. While deep learning models have improved predictive accuracy, the intrinsic magnitude, redundancy, and interpretability of classical sequence-derived determinants remain insufficiently characterized. We performed a large-scale univariate analysis on a curated dataset of 78,031 proteins (46,450 soluble; 31,581 insoluble). Thirty-six biochemical descriptors were evaluated using Mann-Whitney U tests with Benjamini-Hochberg false discovery rate correction. Effect sizes were quantified using Cliff’s , and discriminative performance was assessed by ROC AUC. Although 34 features remained statistically significant after correction, most exhibited small effect sizes and substantial overlap between classes. The strongest effects were associated with size-related features (sequence length and molecular weight; ), whereas charge-related descriptors, particularly the proportion of negatively charged residues (; AUC = 0.575), showed consistent but modest shifts. Spearman correlation analysis revealed near-complete redundancy among major size-related variables ( up to 0.998). Applying a redundancy threshold (), we derived a parsimonious composite integrating sequence length and negative charge proportion, achieving AUC = 0.624 (MCC = 0.1746). These findings suggest that sequence-level solubility information is consistent with a low-dimensional organization at the level of global sequence-derived descriptors and governed by coordinated weak effects, establishing a transparent statistical baseline for large-scale solubility characterization.

Introduction

Protein solubility is a fundamental physicochemical property that governs protein folding, intracellular stability, recombinant expression efficiency, and downstream biotechnological applications [13]. In heterologous expression systems, poor solubility frequently results in aggregation, inclusion body formation, reduced functional yield, and substantial experimental cost [4,5]. From a molecular perspective, solubility reflects the balance between intermolecular attractive forces that promote aggregation and intramolecular stabilizing interactions that favor properly folded conformations. Understanding how this balance is encoded in the primary amino acid sequence therefore remains a central problem in computational and structural biology.

A substantial body of experimental and computational work has shown that protein solubility is influenced by multiple sequence-derived properties. Chain length affects folding complexity, translational burden, and the likelihood of partially folded intermediates. Amino acid composition shapes the overall physicochemical landscape, including hydrophobicity, polarity, and charge distribution. Electrostatic interactions contribute to colloidal stability, while hydrophobic clustering and aggregation-prone segments facilitate intermolecular association [6,7]. Importantly, these determinants are not independent: many classical descriptors are structurally or mathematically coupled, suggesting that observed effects may reflect shared latent physicochemical axes rather than distinct mechanisms.

Early computational approaches captured these determinants using interpretable sequence-derived descriptors such as amino acid composition, secondary structure propensity scales, and hydropathy indices [8,9]. These features remain attractive due to their transparency and computational efficiency. However, their relative contribution, redundancy structure, and practical magnitude remain insufficiently characterized at large-scale. In high-powered datasets, extremely small p-values can arise from negligible distributional shifts, making it difficult to distinguish statistical significance from biological relevance [10].

Recent advances in machine-learning, particularly deep neural networks and protein language models, have achieved strong predictive performance in solubility benchmarks [1114]. Despite their accuracy, these approaches often operate as high-capacity black boxes, obscuring the marginal contribution and interaction structure of individual physicochemical features. Moreover, such models typically require substantial computational resources, limiting their applicability in experimental settings where rapid and accessible prediction is needed.

A key unresolved question is whether sequence-level determinants of solubility act primarily through additive contributions of independent physicochemical factors, or whether they depend on higher-order sequence patterns that cannot be captured by classical descriptors. While higher-order representations may improve predictive performance, they introduce additional modeling layers, increase computational complexity, and may accumulate prediction error when inferred from primary sequence alone.

In parallel, many widely accepted determinants of solubility have been identified in relatively small or model-specific datasets, and their importance is often interpreted qualitatively. As a result, their true effect size, redundancy, and practical discriminative contribution remain unclear at scale. This motivates a systematic and statistically rigorous re-evaluation of classical sequence-derived features in a large and heterogeneous dataset.

In the present study, we perform a large-scale statistical dissection of 36 sequence-derived biochemical descriptors using a curated benchmark of soluble and insoluble proteins. Rather than developing a high-capacity predictive model, our objective is to quantify effect magnitude, redundancy structure, and practical discrimination under strict statistical control. By combining non-parametric testing, FDR correction, effect size estimation, ROC analysis, and correlation-based redundancy analysis, we examine whether protein solubility is driven by dominant single determinants or by coordinated weak signals across a low-dimensional physicochemical space.

This study provides a statistically principled and interpretable baseline framework for protein solubility derived from primary sequence features. By integrating large-scale effect size characterization with a reproducible analytical pipeline, it quantifies the practical magnitude and limitations of classical biochemical descriptors. Rather than proposing a higher-capacity predictor, the framework helps characterize the level of discrimination achievable within the restricted space of global sequence-derived physicochemical features, providing a transparent reference for assessing the added value of more complex predictive models.

Importantly, this study is not intended to develop or benchmark a predictive model. Instead, the objective is to quantify the intrinsic statistical structure of sequence-derived physicochemical features under minimal modeling assumptions. Any discriminative performance reported in this work should therefore be interpreted as a descriptive property of the feature space rather than as an estimate of predictive generalization.

Materials and methods

Dataset

The dataset was obtained from the curated solubility benchmark introduced by Zhang et al. (2024) [14], comprising more than 78,000 protein sequences annotated as soluble or insoluble. The original release provides independent training, validation, and test splits constructed for predictive benchmarking, distributed as three FASTA files containing approximately 74,000, 2,000, and 2,000 protein sequences, respectively.

For the present analysis, all splits were merged into a single dataset because the objective was large-scale distributional characterization rather than predictive generalization. The final dataset comprised 78,031 protein sequences (46,450 soluble and 31,581 insoluble). All sequences and labels are publicly available via Zenodo [14]. Class imbalance in the dataset is unlikely to materially affect rank-based statistics such as the Mann–Whitney U test or ROC AUC.

Protein sequences are represented as contiguous strings of single-letter amino acid codes (e.g., “ACDEFGHIK...”) without gaps or alignment. No additional filtering or preprocessing was applied, as the dataset had already been curated in the original benchmark study. In particular, no filtering based on sequence length, redundancy, or sequence identity was performed, and all sequences provided in the original dataset were retained.

Two sequences containing rare or ambiguous amino acid symbols (“X”) were preserved. These residues were excluded from amino acid frequency calculations but did not affect other sequence-derived descriptors. All features were computed directly from the raw FASTA sequences.

The FASTA files used in this study correspond exactly to those provided in the original benchmark without modification.

Feature extraction

All protein sequences were retained in their original form; ambiguous residues were preserved in the raw sequences but excluded from relevant feature calculations as described below. Feature extraction was performed using the 20 canonical amino acids; rare or ambiguous residues (e.g., “X”) were not removed from the sequences but were excluded from frequency-based calculations to avoid distortion of compositional features. The presence of such residues was minimal and did not affect global descriptors beyond amino acid frequency measures.

A total of 36 sequence-derived biochemical features were computed for each protein, including:

  • Amino acid frequency features (20 variables). For each amino acid (20 canonical residues), the frequency is defined as:
    where na is the number of occurrences of residue a in the sequence and L is the sequence length (excluding ambiguous residues such as “X”). These variables quantify compositional bias at the residue level.
  • Functional residue group ratios. Residues were grouped according to physicochemical properties, including: positively charged ({K,R,H}), negatively charged ({D,E}), polar ({S,T,N,Q}), hydrophobic ({A,V,I,L,M,F,W,Y}), small ({A,G,S,T}), and sulfur-containing ({C,M}). For each group G, the ratio is defined as:
    These descriptors capture coarse-grained physicochemical composition.
  • Global physicochemical descriptors. Molecular weight was computed as the sum of residue masses. The isoelectric point (pI) and net charge at pH 7 were estimated using standard residue pKa values and the Henderson–Hasselbalch equation. Mean hydropathy was computed as the average Kyte–Doolittle hydropathy index across the sequence [9]. These variables characterize global physicochemical properties of the sequence.
  • Secondary structure propensity proxies. Helix, sheet, and turn propensities were approximated by averaging residue-level Chou–Fasman parameters across the sequence [8]. For a given structural class s, the feature is:
    where ps(ai) is the Chou–Fasman propensity of residue ai for structure s.
  • Intrinsic disorder proxy. Intrinsic disorder tendency was approximated using residue-level disorder propensity scales derived from established literature [15,16]. The disorder ratio was computed as the fraction of residues exceeding a predefined disorder propensity threshold:
    where d(ai) denotes the disorder score and is a fixed threshold. The exact propensity scale and threshold used are provided in Table S1 in S1 File and in the accompanying code repository.
  • Aggregation-related proxy. Aggregation propensity was approximated by the longest contiguous hydrophobic segment in the sequence. Let denote the length of the longest run of residues in the hydrophobic set. The feature is defined as:
    This proxy captures local clustering of hydrophobic residues associated with aggregation risk.

These descriptors form a structured feature matrix for subsequent statistical evaluation.

Analytical workflow

The analytical workflow was designed to quantify practical effect magnitude, control false discoveries, eliminate redundancy, and subsequently construct an interpretable composite index.

For each of the 36 sequence-derived descriptors, distributional differences between soluble and insoluble proteins were first evaluated using the Mann-Whitney U test [17]. Resulting p-values were adjusted via the Benjamini-Hochberg procedure to control the false discovery rate (FDR) [18]. Effect sizes (Cliff’s ), which measure stochastic dominance without distributional assumptions [19], were computed for all features. For features meeting FDR significance, effect magnitudes were interpreted alongside confidence intervals. Median shifts were expressed using the Hodges–Lehmann estimator with 95% confidence intervals [20]. Stability of was assessed via percentile bootstrap resampling (B = 2,000) [21]. Univariate discriminative capacity was evaluated using ROC AUC [22], with optimal thresholds determined by Youden’s J statistic [23].

Features were subsequently ranked according to absolute to identify descriptors exhibiting the strongest practical separation. Because high-effect features may capture overlapping physicochemical dimensions, pairwise redundancy was assessed prior to composite modeling using Spearman’s rank correlation coefficient [24,25]. Spearman correlation was selected due to its robustness to non-normality and its ability to capture monotonic relationships.

Feature pairs exhibiting strong monotonic association () were considered redundant. Within correlated clusters, only a single representative descriptor was retained to avoid double-counting latent physicochemical axes. This redundancy-aware filtering ensured that subsequent integration combined orthogonal dimensions rather than correlated proxies of the same underlying property.

The final composite- index was constructed as a linear combination of retained features, robustly scaled using the median and interquartile range (IQR) [26,27], and weighted by their corresponding values. Proteins were classified according to the sign of the composite score.

The composite- index is not a trained predictive model. Its construction is fully data-dependent and based on empirical effect size estimates derived from the same dataset. Because no parameter fitting or hyperparameter optimization is performed, the resulting index summarizes distributional separability under a statistically controlled and redundancy-filtered feature set.

Consequently, any performance metrics (e.g., AUC, MCC) reported for this index do not represent out-of-sample predictive performance, but rather quantify descriptive separability within the observed feature space.

Results

Dataset overview and multiple-testing adjusted significance

An overview of the analytical workflow is provided in Fig 1. Following preprocessing and feature extraction, the final analysis dataset comprised N = 78,031 protein sequences, including 46,450 soluble and 31,581 insoluble entries. For each sequence, 36 sequence-derived biochemical descriptors were computed, spanning composition, charge-related variables, hydrophobicity measures, structural propensities, and aggregation proxies.

Group-wise distributional differences were evaluated using the Mann-Whitney U test [17], with Benjamini-Hochberg correction to control the false discovery rate (FDR) [18]. After adjustment, 34 of 36 features remained statistically significant (q < 0.05), indicating that most classical sequence-level descriptors exhibit detectable distributional shifts between soluble and insoluble proteins.

However, given the large sample size, statistical significance alone is insufficient to infer biological relevance. In high-powered datasets, even negligible shifts can produce extremely small p-values [10]. We therefore focus on two complementary quantities: (i) effect magnitude quantified by Cliff’s [19], which measures stochastic dominance independent of distributional assumptions, and (ii) univariate discriminative capacity assessed via ROC AUC [22], which provides a descriptive measure of class separability.

To express group differences on the original measurement scale, we report the Hodges–Lehmann (HL) location shift with 95% confidence intervals [20]. Uncertainty of estimates was evaluated using percentile bootstrap intervals (B = 2,000) [21]. For completeness, Youden’s J statistic and its corresponding optimal threshold T are also reported [23]; in the presence of substantial distributional overlap, J remains close to zero even when statistical significance is achieved.

Collectively, this framework distinguishes statistical detectability from practical magnitude, enabling a biologically grounded interpretation of whether sequence-derived physicochemical features exert strong or merely subtle influence on protein solubility.

Global physicochemical features

Table 1 summarizes 16 global and grouped-composition descriptors. The largest absolute effects were observed for length and molecular_weight, both exhibiting , indicating that insoluble proteins are, on average, longer and heavier. On the original measurement scale, the Hodges–Lehmann estimates correspond to a median shift of −70 amino acids (95% CI [−77, −62]) for length and Da (95% CI ) for molecular_weight.

thumbnail
Table 1. Results for 16 global sequence-derived features.

https://doi.org/10.1371/journal.pone.0344883.t001

From a mechanistic standpoint, increased chain length elevates folding complexity, prolongs exposure of partially folded intermediates, and increases the probability of intermolecular encounters that nucleate aggregation [1,4]. Larger proteins also present greater solvent-accessible surface area and a higher combinatorial risk of hydrophobic clustering. Despite these biologically plausible trends, univariate discrimination remained limited ( and 0.393, respectively). Values below 0.5 reflect reversed directionality rather than meaningful separability; indeed, inverting the decision rule yields , yet substantial distributional overlap persists, as confirmed by small Youden’s J statistics.

Charge-related descriptors exhibited coherent and biologically interpretable shifts. The proportion of negatively charged residues (neg_ratio) was enriched in soluble proteins (, 95% CI [0.134,0.173], ; HL = 0.00810 with 95% CI [0.00674,0.00937]). Conversely, isoelectric_point and net_charge_pH7 showed , indicating higher pI and reduced net negativity among insoluble proteins. These findings align with electrostatic stabilization theory: increased net negative charge enhances intermolecular repulsion, reduces colloidal attraction, and suppresses aggregation propensity [46]. The moderate AUC values (≈0.57) indicate measurable but limited standalone discriminatory power.

Hydrophobicity-related descriptors exhibited smaller effects. Both mean_hydropathy and hydrophobic_ratio showed , consistent with the role of exposed hydrophobic patches in promoting solvent exclusion and intermolecular association [7,9]. However, effect sizes were modest and AUC values remained close to 0.5, underscoring extensive overlap between soluble and insoluble distributions. These observations suggest that global hydrophobic averages are insufficient to capture context-dependent aggregation mechanisms.

Disorder and secondary structure proxies

Intrinsic disorder tendency, approximated by disorder_ratio, exhibited a small positive effect (, ), suggesting slight enrichment of disorder-associated residues among soluble proteins. Disordered or flexible regions may increase solvent accessibility and modulate intermolecular interaction landscapes, although their impact appears context-dependent and modest in magnitude [15,16].

Secondary structure propensity proxies (helix_prop_mean, sheet_prop_mean, and turn_prop_mean) demonstrated very small effect sizes. This indicates that simple global aggregation of Chou-Fasman propensities across full-length sequences does not yield strong univariate separation in a large and heterogeneous dataset [8].

Amino acid composition

Table 2 reports results for the 20 amino acid frequency variables. Many compositional descriptors achieved extremely small q-values, reflecting high statistical power; however, most displayed small and AUC values only marginally deviating from 0.5. Notably, freq_E and freq_D were elevated in soluble proteins, consistent with the neg_ratio signal, whereas freq_R and freq_C were relatively enriched in insoluble proteins. These trends are compatible with electrostatic contributions and residue-specific side chain chemistry influencing folding stability and aggregation kinetics [4,6,7]. Only two features, freq_M and freq_T, did not remain significant after FDR correction (q ≥ 0.05), with and , indicating negligible univariate discriminatory information within this dataset.

thumbnail
Table 2. Results for amino acid frequency features.

https://doi.org/10.1371/journal.pone.0344883.t002

Emergent weak-signal regime and biological interpretation

Taken together, these findings indicate that protein solubility is not governed by a dominant sequence-level determinant but instead reflects coordinated contributions of multiple weak physicochemical signals [2,5]. Size-related variables capture a structural burden axis, while charge-related descriptors reflect electrostatic stabilization; hydrophobic and compositional features contribute smaller contextual effects. The generally small Youden’s J values across individual descriptors confirm that no single feature provides practical threshold-based separation. Rather, the data support a weak-signal, potentially low-dimensional structure in which overlapping physicochemical axes jointly influence solubility. This observation motivates subsequent redundancy-aware integration of selected descriptors into a parsimonious composite index, as detailed below.

Fig 2 illustrates the ROC curves corresponding to the features with the largest absolute effect sizes. The observed asymmetry between negatively and positively charged residues warrants further interpretation. Enrichment of negatively charged residues in soluble proteins is consistent with established biophysical principles, as increased net negative charge enhances electrostatic repulsion and reduces intermolecular association, thereby promoting solubility [28,29].

thumbnail
Fig 2. Univariate ROC curves for the four features with the largest absolute Cliff’s .

https://doi.org/10.1371/journal.pone.0344883.g002

In contrast, positively charged residues may participate in non-specific interactions with nucleic acids, membranes, or other cellular components, which can facilitate aggregation under certain conditions.

It is important to note that the dataset is derived from recombinant expression in Escherichia coli, and therefore the observed patterns may partially reflect system-specific constraints. The intracellular environment and expression constraints of this system may introduce biases that favor negatively charged, more soluble constructs. Therefore, the observed effect likely reflects a combination of general physicochemical principles and system-specific expression biases. Generalization to other expression systems or native proteomes warrants further investigation.

Redundancy structure and composite- refinement

Spearman redundancy analysis.

Prior to finalizing the composite formulation, pairwise redundancy among selected high-effect features was evaluated using Spearman’s rank correlation coefficient [24]. The resulting correlation matrix is shown in Fig 3.

thumbnail
Fig 3. Spearman correlation matrix among top ranked global features.

https://doi.org/10.1371/journal.pone.0344883.g003

Strong monotonic associations were observed among size-related descriptors. In particular, sequence length and molecular weight exhibited near-complete collinearity (), reflecting their deterministic structural coupling. Aggregation-related metrics also displayed strong monotonic association with size-related variables (), indicating that these descriptors largely capture a shared latent structural axis.

In contrast, the proportion of negatively charged residues (neg_ratio) showed minimal correlation with size-related variables (), suggesting relative independence of electrostatic and size dimensions. Applying a predefined redundancy criterion (), correlated size-related descriptors were considered redundant. To avoid double-counting a single latent physicochemical axis, only one representative size descriptor was retained. We selected length due to its direct structural interpretability and deterministic relation to molecular mass. The threshold () was adopted as a conservative criterion to remove strongly collinear features while retaining distinct physicochemical dimensions. A formal dimensionality analysis (e.g., PCA) was not performed and is left for future work.

To illustrate the extent of distributional overlap, Fig 4 presents the density distributions of the two principal orthogonal features: sequence length and negative charge proportion. In both cases, soluble and insoluble proteins exhibit substantial overlap despite statistically significant differences in central tendency.

thumbnail
Fig 4. Distributions of sequence length (left) and negative charge proportion (right) for soluble and insoluble proteins, showing substantial overlap between classes.

https://doi.org/10.1371/journal.pone.0344883.g004

For sequence length, insoluble proteins display a right-shifted distribution, reflecting a tendency toward longer chains. However, a considerable fraction of soluble proteins occupies the same range, limiting practical separability. Similarly, for negative charge proportion, soluble proteins show a modest enrichment in negatively charged residues, yet the two classes remain strongly intermixed.

These observations provide a direct visual confirmation that, although detectable at scale, the underlying physicochemical differences are small relative to within class variability. This substantial overlap explains the limited univariate discriminative performance observed in ROC analysis and reinforces the interpretation that protein solubility is governed by coordinated weak effects rather than strongly separable individual features.

The initial composite- index integrated the top ranked descriptors by absolute effect size, robustly scaled using median and interquartile range (IQR) to mitigate the influence of outliers.

(1)(2)(3)

Reduced composite formulation.

Using global medians and IQR values (Table 3), the reduced composite- score becomes:

(4)
thumbnail
Table 3. Robust summaries of the two retained physicochemical dimensions defining the reduced composite index.

https://doi.org/10.1371/journal.pone.0344883.t003

This two-dimensional formulation preserves the dominant size and electrostatic axes while eliminating redundant contributions. As shown below, discriminative performance remains comparable to the full multi-feature composite, supporting the interpretation that sequence-level solubility information is consistent with a low-dimensional organization at the level of global sequence-derived descriptors. The use of a linear formulation is intentional, as the objective is to isolate the contribution of first-order physicochemical features without introducing higher-order interactions that may obscure interpretability.

Performance metrics (AUC and MCC) of reference models were taken directly from the original benchmark study [14] to provide contextual reference. These values are shown alongside the descriptive metrics of the composite- index but do not constitute a direct head-to-head comparison.

The comparison in Table 4 is provided for contextual reference only. The reported performance of the composite- index is not directly comparable to supervised models evaluated under strict train/test separation, as the present analysis does not involve independent validation. Instead, these values should be interpreted as descriptive indicators of signal strength in sequence-derived features.

thumbnail
Table 4. Contextual performance reference for existing solubility prediction methods and the composite- index.

https://doi.org/10.1371/journal.pone.0344883.t004

Because the present study merges the original training, validation, and test splits for distributional analysis rather than predictive evaluation, the reported metrics should not be interpreted as estimates of out-of-sample performance.

Importantly, the proposed composite- index does not constitute a trained predictive model. Its construction is fully data-dependent and based on empirical effect size estimates derived from the same dataset. Consequently, AUC and MCC are reported solely as descriptive measures of separability within the observed feature space, rather than as indicators of predictive generalization.

The purpose of the composite- formulation is therefore not to achieve state-of-the-art predictive performance, but to provide an interpretable statistical reference that characterizes the magnitude and structure of sequence-derived solubility signals.

To contextualize computational efficiency, Table 5 provides a qualitative comparison of inference-time complexity across representative models. For clarity, we define the notation used in the complexity analysis as follows: L denotes the sequence length, d the number of input features, t the number of trees or ensemble components, k the kernel or filter size in convolutional models, H the number of transformer layers, h the number of attention heads per layer. All complexity estimates refer to inference-time computational cost under standard implementations.

thumbnail
Table 5. Qualitative comparison of inference-time computational requirements across representative solubility prediction approaches.

https://doi.org/10.1371/journal.pone.0344883.t005

The redundancy-aware composite- baseline achieved and . Although this performance remains below that of high-capacity protein language model architectures such as PLM_Sol [14], it is comparable to, and in several cases exceeds, traditional physicochemical feature-based predictors reported in the literature. Importantly, the composite- formulation involves no parameter fitting, no embedding extraction, and no hyperparameter optimization. The decision rule is fully determined by robust scaling [26,27] and statistically estimated effect sizes derived directly from empirical distributions. Consequently, the observed discrimination reflects separability of sequence-level biochemical features rather than model-driven representation learning.

From a computational standpoint, it is important to clarify that the computational complexity of the proposed composite- model depends on the level of representation considered. When applied directly to raw amino acid sequences, feature extraction requires a single pass over the sequence, resulting in a linear time complexity of O(L), where L is the sequence length.

However, once sequence-derived features are computed, the composite- score itself is obtained through a simple linear combination, which operates in constant time O(1). Therefore, the model is best interpreted as having O(L) end-to-end complexity with an O(1) scoring step.

This distinction is important in practice, as feature extraction is computationally inexpensive and scales linearly with sequence length, while the absence of training, parameter optimization, or iterative inference makes the overall framework substantially more efficient than high-capacity models such as protein language models, which typically scale at least quadratically with sequence length. In practical settings, where sequence-derived features are often precomputed or cached, the effective runtime of the scoring step becomes negligible. In contrast, classical machine-learning predictors based on handcrafted descriptors scale at least linearly with the number of features or ensemble components (O(d)–O(t)) [30]. Convolutional neural networks introduce sequence-length-dependent cost () [31], while transformer-based protein language models incur quadratic complexity with respect to sequence length due to the self-attention mechanism () [32]. This progression corresponds to substantially higher computational cost.

The moderate performance gap between composite- and PLM-based architectures therefore reflects a clear trade-off between representational capacity and computational efficiency. While transformer models capture higher-order contextual interactions at substantial resource cost, the present results demonstrate that a low-dimensional, redundancy-controlled linear formulation retains a non-trivial portion of discriminative signal with negligible computational overhead.

From a mechanistic perspective, the achieved performance indicates that global physicochemical descriptors encode a measurable but limited solubility signal. The modest AUC and MCC values are consistent with the substantial distributional overlap observed in univariate analyses and reinforce the interpretation of solubility as a weak-signal, low-dimensional, and context-dependent phenotype [2,5]. Rather than suggesting inadequacy of classical descriptors, these findings clarify their role: global sequence-derived features provide an interpretable empirical reference for the level of discrimination and reflect the interpretable physicochemical axes upon which more complex models may implicitly build. In this sense, the composite- baseline serves as both a transparent reference model and a mechanistic anchor for evaluating the added value of high-capacity predictive frameworks under explicit computational constraints. In addition to these mechanistic insights, it is important to consider the robustness of the proposed framework. To assess robustness, we note that the composite- formulation is derived directly from global distributional properties and does not involve optimization or parameter tuning beyond direct empirical estimation from the same dataset, the framework may be less sensitive to dataset partitioning, although this was not formally evaluated compared to trained models; formal stability assessment is left for future work. Future work may evaluate stability under resampling; however, given the large sample size (N > 78,000), variance in estimated effect sizes is expected to be limited, although not formally evaluated.

Conclusion

This study presents a large-scale, statistically controlled, and interpretable analysis of sequence-derived biochemical determinants of protein solubility using 78,031 labeled proteins. By combining non-parametric testing, multiple-testing correction, effect-size estimation, uncertainty quantification, ROC-based evaluation, and redundancy analysis, we distinguish statistical detectability from practical discriminative relevance.

The results show that soluble and insoluble proteins are statistically distinguishable across many sequence-derived descriptors, but the corresponding effects are generally small and strongly overlapping. Protein solubility at the sequence level therefore appears to lie in a weak-signal regime, where no individual descriptor provides strong standalone discrimination. Instead, separability emerges from the coordinated contribution of multiple physicochemical features with modest individual effects.

This conclusion refines the interpretation of classical solubility determinants such as sequence length, molecular weight, charge, and hydrophobicity. Although these factors are statistically significant, their practical effect sizes are considerably smaller than may be inferred from significance alone. Redundancy analysis further indicates that much of the signal is organized along a limited number of latent physicochemical axes. Size-related descriptors largely reflect a shared structural-burden dimension, whereas charge-related features form a comparatively independent electrostatic axis. After redundancy filtering, a parsimonious two-dimensional composite based on sequence length and negative charge proportion retained measurable descriptive discrimination.

The observed patterns are consistent with predominantly additive contributions of global physicochemical features, although higher-order dependencies were not explicitly modeled. The present framework deliberately relies on global sequence-derived descriptors and therefore does not capture positional or contextual effects. In proteins, the contribution of a residue depends strongly on its local and structural environment; identical amino acids may affect folding stability, aggregation propensity, or solvent exposure differently depending on context [33]. While protein language models can implicitly capture such ordered and higher-order dependencies, they introduce greater computational cost and multi-stage inference. In contrast, this study isolates the intrinsic contribution of first-order sequence descriptors, providing a transparent and computationally efficient lower-bound characterization of sequence-based solubility information.

All dataset partitions were merged to maximize statistical power for descriptive distributional analysis; consequently, no out-of-sample predictive claims are made. The composite- index should therefore be interpreted as a data-dependent statistical reference rather than a trained predictive model. Although its descriptive discrimination is moderate relative to high-capacity models, it offers a favorable trade-off between interpretability, computational efficiency, and practical applicability. It operates in linear time with respect to sequence length, requires no training, and supports rapid pre-screening in resource-constrained experimental settings.

Overall, this work establishes a statistically principled and interpretable reference framework for sequence-based solubility analysis. By quantifying effect size, redundancy, and low-dimensional structure in a large-scale dataset, it clarifies the role of classical descriptors, refines their biological interpretation, and provides a transparent baseline for assessing the added value of more complex machine-learning and protein-language-model approaches.

Supporting information

References

  1. 1. Baneyx F. Recombinant protein expression in Escherichia coli. Curr Opin Biotechnol. 1999;10(5):411–21. pmid:10508629
  2. 2. Chiti F, Dobson CM. Protein misfolding, amyloid formation, and human disease. Annu Rev Biochem. 2017;86:27–68.
  3. 3. Rosano GL, Ceccarelli EA. Recombinant protein expression in Escherichia coli: advances and challenges. Front Microbiol. 2014;5:172. pmid:24860555
  4. 4. Wilkinson DL, Harrison RG. Predicting the solubility of recombinant proteins in Escherichia coli. Nat Biotechnol. 1991;9(5):443–8.
  5. 5. Chiti F, Dobson CM. Protein aggregation in disease and biotechnology. Annu Rev Biochem. 2006;75:333–66.
  6. 6. Idicula-Thomas S, Balaji PV. Understanding the relationship between the primary structure of proteins and its propensity to be soluble on overexpression in Escherichia coli. Protein Sci. 2005;14(3):582–92. pmid:15689506
  7. 7. Ventura S. Sequence determinants of protein aggregation: tools to increase protein solubility. Microb Cell Fact. 2005;4(1):11. pmid:15847694
  8. 8. Chou PY, Fasman GD. Prediction of protein conformation. Biochemistry. 1974;13(2):222–45. pmid:4358940
  9. 9. Kyte J, Doolittle RF. A simple method for displaying the hydropathic character of a protein. J Mol Biol. 1982;157(1):105–32. pmid:7108955
  10. 10. Wasserstein RL, Schirm AL, Lazar NA. Moving to a world beyond p < 0.05. Am Stat. 2019;73(Suppl 1):1–19.
  11. 11. Khurana S, Rawi R, Kunji K, Chuang G-Y, Bensmail H, Mall R. DeepSol: a deep learning framework for sequence-based protein solubility prediction. Bioinformatics. 2018;34(15):2605–13. pmid:29554211
  12. 12. Hon J, Marusiak M, Martinek T, Kunka A, Zendulka J, Bednar D, et al. SoluProt: prediction of soluble protein expression in Escherichia coli. Bioinformatics. 2021;37(1):23–8. pmid:33416864
  13. 13. Thumuluri V, Martiny H-M, Almagro Armenteros JJ, Salomon J, Nielsen H, Johansen AR. NetSolP: predicting protein solubility in Escherichia coli using language models. Bioinformatics. 2022;38(4):941–6. pmid:35088833
  14. 14. Zhang X, Hu X, Zhang T, Yang L, Liu C, Xu N, et al. PLM_Sol: predicting protein solubility by benchmarking multiple protein language models with the updated Escherichia coli protein solubility dataset. Brief Bioinform. 2024;25(5):bbae404. pmid:39179250
  15. 15. Dunker AK, Lawson JD, Brown CJ, Williams RM, Romero P, Oh JS, et al. Intrinsically disordered protein. J Mol Graph Model. 2001;19(1):26–59. pmid:11381529
  16. 16. Uversky VN. Intrinsic disorder-based protein interactions and their modulators. Curr Pharm Des. 2013;19(23):4191–213. pmid:23170892
  17. 17. Mann HB, Whitney DR. On a test of whether one of two random variables is stochastically larger than the other. Ann Math Stat. 1947;18(1):50–60.
  18. 18. Benjamini Y, Hochberg Y. Controlling the false discovery rate. J R Stat Soc B. 1995;57(1):289–300.
  19. 19. Cliff N. Dominance statistics. Psychol Bull. 1993;114(3):494–509.
  20. 20. Hodges JL, Lehmann EL. Estimates of location based on rank tests. Ann Math Stat. 1963;34(2):598–611.
  21. 21. Efron B, Tibshirani RJ. An introduction to the bootstrap. Chapman & Hall; 1993.
  22. 22. Hanley JA, McNeil BJ. ROC curve interpretation. Radiology. 1982;143(1):29–36.
  23. 23. Youden WJ. Index for rating diagnostic tests. Cancer. 1950;3(1):32–5. pmid:15405679
  24. 24. Spearman C. Association measurement. Am J Psychol. 1904;15(1):72–101.
  25. 25. Conover WJ. Practical nonparametric statistics. 3rd ed. Wiley; 1999.
  26. 26. Huber PJ. Robust Statistics. Wiley; 1981.
  27. 27. Hampel FR, et al. Robust Statistics. Wiley; 1986.
  28. 28. Kramer RM, et al. Increased negative surface charge correlates with solubility. Biophys J. 2012;102(8):1907–15.
  29. 29. Chan P. Soluble expression correlates with lack of positive charge. J Mol Biol. 2013;425(8):1427–35.
  30. 30. Bishop CM. Pattern recognition and machine learning. Springer; 2006.
  31. 31. Lecun Y, Bottou L, Bengio Y, Haffner P. Gradient-based learning applied to document recognition. Proc IEEE. 1998;86(11):2278–324.
  32. 32. Vaswani A, et al. Attention is all you need. NeurIPS. 2017.
  33. 33. Rives A, et al. Biological structure from protein language models. PNAS. 2021;118(15):e2016239118.