Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Revisiting co-expression-based automated function prediction in yeast with neural networks and updated Gene Ontology annotations

  • Cole E. McGuire,

    Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Software, Validation, Visualization, Writing – original draft, Writing – review & editing

    Affiliation Department of Computer Science, Trinity University, San Antonio, Texas, United States of America

  • Matthew A. Hibbs

    Roles Conceptualization, Formal analysis, Funding acquisition, Investigation, Methodology, Project administration, Resources, Supervision, Writing – original draft, Writing – review & editing

    mhibbs@trinity.edu

    Affiliation Department of Computer Science, Trinity University, San Antonio, Texas, United States of America

Abstract

Automated function prediction (AFP) is the process of predicting the function of genes or proteins with machine learning models trained on high-throughput biological data. Deep learning with neural networks has become the dominant machine learning methodology of contemporary AFP models. However, it is unclear what difference exists between neural networks and classical machine learning techniques for AFP. Therefore, we created a model of AFP in yeast using feedforward neural networks that is trained on gene co-expression data to predict Gene Ontology (GO) labels and directly compared it to two previous, experimentally-validated AFP models. When trained on the same input data, our model outperforms these classical machine learning models (a Bayesian network and adaptive query-driven search) when predicting individual genes involved in mitochondrion organization. In particular, we found our neural network model better distinguished mis-annotated negatives in its training data. Finally, we quantified how differences in gene expression data and Gene Ontology annotations affect the performance of our model across each of its predicted GO terms. Our results suggest that feedforward neural networks can be more performant and robust to GO mis-annotations compared to the specific classical approaches examined here for co-expression-based AFP of some biological processes.

1. Introduction

Automated function prediction (AFP) is the process of predicting the function of genes or proteins by learning patterns in high-throughput biological data using machine learning. The goal of AFP is to predict novel protein functions to increase the rate at which researchers can understand disease and other phenotypes. The history of AFP shows a trend of using increasingly sophisticated machine learning methodologies and making predictions in increasingly complex organisms. One conception of AFP uses supervised machine learning models trained on high-throughput biological data to predict gene or protein function in terms of Gene Ontology labels [1]. The first iterations of this approach occurred in 2003 and 2004 using models built with Bayesian networks. These models combined high-throughput gene co-expression, protein interaction, and genetic interaction data along with expert-curated priors to predict gene function in Saccharomyces cerevisiae [2,3].

SPELL, MEFIT, and bioPIXIE were subsequent models of automated function prediction in S. cerevisiae that leveraged gene co-expression data and innovated on many of the ideas found in prior work [46]. MEFIT and bioPIXIE employed naive Bayesian networks while SPELL used a directed query search through gene expression datasets. These were among the first AFP methods to use some conception of model training. MEFIT and bioPIXIE replaced expert-curated model priors with priors learned from training data, and SPELL dynamically generated model weights at runtime in response to each query. These three methods were trained on similar input data and were experimentally validated in a systematic study of mitochondrion organization and biogenesis in yeast. This led to the discovery of over 100 genes involved in that process. These studies demonstrated that AFP models can be useful for directing experimental biology; furthermore, they evaluated the impact of underlying data types (e.g., expression data vs protein interaction data) on AFP performance [79]. At the same time, models were developed that used more sophisticated machine learning methodologies, such as support vector machines [1012]. Additionally, methodologies began to branch out to predicting gene and protein function in organisms other than yeast [13,14], and predictions were extended beyond GO annotations to predict other genomic features, such as knockout phenotypes and gene-disease associations [1517].

In the past decade, much of the cutting edge research in machine learning, and its applications to biology, have transitioned towards the use of neural networks. Neural networks transform input data through layers of “artificial neurons” to produce an output prediction. Neural networks quantify the discrepancy (or “loss”) between the predicted output and the target output value of labeled training examples using a loss function. Given this loss value, neural networks can update their own parameters to make more accurate predictions via an algorithm known as “backpropagation.” Deep neural networks (those with several layers of artificial neurons) are responsible for breakthroughs in numerous problem domains including image recognition and natural language processing due to their ability to learn complex, hierarchical, and non-linear representations of data. However, the complexity of the functions learned by deep neural networks makes it challenging for humans to understand the rationale behind how these models make their predictions. Improving the explainability of neural network outputs and finding ways to interpret the inner workings of these models is an ongoing area of research [18].

Advancements in deep learning with neural networks are responsible for a paradigmatic shift in how researchers approach automated function prediction. Deep neural networks have displayed a marked increase in predictive accuracy over past AFP models [19]. Several variants of neural networks have been shown to be particularly well suited for modeling complex relationships in various modalities of biological input data. Some notable variants include convolutional neural networks, graph neural networks, and transformers [19]. Furthermore, neural networks have been associated with a trend in AFP towards predicting function primarily from a protein’s amino acid sequence (in the absence of species information) [2023]. In parallel, a substantial body of work has explored network-integration-based AFP approaches by combining multiple biological networks using techniques such as matrix factorization, graph embedding, and deep representation learning [2428].

Researchers lack a nuanced understanding of how the differences between specific deep learning architectures and earlier, widely-used machine learning methodologies affect the quality of predictions of co-expression-based AFP models. In the field of AFP, models lose the attention of researchers once they have been surpassed in predictive accuracy by other state-of-the-art techniques, and this trend has only been exacerbated by the rise of deep learning in AFP. There is little analysis of how deep learning versus classical machine learning techniques are suited for tackling the domain-specific challenges of AFP (such as noisy input data and changing Gene Ontology annotations). Furthermore, most AFP models lack systematic experimental validation beyond evaluating performance on newly annotated proteins (that were not present when the model was trained) [1]. Therefore, it is unclear how most deep learning and classical AFP models differ, especially in the context of directing novel biological experimentation.

The large variety of machine learning techniques and input data modalities in AFP make it difficult to determine how differences between deep learning and classical machine learning methodologies affect prediction quality. In service of this goal, it is useful to establish a baseline by comparing the performance of the early, classical AFP models against the simplest deep learning architecture. Therefore, we chose to compare the classical AFP models MEFIT and SPELL to a model built with feedforward neural networks trained on the same gene co-expression input data. While none of these models represent the state-of-the-art in contemporary AFP, we believe that this analysis will establish a baseline that will be useful for comparisons with more complex deep learning architectures and other input data modalities in the future. We emphasize that our comparison is not meant to be exhaustive of all “classical” machine learning techniques; rather, it is to help establish a historically-grounded baseline of comparison to examine methodological behaviors and variance.

In this paper, we directly compare feedforward neural networks to experimentally-validated, classical machine learning techniques for co-expression-based AFP in yeast to assess how deep learning can improve prediction quality in this domain. First, we describe our methodology for building a model of AFP using feedforward neural networks that predicts 79 GO biological process labels from co-expression data of pairs of genes. We trained our model with the same gene expression data and GO labels that MEFIT and SPELL were trained with in 2007. Second, we compared each model’s ability to predict genes involved in mitochondrion organization. We found that our model is more performant than MEFIT and SPELL, in part because it better distinguishes mis-annotated negatives in its training data. Third, we trained our model with current-day gene expression data and GO labels to assess how the accumulation of more input data and higher-quality training labels affects our model’s performance. Finally, we discuss the relative merits of the approaches investigated and place our results in the broader context of more modern AFP methods.

2. Methods

Our model trains feedforward neural networks to predict the functional relationship between pairs of genes in the context of 79 Gene Ontology biological processes (Fig 1). As input, our model uses the correlations between the gene expression profiles of pairs of genes in 113 microarray assay datasets. Once trained, our model uses its predictions to construct a functional relationship graph for each GO term that can be converted into experimentally-testable, single-gene functional predictions. Our initial model is trained using the same Gene Ontology annotations and gene expression datasets that SPELL and MEFIT trained with as of April 15th, 2007.

thumbnail
Fig 1. Overview of methodology.

(A) Training data is created by calculating pairwise gene expression similarities in 113 gene expression datasets and by curating a set of “gold standard” training labels for genes with known functions. (B) Feedforward neural networks are trained to predict the Gene Ontology co-annotations of pairs of genes given feature vectors of their gene expression similarity. (C) Our method translates pairwise functional relationship predictions into testable single gene rankings for 79 biological processes.

https://doi.org/10.1371/journal.pone.0322689.g001

2.1 Curation of labeled training data

The input data for our model represents each pair of genes as a vector of 113 z-scores derived from Pearson correlations across experimental conditions for each of the 113 gene expression datasets. To train our model, we curated training labels for gene pairs with known functional relationships and gene pairs thought to be unrelated. Positive and negative labels are determined for each of the 79 predicted GO terms based on each gene pair’s co-annotation to a term, as detailed below.

2.1.1 Gene expression compendium curation.

In order to accurately compare between approaches, our model uses the same gene expression compendium as MEFIT and SPELL, which we obtained from the Saccharomyces Genome Database (SGD) [29,30]. This compendium consists of 113 gene expression datasets publicly available and curated by SGD as of April 15th, 2007 and comprises 1842 experimental conditions. These raw gene expression levels are processed for use as input data by normalizing Fisher-transformed Pearson correlations between pairs of genes for each of the 113 datasets in the compendium. For each gene expression dataset, the Pearson correlation between a pair of genes’ expression values is calculated as:

where x and y are the gene expression vectors of size n from a given dataset, μx and μy are their means, and σx and σx are their standard deviations. In the case of missing measurements, gene expression vectors x and y only contain the experimental conditions that have observations for both gene x and gene y (Further details and other minor adjustments are described in Section 1.1 of S1 Text). As the distribution of Pearson correlations can vary greatly between gene expression datasets, the Fisher transformation is applied to improve comparability across datasets [31]:

After Pearson correlations have been Fisher transformed, the resulting values are normalized by the following equation to produce a standard normal score (z-score) for each pair of genes:

where μf and σf are estimated mean and standard deviations calculated using a subset of 200,000 random Fisher transformed Pearson correlations from that dataset. The resulting distribution is approximately standard normal . In all, each value in each dataset is a z-score that represents the number of standard deviations away a gene pair’s gene expression correlation is from the dataset’s mean of pairwise correlations [5]. This is the same preprocessing and normalization approach MEFIT and SPELL used for their gene expression compendiums [4,5].

2.1.2 Gold standard curation and cross validation.

To train our model, we need training labels consisting of gene pairs known to be functionally related (positive examples) and gene pairs thought to lack functional relationships (negative examples). To derive these labels, we first curate a set of single genes that have known functions, which we will call our “gold standard” genes. We defined genes with known functions as genes annotated to at least one of a set of 79 biological process Gene Ontology terms curated by experts as being informative enough to direct laboratory experiments [32]. These GO terms form a subset of biological process terms (distributed as a GO slim), and we used annotations to this subset from January 1st, 2007 in order to enable comparison to prior methods trained at that time. We further filtered the GO slim to only include terms with a sufficient number of annotations for model training (n >= 10). Genes not annotated to any of these 79 GO terms were considered functionally uncharacterized.

Our model uses 4-fold cross validation to assess overfitting and make its final predictions. The gold standard genes are randomly divided into four independent subsets. Each “fold” of the model will use three of these subsets as training genes while holding out the remaining subset as testing genes. Each fold generates a set of related (positive) gene pairs and unrelated (negative) gene pairs from the training genes or testing genes. Related pairs are those that are co-annotated to at least one of the 79 GO terms. Unrelated pairs are those whose smallest co-annotated GO term contains at least 10% of the genes in the yeast genome, indicating that each gene has a known function, and those functions are different from each other. This approach is not a perfect measure, since it assumes that two genes with known, different functions are unlikely to share a functional relationship. Note that to avoid data leakage and circularity, gene pairs used as training examples are only pairs where both genes are in the training gene set, and testing examples are only pairs where both genes are in the testing geneset [33,34].

2.1.3 Features and training labels.

Our model represents each gene pair as a vector of input features associated with a vector of training labels. The input features for each gene pair is a vector of 113 z-scores derived from the gene expression datasets, and the training labels are one-hot encoded vectors representing the co-annotations of a gene pair to 79 Gene Ontology terms. For each GO term, a gene pair is labeled positive (encoded as 1) if the genes are co-annotated to that GO term and negative (encoded as 0) otherwise.

The full gene expression data files compendium and Gene Ontology files used for all training are available in S1 Dataset.

2.2 Pairwise function prediction with neural networks

Our model trains four cross-validated, feedforward neural networks to predict 79 GO term labels from feature vectors of gene expression z-scores. The performance of each fold’s neural network is measured for each GO term, and folds are compared to each other to evaluate the consistency of the cross-validated neural networks. After the model’s performance is evaluated, the model feeds all gene pairs through the neural networks to produce a fully-connected functional relationship graph for each GO term, where nodes correspond to genes, and edge weights correspond to the neural network output confidence predictions.

2.2.1 Neural network architecture and training loop.

The features and labels vectors are used to train feedforward neural networks to predict a gene pair’s co-involvement in each of the previously described 79 GO terms. The neural network was implemented with PyTorch (PyTorch 1.13.1 and Python 3.9.12). The network consists of an input layer with 113 neurons and an output layer of 79 neurons. The number and size of the model’s neural networks’ hidden layers is flexible. The model that we use to make our final comparisons against SPELL and MEFIT has three hidden layers with 500, 200, and 100 neurons respectively. Each hidden layer applies a ReLU activation function. At the beginning of the training process, weights are initialized by sampling a uniform distribution where k = 1/ input size of a given layer.

Our model represents the problem of gene pair function prediction as a multilabel prediction problem by using a binary cross entropy loss function. The prediction of each GO term is treated as an independent binary classification task. Our model is trained with stochastic gradient descent; during each iteration of the training loop, a mini-batch of 50 gene pairs is sampled from the training data. This mini-batch consists of 25 gene pairs sampled from the related pairs and 25 gene pairs sampled from the unrelated pairs. After the mini-batch is fed forward through the model, the binary cross entropy loss between the network’s output and training labels is used to update the neural network’s weights via backpropagation. The training loop runs for enough iterations for the loss of each mini-batch to converge, which we empirically found to be at least 600,000 iterations. Each network is trained with a learning rate of 0.01 and momentum of 0.9 as hyperparameters. Further information about hyperparameters, loss curves, and training hardware can be found in Section 1.2 of S1 Text.

2.2.2 Performance evaluation.

The performance of each cross-validated neural network is measured with receiver operating characteristic (ROC) curves, which show the tradeoffs between the true positive rate (TPR) and false positive rate (FPR). Once a network has been trained, the performance of each of the 79 output GO terms is measured by ranking the prediction of positive pairs and negative pairs derived from the fold’s testing genes or training genes. For a given GO term, positive pairs are gene pairs that are co-annotated to the GO term and negative pairs are pairs of genes where neither gene is annotated to the GO term. A random subset of up to 10000 positive pairs and a random subset of 10 times the number of the positive pairs are evaluated. After these testing pairs have been fed forward through the network, each pair is ranked by its output value for the GO term being evaluated, and a ROC curve is generated using this ranking.

The performance for each GO term is summarized with the average area under the ROC curve (AUC) across all four folds of a model. The overfitting of a single GO term is assessed by comparing the average testing and training data AUCs of that term across each fold of the model.

2.2.3 Generation of functional relationship graphs.

Once the networks have been trained, functional relationships are predicted for all possible gene pairs in the yeast genome. The final scores for each gene pair are calculated as the average outputs of the neural network folds that held out either of the genes in a pair. This is done so that only independent network folds contribute to the final score of a pair, meaning that each fold’s networks are only used to predict pairs they were not trained on. Since pairs that contain uncharacterized genes (which are not included in the gold standard) are held out of all folds’ training data, these pairs’ final scores are calculated as the average predictions of all folds of the model.

Once all pairwise scores between genes have been calculated, a functional relationship graph is constructed for each GO term using each gene pair’s score for that GO term.

A functional relationship graph for a GO term is a weighted graph where the vertices represent genes and the weighted edges represent the predicted confidence of a pair of genes sharing a functional relationship for that GO term.

2.3 Post-processing for testable biological predictions

The aggregated outputs of our neural network model are functional relationship graphs for each predicted GO term that contain information about how confident our model is that two genes are functionally related in the context of each respective GO term. Importantly, our predicted GO terms were collected from a list of biological processes experts deemed to be informative for directing experimental biology [32]. While our model’s pairwise predictions contain a wealth of potentially novel biological information, we expect that they are also fairly noisy. We can reduce this noise by aggregating pairwise predictions into predictions of the functions of single genes. Furthermore, this allows us to compare against the performance of other automated function prediction models (such as MEFIT and SPELL) that also make single gene functional predictions. Similar to MEFIT and SPELL [7,9], we process our functional relationship graphs to generate testable predictions by ranking individual genes by their predicted involvement in a GO term as measured by the mean of their connections to other genes annotated to that term (Fig 2).

thumbnail
Fig 2. Diagram illustrating process of generating testable predictions from predicted pairwise relationships.

For each GO term, every gene in the yeast genome is ranked by the weighted average of its connections to genes annotated to that GO term in that GO term’s functional relationship graph.

https://doi.org/10.1371/journal.pone.0322689.g002

To predict an individual gene’s involvement in a GO term’s biological process, we use the confidence of the gene’s predicted relationship to other genes annotated to that term. For each GO term’s functional relationship graph, each gene in the yeast genome is ranked by its average weighted connection to all other genes annotated to that GO term. Therefore, for each GO term, our model produces a ranked list of individual genes where the top ranked genes are predicted to be more likely to be involved in that biological process. The performance of the model is measured by generating ROC and precision-recall curves from these single gene rankings [35]. Genes annotated to a GO term are labeled positive, all other genes in the gold standard set are labeled negative, and all uncharacterized genes are unlabeled and do not affect the performance evaluation.

3. Results

3.1 Model performance

We evaluated the performance and overfitting of our neural network model by comparing the model’s average pairwise training AUC, average pairwise testing AUC, and single gene AUC for each of the 79 predicted GO terms. We plotted each GO term’s testing AUC against its training AUC with the size and color of each point being proportional to its GO term’s single gene AUC (Fig 3). The overfitting for each GO term (measured as the difference between its training and testing AUC) is represented by its point’s distance from the black diagonal line. We found that a GO term’s pairwise testing AUC was positively correlated with its single gene AUC (r = 0.847; p << 0.001) and negatively correlated with the overfitting (r = −0.859; p << 0.001). Together, this indicates that while our model does suffer from some overfitting, it is significantly less for the GO terms that have high testing AUC performance. The single predicted rankings for each predicted GO term for this model can be found in S1 File; further analysis of overfitting can be found in S1 Text.

thumbnail
Fig 3. Comparison of performance of neural network model for all 79 output GO terms.

For each GO term, the term’s average testing AUC is plotted against its average training AUC, and the size and color of the point is proportional to the AUC of the GO term’s single gene ranking. The black diagonal line shows where the testing AUC and training AUC are the same, which indicates there is no overfitting. The red circle and text denoted the six GO terms with the worst testing AUC, and the green circle and text denoted the six GO terms with the best testing AUC. This figure was constructed using data found in S1 Table.

https://doi.org/10.1371/journal.pone.0322689.g003

There is great variation in the testing AUC performance and overfitting across all predicted GO terms of the model. We expected that this is the result of our input gene expression data being fundamentally more informative for predicting some biological processes compared to others. To examine this claim, we first analyzed the top six best performing and the top six worst performing GO terms by testing AUC, which are the terms circled in green and red respectively (Fig 3). We found that the highest ranking terms were enriched for processes related to ribosome biogenesis, which is a process whose genes are highly co-regulated by gene expression [36]. Conversely, we expected that the weakest performing terms likely lacked strongly coordinated gene co-expression among most pairs of genes annotated to these processes among our data compendium. As such, we found that the lowest ranking terms were pseudohyphal growth (a process not commonly occurring in laboratory yeast strains [37]) and terms encompassing a diverse set of biological pathways (organelle fusion, protein acylation and dephosphorylation, etc.).

These findings align with our expectation that the testing AUC performance and overfitting for each GO term are correlated with the extent to which a GO term’s genes are co-regulated at the level of gene expression. To evaluate this quantitatively, we examined the difference in co-expression distributions for co-annotated gene pairs to the background distribution for all gene pairs. For each GO term, we derived a probability distribution, T, of the average z-scores (across our 113 gene expression datasets) by randomly sampling 200,000 gene pairs co-annotated to that GO term. Additionally, we created a background probability distribution, B, by sampling the average z-scores for 200,000 of all the training gene pairs. For each GO term, we measured the Kullback-Liebler (KL) divergence between the term’s probability distribution and the background probability distribution as DKL(T,B). KL divergence measures the extent of discrepancy between two distributions by quantifying the amount of information lost by approximating one distribution with another [38]. The KL divergence of each term was positively correlated with the testing AUC of the term (r = 0.735; p << 0.001) and negatively correlated with the term’s overfitting (r = −0.660; p << 0.001) (S4 Fig in S1 Text). This indicates that as a GO term’s pairwise co-expression differs more from the background co-expression distribution, the GO term tends to make more accurate predictions and suffer from less overfitting. Therefore, these findings support our expectation that our model performance for a given GO term is correlated with the extent to which the term’s genes have coordinated co-expression.

3.2 Comparing mitochondrion organization predictions to SPELL and MEFIT

We set out to compare the performance between different machine learning models when predicting functional involvement in mitochondrion organization (GO:0007005) when trained on the same gene co-expression data. We compared the performance of our neural network model, SPELL, and MEFIT by comparing the ROC curves generated by each model’s rankings of single genes predicted to be involved in mitochondrion organization using Gene Ontology labels from January 1st, 2007 (Fig 4). We found that our neural network model performed better than both SPELL and MEFIT in the TPR-FPR tradeoff as reflected by over a 0.04 increase in AUC compared to both models.

thumbnail
Fig 4. Comparison of ROC performance to MEFIT and SPELL.

ROC curve comparing tradeoff between recall and false positive rate for rankings of genes predicted to be involved in mitochondrion organization for MEFIT, SPELL, and our neural network model. GO labels used in ROC curve calculations were derived from the January 1st, 2007 Gene Ontology. This figure was constructed using data found in S2 Table.

https://doi.org/10.1371/journal.pone.0322689.g004

We also assessed the precision of each model on their top predictions, since those predictions are the most ripe for experimental validation. Using the same rankings mentioned above, we generated precision-recall curves for each model’s rankings (Fig 5A). We found that the neural network model had worse precision at low recall (0–0.2) compared to SPELL and MEFIT when using labels derived from the 2007 Gene Ontology, despite our model’s overall improvement in ROC performance. In order to assess which model has better precision on top predictions in terms of our current understanding of biology, we relabeled the aforementioned rankings from each model using labels derived from June 16th, 2022 GO annotations. Using these modern labels, we found that our neural network model had higher precision than SPELL and MEFIT at low recall (Fig 5B).

thumbnail
Fig 5. Comparison of precision-recall tradeoff with MEFIT and SPELL.

Convex-hulled precision-recall curves with logarithmic x-axes comparing the tradeoff between precision and recall for rankings of genes predicted to be involved in mitochondrion organization for MEFIT, SPELL, and our neural network model trained on Gene Ontology annotations from using labels from January 1st, 2007 Gene Ontology. (A) Rankings labeled using January 1st, 2007 Gene Ontology annotations (solid lines) and (B) using June 22nd, 2022 Gene Ontology annotations (dashed lines). This figure was constructed using data found in S2 Table.

https://doi.org/10.1371/journal.pone.0322689.g005

In order to understand our model’s performance increase on top predictions using 2022 GO labels, we analyzed the distribution of genes that had their annotations changed from 2007 to 2022. Partially due to the efforts of SPELL and MEFIT, the genes and proteins annotated to mitochondrion organization in yeast have significantly changed since 2007. One particularly interesting set is the class of 117 genes that were labeled as negatives using 2007 GO annotations (in the gold standard but not annotated to mitochondrion organization in 2007) but are labeled as positives using 2022 GO annotations (annotated to mitochondrion organization in 2022). These genes were labeled as negatives for mitochondrion organization in our training data (and in the training data of MEFIT and SPELL) but according to our current understanding of biological reality, they should be labeled as positive. Due to these mis-annotations in the Gene Ontology, these examples contaminated our negative training set. As such, these 117 “contaminated negatives” are penalized for high-ranking predictions based on the 2007 evaluation, even though their high rank is biologically accurate. As such, we analyzed the distribution of rankings for these contaminated-negative mitochondrion organization genes ranked within the top 20% of genes predicted to be involved in mitochondrion organization for each model (Fig 6). Our neural network approach ranks these mis-annotated genes significantly greater rank than either MEFIT or SPELL (two sample KS test, Neural network–MEFIT p-value = 0.0274; Neural network–SPELL p-value = 0.0268), suggesting that our approach is more robust to label noise in its training data, at least in this particular case.

thumbnail
Fig 6. Distribution of mitochondrion organization mis-annotated negatives.

Density plots of relative rank of each model of predictions of high confidence mis-annotated negative mitochondrion organization genes. A gene’s rank is calculated as its relative placement in the single gene ranking of genes involved in mitochondrion organization. For each model, n represents the number of mis-annotated negatives predicted with a relative rank between 0.8 and 1.0. Two sample right-tailed Kolmogorov-Smirov tests were performed between the distribution of the neural networks high confidence mis-annotated negatives and SPELL or MEFIT high confidence mis-annotated negative distribution. The null hypothesis was rejected for both tests (Neural networks–MEFIT p-value = 0.0274; Neural network–SPELL p-value = 0.0268).

https://doi.org/10.1371/journal.pone.0322689.g006

3.3 Model performance using modern gene expression data and Gene Ontology labels

In the previous section, we found that our model outperformed SPELL and MEFIT when trained with the same 2007 gene expression data and Gene Ontology annotations. However, since 2007, the scientific community has accumulated significantly more gene expression data and the annotation quality of the Gene Ontology continued to improve. Therefore, we wanted to assess the performance of our approach when using modern gene expression data and Gene Ontology annotations and to explore the effects of additional input data versus improved label accuracy. Models used in this section have an architecture consisting of three layers of 80 hidden neurons each. This shared architecture was chosen for consistency of performance between the models in this section that have different sized input vectors.

We trained two models with the same 2007 Gene Ontology labels, but different gene expression input data. The first model used the 113 gene expression datasets available in 2007 and the second model used 430 gene expression datasets curated by SGD in 2022. When comparing each model’s AUC for each GO term’s single gene rankings, we found that across all GO terms, the AUC tended to be higher for the model trained with 2022 gene expression data (right-tailed paired t-test; p << 0.001) (Fig 7A). This indicates that additional input data generally improves the accuracy of predictions.

thumbnail
Fig 7. Comparison of performance with different combinations of gene expression input data and Gene Ontology training labels.

(A) Scatterplot of the AUCs derived from single gene rankings for each GO term derived from two models trained with different input data and the same 2007 training labels. The x axis represents AUCs from a model trained with all gene expression datasets available in 2007 and the y axis represents AUCs from a model trained with all gene expression datasets available in 2022. The black diagonal line separates terms that perform better with 2007 vs 2022 gene expression data. (B) Scatterplot of the AUCs derived from single gene rankings for each GO term derived from two models, one trained with 2007 Gene Ontology labels (x axis) and one with 2022 Gene Ontology labels (y axis). The black diagonal line separates terms that perform better with 2007 vs 2022 Gene Ontology labels (C) For each GO term, probability distributions were created by sampling the average z-score of 200,000 genes pairs each from the set of genes annotated in 2007 (T2007) and the set of genes annotated in 2022 (T2022). Background probability distributions were created from samples of the average z-score of 200,000 gene pairs from the set of all training gene pairs in 2007 and 2022 (B2007 and B2022). For each GO term, the 2007 KL divergence was measured as DKL(T2007,B2007) and the 2022 KL divergence was measured as DKL(T2022,B2022). The difference between the 2022 Ontology AUC and the 2007 Ontology AUC is plotted against the difference between the 2022 KL Divergence and the 2007 KL divergence with a regression line (dashed line). This figure was constructed using data found in S3 Table.

https://doi.org/10.1371/journal.pone.0322689.g007

We also trained two models using the same 2022 gene expression input data, but with different Gene Ontology training labels. The first model (as described in the previous paragraph) was trained and evaluated on labels from the 2007 Gene Ontology while the second model used labels from the 2022 Gene Ontology. We found that there was a high disparity in performance between the two models across all GO terms (Fig 7B). Some GO terms had better AUCs when training and evaluating on 2022 labels while others performed better with the 2007 labels. Given our findings in section 3.1 that each GO term’s performance is correlated with the extent its genes are co-regulated at the level of gene expression, we decided to investigate if a similar effect could explain the disparities in performance between these models.

We hypothesized that these performance disparities could be partially explained by the change from 2007 and 2022 in how correlated the gene sets of each GO term were compared to the background distribution. For each GO term, probability distributions were created by sampling the average z-score (across 430 gene expression datasets) of 200,000 genes pairs each from the set of genes annotated in 2007 (T2007) and the set of genes annotated in 2022 (T2022). Background probability distributions were created from samples of the average z-score of 200,000 gene pairs from the set of all training gene pairs in 2007 and 2022 (B2007 and B2022). For each GO term, the 2007 KL divergence was measured as DKL(T2007,B2007) and the 2022 KL divergence was measured as DKL(T2022,B2022). We found that the difference in a GO term’s AUC from 2022 labels and 2007 labels was positively correlated with the difference in the KL divergence for the 2022 and 2007 z-score probability distributions (r = 0.485; p << 0.001) (Fig 7C). This suggests that terms that had worse performance with 2022 labels had z-score distributions that became more similar to the background distribution between 2007 and 2022, making them harder to predict.

4. Discussion

In this work, we set out to compare feedforward neural networks to two classical machine learning methodologies for co-expression-based automated function prediction in yeast. We created a model of AFP in yeast using neural networks that can predict the functional relationship between pairs of genes in the context of 79 biological processes. Our model uses these pairwise predictions to rank individual genes by their predicted involvement for these 79 different biological processes. When training our model with the same gene expression input data and Gene Ontology labels as SPELL and MEFIT, we found that our model’s predictions of genes involved in mitochondrion organization had better ROC performance than these models. This indicates that if our model had been created at the same time as MEFIT and SPELL, it could have been used in conjunction with them to successfully direct an experimental study of mitochondrion organization [79]. However, our model still needs to be experimentally validated to determine if it is useful for directing experimental biology in the modern day. In future work, we plan to use petite frequency assays to test functionally uncharacterized genes our model predicts to be involved in mitochondrion organization (as did MEFIT and SPELL) [79].

Revisiting automated function prediction with gene expression data and Gene Ontology labels from 2007 revealed how the application of neural networks for automated function prediction interacts with our accumulation of biological knowledge. We found that the accumulation of more high-throughput gene expression data between 2007 and 2022 led to a significant increase in the performance of our model across most biological processes (Fig 7A). This suggests that despite the large corpus of gene expression data for yeast we have already amassed, acquiring more data can still improve co-expression-based AFP’s accuracy.

The conclusions for updates to Gene Ontology annotations are more complicated. We found that the difference in performance between a model trained with 2007 Gene Ontology annotations and 2022 Gene Ontology annotations varied greatly between GO terms (Fig 7B). We found that GO terms that performed better with 2007 GO annotations had co-expression z-score distributions that were less similar to the background training distributions in 2007 compared to 2022 (Fig 7C).

These results highlight an important limitation of our approach that is shared by most supervised AFP methods: prediction quality is highly sensitive to the annotations used to derive training labels. GO annotations are derived from the published literature and therefore reflect a bias toward well-studied genes and pathways, with limited coverage for poorly characterized genes. Additionally, the GO structure itself encodes strong assumptions about functional organization that may not fully reflect biological reality in all organisms or circumstances. These factors may further constrain the discovery of genuinely novel functional relationships using GO-based supervised approaches. Our results indicate that AFP methods must recognize these limitations when constructing training labels from GO annotations for supervised models. Furthermore, our results highlight the benefits of the fully unsupervised or weakly supervised approaches to AFP from network integration, particularly for uncovering function among uncharacterized genes [24].

Our analyses of past GO annotations highlights that the way our model, and many others, define negative training examples is somewhat naïve and potentially problematic. Recall that in our model, MEFIT, and SPELL, genes and pairs of genes are labeled as negative for a term if they are not annotated to that term but are annotated to another term in the GO slim. Yeast proteins often exhibit multiple distinct functions; thus, making the assumption that a gene’s involvement in one biological process implies a lack of involvement in another is reductionary. As such, there may be great value in the development of new approaches to defining negative training examples or approaches that forgo defining negative training examples entirely [3941]. In the face of this problem, we note our model’s ability to better distinguish mis-annotated negative mitochondrion organization genes in its training set compared to SPELL and MEFIT (Fig 6). This is evidence that neural networks may have better potential to overcome the issue of problematic negatives compared to other machine learning models in the domain of AFP. However, further experiments and analyses are warranted to more fully explore this relationship.

In this study, we established a baseline for how feedforward neural networks and two classical machine learning methods compare for co-expression-based AFP. As we look to expand this analysis to more complex deep learning architectures, we feel it is important to place our methodology within the broader context of contemporary AFP research. Our methodology represents a different paradigm of supervised AFP than contemporary AFP approaches. Contemporary AFP largely focuses on predicting Gene Ontology labels of proteins from amino acid sequence embeddings generated from protein language models [19,23]. Some other common input data modalities for contemporary AFP models are protein structure [42] and text embeddings from large language models trained on scientific literature or GO term descriptions [23,43]. These models are typically trained on proteins from multiple species and are therefore predicting function in a species-agnostic fashion. Early AFP approaches, as well as our own, predict Gene Ontology labels by learning patterns in the co-expression profiles of pairs of genes. They learn the patterns of interactions between genes of the same species and are therefore species-specific. A proper comparison of these paradigms would require two kinds of analyses. First, what biological functions are gene expression versus amino acid sequence informative of? Second, what biological processes are species-specific and species-agnostic AFP models each better at predicting? While these analyses were out of the scope of our study, we do believe they are highly valuable and worthy of future study.

Our study shows that there is room for innovation in co-expression-based, species-specific AFP. First, we showed that, for some biological processes, feedforward neural networks are more performant at predicting GO labels from co-expression data than the classical machine learning methods of Bayesian belief networks and directed query searches. As we will discuss more, other deep learning architectures may achieve even better performance and are worthy of further inquiry. Second, our findings suggest the accumulation of more genome-wide gene expression data significantly improves our ability to predict gene function (Fig 7A). Given the greater abundance of data and innovations in machine learning methodologies, we believe that there is still a wealth of functional insights to be mined from gene expression data.

5. Future work and conclusions

Our future work consists of improving the interpretability of our model, exploring other deep learning architectures, and integrating data modalities other than gene expression into our methodology comparisons. First, while our model displayed better predictive accuracy than MEFIT and SPELL for mitochondrion organization, we note that our neural network model is less explainable and interpretable than these other methods. Given a predicted gene function, MEFIT and SPELL both display the input gene expression datasets that were most important for making that prediction [4,5]. Our model lacks this explainability. We did improve the interpretability of our model by finding that a GO term’s performance is correlated with the extent the co-expression of a term’s gene pairs diverge from the background training distribution (S4 Fig in S1 Text). That being said, our future work includes making our model more interpretable and explainable by implementing strategies like feature importance attribution [44].

Second, we want to explore other neural network architectures for co-expression-based AFP. We chose to build our model using feedforward neural networks because we wanted to compare the simplest deep learning architecture to MEFIT and SPELL. We found that feedforward neural networks can provide a modest increase in ROC performance over Bayesian networks and directed query searches (Fig 4). Now that we have established this baseline for deep learning models, we plan to compare feedforward networks to other neural network architectures. In particular, we believe we can improve performance by using graph neural networks to directly learn functional relationship graphs rather than constructing them from the pairwise predictions made by our current model.

The last portion of our future work lies in testing the performance of models that integrate other modalities of biological data into their input other than just co-expression. As illustrated in Fig 3, gene co-expression data alone is not informative of every biological process a gene can participate in. That being said, integrating other data modalities into a model’s input can compensate for the lack of signal in gene expression data for some biological processes. For instance, BioPIXIE is a Bayesian network based model (similar to MEFIT) that includes physical interaction, genomic interaction, and cellular localization data in addition to co-expression data in its input. We plan to augment our model by including these data modalities in our input features [6]. Then, we want to compare this model to bioPIXIE to see if neural networks can improve performance by learning patterns between data modalities (which naïve Bayesian networks are unable to do).

In summary, we directly compared neural networks to two classical, experimentally validated machine learning models for co-expression-based AFP. We found that our model outperformed these prior methodologies at predicting the involvement of yeast genes in mitochondrion organization, in part due to being robust to mis-annotated negatives in its training data. As such, we suspect that the predictions for other biological processes made by our model are similarly useful for directing laboratory biology experiments. Overall, we found that a simple feedforward neural network can compete with or outperform earlier co-expression-based AFP models such as MEFIT and SPELL, which highlights how architectural choices and differences can influence robustness to annotation noise and data availability for predicting a gene’s biological function.

Supporting information

S1 Text. Supplemental figures, methods and results.

A document containing additional analysis and supplemental figures relating to our methodology and results. The supplemental methods contain sections about Pearson correlation calculation adjustments, model hyperparameters and training, an analysis of overfitting, and a time complexity analysis. The supplemental results contain an analysis of our methodology’s sensitivity to initialization parameters and the cross-validation data split, and it contains a figure analyzing the relationship between model performance and a GO term’s co-expression z-score distribution.

https://doi.org/10.1371/journal.pone.0322689.s001

(PDF)

S1 File. Single-gene function predictions for each GO term.

A zip file of tables in csv format containing individual genes ranked by their predicted involvement in each predicted GO term. These predictions were made by Model 1 as described in Table A of S1 Text.

https://doi.org/10.1371/journal.pone.0322689.s002

(ZIP)

S1 Table. Model 1 performance data.

Table containing the performance statistics of Model 1 and KL-divergence metrics for each GO term used to construct Fig 3 and Figure S3 in S1 Text.

https://doi.org/10.1371/journal.pone.0322689.s003

(CSV)

S2 Table. Mitochondrion organization prediction rankings data for MEFIT, SPELL, and neural network model.

Table containing each yeast gene and its predicted confidence or rank of involvement in mitochondrion organization (GO:0007005) for MEFIT, SPELL, and our neural network model. Genes are labeled in the following way: positives are ‘1’, negatives are ‘-1’, and genes that are neither positive nor negative are ‘0’. This data was used to construct Figs 46.

https://doi.org/10.1371/journal.pone.0322689.s004

(CSV)

S3 Table. 2007 Versus 2022 gene expression data and Gene Ontology annotations model comparison data.

This model contains the single gene ROC AUCs for Model 2–4 (as described in Table A of S1 Text) and KL-divergence metrics in 2007 and 2022 for each GO term. This data was used to construct Fig 7.

https://doi.org/10.1371/journal.pone.0322689.s005

(CSV)

Acknowledgments

We would like to thank Dr. Bethany Strunk and Dr. Brian Teague for their help in editing the manuscript. We would also like to thank the members of Strunk Lab for their thoughtful conversations.

References

  1. 1. Makrodimitris S, van Ham RCHJ, Reinders MJT. Automatic gene function prediction in the 2020’s. Genes (Basel). 2020;11(11):1264. pmid:33120976
  2. 2. Troyanskaya OG, Dolinski K, Owen AB, Altman RB, Botstein D. A Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae). Proc Natl Acad Sci U S A. 2003;100(14):8348–53. pmid:12826619
  3. 3. Lee I, Date SV, Adai AT, Marcotte EM. A probabilistic functional network of yeast genes. Science. 2004;306(5701):1555–8. pmid:15567862
  4. 4. Hibbs MA, Hess DC, Myers CL, Huttenhower C, Li K, Troyanskaya OG. Exploring the functional landscape of gene expression: directed search of large microarray compendia. Bioinformatics. 2007;23(20):2692–9. pmid:17724061
  5. 5. Huttenhower C, Hibbs M, Myers C, Troyanskaya OG. A scalable method for integration and functional analysis of multiple microarray datasets. Bioinformatics. 2006;22(23):2890–7. pmid:17005538
  6. 6. Myers CL, Robson D, Wible A, Hibbs MA, Chiriac C, Theesfeld CL, et al. Discovery of biological networks from diverse functional genomic data. Genome Biol. 2005;6(13):R114. pmid:16420673
  7. 7. Hibbs MA, Myers CL, Huttenhower C, Hess DC, Li K, Caudy AA, et al. Directing experimental biology: a case study in mitochondrial biogenesis. PLoS Comput Biol. 2009;5(3):e1000322. pmid:19300515
  8. 8. Huttenhower C, Hibbs MA, Myers CL, Caudy AA, Hess DC, Troyanskaya OG. The impact of incomplete knowledge on evaluation: an experimental benchmark for protein function prediction. Bioinformatics. 2009;25(18):2404–10. pmid:19561015
  9. 9. Hess DC, Myers CL, Huttenhower C, Hibbs MA, Hayes AP, Paw J, et al. Computationally driven, quantitative experiments discover genes required for mitochondrial biogenesis. PLoS Genet. 2009;5(3):e1000407. pmid:19300474
  10. 10. Vinayagam A, König R, Moormann J, Schubert F, Eils R, Glatting K-H, et al. Applying support vector machines for Gene Ontology based gene function prediction. BMC Bioinformatics. 2004;5:116. pmid:15333146
  11. 11. Guan Y, Myers CL, Hess DC, Barutcuoglu Z, Caudy AA, Troyanskaya OG. Predicting gene function in a hierarchical context with an ensemble of classifiers. Genome Biol. 2008;9 Suppl 1(Suppl 1):S3. pmid:18613947
  12. 12. Barutcuoglu Z, Schapire RE, Troyanskaya OG. Hierarchical multi-label prediction of gene function. Bioinformatics. 2006;22(7):830–6. pmid:16410319
  13. 13. Falda M, Toppo S, Pescarolo A, Lavezzo E, Di Camillo B, Facchinetti A, et al. Argot2: a large scale function prediction tool relying on semantic similarity of weighted Gene Ontology terms. BMC Bioinformatics. 2012;13 Suppl 4(Suppl 4):S14. pmid:22536960
  14. 14. Vinayagam A, del Val C, Schubert F, Eils R, Glatting K-H, Suhai S, et al. GOPET: a tool for automated predictions of Gene Ontology terms. BMC Bioinformatics. 2006;7:161. pmid:16549020
  15. 15. Guan Y, Gorenshteyn D, Burmeister M, Wong AK, Schimenti JC, Handel MA, et al. Tissue-specific functional networks for prioritizing phenotype and disease genes. PLoS Comput Biol. 2012;8(9):e1002694. pmid:23028291
  16. 16. Grinberg NF, Orhobor OI, King RD. An evaluation of machine-learning for predicting phenotype: studies in yeast, rice, and wheat. Mach Learn. 2020;109(2):251–77. pmid:32174648
  17. 17. Wang PI, Marcotte EM. It’s the machine that matters: predicting gene function and phenotype from protein networks. J Proteomics. 2010;73(11):2277–89. pmid:20637909
  18. 18. ŞAHiN E, Arslan NN, Özdemir D. Unlocking the black box: an in-depth review on interpretability, explainability, and reliability in deep learning. Neural Comput Appl. 2024;37(2):859–965.
  19. 19. Boadu F, Lee A, Cheng J. Deep learning methods for protein function prediction. Proteomics. 2025;25(1–2):e2300471. pmid:38996351
  20. 20. Radivojac P, Clark WT, Oron TR, Schnoes AM, Wittkop T, Sokolov A, et al. A large-scale evaluation of computational protein function prediction. Nat Methods. 2013;10(3):221–7. pmid:23353650
  21. 21. Wu J, Qing H, Ouyang J, Zhou J, Gao Z, Mason CE, et al. HiFun: homology independent protein function prediction by a novel protein-language self-attention model. Brief Bioinform. 2023;24(5):bbad311. pmid:37649370
  22. 22. Jiang Y, Oron TR, Clark WT, Bankapur AR, D’Andrea D, Lepore R, et al. An expanded evaluation of protein function prediction methods shows an improvement in accuracy. Genome Biol. 2016;17(1):184. pmid:27604469
  23. 23. Wang M, Nie Z, He Y, Vasilakos AV, Cheng Q (shawn), Ren Z. Deep learning methods for protein representation and function prediction: a comprehensive overview. Eng Appl Artif Intell. 2025;155:110977.
  24. 24. Forster DT, Li SC, Yashiroda Y, Yoshimura M, Li Z, Isuhuaylas LAV, et al. BIONIC: biological network integration using convolutions. Nat Methods. 2022;19(10):1250–61. pmid:36192463
  25. 25. Gligorijevic V, Barot M, Bonneau R. deepNF: deep network fusion for protein function prediction. Bioinformatics. 2018;34(22):3873–81. pmid:29868758
  26. 26. Cho H, Berger B, Peng J. Compact integration of multi-network topology for functional analysis of genes. Cell Syst. 2016;3(6):540-548.e5. pmid:27889536
  27. 27. Nasser R, Sharan R. BERTwalk for integrating gene networks to predict gene- to pathway-level properties. Bioinform Adv. 2023;3(1):vbad086. pmid:37448813
  28. 28. Li MM, Huang Y, Sumathipala M, Liang MQ, Valdeolivas A, Ananthakrishnan AN, et al. Contextual AI models for single-cell protein biology. Nat Methods. 2024;21(8):1546–57. pmid:39039335
  29. 29. Cherry JM, Adler C, Ball C, Chervitz SA, Dwight SS, Hester ET. SGD: Saccharomyces genome database. Nucleic Acids Res. 1998;26:73–9.
  30. 30. Wong ED, Miyasato SR, Aleksander S, Karra K, Nash RS, Skrzypek MS, et al. Saccharomyces genome database update: server architecture, pan-genome nomenclature, and external resources. Genetics. 2023;224(1):iyac191. pmid:36607068
  31. 31. Fisher RA. Frequency distribution of the values of the correlation coefficient in samples from an indefinitely large population. Biometrika. 1915;10(4):507.
  32. 32. Myers CL, Barrett DR, Hibbs MA, Huttenhower C, Troyanskaya OG. Finding function: evaluation methods for functional genomic data. BMC Genomics. 2006;7:187. pmid:16869964
  33. 33. Park Y, Marcotte EM. Flaws in evaluation schemes for pair-input computational predictions. Nat Methods. 2012;9(12):1134–6. pmid:23223166
  34. 34. Bernett J, Blumenthal DB, List M. Cracking the black box of deep sequence-based protein-protein interaction prediction. Brief Bioinform. 2024;25(2):bbae076. pmid:38446741
  35. 35. Fawcett T. An introduction to ROC analysis. Pattern Recognit Lett. 2006;27(8):861–74.
  36. 36. Zencir S, Dilg D, Rueda MP, Shore D, Albert B. Mechanisms coordinating ribosomal protein gene transcription in response to stress. Nucleic Acids Res. 2020;48(20):11408–20. pmid:33084907
  37. 37. Gancedo JM. Control of pseudohyphae formation in Saccharomyces cerevisiae. FEMS Microbiol Rev. 2001;25(1):107–23. pmid:11152942
  38. 38. Shlens J. Notes on Kullback-Leibler divergence and likelihood. arXiv [cs.IT]; 2014. Available from: http://arxiv.org/abs/1404.2000
  39. 39. Li F, Dong S, Leier A, Han M, Guo X, Xu J, et al. Positive-unlabeled learning in bioinformatics and computational biology: a brief review. Brief Bioinform. 2022;23(1):bbab461. pmid:34729589
  40. 40. Youngs N, Penfold-Brown D, Bonneau R, Shasha D. Negative example selection for protein function prediction: the NoGO database. PLoS Comput Biol. 2014;10(6):e1003644. pmid:24922051
  41. 41. Youngs N, Penfold-Brown D, Drew K, Shasha D, Bonneau R. Parametric Bayesian priors and better choice of negative examples improve protein function prediction. Bioinformatics. 2013;29(9):1190–8. pmid:23511543
  42. 42. Zhao C, Liu T, Wang Z. PANDA-3D: protein function prediction based on AlphaFold models. NAR Genom Bioinform. 2024;6(3):lqae094. pmid:39108640
  43. 43. Kulmanov M, Guzmán-Vega FJ, Duek Roggli P, Lane L, Arold ST, Hoehndorf R. Protein function prediction as approximate semantic entailment. Nat Mach Intell. 2024;6:220–8.
  44. 44. Mandler H, Weigand B. A review and benchmark of feature importance methods for neural networks. ACM Comput Surv. 2024;56(12):1–30.