Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Comparing statistical learning methods for complex trait prediction from gene expression

  • Noah Klimkowski Arango,

    Roles Data curation, Formal analysis, Investigation, Software, Validation, Visualization, Writing – original draft, Writing – review & editing

    Affiliations Center for Human Genetics, Clemson University, Greenwood, SC, United States of America, Department of Genetics and Biochemistry, Clemson University, Clemson, SC, United States of America

  • Fabio Morgante

    Roles Conceptualization, Funding acquisition, Methodology, Project administration, Resources, Supervision, Validation, Writing – original draft, Writing – review & editing

    fabiom@clemson.edu

    Affiliations Center for Human Genetics, Clemson University, Greenwood, SC, United States of America, Department of Genetics and Biochemistry, Clemson University, Clemson, SC, United States of America

Abstract

Accurate prediction of complex traits is an important task in quantitative genetics. Genotypes have been used for trait prediction using a variety of methods such as mixed models, Bayesian methods, penalized regression methods, dimension reduction methods, and machine learning methods. Recent studies have shown that gene expression levels can produce higher prediction accuracy than genotypes. However, only a few prediction methods were tested in these studies. Thus, a comprehensive assessment of methods is needed to fully evaluate the potential of gene expression as a predictor of complex trait phenotypes. Here, we used data from the Drosophila Genetic Reference Panel (DGRP) to compare the ability of several existing statistical learning methods to predict starvation resistance and startle response from gene expression in the two sexes separately. The methods considered differ in assumptions about the distribution of gene effects—ranging from models that assume that every gene affects the trait to more sparse models—and their ability to capture gene-gene interactions. We also used functional annotation (i.e., Gene Ontology (GO)) as a source of biological information to inform prediction models. The results show that differences in prediction accuracy exist. For example, methods performing variable selection achieved higher prediction accuracy for starvation resistance in females, while they generally had lower accuracy for startle response in both sexes. Incorporating GO annotations further improved prediction accuracy for a few GO terms of biological significance. Biological significance extended to the genes underlying highly predictive GO terms. Notably, the Insulin-like Receptor (InR) was prevalent across methods and sexes for starvation resistance. For startle response, crumbs (crb) and imaginal disc growth factor 2 (Idgf2) were found for females and males, respectively. Our results confirmed the potential of transcriptomic prediction and highlighted the importance of selecting appropriate methods and strategies in order to achieve accurate predictions.

Introduction

Predicting yet-to-be observed phenotypes for complex traits is an important task for many branches of quantitative genetics. Complex trait prediction was developed in agricultural breeding to select the best performing individuals for economically important traits such as milk yield in dairy cattle using estimated breeding values (EBVs). While EBVs have been traditionally computed using pedigree information, the availability of genotyping arrays has made it possible to compute genomic EBVs (GEBVs) [1, 2]. GEBVs are linear combinations of the genotypes of the target individuals and the effect sizes of many genetic variants along the genome computed in a reference population. The same concept has later been applied to human genetics, especially in the context of precision medicine. Here, the goal is to predict medically relevant phenotypes such as body mass index (BMI) or disease susceptibility using Polygenic Scores (PGSs) [3, 4]. While GEBVs and PGSs are technically the same, the different goals of these two quantities (i.e., selection for GEBVs and prevention/monitoring for PGSs) have important implications. We refer the readers to [2] for a comprehensive treatment of this topic.

The estimation of the effect sizes of genetic variants to be used for prediction can be done using a variety of methods. The most common methods have regression at their core, where the response variable is the phenotype of interest and the predictor variables are the genotypes for a set of genetic variants [5]. Since the number of genetic variants, p, is usually much larger than the sample size, n, (the well-known p > >n problem in statistics [6]), methods that perform variable selection or regularization of the effect sizes are needed. These methods encompass dimension reduction methods (e.g., principal components regression), penalized regression methods (e.g., ridge regression), linear mixed models (e.g., GBLUP), Bayesian methods (e.g., BayesC), and machine learning (e.g., random forest) [711]. These methods differ in the assumptions they make regarding the distribution of the effect sizes, with some methods performing only effect shrinkage and some methods performing variable selection as well [5]. Research focused on comparing several methods has shown that there is no single best method, with performance being affected by the genetic architecture of the trait of interest (e.g., sparse vs dense), the biology of the species (e.g., the extent of linkage disequilibrium), and the assumptions of the method [12, 13].

Traditionally, genotype data has been used for complex trait prediction since they are easy and cost-effective to obtain. However, it is now possible to obtain multidimensional molecular data such as gene expression or metabolite levels at a reasonable cost. This advancement has made the use of additional layers of data for complex trait prediction feasible. Broadly, genetic information flows from DNA to RNA to proteins, and then to metabolites, which affect phenotypes [14]. Using these intermediate layers of data could improve prediction accuracy for some traits. In addition to being biologically ‘closer’ to phenotypes, these molecular data can be thought of as endophenotypes, which are affected by environmental conditions as well as genetic effects. Thus, endophenotypes could capture environmental and gene-by-environment effects [15].

Recent work has shown that using additional omic data types can result in more accurate predictions [1520]. In particular, transcriptomic data has shown good potential for improving prediction accuracy. For example, Wheeler et al. used lymphoblastoid cell line data to show that gene expression levels provided much higher accuracy than genotypes when predicting intrinsic growth rate [16]. Morgante et al. found that prediction accuracy of starvation resistance in Drosophila melanogaster was higher when using gene expression levels instead of genotypes [18].

While these studies have shown the potential of gene expression as a predictor of complex phenotypes, only a few statistical methods were used, with most studies using linear mixed models. Studies using genotypes have found that prediction accuracy may vary substantially depending on the method used. However, a comprehensive comparison of methods for transcriptomic prediction is currently missing (to the best of our knowledge). Thus, in this study, we sought to compare several common methods spanning dimension reduction, penalized linear regression, Bayesian linear regression, linear mixed model, and machine learning in their ability to predict starvation resistance and startle response from gene expression levels using data from the Drosophila Genetic Reference Panel [21].

Materials and methods

Data processing

The Drosophila melanogaster Genetic Reference Panel (DGRP) is a collection of over 200 inbred lines that have full genome sequences and phenotypic measurements for several traits [22]. Additionally, prior work from [23] obtained sex-specific whole-body transcriptome profiles for 200 DGRP lines by RNA sequencing. We followed the steps described in [18] to filter for genetically variable and highly expressed genes. This filtering process resulted in 11,338 genes in females and 13,575 genes in males.

In this work, we chose starvation resistance as a model complex trait because it can be predicted with decent accuracy considering the small sample size [18]. We analyzed the two sexes separately to account for the presence of genetic variation in sexual dimorphism (i.e., cross-sex genetic correlation significantly different from 1) in starvation resistance [22, 24, 25]. We also analyzed startle response, a complex behavioral trait with low genetic variation in sexual dimorphism [22]. Despite the low genetic variation in sexual dimorphism for startle response, we still analyzed the two sexes separately because differences between sexes also exist in the transcriptome [23]. We used line means adjusted for the effect of Wolbachia infection and major inversions [21]. After removing lines with missing phenotypic measurements or gene expression profiles, we were left with 198 and 199 lines for starvation resistance and startle response, respectively, for use in all further analyses.

Transcriptomic prediction

Unless stated, prediction methods assessed in this study follow the general multiple regression model: (1) where y is an n-vector of phenotypes for n lines, X is an n × m matrix of expression levels for m genes, β is an m-vector of effect sizes, and e is an n-vector of residuals. We assume that the columns of X and y have been centered to mean 0.

Principal Component Regression (PCR).

PCR [26] uses Principal Component Analysis [27] to reduce the dimensionality of the predictor matrix, X, by selecting a set of k orthogonal components that are linear combinations of the original predictors and maximize their variance. Then, the n × k matrix of principal components, T, is used in place of X in Eq 1 [28]. We used the algorithm implemented in the package v. 2.8–2 [29] with default parameters. We used 5-fold cross validation in the training set to select the number of principal components to be used for prediction in the test set.

Partial Least Squares Regression (PLSR).

Like PCR, PLSR [30] also reduces the dimensionality of the predictor matrix, X. However, this is achieved via a simultaneous decomposition of X and y to select a set of k components that maximizes the covariance between X and y. This addresses a limitation of PCR where the components that best ‘explain’ X may not necessarily be the most relevant to y. Then, the n × k matrix of latent vectors, T, is used in place of X in Eq 1 [31]. We used the algorithm implemented in the package v. 2.8–2 [29] and chose the ‘widekernelpls’ method, which is suited for the wide (m > >n) matrix of gene expression, along with other default parameters. We used 5-fold cross validation in the training set to select the number of latent vectors to be used for prediction in the test set. For the startle response analysis, the maximum number of iterations was increased from 100, the default value, to 500 due to complications with model convergence.

Ridge Regression (RR).

Ridge Regression [7] is a penalized regression method that uses an ℓ2 penalty to achieve shrinkage of the estimates of the effect sizes. The amount of shrinkage is determined by a tuning parameter, λ, such that large values of λ result in more shrinkage. We used the algorithm implemented in the package v. 4.1–8 [32]. We used 5-fold cross validation to select λ. All other parameters were left as their default values.

Least Absolute Shrinkage Selector Operator (LASSO).

LASSO [33] is a penalized regression method that uses an ℓ1 penalty to perform both variable selection (by setting some effects to be exactly 0) and shrinkage of the estimates of the effect sizes. The amount of shrinkage and variable selection is determined by a tuning parameter, λ, such that large values of λ result in more shrinkage. We used the algorithm implemented in the package v. 4.1–8 [32]. We used 5-fold cross validation to select λ. All other parameters were left as their default values.

BayesC.

BayesC [9] is a Bayesian regression method that imposes a spike-and-slab prior on the effect sizes: (2) where π is the probability that the effect of the jth variable comes from a Normal distribution with mean 0 and variance , and δ0 is a point-mass at 0. In this way, both variable selection and effect shrinkage are achieved. In the package v. 1.1.0 [34] implementation, the posterior distribution of the effect sizes and some model parameters are computed using Markov Chain Monte Carlo (MCMC) methods. We ran the algorithm for 130,000 iterations, discarded the first 30,000 samples as burn-in, and retained every 50th sample. We assessed convergence through visual inspection of the trace plots. The expected proportion of variance explained by the predictors, R2, was set to 0.8 in accordance with the broad sense heritability estimates of line means for starvation resistance and startle response [22, 25].

Variational Bayesian Variable Selection (VARBVS).

VARBVS is a Bayesian regression method that imposes the same spike-and-slab prior as BayesC. However, VARBVS computes posterior distributions using Variational Inference, which is computationally more efficient than MCMC [35]. The algorithm implemented in the package v. 2.6–10 [35] was fit with default parameters.

Multiple Regression with Adaptive Shrinkage (MR.ASH).

MR.ASH is a Bayesian regression method that imposes a scale mixture-of-Normals prior on the effect sizes: (3) for a fixed grid of variances, σ2. Thus, like BayesC and VARBVS, MR.ASH performs both variable selection and effect shrinkage. MR.ASH is able to model complex distributions of the effect sizes thanks to a more flexible prior than VARBVS and BayesC. MR.ASH uses a Variational Empirical Bayes approach to estimate the prior (i.e., the mixture weights, π) from the data and compute the posterior distribution of the effect sizes [36]. The algorithm implemented in the package v. 0.1–43 [37] was fit with default parameters and initialized using the effect size estimates from LASSO.

Transcriptomic Best Linear Unbiased Predictor (TBLUP).

TBLUP [38] is a linear mixed model that aggregates the effects of all of the genes into a single random effect. Let , where wj is a standardized version of xj to have unit variance, then: (4) where t is a n-vector of transcriptomic effects, , and T = is the Transcriptomic Relationship Matrix (TRM). TBLUP was implemented by using the package v. 1.1.0 [34]. BGLR uses a Bayesian approach to estimate the transcriptomic and residual variance components. We ran the algorithm for 85,000 iterations, discarded the first 10,000 samples as burn-in, and retained every 50th sample. We assessed convergence through visual inspection of the trace plots. The expected proportion of variance explained by the predictors, R2, was set to 0.8 in accordance with the broad sense heritability estimates of line means for starvation resistance and startle response [22, 25]. All other parameters were left as their default values.

All of the previous methods assume that no gene-gene interactions affect the phenotype. Thus, we decided to add some more flexible machine learning methods capable of capturing interaction effects to the comparison.

Random Forest (RF).

Random Forest is a machine learning method whereby a collection of decision trees are grown, each on a different bootstrap sample of the predictor data [11]. This method has been used to identify gene-gene interactions successfully [39]. The model is given by: (5) where S is the number of decision trees, cs is a shrinkage factor that averages the trees, hs(y;X) is a decision tree that is grown using only a subset of predictors at each node [11]. The algorithm implemented in the package v. 1.2–20 [40] was fit with 1000 trees and default parameters.

Neural Networks (NN).

Artificial Neural Networks are a type of machine learning method that use layers of nodes, or neurons, to process data similar to how the human brain works. Neural networks are built using input layers, hidden layers, and output layers [41]. Networks can vary by hidden layer count, neuron count per layer, and activation function per layer. Neuron count selection is a fundamental problem in constructing networks [42]. Neural networks use nonlinear activation functions to determine whether neurons in hidden layers should be activated based on their inputs. This feature can be used to model gene-gene interactions [43]. The neural network implemented through the package v1.44.2 [44] used default parameters and a custom neuron structure with a hidden layer of 1,000 neurons. In our model, weights for each neuron in the hidden layer were learned using resilient backpropagation [45].

Gene Ontology-informed transcriptomic prediction

While some of the methods above try to enrich the prediction model for genes that are particularly predictive of the trait by performing internal variable selection, this procedure becomes difficult with a small sample size such as in the DGRP. Informing prediction models with functional annotation has been shown to be effective at disentangling signal from noise and improve accuracy in complex trait prediction [18, 4648]. Edwards et al. [46] and Morgante et al. [18] used Gene Ontology (GO) annotations [49] to improve prediction accuracy for three complex traits in Drosophila. However, these applications only used BLUP-type models to include GO information. Here, we tested two additional methods described below. For each sex, we selected GO terms that included at least five genes present in the DGRP expression data, in line with previous work [18]. This procedure resulted in 2,628 terms for females and 2,580 terms for males being retained for further analysis. For all methods, GO-informed models were fit with one GO term at a time for all GO terms specified for each sex.

Sparse Group LASSO.

Sparse group LASSO [50] is a penalized regression method that uses a combination of the ℓ1 penalty and a group LASSO penalty [50]. The Group LASSO [51] applies variable selection on entire groups of predictors, while the ℓ1 penalty achieves effect shrinkage and variable selection at the individual variable level. The strength of the penalties is determined by a tuning parameter, λ, such that larger values of λ result in more shrinkage/selection. In our application, one group included all of the genes in the selected GO term and the other group included all of the remaining genes. We used the Sparse Group LASSO implementation in the package v1.0.2 [52] with default parameters.

GO-BayesC.

GO-BayesC is an extension of BayesC that imposes independent spike-and-slab priors on the effect sizes of genes grouped by GO term association. It follows the model (6) where XGO is the subset of X containing the genes associated with the selected GO term, βGO is the vector of effects of the genes in the selected GO term, XnotGO is the subset of X containing all other genes, and βnotGO is the vector of effects of all other genes. βGO,j and βnotGO,j are assigned separate spike-and-slab priors as in Eq 2, with group-specific prior inclusion probability and effect size variance. This method uses the same algorithm from the package [34]. We ran the algorithm for 130,000 iterations, discarded the first 30,000 samples as burn-in, and retained every 50th sample. We assessed convergence through visual inspection of the trace plots. The expected proportion of variance explained by the predictors, R2, was set to 0.8 in accordance with the broad sense heritability estimates of line means for starvation resistance and startle response [22, 25].

GO-TBLUP.

GO-TBLUP is an extension of TBLUP that includes two random effects—one associated with genes in the selected GO term and one associated with all of the other genes. It follows the model (7) where tGO is a n-vector of transcriptomic effects associated with genes in the GO term, , , WGO is the subset of W containing the genes associated with the selected GO term, tnotGO is a n-vector of transcriptomic effects associated with all other genes, , , and WnotGO is the subset of W containing all other genes. GO-TBLUP was implemented by using the package v. 1.1.0 [34]. BGLR uses a Bayesian approach to estimate the transcriptomic and residual variance components. We ran the algorithm for 85,000 iterations, discarded the first 10,000 samples as burn-in, and retained every 50th sample. We assessed convergence through visual inspection of the trace plots. The expected proportion of variance explained by the predictors, R2, was set to 0.8 in accordance with the broad sense heritability estimates of line means for starvation resistance and startle response [22, 25]. All other parameters were left as their default values.

Evaluation scheme

We fitted each method to 90% of the data (i.e., the training set) to estimate the model parameters. The trained model was then validated by predicting phenotypes for the remaining 10% of the data (i.e., the test set). Prediction accuracy was measured as the correlation coefficient between the observed and predicted phenotypes. We repeated this procedure for 25 random training-test splits and used the average correlation across splits as our final metric to assess prediction accuracy.

Results

The figures and tables for the analyses of starvation resistance are reported in the main text, while those for startle response can be found in the supplementary materials.

Transcriptomic prediction

We first fitted a few widely used prediction methods and compared their accuracy. The results for starvation resistance are shown in Fig 1 and S1 Table. Overall, prediction accuracy was low to moderate for all methods, especially considering that the analyses were based on lines means of many individual flies, which substantially increases the broad sense heritability of the trait to values around 0.8 (the broad sense heritability of line means is , where is the genetic variance, is the environmental variance, and f is the number of measured flies per line) [22, 53]. However, differences in prediction accuracy between methods existed, both within each sex and across sexes. In males, we found that TBLUP (r = 0.455±0.035) and Ridge Regression (r = 0.455±0.034) provided the highest accuracy, with BayesC (r = 0.448±0.033), PLSR (r = 0.442±0.034), and PCR (r = 0.430±0.039) being competitive. On the other hand, Neural Network provided the lowest prediction accuracy for males (r = 0.143±0.079). In females, we observed more marked differences in prediction accuracy across methods. Methods that perform variable selection—i.e., VARBVS, MR.ASH, and LASSO—tended to perform better than the other methods, with VARBVS (r = 0.503±0.031) providing the highest accuracy. Neural Network provided the lowest prediction accuracy in females (r = 0.064±0.051) as well.

thumbnail
Fig 1. Prediction accuracy for starvation resistance.

Prediction accuracy of 25 replicates in females (A) and males (B) for all standard methods. Methods are colored by family, where dimension reduction (blue), penalized regression (cyan), Bayesian regression (lime), linear mixed model (orange), and machine learning methods (red) are ordered from left to right. The mean correlation coefficient is denoted by diamonds. Outliers are denoted by circles.

https://doi.org/10.1371/journal.pone.0317516.g001

The results for startle response are presented in S1 Fig and S2 Table. A few methods (i.e., PCR, PLSR, RR, LASSO, and NN) produced intercept-only models or failed to converge, resulting in no correlation coefficient to measure prediction accuracy for some replicates (S3 Table). In particular, LASSO produced intercept-only models for 23 (92%) and 12 (48%) replicates in females and males, respectively. PLSR failed to converge for 20 (80%) replicates in females. Thus, we excluded LASSO and PLSR from our comparisons as both did not have accuracy for more than 40% of replicates in at least one sex, making the results unreliable. Of the remaining methods, the best performing ones are BayesC (r = 0.182±0.045 for females, r = 0.273±0.041 for males) and TBLUP (r = 0.173±0.045 for females, r = 0.258±0.042 for males) in both sexes. In females, all of the other methods achieved a correlation lower than 0.100, with PCR performing the worst (r = −0.025 ± 0.069; 15 replicates). In males, RF (r = 0.251 ± 0.047) and MR.ASH (r = 0.236 ± 0.049) achieved prediction accuracies comparable to the top methods. All of the remaining methods achieved a correlation lower than 0.200, with Neural Network achieving the lowest prediction accuracy (r = −0.028 ± 0.053; 15 replicates).

Collectively, these results show that differences in prediction accuracy between methods are present, with performance being dependent on the genetic architecture of the trait and the methods’ assumptions.

Gene Ontology informed transcriptomic prediction

It has been shown previously that informing prediction models with functional information can improve prediction accuracy [18, 4648]. Thus, we also tested methods that could include external information. In this work, we focused on GO annotation and extensions of BayesC and TBLUP, namely GO-BayesC and GO-TBLUP. The results are summarized in Fig 2 and S4 Table for starvation resistance, and in S2 Fig and S5 Table for startle response. We also sought to use the Sparse Group LASSO. However, in our initial testing using starvation resistance, the prediction accuracies provided by that method were nearly identical for all GO terms tested (S3 Fig). This pattern was also seen for GO terms found to be highly predictive by GO-BayesC and GO-TBLUP. Thus, we decided not to assess Sparse Group LASSO further.

thumbnail
Fig 2. Prediction accuracy for starvation resistance using GO terms.

Prediction accuracy in the two sexes using GO-BayesC (A for females, C for males) and GO-TBLUP (B for females, D for males). Each dot represents the mean correlation between true and predicted phenotypes (r) across 25 replicates for a GO term. The solid line indicates the mean r from the respective standard method (i.e., BayesC and TBLUP). The dashed black line represents the 99th percentile of terms ranked by prediction accuracy.

https://doi.org/10.1371/journal.pone.0317516.g002

For both traits analyzed, we found that GO-BayesC and GO-TBLUP provided accuracies that were similar to or lower than the respective standard model (i.e., BayesC and TBLUP) for the majority of GO terms in both sexes. However, some GO terms seemed to be particularly predictive of the trait, yielding accuracies that were substantially higher than the standard models. For starvation resistance, the accuracies provided by GO-BayesC and GO-TBLUP generally agreed for both sexes (r = 0.836), as shown in Fig 3. The same pattern held for startle response in both females (r = 0.856) and males (r = 0.872), as shown in S4 Fig.

thumbnail
Fig 3. Correlation of prediction accuracy of GO-annotated methods for starvation resistance.

Prediction accuracy for all GO terms of GO-BayesC (x-axis) against GO-TBLUP (y-axis) for females (A) and males (B). The black line represents the line of least squares fit for each panel.

https://doi.org/10.1371/journal.pone.0317516.g003

In females, four of the five most predictive GO terms are shared between GO-BayesC and GO-TBLUP for starvation resistance. GO:0017056 (GO-BayesC r = 0.477 ± 0.034, GO-TBLUP r = 0.457 ± 0.041) and GO:0006606 (GO-BayesC r = 0.464 ± 0.039, GO-TBLUP r = 0.434 ± 0.044) are both related to nuclear import by function and structure, respectively. Lee et al. has demonstrated that starvation resistance induces nuclear pore degradation in yeast [54]. GO:0055088 (GO-BayesC r = 0.446 ± 0.040, GO-TBLUP r = 0.460 ± 0.036) and GO:0045819 (GO-BayesC r = 0.451 ± 0.038, GO-TBLUP r = 0.427 ± 0.042) are related to macromolecule metabolism in lipids and carbohydrates, respectively. GO:0055088 has been implicated in starvation resistance using Korean rockfish [55]. GO:0017056 was the most predictive term for GO-BayesC (r = 0.477 ± 0.034) and GO-TBLUP (r = 0.457 ± 0.041). However, some differences between methods existed. For example, GO:0016042, which is involved in lipid catabolism, was found to be highly predictive by GO-BayesC (r = 0.447 ± 0.038), while GO:0008586, which is involved in wing vein morphogenesis, was highly predictive in GO-TBLUP (r = 0.424 ± 0.039). For startle response, GO:0008061 (GO-BayesC r = 0.294 ± 0.043, GO-TBLUP r = 0.262 ± 0.046) and GO:0043066 (GO-BayesC r = 0.303 ± 0.042, GO-TBLUP r = 0.270 ± 0.047) were found within the top five most predictive GO terms in both methods for females. These terms cover chitin binding (GO:0008061) and negative regulation of apoptotic processes (GO:0043066).

In males, two of the five most predictive GO terms are shared between methods for starvation resistance. GO:0042593 (GO-BayesC r = 0.539 ± 0.032, GO-TBLUP r = 0.520 ± 0.035) is involved in glucose homeostasis, while GO:0035003 (GO-BayesC r = 0.526 ± 0.036, GO-TBLUP r = 0.512 ± 0.030) is involved in the subapical complex. This is involved with nutrient acquisition in the intestines as part of the barrier between host cells and the gut microbiome [56]. Four of the top five GO terms found by either method are implicated in cellular growth and development. In GO-BayesC, GO:0042461 (r = 0.530 ± 0.036) is involved in photoreceptor cells, GO:0001738 (r = 0.530 ± 0.036) is involved in epithelial tissue and GO:0045186 (r = 0.529 ± 0.037) is involved in the assembly of the zonula adherens. GO:0007485, which is involved in genital disc formation, was highly predictive in GO-TBLUP(r = 0.520 ± 0.030). Multiple studies have found connections between cell size regulation and overall body size with starvation resistance [5759]. The top GO term for TBLUP in males, GO:0035008 (r = 0.526 ± 0.026), is involved in the positive regulation of the melanization defense response. The biological connection of this process to starvation resistance is unclear, as this response increases oxidative stress in wounds to prevent infection [60]. For startle response, four of the top five most predictive GO terms are shared between methods in males. These terms include chitin binding (GO:0008061—GO-BayesC r = 0.294 ± 0.043, GO-TBLUP r = 0.262 ± 0.046), imaginal disc growth factor receptor binding (GO:0008084—GO-BayesC r = 0.368 ± 0.031, GO-TBLUP r = 0.365 ± 0.028), protein tyrosine phosphatase activity (GO:0004725—GO-BayesC r = 0.348 ± 0.032, GO-TBLUP r = 0.349 ± 0.033), and plasma membrane (GO:0005886—GO-BayesC r = 0.348 ± 0.037, GO-TBLUP r = 0.358 ± 0.035).

Overall, the increased prediction accuracy for both traits provided by specific GO terms highlights the usefulness of external information for improving accuracy.

Gene analysis

Given that many of the most predictive GO terms were biologically relevant to the traits analyzed, we investigated whether any particular genes were included in such terms. For starvation resistance, we selected the 1% most predictive GO terms for each method and sex (26 and 25 GO terms for females and males, respectively). For each prediction method and sex combination, we counted how many times each gene was found across these GO terms. We then examined the distribution of the count (S5 Fig) and decided to focus only on the most frequently occurring genes. This resulted in selecting genes appearing in 5 or more GO terms for GO-TBLUP in females and genes appearing in 4 or more GO terms for all other method and sex combinations. These results are summarized in Table 1 and S6 Table. For startle response, a similar procedure led us to selecting genes appearing in 4 or more top GO terms for all method and sex combinations (S6 Fig). These results are summarized in S7 and S8 Tables.

thumbnail
Table 1. Most frequent genes across the top 1% of GO terms for starvation resistance.

https://doi.org/10.1371/journal.pone.0317516.t001

For both sexes, the results show that some overlapping genes were found by both GO-BayesC and GO-TBLUP, while some genes were picked up by only one method. For starvation resistance, significant enrichment (Fisher’s Exact Test P < 0.001) of protein kinases emerged when considering top genes across all setups. For startle response, top genes were enriched (Fisher’s Exact Test P < 0.001) for neuronal development and sensory receptors across all setups.

In females, GO-BayesC and GO-TBLUP found AkhR, Akt1, InR, Egfr, and Erk7 in common out of the top 1% of GO terms for starvation resistance. Of these genes, AkhR, Akt1, and InR are related to insulin signaling and lipid metabolism [6163]. Aside from insulin signaling, Egfr and Erk7 have been implicated in starvation resistance [64, 65]. For startle response, only crb, a gene involved in Notch regulation and photoreceptor morphogenesis [66], was shared between GO-BayesC and GO-TBLUP. For starvation resistance, most of the genes found by only GO-BayesC are from the nucleoporin family (mbo, Nup53, Nup54, Nup93–1, Nup93–2, Nup98–96, Nup153, and Nup205). The remaining genes are tangential to the InR signaling pathway. For startle response, aPKC was the only other gene found by GO-BayesC. For starvation resistance, eleven distinct genes found by GO-TBLUP only are involved in various complexes and pathways. hpo, Ack, and Sik3 are related to the Hippo or Salvador-Warts-Hippo signaling pathway [6769]. pdk1 is part of the InR signaling pathway [70]. put and babo are involved in the activin signaling pathway [71]. hep is a kinase involved in regulation of gut metabolism [72]. For startle response, Orco, dco, and park were found by GO-TBLUP only. These genes are related to nervous system functions, where Orco is involved in olfactory receptors [73], dco is part of circadian signaling [74], and park is related to locomotion [75].

In males, GO-BayesC and GO-TBLUP found sdt, aPKC, PDZ-GEF, par-6, Patj, crb, Ilp2, and InR in common out of the top 1% of GO terms for starvation resistance. The two major categories that emerge from these genes are carbohydrate metabolism and cell polarization. For carbohydrate metabolism, the insulin-like receptor InR and an insulin-like peptide Ilp2 are key components of InR signaling [76]. InR is the only gene that was found by both GO methods in both sexes. The remaining genes are all related to cell polarization. From both methods, crb, sdt, and Patj are involved in the Crumbs complex [66], while aPKC and par6 are involved in the PAR complex [77]. Outside of these complexes, PDZ-GEF is involved in epithelial cell polarization [78]. The startle response analysis yielded two prevalent genes in common between GO-BayesC and GO-TBLUP. Idgf2 is involved in stress response [79], while stg is involved in cell cycle progression [80]. Six genes were found by GO-BayesC only in the top 1% of GO terms in males for starvation resistance. baz, scrib, and dlg1 are related to the Scribble and PAR complexes [81], while Moe regulates the Crumbs complex [82]. AkhR, described above, and shg, which is part of the Egfr signaling pathway [83], were also found uniquely by GO-BayesC. For startle response, GO-BayesC found five Idgf genes (in addition to Idgf2) and neuromuscular genes wg, Ptp99A, and Ptp69D. Only two genes were found uniquely by GO-TBLUP in males for starvation resistance. Desat1 is a lipid desaturase [84], while Ras85D is an oncogenic cell growth promoter [85]. For startle response, GO-TBLUP uniquely found Notch, Cul3, and Dl.

Detailed descriptions of the genes and their connections with starvation resistance and startle response are given in S1 Text.

Discussion

In this study, we evaluated ten statistical methods on their ability to predict starvation resistance and startle response, two well-documented quantitative traits [22], using transcriptomic data [23]. As expected, we found differences in the prediction accuracy provided by the methods tested, both within each sex and between sexes. While most methods were somewhat predictive, neural networks provided minimal prediction accuracy for both sexes. This is in agreement with previous work showing the importance of feature selection prior to model fitting in the p > >n regime for neural networks to perform well [12]. However, we caution the readers that our work focused on out-of-the-box performance of the different methods. Better performance may be achieved with additional fine-tuning of each method. For starvation resistance, the most predictive methods in females, VARBVS and MR.ASH, are both Bayesian regression methods that perform effect shrinkage and variable selection. These methods allow for the underlying genetic architecture to be sparse, suggesting that not all genes affect starvation resistance. In contrast, the most predictive methods in males (TBLUP and ridge regression) only perform effect shrinkage, which suggests that the genetic architecture of starvation resistance is denser in males. The difference in best performing methods between sexes is not surprising because starvation resistance is known to have high genetic variation in sexual dimorphism, i.e., the two sexes have different genetic architectures [22, 24, 25].

The results for startle response, a complex behavioral trait with low genetic variation in sexual dimorphism, show that prediction accuracy for all methods is substantially lower than for starvation resistance and is higher in males than in females. However, we note that a few methods (i.e., PCR, PLSR, RR, LASSO, and NN) could not find signal in the data or failed to converge for several replicates. While this is a result in itself in that these methods seem less robust, it also limits the reliability of the comparisons. Among the methods that did not have any issues, BayesC and TBLUP achieved the highest prediction accuracies for both sexes, in agreement with the low genetic variation in sexual dimorphism for startle response.

Overall, the results for both traits highlight the importance of choosing methods with assumptions that match the genetic architecture of the trait under investigation. Additionally, prediction analysis can also provide some hypotheses about the genetic architecture of the traits, which guide further specific experiments and analyses.

Previous studies have shown that including functional annotation information into prediction models can help improve prediction accuracy [18, 4648]. Thus, we selected two methods that allow the incorporation of additional information, BayesC and TBLUP, and annotated them with Gene Ontology (GO) information. We showed that a small number of biologically relevant GO terms achieved substantially higher prediction accuracies for GO-BayesC and GO-TBLUP than the standard BayesC and TBLUP methods. While the correlation between prediction accuracies for each GO between GO-BayesC and GO-TBLUP was high for both sexes (r = 0.84 − 0.87), differences in method assumptions resulted in different top GO terms between methods, especially in males. For starvation resistance, the most predictive GO terms for both sexes shared genes with biological connections with the trait (S1 Text). For example, InR and AkhR are involved in carbohydrate and lipid metabolism. Both genes have been associated to starvation resistance in previous studies [86, 87]. However, many of the genes shared by the most predictive GO terms in the two sexes were different (e.g., Egfr in females and sdt in males). These findings also suggest that both methods may be able to highlight sex-shared and sex-specific genes of interest for traits with unknown genetic architectures. The results for startle response showed similar trends. A few GO terms substantially increased prediction accuracy, with the most predictive GO term (e.g., GO:0008061—chitin binding) suggesting a plausible biological relation to startle response through neurological mechanisms [88]. The most predictive GO terms also included genes related to startle response mechanisms (e.g., crb regulating sensory receptors in females, and Notch and Dl promoting neuron development in males [82, 89]). Overall, our findings suggest that additional information from GO terms may help disentangle signal from noise to improve prediction accuracy and our understanding of the complex trait of interest.

In conclusion, we found that differences in prediction performance between methods exist and depend on the assumptions made by the model relative to the genetic architecture of the trait of interest. We also confirmed that external information, such as GO term annotation, can improve prediction accuracy for biologically relevant data. However, there are a number of limitations and considerations to address. First, the data is limited by the small number of available DGRP lines. In fact, the estimates of the effects of the p = 12, 000 genes using only n = 200 lines may not be precise, resulting in low prediction accuracy. We hypothesize that increasing the sample size of the DGRP would improve signal detection and estimation for all methods and traits. Second, the linear regression performed by most methods used here do not account for interactions between genes and may perform poorly for complex traits with epistatic interactions. At the same time, accounting for interactions results in many more effects to be estimated, which suffers from small sample sizes, such as the DGRP, even more. Previous work describing a deep neural network architecture for detecting gene-gene interactions notes that these types of methods are generally applicable to larger data sets [43]. The poor performance of neural networks in our study also seems to confirm this issue. Third, the gene expression profiles were obtained from whole flies reared under standard conditions. Higher prediction accuracy may be achieved if gene expression were measured under the same conditions used to score starvation resistance and startle response. For similar reasons, higher prediction accuracy may also be achieved when using gene expression from only tissues relevant to the trait analyzed (e.g., brain tissues for startle response). Despite these limitations, we have shown that gene expression data coupled with appropriate model selection and external information can be effective for complex trait prediction.

Supporting information

S1 Fig. Prediction accuracy for startle response.

Prediction accuracy for 25 replicates in females (A) and males (B) for all standard methods. Methods are colored by family, where dimension reduction (blue), penalized regression (cyan), Bayesian regression (lime), linear mixed model (orange), and machine learning methods (red) are ordered from left to right. The mean correlation coefficient is denoted by diamonds. Outliers are denoted by circles.

https://doi.org/10.1371/journal.pone.0317516.s001

(TIF)

S2 Fig. Prediction accuracy for startle response using GO terms.

Prediction accuracy in the two sexes using GO-BayesC (A for females, C for males) and GO-TBLUP (B for females, D for males). Each dot represents the mean correlation between true and predicted phenotypes (r) across 25 replicates for a GO term. The solid line indicates the mean r from the respective standard method (i.e., BayesC and TBLUP). The dashed black line represents the 99th percentile of terms ranked by prediction accuracy.

https://doi.org/10.1371/journal.pone.0317516.s002

(TIF)

S3 Fig. Prediction accuracy using Sparse Group LASSO.

Violin plot comparison of Sparse Group Lasso results for top GO terms from GO-BayesC/GO-TBLUP along with randomly selected GO terms in females.

https://doi.org/10.1371/journal.pone.0317516.s003

(TIF)

S4 Fig. Correlation of prediction accuracy of GO-annotated methods for startle response.

Prediction accuracy for all GO terms using GO-BayesC (x-axis) against GO-TBLUP (y-axis) for females (A) and males (B). The black line represents the line of least squares fit for each panel.

https://doi.org/10.1371/journal.pone.0317516.s004

(TIF)

S5 Fig. Histogram of genes present in top GO terms for starvation resistance.

Distribution of number of overlapping genes for the top 1% of GO terms in the two sexes using GO-BayesC (A for females, C for males) and GO-TBLUP (B for females, D for males). The selection cutoff is marked by the vertical bar.

https://doi.org/10.1371/journal.pone.0317516.s005

(TIF)

S6 Fig. Histogram of genes present in top GO terms for startle response.

Distribution of number of overlapping genes for the top 1% of GO terms in the two sexes using GO-BayesC (A for females, C for males) and GO-TBLUP (B for females, D for males). The selection cutoff is marked by the vertical bar.

https://doi.org/10.1371/journal.pone.0317516.s006

(TIF)

S1 Table. Prediction accuracy for starvation resistance.

Mean prediction accuracy and standard error for all methods in females and males.

https://doi.org/10.1371/journal.pone.0317516.s007

(XLSX)

S2 Table. Prediction accuracy for startle response.

Mean prediction accuracy and standard error for all methods in females and males.

https://doi.org/10.1371/journal.pone.0317516.s008

(XLSX)

S3 Table. Successful replicates for startle response.

Number of successful replicates out of 25 replicates for all methods in both sexes.

https://doi.org/10.1371/journal.pone.0317516.s009

(XLSX)

S4 Table. Prediction accuracy for starvation resistance for each GO term.

Mean prediction accuracy and standard error for each GO term in GO-BayesC and GO-TBLUP for females and males.

https://doi.org/10.1371/journal.pone.0317516.s010

(XLSX)

S5 Table. Prediction accuracy for startle response for each GO term.

Mean prediction accuracy and standard error for each GO term in GO-BayesC and GO-TBLUP for females and males.

https://doi.org/10.1371/journal.pone.0317516.s011

(XLSX)

S6 Table. All genes across top GO terms for starvation resistance.

Genes in the top 1% of GO terms for GO-BayesC and GO-TBLUP in females and males ordered by gene count.

https://doi.org/10.1371/journal.pone.0317516.s012

(XLSX)

S7 Table. Most frequent genes across the top 1% of GO terms for startle response.

Top overlapping genes across the 1% most predictive GO terms for each method and sex combination. Genes appearing in all four setups are highlighted in green. Genes appearing in both methods for females are highlighted in red. Genes appearing for both methods in males are highlighted in blue.

https://doi.org/10.1371/journal.pone.0317516.s013

(XLSX)

S8 Table. All genes across top GO terms for startle response.

Genes in the top 1% of GO terms for GO-BayesC and GO-TBLUP in females and males ordered by gene count.

https://doi.org/10.1371/journal.pone.0317516.s014

(XLSX)

S1 Text. Details about the genes underlying the most predictive GO terms.

Detailed description of the genes identified in the gene analysis, including their relevance to the traits of interest.

https://doi.org/10.1371/journal.pone.0317516.s015

(PDF)

Acknowledgments

We thank Trudy Mackay for helpful comments on an earlier version of this manuscript, and Liangjiang Wang for suggestions about the neural network analyses.

References

  1. 1. Meuwissen TH, Hayes BJ, Goddard M. Prediction of total genetic value using genome-wide dense marker maps. genetics. 2001;157(4):1819–1829. pmid:11290733
  2. 2. Wray NR, Kemper KE, Hayes BJ, Goddard ME, Visscher PM. Complex trait prediction from genome data: contrasting EBV in livestock to PRS in humans: genomic prediction. Genetics. 2019;211(4):1131–1141. pmid:30967442
  3. 3. Purcell SM, Wray NR, Stone JL, Visscher PM, O’Donovan MC, Sullivan PF, et al. Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature. 2009;460(7256):748–752. pmid:19571811
  4. 4. Lewis CM, Vassos E. Polygenic risk scores: from research tools to clinical instruments. Genome medicine. 2020;12(1):44. pmid:32423490
  5. 5. de Los Campos G, Hickey JM, Pong-Wong R, Daetwyler HD, Calus MP. Whole-genome regression and prediction methods applied to plant and animal breeding. Genetics. 2013;193(2):327–345. pmid:22745228
  6. 6. Candes E, Tao T. The Dantzig selector: Statistical estimation when p is much larger than n. The Annals of Statistics. 2007;35(6):2313–2351.
  7. 7. Hoerl AE, Kennard RW. Ridge Regression: Biased Estimation for Nonorthogonal Problems. Technometrics. 1970;12(1):55–67.
  8. 8. de Los Campos G, Vazquez AI, Fernando R, Klimentidis YC, Sorensen D. Prediction of complex human traits using the genomic best linear unbiased predictor. PLoS genetics. 2013;9(7):e1003608. pmid:23874214
  9. 9. Habier D, Fernando RL, Kizilkaya K, Garrick DJ. Extension of the Bayesian alphabet for genomic selection. BMC bioinformatics. 2011;12:1–12. pmid:21605355
  10. 10. Massy WF. Principal components regression in exploratory statistical research. Journal of the American Statistical Association. 1965;60(309):234–256.
  11. 11. Breiman L. Random Forests. Machine Learning. 2001;45(12):5–32.
  12. 12. Azodi CB, Bolger E, McCarren A, Roantree M, de Los Campos G, Shiu SH. Benchmarking parametric and machine learning models for genomic prediction of complex traits. G3: Genes, Genomes, Genetics. 2019;9(11):3691–3702. pmid:31533955
  13. 13. Ma Y, Zhou X. Genetic prediction of complex traits with polygenic scores: a statistical review. Trends in Genetics. 2021;37(11):995–1011. pmid:34243982
  14. 14. Civelek M, Lusis AJ. Systems genetics approaches to understand complex traits. Nature Reviews Genetics. 2014;15(1):34–48. pmid:24296534
  15. 15. Azodi CB, Pardo J, VanBuren R, de Los Campos G, Shiu SH. Transcriptome-based prediction of complex traits in maize. The Plant Cell. 2020;32(1):139–151. pmid:31641024
  16. 16. Wheeler HE, Aquino-Michaels K, Gamazon ER, Trubetskoy VV, Dolan ME, Huang RS, et al. Poly-omic prediction of complex traits: OmicKriging. Genetic epidemiology. 2014;38(5):402–415. pmid:24799323
  17. 17. Guo Z, Magwire MM, Basten CJ, Xu Z, Wang D. Evaluation of the utility of gene expression and metabolic information for genomic prediction in maize. Theoretical and applied genetics. 2016;129:2413–2427. pmid:27586153
  18. 18. Morgante F, Huang W, Sørensen P, Maltecca C, Mackay TF. Leveraging multiple layers of data to predict drosophila complex traits. G3: Genes, Genomes, Genetics. 2020;10(12):4599–4613. pmid:33106232
  19. 19. Zhou S, Morgante F, Geisz MS, Ma J, Anholt RR, Mackay TF. Systems genetics of the Drosophila metabolome. Genome Research. 2020;30(3):392–405. pmid:31694867
  20. 20. Rohde PD, Kristensen TN, Sarup P, Muñoz J, Malmendal A. Prediction of complex phenotypes using the Drosophila melanogaster metabolome. Heredity. 2021;126(5):717–732. pmid:33510469
  21. 21. Huang W, Massouras A, Inoue Y, Peiffer J, Ràmia M, Tarone AM, et al. Natural variation in genome architecture among 205 Drosophila melanogaster Genetic Reference Panel lines. Genome research. 2014;24(7):1193–1208. pmid:24714809
  22. 22. Mackay TF, Richards S, Stone EA, Barbadilla A, Ayroles JF, Zhu D, et al. The Drosophila melanogaster genetic reference panel. Nature. 2012;482(7384):173–178. pmid:22318601
  23. 23. Everett LJ, Huang W, Zhou S, Carbone MA, Lyman RF, Arya GH, et al. Gene expression networks in the Drosophila genetic reference panel. Genome research. 2020;30(3):485–496. pmid:32144088
  24. 24. Harbison ST, Yamamoto AH, Fanara JJ, Norga KK, Mackay TF. Quantitative trait loci affecting starvation resistance in Drosophila melanogaster. Genetics. 2004;166(4):1807–1823. pmid:15126400
  25. 25. Everman ER, McNeil CL, Hackett JL, Bain CL, Macdonald SJ. Dissection of complex, fitness-related traits in multiple Drosophila mapping populations offers insight into the genetic control of stress resistance. Genetics. 2019;211(4):1449–1467. pmid:30760490
  26. 26. Hotelling H. THE RELATIONS OF THE NEWER MULTIVARIATE STATISTICAL METHODS TO FACTOR ANALYSIS. British Journal of Statistical Psychology. 1957;10(2):69–79.
  27. 27. Pearson K. LIII. On lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science. 1901;2(11):559–572.
  28. 28. Jolliffe IT. A note on the use of principal components in regression. Journal of the Royal Statistical Society Series C: Applied Statistics. 1982;31(3):300–303.
  29. 29. Liland KH, Mevik BH, Wehrens R. pls: Partial Least Squares and Principal Component Regression; 2023. Available from: https://CRAN.R-project.org/package=pls.
  30. 30. Höskuldsson A. PLS regression methods. Journal of Chemometrics. 1988;2(3):211–228.
  31. 31. Abdi H. Partial least square regression (PLS regression). Encyclopedia for research methods for the social sciences. 2003;6(4):792–795.
  32. 32. Tay JK, Narasimhan B, Hastie T. Elastic Net Regularization Paths for All Generalized Linear Models. Journal of Statistical Software. 2023;106(1):1–31. pmid:37138589
  33. 33. Tibshirani R. Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society Series B (Methodological). 1996;58(1):267–288.
  34. 34. Perez P, de los Campos G. Genome-Wide Regression and Prediction with the BGLR Statistical Package. Genetics. 2014;198(2):483–495. pmid:25009151
  35. 35. Carbonetto P, Stephens M. Scalable Variational Inference for Bayesian Variable Selection in Regression, and Its Accuracy in Genetic Association Studies. Bayesian Analysis. 2012;7(1):73–108.
  36. 36. Kim Y, Wang W, Carbonetto P, Stephens M. A flexible empirical Bayes approach to multiple linear regression and connections with penalized regression. Journal of Machine Learning Research. 2024;25(185):1–59.
  37. 37. Kim Y, Carbonetto P, Stephens M. mr.ash.alpha: Multiple Regression with Adaptive Shrinkage; 2023. Available from: https://github.com/stephenslab/mr.ash.alpha.
  38. 38. Li Z, Gao N, Martini JW, Simianer H. Integrating gene expression data into genomic prediction. Frontiers in genetics. 2019;10:430679. pmid:30858865
  39. 39. Orlenko A, Moore JH. A comparison of methods for interpreting random forest models of genetic association in the presence of non-additive interactions. BioData Mining. 2021;14(1):9. pmid:33514397
  40. 40. Zeileis A, Hothorn T, Hornik K. Model-Based Recursive Partitioning. Journal of Computational and Graphical Statistics. 2008;17(2):492–514.
  41. 41. Marchevsky AM. The Use of Artificial Neural Networks for the Diagnosis and Estimation of Prognosis in Cancer Patients. In: Outcome prediction in cancer. Elsevier; 2007. p. 243–259.
  42. 42. O’Connor LJ, Schoech AP, Hormozdiari F, Gazal S, Patterson N, Price AL. Extreme polygenicity of complex traits is explained by negative selection. The American Journal of Human Genetics. 2019;105(3):456–476. pmid:31402091
  43. 43. Cui T, El Mekkaoui K, Reinvall J, Havulinna AS, Marttinen P, Kaski S. Gene–gene interaction detection with deep learning. Communications Biology. 2022;5(1):1238. pmid:36371468
  44. 44. Fritsch S, Guenther F, Wright MN. neuralnet: Training of Neural Networks; 2019. Available from: https://CRAN.R-project.org/package=neuralnet.
  45. 45. Riedmiller M. Advanced supervised learning in multi-layer perceptrons—from backpropagation to adaptive learning algorithms. Computer Standards & Interfaces. 1994;16(3):265–278.
  46. 46. Edwards SM, Sørensen IF, Sarup P, Mackay TF, Sørensen P. Genomic prediction for quantitative traits is improved by mapping variants to gene ontology categories in Drosophila melanogaster. Genetics. 2016;203(4):1871–1883. pmid:27235308
  47. 47. Márquez-Luna C, Gazal S, Loh PR, Kim SS, Furlotte N, Auton A, et al. Incorporating functional priors improves polygenic prediction accuracy in UK Biobank and 23andMe data sets. Nature Communications. 2021;12(1):6052. pmid:34663819
  48. 48. Zheng Z, Liu S, Sidorenko J, Wang Y, Lin T, Yengo L, et al. Leveraging functional genomic annotations and genome coverage to improve polygenic prediction of complex traits within and between ancestries. Nature Genetics. 2024. pmid:38689000
  49. 49. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, et al. Gene ontology: tool for the unification of biology. Nature genetics. 2000;25(1):25–29.
  50. 50. Noah Simon TH Jerome Friedman, Tibshirani R. A Sparse-Group Lasso. Journal of Computational and Graphical Statistics. 2013;22(2):231–245.
  51. 51. Yuan M, Lin Y. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society Series B: Statistical Methodology. 2006;68(1):49–67.
  52. 52. McDonald DJ, Liang X, Solón Heinsfeld A, Cohen A. sparsegl: Sparse Group Lasso; 2023. Available from: https://CRAN.R-project.org/package=sparsegl.
  53. 53. Falconer DS. Introduction to quantitative genetics. Pearson Education India; 1996.
  54. 54. Lee CW, Wilfling F, Ronchi P, Allegretti M, Mosalaganti S, Jentsch S, et al. Selective autophagy degrades nuclear pore complexes. Nature cell biology. 2020;22(2):159–166. pmid:32029894
  55. 55. Han X, Wang J, Li B, Song Z, Li P, Huang B, et al. Analyses of regulatory network and discovery of potential biomarkers for Korean rockfish (Sebastes schlegelii) in responses to starvation stress through transcriptome and metabolome. Comparative Biochemistry and Physiology Part D: Genomics and Proteomics. 2023;46:101061. pmid:36796184
  56. 56. Van IJzendoorn SC, Maier O, Van Der Wouden JM, Hoekstra D. The subapical compartment and its role in intracellular trafficking and cell polarity. Journal of cellular physiology. 2000;184(2):151–160. pmid:10867639
  57. 57. Gergs A, Jager T. Body size-mediated starvation resistance in an insect predator. Journal of Animal Ecology. 2014;83(4):758–768. pmid:24417336
  58. 58. Privalova V, Labecka AM, Szlachcic E, Sikorska A, Czarnoleski M. Systemic changes in cell size throughout the body of Drosophila melanogaster associated with mutations in molecular cell cycle regulators. Scientific Reports. 2023;13(1):7565. pmid:37160985
  59. 59. Watson SP, Clements MO, Foster SJ. Characterization of the starvation-survival response of Staphylococcus aureus. Journal of bacteriology. 1998;180(7):1750–1758. pmid:9537371
  60. 60. Tang H. Regulation and function of the melanization reaction in Drosophila. Fly. 2009;3(1):105–111. pmid:19164947
  61. 61. Bharucha K, Tarr P, Zipursky S. A glucagon-like endocrine pathway in Drosophila modulates both lipid and carbohydrate homeostasis. Journal of Experimental Biology. 2008;211(19):3103–3110. pmid:18805809
  62. 62. Nojima A, Yamashita M, Yoshida Y, Shimizu I, Ichimiya H, Kamimura N, et al. Haploinsufficiency of akt1 prolongs the lifespan of mice. PloS one. 2013;8(7):e69178. pmid:23935948
  63. 63. Paaby AB, Bergland AO, Behrman EL, Schmidt PS. A highly pleiotropic amino acid polymorphism in the Drosophila insulin receptor contributes to life-history adaptation. Evolution. 2014;68(12):3395–3409. pmid:25319083
  64. 64. Sevelda F, Mayr L, Kubista B, Lötsch D, van Schoonhoven S, Windhager R, et al. EGFR is not a major driver for osteosarcoma cell growth in vitro but contributes to starvation and chemotherapy resistance. Journal of Experimental & Clinical Cancer Research. 2015;34:1–12. pmid:26526352
  65. 65. Zacharogianni M, Kondylis V, Tang Y, Farhan H, Xanthakis D, Fuchs F, et al. ERK7 is a negative regulator of protein secretion in response to amino-acid starvation by modulating Sec16 membrane association. The EMBO journal. 2011;30(18):3684–3700. pmid:21847093
  66. 66. Pénalva C, Mirouse V. Tissue-specific function of Patj in regulating the Crumbs complex and epithelial polarity. Development. 2012;139(24):4549–4554. pmid:23136386
  67. 67. Harvey K, Tapon N. The Salvador–Warts–Hippo pathway—an emerging tumour-suppressor network. Nature Reviews Cancer. 2007;7(3):182–191. pmid:17318211
  68. 68. Hu L, Xu J, Yin MX, Zhang L, Lu Y, Wu W, et al. Ack promotes tissue growth via phosphorylation and suppression of the Hippo pathway component Expanded. Cell discovery. 2016;2(1):1–14. pmid:27462444
  69. 69. Wehr MC, Holder MV, Gailite I, Saunders RE, Maile TM, Ciirdaeva E, et al. Salt-inducible kinases regulate growth through the Hippo signalling pathway in Drosophila. Nature cell biology. 2013;15(1):61–71. pmid:23263283
  70. 70. Alessi DR, Deak M, Casamayor A, Caudwell FB, Morrice N, Norman DG, et al. 3-Phosphoinositide-dependent protein kinase-1 (PDK1): structural and functional homology with the Drosophila DSTPK61 kinase. Current biology. 1997;7(10):776–789. pmid:9368760
  71. 71. Song W, Cheng D, Hong S, Sappe B, Hu Y, Wei N, et al. Midgut-derived activin regulates glucagon-like action in the fat body and glycemic control. Cell metabolism. 2017;25(2):386–399. pmid:28178568
  72. 72. Karpac J, Biteau B, Jasper H. Misregulation of an adaptive metabolic response contributes to the age-related disruption of lipid homeostasis in Drosophila. Cell reports. 2013;4(6):1250–1261. pmid:24035390
  73. 73. Eddison M, Belay AT, Sokolowski MB, Heberlein U. A genetic screen for olfactory habituation mutations in Drosophila: analysis of novel foraging alleles and an underlying neural circuit. PLoS One. 2012;7(12):e51684. pmid:23284741
  74. 74. Chabot CC, Taylor DH. Circadian modulation of the rat acoustic startle response. Behavioral neuroscience. 1992;106(5):846. pmid:1445660
  75. 75. Von Coelln R, Thomas B, Savitt JM, Lim KL, Sasaki M, Hess EJ, et al. Loss of locus coeruleus neurons and reduced startle in parkin null mice. Proceedings of the National Academy of Sciences. 2004;101(29):10744–10749. pmid:15249681
  76. 76. Park S, Alfa RW, Topper SM, Kim GE, Kockel L, Kim SK. A genetic strategy to measure circulating Drosophila insulin reveals genes regulating insulin production and secretion. PLoS genetics. 2014;10(8):e1004555. pmid:25101872
  77. 77. Suzuki A, Ohno S. The PAR-aPKC system: lessons in polarity. Journal of cell science. 2006;119(6):979–987. pmid:16525119
  78. 78. Consonni SV, Brouwer PM, van Slobbe ES, Bos JL. The PDZ domain of the guanine nucleotide exchange factor PDZGEF directs binding to phosphatidic acid during brush border formation. PLoS One. 2014;9(5):e98253. pmid:24858808
  79. 79. Broz V, Kucerova L, Rouhova L, Fleischmannova J, Strnad H, Bryant PJ, et al. Drosophila imaginal disc growth factor 2 is a trophic factor involved in energy balance, detoxification, and innate immunity. Scientific reports. 2017;7(1):43273. pmid:28230183
  80. 80. Edgar BA, O’Farrell PH. The three postblastoderm cell cycles of Drosophila embryogenesis are regulated in G2 by string. Cell. 1990;62(3):469–480. pmid:2199063
  81. 81. Humbert PO, Dow LE, Russell SM. The Scribble and Par complexes in polarity and migration: friends or foes? Trends in cell biology. 2006;16(12):622–630. pmid:17067797
  82. 82. Sherrard KM, Fehon RG. The transmembrane protein Crumbs displays complex dynamics during follicular morphogenesis and is regulated competitively by Moesin and aPKC. Development. 2015;142(10):1869–1878. pmid:25926360
  83. 83. O’Keefe DD, Prober DA, Moyle PS, Rickoll WL, Edgar BA. Egfr/Ras signaling regulates DE-cadherin/Shotgun localization to control vein morphogenesis in the Drosophila wing. Developmental biology. 2007;311(1):25–39. pmid:17888420
  84. 84. Parisi F, Riccardo S, Zola S, Lora C, Grifoni D, Brown LM, et al. dMyc expression in the fat body affects DILP2 release and increases the expression of the fat desaturase Desat1 resulting in organismal growth. Developmental biology. 2013;379(1):64–75. pmid:23608455
  85. 85. Hirabayashi S. The interplay between obesity and cancer: A fly view. Disease Models & Mechanisms. 2016;9(9):917–926.
  86. 86. Strilbytska OM, Semaniuk UV, Storey KB, Yurkevych IS, Lushchak O. Insulin signaling in intestinal stem and progenitor cells as an important determinant of physiological and metabolic traits in Drosophila. Cells. 2020;9(4):803. pmid:32225024
  87. 87. Yu Y, Huang R, Ye J, Zhang V, Wu C, Cheng G, et al. Regulation of starvation-induced hyperactivity by insulin and glucagon signaling in adult Drosophila. Elife. 2016;5:e15693. pmid:27612383
  88. 88. Pinteac R, Montalban X, Comabella M. Chitinases and chitinase-like proteins as biomarkers in neurologic disorders. Neurology: Neuroimmunology & Neuroinflammation. 2020;8(1):e921. pmid:33293459
  89. 89. Sargin D, Botly LC, Higgs G, Marsolais A, Frankland PW, Egan SE, et al. Reprint of: Disrupting Jagged1–Notch signaling impairs spatial memory formation in adult mice. Neurobiology of learning and memory. 2013;105:20–30. pmid:23850596