Large-scale investigation of the reasons why potentially important genes are ignored

doi:10.1371/journal.pbio.2006643

Fig 1.

Physical, chemical, and biological features of genes predict the number of publications.

(A) Illustration of modeling approach and prediction of number of research publications for single genes using information on 430 physical, chemical, and biological features of genes (S1 Data). (B) Research publications on individual genes grouped by t-SNE visualization using the 15 features most important to the models used in (A). Heatmaps show z-scored values of the 15 features for the genes in each cluster. Order of features as indicated in S3A Fig (S1 Data). SRP, signal recognition particle; t-SNE, t-distributed stochastic neighbor embedding.

More »

Expand

Fig 2.

Features of genes and homologous genes predict discovery of human genes.

(A) Number of publications per gene for past and recent research. Publications of past research (until 2010) are scaled so that the total number of publications matches present research (2011–2015). Dashed grey lines delimit three standard deviations away from the mean. (B) Prediction of the number of research publications for the model of Fig 1A extended by including the year of the first publication on the specific human gene (S1 Data). (C) Prediction of the year of discovery using the features from Fig 1A (S1 Data). (D) Percentage of publications that cite publications with nonhuman genes more frequently than they cite publications with human genes (S1 Data). (E) Prediction of the year of initial publications on individual genes using the features from Fig 1A and the year of the initial publication on homologous genes of nonhuman model organisms (S1 Data). (F) Prediction of the number of research publications using the features of Fig 1A and the number of publications on homologous genes (S1 Data).

More »

Expand

Fig 3.

Many potentially important genes are not being studied enough.

(A) Relative enrichment of the presence of genes with genetic loss-of-function (LoF) intolerance, presence of genes with GWAS traits, and the attention within publications. (B) Predicted versus actual NIH budget spending on individual genes (dots). The black line shows a lowess fit and the dashed lines show the two distinct regimes of the prediction (S1 Data). (C) Fraction of disease-linked genes with at least one experimental drug conditioned on the predicted order of discovery according to the model shown in Fig 2B. Error bars show 95% confidence intervals for the estimations. GWAS, genome-wide association study; LoF, loss-of-function; NIH, National Institutes of Health; USD, US dollar.

More »

Expand

Fig 4.

Identifying and exploring ignored genes.

(A) Estimation of the years until all genes are studied if scientific enterprise continues to follow trends reported above. Number of genes with at least n focused (single-gene) publications per year. Dashed lines show extrapolation of the bounds of linear regression for recent years. (B) Percentage of highly cited studies (top 5% in number of citations) in the 8 years following their publication. Error bars show 95% confidence intervals. (C) Percentage of genes with a strong RNAi phenotype, at least one tissue with moderate RNA abundance, presence of a Drosophila melanogaster homolog, or membership in a complex with highly studied genes. Highly studied genes show higher percentages for all these characteristics, but many unstudied genes also share those characteristics. (D) Illustration of bias in identification of hits in distinct large-scale experimental approaches. Interaction studies refer to studies labelled as “High throughput” within BioGRID. Relative hits marks fold enrichment over equal occurrence (S1 Data). (E) Genes grouped by t-SNE visualization using the 15 features most important to the models used in Fig 1A. Large circles highlight genes with frequently discovered GWAS traits. Heatmaps show presence of strong genetic evidence (G), experimental potential (E), and homolog in invertebrate model organism (M). Note the lack of a strong correlation between GEM characteristics and research attention. E, experimental potential; FPKM, fragments per kilobase of transcript per million mapped reads; G, strong genetic support; GEM, strong genetic support and experimental potential and homolog in invertebrate model organism; GWAS, genome-wide association study; M, model organism; RNAi, RNA interference; t-SNE, t-distributed stochastic neighbor embedding.

More »

Expand