All authors are in agreement with the content of the manuscript. The authors declare that they have no financial and personal relationships with other people or organizations that can inappropriately influence their work.
Conceived and designed the experiments: NK MB. Analyzed the data: NK MB. Contributed reagents/materials/analysis tools: NK JT MR MB. Wrote the paper: NK MB BP.
Genomics experiments are widely acknowledged to produce a huge amount of data to be analysed. The challenge is to extract meaningful biological context for proteins or genes which is currently difficult because of the lack of an integrative workflow that hinders the efficiency and the robustness of data mining performed by biologists working on ruminants. Thus, we designed ProteINSIDE, a free web service (
A main challenge for scientists working on the efficiency of ruminant production and the quality of their products (meat, milk…) is to understand which genes and proteins control nutrient metabolism and partitioning between tissues or which genes and proteins control tissues growth and physiology [
Here, we present ProteINSIDE, which aims to seamlessly integrate complementary analyses to produce information about proteins in one online automated package. ProteINSIDE comprises four modules: biological knowledge retrieval, annotations relative to biological process, molecular function, and subcellular location according to the GO, prediction of secreted proteins, and PPi analysed as a network. ProteINSIDE provides graphical and interactive results viewable on the website or downloadable. ProteINSIDE extracts information for a list of genes or proteins ID from myriad data sources with a unique input. Thus, it will circumvent searching for the different ID required for queries with the multiple existing bioinformatics tools, and it will save analysis time for the biologists. We demonstrated the higher or similar performances of ProteINSIDE relatively to currently used web-services or dedicated resources by a bench test that has evaluated results from 1000 random proteins by species. The relevance of results produced by ProteINSIDE was also checked with data that were previously partly analysed [
ProteINSIDE is an online workflow with an interface devoted to user-friendly and fully customisable analyses from lists of proteins or genes ID. Registered users have access to a private session to run, save, and visualise their results. Unregistered users can use ProteINSIDE, there is no analyses manager and results are deleted each month. Uploaded data are encrypted to ensure confidentiality. ProteINSIDE is divided into three parts: the workflow, the database, and the web interface. The workflow is a combination of Perl and R scripts to query databases, recover protein data, perform calculations, and run algorithms for signal peptide predictions and network visualisation. The MySQL database aims to reduce server load and stores settings and results from queries. ProteINSIDE’s database also stores available knowledge from major public biological databases. The web interface is the structure of ProteINSIDE and allows creating an analysis, viewing results, and keeping users informed with updates (
The four modules to query the available biological information, annotate according to the GO, predict secreted proteins and visualize PPi, are either all run in the basic analysis or individually selected and run with specific settings in the custom analysis. The basic analysis runs ProteINSIDE with automatic settings. The custom analysis operates with the settings selected by the user: option to include GO Inferred from IEA codes (electronic annotation that are automatically unselected in the basic analysis), option to make GOTree chart networks with Cytoscape web, option to search PPi among the 31 databases proposed by ProteINSIDE, option to search PPi in other species using orthologous proteins, option to extend the PPi network with proteins that are not in the dataset, and option to choose the sensitivity to detect signal peptides with SignalP 4.1.
ProteINSIDE performs either a “basic analysis” (in which settings are locked and the workflow acquires GO terms, signal peptide prediction, and PPi data from IntAct, UniProt, and BioGrid respectively) or a “custom analysis” (in which the user chooses the modules, the settings, and databases for PPi) (
Submitted ID, names, or accession numbers are compared to the ProteINSIDE database to ascertain a match with genes or proteins from Human, Rat, Murine, Bovine, Ovine, or Caprine species. This is achieved by merging the local database, which is updated every month, with data from UniProt [
ProteINSIDE imports GO terms by querying the QuickGO database. QuickGO was chosen because of its daily update, accessibility and performances. In the basic analysis, ProteINSIDE only imports GO terms that have been selected by evidence codes (GO terms Inferred from Electronic Annotation (IEA) are excluded) and agreed by curator review. The use of IEA is a setting of the custom analysis. ProteINSIDE provides the number of genes or proteins annotated by a specific GO term relatively to the total number of genes or proteins within both the dataset (frequency within the list) and the GO (frequency within the genome). The GO module of ProteINSIDE also analyses over- and under-represented terms to identify the most relevant and the most specific terms associated with the uploaded list, according to the functional enrichment first proposed by FatiGO [
The GO module also provides a view of network that links GO terms as a tree ancestor. The GO terms are linked by their parental association using ProteINSIDE database. ProteINSIDE database has its own version of ontology annotation from GO consortium [
In eukaryotes, at least five different routes of protein secretion out of the cells are reported: (I) the classical Golgi/ER-dependent secretory pathway for proteins that contain N-terminal signal peptide and non-classical protein exports or ER/Golgi-independent protein secretions that ensure protein secretion by (II) endosomal recycling, (III) plasma membrane transporter, (IV) membrane flip-flop, and (V) membrane blebbing that involves formation of vesicles or exosomes [
To identify proteins that are putatively secreted by the classical Golgi/ER-dependent secretory pathway, ProteINSIDE predicts the presence of a signal peptide on a protein sequence (imported by the biological knowledge retrieval script) through a local version of the SignalP tool (version 4.1 [
PPi identification and visualisation within a network point out how various genes or proteins contribute to cellular or metabolic processes. ProteINSIDE uses Psicquic web service [
Thus, ProteINSIDE constructs networks with PPi recorded in well- (Rat, Mouse, or Human) and poorly-annotated (Bovine, Ovine, or Caprine) species. The comparison of PPi networks allows expanding knowledge in the poorly- (Bovine, Ovine, and Caprine) relatively to well-annotated (Rat, Mouse, or Human) species. After a custom analysis with PPi extension, a GO analysis with the ID of genes/proteins of the extended network is directly launched by clicking on the button “Run a job to analyse the Gene Ontology for ID from this network”. Results of this GO analysis are available as a new work on the user home page, and provide GO terms with
Results are available through a unique code or a link provided after the submission of a list. Four separated pages provide results from the four analyses. The results are dynamic tables and charts that can be sorted and filtered online on the website by specific criterion such as biological function or protein and gene ID. Tables and charts are downloadable, and diagrams or histograms are printable. Networks are downloadable as picture (.pdf and. png) or as network software input (.sif or. xgmml or. graphml) files. Whatever the networks (GOTree or PPi), they are dynamics thanks to Cytoscape web which gives options to sort nodes, change layout, and search by proteins or biological function (Figs
Network was built with Cytoscape web. Settings are available to sort the network according to Ontology groups, biological function, result of GO enrichment (p-value), numbers (Nb) of GO, or network layout. Go terms are sorted and colorized depending on the ontology group and the number of annotated proteins. (A) Dynamic network view of GO terms related to Molecular Function. (B) Clicking on a GO term provides the GO number, proteins from the sample list annotated by this GO, and links with public GO databases (AmiGO and QuickGO).
Network was built with Cytoscape web. (A) The colour of the edge depends on experimental methods used to identify PPi. White nodes are proteins from the dataset and grey nodes are known interacting proteins not included in the dataset. (B) Clicking on a protein/node provides biological information as gene and proteins ID, function, and a link to UniProt database. (C) Available options to sort the network and highlight proteins of interest.
The web interface was programmed in PHP, HTML, and JavaScript. The workflow has been programmed using Perl (with CPAN modules (Comprehensive Perl Archive Network) and BioPerl [
To ensure the sustainability of ProteINSIDE, we have set up an automatic update of the database. A program updates each month the biological information of the database of ProteINSIDE by comparison with the current and free NCBI and UniProt databases resources. Moreover, it gathers the information from the databases QuickGO and Amigo in order to update the GO term database of ProteINSIDE (function and annotated genes expected by GO term for each species) or to add new GO terms. The workflow update is done manually. If a program requires an update, it is set up and tested locally to avoid conflicts with running analyses of users. When the update is applied to ProteINSIDE: users can press the button "reload analysis" (green arrow) to restart previous analysis and to benefit to the new version of the workflow. Each new version of ProteINSIDE’s database or workflow is informed on the website by a new or by a message in the “About” section and the “User’s main page”. Users should be aware that results obtained with 2 different versions of ProteINSIDE may be not reproducible because of the deletion of obsolete data and the use of new ones.
To test ProteINSIDE’s performances, we have created six datasets made of 1000 random proteins for each species (
Moreover, to demonstrate the added-value of the biological meanings provided by ProteINSIDE, we used 2 datasets that are lists of bovine proteins identified in the skeletal muscle
We have tested ProteINSIDE with 1000 random proteins ID from six species by comparison with widely used resources (
Analyses | ProteINSIDE | DAVID | BioMyn | AgBase |
---|---|---|---|---|
x | x | x | x | |
x | x | x | x | |
x | x | x | ||
x | x | |||
x | x | |||
x | ||||
x | x | x | x | |
x | x | x | ||
x | x | x | ||
x | x | |||
x | ||||
x | x | |||
x | x | x | ||
x | ||||
x | ||||
x | ||||
x | x | x | ||
31 | 4 | 10 | ||
x | x | |||
x | x | x | x | |
x | x | x | ||
x | x | |||
x | ||||
6 | 6 | Human | Bovine & Sheep | |
Monthly | Sep. 2009 | Mar. 2013 | Monthly |
Analyses performed by ProteINSIDE in comparison with DAVID [
x indicates that the analysis is performed by the resource.
Species | ProteINSIDE | DAVID | BioMyn | AgBase |
---|---|---|---|---|
1000 | 899 | 998 | ||
998 | 993 | |||
1000 | 949 | |||
979 | 378 | 1000 | ||
1000 | 6 | |||
1000 | 959 | 1000 | ||
99.62 | 69.73 | |||
99.30 |
44.77 |
100 |
Numbers of ID retrieved by the ID mapping module of ProteINSIDE and by DAVID, BioMyn, or AgBase when a list of 1000 random proteins per species was uploaded.
a from Sheep, Goat and Bovine results
b from Sheep and Bovine results
Species | ProteINSIDE | DAVID | BioMyn | AgBase | Percent of results shared by ProteINSIDE and | |||
---|---|---|---|---|---|---|---|---|
DAVID | BioMyn | AgBase | ||||||
Annotated proteins | 816 (945) | - (864) | - (803) | - (90) | - (85) | |||
GO terms | 2676 (3641) | - (4167) | - (1930) | - (64) | - (50) | |||
Annotated proteins | 938 (997) | - (991) | - (89) | |||||
GO terms | 2435 (3514) | - (3545) | - (55) | |||||
Annotated proteins | 783 (979) | - (950) | - (91) | |||||
GO terms | 2790 (4074) | - (5134) | - (67) | |||||
Annotated proteins | 159 (916) | - (372) | 125 (938) | - (40) | 78 (100) | |||
GO terms | 834 (2446) | - (1252) | 1069 (2541) | - (51) | 100 (100) | |||
Annotated proteins | 32 (503) | - (6) | - (1) | |||||
GO terms | 190 (612) | - (82) | - (4) | |||||
Annotated proteins | 365 (886) | - (957) | 286 (898) | - (86) | 77 (100) | |||
GO terms | 1418 (3085) | - (3850) | 1569 (3130) | - (61) | 100 (100) |
The percent of results shared by ProteINSIDE and by DAVID, BioMyn, or AgBase were calculated for comparison.
() for annotations that include IEA (Inferred electronic annotation).
- indicates that the resource does not provide annotation without IEA.
3 databases | 9 databases | |||
---|---|---|---|---|
PPi within the dataset | PPi outside the dataset | PPi within the dataset | PPi outside the dataset | |
396 | 5518 | 1703 | 6269 | |
111 | 1410 | 702 | 8475 | |
27 | 271 | 42 | 425 | |
0 | 0 | 0 | 0 | |
0 | 0 | 0 | 0 | |
12 | 96 | 34 | 131 | |
12 | 5921 |
ProteINSIDE had queried both 3 (BioGrid, Uniprot, and IntAct selected by default in the basic analysis) and 9 (BioGrid, IntAct, MINT, MatrixDB, STRING, Reactome, InnateDB-IMEx, UniProt, and I2D-IMEx chosen by the user) databases to record PPi that have been identified by experiments. Within a species, PPi were searched between proteins within (core network) and outside (extended network) the dataset. Bovine proteins ID were uploaded to search for known interactions with their orthologs in Human (EXT Human).
The high ability of ProteINSIDE to retrieve biological information for 1000 random proteins ID from Human, Rat, Murine, Bovine, Ovine, or Caprine species is shown by the retrieval of 100% of ID from Bovine, Goat, Human, and Rat; as well as of 99.8% and 97.8% of ID from Mouse and Sheep, respectively (
For each uploaded ID, ProteINSIDE obtained and summarized, as a downloadable table, the gene or protein ID, gene and protein names, the protein function, the gene chromosomal location, information on tissues expression, the cellular location, orthologous ID, and the FASTA sequence of the protein. These results are directly viewed on the “ID resume” web page of ProteINSIDE, and each protein or gene ID are linked to corresponding UniProt and NCBI web pages. A part of these biological data is also provided by DAVID, BioMyn, and AgBase.
ProteINSIDE annotated proteins or genes ID with GO terms selected by evidence codes (IEA are excluded) and agreed by curator review in the basic analysis, and additionally with GO terms from IEA as a setting of the customs analysis. As ProteINSIDE, AgBase allowed unselecting IEA, while DAVID and BioMyn use IEA by default. Thus, results from the GO module of ProteINSIDE for 1000 random proteins ID were compared to DAVID and BioMyn results with IEA, and to AgBase results with and without IEA (
On average 90% of proteins ID from Human, Rat, or Mouse annotated by ProteINSIDE were also annotated by DAVID. However, the percent of proteins annotated both by ProteINSIDE and DAVID decrease to 86%, 40%, and 0.6% for Bovine, Ovine, and Caprine ID, respectively. The numbers of GO terms from DAVID were around 52% higher for Human, Mouse, Rat, and Bovine proteins but 86% and 48% lower for Caprine and Ovine proteins, respectively, when compared to ProteINSIDE. Consequently, around 64% of GO terms from ProteINSIDE were common to those provided by DAVID for five species. On average, 85% of Human proteins were annotated both by ProteINSIDE and BioMyn with 50% of GO terms provided by the two resources. For Bovine and Ovine proteins, ProteINSIDE annotated more proteins with less GO terms than AgBase in analyses without IEA, while AgBase annotated more proteins with more GO terms when IEA annotations were used. In both analyses, GO terms provided by ProteINSIDE were all retrieved with AgBase. The annotation differences between ProteINSIDE and DAVID, or to a lesser extend BioMyn, may result from the lack of a recent update of the databases used. For example, from our lists of proteins, DAVID had provided GO terms that were declared obsolete by the GO consortium (GO:0006096 previous term was “Glycolysis”, replaced the 29th march 2014 by “Glycolytic process”) that thus could not be retrieved by resources such as ProteINSIDE and AgBase (that integrate the GO consortium updates). It is noteworthy that AgBase provided reliable GO annotations for two ruminant species, probably because of their own curated database that is also monthly updated. Thus, both AgBase and ProteINSIDE perform GO annotations for bovine and ovine, while only ProteINSIDE annotates ID from Caprine species. It is noteworthy that ProteINSIDE annotated on average 45% more proteins with 46% more GO terms when using IEA by comparison without IEA.
The results are viewed on the “GO” web page of ProteINSIDE as tables, diagrams, and GOTree charts. Unlike the other available tools, ProteINSIDE summarized main results of GO as GOTree charts which are ordered tree layout networks that link related GO terms (
Whatever the species, 95% of signal peptides predicted by ProteINSIDE thanks to SignalP were also predicted by PrediSi and Phobius (
Resources used to predict signal peptides were SignalP that was included in ProteINSIDE, as well as PrediSI [
An added-value of ProteINSIDE is the use of GO terms and TargetP, both to reinforce the prediction of proteins that are secreted (
Results are viewed on the “Secreted Protein” web page of ProteINSIDE as two main tables: the first table listed proteins secreted by the classical Golgi/ER-dependent secretory pathway thanks to a signal peptide and the second table listed proteins that are predicted to be secreted by non-classical secretory pathways.
The number of PPi identified by querying 3 major (UniProt, BioGrid, and IntAct preselected in the basic analysis) or 9 widely used PPi databases (BioGrid, IntAct, MINT, MatrixDB, String, Reactome, InnateDB, I2D, and UniProt, chosen thanks to the settings of the custom analysis) are summarized in
On the “Protein Interactions” web page of ProteINSIDE, PPi are viewed as tables of listed PPi and as dynamic networks. A dynamic network (
The bench test had provided data in favour of the reliability and accuracy of results produced by ProteINSIDE. We have then tested the ability of ProteINSIDE to produce new hypotheses of research and knowledge for biologists.
Custom analyses were proceeded with 143 and 120 proteins ID identified by proteomics from perirenal AT [
ProteINSIDE successfully uploaded and provided a fast overview of the biological information available in UniProt and NCBI databases for 143 proteins that were identified in the AT at each foetal age.
To check for the relevance of ProteINSIDE results, we compared GO terms recovered by ProteINSIDE to those previously published after a GO analysis with DAVID [
ProteINSIDE also brought new knowledge since it had predicted 18 proteins as secreted thanks to a signal peptide. Among them, 13 were also annotated by GO terms relative to secretion processes and predicted to be located out of the cell either by TargetP or the subcellular location provided by UniProt (
Adipose tissue | Muscle tissue | Adipose and muscle tissues |
---|---|---|
ADIPOQ |
ADIPOQ |
ADIPOQ |
NDUS3 | ALB |
APOA1 |
ERLIN2 |
AFP |
AFP |
NDUS8 | APOA1 |
ALB |
NDUFA10 | GSN |
PDIA3 |
SERPINA1 |
GARS |
SERPINA1 |
APOA1 |
SERPINA1 |
|
APOA2 |
P4HB |
|
FGG |
PDIA3 |
|
TTR |
||
ALB |
||
AFP |
||
TF |
||
PCCB | ||
COL6A2 |
||
HSP90B1 |
||
PDIA3 |
||
RCN1 |
1 confirmed by GO terms
2 confirmed by TargetP
3 confirmed by Subcellular location provided by UniProt resource
The PPi module of ProteINSIDE had revealed that 107 proteins of the dataset were linked by 300 interactions (
We highlighted the relationships among different proteins involved in a same process: (A) mitochondrial metabolism, (B) redox activity, (C) proteasome complex, (D) cell proliferation, and (E) differentiation and metabolism of AT. Squares indicate key proteins identified by sorting the network with algorithms of betweeness (an average of 400) and closeness centralities (an average of 0.3).
ProteINSIDE correctly identified 120 muscular proteins present at all the five foetal ages [
As for data from AT, the relevance of ProteINSIDE results was checked. We have compared enriched GO terms recovered by ProteINSIDE to those previously published [
As new knowledge, ProteINSIDE had predicted 9 secreted proteins among the 120 proteins from foetal muscle (
ProteINSIDE revealed 171 interactions between 76 proteins of the dataset (
We highlighted the relationships among different proteins involved in a same process: (A) muscle development, (B) energetic complexes, (C) respiratory chain, and (D) cell proliferation. Squares indicate key proteins identified by sorting the network with algorithms of betweeness (an average of 100) and closeness centralities (an average of 0.4).
There is striking evidence for developmental and functional links between muscle and AT: the successive waves of growth of muscle and AT suggest a priority for muscle growth, and the comparison between lean and fat bovine breeds suggest that an increased muscular development is concomitant with a decrease in AT mass [
The 46 proteins were annotated by enriched GO terms (
The abundance of ADIPOQ was normalized to the mRNA abundance of ribosomal protein P0 (RPLP0). Results are ΔΔ CT for foetal and adult AT or muscle samples, relatively to a control sample that is an adult AT. PCR were carried out as previously described [
Among the 46 common proteins, 23 were linked by 40 interactions and grouped in 2 subnetworks of proteins related to muscle proteasome and mitochondrial metabolism (already described in Fig
ProteINSIDE built a network between the 46 proteins identified in AT and muscle, and proteins outside of the dataset and know to interact with them in Human species. We filtered and sorted the network using high values of betweenness (an average of 10000) and closeness centralities (an average of 0,318). We have identified 6 proteins that were highly central in the dataset (linked with the maximum of proteins and pathways), 12 proteins were moderately central (engaged in the maximum of pathways but not necessary with many proteins), and 17 proteins were more weakly central (less linked with the maximum of proteins and less engaged in pathways, but central on the network). White boxes indicate proteins that are from the bovine dataset and grey boxes indicate proteins that are external to the dataset.
VCAM1 also named CD106 is a cell surface marker of mesenchymal stem cells found in AT [
Using a list of genes or proteins, ProteINSIDE proceeds to a fast overview of biological information for ID of the uploaded dataset, a functional annotation according to GO, a prediction of secreted proteins, and a simple and interactive view of known and experimentally proven PPi. The similar or better reliability of results produced by ProteINSIDE was demonstrated by a bench test involving others available resources. Moreover, we have verified the biological relevance of the results by comparison with our previous analyses. Lastly, we have used ProteINSIDE to propose new hypotheses of research that should help to better understand the growth of AT and muscle in bovine. Among the new hypotheses of research that deserved to be investigated, we have focused on an uptake of adiponectin by the foetal muscle and the role of autophagy on the ontogenesis of AT and muscle. To go further, ProteINSIDE can easily be upgraded to broaden the range of species (as other farm species: e.g. pig, chicken…) by an update of the database and the workflow.
(XLSX)
(XLSX)
The authors acknowledge A. Delavaud for conducting the mRNA quantification.