System-Level Insights into the Cellular Interactome of a Non-Model Organism: Inferring, Modelling and Analysing Functional Gene Network of Soybean (Glycine max)

Cellular interactome, in which genes and/or their products interact on several levels, forming transcriptional regulatory-, protein interaction-, metabolic-, signal transduction networks, etc., has attracted decades of research focuses. However, such a specific type of network alone can hardly explain the various interactive activities among genes. These networks characterize different interaction relationships, implying their unique intrinsic properties and defects, and covering different slices of biological information. Functional gene network (FGN), a consolidated interaction network that models fuzzy and more generalized notion of gene-gene relations, have been proposed to combine heterogeneous networks with the goal of identifying functional modules supported by multiple interaction types. There are yet no successful precedents of FGNs on sparsely studied non-model organisms, such as soybean (Glycine max), due to the absence of sufficient heterogeneous interaction data. We present an alternative solution for inferring the FGNs of soybean (SoyFGNs), in a pioneering study on the soybean interactome, which is also applicable to other organisms. SoyFGNs exhibit the typical characteristics of biological networks: scale-free, small-world architecture and modularization. Verified by co-expression and KEGG pathways, SoyFGNs are more extensive and accurate than an orthology network derived from Arabidopsis. As a case study, network-guided disease-resistance gene discovery indicates that SoyFGNs can provide system-level studies on gene functions and interactions. This work suggests that inferring and modelling the interactome of a non-model plant are feasible. It will speed up the discovery and definition of the functions and interactions of other genes that control important functions, such as nitrogen fixation and protein or lipid synthesis. The efforts of the study are the basis of our further comprehensive studies on the soybean functional interactome at the genome and microRNome levels. Additionally, a web tool for information retrieval and analysis of SoyFGNs can be accessed at SoyFN: http://nclab.hit.edu.cn/SoyFN.


Introduction
The living body is a complex system of storing and processing information. Full understanding of this system means characterising the function of its components and their interactions. The cell, as the most basic system of life, is a system of hierarchical organisation from individual molecules (such as genes, mRNAs, proteins, and metabolites) to complex molecular pathways (such as gluconeogenesis and tricarboxylic acid cycle), in which molecular interactions play an important role. Interacting molecules form functional modules (such as groups of molecules involved in the same biological process), which in turn interact with each other to drive larger scale biological processes. Comprehensive maps of the interactions among biomolecules provide an overall view of the cell. The past decade has witnessed significant effort aimed at modelling, identifying, organising, and analysing cellular interactomes. Such effort, grounded in significant advances in our understanding of molecular biology, is supported by the omic-level highthroughput data collections and acquisition techniques, which are used to interrogate the states and interactions of biomolecules at multiple levels, and to further map the structure of the genome-wide interaction networks. If the complex system of a cell is regarded as a gene society, although it is in fact composed of a variety of biological molecules, the heterogeneous interactions between biological molecules are, essentially, interactions between genes. A same gene society may be modelled by various networks, of which the most popular are the protein-protein interaction network (PPIN), gene regulatory network (GRN, or transcriptional regulatory network, TRN) and metabolic network (MN). In addition, there exist various other types of connections upon which to model gene interactions, such as signal transduction pathways, co-expression networks, genetic interactions, and so forth ( Figure 1). However, these models characterise the different interactive relationships between genes, implying their unique intrinsic properties and defects, and covering different slices of biological information. In other words, one specific type of connection alone cannot explain the various interactions among genes. Integrating them would contribute to a comprehensive view of the cellular system. Therefore, a challenging problem of network integration arises.
Although reconstruction of FGNs, depending on a variety of functionassociated data (Figure 1), has been successful in many model plant species, especially, for example, the dicot Arabidopsis [7] and the monocot rice [9], integrating diverse genomic data into network models for many other plants, such as soybean, is still problematic. First, the genomic data are heterogeneous in their sensitivity and specificity for relationships between genes. For example, experimental methods such as mass spectrometry preferentially observe abundant proteins, whereas comparative genomics methods apply only to evolutionarily conserved genes. Second, genomic data sets vary widely in their utility for reconstructing gene networks. Thus, we need robust benchmarking methods that can be used to evaluate each data set and allow comparison of their relative merits. Third, data sets are often correlated, but the correlations are always difficult to measure because of data incompleteness (a common problem) and sampling biases [4]. For most species, the richness and accuracy of these various functionassociated data are quite inconsistent. For example, for model organisms, such as Arabidopsis, a wealth of data resources is available owing to extensive research, but for other non-model organisms, such as soybean, there are not enough data to construct such networks. We therefore need a cross-species and minimally datadependent approach to construct the FGNs of non-model organisms.
The Gene Ontology (GO) project [14] has integrated information from multiple data sources to annotate genes to specific biological process (BP), molecular function (MF) or cellular component (CC), which are three subontologies (or aspects). GO annotation (GOA) itself can be regarded as a de facto way to integrate diverse unstructured data into a single structured data source. Therefore, GOA is important for inferring FGNs based on the fact that the strength of functional interaction between genes is proportional to their functional similarity (FS). Thus we can calculate the FS among all the genes of an organism based on GOA and further construct a genome-wide network, referred to as an FGN. As a weighted network model, edge weights in the FGN represent the functional similarity rather than the likelihood of interaction between genes in a PFGN.
In comparison to the PFGN, the FGN based on GOA seems to be much easier to construct. However, construction of such a genome-wide FGN for soybean is challenging for several reasons. First, whereas A. thaliana has <27,000 protein coding genes (The Arabidopsis Information Resource, release 9) [15], soybean is predicted to have 46,430 protein coding genes, 70% more than Arabidopsis [16], but it in fact has 54174 protein-coding genes annotated by EnsemblPlants, as of May 2013 (v1.0, JGI-Glyma-1.1). This increased genome complexity results in a combinatorial explosion for the number of pairwise relations between genes (theoretically <1.5 billion pairs in total but actually we computed more than 2.7 billion pairs because of the three aspects of GO), complicating discovery of true functional associations. Second, the current reference knowledge and raw omic data available for modelling gene interactions are much sparser for soybean than for Arabidopsis, reducing the predictive power of resulting networks and increasing the difficulty of evaluating this power. Despite these hurdles, we constructed the first version soybean FGNs, called SoyFGNs, using the three aspects of GOA published by UniprotKB in September 2012 (version 111), which cover <70% of the 54174 soybean genes (Ensembls) recorded by EnsemblPlants. The construction of the second version SoyFGNs covering all 54174 genes is under way. The entire construction process described below includes the following steps: 1) measuring the pairwise functional similarities of genes annotated by GO; 2) setting a threshold to determine how similar in function the gene pairs should be to be connected in the network; 3) dissecting the validity of SoyFGNs by topology analysis, comparative analysis and functional verification.

Datasets
Gene ontology (GO) The GO data were downloaded from the Gene Ontology website [17] (data version: 1.1.3499), excluding cross-products, inter-ontology and ''has-part'' relationships. This dataset contains 38137 terms, including 1692 obsolete terms. The total valid terms in BP, MF and CC number 23928, 9467 and 3050, respectively. The ''is-a'' and ''part-of'' relationships number 56718 and 6127, respectively.

GO annotations (GOA)
The GOAs of soybean (Glycine max) were downloaded from UniProt-GOA (http://www.ebi.ac.uk/GOA/, version 111). A total of 165040 annotations annotate 37827 (,70%) of the 54174 soybean genes (recorded by EnsemblPlants, release 18 April 2013). The entries annotated in BP, MF and CC number 47452, 92374 and 25214, respectively. The genes annotated in BP, MF and CC number 27594, 33189 and 14150, respectively. Here we use UniprotKB AC/IDs or Ensembl Genome IDs to represent corresponding genes.

Functional similarities of pairwise genes
We previously proposed a shortest semantic differentiation distance (SSDD) method to calculate the semantic similarity between GO terms from a novel perspective [18]. An overlapping directed acyclic graph (DAG, a sub-graph of GO) was generated to represent two given terms. Such a DAG was then viewed as a semantic genealogy wherein a term inherits the semantics of its ancestors and distributes it to its descendants. We introduced the concept of semantic differentiation to represent the transition of a term from one pattern of semantic integration to another and the concept of semantic totipotency to represent the capacity of this differentiation. Taking into account all paths linking a term and its ancestors, the semantic totipotency of a given term t is quantified as a T-value (T(t)) as follows: where r represents a root term. The semantic totipotency of the three root terms is given as 1. The variable v is the semantic differentiation factor for edge linking term t with its parent t p . The T-values of any other terms are derived as the average of all of its parents' T-values multiplied by the semantic differentiation factor (v). The differentiation capacity (T(t)) should decrease moving down the hierarchy and be positively proportional to the number of descendants, or local density. Thus, the v between a term t and its parent t p should be greater than 0 and less than 1, and can be calculated as where Dst(t) is the number of descendants of the term t, including itself. Based on T-values, we proposed the SSDD to measure the semantic similarity in the GO. Given two terms t A and t B , the normalised distance between them is defined as where path t A ,t B ð Þ represents a set of terms on the shortest path connecting the terms t A and t B via their lowest common ancestors(LCAs). The arctan function is used to normalise the distance to (0, 1). Apparently, Dist t A ,t B ð Þ is symmetric, i.e. Dist t A ,t B ð Þ~Dist t B ,t A ð Þ. After normalisation, the semantic similarity is defined as: SSDD was shown to be effective for measuring the semantic similarity of pairwise GO terms. We also need a method for integrating pairwise semantic similarities into a single FS of genes because a gene is often annotated by more than one term in GOA. Three distinct approaches have been proposed for this integration: Lord et al. [19,20] used an arithmetic average (Avg) of pairwise similarities between all terms of the first protein set and the second one; Sevilla et al. [21] used only the maximum (Max) similarity between all term pairs; Couto et al. [22], Schlicker et al. [23] and Azuaje et al. [24] developed the best-match average (BMA) method, in which each term of the first protein is paired only with the most similar term of the second one and vice versa. We take the BMA approach to compare gene similarities, as it was found to be most effective [25]. Given two genes, g 1 and g 2 , BMA is defined as where go 1i (go 2j ) denotes a term that belongs to the term set with a size of m(n) that annotates g 1 (g 2 ). Thus, each gene pair is assigned three FSs based on three orthogonal aspects of GO. We also need a single integrated FS for each gene pair (denoted by FS INT ). Thus, we calculate the weighted average of the three FSs as their integration (hereinafter denoted by INT), which can be formulated as where, FS bp , FS mf , and FS cc are three FSs for each gene pair; w bp , w mf and w cc are the corresponding weights of the three GO aspects. Though the absence of a criterion to quantify the weights of the different aspects of GO on gene's function, we let the weight be equal to the corresponding FS, based mainly two considerations. First, because genes function unequally in the three GO aspects, the one yielding greater similarity should have a greater weight. Second, a great reduction in the integrated FS can be avoided even though the gene pair receives a zero FS in some aspect. The final formula for the integrated FS is where, FS INT also ranges between 0 and 1.

SoyFGNs construction
As shown in our previous work [18], our method yields more reliable gene FS for such species that has shallow gene annotations as soybean, somewhat resolving a critical problem in functional network construction. In doing so, we can calculate any pairwise FSs for a list of genes g 1 ,g 2 , Á Á Á ,g N f g , and further get an N|N similarity matrix M~FS ij Â Ã , in which the element FS ij represent the functional similarity of the gene g i and g j . The next is to filter the matrix M to derive an adjacency matrix A~a ij Â Ã representing the functional gene network. The key to do this is to determine how similar in function must the two genes be to be linked in the network, i.e. appropriate threshold is needed to ensure that gene pairs with FSs greater than or equal to the threshold value will be connected by edges (a ij~F S ij ); otherwise, they are not connected directly(a ij~0 ).
In this study, we adopted clustering coefficient-based threshold selection. The clustering coefficient (C i ) of a node (i) in a network is defined as C i~2 n i =k i (k i {1), where n i represents the number of edges between k i (w1) first neighbours of a gene i; if k i~1 , we define C i~0 . The clustering coefficient of a network is defined as the average clustering coefficient of all of its nodes, where N is the number of nodes in the network. If N~0, we define C~0.
The construction of a gene network can be viewed as a process in which links are removed from the initially complete graph by gradually increasing the FS threshold. Because all FSs range between 0 and 1, we set a series of incremental thresholds t (0ƒtƒ1) with an increment of 0.01. For each threshold t, we construct a network by set a ij~0 if FS ij vt. In systems biology, a genuine biological network should be scale-free and highly modular; its clustering coefficient, denoted by c t ð Þ, should be significantly higher than that of the corresponding random network, denoted by c r (t). Here, we denote the difference between c(t) and c r (t) by Dc t ð Þ, i.e. Dc t ð Þ~c t ð Þ-c r t ð Þ. We conjectured that the most appropriate threshold should be the maximum t, which can produce a monotonically increasing Dc t ð Þ when the links are removed gradually as the threshold increases from 0 to t. More specifically, we formulated this as a discrete optimisation problem, where the critical cut-off threshold t Ã was determined by finding the first t, which lets Dc tz0:01 ð Þ-Dc t ð Þv0 over a set of t gradually increasing from 0 to 1. Note that calculating c r (t) of the randomise networks is nontrivial by formula (8) because it is not clear which random network model should be used for this purpose. Hence, we adopted a statistical method proposed by Elo et al. [26] for its solution. If N denotes the total number of nodes and k i denotes the degree of a node i for the original network, then c r (t) is calculated as the expected value of the clustering coefficient as follows: where k~1=N P N i k i , and k 2~1 =N P N i k i 2 . Finally, an FGN can be constructed and represented as G(V,E,W,T), where V~fg 1 ,g 2 , Á Á Á ,g N g represents the genes involved in the network, E~fe ij[ g i ,g j ] j FS ij §Tg represents the edges between gene pairs with FSs greater than or equal to the threshold T, W~FS ij represents the weights of the edges, which are the FSs of pairwise genes.
Using the pairwise FS of all soybean genes and the clustering coefficient-based threshold selection, we construct four soybean functional gene networks (SoyFGNs) in BP, MF, CC and INT, respectively.

Topologic characterisation of SoyFGNs
One way to characterise biological networks is to study their topologic properties. We using Cytoscape 2.8.2 [27], investigated the global properties of the resulting SoyFGNs. In addition, we conducted an in-depth analysis of the degree distribution and degree correlation, as described in the next two subsections.

Degree distribution
Many early studies observed that biological networks are generally scale free and their degree distribution follows the power law [28,29]. A number of later studies have argued that there are other distributions, such as the log-normal distribution, which explain the degree distribution better than power law [30,31]. We used three models to investigate the distributions of the four resultant FGNs: lognormal, power law and exponential. All model fittings and visualisations are completed with the use of Origin 9 (http://www.originlab.com).

Degree correlation
Degree correlation is a basic structural metric for calculating the likelihood that nodes link to nodes of similar or dissimilar nodal degree. The former case is called positive degree correlation, and the latter is called negative degree correlation. In the social sciences, a network with positive degree correlation is referred to as an assortative network, whereas a network with negative degree correlation is referred to as disassortative network [32]. Three ways of characterising the amount of degree correlation are used, each involving less detail and expressing the result in more compact terms. They are the joint degree distribution (JDD), the k-nearest neighbours (knn) and the Pearson degree correlation (PDC).
The JDD is defined as the distribution in which each entry D ij is the number of edges that the nodes at their endpoints have degrees i and j, respectively. JDD is actually a two-dimensional distribution of the number of edges with respect to the degree of their connected nodes.
Instead of recording every pair of nodes, as JDD does, knn simply averages the degrees of the neighbours of each node of a given degree and plots the results as linear, semi-log, and log-log plots. If a degree is missing, it is skipped in the graph. A rise in knn along with a rise in nodal degree indicates that nodes of similar degree tend to be linked, whereas a fall in knn with a rise in degree indicates the opposite.
PDC is the most condensed way to characterise the degree-link structure of a network. It consists of the conventional Pearson correlation calculation applied to each pair of linked nodes. The result always lies in the range [21,1], with a negative value indicating that nodes of dissimilar degree tend to be linked and a positive value indicating that nodes of similar degree tend to be linked.

A Soybean network generated by orthology from Arabidopsis
An alternative approach to constructing a soybean gene network might be simple to transfer linkages from orthologous gene pairs of the existing gene network. This approach does not require modelling using soybean annotations or any of the soybean-derived experimental data. The value of this approach has been shown in reconstruction of gene networks for C. elegans [5] and Arabidopsis [7]. To assess the accuracy of the SoyFGNs in comparison to such an orthology-derived network, we first identified the orthologs between soybean and Arabidopsis using BLASTN, the results are shown in Table S1. We then downloaded the gene network of Arabidopsis from BioGRID (3.2.96) and infer soybean gene linkages based on linkages of this network, generating an orthology-derived soybean gene network, which consists of 16566 nodes (genes) and 146562 edges (linkages).

Inferring functional linkages from KEGG pathways and validating a query network
To validate SoyFGNs using independent annotations, we employed the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway database [33]. KEGG is based on manual curation and is thus considered generally accurate and largely independent from both SoyFGNs and the orthology-derived network. We downloaded equivalent link information for soybean genes from LinkDB (http:// www.genome.jp/linkdb/) using UniprotKB AC/ID on March 2013. All links were also mapped to Ensembl Genomes IDs. As a result, 3145 genes were mapped to 238 pathways, which can be retrieved by our web database (http://nclab.hit.edu. cn/SoyFN/tar_pathway.php). As a benchmark network, the KEGG-derived network was constructed by generating linkages between genes sharing KEGG annotation terms, i.e. sharing the same KO IDs. The validation of a query network by KEGG-derived network is mainly based on the gene coverage and the linkage accuracy. The gene coverage (G cvrg ) is defined as G crvg~Nshared =N KEGG ð Þ |100%, where N shared is the number of genes shared by the query network and the KEGGderived network, N kegg is the number of genes involved in KEGG-derived network. The linkage accuracy (L acc ) is defined as L acc~Lshared =L KEGG À Á |100%, where L shared is the number of linkages between N shared genes in the query network, L KEGG is the number of linkages between N shared genes in the KEGG-derived network.
Inferring functional linkages from co-expression data and validating a query network Another major source of functional associations is mRNA co-expression data. So we additionally inferred functional associations from mRNA co-expression profiles to evaluate SoyFGNs. 11 datasets for Glycine max genes was downloaded from the Gene Expression Omnibus (GEO) [34] on March 2013 (Table 1). In order to reduce the false positive rate, 4 datasets that have less than 20 samples each were discarded. The remaining 7 datasets were then filtered by removing the uninformative sets by testing for a significant correlation between the Pearson correlation coefficients (PCCs) between pairs of genes' expression vectors and removing the genes not sharing a specific Ensembl Genomes ID for further analysis. For each dataset, the PCC between pairs of genes' expression profile was used as the measure for inferring the co-expression linkages. The pairs of genes, between which the absolute value of PCC is more than 0.8, were linked. Finally, all linkages derived from 7 expression datasets were merged into a final co-expression network. The inclusiveness of a network versus co-expression network is also measured by the gene coverage (G cvrg ) and the linkage accuracy (L acc ).The gene where N shared is the number of genes shared by the query network and the co-expression network, N co{exp is the number of genes involved in co-expression network. The linkage accuracy (L acc ) is defined as L acc~Lshared =L co{exp |100%, where L shared is the number of linkages between N shared genes in the query network, L co{exp is the number of linkages between N shared genes in the co-expression network.
For the reason that co-expression network is generated from an approach different from the one to generate GO, while the KEGG network is generated from the same approach to generate GO. It would generate a very low linkage accuracy. To evaluate the perhaps low linkage accuracies are statistically significantly higher than the background accuracy, we made an additional statistical analysis between the linkage accuracies of the original ontology-derived network and SoyFGNs and their corresponding randomized networks. A randomized network is generated by doing randomly perturbations to the edges, but maintaining the same nodes and their degree distributions. As our pre-experiments showed that more than 400 times perturbations could provide a stable-property randomized networks, we used the average of 400 randomized networks to evaluate the background linkage accuracy. The p-values are given to indicate their difference significances (by ANOVA).

Network-guided disease-resistant gene discovery
The aforementioned pathway and co-expression analysis showed that genes for similar biological processes or with similar expression profiles can be successfully associated in SoyFGNs. We next, as a case study, specifically tested the feasibility of predicting the genes governing plant disease resistance by using SoyFGN-INT in two steps: network-guided discovery and in silico verification. Plant disease resistance protects plants from pathogens. Resistance genes (Rgenes) are genes in plant genomes that convey plant disease resistance against pathogens by producing R-proteins. The main classes of R-genes consists of a nucleotide binding domain (NB) and a leucine rich repeat (LRR) domain(s) and are often referred to as (NB-LRR) R-genes. NB-LRR R-genes can be further subdivided into toll interleukin 1 receptor (TIR-NB-LRR) and coiled-coil (CC-NB-LRR) [35]. To implement this study, randomly selected 24 genes were used as query genes to predict R-genes using the Gaussian smoothing guilt-by-association method [36]. In order to evaluate the stability of prediction, 6 (1/4) of 24 query genes were putative R-genes, while others were experimentally verified R-genes (see Table S2). To be noted that, we here predicted candidate genes by only using the direct network neighbors via guilt-by-association. 225 of 737 candidate genes, which are highly connected with 14 query R-genes and constitute the biggest disease resistant module, were used for further analysis (see Table S3). For these 225 disease-resistant candidates, we first defined their functions by extensive databases and literatures searches. Second, we assigned a weighted rating (WR) score for each candidate according to its connected known R-genes to prioritize their possibilities to be R-genes. Obviously, WR of a gene should be positive proportional to both the number of its neighbor function-known R-genes and the average weight of edges link it to the neighbors, which were represented as functional similarity (FS). We used the so called 'true Bayesian estimate' to compute such WR, which is a useful weighting mechanism used by the Internet Movie Database (IMDb) to adjust a movie's rating score based on the number of votes it has received. The formula is defined as: v vzm : Fz m vzm : C where F, the average FS of each gene; C, the total average FS of all genes; v, the number of neighbor function-known R-genes; m, a minimum number of neighbors to a R gene, which was set to be the first quartile of the neighbor number distribution of all 225 genes. The resulting WR scores of 225 genes are also provided in Table S3 (xls). We excluded the FSs of the genes themselves because these will not be used for subsequent construction of the no-loop networks. The distribution of these four types of pairwise FSs is shown in Figure 2. The complete data are provided on our website (http://nclab.hit.edu.cn/ SoyFN) because their sizes exceed the upper limit of supplementary files (each larger than 8 GB). All genes can be retrieved by the UniprotKB AC/ID or the Ensembl Genome ID on our website (e.g., K7MVA4 and GLYMA18G52145).

Functional similarities of pairwise genes
Hereafter in this paper, we use the UniprotKB AC/ID to refer to the corresponding gene.

SoyFGNs construction
Our SoyFGNs are weighted undirected graphs in which the nodes represent genes and the edges represent their functional associations weighted by the pairwise functional similarities (FSs) of genes they link. Given pairwise FSs, the next step is to set an appropriate threshold to ensure that gene pairs with FSs greater than or equal to the threshold will be connected by edges; otherwise, they are not connected directly. We adopted the clustering coefficient-based threshold selection, which is based on the fact that given a threshold t, a biological network   should be scale-free and highly modular, and thus its average clustering coefficient, denoted by c t ð Þ, should be significantly higher than that of the corresponding random network, denoted by c r (t).
By setting a series of incremental thresholds t (from 0 to 1)with an increment of 0.01, we used each threshold to filter the original networks (including all pairwise similarities of genes) in BP, MF, CC and INT, respectively. As a result, we obtained 100 networks each in BP, MF, CC and INT. Using our own JAVA script, we calculated the cluster coefficient of each network (c(t)) and that of its random model (c r (t)). As shown in Figure 3, the first stop of monotonically increasing of the Dc t ð Þ occurs at t~0.99, 0.99, 0.84, and 0.99 in BP, MF, CC, and INT, respectively, which indicates that these thresholds are the most appropriate ones for constructing the FGNs in BP, MF, CC, and INT, respectively (for more explanation, see the corresponding parts of the METHODS section).
Using the above-mentioned thresholds, we constructed four FGNs in BP  Table 2).

Degree distribution
Three models were used to investigate the distributions of the four SoyFGNs: lognormal, power law, and exponential. Graphs of the degree distribution and the three fitted models for each network are shown in Figure 4. The detailed parameters of these models and their performances (represented by R-squared, R 2 ) are listed in Table 3. Our results showed that the exponential models followed by power law models fit the degree distribution best, and the lognormal models were the worst. The degree distributions indicate that SoyFGNs have the typical characteristics of biological networks, e.g., scale free, small world, rather than the characteristics of random networks, for which the degree distribution fit Poisson distribution best. We would like to clarify that Poisson distribution was also used to fit the degree distribution of SoyFGNs, but the results are not given because they deviated fully from the degree distribution of each network.  Table 2. Second, the extremely sharp protrusions show that little nodes share a large number of edges, indicating the existence of local dense functional modules in SoyFGNs. Third, the majority of apparent peaks (the number of edges >2000) appear in the low-low and high-high degree node pairs, suggesting that the genes tend to interact with those of similar degrees in SoyFGNs, indicating their assortative features.
Similar results were obtained in the analysis of the knn and the PDC ( Figure 6). The overall ascending knns and large positive PDCs indicate that genes of similar degrees tend to be connected with each other more in all four SoyFGNs.

SoyFGNs is more extensive and accurate than a network generated by orthology from Arabidopsis
It is an open question how well a gene network derived from a bettercharacterised dicot such as Arabidopsis might faithfully reconstruct a gene network in another dicot such as soybean. To assess the accuracy of such a network, we defined an orthology-derived soybean gene network from  Figure 7). Therefore, in terms of genome coverage, SoyFGNs are more extensive than orthology-derived network. We further assessed the quality of SoyFGNs in comparison to orthology-derived network by two additional computational analyses using two independent data sources: KEGG pathways and co-expression profiles.

Assessment using linkages derived from KEGG pathways
We tested the accuracy of the SoyFGNs versus the orthology-derived network using linkages derived from the KEGG pathway database. As a result, the KEGGderived network consists of 380969 edges between 3144 genes, which share 238 KO IDs in 123 pathways of Soybean. The validation of a query network by the KEGG-derived network was based mainly on the gene coverage (G cvrg ) and the linkage accuracy (L acc ) defined in Method section. Compared with orthologyderived network, SoyFGNs have significantly higher G cvrg and L acc (Table 4), indicating SoyFGNs shared significantly more genes and linkages with KEGGderived network than did the orthology-derived network. Assessed by linkages derived from KEGG pathways, SoyFGNs are therefore considered to be more extensive and accurate than orthology-derived network. We noted that SoyFGN-CC gets a lower G cvrg than orthology-derived network. This is mainly because (1) GO annotates the fewest number of genes in CC (see section 2.1.2) and (2) KEGG annotates genes mainly considering their molecular function (MF) or biological

Assessment using linkages derived from co-expression profiles
We evaluated the SoyFGNs by comparing them to the orthology-derived network using the soybean gene co-expression network. As described in Introduction section, the co-expression network can, to some extent, reveal the genes function and the complex mechanism of action between genes, although genes interacting with each other do not always have similar gene expression profiles, and vice versa. Therefore we used the co-expression network as another independent reference network to evaluate the extent to which the more inclusive SoyFGNs consolidate the gene interactions derived from co-expression, referred to as inclusiveness. To do this, 7 of 11 in total Gene Expression Omnibus (GEO) [34] datasets were used to infer gene co-expression interactions, resulting in a coexpression network of 12933 genes linked by 2971228 edges. The inclusiveness of a network versus the co-expression network was also measured by the gene coverage (G cvrg ) and the linkage accuracy (L acc ), which led to similar results (Table 5), i.e., SoyFGNs shared more genes and linkages with the co-expression network than did the orthology-derived network, indicating the greater extensity and accuracy of SoyFGNs. Additionally, the statistics analysis showed that the shared edges (accuracy) are significantly higher than background (p,0.05, Table 5). Thus, according to the above two comparisons, reconstructing a gene network specifically for soybean genes rather than simply generating the network from orthology is essential and improves both accuracy and coverage of the network.

Network-guided discovery of disease-resistant genes
Eighteen randomly selected true R-genes and six putative R-genes (listed in Table  S2) were used as query genes to predict the potential R-genes in SoyFNG-INT by using guilt-by-association [36]. As a result, we identified 737 candidate genes with a predicted function in disease resistance, accounting for 95.0% of 776 genes deposited as disease resistance genes in UniprotKB as of June 2013. The 737 Figure 6. The k-nearest neighbours (knn) and Pearson degree correlations (PDCs) of SoyFGNs. Note that the PDC in each sub-graph was calculated according to the degree of two endpoints of all edges, rather than results derived from this graph.
doi:10.1371/journal.pone.0113907.g006 candidate genes and 24 query genes together constitute a disease-resistant gene network (Figure 8), which shows that most true R-genes (shaped like red octagons) are more connective with each other and share more candidates than the putative R-genes (red diamonds) do. We selected only the genes (14 genes, pink background in Table S2 or red-filled octagons in Figure 8) and their first neighbors (225 genes, yellow-filled ellipses in Figure 8) that constitute the largest disease-resistance module (listed in Table S3) as the high-confidence predictions to further evaluate the predictability of SoyFGN. Of these 225 disease-resistant candidates, 117 genes (52.00%) were previously known as R-genes, 25 of which  were experimentally validated and 92 putative; 103 genes (45.78%), of which the functions were previously unknown, were newly predicted to be disease-resistant genes by using SoyFGN-INT; only 5 genes were not confirmed to be associated with plant disease resistance, i.e. false positive (2.22%). The results are also briefly summarised in Figure 9A.

In silico verification of newly predicted disease-resistant genes
As the results of network-guided R-gene discovery, a highly confident diseaseresistance module consisting of 14 query R-genes and 225 more predicted candidates was obtained by using the Gaussian smoothing guilt-by-association method. Among the 225 candidates (Table S3 (xls)), except 6 false-positive genes, 103 predicted candidates' functions are unknown. For these 103 newly predicted disease-resistant genes, it is nontrivial for us to validate them one by one using wet-lab experiments. Here we provide an in silico verification to check the performance of SoyFGN-INT on predicting the function of unknown genes by using Blast2GO [37]. The verification procedure includes the following: 1) BLAST the protein sequences of 103 candidates against the non-redundant protein database of NCBI (nr) using BLASTP to hunt for their orthologues; 2) analysis of the enrichment functions of BLAST hits by integrating the functional information retrieved from GO annotations, domain/motif and the KEGG pathways; 3) mapping the enrichment functions of orthologues to the unknown genes. We made many settings to reduce the false positive. In BLAST step, all sequences were blasted to nr (non-redundant protein database of NCBI) using BLASTP, with the minimum E-value of 1.0E-8; top 20 hits were selected to be used in next step; genes with less than 5 hits were excluded. In annotation step, in addition to GO annotations, we also ran the 'InterProScan' using all available applications as well as 'GO-Slim' using 'goslim_plant.obo' to enrich the annotations. Additionally, Enzyme codes and KEGG pathways were taken into account to enhance the annotations.
As a results, 97 of the 103 genes got at least 1 hit, of which 89 genes got at least 10 hits and 87 genes got more than 20 hits (see Table S4). 97 matched genes finally got 451 annotations in three aspects of GO (P, biological process; F, molecular function; C, cellular component) in total (see Table S4). 14 matched genes were mapped to 8 Enzyme codes and involved in 18 KEGG pathways ( Table 6). The species distribution of the annotations is shown in Figure 10. The enriched putative functions of all these 97 matched genes are shown in Figure 11. According to the annotations as well as the extensive database and literature searches, 77 of the 103 genes were newly predicted to be putative disease-resistant genes, 13 were recognised as non-resistant genes, and the rest (13 genes) remained unknown. Finally, 194 (86.22%) of 225 genes were identified as disease-resistant genes, an increase of 65.8% over the previously known R-genes, 13 retained unknown function and 18 were false positive ( Figure 9B). Additionally, all predicted R genes were prioritised by assigning each with a weighted rating (WR) score. The results are provided in Table S4, which will help biologist identify the R-genes from the most likely candidates.

System-level insight into the cellular interactome of non-model organisms becomes feasible
We have shown here that inferring, modelling, and analysing the intracellular interactome of a non-model species became a reality based on the notion of functional gene network (FGN). Although FGNs have been constructed for many model species, the methods cannot yet be extended to other infrequently studied species, such as Glycine max, due to the absence of sufficient heterogeneous and previously known omic-level interaction data as shown in Figure 1. Using GO annotations and our SSDD method, proposed for comparing gene functional similarity (FS), we identified the pairwise genes' FSs for soybean and further modelled the gene network on the notion of functional association. The schemes introduced here seem much simpler than those integrating heterogeneous omic data, yet it is currently the best solution for non-model species because the GO annotations actually provide a way to integrate diverse data into a single structured dataset. Inferring from orthologs, co-expression, and sharing KEGG terms are some alternative solutions, by which, however, the networks were proved to be less extensive and accurate than SoyFGNs. Additionally, as a case study, the successful application of SoyFGN-INT to predict the soybean diseaseresistant genes further illustrates that SoyFGNs constructed on the basis of GO similarities can also provide system-level insight into the intracellular interactome as the networks of model organisms did, and this will speed up the discovery and definition of the function and interaction of genes that control important plant characteristics such as disease resistance, symbiotic nitrogen fixation, and protein  and lipid synthesis in soybean. A study conducted on the soybean microRNA interactome based on SoyFGNs is an additional powerful evidence of the important roles of SoyFGN in future studies of the soybean functional interactome at the genome and microRNome levels [38].

The first global view of soybean gene functional interaction
Soybean (Glycine max) is one of the most economically important crops and a major food source. A soybean whole-genome shotgun sequence of Glycine max var. Williams 82 was first reported in 2010 [16], which stimulated research on soybean at the genome level. The work introduced herein is the first study on soybean gene interaction on the genome level. We drew four functional gene networks (SoyFGNs), containing up to 70% of the soybean genes reported by EnsemblPlants (release 18, April 2013) and the construction of the second version SoyFGNs covering all genes is about to release. The topological analysis showed that, like other biological networks, SoyFGNs are scale free, and their degree distributions fit best to exponential and power-law distributions. Their degree correlations indicate that the genes of similar degrees tend to be connected with each other more in all four SoyFGNs, referred to as assortativity, implying the existence of functional modules in SoyFGNs. The achievements we report here will be fundamental to further studies on the interactome of soybean at the genome level. We admit that the SoyFGNs certainly contain false positives and even errors, just as or more than the model organisms contain. However, the effort involved in this work seems to be the best solution with the best outcome for such non-model organisms in absence sufficient data sources. The inherent deficiencies will certainly be overcome with increasingly enrichment of the data.

Availability
Based on the research described herein, we developed a user-interactive web platform for information retrieval and analysis of the SoyFGNs and the aforementioned microRNA networks derived from SoyFGNs, SoyFN: http://nclab. hit.edu.cn/SoyFN/.

Prospects
Our SoyFGNs provide a systematic view of the whole soybean genome, and hence such construction of the genome-wide networks has been followed by attempts to discover and predict function within the system as a whole. Therefore, the effort represented by our study is just the beginning of characterising the soybean functional genome. As shown in Figure 12, our whole research project consists of three main focuses: 1) construction of SoyFGN as described herein (shapes in blue background); 2) inferring the microRNA functional network of soybean based on the SoyFGNs (shapes in yellow background); 3) module detection, miRNA-gene two layer network analysis, and further interactive module analysis coupled with genomic context analysis to discover the gene-miRNA regulatory mechanism involved in stress resistance, nitrogen fixation, protein and lipid synthesis along with other biological processes in soybean (shapes in red background). Overall, the efforts of the study described herein are the basis of our further comprehensive studies on the soybean functional interactome at the genome and microRNome levels.

Conclusions
As the most important biomolecules in a cell, genes rarely act alone. They interact functionally with other genes to synergistically mediate their biological functions. So far, hardly anything is known about this functionally interplay between genes in Soybean (Glycine max) at the genome level. The only large-scale genomic study in Soybean investigated the whole-genome shotgun sequences of Williams 82 [16], which stimulated our research on genome-level interactions among genes.
As an initial step on the way to fully expose the ensemble of all functional associations between genes, we here present the first FGNs of soybean (SoyFGNs). Instead of combining unavailable genomic, transcriptomic and comparative genomic data to predict associations (interactions) between gene pairs, we inferred the gene functional associations from GOA resulting in four comprehensive networks of gene associations that covers 70 percent of the predicted genes of soybean. We showed that SoyFGNs are scale free, and in which the genes of similar degrees tend to be connected with each other more in all four SoyFGNs, referred to as assortativity, implying the existence of functional modules in SoyFGNs. Verified by co-expression and KEGG pathways, SoyFGNs are more extensive and accurate than an orthology network derived from Arabidopsis. Network-guided disease-resistance gene discovery indicates that SoyFGNs constructed on the basis of GOA can also provide system-level insights into the intracellular interactome as the networks of model organisms did, which will speed up the discovery and definition of the function and interaction of genes that control important plant characteristics such as disease resistance, symbiotic nitrogen fixation, and protein and lipid synthesis in soybean. The availability of the predicted functional association network allows a gradual transition from a single gene perspective to a more comprehensive understanding of the complex biology of soybean. Additionally, a web tool for information retrieval and analysis of SoyFGNs can be accessed at SoyFN: http://nclab.hit.edu.cn/SoyFN. Table S1. Orthologs between soybean and Arabidopsis using BLASTN.