Nonlinear Network Reconstruction from Gene Expression Data Using Marginal Dependencies Measured by DCOL

Reconstruction of networks from high-throughput expression data is an important tool to identify new regulatory relations. Given that nonlinear and complex relations exist between biological units, methods that can utilize nonlinear dependencies may yield insights that are not provided by methods using linear associations alone. We have previously developed a distance to measure predictive nonlinear relations, the Distance based on Conditional Ordered List (DCOL), which is sensitive and computationally efficient on large matrices. In this study, we explore the utility of DCOL in the reconstruction of networks, by combining it with local false discovery rate (lfdr)–based inference. We demonstrate in simulations that the new method named nlnet is effective in recovering hidden nonlinear modules. We also demonstrate its utility using a single cell RNA seq dataset. The method is available as an R package at https://cran.r-project.org/web/packages/nlnet.


Introduction
High-throughput expression techniques such as microarray, deep sequencing, and liquid chromatography-mass spectrometry (LC-MS) are able to generate measurements of tens of thousands of biological units simultaneously [1,2]. Identifying relations between the biological units and detecting modules in the system can result in a better understanding of the underlying regulatory system [3][4][5]. However, the large number of biological units involved, the complexity of their associations, and the under-determined nature of the problem poses great challenges toward data analysis [6].
The topic of network reconstruction from high-throughput data has received a great deal of attention in recent years. A number of methods with different objectives have been proposed, including network reconstruction based on marginal correlation [7,8], Gaussian graphical models [9,10], Bayesian network [11,12], mutual information-based network inference [13][14][15][16], etc. Some methods go beyond inference using a single approach, by taking previous knowledge into consideration [17], or combining the predictions from multiple methods [18,19]. Given a gene expression dataset, different algorithms put emphasis on different aspects of data characteristics, and the prediction of different algorithms always show large discrepancies [20].
Nonlinear and complex relations between genes has been widely documented. They may represent nonlinear response, conditional dependency, or time-lagged dependency [21][22][23][24]. Mutual information-based network reconstruction methods can establish links between genes based on nonlinear relations [13][14][15][16]25]. However nonlinear relations can be diverse, and generally the statistical power of detecting such relations is lower compared to detecting linear relations based on correlation.
We have previously developed a very sensitive method to detect predictive nonlinear relations, named the Distance Based On Conditional Ordered List (DCOL) [26]. Given the method is rank-based, and the asymptotic normality of the test statistic under the null hypothesis, the computation is very efficient on large matrices. Also, it is amenable to statistical inference, i.e. rigorously defining links between genes based on p-value or false discovery rate [27]. These properties make DCOL a very good candidate for nonlinear network reconstruction.
In this study, we develop a network reconstruction method based on DCOL. First, we compute pairwise relations between genes using DCOL. We then conduct gene-wise false discovery rate inference [27] to determine significant links for each gene. Such an approach will yield a network with marginal nonlinear dependencies, i.e. not considering conditional dependencies. It has been shown previously that reconstruction based on marginal linear dependencies performed very well compared to other methods [20]. Given that nonlinear relations can be quite diverse, and the statistical power of detecting different types of nonlinear relations can vary substantially especially when considering conditional dependencies, it is our belief that constructing a network using marginal dependencies and allowing users to trim the links based on biological considerations is a viable approach.
we demonstrate the performance of our method using simulations, mostly in its capability of recovering hidden community structures, in comparison with a few other methods. We also apply the method to a real dataset. The real data analysis shows that our method successfully detects meaningful biological relations supported by existing knowledge, as well as detects plausible new links. In the following discussions, we refer to our method as nlnet.

Distance Based on Conditional Ordered List (DCOL)
Given a data matrix G p×n with p genes from n samples, the key issue is to find significant relations between genes that are either linearly or nonlinearly associated. We have previously proposed the Distance Based on Conditional Ordered List (DCOL) as a non-linear distance [26]. To find the DCOL(g j |g i ),8j = 1,. . .,p, where g i is a row-vector in matrix G, we sort the columns of the matrix based on the value of g i first to create an ascending order, Then we obtain the DCOL based on the sorted data, In actual computation, for each g i , we permute the columns of matrix G based on the order of the values of g i , and find DCOL(g j |g i ) for all j's simultaneously.

Null Distribution
It has been shown that the DCOL under the null hypothesis, i.e. the two genes are independent of each other, follows normal distribution [26]. We can obtain a gene-specific null distribution by permutation. In order to estimate a null distribution of genes, we randomly permute the columns of data matrix G for B times (B = 500 in this study). For each permutation, we calculate the sum of absolute difference between adjacent columns in a row, We then estimate the null distribution parameters for gene i, m i ¼ 1

Gene Network Reconstruction
By applying the DCOL we introduced in section 2.1, we find the gene distance matrix D p×p , where D i,j = DCOL(g j |g i ). For each column of the matrix, we compare the distance vector to the null distribution to obtain the one-sided p-value, P i;j ¼ F À1 D i;j Àm j s j . We then feed the column vector of p-values to the fdrtool package [28] to obtain local false discovery rate (lfdr) values [29], which is the posterior probability that g j is dependent on g i Up to this step, we have a matrix L p×p of local fdr values. Next we threshold the local fdr values to obtain connections between genes. In order to control the overall level of edge density in the network, we apply a dynamic thresholding procedure. A parameter is provided to set the target average degrees over all nodes. This target corresponds to a target fdr value. If this target fdr is between 0.05 and 0.2, we will use the target fdr as cutoff. Otherwise, if the target fdr is larger than 0.2, we will use 0.2; if the target fdr is smaller than 0.05, we will use 0.05 as cutoff.

Community Detection in Network
As the biological system is modular [30], to facilitate data interpretation, we further decompose the gene network into communities. There are many mature community detection methods which are commonly applied in social computing but are also suitable for genes networks. In this study, we apply two common community finding methods: multi-level optimization of modularity, and label propagation [31,32].

Simulation
We conduct a simulation study to examine the capability of the proposed method to recover nonlinear gene modules. The reason for focusing on module recovery, rather than link recovery, is because in the nonlinear situation, methods have different statistical power on different types of nonlinear functions, and it is unreliable to define conditional dependency given the existence of measurement noise. As modules are more robust against the gain/loss of a small proportion of links, we believe the recovery of gene modules is a more meaningful goal.
We conduct the simulation by setting parameters in two main scenarios. We set the average number of genes in a module to 100, and set the number of hidden modules to 10 or 20. An additional 10% pure noise genes are also added. In each scenario, the number of genes are calculated based on modules number and size, namely 1100 genes for scenario 1 and 2200 genes for scenario 2. Noise is added to the generated gene expression vectors by adjusting the noise level from 0.2 to 0.8 (noise standard deviation divided by true signal standard deviation). We use three different mechanisms for generating modules, (1) 0-dependent approach, (2) 1-dependent approach, (2) 2-dependent approach. To establish nonlinear relations, we use four link functions including (1) linear function, (2) sine function, (3) box wave function, (4) absolute value function.
For 0-dependent approach, for each module, we generate the expression level of a hidden controlling node x by sampling the standard normal distribution. Then for each gene in the module, a link function f() would be randomly chosen from the function group, and the gene expression level is generated as y = f(x) + ε, where ε is random noise. For 1-dependent approach, for each module, the first gene is generated by sampling the standard normal distribution. From the second gene on, one existing gene is randomly selected, and a link function is randomly selected. The expressions of the new gene is generated as y (new) = f(y (selected) ) + ε. The 2-dependent approach is similar to the 1-dependent approach, except each new gene depends on two existing genes, y (new) = f(y (selected1) ) + g(y (selected2) ) + ε

Simulation Result
We compared our algorithm with three methods that deal with nonlinear associations. One is the mutual information-based network inference methodology ARACNE, and its R package name is parmigene [14]. Two other methods are nonlinear clustering methods: General Dependency Hierarchical Clustering (GDHC) [26] and the K-profile method [33].
The Adjusted Rand Index (ARI) [34] is used to evaluate the results. Higher ARI values means better agreement between detected and hidden true modules. The simulation results are shown in Fig 2. There are two rows in the figure. The first row represents algorithm performance under the 10 hidden gene module circumstance, and the second row represents 20 hidden module circumstances.
For the 0-dependent scenarios, apparently the result of nlnet combined with the multi-level algorithm outperformed all other methods across all noise levels. The result by ARACNE is the worst one, in which the optimal ARI is below 0.5 and in most of cases, it stays at the bottom of the plots. So it is safe to say that the ARACNE algorithm is not adapted well to our benchmark.
For the 1-dependent scenarios, nlnet performs the best when noise level is low. The result of K-profile becomes better when the noise level grows higher, while GDHC and ARACNE lag behind. This trend persists into the 2-dependent scenarios. We notice that the 2-dependent scenarios, the dependencies between gene pairs are weaker but more pervasive. When noise level is high, it is likely that network-based methods lose too many true connections, causing the modules to break up.
The second row of Fig 2 (20 true modules) shows a similar trend as the first row (10 true modules). Compared with nlnet paired with multi-level or label-propagation method, the K-Profile algorithm only shows a better result in cases where the noise is high, or the dependencies are weaker but more pervasive. Overall, the nlnet-based module recovery outperforms the existing methods in lower noise situations, and when the dependency structure is simpler.

Real Data Result
We apply the method on the single-cell RNA seq data on sensory organs of the neonatal inner ear in mice (GSE71982) [35]. In the inner ear, sound and acceleration stimuli are transduced by cochlear and utricular sensory epithelia. We compare the cochlear and utricular single cell RNA seq data through our method of non-linear network reconstruction and community detection. Firstly, we filter the sequencing data by requiring non-zero values of each selected gene to be greater than 2/3 of the samples in either of the tissue types. Secondly we log-transform the data by taking log(x+1) of all values. Finally, we apply our algorithm to construct networks for utricular and cochlear sensory epithelia separately. For community detection, the minimum module size of is set to 100. Fig 3 shows the overall network. The cochlear network has 2412 connected genes and 33846 connections. The utricular network has 2906 connected genes and a total of 45068 connections. With an average degree of 15.5 compared to 14 in the cochlear network, the utricular network is more densely connected. We color the highly connected genes in the utricular network (!50 connections) red in the cochlear network (Fig 3a), and vice versa (Fig 3b). We see that many highly connected genes in one cell type are no longer highly connected in the other type of cell. Functional analysis on highly connected genes (!50 connections in cochlear cells alone, or !50 connections in utricular cells alone) shows an over-representation of regulatory and sensory developmental function (Table 1). In our result, there are two communities (523 and 1726 genes respectively) in cochlear cells. The second community shows a clear functional link to inner ear and sensory development ( Table 2). There are 5 communities (535, 205,159,1235, and 553 genes respectively) in utricular cells. However there is no clear link to inner ear development at the community level. Given the communities are too large to yield simple biological interpretations, we focus on some genes that have been widely studied in ear development.
Genes such as Tgfb2, Stat3, Notch2, and Fgf10 (Fig 4), were widely studied in the development of the ear. These genes are found in communities of both cochlear and utricular system. But their connection patterns are different between the two cell types.
We can find there are more genes connected with Notch2 and Tgfb2 in utricle than cochlea. The Notch signaling pathway is thought to regulate development of inner ear. Overall, Notch2 has 28 connections in the cochlear cells, and 159 connections in the utricular cells. Functional analysis using GOstats [36] shows that Notch2-connected genes over-represent biological  processes related to the genesis of cellular structure in cochlear cells, and sensory development in utricular cells (Table 3). Aberrant number and arrangement of hair cells and supporting cells of the ear are caused by mutations in the Notch signaling pathway. Moayedi et al found the Sfswap Tg mutation can result in defects in the mouse ear, and the defects were consistent with disrupted Notch signaling [37]. In this study, Sfswap mutants exhibit defects in hair cells and supporting cells in the cochlea, but exhibited much more severe vestibular organ defects.
In our results, we can only find the connection between Sfswap and notch pathway in the utricular cells, which is part of the vestibular organ. Slowik et al found the ligand of Notch pathway, Jag1, can attend mammalian hair cell regeneration through Tgf family [38]. Jag1 is connected with Tgfb2 in cochlear cells, but not in utricular cells, although Tgfb2 is connected with more genes in utricular cells, which over-represent neurological system process. Stat3 has more direct link genes in cochlea than utricle. Qin et al found the cross talk of Klf4 and Stat3 can regulates axon regeneration, which was also associated with ear nervous sensor development [39,40]. In the estimated networks, Klf4 is connected with Stat3 only cochlear cells.
Fgf10 plays a major role in ear morphogenesis. Urness et al found a dosage-sensitive requirement for Fgf10 in vestibular development [41], and that the mutant cochlear epithelium of the Fgf10−/− embryos lacks Reissner's membrane and a large portion of the outer sulcus, consistent with earlier findings on a null mutation in Fgfr2b, one of Fgf10's main receptors [42]. In our result, Fgf10 is connected with Jag1 in both cochlear and utricular system. Jag1, a notch ligand, is also required for sensory progenitor development in the mammalian inner ear [43]. Together, our result suggests the notch pathway has interacted with other signaling pathways such as the Tgf [38] and Fgf [43] pathways to create sensory organs of ear.
There are some other genes that only have connections in one of the cell types. As an example, Igf1r is a classic transmembrane tyrosine kinase receptor and is believed to mediate all cellular responses to the Igf ligands. During the development of cochlea, Igf signaling and PI3K/ AKT signaling can regulate cochlear length and hair cell number. Igf signaling also regulates Atoh expression by PI3K/Akt signaling. PI3K/Akt signaling is considered to be the downstream of Igf signaling [44,45]. Igf1r is found in a community of cochlear network, with connection to another important gene Chol11a1 in cochlea development [46], but not in the utricular system (Fig 4). Adam10 is also related with the development of cochlear. Yan et al found regional expression of the Adam10 in developing chicken cochlea [47]. Lin et al also showed that Adam10 is expressed and regulated in the early developing cochlea [48]. Some more genes such as Rbpj, that take part in the functional and morphological development of cochlear organ [49], also only found in the community of cochlear network.
Overall, our results show some agreements and distinctions between the cellular networks of cochlear and utricular cells, much of which agree with existing knowledge, with some new links that are plausible but requires experimental studies to validate. This indicates the nlnet method can produce biologically meaningful results. The computation of the nlnet method is relatively efficient. As shown in Table 4, among all the methods used in simulation, nlnet is the second fastest. It uses~20 seconds for a data matrix of dimension 1100 × 100, on a laptop computer with a 2.5 GHz intel i7 processor. When the data dimension is larger in real data analysis, nlnet is the fastest among all methods tested, finishing in~5 minutes for a data matrix of dimension 3724 × 164.

Conclusions
In this manuscript, we presented a network reconstruction method based on nonlinear associations. Linear associations are also utilized as a special case of the general association. The method is based on the distance measure DCOL, which enjoys high sensitivity and efficient computation. The method is effective in reconstruction of the network based on marginal nonlinear associations. The nonlinear network reconstruction method can also capture linear relations. However when sample size is not large, its statistical power to detect linear relations is lower than methods developed specifically for linear relations. Thus it may lose some weak linear relations, while gaining nonlinear relations. At the same time, currently there is no straight-forward definition of conditional independence based on DCOL. Thus such a nonlinear methods could be seen as a compliment to linear network reconstruction methods. It can potentially be combined with linear methods to make the results more comprehensive.