Gene Cascade Finder: A tool for identification of gene cascades and its application in Caenorhabditis elegans

Obtaining a comprehensive understanding of the gene regulatory networks, or gene cascades, involved in cell fate determination and cell lineage segregation in Caenorhabditis elegans is a long-standing challenge. Although RNA-sequencing (RNA-Seq) is a promising technique to resolve these questions, the bioinformatics tools to identify associated gene cascades from RNA-Seq data remain inadequate. To overcome these limitations, we developed Gene Cascade Finder (GCF) as a novel tool for building gene cascades by comparison of mutant and wild-type RNA-Seq data along with integrated information of protein-protein interactions, expression timing, and domains. Application of GCF to RNA-Seq data confirmed that SPN-4 and MEX-3 regulate the canonical Wnt pathway during embryonic development. Moreover, lin-35, hsp-3, and gpa-12 were found to be involved in MEX-1-dependent neurogenesis, and MEX-3 was found to control the gene cascade promoting neurogenesis through lin-35 and apl-1. Thus, GCF could be a useful tool for building gene cascades from RNA-Seq data.


Introduction
Spatially and temporally regulated gene expression is essential to precisely modulate cellular behaviors during development in multicellular organisms. Elucidating gene cascades during early embryonic development may improve our understanding of mechanisms of cell fate determination and lineage segregation [1][2][3]. The nematode Caenorhabditis elegans, a model organism of development research, comprises 959 cells in adult hermaphrodites with robustness and reproducibility of the cell lineage [3]. Additionally, over 80% of the C. elegans proteome shows homology with human proteins [4], providing a particularly valuable model organism for studies of the developmental system. PAR proteins, which are expressed immediately after fertilization, are associated with formation of the anterior-posterior polarity of P0 cells and control the localization of polarity mediators such as SPN-4, MEX-1, and MEX-3 in C. elegans [5]. Aberrant  genes encoding these proteins affects cellular and developmental regulation, leading to embryonic lethality in early embryogenesis [6][7][8][9]. Specifically, SPN-4 is localized in all blastomeres at the four-cell stage and plays essential roles in axial rotation [8,[10][11][12]; MEX-1 is expressed in the P1 blastomere, and loss of its function leads to excessive muscle formation [7,13]; and MEX-3 is expressed in the AB blastomere, and causes excessive muscle formation and hatching failure in mutants [14][15][16]. Although polarity mediators regulate protein synthesis by binding to the 3 0 -untranslated region of the target mRNA, it is difficult to directly identify their associated gene cascades. Conventional genetic and molecular biological approaches have focused on the target gene to be identified and have clarified functions and identified related genes, representing a bottom-up approach. Both functional analysis of individual genes and comprehensive analysis of the genome are indispensable for identification of gene cascades. After determination of the whole genome sequence of C. elegans in 1998, genome-wide analysis via a top-down approach was made possible [17], representing the beginning of the post-genome sequencing era. Transcriptomics via DNA microarray analysis [18], proteomics via mass spectrometry analysis [19], and phenomenon analysis by RNA interference [20] have been extensively reported. Furthermore, several methods for comprehensive analyses have been developed, including protein-protein interaction analysis using a yeast two-hybrid system and phage display [21,22] and multiple mutation analysis using knockdown mutants. At the same time, the WormBase database was constructed to integrate the vast quantities of data obtained from these genomewide analyses [23]. Accordingly, the development of new technologies and methodologies has enabled the accumulation of detailed genome-wide data.
Next-generation sequencing (NGS) has now replaced conventional Sanger sequencing [24]. Conventional Sanger sequencing can simultaneously analyze 8-96 sequencing reactions, whereas NGS can simultaneously run millions to billions of sequencing reactions in parallel. This technique can dramatically and quickly determine the gene sequences in organisms whose whole-genome sequences have already been determined. Even at the laboratory level, genomic sequencing results can be produced in only a few days, enabling researchers to obtain genome-wide information rapidly. Furthermore, RNA sequencing (RNA-Seq) has recently been developed to measure gene expression levels by counting the number of sequence reads obtained from converting RNA into cDNA [25]. Existing RNA-Seq data analysis tools include RSeQC [26], which measures the quality control of the obtained data; Cufflinks [27], which involves genome mapping; and IsoEM [28], which identifies isoforms within a dataset. These tools can be used to identify gene expression variations from RNA-Seq data. TopGO [29] is an analytical tool used to identify gene function based on RNA-Seq data and can confirm the functions of genes with varying expression levels. In addition, Cascade R was established to identify the gene cascade of a query gene [30]. Cascade R constructs an intergenic network of knockout genes from the results of DNA microarray analysis. However, it requires multiple timeline datasets from microarray analyses.
Genes, especially those expressed in early embryogenesis, function in chronological order rather than having only a single function, and genes responsible for functional expression often exert their effects at the bottom of gene cascades. STRING [31], BIOGRID [32], and WormBase [23] are databases of protein-protein interactions and the gene-dependent regulation of transcription and translation. In order to predict genetic cascades from these databases, researchers currently must perform separate analyses. Moreover, although RNA-Seq can be used to easily acquire large amounts of data via a semi-automatic process, the subsequent analysis must be performed manually and is therefore quite time-consuming. Therefore, the data acquisition capacity currently exceeds the data analysis capacity. Accordingly, automation of the analysis using bioinformatics tools is an important research subject.

Guideline for RNA-Seq analysis and comparative gene expression analysis
The guideline for RNA-seq data analysis for gene cascade prediction is indicated in the supplementary material (S1 Fig). RNA-seq analysis of wild type (N2) and sin-3(tm1476) mutant early embryos (N = 3) was performed by following the Illumina protocols (available on the Illumina, Inc. website). Library preparation for RNA-Seq was performed using the TruSeq Stranded Total RNA LT Sample Prep Kit (Illumina, Inc.). Next, using a MiSeq Reagent Kit v3 (Illumina, Inc.), the DNA samples were denatured, diluted, and paired-end sequencing (75 bp length) was performed in MiSeq sequencer (Illumina). The quality of raw sequence data obtained by RNA-Seq was checked using FastQC. Trimmomatic [35] was employed to trim low quality reads and the sequence data were mapped to a C. elegans reference genome (WormBase Version 261) using HISAT2. The count data of wild type and sin-3 mutant were compared with DESeq2 [36] and differentially expressed genes (DEGs) (p-value < 0.05, log 2 Fold changes between sin-3 mutant/wild type (log 2 count data of sin-3 mutant/count data of wild type)) were identified (S1 Table).

RNA-Seq analysis of mex-1, mex-3, and spn-4
The mRNAs of the synchronized C. elegans early embryos were purified using RNeasy Minikit (Qiagen NV, Venlo, the Netherlands). Purified mRNAs were reverse-transcribed into cDNAs, amplified by polymerase chain reaction, and fragmented using a TruSeq RNA Sample Prep Kit (Illumina, Inc.). The amplified cDNAs were sequenced using Hi-Seq2000 (Illumina, Inc.) and the sequenced cDNAs were mapped to the C. elegans genome sequence and counted according to WormBase (WS190) [23] using DNAnexus. Using this procedure, the mRNA expression levels were obtained as reads per kilobase of exon per million mapped reads (RPKM) [37]. The gene name and RPKM values of wild-type and mutant genes were filed for input data in GCF (S2 Table). Although we used the RPKM method to perform static evaluation, it is better to use DESeq2 or edgeR to define and normalize the fluctuations or differential expression patterns. However, since RPKM is still utilized by many researchers up until now [38,39], we initially develop the tool for this classical method and will incorporate other more sophisticated normalization methods to GCF (see Guideline for RNA-Seq analysis and comparative gene expression analysis).

Comparative quantitative gene expression analysis of mex-1, mex-3, and spn-4
Expression levels of gene i in the wild-type and mutant were defined as xi and yi, respectively, and the change rates in these gene expression levels (Ri) were determined as shown in Eq 1.
Further, genes that satisfied the condition of Eq 2 were assumed to show expression level fluctuations, and genes satisfying the condition of Eq 3 were assumed to not show expression level fluctuations. N is the number of genes, and Mi and Qi are the median and quartile deviation, respectively. The expression level of each gene was spatially and temporally regulated in the early embryonic development of C. elegans. To analyze the effect of the expression of each gene in mutant for the polarity mediator genes, we compared the relative expression level of each gene between the wild type and deletion mutants for the polarity mediator gene (S2 Table).

Dataset for the software
Information on the expression timing and interactions of all genes in C. elegans was extracted from WormBase (Version 256) [23] using the application programming interface. The total transcription factors of C. elegans were acquired from the gene ontology database Amigo 2 (http://amigo.geneontology.org/amigo) [40] using the keyword search "GO: 0006351". Furthermore, all gene IDs, protein IDs, and domain information from Pfam (https://pfam.xfam. org/) [41] in C. elegans required for functional analysis were extracted from UniProt [42]. We used the embryonic gene expression and gene interaction information from WormBase, and about 20,000 domain information of the gene product from UniProt.

Development of Gene Cascade Finder
The programming language Ruby was used to construct Gene Cascade Finder (GCF). The web interface of GCF was written in Python. The input data for GCF were data from the wild-type and mutant strains, as shown in S2 Table. The output data from GCF included discovered gene cascades (S3-S5 Tables), the data input for Cytoscape (S6 Table), and gene cascade-specific domains and their gene cascades (S7 Table). GCF is available at http://www.gcf.sk. ritsumei.ac.jp.

Direct target prediction by GCF
GCF was developed by the algorithm shown in Fig 1. To identify and verify the candidate for the direct target genes of a query gene, we performed the following tasks: (1) evaluate expression level fluctuation between wild-type and a mutant of the query gene, (2) confirm that the candidate gene is transcription factor, (3) confirm that the candidate gene and query gene are simultaneously expressed, (4) confirm phenotypic similarity between a mutant of the candidate gene and that of the query gene. For identification of the downstream target of the polarity mediator-mediated gene cascade, we performed the following. First, the candidate genes of the target of the query gene were found from transcription factors and genes with no gene expression level fluctuations, as calculated by Eq 3, with the same cellular localization and phenotype. Query genes bind to target mRNAs to regulate their translation. Thus, the mRNA expression levels of the target genes showed no changes. In addition, the target genes needed to be expressed with the same timing as the query gene because the query gene is directly bound to the target mRNA. Furthermore, the gene cascade of the target genes was mostly consistent with the query gene cascade, suggesting that the target gene may have the same phenotype as the query gene. Therefore, to expand the gene cascade, the target gene should be a transcription factor with downstream genes.

Downstream gene identification by GCF
For identification of genes downstream to the gene cascade, we continuously performed the following. First, the genes from transcription or protein-protein interactions were extracted as downstream gene candidates of the target gene. Second, the expression timing of the candidate genes was checked. Only candidate genes noted as being expressed in early embryos or in embryos in WormBase were defined as downstream genes of the query's target genes. These two steps were repeated to obtain the next downstream genes. Finally, the procedure was repeated until there were no genes left to be extracted to obtain the final gene. Lastly, GCF output the cascade data (S3-S5 Tables). An example showing input obtained from output data from GCF to Cytoscape [43] is provided in S2 Fig.

Specific domain search from the constructed gene cascade
Each direct target gene was rooted, and the functions of their bottom genes were investigated using Pfam [41] in UniProt [42]. The P-values of the domains from the bottom gene products were evaluated using the same formula for Gene Ontology in Panther [44]. If the transcription-related domain was extracted from a cascade, the cascade was no longer considered since it would be functioning only after the early embryo stage.

Results and discussion
In this work, we developed a novel software that is useful to predict the gene cascade during C. elegans development. To evaluate the accuracy of the gene cascades that were predicted by GCF, we performed the following. First, we isolated candidate direct target genes of the polarity mediators using RNA-seq data by comparing the data for wild type and each mutant of the polarity mediators, SPN-4, MEX-1 and MEX-3. Next, GCF predicted the gene cascades from the candidate direct target genes localized at the top of the gene cascade. In this step, information that is stored in WormBase was used to predict the gene cascade. Finally, we assumed that the GCF gene cascade prediction might be accurate when the predicted gene cascade may have an essential role in early embryogenesis and may regulate any signaling pathway. Our purpose was assumed to be achieved when the predicted gene cascade was obtained previously or was considered to be useful for further detailed analysis of the cascades involved in C. elegans embryonic development. Although we used GCF to predict the polarity mediator gene cascade, it is applicable to analyze any other query gene.

Analysis of mRNA expression by comparative RNA-Seq
To explore polarity mediator-dependent mechanisms, the effects of deficiencies in polarity mediators were analyzed by performing RNA-Seq analysis in early embryos. From the results of comparative RNA-Seq of the wild-type strain and the spn-4, mex-1, and mex-3 mutant strains, 15,288, 15,265, and 15,005 genes were identified, respectively (S8 Table). In these gene groups, expression level fluctuations were calculated by examining the median ± quartile deviations. From this analysis, 6,417 genes distributed at −0.65 < log2 (RNA expression level ratio) < 0.69 in the spn-4 gene, 6,456 genes distributed at −0.74 < log2 (RNA expression level ratio) < 0.82 in the mex-3 gene, and 6,491 genes distributed at −0.82 < log2 (RNA expression level ratio) < 1.10 in the mex-1 gene were defined as genes showing no expression level variations (S9 Table).

Gene cascade prediction using Gene Cascade Finder
As shown in S1 Table, gene cascade prediction was performed by inputting data obtained by comparative RNA-Seq into GCF. GCF can predict gene cascades by continuously integrating the results from RNA-Seq along with data on gene expression and intermolecular interactions from WormBase. In total, 127, 180, and 226 gene cascades were predicted from 6,418, 6,457, and 8,513 genes from the comparative analysis of the spn-4, mex-1, and mex-3 mutant strains, respectively (Fig 2, S3 Fig and S7-S9 Tables).

Extraction of gene cascade-specific domains using Gene Cascade Finder
The genes and domains located at the bottom of the gene cascade were extracted for functional analysis of the predicted gene cascade (S10 Table). Overall, 53, 146, and 143 genes with 32, 34, and 54 specific domains as the bottom gene were extracted from the gene cascades in the spn-4, mex-1, and mex-3 mutants, respectively.

Domain analysis of spn-4, mex-1, and mex-3 cascades
To predict the functions of the gene cascades, we focused on the functions of genes localized at the bottom of the gene cascade by analyzing the domains of the gene products using the Pfam protein family database. The functions of the 53 SPN-4-mediated genes were obtained as bottom genes to calculate the functional trends in the spn-4 cascade. Within the 53 bottom genes, 32 domains were classified based on information from the Pfam database (Table 1) [41]. When we calculated the numbers of these genes, transcription and signal transduction were obtained at high frequency.
Similarly, in the gene cascade of 146 mex-1-mediated genes, 27 domains were classified and obtained from the domain analyses to have functions in early embryonic development, cell division, transcription, DNA replication, and signal transduction (Table 1). In contrast, in the gene cascades of the 143 mex-3-mediated genes, 54 domains were classified and obtained from the domain analyses to have functions in development, cell cycle, transcription, and signal transduction (Table 1).

Distinct availability between Gene Cascade Finder and a gene cascade prediction package, Cascade R
Among the previously established software packages, Cascade R was established to find out the gene cascade of a query gene [30]. Cascade R [30] constructs an intergenic network of knockout genes from the results of DNA microarray. The difference in the datasets between Cascade R and GCF lies in that Cascade R constructs gene cascade from the series of microarray data, and GCF needs RNA-seq data and existing database information. Thus, Cascade R needs multiple timeline datasets of microarray. In contrast, using GCF, we can easily visualize the existing information and predict the direct target when a translational regulator is selected as a query gene. Although researchers need to refer to the function of the domain from Pfam, GCF can predict gene cascade by continuously integrating the results from RNA-Seq data and gene expression and intermolecular interaction data from WormBase. Thus, GCF offers inexpensive prediction of gene cascade.

Evaluation of GCF by assessment of gene cascades in the canonical Wnt signaling pathway
Next, we focused on genes involved in SPN-4-mediated signal transduction (Fig 3A). First, we found that MOM-2, a nematode homolog of the Wnt ligand, is involved in the signal transduction cascade (Table 1). Since both SPN-4 and MOM-2 were previously reported to regulate EMS cell lineage formation and spindle orientation [11,45], we hypothesized that the SPN-4/MOM-2 gene cascade may have an essential role in early embryogenesis and may regulate the Wnt signaling pathway. Moreover, because OMA-1, MEX-1, and PIE-1, which are known to be essential for MOM-2 expression in embryonic development [45], were also identified in this pathway, the GCF-mediated gene cascade prediction was assumed to be accurate.
Similarly, unc-37, which encodes a Groucho/TLE homolog that suppresses Wnt signaling, was found as the "bottom gene" in the spn-4 and mex-3 cascades (Fig 3A and 3B). Thus, we propose that GCF-mediated gene cascade prediction may be useful for identification of gene cascades involved in C. elegans embryonic development.

Examples of the application of GCF for prediction of new biological functions involved in a gene cascade
Endoplasmic reticulum (ER) stress response pathway in a MEX-1-mediated gene cascade. Because cell division and DNA replication are regulated by a MEX-1-dependent cascade, we further focused on this signal transduction cascade. In the gene cascade related to signal transduction, paqr-2, which encodes an adiponectin receptor, was isolated ( Fig 3C). Interestingly, a stress response pathway is known to regulate stress responses in both mouse and C. elegans embryogenesis [45,46]. Thus, an evolutionarily conserved gene cascade against environmental stress may be identified using GCF. Moreover, because adiponectin receptor regulates insulin signaling [47], it is likely that MEX-1/PAQR-2-mediated gene cascades may be involved in ER stress tolerance signaling within or parallel to the insulin signaling pathway during embryogenesis.
lin-35, hsp-3, and gpa-12 in a MEX-1/DPY-23-mediated gene cascade in neuronal development. In a MEX-1/DPY-23-mediated gene cascade, the mex-1, lin-14, let-60, ces-2, unc-13, and dpy-23 mutants were shown to exhibit specific phenotypes in neuronal development ( Fig   Fig 2. Schematic illustrations of polarity mediator-dependent gene cascades during C. elegans embryogenesis. Rendering of each gene cascade was performed using Cytoscape. Nodes indicate each gene in the cascade. Edges indicate the interactions between two genes. Large and intermediate nodes indicate the query genes and the direct target of the query genes, respectively. Other nodes indicate downstream genes. Green nodes indicate genes that are expressed during the early embryonic stage. Purple nodes indicate presumptive early embryonic genes. Red, blue, and black edges indicate positive regulation, negative regulation, and genetic interactions, respectively. Dotted lines indicate protein-protein interactions. Properties of the gene product of the bottom genes were calculated by domain analysis. The sum of the characteristic features of each cascade (P < 0.05) was then calculated. Domains with a score less than 1 were classified as "Others".
https://doi.org/10.1371/journal.pone.0215187.t001 3D) [12,[48][49][50]. Thus, six of the nine genes of this gene cascade were shown to have essential roles in neuronal development, indicating that MEX-1/DPY-23-mediated gene cascades may regulate neuronal function. Although their roles in neuronal development have not yet been investigated, locomotion defects have been reported in hsp-3 and gpa-12 mutants [51,52]. Similarly, the lin-35(n745) mutation has been shown to enhance the neuronal phenotype of the neuronal regulator genes dpy-13 and unc-104 [53]. Thus, lin-35, hsp-3, and gpa-12 may be involved in a DPY-23-mediated gene cascade in neuronal development in embryos [50]. However, further studies are required to examine this possibility. MEX-3/APL-1-mediated neuronal patterning and MEX-3/CDC-14-mediated cell fate determination in the MEX-3-mediated gene cascade. Because MEX-3 is specifically expressed in AB cells at the four-cell stage, spatiotemporal-regulated synaptic formation defects in hbl-1 mutants and apl-1-dependent embryonic neuronal patterning may be elucidated by identifying MEX-3/APL-1-mediated gene cascades (Fig 3E) [54][55][56]. In parallel, when we focused on the MEX-3/CDC-14-mediated gene cascade (Fig 3E), CDC-14B, a zebrafish homolog of CDC-14, was shown to be involved in formation of the cilium in sensory neurons [55]. Because sensory neurons have cilia in C. elegans [57], CDC-14 may be involved in an evolutionarily conserved signaling pathway. Similarly, the lin-35(n745) mutation was shown to enhance the neuronal phenotype of neuronal regulator genes [53]. Thus, lin-35 may be involved in a MEX-3/CDC-14-mediated gene cascade in sensory neuron development. Accordingly, our findings suggested that GCF may be useful for predicting the comprehensive functions of query genes and for identification of new genes involved in known gene cascades.

Conclusion
In this study, we created a software program called GCF, which could comprehensively identify genes downstream of the query genes by integrating RNA-Seq data and previously characterized data from WormBase. Using GCF, we analyzed gene cascades of the polarity mediator proteins SPN-4, MEX-1, and MEX-3, and identified 127, 180, and 226 putative gene cascades, respectively. By analyzing the functions of these gene cascades, we confirmed that SPN-4 and MEX-3 regulate the canonical Wnt pathway during embryonic development. Furthermore, we found that the ER stress response and motor neuron development are regulated by MEX-1-dependent cascades, and that neural development is regulated by MEX-3-dependent cascades. Although we used GCF only to evaluate SPN-4, MEX-1, and MEX-3 functions in this study, the method is applicable for other translation or transcription factors involved in early embryogenesis. In addition, GCF provides a general method for predicting the functions of genes involved in a gene cascade during C. elegans embryonic development. Taken together, we propose that our strategy using the GCF tool offers a reliable approach for comprehensively identifying networks of embryo-specific gene cascades in C. elegans. Importantly, GCF can also be applied to humans and other model organisms such as mice and Drosophila.
In the future, by adding threshold for quantitative gene expression analysis of RNAsequencing data and expanding the algorithm to fit the cell lineage-segregation of C. elegans [3], we will be able to predict more reliable quantitative-data-oriented precise gene cascades reflecting four-dimensional (spatial and temporal) regulation [58]. Combinational analysis of improved GCF and molecular biology techniques such as RNA-pull down assays, fluorescent in situ hybridization, and phenotypic characterization of the mutants may be required to build a more reliable regulatory network for these gene cascades [59,60].
Supporting information S1