Prediction pipeline for discovery of regulatory motifs associated with Brugia malayi molting

Filarial nematodes can cause debilitating diseases in humans. They have complicated life cycles involving an insect vector and mammalian hosts, and they go through a number of developmental molts. While whole genome sequences of parasitic worms are now available, very little is known about transcription factor (TF) binding sites and their cognate transcription factors that play a role in regulating development. To address this gap, we developed a novel motif prediction pipeline, Emotif Alpha, that integrates ten different motif discovery algorithms, multiple statistical tests, and a comparative analysis of conserved elements between the filarial worms Brugia malayi and Onchocerca volvulus, and the free-living nematode Caenorhabditis elegans. We identified stage-specific TF binding motifs in B. malayi, with a particular focus on those potentially involved in the L3-L4 molt, a stage important for the establishment of infection in the mammalian host. Using an in vitro molting system, we tested and validated three of these motifs demonstrating the accuracy of the motif prediction pipeline.

Introduction differentially expressed during this unique process. We used RNA-seq to profile transcription at different time points during the molt, collecting samples from the infective L3 (from mosquitoes), L3 Day 6, and L3 Day 9 worms recovered from infected gerbils (NCBI PRJNA557263). We combined this transcriptome data with previously published L4 data [7] that corresponds to Day 14 post infection of gerbils (Table 1). In total, 2.36 billion reads were generated, with 1.38 billion reads mapping to the B. malayi genome. Each biological replicate received an average of 272 million reads, with an average of 173 million reads that were successfully mapped ( Table 1). Of the 11,841 B. malayi gene models, 87.6% were expressed in at least one stage of the L3 to L4 molt (Fig 1). The molting expression data shows unique stagespecific profiles for each stage of the molt with significant differential expression between days. Principle component analysis based on gene expression shows close clustering of biological replicates (S1 Fig).
We determined differentially expressed genes during the iL3 to L4 molt using both DESeq [25] and EdgeR [26] to perform pairwise comparisons between the four samples. To get a high confidence list of differentially expressed genes, we used the consensus of the two algorithms (FDR �0.01, log fold change � 2.5) (S1 Table). All differentially expressed genes had a minimum coverage of 7 read counts in at least one stage. For the purposes of this study, we focused on the genes that were up-regulated at each stage of molting, as compared to the other stages, and did a gene annotation enrichment analysis for each stage comparison (S2 Table). We found that up-regulated genes in iL3 larvae-as compared to L3D6, L3D9, and L4-were enriched for annotations involving cysteine-type peptidase activity as well as serpin domains and serpin family proteins. Cysteine-type peptidases are essential for molting in B. malayi [27,28], and serpins are serine protease inhibitors that have previously been shown to be involved in immunomodulation and host immune evasion during infection [29]. We identified five different cysteine-type peptidases and two cysteine-type endopeptidase inhibitors that were upregulated in iL3 larvae. By day 6 of molting, structural constituents of the cuticle, including collagen (the main component of the cuticle) were enriched in the up-regulated gene sets. We also see the up-regulation of several metalloproteases. At day 9, genes involved in signaling were enriched among the up-regulated genes, as were several different metalloproteases. At day 14 (L4 larvae), we again see an enrichment of structural constituents of the cuticle. As for those enriched in L3 day 6 larvae, they are all mostly orthologs of C. elegans col (COLlagen) genes, which are themselves orthologs of human MARCO genes (macrophage Table 1. RNA-Seq summary. receptor with collagenous structure). The structural constituents enriched at day 14 are, however, a completely unique set of collagen genes as compared to the genes observed at day 6. These stage-specific enrichments reflect the order of peptidases and structural constituents necessary for the building of a new L4 cuticle, the separation of the old L3 cuticle from the developing L4 cuticle, and the shedding of the old L3 cuticle.

Library a Total reads (millions) b Stage Total (millions) Total Mapped Reads (millions) Stage Total Mapped Reads (millions) % B. malayi Genes Expressed
Identification of 12 motifs associated with transcription factor binding that are enriched in the L3 to L4 molt To better understand the regulatory program of B. malayi during the L3 to L4 molt, we analyzed statistically over-represented DNA motifs in regions upstream of genes that were upregulated during molting. To do so, we developed a motif identification pipeline called Emotif Alpha (Fig 2). First, we used the transcriptome data from the different stages of the L3 to L4 molt to generate using pair-wise comparisons lists of genes up-regulated at each stage of the molt. We then did a motif discovery analysis on each gene set using a combination of three motif discovery tools: GimmeMotifs, DME, and DECOD. GimmeMotifs is an ensemble of generative motif discovery tools-including Homer [30], AMD [31], BioProspecter, MDmodule [32], MEME, Weeder, GADEM [33], and Improbizer-while DME and DECOD are discriminative motif discovery tools. We did a discriminative motif discovery analysis by randomly selecting background promoter region sets from all B. malayi genes, excluding the differentially expressed genes. These background sets are three times larger than the foreground sets. We selected motif lengths between 6-and 15-mer. In total, we identified 20,025 motifs.
To select statistically significant motifs, we first assessed the motifs by a random forest classifier using scikit-learn [34]. The random forest algorithm uses bootstrap sampling and constructs a decision tree for each sub-sample. To evaluate the motifs, we used both Gini impurity [35] and information gain [36] criteria, and retained the union of the resulting top 40 motifs. We further filtered the motifs by foreground coverage (i.e. UG%), removing motifs occurring in less than 30% of the genes up-regulated at that stage of molting. We then used a Z-test to compare the frequency of a motif in the up-regulated genes with the expected frequency in the background promoters. Using a significance level (p-value) cutoff of 10 −3 , we selected 395 motifs (S3 Table).
We retrieved a collection of 163 known nematode transcription factor binding sites (TFBSs) from the MEME suite (http://meme-suite.org/), searching the motif databases JAS-PAR CORE 2016 nematodes [37], CIS-BP Brugia malayi [14], and Uniprobe worm [38]. We matched the remaining motifs to known TFBSs with TOMTOM [39]. If two motifs were matched to the same binding site and they were discovered from the same gene list, we considered them to be redundant and kept the one with the lowest over-representation p-value. This step narrowed our list down to 27 motifs that had matches to known binding sites.
We next performed a conservation analysis amongst nematodes using an adaptation of a published method [40]. We retrieved orthologous gene information among B. malayi, C. elegans, and O. volvulus from Wormbase ParaSite Biomart [41]. We extracted up to 1kb upstream from the translation start sites for B. malayi genes, assuming these regions would contain the promoter. We performed multiple sequence alignments using CLUSTALW2 [42] and defined a motif as conserved if it occurred at the same position in the orthologous promoter region alignment of either C. elegans or O. volvulus. This step resulted in 12 remaining motifs (Fig 3,  Table 2) that were (1) enriched (p-value < 10 −3 ) and (2) conserved in either C. elegans or O. volvulus. The frequency of motif occurrence in the putative promoter regions of up-regulated genes ranges from 33% to 94%. The fold enrichment, representing the ratio between motif frequencies in the up-regulated gene promoters and background promoters, ranges from 1.28 to 2.19.
The 12 selected motifs matched known binding sites for 6 transcription factors in C. elegans ( Table 2) Table). Moreover, the occurrence of M3.2 in the Bm10655 promoter region is conserved in orthologs in both C. elegans (promoter of C28A5.3) and O. volvulus (promoter of OVOC9600). We find that a number of the B. malayi orthologs to the C. elegans transcription factors that we predict could bind the predicted TFBS are differentially expressed as well (S5 Table).
The motif analysis reveals how some of the differential expression of different proteases may be orchestrated during the L3 to L4 molt. Motif M1.4 is found in the promoter region of Bm2270, a predicted metalloprotease significantly up-regulated in L3D9 worms (S4 Table).

Fig 1. Expression of Brugia malayi genes during the L3 to L4 molt. Expression is in
FPKMs and is Z-scale normalized by row prior to clustering. High expression is indicated by red and low expression by blue. Time-points included infective L3 larvae (iL3), L3 larvae at Day 6 of molting (L3D6), L3 larvae at Day 9 of molting (L3D9), and L4 larvae. Biological replicates have been combined.
https://doi.org/10.1371/journal.pntd.0008275.g001 Bm2270 is an ortholog of nas-37 in C. elegans and has been shown to be involved in collagen and cuticulin-based cuticle development and ecdysis. Motif M5 is found in the promoter region of Bm1938 and is predicted to encode a serpin (S4 Table). Bm1938 is one of the serpins that was found to be significantly up-regulated in the iL3 larvae.

L3 stage-specific transcription factor binding motifs can be validated in vitro
Three of the motifs (M5, M6, and M7) were chosen for validation based on their enrichment in the promoters of genes up-regulated in the mid to late stages of the L3 to L4 molt. Three separate genes, each containing one chosen motif, were tested. The 1 kb upstream region of each gene was amplified from B. malayi genomic DNA and cloned upstream of the firefly luciferase reporter gene in the expression vector pGL3 Basic [23]. B. malayi L3 were then transfected with the constructs in a co-culture system, as previously described [44]. The parasites were induced to molt in vitro and then assayed for luciferase activity. The number of relative light units (RLUs) observed were normalized to those obtained from parasites transfected in parallel with a construct consisting of the B. malayi HSP70 promoter driving the expression of the firefly luciferase reporter [44]. The experiment was performed with both the native promoter and a mutant promoter where the nucleotides of the motif had been randomly shuffled. All of the native promoters produced significant amounts of reporter luciferase activity in the molting parasites (ranging from 40-70% of the activity produced by the HSP70 construct transfected Note that M7 was included in the experimental validation because it passed 4 out of 5 filters, including the foreground coverage filter, the random forest filter, the known motif filter, and the conservation filter. However, it was not included in the 12 reported motifs due to its non-significant p-value.  Fig 4). However, when the putative motifs were mutated, the activity in all the promoters tested decreased by 80-90% (Fig 4).

Discussion
Parasitic nematodes such as B. malayi maintain a complicated lifecycle involving both an insect and a mammalian host, and undergo a number of molts. The L3 to L4 molt that occurs immediately upon infection of the mammalian host is of particular interest as it marks the establishment of infection and thus represents an attractive point for drug intervention. Little is known, however, about how B. malayi regulates the transitions between these stages. Prior to our study, nothing was known about promoter motifs that regulate developmentally expressed genes in B. malayi. Because stage transitions rely on precise transcriptional control through the interaction of transcription factors and their binding sites, we set out to characterize the potential transcription factor binding motifs of these parasites and to identify motifs that contribute to stage-specific expression of genes involved in early worm development in the mammalian host.
We found a number of enriched motifs and were able to define both conserved motifs across molting as well as stage-specific motifs. While some of the motifs we identified are conserved in other nematodes, such as C. elegans, a number of motifs represent novel binding sites potentially reflecting the differences in development and the parasitic lifestyle. It is known that molting is regulated by an ecdysone-like response system [45,46,47,48]. Two of the identified motifs and their cognate TFs appear to be related to the ecdysone response. For example, zfh-2, the transcription factor predicted to bind four of our identified motifs, is a common cofactor implicated in ecdysone signaling in D. melanogaster [49]. Blimp-1, the transcription factor predicted to bind three of our identified motifs, is an ecdysone-inducible repressor that is essential for the prepupal development in Drosophila [50]. Validation results suggest that our pipeline is able to identify biologically-relevant motifs involved in molting. This analysis provided biological insight into the development of the parasite as well as the identification of novel drug targets.
Future studies should expand this analysis across the lifecycle of the nematode at different stages of development, in its different hosts (i.e. human vs. mosquito). While the validation results from our in vitro system are promising, we must assume there are biological differences to molting in vivo. Transcriptomic data from other life stages already exist and can be used to predict motifs; however, validation at other stages in vivo will prove more difficult. This is due, in large part, to the fact that the human filariae are obligate parasites without a free-living stage. This has limited the ability to genetically manipulate these stages in ways necessary to validate predictions made by transcriptomic analyses. Until recently, studies of promoter structure and function were limited by the fact that the only way to transfect B. malayi involved biolistic transfection of isolated embryos [23] which were developmentally incompetent. While this system has been used to identify the conserved motifs necessary for promoter function in constitutive promoters [24,51,52,53] and cis acting regulatory regions in some regulated promoters [54,55], these studies were limited to genes expressed in embryos. However, with recent innovations in filarial transgenics, it is now possible to efficiently produce stably transfected developmentally competent infective larvae, which can in turn be used to produce transgenic parasites in which transgenes are stably integrated into the parasite genome [44]. This study represents the first application of this new technology to the study of promoter function in a lifecycle stage other than isolated developmentally incompetent embryos. The ability to produce stable transgenic parasite lines will permit in vivo functional testing of promoter motifs for stage-specific expressed genes in all lifecycle stages of the parasite.

Transcriptomic study design
All parasites were obtained from FR3 (Filariasis Research Reagent Resource Center; BEI Resources, Manassas, VA, USA) where they were isolated from infected gerbils (Meriones unguiculatus) or mosquitoes (Aedes aegypti). Worms were flash-frozen and shipped to the New York Blood Center for processing. For transcriptomic sequencing, infective third-stage larvae (iL3) were recovered from mosquitoes and mammalian stage larvae were recovered from gerbils at 6 and 9 days post infection (dpi). At 6 dpi, larvae are typically undergoing the molt from L3 to L4, while by 9 dpi some worms may be finished molting [56] while others may not finish until 10 dpi [57]. Data was combined with previously published stages 14 dpi (L4) [7].

RNA isolation, library preparation and sequencing
Total RNA was prepared from B. malayi worms as previously described [7]. RNA was prepared from 3 biological replicates of infective L3 (iL3; 2000 larvae each), 3 replicates of 6 dpi larvae (1500 each) and 2 replicates of 9 dpi larvae (1300 each). B. malayi worms were homogenized in Trizol (ThermoFisher) using a hand-held pestle in 1.5mL tubes containing the A) Promoter motif validation in L3 worms that were molting in vitro using the native promoter of Bm1559 and a mutated motif M5 version of the same promoter. B) Promoter motif validation using the native promoter of Bm1938 and a mutated motif M6 version of the promoter. C) Promoter motif validation using the native promoter of Bm4184 with a mutated motif M7 version of the promoter. A total of four wells, each containing 100 L3, were transfected with each construct, as previously described [44]. Parasites in each well were harvested on days 5-8 and firefly luciferase activity determined as previously described [23]. In each panel luciferase activity obtained from the constructs is normalized against parasites transfected with a construct containing the Bm HSP70 promoter driving the expression of the firefly luciferase reporter.
https://doi.org/10.1371/journal.pntd.0008275.g004 worms. Total RNA was extracted by organic extraction using Trizol and the PureLink RNA mini kit (ThermoFisher) and after being treated with DNaseI (New England Biolabs). Ribosomal RNA (rRNA) depletion was performed using Terminator (Epicentre), a 5'-phosphatedependent exonuclease that degrades transcripts with a 5' monophosphate. Libraries were prepared using the NEBNext Ultra II RNA Library Prep Kit for Illumina (New England Biolabs) according to manufacturer instructions. Library quality was assessed using a D1000 Screen-Tape Assay (Agilent) prior to sequencing. Library concentrations were assessed using the qPCR library quantification protocol (KAPA biosystems). Libraries were sequenced on the Illumina NextSeq500 platform with 150bp paired-end reads. To minimize the confounding effects of lane-to-lane variation, libraries were multiplexed and sequenced with technical replicates on multiple lanes. Each biological replicate received an average of 135 million mapped reads (PRJNA557263).
Potential promoter sequences were retrieved from WormBase ParaSite Biomart [41] web interface, capturing the 1000bp upstream of the translation start site for each gene.

The Emotif Alpha pipeline for regulatory motif identification
The Emotif Alpha pipeline (freely available at: https://github.com/YichaoOU/Emotif_Alpha) was developed to automate motif discovery analysis for the 10 up-regulated gene lists. This pipeline was written in python and was applied to perform all aforementioned motif analyses. The motif discovery step used multiple tools and was run in parallel at the Ohio SuperComputer Center. Motif length search was from 6 to 14. Motif scanning was done using FIMO [62] with a default p-value threshold of 10 −4 . We implemented 5 different motif filters: (1) Foreground coverage (i.e., UG%) was defined as the proportion of up-regulated gene promoters containing the given motif. We set a minimal foreground coverage at 30%; (2) Motifs were then filtered by a random forest classifier. The union of the top 40 motifs that resulted from either Gini impurity or information gain criterion was retained; (3) Motif enrichment p-value was calculated using Z-test and the cutoff was 10 −3 ; (4) Known motif filter was performed using TOMTOM and a collection of 163 known nematode TFBSs. The motif similarity p-value threshold was 10 −4 ; (5) Conservation analysis was performed using a method described in [40]. Only conserved motifs were kept.

In vitro validation of promoter transcription motifs
The putative TF motifs M5, M6, and M7 were chosen for validation based on their enrichment in the promoters of genes upregulated in the mid to late stages of the L3 to L4 molt. Three different genes, each containing one of the chosen motifs, were used for the validation assay. As previously described [23], we amplified the 1 kbp region upstream of each gene from B. malayi genomic DNA and cloned it upstream of the firefly luciferase reporter gene in the expression vector pGL3 Basic. We then transfected B. malayi L3 larvae with the constructs in a co-culture system, as previously described [44]. The parasites were induced to molt by the addition of ascorbic acid on day 5, and parasites were assayed for luciferase activity on days 5-8, as by day 10 the molting was complete. We normalized the number of RLUs observed to those obtained from parasites transfected in parallel with a construct consisting of the B. malayi HSP70 promoter driving the expression of the firefly luciferase reporter [23] to control for accumulation of the firefly luciferase over time during the duration of the experiment. We did the experiment with both the native promoter and a mutant promoter where the nucleotides of the motif had been randomly shuffled (S6 Table). Only one replicate was carried out for each motif validation.