Figures
Abstract
As the number of sequenced insect genomes continues to grow, there is a pressing need for rapid and accurate annotation of their regulatory component. SCRMshaw is a computational tool designed to predict cis-regulatory modules (“enhancers”) in the genomes of various insect species. A key advantage of SCRMshaw is its accessibility. It requires minimal resources—just a genome sequence and training data from known Drosophila regulatory sequences, which are readily available for download. Even users with modest computational skills can run SCRMshaw on a desktop computer for basic applications, although a high-performance computing cluster is recommended for optimal results. SCRMshaw can be tailored to specific needs: users can employ a single set of training data to predict enhancers associated with a particular gene expression pattern, or utilize multiple sets to provide a first-pass regulatory annotation for a newly-sequenced genome. This protocol provides an extensive update to the previously published SCRMshaw protocol and aligns with the methods used in a recent annotation of over 30 insect regulatory genomes. It includes the most recent modifications to the SCRMshaw protocol and details an end-to-end pipeline that begins with a sequenced genome and ends with a fully-annotated regulatory genome. Relevant scripts are available via GitHub, and a living protocol that will be updated as necessary is linked to this article at protocols.io.
Citation: Asma H, Liu L, Halfon MS (2024) SCRMshaw: Supervised cis-regulatory module prediction for insect genomes. PLoS ONE 19(12): e0311752. https://doi.org/10.1371/journal.pone.0311752
Editor: Arnar Palsson, University of Iceland, ICELAND
Received: June 12, 2024; Accepted: September 24, 2024; Published: December 5, 2024
Copyright: © 2024 Asma et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All relevant data are within the manuscript and its Supporting Information files.
Funding: Funding for this work was provided by USDA grant 2019-67013-29354 to M.S.H. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Although metazoan whole-genome sequences continue to accumulate at a rapid pace, comprehensive annotation of these genomes lags significantly behind. Particularly lacking is identification of non-coding regulatory sequences, which comprise a substantial fraction of the genome yet typically are completely lacking in available annotations. One reason why regulatory annotation so often lags behind genome sequencing is that historically, finding regulatory elements in the genome has been difficult even in well-studied model organisms because of their distant positions from target genes, the absence of a clear universal biochemical regulatory sequence marker, and the cell type specificity of regulatory element activity [1–4]. Although both empirical and computational methods for regulatory element discovery have been developed (for review see [5–8]), a variety of limitations prevent them from being easily adopted for efficient annotation of newly-sequenced genomes. Empirical approaches are costly; can be difficult to validate depending on the availability of biological resources such as cell lines, antibodies, and tissue samples, or the existence of relevant technologies, such as transgenesis; and carry apparent false-positive and false-negative rates that can be surprisingly high (false-positive rates range as high as 40% for some ChIP-based methods [9–11] and from 10–20% for some ATAC-seq studies [12, 13]). Because regulatory elements may be functional only in certain cell types or under specific conditions, assays need to be applied to multiple tissues over many developmental stages and/or under varying environmental conditions in order to achieve comprehensive annotation. Although computational methods in theory avoid these issues, many approaches still rely on experimental data either for training or as input, which often negates their advantages. For non-model organisms, where only limited functional genomic data tend to be available, regulatory annotation is therefore often highly challenging.
We previously developed SCRMshaw, a supervised cis-regulatory module machine-learning method that shows strong performance in predicting transcriptional regulatory sequences (“enhancers”) [14–19]. When trained on known enhancers from the fruit fly Drosophila melanogaster, SCRMshaw successfully predicts regulatory sequences across a wide range of insect species [17, 19–21]. Importantly, SCRMshaw requires as input only a sequenced genome and a basic gene annotation, making it highly suitable for producing preliminary regulatory annotations rapidly following initial publication of new genome assemblies. While benchmarking enhancer discovery methods is complicated due to the fact that there are no established true positive/true negative data sets against which to compare competing approaches [14], success rates from SCRMshaw appear to be on a par with or better than those from other rigorously-evaluated methods. Importantly, while we expect that other sequence-based enhancer discovery methods will, similar to SCRMshaw, be capable of cross-species discovery, only SCRMshaw so far has a track record of success in the cross-species setting.
We present here a detailed updated SCRMshaw protocol that replaces the earlier protocol published in 2019 [16] and aligns with the procedures described in our recent publication presenting the regulatory annotation of over 30 insect genomes [19]. The current protocol incorporates changes to the original SCRMshaw protocol as described by Asma and Halfon [14] and Asma et al. [19]. It includes additional steps developed by Asma et al. [19] to implement an end-to-end pipeline that begins with a sequenced genome and ends with a fully-annotated regulatory genome formatted for easy comparison to other SCRMshaw-annotated genomes.
The SCRMshaw pipeline consists of four basic parts (Fig 1). A pre-processing step checks the input files for proper formatting and removes minor unmapped/unannotated scaffolds from the genome. The SCRMshaw algorithm itself, which consists of three underlying statistical methods, is then run to scan the genome for high-scoring sequences. This step is dependent on training data, which can be downloaded from our GitHub site. Custom training sets can also be developed by the user; for instructions on custom training set generation, we refer the reader to Kazemian and Halfon [16]. The basic protocol described here is for “SCRMshaw_HD” [14], which requires use of a high-performance computing cluster and a minimum of 25 available nodes. However, SCRMshaw can also be run in its original lighter-weight form on a desktop computer by following the relevant protocol notes and/or referring to [16]. The third step in the SCRMshaw pipeline is post-processing, which determines the final set of enhancer predictions based on the SCRMshaw scores. Details on this step can be found in [19] and are diagrammed in Fig 2. The final step in the SCRMshaw pipeline, which is optional, maps putative enhancer target genes to their Drosophila orthologs. We find that this step is helpful in that it provides recognizable gene names to what are otherwise frequently arbitrary designations, and enables comparison of locus-by-locus results between SCRMshaw outputs from multiple species. Note that target gene assignments are based solely on nearest-gene relationships, which is a simple but often non-accurate method of determining enhancer targets. Users may want to incorporate additional information into their analysis if accurate enhancer-target relationships are a primary goal of their study.
The left side shows pre-processing steps, the right side, post-processing. Input to SCRMshaw consists of the genome sequence and gene annotation. A protein sequence annotation is supplied later for the ortholog mapping step. Adapted from [19].
Post-processing is described in [14], with modifications in [19]. (A) Scores from each of the 25 individual SCRMshaw instances are assessed and (B) any 500 bp window whose score is below the value of the 5000th ranked score is eliminated by having its score reset to zero (B). (C) The “elbow” point of the SCRMshaw score curve of the 5000 top scores from each instance is then determined, and (D) any scores below the elbow point are reset to zero. (E, F) After these two rounds of score evaluation, windows are grouped together and (G) subjected to peak calling on 10 bp intervals. (H) Final predictions are chosen as peaks with an amplitude above the elbow point of the amplitude curve, represented by a red dot. Adapted from [19].
Materials and methods
The protocol described in this peer-reviewed article is published on protocols.io (doi:10.17504/protocols.io.e6nvw1129lmk/v2) and is included for printing as S1 File with this article.
Expected results
The SCRMshaw pipeline described in the accompanying protocol (S1 File) was used most recently to produce the 33 genome annotations reported by Asma et al. [19]. The complete set of scripts and SCRMshaw software from that study is available for download at our GitHub site (https://gitub.com/HalfonLab), and the annotation data can be searched at REDfly (http://redfly.ccr.buffalo.edu) or downloaded from Dryad (https://doi.org/10.5061/dryad.3j9kd51t0).
The SCRMshaw_HD pipeline can analyze an average-sized insect genome using several training sets in a matter of hours; we are able to run all but the largest or most fragmented genomes with our full default set of 48 training sets in under 72 hours. The bulk of the computational time is spent in the SCRMshaw step. To decrease run times, chromosomes and/or training sets can be split out and run as separate instances on additional sets of 25 nodes, if available, as a simple parallelization strategy. Storage space increases with genome size, mostly due to the larger number of kmers that must be stored, and can grow to several TB with larger genomes. However, the majority of this space can be released upon completion of the pipeline by deleting intermediate and temporary files, using “cleanup” scripts we have made available. If sufficient temporary storage space is not available, it is advisable to run the original lightweight SCRMshaw [16] rather than SCRMshaw_HD, which will keep storage requirements below 100GB for most genomes.
Each step of the pipeline produces output. An example of the output from the preflight step is provided in S2 File. Preflight validates the format of the input files and produces a comprehensive log file that highlights any issues along with basic information such as the number of chromosomes/scaffolds and their sizes, data types present in the annotation (e.g., ‘gene’, ‘exon’, ‘ncRNA’, etc.), and average intergenic distances. Preflight also identifies minor scaffolds that are not annotated as containing genes, which can then be discarded.
Output from SCRMshaw itself is described in detail in [16]. When using the preferred SCRMshaw_HD process, this is mainly intermediate output that is used for subsequent steps but not directly evaluated.
Post-processing generates two principal files. scrmshawOutput_offset_0to240.bed contains the top 5000 raw results from each of the multiple individual SCRMshaw instances run by SCRMshaw_HD. The post-processing script then uses these raw results to make final enhancer predictions, which it outputs in the peaks_AllSets.bed file. This file can be used as SCRMshaw output for downstream analysis and differs from the final output described below only in that putative target genes are not mapped to their Drosophila orthologs. If that final step is not conducted, it is recommended to sort and merge the results using BEDTools [22] before considering the analysis completed, as there may be overlap in enhancer predictions from different training sets or statistical models.
Final results are obtained following the ortholog mapping procedure and merging of overlapping results. An example of final output is provided in S3 File. The final output is in the form of an 18-column tab-delimited file organized as follows:
- Chromosome
- Start coordinate
- End coordinate
- Peak amplitude
- SCRMshaw score
- Flanking gene
- D. melanogaster ortholog of flanking gene
- Distance of hit from flanking gene (basepairs)
- Location of hit relative to flanking gene
- Local rank
- Next closest flanking gene
- D. melanogaster ortholog of next flanking gene
- Distance of hit from flanking gene (basepairs)
- Location of hit relative to flanking gene
- Local rank
- Training set
- Method (hexmcd, imm, pac)
- Rank
If the orthologous gene is not known, it is listed as “No_OrthoPara.” Where predictions are merged, multiple results may be provided in each column, depending on the results of the merge (e.g., for method, “imm, hexmcd”). Peak amplitude, score, and rank will contain the best value from among the merged predictions. “Local rank” is described in [17], although its utility as a metric when using the SCRMshaw_HD post-processing procedure has not been determined.
SCRMshaw’s performance relies heavily on training set quality. Based on various measures over many individual studies [inter alia 14, 15, 17–20, 23], we estimate that true-positive rates for enhancer prediction range from 50–85%, with most training sets reaching or exceeding 70%. We plan to continue to make additional and improved training data available via our GitHub training set repository (https://github.com/HalfonLab/dmel_training_sets). The protocol linked to this article (S1 File) exists as a living protocol at protocols.io (http://dx.doi.org/10.17504/protocols.io.e6nvw1129lmk/v2), and we will continue to update it as improvements to the SCRMshaw pipeline are developed. Aspects of SCRMshaw under continued development include investigating optimal repeat masking strategies for genomes with different types and degrees of repeats, and how best to combine and weight the scores from the three individual SCRMshaw scoring methods of IMM, HexMCD, and PAC-rc. Users should bear in mind that SCRMshaw predictions are, ultimately, predictions, and appropriate validation experiments should be undertaken for any sequences of particular interest.
Supporting information
S1 File. Step-by-step protocol, also available on protocols.io.
https://doi.org/10.1371/journal.pone.0311752.s001
(PDF)
S2 File. Example of the output from the “preflight” step run on the Apis mellifera (honeybee) genome.
https://doi.org/10.1371/journal.pone.0311752.s002
(PDF)
S3 File. Example final output for SCRMshaw_HD run on the Apis mellifera genome using the “mapping2.wing” training set.
https://doi.org/10.1371/journal.pone.0311752.s003
(BED)
Acknowledgments
We thank members of the Halfon lab for helpful comments on the protocol and the manuscript.
References
- 1. Pennacchio LA, Bickmore W, Dean A, Nobrega MA, Bejerano G. Enhancers: five essential questions. Nat Rev Genet. 2013;14(4):288–95. Epub 2013/03/19. pmid:23503198; PubMed Central PMCID: PMC4445073.
- 2. Halfon MS. Studying Transcriptional Enhancers: The Founder Fallacy, Validation Creep, and Other Biases. Trends Genet. 2019;35(2):93–103. Epub 2018/12/17. pmid:30553552; PubMed Central PMCID: PMC6338480.
- 3. Catarino RR, Stark A. Assessing sufficiency and necessity of enhancer activities for gene expression and the mechanisms of transcription activation. Genes Dev. 2018;32(3–4):202–23. Epub 2018/03/02. pmid:29491135; PubMed Central PMCID: PMC5859963.
- 4. Gasperini M, Tome JM, Shendure J. Towards a comprehensive catalogue of validated and target-linked human enhancers. Nat Rev Genet. 2020;21(5):292–310. Epub 2020/01/29. pmid:31988385.
- 5. Asma H, Halfon MS. Annotating the Insect Regulatory Genome. Insects. 2021;12(7):591. Epub 2021/07/03. pmid:34209769; PubMed Central PMCID: PMC8305585.
- 6. Suryamohan K, Halfon MS. Identifying transcriptional cis-regulatory modules in animal genomes. Wiley Interdisciplinary Reviews: Developmental Biology. 2015;4(2):59–84. pmid:25704908
- 7. Phan LT, Oh C, He T, Manavalan B. A comprehensive revisit of the machine-learning tools developed for the identification of enhancers in the human genome. Proteomics. 2023;23(13–14):e2200409. Epub 2023/04/07. pmid:37021401.
- 8. Barshai M, Tripto E, Orenstein Y. Identifying Regulatory Elements via Deep Learning. Annual Review of Biomedical Data Science. 2020;3(1):315–38.
- 9. Blow MJ, McCulley DJ, Li Z, Zhang T, Akiyama JA, Holt A, et al. ChIP-Seq identification of weakly conserved heart enhancers. Nat Genet. 2010;42(9):806–10. Epub 2010/08/24. pmid:20729851; PubMed Central PMCID: PMC3138496.
- 10. May D, Blow MJ, Kaplan T, McCulley DJ, Jensen BC, Akiyama JA, et al. Large-scale discovery of enhancers from human heart tissue. Nat Genet. 2011;44(1):89–93. Epub 2011/12/06. pmid:22138689; PubMed Central PMCID: PMC3246570.
- 11. Visel A, Blow MJ, Li Z, Zhang T, Akiyama JA, Holt A, et al. ChIP-seq accurately predicts tissue-specific activity of enhancers. Nature. 2009;457(7231):854–8. Epub 2009/02/13. pmid:19212405.
- 12. Cusanovich DA, Daza R, Adey A, Pliner HA, Christiansen L, Gunderson KL, et al. Multiplex single cell profiling of chromatin accessibility by combinatorial cellular indexing. Science. 2015;348(6237):910–4. Epub 2015/05/09. pmid:25953818; PubMed Central PMCID: PMC4836442.
- 13. Bravo Gonzalez-Blas C, Quan XJ, Duran-Romana R, Taskiran II, Koldere D, Davie K, et al. Identification of genomic enhancers through spatial integration of single-cell transcriptomics and epigenomics. Mol Syst Biol. 2020;16(5):e9438. Epub 2020/05/21. pmid:32431014; PubMed Central PMCID: PMC7237818.
- 14. Asma H, Halfon MS. Computational enhancer prediction: evaluation and improvements. BMC bioinformatics. 2019;20(1):174. pmid:30953451
- 15. Kantorovitz MR, Kazemian M, Kinston S, Miranda-Saavedra D, Zhu Q, Robinson GE, et al. Motif-blind, genome-wide discovery of cis-regulatory modules in Drosophila and mouse. Dev Cell. 2009;17(4):568–79. Epub 2009/10/27. pmid:19853570.
- 16. Kazemian M, Halfon MS. CRM Discovery Beyond Model Insects. Methods Mol Biol. 2019;1858:117–39. Epub 2018/11/11. pmid:30414115; PubMed Central PMCID: PMC6482005.
- 17. Kazemian M, Suryamohan K, Chen JY, Zhang Y, Samee MA, Halfon MS, et al. Evidence for deep regulatory similarities in early developmental programs across highly diverged insects. Genome biology and evolution. 2014;6(9):2301–20. Epub 2014/09/01. pmid:25173756; PubMed Central PMCID: PMC4217690.
- 18. Kazemian M, Zhu Q, Halfon MS, Sinha S. Improved accuracy of supervised CRM discovery with interpolated Markov models and cross-species comparison. Nucleic Acids Res. 2011;39(22):9463–72. Epub 2011/08/09. pmid:21821659; PubMed Central PMCID: PMC3239187.
- 19. Asma H, Tieke E, Deem KD, Rahmat J, Dong T, Huang X, et al. Regulatory genome annotation of 33 insect species. eLife. 2024;13:RP96738. pmid:39392676
- 20. Suryamohan K, Hanson C, Andrews E, Sinha S, Scheel MD, Halfon MS. Redeployment of a conserved gene regulatory network during Aedes aegypti development. Dev Biol. 2016;416(2):402–13. Epub 2016/06/28. pmid:27341759; PubMed Central PMCID: PMC4983235.
- 21. Schember I, Halfon MS. Identification of new Anopheles gambiae transcriptional enhancers using a cross-species prediction approach. Insect molecular biology. 2021;30(4):410–9. Epub 2021/04/19. pmid:33866636; PubMed Central PMCID: PMC8266755.
- 22. Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010;26(6):841–2. Epub 2010/01/30. pmid:20110278; PubMed Central PMCID: PMC2832824.
- 23. Weinstein ML, Jaenke CM, Asma H, Spangler M, Kohnen KA, Konys CC, et al. A novel role for trithorax in the gene regulatory network for a rapidly evolving fruit fly pigmentation trait. PLoS Genet. 2023;19(2):e1010653. Epub 2023/02/17. pmid:36795790; PubMed Central PMCID: PMC9977049.