Identifying promoter sequence architectures via a chunking-based algorithm using non-negative matrix factorisation

Core promoters are stretches of DNA at the beginning of genes that contain information that facilitates the binding of transcription initiation complexes. Different functional subsets of genes have core promoters with distinct architectures and characteristic motifs. Some of these motifs inform the selection of transcription start sites (TSS). By discovering motifs with fixed distances from known TSS positions, we could in principle classify promoters into different functional groups. Due to the variability and overlap of architectures, promoter classification is a difficult task that requires new approaches. In this study, we present a new method based on non-negative matrix factorisation (NMF) and the associated software called seqArchR that clusters promoter sequences based on their motifs at near-fixed distances from a reference point, such as TSS. When combined with experimental data from CAGE, seqArchR can efficiently identify TSS-directing motifs, including known ones like TATA, DPE, and nucleosome positioning signal, as well as novel lineage-specific motifs and the function of genes associated with them. By using seqArchR on developmental time courses, we reveal how relative use of promoter architectures changes over time with stage-specific expression. seqArchR is a powerful tool for initial genome-wide classification and functional characterisation of promoters. Its use cases are more general: it can also be used to discover any motifs at near-fixed distances from a reference point, even if they are present in only a small subset of sequences.


Introduction
Promoter sequences are stretches of DNA flanking the transcription start site (TSS) of a gene.Historically, a window spanning −40 to + 40 base pairs (bp) (upstream denoted by '-' and downstream by '+') relative to the TSS is called the core promoter sequence.Further upstream from −40 to −250 bp is called the proximal promoter region [1].These regions work together with enhancers-which can be anywhere in the genome, upstream or downstream of the gene they affect-to regulate gene expression.Transcription of genes by RNA polymerase II (RNA PolII) happens by assembly of the pre-initiation complex of RNA PolII and different basal transcription factors (TFs) [2].These TFs recognise different sequence motifs present in the core promoters for binding.
However, there is no universal promoter architecture.Not all motif elements are present in all core promoters.For instance, TATA-box, which is bound by the TATA-box binding protein (TBP), is a popular textbook example of a core promoter motif, but at most *20% of the known core promoters of protein-coding genes in animals contain a TATA-box [2].Furthermore, transcription initiation can be 'focused' or 'dispersed' [3,4].In focused transcription, there is a single nucleotide (nt) position where most of the transcription begins.In dispersed transcription there are multiple weaker start positions spread up to *100 bp within the promoter, and the dominant position is determined by the distance from the + 1 nucleosome [5,6].The resultant promoters are thus called sharp and broad respectively.Sharp promoters are overwhelmingly associated with tissue-specific genes, while broad promoters are associated with constitutively expressed genes and a subset of tissue restricted ones (reviewed in [7]).The most important realisation about different promoter architectures is that there is a TSS-selection determining sequence element at near-fixed distance from the dominant TSS (Fig 1).Different TSS selection mechanisms suggest that there are multiple fundamentally different promoter architectures, and that they are functionally specialised.Heterogeneity in architectures at the organism-level gives rise to additional diversity [3,7,8].Also, the advent of chromatin interaction experiments in the last decade has enabled better exploration of the enhancer-promoter interaction specificity.Thus, studying and defining the promoter sequence architectures that give rise to such specificities in transcription regulation mechanisms is a key step in understanding transcriptional regulation.
Promoter sequence architectures are not always characterised just by occurrences of short sequence motifs but at times also by the sequence composition, for example, GC-richness or CpG island promoters.Using motif finding algorithms to process promoter sequences, as is traditional, has limitations.First, they cannot identify differences in architectures stemming from overall sequence compositions.Second, in scenarios where promoter sequences are indeed characterised by motifs, scanning the sequences for motifs from a database (when available) limits the study to looking at only known motifs.These known motifs often do not explain much of the diversity among the promoter sequences.In case of novel organisms for whom no databases of known motifs exist yet, a de novo motif search can be performed.But most motif finding algorithms expect the motifs to be statistically over-represented in the promoter sequences in comparison to background sequences.Such approaches can fail to identify motifs that are present in a relatively small subset of sequences and therefore not statistically 'significant', and do not provide precise position information of the motifs.Furthermore, they also often miss discovering low-complexity or degenerate motifs.New approaches, which can overcome the above limitations in using motif finding algorithms and identify architectures in promoter sequences de novo are required to study promoter sequence architectures.
Here, we present seqArchR, an unsupervised approach using non-negative matrix factorisation (NMF) [9] for de novo characterisation of diversity in promoter sequence architectures.seqArchR does not require background sequences.It needs no information on the kind (motifs-or sequence composition-based) or number of architectures expected to be present in the input sequences.When architectures characterised by motifs may be present, it does not require any specifications on the number of motifs, their sizes or any gaps in motifs or between them when they work in a cooperative fashion.seqArchR's chunking-based approach can reveal architectures harboured by very few sequences out of the whole input.This is often helpful to precisely tease out minor positional variations in features which can be functionally relevant.For instance, the spacing between the TATA-box and the TSS is known to correlate with tissue specificity [10].
seqArchR is inspired by 'No promoter left behind' (NPLB) [11], the current state-of-the-art method identifying such sequence architectures de novo from aligned promoter sequences.NPLB uses a Bayesian approach to jointly learn features and important positions in an architecture.It identifies the number of architectures best explaining the complete set of promoter sequences by cross-validation where the model with the highest loglikelihood is chosen.NPLB has a relatively long execution time making it less effective as an exploratory tool.It takes about 10 hours to process ca.6600 promoter sequences in Drosophila compared to seqArchR's 42 minutes when run serially to obtain comparable results.Additionally, seqArchR gives the user complete control over the granularity of identified clusters which is often useful in the exploratory setting.
Since Lee and Seung's work [9], NMF has become a popular technique finding wide applications across domains.Some earlier applications of NMF in biology include those for the inference of biological processes [12], sequence motifs characterisation [13], and multi-omics analysis [14].In particular, Hutchins et al. [13] used NMF for characterizing sequence data sets by identifying position-dependent motifs in them.In their approach, the input matrix encodes the counts of the k-mers and their occurrence positions in the given collection of sequences.This approach assumes that all sequences in the data set contain the same motifs or minor variations thereof.Although this approach is well-suited for identifying protein-DNA binding sites from ChIP-seq like data, it assumes a single mode of binding and does not cater to the scenario of multiple modes of protein-DNA binding [15].In comparison, seqArchR makes no such assumptions and models the collection of sequences to harbour any number of architectures, each with an unknown number of motifs-or non-motifs-based sequence features.
With seqArchR, we demonstrate the general applicability of NMF for de novo identification of near fixed-position motifs from sequences, and simultaneously clustering the sequences based on the identified motifs.We note that seqArchR is not a general sequence motif discovery tool.It will not detect those overrepresented promoter motifs that are found at a wide distribution of distances from transcription start sites, such as GC-box and CAATbox in vertebrates, or DNA recognition element (DRE) in Drosophila.After seqArchR identifies architecturally distinct clusters based on TSS-associated sequence elements, the obtained clusters can be analysed using general motif finders to investigate which non-TSS determining motifs are associated with which architecture.(See Discussion section for further details.).
This paper describes seqArchR in complete detail and presents results from computational experiments demonstrating its efficacy.We compare seqArchR with NPLB in analysing a simulated dataset where the ground truth of clusters/architectures is known, and a set of CAGEbased core promoter sequences from Chen et al. [16] (S1 Text).Furthermore, we show results from processing CAGE-based promoter sequences in fruit fly [17], zebrafish [18] and human [19,20] to demonstrate seqArchR's ability to seamlessly identify heterogeneous architectures across organisms with large differences in promoter sequence composition.seqArchR is provided as an R/Bioconductor package accessible at https://www.bioconductor.org/packages/seqArchR or https://snikumbh.github.io/seqArchR.

seqArchR: A novel algorithm using non-negative matrix factorisation for de novo identification of sequence architectures
To classify promoters based on their fixed-position determinants, we developed, seqArchR, a chunking-based iterative algorithm using NMF for de novo identification of architectural elements.The input to seqArchR is a (0, 1)-matrix which is a one-hot encoded representation of dinucleotide profiles of a gapless alignment of DNA sequences.As shown in the schematic in Fig 2, seqArchR processes the whole collection of input sequences one chunk (subset of sequences) at a time.The (0, 1)-matrix for each chunk of sequences is processed with NMF.NMF decomposes the matrix into two low-rank matrices-the basis matrix and the coefficients matrix.Columns of the basis matrix (a.k.a.basis vectors) represent the different potential architectures, and along its rows are the loadings for the features per architecture (see Fig 2B).On the other hand, in the coefficients matrix, where each column corresponds to one sequence, its rows hold the per-sequence coefficients for each architecture.A sequence with a high coefficient for an architecture is expected to harbour features captured by that architecture.For example, consider matrix H k×n in Fig 2B, the first and second sequence have the highest coefficient for architecture I among all four architectures, and are expected to harbour sequence features prominent for architecture I (see corresponding to the sequence logos on the top).Because the input matrix already encodes the position information of the nucleotides in the sequences, the factorisation identifies the characteristic architectures (combinations of nucleotides over-represented at specific positions) describing different subset of sequences.
Per chunk, seqArchR performs a model selection procedure to find the appropriate number of basis vectors suitable to represent the set of sequences in the chunk in a lower-dimensional space.Two model selection procedures are available: stability-and cross-validation-based (see the Methods section for details).This decomposition is used to obtain clusters of sequences characterised by a diverse set of sequence features.
After all chunks have been factorised and sequence clusters obtained, similar clusters from different chunks are collated.Similarity of clusters is judged based on the similarity of corresponding NMF basis vectors (see Methods section for additional details).The collated set of clusters may need further processing to weed out any mis-assignments or to crisply identify For each chunk of sequences being processed with NMF, the sequences are represented as a one-hot encoded matrix (hence, 0/1 matrix), denoted in the schematic by matrix V p×n ; matrices W p×k and H k×n are respectively the basis matrix and coefficients matrix obtained upon factorisation.n denotes the number of input sequences, p, the number of features, k, the optimal number of dimensions selected for the low-rank representation and L, the length of the input sequences.The schematic depicts one-hot encoding of the dinucleotide profile of sequences.One can use the mono-or dinucleotide profile of sequences.https://doi.org/10.1371/journal.pcbi.1011491.g002architectures with shared motifs or those with minor positional shifts.In this case, seqArchR iterates over the collated clusters treating each of them as separate chunks and repeating the above steps.The number of iterations required depends on the structure in the input data.The algorithm is described in detail in the Methods section.

seqArchR is fast and attains high accuracy on simulated data
In order to demonstrate seqArchR's efficacy, we performed computational experiments on simulated data as follows.We generated a set of 1000 simulated DNA sequences, each 100 nucleotides long.All nucleotides at any position in these sequences appeared with a uniform probability of 0.25.We planted different motifs at specific positions in these sequences such that they resulted in four clusters with as many architectures.These are described in Table 1.
All motifs except those in cluster A were planted with mutations, as follows.For planting a motif in a sequence, any one position in the motif was randomly selected for mutation.The nucleotide at this position was then mutated with probability (mutation rate) {0.1, 0.2, 0.3, 0.4, 0.5}.This resulted in five variations of DNA sequences-one mutated position × five mutation rates for all motifs.This procedure was repeated, selecting two and three positions for mutation per motif (instead of one), giving us a total of 15 variations.For each of these variations, we performed experiments with chunk sizes {200, 500, 1000}.Thus, all parameter combinations resulted in a total of 45 datasets to be processed by seqArchR.For robustness, each of these was repeated 10 times with different random seeds.Here, we used stability-based model selection.
For comparison, we also processed these 15 datasets with NPLB [11].Performance results of seqArchR and NPLB are presented next.All reported timing information is for experiments performed in parallel on 16 cores of Intel(R) Xeon(R) Gold 6154 CPUs.
Performance as adjusted Rand index (ARI).Performance is reported using the adjusted Rand index (ARI) which measures the similarity between two clustering solutions or a clustering and ground truth (when available), and, is corrected for chance.As seen in Fig 3A, irrespective of the chunk size, seqArchR attains a high ARI value when the mutation rate and the number of mutated motif positions is low.For any given chunk size, the performance degrades with increase in the mutation rate, while for the same mutation rate, increasing the number of mutated motif positions hardly affects performance.For smaller chunk size, seqArchR's performance shows a slight increase in variance with increase in mutation rate.With a larger chunk size, seqArchR demonstrates very low variance in comparison to NPLB which shows high variance for all mutation rates.
Fig 3E and 3F show the time taken (in minutes) by seqArchR and NPLB.In Fig 3E, the colours denote time taken by seqArchR to complete different iterations.We observe that processing the whole data in many, small chunks, takes more time overall.This is expected: for each chunk, the stability-based model selection procedure starts from checking for just two clusters and continues checking further values until the bounding threshold on the instability of identified cluster features is satisfied (see more details of the algorithm in the Methods section).Since sequences from each cluster are distributed uniformly randomly across the complete set of sequences, for each chunk, the optimum number of clusters is expected to be around the true number of clusters.Thus, there is some redundancy of checking the lower values for every chunk.This redundancy reduces for larger chunk sizes.

seqArchR identifies fixed elements of promoter sequence architectures in different organisms
In general, when dealing with DNA sequences, different organisms are known to present different challenges for sequence analysis methods.For example, the high GC and especially CpG content in human genomic sequence is known to affect approaches using motif over-representation [21,22].More specifically, with respect to gene promoters, there is diversity of sequence elements within promoters of one organism and also between those of different organisms.Different promoter types have different sequence composition, some of which are highly constrained, and some are not.Drosophila promoter sequences have multitude of well defined sequence motifs viz. the TATA-box and various Ohler motifs [23,24].In comparison, promoters of vertebrates have relatively fewer well defined motifs; they are also characterised by the stretches of varying sequence compositions, e.g., CG/GC dinucleotide content vs A/T-rich W-box motifs, owing to different architectures (sometimes even intertwined) determining TSS selection at play during different stages of development [5,25].This diversity of promoter sequence elements reflects the diversity in mechanisms of transcription regulation.Thus, computational methods that work for one organism may not work off-the-shelf for other organisms.In the following sections, we evaluate seqArchR's ability to identify fixed-element promoter sequence architectures in fruit fly, zebrafish and humans.Note that promoter architecture clusters reported here for all three organisms are obtained by manually curating (post-processing) the raw clusters identified by seqArchR in its final iteration.This is discussed further in the "Curation of collated clusters" subsection in the Methods section.

Fruit fly (Drosophila melanogaster)
Schor et al. [17] produced high resolution CAGE data and studied the effect of natural genetic variation on transcription strength and distribution of transcription start sites in gene promoters at different developmental stages in Drosophila.From the various samples available, we randomly selected one sample 'RAL28' (this number identifies the D. melanogaster line from Drosophila melanogaster Genetic Reference Panel (DGRP) that were used by Schor et al. [17]) to help demonstrate seqArchR's efficacy.
Specifically, we used seqArchR to identify promoter architectures at three stages of transition during embryogenesis-2-4 hours (h), 6-8h, and 10-12h after egg laying (AEL).Each stage was analysed with seqArchR separately.We used 91 bp-long sequences-45 bp upstream and downstream of the TSS-as core promoter sequences.Clustering results for D. melanogaster are reported from experiments using stability bound 10 −8 (see Methods).Figs 4, 5, and 6 show the sequence logos of architectures of all clusters identified for the three respective transitions.For brevity, we use the following shorthand notation to refer to clusters at any transition: CnX or CnY or CnZ.Here, 'C' stands for cluster, n, the cluster number, and letters X, Y and Z denote the three transitions 2-4h, 6-8h, and 10-12h AEL.A single cluster, say cluster 3 at 2-4h, is named C3X, and multiple clusters, say clusters 2-6 or clusters 3 and 5 at 10-12h are named C2-6Z or C3,5Z respectively.Unless specified otherwise, all clusters are arranged in ascending order of their median interquantile width (IQW).
Motifs.seqArchR identifies several of the known core promoter elements in Drosophila.These are directional motifs with a precise or variable positioning within promoters-abbreviated as DMp and DMv respectively.The identified DMp motifs include the TATA-box, initiator (Inr, canonical and other non-canonical ones), the downstream promoter element (DPE).Among DMv motifs, seqArchR identifies Ohler motifs 1 and 6, and another motif named DMv5 [23,26].seqArchR also identifies some of the non-directional motifs (NDMs) reported by FitzGerald et al. [26]   With a strict stability bound 10 −8 , seqArchR identified a strong DPE signal at all three stages AEL, while a strong TATA-box architecture was only identified at 10-12h AEL (see Figs K, L and M in S1 Text).Some TATA-box promoters seemed to be misclassified.So, we processed the promoters at all three stages individually using more lenient bound values (see S1 Text, subsection "Ensuring identification of TATA-box at all stages in D. melanogaster development" for more details) to tease out any missed TATA signal.The final set of clusters reported here accommodate these previously missed TATA architecture clusters identified with lenient bound values.
TATA-box is observed at all three stages (Table 2).Of these, cluster C4Z, with, comparatively, the highest number of promoters, is the one where the TATA-box appears most downstream, nearest to the dominant TSS and combines with a canonical dinucleotide initiator C −1 A +1 , while the others combine with a longer, full Inr (consensus TCA +1 KTY) [23,26].In C9Z, there are two initiators, the canonical CA and longer, full Inr (consensus TCA +1 KTY) [23,26], their A +1 3nt apart.We speculate that the canonical CA is paired with the TATA-box at −31 bp from the TSS, and the full Inr 3 bp downstream is paired with a weaker DPE which is also correspondingly shifted by 3 bp downstream.C12Z has an Inr with a TATA-like sequence at −30.Many CTSSs in these tissue-specific architectures (with TATA and DPE) are unique to 10-12h (relatively higher percentages of sites unique to a transition point) denoting that these start sites are not active earlier.
TGT motif at initiator position is found in all timepoints-C20,22X and C20Y and C21Z.TTAGT is identified at the initiator location in C10Z.A variant of this initiator with a weaker G is found in C12X, C14Y.C10Z has a few different GO terms enriched in comparison to C12X and C14Y, both of which have many common enriched terms as seen in Functional specialisation of clusters.As expected, TATA-box and DPE are among the top-ranked, focused architectures (clusters with lower median IQW) overall.The polypyrimidine initiator, TCT architecture, is observed for genes responsible for translational machinery, such as the ribosomal protein genes (see top-5 enriched GO terms in Fig 7C).These genes are known to have ubiquitously high expression [28].The corresponding clusters across stages, namely C9X, C13Y and C11Z, have the highest median TPM values while still being relatively sharp.
Detection of recently expanded gene families.seqArchR can identify promoters associated with histone genes which are known to have sequence repeats.These are identified across all timepoints.Specifically, C1-4,6X, C2,4,5Y and C8-9Y, and C1Z and C6-8Z are enriched for histone genes His1, His2A/2B and His4/4r and His3.This explains the very high information content of the sequence logo throughout the core promoter region.All CTSSs of His1 gene, namely C2X, C4Y and C7Z, have TSSs in 5' UTRs.Clusters C4X, C8Y, and C6Z, all clusters of His2B gene CTSSs, are contaminated with some non-histone ones (see section 4.2 in the S1 Text for more details on this contamination).We recommend that studies looking to generate a compendium of genome-wide CAGE-derived promoters and their properties for any non-model, as yet un(der)-studied, organism should report them but may choose to sequester them from any downstream analyses especially when these CTSSs are originating from non-promoter regions, e.g., 5' UTRs (Fig J in S1 Text).
Novel Observations.seqArchR also detects a novel positional motif consisting of a polythymine stretch, reminiscent of T-blocks (with >= 3Ts) common in C. elegans core promoters [29].We observe such T-blocks at transitions 2-4h and 10-12h AEL.Specifically, C13X and C14Z, both contain T-blocks flanking the CTSS of core promoters in this cluster.T-stretches have been reported in human genome, too, where they were shown to be able to bind TBP [30].
C17X, C16Y and C16Z, all have A/T enrichment at 30 to 45 bp downstream of the CTSS, with a marked increase precisely at 35 bp.The GO terms enriched for these clusters are also similar-mainly including splicing-and catabolic/metabolic process-related terms (Fig 7E and Fig J in S1 Text).We speculate that this architecture is related to the reported pausing of the RNA PolII between + 20 bp and + 50 bp with the centre at + 35 bp from the TSS [31].This arrangement places the front end of PolII just 10 bp away from the + 1 nucleosome border and enables their contact.Kwak et al. [32] also report on promoter-proximal PolII pausing (termed focused-proximal pausing) at 35-40 bp downstream of the TSS.Indeed, the gene RpS7, an example proximal-pausing gene from Kwak et al. [32] is present in these clusters.
Temporal successions of motif use in Drosophila development.A look at the different motifs present (or absent) in the promoter architectures at successive developmental stages, termed 'successions of motifs' here, can help gain insight into biologically meaningful preferences across stages.When comparing promoter sequence architectures prevalent at different stages of Drosophila development, some trends immediately emerge.While the DPE is observed from 2-4h and beyond in large numbers (ca.1500/25% of all promoters at each stage), very few promoters (only ca.40) have a canonical TATA-box.This number grows beyond 100 only at 10-12h stage (C2,4Z), albeit proportionally still very low in comparison to the DPE clusters.This confirms the idea that the DPE is associated with developmental regulation, and is on average used earlier than TATA, which drives mostly tissue-specific expression in differentiated cell types [33].Unlike TATA, DPE, TTT and initiator, other motifs identified by over-representation (e.g., Ohler motif 1/6, DRE, etc.) from Ohler et al. [23]) and FitzGerald et al. [26] do not occur at fixed distance from the TSS.
For completeness, we also analysed Drosophila melanogaster promoter sequences from modENCODE [16] with seqArchR to compare its performance on real promoter sequences with that of the current state-of-the-art approach, NPLB [11].seqArchR identifies all architectures including variations of the TATA-box architectures with their precise positional changes.While seqArchR was run for 5 iterations for this data, it identifies a comparable set of clusters/ architectures after three iterations and just 38 minutes compared to the 600 minutes taken by NPLB (for its first pass).See additional details in S1 Text, section 2.

Zebrafish (Danio rerio)
Nepal et al. [18] produced the first single nucleotide resolution transcriptome (CAGE) data for a vertebrate, zebrafish, across different stages of embryonic development.We used seqArchR to identify different promoter architectures in zebrafish at three different developmental stages, from early to late, namely 64 cells (early maternal), 30% Epiboly/Dome (late maternal/ early zygotic transition), and the Prim-6 stage (late zygotic).Promoter sequences at each stage were independently analysed with seqArchR to identify clusters and architectures.We use the same naming scheme, as used for D. melanogaster, for referring to clusters identified in zebrafish, except X, Y and Z here denote the three developmental stages of zebrafish analysed in this study.
Motifs.In comparison to Drosophila, core promoters of zebrafish (and other vertebrates) do not possess as much variety of sequence motifs at fixed distance from the dominant TSS.In zebrafish, across developmental stages, genome-wide transcription initiation occurs at Y −1 R +1 (see Figs 8, 9 and 10; [18]).Motif GGG +1 is seen in some clusters (C4X, C7Y, C7Z) where majority of the CTSSs are located in non-promoter regions, especially 3' UTR.
At 64 cells stage, which is only 2 hours post fertilisation (hpf), all architectures contain the canonical pyrimidine-purine initiator with an upstream TATA-box or a W-box-a TATA-like element, but more degenerate than canonical TATA box.This is the architecture used for maternal transcription in the oocyte [5].The TATA-box containing cluster with the sharpest architecture among all is enriched for lectins (e.g., rhamnose-binding lectins) which are carbohydrate binding proteins (Fig 11B ) and are known to provide essential innate immune response in early embryos in marine animals.While it is known that the W-box architecture is sufficient for oocyte expression, rhamnose-binding lectins are also known to be expressed in other types of cells in ovaries and also in gut epithelium [34], this might explain why they have a canonical TATA-box.
While clusters C6X at 64 cells and C13Y at 30% Epiboly/Dome stage have enrichment of WW downstream starting at ca. 50 bp, this signal is not periodic.In comparison, cluster C12Z, among the broader ones at the Prim-6 stage, exhibits a 10 bp periodicity in dinucleotide WW.This periodicity is known to be an intrinsic property of promoters with a well-positioned + 1 nucleosome downstream of the nucleosome free region (NFR) spanning the promoter [6,35].Clusters C13,15Z also show a weak periodic signal visible in the WW motif occurrence heatmap (Fig AD in S1 Text).This is characteristic of broad promoter architectures where the position of the downstream nucleosome defines a so called 'catchment' area for transcription initiation [7].
Functional specialisation of clusters.Even though more than 90% of the stable RNA in 30% Epiboly/Dome stage is still maternally inherited [5], seqArchR already detects zygotically activated promoters and their specific architectures.Notably, it detects the promoters of the first wave of zygotic genome activation (ZGA), primarily coming from a dedicated gene cluster on chromosome 4 (clusters C2,9,12,14,15Y) (Fig 11A , and Fig AF in S1 Text).These clusters have genes encoding mir-430, which are present in hundreds of moderately divergent copies [36], and whose miRNA product specifically targets maternal RNAs for degradation [37,38].Additionally, at the Prim-6 stage, cluster C6Z contains predominantly forward strand promoters on located on chromosome 22 (Fig AG in S1 Text).The genes in this locus all belong to a family of NACHT-domain and leucine-rich-repeat-containing (NLR) proteins, involved in innate immunity.In zebrafish lineage, this family has been tandemly expanded into approximately 400 copies on chromosome 22 [39].Since the expansion was relatively recent, the duplicated promoter sequences did not have time to diverge, even in neutrally evolving positions.The ability to detect all these architectures in a sample dominated by maternal RNA  stage, about half of the promoter sequences have the architecture where dispersed transcription initiation is orchestrated by the first downstream nucleosome (C12Z).Baranasic et al. [35] also reported that the chromatin organisation of promoters between 30% Epiboly/Dome and Prim-6 stages remains stable.

Human (Homo sapiens)
Human promoter sequences are known to be challenging for sequence analysis methods owing to their sequence heterogeneity and, in the case of majority of protein-coding genes, high GC and CpG percentages.To test seqArchR on human core promoter sequences, we gathered CAGE data for different human cell lines and tissues from ENCODE.They were pooled by merging CAGE data for all cell lines (see Fig N in S1 Text).Using a threshold of 1 TPM, about 9500 promoter sequences were analysed using seqArchR.
Representative promoter windows.First, we tried the same sequence window as in zebrafish (−45, +150) because we expected from the FANTOM5 analysis (supplementary figure 13 in Andersson et al. [40]) that a majority of CpG-island overlapping and broad promoters will have a stable position of +1 nucleosome and an underlying nucleosome positioning signal, albeit weaker than in zebrafish.Keeping the bound value moderately stringent at 10 −6 , the analysis separated architectures that had a TATA-like signal, ca.30 bp upstream, from the CGrich architectures (see Fig AC in S1 Text).The inability to identify a stronger, canonical, TATA The reason for this low percentage is due to the way the set of promoters was obtained: by pooling CAGE signal from many different tissue and cell types, the signal from tissue-specific genes is diluted, unlike that of broadly expressed genes, bringing a higher proportion of the former under the 1TPM threshold (see S1 Text, section "Diminished tissue-specific signal among the all-pooled CAGE data for H. sapiens.")We used the τ values as a measure of tissue specificity of promoters [41]).Of these, clusters C1, C3 and C4 have a higher median tissue specificity score than the rest of the clusters.The proportion of housekeeping genes is also extremely low in these clusters (Fig 13B).The top-5 enriched GO terms for each cluster  (excluding C2 and C7-8, which have more than 25% non-promoter CTSSs) are shown in Fig 13C .A clear distinction of enriched terms is observed between tissue-specific/focused architectures and broader architectures.
The TCT architecture seen in C6 contains the largest percentage of ribosomal protein genes (Fig 13A).They are also seen in the TATA architecture clusters, albeit much fewer in proportion.This is not surprising given that earlier studies have reported that the promoters of these genes in vertebrates are known to harbour TATA or TATA-like sequences [42][43][44].Nepal et al. [25] reported on dual initiation promoters, where YC initiation (which includes TCT) is intertwined with canonical initiation (YR) in metazoans, and also being pervasive in human and Drosophila.It was also shown that dual initiation promoters are often broad in shape owing to many TSSs in vicinity of each other.Our observations here for human promoter architectures are in line with this (see Fig 13B)-proportion of dual initiation promoters is higher in clusters with broad promoters, with cluster C2 (non-promoter CTSSs) and C6 (TCT/YC architecture) as exceptions.
Exclusion of initiator sequence and subsequent analysis.We envisaged that, at least in vertebrates, because the initiator is degenerate but still adhering to the YR consensus, it may lead to splitting clusters by the four possible YR dinucleotides, a split uninformative of the functional promoter architecture.Thus, to test this impact of the initiator on the clustering, we compared the clusters obtained with and without the initiator sequence.Refer to Fig AM in S1 Text for the following.The results show some interesting observations.In most of the clusters, a large proportion of promoters stay together when with or without the initiators.Examples of such behaviour are clusters C2-5, C8-9, C11-13, C17,19-20 (left panel; with initiator), and approximately retain their shape-based rankings between the two.In some clusters, such as C14-16 (with initiator) where the flanks show slight GC enrichment, a good chunk of these promoters get distributed across some narrow and broad architecture clusters, implying that they clustered together due to the initiator sequence TG.

Discussion
Promoter sequence architectures are much more than just individual sequence motifs lying anywhere within the promoter region.In some instances, these sequence motifs must occur in tandem, at fixed (relative) distances from one another.In addition to motifs, the sequence composition also plays an important role.This gives rise to a diverse set of sequence architectures even within a single species.Organisms are known to use different core promoter grammars in different regulatory contexts, such as different stages of embryogenesis, e.g., early vs late, or for different functions, e.g., tissue specific vs housekeeping functions.Moreover, there is diversity even across organisms.Therefore, there is a need for de novo approaches to explore and study the diverse promoter sequence architectures across species and organisms.
We present seqArchR, an approach for de novo identification of sequence elements based on non-negative matrix factorisation.seqArchR can be used for studying promoter sequences aligned by the position of their TSSs.seqArchR's chunking-based iterative algorithm can handle scenarios of complex interactions between sequence (motif) elements.Experiments on simulated DNA sequences show its ability to identify strong and/or weak artificially planted sequence elements.seqArchR is recognised as much faster (ca.20-30x) than No Promoter Left Behind, the current state-of-the-art approach for de novo identification of promoter sequence architectures.
Different organisms are known to pose different challenges for sequence analysis.For example, Drosophila core promoter sequences harbour multiple motifs with high information content, e.g., TATA-box, DPE, Ohler motif 5, embedded in otherwise low information regions.
In comparison, most promoters in some other organisms lack such motifs.For example, most zebrafish zygotic promoters have only a YR initiator and WW periodicity downstream demarcating nucleosome position [5].Only a small percentage of core promoters in human have a TATA-box, while the majority of their promoter sequences have a higher GC content throughout.seqArchR can detect all types of motifs and non-motifs-based sequence elements.We managed to retrieve this information by processing promoter sequences across the three aforementioned organisms.Depending on the nature of sequence elements, some parameters for seqArchR, namely the stability bound, number of iterations, and per-iteration collation decision, need to be set appropriately.Adjusting the stability bound value enables adjusting the sensitivity of seqArchR to identify even weak sequence features that may be enriched in very few sequences.As discussed in the Methods section, a good choice of number of iterations and the per-iteration collations can help identify overall defragmented clusters while still keeping clusters with minor positional variations separate.Indeed, the ability to tune seqArchR in this way makes it work seamlessly across organisms.
seqArchR is developed to discover the promoter sequence components tied to the precise transcription initiation sites identified using CAGE experiments.Therefore, it treats absolute position information of sequence features as an important characteristic of any architecture.We note that, for this approach to work, nucleotide-resolution data such as that from CAGE is required.For instance, ChIP-seq signal is not sufficiently high-resolution to be useful for aligning the sequences for NMF analysis.We envisage that ChIP-exo or ChIP-nexus data, which yield high-resolution information of transcription factor binding sites, can be analysed informatively with seqArchR.

Is seqArchR another motif finder?
seqArchR has some similarities with general motif finding method, and important differences.Because seqArchR analyses sequences that can contain enriched motifs, it is able to identify motifs, albeit only positioned, i.e., aligned ones.However, seqArchR uses the complete sequence to directly infer clusters of sequences with common features.This is affected by the composition of the whole sequence, including the flanks of the motif, if present.This is unlike any motif finding approach, which looks for statistically enriched words of pre-determined size (motif length) (usually 6-15 bp and only seldom larger).Another important distinction is that seqArchR does not require/use any background (sequence) information as motif finders do.This limits seqArchR's ability to find motifs enriched in set A vs set B which is a common usecase scenario for motif finders (i.e., differential motif enrichment).In this case, seqArchR can still be used to independently analyse set A and set B and then compare the identified features.In a nutshell, seqArchR is a clustering approach, where as typical motif finders can be considered classification-based.
seqArchR could be further improved in the following ways.It currently uses the same set bound value for every chunk/cluster in every iteration.It may be beneficial to extend this to adaptively change the bound value per chunk/cluster in any iteration.In terms of its algorithm, the approach for identifying overfit clusters can be improved.Furthermore, candidate approaches to automate the decision to collate after every iteration can be explored.Also, the current approach for collation can be improved to reduce manual intervention.In its current form, seqArchR only reports complete profiles of the same length as the input sequences.We want to extend it to be able to extract motifs from complete profiles like TF-MoDISco [45].

Conclusions
seqArchR is a de novo approach for discovering and clustering promoter sequence architectures.With seqArchR, we demonstrated the ability of NMF to simultaneously discover clusters of sequences with diverse positioned sequence motifs.seqArchR seamlessly identifies promoter architectures across organisms which are characterised by sequence motifs or the underlying sequence compositions.It is much faster than the current state-of-the-art, NPLB.seqArchR is made available as an R/Bioconductor package http://www.bioconductor.org/packages/seqArchR and a website accessible at https://snikumbh.github.io/seqArchR.

Materials
All experimental CAGE datasets analysed in this study were obtained and processed as follows.CAGE data for embryonic developmental stages in Drosophila melanogaster were obtained from Schor et al. [17].Out of different strains/samples available, sample 'RAL_28' was randomly chosen.CAGE data for three transitions in embryonic development, namely 2-4h, 6-8h and 10-12h after egg laying, was processed using CAGEr v2.0.2.See this Rmarkdown document for more details.
CAGE data for the embryonic developmental stages in Zebrafish were obtained from Nepal et al. [18].Out of all stages ranging over maternal to zygotic transition, we selected three, namely 64 cells, 30% Epiboly or Dome, and Prim-6 for this study.These were processed using CAGEr v1.20.See this Rmarkdown document for more details.
For Homo sapiens, all cell lines-specific CAGE data was obtained from ENCODE consortium which was made available previously as an R package ENCODEprojectCAGE_1.0.1.tar.gzavailable at http://promshift.genereg.net/CAGEr/PackageSource/. For this study, data from all cell lines was merged and processed using CAGEr v1.The list of housekeeping genes in humans was obtained from Eisenberg and Levanon [46].The τ index, first introduced by Yanai et al. [41]), for all human genes were obtained from Palmer et al. [47].Palmer et al. used the GTEx data to compute the τ index of tissue-specificity for all human genes [47].A value closer to zero indicates equally expressed across included tissues in the GTEx data and a value closer to 1 indicates expressed in one tissue.
The list of dual initiation promoters in humans was obtained from Nepal et al. [25].

Methods
NMF seeks a low-rank decomposition of any given matrix V p×n which only has non-negative entries.Here, n is the number of samples and p the number of attributes.We note that attributes can also be referred to as features and V as data matrix.
The matrix H k×n records the coefficients of the n samples in the k-dimensional representation, k � p.The matrix W p×k provides the loadings of the original attributes, p, in the kdimensional space.W p×k is called the basis matrix (and we refer to each column of W as a basis vector), and H the coefficients matrix.All entries in W and H are non-negative.Thus, NMF achieves dimensionality reduction, and can be further used for clustering using information in the coefficients matrix.Here, it is assumed that the lower-dimensional representation obtained using NMF corresponds to different clusters present in the data, and the clusters are characterised by different groups of attributes.Due to non-negativity, NMF lends an intuitive parts-based representation in many applications where negative values do not have an intuitive meaning.
The basic idea of seqArchR is to process a given set of sequences with NMF.These sequences are represented in the form of a one-hot encoded matrix, whose each column corresponds to a sequence and the sequence features are along its rows.In the following, we describe in detail the input to seqArchR and its chunking-based iterative algorithm.Together with it, information on the relavent parameters from the corresponding R package seqArchR is also given.

Input
The input to seqArchR is a matrix of one-hot encoded representation of DNA sequences of same length, L. For each DNA sequence, its mononucleotide profile is one-hot encoded as follows.The four channels in its one-hot encoding-due to DNA alphabet, S = {A, C, G, T} -are concatenated to produce a single column vector representation per sequence.The size of this vector is four times the length of the sequence itself.Thus, for an input matrix representing mononucleotide profiles of n sequences, its dimensions are (4L × n).All entries in this matrix are non-negative, or more specifically, 0 or 1.
Similarly, the dinucleotide profile of a sequence can be one-hot encoded.The input matrix dimensions in this case are (4 2 L × n).It is known that dinucleotide profiles of DNA sequences hold more information than mononucleotide profiles [48,49].All results reported in this article are obtained using dinucleotide profiles.
As shown in Fig 2B, in the decomposition of the one-hot encoded input matrix V p×n to W p×k and H k×n , W p×k assigns scores to each (di)nucleotide at a given position.Each column of W p×k , a single basis vector, captures combinations of various nucleotides at characteristic positions in sequences and can be said to represent an architecture.As an example, consider architectures I-IV in Fig 2B .All important features ((di)nucleotide-position pairs) are captured at once in a single basis vector of W. H k×n gives the coefficients for each sequence on all k basis vectors.These coefficients are relative in nature.When a particular sequence has features characterised by basis vector A but not those by basis vector B, it gets assigned a higher coefficient for basis vector A than for B. This information can be used for soft clustering of sequences.In this fashion, coefficients for all n sequences can be interpreted.Thus, NMF can simultaneously learn the complex interdependencies and interactions between sequence features de novo, and also provide soft clustering of input sequences.
Given a set of DNA sequences as a FASTA file or a Biostrings::DNAStringSet object [50], the R package seqArchR has provision to generate one-hot encoding of mono-as well as dinucleotide profiles of sequences.This can then be used as an input.

Algorithm
seqArchR's algorithm for de novo identification of clusters and architectures among the given sequences is depicted pictorially in Fig 2A .Here, we describe the algorithm in full.
1. Chunk the given sequences.Given the input matrix, the total collection of sequences is divided into smaller chunks (subsets).These chunks are of a pre-defined size that can be set by the user (parameter: chunk_size).Note on terminology: With every chunking operation on a set of sequences, the resulting chunks of sequences are referred to as inner chunks, and the parent set of sequences are referred to as the outer chunk.At the first iteration, we have one outer chunk (the complete set of input sequences), and many inner chunks.At subsequent iterations, we expect more than one outer chunks, and one or more inner chunks per outer chunk.
2. Independently process each inner chunk with NMF as follows.
(a) Finding the appropriate number of clusters/basis vectors ( model selection): The number of basis vectors is given by k in Eq (1).We test a range of values for k (parameters: k_min and k_max) from which the optimum value is selected based on either of the two quantitative criteria mentioned below (parameter: mod_sel_type).
• Model selection using stability of obtained basis vectors.Wu et al. [51] proposed and used the Amari-type distance that for a given value of k, measures the instability of the identified basis vectors over multiple runs.They termed the set of basis vectors R learned in a single run as a dictionary.With R m dictionaries from m runs, the dissimilarity between any two dictionary pairs R, R 0 is given as The instability of the set of identified k basis vectors is the average dissimilarity of all dictionary pairs over m multiple runs as computed by Eq (2).Here, C K×K is the cross-correlation matrix of basis vectors (columns of W) from R and R 0 while j and k are column and row iterators respectively.Wu et al. [51] used this for studying spatial gene expression patterns.
With increasing values of k, the instability increases.seqArchR sets a bound on the instability.The maximal value of k whose instability is lower than the bound is selected.In the R package, the bound parameter can be set by the user.Based on experiments on simulated data and various biological datasets, the recommended value of bound is one of 10 {−6,−7,−8} .Higher powers (e.g., 10 −8 ) are more stringent and lower ones (e.g., 10 −6 ) more lenient.The user can choose values beyond this range as may be suitable for a dataset.Note that some weaker clusters may not be identified with stricter bound values, in which case lenient bound value is recommended.The default value of bound is set to 10 −6 (lenient).The default/recommended number of bootstrapped iterations when using this approach for model selection is 100 (parameter: n_runs).
• Model selection using bi-cross-validation.r-fold bi-cross-validation (bi-CV) approach was proposed by Owen and Perry [52] and recently used by Eng et al. [53].This approach tests the re-construction accuracy of the NMF solution for each value of k and chooses the one that maximises it.Multiple bootstrapped iterations of NMF are performed for each value of k.The one with the maximum average re-construction accuracy over all bootstrapped iterations is the best one.We further apply the one-standard error rule to choose the most parsimonious model: the smallest value of k for which the reconstruction accuracy is no lower than the one-standard error of the best performing k [54].
The default/recommended number of folds and bootstrapped iterations when using this approach is 5 and 500 respectively (parameter: cv_folds and n_runs).Note that the multiple runs are performed per-fold.The effective number of iterations thus performed is 2500 per value of k.
The stability-based model selection procedure is the recommended approach since it is faster owing to the smaller number of bootstrapped runs required per value of k.
set_ocollation).One can suppress collation for a particular iteration or all of the specified number of iterations.For instance, if three iterations are to be performed, the user can choose to collate clusters only for the second iteration.Deferring collation is a useful strategy, especially in scenarios where the sequence features/motifs have minor positional variations.For example, in Drosophila promoter sequences, where TATA-box is observed anywhere between 32 to 28 bp upstream of the TSS, deferring collation to the second iteration is helpful to detect these minor positional differences right away.In fact, in all experiments reported here, except those with simulated data, we defer collation to the second iteration.Number of iterations.The appropriate number of iterations depends on the nature and number of sequence architectures.It is almost always a good idea to perform at least three iterations or at least one iteration after a deferred collation exercise when analysing real promoter sequences.

Curation of collated clusters
Another point to note is that depending on the choices for distance measure and agglomeration method, programmatic collation of basis vectors with hierarchical clustering can sometimes combine clusters which are perhaps better off separate.Indeed many studies require some manual curation when combining position weight matrices (PWMs) with hierarchical clustering [57] Therefore, we recommend that the clusters from the last iteration be left uncollated (by appropriately setting the parameter set_ocollation to FALSE for the last iteration).The final output from seqArchR will still be a collation of clusters from the last iteration which, as noted above, is treated as independent of per-iteration collation.Keeping clusters from the last iteration uncollated provides the user with a choice to collate the clusters in a custom manner and curate them when necessary.The users can either use built-in functions from seqArchR package or any other custom script.All results reported in the manuscript follow this strategy (except simulated data results).Thus, seqArchR requires no prior information or specification about the nature of architectures.It can distinguish sequences with motif-based architectures from non-motif-based architectures.If there are motifs, it does not require any information on the expected number of motifs or motif lengths or if any of the motifs have gaps (and the size of the gaps), or if some of them work cooperatively resulting in a combinatorial interplay.

Rationale for the chunking-based algorithm
As discussed in the Results section, promoter sequence architectures often harbour only slightly varying sequence elements at the same position in all sequences.For example, the canonical YR initiator-case of same length motifs overlapping in occurrence positions, and Inr CA overlapping with the first CA of Ohler1 motif in Drosophila.A chunking-based approach makes it relatively easier to select the best model by alleviating the above mentioned co-occurrence cases and handling fewer sequences at once.

Fig 2 .
Fig 2. Overview of seqArchR.(A) Schematic showing seqArchR's chunking-based, iterative algorithm.(B) Schematic describing input to seqArchR and the factorisation output.For each chunk of sequences being processed with NMF, the sequences are represented as a one-hot encoded matrix (hence, 0/1 matrix), denoted in the schematic by matrix V p×n ; matrices W p×k and H k×n are respectively the basis matrix and coefficients matrix obtained upon factorisation.n denotes the number of input sequences, p, the number of features, k, the optimal number of dimensions selected for the low-rank representation and L, the length of the input sequences.The schematic depicts one-hot encoding of the dinucleotide profile of sequences.One can use the mono-or dinucleotide profile of sequences.

Fig 3 .
Fig 3. Assessment of performance of seqArchR on simulated data.(A) Adjusted Rand index (ARI) attained by seqArchR with various parameter combinations on simulated data (1000 sequences) (data described in Table 1).Bar heights represent average over ten runs.Experiments are performed for various chunk sizes, mutation rates (m) and number of mutated motif positions (p).(B) ARI attained by NPLB for 1000 sequences setting.(C, D) ARI attained by seqArchR and NPLB respectively for two scaled up versions of the simulated data: 5000 and 10000 sequences.See main text for more details.(E, F) Time taken by seqArchR and NPLB respectively to process simulated data (see main text for details).https://doi.org/10.1371/journal.pcbi.1011491.g003 such as the downstream regulatory element (DRE) or Ohler motif 2 (NDM4).The non-directional sequence elements are visible in the sequence logos of the raw clusters from seqArchR (see Figs F, G, and H in S1 Text).These were manually combined to obtain the final clusters as seen in the main figures (see Methods section, "Curation of collated clusters" subsection; and Figs F, G, and H in S1 Text).TATA and DPE sequence elements are known to pair mutually exclusively with the Inr for architectures characteristic of tissue-specific and developmental genes respectively [27].DPE appears very weak in the sequence logo for C8X (Fig 4), C7,10Y, (Fig 5) and C3,5Z (Fig 6) as against C1Y (Fig 5) due to collation of clusters with similar but stronger/weaker DPE signals resulting in an overall weaker DPE signal in the final cluster (Table 2).C8X with Inr+DPE architecture and C1Y with Inr+DPE1 (variant of DPE) architecture have a very short median

Fig 4 .
Fig 4. Clusters and architectures identified by seqArchR for D. melanogaster, 2-4h AEL.Sequence clusters arranged by the median interquantile widths (IQW) of CAGE TCs in seqArchR clusters (shortest on top, broadest at the bottom).From left to right: Box and whisker plots of per-cluster IQWs, TPMs, and PhastCons scores followed by per-cluster sequence logos, and stacked barplots showing proportion of TCs unique/shared between transitions.Sequence logos for histone gene clusters are shown with a grey background.'All' denoting common between all stages.TPM, Tags per million.https://doi.org/10.1371/journal.pcbi.1011491.g004 Fig 7A compares the top-5 gene ontology (GO) terms that the TATA-box and DPE clusters are enriched for.Interestingly, many GO terms enriched in the DPE clusters are common across stages, while all the TATA-box clusters are enriched for mostly different GO terms, but

Fig 5 .
Fig 5. Clusters and architectures identified by seqArchR for D. melanogaster, 6-8h AEL.Sequence clusters arranged by the median interquantile widths (IQW) of CAGE TCs in seqArchR clusters (shortest on top, broadest at the bottom).From left to right: Box and whisker plots of per-cluster IQWs, TPMs, and PhastCons scores followed by per-cluster sequence logos, and stacked barplots showing proportion of TCs unique/shared between transitions.Sequence logos for histone gene clusters are shown with a grey background.'All' denoting common between all stages.TPM, Tags per million.https://doi.org/10.1371/journal.pcbi.1011491.g005

Fig 6 .
Fig 6.Clusters and architectures identified by seqArchR for D. melanogaster, 10-12h AEL.Sequence clusters arranged by the median interquantile widths (IQW) of CAGE TCs in seqArchR clusters (shortest on top, broadest at the bottom).From left to right: Box and whisker plots of per-cluster IQWs, TPMs, and PhastCons scores followed by per-cluster sequence logos, and stacked barplots showing proportion of TCs unique/shared between transitions.Sequence logos for histone gene clusters are shown with a grey background.'All' denoting common between all stages.TPM, Tags per million.https://doi.org/10.1371/journal.pcbi.1011491.g006

Fig 8 .
Fig 8. Clusters and architectures identified by seqArchR for 64 cells stage.Sequence clusters arranged by the median interquantile widths (IQW) of CAGE TCs in seqArchR clusters (shortest on top, broadest at the bottom).Each panel, from left to right: Box and whisker plots of per-cluster IQWs, and TPMs followed by per-cluster sequence logos, and stacked barplots showing proportion of different genomic annotations.TPM, Tags per million.https://doi.org/10.1371/journal.pcbi.1011491.g008

Fig 9 .
Fig 9. Clusters and architectures identified by seqArchR for 30% Epiboly/Dome stage.Sequence clusters arranged by the median IQW of CAGE TCs in seqArchR clusters (shortest on top, broadest at the bottom).Each panel, from left to right: Box and whisker plots of per-cluster IQWs, and TPMs followed by per-cluster sequence logos, and stacked barplots showing proportion of different genomic annotations.TPM, Tags per million.https://doi.org/10.1371/journal.pcbi.1011491.g009

Fig 10 .Fig 11 .
Fig 10.Clusters and architectures identified by seqArchR for Prim-6 stage of zebrafish development.Sequence clusters arranged by the median interquantile widths (IQW) of CAGE TCs in seqArchR clusters (shortest on top, broadest at the bottom).From left to right: Box and whisker plots of percluster IQWs, and TPMs followed by per-cluster sequence logos, and stacked barplots showing proportion of different genomic annotations.For cluster C12Z, an additional zoomed-in view of the section from 45 bp to 150 bp downstream is shown.TPM, Tags per million.https://doi.org/10.1371/journal.pcbi.1011491.g010

20 .
Library sizes of all cell lines that are included are shown in Fig AJ in S1 Text.
Figs F, G, and H in S1 Text show how this process was carried out for D.melanogaster; Figs L, M, and N in S1 Text show this for D.rerio; and Fig Q in S1 Text shows this for H.sapiens.
Fig D in S1 Text shows the architectures of clusters, identified among Drosophila promoter sequences from modENCODE, using NMF with stability-based bound criterion for model selection in the first iteration.It can be seen that, the first set of identified clusters (across chunks) are majorly governed by initiator elements.Additionally, for larger chunk-sizes, the memory usage increases linearly (see Fig B in S1 Text).Therefore, having the ability to process very large datasets one chunk at a time enables