Fig 1.
Schema depicting how, across organisms, different promoter architectures have a TSS-selection determining sequence element at near-fixed distance from their dominant transcription start sites.
nt = Nucleotide(s). The silhouettes for Drosophila, Danio rerio and Homo sapiens were obtained from PhyloPic (http://phylopic.org). Licenses: “CC0 1.0 Universal Public Domain Dedication” for Drosophila (link); “CC0 1.0 Universal Public Domain Dedication” for Danio rerio (link); and “CC0 1.0/Universal Public Domain Dedication” for Homo sapiens (link).
Fig 2.
(A) Schematic showing seqArchR’s chunking-based, iterative algorithm. (B) Schematic describing input to seqArchR and the factorisation output. For each chunk of sequences being processed with NMF, the sequences are represented as a one-hot encoded matrix (hence, 0/1 matrix), denoted in the schematic by matrix Vp×n; matrices Wp×k and Hk×n are respectively the basis matrix and coefficients matrix obtained upon factorisation. n denotes the number of input sequences, p, the number of features, k, the optimal number of dimensions selected for the low-rank representation and L, the length of the input sequences. The schematic depicts one-hot encoding of the dinucleotide profile of sequences. One can use the mono- or dinucleotide profile of sequences.
Table 1.
Sequence motifs corresponding to four architectures/clusters in the simulated data.
Fig 3.
Assessment of performance of seqArchR on simulated data.
(A) Adjusted Rand index (ARI) attained by seqArchR with various parameter combinations on simulated data (1000 sequences) (data described in Table 1). Bar heights represent average over ten runs. Experiments are performed for various chunk sizes, mutation rates (m) and number of mutated motif positions (p). (B) ARI attained by NPLB for 1000 sequences setting. (C, D) ARI attained by seqArchR and NPLB respectively for two scaled up versions of the simulated data: 5000 and 10000 sequences. See main text for more details. (E, F) Time taken by seqArchR and NPLB respectively to process simulated data (see main text for details).
Fig 4.
Clusters and architectures identified by seqArchR for D. melanogaster, 2–4h AEL.
Sequence clusters arranged by the median interquantile widths (IQW) of CAGE TCs in seqArchR clusters (shortest on top, broadest at the bottom). From left to right: Box and whisker plots of per-cluster IQWs, TPMs, and PhastCons scores followed by per-cluster sequence logos, and stacked barplots showing proportion of TCs unique/shared between transitions. Sequence logos for histone gene clusters are shown with a grey background. ‘All’ denoting common between all stages. TPM, Tags per million.
Fig 5.
Clusters and architectures identified by seqArchR for D. melanogaster, 6–8h AEL.
Sequence clusters arranged by the median interquantile widths (IQW) of CAGE TCs in seqArchR clusters (shortest on top, broadest at the bottom). From left to right: Box and whisker plots of per-cluster IQWs, TPMs, and PhastCons scores followed by per-cluster sequence logos, and stacked barplots showing proportion of TCs unique/shared between transitions. Sequence logos for histone gene clusters are shown with a grey background. ‘All’ denoting common between all stages. TPM, Tags per million.
Fig 6.
Clusters and architectures identified by seqArchR for D. melanogaster, 10–12h AEL.
Sequence clusters arranged by the median interquantile widths (IQW) of CAGE TCs in seqArchR clusters (shortest on top, broadest at the bottom). From left to right: Box and whisker plots of per-cluster IQWs, TPMs, and PhastCons scores followed by per-cluster sequence logos, and stacked barplots showing proportion of TCs unique/shared between transitions. Sequence logos for histone gene clusters are shown with a grey background. ‘All’ denoting common between all stages. TPM, Tags per million.
Table 2.
Clusters with DPE motifs at all three developmental stages in D. melanogaster.
Fig 7.
Visualisation of GO terms enriched for various clusters at different developmental stages of Drosophila melanogaster.
Top-5 enriched GO terms are shown for clusters with: (A) TATA vs DPE architectures; (B) TTANT architecture; (C) TCT architecture; (D) cAAA architecture; and (E) polythymine stretch architecture.
Fig 8.
Clusters and architectures identified by seqArchR for 64 cells stage.
Sequence clusters arranged by the median interquantile widths (IQW) of CAGE TCs in seqArchR clusters (shortest on top, broadest at the bottom). Each panel, from left to right: Box and whisker plots of per-cluster IQWs, and TPMs followed by per-cluster sequence logos, and stacked barplots showing proportion of different genomic annotations. TPM, Tags per million.
Fig 9.
Clusters and architectures identified by seqArchR for 30% Epiboly/Dome stage.
Sequence clusters arranged by the median IQW of CAGE TCs in seqArchR clusters (shortest on top, broadest at the bottom). Each panel, from left to right: Box and whisker plots of per-cluster IQWs, and TPMs followed by per-cluster sequence logos, and stacked barplots showing proportion of different genomic annotations. TPM, Tags per million.
Fig 10.
Clusters and architectures identified by seqArchR for Prim-6 stage of zebrafish development.
Sequence clusters arranged by the median interquantile widths (IQW) of CAGE TCs in seqArchR clusters (shortest on top, broadest at the bottom). From left to right: Box and whisker plots of per-cluster IQWs, and TPMs followed by per-cluster sequence logos, and stacked barplots showing proportion of different genomic annotations. For cluster C12Z, an additional zoomed-in view of the section from 45 bp to 150 bp downstream is shown. TPM, Tags per million.
Fig 11.
(A) Genomic locations of per-cluster promoters in Zebrafish developmental stages 64 cells, 30% Epiboly/Dome and Prim-6. (B) Comparison of top-5 enriched GO terms for different clusters per stage of zebrafish development. Clusters grouped by architecture attributes: vs
of downstream enrichment of W (A/T) signal.
Fig 12.
Clusters and architectures identified by seqArchR for CAGE-derived core promoter sequences from cell lines and tissues of H. sapiens.
Sequence clusters arranged by the median interquantile widths (IQW) of CAGE TCs in seqArchR clusters (shortest on top, broadest at the bottom). From left to right: Box and whisker plots of per-cluster IQWs, TPMs, and tissue specificity scores (τ) followed by per-cluster sequence logos, and stacked barplots showing proportion of different genomic annotations. TC, tag clusters. TPM, Tags per million.
Fig 13.
Additional analyses of seqArchR clusters for H. sapiens.
(A) Per-cluster proportions of ribosomal and other (non-ribosomal) genes (left) and proportions of dual initiation and non-dual initiation promoters (right). (B) Proportion of housekeeping (HK) genes in each cluster with clusters arranged in ascending order of % HK genes (bottom) and absolute numbers (top). (C) The top-5 enriched GO terms for all clusters except C2 and C7–8 which have more than 25% non-promoter CTSSs. seqArchR cluster names correspond to those from Fig 12.