Fig 1.
First, preprocessing transforms genomic assay sequencing reads into signal datasets. Second, with signal datasets as input, a SAGA algorithm partitions the genome and assigns an integer label to each segment, yielding an annotation. Third, a researcher interprets the labels, assigning a biological interpretation to each. ChIP-seq, chromatin immunoprecipitation-followed by sequencing; SAGA, segmentation and genome annotation.
Table 1.
Timeline of selected SAGA methods.
Fig 2.
Two representations of an HMM.
(A) Conditional dependence diagram representation of an unrolled HMM with sequence of hidden states and sequence of observations
. In this representation, each node represents a hidden discrete (white rectangle) or observed continuous (gray circle) random variable. For every index t, each hidden random variable Qt takes on some value qt; similarly, each observed variable Xt takes on some value xt. Xt may represent either scalar or vector observations. Solid arcs represent conditional dependence relationships between random variables. (B) State transition diagram representation of Rover and Thomas’s weather example. In this representation, each node represents a potential value of the hidden variable Qt. The hidden variable takes on values r (rainy) or ¬r (not rainy) on any given day t. Solid arcs represent transitions between hidden states, which have transition probabilities A. HMM, hidden Markov model.
Fig 3.
Annotating multiple cell types.
(A) Datasets generated by the ENCODE and Roadmap Epigenomics consortia as of 2019. The black cells represent the datasets actually generated out of a larger number of potential combinations of cell type and assay type. (B) Annotating 6 datasets from 3 different samples: 3 from cell type A, 2 from cell type B, and 1 from cell type C. Colored letters over signal data indicate data associated with those samples. One can use 3 different SAGA strategies with this collection of datasets: Independent: performing training and inference completely independently on each sample. This yields a different annotation for each sample. Concatenated (horizontal sharing): training a single model across all cell types. This yields 1 annotation per sample with a shared label set. Each sample must have the same datasets, necessitating imputation of any missing datasets. Stacked (vertical sharing): performing training and inference on datasets from all samples. This yields a single pan–cell-type annotation. ChIP-seq, chromatin immunoprecipitation-followed by sequencing; DNase-seq, sequencing DNase I hypersensitive sites sequencing; ENCODE, Encyclopedia of DNA Elements; SAGA, segmentation and genome annotation.
Fig 4.
Visualizations of SAGA annotations.
(A) Genome browser display showing 164 cell type annotations for a 20-kbp region on human chromosome 15 (GRCh37/hg19) [76]. Each annotation has 8 labels: Promoter (red), Enhancer (orange), Transcribed (green), Permissive regulatory (yellow), Bivalent (purple), Facultative heterochromatin (light blue), Constitutive heterochromatin (black), Quiescent (gray), and Low Confidence (light gray). (B) Importance score (CAAS) for the same region. Total height at each position indicates the position’s estimated importance. Height of a given color band denotes the contribution toward importance of the associated label. (C) Genome-wide visualization of the SAGA annotation for 164 samples aggregated over GENCODE [77] protein-coding gene components. Rows: the 9 labels of the annotation. Columns: gene components, including 10 kbp flanking regions upstream and downstream. Each cell shows the enrichment of the row’s label with a position along the column’s component. Figures derived from [14]. CAAS, conservation-associated activity score; SAGA, segmentation and genome annotation.