Segmentation and genome annotation algorithms for identifying chromatin state and other genomic patterns
Fig 3
Annotating multiple cell types.
(A) Datasets generated by the ENCODE and Roadmap Epigenomics consortia as of 2019. The black cells represent the datasets actually generated out of a larger number of potential combinations of cell type and assay type. (B) Annotating 6 datasets from 3 different samples: 3 from cell type A, 2 from cell type B, and 1 from cell type C. Colored letters over signal data indicate data associated with those samples. One can use 3 different SAGA strategies with this collection of datasets: Independent: performing training and inference completely independently on each sample. This yields a different annotation for each sample. Concatenated (horizontal sharing): training a single model across all cell types. This yields 1 annotation per sample with a shared label set. Each sample must have the same datasets, necessitating imputation of any missing datasets. Stacked (vertical sharing): performing training and inference on datasets from all samples. This yields a single pan–cell-type annotation. ChIP-seq, chromatin immunoprecipitation-followed by sequencing; DNase-seq, sequencing DNase I hypersensitive sites sequencing; ENCODE, Encyclopedia of DNA Elements; SAGA, segmentation and genome annotation.