Systematic clustering algorithm for chromatin accessibility data and its application to hematopoietic cells

The huge amount of data acquired by high-throughput sequencing requires data reduction for effective analysis. Here we give a clustering algorithm for genome-wide open chromatin data using a new data reduction method. This method regards the genome as a string of 1s and 0s based on a set of peaks and calculates the Hamming distances between the strings. This algorithm with the systematically optimized set of peaks enables us to quantitatively evaluate differences between samples of hematopoietic cells and classify cell types, potentially leading to a better understanding of leukemia pathogenesis.

For the γ-th chromosome (γ ∈ X), the length of the corresponding DNA sequence is written as L γ , where 5.0 × 10 7 ≤ L γ ≤ 2.5 × 10 8 and the total length is L = 22 γ=1 L γ + L X + L Y ∼ 3.1 × 10 9 . To position the x-th base pair in the γ-th chromosome, we setchromatin b γ x ∈ D, with 1 ≤ x ≤ L γ . In this paper, for a set SET, we write the number of elements in SET as |SET|. For example, we have | D | = 4 and |X| = 24.
Chromatin is a complex of DNA and associated proteins such as histones. A chromatin has "open" regions, around which the density of the DNA and the associated proteins are rather low and also "closed" regions, around which the opposite situation happens. Gene expressions are largely regulated by the interactions between DNA and transcription factors depending on the open and closed regions. The analysis of open/closed chromatin regions is necessary for the understanding of cell differentiation and phenotype [1,2].
ATAC-seq was developed for the genome-wide detection of open chromatin regions. One of the features of ATAC-seq is that it uses Tn5 transposase. At a certain proper condition, Tn5-transposase mainly cut DNA in open chromatin regions and the sequences of those DNA fragments are obtained by sequencers [3]. ATAC-seq has several advantages compared to the other epigenomic sequencing methods [4]. For example, to analyze open chromatin regions, DNase-seq needs about 10 7 -10 8 cells and takes 4-5 days to obtain the data. ATAC-seq, on the other hand, requires only about 10 3 -10 4 cells and takes half a day.

Reads
As briefly reviewed above, one Tn5-transposase cuts and splits DNA into two parts or fragments. If there are two Tn5-transposases, two locations of DNA are cut to make three fragments.
Thus, we can view fragment f as a subsequence of a DNA sequence consisting of successive symbols in D. Since we refer to the same DNA sequence of the human genome in this study, fragment f can be also represented by three coordinates: the the chromosome number γ ∈ X, the start position s = s(f ), and the end position e = e(f ), , that can be expressed as f = (γ, s, e).
A sample is, in our settings, a product generated by a certain experimental procedure through ATAC-seq library preparation from a set of cells [3].
The input of a sequencer is the set of the obtained fragments , and the number of fragments is denoted as N f . A sequencer with "paired-end sequencing" outputs a DNA sequence of the two edges of a fragment as two reads (R s i , R e i ) where meaning that each length of the two reads (R s i , R e i ) is i . We consider read as a sequence of four symbols in D of length less than or equal to 0 , where 0 can be changed as a parameter controlled by the sequencer. Note that for the case of "single-end sequencing", where one gets only a read from one edge, we obtain read R i = {R j } i j=1 . In the end, we obtain the data of reads R : where the number of reads is denoted as N r . Note that in the case of "paired-end sequencing", one may regard both R s i and R e i as R i . This is the starting point of our analysis because sequencers do not directly give the actual values of f i .
Summarizing the relationship between fragments and reads, let us assume that all reads are obtained from "paired-end sequencing" and that the sample preparation and the sequencer output are "ideal" as follows. If we denote fragment f i as sequence (b γi si , b γi si+1 , . . . , b γi ei ), then the beginning read R s i and the terminal read R e i corresponding to f i are In other words, if the length L(f i ) of fragment f i is greater than or equal to 0 , the beginning read R s i is the direct inference of the first 0 symbols of the fragment f i . The condition for terminal read R e i is similar. If length L(f i ) is less than 0 , we directly infer R s i = R e i = f i as a sequence of four symbols, where we see that the two reads have the same length.
However, the situation above is "ideal" and there are unexpected errors that stochastically flip symbols in the ideal situation. Thus, we need to infer the information of fragments in a statistical manner. Note that this inference can be straightforwardly applied to the case of "single-end sequencing".

Alignment of reads onto the reference genome
Hereafter, for simplicity, we consider single-ended reads R = {R i } N r i=1 because similar processes can be done for paired-end reads. We perform mapping of the reads data R from a sequencer onto the DNA sequence.
We use the BWA-MEM algorithm of the software BWA (v0.7.16a) with no options. This algorithm aligns each read onto the hg19 reference sequence (b γ x ) γ∈X,1≤x≤Lγ and gives an estimate of the quality of the alignment (for details, see [5] and references therein). Then we obtain the following data: • Chromosome number γ(R i ) ∈ X ∪ {U} with the start position s(R i ) and the end position e(R i ) of read R i mapped onto the DNA sequence, where 1 ≤ s(R i ) ≤ e(R i ) ≤ L γ(Ri) . Note that U is a set of unplaced sequences in any elements in X. Hereafter, X includes U with L U 3.7 × 10 6 .
• The mapping quality score MQ(R i ) ≥ 0 of read R i calculated by using the Phred quality score.
Therefore, ( γ(R i ), s(R i ), e(R i )) infers the coordinates (γ i , s i , e i ) = ( γ(R i ), s(R i ), e(R i )) of read R i onto the DNA sequence. For R i , we definê T(R i ) asT To select reliable data withT(R i ), we preprocess the outputs obtained above as follows: 1. In order to reduce duplicated reads, which could be produced artificially in the sequence sample preparation, we apply the command MarkDuplicates in PICARD software (v1.119) (http://broadinstitute.github.io/picard/) with the REMOVE DUPLICATE option.
2. Then we cut off reads with a mapping quality score MQ(R i ) less than 30. We used samtools for this purpose [6].
After processes (1) and (2), we obtain where i is the length of R i and N r denotes the number of reads after preprocessing.
. This is part of the information obtained by the preprocessing. Note that MQ(R i ) ≥ 30 holds for any i with 1 ≤ i ≤ N r and there are no duplicated pairs inP. For simplicity, hereafter, we sometimes expressP(R) asP. We use similar abbreviations for other symbols.

Pilings of reads
From the dataP, we can calculate how many reads are on position (γ, x) in the DNA sequence. We consider the set of reads located on position (γ, x) symbolically by defining Y γ,x (P(R)) := {1 ≤ i ≤ N r | γ(R i ) = γ and ( s(R i ) ≤ x ≤ e(R i ))} .
For two samples in reads data R obtained from SRA (SRR2920495.sra and SRR2920466.sra), we show Y γ,x := |Y γ,x |, which is the number of reads on each position (γ, x) in the DNA sequence in Fig S1. In this study, we used reads data R from the Gene Expression Omnibus (GEO) with accession number GSE74912 as the initial input of the analysis.