Identifying promoter sequence architectures via a chunking-based algorithm using non-negative matrix factorisation
Fig 2
(A) Schematic showing seqArchR’s chunking-based, iterative algorithm. (B) Schematic describing input to seqArchR and the factorisation output. For each chunk of sequences being processed with NMF, the sequences are represented as a one-hot encoded matrix (hence, 0/1 matrix), denoted in the schematic by matrix Vp×n; matrices Wp×k and Hk×n are respectively the basis matrix and coefficients matrix obtained upon factorisation. n denotes the number of input sequences, p, the number of features, k, the optimal number of dimensions selected for the low-rank representation and L, the length of the input sequences. The schematic depicts one-hot encoding of the dinucleotide profile of sequences. One can use the mono- or dinucleotide profile of sequences.