Identifying promoter sequence architectures via a chunking-based algorithm using non-negative matrix factorisation

doi:10.1371/journal.pcbi.1011491

Identifying promoter sequence architectures via a chunking-based algorithm using non-negative matrix factorisation

Fig 2

Overview of seqArchR.

(A) Schematic showing seqArchR’s chunking-based, iterative algorithm. (B) Schematic describing input to seqArchR and the factorisation output. For each chunk of sequences being processed with NMF, the sequences are represented as a one-hot encoded matrix (hence, 0/1 matrix), denoted in the schematic by matrix V_p×n; matrices W_p×k and H_k×n are respectively the basis matrix and coefficients matrix obtained upon factorisation. n denotes the number of input sequences, p, the number of features, k, the optimal number of dimensions selected for the low-rank representation and L, the length of the input sequences. The schematic depicts one-hot encoding of the dinucleotide profile of sequences. One can use the mono- or dinucleotide profile of sequences.

doi: https://doi.org/10.1371/journal.pcbi.1011491.g002