Efficient discovery of frequently co-occurring mutations in a sequence database with matrix factorization

doi:10.1371/journal.pcbi.1012391

Fig 1.

The workflow of our complete methodology for deriving CMPs, including a branch that utilizes the brute-force method used for the comparative analysis.

More »

Expand

Fig 2.

Example cost or distance function computation for a point mutation for the matrix V.

Here, an Asparagine (N) mutated to a Glutamic Acid (E). Only the highlighted/diagonal cells of the matrix are considered and summed, resulting in a final score of 2 (1 + 0 + 1) for the N to E positional comparison. This computation is repeated for each position in the protein, across all protein sequences in the database.

More »

Expand

Fig 3.

The dissimilarity parameter, which indicates the uniqueness of a set of factors in the H matrices for each r value calculated across a range of r.

To better visualize the range of DSIM values, we used the logarithm of these values. The light blue region beneath the function curve highlights which r values produced the most unique factors. We opted to keep factors for for that included mostly positive DSIM values.

More »

Expand

Table 1.

Examples of how much overlap there was between CMPs and their dropped subset mutations.

CMP ID refers to the ID column from Table 2, Sequences refers to the absolute value difference between their frequencies, and Total shows the percentage that Sequences makes up of the 750,391 data sequences.

More »

Expand

Fig 4.

Example clustered heat-map of a H matrix for r = 12.

The X-axis is an arbitrarily assigned number to each factor, depending on its position in the matrix (i.e 0-11). The Y-axis is the remaining residue positions after removing all-zero rows, demonstrating only the H matrix rows which had CMPs.

More »

Expand

Table 2.

List of CMPs with their respective frequencies, each element in a CMP is the position number preceded by the residue in the source/Wuhan sequence.

More »

Expand

Fig 5.

Bar-chart showing each CMP’s prevalence in a PANGO lineage based on the label within the data.

For simplified comparison, only the top PANGO labels (above 100,000 per CMP counts) are shown here. Therefore, the sum of the X-axis may not reflect the entire database count. The labels (C1- C30) on Y-axis are sorted based on each CMP’s first and last occurring PANGO label (“birth" and “death" of CMP) as well as its count.

More »

Expand