DPCfam: Unsupervised protein family classification by Density Peak Clustering of large sequence datasets

doi:10.1371/journal.pcbi.1010610

Fig 1.

Graphical sketch of the DPCfam workflow.

(A) for each query sequence in the reference dataset (UniRef50), we collect all search sequence regions (SSRs) that align to it (BLAST with E-value < 0.1); (B) we use Density Peak Clustering (DPC) to group together SSRs that align to similar portions of the query; (C) primary clusters obtained in (B) for all queries are “metaclustered” (using, again, DPC) according to the number of SSRs they share; (D) finally, metaclusters that still share a significant number of SSRs undergo a further merging step. Material from: ‘Russo et al., Density Peak clustering of protein sequences associated to a Pfam clan reveals clear similarities and interesting differences with respect to manual family annotation, BMC Bioinformatics, published 2021, BioMed Central Ltd.’ [16].

More »

Expand

Fig 2.

Schematic representation of the primary clusters’ distance matrix organised into blocks.

The cluster list is first split into N groups of equal size and then the distance matrix is divided into N² blocks with each block containing distances between clusters in two groups. Since the matrix is symmetric, we can consider a single block for each pair of groups. Blue blocks represent off-diagonal blocks, green blocks represent diagonal blocks.

More »

Expand

Fig 3.

General properties of Metaclusters.

A Cumulative Distribution Function (1-CDF) of MC (purple) and Pfam family (green) sizes. All 210,802 MCs have been considered, with the size of a MC equal the number of its seed sequences. Pfam families (v. 33) are all those in Uniref50 ∩ UniprotKB (18,189 total) and family size is the ‘full’ size, that is, not only considering Pfam seed sequences. Black line: best fit of MC’s CDF with a power-law (CDF exponent γ = −1.15, corresponding to an exponent of −2.15 of the Probability Distribution Function). In S3 Fig, we additionally plot the size of MCs/Pfam families as a function of their size rank. B: Venn diagram showing the percentage of Low Complexity MCs (green), Disordered MCs (red) and Coiled-Coil MCs (purple), respectively (see Methods for definitions). Note that, in this case, only MCs containing at least 50 seed sequences and with average length ≥ 50 are considered (46,828 total).

More »

Expand

Fig 4.

Histogram showing average overlap (O_MC) between MCs (only those with %DA≥50) and their associated Pfam DAs.

Colors reflect the contribution of each MC category to each bin (equivalent, reduced, extended and shifted, see S1(B) Fig and Methods for definitions). The legend on the right side of the histogram reports total counts of MCs in each category and, additionally, total count of MCs with %DA<50 and total count of unknown MCs.

More »

Expand

Fig 5.

Comparison between areas of sequence space covered by MCs in fully redundant pairs.

Note that we exclude MC pairs in parent-child relationships and are left with 25,980 pairs overall. For each MC, we generate the list of IDs of all proteins that map to at least one MC member (using the profile-HMM-based definition of MC membership, see Methods). Then, for each pair, we calculate the fraction of protein IDs that are shared between the two MCs (where the fraction is calculated with respect to the MC with the shorter protein ID list).

More »

Expand

Fig 6.

Histograms showing average overlap between Pfam families and their representative MCs.

Colors reflect the contribution of each MC category to each bin (equivalent, reduced, extended and shifted, see S1(B) Fig and Methods for definitions). A: Overlap between individual Pfam families and their representative MCs B: Overlap between individual Pfam families or architectures and representative MCs. Given a Pfam family and its representative MC (same pairs as in A), we search for a better overlap of the representative MC with any multi-family architecture featuring the original Pfam family and up to two additional families. The reported average overlap value is thus the best between the overlap with the original family and any other such Pfam architecture. Note that the Pfam architecture labels (equivalent/reduced/extended/shifted) are still assigned according to the representative MC overlap to the original Pfam family so as to show to which extent the overlap in each MC category increases with respect to A).

More »

Expand

Fig 7.

Histograms showing average overlap between ECOD families and their representative MCs.

Colors reflect the contribution of each MC category to each bin (equivalent, reduced, extended and shifted, see S1(B) Fig and Methods for definitions). A: Overlap between individual ECOD families and their representative MCs (cfr. with Fig 6) B: Overlap between individual ECOD families or architectures and representative MCs. Given an ECOD family and its representative MC (same pairs as in A), we search for a better overlap of the representative MC with any multi-family architecture featuring the original ECOD family and up to two additional families. The reported average overlap value is thus the best between the overlap with the original family and any other such ECOD architecture. Note that the ECOD architecture labels (equivalent/reduced/extended/shifted) are still assigned according to the representative MC overlap to the original ECOD family so as to show to which extent the overlap in each MC category increases with respect to A).

More »

Expand