Skip to main content
Advertisement

< Back to Article

Fig 1.

Diagram summarizing the different steps of the pipeline to generate the LSH Forest and hash signatures for each HOG.

The labelled phylogenetic trees generated by pyHam are converted into phylogenetic profiles and used to generate a weighted MinHash signature with Datasketch. The hash signatures are inserted into the LSH Forest and stored in an HDF5 file.

More »

Fig 1 Expand

Fig 2.

ROC curves for all profiling methods.

a. Yeast protein-protein interactions. Our method (MinHash Jaccard HogProf), performs best overall, but when high precision is required, Enhanced phylogenetic Tree [19] is still slightly more accurate. b. Human protein-protein interactions. Jaccard Hash HogProf performs better than all metrics overall but again, when high precision is required, EPT score is still slightly more accurate. Binary Pearson refers to a distance using binary vectors and Pearson correlation described in [26]. Occurence Euclidean and Occurence Pearson refer to the occurence profiles with Euclidean distance and Pearson correlation as described in [27].

More »

Fig 2 Expand

Table 1.

AUC values for Profiling distance metrics.

More »

Table 1 Expand

Fig 3.

Recovery of kinetochore and APC complexes.

After mapping each of the protein families presented in Van Hooff et al. [10] to their corresponding HOG, a distance matrix was constructed by comparing the Jaccard hash distance between profiles using HogProf. Name colors in the rows and columns of the matrix correspond to the kinetochore and APC subcomplex components as defined manually using literature sources [10].

More »

Fig 3 Expand

Fig 4.

Putative novel components of the kinetochore and APC complexes.

The profiles associated with all HOGs mapping to known kinetochore components shown in Fig 3 were used to search the LSH Forest and retrieve the top 10 closest coevolving HOGs resulting in a list of 871 HOGs including the queries from the original complexes. The Jaccard distance matrix is shown between the hash signatures of all query and result HOGs. UPGMA clustering was applied to the distance matrix rows and columns. Labelled rows and columns correspond to profiles from the starting kinetochore dataset [10]. A cutoff hierarchical clustering distance of 1.3 was manually chosen (blue lines) to limit the maximum cluster size to less than 50 HOGs. This cutoff resulted in a total of 142 clusters of HOGs used for GO enrichment to identify functional modules. The coloring of the protein family names to the right and below the matrix is identical to the complex related coloring shown in Fig 3.

More »

Fig 4 Expand

Table 2.

Manually curated biologically relevant search results for interactors coevolving with van Hooff et al.’s kinetochore and APC selected protein families [10].

Protein families returned within clusters containing query HOGs are listed with their pertinent annotation and literature. This is a non-exhaustive summary of some selected results. The full enrichment results are available as S1 Data.

More »

Table 2 Expand

Table 3.

Manually curated biologically relevant enriched GO terms from returned results.

The query sequence Hap2 is UniProt entry F4JP36 with OMA identifier ARATH26614 belonging to OMA HOG:0406399. The full enrichment results are available in the S2 Data.

More »

Table 3 Expand

Table 4.

Manually curated biologically relevant enriched GO terms from returned results.

The query sequence Gex1 is UniProt identifier Q681K7 with OMA identifier ARATH38809 belonging to OMA HOG:0416115. The full enrichment results are available as S3 Data.

More »

Table 4 Expand

Table 5.

Manually curated biologically relevant enriched GO terms from returned results.

The query sequence Spo11-1 is UniProt identifier Q9M4A2 with OMA identifier ARATH19148 belonging to OMA HOG:0605395.

More »

Table 5 Expand

Fig 5.

HogProf’s reproductive network.

A list of proteins known to be involved in sexual reproduction was compiled and mapped to OMA HOGs. These queries were used to search for the 20 closest coevolving HOGs in an LSH forest containing all HOGs in OMA. A Jaccard kernel was generated by performing an All vs All comparison of the Hash signatures of search results and queries. UPGMA clustering was performed on the rows and columns of the kernel. A cutoff distance of .995 (blue lines) was manually chosen to limit cluster sizes to less than 50 HOGs. This generated a total of 215 clusters of HOGs. Names for queries are shown with Saccharomyces cerevisiae gene names (apart from Hap2 which is not present in fungi).

More »

Fig 5 Expand

Table 6.

Manually curated biologically relevant putative interactors from sexual reproduction search results.

Protein families within clusters containing query HOGs are listed with their pertinent annotation and literature. GO enrichment results of clusters containing one or more queries were analyzed manually. Full enrichment results are available as S5 Data.

More »

Table 6 Expand

Fig 6.

To illustrate the advantageous scaling properties of MinHash data structures, synthetic profiles of length 100 were generated in the form of binary vectors (0 and 1 equiprobable).

Profiles were then clustered using an explicit calculation of the Jaccard distance, reduced to a lower dimensionality (5 dimensions) with truncated SVD, normalized and explicitly clustered using Euclidean distance as in SVD-Phy [18] or transformed into MinHash signatures and inserted into an LSH Forest object as in our method. Orders of magnitude showing typical use cases for profiling pipelines are shown on the x-axis. Curves were fitted to each set of timepoints to empirically determine the time complexity of each approach.

More »

Fig 6 Expand