CGG toolkit: Software components for computational genomics

doi:10.1371/journal.pcbi.1011498

Fig 1.

Revived software tools.

A 2008 snapshot of the ‘Key software’ section of the CGG website followed by services (partly shown), with the list of tools made available again.

More »

Expand

Table 1.

A list of the tools presented and selected, additional work that benefited from them.

Columns—GitHub: name of GitHub repository where the tools and documentation are available (NA: not applicable, as case study)–the prefix of the GitHub folders implies a typical workflow (outlined in Fig 2); tool: tool name (or in case of studies, a codeword); year: year of original publication; PMID: PubMed identifier; citations: number of citations reported by Google Scholar on 28-Mar-2023; citations/yr: number of citations per year since original publication; short description: self-explanatory, for further details, please see original publications. Table is sorted on PMID (which reflects the time of publication).

More »

Expand

Fig 2.

Representation of a typical workflow using the reported tools.

Pre-processing may start with a genome collection (database symbol, upper left), optionally mixed with a curated sequence resource such as UniProt (database symbol in green, upper left). To cross-index entries at the sequence level or simply identify them, MagicMatch can be used as an option. The sequence collection can be submitted to GeneCAST to mask compositional bias and prepare the query for sensitive searches (disk symbol with Q, lower left). For genome-scale analysis, species codes can be generated for the reference (target) set with cogent_utils, to create a uniformly named sequence set (disk symbol with R, lower middle, optionally mixed with UniProt or any other annotated collection). Sequence comparisons are executed with BLAST or other options with query Q vs. reference R (or in the case of all-vs-all, disk symbol in green-blue gradient, upper middle). The vertical gray line divides this pre-processing phase from the next phase, signifying the computationally intensive step or long wall-time. Two (non-mutually exclusive) output alternatives are shown: the pairs-list (in pink, upper right) or full alignments (also in pink, lower right). The former can be treated with clustt_utils that launches Tribe-MCL and generates protein families or can be used as input for network visualization with BioLayout or other similar software, while the latter can be further processed for GeneRAGE or DifFuse for multi-domain or gene-fusion detection, respectively, as well as for inspection and parsing for multiple alignments.

More »

Expand