Fig 1.
Schematic presentation of PhenotypeSeeker workflow.
Panel A shows the 'PhenotypeSeeker modeling' steps, which generate the phenotype prediction model based on the input genomes and their phenotype values. Panel B shows the 'PhenotypeSeeker prediction' steps, which use the previously generated model to predict the phenotypes for input genomes.
Fig 2.
The influence of k-mer length on the CPU time and total RAM usage of PhenotypeSeeker (bars, left axis) and on the number of different k-mers present in the genomes (line, right axis).
Fig 3.
The positions of ciprofloxacin-resistant P. aeruginosa strains on cladogram.
The MIC values (mg/l) are marked to the external nodes with corresponding strain names. Strains with MIC > 0.5 mg/l are highlighted with yellow to denote ciprofloxacin resistance according to EUCAST breakpoints [16]. Strains with detected mutations in QRDR of gyrA and parC are marked with the color code on the perimeter of the cladogram.
Fig 4.
Virulence genes in corresponding clusters and wzi included in the PhenotypeSeeker prediction model in K. pneumoniae strains (13-mers, weighted, max. 10 000 k-mers for the regression model).
Each row is one strain, and each column represents one protein coding gene. Blue cells represent 13-mers in the model for the corresponding gene and a strain. Genes in colibactin, aerobactin and yersiniabactin clusters show the most differentiating pattern between carrier and invasive/infectious strains.
Table 1.
Model’s F1-measure and running time.
The results with 13-mers and weighting are shown. The maximum number of 13-mers selected for the regression model was 1000. In cases where sequencing reads were used as the input, a minimum frequency of 5 for a 13-mer was required to reduce the influence of sequencing errors.
Table 2.
PhenotypeSeeker comparison to Kover and SEER using P. aeruginosa and C. difficile data.
PhenotypeSeeker with the weighting option and maximum 1000 k-mers for the regression model was used.