A fast and agnostic method for bacterial genome-wide association studies: Bridging the gap between k-mers and genetic events

Genome-wide association study (GWAS) methods applied to bacterial genomes have shown promising results for genetic marker discovery or detailed assessment of marker effect. Recently, alignment-free methods based on k-mer composition have proven their ability to explore the accessory genome. However, they lead to redundant descriptions and results which are sometimes hard to interpret. Here we introduce DBGWAS, an extended k-mer-based GWAS method producing interpretable genetic variants associated with distinct phenotypes. Relying on compacted De Bruijn graphs (cDBG), our method gathers cDBG nodes, identified by the association model, into subgraphs defined from their neighbourhood in the initial cDBG. DBGWAS is alignment-free and only requires a set of contigs and phenotypes. In particular, it does not require prior annotation or reference genomes. It produces subgraphs representing phenotype-associated genetic variants such as local polymorphisms and mobile genetic elements (MGE). It offers a graphical framework which helps interpret GWAS results. Importantly it is also computationally efficient—experiments took one hour and a half on average. We validated our method using antibiotic resistance phenotypes for three bacterial species. DBGWAS recovered known resistance determinants such as mutations in core genes in Mycobacterium tuberculosis, and genes acquired by horizontal transfer in Staphylococcus aureus and Pseudomonas aeruginosa—along with their MGE context. It also enabled us to formulate new hypotheses involving genetic variants not yet described in the antibiotic resistance literature. An open-source tool implementing DBGWAS is available at https://gitlab.com/leoisl/dbgwas.


43
The aim of Genome-Wide Association Studies (GWAS) is to identify associ-44 ations between genetic variants and a phenotype observed in a population. 45 They have recently emerged as an important tool in the study of bacteria, 46 given the availability of large panels of bacterial genomes combined with phenotypic data (Farhat et al., 2013;Sheppard et al., 2013;Alam et al., Our approach, coined DBGWAS, for De Bruijn Graph GWAS, bridges the gap between, on the one hand, SNP-and gene-based representations 86 lacking the right level of flexibility to cover complete genomic variation, and, 87 on the other hand, kmer-based representations which are flexible but not 88 readily interpretable. We use De Bruijn graphs (de Bruijn, 1946) (DBGs), 89 which are widely used for de novo genome assembly (Pevzner et al., 2001;90 Zhang et al., 2011) and variant calling (Iqbal et al., 2012;Le Bras et al., 91 2016). These graphs connect overlapping kmers (here DNA fragments), In this example two sequences s 1 and s 2 of length 12 differ by a single letter. All kmers (k = 4) present in these sequences are listed. A) A link is drawn between two kmers when the k − 1 = 3 last nucleotides of the first kmer equal the 3 first nucleotides of the second kmer. B) The bubble pattern represents the SNP C to A; each branch of the bubble represents an allele. C) linear paths of the graph are compacted; the compacted DBG of the example only contains four nodes (unitigs) and represents the same variation as the original DBG, which contained 13 nodes (kmers).  represents a distinct genetic event. Panel A shows the subgraph with lowest min q extracted for P. aeruginosa levofloxacin resistance. It was composed of 27 unitigs, 5 of which were significantly associated with resistance. Susceptible unitigs are shown in blue, while resistant unitigs in red. All unitigs of this subgraph mapped to the gyrA gene. Panels B, C, D, E correspond to the top subgraphs obtained for other panels/phenotypes. The larger the node, the higher the allele frequency. Grey nodes were present in > 99% or < 1% of the strains and were not tested. Bright blue (resp. bright red) nodes were present almost exclusively in susceptible (resp. resistant) strains. Pale blue (resp. pale red) nodes were present with a larger frequency in susceptible (resp. resistant) strains. Circled black nodes mapped to annotated genes.

Coloured bubbles highlight local polymorphism in
140 core genes, accessory genes and noncoding regions 141 For P. aeruginosa levofloxacin resistance, the subgraph obtained with the 142 lowest min q highlighted a polymorphic region in a core gene (Figure 2A).

143
Indeed, it showed a linear structure containing a complex bubble, with a 144 fork separating susceptible (blue) and resistant (red) strains. The anno-145 tation revealed that all unitigs in this subgraph mapped to the quinolone 146 resistance-determining region (QRDR) of the gyrA gene. gyrA codes for a 147 subunit of the DNA gyrase targeted by quinolone antibiotics such as lev-148 ofloxacin and its alteration is therefore a prevalent and efficient mechanism of 149 resistance (Hooper and Jacoby, 2015;Lowy, 2003 M. tuberculosis, whose genome does not contain the parC gene (Piton et al.,156 2010) (first, gyrA: min q = 9.66 × 10 −144 ).

157
For P. aeruginosa amikacin resistance, the top subgraph (min q = 5.86 × 158 10 −9 ) highlighted a SNP in an accessory gene ( Figure 2B). As in Figure 2A, 159 it contained a fork separating a blue and a red node. However, other remain-160 ing nodes were not grey: they represented an accessory sequence because 161 they were not present in all the strains. Most of these nodes were pale-red, 162 showing that the accessory sequence was more frequent in resistant sam-163 ples. The annotation revealed that this subgraph corresponded to aac(6'), 164 a gene coding for an aminoglycoside 6-acetyltransferase, an enzyme capable 165 of inactivating aminoglycosides, such as amikacin, by acetylation (Lambert, 166 2002). Most unitigs in this gene had a low association with resistance, ex-167 cept for the ones describing this particular SNP. This mutation, L83S, lying 168 in the enzyme binding site, was previously shown to be responsible for sub-169 strate specificity alteration in a strain of Pseudomonas fluorescens (Lambert 170 et al., 1994). It appeared thus to increase the amikacin acetylation ability of 171 aac(6'), making its association to amikacin resistance more significant than 172 the gene presence itself.

173
Finally, for M. tuberculosis ethionamide resistance, the top subgraph 174 (min q = 7.86 × 10 −11 , Figure 2C) represented a polymorphic region in a 175 core gene promoter. The subgraph was mostly grey and linear with a lo-176 calised blue and red fork. The most reliable annotation for this subgraph was 177 fabG1 (also known as mabA), a core gene previously shown to be involved in 178 ethionamide and isoniazid resistance (Lee et al., 2000;Farhat et al., 2016).

179
None of the significantly associated unitigs mapped to the fabG1 gene, but 180 their close neighbours did (highlighted in Figure 2C by black circles), sug- (min q = 2.69 × 10 −100 ). As shown in Figure 2E, the subgraph described the 207 circular structure of a 2,500 bp-long plasmid known to carry the causal ermC 208 gene (Westh et al., 1995;Gordon et al., 2014) together with a replication 209 and maintenance protein in strong linkage disequilibrium with ermC.

210
For P. aeruginosa amikacin resistance, the third subgraph (min q = 211 2.21 × 10 −6 ) represented a 10,000 bp plasmid acquisition. Using the NCBI 212 nucleotide database (Benson et al., 2012), most of the unitigs in this sub-213 graph mapped to the predicted prophage regions of an integrative and con-214 jugative plasmid, whose structure was recently described as the pHS87b  tuberculosis (TB) and P. aeruginosa (PA) panels. For each antibiotic, subgraphs were reported with their rank, number of significant unitigs over all unitigs in the subgraph (Sign. unit.), q-value of the unitig with the lowest q-value (min q ), the corresponding estimated effect (β coefficient of the linear mixed model) and annotation of the subgraph. The type of event represented by the subgraph was colour-coded as: yellow for MGE, light blue for local polymorphism in gene (LPG), and dark blue for local polymorphism in noncoding region (LPN). Known positives were indicated in dark green (Pos), regions in LD with a positive in light green (LD), determinants caused by cross-resistance in orange (CR) and unknown determinants in grey (Ukn).   All the methods identified several markers described for other antibiotics.

251
This observed cross-resistance to antibiotics is a well described phenomenon 252 in M. tuberculosis species (Traore et al., 2000;Palomino and Martin, 2014  In this figure, we report deduplicated annotations of features identified as significant with the default parameters (p-value for SEER and HAWK or q-value for RWAS and DBGWAS). The total number of reported features is given in the header. For kmer-based methods, annotations were retrieved by mapping unitig/kmer sequences on the resistance and Uniprot databases. Green cells correspond to resistance determinants already described in the literature, orange cells to resistance determinants described for association with other antibiotics (annotations not found by RWAS are written in bold), and grey cells to unknown determinants. Within each category, annotations are ordered by increasing minimum p/q-values, corresponding to the lowest p/q-value found for each annotation before deduplication (p/q-values are reported only for the most significant annotations). For each method, the annotation with the lowest p/q-values is underlined. The execution time and memory load (in Gigabytes) are shown in the header (see also Supplementary Table S2). tance. Within them, in the context of streptomycin resistance, the mmpS1 annotation was reported by the three methods, but not by the RWAS 263 approach, as this gene was not included in the targeted approach prior.  Table S2).

293
As DBGWAS screens the genomic variations without prior knowledge, it 294 documented associations never previously described in resistance literature. into genotype to phenotype correlation, also beyond antibiotic resistance.

374
This will include complex traits such as biofilm formation, epidemicity and 375 virulence.

378
DBGs are directed graphs that efficiently represent all the information con-  Figure 1A).
These graphs can be compacted into cDBGs by merging linear paths (se-385 quences of nodes not linked to more than two other nodes) into a single node 386 referred to as a unitig (Butler et al., 2008;Zerbino and Birney, 2008;Chikhi 387 et al., 2016) ( Figure 1C). Compaction yields a graph with locally optimal Step 1 Step 2 Step 3 • DBG construction; • DBG compaction; • Strain mapping.
• Local neighborhood retrieval around significant unitigs; • Graph decoration with phenotype and statistical data, and annotations databases; • Visualization on web browsers; Tool: GATB (Drezen, 2014) Tool: bugwas (Earle, 2016) Tools: Boost (boost.org) _____-Blast+ (Camacho, 2009) _____-Cytoscape.js (Franz, 2015) Figure 4: DBGWAS pipeline. DBGWAS takes as input draft assemblies and phenotype data for a panel of bacterial strains. Variant matrix X is built in step 1 using cDBG nodes. Variants are tested in step 2 using a linear mixed model. Significant variants are post-processed in step 3 to provide an interactive interface assisting with their interpretation.

Representing strains by their unitig content (step 1)
cDBG construction. We build a single DBG from all genomes given as 399 input using the GATB C++ library (Drezen et al., 2014). We start from 400 contigs rather than reads to be robust to sequencing errors. Consequently, 401 we do not need to filter out low abundance kmers, allowing for the explo-402 ration of any variation present in the set of input genomes. 403 We use a k = 31 length for our kmers, as it produced the best per- kmers is fixed and not updated, and because we always query kmers that 424 are guaranteed to be in the DBG (since we do not filter out any kmer).

425
The unitig description on all the input genomes is stored into a matrix U : 426 U i,j = 1, if the j-th unitig is present in the i-th input genome; 0, otherwise.
We then transform the matrix U into Z, giving minor allele descrip-427 tion (Earle et al., 2016). Z is identical to U except for columns with a mean 428 larger than 0.5, which are complemented: Z j = 1 − U j for these columns.

429
We then restrict Z to its set of unique columns. If several unitigs have 430 the same minor allele presence pattern, then they will be represented by 431 a single column. Keeping duplicates would lead to performing the same 432 statistical test several times. Finally, we filter out columns whose average is 433 below 0.01. We denote the de-duplicated, filtered matrix of patterns by X. that correcting for population structure using LMMs is often preferable to 444 using a fixed effect correction or not correcting at all. 445 We thus rely on the bugwas method (Earle et al., 2016), which uses the 446 linear mixed model (LMM) implemented in the GEMMA library (Zhou and 447 Stephens, 2012) to test for association with phenotypes while correcting for 448 the population structure. This method also offers the possibility to test for represented by a set of factors W ∈ R n p , by: β is the fixed effect of the tested candidate on the phenotype, α is the random 456 effect of the population structure, and ε ij iid ∼ N (0, σ 2 ) are the residuals with 457 variance σ 2 > 0. W is estimated from the Z matrix which includes duplicate 458 columns representing both core and accessory genome. 459 We test H 0 : β = 0 versus H 1 : β = 0 in equation 1 for each unitig using 460 a likelihood ratio procedure producing p-values and maximum likelihood 461 estimatesβ. Finally, we compute the q-values, which are the  Hochberg transformed p-values controlling for false discovery rate (FDR) in 463 the situation of multiple testing (Benjamini and Hochberg, 1995).

464
Interpretation of significant unitigs (step 3) 465 The LMM can be used to identify deduplicated minor allele presence pat- unitigs is therefore essential to assess its significance.  Figure 5: Subgraphs induced by the neighbourhood of significantly associated unitigs. In this example, a neighbourhood of size 2 was used: any unitig distant up to 2 edges from a significant unitig is retrieved to define its neighbourhood. Neighbourhoods are merged if they share at least one node, e.g. the neighbourhoods of U 1 and U 2 are merged because they share N 6 , and will be represented in a single subgraph.
databases. Users can then easily retrieve the annotations which are the most 516 supported by the nodes in the subgraph, or with the lowest E-value. We ing Cytoscape.js (Franz et al., 2015).  fections. It is subject to HGT and many plasmids, mobile elements, and 561 phage sequences have been described in its genome. However, this does not 562 affect the species' genome size which is always close to 3 Mbp (Mlynarczyk 563 et al., 1998). Most antibiotic resistance mechanisms are well determined by 564 known variants as shown in a previous study (Gordon et al., 2014). This 565 study obtained an overall sensitivity of 97% for predicting 12 phenotypes 566 from rules based on antibiotic marker mapping. We use this study panel of

Resistome-based GWAS (RWAS)
RWAS are performed to validate that DBGWAS retrieves all known deter-598 minants found by a targeted approach. In this validation study we used 599 bugwas with the same phenotypes and population structure matrix W so 600 the RWAS analyses and DBGWAS only differ by their input variant matrix not reads. Moreover, for assembling the significant kmers, we used ABYSS 635 v2.0.2 (Jackman et al., 2017). Finally, the last step was performed similarly 636 as the one described for SEER.

Data and source code access
All data used in this work were previously published. Data generated by our method and discussed in the manuscript are available at http://leoisl.gitlab.io/DBGWAS_support/experiments.
The source code and precompiled version of our method is available on gitlab: https://gitlab.com/leoisl/dbgwas/.