Efficient association mapping from k-mers—An application in finding sex-specific sequences

Genome wide association studies (GWAS) attempt to map genotypes to phenotypes in organisms. This is typically performed by genotyping individuals using microarray or by aligning whole genome sequencing reads to a reference genome. Both approaches require knowledge of a reference genome which hinders their application to organisms with no or incomplete reference genomes. This caveat can be removed by using alignment-free association mapping methods based on k-mers from sequencing reads. Here we present an improved implementation of an alignment free association mapping method. The new implementation is faster and includes additional features to make it more flexible than the original implementation. We have tested our implementation on an E. Coli ampicillin resistance dataset and observe improvement in execution time over the original implementation while maintaining accuracy in results. We also demonstrate that the method can be applied to find sex specific sequences.


Introduction
Association mapping is the process of associating phenotypes with genotypes. In genome wide association studies (GWAS), individuals are typically genotyped using microarrays or by aligning sequencing reads from individuals to a reference genome. However, both these approaches require a reference genome of the organism which makes them inappropriate for association mapping in non-model organisms with incomplete reference genomes or none at all.
To address this issue, alignment free approaches for association mapping have been explored. A number of methods have been developed to perform association studies in bacterial genomes that do not require aligning reads to reference genomes [1][2][3][4]. The high plasticity in bacterial genomes means structural variants and even large genomic segments in various strains are missing in the reference genomes which makes application of reference based methods difficult. However, these methods do not scale to organisms with large genomes, and as many of them have incomplete reference genomes, there were challenges in association mapping in these organisms. To overcome this, Rahman et al. [5] and Voichek et al. [ methods for mapping associations in large genomes, to categorical phenotypes and to both categorical and quantitative phenotypes, respectively. The methods are primarily based on finding k-mers i.e. contiguous sequences of length k in sequenced reads and identifying k-mers associated with the phenotype.
In the association mapping tool named HAWK developed by Rahman et al. [5], frequencies of k-mers are analyzed to find k-mers associated with a phenotype and then they are assembled to form the associated sequences. First, they count k-mers in reads from each individual using Jellyfish [7]. Second, using likelihood ratio test, they find k-mers with significantly different counts in case and control samples. Next, population structure is determined from k-mer counts using Eigenstrat [8,9]. After that, associations to k-mers after correcting for population structure are determined. Finally, the k-mers found associated may be assembled to get a sequence for each associated loci.
Here we re-implement HAWK with the goal to reduce its execution time and make it more convenient for users. We have re-implemented the step for finding associated k-mers after population structure correction using C++, which was previously implemented in R. We have also extended support for Jellyfish 2 and implemented Benjamini-Hochberg procedure [10], which can be used to correct for multiple tests when the study is underpowered for Bonferroni correction. We have tested our implementation with a dataset on E.coli ampicillin resistance and have compared its output with the output of the original implementation. We have also analyzed the execution times of the two implementations. Our implementation is faster and more flexible to run compared to the original implementation while producing results similar to the original one.
Finally, we show that our method can be used to find sequences in the sex chromosomes. We apply our method to sequencing data from two populations in the 1000 genomes dataset [11], labeling males and females as cases and controls. We find that the k-mers determined by HAWK cover the entire sequenced regions in X and Y chromosomes. It is worth noting that other reference free methods for association mapping mentioned above are based on presence and absence of k-mers, and hence are not suitable for finding sequences in sex chromosomes present in both sexes e.g. the human chromosome X.

Implementation
Here we summarize the improvements and new features we have added. Fig 1 shows the workflow for HAWK highlighting the additions and modifications. Results supporting the improvements is presented in the following section.

Re-implementation of correction for population structure
Population stratification is a known confounder in association studies. Without correcting for this confounding factor, one may falsely associate non-significant genotypes with phenotypes. In the HAWK pipeline, population structure was estimated using Eigenstrat [8,9]. Eigenstrat performs a principal component analysis (PCA) on the presence or absence status in each sample of a randomly selected set of k-mers. Then the population structure is represented by the projections of the data points along the principal components. By default, first two principal components are used to denote population stratification but users have the option to choose up to first ten of them. Subsequently, p-values were adjusted for population structure and other confounders i.e. the associations between the k-mers and the phenotype were adjusted for confounders and p-values were re-estimated using the glm function (for fitting logistic regression models) and the ANOVA function (for testing the goodness of fit) in R.
Here, we re-implement this process in C++ and thereby, improve the performance of the pipeline. R uses the IRLS (Iteratively Re-weighted Least Square) method to fit the model [12]. Therefore in our implementation of the glm function in C++, we also used IRLS for fitting the model. Iteratively re-weighted least squares for finding the MLE (Maximum Likelihood Estimate) for logistic regression is a special case of Newton's algorithm. If the problem is written in vector matrix form, with parameters w T = [β 0 , β 1 , β 2 , . . .], explanatory variables x(i) = [1, . .] T and expected value of Bernoulli distribution mðiÞ ¼ 1 1 þ e À w T xðiÞ , the parameters w can be found using the following iterative algorithm: . .] is the vector of expected values, It can be observed that the B matrix is of dimensionality N × N, where N is the number of instances. For large volume of data, this matrix can greatly affect the performance of the implementation. However, we need to only keep the values along the diagonal as this is a diagonal matrix; thereby precluding the potential performance drawbacks. The pseudo-code of both glm and our implementation are given in Algorithm 1 and 2 respectively. For each k-mer, we perform a hypothesis test to compute a p-value. The null hypothesis, i.e. the k-mer is not associated with the phenotype, is represented by fitting a logistic regression model against population structure and other confounders whereas for the alternate hypothesis we fit a logistic regression model against the confounders as well as the k-mer counts. Then a p-value is computed using a likelihood ratio test to check whether the null can be rejected.
In our implementation, there are two hyper parameters that need to be tuned before running the process. One is the learning rate of the logistic regression model and the other is the number of maximum iterations allowed for convergence. We used maximum iteration as 25

PLOS ONE
Efficient association mapping from k-mers-An application in finding sex-specific sequences because we found that the glm implementation of R has 25 maximum iteration by default [13]. We used the learning rate value of 0.1.

Algorithm 1: glm
Inputs: X, y, α, MaxIter The implementation also makes it easier for users to specify the number of principal components to be used for population structure correction as well as additional covariates using command line parameters and input files.

Bug fixes
An error was found in the old implementation regarding the order of samples during adjustment of p-values using the confounding factors. This has been corrected in the new C++ implementation.

Implementation of Benjamini-Hochberg procedure
A number of approaches exist for adjusting p-value thresholds when multiple tests are being performed. Two such methods are: Bonferroni correction and Benjamini-Hochberg correction. The previous implementation performed hypothesis testing on each k-mer and performed Bonferroni correction using the total number of k-mers to determine k-mers associated with the phenotype in question. However, Bonferroni correction is known to be conservative i.e. it may fail to reject the null hypothesis even when it should be rejected [10].
Here, we implemented Benjamini-Hochberg correction which controls the false discovery rate (FDR). This can be used in studies underpowered for Bonferroni correction. The new implementations gives the provision for performing either correction according to the user preference.

Support for Jellyfish 2
The original implementation of HAWK used a modified version of Jellyfish [7]. Subsequently, Jellyfish2 has been released which provides better performance. The present implementation of HAWK allows k-mer counting using a modified version of Jellyfish2 available through our Github repository.

Re-implementation of post-processing
Once sequences corresponding to each loci associated with the phenotype are obtained, information such as average p-values of constituent k-mers as well as the average number of times they are present in case and control samples could be looked up using scripts provided with the original implementation. However, these scripts used a combination of C++ codes and shell commands, and was found to be slow in some cases [4]. Here, we re-implemented the script using hash tables in C++ to speed up the look-up.

Results
To assess the performance and accuracy of our implementation, we use the E.coli dataset on ampicillin resistance which was analyzed using the original implementation of HAWK [5].

Experimental setup
All the experiments are performed on a machine with CPU Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz, 386GB memory with OS Ubuntu 18.04.3. There are two CPUs in the system with total 48 logical cores. The C++ implementation are compiled with g++-7.4.0 with no optimization flag. Execution time is measured using the "date" command.

Hyper parameters
The code has 3 hyper parameters which are described below along with the values used in Table 1.

Comparison of p-values
We have compared the output of our implementation and the output of the previous implementation by plotting logarithm of p-values obtained using the C++ implementation against that of p-values obtained using the R implementation in Fig 2. An identical result should give a line with slope 1. Graph produced by our implementation are almost linear with small deviations at a few points.

PLOS ONE
Efficient association mapping from k-mers-An application in finding sex-specific sequences  Fig 3(a) and 3(b). We find associations near the β-lactamase TEM-1 (blaTEM-1) gene, the presence of which is known to confer ampicillin resistance, as in [5]. However, some of the associations outside of this gene detected previously in [5], that are likely to be spurious, is no longer observed after the correction of error.
The above analysis was also performed with the k-mers found significant using the new C++ implementation and the Manhattan plots are shown in Fig 3(c) and 3(d). We observe that associations are detected in same regions as those found using the R implementation.

Controlling FDR using the Benjamini-Hochberg procedure
HAWK uses Bonferroni correction to address multiple testing by default. However, we provide the option to control false discovery rate (FDR) using the Benjamini-Hochberg procedure. Fig  3(e) and 3(f) show Manhattan plots when k-mers are considered associated with ampicillin resistance after controlling FDR at level α = 0.05 using the Benjamini-Hochberg procedure. We observe that many k-mers outside of the β-lactamase TEM-1 gene are considered significant. We therefore recommend using Bonferroni correction and using the Benjamini-Hochberg procedure only if the study is under-powered for Bonferroni correction.

Comparison of running time
The execution times of the old and new implementations are compared in Fig 4. Reported times are obtained by running both implementations using 32 threads. We find that the C++ implementation of the confounder correction phase is approximately three times faster than the previous implementation in R. However, the overall execution time is dominated by the kmer counting step.

Fig 2. Comparison of log of p-values with values computed using the previous implementation in R (-log(pvalue_case_R) and -log (pvalue_control_R)) along the x-axis and the ones computed using the new implementation using C++ (-log(pvalue_case_cpp) and -log (pvalue_control_cpp)) along the y-axis, for k-mers positively correlated with (a) cases, and (b) controls.
https://doi.org/10.1371/journal.pone.0245058.g002  Table 2 shows comparison of execution times of Jellyfish and Jellyfish 2. We observe that, although Jellyfish 2 is faster overall, the performance improvement it provides is not substantial.

Finding sex-specific sequences
The HAWK pipeline can be used to find sequences in sex chromosomes in organisms with unassembled or poorly assembled genomes. To assess the performance, we ran HAWK on sequencing data from the Yoruba in Ibadan, Nigeria (YRI) and the Toscani in Italia (TSI) populations from the 1000 genomes project dataset [11]. Of the 110 YRI and 109 TSI individuals, 107 were male and 112 were female. The sexes of individuals were used as cases and controls and the pipeline was executed.
The initial step revealed 106,272,845 and 17,056,781 k-mers that are present significantly more times in female and male samples respectively compared to the other. Principal

PLOS ONE
Efficient association mapping from k-mers-An application in finding sex-specific sequences component analysis was then performed on the binary matrix denoting presence or absence of 32,699,548 randomly chosen k-mers, where each k-mer present in between 1% and 99% of the samples was selected with probability 0.01. The PCA plots for the first six principal components are shown in Fig 5. The first six PCs explain 14.90% of the variance among which the first and second PCs explain 7.68% and 2.51% of the total variance respectively. We observe that the first two PCs capture the

PLOS ONE
Efficient association mapping from k-mers-An application in finding sex-specific sequences population structure whereas no relationship is seen between populations and the next four PCs. Hence PC1 and PC2 are used as confounders, as done by default. It may be noted from Fig 5(d) that the fifth PC nearly separate the two sexes. As such treating this as a confounder would lead to removal of many sex-specific k-mers.
After correcting for confounders, we obtain 14,473,058 and 54,256,206 k-mers positively correlated with male and female samples respectively. The k-mers were mapped to the human reference genome (hg38/GRCh38 [GCA_000001405.15]) using Bowtie 2 [14] to analyze their locations. The results are summarized in Table 3. We find that 99.97% and 96.37% of k-mers positively correlated with female and males samples map to Chromosome X and Chromosome Y respectively. The remaining k-mers map to other locations in the human reference genome or stay unmapped.
The positions of the k-mers, positively correlated with female and male samples, in Chromosomes X and Y respectively are shown in Fig 6(a) and 6(b). We observe that k-mers throughout the entire sequenced regions of the two chromosomes are detected using HAWK. It is worth noting that the region in Chromosome Y, where no k-mer could be mapped, is missing from the reference genome i.e. represented by a sequence of Ns (the wild card character used to denote any nucleic acid in fasta format) in the reference [15]. Our analysis reveals that the k-mers that can be mapped to Chromosome Y have an average total count of 754.10 in the male samples whereas that of the unmapped k-mers is 2730.06. The histograms of counts of mapped and unmapped k-mers in Fig 6(c) also show that the count distribution of unmapped k-mers has a heavier tail compared to that of the mapped ones. This suggests that many of the k-mers positively correlated to male samples that could not be mapped to Chromosome Y are from the missing region in the chromosome since the missing region is known to be rich in repeats.

Discussion
We have re-implemented portions of HAWK, which is a tool for association mapping using kmers. The re-implementation in C++ makes it faster and more convenient to use while retaining accuracy. We have also added support for the new version of k-mer counting tool Jellyfish and correction for multiple testing using the Benjamini-Hochberg procedure. The k-mer counting step remains the bottleneck in the pipeline which may be addressed by when faster k-mer counting tools emerge. Finally, we show how the method can be applied to determine sex specific sequences in organisms accurately.