Genome-Wide Association Study with Targeted and Non-targeted NMR Metabolomics Identifies 15 Novel Loci of Urinary Human Metabolic Individuality

Genome-wide association studies with metabolic traits (mGWAS) uncovered many genetic variants that influence human metabolism. These genetically influenced metabotypes (GIMs) contribute to our metabolic individuality, our capacity to respond to environmental challenges, and our susceptibility to specific diseases. While metabolic homeostasis in blood is a well investigated topic in large mGWAS with over 150 known loci, metabolic detoxification through urinary excretion has only been addressed by few small mGWAS with only 11 associated loci so far. Here we report the largest mGWAS to date, combining targeted and non-targeted 1H NMR analysis of urine samples from 3,861 participants of the SHIP-0 cohort and 1,691 subjects of the KORA F4 cohort. We identified and replicated 22 loci with significant associations with urinary traits, 15 of which are new (HIBCH, CPS1, AGXT, XYLB, TKT, ETNPPL, SLC6A19, DMGDH, SLC36A2, GLDC, SLC6A13, ACSM3, SLC5A11, PNMT, SLC13A3). Two-thirds of the urinary loci also have a metabolite association in blood. For all but one of the 6 loci where significant associations target the same metabolite in blood and urine, the genetic effects have the same direction in both fluids. In contrast, for the SLC5A11 locus, we found increased levels of myo-inositol in urine whereas mGWAS in blood reported decreased levels for the same genetic variant. This might indicate less effective re-absorption of myo-inositol in the kidneys of carriers. In summary, our study more than doubles the number of known loci that influence urinary phenotypes. It thus allows novel insights into the relationship between blood homeostasis and its regulation through excretion. The newly discovered loci also include variants previously linked to chronic kidney disease (CPS1, SLC6A13), pulmonary hypertension (CPS1), and ischemic stroke (XYLB). By establishing connections from gene to disease via metabolic traits our results provide novel hypotheses about molecular mechanisms involved in the etiology of diseases.


Introduction
Genome-wide association studies with metabolic traits (mGWAS) investigate the relationship between genetic variance and metabolic phenotypes (metabotypes). In 2008, Gieger et al. presented the first mGWAS in serum of 284 individuals [1]. Since then, numerous mGWAS using different analytical platforms and ever larger study populations were published [2][3][4][5][6][7][8]. These studies discovered more than 150 genetic loci that associate with blood levels of more than 300 distinct metabolites. We refer to these loci as the genetically influenced metabotypes (GIMs), their ensemble defining the genetic part of human metabolic individuality. Many of the single nucleotide polymorphisms (SNPs) that associate with metabolic traits map to genetic regions coding for enzymes or metabolite transporters that are biochemically linked to the associated metabolites. Moreover, a large number of these GIMs have been previously linked to clinically relevant phenotypic traits. As intermediate traits on the pathways of many disorders, these GIMs have become valuable tools that allow unraveling disease mechanisms on the molecular level [9].
However, so far mGWAS have mostly been limited to studies of serum or plasma metabolite levels, thereby focusing on genetically influenced metabolic homeostasis in blood. Only a few studies investigated urine as a complementary body fluid enabling studies of kidney function and the detoxification capabilities of the human body. In 2011, we published the first mGWAS in urine [10] using proton nuclear magnetic resonance spectroscopy ( 1 H NMR) to determine metabolite concentrations in urine of 862 male participants of the SHIP-0 cohort. We identified five genetic loci (SLC6A20, AGXT2, NAT2, HPD, and SLC7A9) that modulate urinary metabolite levels. While for this study metabolite concentrations were manually derived from the NMR spectra for a targeted set of metabolites, Nicholson et al. [5] directly used spectral features as abstract, non-targeted urinary metabolic traits in an mGWAS. Based on data for 211 participants of the MolTWIN and MolOBB studies, the authors identified SNPs at three loci (ALMS1/NAT8, AGXT2, and PYROXD2) that were associated with metabolic traits in urine. Two of these loci (ALMS1/NAT8 and PYROXD2) were replicated in an NMR-based mGWAS published by Montoliu et al. For that study, the authors analyzed non-targeted urinary traits from 265 subjects from the São Paolo metropolitan area [11]. Recently, Rueedi et al. [12] reported significant associations of NMR-derived non-targeted urinary traits in ten loci (ALMS1/NAT8, ACADL, AGXT2, NAT2, ABO, PYROXD2, ACADS, PSMD9, SLC7A9, and FUT2) using data from 835 participants of the CoLaus study, thus bringing the total number of reported urinary GIMs to eleven.
Here, we substantially extend our previous mGWAS with metabolic traits in urine, both in size and in scope. First, we metabolically characterize the urine samples of 3,861 male and female participants of the SHIP-0 study, thereby quadrupling the sample size when compared to previous studies. Second, we combine both targeted and non-targeted NMR-based metabolomics. In this way, we implement the approaches used in the studies by Nicholson et al., Montoliu et al., and Rueedi et al. alongside the targeted metabolomics approach used in our previous study. For an unbiased interpretation of our mGWAS results, we apply tools for evidence-based locus-to-gene mapping and automated assignment of metabolites to non-targeted NMR spectral features. Finally, besides determining the overlap of variants identified in our study with variants previously linked to clinical traits, we specifically investigate the overlap between variants influencing metabolic traits in both urine and blood.

Results
Our study is based on one-dimensional 1 H NMR spectra of urine samples from 3,861 genotyped participants in the SHIP-0 cohort (see Methods). For the targeted metabolomics analysis, we manually quantified a set of 60 metabolites in these spectra ( Fig 1A). For the non-targeted analysis, we used the same spectra and applied an automated processing algorithm to align the spectra and to perform dimensionality reduction [13]. In the subsequent analysis, we screened the targeted and the non-targeted metabolic traits as well as the pairwise ratios within each trait type for associations with genotyped and imputed variants in a two-step approach ( Fig  1B). We identified a total of 23 genetic loci that display significant associations with targeted and/or non-targeted metabolic traits (Fig 2, Tables 1 and 2). All but one of the discovered loci replicated in data from the KORA F4 cohort (N = 1,691). For 15 loci, our study is, to the best of our knowledge, the first to report associations with urinary traits. For 7 of these 15 loci, associations have previously been reported with blood metabolites. Thus, 8 loci are entirely new ( Fig  3). Finally, 11 of the 22 replicated loci host significantly associated variants that were previously associated with phenotypes of clinical relevance (Table 3).

mGWAS with targeted and non-targeted NMR features
For the targeted metabolomics analysis, the 1 H NMR spectra were manually annotated to derive absolute metabolite concentrations (out of a panel of 60 compounds) for each sample. For the non-targeted analysis, the same spectra were automatically aligned and processed using (a) Genotyping and metabotyping of 3,861 SHIP-0 study participants. One-dimensional 1 H NMR spectra of the urine samples were recorded to derive targeted and non-targeted metabolic traits. (b) Two-staged mGWAS. First stage: genome-wide association tests using genotyped SNPs and 15,379 targeted and non-targeted traits. Second stage: fine mapping of regions with potentially significant associations using imputed SNPs. (c) Replication and interpretation. Genome-wide significantly associated SNPs were assigned to one of 23 distinct genetic loci. The loci and the significantly associated non-targeted traits were annotated using algorithmic approaches. 22 of the 23 loci could be replicated using genotype and metabotype data from 1,691 KORA F4 participants.
doi:10.1371/journal.pgen.1005487.g001 the FOCUS package [13]. This resulted in NMR signal intensities at 166 distinct spectral positions ("chemical shifts") per sample (see Methods). In previous mGWAS, we demonstrated the potential of testing pairwise ratios of metabolite concentrations to boost genetic association signals [1,4,6,8,10,21,53]. Recently, we showed that this approach can also be successfully applied to NMR-based mGWAS with non-targeted features [22]. Thus, we calculated the pairwise ratios of all metabolite concentrations with at least 300 valid data points over all samples (55×54/2 = 1,485 ratios of targeted traits) and NMR signal intensities (166×165/2 = 13,695 ratios of non-targeted traits), respectively ( Fig 1A). Out of all 15,401 metabolic features (targeted and non-targeted traits and the ratios thereof), a total of 15,379 features with at least 300 valid data points were screened for genetic associations using 620,456 genotyped autosomal SNPs (Fig 1B). To this end, we computed age-and sex-adjusted linear models under the assumption of additive genetic effects for each SNP-metabolic trait pair. A total of 499 genotyped variants display associations with metabolic traits with P-values below 5×10 −8 .

Fine-mapping of chromosomal regions with associated variants
We used the 499 variants identified in the mGWAS to tag 54 distinct chromosomal regions at a window size of at least 2 Mb (centered to the tag SNPs). We then performed additional association studies using imputed variants (1000 genomes project imputation) in the tagged regions ( Fig 1B). We considered associations with a P-value below the Bonferroni-adjusted significance Manhattan plot of genetic associations to targeted and non-targeted traits. SNPs are plotted according to chromosomal location and the-log 10 transformed P-value of the strongest association with targeted traits (top) and non-targeted traits (bottom). In case of associations with ratios, only associations with P-gain exceeding 15,180 (targeted metabolic traits) or 138,610 (non-targeted traits) were considered. Associations of genome-wide significance (P < 3.25×10 −12 ) are plotted in red. Triangles indicate associations with P < 1.0×10 −100 . Significant associations within a physical distance of 1 Mb were assigned to a locus labeled after the most likely causative gene (as determined using an evidence-based approach for the identification of candidate genes; see Methods). threshold of α' = 5×10 −8 /15,379 = 3.25×10 −12 to be genome-wide significant. For ratio traits, we also required the P-gain to be greater than 1.52×10 4 for targeted traits and 1.38×10 5 for non-targeted traits (10 times the number of tested traits [53]). P-gain reflects the increase of association strength with the ratio trait when compared to the P-values that result from associations with the individual traits buildinging the ratio. A total of 2,882 genotyped or imputed SNPs display association signals below P < 3.25×10 −12 and, in case of ratios, above the imposed P-gain threshold (Fig 2). All significantly associated SNPs within a physical distance of 1 Mb were assigned to one of 23 distinct genetic loci. Three loci display significant association signals only when imputed SNPs were used, and 8 loci show significant associations only when pairwise ratios of metabolic traits were considered. Twelve loci show significant associations in both targeted and non-targeted data sets. Three loci are only significantly associated with targeted traits (i.e., quantified metabolite concentrations or ratios thereof), whereas 8 loci are only significantly associated with non-targeted traits (spectral features or ratios thereof) (Fig 3). For each locus, we list the SNP that displays the strongest association signal (lead SNP) and its associated metabolic trait in Tables 1 and 2. In addition, we provide boxplots, regional which both genotype and phenotype data were available for the tested SNP/metabolic trait pair. beta': beta' is defined as 10 beta -1 where beta depicts the relative effect size representing the slope of the regression line in the linear model when using log 10 -scaled metabolic traits and the occurrence of the SNP's minor allele (coded as 0,1, and 2). Thus, beta' describes the relative difference per minor allele copy for non-scaled metabolic traits in comparison to the estimated mean of the metabolic trait in the major homozygote test subjects. P-gain: Defined as min(P(M 1 )/P(M 1 /M 2 ), P(M 2 )/P(M 1 /M 2 )), where M 1 and M 2 represent the two traits of which the ratio M 1 /M 2 is built. Replicated: SNP/metabolic trait pair was replicated in KORA F4 (P < 1.32×10 −3 ) (S1 Table). Non-targeted: SNP or proxy in linkage disequilibrium (LD) is also associated with a non-targeted metabolic trait (P < 3.25×10 −12 ) (

Systematic assignment of loci to genes and annotation of non-targeted metabolic traits
In general, the biological interpretation of association results from mGWAS requires the mapping of SNPs to candidate genes that are most likely causally linked to the observed changes in Table 2. Twenty genetic loci as discovered in the SHIP-0 data set and their most significant associations to non-targeted metabolic traits. Chr/Position: Chromosomal location of the lead SNP according to the human reference genome (GRCh37). EA/EAF: Effect allele and frequency. Trait or pairwise ratio: Tested metabolic trait (chemical shift). In case of ratios, the trait that drives the association (i.e., shows the stronger association signal) is named first. N: Number of samples where both genotype and phenotype data were available for the tested SNP/metabolic trait pair. beta': beta' is defined as 10 beta -1 where beta depicts the relative effect size representing the slope of the regression line in the linear model when using log 10 -scaled metabolic traits and the occurrence of the SNP's minor allele (coded as 0,1, and 2). Thus, beta' describes the relative difference per minor allele copy for non-scaled  Table). Targeted: SNP or proxy in LD is also associated with a targeted metabolic trait (P < 3.25×10 −12 ) ( the metabotype. Furthermore, non-targeted metabolic traits that exhibit significant association signals have to be assigned to distinct metabolites. In our study, we implemented algorithmic approaches for both the locus-to-gene mapping and the assignment of non-targeted metabolic features. As the first step in the candidate gene selection, we assigned the significantly associated SNPs to distinct loci using a physical distance threshold of 1Mb. Assigning variants within a locus to one of the covered genes based only on proximity or plausibility ignores haploblock structure and existing regulatory information for the SNPs such as expression quantitative trait loci (eQTL). To take such information into account and to achieve an unbiased selection of candidate genes, we collected evidence for each significantly associated SNP and its proxies in strong linkage disequilibrium (LD) using the SNiPA web server [51]. For each locus, we received a list of candidate genes that are linked to one or more associated variants (or a proxy in LD). Thereby, genes are linked via genomic proximity (i.e., if any of the variants is located within the candidate gene or is in close proximity), via eQTL associations (i.e., if any of the variants is associated with expression levels of the gene in a previous eQTL study), or via regulatory element association (i.e., if any of the variants is contained in a promoter/enhancer/ repressor element that is associated with the gene). Moreover, missense variants or known pathogenic variants in the locus are considered to provide additional types of evidence for the linked genes. We finally assigned the locus to the gene with the strongest functional evidence (i.e., the gene showing the highest number of different types of evidences (max. 5) among the Loci with associated urinary metabolic traits and their overlap with previous mGWAS in blood and urine. We identified and replicated genome-wide significant associations between metabolic traits and genetic variants in 22 genetic loci (named after the most likely causative gene). Three loci could only be identified using targeted metabolic traits, while 7 loci were exclusively discovered with non-targeted traits. 12 loci were identified using both targeted and non-targeted approaches. Loci with hitherto unknown associations with urinary metabolic traits are highlighted (totaling 15). We identified and replicated significant associations in 7 of the 11 loci that were reported in previous mGWAS in urine [5,[10][11][12]. We also discovered significant associations of the ABO locus (marked with an asterisk) with non-targeted traits, but this locus could not be replicated in KORA F4. When compared to previous mGWAS in blood, we find 14 loci that display associations with metabolic traits in both urine and blood [1,4,[6][7][8][14][15][16][17][18][19][20][21][22][23][24].
doi:10.1371/journal.pgen.1005487.g003 Table 3. Twenty-two identified and replicated loci and their overlap with associations to metabolic traits and clinical phenotypes.

Locus
AGXT is an alanine-glyoxylate aminotransferase, which is highly expressed in liver. The encoded protein is also highly localized in liver.
AGXT is linked to type 1 Hyperoxaluria, which can cause renal failure (MIM 259900). TKT encodes a transketolase that connects the pentose phosphate pathway to glycolysis.

ACSM3
shows substrate specificity for the C5 fatty acid isovalerate [46]. It is estimated that ACSM3 acts on "acids from C4 to C11 and on the corresponding 3-hydroxy (.
ACSM3 is suspected to be a risk locus for overweight and hypertension (MIM 145505).
SLC5A11 is short for solute carrier family 5 sodium/myo-inositol transporter, member 11 (see main text).  [49]. It is highly expressed in kidney. For each locus, we selected all variants that displayed genome-wide significant association signals to metabolic traits in the SHIP-0 cohort. We added their proxy variants in LD (r 2 ! 0.8; based on 1000 genomes project data [50,51]). These variant sets were used for the selection of candidate genes and the comparison with association results from other studies. Candidate genes: selection of genes based on variant evidence (genes hit or close-by, eQTL, potentially regulatory effects, or missense variants) (S3 Table). candidate genes; see Methods). In case of ambiguous assignments, the gene with the most plausible biological function was chosen. As an example, one locus contains a high number of SNPs with strong associations with non-targeted traits corresponding to N-acetylated compounds. These SNPs cover 12 different genes (see regional association plots in S2 Fig). The gene covered by the highest number of SNPs is ALMS1. However, there are 3 more genes in this locus with the same amount of functional evidence count as ALMS1 (S3 Table). One of these genes is NAT8, which encodes an N-acetyltransferase. Since there is a biologically meaningful link between the function of the NAT8 gene product and the associated metabolic traits, we annotated this locus with NAT8 as the most likely candidate gene. According to our evidence-based candidate genes assignment approach, the 23 loci map to the genes NAT8, HIBCH, CPS1, AGXT, XYLB, SLC6A20, TKT, ETNPPL, SLC6A19, AGXT2, DMGDH, SLC36A2, NAT2, ABO, GLDC, PYROXD2, SLC6A13, HPD, ACSM3, SLC5A11, PNMT, SLC7A9, and SLC13A3. S3 Table provides a complete list of candidate genes and the corresponding collected evidences.
For the identification of metabolites underlying non-targeted NMR traits, we used pseudospectra that display the strength of associations of a given SNP across the complete NMR spectrum [12,22]. If the association is strong enough, these "association spectra" often exhibit a striking similarity to the reference NMR spectrum of the underlying metabolite(s). For the present study, we applied the "metabomatching" method introduced by Rueedi et al. [12] to perform an automated annotation of the association spectra for each genetic locus of interest. For 19 of the 20 loci that display significant associations with non-targeted traits, metabomatching suggests plausible metabolite candidates matching signals present in the association spectra (S1 Fig). For 10 of these 19 loci (CPS1, SLC6A20, ETNPPL, SLC6A19, AGXT2, SLC36A2, HPD, ACSM3, PNMT, and SLC13A3), the match between the association signal and the NMR spectrum of the candidate metabolite (as provided by the Urine Metabolome Database [54]) is strong and unique, which makes the assignment of a metabolite identity to a nontargeted trait unambiguous in these cases.

Replication
To replicate our findings, we used genotype data and urine samples from participants of the KORA F4 cohort (N = 1,691). From recorded 1 H NMR spectra of the urine samples, we derived the targeted and non-targeted metabolic traits (metabolite concentrations, NMR spectral features, and the respective pairwise ratios) as for the discovery study. For 14 of the 15 new loci that show significant associations with targeted metabolic traits in the SHIP-0 data set, the topranking SNP/metabolic trait association replicates in KORA F4 (S1 Table). For the SLC7A9 locus, the association with lysine/valine does not replicate, possibly due to the difficulty in annotating lysine from the NMR spectra (> 75% missing values for lysine). However, the second-best, still genome-wide significant association of the tested SNP with valine replicates. For 15 of 20 loci that display significant association signals in the GWAS with non-targeted traits, we were able to replicate the best SNP/NMR trait association or, if this failed, the next, still significant follow-up association (S2 Table). The failure to replicate the remaining 5 loci might be due to the lower sample size in KORA, due to different fasting states of the subjects in the different cohorts, or due to a less perfect alignment of the NMR spectra, since we chose the same FOCUS parameters for aligning SHIP and KORA spectra instead of treating them separately. However, 4 of these 5 loci (ETNPPL, SLC6A19, DMGDH, PNMT) show also significant associations in the targeted SHIP-0 data set that replicate in the targeted KORA F4 data set ( Table 1). Out of the 23 loci identified in the discovery study, ABO is the only locus that could not be replicated using either a targeted or a non-targeted metabolic trait in KORA F4, leaving 22 loci that display stable associations with metabolic traits in urine.

Overlap with previous mGWAS in urine and blood
We evaluated each identified and replicated locus in the light of previously reported associations with metabolic phenotypes and clinical traits. To this end, we selected all SNPs within a locus for which we found genome-wide significant associations with any urinary metabolic trait in the SHIP-0 cohort. Furthermore, we added all bi-allelic variants from the 1000 genomes project [50] (phase 1, version 3, European ancestry) that are in strong LD to these SNPs (r 2 ! 0.8).
We further compared our association results with those of published mGWAS with metabolic traits in blood (P < 5×10 −8 ), including all studies listed in the NHGRI GWAS catalog [23] and other studies such as the mGWAS by Shin et al., which is based on metabolomics data from the KORA F4 and TwinsUK cohorts [1, 4, 6-8, 14-22, 24]. In total, 14 loci show significant associations with metabolic traits both in blood in one of these mGWAS and in urine in our study (Fig 3, Table 3). For 3 of these 14 loci (SLC6A20, PNMT, AGXT), we consider the associated metabolic traits in both media to be unrelated. In 5 cases (NAT2, NAT8, PYROXD2, SLC7A9, TKT), the genetic association analyses identified different, but related metabolites (i.e., the associated metabolites from urine and blood are either products/substrates of the locus' candidate gene product, or are biochemically converted within another known enzymatic reaction, or belong to the same metabolite class). In 6 cases, the associations target the same metabolites in urine and blood (CPS1, AGXT2, DMGDH, SLC6A13, HPD, SLC5A11). For 5 of these 6 loci, the direction of the observed effect is the same, whereas for SLC5A11 (associated with myo-inositol), we observe an increase in urinary metabolite concentration per copy of the effect allele, as opposed to decreased levels reported in blood (Table 3). For this locus, we additionally investigated whether the effects seen in blood and urine are directly coupled. To this end, we made use of myo-inositol levels (normalized to circulating creatinine) measured through mass spectrometry (MS) in blood serum samples of the same KORA F4 participants [6] that form the replication cohort in this study. The ratio between the urinary myo-inositol (this study) and the serum myo-inositol levels shows an increase in association strength to the lead SNP in SLC5A11 (rs17702912) by seven orders of magnitude in comparison to the association of urinary myo-inositol alone (P urine < 1.95×10 −24 , P blood < 1.50×10 −4 , P ratio < 2.43×10 −31 ).

Discussion
In this study, we present the largest genome-wide association study with metabolic traits (mGWAS) in urine to date. In addition to quadrupling the sample size compared to previous mGWAS in urine, we analyzed both targeted traits (metabolite concentrations manually derived from NMR spectra) and non-targeted traits (NMR spectral features).

Fifteen new genetic loci linked to urinary metabolic traits
In total, we identified 23 genetic loci with significant associations between genetic variants and targeted or non-targeted metabolic traits in urine of SHIP-0 participants, 22 of which replicate in the independent KORA F4 cohort. To the best of our knowledge, 15 loci have not been linked to changes in the urine metabolome before. For the remaining 7 loci, our results are in line with the results from previous mGWAS in urine [5,10,12] regarding both the associated metabolic traits and the direction of the genetic effects (Table 3).

Targeted and non-targeted metabolomics are complementary
Though derived from the same NMR spectra, the list of GIMs identified with targeted traits and non-targeted traits partly differ. Of the 22 genetic loci reported in this study, only 12 loci were discovered in both targeted and non-targeted traits, whereas 7 loci show significant associations only with non-targeted traits, and 3 only with targeted traits (Fig 3 and Table 3).
For the data set used in the targeted analysis, the NMR spectra were manually annotated to identify and quantify the metabolites underlying the spectra. Involving human expert knowledge usually allows metabolite identification with very high confidence and yields more precise quantification, especially if signals of multiple metabolites overlap in the NMR spectrum. Furthermore, a manual annotation can to some extent compensate for different experimental and sample conditions, as alignment and pre-processing can be optimized for each spectrum individually. As an example, lysine exhibits characteristic signals in the NMR spectral region between δ = 1.68 and 1.76 ppm, which is often dominated by signals from a variety of additional metabolites, making the annotation of lysine a very difficult task. Thus, while lysine concentrations could be determined for 888 samples of the discovery cohort through manual quantification and yielded a significant genetic association at SLC7A9, the non-targeted approach did not capture any association signals for this locus.
However, a manual spectral annotation as performed in the targeted analysis is quite laborious, which limits the number of quantifiable metabolites in large studies. This leads to a bias towards a certain set of metabolites and, as a consequence, significant associations actually present in the NMR data might be missed. Also, a manual annotation in general bears some risk of annotator-induced bias [58]. As an automated method, the non-targeted analysis of spectra has the potential to overcome some of the limitations of targeted analyses. Here, the most prominent example is the PYROXD2 locus, where SNPs display exceptionally strong associations (P < 1.0×10 −307 ) to the NMR signal intensities at δ = 2.854 ppm. We could not identify any significant associations within this locus using the targeted data. Thus, we assumed that our set of targeted traits did not cover the metabolite(s) corresponding to these signals. The challenge with genetically associated non-targeted traits lies in the lack of biochemical interpretability. To facilitate the assignment of non-targeted NMR traits to chemical compounds, we applied the metabomatching algorithm introduced by Rueedi et al. [12]. In case of PYROXD2, metabomatching suggests that the associated NMR signals correspond to trimethylamine. Thereby, the automated method replicates the findings of Nicholson et al. [5] where the authors manually annotated the associated signals based on expert knowledge. In case of our mGWAS with targeted traits, trimethylamine was not part of the metabolite panel and thus the association with PYROXD2 could only be discovered using non-targeted metabolic traits in combination with the automated metabomatching processing. Of course, automated annotation of non-targeted traits also has its limitations: the annotation through metabomatching relies on the association signals that genetic variants display over the NMR spectral range ("association spectra") as well as on the existence of the relevant reference metabolite spectrum (see Methods and S1 Fig). In some cases, these association spectra are not meaningful enough to allow an unambiguous assignment of non-targeted features to metabolites, or they may be pointing to a metabolite not present in the reference set.
In summary, our study demonstrates that GWAS with NMR-determined metabolic traits can benefit from a combined application of both targeted and non-targeted metabolomics. Our results suggest that a targeted approach is better suited for the annotation of metabolites for which the corresponding NMR signals are in regions with a plethora of other signals as in some cases these signals cannot be resolved through non-targeted methods. Furthermore, genetic associations with targeted traits appear to be more robust, since 5 of the 12 loci that display associations with both targeted and non-targeted traits clearly display stronger association signals in the targeted data set (several orders of magnitude in case of the SLC6A20 locus; Tables 1 and 2). However, the non-targeted metabolic traits provide a less biased view on the metabolome, which in our case results in additional significantly associated genetic loci.

Functional metabolomics: from GIMs to testable hypotheses
Fifteen of the 22 identified and replicated loci show a plausible biochemical connection between functionally annotated genes and their associated metabolic traits (Table 3). This is similar to observations from previous mGWAS. For instance, Shin et al. reported biologically meaningful links between metabolites and genetic loci for 101 of 145 GIMs [21]. In case of genes with vague functional annotations, gene-metabolite associations from mGWAS provide testable hypotheses for further gene characterization. As an example, Suhre et al. experimentally confirmed the mGWAS driven hypothesis of SLC16A9 being a carnitine transporter [6]. Vice versa, with the help of mGWAS, the chemical structure of a non-targeted metabolic trait was elucidated through the function of the associated gene [8].
Another prominent finding of previous mGWAS is the overlap between disease relevant genetic variants and variants associated with metabolic traits. In the present study, we found 11 loci hosting variants that have previously been linked to clinical phenotypes. This includes associations with the estimated glomerular filtration rate (eGFR) and chronic kidney disease (CKD). Thus, the associated metabolites might, on the one hand, serve as intermediate traits for clinical endpoints. On the other hand, the associations might provide new insights regarding the involvement of specific metabolic pathways in pathomechanisms and the mediation of genetic risk loci through metabolic changes. For all 22 GIMs, we provide information on both the match of gene and metabolite function and the link to clinical traits in Table 3. In the following, we exemplify the value of our results for the characterization of gene functions in the light of clinical phenotypes.
As a first example, we identified significant associations of variants upstream of ETNPPL with ethanolamine. Interestingly, at the time when we received the first results from our association studies this gene was named AGXT2L1 and was assumed to encode an alanine-glyoxylate-aminotransferase. Based on this gene annotation, there was no obvious relation to the associated metabolite ethanolamine. In such cases, only dedicated experiments (similar to the one for the carnitine/SLC16A9 association mentioned above [6]) could validate the connection of ethanolamine to the gene product. Meanwhile, Veiga-da-Cunha et al. experimentally investigated the locus in an independent study and found that AGXT2L1 actually encodes an ethanolaminephosphate-phospholyase [59]. As a consequence, AGXT2L1 now carries the gene symbol ETNPPL. As ethanolamine is a direct precursor of ethanolaminephosphate via ethanolamine kinase (EC 2.7.1.28), our finding indeed matches the actual gene function. Besides the functional characterization of this locus, Veiga-da-Cunha et al. suggest that the ETNPPL-mediated degradation of ethanolaminephosphate balances the concentration of that metabolite in the central nervous system. They concluded that an altered ethanolaminephosphate homeostasis might contribute to mental disorders such as schizophrenia [59]. In line with this hypothesis, the ETNPPL expression rate in brain was previously found to be associated with schizophrenia [60]. ETNPPL is primarily expressed in brain and liver and the encoded protein is, amongst other tissues, highly localized in the cerebral cortex and the kidney (S4 Table and The Human Protein Atlas [52], http://www.proteinatlas.org/ENSG00000164089/tissue). Our results suggest that an excess of ethanolamine in urine could indicate alterations in ethanolaminephosphate homeostasis linked to a genetically reduced enzymatic activity of ETNPPL.
As a second example, we identified significant associations of genetic variants with 2-hydroxyisobutyrate (2-HIBA) in a locus comprising 9 different genes. According to our evidencebased locus to gene assignment, 4-hydroxyphenylpyruvate dioxygenase (HPD) is the most probable effector gene candidate. The association between 2-HIBA and this locus represents a well replicated finding: it was already identified in our previous NMR-based mGWAS in urine [10] and it has meanwhile also been discovered in an MS-based GWAS with blood metabolites [21]. Nonetheless, to the best of our knowledge, there is no obvious, known biological link between 2-HIBA and the HPD gene or any of the remaining 8 genes covered by this locus. In the literature, 2-HIBA is often referred to as a secondary metabolite that can be found in urine of humans and rats exposed to the volatile gasoline additives methyl-tert-butylether and ethyltert-butylether [61][62][63]. However, 2-HIBA has been identified by both MS-and NMR-based methods in almost all serum and urine samples of large human cohorts (e.g. ARIC [25], CoLaus [12], KORA [6,21], SHIP [10], TasteSensomics [12], and TwinsUK [6,21]) in relatively high concentrations (~40 μM in urine in this study), which suggests sources beyond gasoline for this metabolite (e.g. microbiota [64] or medication [65]). Interestingly, Dai et al. recently showed that 2-HIBA is an intermediate for the newly discovered but common 2-hydroxyisobutyrylation of lysine residues of histones [66], thus indicating an endogenous role of 2-HIBA. In this context, it is interesting to note that SETD1B is one of the genes within the identified locus on chromosome 12. SETD1B is a component of the methyltransferase complex that specifically methylates the lysine-4 residue of histone H3 [67]. This residue is amongst the 63 sites for 2-hydroxyisobutyrylation presented by Dai et al. [66]. Thus, one could speculate that in addition to its activity as a histone methylase, SETD1B may also be involved in the newly discovered process of histone hydroxyisobutyrylation, a hypothesis that may now be tested by dedicated experiments.
As a third example, we discuss the association of variants in XYLB with increased urinary glycolate levels. One of these variants, rs17118, causes an amino acid exchange in the XYLB gene product. XYLB encodes the enzyme xylulokinase [33], which catalyzes the phosphorylation of D-xylulose to D-xylulose-5-phosphate. In humans, the vast majority of D-xylulose is metabolized via xylulokinase [33,68,69]. However, there is an alternative metabolic pathway in which D-xylulose is metabolized by phosphofructokinase (PFK) (S3 Fig) [70]. Therein, one of the downstream products is glycolate. Thus, the genetic variants in XYLB might reduce the enzymatic activity of xylulokinase and thereby cause a shift towards the alternative pathway. Interestingly, the minor allele of rs17118 has been implicated in increased susceptibility for ischemic stroke [32]. Furthermore, Jung et al. found a significant association between elevated glycolate levels in plasma and cerebral infarction [34]. In the alternative pathway, glycolate is a precursor of oxalate, whose toxic effect has been demonstrated repeatedly [33,71]. Very recently, Rao et al. postulated that circulating oxalate precipitate might be a potential mechanism for stroke [72]. In this context, the association between the SNP rs17118 and glycolate (identified in our study) suggests that the carriers of this variants have a higher risk of stroke (identified in [32]) possibly via increased levels of glycolate or oxalate through favoring the alternative D-xylulose degradation. Unfortunately, oxalate or any other metabolite in the two D-xylulose degradation pathways are not detected in our metabolomics analysis to further support our hypothesis.

Extending blood GIMs to urine
In total, 26 genetic loci that associate with urinary metabolic traits are known to date (22 identified or confirmed in this study plus 4 identified in previous studies [5,11,12] , Fig 3). Of the 26 loci, only 8 loci lack corresponding SNP-metabolite associations in blood, and, based on current mGWAS, represent urine specific hits. All of these 8 loci were first reported in the present study. In case of the 14 loci with overlapping associations between blood and urine in our study (Table 3), 6 target the same metabolite in both media (CPS1, AGXT2, DMGDH, SLC6A13, HPD, SLC5A11). Interestingly, in all but one case (SLC5A11) the genetic effect has the same direction in both fluids, thus indicating that urine can be regarded as "diluted plasma" to some extent. For 5 of the 14 loci, we considered the associated metabolic traits in blood and urine to be biochemically related. Here, the metabolites are either products of the enzyme coded by the candidate gene (NAT8: N-acetylated compounds), or they are linked through an enzymatic reaction other than the reaction catalyzed by the candidate gene's product (NAT2: 1,3-dimethylurate and 1-methylurate [73]; PYROXD2: trimethylamine and dimethylamine [EC 1.5.8.2]; SLC7A9: lysine and homocitrulline [EC 2.1.3.8]), or they belong to the same metabolite class (TKT: gluconate and erythronate are aldonates). The observed associations of related but different metabolites in blood and urine may be indicative either for biochemical conversions before excretion, or simply be a result of differences in the composition of the metabolite panels covered by the various mGWAS. In case of the remaining 3 loci, we find no direct biochemical or metabolic relationship between the metabolites in both media, since AGXT associates with an unknown compound in blood, PNMT associates with amino acids in urine and HDL cholesterol in blood, and SLC6A20 targets loosely related amino acid derivatives.
As an example for parallel effects in blood and urine, we identified an association between variants in the Carbamoyl-Phosphate Synthetase 1 (CPS1) gene and elevated urinary glycine levels. The strongest associated SNP rs715 was also identified in previous mGWAS with higher glycine concentrations in blood [14,21,24]. This variant has been highlighted previously as a putative regulator of CPS1 expression [74][75][76]. Furthermore, the second strongest glycine-associated SNP rs1047891 causes a non-synonymous mutation (Thr>Asn) in the C-terminal domain of the CPS1 polypeptide, which hosts the binding site for the allosteric activator Nacteyl-L-glutamate (NAG) [77]. Both SNPs are therefore potentially causative variants in this metabolic association. CPS1 is highly expressed in liver (S4 Table) and controls the first step in the urea cycle: ammonia is catalyzed to carbamoyl-phosphate, which in turn is the entry substrate of the urea cycle. CPS1 deficiency can lead to high ammonia levels in the body (Hyperammonemia, OMIM #237300) (S4 Fig). The association of the CPS1 variants with glycine can be explained by the conversion of excess ammonia to glycine via the glycine cleavage system [78,79] and is thus biologically meaningful. The association between common variants in CPS1 and glycine might therefore be driven by mild forms of genetically induced Hyperammonemia. In this study, we could establish a link between genetic factors and a potential urinary marker for this condition.
SLC5A11 is the only locus where we observe an association with exactly the same metabolite in blood and urine but with reversed effects: while myo-inositol concentrations in urine increase per effect allele copy of the lead SNP, they decrease in serum (Table 3) [21]. The Solute Carrier Family 5 (Sodium/Inositol Cotransporter), Member 11 (SLC5A11) is a co-transporter of myo-inositol with sodium [80]. SLC5A11 was postulated to play a role in the regulation of serum myo-inositol concentrations [81], which was recently confirmed by an mGWAS in blood [21]. On the one hand, the influence of SLC5A11 on myo-inositol concentrations has been linked to apical transport and absorption in intestine [82]. On the other hand, SLC5A11 may be implicated in the re-absorption of myo-inositol in the proximal tubule of the kidney [83]. The opposite direction of the genetic influence in blood and urine as observed in our study suggests that SLC5A11 is actively involved in the re-absorption of myo-inositol. This assumption is further supported by the strong increase of the association strength when testing the ratio between urinary and serum myo-inositol. This could indicate that the reduced levels in blood are indeed caused by a reduced re-absorption rate in subjects that are homozygous regarding the effect allele.
As these examples demonstrate, mGWAS in urine extend our understanding of genetically influenced biochemical processes and can facilitate the knowledge transfer from blood to urine and vice versa. Currently, this transfer is limited by the comparatively low number of GIMs in urine (26) versus blood (>150). Further increasing the sample sizes of mGWAS in urine and the application of more sensitive MS-based metabolomics platforms as already used for blood mGWAS could compensate this bias.

Study samples
For this study, we used data from SHIP (Study of Health in Pomerania) and, for replication, from the KORA (Kooperative Gesundheitsforschung in der Region Augsburg) study. Both studies have been described extensively in the study design papers [84][85][86] and in our previous publications [4,6,10,21].
SHIP is a longitudinal population study conducted in West-Pomerania, located in the northeastern part of Germany. 4,308 inhabitants in that region participated in the first phase "SHIP-0". For the GWAS presented here, metabolically characterized urine samples and genotype data were jointly available for 3,861 study participants (1,960 female and 1,901 male, aged 20 to 81 years). KORA is a population study conducted in the municipal region of Augsburg in southern Germany. The KORA F4 cohort comprises 3,080 subjects. For the study presented here, both genotype and urine samples from 1,691 participants (865 female and 826 male, age 32 to 77) were available. In both studies, all participants have given written informed consent and the local ethics committees (SHIP: ethics committee of the University of Greifswald; KORA: ethics committee of the Bavarian Chamber of Physicians, Munich) have approved the studies.
Metabolomics data acquisition and processing NMR measurements. In SHIP-0, non-fasting, spontaneous urine samples were collected from the study participants. In contrast, KORA F4 study participants were overnight-fasting prior to urine sample collection. All urine samples were stored at -80°C until the analysis. In preparation for the recording of NMR spectra, 75μl of phosphate buffer were added to 675μl of urine to set the pH to 7.0 (+/-0.35). The deuterated buffer contained 0.5 mM sodium trimethylsilylphosphate (TSP) to provide a reference substance for the annotation of the NMR spectra. One-dimensional 1 H NMR spectra were recorded at the University of Greifswald, Germany using a Bruker DRX-400 spectrometer (Bruker BioSpin GmbH, Rheinstetten, Germany). Spectra were acquired at 300K and a frequency of 400.13 MHz using a standard onedimensional NOESY-PRESAT pulse sequence with water peak suppression, as previously described in [10].
Targeted analysis. The Fourier-transformed and baseline-corrected NMR spectra were manually annotated by spectral pattern matching using the Chenomx Worksuite 7.0 by Chenomx, Inc. (Edmonton, Canada) to deduce absolute urinary metabolite concentrations. The panel of targeted metabolites comprises 60 compounds (including creatinine, which was used for normalization). In the discovery mGWAS, we used both metabolite concentrations (59) and pairwise ratios thereof (59 × 58/2 = 1,711) from the SHIP-0 data as phenotypic traits. Prior to analysis, individual metabolite concentrations were normalized to the annotated creatinine levels, and both concentrations and ratios were log 10 transformed. Individual data points more than three times the standard deviation away from the mean were removed to avoid spurious associations. Finally, we considered only metabolic traits with at least 300 valid data points for the GWAS. In total, the final SHIP-0 data set comprises 1,518 targeted metabolic traits (55 metabolites and 1,463 ratios) that were tested for genetic associations.
Likewise, we processed the NMR spectra originating from the KORA F4 data set that was used for replication. Due to the smaller data set size, we lowered the burden for non-missing values to 100. The replication data set contains 53 metabolites and 1,359 pairwise ratios. It covers all metabolic traits that significantly associated in the discovery GWAS.
Non-targeted analysis. The same Fourier-transformed and baseline-corrected NMR spectra that were used for the targeted analysis were binned at a segment width of 0.0005 ppm. All signal intensities were normalized to the TSP reference peak. We used the FOCUS software (http://www.urr.cat/FOCUS) [13] for the subsequent processing of the NMR spectra. We excluded the spectral regions between δ = 4.6 and 5.0 ppm (water peak) and the regions below δ = 0.0 and above δ = 10.0 ppm, which do not contain any signals relevant for a metabolomics analysis. We performed the spectral alignment and feature extraction using the default parameters of FOCUS, resulting in a total of 166 NMR peaks in SHIP-0 and 217 peaks in KORA F4. As for the targeted data set, we computed pairwise ratios of the NMR peaks (SHIP-0: 13,695; KORA F4: 23,436). To compensate for dilution effects, we additionally normalized the signal intensities of individual NMR peaks to the annotated creatinine concentrations prior to analysis. All non-targeted traits (i.e., NMR peaks and peak ratios) were log 10 transformed. In comparison to the targeted data set, we chose a less stringent removal of extreme values. Here, individual data points more than four standard deviations away from the mean were removed.

Genotype data
Both SHIP and KORA samples were genotyped using Affymetrix Human SNP Array 6.0 gene chips. SNPs were called using the Birdseed2 algorithm. In both data sets, the total genotyping rate was above 99%. 909,508 SNPs were genotyped in the SHIP-0 cohort, and 906,716 SNPs in the KORA F4 cohort. We excluded SNPs that violated the Hardy-Weinberg equilibrium (P HWE < 1.0×10 −6 , 8,623 in SHIP and 32,033 in KORA), or had a genotyping rate below 95% (57,160 in SHIP and 84,351 in KORA), or displayed minor allele frequencies (MAF) below 5% (227,967 in SHIP and 224,723 in KORA). After the exclusion, 620,456 autosomal SNPs remained in the SHIP-0 data set and 593,830 autosomal SNPs in the KORA F4 data set. Both SHIP and KORA genotypes were imputed in a two-stage process (pre-phasing followed by imputation). According to data from the 1000 genomes project (phase 1, March 2012 release [50]), we used SHAPEIT (v1.416) [87] for phasing in KORA F4 and IMPUTE (v2.2.2) [88] for imputation in KORA F4 and both phasing and imputation in SHIP-0. For the association analyses, we considered only imputed variants with a MAF ! 5%, P HWE ! 1.0×10 −6 , and an imputation quality score (IMPUTE info-score) ! 0.8.

Statistical analysis
Genome-wide association study. In both the SHIP-0 and KORA F4 cohorts, we used PLINK (v1.07) [89] to compute age-and sex corrected linear regression models under the assumption of an additive genetic model. For the discovery study based on the SHIP-0 data set, we carried out association tests for each genotyped autosomal SNP (filtered for the criteria described above) and all 15,379 targeted and non-targeted metabolic traits. In addition, we tested all imputed, quality-filtered SNPs within a physical distance of 1 Mb to each genotyped SNP that displayed an association signal of P < 5×10 −8 . This two-stage approach of testing imputed variants in selected candidate regions results in a drastically lowered computational burden when compared to association studies with imputed variants over the whole genomic range (0.9M vs. 15.9M tested SNPs). We considered associations to be genome-wide significant if the resulting P value was below the Bonferroni-adjusted significance threshold of 5×10 −8 / 15,379 = 3.25×10 −12 . In case of associations with ratios, we imposed an additional significance criterion on the P-gain as suggested by Petersen et al. [53]. P-gain describes the observed increase in strength of association when compared to the association with the individual metabolic traits from which the ratio was computed (min(P(M 1 )/P(M 1 /M 2 ), P(M 2 )/P(M 1 /M 2 ))). A conservative Bonferroni-type lower limit on the P-gain at a significance level of 5% is given by ten times the number of tested ratio pairs. Thus, associations with ratios were considered to be significant if the P-gain exceeded 15,180 in case of targeted metabolic ratios and 138,610 in case of non-targeted ratios. The analysis of the PLINK output files was performed using inhouse R (v3.0.2) code (available at http://www.gwas.eu).
Replication of association results. To replicate the significant associations discovered in the SHIP-0 data set, we used data from the independent KORA F4 cohort. For each genetic locus, we attempted to replicate the association of the SNP/metabolic trait pair that displayed the lowest P-value (S1 and S2 Tables). If an association did not replicate, we checked whether the tested SNP was significantly associated with other metabolic traits in SHIP-0. In that case, we tried to replicate these associations, beginning with the one that displayed the strongest association signal. To replicate associations with non-targeted traits, we selected the NMR feature with the minimum difference in chemical shift. In total, we attempted the replication of 38 SNP/metabolic trait associations. Thus, we considered associations with P < 0.05/ 38 = 1.32×10 −3 to be successfully replicated in the KORA F4 cohort.

Genetic variant annotation and evidence-based locus to gene mapping
Annotation data for genetic variants as well as linkage disequilibrium (LD) data from the 1000 genomes project (phase 1 version 3, EUR panel [50]) were retrieved from SNiPA v1 (http:// www.snipa.org) [51]. Full lists of association signals from the serum-based mGWAS [21] were obtained from the Metabolomics GWAS server (http://www.gwas.eu).
All genotyped and imputed SNPs that displayed genome-wide significant association signals (according to the aforementioned P-value and P-gain criteria) were assigned to distinct genetic regions (loci), based on a physical distance threshold of 1 Mb. Each of the resulting 23 genomewide significant loci was then projected to candidate genes using an evidence-based procedure. To this end, we used the "block annotation" feature of SNiPA on the LD-extended (r 2 ! 0.8) list of associated variants at each locus. This feature provides a condensed view of genes that are linked to any of the significantly associated variants or their LD-proxies via genomic proximity, eQTL association, or regulatory elements. Additionally, the block annotation highlights missense and pathogenic variants. Based on these data, we defined the following criteria to identify candidate genes: 1) Genomic proximity: genes that harbor or are in close proximity (<5kb) to any of the variants in the list. 2) eQTL association: genes where altered expression levels have been discovered to associate with any of the variants in the list. 3) Regulatory elements: potentially regulated genes that are associated with a promoter/enhancer/repressor element containing a variant of the list. Further evidence for potential involvement of a gene was assumed if 4) the variant list contains a missense variant for a protein product of this gene and 5) if an intragenic variant in the list is annotated as pathogenic in one of the phenotype databases contained in SNiPA. For each gene, we counted how many of the aforementioned criteria are met. Thus, the maximum evidence count for a candidate gene is five. If evidence-based gene selection was ambiguous, the gene with the most plausible biological function was chosen (S3 Table).

Metabomatching of non-targeted NMR traits
Metabomatching [12] is an automated annotation method that identifies metabolites likely to underlie an observed genetic association between a SNP and one or more non-targeted metabolic traits (www.unil.ch/cbg). It does so by comparing the association signal between a SNP and all non-targeted traits (pseudo-spectrum or association spectrum) to the 1 H-NMR spectra of metabolites in a reference set, and assigning a score to each pseudo-spectrum to NMR spectrum match. The metabolites most likely to underlie the genetic association are the ones with the highest scores. We applied metabomatching to each SNP showing a significant association with an NMR feature or feature-ratio, using the 180 metabolites listed in the urine metabolome database [54] with an experimental NMR spectrum as reference set. The urine metabolome database is the subset of metabolites in the Human Metabolome Database (HMDB) [90] present in urine. Where indicated, 2-compound or effect direction-specific metabomatching was applied. Final candidates were manually identified, usually among the few top-ranked matches. Most likely candidates are used in the text; potential alternatives are listed, along with the metabomatching details, in S1  Table 2 (except HIBCH, for which no match was found). Top panels contain the pseudo-spectrum of the SNP with the strongest association within its locus, bottom panels show the NMR spectrum of the most likely candidate metabolite(s). For pseudo-spectra, only peaks with-log(P) > 1.3 (P < 0.05) are displayed, and peak color indicates effect direction, with blue for β > 0 and orange for β < 0. The most likely candidate metabolite is not necessarily the metabolite of top rank: instead, it is manually selected among top-ranked metabolites. For some loci, the pseudo-spectrum indicates the involvement of more than one compound. For CPS1 (b) and HPD (o), 2-compound metabomatching, which ranks compound pairs rather than individual compounds, produces viable candidates. For SLC6A19 (g), NAT2 (k), and PNMT (r), the pseudo-spectra allow too many viable compound pairs even for 2-compound metabomatching, so these matches were manually selected. (PDF) S2 Fig. Regional association plots, box plots, and quantile-quantile plots for the loci with genome-wide significant associations. Top: The regional association plot displays SNPs in a locus and the strength of association to the top-associated metabolic trait (Tables 1 and 2). The plot window (500 kb) is centered on the SNP that displays the strongest association signal and, in case of ratios, meets the P-gain criterion ("lead SNP", indicated in blue). The variants are colored according to the degree of linkage disequilibrium (LD) to the lead SNP. The plot symbols and border colors indicate functional annotations (e.g. effects on transcripts) as well as results from other association studies. Plots were created with the SNiPA web service (available at www.snipa.org) using data from the 1000 genomes project (phase 1 version 3, European super-population) and Ensembl 75. Bottom left: Top associated metabolic trait in SHIP-0 grouped by genotype of the lead SNP (in order major allele homozygotes, heterozygotes, and minor allele homozygotes). Bottom right: Quantile-quantile plots displaying the observed vs. the expected distribution of P-values for associations to the top-associated metabolic trait. (PDF) S3 Fig. Polymorphisms in XYLB might induce a switch to an alternative xylulose pathway via phosphofructokinase (PFK). The mGWAS identified significant associations of SNP rs3132440 (intronic region of XYLB) and rs17118 (non-synonymous variant in XYLB) with glycolate. The urinary concentration increases per copy of the effect allele (indicated by the yellow arrow). Glycolate is a downstream product of the PFK-mediated D-xylulose pathway. Thus, the associated variants could decrease the enzymatic activity of XYLB's gene product (xylulokinase; XK), which would lead to an increased D-xylulose metabolism via PFK. A deficiency of Carbamoyl-Phosphate Synthetase 1 (CPS1) can lead to high ammonia levels in blood (top). We detected significant associations of variants in CPS1 and elevated urinary glycine levels (indicated by the yellow arrow). We hypothesize that the excess ammonia is converted to glycine via the Glycine Cleavage System (bottom). (PDF) S1 Table. Replication results for the 15 lead SNPs and their associations to targeted metabolic traits in the KORA F4 cohort. In case of association to ratios, the metabolite that shows the stronger association signal is listed in the numerator. Tests: number of conducted association tests on the SNP. The significance level was adjusted for 38 tests (P < 0.05/38 = 1.32×10 −3 ), which is the sum of replication attempts using targeted traits and non-targeted traits (S2 Table). In the targeted data set, the strongest identified association between rs7247977 (SLC7A9) and the lysine/valine ratio could not be replicated. Instead, the second-strongest, still significant association of this SNP with the concentrations of valine was successfully replicated. (DOCX) S2 Table. Replication results for the 20 lead SNPs and their associations to non-targeted metabolic traits in the KORA F4 cohort. Non-targeted traits are reported as chemical shifts (i.e., the position in the NMR spectrum). To replicate the non-targeted association results, we selected the NMR peaks or ratios thereof that were closest to the NMR features used in the SHIP-0 data set. For five loci (ETNPPL, SLC6A19, DMGDH, ABO, and PNMT), we were unable to replicate the top associations in the KORA F4 data set. These associations are marked with an asterisk ( Ã ). (DOCX) S3 Table. Evidence-based locus-to-gene assignment.