Risk estimation model for nonalcoholic fatty liver disease in the Japanese using multiple genetic markers

The genetic factors affecting the natural history of nonalcoholic fatty liver disease (NAFLD), including the development of nonalcoholic steatohepatitis (NASH) and NASH-derived hepatocellular carcinoma (NASH-HCC), are still unknown. In the current study, we sought to identify genetic factors related to the development of NAFLD, NASH, and NASH-HCC, and to establish risk-estimation models for them. For these purposes, 936 histologically proven NAFLD patients were recruited, and genome-wide association (GWA) studies were conducted for 902, including 476 NASH and 58 NASH-HCC patients, against 7,672 general-population controls. Risk estimations for NAFLD and NASH were then performed using the SNPs identified as having significant associations in the GWA studies. We found that rs2896019 in PNPLA3 [p = 2.3x10-31, OR (95%CI) = 1.85 (1.67–2.05)], rs1260326 in GCKR [p = 9.6x10-10, OR (95%CI) = 1.38(1.25–1.53)], and rs4808199 in GATAD2A [p = 2.3x10-8, OR (95%CI) = 1.37 (1.23–1.53)] were significantly associated with NAFLD. Notably, the number of risk alleles in PNPLA3 and GATAD2A was much higher in Matteoni type 4 (NASH) patients than in type 1, type 2, and type 3 NAFLD patients. In addition, we newly identified rs17007417 in DYSF [p = 5.2x10-7, OR (95%CI) = 2.74 (1.84–4.06)] as a SNP associated with NASH-HCC. Rs641738 in TMC4, which showed association with NAFLD in patients of European descent, was not replicated in our study (p = 0.73), although the complicated LD pattern in the region suggests the necessity for further investigation. The genetic variants of PNPLA3, GCKR, and GATAD2A were then used to estimate the risk for NAFLD. The obtained Polygenic Risk Scores showed that the risk for NAFLD increased with the accumulation of risk alleles [AUC (95%CI) = 0.65 (0.63–0.67)]. Conclusions: We demonstrated that NASH is genetically and clinically different from the other NAFLD subgroups. We also established risk-estimation models for NAFLD and NASH using multiple genetic markers. These models can be used to improve the accuracy of NAFLD diagnosis and to guide treatment decisions for patients.

Introduction Nonalcoholic fatty liver disease (NAFLD) is frequently associated with metabolic syndrome, a broad range of pathologies including nonalcoholic fatty liver (NAFL), nonalcoholic steatohepatitis (NASH), cirrhosis, and hepatocellular carcinoma (NASH-HCC). NAFLD is classified into four subgroups based on its long-term histological progression. Type 1 (steatosis) and type 2 (steatonecrosis) are classified as NAFL, and type 3 (steatohepatitis) and type 4 (steatohepatitis with fibrosis) as NASH [1]. The prevalence of NAFLD varies widely across the world; for example, it is rare in Asian countries and more common in North America [2]. It was recently reported that the overall NASH prevalence among biopsied NAFLD patients is 59.1% [2].
NAFLD including NASH has a highly varied natural history. In a longitudinal study of 81 NASH and 27 NAFL patients with serial liver biopsies, 45 (42%) patients showed fibrosis progression, while 20 (18%) showed regression after the median follow-up period of 6.6 years [3].
NASH-HCC is considered to be derived mainly from cirrhotic liver, although other factors such as advanced fibrosis and the presence of diabetes mellitus are high risk factors for HCC development [4,5]. Notably, approximately one third of NASH-HCC cases are derived from non-cirrhotic liver [6,7]. These findings indicate that multiple environmental, lifestyle, and genetic factors are involved in its onset and progression. Recent studies demonstrated that, among the pathologic features of NAFLD, only fibrosis independently predicts long-term liver-related mortality [8,9].
The first genome-wide association (GWA) study for NAFLD used NAFLD cases diagnosed by liver fat content, and identified PNPLA3 as a major genetic determinant for fatty liver and triglyceride content [10] in Hispanic, African American, and European populations. Subsequent studies [6,[11][12][13][14][15] showed an association of PNPLA3 with inflammation, fibrosis, and HCC development. An association of PNPLA3 with NASH-HCC in European and Japanese populations was demonstrated by genotyping candidate SNPs [6,15,16]. Exome-wide analyses and subsequent replication studies showed an association of TM6SF2 with NAFLD [17,18]. Rs641738, originally regarded as located in the MBOAT7 locus but now confirmed to be located in TMC4, was initially reported as a susceptibility variant for alcohol-related cirrhosis [19], and was later found to be associated with NAFLD patients of European descent [20]. This variant was not replicated further in any other populations.
We previously reported the results of a GWA study using 529 histologically proven NAFLD cases, that demonstrated that Matteoni type 4 (histologically typical NASH) was both genetically and clinically different from the other three Matteoni types in the Japanese [21]. However, in that study only 29 Matteoni type 3 cases and no NASH-HCC cases were included.

Statistical analysis
Logistic regression was used for statistical analyses of the GWA studies, comparing 1) all NAFLD patients or 2) NASH-HCC patients with the controls. Population stratification was assessed by the genomic control method [23] and adjusted for 10 principle components (PCs) calculated using a tool for Genome-wide Complex Trait Analysis [24]. Genome-wide significance was set as p = 5.3x10 -7 based on Bonferroni's correction for multiple testing. Regional genotype imputations were performed with MACH [25] using the 1000 Genomes Project Consortium [26] phase I release version 3 as a template, and SNPs passing an imputation quality threshold of r2>0.5 were used. Linkage disequilibrium (LD) indices were calculated by PLINK [27]. We also conducted a GWA study for the Brunt stage, Brunt grade, and fat-droplet content using ordinal logistic regression. The allele distributions of the genome-wide significant SNPs were compared between the different subgroups of Matteoni types by logistic regression, adjusting for age, sex, BMI, and the 10 PCs.
Polygenic Risk Scores (PRS) were calculated for all NAFLD patients compared with controls. We also calculated the PRS for Matteoni type 4 + NASH-HCC compared with Matteoni type 1 to 3, and for NASH-HCC compared with Matteoni type 4. Using the genome-wide significant SNPs identified in the GWA studies, we generated models by a forward stepwise selection procedure for each comparison, including sex as basic genetic background. PRS were then calculated for each subject using the estimated model, and the study subjects were divided into quintile groups (Q1 to Q5). We compared the lowest quintile group (Q1) with the other groups (Q2 to Q5) using fisher.test in the R package. We also sought to improve the model by adding SNPs previously reported to be associated with NAFLD. We performed additional GWAs for these SNPs, and those with a low p-value (p<1e-4) were included in the models.

Characteristics of the study population
The clinical characteristics of the 902 patients and 7,672 control subjects are summarized in Table 1. We compared the distribution of clinical traits between the general-population controls and the Matteoni type 1 subgroup (21 traits), the type 1 and type 2 subgroups (31 traits), the type 2 and type 3 subgroups (31 traits), the type 3 and type 4 subgroups (32 traits), and the type 4 and NASH-HCC subgroups (31 traits). In agreement with our previous study, the results suggested that Matteoni type 4 (histologically typical NASH) was clinically different from the other three Matteoni types. The NASH-HCC and control subjects also showed a clearly different clinical background from the NAFLD patients. A significant difference (p<3.5x10 -4 ) was observed for 34 of the 146 traits examined, of which 16 were observed between the controls and the type 1 subgroup. There were no significant differences between the type 1 and type 2 or the type 2 and type 3 subgroups, except for the fat-droplet content (p = 3.2x10 -4 ) between type 1 and type 2. In contrast, six traits, including two biomarkers for liver fibrosis (type IV collagen 7S and hyaluronic acid) were significantly different between the type 3 and type 4 subgroups. These results suggested that Matteoni type 1, type 2, and type 3 belonged to the same subgroup, and Matteoni type 4 formed a distinct subgroup. In the comparison between type 4 and NASH-HCC, 11 clinical traits showed significant differences. All of these 11 markers were associated with the severity of fibrosis, decline of liver function, or a higher age range in the NASH-HCC compared to the type 4 NASH patients.

Genome-wide association studies
We conducted a GWA study between 902 NAFLD patients including 58 NASH-HCC cases and 7,672 population controls for 93,606 SNP markers. A slight increase in p-values was observed after adjusting the population stratification using 10 PCs (λ = 1.12). Significant association signals (p<5.3x10 −7 ) were detected in three chromosomes ( Fig 1A, Table 2). The strongest association was observed for rs2896019 (p = 2.3x10 -31 ), at 22q13.31 in the PNPLA3 gene, which has repeatedly been reported as a strong genetic determinant for NAFLD. Rs738409, a non-synonymous SNP reportedly associated with NAFLD, was not identified in this analysis. However, its association with NAFLD was observed in the regional imputation analysis (p = 1.0x10 -29 ) (Fig 2A). The second strongest association was detected for rs1260326 (p = 9.6x10 -10 ), at 2p23.3 in the glucokinase regulator (GCKR) gene, another known susceptibility gene for NAFLD [12] (Fig 2B). A previously reported SNP, rs780094, also showed significant association with NAFLD (p = 2.1x10 -8 ). The third strongest association was detected for rs4808199 (p = 2.3x10 -8 ), in the vicinity of the GATA Zinc Finger Domain Containing 2A (GATAD2A) gene at 19p13.11. Rs4808199 was located in a 360-kb LD block encompassing NCAN and TM6SF2, which are both known to be associated with NAFLD [12,17] (Fig 2C). However, no association was detected for rs2228603 in NCAN, which was reported to be associated with NAFLD. The regional imputation analysis detected an indicative association signal for rs58542926 in TM6SF2 (p = 2.2x10 -4 ). A weak linkage disequilibrium between rs58542926 and rs4808199 was observed (r2 = 0.21.) The associations of PNPLA3 and GATAD2A were lost when the type 1 to type 3 patients were used as cases. In contrast, they became stronger when only type 4 and NASH-HCC patients were used (p = 2.9x10 −34 for rs2896019 and p = 2.0x10 −8 for rs4808199, respectively), but the association of GCKR was lost (S2 Table).
We also performed a GWA study between 58 NASH-HCC patients and the same 7,672 controls. Population stratification was not observed (λ = 1.00) for this analysis. The significant association of PNPLA3 was again observed (p = 1.8x10 -8 for rs2896019). In addition, we detected a significant association in the vicinity of the dystrophy-associated fer-1-like protein (Dysferlin or DYSF) gene on 2p13.3 (p = 5.2x10 -7 for rs17007417) (Fig 1B and Table 2). Regional imputation analysis detected an association peak consisting of multiple SNP markers, of which rs17007417 showed the highest association ( Fig 2D). The list of SNP markers for which p<1.0x10 -5 is provided as supplementary information (S3 Table) We also performed a GWA study using the ordinal logistic regression for Brunt grade as a metric for necroinflammatory activity, Brunt stage as the fibrosis stage, and fat-droplet content. No genome-wide significant SNPs were identified in these studies except for rs2896019 in PNPLA3, which showed a moderate association with fat-droplet content (p = 3.7x10 -4 ) (S4 Table and

Associations of previously reported SNPs with NAFLD
Rs641738, a genetic variant that was recently shown to be associated with NAFLD [20], was included in the SNP markers examined in the current GWA study. However, we did not find any association of rs641738 with NAFLD, or with the Matteoni type 4 or NASH-HCC subgroup. We also examined the association of 14 other SNPs reviewed in Anstee et al. that were reported to be associated with the disease [28]. None of them showed genome-wide significance in our analysis, although rs58542926 in TM6SF2 (p = 2.2x10 -4 ) and rs1800234 in PPARA (p = 6.5x10 -5 ) showed significant associations when the cut-off p-value was corrected for multiple testing to p<0.0035 (S5 Table). In addition, we examined whether the above 14 markers showed an association with the Brunt stage, Brunt grade, or fat-droplet content in the patient population. However, we did not find associations of any of the 14 markers with these NAFLD-related phenotypes.

Impact of genetic variations on the pathogenicity of NAFLD
We next investigated the impact of the genetic variations identified as significant in the GWA studies on the pathogenicity of the disease. The genotype distributions of rs2896019 in PNPLA3, rs4808199 in GATAD2A, rs1260326 in GCKR, and rs17007417 in DYSF were compared among the controls and the five patient subgroups. As reported previously [21], a significant difference (p<3.3x10 -3 ) was observed for rs2896019 in Matteoni type 4 compared with the controls or the type 1 subgroup [p = 1.0x10 -28 with an odds ratio (OR) of 2.23 and 95% confidence interval (95%CI) between 1.93 and 2.56, and p = 7.90x10 -6 , OR (95%CI) = 1.93 (1.45-2.58), respectively] (Fig 3A and S6A Table). The same trends were observed when type 4 was compared with type 2 and type 3, although the differences were not significant after correcting for multiple testing. The difference in rs2896019 was greatest between NASH-HCC and controls, type1, or type 3. The association was at the border of significance when NASH-HCC was compared with type 2 (p = 3.9x10 -3 ), but more importantly, no association was observed when compared with type 4. Rs1260326 in GCKR showed significant associations with type 3, type 4, and NASH-HCC when compared with controls ( Fig 3B and S6B Table).
Notably, no statistical differences were observed among the five subgroups of patients. For rs4808199 in GATAD2A, only the comparison of type 4 and controls showed a significant difference (Fig 3C and S6C Table). Rs17007417 in DYSF was significantly different in NASH-HCC cases compared with controls or with any of the four Matteoni subgroups ( Fig  3D and S6D Table).

Risk estimation of NAFLD using genetic variations
We next assessed the influence of risk alleles associated with NAFLD on the development of NAFLD. We first generated an estimated model by forward stepwise logistic regression using rs2896019 in PNPLA3, rs1260326 in GCKR, rs4808199 in GATAD2A, and rs17007417 in DYSF. Subsequently, rs2896019, rs1260326, and rs4808199 remained in the model. The ORs and 95%CIs compared with the 1st quintile PRS were 1.89 (1.40-2.58), 2.24 (1.68-3.01), 3.30 Risk estimation model for nonalcoholic fatty liver disease in the Japanese using multiple genetic markers (2.51-4.38), and 5.00 (3.83-6.57) for the 2nd to 5th quintile PRS, respectively ( Fig 4A). The area under the curve (AUC) was 0.65 (95% CI = 0.63-0.67). Next, the model was refined by including rs780094, rs738409, and rs58542926 instead of rs2896019, rs1260326, and rs4808199 and including 14 previously reported SNPs for NAFLD (S5 Table). Although rs56225452 in SLC27A5, rs1800234 in PPARRA, rs1799945 in HFE, and rs17883901 in GCLC were added to the model, the AUC was not improved (AUC = 0.65, 95%CI = 0.64-0.67). To estimate the possible maximum AUC for the current GWA study, we additionally included 10 candidate SNPs for which the p-value was <1x10 -4 in our GWA study (S7 Table), and 13 total SNPs remained after the model selection. The AUC was increased to 0.69 (95%CI = 0.66-0.70) (Fig 4B). Risk estimation results for Matteoni type 1 to 3 vs. type 4 or NASH-HCC and for type 4 vs. NASH-HCC are shown in the S2 Fig and S8 Table.

Discussion
PNPLA3 is the strongest genetic determinant known for the development of NAFLD and NASH-HCC [6,12]. PNPLA3 is a membrane protein located on the surface of lipid droplets. Rs738409, the I148M variant of PNPLA3, decreases triglyceride breakdown, leading to lipid  Table. retention in hepatocyte lipid droplets. The strong association of PNPLA3 with type 4 and NASH-HCC but not with type 1 to type 3 indicates that it is involved in the later stages of NAFLD, particularly in liver fibrosis. We identified GATAD2A, the function of which is not well understood, as a novel susceptibility gene for NAFLD. It is located in an LD block spanning TM6SF2 and NCAN, which were previously reported as susceptible genes for NAFLD [12,17]. We also found that rs58542926 in TM6SF2, which had weak LD with rs4808199 in GATAD2A, showed a moderate association with NASH. Similar to PNPLA3, the association of GATAD2A was not observed with the type 1 to type 3 subgroups and was stronger with type 4 and NASH-HCC, indicating that it is related to the development of NASH. Rs4808199 in GATAD2A was previously shown to be strongly associated with the expression of GATAD2A and MAU2 [22]. Furthermore, rs58542926 in TM6SF2 and rs4808199 in GATAD2A are located in the same LD block, and genome-wide significant SNPs were also found in TM6SF2. Since there is accumulating biological evidence for the association between TM6SF2 and NAFLD, the association of rs4808199 may be driven by the group of SNPs in high LD, which includes rs58542826. Given that there is no direct evidence that the nonsynonymous variant rs4808199 is causative, it is possible that other genetic variants functionally affect TM6SF2.
GCKR is an inhibitor of glucokinase (GCK), and its hepatic concentration is increased in NAFLD [29,30]. The risk allele of rs780094 in GCKR was shown to increase liver fat, possibly by increasing the expression of C2orf16 [12]. However, no effect of rs780094 genotypes on C2orf16 expression has been observed [22,31]. Rs780094 was also reported to be associated with the blood C reactive protein (CRP) level [32], but we did not observe this association using a high-sensitive CRP test (p = 0.77 by linear regression). Hence, it is still premature to draw clear conclusions about the mechanisms underlying the biological effects of this disease.
We identified an association peak at chromosome 2p13.3 in the GWA study using the NASH-HCC cases (Figs 1B and 2D). This SNP, rs17007417, was located 125-kb downstream of DYSF and was in an LD block encompassing a 'gene desert'. DYSF is reported to be the causative gene for monogenic muscular disorders, such as muscular dystrophy, limb-girdle, type 2b, and Miyoshi muscular dystrophy 1. However, there is no report showing DYSF to be susceptible for multigenic diseases, including liver-related disorders. We also did not find an effect of rs17007417 on any gene expressions [22]. In addition, a limited number of NASH-HCC DNAs (58 samples) were included in the study due to the small number of biopsyproven NASH-HCC patients, and this result was not validated using an independent sample set. Therefore, the involvement of rs17007417 in NASH-HCC still needs further investigation.
Rs641738 in TMC4 located near MBOAT7 on chromosome 19 did not show association with NAFLD and NASH-HCC in our Japanese study (S5 Table). While the association of rs641738 in patients of European descent was initially reported in 2015 [19], the association has not been replicated in any other population to date. This may be due to the difference in LD patterns of the populations. According to the varLD [33], LD pattern between CEU and JPT population was significantly different within the MBOAT7 (p<0.001) and TMC4 (p = 0.002, p-values were calculated using 1000 genome phase3 dataset) regions. In addition, there are no large LD blocks in this region, so we could only successfully impute the SNP genotypes located near the genotyped SNPs (S3 Fig). These data suggest that rs641738 is not the causal variant, and in the European population the actual causal variant may lie close to and show high LD with rs641738. To confirm this hypothesis, a very dense genotyping or target sequencing in this region is necessary.
Risk estimation by PRS using the identified genome-wide significant SNPs for NAFLD clearly showed that the effect of the risk alleles, namely PNPLA3, GATAD2A, and GCKR was cumulative and increased the risk for NAFLD. Clinical and lifestyle information obtained from a prospective study might further improve this model.
The general population used as the control in this study could potentially include NAFLD patients. In addition, according to the 1000 Genomes Project phase 3 dataset [26], only 42.3% of the common SNPs in the Japanese population are tagged by the SNPs used in the current study with r2>0.5. These issues can lower the statistical power of the study and elicit false-negative results. In addition, even though almost all of the patient samples and all of the control samples were collected on Honshu island in Japan, some population stratification was observed, and none of our results were replicated in an independent cohort. Further confirmation of our findings is needed to draw conclusions with higher confidence.
The present study clearly demonstrated that genetic background exerted a marked influence on the severity of liver fibrosis and the development of NASH-HCC. We believe that the risk estimation using genetic markers will improve the accuracy of NAFLD diagnoses and help to guide treatment strategy decisions for patients.