Comparative Genome of GK and Wistar Rats Reveals Genetic Basis of Type 2 Diabetes

The Goto-Kakizaki (GK) rat, which has been developed by repeated inbreeding of glucose-intolerant Wistar rats, is the most widely studied rat model for Type 2 diabetes (T2D). However, the detailed genetic background of T2D phenotype in GK rats is still largely unknown. We report a survey of T2D susceptible variations based on high-quality whole genome sequencing of GK and Wistar rats, which have generated a list of GK-specific variations (228 structural variations, 2660 CNV amplification and 2834 CNV deletion, 1796 protein affecting SNVs or indels) by comparative genome analysis and identified 192 potential T2D-associated genes. The genes with variants are further refined with prior knowledge and public resource including variant polymorphism of rat strains, protein-protein interactions and differential gene expression. Finally we have identified 15 genetic mutant genes which include seven known T2D related genes (Tnfrsf1b, Scg5, Fgb, Sell, Dpp4, Icam1, and Pkd2l1) and eight high-confidence new candidate genes (Ldlr, Ccl2, Erbb3, Akr1b1, Pik3c2a, Cd5, Eef2k, and Cpd). Our result reveals that the T2D phenotype may be caused by the accumulation of multiple variations in GK rat, and that the mutated genes may affect biological functions including adipocytokine signaling, glycerolipid metabolism, PPAR signaling, T cell receptor signaling and insulin signaling pathways. We present the genomic difference between two closely related rat strains (GK and Wistar) and narrow down the scope of susceptible loci. It also requires further experimental study to understand and validate the relationship between our candidate variants and T2D phenotype. Our findings highlight the importance of sequenced-based comparative genomics for investigating disease susceptibility loci in inbreeding animal models.


Background
Type 2 diabetes (T2D), also known as non-insulin-dependent diabetes is characterized by defects in both insulin secretion and utilization and accounts for about 90% of the 346 million diabetic cases around the world [1]. Both environmental and genetic factors contribute to the etiology and progression of T2D [2,3]. In the last two decades, significant efforts, ranging from read in GK/Slac and Wistar/Slac sequencing, respectively; 93.02% and 93.65% genome were covered by at least five reads).
In order to get accurate variant calls, we reassessed the quality of the mapping results through the GATK workflow [39]. After these steps, 0.42% (0.58%) of the Solexa reads, and 6.05% (4.87%) of the SOLiD reads were removed from GK/Slac (Wistar/Slac) mapping results, respectively (S3 File). In total we aligned 34.06 Gb (34.70 Gb) from GK/Slac (Wistar/Slac) sequencing data to the BN reference rat genome which corresponded to 13.27X (13.52X) coverage of high-quality reads for GK/Slac (Wistar/Slac) [38].
To evaluate the accuracy of variant calling, we compared the primary results to a public dataset from the STAR project [41]. This dataset includes genotypes for 20,238 SNVs across 167 distinct inbred rat strains including 10 GK and 6 Wistar strains. There were 7368 (4000) positions that were mutant in all GK (Wistar) strains, and 2491 (1372) positions that were not polymorphic in all GK (Wistar) strains. We checked our SNV calling results against these positions and calculated sensitivity and specificity. For GK/Slac strain, we called 6984 SNVs among the 7368 positions (94.79% sensitivity) and 2489 out of the 2491 non-polymorphic positions (99.9% specificity). For Wistar/Slac strain, we called 3319 SNVs among the 4000 positions (82.97% sensitivity) and 1104 out of the 1372 non-polymorphic positions (80.47% specificity).

Comparative genome analysis
Since the GK rat was obtained by selective inbreeding of Wistar rats, their specific genetic changes from Wistar should indicate the cause of type 2 diabetic phenotypes observed. Therefore, GK/Slac specific variants were determined by comparing variants in GK/Slac with Wistar/Slac. All GK/Slac specific variants were shown on a circular chromosome map (Fig 2).
There were 1,354,739 GK/Slac specific SNVs and 134,554 GK/Slac specific indels. The density of GK/Slac specific SNV/indel was calculated in each 1Mb segment, and their distribution was plotted in Fig 3A and 3B. Most genomic regions were relatively conserved with extremely low SNV density (0-0.0001) and regions with median SNV density (0.0001-0.001) were evenly distributed. When the SNV density increased, the frequency decreased smoothly (0.001-0.002). A long tail indicated the existence of extremely high SNV density (>0.002) regions.
The distribution of indel density was similar to the SNV distribution ( Fig 3A and 3B). We calculated the Pearson correlation coefficient between SNV density and indel density in each 1Mb genomic segment. The result showed that the density frequency of GK/Slac specific SNVs and indels were positively correlated (R 2 = 0.959, Fig 3C). This indicated some SNVs and indels tend to co-localize at highly mutated regions of the genome (S5 File). As expected, there were regions in the genome that showed no or very few differences between GK/Slac and Wistar/ Slac strain, termed variants deserts. Examples included chromosome 1 (20Mb-21Mb) and chromosome X (53Mb-54Mb).
Besides SNV and small indels, we also compared large SVs and CNVs between GK/Slac and Wistar/Slac. We identified 228 GK/Slac specific SVs, including 174 deletions, 12 inversions, 36 tandem duplications and 6 translocations. Based on sequence coverage, we predicted 2660 CNV amplified regions and 2834 CNV deleted regions between the GK/Slac and Wistar/Slac rat genomes. To validate our CNV calling, we compared the CNV candidates with a set of 116 CNVs identified by CGH array data from our previous work [36]. Out of 58 CNV gain regions identified by array, we successfully identified 38 in the sequencing results. The sequencing results also identified 31 out of the 58 CNV loss regions identified by array.
Identification of potential T2D candidate mutations specifically presented in the GK rat We were interesting in the GK/Slac specific variants that might contribute to the development of type 2 diabetes phenotype in GK/Slac. We mapped GK/Slac specific SVs and CNVs to regions containing functional transcripts. For SVs, 26 genes were affected, including 18 olfactory receptor genes (ORs) (S7C File). 75 genes were associated with CNVs, among them 24 Densities for 7 kinds of GK/Slac specific variants on the rat genome. Each tiny bar stand for variants density normalized in 1 Mb genomic segments (see Methods), which was plotted on the circular chromosomes by CIRCOS (http://http://circos.ca/). Layers from outside to inside represent for rat kayrotype and the density of SNV, small indel, large deletion, inversion, tandem duplication, CNV gain, and CNV loss. were OR genes (S7D File). The OR gene family is the largest superfamily in mammalian genome. There are 1,493 OR genes in the rat and 19.5% may be pseudo-genes [42]. OR genes are frequently clustered in regions with a large number of retrotransposons or around subtelomeric regions [43] [44], which tend to exhibit a high rate of mutation. Therefore, we thought GK/Slac specific variants in ORs were background mutations rather than causal of the T2D phenotype. Besides the OR genes, other SV or CNV affected genes were randomly distributed with no enrichment in T2D related pathways and no literature evidence of either direct or indirect association between these genes and T2D.
Next we investigated potential T2D candidate variants from GK/Slac-specific SNVs and indels. We divided SNP/indels into five groups to illustrate their genotype patterns in GK/Slac and Wistar/Slac ( Fig 4A). Group1 (0/1, 0/0) contained sites that were heterozygous variant in GK/Slac and homozygous reference in Wistar/Slac; Group2 (1/1, 0/0) contained sites that were homozygous variant in GK/Slac and homozygous reference in Wistar/Slac; Group3 (1/1, 0/1) contained sites that were homozygous variant in GK/Slac and heterozygous variant in Wistar/ Slac; Group4 (1/2, 0/0) was similar with Group2 and Group5 (1/2, 0/1) was similar with Group3, which were rare sites with two mutant alleles. Among 1,354,739 GK/Slac specific SNVs, group 1 to 3 accounted for the majority of SNVs with 3.6%, 92.9% and 3.5%, respectively (Table 1). Like SNVs, among 134,554 GK/Slac specific indels, proportion of group1 to group3 were 5.0%, 81.9% and 13.1%, respectively. In summary, most SNV and indel variants belonged to groups 1, 2 and 3, and only few allele sites had the complicated allele composition in group 4 and 5 which was consistent with the low probability of de novo production of two rare alleles in the lab inbreeding strain. Group 2 accounted for a large proportion that was concordant with the high homozygosity rate of inbred laboratory rat. Next we annotated the functional effect of GK/Slac specific SNVs/indels by ANNOVAR [45]. Table 2 showed the number of SNPs/indels in each genotype group and functional class. Variants had potential to interrupt the protein functions were called protein affecting variants (PAVs), including nonsynonymous, stopgain, stoploss, splicing, frameshift indels and exonic ncRNA. We detected 1796 PAVs, including 1762 SNVs and 34 indels (S7AB File).
To further refine the above PAVs, we compared our variants with the variants of public RGD datasets. Atanur et al. reported whole-genome sequencing results of 28 laboratory rat strains [46]. Depending on these variants and ours, we plotted a phylogenetic tree for these rats ( Fig 5). As the phylogenetic relationship showed, GK/Slac was close to GK/Ox, and Wistar/Slac  In the light of the public resources of variants from different rat strains, we were able to further narrow down the mutant profile. Fig 4B, 4C and 4D showed the genotype profiles of 1762 GK/Slac specific PAVs in 28 rat strains, the overlap with T2D prior genes (S6 File), and the predicted functional effect of PAVs. To identify T2D phenotype-specific genetic changes, we further filtered the 1796 GK/Slac specific PAVs based on the genotype profile of 11 Wistar strains (except BBDP/Wor, which is a type 1 diabetic model) and 1 GK/Ox strain. Our GK specific variants, which had potential to contribute to T2D phenotype, were required to present in the GK/Ox strain but not the other 11 Wistar strains.
Considering the laboratory inbreeding process, we supposed homozygous variants in GK rat have a higher probability to account for the disease phenotype. Among the 1762 GK/Slac specific protein affecting SNVs, 300 were homozygous variants in both GK strains (GK/Slac in our report and GK/Ox strain studied by Atanur et.al. [46]), but did not present in other 11 Wistar strains. These 300 SNVs were located in 252 genes including 60 OR genes and the other 192 genes were used for further analysis (S8A File). We also checked 34 protein affecting indels. Besides 7 indels were heterozygous in GK/Slac, one homozygous indel resided in the T2D prior gene Hif1a, but many other rat strains also had this homozygous indel; the other 26 homozygous indels were either located in OR genes or not reported to be associated with T2D.

Refinement of the genes with homozygous mutation based on PPI network and gene expression identify prior T2D genes and new candidates
Among 192 potential T2D associated genes, seven of them (Tnfrsf1b, Scg5, Fgb, Sell, Dpp4, Icam1, and Pkd2l1) were clearly reported to be T2D prior genes (see Methods for detailed description, S6 File)). As T2D phenotype was correlated with protein network dysregulation, we hypothesized T2D candidate genes were more likely to have interactions with reported T2D prior genes. We used Fisher's test to evaluate whether their interaction partners were enriched with T2D prior genes (S8B File). There were 16 genes whose interaction partners were enriched with prior genes (adjusted p-value<0.05), among of which six genes (Tnfrsf1b, Scg5, Fgb, Sell, Dpp4, and Icam1) were known T2D prior genes. Fig 6A shows the protein-protein interaction (PPI) network between these six genes and other T2D prior genes. PPIs contained validated and predicted protein associations from STRING database and genes were annotated by KEGG pathway database. Five T2D related pathways were labeled by different colored boxes, including Adipocytokine signaling pathway, Glycerolipid metabolism, PPAR signaling pathway, T cell receptor signaling pathway and Insulin signaling pathway. The six T2D prior genes were closely correlated with T2D phenotype according to previous investigations. Genetic variation in or near Tnfrsf1b might predispose clinical neuropathy,  [46]. Distance between all possible pairs of strains were measured by net nucleotide substitutions [88]. The phylogenetic tree was constructed using UPGMA (unweighted pair-group method with arithmetic means) method in MEGA 6.06 package.
doi:10.1371/journal.pone.0141859.g005 reduced glycosylated hemoglobin, and increased HDL cholesterol in type 2 diabetes patients. The latter could be part of a protective response [47]. Tnfrsf1b and its interacting proteins were involved in the adipocytokine signaling pathway and increased TNF-alpha action would protect the organism from the damage by increasing HDL cholesterol in T2D patients [47,48]. The nonsynonymous SNV in Scg5 (chr3: 99641204:G->C) was predicted to be deleterious ( Fig  4D) by SIFT [49]. Its homologous site in mouse is annotated as "type 2 diabetes mellitus 2 in SMXA RI mice" based on QTL data in UCSC genome browser. Also Scg5 (SGNE1) might impair glucose intolerance and insulin resistance [50], which was consistent with the insulin resistant phenotype of GK strain. Fibrinogen (Fgb) was described as one of the cardiovascular risk factors [51] and Fgb concentration was correlated with fasting insulin concentration [52]. Fgb was also involved in T2D related PPARγ signaling pathways [53]. Sell was a cell surface adhesion/homing receptor that played important roles in leukocyte-endothelial cell interactions. Although its interaction partners did not show enrichment in any T2D related pathway, previous literature had reported that Sell was associated with T2D-associated pathologies, such as diabetic microangiopathy [54], nephropathy [55] and diabetic retinopathy [56]. Dpp4 was a famous drug target of T2D [57], and Dpp4 inhibitors could ameliorate many symptom of T2D [58,59]. PPI showed that Dpp4 was involved in a number of biological functions (Fig 6A) [57]. Dpp4 played a critical role in both the adipocytokine signaling pathway and insulin signaling pathway [60]. Inhibiting Dpp4 activity increased glucose-dependent insulinotropic polypeptide and glucagon like peptide 1 induced insulin secretion [61]. T2D patients showed increased ICAM-1 and VCAM-1 plasma concentrations, which was thought to be related to hyperglycaemia [62]. Pkd2l1 had been associated with T2D by GWAS study [63] and Mancusi S et al. demonstrated that the expression change of PKD2, which was responsible for the formation of the renal cysts and associated with early diabetes onset [64].
Some of the mutant genes were supposed to show expression changes between GK and Wistar strain. We compared the expression profile of 192 potential T2D genes in GK and Wistar rats by analyzing a public microarray dataset (GSE13271). There were 32 differentially expressed genes and 38 differential co-expressed genes in at least one tissue between GK and Wistar rat (S8B File). Among above 16 candidate genes, seven of them (Tnfrsf1b, Ldlr, Pik3c2a, Sell, Icam1, Eef2k, Cpd) also had significant expression changes (differentially expression or differential co-expression) between GK and Wistar. (Table 3) Integrating prior knowledge, PPI network and differential gene expression, we finally selected 15 higher confidence T2D candidate genes with homozygous variants in GK strains (Table 3). These 15 genes were involved in multiple T2D related pathways. We counted the number of shared interaction partners between any two genes, and constructed a relationship network for 14 genes out of the 15 high-confidence T2D candidate genes ( Fig 6B). Fig 6B illustrates the close relationship between 8 new genes (Ldlr, Ccl2, Erbb3, Akr1b1, Pik3c2a, Cd5, Eef2k, Cpd) and known T2D prior genes, indicating these 8 genes have strong relationships with T2D associated pathways. We manually checked their function and possible relationship to T2D in published literature. For instance, Ldlr had previously been shown to be associated with diabetes mellitus and its lipid phenotype [65]. A GWAS study of French population also identified Ldlr as a T2D risk locus [66]. Akt2/Ldlr double knockout mice displayed impaired glucose tolerance [67], and increased inflammation response [68]. CCL2/C-C chemokine receptor 2 (CCR2) signaling was suggested to play a significant role in diabetic nephropathy and in adipose tissue inflammation during insulin resistant. Blocking CCL2/CCR2 signaling not only improved blood glucose levels but also altered renal nephrin and VEGF expressions in type 2 diabetic mouse model [69]. Butcher et.al. showed that type 2 diabetic islets displayed higher CCL2 expression than healthy islets [70]. Polymorphism of Akr1b1 was associated with diabetic nephropathy and type 2 diabetes mellitus by two GWAS studies [71,72]. Pik3c2a played a critical role in insulin secretion in β cells and down-regulation of PI3K-C2α might be a feature of type 2 diabetes [73]. Eef2k was a kinase of Eef2 and renal cortical homogenates from db/db mice in early stage of type 2 diabetes showed decrease in Eef2 phosphorylation and increment in Eef2 kinase phosphorylation [74]. Carboxypeptidase D (Cpd) and carboxypeptidase E (Cpe) belonged to same family of enzymes and defects in Cpe could lead to β-cell stress and hyperproinsulinemia, both of which were features of type 2 diabetes in GK rat [75]. Chu KY et al. also found that Cpd was significantly up-regulated by elevated glucose or low doses of insulin [76].

Conclusion
We presented a comprehensive analysis pipeline of re-sequencing study with general case-control study design. Besides identifying some prior T2d genes with mutations, we found 8 new candidate genes which required further wet-lab experimental evaluation and validation. As a bioinformatic analysis of NGS data, our workflow could be adopted in other re-sequencing study of organism with disease model.

Discussion
Rodents have been used to model human diseases because of their similarity in genome and physiology. GK is a classic rat T2D model, which is obtained by selective inbreeding of Wistar rats. GK/Slac and Wistar/Slac rats have been bought from a commercial company (www. slaccas.com), which import rat strains from the place of production and then bred locally in China. These two strains have been widely used by Chinese investigators [77][78][79][80][81][82]. Our work provides the whole-genome sequences of GK/Slac strain and Wistar/Slac strain for the first time. This sequencing dataset will be very useful for the researchers who use these two strains as study objects. It is also an important complement to the Rat Genome Database (RGD) [83], which include the international sequencing resources of different rat strains. Comparing the whole genome sequence of T2D phenotypic GK rats with the corresponding Wistar rats provides insights into the genomic evolution of GK during the laboratory selective inbreeding for developing the insulin resistant T2D phenotype. In the light of sequencing technology, the genetic difference between T2D GK and control Wistar rats is easy to identify. Many years of selective inbreeding in the laboratory makes these genetic variants are correlated to a consistent phenotype. Such advantages make the GK rat an ideal model to discover T2D causative genes. Here we studied the genomes of GK and Wistar rats by both sequencing and computational strategies. We have combined two sequencing platforms with different read lengths and insert-sizes. The reads are processed with stringent quality control to obtain accurate high-quality variants. In order to identify the causal variants of T2D phenotype, we used Wistar strains as background to screen GK specific variants, which not only are required to present consistently homozygous in both our and public GK rat samples, but also are absent in either our Wistar sample or any Wistar derived samples from the public databases. Then we selected high-confidence affected genes by integrating T2D prior knowledge, protein interaction and gene expression data. Finally we got fifteen genes with homozygous SNVs in GK rats and their functions are related with T2D phenotype. The integration of public resources and prior knowledge can increase the power of detection and narrow down the scale of candidates. Our data mining approach would inspire similar bioinformatic studies for disease animal model.
Our analysis focus on variants that affect protein sequences so that variants in the intergenic or intronic regions are ignored due to the lack of function annotations for these regions in rat genome. The understanding of noncoding regions may be improved by more studies on translational regulation and evolutionary conserved regions. We also observed that GK strains and Wistar strains coming from different laboratories have slightly genetic difference (Fig 4B, Fig  5), showing it is important to use biological repeats even for inbreed organisms. With decreased sequencing cost and improved computational ability, it is possible to sequence more samples to increase analysis power. The whole-genome sequencing-based disease study will be extended to other disease models and our approach can be used as an example to study these disease model organisms.

Sample preparation
One male GK/Slac rat and one male Wistar/Slac rat were obtained from SHANGHAI SLAC LABORATORY ANIMAL CO. LTD (www.slaccas.com). The rats were anesthetized by formalin at the age of 8 weeks, and the blood was taken from the pericardia with anticoagulant. Genomic DNA was then isolated using DNeasy Blood & Tissue Kit (Qiagen, p/n69504). All animal experiments were approved by the Biomedical Research Ethics Committee of Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences (IRB00005813).
Whole genome sequencing and data preprocessing GK/Slac and Wistar/Slac rat DNA samples were sequenced by ABI SOLiD and Illumina Solexa paired-end sequencing technologies. To increase the coverage of genome, three SOLiD sequencing libraries and one Solexa library were constructed with different read length and insert length (S1 File). All sequence reads were deposited in the European Nucleotide Archive under accession number PRJEB6678.
Read quality was assessed by per base sequence quality, per sequence quality score, per base N content and overrepresented sequences using software FastQC (http://www.bioinformatics. babraham.ac.uk/projects/fastqc/). Low quality reads were filtered by stringent criteria: 1) removing overrepresented adaptor found by FastQC using FASTX-Toolkit (http://hannonlab. cshl.edu/fastx_toolkit/), 2) removing N base and low quality base (Phred quality score was below 20), 3) removing reads that's shorter than 15 bp or paired read was filtered.

Mapping to reference genome
In order to combine data from two different sequencing platforms, we transferred ABI SOLiD color space encoding data and their quality file to Solexa base space encoding format (Fastq file). Then we aligned the high-quality reads to the BN rat reference genome (UCSC rn4 / Baylor HGSC Build 3.4) by Bowtie 2 with default parameters [84]. The coverage proportion of reference genome and estimated genome were calculated by the following formula: reference coverage proportion ¼ number of bases that0s covered by at least 1 read total bases in reference genome genome coverage proportion ¼ number of bases that0s covered by at least 1 read estimated number of bases in the genome We further filtered the mapping results to increase the accuracy of variant calling. Firstly, we did local realignment around known indels using Smith-Waterman algorithm. Then we removed duplicate reads to reduce amplification bias. Lastly we recalibrated base quality depending on the reference genome and dbSNP information. These three main processes were done by GATK (The Genome Analysis Toolkit) [39], and PICARD TOOLS and SAMTOOLS [85] were used to sort the bam file, fix mate pair information and do format transformation, which facilitated the GATK running process.

Variants calling and comparison
After pre-mapping and post-mapping quality control, remained bam files were used to call variants: single nucleotide variant (SNVs), small insertion and deletion (indels), structural variation (SVs), and copy number variations (CNVs).
Small indels and SNVs were called by the UnifiedGenotyper module in GATK software and filtered by following filtering criteria: minimum number of consensus is 5, minimum base quality required to consider a base for calling is 17. Furthmore, we filtered the candidate small indels by criteria: minimum depth (DP) is 8 and allele number (AN) is 4. We filtered SNVs by criteria: minimum depth (DP) for each allele in per sample is 10, allele number (AN) is 4, minimum base quality is 30, minimum qual by depth (QD) is 5, maximum mapping quality zero (MQ0) is 4, removing SNVs that were located on indel regions or in SNV cluster regions (defined by 3 SNV calling in a 10 bp window).
DELLY was used to detect structural variants from discordantly mapped read pairs [86]. The predicted SVs were classified as four groups: deletions, Inversions, tandem duplications and translocations. To avoid false-positive SVs in GK, DELLY was run with "-p" option that combined discordant alignment with split-read to get higher confident SVs. Then GK SVs were compared with Wistar SVs to get GK/Slac specific SVs.
The software BIC-seq [87] was used to detect copy number alterations between GK strain and Wistar strain. Differential CNVs were selected by two criteria: a) ratio of mapped reads number between GK and Wistar is greater than 2; b) Bofferoni adjusted p-value is smaller than 0.01.

Variant density and function annotation
To illustrate the genome-scale difference between GK/Slac and Wistar/Slac, we analyzed the density and distribution of GK/Slac specific variants. Reference genome were segmented into 1Mb bins and variant density was defined as the number of variants in each bin divided by the number of nucleotide bases in the same bin that were covered by at least three reads in GK sequencing.
ANNOVAR was used to annotate SNVs/indels to gene region (exonic, splicing, ncRNA, UTR, intronic, up/down-stream, and intergenic). Functional impacts of exonic SNVs/indels were further classified as synonymous, nonsynonymous, stopgain, stoploss, and frameshift indels. SIFT was used to predict whether a nonsynonymous SNV affects protein function. SVs and CNVs were compared with gene annotation to get effected genes.

Functional analysis of potential candidate genes
Candidate gene lists were further filtered by integrating other information: T2D prior genes, protein-protein interaction, and differential gene expression.
We manually curated T2D related genes from published literatures and human GWAS catalog (http://www.genome.gov/admin/gwascatalog.txt) [90]. Totally, 506 T2D related genes were collected as prior genes (S6 File), including T2D susceptible genes, genes involved in important T2D pathways (such as insulin signaling pathway, adipocytokine signaling pathway), and genes associated with T2D related traits. Genes with GK/Slac specific variants were compared with T2D prior genes to narrow down the candidate gene list.
Known and predicted protein-protein interaction were obtained from STRING database [91], which quantitatively integrates interaction data from previous knowledge, genomic context, high-throughput experiments and conserved gene co-expression. We only used interaction pairs whose score is higher than 0.4. For T2D candidate gene list, we counted the number of their interaction partners in rat genome and in 506 T2D prior genes. Then we used Fisher test to calculate P-value, and adjusted it by multiple test. Genes with P-value< = 0.05 were regarded as better candidate genes.
Gene expression data was downloaded from a GEO dataset GSE13271(http://www.ncbi. nlm.nih.gov/geo/query/acc.cgi?acc=GSE13271), which measured the expression of GK and Wistar in three tissues (liver, muscle and adipose) at five time points during T2D development [92]. To compare GK and Wistar, T-test and fold-change threshold was used to get significantly differentially expressed genes (P-value< = 0.01 and fold-change>2); R package DCGL 2.0 [93] was used to mine differentially co-expressed genes (P-value< = 0.05 after Bonferroni correction). Results of each time point were combined to get the final differential gene list. S8 File. Progress for selecting potential T2D candidate genes. (A) 300 GK/Slac specific PAVs in 252 genes, which are homozygous mutant locus in GK/Slac strain but not in Wistar derived strains [46]. (B) After removing 60 ORs genes, there are 228 GK/Slac specific PAVs in 192 genes. These genes are analyzed by the following steps: 1) comparison with T2D prior genes; 2) differential (co-)expression between GK and Wistar rats in liver, muscle or adipose; 3) protein-protein interaction with T2D prior genes.

Author Contributions
Conceived and designed the experiments: LL YYL YXL. Analyzed the data: TCL HL. Wrote the paper: TCL HL. Performed phylogenetic analysis: GHD ZW. Performed data preprocessing: YQC.