Deep sequencing of HBV pre-S region reveals high heterogeneity of HBV genotypes and associations of word pattern frequencies with HCC

Hepatitis B virus (HBV) infection is a common problem in the world, especially in China. More than 60–80% of hepatocellular carcinoma (HCC) cases can be attributed to HBV infection in high HBV prevalent regions. Although traditional Sanger sequencing has been extensively used to investigate HBV sequences, NGS is becoming more commonly used. Further, it is unknown whether word pattern frequencies of HBV reads by Next Generation Sequencing (NGS) can be used to investigate HBV genotypes and predict HCC status. In this study, we used NGS to sequence the pre-S region of the HBV sequence of 94 HCC patients and 45 chronic HBV (CHB) infected individuals. Word pattern frequencies among the sequence data of all individuals were calculated and compared using the Manhattan distance. The individuals were grouped using principal coordinate analysis (PCoA) and hierarchical clustering. Word pattern frequencies were also used to build prediction models for HCC status using both K-nearest neighbors (KNN) and support vector machine (SVM). We showed the extremely high power of analyzing HBV sequences using word patterns. Our key findings include that the first principal coordinate of the PCoA analysis was highly associated with the fraction of genotype B (or C) sequences and the second principal coordinate was significantly associated with the probability of having HCC. Hierarchical clustering first groups the individuals according to their major genotypes followed by their HCC status. Using cross-validation, high area under the receiver operational characteristic curve (AUC) of around 0.88 for KNN and 0.92 for SVM were obtained. In the independent data set of 46 HCC patients and 31 CHB individuals, a good AUC score of 0.77 was obtained using SVM. It was further shown that 3000 reads for each individual can yield stable prediction results for SVM. Thus, another key finding is that word patterns can be used to predict HCC status with high accuracy. Therefore, our study shows clearly that word pattern frequencies of HBV sequences contain much information about the composition of different HBV genotypes and the HCC status of an individual.


Genotype NGS reads using STAR 1
STAR is a widely used software package that can identify the genotype as well as recombination of a specific HBV sequence read. STAR assigns a read eight normalized Z scores corresponding to genotypes A to H. If the highest Z score is above 2, the read is predicted to have the genotype corresponding to the highest Z score according to the recommendations given in 1 . Otherwise, STAR uses a sliding window approach of 150bps by step 1bp to detect recombination and assigns the read as recombinant from more than one genotypes. For each sample, we calculated the fraction of recombinant reads. Figure S1 shows the histogram of the fraction of recombinant reads for the 139 samples under study. About 95% of the samples (132/139) have the fraction of recombinant reads below 5%, and only 3 out of 139 samples have more than 20% recombinant reads.

Genotype NGS reads using jpHMM 2
In order to see how the results of our study depend on different genotyping software tools, we used another program jpHMM to genotype the NGS reads. jpHMM uses a jumping hidden Markov model to identify regions of a read potentially belonging to different genotypes. The computational speed of jpHMM is much slower than that of STAR. In our study, if the length of a consecutive region of a read having the same genotype is above 400 (the length of the whole region without insertions or deletions in our study is 457 bp), the read is defined as having that genotype. Otherwise, the read is considered as recombinant. For each sample, we calculated the fraction of recombinant reads and plotted the histogram of the fractions of recombinant reads of the 139 samples as shown in Figure 2(a). We also calculated the fraction of reads having genotype B among the reads having genotype B or C and compared with the fraction calculated using STAR as shown in Figure S2(b). It can be seen from the figure that most dots scatter around the line y = x and there are a few dots deviate from y = x. Since the results come from different software tools and different criteria were used for genotyping, some differences between the two fractions using STAR and jpHMM were expected. On the other hand, the figure clearly shows that the two fractions from STAR and jpHMM are highly associated and our results are robust and consistent to different genotyping tools. The relationship between the fractions of genotype B using jpHMM and STAR. For STAR, we only considered reads having score above 2.0. For jpHMM, only reads with at least 400 bps consecutive region of the same genotype were considered. All fractions were normalized such that the sum of genotypes B and C is 1. Each dot corresponds to a sample. Figure S3(a) shows the histograms of fraction of genotype B reads among the 94 HCC and 45 CHB samples, respectively, and Figure S3(b) shows the relationship between the ratio of the fraction of HCC individuals in the bin over that of the CHB individuals and the fraction of genotype B reads based on the genotyping results using jpHMM. In Figure S3(b), we merged some bins as there are too few samples in those bins, which is the same to Figure 1 in the main text. Again we showed that the probability of having HCC decreases with the fraction of genotype B reads.  Figure S4. PCoA plots based on the 94 HCC and 45 CHB patients. The distance matrix is calculated based on the Manhattan distance between the frequency vectors of word patterns of length (a) k = 6 and (b) k = 8, respectively. Color shows the fractions of genotypes B and C reads based on the jpHMM genotyping results. Red represents 100% genotype B and blue represents 100% genotype C. Reference B and C sequences are also added on the figures as references. The relationship between the first principal coordinate and the fraction of genotype B calculated using jpHMM, (c): k = 6, (d): k = 8. Figure S4 (ab) show the PCoA plots of the 139 patients marked with colors indicating the fractions of genotypes B and C calculated using jpHMM. We can see that the first principal coordinate increases with the fraction of genotype B reads. Figure S4 (cd) show the relationship between the first principal coordinate and the fraction of genotype B calculated using jpHMM. It is clear that the dots scatter around the line y = x such that they are highly correlated. The results are consistent with the results shown in Fig 2 in the main text based on STAR.
We further calculated the Spearman and Pearson correlations between the first principal coordinate using PCoA and the fraction of genotype B calculated using jpHMM. The results are shown in Table S2. It shows that both the Spearman and Pearson correlations increase and become stable with the increase of word length k, consistent with the conclusions from Table 1 in the main text based on STAR.

Results based on selecting a subset of words for predicting HCC status
Tables S3-S8 give the prediction results using selected words with different values of α = 0.05, 0.01, 0.001, respectively. Although selecting subsets of words can give better results for cross-validation, the prediction results for independent data are worse. 4 Results based on trimming data using quality score threshold of 30

α=0.05
To see the effect of trimming the reads with different quality score thresholds on our results, we also trimmed the paired-end reads using quality score 30. We then calculated the word frequencies of the samples and re-did the analyses. Table S9 gives the correlation coefficients between the first principal coordinate of PCoA and the fraction of genotype B reads. Here the Manhattan distances were computed using the new word frequencies of different word length k after trimming the paired-end reads using quality score 30. The fraction of genotype B reads was calculated using STAR. The correlation coefficients increase with word length, consistent with our conclusion shown in Table 1 in the main text based on quality score threshold of 20. Tables S10-S11 give the prediction results using word frequencies after trimming the paired-end reads using quality score threshold of 30. The prediction accuracies are close to but lower than the results based on quality score threshold of 20 shown in Tables 3-4 in the main text. As quality score 30 is a higher standard that may result in shorter reads, much more information could be lost considering the relatively short region (457 bp) in our study.  Figure S6 gives the histograms of read lengths before and after linking using quality score threshold of 20 (S6 (a)) and 30 (S6 (b)), respectively. Under the scenario of Q20, the highest bars for R1, R2 and the linked data are 251-300, 151-200, and 451-500, respectively. This shows that the overall quality of R1 reads is higher than that for the R2 reads. About 40% of the paired-end reads were successfully linked. Under the scenario of Q30, most reads in the R1 file still have length 251-300 bps but for the R2 file the most frequent bin is 101-150 bps, which indicates that the reads in the R1 file have relatively higher quality than reads in the R2 file. Under the scenario of Q30, the reads of the R1, R2 and linked files all tend to be shorter as expected, because Q30 is a more stringent criteria for trimming the reads. Also, only about one fourth of the reads are successfully linked.