Statistical methods for classification of 5hmC levels based on the Illumina Inifinium HumanMethylation450 (450k) array data, under the paired bisulfite (BS) and oxidative bisulfite (oxBS) treatment

Hydroxymethylcytosine (5hmC) methylation is a well-known epigenetic mark that is involved in gene regulation and may impact genome stability. To investigate a possible role of 5hmC in cancer development and progression, one must be able to detect and quantify its level first. In this paper, we address the issue of 5hmC detection at a single base resolution, starting with consideration of the well-established 5hmC measure Δβ and, in particular, with an analysis of its properties, both analytically and empirically. Then we propose several alternative hydroxymethylation measures and compare their properties with those of Δβ. In the absence of a gold standard, the (pairwise) resemblance of those 5hmC measures to Δβ is characterized by means of a similarity analysis and relative accuracy analysis. All results are illustrated on matched healthy and cancer tissue data sets as derived by means of bisulfite (BS) and oxidative bisulfite converting (oxBS) procedures.


Prevalence of positive results
Here we compare the prevalences of positive results estimated sample-wise on raw and normalized data. First, Table 1 (mostly) confirms a reduction of the 5hmC levels in cancer tissue compared to healthy one. In particular, in terms of the 5hmC measure ∆β(100), such reduction was observed both on raw and normalized data. On the other hand, in terms of the measure ∆m ∞ , only two normalized data sets, funNorm and Illumina data sets, confirmed the above statement about a reduction of the 5hmC levels. Finally, for the measure ∆h, no significant reduction of the 5hmC levels in cancer tissue was detected, neither on raw nor on normalized data.
Further, when comparing the sample-wise prevalence of positive results of ∆β(100) on raw and normalized data, no significant difference in such prevalences on healthy tissue was observed. On cancer tissue, prevalence of positive results of ∆β(100) estimated on raw data exceeded the corresponding prevalence estimated on SWAN data (paired Wilcoxon test; p = 0.023, the sample estimate for the pseudomedian 0.01) and on Illumina data (paired Wilcoxon test; p = 0.041, the sample estimate for the pseudomedian 0.02). In terms of the 5hmC measure ∆m ∞ , the prevalence of positive results estimated on raw data always exceeded the corresponding prevalence estimated on normalized data, both on healthy and cancer tissue. This result is also true in terms of the measure ∆h, although only on cancer tissue.
On healthy tissue, the prevalence of positive results estimated on funNorm data is significantly lower compared to the corresponding prevalence estimated on Illumina data, both for the measure ∆m ∞ (paired Wilcoxon test; p = 0.006, the sample estimate for the pseudomedian −0.04) and for ∆h (paired Wilcoxon test; p = 0.021, the sample estimate for the pseudomedian −0.04).  (a)), ∆m ∞ (the panel (b)) and ∆h (the panel (c)) were considered. The p-values in the tables correspond to the p-values provided by the applied test, the pm-values are the corresponding pseudomedian estimates. For instance, the first row of the table (a) shows a significant difference in the prevalence of positive results of ∆β on healthy and cancer tissue (the p-value< 0.001), when estimated on raw data; the corresponding pseudomedian estimate is equal to 0.04. On the other hand, the first row of the table (b) shows no significant difference in the prevalence of positive results of ∆m ∞ on healthy and cancer tissue, when estimated on raw data (the p-value > 0.05).
On cancer tissue, the prevalence of positive results of ∆β(100) estimated on funNorm data exceeds the corresponding prevalence estimated on Illumina data (paired Wilcoxon test; p = 0.002, the sample estimate for the pseudomedian 0.03). In terms of ∆m ∞ , the prevalence of positive results estimated on SWAN data is the highest one, compared to the corresponding prevalences estimated on funNorm data (paired Wilcoxon test; p = 0.001, the sample estimate for the pseudomedian 0.13) as well as on Illumina data (paired Wilcoxon test; p = 0.021, the sample estimate for the pseudomedian 0.1). This result remains true for the 5hmC measure ∆h as well.

Similarity analyses
Due to our previous discussion, pairwise similarity between any two 5hmC measures can be characterized by means of the similarity coefficient S. To check for a possible tissue dependence in similarities among the considered 5hmC measures, we compared the pairwise similarities on healthy and cancer tissue. The results are presented in Table 2. As that table shows, both on funNorm and Illumina data, similarities described in terms of S(∆β(100), ∆m ∞ ) on healthy tissue exceed the corresponding similarities on cancer tissue; analogous result holds for similarities measured by S(∆β(100), ∆h). On Illumina data set, similarity measured by S(∆h, ∆m ∞ ) on cancer tissue exceeds the analogous similarity on healthy tissue (paired Wilcoxon test; p = 0.009, the sample estimate for 2/3  Table 2. On pairwise similarity among three considered 5hmC measures, as estimated on healthy and cancer tissue. Analyses were performed sample-wise on raw and normalized data, by means of the paired Wilcoxon signed rank test; in terms of the similarity coefficient S, the quantities S(∆β(100), ∆m ∞ ) (the panel (a)), S(∆β(00), ∆h) (the panel (b)) and S(∆h, ∆m ∞ ) (the panel (c)) were analyzed. The p-values in the tables correspond to the p-values provided by the applied test, the pm-values are the corresponding pseudomedian estimates. the pseudomedian −0.01). Altogether, when comparing pairwise similarities for three considered 5hmC measures on healthy tissue, we observe the following relation S(∆m ∞ , ∆h) > S(∆β(100), ∆m ∞ ) > S(∆β(100), ∆h), both on raw and normalized data. Analogous result holds on cancer tissue.
When comparing similarity coefficients as estimated on three different normalized data sets, funNorm, SWAN and Illumina, no difference in similarities described by the coefficient S(∆β(100), ∆h), both on healthy and cancer tissue could be observed so far. Further, on healthy tissue, similarity characterized by S(∆β(100), ∆m ∞ ) is the strongest on funNorm data set (paired Wilcoxon test; p = 0.001, the sample estimate for the pseudomedian 0.03), followed by the similarity estimated on SWAN data set (paired Wilcoxon test; p < 0.001, the sample estimate for the pseudomedian −0.036); an analogous result (with different p-values and pseudomedians) holds on cancer tissue as well. As for similarity characterized by S(∆h, ∆m ∞ ), we observed that on healthy tissue this similarity is the strongest on SWAN data; on cancer tissue, this similarity is the strongest on Illumina data set.