Parenclitic Network Analysis of Methylation Data for Cancer Identification

We make use of ideas from the theory of complex networks to implement a machine learning classification of human DNA methylation data, that carry signatures of cancer development. The data were obtained from patients with various kinds of cancers and represented as parenclictic networks, wherein nodes correspond to genes, and edges are weighted according to pairwise variation from control group subjects. We demonstrate that for the 10 types of cancer under study, it is possible to obtain a high performance of binary classification between cancer-positive and negative samples based on network measures. Remarkably, an accuracy as high as 93−99% is achieved with only 12 network topology indices, in a dramatic reduction of complexity from the original 15295 gene methylation levels. Moreover, it was found that the parenclictic networks are scale-free in cancer-negative subjects, and deviate from the power-law node degree distribution in cancer. The node centrality ranking and arising modular structure could provide insights into the systems biology of cancer.


Introduction
Epigenetic information is stored in a genome in the form of heritable modifications to the chemical structure of DNA, such as methylation of CpG di-nucleotides, and a number of chemical modifications of histone proteins. It can be modulated during the lifetime of an organism by environmental signals [1][2][3], and these changes persist in subsequent mitoses, as an acquired change of phenotype. However, besides the fundamental role of epigenetics in mediating environmental effects on the genome, it leaves a backdoor for environmental risk factors.
In particular, variations in DNA methylation (DNAm) accompany the early stages of human carcinogenesis [1], and could, therefore, manifest as quantitative signatures of such illnesses or as a risk of their development [4][5][6][7]. Due to the huge number of individual CpG sites (of the order of 10 5 ) at which methylation levels are assessed, there is a substantial interest in developing aggregate measures, so as to reduce the dimension of the mathematical problem, whilst still taking account of the key genomic effects of cancer. a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 Recently, it has been demonstrated that gene-specific measures of DNA methylation, such as mean, variance, mean derivative and suchlike reflect cancer-related changes and enable differentiation between normal and tumour tissue samples [8]. Furthermore, a method to construct a network representation of the data was proposed, with genes taken as nodes. Edge weights then quantify the extent to which the methylation profiles of these genes covary in the same way as the healthy population profiles do [9]. Some edges of this network representation were shown to be associated with survival outcome for patients with different types of cancer. It was also found that natural groupings of these prognostic edges could be identified as subnetwork modules, relevant to a number of biological functions. This indicated that epigenetic network models and measures do not just technically reduce the complexity of a computational problem, but naturally reflect intrinsic collective behaviour and interactions of such groups of genes.
Inspired by these findings, we address the problem of constructing epigenetic data networks and identifying network measures for distinguishing between normal and cancer cells. We seek a solution implementing recently developed parenclitic network analysis [10,11]. This approach identifies generic biomedical measurements with nodes and specifies that an edge exists between each pair of nodes if their values for a particular subject are significantly different from the linear regression model for a control "healthy" group (or weight of the edge is proportional to the mismatch of the regression model). In result one obtains a network for each subject, which properties are expected to be different in health and disease.
We demonstrate that parenclictic networks built on DNAm data correctly represent source data, therefore classifiers based on routine network measures (average node degree, diameter, etc.) can produce approximately 93−99% accuracy. The statistics of parenclictic networks for healthy tissues exhibits power-law tails in the node-degree distributions, that indicates substantial natural fluctuations in methylation levels. Remarkably, cancer modifications in DNAm induce a qualitative change in network topology: heavy-tailed too, they show marked deviations from power-law scaling. Exceptions are found in one case only, where the network architecture does not change noticeably and at the same time the performance of the respective classifier drops to about 90%.

Methods Data
Methylation data, collected via the Illumina Infinium Human Methylation 450 platform, were downloaded from The Cancer Genome Atlas (TCGA) project [12] at level 3. Data were obtained from ten different healthy and tumour tissues: Bladder Urothelial Carcinoma  Table 1. Raw data were pre-processed as described in [8], briefly summarised as follows. First, probes were removed if they have non-unique mappings or map to SNPs (as identified in the TCGA level 3 data); probes mapping to sex chromosomes were also removed; in total 98384 probes were removed in this way from all data sets. After removal of these probes, 270985 probes with known gene annotations remained. Individually for each data set, probes were then removed if they had less than 95% coverage across samples; probe values were also replaced if they had corresponding detection p-value greater than 5%, by KNN (k nearest neighbour) imputation (k = 5). Methylation level has been considered across the whole gene, as a way of simplifying the large quantity of data. And whilst some information will undoubtedly be lost by rescaling all methylation values into range [0, 1] and characterising each gene by the average level of methylation, we still found that changes in methylation level across the whole gene were indicative of disease. In future work, it will be important to refine the analysis by considering mean methylation across specific functional regions within the gene, such as the promoter, the first exon, et cetera. The total number of genes of interest was 15295.

Construction of networks
We utilize the parenclictic network approach [10,11] in order to construct and analyze graphs from gene methylation data. The resulting network is a complete weighted graph: each vertex corresponds to a specific gene, and edge weight is proportional to the variation of methylation levels in specific gene pair in cancer-positive and negative phenotypes. The procedure is aimed at unveiling hidden relations between methylation levels of gene pairs and discovery of global dependencies. The procedure originally applied to different biomedical data [10,11] includes the following steps: 1. Select a control group from healthy tissue samples.
2. Adjust methylation levels for each pair of genes, m i and m j , to a linear regression based on the control group subjects: where α i, j and β i, j are regression coefficients.
3. Build complete weighted graph for each cancer-positive and negative sample, excluding the control group, such that each vertex corresponds to a particular gene, and edges are weighted according to where x i and x j are respective methylation levels, and σ i, j is the standard deviation of errors in the linear regression model for control objects Eq (1).
This process is illustrated in Fig 1 through the use of the two genes ZFP106 and NEUROD1 for BRCA data. Each point corresponds to gene methylation levels in the control group Table 1. Number of samples obtained from normal tissue (healthy subjects) and tumour tissue (cancer subjects), see the text for abbreviations. The data were downloaded from TCGA portal.

Type of cancer Number of healthy subjects Number of cancer subjects
(green), and other BRCA-negative (blue) and BRCA-positive (red) subjects. The data points for the subjects with the disease are substantially more distant from the linear regression model (solid line), as compared to the data points for the healthy ones. Thus at least some of the edge weights in the parenclictic networks for the former group will be considerably greater, potentially introducing detectable modifications in global network characteristics [10,11]. Linear regression approach is quite clear and gives good results in many cases. However, at least for data of interest, it does not always yield a decent estimation of the closeness between the control and cancer subjects. Indeed, changing one of the genes in the example above, that is, considering ZFP106 and TRIM9 methylation levels in BRCA data, we find that all three of the clusters (control, BRCA-negative and positive) match the linear regression model (1) well (see Fig 2). Consequently, for all sample classes the ZFP106-TRIM9 edge weights Eq (2) will, typically, be relatively small and of the same order of magnitude. At the same time, the BRCAnegative and positive clusters are visibly distinct.
To overcome this problem we implement the Mahalanobis distance [13], which, essentially, measures separation between data sets. In particular, instead of Eq (2) the edge weight w i, j is caclulated as: where, as before, x i and x j are gene methylation levels in an investigated sample, μ i and μ j are the gene methylation levels in the control group, , and S i, j = cov(m i , m j ). In this way, the abnormal modifications in methylation of gene pairs are better captured, which, in turn, improves sample classification accuracy for our data by 1−3%.

Metrics of network topology
The massive number of edges in constructed graphs makes a straightforward solution of machine learning classification problem intractable in practice, also manifesting a huge imbalance between the number of features and available samples. Therefore, we utilize a number of topology metrics, widely used for characterising complex networks [14][15][16], appropriately generalised for weighted graphs G ≔ (V, E) with |V| vertices and |E| edges: • Node degree deg(v) as the sum of incidental edges weights.
• The distance d(v i , v j ) between the nodes v i, j 2 V, defined as the sum of the edge weights in the shortest path.
• The diameter of the graph G as the maximal distance between a pair of vertices.
• The degree centrality C D (G) of the graph G defined as the normalised graph degree centrality H(G) which is based on the node degree centrality C D ðvÞ ¼ degðvÞ jVj , and where v Ã is the node with the maximal degree centrality, and H max = (|V| − 1)(|V| − 2) is the maximal graph degree centrality, obtained for the star topology.
• Graph efficiency E C (G) defined as [17] E C ðGÞ ¼ C C ðGÞ jVjðjVj À 1Þ ; ð6Þ based on the graph centrality measure • Betweenness centrality C B (v k ) of a node as the number of the shortest paths the particular node v k belongs to: where σ v i ,vj is the number of the shortest paths between the nodes v i and v j , among which These quantities in one way or another should reflect the expected differences between the sample classes. For instance, increasing separation of data sets produces greater edge weights and may result in substantial decrease of the graph diameter. Likewise, nodes with large centrality scores signify their key role is class distinguishing and give us a hint at biological importance.

Results and Discussion
As we can see from the examples of the pairwise gene methylation level diagrams, there exists intrinsic variability in both healthy and tumour tissue samples (see Figs 1 and 2). Therefore, parenclictic networks built from both data classes should have random-like and complex structure. To get a general idea on the appearance of the resulting networks, we present indicative examples of BRCA-positive and negative donors, plotting the 1000 edges with largest weights and incident nodes (Fig 3). Remarkably, for cancer subjects they typically comprise several star-type subgraphs, with the most abnormally methylated genes as vertices. At the same time, healthy subjects yield considerably more homogeneous networks. Distinguishing between cancer-positive and negative DNAm profiles is implemented by standard machine learning classification algorithms, namely, Support Vector Machine (SVM) and Random Forest (RF) [18,19]. The binary classifiers are trained on 12 topological indices calculated for parenclictic networks as described in Methods: • mean, variation and maximal values of edge weights, • mean, variation and maximal values of vertex degree, • mean and variation of shortest path lengths, • diameter of graph, • degree centrality, • efficiency, • and betweenness centrality.
In addition, RF was also trained on the original gene average methylation level data. All code was written in Python 3. For classification we have used Python scikit-learn package.
In order to assess the performance of our classifiers we used the two-step cross-validation technique. The first step included cross-validation for selecting a control group from healthy tissue samples. For data sets where the number of healthy subjects was less than 50 instances, we applied 2-fold cross-validation; otherwise, 4-fold cross-validation was used. The second step involved factual 10-fold cross-validation for every data set with topology indices.
Limited by the amount of data, we couldn't make use of AUC as a classification metrics. Indeed, a quarter or a half of healthy patients data were put aside after the first cross-validation step. During the 10-fold cross-validation at the second step, the data were divided further and, after all, each fold contained one or two healthy patients at best. In result, AUC could be calculated in rare cases only.
The results, summarized in Tables 2 and 3, demonstrate an excellent performance of classifiers trained on network measures for almost all kinds of cancer, with accuracy in the range 93 −99%. Interestingly, the performance does not manifest any considerable dependence on the type of classification algorithm in most cases, with the only noticeable dissimilarity of RF observed for BRCA, PRAD and THCA groups. We suggest that the accuracy of classification would not depend strongly on the particular choice of machine learning algorithm. Comparing performance of the RF classifier built on the network measures and the original average methylation levels of 15295 genes, we do not notice systematic weaknesses of the latter. Moreover, in some cases, using network measures enhances classification results. Except for the case with PRAD, which we will discuss hereinafter, the network measures seem to incorporate cancer modifications of gene methylation levels very well, and the substantial reduction of the complexity of classification task does not impair the resulting accuracy. Another benefit of implementing classification on network measures is that in this case the number of features is considerably smaller than the number of samples, which prevents overfitting.
Let us investigate it in more detail, why the performance of the network measure classifiers for PRAD is slightly poorer, at about 87−90%. Clearly, the answer must be sought in the less pronounced differences between the classes in the global network characteristics. To get a deeper insight we analysed the node degree distributions for parenclictic networks constructed from the data, corresponding to different types of cancer. Remarkably, the results consistently revealed that for cancer-negative subjects the networks are scale-free, with the complementary cumulative degree distributions closely following a power-law (see Fig 4 for representative cases). On the contrary, parenclictic networks for cancer-positive subjects demonstrate pronounced deviations from power-law scaling, except for PRAD, where the node degree statistics does not change significantly and network measures give worse classification accuracy (Fig 4). This suggests that the further work here should primarily focus on modifying the network construction method itself, rather than introducing other network measures.

Conclusions
We demonstrated that the parenclictic network approach can be successfully implemented in order to obtain graph representations of gene methylation levels for cancer-positive and negative subjects. These graphs can be characterised by 12 global network measures, which provide a basis for binary classification by routine machine learning algorithms, Support Vector Machine and Random Forest. For almost all cancer types the performance of both algorithms remains much the same so that the particular choice does not seem crucial.
Comparing to performance of Random Forest classifier built on the original average methylation levels of 15295 genes does not reveal a substantial difference, with the accuracy reaching 90−99% for all the types of cancer. This means that a cardinal reduction to 12 features, topological indices, does not lead to the loss of important information. Yet another strong benefit of the considerably decreased complexity is the avoidance of overfitting, which could be a serious problem when using the original 15295 gene methylation levels as features, due to the relatively small number of samples. Finally, network analysis does not only allow identifying cancer methylation profiles but provides additional insight into the systems biology of cancer. That is, the nodes with high centrality are good candidates for playing a significant role in cancer development. Strong deviations from scale-free node degree distributions for almost all cancer types is a hallmark of the global changes in the network topology induced by cancer, which remains to be understood. Another open question is the modular properties of the arising parenclictic networks, observed by naked eye inspection, and their correspondence to biological functions.