Multilayer network analysis of miRNA and protein expression profiles in breast cancer patients

MiRNAs and proteins play important roles in different stages of breast tumor development and serve as biomarkers for the early diagnosis of breast cancer. A new algorithm that combines machine learning algorithms and multilayer complex network analysis is hereby proposed to explore the potential diagnostic values of miRNAs and proteins. XGBoost and random forest algorithms were employed to screen the most important miRNAs and proteins. Maximal information coefficient was applied to assess intralayer and interlayer connection. A multilayer complex network was constructed to identify miRNAs and proteins that could serve as biomarkers for breast cancer. Proteins and miRNAs that are nodes in the network were subsequently categorized into two network layers considering their distinct functions. The betweenness centrality was used as the first measurement of the importance of the nodes within each single layer. The degree of the nodes was chosen as the second measurement to map their signalling pathways. By combining these two measurements into one score and comparing the difference of the same candidate between normal tissue and cancer tissue, this novel multilayer network analysis could be applied to successfully identify molecules associated with breast cancer.


Introduction
Breast cancer is the second leading cause of cancer death among women and results in millions of new cases every year [1]. Often assuming regulatory roles in eukaryotic cells, miRNAs are small, non-coding RNAs of roughly 20~22 nucleotides that can bind to and inhibit protein coding mRNAs [2]. The expression profiles of miRNAs are correlated with cancer type, stage, and other clinical variables [3]. Therefore, miRNA expression profiling could be a useful tool for cancer diagnosis and prognosis. MiRNAs play important roles in almost all aspects of cancer biology, including proliferation, apoptosis, tissue invasion, metastasis, and angiogenesis [4]. miRNAs also play important roles in toxicogenomics and may explain the relationship between toxicant exposure and tumorigenesis. Previous work has identified 63 miRNA genes shown to be epigenetically regulated in association with 21 diseases, including 11 cancer types [4]. Many this biological multilayer network model exhibits the interrelationship between the miRNA and protein, thereby studying their combined action on cancer at different scales and levels.
To better understand their roles in the context of biological networks, miRNA and protein expression profile networks were constructed for both normal and breast tissues. Furthermore, due to the large number of miRNAs and proteins, in order to prevent analysis process from interference of unrelated variables, random forest model and XGBoost were applied to filter miRNAs and proteins before establishing a multilayer network. The filtered molecules were used as nodes in the network. Both threshold and MIC values between every two nodes determined the final structure of the multilayer network. Comparing the betweenness centrality of the node between health control and patient samples could lead to the novel finding of miR-NAs and proteins related to cancer. The miRNA expression data and protein expression data were obtained from the same patient. By considering the bias raised by different studies, it is highly recommended to use the data from the same study for the analysis. Other physiological factors in different individuals will affect the accuracy of the analysis results. In order to avoid this effect, we selected paracancerous normal tissues of the same individual as controls.

Data
Expression data of primary solid tumor and normal tissues were categorized. For both proteins and miRNAs, candidates with expression level with a value of zero are considered as noise and then filtered in the further analysis. After filtering, the miRNA and protein expression data dimensions ultimately used for analysis were 1182×320 and 925×147 (the number of miRNA tissue samples was 1182, and the number of miRNA species was 320; the number of protein tissue samples was 925, and the number of protein species was 147). ordinary decision tree, k attributes are first selected as candidate attributes, one of which is selected to divide the tree node. Given the number of miRNAs and proteins is large, to reduce computing costs, feature selection, one of the commonly data dimension reduction methods, was applied. It is based on a criterion that selects parts of original features that can best separate different types of samples. According to the feature evaluation strategy [26], feature selection algorithm can be divided into Filter and Wrapper which are two complementary methods that were combined to characterize the molecular expression levels of normal tissues and tumor tissues. The Filter method is independent of the machine learning algorithm that was subsequently adopted. This method calculates for each feature a statistic that can represent how well a feature has distinguished the sample. On the other hand, the Wrapper method randomly selects a subset of the feature set as a temporary feature set for the random forest model, wherein the set with the smallest prediction error and fewer feature numbers serves as the final feature set.

XGBoost algorithm
XGBoost [27] is similar to Boosting for accurate classification through gradient iterations of weak classifiers. In order to efficiently retrieve the best segmentation, the training data sets are sorted before training. As both miRNA data and protein data are labeled, it belongs to supervised learning model. In XGBoost, the best model is selected by applying the accuracy rate (or error rate) and the logistic loss as evaluation criterion. Then the best prediction is achieved for the known training data set and the test data set under the optimized evaluation criterion.
Similar to random forests, each classifier is also a decision tree. Consider this similarity between random forest and XGBoost, to avoid missing important cancer-associated molecules, the results of feature selection of these two algorithms were merged into one set as the final feature set. The construction of the decision tree in the random forest algorithm is independent, however, in the XGBoost algorithm, classifiers are not independent to each other, every latter classifer is optimized based on the classifer result of the previous one. Formally, the mathematical model of XGBoost can be presented as the following formula: Each icon denotes an analytical process. Icon 1 denotes data containing miRNAs and proteins. Icon 2 shows the feature selection process, which includes XGBoost and random forest algorithms. Icon 3 indicates calculation of the MIC value for every two nodes in the network, which represents the interaction between nodes. Icon 4 is the process that generates edges in the network by setting a specific threshold of MIC. Icon 5 represents the construction of a multilayer network. Icon 6 is the final step of analysis process that gives each node an importance score related to breast cancer. https://doi.org/10.1371/journal.pone.0202311.g001 whereŷ ðjÞ i represents the classification result of the first j-th classifier. The XGBoost algorithm adds a new function to the original model in each iteration. The reason to add a new function is to minimize the loss of the objective function. By minimizing the following objective function Obj(Θ) (t) as follow: where L(Θ) is the loss function to compute error of training set and O(Θ) is the regularization term to control complexity of the base classifiers. We used the method of Taylor series expansion to approximate the objective function: Where g i represents the first derivative of the function lðy i ;ŷ ðtÀ 1Þ i Þ, and h i represents the second derivative of the function lðy i ;ŷ ðtÀ 1Þ i Þ. The breast cancer data used in this paper are all classified as two categories, one is tumor tissue data and the other one is paracancerous normal tissue data. Logistic function was selected as the loss function for the model. For tree structure splitting in the training, each miRNA and protein representing a leaf node will split. The training samples on each leaf node will get a probability value that is fed back through the model. The loss function is applied in the training, so the effective splitting features and optimal splitting points can be obtained by the loss function before and after the leaf nodes are splitted. As long as the sample cumulative loss function on each leaf node is minimized, the loss function of the overall sample set is minimized. At the end, the final model with the smallest loss function can be obtained.

Maximal information coefficient
Mutual information has been widely used to find non-linear relationships between two variables. Reshef [28] proposed the method of Maximal Information Coefficient (MIC) based on mutual information. The primary advantage of MIC is that a broad correlation analysis can be captured on a sufficient number of statistical samples. M(X,Y) represents the population feature matrix of X,Y.
MðX; YÞ s;t ¼ I � ððX; YÞ; s; tÞ log minfs; tg where I(X,Y) is interactive information of X and Y, s,t are the number of divisions on the horizontal and vertical axes, s t < n 0.6 (empirical values), and n is the number of samples. For the miRNAs and proteins screened by the algorithm above, each miRNA or protein is treated as a node, and its expression level in different patients is the attribute of the node. In this study, to measure the correlation between any two molecules in the network, the MIC values between any two nodes were calculated. The greater the MIC is, the stronger the correlation is.

Multilayer network
Multilayer network is denoted by M = (G,C), where G = {G α ;α 2 {1,2,. . .,m}} is a set of single layer networks which is denoted as G α = (X α ,E α ). X α and E α is the set of nodes and edges belongs to the layer G α , respectively. C = {E αβ �X α × X β ;α, β 2 {1,2,. . .,m}, α 6 ¼ β} represents the set of edges that connects the nodes in different layers. Elements in C are called cross-layer connected edges. Element in E α is called the intra-layer node connection of M. The set of nodes in layer G α is denoted as: X a ¼ fx a 1 ; . . . ; x a N a g, and the adjacency matrix in layer G α is denoted as: The adjacency matrix in cross-layer E αβ is denoted as: In this study, a single layer network was established between miRNAs, and another single layer network was composed of proteins, which together constituted a two-tier multilayer network. The structure of a multi-layered network can sort out the internal interactions of the same kind of molecules while also taking into account the interactions of different kinds of molecules. Thanks to multilayer structure, in the process of cancer-associated biomarker recognition, the identification of a molecule will no longer be limited to the interaction of the same kind of molecules.

Betweenness centrality
The betweenness centrality [29] can measure the importance of nodes in the network. If the two network nodes, v i and v j are two non-adjacent nodes, the shortest path between them will pass through some nodes. If the certains nodes exists in many of these paths, one can infer that the node is relatively important. The betweenness centrality of node B k is represented as: where N represents the number of shortest paths. The betweenness centrality reflects the role of the node in the entire network and has a strong practical significance. In different networks, if the betweenness centrality of the same molecule is distinctly different, thereby indicating that this molecule (miRNA or protein) has played a significant role in the breast cancer. In this study we adopted the centrality function of MatLab to calculate the betweenness centrality of the nodes.

Importance score of nodes
To determine the importance score of nodes in the miRNA layer related to breast cancer, B miRNA normal ¼ fB miRNA normal;k g and B miRNA cancer ¼ fB miRNA cancer;k g were used to represent betweenness centrality of nodes of normal tissue and cancer tissue, respectively. In miRNA layer, difference in betweenness centrality of the same miRNA belongs to different tissues is taken as importance score of nodes related to breast cancer. We standardize B miRNA k as follows: Difference in betweenness centrality, denoted as B miRNA ¼ fjB miRNA normal;k À B miRNA cancer;k jg ¼ fB miRNA k g, σ(B miRNA ) represents the standard deviation of the B miRNA set, and E(B miRNA ) represents the mean of the B miRNA set. D miRNA normal;k represents the degree of node k in the miRNA network of normal tissue. Note that the calculation of degree here only considers cross-layer connected edges: Similarly, D miRNA cancer;k represents the same indicator of cancer tissue. Absolute value of difference of degree is D miRNA ¼ fjD miRNA normal;k À D miRNA cancer;k jg ¼ fD miRNA k g. We standardize D miRNA k as follows: Degree distribution is often power law distribution. However, by the Jarque-Bera test, fB miRNA normal;k À B miRNA cancer;k g can be considered as normal distribution when the MIC threshold is 0.35 (alpha = 0.05, p = 0.1592). Finally, the importance score of node k in the miRNA network is: In the same way, calculate the score of the protein molecule as S protein k .

Feature selection
XGBoost for feature selection. Because of the imbalance between positive and negative samples of miRNA and protein expression data, up-sampling was used to amplify positive samples. The leave-one-out method was used to train and validate the datasets. The error rate, logic loss, Root Mean Squared Error (RMSE) of the training and testing datasets (Fig 2a) gradually decrease in the model, and the Area under the Curve of ROC (AUC) (Fig 2a) gradually increases and stabilizes after 35 iterations. The error rates are 0.0005, 0.005; logical loss values 0.0098, 0.0274; the AUC values close to 1, 1; and the RMSE values 0.0293, 0.0709 in the training and testing datasets, respectively. Similarly, the XGBoost algorithm has a high accuracy for classification on miRNA expression data. Computation of importance scores (Fig 2b)  Error rate logic loss and RMSE of the training and testing datasets decrease in the model, whereas the AUC gradually increases then stabilizes after 30 iterations (Fig 3a). The error rates are 0, 0.0118 in the training and testing datasets, respectively; logical loss values are 0.0074, 0.04; AUC values are 1, 0.999; and the RMSE values are 0.0126, 0.0941. Similarly, the XGBoost algorithm is accurate in classifying protein expression data. Importance score as calculated by XGBoost shows that Bax, GSK3.alpha.beta, E-cadherin, Rab11, Caveolin.1 and Collagen_VI contribute to the high classification accuracy of tumor and normal tissue in breast cancer (Fig 3b).
Random forest for feature selection. To estimate the accuracy of the classification, 10-fold cross-validation method was used to assess the classification model (Table 1). When the number of selected miRNAs is 50 in the breast cancer dataset, the cross-validation accuracy rate is 98.50%.
The accuracy coefficient measures the correct rate of sample classification, and the Kappa [39] coefficient is used for checking consistency and could also measure the effect of classification accuracy. As accuracy and Kappa coefficients increase, their standard deviations decrease (Table 1).  In this study, four cancer-associated miRNAs were screened by XGBoost algorithm, and three of them, namely mir.21, mir.96, and mir.183, were screened out by random forests. A comparison of the two miRNA datasets indicated that 28% of the feature selections are consistent. Similar to the analysis miRNA datasets, a 10-fold cross-validation method was used to assess the classification model to obtain protein classification ( Table 2).
Summary of feature selection. After obtaining two miRNA candidate sets and two protein candidate sets selected by two algorithms, the union of the two miRNA sets was taken as the final miRNA candidate set, in the same manner, the final proteins candidate set was obtained. The number of selected miRNA sets is 86, and the number of selected protein sets is 30.

Calculate MIC and threshold setting
As the MIC increases, the number of nodes and edges decreases. If the selected MIC threshold is so small that the number of nodes and edges in both network becomes too large, identification of nodes that have significant differences becomes more difficult. If the selected MIC threshold is so large that the network becomes too sparse, many connections are missed, which is not conducive to analyze the relationship between the nodes. MIC was calculated between any two candidates in the miRNA and protein datasets obtained through feature selection, and the threshold was set to 0.2, 0.35, and 0.5.
Under the MIC threshold of 0.5, miRNA network of cancer tissue and normal tissue was plotted (Fig 4). The cancer network (Fig 4a) is sparser than the normal network (Fig 4b). This finding indicates that the interaction of miRNA networks differs significantly by cell type and supports the use of complex networks for breast cancer analysis. Similarly, under the MIC threshold of 0.5, protein network of cancer tissue and normal tissue was also plotted. The protein network also shows the same characteristics as miRNAs, that the network of cancerous tissue is much sparser than that of normal tissue (Fig 5). Because the miRNA network of cancer cells has a small number of nodes at an MIC threshold of 0.5 and may miss some important proteins, we decided not to use this MIC threshold to construct a complex network.
Figure analysis was applied to determine which MIC threshold should be adopted. While the number of nodes varies inappreciably when MIC threshold is set to 0.35, the number of connections between nodes is significantly reduced, suggesting that these complex networks are distinct.
Several principles were considered when selecting a threshold. Firstly, the MIC threshold selected must not lead to the loss of too many nodes. For instance, in the analytical process described, less than 5% of nodes are lost. Secondly, the number of edges of the network could not be too small, and the number of edges of the two networks must be significantly different. Number of edges in the miRNA network of normal tissue (Fig 6b) is about 1.59 times that in the miRNA network of cancer tissue (Fig 6a). Through observing the difference in structure of network under different thresholds (Fig  7), it was be found that when the MIC threshold is 0.35, the decrease in the number of network nodes is not obvious, effectively fulfilling the first principle of threshold selection. Difference in the number of edges between the two networks is also kept at a relatively high level; that is, this MIC threshold could effectively differentiate the two networks, which is in accordance with the second principle of threshold selection. Therefore, we selected 0.35 as the MIC threshold for analysis.

Multilayer network
Multilayer networks were generated after calculating MIC between nodes and setting MIC thresholds to 0.2, 0.35, or 0.5 (Fig 8).
A multilayer network with a threshold of 0.2 had more edges than other multilayer network with higher threshold, which causes the impact of key connections in the network become  smaller. Hence, under the threshold of 0.2, two types of cells is difficult to be distinguished well with this threshold. When the threshold is 0.5, there are obvious differences between the two multi-layer networks, but the number of edges is sparse and some important relationships may be mistakenly omitted. These problems are averted when the threshold of 0.35 is used, which affirms the use of this threshold.

Node ranking
To better understand the details of the networks, the nodes representing different candidates were ranked according to betweenness centrality and node degree (Figs 9 and 10).
Among the top 15 miRNAs, the relationships of 11 miRNAs with breast cancer in previously published studies were confirmed (Table 3).
Among the selected proteins, we were able to confirm the relationships of the top 10 proteins with breast cancer using published literature (Table 4).

Discussion
The multilayer network analysis proposed helps identify miRNAs and proteins that could be associated with breast cancer. While biomarkers were previously selected using machine learning, this study is novel in that a combination of machine learning and multilayer network methods was used.
This combinatorial approach to identify cancer biomarkers could prevent missing critical miRNA or protein candidates and ensure a more robust analysis. For example, the final ranking of nodes generated from multilayer network method in combination with machine learning differs from that using machine learning alone. This finding suggests that the combinatorial effect of multilayer network analysis and machine learning yields more comprehensive information. It also shows that the multilayer network analysis method could facilitate the discovery of novel molecular candidates. Although published work has described miRNA or protein networks of expression profiles separately, interrelationships between two networks have not been thoroughly investigated. To address this knowledge gap, the interrelationship between miRNA and protein networks was studied through the MIC. The most suitable MIC threshold was determined by analyzing the Table 3. Cancer-related miRNAs.

miRNA Candidate
Description mir-203a [44] Reconstitution of Runx2 in MDA-MB-231-luc cells delivered with miR-203 reverses the inhibitory effect of the miRNAs on tumor growth and metastasis.
mir-374a [46] The Wnt/β-catenin signaling is hyperactivated in metastatic breast cancer cells that express miR-374a. mir-28 [47] In breast cancer cells, miR-28 regulates Nrf2 expression at the posttranscriptional level by binding to the 3 0 UTR of Nrf2 mRNA and resulting in Nrf2 mRNA degradation.
mir-155 [48] miR-155 expression is upregulated in breast cancer cells, which reduces the levels of RAD51 and affects the cellular response to ionizing radiation.
mir-193a [49] miR-193a expression is downregulated in breast cancer cell lines and tissues when compared with the adjacent non-tumor tissues.
mir-365b [50] miR-365 expression levels are significantly higher in breast cancer tissues when compared with adjacent non-tumor tissues.

mir-1301 [51]
miR-1301 is overexpressed in breast cancer tissues and cell lines and cell tissues, whereas downregulation of miR-1301 inhibits the proliferation of breast cancer cells in vitro.

mir-200c [52]
For the claudin-low breast cancer, miR-200c has therapeutic effects in an in vivo model.

mir-10a [53]
The median expression levels of miR-10b in tumor tissue when compared with adjacent nontumor tissue are significantly higher in relapsed patients than in relapse-free patients.
mir-148b [54] miR148b is a major coordinator of breast cancer progression in a relapse-associated microRNA signature.

E-Cadherin-R-V [36]
E-cadherin is regulated epigenetically via methylation of the promoter in most intraductal breast carcinomas.

Caveolin-1-R-V [38]
Caveolin-1 expression is significantly decreased in breast cancer-associated fibroblasts compared to normal fibroblasts and is associated with increased invasion-promoting capacity.
PI3K-p85-R-V [40] The inhibition of PI3K promotes ER activity, as manifested by increases in ER binding to target promoters and ER target gene expression.

Collagen_VI-R-V [41]
Collagen VI are upregulated in breast cancer, generating a microenvironment that promotes tumour progression and metastasis.

Rictor-R-C [55]
Rictor expression is upregulated significantly as compared with nonmalignant tissues in invasive breast cancer specimens.

CDK1-R-V [58]
Combined inhibition of Cdk1 and PARP in BRCA-wild-type cancer cells (breast cancerassociated cell) results in reduced colony formation, delayed growth of human tumor xenografts.

mTOR-R-V [60]
The PI3K/Akt/mTOR pathway results in cell growth and tumor proliferation, and it plays a significant role in endocrine resistance in breast cancer.
interrelationship between the MIC threshold and the number of nodes and edges of the network. By using the most optimized MIC threshold to construct the multilayer networks, miR-NAs and proteins associated with breast cancer were identified. Although the top-ranked candidates for protein biomarkers were previously identified, the combinatorial approach proposed reveals potentially novel miRNAs associated with breast cancer, such as mir-331, mir-486-2, mir-1307 and mir-1287. Roles of these miRNA candidates in breast cancer will be confirmed through molecular means. A minor drawback associated with the multilayer network analysis is that the optimized threshold value must be determined through analysis of the distribution map (Figs 6 and 7) measuring the number of edges and nodes in network under different thresholds. The approximate range can be selected but the optimal value cannot be obtained automatically. This shortcoming could be overcome by establishing algorithms that facilitate selection of optimized threshold values.
Because single network analysis provides limited information, the proposed combinatorial approach will allow for a deeper understanding of multiple networks and signaling pathway in cancer. The regulatory architecture of miRNA and protein in breast cancer patients analyzed in multiple network-wide will potentially enable novel cancer biomarker discovery.
Supporting information S1 File. miRNA expression data of breast cancer. The first column of the data is the name of miRNAs, and first row of the data is the code of the corresponding cases of samples in TCGA, where the fields ending with '-11' is normal samples and the rest are cancer tissues.