Analysis on Differential Gene Expression Data for Prediction of New Biological Features in Permanent Atrial Fibrillation

Permanent Atrial fibrillation (pmAF) has largely remained incurable since the existing information for explaining precise mechanisms underlying pmAF is not sufficient. Microarray analysis offers a broader and unbiased approach to identify and predict new biological features of pmAF. By considering the unbalanced sample numbers in most microarray data of case - control, we designed an asymmetric principal component analysis algorithm and applied it to re - analyze differential gene expression data of pmAF patients and control samples for predicting new biological features. Finally, we identified 51 differentially expressed genes using the proposed method, in which 42 differentially expressed genes are new findings compared with two related works on the same data and the existing studies. The enrichment analysis illustrated the reliability of identified differentially expressed genes. Moreover, we predicted three new pmAF – related signaling pathways using the identified differentially expressed genes via the KO-Based Annotation System. Our analysis and the existing studies supported that the predicted signaling pathways may promote the pmAF progression. The results above are worthy to do further experimental studies. This work provides some new insights into molecular features of pmAF. It has also the potentially important implications for improved understanding of the molecular mechanisms of pmAF.


Introduction
Atrial fibrillation (AF) is an extremely common cardiac rhythm disorder in population [1,2]. The outstanding progresses have shown that AF pathophysiology involves electrical, structural and contractile remodeling, which is related to changes in cardiac gene expression.
Over the last decade, several studies have characterized the molecular basis of remodeling on a more global scale [3][4][5][6][7], in which Barth et al. [5] and Censi, et al. [6] respectively analyzed the same differential gene expression data with 10 permanent AF (pmAF) patients and 20 controls and found different biological features. In the atrial myocardium, Barth et al. [5] identified 1434 genes deregulated in pmAF. Besides, they found that most ventricular -predominant genes were upregulated in pmAF. They thought that dedifferentiation with adoption of a ventricular -like signature is a general feature of the fibrillating atrium. In the work of Censi, et al. [6], they applied the oblique principal component analysis (OPCA, a PCA -based method) to mine some new pmAF -related differentially expressed genes (DEGs) that were different from those in the work of Barth et al. [5] and found an attractor-like property of gene expression. Furthermore, Censi, et al. [7] analyzed the connection relationships among the identified DEGs. Their work showed that the connection relationships between AF and normal patient populations have great difference. However, any a quantitative analysis method for gene expression data has its limitations. For example, the PCAbased method might ignore some components that are, though statistically unimportant, biologically meaningful. The classical supervised approaches (such as SAM) were not able to well characterize different pathophysiological features of AF. Therefore, constant refinement is needed to evolve better methodologies to find more new biological features associated with AF for improved understanding of pathophysiological mechanisms of AF.
In this work, we designed an asymmetric principal component analysis (APCA) algorithm by considering the unbalanced sample numbers in most microarray data of case -control. From the data used by Barth et al. [5] and Censi, et al. [6], we identified 51 differentially expressed genes (DEGs) using the APCA algorithm, in which 42 DEGs are new findings compared with two related works and the existing studies. The enrichment analysis on GO and GAD showed that most identified DEGs are associated with the etiological factors inducing atrial fibrillation. Moreover, we predicted three new signaling pathways. Our analysis and the existing studies supported that the predicted signaling pathways may promote the pmAF progression. The obtained results in this work are worthy to do further experimental studies.

Data
The gene expression data used in this work were obtained from the public functional genomics data repository of the National Institute of Health (called Gene Expression Omnibus, GEO). The data with record #GSE2240 consists of samples from 30 patients, in which 10 patients had pmAF, whereas 20 patients had no history of AF and were in SR. The data are related to two Affymetrix platforms U133A and U133B. According to our statistics, 20995 of the 22283 genes in U133A have been annotated and the percentage of annotated genes is 94.2%. Only 14586 of the 22283 genes in U133B have been annotated and the annotated percentage is less than 65%. This may result in wrong interpretation to the final results. Therefore, our study focus on the data in U133A, which is a gene expression data set with 22283 rows (probes) and 30 columns (samples).

APCA algorithm for identifying differentially expressed genes
The existing researches have shown that the PCA-based method on this data set succeed in discriminating patients from controls and in offering a mechanistic view relating AF condition to both cardiac muscle organization and inflammatory processes [6]. However, applying PCA-based method, the total covariance matrixes does not effectively remove the unreliable dimensions if one class is represented by its training data much better or much worse than the other class. The asymmetric principal component analysis (APCA) [8] and linear subspace learning-based dimensionality reduction [9] alleviate this problem by asymmetrically weighting the class conditional covariance matrices and by considering the important weak composition. Thus, we design an novel APCA-based method to identify the differentially expressed genes between patients with pmAF and the control group by combining the ideas in two works above [8,9].
Assume that a gene expression dataset is represented by X with m rows (probes) and l columns (samples) and there is m..l. The novel APCA -based method consists of four phases. In the phase one, we project X into a low dimension matrix Xs. The steps are as follows.
(1) Construct S = X T X, where T denotes the transpose. Obviously, the size of S is l6l. (2) Apply PCA to S for obtaining the eigenvector matrix EIV S with size l6l, where one column is one eigenvector. (3) Compute EIV L = XEIV S . The EIV L is the eigenvector matrix with size m6l corresponding to the nonzero eigenvalues of XX T . (4) Project the X into X S through computing Xs = EIV L T X, which is of size l6l.
All principal components (PCs) and classification on Xs will be exactly the same with that on X. In phase two, we construct a new data matrix, shown as below.
(1) Suppose that there are n control samples and d disease samples in the original data matrix X, and n + d = l. Xs is divided into two block matrices, that is, X S = [X Sn, X Sd ].
Thus, X Sn is of size l6n and X Sd l6d. (2) Calculate the mean vectors of X Sn , X Sd and X S , M Sn , M Sd and M S whose elements are the mean values over rows and whose sizes are all l61. (3) Centralize the data matrices above by subtracting the means from X Sn , X Sd and X S to yield Y Sn , Y Sd and Y S . In the third phase, we extract the principal components from C using the PCA algorithm [9]. The 'shared variance' is accounted for by the first PC and the minor PCs (from the second PC onward) keep trace of the relevant among sample difference [6]. In APCA, since the first two PCs could optimally classify the patients/control samples under the selected a and b, the second PC is used as the discriminant PC, which will be used to identify differentially expressed genes.
The discriminant PC is a vector with dimension of 2228361, in which the value in each row represents the score of corresponding gene. In final phase, we identify the DEGs according to the scores of the genes. Those genes, whose scores (in absolute value) are higher than a given threshold (denoted by h), are identified as the DEGs. As we all know, the selection of the threshold is very important. Many biologically relevant DEGs could not be identified if the threshold is too high, while lowering the identification threshold will increase the number of potential DEGs, including many false ones. Based on the reasons above, we select about half of the numerical range of the score S as the threshold.
Assuming that the score and its maximum value (in absolute value) are respectively indicated by S and |S max |, the numerical range of the score S is 0#S#|S max |, Thus, the threshold is set as h = v|S max |/2w.

Prediction of pmAF -related pathways
The Affymetrix IDs of all the identified DEGs are entered into the KO-Based Annotation System (KOBAS) [10], which identifies the statistically enriched pathways. An identified pathway is predicted to be associated with pmAF when it satisfies the following criterion: the value of EASE (corrected P-value) for the pathway is less than or equal to 0.05 and meanwhile the number of DEGs involved in the pathway is larger than or equal to 2. This criterion ensures statistical credibility to the predicted results but not a consequence of chance.

Calculation of ROC for the identified DEGs
In order to illustrate the reasonableness of threshold in our method as well as the reliability of the results, we assess the discrimination powers of identified DEG expression levels to classify normal and pmAF samples by calculating their receiver operating characteristic (ROC) curves and the areas under the curve (AUC). In ROC calculation, we respectively define pmAF and normal patients as positive and negative samples. Therefore, the true positive rate indicates how many correct positive results occur among all pmAF samples available during the test. False positive rate, on the other hand, represents how many incorrect positive results occur among all normal samples available during the test.
It is well known that pmAF is a kind of polygenic and multifactorial disease and so its occurrence and progress are associated with the combinational work of multiple genes. Thus, we calculate the ROCs and the AUCs of combinations among 51 identified DEGs. In order to find the combination rules of these DEGs, we analyze the connection relationships among them using correlation technique. Using the gene expression values of 51 DEGs across 10 samples in pmAF group of U133A, we calculate the correlation coefficient (CC) between each DEG pair. Under CC$0.9 [7], we construct the connection relationship among 51 identified DEGs correspondent to pmAF patients and then combine these DEGs according to the connection relationship for conducting the calculation of ROCs and AUCs. For comparison, we also calculate the ROC and the AUC of each DEG for finding the discrimination powers of these DEGs individually.

The discriminating PC
In U133A data set, there are respectively 22283 probes and 30 samples, which include 20 control samples and 10 disease samples. The independent number of control samples is 19 since the gene expression values of two samples in the control group are completely same. Thus, the total independent sample number is 29. The dimension of data matrix in our experiment are m = 22283, l = 29, n = 19 and d = 10.
We let b be an integer of 20 and then uniformly sampled a over the interval (0, 1). Our simulation showed that two kinds of samples can be correctly classified by the first two PCs when taking a = 0.3 and b = 20 in the APCA algorithm, as shown in Figure 1. Table 1 gave the proportional and cumulative variances represented by the first 10 PCs under a = 0.3 and b = 20 when our method was applied to the U133A data set.
The results in the Table 1 show that the first PC accounts for 99.61% of total variability in both groups. With the large value of b (here b = 20), our algorithm ensures that the first PC is caused by the two class means and it represents the commonality between gene expression profiles on the genome-wide scale. Hence, all minor PCs are caused by the within-class variations of the both classes, which mainly concentrate on the second PC as shown in Table 1. The second PC captures the major difference between the two within-class variations of the two classes. Further, the pmAF patient and control samples are able to be correctly classified using the first two PC and so we select the second PC as the discriminating PC.

The identified DEGs
We obtained each gene's score using the second PC in Table 1. The maximum value of the score is 11.02. Thus, the threshold of the score h is calculated as 5. We identified 51 DEGs (corresponding to the 63 Affymetrix IDs) that have the scores (in absolute value) higher than the threshold. Their symbols, Affymetrix IDs (ID_REF), titles and scores are shown in Table 2, in which the genes DICER1, IGFBP2, LBH and NPR3 marked with italic are the differential expression genes previously reported in the work of [5] on the same data. Also, the genes IGFBP2, IGH@/IGHG1/IGHG2/IGHM/IGHV4-31, LOC100133662/ RPS4Y1, RBP4, SFRP1 and XIST marked with bold are the DEGs previously found on the same data in the work of [6]. Compared with the results of two previous works, 42 new DEGs are identified by our novel method, which fully considers the important weak composition and so can identify some new genes that cannot be found by PCA algorithm and classical supervised approaches.

The predicted pmAF -related signaling pathways
The DEGs in Table 2 are fed into the KOBAS for predicting the pmAF-related signaling pathways. The KOBAS identified three signaling pathways that satisfy the criterion, as shown in Table 3. We inferred these pathways to be associated with development or maintenance of pmAF.

Discrimination powers of identified DEGs
The connection relationships among 51 DEGs correspondent to pmAF patients and the calculation of corresponding ROCs and AUCs are respectively shown in Figure S1, Table S1 and S2. The connection network among 51 DEGs ( Figure S1) consisted of two subnetworks, four DEG connection pairs and 29 individual DEGs. In two subnetworks, all the DEGs were found to give an AUC of less than 0.7 or more than 0.3 individually (Table S1), which illustrated single DEG had poor discrimination power. When the DEGs in each subnetwork are combined, we calculated the ROC and AUC of each combinational DEGs and found that two AUCs were elevated to 1 (P = 0) (Table S2). In 8 DEGs of four pairs, only DICER1 (No. 8) were found to have an AUC of more than 0.7. When these 8 DEGs combined, the AUC was equal to 1 (P = 0). 10   individual DEGs was also found to have an AUC of more than 0.7 (AUC = 1, P = 0). As have discussed above, different combinations among 51 DEG according to their connection relationships in pmAF samples showed great discrimination power of classifying normal and pmAF samples. This demonstrated that the identified DEGs are reliable and the threshold of identifying DEGs is reasonable.

Discussion
The enrichment analysis of identified DEGs The results for the enrichment analysis of biological process and cellular component on Gene Ontology (GO) and the diseases on Genetic Association Database (GAD) are respectively shown in Table S3, S4 and S5. The existing experimental studies showed that the etiological factors inducing AF mainly include cardiac muscle or organ [1,11], cardiovascular [2], inflammation [12,13], proliferation or differentiation [14], fiber/fibrosis [15,16], external/hormone stimulation [17,18], extracellular region/matrix [19,20] and metabolism [5]. 32 of 51 DEGs are included in the statistically enriched GO terms of biological processes, in which 11 are relevant to the cardiac muscle, muscle cell or muscle organ, such as BMP10, RBP4 and MYL2 and other seven DEGs are related to the inflammation, for instance, ADIPOQ, FABP4 and C3; 38 of 51 DEGs are included in the statistically enriched GO terms of cellular component, in which 24 DEGs are located in extracellular region or extracellular matrix (ECM) at the levels of subcellular structures and macromolecular complexes, such as ADIPOQ, BMP10 and IGF1, while others are located in sarcomere, myofibril, contractile fiber and adherens junction; 22 of 51 DEGs are included in the statistically enriched GAD terms of disease, most of which are associated with metabolism and cardiovascular diseases. For example, the ADIPOQ, AMY1A, CFB, HP and HBB are associated with the metabolic diseases, while the FBP4, HP, LPL and MYL2 are related to the cardiovascular diseases.
In order to further illustrate the reliability of identified DEGs, we established the association between the AF-related etiological factors and all the identified DEGs. We firstly connected the factors and the ''terms'' according to the biological meaning of each term and then established the relationships between the identified DEGs and the etiological factors via the terms in the enrichment analysis results. The 51 DEGs and their association with the AF -related etiological factors are shown in Table S6. The results showed that 37 of 51 DEGs are closely related to the etiological factors inducing AF and so our results have high reliability. Since the pathophysiological mechanisms of AF have not completely been explained, the known factors causing pmAF are not comprehensive. Thus, those genes, such as DIRAS3, HBA1/HBA2, IGH@/IGHA1/IGHA2/IGHV3OR16-13/ LOC100126583, MMD, PRKACA and SLC16A7, which do not correlated with any a known etiological factor of AF, may provide new insights for understanding pathophysiological mechanisms of pmAF.

Analysis of association between the predicted pathways and pmAF
There are respectively 5, 4, and 3 DEGs in the PPAR, focal adhesion and dilated cardiomyopathy signaling pathways (Table 3). Our previous analysis illustrated that these DEGs are closely associated with pmAF. The abnormal expressions of the DEGs in three predicted signaling pathways are probably one of the reasons that these signaling pathways promote the pmAF progression. Further, using gene expression data in U133A, we analyzed the connections among the DEGs involved in each predicted pathway in AF patients and controls respectively [7]. The connection relationships among five DEGs involved in the PPAR signaling pathway are shown in Figure 2. We found that the connections between ADIPOQ and FABP45 and between ADIPOQ and LPL disappear in pmAF patients (Figure 2(A)), while there are strong pairwise connections among ADIPOQ, FABP4, LPL and PLIN in the controls (Figure 2(B)). The ACK1 is isolated in both cases. The similar results are obtained for the focal adhesion and dilated cardiomyopathy pathways (the data are not given). For instance, in the focal adhesion pathway, the MYL2 and SPP1 interacted in the control (CC = 0.86), but they were not correlated with each other in the pmAF patients (CC = 0.17); although all of the connections among the DEGs in the dilated cardiomyopathy pathway were weak correlation in both pmAF patients and controls, there are great difference between the corresponding CCs in both cases. Thus, we inferred that the alterations of connections among the DEGs in three pathways may be another cause that these signaling pathways promote pmAF.
In addition, some existing researches indirectly supported our prediction. For the PPAR signaling pathway, [21] and [22] illustrated that the peroxisome proliferator-activated receptors (PPARs) are lipid-activated transcription factors that regulate lipid and lipoprotein metabolism, glucose homeostasis, inflammation and cardiovascular system; The PPARs are a family of three nuclear hormone receptors, PPARa, -b/d, and -c, in which the PPARc activator pioglitazone can attenuate congestive heart failure-induced atrial structural remodeling and AF promotion, with effects similar to those of candesartan [15]. The focal adhesions are large multi-protein assemblies that form at the basal surface of cells on planar dishes, and that mediate cell signaling, force transduction and adhesion to the substratum [23]. The modulation of focal adhesion assembly/disassembly in response to mechanical load may be related to a primary role for focal adhesion assembly in myofibrillogenesis [24]. Like their costameric counterparts in vivo, the cardiomyocyte focal adhesions contain vinculin and other cytoskeletal proteins that form a dense adhesion plaque at sites of close approximation of the sarcolemma to the ECM. The increase in cardiomyocyte ECM deposition results in abnormal conduction through the atria, thus creating a substrate for atrial fibrillation [25]. The Dilated cardiomyopathy (DCM), a genetically heterogeneous disorder, causes heart failure and rhythm disturbances. The dilated cardiomyopathy was typically preceded by atrial fibrillation, sinus node dysfunction, and conduction block [26]. Remodeling occurs in both ventricle and atrium in dilated cardiomyopathy. Thus, the dilated cardiomyopathy might cause pmAF by the alteration of atrial ECM components during remodeling [20].

Comparison between the APCA and other related methods
The study of Censi, et al. [6] illustrated the effectiveness and feasibility of PCA method in finding disease -related biological features. APCA is an improved PCA and both have same theoretical basis. Therefore we first compare APCA with PCA. Figure 3 shows the first 10 PCs extracted by APCA and PCA respectively. Their first PCs respectively account for 99.61% and 98.42%. In minor PCs, the second PC of APCA is much larger than the third PCs onward, while the second PC of PCA is comparable with the third to the fifth PCs. Our simulation showed that PCA is undesirable or has drawbacks for the data analysis with different numbers of samples in the different classes because PCA uses the number of the samples to weight the class conditional covariance matrix in constructing the total scatter matrix. As such, the class with large number of samples will dominate the results of the principle components of PCA while the information of the class with small number of samples cannot be well shown in its principal components. Now the APCA takes a = 0.3 and so the larger weight ((1-a) = 0.7 comparing to 0.345 (10/29) of PCA) is used for the class of pmAF. Thus, information of the class of pmAF is emphasized in APCA (0.7.0.5) while it is deemphasized in PCA (0.345,0.5). Furthermore, with b = 20 (it is significantly larger than b = 1 in PCA), APCA forces the largest PC to capture the difference of the class means and hence clearly separates the information about the difference of the class means from the information about the within-class variations into different principal components. PCA with b = 1 makes these two different types of information mixed in various PCs. Thus, the first two PCs of APCA have higher discriminating power of classifying normal and pmAF samples than that of PCA since APCA considers the unbalanced sample numbers. Numerous feature selection methods have been applied to the identification of DEGs on microarray, including Fold change, Welch t-statistic, SAM (Significance Analysis of Microarray), etc. [27]. The feature selection methods separately identify each DEG that has significant difference in statistics and the number of identified DEGs is usually very large, while APCA identify DEGs whose expressions are correlated. Since the AF signature is activated by a general modulation of the whole genome but a single gene, APCA is able to better characterize different pathophysiological aspects of AF. Typically, the number of samples is limited by the availability of sufficient patients or cost and the noise is inevitable in a microarray study. The number of samples and noise are significant challenge to any feature selection approaches [27], while APCA is more robust to both factors [28]. For a microarray data with unbalanced samples, APCA is able to allocate larger weight to the group with fewer sample number for reducing the influence of imbalance on the final results. Therefore APCA can produce more reliable results than other methods that do not consider the problem of unbalanced sample number when processing U133A dataset, which is a typical microarray data with unbalanced samples.

Comparing with the existing results
By PCA, Censi, et al. identified 50 pmAF -related DEGs from the same data set [6]. APCA and PCA' mechanisms of weighting two classes of samples (pmAF and control) are very different so that the scores of same a gene generated by APCA and PCA are very different. Therefore, APCA and PCA identify different DEG lists that have very low overlap. This is the main reason why only 6 genes are same between two DEG lists identified by our and Censi, et al.'s methods.
Our enrichment analysis about biological process and cellular component on GO for 50 DEGs also shows the majority of them (27 DEGs, while ours is 37 DEGs) are individually related to the etiological factors inducing AF. Using 50 DEGs extracted by Censi, et al., we do not find any a gene is included in the statistically enriched GAD terms of disease on GAD (we have 22 DEGs), and only one statistically enriched pathway named focal adhesion is found on KOBAS, in which genes JUN, PIK3R1, TNC and THBS4 are involved. This illustrates that the correlation in biological functions among our 51 DEGs is higher than that of  50 DEGs. Therefore, there are more genes and combinational works of multiple genes in our 51 DEGs to be associated with occurrence and progress of pmAF. APCA is a more appropriate method to microarray data that have unbalanced samples.
Finally, it is worthy explaining that we do not analyze the U133B data set because too many genes were not annotated on this chip, which may result in wrong interpretation to the final results. The pathophysiology of pmAF is extremely complex. In our future work, we shall validate the suggested pmAF-related DEGs in experiments and integrate multiple types of data (such as gene sequence, RNA and miRNA expression profiles, proteinprotein interactions) to build functional networks promoting pmAF for more comprehensive understanding of pmAF pathophysiology.

Conclusion
This work proposes a novel method to identify the DEGs from microarray data with unbalanced sample numbers. 51 DEGs associated with pmAF are identified, in which 42 DEGs are different from the existing related results. The PPAR, focal adhesions and dilated cardiomyopathy signaling pathways are predicted to be associated with pmAF based on all of the identified DEGs. This work provides some new insights into biological features of pmAF and has also the potentially important implications for improved understanding of the molecular mechanisms of pmAF. Figure S1 The connection network among 51 identified DEGs. The No. of each DEG is same with that in Table 2. (TIF)