The biological knowledge discovery by PCCF measure and PCA-F projection

In the process of biological knowledge discovery, PCA is commonly used to complement the clustering analysis, but PCA typically gives the poor visualizations for most gene expression data sets. Here, we propose a PCCF measure, and use PCA-F to display clusters of PCCF, where PCCF and PCA-F are modeled from the modified cumulative probabilities of genes. From the analysis of simulated and experimental data sets, we demonstrate that PCCF is more appropriate and reliable for analyzing gene expression data compared to other commonly used distances or similarity measures, and PCA-F is a good visualization technique for identifying clusters of PCCF, where we aim at such data sets that the expression values of genes are collected at different time points.


Introduction
In the process of biological knowledge discovery, the clustering and visualizing analysis plays central roles [1][2][3]. The clustering algorithms are used to search for patterns that provide additional insight into the biological function and relevance of genes [4,5]. Among the most popular are unsupervised clustering algorithms, such as K-means [5]. K-means analysis depends on choosing an appropriate distance or similarity measure that takes into account the underlying biology and the nature of the data [6]. Commonly used measures include PCC(the Pearson correlation coefficient) and Euclidean distance [7]. However, K-means can not reveal underlying global patterns in the data, or relationships between the clusters found. To complement Kmeans, PCA is a commonly used method for this purpose. But for most gene expression data, PCA typically gives a poor visualization [8,9]. Because of these limitations, nonlinear dimension reduction methods have been developed that attempt to preserve local structure in the data, such as t-SNE(t-statistic Stochastic Neighbor Embedding) [8,10,11]. For t-SNE, it has been successful in displaying clusters of Euclidean distance [8], but it gives the poor visualizations for clusters of PCC usually.
Here, we use PCCF to measure similarity of genes, and PCA-F to display clusters of PCCF, where PCCF is the correlation coefficient of F-points, PCA-F is the principal component analysis of F-points, and F-point of a gene is constructed by the modified cumulative probability of a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 the positively and reversely normalized gene. To evaluate PCCF measure, we apply it to group four gene expression data sets. These clustering results clearly demonstrate the statistical reliability and biological relevance of PCCF far more than other commonly used distances or similarity measures. For PCA-F, the cumulative variance of its principal components are greater than 85% for any reference data set in this paper, and far more than PCA of the normalized points. Furthermore, we demonstrate that PCA-F is able to project similar F-points in the same regions, to accurately depict distant F-points, and to accurately reveal the relationships of clusters of PCCF. These superior performances of PCCF and PCA-F benefit from the validity of F-points. The most prominent feature of F-points is that their curve shapes are almost like capital N. That is, F-points weaken the curve shape difference of the similar expression behavior genes. Moreover, F-points enlarge the element discrepancy of dissimilar genes by their two cumulative probability.
However, for PCA-F maps of many expression data sets, projections in their internal regions are crowded usually, where these crowded projections come from these genes that their elements are relatively equivalent. For a 2D projecting map, it needs to help an investigator in the interpretation of any particular region of the visualization, but the crowded regions can give inconvenience for the investigator. To clearly distinguish any projecting region, we propose PCA-FO that is the similarity transformation of PCA-F. For gene points, the position relationship of their PCA-FO projections is the same as their PCA-F projections, but the spaces of PCA-FO projections are more uniform compared to PCA-F.
In this study, these data sets from published studies are used to investigate and illustrate the performance of PCCF and PCA-F, including the yeast metabolic cycle data [12], K562 cell line data [13], human embryo data [14], and mouse retinal data [7]. Here, PCCF is firstly applied to divide these data sets into clusters, and then these clustering results are overlayed onto PCA-F maps. Results show that PCCF is able to group the similar expression behavior genes into the same clusters, and PCA-F is able to project genes of the same clusters together. That is, PCCF and PCA-F can be used in conjunction to understand the logic of cluster partitions and to identify co-regulated genes. We suggest that PCCF and PCA-F provide new insights for analyzing large-scale transcriptome data.
induction at least at one time point with respect to the control sample(t = 0; before PMA treatment), and 1779 probes satisfying this requirement have been determined [13,15].

Data set 3
Yeast metabolic cycle data: NCBI GEO accession number GSE3431. This data set described the transcriptional changes in the metabolic cycle of budding yeast Saccharomyces cerevisiae [12,14]. In this experiment, gene expression behaved in a periodic manner, comprising a non-respiratory phase followed by a respiratory phase. The transcriptome was assayed every 25 min over three consecutive cycles, resulting in 36 samples (T1-T36). These were profiled using Affymetrix YG_S98 oligonucleotide arrays. Probes that had at least three 'present' called as generated by Affymetrix Gene Chip software were classified as expressed and the data normalized using GeneSpring v7 per-chip normalization. Using a periodicity algorithm described in the original paper, the authors classified 3552 genes as periodic, corresponding to 3656 probe sets. From these 3552 genes, 2913 genes, expression values had greater than 5 in at least one of 36 samples selected.

Data set 4
Human embryo data: NCBI GEO accession number GSE18887. The resulting matrix contained expression measurements for 5441 transcripts across 18 samples, denoted as the human organogenesis expression matrix [14] (Carnegie stages 9-14, S9-S14). A total of 5441 probe sets were identified as differentially expressed using Extraction of Differential Gene Expression (EDGE)-based methodology. Initially, Hai Fang had used SOM-SVD to identify co-expressed genes of Human embryo Data [10,14], which identified six clusters. From their analysis, they extracted 2148 differentially expressed probe sets. We used this set of 2148 probe sets for our analysis.

Data set 5
The raw mouse retinal data consisted of 10 SAGE libraries (38818 unique tags with tag counts ! 2) from developing retina taken at 2-day intervals. The samples ranged from embryonic, to postnatal, and to adult. Among the 38818 tags, 1467 tags that had counts greater than or equal to 20 in at least one of the 10 libraries were selected [7]. The purpose of this selection was to exclude the genes with uniform low expression. The counts of each tag in a SAGE library was Poisson distributed. These Poisson distributions were independent of each other across different tags and libraries [7].

Methods
The gene expression points can be represented by the n-tuple of vectors, where X i = {x i1 , x i2 ,Á Á Á, x in } represents the i-th gene, and x ij represents the expression level of the j-th time points.
F-points i is named as P-point of X i , and For F i , it is a 2n-dimensional vector, and the sum of its elements is n.
For W i , the last element of its cumulative probability is 1, it may lose part information of w in , so we select the modified cumulative probability. Since the elements of W 0 i and Y 0 i are the monotonous unabated, and X nÀ 1 the curve shape of F i is almost like capital N. That is, F-points weaken the curve shape difference of the similar expression behavior genes. Without doubt, the curve shapes of the dissimilar expression behavior genes are similar also. However, F-points enlarge the element discrepancy of dissimilar genes by their two modified cumulative probability. That is, the curve shapes of dissimilar expression behavior genes are different N.

PCCF measure
Here, PCC between F i and F j (or W 0 i and W 0 j ) is defined as PCCF(or PCCP) of X i and X j . Moreover, Euclidean distance between F i and F j (or W 0 i and W 0 j ) is defined as EuF(or EuP) of X i and X j also.
In fact, W 0 i and F i is able to describe as W Based on Eq (3), EuF and EuP between X i and X j satisfy EuPði; jÞ: That is, EuF and EuP are the same distance in essence.
But for PCCP and PCCF of X i and X j , they are where the mean of F i is 0.5. Since the means of W 0 i and W 0 j are not likely 0.5 at the same time, PCCP and PCCF of X i and X j have significant difference. (2) are the first and second principal components of F i , respectively. Moreover,

PCA-F and PCA-FO
m is gene number of data set, n(f i (1)) and n(f i (2)) are the ordering number of f i (1) and f i (2), respectively. That is, all f i (1)(or f i (2)) are irstly ordered from the smallest value to the largest one, then n(f i (1))(or n(f i (2))) is obtained by the ordering number of f i (1)(or f i (2)). For instance,

S-value
The average silhouette value is a quantitative way to compare different clustering solutions [16]. For a data set, we use the average silhouette value to quantify clustering results of its normalized points, P-points and F-points. Here, we use S1-value to denote the average silhouette value of the data set, where a i is the average distance from Y i to the other points in the same cluster as Y i , b i is the minimum average distance from Y i to points in a different cluster, minimized over clusters, Y i is the i-point of a data set, and m is gene number of the data [16]. Moreover, we use S2-value to evaluate the projections in the same regions whether that come from similar points, Here, projections are firstly divided into clusters by Euclidean distance, then the cluster membership of Y i is k if its projection belongs to the k-th cluster. And then, S2-value is obtained by the average silhouette value of Y i . Here, when we use S2-value to evaluate the quality of projections, this S2-value is abbreviated as S2-value of PCCF if the similarity of genes is defined by PCCF measure, and so on.

D-plot
For a dimension reduction technique, we term it as a 'locally valid'(or 'globally valid') visualization if it satisfies that the i-th closest neighbour(or farthest point) of a point is its j-th closest neighbour(or farthest point) in 2D space, and i, j and |i − j| are the relative small number, where point neighbours are located by PCC measure, while projection neighbours are located by Euclidean distance.
The local and global validity can be respectively quantified by D 1 -plot and D 2 -plot, where m is point number of the data, k is a certain limit of local validity, ρ 2 (i, a) is PCC between X i and its a-th closest neighbor in 2D space, ρ n (i, c) is PCC between X i and its c-th closest neighbor in high dimensional space, ρ 2 (i, e) is PCC between X i and its e-th farthest points in 2D space, ρ n (i, f) is PCC between X i and its f-th farthest points in high dimensional space.
In general, when we use PCC to locate point neighbours, the closest neighbors of projections do not necessarily come from real point neighbors. That is, for the c-th closest neighbor of X i in high dimensional space, if its projection is the s(s > k)-th closest neighbor of the projection X i , ρ n (i, c) does not appear in P b a¼1 r 2 ði; aÞ. Thus, X b a¼1 r 2 ði; aÞ X b c¼1 r n ði; cÞ; Moreover, for a large scale gene expression data and a relative small k, ρ n (i, c) is usually nonnegative. Thus, Here, we connect these (b, D 1 (b)) into a broken line, and the broken line is named as D 1 -plot. Obviously, D 1 -plot is more close Y = 1, the more high dimension nearest neighbours are located close to one another in 2D maps. Similarly, D 2 -plot is defined, and it is more close Y = 1, the relationship of distant points is depicted as more accurately.

Results
Here, all clustering results were generated from K-means with the normalized points, and PCCF, PCC, PCCP, EuF, Euclidean distance, TransChisq and PoissonC were chosen as distance or similarity measure of genes. Moreover, the number of clusters mainly came from the corresponding references. In details, Limb JK et al had divided data set 2 into 8 clusters by Euclidean [13]; Natascha B et al had divided data set 3 into 3 clusters, and data set 4 into 6 and 10 clusters by Euclidean [8]; and data set 5 had been grouped into 30 clusters by TransChisq and PoissonC measure [7,17], respectively. Furthermore, for any clustering result, K-means iterated 1000 times at least.

The statistical reliability of PCCF
Here, we used S1-value to demonstrate the statistical reliability of clusters of PCCF. For comparison, the normalized genes of each experimental data set were divided into clusters by Euclidean, PCC, PCCP, EuF and PCCF, simultaneously. For these clustering results, their S1-values were summarized in Table 1. For S1-value of clustering results within the same data, Table 1 showed that clusters of PCCF was the largest, and far more than other measures. That is, clusters of PCCF were better separated than other measures.

The biochemical reliability of PCCF
In general, the patterns revealed by the clusters under different measures roughly agreed with each other. For instance, data set 5 had been grouped into 30 clusters by TransChisq and Pois-sonC measure, and these studies used five mouse photoreceptor and thirty-four cell-specific genes to demonstrate TransChisq and PoissonC measure were more efficient for analyzing SAGE data than PCC and Euclidean distance [7,17]. The gene expression pattern of five photoreceptor genes showed high tag counts in late retinal development(adult), and thirty-four tags showed the most dynamic and cell-specific expression in the mouse neonatal retina(developmental stages P 0 − P 6 ) [7]. For comparison, we used PCCF and PCCP to group these 1,467 tags into 30 clusters also.
For these five rhodopsin tags, only PCCF was able to group them together, while other measures divided them into two clusters (Table 2). Moreover, these thirty-four 'cell-specific' tags were used to test the sensitivity and specificity of these measures. The comparison statistics of 'cell-specific' tags were summarized in Table 2. Here, for each of the different measures, its three most dynamic clusters that contained 'cell-specific' tags were selected. In Table 2, clusters of PCCF, TransChisq and PoissonC had no significant difference in these cell-specific genes. That is, PCCF was appropriate and reliable for analyzing SAGE data also.

The projecting reliability of PCA-F
The cumulative variance of principal components were commonly used to assess the projecting reliability of PCA [18]. Here, for all data sets in this paper, their cumulative variances of PCA-F, PCA-P and PCA-N were summarized in Table 3, where PCA-P and PCA-N are PCA of P-points and normalized points, respectively. For any data set, Table 3 showed that the cumulative variance of PCA-F and PCA-P had no significant difference, and PCA-P was slightly greater than PCA-F. Importantly, the cumulative variances of PCA-F and PCA-P were greater than 85% for any data set. However, for any data set, the cumulative variance of PCA-N was far less than PCA-F and PCA-P and only the data set 4 was slightly greater than 85%.
Furthermore, we used data set 1 to assess the statistical reliability of PCA-F. Here, according to population membership of points, data set 1 was mapped on PCA-F, PCA-P and PCA-N (Fig 1), respectively. From Fig 1(a) and 1(c), although there was little intermixing within adjacent populations, PCA-F and PCA-P were able to project most points of the same populations together. Importantly, even if all elements of points were relatively equivalent, PCA-F and PCA-P was able to project them together. For instance, PCA-F and PCA-P projected most points of (N(20,2),N(20,2),N(20,2),N(20,2)) together, where these points were marked by 11 in The numbers in the third column were the numbers of rhodopsin genes(or cell-specific genes) in a cluster; total, the total number of cluster members; sensitivity, Numbers/5(or 34); specificity, Numbers/Total.  1(a) and 1(c). Moreover, PCA-N clearly projected points onto seven regions, but each of regions contained projections of two or more populations that had significant intermixing (Fig 1(e)).

The feature of F-points
Here, the down-regulate genes of data set 2 were selected to explore the feature of F-points, where data set 2 were divided into 12 clusters by PCCF and PCC, respectively. Moreover, these 3 clusters of PCCF and 4 clusters of PCC that contained down-regulate genes were selected, and the curve shape of F-points and normalized points of these clusters were shown in Fig 2. For clusters of PCCF, Fig 2 showed that the curve shape of F-points within any cluster were almost like capital N. But for F-points of different clusters that generated from PCCF, their elements had significant difference. Furthermore, Fig 2 showed that the similarity between F-points and normalized points had significant difference. For instance, for genes in the second cluster of PCCF, the curve shape of their normalized points were with no specific patterns (Fig 2(b)), but there were only small differences for their F-points (Fig 2(i)).

The consistency between PCA-F and PCCF
When we use a measure to define the similarity of genes, a good visualization was that it was able to project similar points into the same regions. This was able to visually display by 2D maps of clustering results. Here, for data set 1, 2 and 5, their clusters of PCCF, PCCP and PCC were shown on PCA-F, PCA-P and PCA-N maps, where the clustering numbers of data set 1, 2 and 5 were 7, 8 and 13, respectively. Results showed that PCA-F gave a good visualization for any clustering result of PCCF (Figs 1(b), 3(a) and 3(b)), PCA-P maps had significant intermixing for any clustering result of PCCP (Figs 1(d), 3(c) and 3(d)), and PCA-N gave poor visualizations for clusters of PCC (Fig 1(f)). In fact, for clusters of PCCF, PCA-F was able to give a good visualization even if the clustering number was not very appropriate. For instance, for clusters of data set 1 that generated by PCCF, PCA-FO gave clear cluster boundary for clustering number from 2 to 12. These results clearly demonstrate that PCA-F was able to project similar points into the same regions.
Moreover, for a good visualization, its close projections should come from the similar points, and the feature could be evaluated by S2-value. Here, for each data set in this paper, its normalized points were divided into clusters by Euclidean, PCC, PCCP, EuF and PCCF, simultaneously. Then, S2-values of these clustering results were summarized in Table 4. For  Table 4 showed that clusters of PCCF were the largest, and far more than other measures. That is, for projections of PCA-F, if they were close neighbours in 2D space, their corresponding F-points were Pearson correlation also.

Comparison of PCA-FO and PCA-F
Here, data set 4 were divided into 6 and 20 clusters by the PCCF, and these clustering results were overlaid on PCA-FO and PCA-F maps (Fig 4), respectively.   the internal regions, PCA-F maps were crowded (Fig 4(c) and 4(d)), while PCA-FO maps were relatively loose and clear (Fig 4(a) and 4(b)). In fact, for any of components of two nearest projections of PCA-FO, their spacing was greater than l/2m, where l was the largest exhibition size, m was the gene number of data set. In a limited display space, the feature of PCA-FO would assure that projections were relatively loose and clear. Furthermore, compared to PCA-F and PCA-FO, the position relationship of their projections were the same almost. In fact, for the first(or second) components of PCA-FO, their order of size were the same as PCA-F.

Comparison of PCA-FO and t-SNE
Here, we also used the simple t-SNE to construct 2D projections of F-points, where we named t-SNE of F-points as t-SNE-F, and the dimension of the F-points was used as the perplexity value of t-SNE-F.
Here, data set 3 was firstly divided into 3 and 7 clusters by PCCF, and then these clustering results were overlaid on PCA-FO and t-SNE-F maps (Fig 5). that they had no significant difference, but they were less than t-SNE-F and t-SNE-N. But for the global validity, PCA-FO, PCA-F and t-SNE-F were almost the same, and they were far better than t-SNE-N and PCA-N.
The poor global validity of t-SNE-N and PCA-N was able to explain that they gave the poor visualization for clusters of PCC. That is, the relationship of distantly normalized genes was not accurately depicted by t-SNE-N and PCA-N. But for t-SNE-F, its global validity was the same as PCA-FO, and its local validity was superior to PCA-FO. However, for clusters of PCCF, t-SNE-F maps had significant intermixing within adjacent clusters (Fig 5(c) and 5(d)). In fact, for these gene neighbors keep away from any clustering center, t-SNE-F tried to project them together, but PCCF did not necessarily group them together.

The gene neighbor map of PCA-FO
To readily see which nearby 2D points were truly similar, the nearest and second closest gene neighbor map was generated by PCA-FO. Here, we constructed the nearest and second closest gene neighbor map of data set 2, where the map was showed on Fig 7. Fig 7 showed that the majority of high dimension nearest neighbours were located close to one another in PCA-FO maps.
The gene neighbor map revealed the pairs of high dimensional points that were truly close, and which pairs were in fact distant in 2D space. Moreover, PCA-FO maps combined with nearest neighbour maps provided an intuitive means to understand the relationship between clusters and the affiliation of genes with specific clusters.

Discussion
For the modified cumulative probability, although they are the one-to-one mapping with their normalized points, their magnitude has significant differences, which can result in PCA-P to give the poor visualizations for clusters of PCCP. Moreover, for the different position elements of a normalized point, their superposed opportunity are not consistent in the modified cumulative probability, which can make PCCP excessively dependent on the first few elements of normalized points. Here, the defect of the modified cumulative probability is removed by Fpoints. That is, the magnitude of F-points is the same, and F-points assure that the superposed opportunity of all elements of normalized points are consistent. Importantly, for data set 2 and 4, PCA-N gave good visualizations for clusters of PCCF also (such as Fig 4(e) and 4(f)). That is, F-points retain the difference of the normalized genes.
For a complex gene expression data set, a difficult issue in K-means is the estimation of K, the number of clusters. If K is unknown, starting with arbitrary random K is a relatively poor method. Here, the defect of K-means are partially weakened by PCCF and PCA-F. That is, for the similar expression behavior genes, even if the number of clusters is not very appropriate, PCCF can group them into appropriate clusters, and PCA-F is able to reveal their relationships also.

Conclusion
In this paper, we clearly demonstrate that PCCF is more reliable for analyzing gene expression data compared to other commonly used measures. Moreover, for clusters of PCCF, PCA-F give them good visualizations. The success of PCCF and PCA-F indicates that the effective methods for analyzing large-scale gene expression data must be based on an understanding of the biological nature of the experimental data.