## Figures

## Abstract

In the process of biological knowledge discovery, PCA is commonly used to complement the clustering analysis, but PCA typically gives the poor visualizations for most gene expression data sets. Here, we propose a PCCF measure, and use PCA-F to display clusters of PCCF, where PCCF and PCA-F are modeled from the modified cumulative probabilities of genes. From the analysis of simulated and experimental data sets, we demonstrate that PCCF is more appropriate and reliable for analyzing gene expression data compared to other commonly used distances or similarity measures, and PCA-F is a good visualization technique for identifying clusters of PCCF, where we aim at such data sets that the expression values of genes are collected at different time points.

**Citation: **Jia X, Zhu G, Han Q, Lu Z (2017) The biological knowledge discovery by PCCF measure and PCA-F projection. PLoS ONE 12(4):
e0175104.
https://doi.org/10.1371/journal.pone.0175104

**Editor: **Lingling An,
University of Arizona, UNITED STATES

**Received: **May 5, 2016; **Accepted: **March 21, 2017; **Published: ** April 11, 2017

**Copyright: ** © 2017 Jia et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

**Data Availability: **All relevant data are within the paper and its Supporting Information files.

**Funding: **The author(s) received no specific funding for this work.

**Competing interests: ** The authors have declared that no competing interests exist.

## Introduction

In the process of biological knowledge discovery, the clustering and visualizing analysis plays central roles [1–3]. The clustering algorithms are used to search for patterns that provide additional insight into the biological function and relevance of genes [4, 5]. Among the most popular are unsupervised clustering algorithms, such as K-means [5]. K-means analysis depends on choosing an appropriate distance or similarity measure that takes into account the underlying biology and the nature of the data [6]. Commonly used measures include PCC(the Pearson correlation coefficient) and Euclidean distance [7]. However, K-means can not reveal underlying global patterns in the data, or relationships between the clusters found. To complement K-means, PCA is a commonly used method for this purpose. But for most gene expression data, PCA typically gives a poor visualization [8, 9]. Because of these limitations, nonlinear dimension reduction methods have been developed that attempt to preserve local structure in the data, such as t-SNE(t-statistic Stochastic Neighbor Embedding) [8, 10, 11]. For t-SNE, it has been successful in displaying clusters of Euclidean distance [8], but it gives the poor visualizations for clusters of PCC usually.

Here, we use PCCF to measure similarity of genes, and PCA-F to display clusters of PCCF, where PCCF is the correlation coefficient of F-points, PCA-F is the principal component analysis of F-points, and F-point of a gene is constructed by the modified cumulative probability of the positively and reversely normalized gene. To evaluate PCCF measure, we apply it to group four gene expression data sets. These clustering results clearly demonstrate the statistical reliability and biological relevance of PCCF far more than other commonly used distances or similarity measures. For PCA-F, the cumulative variance of its principal components are greater than 85% for any reference data set in this paper, and far more than PCA of the normalized points. Furthermore, we demonstrate that PCA-F is able to project similar F-points in the same regions, to accurately depict distant F-points, and to accurately reveal the relationships of clusters of PCCF. These superior performances of PCCF and PCA-F benefit from the validity of F-points. The most prominent feature of F-points is that their curve shapes are almost like capital *N*. That is, F-points weaken the curve shape difference of the similar expression behavior genes. Moreover, F-points enlarge the element discrepancy of dissimilar genes by their two cumulative probability.

However, for PCA-F maps of many expression data sets, projections in their internal regions are crowded usually, where these crowded projections come from these genes that their elements are relatively equivalent. For a 2D projecting map, it needs to help an investigator in the interpretation of any particular region of the visualization, but the crowded regions can give inconvenience for the investigator. To clearly distinguish any projecting region, we propose PCA-FO that is the similarity transformation of PCA-F. For gene points, the position relationship of their PCA-FO projections is the same as their PCA-F projections, but the spaces of PCA-FO projections are more uniform compared to PCA-F.

In this study, these data sets from published studies are used to investigate and illustrate the performance of PCCF and PCA-F, including the yeast metabolic cycle data [12], K562 cell line data [13], human embryo data [14], and mouse retinal data [7]. Here, PCCF is firstly applied to divide these data sets into clusters, and then these clustering results are overlayed onto PCA-F maps. Results show that PCCF is able to group the similar expression behavior genes into the same clusters, and PCA-F is able to project genes of the same clusters together. That is, PCCF and PCA-F can be used in conjunction to understand the logic of cluster partitions and to identify co-regulated genes. We suggest that PCCF and PCA-F provide new insights for analyzing large-scale transcriptome data.

## Materials and methods

### Data set 1

The simulation data set contained 1500 four-dimensional points. These 1500 points belonged to 14 populations, and each population was constructed by four independent normal distributions, where the used normal distributions were N(10,1) and N(20,2). Obviously, N(10,1) and N(20,2) would construct 16 four-dimensional normal populations. Here, (N(10,1),N(10,1),N(10,1),N(10,1)) and (N(10,1),N(20,2),N(20,2),N(20,2)) were abandoned, (N(20,2),N(20,2),N(20,2),N(10,1)) consisted of 200 points, and each of other populations consisted of 100 points.

For points of (N(20,2),N(20,2),N(20,2),N(20,2)), all their elements were equivalent. And for points of other groups, half of their elements were relatively equivalent at least.

### Data set 2

#### NCBI GEO accession number GSE 12736.

Time course microarray data was obtained at seven independent time points. Duplicate experiments were performed for each time point. Selecting genes with significant detection p-value produced 14000 probes out of total 23920 probes. Quantile normalization was carried out for each dataset at seven time points using the average expression value. It was reasoned that significant genes should show over two-fold induction at least at one time point with respect to the control sample(t = 0; before PMA treatment), and 1779 probes satisfying this requirement have been determined [13, 15].

### Data set 3

#### Yeast metabolic cycle data: NCBI GEO accession number GSE3431.

This data set described the transcriptional changes in the metabolic cycle of budding yeast Saccharomyces cerevisiae [12, 14]. In this experiment, gene expression behaved in a periodic manner, comprising a non-respiratory phase followed by a respiratory phase. The transcriptome was assayed every 25 min over three consecutive cycles, resulting in 36 samples (T1-T36). These were profiled using Affymetrix YG_S98 oligonucleotide arrays. Probes that had at least three ‘present’ called as generated by Affymetrix Gene Chip software were classified as expressed and the data normalized using GeneSpring v7 per-chip normalization. Using a periodicity algorithm described in the original paper, the authors classified 3552 genes as periodic, corresponding to 3656 probe sets. From these 3552 genes, 2913 genes, expression values had greater than 5 in at least one of 36 samples selected.

### Data set 4

#### Human embryo data: NCBI GEO accession number GSE18887.

The resulting matrix contained expression measurements for 5441 transcripts across 18 samples, denoted as the human organogenesis expression matrix [14] (Carnegie stages 9-14, S9-S14). A total of 5441 probe sets were identified as differentially expressed using Extraction of Differential Gene Expression (EDGE)-based methodology. Initially, Hai Fang had used SOM-SVD to identify co-expressed genes of Human embryo Data [10, 14], which identified six clusters. From their analysis, they extracted 2148 differentially expressed probe sets. We used this set of 2148 probe sets for our analysis.

### Data set 5

The raw mouse retinal data consisted of 10 SAGE libraries (38818 unique tags with tag counts ≥ 2) from developing retina taken at 2-day intervals. The samples ranged from embryonic, to postnatal, and to adult. Among the 38818 tags, 1467 tags that had counts greater than or equal to 20 in at least one of the 10 libraries were selected [7]. The purpose of this selection was to exclude the genes with uniform low expression. The counts of each tag in a SAGE library was Poisson distributed. These Poisson distributions were independent of each other across different tags and libraries [7].

### Methods

The gene expression points can be represented by the *n*-tuple of vectors, where *X*_{i} = {*x*_{i1}, *x*_{i2},⋯, *x*_{in}} represents the *i*-th gene, and *x*_{ij} represents the expression level of the *j*-th time points.

### F-points

*X*_{i}is normalized into*W*_{i}, where (1) For genes, their expression levels may be negative at some time points, such as genes of data set 2. Here,*x*_{it}is substituted by . In fact, if all expression levels of*X*_{i}are nonnegative, is the same as*x*_{it}.-
is constructed, where is the modified cumulative probability of
*W*_{i}, is named as P-point of*X*_{i}, and *V*_{i}and are constructed by*W*_{i}, where*V*_{i}is the ON-point of*X*_{i}, is the modified cumulative probability of*Y*_{i}, and-
and are merged into
*F*_{i}, where*F*_{i}is named as F-point of*X*_{i}, and (2) For*F*_{i}, it is a 2*n*-dimensional vector, and the sum of its elements is*n*.

For *W*_{i}, the last element of its cumulative probability is 1, it may lose part information of *w*_{in}, so we select the modified cumulative probability. Since the elements of and are the monotonous unabated, and
the curve shape of *F*_{i} is almost like capital *N*. That is, F-points weaken the curve shape difference of the similar expression behavior genes. Without doubt, the curve shapes of the dissimilar expression behavior genes are similar also. However, F-points enlarge the element discrepancy of dissimilar genes by their two modified cumulative probability. That is, the curve shapes of dissimilar expression behavior genes are different *N*.

### PCCF measure

Here, PCC between *F*_{i} and *F*_{j}(or and ) is defined as PCCF(or PCCP) of *X*_{i} and *X*_{j}. Moreover, Euclidean distance between *F*_{i} and *F*_{j}(or and ) is defined as EuF(or EuP) of *X*_{i} and *X*_{j} also.

In fact, and *F*_{i} is able to describe as
(3)

Based on Eq (3), EuF and EuP between *X*_{i} and *X*_{j} satisfy
That is, EuF and EuP are the same distance in essence.

But for PCCP and PCCF of *X*_{i} and *X*_{j}, they are
(4)
where the mean of *F*_{i} is 0.5. Since the means of and are not likely 0.5 at the same time, PCCP and PCCF of *X*_{i} and *X*_{j} have significant difference.

### PCA-F and PCA-FO

Here, (*f*_{i}(1), *f*_{i}(2)) is called as PCA-F projection of *X*_{i}, where *f*_{i}(1) and *f*_{i}(2) are the first and second principal components of *F*_{i}, respectively. Moreover, (*F*_{i}(1), *F*_{i}(2)) is extracted as PCA-FO projection of *X*_{i}, where
(5)
*m* is gene number of data set, *n*(*f*_{i}(1)) and *n*(*f*_{i}(2)) are the ordering number of *f*_{i}(1) and *f*_{i}(2), respectively. That is, all *f*_{i}(1)(or *f*_{i}(2)) are irstly ordered from the smallest value to the largest one, then *n*(*f*_{i}(1))(or *n*(*f*_{i}(2))) is obtained by the ordering number of *f*_{i}(1)(or *f*_{i}(2)). For instance, if *f*_{i}(1) is the *u*-th smallest value in all *f*_{j}(1), *n*(*f*_{i}(1)) is *u*.

*S*-value

The average silhouette value is a quantitative way to compare different clustering solutions [16]. For a data set, we use the average silhouette value to quantify clustering results of its normalized points, P-points and F-points. Here, we use *S*1-value to denote the average silhouette value of the data set, where
*a*_{i} is the average distance from *Y*_{i} to the other points in the same cluster as *Y*_{i}, *b*_{i} is the minimum average distance from *Y*_{i} to points in a different cluster, minimized over clusters, *Y*_{i} is the *i*-point of a data set, and *m* is gene number of the data [16].

Moreover, we use *S*2-value to evaluate the projections in the same regions whether that come from similar points, Here, projections are firstly divided into clusters by Euclidean distance, then the cluster membership of *Y*_{i} is *k* if its projection belongs to the *k*-th cluster. And then, *S*2-value is obtained by the average silhouette value of *Y*_{i}. Here, when we use *S*2-value to evaluate the quality of projections, this *S*2-value is abbreviated as *S*2-value of PCCF if the similarity of genes is defined by PCCF measure, and so on.

*D*-plot

For a dimension reduction technique, we term it as a ‘locally valid’(or ‘globally valid’) visualization if it satisfies that the *i*-th closest neighbour(or farthest point) of a point is its *j*-th closest neighbour(or farthest point) in 2D space, and *i*, *j* and |*i* − *j*| are the relative small number, where point neighbours are located by PCC measure, while projection neighbours are located by Euclidean distance.

The local and global validity can be respectively quantified by *D*_{1}-plot and *D*_{2}-plot, where
(6) *m* is point number of the data, *k* is a certain limit of local validity, *ρ*_{2}(*i*, *a*) is PCC between *X*_{i} and its *a*-th closest neighbor in 2D space, *ρ*_{n}(*i*, *c*) is PCC between *X*_{i} and its *c*-th closest neighbor in high dimensional space, *ρ*_{2}(*i*, *e*) is PCC between *X*_{i} and its *e*-th farthest points in 2D space, *ρ*_{n}(*i*, *f*) is PCC between *X*_{i} and its *f*-th farthest points in high dimensional space.

In general, when we use PCC to locate point neighbours, the closest neighbors of projections do not necessarily come from real point neighbors. That is, for the *c*-th closest neighbor of *X*_{i} in high dimensional space, if its projection is the *s*(*s* > *k*)-th closest neighbor of the projection *X*_{i}, *ρ*_{n}(*i*, *c*) does not appear in . Thus,
Moreover, for a large scale gene expression data and a relative small *k*, *ρ*_{n}(*i*, *c*) is usually nonnegative. Thus,
Here, we connect these (*b*, *D*_{1}(*b*)) into a broken line, and the broken line is named as *D*_{1}-plot. Obviously, *D*_{1}-plot is more close *Y* = 1, the more high dimension nearest neighbours are located close to one another in 2D maps. Similarly, *D*_{2}-plot is defined, and it is more close *Y* = 1, the relationship of distant points is depicted as more accurately.

## Results

Here, all clustering results were generated from K-means with the normalized points, and PCCF, PCC, PCCP, EuF, Euclidean distance, TransChisq and PoissonC were chosen as distance or similarity measure of genes. Moreover, the number of clusters mainly came from the corresponding references. In details, Limb JK et al had divided data set 2 into 8 clusters by Euclidean [13]; Natascha B et al had divided data set 3 into 3 clusters, and data set 4 into 6 and 10 clusters by Euclidean [8]; and data set 5 had been grouped into 30 clusters by TransChisq and PoissonC measure [7, 17], respectively. Furthermore, for any clustering result, K-means iterated 1000 times at least.

### The statistical reliability of PCCF

Here, we used *S*1-value to demonstrate the statistical reliability of clusters of PCCF. For comparison, the normalized genes of each experimental data set were divided into clusters by Euclidean, PCC, PCCP, EuF and PCCF, simultaneously. For these clustering results, their *S*1-values were summarized in Table 1. For *S*1-value of clustering results within the same data, Table 1 showed that clusters of PCCF was the largest, and far more than other measures. That is, clusters of PCCF were better separated than other measures.

### The biochemical reliability of PCCF

In general, the patterns revealed by the clusters under different measures roughly agreed with each other. For instance, data set 5 had been grouped into 30 clusters by TransChisq and PoissonC measure, and these studies used five mouse photoreceptor and thirty-four cell-specific genes to demonstrate TransChisq and PoissonC measure were more efficient for analyzing SAGE data than PCC and Euclidean distance [7, 17]. The gene expression pattern of five photoreceptor genes showed high tag counts in late retinal development(adult), and thirty-four tags showed the most dynamic and cell-specific expression in the mouse neonatal retina(developmental stages *P*_{0} − *P*_{6}) [7]. For comparison, we used PCCF and PCCP to group these 1,467 tags into 30 clusters also.

For these five rhodopsin tags, only PCCF was able to group them together, while other measures divided them into two clusters(Table 2). Moreover, these thirty-four ‘cell-specific’ tags were used to test the sensitivity and specificity of these measures. The comparison statistics of ‘cell-specific’ tags were summarized in Table 2. Here, for each of the different measures, its three most dynamic clusters that contained ‘cell-specific’ tags were selected. In Table 2, clusters of PCCF, TransChisq and PoissonC had no significant difference in these cell-specific genes. That is, PCCF was appropriate and reliable for analyzing SAGE data also.

### The projecting reliability of PCA-F

The cumulative variance of principal components were commonly used to assess the projecting reliability of PCA [18]. Here, for all data sets in this paper, their cumulative variances of PCA-F, PCA-P and PCA-N were summarized in Table 3, where PCA-P and PCA-N are PCA of P-points and normalized points, respectively. For any data set, Table 3 showed that the cumulative variance of PCA-F and PCA-P had no significant difference, and PCA-P was slightly greater than PCA-F. Importantly, the cumulative variances of PCA-F and PCA-P were greater than 85% for any data set. However, for any data set, the cumulative variance of PCA-N was far less than PCA-F and PCA-P and only the data set 4 was slightly greater than 85%.

Furthermore, we used data set 1 to assess the statistical reliability of PCA-F. Here, according to population membership of points, data set 1 was mapped on PCA-F, PCA-P and PCA-N (Fig 1), respectively. From Fig 1(a) and 1(c), although there was little intermixing within adjacent populations, PCA-F and PCA-P were able to project most points of the same populations together. Importantly, even if all elements of points were relatively equivalent, PCA-F and PCA-P was able to project them together. For instance, PCA-F and PCA-P projected most points of (N(20,2),N(20,2),N(20,2),N(20,2)) together, where these points were marked by 11 in Fig 1(a) and 1(c). Moreover, PCA-N clearly projected points onto seven regions, but each of regions contained projections of two or more populations that had significant intermixing (Fig 1(e)).

**(a)** PCA-F map of 14 populations. **(b)** Overlay of 7 clusters of PCCF onto PCA-F map. **(c)** PCA-P map of 14 populations. **(d)** Overlay of 7 clusters of PCCP onto PCA-P map. **(e)** PCA-N map of 14 populations. **(f)** Overlay of 7 clusters of PCC onto PCA-N map.

### The feature of F-points

Here, the down-regulate genes of data set 2 were selected to explore the feature of F-points, where data set 2 were divided into 12 clusters by PCCF and PCC, respectively. Moreover, these 3 clusters of PCCF and 4 clusters of PCC that contained down-regulate genes were selected, and the curve shape of F-points and normalized points of these clusters were shown in Fig 2. For clusters of PCCF, Fig 2 showed that the curve shape of F-points within any cluster were almost like capital *N*. But for F-points of different clusters that generated from PCCF, their elements had significant difference.

The *X*-axis represents the different time points. The *Y*-axis represents the expression level. **(a, b and c)** The profiles of normalized plots of three clusters of PCCF. **(d, e, f and g)** The profile of normalized plots of four clusters of PCC. **(h, i and j)** The F-points profile plots of three clusters of PCCF. **(k, l, m and n)** The F-points profile plots of four clusters of PCC.

Furthermore, Fig 2 showed that the similarity between F-points and normalized points had significant difference. For instance, for genes in the second cluster of PCCF, the curve shape of their normalized points were with no specific patterns (Fig 2(b)), but there were only small differences for their F-points (Fig 2(i)).

### The consistency between PCA-F and PCCF

When we use a measure to define the similarity of genes, a good visualization was that it was able to project similar points into the same regions. This was able to visually display by 2D maps of clustering results. Here, for data set 1, 2 and 5, their clusters of PCCF, PCCP and PCC were shown on PCA-F, PCA-P and PCA-N maps, where the clustering numbers of data set 1, 2 and 5 were 7, 8 and 13, respectively. Results showed that PCA-F gave a good visualization for any clustering result of PCCF (Figs 1(b), 3(a) and 3(b)), PCA-P maps had significant intermixing for any clustering result of PCCP (Figs 1(d), 3(c) and 3(d)), and PCA-N gave poor visualizations for clusters of PCC (Fig 1(f)). In fact, for clusters of PCCF, PCA-F was able to give a good visualization even if the clustering number was not very appropriate. For instance, for clusters of data set 1 that generated by PCCF, PCA-FO gave clear cluster boundary for clustering number from 2 to 12. These results clearly demonstrate that PCA-F was able to project similar points into the same regions.

**(a)** Overlay of 8 clusters of PCCF of data set 2 onto PCA-F map. **(b)** Overlay of 13 clusters of PCCF of data set 5 onto PCA-F map. **(c)** Overlay of 8 clusters of PCCP of data set 2 onto PCA-P map. **(d)** Overlay of 13 clusters of PCCP of data set 5 onto PCA-P map. **(e)** Overlay of 8 clusters of EuF of data set 2 onto PCA-P map. **(f)** Overlay of 13 clusters of EuF of data set 5 onto PCA-P map.

Moreover, for a good visualization, its close projections should come from the similar points, and the feature could be evaluated by *S*2-value. Here, for each data set in this paper, its normalized points were divided into clusters by Euclidean, PCC, PCCP, EuF and PCCF, simultaneously. Then, *S*2-values of these clustering results were summarized in Table 4. For *S*2-value of any data, Table 4 showed that clusters of PCCF were the largest, and far more than other measures. That is, for projections of PCA-F, if they were close neighbours in 2D space, their corresponding F-points were Pearson correlation also.

### Comparison of PCA-FO and PCA-F

Here, data set 4 were divided into 6 and 20 clusters by the PCCF, and these clustering results were overlaid on PCA-FO and PCA-F maps (Fig 4), respectively. Fig 4 showed that PCA-FO and PCA-F gave the good visualizations for any clustering result. However, for projections in the internal regions, PCA-F maps were crowded (Fig 4(c) and 4(d)), while PCA-FO maps were relatively loose and clear (Fig 4(a) and 4(b)).

**(a)** Overlay of 6 clusters onto PCA-FO map. **(b)** Overlay of 20 clusters onto PCA-FO map. **(c)** Overlay of 6 clusters onto PCA-F map. **(d)** Overlay of 20 clusters onto PCA-F map. **(e)** Overlay of 6 clusters onto PCA-N map. **(f)** Overlay of 20 clusters onto PCA-N map.

In fact, for any of components of two nearest projections of PCA-FO, their spacing was greater than *l*/2*m*, where *l* was the largest exhibition size, *m* was the gene number of data set. In a limited display space, the feature of PCA-FO would assure that projections were relatively loose and clear. Furthermore, compared to PCA-F and PCA-FO, the position relationship of their projections were the same almost. In fact, for the first(or second) components of PCA-FO, their order of size were the same as PCA-F.

### Comparison of PCA-FO and t-SNE

Here, we also used the simple t-SNE to construct 2D projections of F-points, where we named t-SNE of F-points as t-SNE-F, and the dimension of the F-points was used as the perplexity value of t-SNE-F.

Here, data set 3 was firstly divided into 3 and 7 clusters by PCCF, and then these clustering results were overlaid on PCA-FO and t-SNE-F maps (Fig 5). Fig 5(a) and 5(b) showed that PCA-FO gave these clustering results good 2D projections. However, Fig 5(c) and 5(d) showed that t-SNE-F maps had significant intermixing for any clustering result.

**(a)** Overlay of 3 clusters onto PCA-FO map. **(b)** Overlay of 7 clusters onto PCA-FO map. **(c)** Overlay of 3 clusters onto t-SNE-F map. **(d)** Overlay of 7 clusters onto t-SNE-F map.

### The local and global validity of PCA-FO

Here, *D*_{1}-plot and *D*_{2}-plot were used to assess the local and global validity of different dimension reduction techniques, where *D*_{1}-plot and *D*_{2}-plot of data set 2 were overlaid on Fig 6(a) and 6(b), respectively. For the local validity of PCA-FO, PCA-F and PCA-N, Fig 6(a) showed that they had no significant difference, but they were less than t-SNE-F and t-SNE-N. But for the global validity, PCA-FO, PCA-F and t-SNE-F were almost the same, and they were far better than t-SNE-N and PCA-N.

D-plot of PCA-FO, PCA-F, PCA-N t-SNE-F and t-SNE-N were displayed by green line, red line, gray dotted line, blue line and pink dotted line, respectively. **(a)** *D*_{1}-plot of PCA-FO, PCA-F, PCA-N t-SNE-F and t-SNE-N. **(b)** *D*_{2}-plot of PCA-FO, PCA-F, PCA-N t-SNE-F and t-SNE-N.

The poor global validity of t-SNE-N and PCA-N was able to explain that they gave the poor visualization for clusters of PCC. That is, the relationship of distantly normalized genes was not accurately depicted by t-SNE-N and PCA-N. But for t-SNE-F, its global validity was the same as PCA-FO, and its local validity was superior to PCA-FO. However, for clusters of PCCF, t-SNE-F maps had significant intermixing within adjacent clusters (Fig 5(c) and 5(d)). In fact, for these gene neighbors keep away from any clustering center, t-SNE-F tried to project them together, but PCCF did not necessarily group them together.

### The gene neighbor map of PCA-FO

To readily see which nearby 2D points were truly similar, the nearest and second closest gene neighbor map was generated by PCA-FO. Here, we constructed the nearest and second closest gene neighbor map of data set 2, where the map was showed on Fig 7. Fig 7 showed that the majority of high dimension nearest neighbours were located close to one another in PCA-FO maps.

The nearest and second closest neighbors of genes of data set 2, where the nearest gene neighbor were lined by red line, and second-closest gene neighbor were lined by blue line.

The gene neighbor map revealed the pairs of high dimensional points that were truly close, and which pairs were in fact distant in 2D space. Moreover, PCA-FO maps combined with nearest neighbour maps provided an intuitive means to understand the relationship between clusters and the affiliation of genes with specific clusters.

## Discussion

For the modified cumulative probability, although they are the one-to-one mapping with their normalized points, their magnitude has significant differences, which can result in PCA-P to give the poor visualizations for clusters of PCCP. Moreover, for the different position elements of a normalized point, their superposed opportunity are not consistent in the modified cumulative probability, which can make PCCP excessively dependent on the first few elements of normalized points. Here, the defect of the modified cumulative probability is removed by F-points. That is, the magnitude of F-points is the same, and F-points assure that the superposed opportunity of all elements of normalized points are consistent. Importantly, for data set 2 and 4, PCA-N gave good visualizations for clusters of PCCF also (such as Fig 4(e) and 4(f)). That is, F-points retain the difference of the normalized genes.

For a complex gene expression data set, a difficult issue in K-means is the estimation of *K*, the number of clusters. If *K* is unknown, starting with arbitrary random *K* is a relatively poor method. Here, the defect of K-means are partially weakened by PCCF and PCA-F. That is, for the similar expression behavior genes, even if the number of clusters is not very appropriate, PCCF can group them into appropriate clusters, and PCA-F is able to reveal their relationships also.

## Conclusion

In this paper, we clearly demonstrate that PCCF is more reliable for analyzing gene expression data compared to other commonly used measures. Moreover, for clusters of PCCF, PCA-F give them good visualizations. The success of PCCF and PCA-F indicates that the effective methods for analyzing large-scale gene expression data must be based on an understanding of the biological nature of the experimental data.

## Supporting information

### S1 File. A freely available MATLAB code that can obtain F-points, PCA-FO and nearest neighbour maps for a data set.

https://doi.org/10.1371/journal.pone.0175104.s001

(ZIP)

## Acknowledgments

This work rests almost entirely on open data. Contributors are gratefully acknowledged. Moreover, we deeply thank Mr Yisu Liu(Linyi NO.1 High School, PR China.) that carefully review our manuscript.

## Author Contributions

**Conceptualization:**XJ.**Data curation:**XJ GZ.**Formal analysis:**XJ QH.**Funding acquisition:**ZL.**Investigation:**XJ.**Methodology:**XJ.**Project administration:**ZL.**Resources:**ZL.**Software:**XJ.**Supervision:**ZL.**Validation:**ZL.**Visualization:**XJ.**Writing – original draft:**XJ.**Writing – review & editing:**XJ.

## References

- 1. Gehlenborg N, O’Donoghue SI, Baliga NS, Goesmann A, Hibbs MA, Kitano H, et al. Visualization of omics data for systems biology. Nat Methods, 2010; 7:s56–s68. pmid:20195258
- 2. Taskesen E, Reinders MJ. 2D Representation of Transcriptomes by t-SNE Exposes Relatedness between Human Tissues. PloS one, 2016; 11: e0149853. pmid:26906061
- 3. Lotfi E, Keshavarz A. Gene expression microarray classification using PCA-BEL. Comput Biol Med, 2014; 54:180–187 pmid:25282708
- 4. Tavazoie S, Hughes JD, Campbell MJ, Cho RJ, Church GM. Systematic determination of genetic network architecture. Nat Genet, 1999; 22:281–285. pmid:10391217
- 5. Jaskowiak PA, Campello RJ, Costa IG. On the selection of appropriate distances for gene expression data clustering. BMC Bioinformatics, 2014; 15:390–400.
- 6. Quackenbush J. Computational analysis of microarray data. Nat Rev Genet, 2001; 2:18–427.
- 7. Cai L, Huang H, Blackshaw S, Liu JS, Cepko C, Wong WH. Clustering analysis of SAGE data using a Poisson approach. Genome Biology, 2004; 5:R51. pmid:15239836
- 8. Bushati N, Smith J, Briscoe J, Watkins C. An intuitive graphical visualization technique for the interrogation of transcriptome data. Nucleic Acids Research, 2011; 39:7380–7389. pmid:21690098
- 9. Sanguinetti G. Dimensionality reduction of clustered data sets. IEEE Trans Pattern Anal Mach Intell, 2008; 30:535–540. pmid:18195446
- 10. Vander ML, Hinton G. Visualizing data using t-SNE. J Mach Learn Res, 2008; 9:2579–2605.
- 11. Hinton G, Roweis S. Stochastic Neighbor Embedding. Neural Information Processing Systems, 2003; 15:857–864.
- 12. Tu BP, Kudlicki A, Rowicka M, McKnight SL. Logic of the yeast metabolic cycle: temporal compartmentalization of cellular processes. Science, 2005; 310:1152–1158. pmid:16254148
- 13. Limb JK, Yoon S, Lee KE, Kim BH, Lee S, Bae YS, et al. Regulation of megakaryocytic differentiation of K562 cells by FosB, a member of the Fos family AP-11 transcription factors. Cell Mol Life Sci, 2009; 66:1962–1973. pmid:19381435
- 14. Fang H, Yang Y, Li C, Fu S, Yang Z, Jin G, et al. Transcriptome analysis of early organogenesis in human embryos. Dev Cell, 2010; 19:174–184.
- 15. Saeed AI, Bhagabati NK, Braisted JC, Liang W, Sharov V, Howe EA, et al. TM4 microarray software suite. Methods Enzymol, 2006; 411:134–193. pmid:16939790
- 16. Rousseeuw PJ. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 1987; 20:53–65.
- 17. Kim K, Zhang S, Jiang K, Cai L, Lee IB, Feldman LJ, et al. Measuring similarities between gene expression profiles through new data transformations. BMC Bioinformatics. 2007; 8:29. pmid:17257435
- 18. Yeung KY, Ruzzo WL. Principal Component Analysis for clustering gene expression data. Bioinformatics. 2001; 17:763–774. pmid:11590094