Rigid geometry solves “curse of dimensionality” effects: an application to proteomics

Motivation: Quality of sample preservation at ultralow temperatures for a long term is not well studied. To improve our understandings, we need an evaluation strategy for analyzing protein degradation or metabolism at subfreezing temperatures. In this manuscript, we obtained LC/MS (liquid chromatography-mass spectrometry) data of calculated protein signal intensities in HEK-293 cells to monitor them. Results: Our first trial for directly clustering the values has failed in proper arrangement of the sample clusters, most likely by the effects from “curse of dimensionality”. By utilizing rigid geometry with p-adic (I-adic) metric, however, we could succeed in rearrange the sample clusters to meaningful orders. Thus we could eliminate “curse of dimensionality” from the data set. We discuss a possible interpretation for a group of protein signal as a quasiparticle Majorana fermion. It is possible that our calculation elucidates a characteristic value of a system in almost neutral logarithmic Boltzmann distribution of any type. Contacts: f.peregrinusns@mbox.kyoto-inet.or.jp


Introduction
Even frozen, biological samples are said to be degraded during aging, and most frozen cell cultures are stored until they aged for two years. However, what actual happens in those samples are not well studied, so far as we know. There are a few reports that describe the existence of enzymatic activities in frozen cultures, such as lipase and peroxidase activities (e.g. Parducci and Fennema 1978;Voituron et al. 2006). However, we still do not know proteomic details of cells stored at subfreezing temperatures. For LC/MS (liquid chromatography-mass spectrometry), the only report we know dealing with cooled environments is the report for frogs whose environments mimicked the environments of winter (Kiss et al. 2011). This report lacks solid statistic analysis and it is not for subfreezing environment. Therefore we need solid proteomic data set from actual frozen cultures under long term storage at subfreezing environments to evaluate the potential degradation/metabolism.
To do this, first we need to set up an evaluation procedure that can well distinguish the samples from long term storage from the samples freshly prepared.
Clustering analyses are popular approaches for the evaluation. Based on particular criteria that can evaluate similarity/dissimilarity, clustering analyses can observe meaningful groups in the data. The approaches are based on bottom-up calculation of the data, and there is no criterion outside of the system. However, there still remain problems such as how we define the groups and the selection of actual clustering methods. If the topological structure of the hierarchical tree or index numbers of clustering group are the same among all the different clustering methods, the output of the analyses is sound; however, the case is not always achieved: there might be some discrepancies and they cast doubt to the confidence of the results.
Mainly there are two types of clustering analysis: hierarchical clustering and non-hierarchical clustering. Hierarchical clustering can be calculable if there is a certain sort of distance/dissimilarity of the data point, and is able to join the data point based on close relationships among the point, until it can combine all the observed data set.
Roughly speaking, it reduces multidimensional data to two dimensional data, with data labeling axis and clustering distance axis. The representative methods are: simple linkage, complete linkage, group average, weighted average, centroid, median, Ward's method. If we set dissimilarity of i, j, k as C i , C j , C k , Non-hierarchical clustering, for example k-means method, is an optimization method based on portioning of groups and classification. First of all, we have to set the number of groups, k, among the data set. As dissimilarity, we use square Euclid distance.
After that, we set initial grouping and score each group, and put samples one by one.
We can select the score of lower case and repeat the process. In the sense that we should select the number of groups it is top-down approach, but from other aspects it is bottom-up. All these eight methods are easy to be equipped in computer and very frequently used, compared to other complex methodologies.
One problem for the analyses is, due to high dimensionality (more than 1000) of the samples, there is "curse of dimensionality" effect and the variances among samples become large and sparse, resulting in meaningless output of the clustering analysis (e.g. Ronan et al. 2016 In brief, 'fresh' means fresh samples immediately underwent protein extraction processes. '1 h' means they were harvested and stayed for 1 h at -80°C with freezing medium. 'o/n-o/n' means they were harvested and stayed overnight at -80°C with freezing medium, then transferred to liquid nitrogen storage overnight.

Utilizing a p-adic (I-adic) metric embedded on rigid geometry
Now we set an analogy (a grounding metaphor) of biological data space (as base) and mathematical space (as target). We will not get into details of the opposite direction of analogy, as we still do not understand how the target space behaves in details mathematically. We will think a projection from the base to the target, and also utilize theories for formal schemes in analyses of projected data (linking metaphors). The improvement of mathematical metrics data of Unused values was directed by ideas in rigid geometry as follows. Please also refer Adachi (2016). In brief, the data from each sample were first arranged in their ranks k of Unused values N k , approximated by logarithmic approximation: Infinite dimensional covering of the last sentence is Schottky-type uniformization (c.f. Fujiwara and Kato 2006 is a crystalline complex, cohomological to crystalline cohomology (Grothendieck, 1966;1968 is an i-th crystalline cohomology group. Therefore H i dR = 0 and that 'any i-th closed form is exact form' are equivalent. We can take a set of rigid analytic space, modular N k as Ω . Please note that setting p as an element of Coxeter group, an identity element of p corresponds to an identity element of Hecke ring. d = p is thus proper. Furthermore, i = v is smooth when p ≠ 1. Since exterior derivative of p is 0 and obviously p is exact form is obviously on unit polydiscs of rigid analytic space, rendering locally ringed G-topologized space with a sheaf of non-archimedean field, which ensures a covering by open subspaces isomorphic to affinoids. Shifting -v (non-archimedean for 1/N k = p -v ) to v metric (non-archimedean for N k = p v ) does not change this property, considering 1/N k space as basis. In other words, lnN k is related to the kernel of present signal space and lnp is related to the kernel of potential signal space, which is the image of past signal space. The division of them, v is the image of potential signal space, which reflects the physiological situation of the system adapted to expecting environments without any noise of current system. Overall, the system described here has a rigid cohomology (Kedlaya 2009). Considering N k = PD Nk and P = 1/(D a ζ (s)) in this case (Adachi 2016), and overconvergence of the v values is thus achieved due to cancelling out of high dimensionality in N, a together with topological characteristics (G-topology) of v on quasi-compactness and quasi-separation as mentioned before. only 3-years sample occupied between those clusters (Fig. 1). These results suggest that there might be "curse of dimensionality" effects, which based on significantly varied values of each data point in high dimensionality that disturbs convergence of the output values (e.g. Ronan et al. 2016). To confirm this idea, the number of unknown parameters in neural network is 1630. The image of these ideas is described in Fig. 2A.

Direct analyses of Unused values in LC/MS resulted in non-
Obviously actual structures of geometric space are important for the resultant output of the calculation (Ronan et al. 2016). In this first analyses, we used simple Unused values as dissimilarity, partly in square Euclidean distance.

A p-adic metric based on rigid geometry eliminated "curse of dimensionality" effects on LC/MS data
To avoid the pitfalls described above, the first choice as a solution is to design more proper metric for calculations. If we set a proper metric for calculation based on geometry, which enables nilpotent for convergence/divergence of values and converging to the value of -1 as oscillation, we can extract more overconverged output from the observed data set to discriminate the characteristics observed. One of the popular methods for this trial is rigid geometry. Non-archimedean valuation field in the geometry is easy to converge compared to Archimedean real filed or complex field, with p-adic (I-adic) metric including a subring of norm < |1|. The geometry globally converges the values, but the values are locally free, enabling freedom from the restriction by "curse of dimensionality". The example image to utilize this idea by quotient is described in Fig. 2B (e.g. Cornelissen and Kato 2005). Consider icosahedron with 12 vertices in blue color, 20 barycenters (the center of the triangle with 20 faces) in green color, 30 edges with 30 midpoints in red color. Projecting the icosahedron from its center to a sphere maps tessellation of the sphere by 120 triangles as shown in Fig.   2B left. The angles are π /2 for red, π /3 for green, and π /5 for blue. A generator is: The icosahedron has 6 cyclic subgroups of order 5, 10 cyclic subgroups of order 3, and 15 cyclic subgroups of order 2. The quotient of this Riemann sphere by the group I is shown in Fig. 2B right. 2, 3, 5 correspond to midpoints of edges, barycenters of faces, and vertices, respectively. The complexity of the system is much more simplified.
Now we defined a p-adic (I-adic) metric based on rigid geometry as in Methods section as a pretreatment of data before clustering/machine learning, and obtained the results as Fig. 3. Obviously control samples and samples of long-term storage clustered separately in any type of proposed methods, suggesting freedom from "curse of dimensionality". Although the means of variances in the original method and the rigid method do not represent the situation (60±10 and 6000±8000 for 95% confidential, respectively). The data from the rigid method have 10 outliers (See Fig. 4 for the skewness) that have larger values than Euclidian values of the same ranks. When samples of top 10 variances are excluded, the means of variances become 44±5 and 14±3, respectively, with p = 5 x 10 -20 for t-test, indicating release from "curse of dimensionality". As a control, machine learning by neural network showed the same tendency as the previous section, with the number of unknown parameter value as 1630.
Interestingly, the distribution of v values can be approximated by a power function with an absolute value of multiplier as ~3/2 (Fig. 4). If we set the multiplier as

Discussion
Utilizing p-adic rigid geometry, we seem to succeed in eliminating the "curse of  (Weyl 1953). 800 x 11 = 8800 > 1630 and the observed convergence of v is expected beyond underdetermined system. To support this idea, clustering of f-1, f-2 and 2y2, which were mis-clustered in Fig. 1, could be clustered well in v metrics in all the methods (Fig. 5), with 800 x 3 = 2400 > 1606. At least this allows us to evaluate whether the samples are from nearly fresh materials or underwent significant lengths of storage at low temperatures. The success is entirely based on an algebraic, analytic and topological geometric analysis based on rigid geometry. So far as we know, this is the first work that applies 'rigid geometry', as the term developing in mathematical fields since 1962, to biological studies. The interesting point is that this methodology can be applicable to any type of almost neutral logarithmic Boltzmann-type distribution in any type of systems interested. The agreement of results in both a supervised machine learning and several unsupervised clustering analyses demonstrates the power of this methodology. Even in biology, we can apply similar approach from protein society inside cells in this study to community dynamics in microbes (Adachi 2016

Conclusions
We have succeeded in the release from "curse of dimensionality" of observed difference among the samples of long term storages and control samples with LC/MS data in HEK-293 cells. The success was entirely based on topological characteristics of p-adic metric on rigid geometry. It may have a potential to calculate a characteristic value of a system with almost neutral logarithmic Boltzmann distribution of any type. Figures   Fig. 1 Please also read Methods.