Zero inflated high dimensional compositional data with DeepInsight

Jeseok Lee; Byungwon Kim

doi:10.1371/journal.pone.0320832

Abstract

Through the Human Microbiome Project, research on human-associated microbiomes has been conducted in various fields. New sequencing techniques such as Next Generation Sequencing (NGS) and High-Throughput Sequencing (HTS) have enabled the inclusion of a wide range of features of the microbiome. These advancements have also contributed to the development of numerical proxies like Operational Taxonomic Units (OTUs) and Amplicon Sequence Variants (ASVs). Studies involving such microbiome data often encounter zero-inflated and high-dimensional problems. Based on the need to address these two issues and the recent emphasis on compositional interpretation of microbiome data, we conducted our research. To solve the zero-inflated problem in compositional microbiome data, we transformed the data onto the surface of the hypersphere using a square root transformation. Then, to solve the high-dimensional problem, we modified DeepInsight, an image-generating method using Convolutional Neural Networks (CNNs), to fit the hypersphere space. Furthermore, to resolve the common issue of distinguishing between true zero values and fake zero values in zero-inflated images, we added a small value to the true zero values. We validated our approach using pediatric inflammatory bowel disease (IBD) fecal sample data and achieved an area under the curve (AUC) value of 0.847, which is higher than the previous study’s result of 0.83.

Citation: Lee J, Kim B (2025) Zero inflated high dimensional compositional data with DeepInsight. PLoS ONE 20(4): e0320832. https://doi.org/10.1371/journal.pone.0320832

Editor: Hongchuan Yu, Bournemouth University, UNITED KINGDOM OF GREAT BRITAIN AND NORTHERN IRELAND

Received: October 30, 2024; Accepted: February 25, 2025; Published: April 16, 2025

Copyright: © 2025 Lee and Kim. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The data raw files can be obtained from NCBI project ID 82109, and the data used in the study is the ‘ibd phylo otu’ dataset from the R package Corncob. Also, the data files are uploaded on our Github page - https://github.com/dlakakwns/Zero-inflated-high-dimension-compositional-data-with-DeepInsight - which is made for this article.

Funding: This work was in part supported by the Korea Research Foundations, Korea, under grant RS-2023-00213626. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

The Human Microbiome Project has significantly advanced our understanding of the human microbiome [1], enabling comprehensive studies on its role in various diseases such as cancer [2–4], Crohn’s disease [5,6], and obesity [7,8]. By utilizing numerical proxies such as Operational Taxonomic Units (OTUs) and Amplicon Sequence Variants (ASVs), researchers have been able to quantify microbiome data more effectively, facilitating more natural data representations [9–12]. Consequently, a wide range of statistical and machine learning techniques have been developed and applied to analyze these microbiome datasets. Although microbiome data are fundamentally count data, limitations of sequencing technologies like 16S rRNA sequencing constrain the total sequencing depth, leading to the interpretation of microbiome data as compositional data [13].

Compositional data is a type of non-Euclidean data in which all components are positive and their sum is fixed at a constant value. Due to this fixed sum, compositional data does not have full rank. Therefore, compositional data is defined within a specialized space known as the simplex [14]. This non-Euclidean sample space poses significant challenges for the direct application of traditional statistical methods, which typically assume Euclidean space properties. To overcome these challenges, log-ratio transformations have become the most widely used approach for mapping compositional data into Euclidean space [15,16]. However, these transformations are unable to properly handle the zero-inflated problem commonly encountered in microbiome data. The zero-inflated problem occurs when datasets contain an excessive number of zeros that cannot be explained by typical statistical distributions [17]. In microbiome data, this often happens due to the absence of certain microbial taxa in some samples, leading to many zero values. These zeros can be either structural, where a particular taxon is truly absent, or sampling-related, where the taxon is present but undetected due to limitations in sequencing depth or sampling [18]. This poses challenges for standard modeling approaches, which do not account for the dual nature of these zeros. To address the zero-inflated problem, methods such as Bayesian-Multiplicative replacement, which is implemented in the cmultRepl function of the R package Zcompositions have been used to replace zero values [19]. Despite their utility, zero-replacement methods can cause distortions, prompting the exploration of alternative approaches that directly handle zero values without replacement [20–24].

The square-root transformation provides an effective solution by mapping compositional data onto the surface of the hypersphere, enabling the use of probability distributions defined on the hypersphere, such as the Kent distribution [25–27]. Furthermore, dimension reduction is also possible using Principal Geodesic Analysis (PGA). PGA extends Principal Component Analysis (PCA) by incorporating the intrinsic geometry of Riemannian manifolds. PGA identifies the main modes of variation along geodesics, the shortest paths on the manifold, providing a more accurate representation of the data’s variability on curved surfaces [28,29]. The detailed methodology of applying PGA in this study is provided in the Materials and Methods section. Based on the square-root transformation, this study analyzes zero-inflated compositional data, preserving the integrity of the original data while effectively handling exactly zero values.

New sequencing techniques, known as Next Generation Sequencing (NGS) or High-Throughput Sequencing (HTS), have led to a rapid increase in both the volume and complexity of microbiome data. High-dimensional microbiome datasets contain a wide range of features, including genomes, transcriptomes, proteomes, and metagenomes, often outnumbering the available biological samples. This high dimensionality poses significant challenges for data analysis, including heightened computational demands, increased risk of overfitting, and difficulties in interpreting results [30–32]. Image generating methods, such as DeepInsight [33] and Image Generator for Tabular Data (IGTD) [34], have been proposed to address high-dimensional problems by leveraging powerful deep learning models like Convolutional Neural Networks (CNNs). These methods convert non-image data into image formats based on data through techniques such as dimension reduction, pairwise correlation matrices, and clustering, enabling CNNs to effectively analyze complex patterns and relationships within the data. This approach is particularly beneficial for managing unstructured and high-dimensional datasets, including those found in microbiome and voice data analyses.

Segmentation, the process of separating the foreground from the background in image analysis, is a fundamental task that significantly impacts the accuracy and effectiveness of image interpretation. In deep learning-based image analysis, methods such as U-Net and SegNet have demonstrated remarkable performance and have garnered considerable attention in the field. These segmentation techniques play a critical role in clarifying the boundaries between the foreground and background, thereby facilitating the learning of image features by CNNs. By accurately delineating different regions within an image, segmentation algorithms enable models to focus on relevant patterns and structures, improving overall performance in tasks like object recognition and scene understanding. In our study, we use image datasets generated by DeepInsight, which transforms non-image data into image formats based on the data values themselves. This data-driven approach can present challenges when dealing with zero-inflated data. To address this issue, we propose adding small values to the true zero values to distinguish between true zero values associated with the foreground and fake zero values associated with the background. The method for adding these values is described in the Materials and Methods section.

In this study, we conducted an analysis of zero-inflated high-dimensional compositional data. First, to address the zero-inflated problem, we transformed the data onto the surface of the hypersphere using the square-root transformation. Next, we modified the DeepInsight algorithm to suit the hypersphere and applied it for analysis. In this process, we added a small value to distinguish between true zero values and fake zero values in the zero-inflated image. Finally, we validated the results using the fecal sample data employed by Papa et al.(2012) [35]. This dataset was designed for the classification of Inflammatory Bowel Disease (IBD), such as Crohn’s disease. Using this dataset, we compared the performance of our modified DeepInsight with that of previous studies.

Materials and methods

Compositional data

Compositional data is a type of non-Euclidean spatial data consisting of positive components that sum to a constant value. Although the sum of the components is not inherently restricted, it is commonly normalized to 1 for simplicity, effectively treating the sample space as a simplex [14]. When considering a d - dimensional compositional vector x, it is expressed as , where denotes the unit simplex defined as follows:

(1)

Square root transformation

The square-root transformation, a type of power transformation, maps a compositional vector (See Eq 1) onto the surface of the (d – 1) - dimensional unit hypersphere . The transformation is defined as follows:

The square-root transformation offers several advantages. It allows direct handling of zeros; unlike log-ratio transformations, which are undefined for zero values and require substitution or adjustment, the square-root transformation naturally accommodates zeros. By mapping data onto the surface of the hypersphere, we can apply statistical methods developed for directional data, including using probability distributions like the von Mises–Fisher and Kent distributions to model and analyze the data’s directional characteristics [26]. Techniques such as spherical regression, clustering on the sphere, and spherical harmonic analysis become applicable, providing deeper insights into the data structure.

DeepInsight

DeepInsight, proposed by Sharma et al.(2019) [33], is a methodology for converting non-image data into image format to apply CNNs. It offers a general approach applicable to various irregular and non-structured data such as genomics, transcriptomics, methylation, mutations, text, spoken words, and financial data.

Algorithm 1 provides a concise overview of the DeepInsight algorithm, and Fig 1 is the pipeline presented by Sharma et al. (2019) [33]. The process begins by transposing the input data matrix X into a feature-focused matrix G, which allows the algorithm to concentrate on relationships between features rather than samples.

Algorithm 1: DeepInsight pipeline with convex hull.

Require: Training set with d features, where

Ensure: Image representation of non-image data for CNNs input

1: Step 1: Transpose Data Matrix

2: Define feature set by transposing X

3: Step 2: Dimensionality Reduction

4: Apply dimensionality reduction (e.g., t-SNE, kernel PCA) on G to obtain 2D coordinates

5: Step 3: Feature Location Mapping & Rearrangement

6: Map the 2D coordinates to their corresponding pixel locations on a grid, and then perform rearrangement through clustering

7: Step 4: Apply Convex Hull Algorithm

8: Apply Convex Hull algorithm to find the smallest bounding polygon for the feature locations

9: Step 5: Rotate or Adjust the Grid

10: Rotate or adjust the grid to fit the CNNs input format

11: Step 6: Feature Normalization

12: Normalize feature values using one of the following methods:

13: (a) Independent normalization: Normalize each feature independently by its own minimum and maximum values

14: (b) Topology-preserving normalization: Normalize all features using a single global maximum value from the training set

15: Select the normalization method that produces the lowest validation error

16: Step 7: Final Image Generation

17: Create the final image with the pixel values representing the normalized features

Download:

Fig 1. DeepInsight pipeline.

This is the DeepInsight pipeline provided by Sharma et al. (2019). It creates representative images through dimension reduction and optimizes the images using the Convex Hull algorithm. Subsequently, it generates image-specific differentiation through feature matrices and normalization, and creates images by mapping them to pixels.

https://doi.org/10.1371/journal.pone.0320832.g001

Next, dimension reduction methods such as t-SNE or kernel PCA are applied to project the high-dimensional features onto a 2D plane. These 2D coordinates are then mapped to their corresponding pixel locations on a grid, which serves as the basis for the image representation. Following this, clustering techniques are employed to rearrange the pixels based on their spatial relationships. Through this clustering and rearrangement process, the inherent features and hidden patterns within the image can be effectively identified. For example, Fig 2 illustrates representative images generated from The Cancer Genome Atlas Program (TCGA) RNA data. Fig 2a) Process using PCA, Fig 2b) Process using kernel PCA, and Fig 2c) Process using t-SNE. As illustrated in the figure, significant differences in the images emerge depending on the dimension reduction method, affecting the rearrangement and corresponding pixel locations. To optimize for CNNs, the Convex Hull algorithm is used to define the smallest bounding polygon around the feature points, optimizing the space for the feature layout.

Download:

Fig 2. DeepInsight example.

Representative images from The Cancer Genome Atlas Program RNA data. a) Process using PCA, b) Process using kernel PCA, and c) Process using t-SNE. Using the Convex Hull algorithm, the red boundary represents the smallest polygon that encloses the corresponding pixels, while the green boundary denotes the smallest rectangle containing them.

https://doi.org/10.1371/journal.pone.0320832.g002

The grid is then rotated or adjusted to fit the expected input dimensions of CNNs model. Feature values are normalized using either independent normalization (each feature is normalized by its own min/max values) or topology-preserving normalization (a single global maximum from the entire dataset). The method that results in the lowest validation error is selected. Finally, the normalized feature values are used to generate an image that can be processed by CNNs. Fig 3 is sample images that will be used for CNNs training. Through this process, the DeepInsight demonstrates strengths in classification compared to traditional methods.

Download:

Fig 3. Characteristic image.

A characteristic image generated from The Cancer Genome Atlas Program RNA data using t-SNE-based DeepInsight for CNNs training.

https://doi.org/10.1371/journal.pone.0320832.g003

Principal geodesic analysis

PGA is introduced to address the limitations of PCA when dealing with data on Riemannian manifolds such as the hypersphere, which are curved, non-linear spaces. In the context of the hypersphere, data points reside on a curved surface where straight lines are replaced by geodesics, the shortest paths along the sphere’s surface that respect its curvature. PCA is hard to capture the intrinsic geometry of the surface of the hypersphere because it relies on linear approximations and Euclidean distances, which are not suitable for spherical data. PGA overcomes this limitation by utilizing geodesics to define principal directions on the surface of the hypersphere, effectively accounting for its curvature and providing a more accurate analysis of data variability.

When considering the d - dimensional unit hypersphere , the point p is the Fréchet mean. The Fréchet mean is the point that minimizes the sum of squared geodesic distances to all data points and is used as the reference point for PGA. Typically, a rotation transformation is applied to set this reference point at ( 0 , 0 , ⋯ , 1 ) .

PGA aims to project the data onto the tangent space at the point p. Logarithmic map enables projection onto the tangent space, and its definition for is as follows:

This mapping allows the transformation of data from the curved directional space into a flat Euclidean tangent space, enabling the application of PCA. By mapping the data from the manifold onto the tangent space using the logarithmic map, PGA performs PCA in the tangent space to identify principal directions. These directions are then mapped back onto the manifold using the Exponential map, ensuring that the analysis respects the manifold’s curvature.

Additionally, Fig 4 is an example of PGA on the 3-dimensional hypersphere . The Python geomstats package implements this methodology in code, and we utilized it in our study.

Download:

Fig 4. Geodesic PCA.

A data point X on the 3-dimensional hypersphere is mapped to the tangent space at the Fréchet mean p via the Logarithmic map along the geodesic path. This projection flattens the manifold while preserving the principal directions of variation, enabling the application of PCA in curved spaces.

https://doi.org/10.1371/journal.pone.0320832.g004

Segmentation

Segmentation enhances image analysis performance by clearly identifying the features or shapes of the foreground through the distinction between the foreground and background. In the case of the existing DeepInsight method, the foreground of the generated images is based on the characteristics of the data, and the shapes are created through dimension reduction, resulting in all images having identical forms. Therefore, the necessity of segmentation was not prominently highlighted. However, when analyzing zero-inflated data using DeepInsight, the color values of the corresponding pixels are determined based on the original data values. If the original value is exactly zero, the pixel color is exactly the same as the background. This causes the shapes of the sample images to be inconsistent, which adversely affects the learning process. An example of this phenomenon is shown in Fig 5.

To address this issue, we added a small value to the corresponding pixels generated through DeepInsight, thereby standardizing the shapes of all images as shown in Fig 6.

Download:

Fig 5. Different structure.

To visualize the variation in the overall structure of zero-inflated sample images, we assigned maximum values to the corresponding pixels based on cross-sectional images generated using PCA.

https://doi.org/10.1371/journal.pone.0320832.g005

Download:

Fig 6. Same structure.

To visualize the variation in the overall structure of zero-inflated sample images after adding a small tolerance at corresponding pixels, then we assigned maximum values to the corresponding pixels based on cross-sectional images generated using PCA.

https://doi.org/10.1371/journal.pone.0320832.g006

Results

Real data

We analyze the Pediatric Inflammatory Bowel Disease (IBD) dataset from Papa et al.(2012) [35] as a real data example. This dataset consists of 16S rRNA sequencing of fecal samples, from 91 children and young adults who were treated in the gastroenterology program at Children’s Hospital in Boston, USA, including 24 positive cases diagnosed with IBD and 67 negative controls. It contains 36,349 columns, each representing OTUs value, and approximately 98% of the data entries are exactly zero values. The IBD dataset consists of count data collected through 16S rRNA sequencing, allowing it to be interpreted as compositional data. To convert it into a compositional format, the data are divided by the total sum. Subsequently, a square root transformation is applied to project the data onto the surface of a hypersphere.

Download:

Fig 7. Generated image samples.

a) is the image actually used for CNNs training. b) is a rescaled version of the Train Plot, adjusted for visual clarity because the original Train Plot was too dark and not visually appealing; this image was not used for actual training.

https://doi.org/10.1371/journal.pone.0320832.g007

Classification performance

Algorithm 2: Classification pipeline.

1: Step 1: Convert to Compositional Data

2: Normalize each sample by dividing feature values by the total sum to transform raw count data into compositional data.

3: Step 2: Square Root Transformation to Directional Space

4: Apply square root transformation to map the compositional data onto the directional space.

5: Step 3: Project to Tangent Space using PGA

6: Perform Principal Geodesic Analysis (PGA) to project points in directional space onto tangent space.

7: Step 4: Apply DeepInsight for Image Generation

8: Execute DeepInsight to convert tangent space data into images. In Modified DeepInsight, apply Segmentation by adding a small constant to the corresponding pixel values.

9: Step 5: Train CNNs and Evaluate Results

10: Train a CNNs model on the generated images and analyze the classification results

Our modified DeepInsight model focuses on two main objectives. First, DeepInsight is adapted to effectively handle data on the surface of a hypersphere. Second, it involves implementing segmentation to address the issue where background and foreground are treated identically at exact zero values. To achieve these objectives, PGA is employed within the DeepInsight process. Subsequently, segmentation is implemented by adding small values to the positions of the corresponding pixels. As a control group, we also evaluated the original DeepInsight model, which does not apply segmentation to the IBD dataset. After the sample images generated we used ResNet50 for CNNs architecture and AdamW as the optimizer. The learning rate and weight decay were tuned using Optuna, an automated optimization software [36], to maximize the Area Under the Curve (AUC). The learning rate and weight decay were tuned using Optuna, an automated optimization software [36]. The learning rate range was set to [], while the weight decay range was set to (). A log-scale search was applied to ensure an even distribution of sampled values across different magnitudes. The input image size was set to 224×224 pixels to match the input requirements of ResNet50. Fig 7a shows the image sample actually used for CNNs training, while Fig 7b shows a rescaled and resized version adjusted to improve visibility for the human eye. Note that the rescaled image was not used for training. The final performance was evaluated by calculating the AUC using 10-fold cross-validation. Algorithm 2 provides a concise representation of our sequential process.

As shown in Table 1, we conducted hyperparameters tuning over 1,000 trials for both the Modified DeepInsight and original DeepInsight using Optuna, selecting the maximum AUC value obtained as the final result for each model. The Modified DeepInsight achieved the performance of 0.847 AUC, exceeding the average AUC of 0.83 reported by Papa et al. (2012) [35], who analyzed fecal samples using Synthetic Learning in Microbial Ecology (SLiME) under three repeated 10-fold cross-validation. Additionally, we observed that the Original DeepInsight achieved the AUC of 0.817, indicating a notably lower performance compared to the Modified DeepInsight. Through these results obtained from real data, we confirmed the applicability of the DeepInsight to zero-inflated high-dimensional compositional data and emphasized the necessity of distinguishing between true zero values and fake zero values.

Download:

Table 1. Compare classification performance. Papa et al. (2012) applied sequencing data to supervised learning classification algorithms using a software pipeline called Synthetic Learning in Microbial Ecology (SLiME), which utilizes relevant metadata as classification labels. They achieved an average AUC of 0.83 on fecal samples over three repeated 10-fold cross-validation. We trained both the Modified DeepInsight and the Original DeepInsight models 1,000 trials using Optuna and selected the maximum AUC value as the final result.

https://doi.org/10.1371/journal.pone.0320832.t001

Discussion

In this study, we reconfirmed the DeepInsight continues to demonstrate effective performance on unstructured high-dimensional data. Specifically, we observed that the issue of segmentation, which arises during the process of converting zero-inflated microbiome data into images, also occurs in conventional image analysis. To address this problem, we proposed the Modified DeepInsight that applies a simple method of adding a small constant to the data, achieving performance improvement.

Although the dataset size was small for deep learning, the Modified DeepInsight showed a slight improvement in the AUC compared to SLiME. This demonstrates that deep learning can still be effective with small datasets when appropriate preprocessing and model modifications are applied. Due to methodological differences from previous studies, we could not perform a performance comparison under completely identical conditions, but our results support the validity of DeepInsight. Notably, it is significant that we achieved the higher AUC than the value reported in the past study.

Furthermore, since we confirmed performance improvement with just the simple method of adding a small constant, we believe that developing more sophisticated zero-value handling techniques could lead to additional performance enhancements. Additionally, after transitioning to Tangent Space, we anticipate that employing more advanced dimension reduction methods beyond the PCA we used could further improve performance.

For future research, we aim to enhance data diversity by incorporating raw microbiome data, which we could not address in this study due to technical constraints, through collaboration with domain experts. Moreover, we seek to further explore the applicability of our approach to a broader range of zero-inflated high-dimensional datasets.

References

1. Human Microbiome Project Consortium. Structure, function and diversity of the healthy human microbiome. Nature 2012;486(7402):207–14. pmid:22699609
- View Article
- PubMed/NCBI
- Google Scholar
2. Helmink BA, Khan MAW, Hermann A, Gopalakrishnan V, Wargo JA. The microbiome, cancer, and cancer therapy. Nat Med 2019;25(3):377–88. pmid:30842679
- View Article
- PubMed/NCBI
- Google Scholar
3. Schwabe RF, Jobin C. The microbiome and cancer. Nat Rev Cancer 2013;13(11):800–12. pmid:24132111
- View Article
- PubMed/NCBI
- Google Scholar
4. Sepich-Poore GD, Zitvogel L, Straussman R, Hasty J, Wargo JA, Knight R. The microbiome and human cancer. Science. 2021;371(6536):eabc4552. pmid:33766858
- View Article
- PubMed/NCBI
- Google Scholar
5. Gevers D, Kugathasan S, Denson LA, Vázquez-Baeza Y, Van Treuren W, Ren B, et al. The treatment-naive microbiome in new-onset Crohn’s disease. Cell Host Microbe 2014;15(3):382–92. pmid:24629344
- View Article
- PubMed/NCBI
- Google Scholar
6. Halfvarson J, Brislawn CJ, Lamendella R, Vázquez-Baeza Y, Walters WA, Bramer LM, et al. Dynamics of the human gut microbiome in inflammatory bowel disease. Nat Microbiol. 2017;2:17004. pmid:28191884
- View Article
- PubMed/NCBI
- Google Scholar
7. Ley RE. Obesity and the human microbiome. Curr Opin Gastroenterol 2010;26(1):5–11. pmid:19901833
- View Article
- PubMed/NCBI
- Google Scholar
8. Maruvada P, Leone V, Kaplan LM, Chang EB. The human microbiome and obesity: Moving beyond associations. Cell Host Microbe 2017;22(5):589–99. pmid:29120742
- View Article
- PubMed/NCBI
- Google Scholar
9. Blaxter M, Mann J, Chapman T, Thomas F, Whitton C, Floyd R, et al. Defining operational taxonomic units using DNA barcode data. Philos Trans R Soc Lond B Biol Sci 2005;360(1462):1935–43. pmid:16214751
- View Article
- PubMed/NCBI
- Google Scholar
10. Schloss PD, Handelsman J. Introducing DOTUR, a computer program for defining operational taxonomic units and estimating species richness. Appl Environ Microbiol 2005;71(3):1501–6. pmid:15746353
- View Article
- PubMed/NCBI
- Google Scholar
11. Callahan BJ, McMurdie PJ, Holmes SP. Exact sequence variants should replace operational taxonomic units in marker-gene data analysis. ISME J 2017;11(12):2639–43. pmid:28731476
- View Article
- PubMed/NCBI
- Google Scholar
12. Callahan BJ, Wong J, Heiner C, Oh S, Theriot CM, Gulati AS, et al. High-throughput amplicon sequencing of the full-length 16S rRNA gene with single-nucleotide resolution. Nucleic Acids Res 2019;47(18):e103. pmid:31269198
- View Article
- PubMed/NCBI
- Google Scholar
13. Gloor GB, Macklaim JM, Pawlowsky-Glahn V, Egozcue JJ. Microbiome datasets are compositional: And this is not optional. Front Microbiol. 2017;8:2224. pmid:29187837
- View Article
- PubMed/NCBI
- Google Scholar
14. Aitchison J. The statistical analysis of compositional data. J R Stat Soc Ser B: Stat Methodol 1982;44(2):139–60.
- View Article
- Google Scholar
15. Aitchison J. The statistical analysis of compositional data: Monographs on statistics and applied probability. London: Chapman & Hall Ltd.; 1986. 416 p.
- View Article
- Google Scholar
16. Egozcue JJ, et al. Isometric logratio transformations for compositional data analysis. Math Geology 2003;35(3): 279–300.
- View Article
- Google Scholar
17. Tu W. Zero-inflated data. Encyclopedia of environmetrics; 2006.
18. Chen L, Reeve J, Zhang L, Huang S, Wang X, Chen J. GMPR: A robust normalization method for zero-inflated count data with application to microbiome sequencing data. PeerJ. 2018;6:e4600. pmid:29629248
- View Article
- PubMed/NCBI
- Google Scholar
19. Palarea-Albaladejo J, Martín-Fernández JA. Compositions — R package for multivariate imputation of left-censored data under a compositional approach. Chemometr Intell Lab Syst. 2015;143:85–96.
- View Article
- Google Scholar
20. Kim K, Park J, Jung S. Principal component analysis for zero-inflated compositional data. Comput Stat Data Anal. 2024;198:107989.
- View Article
- Google Scholar
21. Tang Z-Z, Chen G. Zero-inflated generalized Dirichlet multinomial regression model for microbiome compositional data analysis. Biostatistics 2019;20(4):698–713. pmid:29939212
- View Article
- PubMed/NCBI
- Google Scholar
22. Ha MJ, Kim J, Galloway-Peña J, Do K-A, Peterson CB. Compositional zero-inflated network estimation for microbiome data. BMC Bioinform. 2020;21(Suppl 21):581. pmid:33371887
- View Article
- PubMed/NCBI
- Google Scholar
23. Zhang X, Guo B, Yi N. Zero-inflated Gaussian mixed models for analyzing longitudinal microbiome data. PLoS One 2020;15(11):e0242073. pmid:33166356
- View Article
- PubMed/NCBI
- Google Scholar
24. Xu L, Paterson AD, Turpin W, Xu W. Assessment and selection of competing models for zero-inflated microbiome data. PLoS One 2015;10(7):e0129606. pmid:26148172
- View Article
- PubMed/NCBI
- Google Scholar
25. Mardia KV, Jupp PE. Directional statistics. John Wiley & Sons; 2009.
26. Scealy JL, Welsh AH. Regression for compositional data by using distributions defined on the hypersphere. J R Stat Soc Ser B: Stat Methodol 2011;73(3):351–75.
- View Article
- Google Scholar
27. Scealy JL, Welsh AH. Fitting Kent models to compositional data with small concentration. Stat Comput 2012;24(2):165–79.
- View Article
- Google Scholar
28. Fletcher PT, Lu C, Pizer SM, Joshi S. Principal geodesic analysis for the study of nonlinear statistics of shape. IEEE Trans Med Imaging 2004;23(8):995–1005. pmid:15338733
- View Article
- PubMed/NCBI
- Google Scholar
29. Pennec X, Sommer S, Fletcher T., editors. Riemannian geometric statistics in medical image analysis. Academic Press; 2019.
30. Borah K, Das HS, Seth S, Mallick K, Rahaman Z, Mallik S. A review on advancements in feature selection and feature extraction for high-dimensional NGS data analysis. Funct Integr Genomics 2024;24(5):139. pmid:39158621
- View Article
- PubMed/NCBI
- Google Scholar
31. Li H. Microbiome, metagenomics, and high-dimensional compositional data analysis. Annu Rev Stat Appl 2015;2(1):73–94.
- View Article
- Google Scholar
32. Ju F, Zhang T. 16S rRNA gene high-throughput sequencing data mining of microbial diversity and interactions. Appl Microbiol Biotechnol 2015;99(10):4119–29. pmid:25808518
- View Article
- PubMed/NCBI
- Google Scholar
33. Sharma A, Vans E, Shigemizu D, Boroevich KA, Tsunoda T. DeepInsight: A methodology to transform a non-image data to an image for convolution neural network architecture. Sci Rep 2019;9(1):11399. pmid:31388036
- View Article
- PubMed/NCBI
- Google Scholar
34. Zhu Y, Brettin T, Xia F, Partin A, Shukla M, Yoo H, et al. Converting tabular data into images for deep learning with convolutional neural networks. Sci Rep 2021;11(1):11325. pmid:34059739
- View Article
- PubMed/NCBI
- Google Scholar
35. Papa E, Docktor M, Smillie C, Weber S, Preheim SP, Gevers D, et al. Non-invasive mapping of the gastrointestinal microbiota identifies children with inflammatory bowel disease. PLoS One 2012;7(6):e39242. pmid:22768065
- View Article
- PubMed/NCBI
- Google Scholar
36. Akiba T, Sano Y, Yoshida M, Koyama M, Hiraoka Y. Optuna: A next-generation hyperparameter optimization framework. In: Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining; 2019. p. 2623–31.

[ref1] 1. Human Microbiome Project Consortium. Structure, function and diversity of the healthy human microbiome. Nature 2012;486(7402):207–14. pmid:22699609
View Article
PubMed/NCBI
Google Scholar

[2] View Article

[3] PubMed/NCBI

[4] Google Scholar

[ref2] 2. Helmink BA, Khan MAW, Hermann A, Gopalakrishnan V, Wargo JA. The microbiome, cancer, and cancer therapy. Nat Med 2019;25(3):377–88. pmid:30842679
View Article
PubMed/NCBI
Google Scholar

[6] View Article

[7] PubMed/NCBI

[8] Google Scholar

[ref3] 3. Schwabe RF, Jobin C. The microbiome and cancer. Nat Rev Cancer 2013;13(11):800–12. pmid:24132111
View Article
PubMed/NCBI
Google Scholar

[10] View Article

[11] PubMed/NCBI

[12] Google Scholar

[ref4] 4. Sepich-Poore GD, Zitvogel L, Straussman R, Hasty J, Wargo JA, Knight R. The microbiome and human cancer. Science. 2021;371(6536):eabc4552. pmid:33766858
View Article
PubMed/NCBI
Google Scholar

[14] View Article

[15] PubMed/NCBI

[16] Google Scholar

[ref5] 5. Gevers D, Kugathasan S, Denson LA, Vázquez-Baeza Y, Van Treuren W, Ren B, et al. The treatment-naive microbiome in new-onset Crohn’s disease. Cell Host Microbe 2014;15(3):382–92. pmid:24629344
View Article
PubMed/NCBI
Google Scholar

[18] View Article

[19] PubMed/NCBI

[20] Google Scholar

[ref6] 6. Halfvarson J, Brislawn CJ, Lamendella R, Vázquez-Baeza Y, Walters WA, Bramer LM, et al. Dynamics of the human gut microbiome in inflammatory bowel disease. Nat Microbiol. 2017;2:17004. pmid:28191884
View Article
PubMed/NCBI
Google Scholar

[22] View Article

[23] PubMed/NCBI

[24] Google Scholar

[ref7] 7. Ley RE. Obesity and the human microbiome. Curr Opin Gastroenterol 2010;26(1):5–11. pmid:19901833
View Article
PubMed/NCBI
Google Scholar

[26] View Article

[27] PubMed/NCBI

[28] Google Scholar

[ref8] 8. Maruvada P, Leone V, Kaplan LM, Chang EB. The human microbiome and obesity: Moving beyond associations. Cell Host Microbe 2017;22(5):589–99. pmid:29120742
View Article
PubMed/NCBI
Google Scholar

[30] View Article

[31] PubMed/NCBI

[32] Google Scholar

[ref9] 9. Blaxter M, Mann J, Chapman T, Thomas F, Whitton C, Floyd R, et al. Defining operational taxonomic units using DNA barcode data. Philos Trans R Soc Lond B Biol Sci 2005;360(1462):1935–43. pmid:16214751
View Article
PubMed/NCBI
Google Scholar

[34] View Article

[35] PubMed/NCBI

[36] Google Scholar

[ref10] 10. Schloss PD, Handelsman J. Introducing DOTUR, a computer program for defining operational taxonomic units and estimating species richness. Appl Environ Microbiol 2005;71(3):1501–6. pmid:15746353
View Article
PubMed/NCBI
Google Scholar

[38] View Article

[39] PubMed/NCBI

[40] Google Scholar

[ref11] 11. Callahan BJ, McMurdie PJ, Holmes SP. Exact sequence variants should replace operational taxonomic units in marker-gene data analysis. ISME J 2017;11(12):2639–43. pmid:28731476
View Article
PubMed/NCBI
Google Scholar

[42] View Article

[43] PubMed/NCBI

[44] Google Scholar

[ref12] 12. Callahan BJ, Wong J, Heiner C, Oh S, Theriot CM, Gulati AS, et al. High-throughput amplicon sequencing of the full-length 16S rRNA gene with single-nucleotide resolution. Nucleic Acids Res 2019;47(18):e103. pmid:31269198
View Article
PubMed/NCBI
Google Scholar

[46] View Article

[47] PubMed/NCBI

[48] Google Scholar

[ref13] 13. Gloor GB, Macklaim JM, Pawlowsky-Glahn V, Egozcue JJ. Microbiome datasets are compositional: And this is not optional. Front Microbiol. 2017;8:2224. pmid:29187837
View Article
PubMed/NCBI
Google Scholar

[50] View Article

[51] PubMed/NCBI

[52] Google Scholar

[ref14] 14. Aitchison J. The statistical analysis of compositional data. J R Stat Soc Ser B: Stat Methodol 1982;44(2):139–60.
View Article
Google Scholar

[54] View Article

[55] Google Scholar

[ref15] 15. Aitchison J. The statistical analysis of compositional data: Monographs on statistics and applied probability. London: Chapman & Hall Ltd.; 1986. 416 p.
View Article
Google Scholar

[57] View Article

[58] Google Scholar

[ref16] 16. Egozcue JJ, et al. Isometric logratio transformations for compositional data analysis. Math Geology 2003;35(3): 279–300.
View Article
Google Scholar

[60] View Article

[61] Google Scholar

[ref17] 17. Tu W. Zero-inflated data. Encyclopedia of environmetrics; 2006.

[ref18] 18. Chen L, Reeve J, Zhang L, Huang S, Wang X, Chen J. GMPR: A robust normalization method for zero-inflated count data with application to microbiome sequencing data. PeerJ. 2018;6:e4600. pmid:29629248
View Article
PubMed/NCBI
Google Scholar

[64] View Article

[65] PubMed/NCBI

[66] Google Scholar

[ref19] 19. Palarea-Albaladejo J, Martín-Fernández JA. Compositions — R package for multivariate imputation of left-censored data under a compositional approach. Chemometr Intell Lab Syst. 2015;143:85–96.
View Article
Google Scholar

[68] View Article

[69] Google Scholar

[ref20] 20. Kim K, Park J, Jung S. Principal component analysis for zero-inflated compositional data. Comput Stat Data Anal. 2024;198:107989.
View Article
Google Scholar

[71] View Article

[72] Google Scholar

[ref21] 21. Tang Z-Z, Chen G. Zero-inflated generalized Dirichlet multinomial regression model for microbiome compositional data analysis. Biostatistics 2019;20(4):698–713. pmid:29939212
View Article
PubMed/NCBI
Google Scholar

[74] View Article

[75] PubMed/NCBI

[76] Google Scholar

[ref22] 22. Ha MJ, Kim J, Galloway-Peña J, Do K-A, Peterson CB. Compositional zero-inflated network estimation for microbiome data. BMC Bioinform. 2020;21(Suppl 21):581. pmid:33371887
View Article
PubMed/NCBI
Google Scholar

[78] View Article

[79] PubMed/NCBI

[80] Google Scholar

[ref23] 23. Zhang X, Guo B, Yi N. Zero-inflated Gaussian mixed models for analyzing longitudinal microbiome data. PLoS One 2020;15(11):e0242073. pmid:33166356
View Article
PubMed/NCBI
Google Scholar

[82] View Article

[83] PubMed/NCBI

[84] Google Scholar

[ref24] 24. Xu L, Paterson AD, Turpin W, Xu W. Assessment and selection of competing models for zero-inflated microbiome data. PLoS One 2015;10(7):e0129606. pmid:26148172
View Article
PubMed/NCBI
Google Scholar

[86] View Article

[87] PubMed/NCBI

[88] Google Scholar

[ref25] 25. Mardia KV, Jupp PE. Directional statistics. John Wiley & Sons; 2009.

[ref26] 26. Scealy JL, Welsh AH. Regression for compositional data by using distributions defined on the hypersphere. J R Stat Soc Ser B: Stat Methodol 2011;73(3):351–75.
View Article
Google Scholar

[91] View Article

[92] Google Scholar

[ref27] 27. Scealy JL, Welsh AH. Fitting Kent models to compositional data with small concentration. Stat Comput 2012;24(2):165–79.
View Article
Google Scholar

[94] View Article

[95] Google Scholar

[ref28] 28. Fletcher PT, Lu C, Pizer SM, Joshi S. Principal geodesic analysis for the study of nonlinear statistics of shape. IEEE Trans Med Imaging 2004;23(8):995–1005. pmid:15338733
View Article
PubMed/NCBI
Google Scholar

[97] View Article

[98] PubMed/NCBI

[99] Google Scholar

[ref29] 29. Pennec X, Sommer S, Fletcher T., editors. Riemannian geometric statistics in medical image analysis. Academic Press; 2019.

[ref30] 30. Borah K, Das HS, Seth S, Mallick K, Rahaman Z, Mallik S. A review on advancements in feature selection and feature extraction for high-dimensional NGS data analysis. Funct Integr Genomics 2024;24(5):139. pmid:39158621
View Article
PubMed/NCBI
Google Scholar

[102] View Article

[103] PubMed/NCBI

[104] Google Scholar

[ref31] 31. Li H. Microbiome, metagenomics, and high-dimensional compositional data analysis. Annu Rev Stat Appl 2015;2(1):73–94.
View Article
Google Scholar

[106] View Article

[107] Google Scholar

[ref32] 32. Ju F, Zhang T. 16S rRNA gene high-throughput sequencing data mining of microbial diversity and interactions. Appl Microbiol Biotechnol 2015;99(10):4119–29. pmid:25808518
View Article
PubMed/NCBI
Google Scholar

[109] View Article

[110] PubMed/NCBI

[111] Google Scholar

[ref33] 33. Sharma A, Vans E, Shigemizu D, Boroevich KA, Tsunoda T. DeepInsight: A methodology to transform a non-image data to an image for convolution neural network architecture. Sci Rep 2019;9(1):11399. pmid:31388036
View Article
PubMed/NCBI
Google Scholar

[113] View Article

[114] PubMed/NCBI

[115] Google Scholar

[ref34] 34. Zhu Y, Brettin T, Xia F, Partin A, Shukla M, Yoo H, et al. Converting tabular data into images for deep learning with convolutional neural networks. Sci Rep 2021;11(1):11325. pmid:34059739
View Article
PubMed/NCBI
Google Scholar

[117] View Article

[118] PubMed/NCBI

[119] Google Scholar

[ref35] 35. Papa E, Docktor M, Smillie C, Weber S, Preheim SP, Gevers D, et al. Non-invasive mapping of the gastrointestinal microbiota identifies children with inflammatory bowel disease. PLoS One 2012;7(6):e39242. pmid:22768065
View Article
PubMed/NCBI
Google Scholar

[121] View Article

[122] PubMed/NCBI

[123] Google Scholar

[ref36] 36. Akiba T, Sano Y, Yoshida M, Koyama M, Hiraoka Y. Optuna: A next-generation hyperparameter optimization framework. In: Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining; 2019. p. 2623–31.

Figures

Abstract

Introduction

Materials and methods

Compositional data

Square root transformation

DeepInsight

Principal geodesic analysis

Segmentation

Results

Real data

Classification performance

Discussion

References