Analysis of categorical data from biological experiments with logistic regression and CMH tests

Rebecca J. Androwski; Tatiana Popovitchenko; Anna J. Smart; Sho Ogino; Guoqiang Wang; Mark Saba; Christopher Rongo; Monica Driscoll; Jason Roy

doi:10.1371/journal.pone.0335143

Abstract

The choice of appropriate statistical tests in experimental biology is critical for scientific rigor and can be challenging in the case of categorical data analysis. Using example datasets from Caenorhabditis elegans research, we conduct statistical analysis of (1) a rare cellular event involving the formation of a neuronal extrusion called an exopher and (2) a variable behavioral response across time. We employ the Cochran–Mantel–Haenszel (CMH) test and logistic regression for analysis. Recognizing there are potential accessibility issues using logistic regression, we provide step-by-step tutorials and example code. We emphasize that logistic regression can handle both simple and complex multivariable datasets; logistic regression can also provide more comprehensive insights into experimental outcomes when compared to simpler tests like CMH. By analyzing real biological examples and demonstrating their analysis with R code, we provide a practical guide for biologists to enhance the rigor and reproducibility of categorical data analysis in experimental studies.

Citation: Androwski RJ, Popovitchenko T, Smart AJ, Ogino S, Wang G, Saba M, et al. (2025) Analysis of categorical data from biological experiments with logistic regression and CMH tests. PLoS One 20(11): e0335143. https://doi.org/10.1371/journal.pone.0335143

Editor: Gregg Roman, University of Mississippi School of Pharmacy, UNITED STATES OF AMERICA

Received: May 6, 2025; Accepted: October 7, 2025; Published: November 17, 2025

Copyright: © 2025 Androwski et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All relevant data are within the manuscript and its Supporting information files. The R code and supplementary materials are publicly available from GitHub at https://github.com/Randrowski/Logistic-Regression-for-Biologists.

Funding: Research reported in this publication was supported by the National Center For Advancing Translational Sciences of the National Institutes of Health under Award Number UM1TR004789 to JR. Additional funding sources: NIH 5T32NS115700-04 to RA and TP; NIH R01AG047101 and R37AG56510 to MD; NIH R01GM101972 to CR. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Genetic research has historically leveraged the advantages of large population sizes and, for some model systems, clonal reproduction, to examine biological principles in action. Thanks to large population sizes, infrequent yet biologically significant phenomena can be detected and quantified in genetic models. The significance of these infrequent events can be difficult to represent numerically as occasional events are often outnumbered by prevalent baseline biological outcomes. Here we consider the challenge of detecting statistical differences between unbalanced groups, particularly those that are measured categorically. Appropriate statistical analysis of categorical data provides a layer of confidence and rigor in its interpretation. Deciding on an appropriate statistical test of categorical data can be challenging.

Categorical outcomes are common in both cell biological experiments and behavioral assays involving binomial or multinomial datasets. Categorical data (or categorical variables) are data that can be described by a specific classification. Such data can include nominal categorical variables, which do not have a natural specific order (e.g., color or genotype); ordinal variables, which are ranked categories (e.g., level of damage or progression of disease); and dichotomous variables, which are variables with only two outcomes (e.g., alive or dead). Dichotomous variables are abundant in the Caenorhabditis elegans (C. elegans) experimental repertoire (e.g., alive or dead, on food or off food, moving or not moving). In one of the examples we detail here, we focus on neuronal production of large extracellular vesicles (called exophers) under conditions of proteostress [1]. Exophers bud directly from the neuronal cell body (commonly one per neuron) and can be equal in size to the cell body itself (~5µm). The large size of the exopher combined with the highly stereotypical C. elegans nervous system allows for efficient scoring of exophers. Importantly, individual animals can be easily categorized by the presence or absence of the exopher, and the frequency of exopher events can be assessed over a large sample size of nematodes. In this example, the dichotomous variables are whether the neuron produced an exopher or did not produce an exopher such that an exopher frequency (number of exophers observed/total animals assayed) can be calculated for a sample population.

Multiple statistical approaches can be used to analyze categorical variables. Categorical data is often plotted as a percentage (%) of the whole, also known as a frequency. Since percentage/frequency scores appear to be continuous data, investigators might incorrectly use the t-test or one-way ANOVA to analyze these categorical data; or even use a nonparametric alternative that relies on continuous data such as the Kolmogorov-Smirnov test [2]. The frequency of a binary event from a single trial would more appropriately be analyzed using a Chi-square or Fisher’s exact test. When comparing replicate trials of categorical outcomes, a better option is the Cochran–Mantel–Haenszel test (CMH) [3]. CMH measures the association between a binary independent variable, commonly called the predictor in statistics, and a binary dependent variable, called the outcome. CMH uses outcomes grouped into categories for analysis, summarizes those categories into contingency tables, and calculates an odds ratio (a likelihood that an outcome is linked to a variable). However, the CMH method is limited to situations where there is a binary outcome and a binary treatment, such that the data fall into relatively few categories. Further, the CMH test functions optimally with the assumption of a common odds ratio between outcomes (e.g., the odds ratio between treatment and outcome is the same in both categories). Although a common odds ratio is not mathematically required to calculate the CMH statistic, and some statistical relevance can be gleaned from CMH in the event of an odds ratio mismatch, the strength of the test will be diminished.

Another statistical approach to analyzing categorical data is logistic regression. Logistic regression is a version of linear regression that examines the relationship between an independent variable/predictor and its outcome [4]. A logistic regression models the log-odds of the outcome and can be depicted as an S-shaped curve on an X-Y graph. Yes/no values can be represented as “one” for yes and “zero” for no and the slope of the curve between the top and bottom will indicate how likely an animal will be positioned at “one” or “zero.” Logistic regression can be used for categorical outcomes, can include binary (i.e., alive or dead) or continuous predictors (i.e., increasing doses of a drug), and can include variables to control for factors such as replicate trials or time. Thus, logistic regression can be tailored to the specific experimental design and offers the capacity to analyze data over multiple variables [5].

Logistic regression is a powerful and adaptable test, but is not commonly taught in introductory statistics classes [6,7]. For this reason, logistic regression analysis may be underutilized in the field [8]. Here, we emphasize that the less commonly applied logistic regression model provides the most comprehensive statistical analysis for categorical data. To illustrate the application of this statistical test, we apply CMH and logistic regression to an example of large vesicle extrusion from neurons. We also apply an ANOVA analysis and logistic regression to locomotory behavior in multiple genetic backgrounds over a time course. We include advice for improved data hygiene, a basic R Studio tutorial provided in the supplement, and sample R code for direct implementation (found at https://github.com/Randrowski/Logistic-Regression-for-Biologists.git).

Materials and methods

Strains

All C. elegans strains were derived from wild-type Bristol strain N2. Worm husbandry was performed according to standard methods [9]. Mutants used in this study: N2, ZG31 hif-1(ia4), JT307 egl-9(sa307) – all grown at 20º and on OP50. Touch neuron-specific mCherry (strain ZB4065) was used to visualize touch neuron morphology. A complete list of strains used in this study is provided in the S1 Table.

Anoxia suspended animation assay

This assay was performed as previously published [10]. Animals at Larval stage four (L4) of the strains of interest were picked onto fresh plates seeded and sealed inside Anaero-pack bags (Mitsubishi R681001), which include an AnaeroPouch (Mitsubishi R682001) and methylene indicator strip (BD 271051) inside the bag. The AnaeroPouch is used to create the anaerobic environment by absorbing oxygen and generating carbon dioxide and the methylene indicator strip was used to indicate if the atmosphere in the bag reached anoxia. After 24 hours, these strains were reoxygenated and the number of moving worms and total worms were counted in five-minute intervals for one hour. Plates were dropped or tapped to stimulate movement. If worms are in suspended animation, they do not move even if disturbed in this way.

Confocal microscopy

Live animals synchronized at the fourth larval stage and imaged on adult day one were mounted on slides with 5 mM levamisole. Fluorescence images were obtained using a spinning-disk confocal imaging system: Zeiss Axiovert Z1 microscope equipped with X-Light V2 Spinning Disk Confocal Unit (CrestOptics), 7-line LDI Laser Launch (89 North), Prime 95B Scientific CMOS camera (Photometrics), and a 100x oil-immersion objective. Fluorescence images were captured using Metamorph 7.7 software.

Results

Logistic regression for a simple example problem

Our first example dataset for how to start using logistic regression deals with the quantification of exophers in C. elegans neurons. Large vesicles can be extruded from proteo-stressed touch sensory neurons in C. elegans via a process called exophergenesis [11] (Fig 1a). Exopher events at a specific neuron are rare for each animal; for example, only 5–20% of wild-type animals typically produce exophers from their ALM neuron. In our exopher dataset the categorical variable is the presence of the exopher itself: exopher or no exopher and our stratifying variable is age of the animal, adult day one compared to adult day five (Fig 1a, b). While the exopher event is rare, we can easily analyze large populations (>50 animals) and perform >3 biological replicates and multiple trials. Thus, while enough observations can be made for a clear conclusion [12] the resulting data are severely unbalanced, which complicates analysis [13]. For example, statistical tests, such as t-tests that rely on a comparison of means or ANOVAs that rely on a sum of squares can dismiss exopher production entirely. Yet, as we have argued, exopher production is a consequential biological event that merits analysis [11]. We required a statistical test that could correctly assess categorical results across replicate trials, would allow for the high variation in baseline that is inherent to our biological model, and could identify differences among relatively infrequent events where total sample sizes could vary amongst trials.

Download:

Fig 1. Analysis of exophers and animal movement using logistic regression.

a) Fluorescence micrographs show exophers “E” are extruded from the neuronal soma “S” and are scored as a binary event, as exopher or no exopher. The touch receptor neuron, ALM is labeled with Pmec-4m::Cherry to allow for easy assessment of cell soma and exopher. Scale bar, 10 μm. b) Exopher frequency decreases on adult day 5 (AD5) compared to the frequency on adult day 2 (AD2). *** p < 0.001 by both the Cochran–Mantel Haenszel (CMH) test and logistic regression, N = 50 per trial over two trials. c) Comparison of multiple mating schemes and their effect on exopher frequency. On AD5, there is no difference between unmated and animals mated with sterile males while animals mated with fertile males results in a significant increase in exopher frequency. *** p < 0.001, “ns” = not significant by logistic regression, N = 50 per trial over two trials. Error bars = standard error of the percentage. All example exopher data were originally published in Wang et al 2024.

https://doi.org/10.1371/journal.pone.0335143.g001

Comparison of CMH vs. logistic regression

We found that the CMH statistical test met the criteria for analysis of exopher frequencies when single comparisons were made. For example, we compared exopher frequency in young adults (adult day 2) to that in older adults (adult day 5) in replicate trials; CMH analysis yielded a P value of p = 0.0000624 (Fig 1b). Evaluation of the same data with the logistic regression model yielded a similar significance level (p = 0.000301). While the logistic regression has potential for accommodating experiments with more comparisons, CMH has the benefit of being computationally simpler because with CMH you directly calculate the solution rather than the iterative calculation used in logistic regression. A reliable conclusion drawn from a logistic regression relies on agreement between these iterative calculations, called statistical convergence. Convergence problems can arise from various trends in the dataset, for example the presence of outliers, high correlation between variables, or datasets that show complete separation, all “1” or all “0”. If the regression analysis fails to converge, it becomes unreliable and the better option would be to use another test like CMH. A comparison of features between the CMH and logistic regression analyses is summarized in Table 1.

Download:

Table 1. Comparison of features between CMH and logistic regression.

https://doi.org/10.1371/journal.pone.0335143.t001

The CMH test is simpler and more specialized, allowing for analysis of categorical data that has one variable and controls for stratification (dividing into groups for analysis) – meaning it adjusts for subgroup differences, ensuring that the association between the independent (predictor) and dependent (outcome) variables is evaluated within each stratum separately. This control allows for a more accurate analysis by reducing bias due to confounding factors that might vary across the strata. Logistic regression is more powerful and versatile, allowing for the modeling of multiple predictors. However, logistic regression relies on stronger assumptions (that mean and variance are normally distributed) and requires more computational effort.

In this work, we found that a specific formatting of collected data enabled easy application of the statistical test. Step-by-step instructions for formatting data for CMH, including annotated PDFs for both simple and multiple comparisons, can be found in a linked Github repository (https://github.com/Randrowski/Logistic-Regression-for-Biologists.git). As an alternative to using R, we also provide an excel spreadsheet that will calculate a CMH p-value according to the Handbook of Biological Statistics [14]. The CMH statistic can also be applied through the following web application, https://j-2464.github.io/LabStatTest/.

We found that reporting our results as an odds ratio was ideal. The odds ratio is a likelihood statistic and compares the likelihood of something happening in one group compared to it happening in another group. While probability is the likelihood of something happening out of the total, the odds ratio compares the odds between two groups and is especially informative in experiments in which there is a control condition. An odds ratio of “1” indicates that an event is equally likely in both groups while an odds ratio of “2” indicates that something is twice as likely to occur in one group compared to the other. Additionally, one can also report the 95% confidence interval for the odds ratio, which provides a range of values within the 95% confidence interval in which the true odds ratio lies and gives some insight into the precision of the estimated odds ratio. Statistical significance can be supported from the confidence interval of the odds ratio by simply determining if the confidence interval overlaps with “1”. Since an odds ratio of “1” indicates no difference between the treatments, an overlap of the 95% confidence interval with “1” would suggest the difference is not significant, whereas a confidence interval that does not overlap “1” suggests that there is a significant difference between the treatments. CMH and logistic regression produce slightly different outputs, the risk difference estimate and the conditional odds ratio, respectively. The unconditional odds ratio is easily calculated from these values but does require additional steps (annotated code for calculating the odds ratio and 95% confidence intervals are included in https://github.com/Randrowski/Logistic-Regression-for-Biologists.git).

Notably, logistic regression supplies more descriptive information from the outset in this simple comparison by reporting the direction of the change in the form of a z-value. A z-value is also generated during the CMH calculation but is not reported in the R output, which displays a chi-square value stripped of its directionality. The z-value describes how far the data lies from the average of the control dataset. An increase in frequency for the experimental versus the control would yield a positive numerical value for the z statistic; a negative z statistic would indicate a decrease in frequency. In our exopher example, we tested whether day 5 adults have a higher or lower frequency of exophergenesis compared to day 2 controls. The z-value was z = −3.614, indicating that the exopher frequency in day 5 adults was lower than that in day 2 adults. The z-value also provides an effect size, representing standard deviations from the mean of the control data. Staying with our exopher example, the magnitude of the z-value (z = −3.614) indicates that the frequency of exophers was more than three and a half standard deviations below that of the younger animal control. Thus, logistic regression analysis demonstrates that C. elegans hermaphrodites experience a large and statistically significant decrease in exopher frequency by adult day 5. CMH and logistic regression thus both provide a P statistic and an odds ratio, but logistic regression further provides information on the direction of the change as well as the relationship to the mean without an additional calculation; this leads us to choose the logistic regression test as the best application for the analysis of exophergenesis.

Data collection framework

An important feature of a rigorous and reproducible statistical analysis is in the formatting of the initial data sheet or data frame. This careful formatting is otherwise known as data hygiene. In general, a single individual assessed for a phenotype should occupy a single row in the data sheet, whereas features of all individuals should be organized into columns. The simplest arrangement would be two columns labeled genotype and outcome, indicated by a one or a zero. However, there could be other additional relevant information that needs to be considered in the comparison. For example, the anoxia-behavioral dataset example we consider below contains four columns (trial, time, genotype, and outcome), allowing one to assess a dichotomous outcome (movement vs no movement in our example) along a time series and across trials.

Importantly, attention to careful data hygiene ensures that data are formatted for import into R scripts. As an added benefit, standardized formatting of data allows for easy sharing within lab groups and with biostatistical collaborators (Fig 2). It may seem more useful to have data sheets formatted with color coded text or columns, which we often see in biological research, but in practice the documentation of an observation on an individual level leads to enhanced accessibility, rigor, reproducibility, and application. We present sample data collection spreadsheets, R code for reformatting data for analysis, and an annotated PDF of the code (highlighting terms that might be specific to an individual experiment) in Extended Data sections for each of the datasets discussed here, thereby demonstrating how we efficiently process data from collection into an analysis-ready dataset.

Download:

Fig 2. Data Hygiene and Optimization for R import.

Datasheet in a) shows three trials with information organized into three separate tables. The inset indicated by the dashed box is expanded into an optimized format in b) which now has all information organized into columns with that can be called as individual variables. c) shows the further expanded dashed box inset in b). In c), variables are still organized in columns, but now each individual animal from the experiment is represented on a single row. Reformatting code in Supplementary information can be used to expand b to c.

https://doi.org/10.1371/journal.pone.0335143.g002

Logistic regression applications for more complex datasets

Adding conditions/variables.

It is often of interest to include multiple comparisons with the control strain in an experiment. A substantial advantage of logistic regression over CMH is that the former can readily accommodate complex situations. For example, logistic regression accommodates the type of datasets produced when comparing multiple genetic conditions or performing an RNAi screen; in other words, logistic regression accommodates experimental designs with multiple perturbations. In the published example that follows, the authors examined the effect of fertility on exopher rates by monitoring exopher production in hermaphrodites under three conditions: unmated, mated with sterile males, and mated with fertile males [1]. To detect any significant changes across all groups, we included the three conditions within a single spreadsheet as input for our analysis (See: https://github.com/Randrowski/Logistic-Regression-for-Biologists.git). We included code for reformatting/expansion of the data from total instances (i.e., total exophers produced within a trial) to one observation per animal (i.e., did a particular animal produce an exopher, yes/no). This data expansion step generates a spreadsheet optimally formatted for analysis.

Using the reformatted data spreadsheet as input, the analysis code that we provide includes an initial calculation of the mean and standard error of the percentage. We plotted the data to take an initial look at the sample averages and variation between the groups, and then we calculated the logistic regression. This approach included a determination of whether the data were statistically different (p-value), how different the means were from one another and in which direction this difference occurred (z-value), and what the likelihood of the neuron producing an exopher under these conditions was (odds ratio). The logistic regression code we generated makes specific comparisons among multiple groups monitored in the same study; here, we consider only three (unmated, mated with sterile males, and mated with fertile males). Even when the numbers of conditions increase, the logistic regression can easily accommodate any number of comparisons. In this case, co-culturing hermaphrodites and sterile males resulted in no significant change in exopher rates compared to unmated hermaphrodites (p = 0.756). However, mating to fertile males increased the exopher rate significantly compared to the unmated control (p = 0.0000473, and a positive z-score of 4.068, indicating an increased frequency compared to control) (Fig 1c). Our analysis contains code that allows one to specify the control treatment, which should be kept in mind during data interpretation as a positive z-value indicates that the frequency in the experimental group is greater than that of the control, whereas a negative z-value indicates it is less than in the control.

Assessing how a phenotype can change over time.

Often a phenotype is changeable over the timeframe in which it is scored. The final example we consider here is an established anoxia reoxygenation model that involves a predictable behavioral response in C. elegans. A tutorial with line-by-line code explanation in R is provided for this dataset (See: https://github.com/Randrowski/Logistic-Regression-for-Biologists.git). Anoxia is an environmental condition of extreme oxygen deprivation (<0.1% O₂). When exposed to anoxia, nematodes enter a reversible hypometabolic state called suspended animation in which movement, feeding, egg laying, and cell division (except anaphase) are arrested [10]. This behavior is reversible, as nematodes reintroduced to normal oxygen levels can resume activity and develop into reproductive adults.

Reoxygenation recovery is genetically variable, and mutants of the Hypoxia Inducible Factor (HIF) pathway can be tested for differences in mobility in response to reoxygenation. Briefly, the HIF pathway includes the transcriptional effector HIF-1, which is activated under conditions of oxygen deprivation, and the prolyl hydroxylase EGL-9, which senses oxygen and inhibits the activity of HIF-1 under normal oxygen conditions [15]. We previously demonstrated that at ten minutes post reoxygenation, egl-9(sa307) loss-of-function mutants show an enhanced ability to recover when compared to the wild-type (N2) and hif-1(ia4) strains [16]. Although this result is robust (Fig 3a, ten minutes post reoxygenation), the dynamics of reoxygenation observed over longer durations reveal unexpected patterns that could be biologically informative (Fig 3b, one hour post reoxygenation). We anticipated that representation of categorical proportions and trends of each mutant could be rigorously assessed using a logistic regression analysis, which we describe below.

Download:

Fig 3. Analysis of anoxia recovery with logistic regression models.

a) Bar plot showing the average proportion of animals moving at the ten-minute time point of the anoxia recovery assay. Bars represent means of trials (shown as points). Error bars represent SEM. ** = p < 0.01. b) Time course plot of anoxia recovery for all time points. Solid lines represent mean animals moving at each time point. Error is represented by shaded areas as SEM. c) A plot showing the generalized additive model (GAM) for all three genotypes across all time points.

https://doi.org/10.1371/journal.pone.0335143.g003

Issues with blanket ANOVA statistics that can be better addressed with logistic regression

In our study, we counted animals that were moving or not moving (categorical data) every five minutes post reoxygenation, initially generating a curve that represents repeated measures along a time course (Fig 3b). We note, however, that the curve that we generated violates the assumptions of many statistical tests like t-tests and ANOVA. The first of these assumptions is the independence of measurements because the same animals are being measured from time point to time point [17]. Second, complications arise with the representation of the outcome as a proportion (a typical way of presenting the data in the field): if four of ten animals are moving, then the proportion of animals moving is 0.4 at a given time point. The plotting of proportion fails to capture the variability of the biology being observed by ignoring variability in the denominator of that proportion—the total number of animals scored for that point. In our example, post reoxygenation animals often explore off the agar and either meet premature desiccation death or escape detection at the time of first counting (but make their way back to the moist safety of the agar later) such that the absolute population (i.e., the denominator) effectively changes from time point to time point within a single plate. Logistic regression accounts for a variable denominator by allowing one to consider both outcomes, a constant denominator and a variable denominator; whereas ANOVA does not account for a variable denominator, as ANOVA relies on a single continuous variable in the form of a proportion [18]. Although there is consideration of overall sample size in ANOVA, the individual data are lost once the data have been collapsed into a proportion. It is important to emphasize that these aspects of data (non-independent sampling; changing absolute population numbers) are frequently present in basic research [18] and yet the more appropriate logistic regression statistics are not commonly applied [8].

The variability itself may be a feature of the data that is different amongst genotypes. For continuous data, a simple calculation of the Coefficient of Variance (CoV = SD/mean) can be conducted. Comparing standard deviations can also be accomplished by an F test (F = SD²₁/SD²₂) and then referencing an F table at a desired alpha. Yet, for categorical data, these comparisons are meaningless. A measure of the frequency of differences between measurements or the likelihood that two measurements in the dataset are identical has been presented as the coefficient of unalikeability or the Kader-Perry coefficient [9].

There are other instances, such as with our first example of the detection of a rare event exopher, when standard approaches are limited. Exophers occur approximately 20% of the time; the categorical measurement in our case is the occurrence or not of exopher production. In cases where the data is continuous and not categorical, this kind of rare event is ideally analyzed with rare-event statistics [19,20]. However, with the limitation of categorical classification, CMH and logistic regression are more appropriate.

Logistic regression program

We chose R, an open source and relatively accessible coding language, to conduct our statistical analyses. The R code is split into three sections, which deal with (1) installation of necessary packages and importation of an excel spreadsheet, (2) visualization of data, and (3) execution of line analysis. We emphasize that three main types of analyses are tested here (CMH, ANOVA, logistic regression), but these only scratch the surface of regressions and adaptations that can be applied. For example, we applied a generalized additive model (GAM) to the behavioral data set to show the additive effect of the reoxygenation behavior across all genotypes (Fig 3c). GAMs are regression models that allow for non-linear relationships between a predictor (such as age or time) and an outcome. GAMs are appropriate when there are one or more predictors that are unlikely to have a linear relationship with the outcome. In this case, the smoothing function time was significant (p = 0.0122); this means that time is a necessary element in this behavior- the relationship between moving and the categorical genotypes cannot be explained with a simple line. The GAM is useful to extrapolate trends from data so that future genotypes can be compared to the general trend to reveal adherence or deviation. The GAM model used here further demonstrated that egl-9 pattern of movement over time was significantly different from wild type (p = 0.006). Code and details on how to run these analyses are available at github.com (https://github.com/Randrowski/Logistic-Regression-for-Biologists.git). While the GAM was the most appropriate method to represent the integration of genotypes across time, an alternative would have been to use ensemble integration. Ensemble integration is a method using logistic regression to integrate multimodal data [21], which is frequently a part of biological observation [22]. The moving/not moving data set presented in Fig 3 could be further broken down to multiple variables: the observed behavior, the time course, the genotype, and the quality of the line or adherence to the mean at any given time point and all of these variables contribute to the pattern we observe. While further deconstruction of this dataset is outside the scope of this project, ensemble integration is a valid alternative to describing data composed of categorical variables.

Conclusions

Biologists are tasked with observing and describing phenomena within nature. As scientists, we must reduce our observations to variables that can be repeatedly observed in many circumstances. Statistical analysis allows us to compare these variables. We describe here the analysis of two biological phenomena: neuronal exopher production (Fig 1) and changes in locomotory movement over time (Fig 3). Since the two phenomena we studied can each be reduced to a binary variable (present/absent and yes/no, respectively), logistic regression allows us to implement a rigorous and informative statistical analysis (Table 1). The simplest way to accomplish this analysis is through R coding, which over half of the authors on this publication were not familiar with at the outset of this collaboration. While we provide sample code, R is open source and easily accessible; there are many talented programmers who share their code widely. In addition, the growing utility of AI machine learning models, such as Google’s “Gemini” engine, has provided invaluable resources for troubleshooting R code. During the preparation of our datasets for import into R, we realized that even if logistic regression is not a good fit for a data set, improved data hygiene is invariably a benefit to a lab’s reproducibility and analysis efforts (Fig 2). We argue that logistic regression should be a default consideration for analysis of categorical data in biology in the hope that the arguments presented here will facilitate more rigorous analyses within our scientific community.

Supporting information

S1 Table. Strain List.

A complete list of strains used in this study.

https://doi.org/10.1371/journal.pone.0335143.s001

(PDF)

S1 File. Step-by-step Protocol.

Detailed explanation of the provided code to reformat data for analysis, run CMH and logistic regression analyses, and plot these results in R.

https://doi.org/10.1371/journal.pone.0335143.s002

(PDF)

S2 File. Logistic-Regression-for-Biologists-main.

https://doi.org/10.1371/journal.pone.0335143.s003

(ZIP)

Acknowledgments

We thank Nelson Mejia and Ryan Nguyen for testing our code and providing comments. We also thank the Caenorhabditis Genetics Center (CGC, founded by National Institutes of Health – Office of Research Infrastructure Programs (P40 OD010440)) for providing some strains.

References

1. Wang G, Guasp RJ, Salam S, Chuang E, Morera A, Smart AJ, et al. Mechanical force of uterine occupation enables large vesicle extrusion from proteostressed maternal neurons. Elife. 2024;13:RP95443. pmid:39255003
- View Article
- PubMed/NCBI
- Google Scholar
2. Chakravarti I, Laha R, Roy J. Handbook of Methods of Applied Statistics. New York: John Wiley and Sons; 1967.
3. Spitzer K, Pelizzola M, Futschik A. Modifying the chi-square and the cmh test for population genetic inference: adapting to overdispersion. Ann Appl Stat. 2020;14(1):202–20.
- View Article
- Google Scholar
4. Aggarwal R, Ranganathan P. Common pitfalls in statistical analysis: Linear regression analysis. Perspect Clin Res. 2017;8(2):100–2. pmid:28447022
- View Article
- PubMed/NCBI
- Google Scholar
5. Jr. DWH, Lemeshow S, Sturdivant RX. Applied Logistic Regression. Wiley Series in Probability and Statistics. 2023.
6. Tackett M. Three Principles for Modernizing an Undergraduate Regression Analysis Course. Journal of Statistics and Data Science Education. 2023;31(2):116–27.
- View Article
- Google Scholar
7. Kunene N, Toskin K. An approach for ushering logistic regression early in introductory analytics courses. Information Systems Education Journal. 2002;20(5):42–53.
- View Article
- Google Scholar
8. Fienberg SE. The Analysis of Cross‐Classified Categorical Data. J R Stat Soc: Ser A (Gen). 1978;141(4):551.
- View Article
- Google Scholar
9. Brenner S. The genetics of Caenorhabditis elegans. Genetics. 1974;77(1):71–94. pmid:4366476
- View Article
- PubMed/NCBI
- Google Scholar
10. Padilla PA, Ladage ML. Suspended animation, diapause and quiescence: arresting the cell cycle in C. elegans. Cell Cycle. 2012;11(9):1672–9. pmid:22510566
- View Article
- PubMed/NCBI
- Google Scholar
11. Melentijevic I, Toth ML, Arnold ML, Guasp RJ, Harinath G, Nguyen KC, et al. C. elegans neurons jettison protein aggregates and mitochondria under neurotoxic stress. Nature. 2017;542(7641):367–71. pmid:28178240
- View Article
- PubMed/NCBI
- Google Scholar
12. Arnold ML, Cooper J, Grant BD, Driscoll M. Quantitative approaches for scoring in vivo neuronal aggregate and organelle extrusion in large exopher vesicles in C. elegans. J Vis Exp. 2020;163.
- View Article
- Google Scholar
13. Hector A, von Felten S, Schmid B. Analysis of variance with unbalanced data: an update for ecology & evolution. J Anim Ecol. 2010;79(2):308–16. pmid:20002862
- View Article
- PubMed/NCBI
- Google Scholar
14. McDonald. Handbook of Biological Statistics. 3rd ed. Baltimore, Maryland: Sparky House Publishing; 2014.
15. Powell-Coffman JA. Hypoxia signaling and resistance in C. elegans. Trends Endocrinol Metab. 2010;21(7):435–40. pmid:20335046
- View Article
- PubMed/NCBI
- Google Scholar
16. Ghose P, Park EC, Tabakin A, Salazar-Vasquez N, Rongo C. Anoxia-reoxygenation regulates mitochondrial dynamics through the hypoxia response pathway, SKN-1/Nrf, and stomatin-like protein STL-1/SLP-2. PLoS Genet. 2013;9(12):e1004063. pmid:24385935
- View Article
- PubMed/NCBI
- Google Scholar
17. Aarts S, van den Akker M, Winkens B. The importance of effect sizes. Eur J Gen Pract. 2014;20(1):61–4. pmid:23992128
- View Article
- PubMed/NCBI
- Google Scholar
18. Yu Z, Guindani M, Grieco SF, Chen L, Holmes TC, Xu X. Beyond t test and ANOVA: applications of mixed-effects models for more rigorous statistical analysis in neuroscience research. Neuron. 2022;110(1):21–35. pmid:34784504
- View Article
- PubMed/NCBI
- Google Scholar
19. Coles S. An Introduction to Statistical Modeling of Extreme Values. Springer Ser Stat; 2001.
20. Pinheiro EC, Ferrari SLP. A comparative review of generalizations ofthe extreme value distribution. arXiv. 2015;.
- View Article
- Google Scholar
21. Li YC, Wang L, Law JN, Murali TM, Pandey G. Integrating multimodal data through interpretable heterogeneous ensembles. Bioinform Adv. 2022;2(1):vbac065. pmid:36158455
- View Article
- PubMed/NCBI
- Google Scholar
22. Matthews TJ, Borges PAV, Whittaker RJ. Multimodal species abundance distributions: a deconstruction approach reveals the processes behind the pattern. Oikos. 2014;123(5):533–44.
- View Article
- Google Scholar

[ref1] 1. Wang G, Guasp RJ, Salam S, Chuang E, Morera A, Smart AJ, et al. Mechanical force of uterine occupation enables large vesicle extrusion from proteostressed maternal neurons. Elife. 2024;13:RP95443. pmid:39255003
View Article
PubMed/NCBI
Google Scholar

[2] View Article

[3] PubMed/NCBI

[4] Google Scholar

[ref2] 2. Chakravarti I, Laha R, Roy J. Handbook of Methods of Applied Statistics. New York: John Wiley and Sons; 1967.

[ref3] 3. Spitzer K, Pelizzola M, Futschik A. Modifying the chi-square and the cmh test for population genetic inference: adapting to overdispersion. Ann Appl Stat. 2020;14(1):202–20.
View Article
Google Scholar

[7] View Article

[8] Google Scholar

[ref4] 4. Aggarwal R, Ranganathan P. Common pitfalls in statistical analysis: Linear regression analysis. Perspect Clin Res. 2017;8(2):100–2. pmid:28447022
View Article
PubMed/NCBI
Google Scholar

[10] View Article

[11] PubMed/NCBI

[12] Google Scholar

[ref5] 5. Jr. DWH, Lemeshow S, Sturdivant RX. Applied Logistic Regression. Wiley Series in Probability and Statistics. 2023.

[ref6] 6. Tackett M. Three Principles for Modernizing an Undergraduate Regression Analysis Course. Journal of Statistics and Data Science Education. 2023;31(2):116–27.
View Article
Google Scholar

[15] View Article

[16] Google Scholar

[ref7] 7. Kunene N, Toskin K. An approach for ushering logistic regression early in introductory analytics courses. Information Systems Education Journal. 2002;20(5):42–53.
View Article
Google Scholar

[18] View Article

[19] Google Scholar

[ref8] 8. Fienberg SE. The Analysis of Cross‐Classified Categorical Data. J R Stat Soc: Ser A (Gen). 1978;141(4):551.
View Article
Google Scholar

[21] View Article

[22] Google Scholar

[ref9] 9. Brenner S. The genetics of Caenorhabditis elegans. Genetics. 1974;77(1):71–94. pmid:4366476
View Article
PubMed/NCBI
Google Scholar

[24] View Article

[25] PubMed/NCBI

[26] Google Scholar

[ref10] 10. Padilla PA, Ladage ML. Suspended animation, diapause and quiescence: arresting the cell cycle in C. elegans. Cell Cycle. 2012;11(9):1672–9. pmid:22510566
View Article
PubMed/NCBI
Google Scholar

[28] View Article

[29] PubMed/NCBI

[30] Google Scholar

[ref11] 11. Melentijevic I, Toth ML, Arnold ML, Guasp RJ, Harinath G, Nguyen KC, et al. C. elegans neurons jettison protein aggregates and mitochondria under neurotoxic stress. Nature. 2017;542(7641):367–71. pmid:28178240
View Article
PubMed/NCBI
Google Scholar

[32] View Article

[33] PubMed/NCBI

[34] Google Scholar

[ref12] 12. Arnold ML, Cooper J, Grant BD, Driscoll M. Quantitative approaches for scoring in vivo neuronal aggregate and organelle extrusion in large exopher vesicles in C. elegans. J Vis Exp. 2020;163.
View Article
Google Scholar

[36] View Article

[37] Google Scholar

[ref13] 13. Hector A, von Felten S, Schmid B. Analysis of variance with unbalanced data: an update for ecology & evolution. J Anim Ecol. 2010;79(2):308–16. pmid:20002862
View Article
PubMed/NCBI
Google Scholar

[39] View Article

[40] PubMed/NCBI

[41] Google Scholar

[ref14] 14. McDonald. Handbook of Biological Statistics. 3rd ed. Baltimore, Maryland: Sparky House Publishing; 2014.

[ref15] 15. Powell-Coffman JA. Hypoxia signaling and resistance in C. elegans. Trends Endocrinol Metab. 2010;21(7):435–40. pmid:20335046
View Article
PubMed/NCBI
Google Scholar

[44] View Article

[45] PubMed/NCBI

[46] Google Scholar

[ref16] 16. Ghose P, Park EC, Tabakin A, Salazar-Vasquez N, Rongo C. Anoxia-reoxygenation regulates mitochondrial dynamics through the hypoxia response pathway, SKN-1/Nrf, and stomatin-like protein STL-1/SLP-2. PLoS Genet. 2013;9(12):e1004063. pmid:24385935
View Article
PubMed/NCBI
Google Scholar

[48] View Article

[49] PubMed/NCBI

[50] Google Scholar

[ref17] 17. Aarts S, van den Akker M, Winkens B. The importance of effect sizes. Eur J Gen Pract. 2014;20(1):61–4. pmid:23992128
View Article
PubMed/NCBI
Google Scholar

[52] View Article

[53] PubMed/NCBI

[54] Google Scholar

[ref18] 18. Yu Z, Guindani M, Grieco SF, Chen L, Holmes TC, Xu X. Beyond t test and ANOVA: applications of mixed-effects models for more rigorous statistical analysis in neuroscience research. Neuron. 2022;110(1):21–35. pmid:34784504
View Article
PubMed/NCBI
Google Scholar

[56] View Article

[57] PubMed/NCBI

[58] Google Scholar

[ref19] 19. Coles S. An Introduction to Statistical Modeling of Extreme Values. Springer Ser Stat; 2001.

[ref20] 20. Pinheiro EC, Ferrari SLP. A comparative review of generalizations ofthe extreme value distribution. arXiv. 2015;.
View Article
Google Scholar

[61] View Article

[62] Google Scholar

[ref21] 21. Li YC, Wang L, Law JN, Murali TM, Pandey G. Integrating multimodal data through interpretable heterogeneous ensembles. Bioinform Adv. 2022;2(1):vbac065. pmid:36158455
View Article
PubMed/NCBI
Google Scholar

[64] View Article

[65] PubMed/NCBI

[66] Google Scholar

[ref22] 22. Matthews TJ, Borges PAV, Whittaker RJ. Multimodal species abundance distributions: a deconstruction approach reveals the processes behind the pattern. Oikos. 2014;123(5):533–44.
View Article
Google Scholar

[68] View Article

[69] Google Scholar

Figures

Abstract

Introduction

Materials and methods

Strains

Anoxia suspended animation assay

Confocal microscopy

Results

Logistic regression for a simple example problem

Comparison of CMH vs. logistic regression

Data collection framework

Logistic regression applications for more complex datasets

Adding conditions/variables.

Assessing how a phenotype can change over time.

Issues with blanket ANOVA statistics that can be better addressed with logistic regression

Logistic regression program

Conclusions

Supporting information

S1 Table. Strain List.

S1 File. Step-by-step Protocol.

S2 File. Logistic-Regression-for-Biologists-main.

Acknowledgments

References