## Figures

## Abstract

We consider the problem of constructing a complete set of parameters that account for all of the degrees of freedom for point-biserial variation. We devise an algorithm where sort as an intrinsic property of both numbers and labels, is used to generate the parameters. Algebraically, point-biserial variation is represented by a Cartesian product of statistical parameters for two sets of data, and the difference between mean values (*δ*) corresponds to the representation of variation in the center of mass coordinates, (*δ*, *μ*). The existence of alternative effect size measures is explained by the fact that mathematical considerations alone do not specify a preferred coordinate system for the representation of point-biserial variation. We develop a novel algorithm for estimating the nonoverlap proportion (*ρ*_{pb}) of two sets of data. *ρ*_{pb} is obtained by sorting the labeled data and analyzing the induced order in the categorical data using a diagonally symmetric 2 × 2 contingency table. We examine the correspondence between *ρ*_{pb} and point-biserial correlation (*r*_{pb}) for uniform and normal distributions. We identify the , , and representations for Pearson product-moment correlation, Cohen’s *d*, and *r*_{pb}. We compare the performance of *r*_{pb} versus *ρ*_{pb} and the sample size proportion corrected correlation (*r*_{pbd}), confirm that invariance with respect to the sample size proportion is important in the formulation of the effect size, and give an example where three parameters (*r*_{pbd}, *μ*, *ρ*_{pb}) are needed to distinguish different forms of point-biserial variation in CART regression tree analysis. We discuss the importance of providing an assessment of cost-benefit trade-offs between relevant system parameters because ‘substantive significance’ is specified by mapping functional or engineering requirements into the effect size coordinates. Distributions and confidence intervals for the statistical parameters are obtained using Monte Carlo methods.

**Citation: **Luck S (2020) Nonoverlap proportion and the representation of point-biserial variation. PLoS ONE 15(12):
e0244517.
https://doi.org/10.1371/journal.pone.0244517

**Editor: **Alan D. Hutson,
Roswell Park Cancer Institute, UNITED STATES

**Received: **July 20, 2020; **Accepted: **December 10, 2020; **Published: ** December 28, 2020

**Copyright: ** © 2020 Stanley Luck. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

**Data Availability: **The data are available from FigShare, at https://doi.org/10.6084/m9.figshare.11591334.v2.

**Funding: **The author(s) received no specific funding for this work. The author, Stanley Luck, is a member of Vector Analytics, LLC, which is a science consulting company. Since there is no salary, Vector Analytics did not provide any funds for this work. Vector Analytics, LLC did not have any additional role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript. The specific roles of the author are articulated in the ‘author contributions’ section.

**Competing interests: ** The author, Stanley Luck, is a member of Vector Analytics, LLC, which is a science consulting company. This affiliation also does not alter our adherence to PLOS ONE policies on sharing data and materials. There are no competing interests connected with our consulting work at Vector Analytics, LLC. This work is not associated with any patents or commercial products.

## 1 Introduction

This work began when we noticed that results from classification and regression tree (CART) analyses did not correspond well with statistical associations in genome-wide association studies (GWAS) [1]. Then, we discovered the extensive research literature discussing confounding properties of effect size measures used in our analyses. Statistical components of our bioinformatics system came from open source software packages that are widely used for research. In data analysis, there are two important requirements for obtaining reproducible results. First, statistics methodology is subject to the general physical principle that it is necessary to account for all of the degrees of freedom when studying a quantitative phenomenon. Second, analysis protocols must correct for dependence on data acquisition parameters including unbalanced sample sizes, in order to obtain interpretable results for effect size. Our work on proportional variation and the phi coefficient for 2 × 2 contingency tables was recently published in this journal; we refer to this as Paper1 [2]. There, we demonstrate that odds-ratio or relative risk as standalone effect size measures, do not account for all of the degrees of freedom and are therefore subject to ambiguity. Using matrix factorization for the marginal sums, we identified the four alternative forms of proportional variation which serve as the basis for specifying the effect size. There is also an elementary discussion of projective geometry for fractional variation that might be helpful to the reader. Here, we study similar problems in the formulation of effect size for point-biserial variation and the associated correlation coefficient, *r*_{pb}. First, the term ‘point-biserial’ comes from psychology statistics, and we explain its use as a general reference for the two groups data analysis problem. The difference between mean values for two sets of data, , serves as the basis for specifying effect size for system response to perturbation. Statistically, analysis of *δ* corresponds to measuring the relation or association between a continuous variable and a binary categorical variable obtained by individually labeling the data. The standard procedure is to replace the labels with numeric {0, 1} indicators. The Pearson product moment correlation coefficient (*r*) calculated from these numeric data is known as the point-biserial correlation coefficient (*r*_{pb}) [3]. This connection between *r*_{pb} and *δ* explains our use of the term ‘point-biserial’. It is standard terminology in the effect size literature. We provide a short discussion of the literature which gave us much inspiration, and note that there are several books on effect size methods as well [4, 5]. In their discussion of physical principles in the formulation effect size, Kelly & Preacher recommend that an effect size should serve as a sample size independent estimate of a system parameter [6]. The existence of alternative effect size measures, and their classification as relationship, group difference, and group overlap is discussed by Huberty [7]. A recently proposed group overlap measure is nonparametric but requires the use of kernel density estimators to produce an approximate representation of the unknown densities [8]. McGrath and Meyer give a nice review of research into the limitations of *r*_{pb}, and points out that different measures can “lead to different conclusions about the size or importance” of an effect [3]. Various researchers have already noted that there are two complications that can limit the range of *r*_{pb}. The first difficulty arises from the definition of *r*_{pb}, requiring the {0, 1} representation to allow the calculation of *r*. The {0, 1} representation corresponds to binary groupings of the data, comprising a pair of many-to-one mappings. The latter are incompatible with *r* as a measure of the degree to which two variables are linearly related [9] and raises questions about the interpretation of *r*_{pb}. It has been shown that when the {**y**_{A}, **y**_{B}} data are obtained by a dichotomy of a normal distribution, *r*_{pb} has a maximum value of 0.79 [3, 10]. In contrast, when each set corresponds to a normal distribution, *r*_{pb} still ranges from −1.0 to 1.0 [11, 12], with the proviso that the extremal values are reached in the limit as |*δ*| approaches infinity. Secondly, *r*_{pb} is subject to confounding from unbalanced sample sizes for the {**y**_{A}, **y**_{B}} data; in the effect size literature, the sample size proportions are usually referred to as ‘base rates’. Then, variation in the sampling proportions between data sets leads to irreproducibility, which complicates the interpretation of *r*_{pb}. The machine learning community has rediscovered the problems associated with unbalanced sample sizes, creating the new term “classification imbalance” [13].

It is accepted practice to report a single effect size such as Cohen’s *d* as the basis for deciding the outcome of an experiment. However, *d* is associated with an implicit parameterization that does not account for all of the degrees of freedom for point-biserial variation, which results in ambiguity. Consequently, our objective is to construct a computational framework for a complete parameterization of the variation (**v**_{pb}). We use an inductive approach based on connections between *r*_{pb}, Cohen’s *d*, and the mean squared error information gain (IG_{MSE}). These measures play an important role because of their connections with elementary statistical concepts. We show that Cohen’s *d* is a perspective function of center of mass coordinates (*δ*, *μ*) for the mean value vector . We also identify a novel association measure, *ρ*_{pb}, which measures the degree of nonoverlap between two sets of data.*ρ*_{pb} is calculated directly from the data and is therefore nonparametric because the underlying densities are unspecified. A particular goal is to examine the dependence of *r*_{pb} on unbalanced sample sizes because of concerns about the effect on reproducibility. We address other problems as well including the use of Monte Carlo methods to estimate the joint distribution for statistical parameters. As in Paper1, we use CART association graphs to compare the performance of various effect size measures. However, in this work we are particularly interested in the case where the target variable is a quantitative variable, which corresponds to the regression tree implementation (rCART) [14]. We show that *ρ*_{pb} and the sample size proportion corrected correlation (*r*_{pbd}) serve as effect size measures for rCART while avoiding complications associated with *r*_{pb}. The main novel contributions of this work are as follows: 1) a computational model for generating statistical parameters for point-biserial variation **v**_{pb}, which corresponds to the Cartesian product of parameters for two sets of data, and identification of the fact that pure mathematics alone is not sufficient to specify a preferred effect size, 2) a sorting algorithm to estimate the nonoverlap proportion, *ρ*_{pb}, of two sets of data using a diagonally symmetric 2 × 2 contingency table, 3) identification of the , , and representations for Pearson correlation, 4) demonstration of the equivalence between *r*_{pb} and IG_{MSE}, and 5) demonstration of the importance of adjusting for unbalanced sample sizes in impurity measures in rCART analysis.

## 2 Methods

The specification of a complete set of parameters for point-biserial variation, **v**_{pb}, is a prerequisite for the rigorous formulation of effect size. Then, a measure for effect size is asociated with a perspective function of **v**_{pb}. We begin with an examination of limitations of *r*_{pb} in section 2.1. Then, we use an inductive approach to construct an algebraic framework for point-biserial variation in four sections 2.2–2.5.

### 2.1 The effect of unbalanced sample sizes on *r*_{pb}

The derivation and limitations of *r*_{pb} are reviewed by McGrath and Meyer [3]. Two sets, and , are combined to form a set of paired values, , where *c*_{i} is a group membership label, and the {(*c*_{i}, *y*_{i})} data correspond to the vectors, (**c**, **y**). The standard practice is to invoke a numeric {0, 1} representation for **c** to obtain an indicator vector, . Then, application of the Pearson product-moment formula produces the point-biserial correlation coefficient [3]
(1) (2)
where *p*_{A} = *N*_{A}/(*N*_{A} + *N*_{B}) and *p*_{B} = 1 − *p*_{A} are sample size proportions, Cohen’s *d* is defined as
(3)
and the pooled variance is the weighted average of the sample variances, . Thus, |*r*_{pb}| approaches unity as |*d*| → ∞ [11, 12] for 0 < *p*_{A} < 1. Rearranging Eq 2, we obtain the quadratic relation
(4)
For a fixed value of *r*_{pb}, there is a range of (*d*, *p*_{A}) values (Fig 1). Alternatively, the variation in (*r*_{pb}, *p*_{A}) for fixed *d* becomes a source of irreproducibility in *r*_{pb} because *p*_{A} can vary between experiments depending on the data acquisition protocol. This ambiguity explains why researchers have expressed concern about the confounding effect of unbalanced sample sizes on *r*_{pb}, and effect size in general [3, 6]. Furthermore, the binomial *p*_{A} *p*_{B} dependence originates from the covariance
(5) (6)
and variance, Var(**I**_{c}) = *p*_{A} *p*_{B}. Therefore, the criticism about *p*_{A} *p*_{B} dependence applies more broadly to the use of the numeric {0, 1} indicator variable. Various researchers have already recommended that the proportions should be equalized, *p*_{A} = *p*_{B} = 1/2, in Eq 2 to give [3]
(7)
This ‘attenuation-corrected’ coefficient is denoted as *r*_{c} in [4]. The *r*_{pb} and *r*_{pbd} curves in Fig 2 provide an illustration of this correction. The one-to-one projective relation between *r*_{pbd} and Cohen’s *d* is discussed in section 2.4, and the application of *r*_{pbd} in rCART is discussed in section 2.5.

For the fixed value *r*_{pb} = 0.2, there is a range for Cohen’s *d* and the sample size proportion, *p*_{A}. This ambiguity complicates the interpretation of *r*_{pb} as an effect size measure.

Theoretical curves and estimated values for point-biserial correlation, *r*_{pb}, nonoverlap proportion, *ρ*_{pb}, and sample size adjusted correlation, *r*_{pbd}, for simulated data with unequal sample sizes (*N*_{A} : *N*_{B} = 15000 : 500) and the difference between mean values, . Compared to *r*_{pbd}, *r*_{pb} is attenuated due to the confounding effect of the binomial sampling factor. A: Uniform unit width distributions. B: Standard normal (*σ* = 1) distributions.

### 2.2 Statistical parameters for point-biserial variation

In this section, we consider the question of how to generate a set of parameters for statistical variation in point-biserial data. The fact that *r*_{pb} is subject to confounding effects suggests that replacing categorical labels with {0, 1} numeric values is an improper procedure, because the labels acquire arithmetic properties in an ad-hoc way. Instead, we propose a new framework where sort is used as an intrinsic property of both numbers and labels. Suppose there is a machine which generates numbers with labels, (*c*_{i}, *y*_{i}), in no particular order, placing them in a data table to produce a point-biserial data set. Then, the table can be sorted using either **c** or **y**, to obtain orderings denoted as **y**_{c} and **c**_{y}, respectively. As we discuss next, these orderings are associated with statistical parameters, **v**_{c} and **v**_{y}, respectively. However, there is no rule that specifies which parameterization, **v**_{c} or **v**_{y}, might be preferred. Therefore, we make the following proposition,

**Proposition 1**. *Point-biserial variation is parameterized by the Cartesian product of statistical parameters for the* **y**_{c} *and* **c**_{y} *orderings*,
(8)
The **y**_{c} ordering corresponds to sorting the **y** data into two sets, **y**_{c} ↔ {**y**_{A}, **y**_{B}}. Then, the statistical parameters for the two sets are associated with a two-component Cartesian product structure, yielding the familiar effect size measures, Cohen’s *d* and *r*_{pb} as discussed in section 2.3. The **c**_{y} ordering is associated with a new nonoverlap measure, *ρ*_{pb}. The two types of **y**-sort, ascending or descending, produce orderings where either {(*c*_{i}, *y*_{i})|*y*_{i} ≤ *y*_{i+1}} or {(*c*_{i}, *y*_{i})|*y*_{i} ≥ *y*_{i+1}}, respectively. Then, the **c**-column corresponds to a **y**-ordered string, **c**_{y}. The induced order from the **y**-sorting is reflected in the degree of mixing of As and Bs in **c**_{y}. Next, we sort the data with respect to **c** obtaining a maximally ordered string, **c**_{y}, where the As and Bs are completely separated. **c**_{M} corresponds to the condition where **y**_{A} and **y**_{B} are disjoint in , which has been characterized as “perfect correlation” [11]. Our **c**_{y}-sorting algorithm requires equal sample sizes, *N*_{A} = *N*_{B}. When the sample sizes are unequal, a preprocessing step is required. Suppose *N*_{B} < *N*_{A}. Then, the **y**_{B} data are replicated to create a new data set, **y**_{Brep}, such that *N*_{Brep} = *N*_{A}. If the difference in sample size is small, 0 < *N*_{B} − *N*_{A} < *N*_{B}, then a subset of **y**_{B} uniformly spaced by rank is replicated. The **y**_{Brep} and **y**_{A} data are combined to obtain the (**c**_{y}, **c**_{M}) strings. They constitute a set of joint observations for two categorical variables, which are summarized in a diagonally symmetric 2 × 2 contingency table of the form [[*a*, *b*], [*b*, *a*]]. The symmetric form results from the equal sample size condition, which requires that the rows and columns each sum to *N*_{A}. Then, the nonoverlap proportion is given by the difference in proportions
(9)
where , and *p*_{b} = 1 − *p*_{a}. When **y**_{A} and **y**_{B} are disjoint, |*ρ*_{pb}| = 1. The sign of *ρ*_{pb} is arbitrary because the order of the columns (or rows) of the 2 × 2 table depends on the direction of the sort in **y** or **c**_{M}. In our implementation, the sign is chosen to be consistent with Cohen’s *d*. The *ρ*_{pb} values in Fig 2 were obtained using this sort algorithm. The overlap between uniform unit width distributions is an important pedagogical case because the expressions for Cohen’s *d*, *r*_{pbd}, and *ρ*_{pb} take a simple form. Geometrically, the overlap (*θ*_{U}) is given by a rectangle with area *θ*_{U} = 1 − *δ* for the difference between mean values, with 0 ≤ *δ* ≤ 1, and *θ*_{U} = 0 for *δ* > 1. The nonoverlap is given by *ρ*_{pbU} = 1 − *θ*_{U} = *δ*, with 0 ≤ *δ* ≤ 1. Similarly,
(10) (11)
For the overlap of standard normal (*σ* = 1) distributions, we obtain
(12) (13) (14)
where Φ is the cumulative normal distribution function [8]. In Fig 2, we observe that at a large enough *δ*, *r*_{pbd} is attenuated compared to *ρ*_{pb}, as expected [11]. However, for small *δ*, the inequality is reversed, i.e., *r*_{pbd} > *ρ*_{pb}. Nevertheless, there is close correspondence between *r*_{pbd} and *ρ*_{pb} for both the uniform and normal distributions. This is particularly true for highly correlated data where both *r*_{pbd} and *ρ*_{pb} are near 1, and are therefore equivalent. However, in section 3 we demonstrate that when the data are not well correlated, both *r*_{pbd} and *ρ*_{pb} are needed in order to distinguish different forms of point-biserial variation. We conclude that *r*_{pbd} and Cohen’s *d* serve as measures of the nonoverlap of distributions but are not necessarily equivalent to *ρ*_{pb}.

### 2.3 Coordinates for a two-component system of distributed effects

In this section, we discuss the fact that *d* and *ρ*_{pb} are only two elements of a minimal set of parameters for representing point-biserial variation. The one-to-one correspondence, *d* ↔ *r*_{pbd}, will be discussed in section 2.4. Algebraically, **v**_{c} corresponds to the Cartesian product of statistical parameters for two sets of data, . Introducing the center of mass parameter, , the mean values vector is expressed as
(15) (16)
where (1, 1) and (1, −1) comprise the center of mass basis. We note that the generalization for a weighted average is straightforward. A similar decomposition holds for variances
(17)
where and . A further reduction is obtained if the variances are homoscedastic, , yielding , and . Finally, we obtain
(18)
as a minimal set of parameters for point-biserial variation. However, we observe that **v**_{pb} is not unique because functions of the components, {*f*_{i}(*v*_{pb,i})}, including linear fractional transformations can be introduced to obtain alternative representations. Mathematics alone is not sufficient to specify a preferred vector basis, which explains why there are alternative effect size measures [6, 7]. Furthermore, *r*_{pb} and Cohen’s *d* correspond to perspective functions [15] of **v**_{pb} and do not account for all of the degrees-of-freedom. Consequently, the practice of using one of these measures to serve as a one-parameter summary of experimental results will be subject to irreproducibility.

The term ‘substantive significance’ has been used to refer to the magnitude of an effect that would be regarded as practically important in a given application [6]. Suppose functional or engineering requirements are expressed in terms of a vector, **h**, of system parameters. Then, the utility of an effect would be specified as a mapping, . The specification of *u*(**h**) would account for differences in cost-benefit trade-offs for variation in the {*h*_{i}} components. The substantive significance for the effect size would be determined by the mapping, *u*(**h**) → *u*(**v**_{pb}). Without this information, it is difficult to reach a consensus on the merits of an effect size. This explains the criticism of Cohen’s thresholds for small, medium, and large effects as “somewhat arbitrary” [16] and suggestions that the significance of the magnitude of an effect size depends on the research question [3, 17, 18].

A fundamental limitation arises from the fact that the (*δ*, *μ*) center of mass decomposition does not extend to higher dimensions in a straightforward way. Consider the group means vector for three sets, i.e., . The default center of mass parameter is defined as . However, there is no standard procedure for choosing the two additional deviation parameters needed to specify a complete basis. Consequently, the formulation of an effect size measure for multiple group variation is not a well-posed problem, i.e., there is no unique solution [19]. This explains why Cohen’s *d* does not generalize to schemes involving more than two groups [20] and provides support for previous recommendations to break down ‘complicated hypotheses’, p. 526 [21], and ‘reduce any multiple-level or multiple-variable relationship’ into a set of two-variable effect size relationships [17]. This provides the raison d'être for the development of exploratory methodologies such as CART in high-dimensional data analytics [22, 23].

### 2.4 Homogeneous coordinates for Pearson correlation

In the effect size literature, it is accepted practice to distinguish three different types of effect size measure, ‘relationship’, ‘group difference’, and ‘group overlap’ [3, 7]. In this section, we discuss the fact that this classification is misleading. We have already discussed the fact that Cohen’s *d*, *r*_{pbd} and *ρ*_{pb} all serve as measures of nonoverlap (section 2.2). Now, we point out that *r*_{pbd} and Cohen’s *d* are two sides of the same coin because relationship and group difference correspond to different coordinate systems for representing fractional variation. Such correspondences are quite useful in exploring statistical dependence in high-dimensional data. Consider a vector . Division by the y-component produces the ratio vector, . Ratios can be distinguished by their representations as points in the projective line, . However, normalization of a ratio vector by the Euclidean length, , produces the unit vector , which is a point in the positive half-circle . Thus, a fractional quantity can be represented as a point in either or . Algebraically, the and representations are related by linear fractional transformations. In the terminology of projective geometry, a ratio corresponds to a perspective function, *P*(**u**, *t*) = **u**/*t*, for vector **u** [15]. The scaling invariance property of *α* is represented by the equivalence relation
with *t* ≠ 0. Geometrically, this relation specifies points on the line passing through the origin, (*a*, *b*) and (*α*, 1). The points, (*a*, *b*)*t*, constitute the homogeneous coordinates [24] for the line. The homogeneous coordinates concept shows that there is a natural correspondence between ‘relationship’ and ‘group difference’ effect size. Expressing the Pearson product-moment correlation coefficient as the rescaled covariance [9]
the corresponding projective geometric structure is as summarized in Table 1. Vector representations for *r*_{pb} and *r*_{pbd} are also listed, and a geometric visualization for *r*_{pb} is shown in Fig 3. Consequently, *r*_{pbd}, Cohen’s *d*, and *ρ*_{pb} each possess and representations and serve as measures of group overlap, as described in section 2.2. Therefore, we conclude that the general classification of effect size as a ‘relationship’, ‘group difference’, or ‘group overlap’ index is misleading. We also observe that the question of the merits of Cohen’s *d* versus *r*_{pb} in [3] is complicated by the fact that these measures correspond to points in different spaces, and , respectively. The limitations of *r*_{pb} are more easily understood by considering its representation as the vector, . The binomial factor has a confounding effect, particularly since base rates are determined by the experimental protocol. This is analogous to the confounding effect of the marginal sums on the *ϕ* coefficient for a 2 × 2 contingency table (Paper1). Therefore, neither *r*_{pb} nor *ϕ* meet the criterion for a well-behaved effect size of serving to quantify ‘some phenomenon that addresses a question of interest’ [6]. In section 2.5, we give an example where *r*_{pb} gives nonintuitive results in rCART analysis.

The point-biserial correlation coefficient, *r*_{pb}, corresponds to the point on the positive half-circle, , and the point on the projective line, . The homogeneous coordinates for correspond to points on the line through the origin. {*p*_{A}, *p*_{B}}: sample size proportions, *d*: Cohen’s *d*.

### 2.5 Point-biserial variation in regression tree analysis

The CART association graph was introduced in Paper1 as a new method for analyzing statistical association in point-biserial data. In this section, we investigate the role of point-biserial variation in rCART, particularly the connection between IG_{MSE} and *r*_{pb}, and introduce the rCART graph as a new method for analyzing association for (**x**, **y**) data. The CART decision tree algorithm creates a decision tree by recursive partitioning of the association between response and independent variables [2, 14]. Each node of the tree corresponds to a binary partition of the range of an independent variable. In standard implementations, the partition parameters for a node are determined by maximizing the information gain (IG) for the response variable in an exhaustive search of associations over all independent variables. The rCART implementation is of particular interest because it involves the analysis of point-biserial variation. In each iteration, the set of statistics obtained for partitions of an independent variable constitutes a CART association graph [2]. For the partition value , the data for a node (*V*) are divided into two subsets, i.e., *V*_{A} = {(*x*_{i}, *y*_{i})|*x*_{i} ≤ *x*_{j}} and *V*_{B} = {(*x*_{i}, *y*_{i})|*x*_{i} > *x*_{j}}, from which data vectors {**y**_{A}, **y**_{B}} are obtained. Alternatively, if *x*_{j} is categorical, the subsets are specified using matching criteria *V*_{A} = {(*x*_{i}, *y*_{i})|*x*_{i} = *x*_{j}} and *V*_{B} = {(*x*_{i}, *y*_{i})|*x*_{i} ≠ *x*_{j}}. The standard rCART impurity measure is the mean square error for the response, , where *N*_{V} is the sample size and is the mean [14]. Then, IG is defined as the parent node impurity minus the weighted impurities for the subsets
(19)
where *p*_{A} and *p*_{B} are the sample size proportions. Partitioning the sum of squares, MSE(**y**), gives [3, 21]
Substitution for MSE(**y**) in Eq 19 gives
(20)
Thus, IG_{MSE}(**y**_{A}, **y**_{B}) is equivalent to with *S*_{p} = 1 (Table 1); IG_{MSE} does not account for the variation in *S*_{p}. To the best of our knowledge, this connection between IG_{MSE} and *r*_{pb} has not been reported previously. We conclude that the analysis of point-biserial variation serves as the basis for rCART, and we use the terms ‘effect size’ and ‘information gain’ interchangeably. The *x*_{j} partition produces subsets with sample sizes, *j* and *N*_{V} − *j* for . An association graph is obtained by searching over all partitions where the sample size proportions, *p*_{j} and (1 − *p*_{j}), vary over their entire range, producing a large parabolic variation in the *p*_{j}(1 − *p*_{j}) factor. Thus, an association graph is a convenient way to compare the sample size proportion dependence of effect size measures. In the next section, we demonstrate that *r*_{pb} gives misleading results in rCART, while *r*_{pbd} and *ρ*_{pb} produce more intuitive results. However, when the (**x**, **y**) data are highly correlated and Pearson *r*(**x**, **y**) → 1, the rCART graph becomes a horizontal line or nearly so, because *r*_{pbd} ≈ *ρ*_{pb} ≈ 1 for all *x*_{j} partitions. Then, the rCART graph and Pearson *r* are equivalent representations. Thus, CART methodology is most useful when the data are poorly correlated, which includes population studies where system performance is determined by trade-offs between multiple factors. Typical applications include GWAS, and other high-dimensional search problems such as nursing home performance as discussed in the next section.

## 3 Data analysis and results

In Paper1, we used the publicly accessible Nursing Home Compare (NHC) data [25] in CART analysis to demonstrate the importance of adjusting for the dependence on marginal sums for 2 × 2 contingency tables [2]. In this section, we use a similar NHC data set for a discussion of point-biserial variation and the rCART algorithm. Our objective is to provide a practical demonstration of the limitations of *r*_{pb} due to the confounding effect of unbalanced sample sizes and to compare the behaviors of *r*_{pbd} and *ρ*_{pb}. We also discuss the importance of accounting for three degrees of freedom, (*r*_{pbd}, *μ*, *ρ*_{pb}), and the use of Monte Carlo methods to estimate the joint distribution of statistical parameters.

### 3.1 rCART association graphs for NHC quality measures

NHC data of the fourth quarter of 2018 were retrieved for 20 quality measures (*Q*_{i}) for 15341 nursing homes; detailed descriptions of these continuous variables can be found on the NHC website [26]. A histogram of the nursing home occupancy is shown in Fig 4A. Since performance estimates for nursing homes with low occupancy would be less reliable, a minimum occupancy criterion of at least 50 ‘Average number of residents per day’ was applied to obtain a restricted data set of 11053 nursing homes for further analysis [27]. Pearson correlation coefficients, *r*(*Q*_{i}, *Q*_{j}), and association graphs were calculated for all pairs of quality measures, {(*Q*_{i}, *Q*_{j})|*i* ≠ *j*}. On average, the information gain for the rCART partition is larger when the (*Q*_{i}, *Q*_{j}) variables are highly correlated (Fig 5A); the *r*(*Q*_{i}, *Q*_{j}) correlations are distributed with 95% less than 0.16 and a maximum of 0.65. The distribution for ‘Number of outpatient emergency department visits per 1000 long-stay resident days’ (‘Emergency visits’) versus ‘Number of hospitalizations per 1000 long-stay resident days’ (‘Hospitalizations’) with correlation *r* = 0.37 is skewed, with a long tail towards larger values (Fig 4B). rCART association graphs are shown for the ‘Hospitalizations’ response and ‘Emergency visits’ partition variables (Fig 6A and 6B), and for the reverse, i.e., ‘Emergency visits’ response and ‘Hospitalizations’ partition variables (Fig 6C and 6D). The high correlation between *r*_{pb} and (*r* = 0.99) is typical and indicates that variation in the binomial sampling factor overrides the smaller variation in Cohen’s *d* (Eq 2). We also note that the graphs for *r*_{pb} and IG_{MSE} (not shown) are superimposable, as expected from Eq 20 and because the variation in *S*_{p} is small. Thus, *r*_{pb} and IG_{MSE} mainly correspond to the variation in sample size proportion. In general, we observe that the association curves for *r*_{pbd} and *ρ*_{pb} can be categorized as monotonically increasing or decreasing, or even U-shaped (concave up), depending on how the (*Q*_{i}, *Q*_{j}) data are distributed. Here, the U-shaped dependence of *r*_{pbd} correlates well with *δ* (*r* = 0.999) and contrasts sharply with the concave down variation for *r*_{pb}. Consequently, *r*_{pb} and *r*_{pbd} produce very different rCART partitions (Table 2). In Fig 6A, the *r*_{pb} partition for the split value, *x*_{j} = 0.8, produces subnodes with comparable sample sizes, *N*_{A} = 5742 and *N*_{B} = 4890 (Table 2). It is useful to view this partition from a statistical perspective. As a first approximation, we expect that the majority of nursing homes belong to a broad distribution for average performance. Then, the *r*_{pb} partition with a split value close to the median, 0.85, is analogous to splitting a normal distribution nearly in half, producing subsets with different mean ‘Emergency visits’ values {0.5, 1.4} that nevertheless correspond to entities with average performance. Thus, *r*_{pb} and IG_{MSE} produce rCART subsets that are not well distinguished from a functional perspective. In comparison, for *r*_{pbd}, there are two possible rCART partitions at either low (*x*_{j} = 0.3) or high (*x*_{j} = 2.5) split values. Each partition produces a large subset corresponding to a broad distribution for average performance and a much smaller subset for either above- or below-average performance. Thus, *r*_{pbd} produces more functionally relevant classifications.

A. Histogram of ‘Average number of residents per day’ for 15341 nursing homes. B. Two-dimensional Gaussian kernel density estimate of the distribution of ‘Number of outpatient emergency department visits per 1000 long-stay resident days’ (‘Emergency visits’) versus ‘Number of hospitalizations per 1000 long-stay resident days’ (‘Hospitalizations’), with correlation *r* = 0.37.

These graphs display data obtained from association graphs for 380 pairs of quality measures, {(*Q*_{i}, *Q*_{j})|*i* ≠ *j*}. A. *r*_{pbd} effect size for rCART split versus correlation *r*(*Q*_{i}, *Q*_{j}). On average, the largest information gain is obtained when the response and partition variables are highly correlated. B. Correlation *r*(*r*_{pbd}, *ρ*_{pb}) between effect size and *r*(*Q*_{i}, *Q*_{j}) for association graphs. There is good correlation between *r*_{pbd} and *ρ*_{pb} in many cases, but there are exceptions.

A,B: ‘Hospitalizations’ response versus ‘Emergency visits’ partition variables, with correlation *r*(*r*_{pbd}, *ρ*_{pb}) = 0.93. C,D: ‘Emergency visits’ response versus ‘Hospitalizations’ partition variables, with correlation *r*(*r*_{pbd}, *ρ*_{pb}) = 0.49. Bar plot histograms are shown for ‘Emergency visits’ (B inset) and ‘Hospitalizations’ (D inset). *r*_{pb}: point-biserial correlation coefficient, {*p*_{A}, *p*_{B}}: sample size proportions, *r*_{pbd}: sample size corrected correlation coefficient, *ρ*_{pb}: nonoverlap proportion, (*δ*, *μ*): center of mass parameters .

The importance of accounting for variation in both degrees of freedom, (*r*_{pbd}, *μ*), is illustrated in Fig 6B and 6D. Here, *μ* is monotonically increasing, and one of the *r*_{pbd} partitions might be preferred depending on *μ*. However, this requires an assessment of the cost-benefit trade-offs for (*r*_{pbd}, *μ*) variation, which will depend on the particular application. A close correspondence between *r*_{pbd} and *ρ*_{pb} is observed in many cases, with *r*(*r*_{pbd}, *ρ*_{pb}) ≥ 0.8 in 68% of the association graphs (Fig 5B), but there are many cases where they differ depending on how the (*Q*_{i}, *Q*_{j}) data are skewed. Fig 6C shows an example of the difference between the *ρ*_{pb} and *r*_{pbd} curves with *r*(*r*_{pbd}, *ρ*_{pb}) = 0.49. The *r*_{pbd} partition for the lower split value might be preferred because it is associated with higher *ρ*_{pb}, depending on how the cost-benefit trade-off is assessed for (*r*_{pbd}, *ρ*_{pb}) variation. Consequently, three coordinates (*r*_{pbd}, *μ*, *ρ*_{pb}) are needed to distinguish different forms of point-biserial variation. These observations provide support for previous remarks stating that interpreting the magnitude of an effect size as a measure of substantive significance depends on the particular application [6, 18]. A more precise approach would take into account the multidimensional nature of point-biserial variation and involve the specification of functional or engineering requirements for a relevant vector basis. Then, an analysis of the effect size for the system response could involve separate thresholds for each coordinate. The ability to account for all relevant degrees of freedom is also important in assessing reproducibility. A one-parameter representation using an effect size such as *r*_{pbd} or Cohen’s *d* gives an incomplete picture and leads to ambiguous results because of the loss of information.

### 3.2 Distributed effects in point-biserial variation

The reproducibility of nursing home performance data depends on stochastic effects in the measurement of patient outcome. Then, the observed data are associated with a distribution of data sets, , and corresponding distributions of the statistical parameters and effect size. The specification of must be based on a realistic assessment of all sources of error and uncertainty to form an error model for the data, . Then, the determination of the distribution for the effect size requires propagation of the error in . For fractional quantities such as Cohen’s *d* and *r*_{pbd}, it is necessary to account for stochastic effects in both the numerator and denominator. However, analytical methods for estimating distributions for ratios [28, 29], proportions [30, 31], and correlation coefficients [32] are complicated by fractional transformation, a bounded range, and discreteness. Thus, iterative procedures are needed for the analysis of noncentral effect size distributions and estimating confidence intervals for deviations above and below the effect size estimate [5, 18]. Alternatively, Monte Carlo (MC) methods [2, 33, 34] provide a more practical approach to estimating the distribution for the effect size. In an MC simulation, specifies error parameters for each observed value in the original data. Then, a point-biserial MC data set is obtained by random sampling to produce MC instances for **y**_{A} and **y**_{B}. The MC sampling process is repeated many times to obtain a collection of MC data sets to form an estimate, . Statistical parameters are calculated for the data sets in to obtain estimates of distributions and histograms for point-biserial effects. Many MC runs are performed to obtain a set, , which allows the determination of the degree of convergence for the MC simulation. However, the information needed to construct an error model is not included in the NHC quality measures data. For this demonstration, we provided a rudimentary ‘Emergency visits’ error model, where *σ*_{i} = *y*_{i}/5. MC simulations for (*r*_{pbd}, *μ*) and (*r*_{pbd}, *ρ*_{pb}) for ‘Emergency visits’ response with ‘Hospitalizations’ rCART split value, 3.3 (Table 2), are shown in Fig 7. The discrete structure of the *ρ*_{pb} distribution is due to stochastic effects in the **c**_{y} sorting. The separate confidence intervals in Fig 6 for positive and negative deviation from the observed effect size estimate were estimated from the MC distributions. In practical applications, the advantage of the MC method is that it allows detailed simulation of the data acquisition process, including heterogeneity within groups, and specifications for can include heteroscedasticity, measurement error, and misclassification [17, 35, 36].

2D histograms of MC distributions for (*r*_{pbd}, *μ*) (A) and (*r*_{pbd}, *ρ*_{pb}) (B) for ‘Emergency visits’ response with ‘Hospitalizations’ rCART split value, 3.3 (Table 2). The 1*σ* error bars for the *r*_{pbd} histogram (A inset) serve as an indication of convergence for the simulation; the mean for the normal curve corresponds to the observed *r*_{pbd} value, 0.398. *r*_{pbd}: sample size corrected correlation, *ρ*_{pb}: nonoverlap proportion, *μ*: center of mass parameter , number of MC runs: 25, samples per MC run: 4000.

## 4 Discussion

In this work, we use sort as an intrinsic property of both numbers and labels to generate a complete set of parameters for point-biserial variation, **v**_{pb}. We demonstrate that Cohen’s *d* is associated with the center of mass representation for a two-component system of normal distributions. However, a parameterization can also be constructed for skewed distributions. We do not attempt to incorporate requirements for ‘substantive significance’ because this depends on the particular application, which might require different or additional parameters. The specification of performance criteria for all of the parameters in **v**_{pb} is also required. The (*δ*, *μ*) effect size representation does not generalize because there is no standard center of mass parameterization for a multicomponent system. However, this does not constitute a fundamental limitation in the application of effect size for high-dimensional data analytics. Instead, the (*δ*, *μ*) coordinates serve as a minimal framework for analyzing dependency using exploratory methodologies such as rCART. CART methodology is useful in population studies where the performance or system response is distributed due to complex interactions. Then, a decision tree for identifying outperforming individuals can help in the determination of predictive criteria for improved performance, and the construction of a functional model. We also demonstrate the use of replication as a nonparametric method for equalizing sample sizes in the estimation of *ρ*_{pb}. This replication protocol can be used in other classification algorithms where adjustment for unbalanced sample size is needed. We also demonstrate that the Monte Carlo method is a practical way to estimate the distribution of a fractional statistical quantity from the detailed specification of an error model for the data. Then, the assessment of substantive significance must take into account the distribution in effect size parameters. We conclude that a better understanding of the applied algebraic foundations and an improved methodology are important for the application of effect size in data analytics.

## Acknowledgments

I thank many former colleagues in the Genetic Discovery group at DuPont for stimulating my interest in statistical problems in genome-wide association studies and CART.

## References

- 1.
Beló A, Luck SD. Association Mapping for the Exploration of Genetic Diversity and Identification of Useful Loci for Plant Breeding. In: Meksem K, Kahl G, editors. The Handbook of Plant Mutation Screening. Weinheim, Germany: Wiley-VCH Verlag GmbH & Co. KGaA; 2010. p. 231–246. Available from: http://onlinelibrary.wiley.com/doi/10.1002/9783527629398.ch14/summaryhttp://doi.wiley.com/10.1002/9783527629398.ch14.
- 2. Luck S. Factoring a 2 x 2 contingency table. PLOS ONE. 2019;14(10):e0224460. pmid:31652283
- 3. McGrath RE, Meyer GJ. When effect sizes disagree: The case of r and d. Psychological Methods. 2006;11(4):386–401. pmid:17154753
- 4.
Grissom RJ, Kim JJ. Effect Sizes for Research. 2nd ed. New York, NY: Routledge; 2011.
- 5.
Cumming G. Understanding The New Statistics. New York, NY: Routledge; 2012.
- 6. Kelley K, Preacher KJ. On effect size. Psychological Methods. 2012;17(2):137–152.
- 7. Huberty CJ. A History of Effect Size Indices. Educational and Psychological Measurement. 2002;62(2):227–240.
- 8. Pastore M, Calcagnì A. Measuring Distribution Similarities Between Samples: A Distribution-Free Overlapping Index. Frontiers in Psychology. 2019;10:1089.
- 9. Lee Rodgers J, Nicewander WA. Thirteen Ways to Look at the Correlation Coefficient. The American Statistician. 1988;42(1):59–66.
- 10. Gradstein M. Maximal Correlation between Normal and Dichotomous Variables. Journal of Educational Statistics. 1986;11(4):259–261.
- 11. Chambers RG. Correlation coefficients from 2 x 2 tables and from biserial data. British Journal of Mathematical and Statistical Psychology. 1982;35(2):216–227.
- 12. Cheng Y, Liu H. A short note on the maximal point-biserial correlation under non-normality. British Journal of Mathematical and Statistical Psychology. 2016;69(3):344–351.
- 13. Johnson JM, Khoshgoftaar TM. Survey on deep learning with class imbalance. Journal of Big Data. 2019;6(1):27.
- 14. Krzywinski M, Altman N. Classification and regression trees. Nature Methods. 2017;14(8):757–758.
- 15.
Boyd SP, Vandenberghe L. Convex optimization. New York, NY: Cambridge University Press; 2004.
- 16. Schäfer T, Schwarz MA. The Meaningfulness of Effect Sizes in Psychological Research: Differences Between Sub-Disciplines and the Impact of Potential Biases. Frontiers in Psychology. 2019;10(APR):813.
- 17. Nakagawa S, Cuthill IC. Effect size, confidence interval and statistical significance: a practical guide for biologists. Biological Reviews. 2007;82(4):591–605.
- 18. Fritz CO, Morris PE, Richler JJ. Effect size estimates: Current use, calculations, and interpretation. Journal of Experimental Psychology: General. 2012;141(1):2–18.
- 19.
Logan JD. Applied Mathematics. 2nd ed. New York, NY: John Wiley & Sons, Inc.; 1997.
- 20. Richardson JTE. Measures of effect size. Behavior Research Methods, Instruments, & Computers. 1996;28(1):12–22.
- 21.
Casella G, Berger R. Statistical Inference. 2nd ed. Pacific Grove, CA: Duxbury; 2002.
- 22.
Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning. Springer Series in Statistics. New York, NY: Springer New York; 2009.
- 23. de Ville B. Decision trees. Wiley Interdisciplinary Reviews: Computational Statistics. 2013;5(6):448–455.
- 24.
Ghali S. Introduction to Geometric Computing. London: Springer London; 2008.
- 25.
Nursing Home Compare datasets; 2020. Available from: https://data.medicare.gov/data/nursing-home-compare.
- 26.
NHC Quality Measures; 2020. Available from: https://www.medicare.gov/NursingHomeCompare/About/nhcinformation.html.
- 27. Luck S. Data for the paper “Nonoverlap proportion and point-biserial variation”; 2020. Available from: https://doi.org/10.6084/m9.figshare.11591334.v2.
- 28. Marsaglia G. Ratios of Normal Variables. Journal of Statistical Software. 2006;16(4):1–10.
- 29. von Luxburg U, Franz VH. A Geometric Approach to Confidence Sets for Ratios: Fieller’s Theorem, Generalizations, and Bootstrap. Statistica Sinica. 2009;19:1095–1117.
- 30. Newcombe RG. Interval estimation for the difference between independent proportions: comparison of eleven methods. Statistics in Medicine. 1998;17(8):873–890.
- 31. Agresti A. Dealing with discreteness: making ‘exact’ confidence intervals for proportions, differences of proportions, and odds ratios more exact. Statistical Methods in Medical Research. 2003;12(1):3–21.
- 32. Bishara AJ, Hittner JB. Reducing Bias and Error in the Correlation Coefficient Due to Nonnormality. Educational and Psychological Measurement. 2015;75(5):785–804.
- 33.
Bevington PR, Robinson DK. Data Reduction and Error Analysis for the Physical Sciences. 3nd ed. New York, NY: McGraw-Hill; 2003.
- 34. Kroese DP, Brereton T, Taimre T, Botev ZI. Why the Monte Carlo method is so important today. Wiley Interdisciplinary Reviews: Computational Statistics. 2014;6(6):386–392.
- 35. Höfler M. The effect of misclassification on the estimation of association: a review. International Journal of Methods in Psychiatric Research. 2005;14(2):92–101.
- 36.
Buonaccorsi JP. Measurement error: models, methods, and applications. Boca Raton: Chapman and Hall/CRC; 2010.