Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Nonoverlap proportion and the representation of point-biserial variation

Abstract

We consider the problem of constructing a complete set of parameters that account for all of the degrees of freedom for point-biserial variation. We devise an algorithm where sort as an intrinsic property of both numbers and labels, is used to generate the parameters. Algebraically, point-biserial variation is represented by a Cartesian product of statistical parameters for two sets of data, and the difference between mean values (δ) corresponds to the representation of variation in the center of mass coordinates, (δ, μ). The existence of alternative effect size measures is explained by the fact that mathematical considerations alone do not specify a preferred coordinate system for the representation of point-biserial variation. We develop a novel algorithm for estimating the nonoverlap proportion (ρpb) of two sets of data. ρpb is obtained by sorting the labeled data and analyzing the induced order in the categorical data using a diagonally symmetric 2 × 2 contingency table. We examine the correspondence between ρpb and point-biserial correlation (rpb) for uniform and normal distributions. We identify the , , and representations for Pearson product-moment correlation, Cohen’s d, and rpb. We compare the performance of rpb versus ρpb and the sample size proportion corrected correlation (rpbd), confirm that invariance with respect to the sample size proportion is important in the formulation of the effect size, and give an example where three parameters (rpbd, μ, ρpb) are needed to distinguish different forms of point-biserial variation in CART regression tree analysis. We discuss the importance of providing an assessment of cost-benefit trade-offs between relevant system parameters because ‘substantive significance’ is specified by mapping functional or engineering requirements into the effect size coordinates. Distributions and confidence intervals for the statistical parameters are obtained using Monte Carlo methods.

1 Introduction

This work began when we noticed that results from classification and regression tree (CART) analyses did not correspond well with statistical associations in genome-wide association studies (GWAS) [1]. Then, we discovered the extensive research literature discussing confounding properties of effect size measures used in our analyses. Statistical components of our bioinformatics system came from open source software packages that are widely used for research. In data analysis, there are two important requirements for obtaining reproducible results. First, statistics methodology is subject to the general physical principle that it is necessary to account for all of the degrees of freedom when studying a quantitative phenomenon. Second, analysis protocols must correct for dependence on data acquisition parameters including unbalanced sample sizes, in order to obtain interpretable results for effect size. Our work on proportional variation and the phi coefficient for 2 × 2 contingency tables was recently published in this journal; we refer to this as Paper1 [2]. There, we demonstrate that odds-ratio or relative risk as standalone effect size measures, do not account for all of the degrees of freedom and are therefore subject to ambiguity. Using matrix factorization for the marginal sums, we identified the four alternative forms of proportional variation which serve as the basis for specifying the effect size. There is also an elementary discussion of projective geometry for fractional variation that might be helpful to the reader. Here, we study similar problems in the formulation of effect size for point-biserial variation and the associated correlation coefficient, rpb. First, the term ‘point-biserial’ comes from psychology statistics, and we explain its use as a general reference for the two groups data analysis problem. The difference between mean values for two sets of data, , serves as the basis for specifying effect size for system response to perturbation. Statistically, analysis of δ corresponds to measuring the relation or association between a continuous variable and a binary categorical variable obtained by individually labeling the data. The standard procedure is to replace the labels with numeric {0, 1} indicators. The Pearson product moment correlation coefficient (r) calculated from these numeric data is known as the point-biserial correlation coefficient (rpb) [3]. This connection between rpb and δ explains our use of the term ‘point-biserial’. It is standard terminology in the effect size literature. We provide a short discussion of the literature which gave us much inspiration, and note that there are several books on effect size methods as well [4, 5]. In their discussion of physical principles in the formulation effect size, Kelly & Preacher recommend that an effect size should serve as a sample size independent estimate of a system parameter [6]. The existence of alternative effect size measures, and their classification as relationship, group difference, and group overlap is discussed by Huberty [7]. A recently proposed group overlap measure is nonparametric but requires the use of kernel density estimators to produce an approximate representation of the unknown densities [8]. McGrath and Meyer give a nice review of research into the limitations of rpb, and points out that different measures can “lead to different conclusions about the size or importance” of an effect [3]. Various researchers have already noted that there are two complications that can limit the range of rpb. The first difficulty arises from the definition of rpb, requiring the {0, 1} representation to allow the calculation of r. The {0, 1} representation corresponds to binary groupings of the data, comprising a pair of many-to-one mappings. The latter are incompatible with r as a measure of the degree to which two variables are linearly related [9] and raises questions about the interpretation of rpb. It has been shown that when the {yA, yB} data are obtained by a dichotomy of a normal distribution, rpb has a maximum value of 0.79 [3, 10]. In contrast, when each set corresponds to a normal distribution, rpb still ranges from −1.0 to 1.0 [11, 12], with the proviso that the extremal values are reached in the limit as |δ| approaches infinity. Secondly, rpb is subject to confounding from unbalanced sample sizes for the {yA, yB} data; in the effect size literature, the sample size proportions are usually referred to as ‘base rates’. Then, variation in the sampling proportions between data sets leads to irreproducibility, which complicates the interpretation of rpb. The machine learning community has rediscovered the problems associated with unbalanced sample sizes, creating the new term “classification imbalance” [13].

It is accepted practice to report a single effect size such as Cohen’s d as the basis for deciding the outcome of an experiment. However, d is associated with an implicit parameterization that does not account for all of the degrees of freedom for point-biserial variation, which results in ambiguity. Consequently, our objective is to construct a computational framework for a complete parameterization of the variation (vpb). We use an inductive approach based on connections between rpb, Cohen’s d, and the mean squared error information gain (IGMSE). These measures play an important role because of their connections with elementary statistical concepts. We show that Cohen’s d is a perspective function of center of mass coordinates (δ, μ) for the mean value vector . We also identify a novel association measure, ρpb, which measures the degree of nonoverlap between two sets of data.ρpb is calculated directly from the data and is therefore nonparametric because the underlying densities are unspecified. A particular goal is to examine the dependence of rpb on unbalanced sample sizes because of concerns about the effect on reproducibility. We address other problems as well including the use of Monte Carlo methods to estimate the joint distribution for statistical parameters. As in Paper1, we use CART association graphs to compare the performance of various effect size measures. However, in this work we are particularly interested in the case where the target variable is a quantitative variable, which corresponds to the regression tree implementation (rCART) [14]. We show that ρpb and the sample size proportion corrected correlation (rpbd) serve as effect size measures for rCART while avoiding complications associated with rpb. The main novel contributions of this work are as follows: 1) a computational model for generating statistical parameters for point-biserial variation vpb, which corresponds to the Cartesian product of parameters for two sets of data, and identification of the fact that pure mathematics alone is not sufficient to specify a preferred effect size, 2) a sorting algorithm to estimate the nonoverlap proportion, ρpb, of two sets of data using a diagonally symmetric 2 × 2 contingency table, 3) identification of the , , and representations for Pearson correlation, 4) demonstration of the equivalence between rpb and IGMSE, and 5) demonstration of the importance of adjusting for unbalanced sample sizes in impurity measures in rCART analysis.

2 Methods

The specification of a complete set of parameters for point-biserial variation, vpb, is a prerequisite for the rigorous formulation of effect size. Then, a measure for effect size is asociated with a perspective function of vpb. We begin with an examination of limitations of rpb in section 2.1. Then, we use an inductive approach to construct an algebraic framework for point-biserial variation in four sections 2.2–2.5.

2.1 The effect of unbalanced sample sizes on rpb

The derivation and limitations of rpb are reviewed by McGrath and Meyer [3]. Two sets, and , are combined to form a set of paired values, , where ci is a group membership label, and the {(ci, yi)} data correspond to the vectors, (c, y). The standard practice is to invoke a numeric {0, 1} representation for c to obtain an indicator vector, . Then, application of the Pearson product-moment formula produces the point-biserial correlation coefficient [3] (1) (2) where pA = NA/(NA + NB) and pB = 1 − pA are sample size proportions, Cohen’s d is defined as (3) and the pooled variance is the weighted average of the sample variances, . Thus, |rpb| approaches unity as |d| → ∞ [11, 12] for 0 < pA < 1. Rearranging Eq 2, we obtain the quadratic relation (4) For a fixed value of rpb, there is a range of (d, pA) values (Fig 1). Alternatively, the variation in (rpb, pA) for fixed d becomes a source of irreproducibility in rpb because pA can vary between experiments depending on the data acquisition protocol. This ambiguity explains why researchers have expressed concern about the confounding effect of unbalanced sample sizes on rpb, and effect size in general [3, 6]. Furthermore, the binomial pA pB dependence originates from the covariance (5) (6) and variance, Var(Ic) = pA pB. Therefore, the criticism about pA pB dependence applies more broadly to the use of the numeric {0, 1} indicator variable. Various researchers have already recommended that the proportions should be equalized, pA = pB = 1/2, in Eq 2 to give [3] (7) This ‘attenuation-corrected’ coefficient is denoted as rc in [4]. The rpb and rpbd curves in Fig 2 provide an illustration of this correction. The one-to-one projective relation between rpbd and Cohen’s d is discussed in section 2.4, and the application of rpbd in rCART is discussed in section 2.5.

thumbnail
Fig 1. Quadratic dependence of the point-biserial correlation coefficient, rpb.

For the fixed value rpb = 0.2, there is a range for Cohen’s d and the sample size proportion, pA. This ambiguity complicates the interpretation of rpb as an effect size measure.

https://doi.org/10.1371/journal.pone.0244517.g001

thumbnail
Fig 2. Nonoverlap proportion and point-biserial correlation.

Theoretical curves and estimated values for point-biserial correlation, rpb, nonoverlap proportion, ρpb, and sample size adjusted correlation, rpbd, for simulated data with unequal sample sizes (NA : NB = 15000 : 500) and the difference between mean values, . Compared to rpbd, rpb is attenuated due to the confounding effect of the binomial sampling factor. A: Uniform unit width distributions. B: Standard normal (σ = 1) distributions.

https://doi.org/10.1371/journal.pone.0244517.g002

2.2 Statistical parameters for point-biserial variation

In this section, we consider the question of how to generate a set of parameters for statistical variation in point-biserial data. The fact that rpb is subject to confounding effects suggests that replacing categorical labels with {0, 1} numeric values is an improper procedure, because the labels acquire arithmetic properties in an ad-hoc way. Instead, we propose a new framework where sort is used as an intrinsic property of both numbers and labels. Suppose there is a machine which generates numbers with labels, (ci, yi), in no particular order, placing them in a data table to produce a point-biserial data set. Then, the table can be sorted using either c or y, to obtain orderings denoted as yc and cy, respectively. As we discuss next, these orderings are associated with statistical parameters, vc and vy, respectively. However, there is no rule that specifies which parameterization, vc or vy, might be preferred. Therefore, we make the following proposition,

Proposition 1. Point-biserial variation is parameterized by the Cartesian product of statistical parameters for the yc and cy orderings, (8) The yc ordering corresponds to sorting the y data into two sets, yc ↔ {yA, yB}. Then, the statistical parameters for the two sets are associated with a two-component Cartesian product structure, yielding the familiar effect size measures, Cohen’s d and rpb as discussed in section 2.3. The cy ordering is associated with a new nonoverlap measure, ρpb. The two types of y-sort, ascending or descending, produce orderings where either {(ci, yi)|yiyi+1} or {(ci, yi)|yiyi+1}, respectively. Then, the c-column corresponds to a y-ordered string, cy. The induced order from the y-sorting is reflected in the degree of mixing of As and Bs in cy. Next, we sort the data with respect to c obtaining a maximally ordered string, cy, where the As and Bs are completely separated. cM corresponds to the condition where yA and yB are disjoint in , which has been characterized as “perfect correlation” [11]. Our cy-sorting algorithm requires equal sample sizes, NA = NB. When the sample sizes are unequal, a preprocessing step is required. Suppose NB < NA. Then, the yB data are replicated to create a new data set, yBrep, such that NBrep = NA. If the difference in sample size is small, 0 < NBNA < NB, then a subset of yB uniformly spaced by rank is replicated. The yBrep and yA data are combined to obtain the (cy, cM) strings. They constitute a set of joint observations for two categorical variables, which are summarized in a diagonally symmetric 2 × 2 contingency table of the form [[a, b], [b, a]]. The symmetric form results from the equal sample size condition, which requires that the rows and columns each sum to NA. Then, the nonoverlap proportion is given by the difference in proportions (9) where , and pb = 1 − pa. When yA and yB are disjoint, |ρpb| = 1. The sign of ρpb is arbitrary because the order of the columns (or rows) of the 2 × 2 table depends on the direction of the sort in y or cM. In our implementation, the sign is chosen to be consistent with Cohen’s d. The ρpb values in Fig 2 were obtained using this sort algorithm. The overlap between uniform unit width distributions is an important pedagogical case because the expressions for Cohen’s d, rpbd, and ρpb take a simple form. Geometrically, the overlap (θU) is given by a rectangle with area θU = 1 − δ for the difference between mean values, with 0 ≤ δ ≤ 1, and θU = 0 for δ > 1. The nonoverlap is given by ρpbU = 1 − θU = δ, with 0 ≤ δ ≤ 1. Similarly, (10) (11) For the overlap of standard normal (σ = 1) distributions, we obtain (12) (13) (14) where Φ is the cumulative normal distribution function [8]. In Fig 2, we observe that at a large enough δ, rpbd is attenuated compared to ρpb, as expected [11]. However, for small δ, the inequality is reversed, i.e., rpbd > ρpb. Nevertheless, there is close correspondence between rpbd and ρpb for both the uniform and normal distributions. This is particularly true for highly correlated data where both rpbd and ρpb are near 1, and are therefore equivalent. However, in section 3 we demonstrate that when the data are not well correlated, both rpbd and ρpb are needed in order to distinguish different forms of point-biserial variation. We conclude that rpbd and Cohen’s d serve as measures of the nonoverlap of distributions but are not necessarily equivalent to ρpb.

2.3 Coordinates for a two-component system of distributed effects

In this section, we discuss the fact that d and ρpb are only two elements of a minimal set of parameters for representing point-biserial variation. The one-to-one correspondence, drpbd, will be discussed in section 2.4. Algebraically, vc corresponds to the Cartesian product of statistical parameters for two sets of data, . Introducing the center of mass parameter, , the mean values vector is expressed as (15) (16) where (1, 1) and (1, −1) comprise the center of mass basis. We note that the generalization for a weighted average is straightforward. A similar decomposition holds for variances (17) where and . A further reduction is obtained if the variances are homoscedastic, , yielding , and . Finally, we obtain (18) as a minimal set of parameters for point-biserial variation. However, we observe that vpb is not unique because functions of the components, {fi(vpb,i)}, including linear fractional transformations can be introduced to obtain alternative representations. Mathematics alone is not sufficient to specify a preferred vector basis, which explains why there are alternative effect size measures [6, 7]. Furthermore, rpb and Cohen’s d correspond to perspective functions [15] of vpb and do not account for all of the degrees-of-freedom. Consequently, the practice of using one of these measures to serve as a one-parameter summary of experimental results will be subject to irreproducibility.

The term ‘substantive significance’ has been used to refer to the magnitude of an effect that would be regarded as practically important in a given application [6]. Suppose functional or engineering requirements are expressed in terms of a vector, h, of system parameters. Then, the utility of an effect would be specified as a mapping, . The specification of u(h) would account for differences in cost-benefit trade-offs for variation in the {hi} components. The substantive significance for the effect size would be determined by the mapping, u(h) → u(vpb). Without this information, it is difficult to reach a consensus on the merits of an effect size. This explains the criticism of Cohen’s thresholds for small, medium, and large effects as “somewhat arbitrary” [16] and suggestions that the significance of the magnitude of an effect size depends on the research question [3, 17, 18].

A fundamental limitation arises from the fact that the (δ, μ) center of mass decomposition does not extend to higher dimensions in a straightforward way. Consider the group means vector for three sets, i.e., . The default center of mass parameter is defined as . However, there is no standard procedure for choosing the two additional deviation parameters needed to specify a complete basis. Consequently, the formulation of an effect size measure for multiple group variation is not a well-posed problem, i.e., there is no unique solution [19]. This explains why Cohen’s d does not generalize to schemes involving more than two groups [20] and provides support for previous recommendations to break down ‘complicated hypotheses’, p. 526 [21], and ‘reduce any multiple-level or multiple-variable relationship’ into a set of two-variable effect size relationships [17]. This provides the raison d'être for the development of exploratory methodologies such as CART in high-dimensional data analytics [22, 23].

2.4 Homogeneous coordinates for Pearson correlation

In the effect size literature, it is accepted practice to distinguish three different types of effect size measure, ‘relationship’, ‘group difference’, and ‘group overlap’ [3, 7]. In this section, we discuss the fact that this classification is misleading. We have already discussed the fact that Cohen’s d, rpbd and ρpb all serve as measures of nonoverlap (section 2.2). Now, we point out that rpbd and Cohen’s d are two sides of the same coin because relationship and group difference correspond to different coordinate systems for representing fractional variation. Such correspondences are quite useful in exploring statistical dependence in high-dimensional data. Consider a vector . Division by the y-component produces the ratio vector, . Ratios can be distinguished by their representations as points in the projective line, . However, normalization of a ratio vector by the Euclidean length, , produces the unit vector , which is a point in the positive half-circle . Thus, a fractional quantity can be represented as a point in either or . Algebraically, the and representations are related by linear fractional transformations. In the terminology of projective geometry, a ratio corresponds to a perspective function, P(u, t) = u/t, for vector u [15]. The scaling invariance property of α is represented by the equivalence relation with t ≠ 0. Geometrically, this relation specifies points on the line passing through the origin, (a, b) and (α, 1). The points, (a, b)t, constitute the homogeneous coordinates [24] for the line. The homogeneous coordinates concept shows that there is a natural correspondence between ‘relationship’ and ‘group difference’ effect size. Expressing the Pearson product-moment correlation coefficient as the rescaled covariance [9] the corresponding projective geometric structure is as summarized in Table 1. Vector representations for rpb and rpbd are also listed, and a geometric visualization for rpb is shown in Fig 3. Consequently, rpbd, Cohen’s d, and ρpb each possess and representations and serve as measures of group overlap, as described in section 2.2. Therefore, we conclude that the general classification of effect size as a ‘relationship’, ‘group difference’, or ‘group overlap’ index is misleading. We also observe that the question of the merits of Cohen’s d versus rpb in [3] is complicated by the fact that these measures correspond to points in different spaces, and , respectively. The limitations of rpb are more easily understood by considering its representation as the vector, . The binomial factor has a confounding effect, particularly since base rates are determined by the experimental protocol. This is analogous to the confounding effect of the marginal sums on the ϕ coefficient for a 2 × 2 contingency table (Paper1). Therefore, neither rpb nor ϕ meet the criterion for a well-behaved effect size of serving to quantify ‘some phenomenon that addresses a question of interest’ [6]. In section 2.5, we give an example where rpb gives nonintuitive results in rCART analysis.

thumbnail
Fig 3. Projective spaces for the representation of point-biserial correlation.

The point-biserial correlation coefficient, rpb, corresponds to the point on the positive half-circle, , and the point on the projective line, . The homogeneous coordinates for correspond to points on the line through the origin. {pA, pB}: sample size proportions, d: Cohen’s d.

https://doi.org/10.1371/journal.pone.0244517.g003

thumbnail
Table 1. Homogeneous coordinates for Pearson correlation.

https://doi.org/10.1371/journal.pone.0244517.t001

2.5 Point-biserial variation in regression tree analysis

The CART association graph was introduced in Paper1 as a new method for analyzing statistical association in point-biserial data. In this section, we investigate the role of point-biserial variation in rCART, particularly the connection between IGMSE and rpb, and introduce the rCART graph as a new method for analyzing association for (x, y) data. The CART decision tree algorithm creates a decision tree by recursive partitioning of the association between response and independent variables [2, 14]. Each node of the tree corresponds to a binary partition of the range of an independent variable. In standard implementations, the partition parameters for a node are determined by maximizing the information gain (IG) for the response variable in an exhaustive search of associations over all independent variables. The rCART implementation is of particular interest because it involves the analysis of point-biserial variation. In each iteration, the set of statistics obtained for partitions of an independent variable constitutes a CART association graph [2]. For the partition value , the data for a node (V) are divided into two subsets, i.e., VA = {(xi, yi)|xixj} and VB = {(xi, yi)|xi > xj}, from which data vectors {yA, yB} are obtained. Alternatively, if xj is categorical, the subsets are specified using matching criteria VA = {(xi, yi)|xi = xj} and VB = {(xi, yi)|xixj}. The standard rCART impurity measure is the mean square error for the response, , where NV is the sample size and is the mean [14]. Then, IG is defined as the parent node impurity minus the weighted impurities for the subsets (19) where pA and pB are the sample size proportions. Partitioning the sum of squares, MSE(y), gives [3, 21] Substitution for MSE(y) in Eq 19 gives (20) Thus, IGMSE(yA, yB) is equivalent to with Sp = 1 (Table 1); IGMSE does not account for the variation in Sp. To the best of our knowledge, this connection between IGMSE and rpb has not been reported previously. We conclude that the analysis of point-biserial variation serves as the basis for rCART, and we use the terms ‘effect size’ and ‘information gain’ interchangeably. The xj partition produces subsets with sample sizes, j and NVj for . An association graph is obtained by searching over all partitions where the sample size proportions, pj and (1 − pj), vary over their entire range, producing a large parabolic variation in the pj(1 − pj) factor. Thus, an association graph is a convenient way to compare the sample size proportion dependence of effect size measures. In the next section, we demonstrate that rpb gives misleading results in rCART, while rpbd and ρpb produce more intuitive results. However, when the (x, y) data are highly correlated and Pearson r(x, y) → 1, the rCART graph becomes a horizontal line or nearly so, because rpbdρpb ≈ 1 for all xj partitions. Then, the rCART graph and Pearson r are equivalent representations. Thus, CART methodology is most useful when the data are poorly correlated, which includes population studies where system performance is determined by trade-offs between multiple factors. Typical applications include GWAS, and other high-dimensional search problems such as nursing home performance as discussed in the next section.

3 Data analysis and results

In Paper1, we used the publicly accessible Nursing Home Compare (NHC) data [25] in CART analysis to demonstrate the importance of adjusting for the dependence on marginal sums for 2 × 2 contingency tables [2]. In this section, we use a similar NHC data set for a discussion of point-biserial variation and the rCART algorithm. Our objective is to provide a practical demonstration of the limitations of rpb due to the confounding effect of unbalanced sample sizes and to compare the behaviors of rpbd and ρpb. We also discuss the importance of accounting for three degrees of freedom, (rpbd, μ, ρpb), and the use of Monte Carlo methods to estimate the joint distribution of statistical parameters.

3.1 rCART association graphs for NHC quality measures

NHC data of the fourth quarter of 2018 were retrieved for 20 quality measures (Qi) for 15341 nursing homes; detailed descriptions of these continuous variables can be found on the NHC website [26]. A histogram of the nursing home occupancy is shown in Fig 4A. Since performance estimates for nursing homes with low occupancy would be less reliable, a minimum occupancy criterion of at least 50 ‘Average number of residents per day’ was applied to obtain a restricted data set of 11053 nursing homes for further analysis [27]. Pearson correlation coefficients, r(Qi, Qj), and association graphs were calculated for all pairs of quality measures, {(Qi, Qj)|ij}. On average, the information gain for the rCART partition is larger when the (Qi, Qj) variables are highly correlated (Fig 5A); the r(Qi, Qj) correlations are distributed with 95% less than 0.16 and a maximum of 0.65. The distribution for ‘Number of outpatient emergency department visits per 1000 long-stay resident days’ (‘Emergency visits’) versus ‘Number of hospitalizations per 1000 long-stay resident days’ (‘Hospitalizations’) with correlation r = 0.37 is skewed, with a long tail towards larger values (Fig 4B). rCART association graphs are shown for the ‘Hospitalizations’ response and ‘Emergency visits’ partition variables (Fig 6A and 6B), and for the reverse, i.e., ‘Emergency visits’ response and ‘Hospitalizations’ partition variables (Fig 6C and 6D). The high correlation between rpb and (r = 0.99) is typical and indicates that variation in the binomial sampling factor overrides the smaller variation in Cohen’s d (Eq 2). We also note that the graphs for rpb and IGMSE (not shown) are superimposable, as expected from Eq 20 and because the variation in Sp is small. Thus, rpb and IGMSE mainly correspond to the variation in sample size proportion. In general, we observe that the association curves for rpbd and ρpb can be categorized as monotonically increasing or decreasing, or even U-shaped (concave up), depending on how the (Qi, Qj) data are distributed. Here, the U-shaped dependence of rpbd correlates well with δ (r = 0.999) and contrasts sharply with the concave down variation for rpb. Consequently, rpb and rpbd produce very different rCART partitions (Table 2). In Fig 6A, the rpb partition for the split value, xj = 0.8, produces subnodes with comparable sample sizes, NA = 5742 and NB = 4890 (Table 2). It is useful to view this partition from a statistical perspective. As a first approximation, we expect that the majority of nursing homes belong to a broad distribution for average performance. Then, the rpb partition with a split value close to the median, 0.85, is analogous to splitting a normal distribution nearly in half, producing subsets with different mean ‘Emergency visits’ values {0.5, 1.4} that nevertheless correspond to entities with average performance. Thus, rpb and IGMSE produce rCART subsets that are not well distinguished from a functional perspective. In comparison, for rpbd, there are two possible rCART partitions at either low (xj = 0.3) or high (xj = 2.5) split values. Each partition produces a large subset corresponding to a broad distribution for average performance and a much smaller subset for either above- or below-average performance. Thus, rpbd produces more functionally relevant classifications.

thumbnail
Fig 4. Skewed distributions for NHC quality measures.

A. Histogram of ‘Average number of residents per day’ for 15341 nursing homes. B. Two-dimensional Gaussian kernel density estimate of the distribution of ‘Number of outpatient emergency department visits per 1000 long-stay resident days’ (‘Emergency visits’) versus ‘Number of hospitalizations per 1000 long-stay resident days’ (‘Hospitalizations’), with correlation r = 0.37.

https://doi.org/10.1371/journal.pone.0244517.g004

thumbnail
Fig 5. The relation between rpbd and ρpb in rCART.

These graphs display data obtained from association graphs for 380 pairs of quality measures, {(Qi, Qj)|ij}. A. rpbd effect size for rCART split versus correlation r(Qi, Qj). On average, the largest information gain is obtained when the response and partition variables are highly correlated. B. Correlation r(rpbd, ρpb) between effect size and r(Qi, Qj) for association graphs. There is good correlation between rpbd and ρpb in many cases, but there are exceptions.

https://doi.org/10.1371/journal.pone.0244517.g005

thumbnail
Fig 6. rCART association graphs for effect size.

A,B: ‘Hospitalizations’ response versus ‘Emergency visits’ partition variables, with correlation r(rpbd, ρpb) = 0.93. C,D: ‘Emergency visits’ response versus ‘Hospitalizations’ partition variables, with correlation r(rpbd, ρpb) = 0.49. Bar plot histograms are shown for ‘Emergency visits’ (B inset) and ‘Hospitalizations’ (D inset). rpb: point-biserial correlation coefficient, {pA, pB}: sample size proportions, rpbd: sample size corrected correlation coefficient, ρpb: nonoverlap proportion, (δ, μ): center of mass parameters .

https://doi.org/10.1371/journal.pone.0244517.g006

The importance of accounting for variation in both degrees of freedom, (rpbd, μ), is illustrated in Fig 6B and 6D. Here, μ is monotonically increasing, and one of the rpbd partitions might be preferred depending on μ. However, this requires an assessment of the cost-benefit trade-offs for (rpbd, μ) variation, which will depend on the particular application. A close correspondence between rpbd and ρpb is observed in many cases, with r(rpbd, ρpb) ≥ 0.8 in 68% of the association graphs (Fig 5B), but there are many cases where they differ depending on how the (Qi, Qj) data are skewed. Fig 6C shows an example of the difference between the ρpb and rpbd curves with r(rpbd, ρpb) = 0.49. The rpbd partition for the lower split value might be preferred because it is associated with higher ρpb, depending on how the cost-benefit trade-off is assessed for (rpbd, ρpb) variation. Consequently, three coordinates (rpbd, μ, ρpb) are needed to distinguish different forms of point-biserial variation. These observations provide support for previous remarks stating that interpreting the magnitude of an effect size as a measure of substantive significance depends on the particular application [6, 18]. A more precise approach would take into account the multidimensional nature of point-biserial variation and involve the specification of functional or engineering requirements for a relevant vector basis. Then, an analysis of the effect size for the system response could involve separate thresholds for each coordinate. The ability to account for all relevant degrees of freedom is also important in assessing reproducibility. A one-parameter representation using an effect size such as rpbd or Cohen’s d gives an incomplete picture and leads to ambiguous results because of the loss of information.

3.2 Distributed effects in point-biserial variation

The reproducibility of nursing home performance data depends on stochastic effects in the measurement of patient outcome. Then, the observed data are associated with a distribution of data sets, , and corresponding distributions of the statistical parameters and effect size. The specification of must be based on a realistic assessment of all sources of error and uncertainty to form an error model for the data, . Then, the determination of the distribution for the effect size requires propagation of the error in . For fractional quantities such as Cohen’s d and rpbd, it is necessary to account for stochastic effects in both the numerator and denominator. However, analytical methods for estimating distributions for ratios [28, 29], proportions [30, 31], and correlation coefficients [32] are complicated by fractional transformation, a bounded range, and discreteness. Thus, iterative procedures are needed for the analysis of noncentral effect size distributions and estimating confidence intervals for deviations above and below the effect size estimate [5, 18]. Alternatively, Monte Carlo (MC) methods [2, 33, 34] provide a more practical approach to estimating the distribution for the effect size. In an MC simulation, specifies error parameters for each observed value in the original data. Then, a point-biserial MC data set is obtained by random sampling to produce MC instances for yA and yB. The MC sampling process is repeated many times to obtain a collection of MC data sets to form an estimate, . Statistical parameters are calculated for the data sets in to obtain estimates of distributions and histograms for point-biserial effects. Many MC runs are performed to obtain a set, , which allows the determination of the degree of convergence for the MC simulation. However, the information needed to construct an error model is not included in the NHC quality measures data. For this demonstration, we provided a rudimentary ‘Emergency visits’ error model, where σi = yi/5. MC simulations for (rpbd, μ) and (rpbd, ρpb) for ‘Emergency visits’ response with ‘Hospitalizations’ rCART split value, 3.3 (Table 2), are shown in Fig 7. The discrete structure of the ρpb distribution is due to stochastic effects in the cy sorting. The separate confidence intervals in Fig 6 for positive and negative deviation from the observed effect size estimate were estimated from the MC distributions. In practical applications, the advantage of the MC method is that it allows detailed simulation of the data acquisition process, including heterogeneity within groups, and specifications for can include heteroscedasticity, measurement error, and misclassification [17, 35, 36].

thumbnail
Fig 7. Monte Carlo simulation of the distribution of stochastic effects for point-biserial variation.

2D histograms of MC distributions for (rpbd, μ) (A) and (rpbd, ρpb) (B) for ‘Emergency visits’ response with ‘Hospitalizations’ rCART split value, 3.3 (Table 2). The 1σ error bars for the rpbd histogram (A inset) serve as an indication of convergence for the simulation; the mean for the normal curve corresponds to the observed rpbd value, 0.398. rpbd: sample size corrected correlation, ρpb: nonoverlap proportion, μ: center of mass parameter , number of MC runs: 25, samples per MC run: 4000.

https://doi.org/10.1371/journal.pone.0244517.g007

4 Discussion

In this work, we use sort as an intrinsic property of both numbers and labels to generate a complete set of parameters for point-biserial variation, vpb. We demonstrate that Cohen’s d is associated with the center of mass representation for a two-component system of normal distributions. However, a parameterization can also be constructed for skewed distributions. We do not attempt to incorporate requirements for ‘substantive significance’ because this depends on the particular application, which might require different or additional parameters. The specification of performance criteria for all of the parameters in vpb is also required. The (δ, μ) effect size representation does not generalize because there is no standard center of mass parameterization for a multicomponent system. However, this does not constitute a fundamental limitation in the application of effect size for high-dimensional data analytics. Instead, the (δ, μ) coordinates serve as a minimal framework for analyzing dependency using exploratory methodologies such as rCART. CART methodology is useful in population studies where the performance or system response is distributed due to complex interactions. Then, a decision tree for identifying outperforming individuals can help in the determination of predictive criteria for improved performance, and the construction of a functional model. We also demonstrate the use of replication as a nonparametric method for equalizing sample sizes in the estimation of ρpb. This replication protocol can be used in other classification algorithms where adjustment for unbalanced sample size is needed. We also demonstrate that the Monte Carlo method is a practical way to estimate the distribution of a fractional statistical quantity from the detailed specification of an error model for the data. Then, the assessment of substantive significance must take into account the distribution in effect size parameters. We conclude that a better understanding of the applied algebraic foundations and an improved methodology are important for the application of effect size in data analytics.

Acknowledgments

I thank many former colleagues in the Genetic Discovery group at DuPont for stimulating my interest in statistical problems in genome-wide association studies and CART.

References

  1. 1. Beló A, Luck SD. Association Mapping for the Exploration of Genetic Diversity and Identification of Useful Loci for Plant Breeding. In: Meksem K, Kahl G, editors. The Handbook of Plant Mutation Screening. Weinheim, Germany: Wiley-VCH Verlag GmbH & Co. KGaA; 2010. p. 231–246. Available from: http://onlinelibrary.wiley.com/doi/10.1002/9783527629398.ch14/summaryhttp://doi.wiley.com/10.1002/9783527629398.ch14.
  2. 2. Luck S. Factoring a 2 x 2 contingency table. PLOS ONE. 2019;14(10):e0224460. pmid:31652283
  3. 3. McGrath RE, Meyer GJ. When effect sizes disagree: The case of r and d. Psychological Methods. 2006;11(4):386–401. pmid:17154753
  4. 4. Grissom RJ, Kim JJ. Effect Sizes for Research. 2nd ed. New York, NY: Routledge; 2011.
  5. 5. Cumming G. Understanding The New Statistics. New York, NY: Routledge; 2012.
  6. 6. Kelley K, Preacher KJ. On effect size. Psychological Methods. 2012;17(2):137–152.
  7. 7. Huberty CJ. A History of Effect Size Indices. Educational and Psychological Measurement. 2002;62(2):227–240.
  8. 8. Pastore M, Calcagnì A. Measuring Distribution Similarities Between Samples: A Distribution-Free Overlapping Index. Frontiers in Psychology. 2019;10:1089.
  9. 9. Lee Rodgers J, Nicewander WA. Thirteen Ways to Look at the Correlation Coefficient. The American Statistician. 1988;42(1):59–66.
  10. 10. Gradstein M. Maximal Correlation between Normal and Dichotomous Variables. Journal of Educational Statistics. 1986;11(4):259–261.
  11. 11. Chambers RG. Correlation coefficients from 2 x 2 tables and from biserial data. British Journal of Mathematical and Statistical Psychology. 1982;35(2):216–227.
  12. 12. Cheng Y, Liu H. A short note on the maximal point-biserial correlation under non-normality. British Journal of Mathematical and Statistical Psychology. 2016;69(3):344–351.
  13. 13. Johnson JM, Khoshgoftaar TM. Survey on deep learning with class imbalance. Journal of Big Data. 2019;6(1):27.
  14. 14. Krzywinski M, Altman N. Classification and regression trees. Nature Methods. 2017;14(8):757–758.
  15. 15. Boyd SP, Vandenberghe L. Convex optimization. New York, NY: Cambridge University Press; 2004.
  16. 16. Schäfer T, Schwarz MA. The Meaningfulness of Effect Sizes in Psychological Research: Differences Between Sub-Disciplines and the Impact of Potential Biases. Frontiers in Psychology. 2019;10(APR):813.
  17. 17. Nakagawa S, Cuthill IC. Effect size, confidence interval and statistical significance: a practical guide for biologists. Biological Reviews. 2007;82(4):591–605.
  18. 18. Fritz CO, Morris PE, Richler JJ. Effect size estimates: Current use, calculations, and interpretation. Journal of Experimental Psychology: General. 2012;141(1):2–18.
  19. 19. Logan JD. Applied Mathematics. 2nd ed. New York, NY: John Wiley & Sons, Inc.; 1997.
  20. 20. Richardson JTE. Measures of effect size. Behavior Research Methods, Instruments, & Computers. 1996;28(1):12–22.
  21. 21. Casella G, Berger R. Statistical Inference. 2nd ed. Pacific Grove, CA: Duxbury; 2002.
  22. 22. Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning. Springer Series in Statistics. New York, NY: Springer New York; 2009.
  23. 23. de Ville B. Decision trees. Wiley Interdisciplinary Reviews: Computational Statistics. 2013;5(6):448–455.
  24. 24. Ghali S. Introduction to Geometric Computing. London: Springer London; 2008.
  25. 25. Nursing Home Compare datasets; 2020. Available from: https://data.medicare.gov/data/nursing-home-compare.
  26. 26. NHC Quality Measures; 2020. Available from: https://www.medicare.gov/NursingHomeCompare/About/nhcinformation.html.
  27. 27. Luck S. Data for the paper “Nonoverlap proportion and point-biserial variation”; 2020. Available from: https://doi.org/10.6084/m9.figshare.11591334.v2.
  28. 28. Marsaglia G. Ratios of Normal Variables. Journal of Statistical Software. 2006;16(4):1–10.
  29. 29. von Luxburg U, Franz VH. A Geometric Approach to Confidence Sets for Ratios: Fieller’s Theorem, Generalizations, and Bootstrap. Statistica Sinica. 2009;19:1095–1117.
  30. 30. Newcombe RG. Interval estimation for the difference between independent proportions: comparison of eleven methods. Statistics in Medicine. 1998;17(8):873–890.
  31. 31. Agresti A. Dealing with discreteness: making ‘exact’ confidence intervals for proportions, differences of proportions, and odds ratios more exact. Statistical Methods in Medical Research. 2003;12(1):3–21.
  32. 32. Bishara AJ, Hittner JB. Reducing Bias and Error in the Correlation Coefficient Due to Nonnormality. Educational and Psychological Measurement. 2015;75(5):785–804.
  33. 33. Bevington PR, Robinson DK. Data Reduction and Error Analysis for the Physical Sciences. 3nd ed. New York, NY: McGraw-Hill; 2003.
  34. 34. Kroese DP, Brereton T, Taimre T, Botev ZI. Why the Monte Carlo method is so important today. Wiley Interdisciplinary Reviews: Computational Statistics. 2014;6(6):386–392.
  35. 35. Höfler M. The effect of misclassification on the estimation of association: a review. International Journal of Methods in Psychiatric Research. 2005;14(2):92–101.
  36. 36. Buonaccorsi JP. Measurement error: models, methods, and applications. Boca Raton: Chapman and Hall/CRC; 2010.