Nonoverlap proportion and the representation of point-biserial variation

doi:10.1371/journal.pone.0244517

Fig 1.

Quadratic dependence of the point-biserial correlation coefficient, r_pb.

For the fixed value r_pb = 0.2, there is a range for Cohen’s d and the sample size proportion, p_A. This ambiguity complicates the interpretation of r_pb as an effect size measure.

More »

Expand

Fig 2.

Nonoverlap proportion and point-biserial correlation.

Theoretical curves and estimated values for point-biserial correlation, r_pb, nonoverlap proportion, ρ_pb, and sample size adjusted correlation, r_pbd, for simulated data with unequal sample sizes (N_A : N_B = 15000 : 500) and the difference between mean values, . Compared to r_pbd, r_pb is attenuated due to the confounding effect of the binomial sampling factor. A: Uniform unit width distributions. B: Standard normal (σ = 1) distributions.

More »

Expand

Fig 3.

Projective spaces for the representation of point-biserial correlation.

The point-biserial correlation coefficient, r_pb, corresponds to the point on the positive half-circle, , and the point on the projective line, . The homogeneous coordinates for correspond to points on the line through the origin. {p_A, p_B}: sample size proportions, d: Cohen’s d.

More »

Expand

Table 1.

Homogeneous coordinates for Pearson correlation.

More »

Expand

Fig 4.

Skewed distributions for NHC quality measures.

A. Histogram of ‘Average number of residents per day’ for 15341 nursing homes. B. Two-dimensional Gaussian kernel density estimate of the distribution of ‘Number of outpatient emergency department visits per 1000 long-stay resident days’ (‘Emergency visits’) versus ‘Number of hospitalizations per 1000 long-stay resident days’ (‘Hospitalizations’), with correlation r = 0.37.

More »

Expand

Fig 5.

The relation between r_pbd and ρ_pb in rCART.

These graphs display data obtained from association graphs for 380 pairs of quality measures, {(Q_i, Q_j)|i ≠ j}. A. r_pbd effect size for rCART split versus correlation r(Q_i, Q_j). On average, the largest information gain is obtained when the response and partition variables are highly correlated. B. Correlation r(r_pbd, ρ_pb) between effect size and r(Q_i, Q_j) for association graphs. There is good correlation between r_pbd and ρ_pb in many cases, but there are exceptions.

More »

Expand

Fig 6.

rCART association graphs for effect size.

A,B: ‘Hospitalizations’ response versus ‘Emergency visits’ partition variables, with correlation r(r_pbd, ρ_pb) = 0.93. C,D: ‘Emergency visits’ response versus ‘Hospitalizations’ partition variables, with correlation r(r_pbd, ρ_pb) = 0.49. Bar plot histograms are shown for ‘Emergency visits’ (B inset) and ‘Hospitalizations’ (D inset). r_pb: point-biserial correlation coefficient, {p_A, p_B}: sample size proportions, r_pbd: sample size corrected correlation coefficient, ρ_pb: nonoverlap proportion, (δ, μ): center of mass parameters .

More »

Expand

Table 2.

rCART subnode parameters.

More »

Expand

Fig 7.

Monte Carlo simulation of the distribution of stochastic effects for point-biserial variation.

2D histograms of MC distributions for (r_pbd, μ) (A) and (r_pbd, ρ_pb) (B) for ‘Emergency visits’ response with ‘Hospitalizations’ rCART split value, 3.3 (Table 2). The 1σ error bars for the r_pbd histogram (A inset) serve as an indication of convergence for the simulation; the mean for the normal curve corresponds to the observed r_pbd value, 0.398. r_pbd: sample size corrected correlation, ρ_pb: nonoverlap proportion, μ: center of mass parameter , number of MC runs: 25, samples per MC run: 4000.

More »

Expand