A refined reweighing technique for nondiscriminatory classification

Yuefeng Liang; Cho-Jui Hsieh; Thomas C. M. Lee

doi:10.1371/journal.pone.0308661

Abstract

Discrimination-aware classification methods remedy socioeconomic disparities exacerbated by machine learning systems. In this paper, we propose a novel data pre-processing technique that assigns weights to training instances in order to reduce discrimination without changing any of the inputs or labels. While the existing reweighing approach only looks into sensitive attributes, we refine the weights by utilizing both sensitive and insensitive ones. We formulate our weight assignment as a linear programming problem. The weights can be directly used in any classification model into which they are incorporated. We demonstrate three advantages of our approach on synthetic and benchmark datasets. First, discrimination reduction comes at a small cost in accuracy. Second, our method is more scalable than most other pre-processing methods. Third, the trade-off between fairness and accuracy can be explicitly monitored by model users. Code is available at https://github.com/frnliang/refined_reweighing.

Citation: Liang Y, Hsieh C-J, Lee TCM (2024) A refined reweighing technique for nondiscriminatory classification. PLoS ONE 19(8): e0308661. https://doi.org/10.1371/journal.pone.0308661

Editor: Yang Wang, Xi’an Jiaotong University, CHINA

Received: October 8, 2023; Accepted: July 28, 2024; Published: August 20, 2024

Copyright: © 2024 Liang et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The data sets used in the paper are well-known benchmark data sets that are publicly available. They can be downloaded from https://github.com/frnliang/refined_reweighing.

Funding: This work was partially supported by the National Science Foundation under grants CCF-1934568, DMS-1916125, DMS-2113605, DMS-2210388, IIS-2008173 and IIS2048280.

Competing interests: The authors have declared that no competing interests exist.

1 Introduction

Advances in computing technology enable automated decision-making to be popularized in many social contexts. Artificial intelligence can be more efficient at candidate screening than human resources recruiters [1]. Predictive policing helps forecast crime in law enforcement operations [2]. However, the rising popularity unintentionally introduces socioeconomic and racial disparities [3]. As anti-discrimination laws state that it may be illegal to introduce serious bias [4, 5], discrimination-aware classification has become an important research topic in machine learning [6].

Reweighing (RW) [7] is one of the earliest bias mitigation algorithms studied by researchers. It alleviates sample size disparity [4] by assigning weights to all cohort and label tuples in the training data. Despite its easy implementation [7–12] and accuracy reservation [7], RW has two limitations. First, it ignores potential relation between the within-cohort attributes and other features. If other features are proxies of the within-cohort attributes, the re-weighted data may not be discrimination-free [4]. Second, it does not allow decision makers to control the cost in accuracy they would pay for fairness in real-world problems.

To overcome those two limitations, we propose the Refined Reweighing (RRW) technique that generates more fine-grained weights than RW’s by investigating distributional inequity across all attributes in a two-phase process. In Phase I, we calculate the sample sizes of all observed categorical attribute-label combinations and transform them into weights. We formulate the weight assignment as a linear programming problem. If there exist numerical attributes, we proceed with Phase II that integrates their probability distribution with weights obtained in Phase I. The final weight assignment is independent of the ultimate prediction method, which makes RRW versatile.

The empirical results are promising. RRW is competitive with state-of-the-art pre-processing treatments [6, 12, 13] in the AI Fairness 360 toolkit [14] on accuracy and fairness in three classification tasks. As the numbers of data points in those datasets are all under 50, 000, and the numbers of attributes are all below 6, we conduct an extensive simulation study to evaluate its scalability on more attributes and larger sample sizes. For example, RRW manages to handle 40 million instances with 15 attributes in 2 minutes, while it takes some other methods at least 30 minutes to do so. The effectiveness of both optimization phases are also evaluated by simulation studies and real data experiments.

The rest of the paper is organized as follows. We review related work in Section 2 and introduce our method in Section 3. Experimental results are presented and discussed in Section 4, followed by concluding remarks in Section 5.

2 Related work

Researchers have studied various fairness measures such as statistical parity [15–17], predictive parity [18] and equalized odds [15, 19]. Indeed, all measures fall into three main categories. Group fairness ensures that the subjects in the protected and unprotected cohorts have similar outcomes [13, 20, 21]. Individual fairness emphasizes that subjects possessing similar attributes are similarly labeled [16, 22, 23]. Counterfactual fairness declares fairness to a subject if the outcome is the same in both the real world and a counterfactual world where that subject belonged to a different cohort based on causal graphs [24]. Our goal is to secure statistical parity, a notion for group fairness.

Discrimination prevention algorithms can mitigate bias at different stages of predictive modeling. Pre-processing [6, 10, 12, 13, 25] approaches modify the training data. In-processing [26–28] approaches adjust classification models. Post-processing [19, 22, 29] approaches change the predicted labels. We focus on pre-processing in this work.

Suppression, resampling, modification and reweighing are four typical techniques in pre-processing approaches [10]. Suppression removes sensitive attributes in training and testing. This removal alone is ineffective when some of the insensitive attributes are highly correlated with the sensitive ones [4]. Resampling refers to stratified sampling applied on all combinations of cohort and labels [10]. Each combination is either under-sampled or over-sampled. Modification changes (1) the attributes, (2) the labels [9], or both [6]. Reweighing assigns weights to all pairs of attributes and label. Data can be freed from discrimination without changing its space or value [10]. The last two techniques are more relevant to RRW. We mainly compare RRW with RW [7] and the three modification algorithms in the AI Fairness 360 toolkit [14] summarized below.

RW [7] applies appropriate weights to different (cohort, label) tuples in the training data. The weight is equal to the product of the marginal probabilities of cohort and label over the observed probability of their joint distribution. If the data was unbiased, all cohorts and labels would be statistically independent and the weights would be one.
Learning Fair Representations (LFR) [12] is a prototype-based clustering algorithm. It maps each individual to a probability distribution in a latent representation space in order to obfuscate any cohort-related information, while retaining information of other attributes as much as possible.
Disparate Impact Remover (DIR) [13] edits attributes from which the protected cohort can be predicted. It explores disparate impact [30, 31], balanced error rate and ϵ-fairness to remove the attributes’ ability to distinguish between different cohorts while preserving rank-ordering within cohorts.
Optimized Pre-Processing strategy (OPT) [6] learns a probabilistic transformation that modifies attributes and labels by taking group fairness, individual distortion and data utility into consideration. An appropriate choice of distortion metric is essential for effective discrimination reduction.

RW and RRW are reweighing methods. LFR, DIR and OPT belong to the modification category mentioned earlier. A fundamental difference between these two groups is that modification alters the elements of the instances, while reweighing updates the empirical distribution of the training instances, or their influence in a classifier. This distinction leads to their different levels of scalability and training time complexity. RW does not require any optimization, but it does not provide any control for the fairness-accuracy trade-off. Although RRW entails a two-phase optimization, its processing time is much shorter than the amount of time required by most modification methods. DIR requires the user to specify a repair level, a hyper-parameter indicating how much the user wishes for the distributions of the cohorts to overlap. LFR changes the space of the data distribution so that classification predictions are made on prototypes. Hyper-parameters controlling individual fairness, group fairness and prediction accuracy need to be carefully tuned to produce ideal results. OPT involves a large amount of calibration that would potentially undermine its feasibility, and it is only applicable to categorical attributes. The simulation study in Section 4 validates these statements.

3 Proposed technique

The main idea of RRW is to attach customized weights to training instances with different sensitive attributes (cohorts), insensitive attributes and labels. The choice of sensitive attribute(s) is presumably determined by human. We first define some terminologies used in this work. An unfavorable class in a sensitive attribute is a discriminated cohort. An underrepresented class in an insensitive attribute is the rarest among all classes. An instance is underprivileged if it is either unfavorable or underrepresented, or both. The aim of the weight assignment is to give higher weights to positive instances if the instances are underprivileged, and give lower weights to positive instances if they are privileged. The goal of RRW is to reduce discrimination by handling unfavorable attributes, while maintaining the overall prediction accuracy by caring about underrepresented ones. We start the discussion on weight assignment from problem formulation.

Let be n samples from a joint distribution p_X,Y,D with domain . X, Y and D denote insensitive attributes, labels, and sensitive attributes such as race and gender, respectively. In this work we focus on discrete and finite domain and binary labels . We only present the derivation under a univariate and binary scenario , while the proposed framework is applicable to higher dimensional . We now define statistical parity, the notion of our fairness goal.

Definition 1 A binary classifier satisfies statistical parity with respect to a sensitive attribute if is independent of D: To distinguish categorical insensitive attributes from numerical ones, we introduce the categorical domain and the numerical domain , and we have X = (X^(c), X⁽ⁿ⁾) in . If both and are non-empty, then the weight assignment is a two-phase optimization problem: Phase I assigns weights under , and Phase II assigns weights under . If only one type of X is observed, then we implement its corresponding phase alone.

3.1 Phase I optimization

Sample size plays a pivotal role in Phase I. Let n_x,y,d be the number of instances containing the triplet (x, y, d), and n_y,d be the number of instances containing the tuple (y, d). n_y and n_d are defined in a similar manner. Let be a weight function. We now assign a non-negative weight W₁(x, y, d) to every instance , so in total there are at most weights to be sought. We impose the following constraints when we search for the optimal weights. (1) The weights are considered optimized if they have the following three properties.

I. Independence guarantee. To achieve this goal, we borrow strength from RW that attaches weights to all instances according to Y and D. Let W₁(y, d) be the weights for all tuples . RW can be summarized as (2) W₁(y, d) in RW and W₁(x, y, d) in RRW are closely related, as W₁(y, d) can be regarded as a weighted average of W₁(x, y, d) over all x’s: (3) If we combine Eqs (2) and (3), we obtain another equation that is intrinsically aligned with the first three equations in constraint set (1): (4) The agreement between RW and W₁(x, y, d) can be acknowledged in two ways. First, constraint set (1) about W₁(x, y, d) are compatible with analysis on W₁(y, d). Second, Y and D are independent. This can be visualized from the equivalence of the original distribution of Y and the weighted distribution of Y conditional on D by dividing both sides of Eq (4) by n_d.

II. Discrimination control. The above conditional probability argument leads to discussion on the second goal. Group fairness is enforced when the difference between p_Y|D(y|d₁) and p_Y|D(y|d₂) is under control. If we extend our scope from p_Y|D to p_{Y|X^(c), D}, we may further reduce the discrepancy by considering the relation between X^(c) and D. We now introduce a weighted version of the conditional probability of Y given X^c and D: Thus, we reduce the discrepancy of between d₁ and d₂ for all (x, y) in by minimizing (5)

III. RW-Accuracy preservation. To embrace the advantages of RW, we keep our optimal weights from being too deviated from RW’s solution by controlling the difference between W₁(x, y, d) and for all : (6)

Optimization formulation.

Putting constraint set (1), Eqs (4)–(6) together, we arrive at the optimization problem below for determining W₁(x, y, d) for all : (7) where λ > 0 is a tuning parameter. A smaller λ pulls the weights more toward group fairness, as less emphasis is put on maintaining the accuracy provided by RW. Indeed, RW is a special case of RRW.

Proposition 1 RRW is a generalized version of RW.

Proof 1 If the λ in problem (7) is sufficiently large, the first summation in the objective function is dominated by the second summation. As the second summation is non-negative, the minimum value of the objective function is driven down to 0. As λ → ∞, for all . Note that and for all , the optimal solution in RW satisfies all constraints provided by RRW. When λ is small, the optimal solution in RRW deviates from the one in RW.

This optimization problem can be formulated to and therefore efficiently solved by linear programming. Here we address two practical concerns for implementation.

What happen if not all combinations of X^(c), Y and D exist? The underlying assumption of RRW is that p_{X^(c), Y, D} is known along with its marginals and conditionals. If there exists at least an (x, d) pair such that n_{x, d} = 0, the corresponding probability is undefined. To overcome this limitation, we exclude unobserved pairs in problem (7), and their weights will be 1, the original value without any treatment.

What happen if there are too many x’s in ? The time complexity of linear programming in problem (7) is exponential to the number of categorical attributes. To avoid the potential combinatorial explosion, we consider the bias corrected Cramér’s V [32], denoted by V, that measures pairwise association. For every , if ∑_j≠k V(x_j, x_k), the sum of its pairwise association with other attributes, is stronger than a certain threshold, then we claim the correlation between x_j and other attributes is so strong that x_j can either be excluded from both phases, or be viewed as numerical and handled by Phase II. This selection process ensures that the number of attributes handled by Phase I optimization remains sufficiently small for linear programming in high-dimensional categorical attribute spaces. In Section 4.4, we argue that the second route is preferred.

3.2 Phase II optimization

The probability distribution of numerical insensitive attributes plays an essential role in Phase II. Suppose x′ is a numerical insensitive attribute re-scaled to [0, 1]. For every , let f(x′|x, y, d) be the frequency of over all instances with (x, y, d). For simplicity, we write f(x′) = f(x′|x, y, d). When x′ is discrete, we obtain f(x′) by counting the frequencies of all available values in the training set. When x′ is continuous, we bucketize the values into equal-sized buckets, treating these buckets as discrete values as in the first scenario. The bucket size is determined by making each bucket as granular as possible while remaining non-empty. Given an unknown value c ranges from min f(x′) to max f(x′), we define dev_c(x′) = f(x′) − c that measures the deviation of f(x′) from c, where c is determined by . This can be solved by binary search over [min f(x′), max f(x′)]. Geometrically, c is a horizontal line across f(x′) whose positive and negative vertical distances to f(x′) integrate to 0. We provide a graphical illustration on the UCI Adult income data in the top-left plot of Fig 1 to demonstrate that c = 131 equalizes the red and purple areas.

Download:

Fig 1. Graphical illustrations for the role of c with the Adult income data.

Top-left: f(age) against age; top-right: dev(age)/max dev(age) against age; bottom-left: (age) against age; and bottom-right: (age) against age.

https://doi.org/10.1371/journal.pone.0308661.g001

As c is now fixed, we denote dev_c(x′) by dev(x′) for simplicity. Next, let t ∈ [0, 1] be another unknown value to be solved. Let be a weight function such that For illustrative purposes, we express as a continuous function in this section. This function is not necessarily smooth in application as the weight is only defined for the points observed in the training set. When t = 1, is identical to the horizontal reflection of f(x′). When t ∈ [0, 1), geometrically, shrinks to W₁ vertically by a factor of t. The last three plots in Fig 1 illustrate that and dev(x′)/max dev(x′) move in the opposite vertical direction. A nice property of is that for all t ∈ [0, 1], As shown in Fig 2, in the Adult data, lower ’s are assigned to more frequent combinations of age and education. The weighted average of (race, gender, income, age, education) over age and education is exactly W₁(race, gender, income).

Download:

Fig 2. Top-left: f(age) for (non-white, female, < 50K income) individuals; top-right: f(education) for (non-white, female, < 50K income) individuals; and bottom: W₂(non-white, female, < 50K, age, education) in the Adult data.

https://doi.org/10.1371/journal.pone.0308661.g002

Similar to expression (5), the optimized t minimizes the discrepancy of weighted conditional probability between d₁ and d₂: (8) This t can also be solved by binary search over [0, 1]. After we get the optimized value, we denote by W₂ for simplicity.

In general, given m numerical insensitive attributes , k = 1, …, m, we assume they have equal contribution to label prediction, as the classification model is not available under the pre-processing setting. The weight for is therefore It is easy to check that W₁(x, y, d) is equivalent to As every is independently solved in problem (8) for all and all , and computation of is , the time complexity of Phase II optimization is . In high-dimensional categorical attribute spaces, where not all categorical attributes are handled by Phase I, Phase II has a linear time complexity relative to the number of remaining categorical attributes.

3.3 Training and prediction

RRW aims to help classifiers satisfy statistical parity by seeking independence between and d. While the relationship between the observed y in the training set and d is established in the two optimization phases, minimizing the gap between p_Y|D(y|d) and p_Y|D(y|d′) for all d and d′, we also need to establish a connection between the observed y in the training set and the predicted in the test set. This connection requires the assumption that the training and test sets share the same conditional distribution of labels given the sensitive attribute. In other words, statistical parity can be achieved when

In practice, when we train a model with the assigned weights, we update the weights of training instances, and feed them to the classification model. To predict labels of test instances, we run the model on testing data as usual without any extra steps.

4 Experimental results

We evaluate the performance of RRW on both synthetic and benchmark datasets, and conduct an extensive comparison among two baselines, six pre-processing approaches and one in-processing method on three fairness metrics. Denote predicted labels by . The first fairness measure is motivated by the “80% rule” [33] and statistical parity [15–17]: (9) The second measure is disparate impact [30, 31]: (10) It focuses on the proportions of two cohorts that receive the positive outcome. The last measure is the false discovery rate. We report the trade-off between the empirical discrimination on test set and the empirical accuracy, measured by the Area under ROC (AUC). As the accuracy versus discrimination pattern remains stable across various classifiers such as logistic regression, support vector machine, random forest and Gaussian naïve bayes for all pre-processing methods [6, 12], we only present results on logistic regression. All experiments are conducted on a machine with an Intel Xeon E5-2690 2.90GHz CPU and 256GB RAM.

4.1 Phase I simulation

Data.

Let D = (D₁, …, D_n), D_i ∈ {+1, −1} for all i be a sensitive attribute vector of length n. 80% entries are set to be unfavorable. Categorical insensitive attributes {X_ij}, i = 1, …, n, j = 1, …, q−1 are randomly generated, and set to be binary with 1/2 chance for each outcome in {+1, −1}. D and all X’s are independent. Let Y = (Y₁, …, Y_n) be the true label vector of length n. We assign different coefficients, and β_j = 0.01 × j for j = 1, …, q−1, to different outcomes of D and X so that the linear combination of D_i and X_ij can be used to determine the true labels by treating Y_i as a Bernoulli random variable for all i. We randomly split 80% and 20% of all instances into training and test sets.

Implementation.

The parameters in LFR [12] are chosen according to its authors’ recommendation: A_x = 0.01, A_y and A_z ∈ {0.1, 0.5, 1, 5, 10}. The regularization parameter in DIR [13] is selected based on balanced error rate. The parameters in OPT [6] are chosen as its authors suggest whenever they are publicly available. To demonstrate the impact of sample size on computational time, we pick seven different values for n from 10⁵ to 10⁷, and set q = 5. We compare RRW with LFR, DIR and OPT and leave out RW, as sample size does not affect the speed of RW. We consider Eq (9) as the fairness measure.

Results.

The left plot in Fig 3 reveals that OPT and LFR consume more time than DIR and RRW. Even though there are weights to be sought in RRW, the underlying data representations in OPT and LFR take even longer time than the reweighing in RRW. Little computational burden is added to the linear programming step in RRW, as all n_x,y,d’s are summarized in the coefficient matrix before the optimization is conducted.

Download:

Fig 3. Elapsed time versus sample size for OPT, LFR, DIR and RRW when q = 5 (left), and elapsed time versus sample size for RRW when q = 5, 10, 15 (right).

https://doi.org/10.1371/journal.pone.0308661.g003

To investigate the influence of the number of attributes for RRW, we pick different values for n from 10⁴ to 4 × 10⁷, and q = 5, 10, 15. As we have mentioned in Section 3.1, we select the 10 most uncorrelated insensitive attributes out of 14 when q = 15. The right plot in Fig 3 illustrates that weights can be assigned to 40 million instances within 30 seconds. Although elapsed time grows linearly as sample size increases, the growth rate is indeed significantly lower than those of LFR, DIR and OPT, which is demonstrated by the left plot in Fig 3.

To compare their performance on fairness and accuracy, we set q = 5 (1 sensitive attribute and 4 categorical insensitive attributes), n = 10⁴, and set the AUC for all methods to be 0.505. We report Discrimination_sp under the best sets of parameters in Table 1. RRW outperforms all other methods. Furthermore, for n = 10⁵ and q = 5, 10, 15, as shown in Table 2, under the best λ, RRW outperforms RW in Discrimination_sp by at least 25%.

Download:

Table 1. Discrimination for q = 5 and n = 10⁴ under AUC = 0.505.

https://doi.org/10.1371/journal.pone.0308661.t001

Download:

Table 2. AUC and discrimination for full, RW and RRW and q = 5, 10, 15.

https://doi.org/10.1371/journal.pone.0308661.t002