AI bias refers to systematic errors in AI model outputs that create unfair outcomes for certain groups of people — often related to race, gender, age, disability, or socioeconomic status. Bias enters through training data (reflecting historical inequalities), model architecture choices, evaluation metrics, and deployment decisions. Fairness in AI means designing and auditing systems to ensure their outputs are equitable across demographic groups.
Where bias comes from: the pipeline
Bias is not a single problem with a single fix — it enters at every stage of the AI development pipeline, often in hard-to-detect ways:
| Stage | Source of bias | Real-world example | Detection method |
|---|---|---|---|
| Data collection | Non-representative training data | Facial recognition trained mostly on light-skinned faces; error rate on dark skin was 34% vs 0.8% (Buolamwini & Gebru, 2018) | Demographic breakdown of dataset; representation audits |
| Label collection | Human annotator bias | Sentiment labelers rated the same text as more negative when written in African American English | Inter-annotator agreement per demographic; bias in annotation guidelines |
| Feature engineering | Proxy variables encode protected attributes | ZIP code encodes race; using it in a loan model discriminates indirectly | Correlation analysis between features and protected attributes |
| Model training | Class imbalance; optimization for average accuracy | High overall accuracy masks 40% error rate on the minority class | Disaggregated evaluation metrics per subgroup |
| Evaluation | Benchmark datasets under-represent minority groups | A model scores 94% on a benchmark where 90% of test examples are from one group | Stratified evaluation; held-out subgroup test sets |
| Deployment | Distribution shift; feedback loops | A biased hiring model rejects minority candidates → less diverse training data → more bias next cycle | Monitoring production outputs; disparate impact audits |
Definitions of fairness — and why they conflict
One of the most important (and counterintuitive) results in algorithmic fairness is that many common definitions of fairness are mathematically incompatible — you can't satisfy all of them simultaneously.
| Fairness definition | What it requires | Mathematical condition | Problem with it |
|---|---|---|---|
| Demographic parity | Equal positive prediction rates across groups | P(Ŷ=1|A=0) = P(Ŷ=1|A=1) | Ignores actual base rates; can force unequal error rates |
| Equal opportunity | Equal true positive rates (recall) across groups | P(Ŷ=1|Y=1,A=0) = P(Ŷ=1|Y=1,A=1) | Can allow very different false positive rates |
| Equalized odds | Equal TPR and FPR across groups | Both TPR and FPR equal across A | Mathematically incompatible with calibration when base rates differ |
| Calibration | Predicted probabilities match actual outcomes equally for all groups | P(Y=1|score=s,A=0) = P(Y=1|score=s,A=1) | Incompatible with equalized odds when base rates differ |
| Individual fairness | Similar individuals should receive similar predictions | If d(x,x') is small, |f(x)-f(x')| should be small | Requires defining "similar" without bias |
The impossibility theorem
Chouldechova (2017) and Kleinberg et al. (2016) proved that when base rates differ across groups, calibration, false positive rate parity, and false negative rate parity cannot all be achieved simultaneously. Any real system must choose which fairness criteria matter most for the specific application — there is no mathematically perfect solution.
Practical mitigation techniques
| Technique | When applied | How it works | Tradeoff |
|---|---|---|---|
| Data resampling / reweighting | Pre-processing | Oversample underrepresented groups; assign higher loss weights to minority samples | Can improve parity but may reduce overall accuracy |
| Adversarial debiasing | In-training | Train a classifier to predict the target AND an adversary to predict the protected attribute from representations; penalize the adversary | Adds training complexity; can be unstable |
| Reranking / post-processing | Post-processing | Adjust decision thresholds per group to equalize specified metrics | Requires group labels at inference; legally sensitive in some jurisdictions |
| Counterfactual data augmentation | Pre-processing | Generate versions of training examples with protected attributes swapped; train on both | Effective for text/NLP; harder for structured data |
| RLHF with fairness constraints | LLM fine-tuning | Include fairness criteria in human feedback; penalize biased outputs in reward model | Expensive; hard to define "fair" consistently across annotators |
Fairness auditing tools
Open-source libraries: Fairlearn (Microsoft), AI Fairness 360 (IBM/AIF360), What-If Tool (Google), and Aequitas (U Chicago). For LLMs specifically: BOLD and WinoBias benchmarks measure representation and stereotype bias in generated text.
Practice questions
- What is the difference between disparate treatment and disparate impact in AI systems? (Answer: Disparate treatment: the AI explicitly uses a protected characteristic (race, gender, age) as an input to make decisions — intentional discrimination. Disparate impact: the AI uses neutral-seeming variables (zip code, name, education institution) that correlate with protected characteristics, producing discriminatory outcomes without explicit use of those characteristics. Both can be illegal under anti-discrimination law. Most AI bias cases involve disparate impact since training on historical data automatically captures proxy correlations.)
- What is the word embedding test for gender bias (WEAT) and what did studies find? (Answer: Word Embedding Association Test (WEAT): measures whether gendered words (he/she) are more similar to certain career or attribute words in embedding space. Caliskan et al. (2017) found: word2vec and GloVe associate 'programmer, engineer, scientist' more closely with male pronouns; 'nurse, teacher, librarian' more closely with female pronouns — mirroring US labour market statistics. These biases reflect historical data but are problematic when used in hiring/recommendation systems.)
- COMPAS is a recidivism prediction tool used in US courts. What bias issue did ProPublica identify? (Answer: ProPublica (2016) found COMPAS predicted Black defendants would reoffend at nearly twice the false positive rate of White defendants — Black defendants who did NOT reoffend were labelled high risk more often. Northpointe (COMPAS developer) argued the tool was 'fair' by calibration metric (equal accuracy across groups). This exemplifies the fairness impossibility theorem: ProPublica's definition (equal FPR) and Northpointe's definition (calibration) are mathematically incompatible when base rates differ.)
- What is 'technical debt' in AI fairness and why is it hard to address retroactively? (Answer: Technical debt: deploying a biased model creates a record of biased decisions (denied loans, failed interviews) that becomes the next round of training data if not carefully managed. The biased model's outputs may influence real-world distributions (denying loans to a community reduces economic activity, making future loan applications from that community look riskier). Retroactive debiasing requires: identifying the source of bias, retraining on corrected data, addressing real-world impacts of past decisions — none of which are technically straightforward.)
- What is 'intersectional fairness' and why is standard demographic fairness analysis insufficient? (Answer: Intersectional fairness (Crenshaw's intersectionality applied to ML): a model may be fair for Black individuals AND fair for women when evaluated separately, but unfair specifically for Black women. Standard fairness analysis evaluates one dimension at a time. Intersectional analysis evaluates all combinations of demographic groups. Practical challenge: small group sizes at intersections (e.g., 'non-binary Hispanic individuals') make statistical analysis unreliable. But ignoring intersections misses systematic harm to specific communities.)