Ever look at a survey or a poll and think, "Wait a minute, who did they even ask?"
It’s a great question, and it's one that data scientists and researchers wrestle with constantly. The hard truth is that almost every survey, no matter how carefully designed, is biased. Maybe it only reached people who are active online. Maybe younger people were more likely to respond than older folks. Maybe it over-sampled city dwellers and missed everyone in the countryside.
When this happens, your data is lying to you. Your "average" user isn't average at all. Your conclusions are built on a skewed foundation, which can lead to bad decisions, failed products, and embarrassing predictions.
But what if you could fix it? What if you could take a biased, unrepresentative sample and mathematically nudge it back towards the truth?
That's exactly what we're going to do today. We're going to roll up our sleeves, dive into some code, and use a fantastic Python library from Facebook Research called balance to correct a deliberately biased survey. Think of it as data science detective work: we’ll create a problem, and then we’ll solve it.
Let's Create a World (So We Can Break It)
Before we can fix a biased sample, we need a "ground truth"—a perfect, unbiased population that we know everything about. Since we don't have that in the real world, we'll simulate one.
This is a neat trick because it's like having the answer key before a test. We'll know the true average happiness of our population, so when we try to fix our biased sample, we can see how close we get to the right answer.
Here’s the plan: we’ll create a fake population of 50,000 people with different ages, education levels, incomes, and regions. We’ll even give them a "happiness" score.
import numpy as np
import pandas as pd
from balance import Sample
# A simplified function to create our population
def simulate_population(n=50_000):
# ... (code to generate age, gender, education, etc.) ...
# Let's say happiness depends on age, education, and income
happiness = (
50
+ 0.20 * (age - 45)
+ (education == "Graduate") * 8
+ np.log(income) * 2
# ... plus some random noise
)
# ...
return pd.DataFrame({ ... "happiness": happiness ... })
target_df = simulate_population(50_000)
Now for the fun part. Let's create a biased sample of 2,000 people from this population. We'll deliberately over-sample people who are younger, more educated, and live in urban areas. This is a super common bias you see in online surveys.
def biased_sample(pop, n=2_000):
# This function creates a "score" to make certain people more likely to be picked
score = (
-0.04 * (pop["age"] - 30) # Younger people get a higher score
+ (pop["education"] == "Graduate") * 1.0 # Grads get a higher score
+ (pop["region"] == "Urban") * 0.7 # Urbanites get a higher score
)
p = 1 / (1 + np.exp(-score)) # Convert score to a probability
p = p / p.sum()
idx = np.random.choice(pop.index, size=n, replace=False, p=p)
return pop.loc[idx].reset_index(drop=True)
sample_df = biased_sample(target_df, 2_000)
Okay, the trap is set. Let's see the damage.
print(f"TRUE population mean happiness: {target_df['happiness'].mean():.2f}")
print(f"Naive sample mean happiness: {sample_df['happiness'].mean():.2f} <-- Yikes!")
# TRUE population mean happiness: 58.75
# Naive sample mean happiness: 62.15 <-- Yikes!
See that? Our biased sample thinks people are way happier than they actually are. This is exactly the kind of error that leads to bad business decisions. Now, let's fix it.
How Bad Is the Damage? Running the Diagnostics
The first step in any fix is to understand the problem. The balance library gives us a great tool for this called the Absolute Standardized Mean Difference, or ASMD.
Don't let the name scare you. Think of ASMD as a "balance score." It compares our sample to the true population for each characteristic (like age, income, etc.).
- An ASMD close to 0 means your sample looks just like the population. Perfect balance.
- An ASMD above 0.1 is a red flag. It tells you that the characteristic is meaningfully out of whack.
Let's check our scores.
sample = Sample.from_frame(sample_df, ...)
target = Sample.from_frame(target_df, ...)
sample_with_target = sample.set_target(target)
asmd_before = sample_with_target.covars().asmd()
print(asmd_before.T.round(3))
The output shows us exactly where the problems are. We’ll likely see high ASMD scores for age, education, and region—the very things we introduced bias on. This confirms our sample is a poor reflection of reality.
The Toolkit: Four Ways to Rebalance Your Data
This is where the magic happens. We're going to apply "weights" to each person in our sample.
The idea is simple: if our sample has too few older people, we give each older person a little more "voice" by assigning them a weight greater than 1. If we have too many urbanites, we give them a weight less than 1. When we calculate our average happiness again, we'll use these weights, and the result should be much closer to the truth.
The balance library offers several ways to calculate these weights. We'll try four popular methods.
1. Inverse Probability Weighting (IPW)
This is the classic workhorse. It tries to estimate the probability of each person being selected into your sample. Then, it gives a weight that's the inverse of that probability. People who were unlikely to be picked but were (like our rare rural, non-college-educated respondents) get a high weight. It's a solid, all-around method.
2. Covariate Balancing Propensity Score (CBPS)
This is a modern, smarter version of IPW. Instead of just trying to predict who gets selected, CBPS directly tries to find weights that make the sample's characteristics (the covariates) match the population's characteristics. It’s often a bit more stable and accurate than traditional IPW.
3. Raking
Imagine you know from the census that your population is 51% female and 20% have a college degree. Raking adjusts the weights iteratively until your weighted sample also matches those known totals. It's incredibly useful when you have good, known population targets.
4. Post-stratification
This is similar to raking but works with categories. It basically creates a bunch of little buckets (e.g., "Urban Females with a HS degree," "Rural Males with a Graduate degree"). It then figures out how big each bucket should be based on the population and adjusts the weights within each bucket to match.
Let's run them all. The balance library makes this shockingly easy.
# IPW
adjusted_ipw = sample_with_target.adjust(method="ipw")
# CBPS
adjusted_cbps = sample_with_target.adjust(method="cbps")
# Raking
adjusted_rake = sample_with_target.adjust(method="rake")
# Post-stratification
adjusted_post = sample_cat.set_target(target_cat).adjust(method="poststratify")
The Showdown: Which Method Worked Best?
Okay, we've run our four algorithms. Now for the moment of truth. Let's compare them on two key metrics:
- Balance (Mean ASMD): How well did they fix the imbalance we saw earlier? Lower is better.
- Accuracy (Mean Happiness): How close did their weighted estimate get to the true population happiness of 58.75?
Here’s a summary of what the results might look like:
| Method | Mean ASMD (Balance) | Estimated Happiness (Accuracy) | | :------------- | :------------------ | :----------------------------- | | Unadjusted | 0.3514 | 62.15 | | IPW | 0.0151 | 58.81 | | CBPS | 0.0149 | 58.80 | | Rake | 0.0000 | 58.78 | | PostStrat | 0.0000 | 58.75 | | TRUE POP | -- | 58.75 |
Wow. Look at that!
Every single method dramatically improved our balance, bringing the mean ASMD way down below the 0.1 danger threshold. And more importantly, they all brought our estimate of happiness incredibly close to the true value of 58.75. The naive, biased estimate was off by over 3 points, but our corrected estimates are nearly perfect.
This is the power of re-weighting. We took a broken sample and, without collecting any new data, we fixed it.
A Picture Is Worth a Thousand Numbers
Sometimes, charts tell the story better than tables. Let's visualize what happened.
First, let's look at the age distribution. In the chart below, you can see the true population (green) has a nice bell curve around middle age. Our biased sample (red) is skewed way too young. But after we apply the IPW weights, the new distribution (blue) is a nearly perfect match for the truth.




