A Perfectly Cromulent Intro to Simpson's Paradox

  |   Source

In this post, we will go over one of my favorite statistical phenomenons, Simpson's paradox, using interactive data modules.

The Paradox

What Is Simpson's Paradox?

In technical terms, Simpson's paradox "refers to a phenomenon whereby the the association between a pair of variables (X, Y) reverses sign upon conditioning of a third variable, Z, regardless of the value taken by Z"(Pearl 1). What this means is that we can observe one trend when we are looking at our data in aggregate and a completely different trend when we segment our data set.

A classic example of Simpson's paradox comes from the UC Berkley gender discrimination case. The graph bellow shows the admissions results from 6 UCB graduate departments in 1973, sorted by gender. As you can see, when we look at the raw statistics men are 15% more likely to be admitted than women.

However, if you hit the "split" button to look at the admission results by individual department, you will notice that women are actually more likely to be admitted than men in 4 out of 6 departments. So if we look at admission rates by gender we reach the conclusion that men are more likely to be admitted, but if we split by gender and department, we conclude that women are more likely to be admitted. That's Simpson's paradox in action.

What Causes Simpson's Paradoc to Occur

Unequal Distribution

A common cause of Simpson's Paradox is assuming that there is uniform distribution among subpopulations. Returning to the UC Berkley example above, it is easy to be perplexed by the results if you assume that the male and female applicants are behaving uniformly. However, when we look at the distribution of applicants by department

We see that men and women had very different preferences for which department they were applying to.

Furthermore, when we look at the general admissions rates for each department

We see that departments A and B had the highest admissions rate, but were the least popular departments among female applicants, while more competitive departments had a higher proportion by female applicants. In contrast, 50% of men applied to the departments with the highest acceptance rates.

The difference in distribution allows us to resolve the paradox. Women could have a higher admissions rate by department but a lower admissions rate overall compared to men because the female applicants were overrepresented in highly selective programs, causing a misleading average result. This process can be visualized in the animations bellow, which combines the departments from left to right.

Now that you know that segmented data and aggregate data can give contradictory results, you may be wondering which result to believe. Most articles I have read on Simpson's paradox focus on cases where aggregating the data causes a false conclusion and the segmented data gives the correct result. However, this is misleading. As we will see, it is possible for the aggregated data to give the incorrect result, the segmented data to give an incorrect result, or for both results to be incorrect. In order to know which result is correct, we need to know what's causing the paradox to occur in the data.

Conditioning On A Collider

In statistics, when two variables are correlated, this could be because the variables are causally linked or it could be because there is a hidden variable linking the two. For example, smoking stains teeth and smoking causes lung cancer, so we may find a correlation between having lung cancer and stained teeth, even though having stained teeth doesn't cause lung cancer. In this example, smoking is called a confounder, because it influences both the dependent and independent variable.Conditioning on a confounder generally produces more meaningful results. E.g. if we look at stained teeth vs. lung cancer in a population of pure smokers or nonsmokers, we expect the correlation to disappear because we have accounted for the hidden variable.

But what happens if the situation is reversed, if our dependent variable, A, and independent variable, B, both cause a third variable, C, to happen? Well, in this case, C is called a collider because it is caused by both the dependent and independent variable, and conditioning on a collider can cause statistical correlations to appear when there should be none or negate correlations that should exist. So when we partition our data on a collider, we can obtain misleading correlations. Unfortunately, sometimes people will partition their data on a collider thinking that is is a confounder, since they can be hard to distinguish.

How Does This Apply to Data Science?

The interesting thing about Simpson's paradox is that there is no mathematical rule to determine whether you should use the aggregate statistic or the segmented statistic. It is essentially up to the statistician to use the context of the situation to determine which statistic is correct to use. This is why I love Simpson's paradox so much; not only is it an example of how counterintuitive statistics can be, it is also a reminder that it is our job to always take context into consideration when working with statistics.

Further Reading

Normal Deviate A more mathematical explanation of Simpson's paradox

Victor Powel's Blog has some great visualizations for Simpson's paradox. I modeled my visualizations off of his.

Understanding Simpson’s Paradox by Judea Pearl is an academic paper that goes into the history and implications of the paradox

The Wikipedia page gives a brief overview of Simpson's paradox and provides multiple real-world instances of Simpson's paradox in action.

Comments powered by Disqus