{ "cells": [ { "cell_type": "markdown", "metadata": { "hide_input": true }, "source": [ "In this post, we will go over one of my favorite statistical phenomenons, Simpson's paradox, using interactive data modules. \n", "" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "hide_input": true, "hide_output": true }, "outputs": [], "source": [ "library(reshape2)\n", "library(ggplot2)\n", "library(plotly)\n", "library(dplyr)\n", "library(scales)" ] }, { "cell_type": "markdown", "metadata": { "hide_input": true, "toc": "true" }, "source": [ " # Table of Contents\n", "
In technical terms, Simpson's paradox \"refers to a phenomenon whereby the the association between a pair of variables (X, Y) reverses sign upon conditioning of a third variable, Z, regardless of the value taken by Z\"(Pearl 1). What this means is that we can observe one trend when we are looking at our data in aggregate and a completely different trend when we segment our data set.
\n", "A classic example of Simpson's paradox comes from the UC Berkley gender discrimination case. The graph bellow shows the admissions results from 6 UCB graduate departments in 1973, sorted by gender. As you can see, when we look at the raw statistics men are 15% more likely to be admitted than women.
\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [], "source": [ "data(UCBAdmissions)\n", "UCB <- as.data.frame(UCBAdmissions)\n", "UCB <- mutate(UCB, Admit = ifelse(Admit ==\"Admitted\", \"Accepted\", \"Rejected\"))\n", "\n", "gender <-\n", " group_by(UCB, Gender) %>%\n", " summarize(Total=sum(Freq))\n", "\n", "gender.accpt <-\n", " group_by(UCB, Admit, Gender) %>%\n", " summarize(Freq=sum(Freq)) %>%\n", " inner_join(gender) %>%\n", " mutate(Prop=Freq/Total)\n", "\n", "\n", "dept <-\n", " group_by(UCB, Dept) %>%\n", " summarize(Total=sum(Freq))\n", "\n", "dept.accpt <-\n", " group_by(UCB, Admit, Dept) %>%\n", " summarize(Freq=sum(Freq)) %>%\n", " inner_join(dept) %>%\n", " mutate(Prop=Freq/Total)\n", "\n", "\n", "dept.gender <-\n", " group_by(UCB, Gender, Dept) %>%\n", " summarize(Freq=sum(Freq)) %>%\n", " inner_join(dept) %>%\n", " mutate(Prop=Freq/Total)\n", "\n", "\n", "UCB.totals <- group_by(UCB, Gender, Dept) %>%\n", " summarise(Total=sum(Freq))\n", "\n", "UCB <- inner_join(UCB, UCB.totals) %>%\n", " mutate(Prop=Freq/Total)\n" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/html": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "Department = c(\"A\", \"B\", \"C\", \"D\", \"E\", \"F\")\n", "Men = UCB$Prop[UCB$Gender == \"Male\" & UCB$Admit == \"Accepted\"]\n", "Women = UCB$Prop[UCB$Gender == \"Female\" & UCB$Admit == \"Accepted\"]\n", "\n", "proportion = data.frame(Department, Men, Women)\n", "\n", "AdmissionRates = plot_ly(proportion) %>%\n", " add_bars(x = ~Department, y = ~Women, name = 'Female', visible = FALSE, color = I(\"blue\")) %>%\n", " add_bars(x = ~Department, y = ~Men, name = 'Male', visible = FALSE, color = I(\"red\")) %>%\n", " add_bars(x = ~c(\"Female\"), y = ~c(0.3035422), name = 'Female', color = I(\"blue\")) %>%\n", " add_bars(x = ~c(\"Male\"), y = ~c(0.4451877), name = 'Male', color = I(\"red\")) %>%\n", " layout(title = \"Admission Rates\", xaxis = list(title = 'Gender'), yaxis = list(tickformat = \"%\", title = 'Percentage Admitted', range = c(0, 1)), barmode = 'group',\n", " updatemenus = list(\n", "\n", " list(\n", " type = \"buttons\", \n", " y = 1,\n", " buttons = list(\n", " list(\n", " label = \"Aggregate\",\n", " method = \"update\",\n", " args = list(list(visible = c(FALSE, FALSE,TRUE, TRUE)),\n", " list(xaxis.title = \"Gender\"))),\n", "\n", " list(\n", " label = \"Split\",\n", " method = \"update\",\n", " args = list(list(visible = c(TRUE, TRUE, FALSE, FALSE)),\n", " list(xaxis.title = \"Department & Gender\"),\n", " list(title = \"Admission Numbers by Gender & Department\")))\n", " )\n", " ) \n", " )\n", ")\n", "\n", "embed_notebook(AdmissionRates, height = 500)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "However, if you hit the \"split\" button to look at the admission results by individual department, you will notice that women are actually more likely to be admitted than men in 4 out of 6 departments. So if we look at admission rates by gender we reach the conclusion that men are more likely to be admitted, but if we split by gender and department, we conclude that women are more likely to be admitted. That's Simpson's paradox in action.
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# What Causes Simpson's Paradoc to Occur" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Unequal Distribution\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A common cause of Simpson's Paradox is assuming that there is uniform distribution among subpopulations. Returning to the UC Berkley example above, it is easy to be perplexed by the results if you assume that the male and female applicants are behaving uniformly. However, when we look at the distribution of applicants by department
" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/html": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "Department = c(\"A\", \"B\", \"C\", \"D\", \"E\", \"F\")\n", "Men = UCB$Total[UCB$Gender == \"Male\" & UCB$Admit == \"Accepted\"]\n", "Women = UCB$Total[UCB$Gender == \"Female\" & UCB$Admit == \"Accepted\"]\n", "\n", "proportion = data.frame(Department, Men, Women)\n", "\n", "Applications = plot_ly(proportion) %>%\n", " add_bars(x = ~Department, y = ~Women, name = 'Female', visible = TRUE, color = I(\"blue\")) %>%\n", " add_bars(x = ~Department, y = ~Men, name = 'Male', visible = TRUE, color = I(\"red\")) %>%\n", " add_bars(x = ~Department, y = ~(Women/sum(Women)), name = 'Female', visible = FALSE, color = I(\"blue\")) %>%\n", " add_bars(x = ~Department, y = ~(Men/sum(Men)), name = 'Male', visible = FALSE, color = I(\"red\")) %>%\n", " layout(title = \"Applications\", xaxis = list(title = 'Department'), yaxis = list( title = 'Number of Students Applied'), barmode = 'group',\n", " updatemenus = list(\n", "\n", " list(\n", " type = \"buttons\", \n", " y = 1.2,\n", " buttons = list(\n", " list(\n", " label = \"Count\",\n", " method = \"update\",\n", " args = list(list(visible = c(TRUE, TRUE, FALSE, FALSE)),\n", " list(yaxis=list(tickformat=\"\", title = \"Number of Students Applied\", range = c(0,800))))), \n", " list(\n", " label = \"Percentage\",\n", " method = \"update\",\n", " args = list(list(visible = c(FALSE, FALSE,TRUE, TRUE)),\n", " list(yaxis=list(tickformat = \"%\", title = \"Percentage of Students Applied\",range = c(0, .5)))))\n", "\n", "\n", " )\n", " ) \n", " )\n", ")\n", "\n", "embed_notebook(Applications, height = 500)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see that men and women had very different preferences for which department they were applying to.
\n", "Furthermore, when we look at the general admissions rates for each department
" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "hide_input": false, "scrolled": true }, "outputs": [ { "data": { "text/html": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plot =\n", " ggplot(dept.accpt[dept.accpt$Admit == \"Accepted\",], aes(x=Dept, y=Prop)) +\n", " ggtitle(\"Acceptance Rate By Department\") +\n", " guides(fill = guide_legend(reverse = TRUE)) +\n", " geom_bar(stat = \"identity\", fill =\"blue\") +\n", " xlab(\"Department\") + \n", " scale_y_continuous(labels = percent_format())\n", "plot = ggplotly(plot)%>% layout(yaxis=list(title = \"\"))\n", "\n", "embed_notebook(plot)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see that departments A and B had the highest admissions rate, but were the least popular departments among female applicants, while more competitive departments had a higher proportion by female applicants. In contrast, 50% of men applied to the departments with the highest acceptance rates.
\n", "\n", "The difference in distribution allows us to resolve the paradox. Women could have a higher admissions rate by department but a lower admissions rate overall compared to men because the female applicants were overrepresented in highly selective programs, causing a misleading average result. This process can be visualized in the animations bellow, which combines the departments from left to right.
" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "hide_input": true }, "outputs": [], "source": [ "Department = c(\"A\", \"B\", \"C\", \"D\", \"E\", \"F\", \"Combined\")\n", "WomenAccepted = c(89, 17, 202, 131, 94, 24, 557)\n", "WomenRejected = c(19, 8, 391, 244, 299, 317, 1278)\n", "MenAccepted = c(512, 353, 120, 138, 53, 22, 1198 )\n", "MenRejected = c(313, 207, 205, 279, 138, 351, 1493)\n", "f =c(1,1,1,1,1,1,7) \n", "data = data.frame(Department, WomenAccepted, WomenRejected, MenAccepted, MenRejected, f)\n", "\n", "xform <- list(categoryorder = \"array\",\n", " categoryarray = c(\"A\", \"B\", \"C\", \"D\", \"E\", \"F\", \"Combined\"))\n", "\n", "\n", "for (i in seq(5)){\n", " ndata = data[(i+1):6,]\n", " ndata[1,2:5] = colSums(data[1:(i+1),2:5]) \n", " ndata$f = (i+1)\n", " data = rbind(data, ndata)\n", "}\n", "\n", "data$textw = data$WomenAccepted/(data$WomenAccepted+data$WomenRejected)\n", "data$textw = round(data$textw, 3)*100\n", "data$textw = paste(data$textw, \"% Admitted\", sep = \"\")\n", "\n", "data$textm = data$MenAccepted/(data$MenAccepted+data$MenRejected)\n", "data$textm = round(data$textm, 3)*100\n", "data$textm = paste(data$textm, \"% Admitted\", sep = \"\")" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/html": [ "