We're the top company on Modern Data Stack! Give us an upvote here.
bubbles svg

Cloud Data Viz and Analytics Health Check

Uncover the fitness of your Cloud Data Viz & Analytics

Get my free score

Simpson’s Paradox: The Danger of Over-Aggregation

Picture this: You’re the manager at a telemarketing company and you want to look at your employee performance. Specifically, you want to view employees’ average time spent on calls, and their average sales amount per call. You create a scatterplot to easily view this information.

Scatterplot | Astrato Analytics

Surprisingly, you notice  that the employees with a longer time on calls actually have a lower sales amount. You wonder how to make sense of this information: Should you be encouraging employees to speed through their sales pitches? 

But as you explore the data more, you uncover additional information that totally changes your perspective! You filter down to just the junior level employees. Wait! Now the relationship has changed. It looks like being on the call longer is associated with higher sales. 

Junior employees | Astrato

Understanding contradictions in your data

You’re on to something! Out of curiosity, you filter to just your mid-level employees, and then to just your senior employees. 

Mid and senior employees | Astrato

You’re seeing more of the same: longer calls equal higher sales. So how can it be true that for each individual group, when evaluated on its own, a positive relationship is clear to see, but when all three groups are evaluated together, the relationship shows exactly the opposite

Let’s put them all back on one graph, with the colors for each group added.

All employees | Astrato

Now we can see the whole picture, and we can extrapolate meaningful insight. The Senior level employees are experienced and can convey information more effectively to make a bigger sale. It takes them less time to make their point, and close the deal. On the other hand, the Junior employees take more time, and their inexperience leads to lower sales.

This seeming contradiction is an example of Simpson’s Paradox.

What is Simpson’s Paradox?

Simpson’s Paradox is a statistical paradox: this happens when an association between two variables that are aggregated reverses (or disappears) when the variables are disaggregated (divided into subgroups). It occurs when the aggregated or disaggregated variable in question is actually a confounding or ‘lurking’ variable that is influencing both of the variables being measured. 

In our example, call time and sales amount seemed to have a negative relationship (longer call time equals lower sales amount), but when we introduced experience level (which was influencing both of these variables) the relationship reversed!

Here are some other visual examples of Simpson’s Paradox at play: 

Example: When aggregated, an increasing linear relationship between X and Y appears. When disaggregated, there is no relationship between X and Y within each subgroup. 

Example 3 Simpson's Paradox | Astrato Analytics

Example: When aggregated, there is no relationship between X and Y. When disaggregated, each subgroup has a different relationship between X and Y, one increasing and one decreasing. 

Example 2 Simpson's Paradox | Astrato Analytics

The takeaway

When does this paradox show up? When creating a dashboard, there is a balance between simplicity and detail. You need to convey a lot of important information, but not so much that you overload the user. If you create one graph with a lot of aggregation behind it, but never check what happens when you apply certain filters, you may misunderstand a key relationship between variables.

Alternatively, if the aggregated version is the default presentation of your data, and you depend on whoever is consuming the dashboard to apply the filters themselves, they may totally miss the relationship switch. 

So, what do you do about Simpson’s Paradox?

Check your data carefully for this paradox! If Simpson’s Paradox is present in your data, it’s important to make it clear to the viewer. Generally the disaggregated version is the most ‘accurate,’ but not always. It really depends on the data, so this is where having good critical thinking and knowledge of the data context is important.