We're the top company on Modern Data Stack! Give us an upvote here.
bubbles svg

Cloud Data Viz and Analytics Health Check

Uncover the fitness of your Cloud Data Viz & Analytics

Get my free score

Simpson’s Paradox Part 2: Let’s Get Technical

In our last article on this topic, we introduced you to Simpson’s Paradox, a statistical paradox that occurs when an association between two variables that are aggregated reverses (or disappears) when the variables are disaggregated (divided into subgroups). This happens when the aggregated or disaggregated variable in question is actually a confounding or ‘lurking’ variable that is influencing both of the variables being measured. 

To gain an even deeper understanding of Simpson’s Paradox and how it might skew your data – and what you can do to prevent that from happening! – let’s look at another example. 

Example of Simpson’s Paradox

Here’s the scene: You’re a marketing researcher. You decide to post the exact same number of targeted ads to women, aged 18-30, on Instagram and TikTok. Then, you record the clickthrough rate on the ads to determine whether they’ve been successful at engaging your audience. You plot the results in a bar chart. 

All records | Astrato Analytics

Based on these initial results, you may conclude that ads on Instagram (2.3% clickthrough rate) perform better than TikTok (1.8% clickthrough rate). 

But! Since this isn’t your first time analyzing data, you know that it’s important to filter your data into different subgroups so that you can analyze any differences in results. 

And just as you suspected, when you apply age filters of over 25 and under 25, you see that in both groups, the TikTok ads had a higher clickthrough rate!

Records by Age | Astrato Analytics

So how is it that the data, when taken in aggregate, shows that Instagram ads perform better, when both subgroups show that TikTok is actually the better performing platform?

The answer is that Simpson’s Paradox has reared its head! 

In this case, the specific age group affects what social media platform the group uses, and also impacts click through rate. The older group prefers Instagram while the younger group prefers TikTok. The older group also has more disposable income to spend and clicks on ads at a higher rate. 

Flow diagram | Astrato Analytics

If you like numbers, it may help to look at the table. Take a minute to figure out how this switch in results happened by checking out the image below:

Data set Astrato Analytics

The 25+ group holds a larger share of the Instagram accounts and also clicks at a higher rate, which skews the aggregate results towards Instagram. 

How to avoid this paradoxical pitfall!

Anticipate when it’s going to show up

When creating a dashboard, it’s critical to interrogate your data set, and ask yourself what you can do to ensure that the information you’re presenting is as accurate and easy-to-understand as possible. This means ensuring that you’ve applied all key filters.

If you create one graph, or KPI, with a lot of aggregation behind it, but never check what happens when you apply certain filters, you may misunderstand a key relationship between variables.

Alternatively, if the aggregated version is the default presentation of your data, and you depend on whoever is consuming the dashboard to apply the filters themselves, they may totally miss the relationship switch. 

Key takeaway

In conclusion, always be weary of the Simpson’s Paradox! If Simpson’s Paradox is present in your data, make sure to make that clear to the viewer. Don’t forget that generally the disaggregated version is the most ‘accurate,’ but not always; so use critical thinking skills and your knowledge of the data context to make the best judgement.