Causation vs Correlation – What's the difference?
by iperceptions, on Mar 12, 2020
Data in the right hands can be extremely powerful and should be a key element of any decision. A famous quote by American statistician, W. Edwards Deming, reads “In God we trust. All others must bring data.”
But too often, data can be misconstrued and misunderstood. One of the biggest misunderstandings is the difference between causation and correlation.
There are countless articles that share wild, often tongue-in-cheek conclusions as a result of two strongly correlated data sets. For example, Harvard Business Review looked at the "possibility" that:
- Spending more to see sports matches reduces your likelihood to consume high-fructose corn syrup
- More iPhones sold means more people die from falling down the stairs
Obviously, these are extreme examples, but it shows the dangers of not understanding the difference between causation and correlation.
While there may indeed be a similarities between two data sets, some additional vetting is required before a correlation can qualify as causation.
What is the difference between correlation and causation?
Let’s start with the basics - What is the definition of causation versus correlation?
What is correlation?
The Australian Bureau of Statistics provides a great definition of correlation:
“A statistical measure (expressed as a number) that describes the size and direction of a relationship between two or more variables.”
In other words, if you were to plot the data for two variables in the same chart, changes in value in one variable will typically be mirrored by a positive or negative in the other.
What are the different types of correlations?
- Positive correlation: Variable A and Variable B move in the same direction. For example, as one variable increases, the other variable increases too.
- Negative correlation: Variable A and Variable B move in opposite directions. For example, as one variable increases, the other variable decreases.
- No correlation: There is no apparent link between the two variables.
Although, correlation does not necessarily mean that there is an actual relationship between these two variables. Which brings us to causation…
What is causation?
Also known as ‘causality’, the Australian Bureau of Statistics goes on to define causation the following way:
“[It] indicates that one event is the result of the occurrence of the other event; i.e., there is a causal relationship between the two events. This is also referred to as cause and effect.”
In other words, does one variable actually impact the other?
Causation vs. Correlation Examples
The website Spurious Correlations is an entertaining resource to check out that shows many examples of data sets that correlate strongly with one another, but are not caused by one another. At least, they should not be.
Case in point: is eating margarine actually behind Maine’s divorce rate, or vice-versa?
Sticking to interesting food correlations, could mozzarella cheese be the secret fuel that powers civil engineers in their studies?
Both these charts show very strong correlations between the variables. However, unless margarine is a touchy subject in Maine, or there are ground-breaking side effects to eating large amounts of cheese, these correlation vs. causation examples are very likely cases of "correlation does not necessarily mean causation".
Why is knowing the difference between correlation and causation important?
Getting the difference right is critical.
Avinash Kaushik, Digital marketing evangelist at Google, wrote in 2016 about how not understanding the difference can be very problematic. Kaushik highlighted an article from The Economist, which featured the assertion that eating more ice cream can help boost student scores on the PISA reading scale.
"To normal people (non-analysts), this graph and article looks legit," wrote Kaushik. "After all this is a reputable site and it is a reputable team. Oh, and look there is a red line, what looks like a believable distribution, and a R-squared!"
But Kaushik wants us to think a bit harder about the data at hand and not take things at face value.
He points out that despite reasonable correlation between these data sets, there is nothing to ground the causation of one and the other. While there may appear to be a clear link connecting IQ to ice cream consumption, the data does not definitively reveal anything aside from that obvious correlation.
Making bold claims
In our everyday life, at work or at home, we now have access to more data than ever before. Key decisions, opinions, even business strategies can depend on our ability to tell the difference between the two.
Kaushik uses the Economist example as a jumping-off point to remind us - and analysts everywhere - to be more skeptical of claims that draw bold conclusions from correlated data points. Kaushik's call-to-action encouraged readers to look deeper at the data and avoid the easy conclusions – to question the causal vs correlational relationships between data sets.
"Our job is to be skeptical, to dig and understand and poke and prod and to reject the outrageously wrong and if it is not outrageously wrong then to figure out how right it might be so that you can make an educated recommendation," he continued.
Causality vs. correlation is also a topic Michael Molnar examined in a recent article for Forbes. Molnar warns that:
“Confusing correlation with causation is not an unknown issue but it is becoming increasingly problematic as data increases and computers get more powerful… It gets to the heart of what we know - or think we know - about how the world works.”
Getting it right
Causality is an area that is frequently misunderstood, and it can be notoriously difficult to infer causation between two variables without doing a randomized controlled experience.
Furthermore, correlation can be a useful measure. However, it has limitations as it is usually associated with measuring a linear relationship.
Understanding that correlation does not imply causation, knowing the difference between the two, and being more skeptical before making bold claims (as Kaushik warns), is critical in today’s data-driven world.
Below are some great resources that explain correlation vs cause and effect.
- Data-driven Intelligence - If correlation doesn’t imply causation, then what does?
- Harvard Business Review – When to Act on Correlation, and When Not To
- Australian Bureau of Statistics – Statistical Language - Correlation and Causation
- Sense About Science USA – Causation vs Correlation
- Towards Data Science - Everything you need to know about interpreting correlations
This blog post was originally published on June 10, 2016, and has been updated with recent stats and examples, and expanded in parts.
Banner image source: Pexels