Data in the right hands can be extremely powerful and should a key element of any decision. One of the most famous quotes by American statistician, W. Edwards Deming is, “In God we trust. Everyone else, bring data.”
But too often than not, data can be misconstrued and misunderstood. One of the biggest misunderstanding is the different between causation and correlation.
A while back, Bloomberg released a tongue and cheek article on the dangers of mixing the two up. The article drew wild conclusions like Facebook is driving the Greek debt crisis or that the popularity of the baby name ‘Avas’ caused the US housing bubble. Obviously, these are extreme examples but it shows the dangers of not understanding the difference.
What is causation and correlation?
Let’s start off with the basics. What is the definition of causation vs correlation.
Well, according to the Bureau of Statistics correlation is, “A statistical measure (expressed as a number) that describes the size and direction of a relationship between two or more variables.”
While causation “Indicates that one event is the result of the occurrence of the other event; i.e. there is a causal relationship between the two events. This is also referred to as cause and effect.”
The classic causation vs correlation example that is frequently used is that smoking is correlated with alcoholism, but doesn’t cause alcoholism. While smoking causes an increase in the risk of developing lung cancer.
Why is the different important?
Getting the difference right is critical. Digital marketing evangelist Avinash Kaushik recently wrote about how not understanding the difference can be very problematic. Kaushik highlighted an article from The Economist, which featured the assertion that eating more ice cream can help boost student scores on the PISA reading scale.
"To normal people (non-analysts), this graph and article looks legit," wrote Kaushik. "After all this is a reputable site and it is a reputable team. Oh, and look there is a red line, what looks like a believable distribution, and a R-squared!"
But Kaushik wants us to think a bit harder about the data at hand and not take things at face value.
He points out that despite reasonable correlation between these data sets, there is really nothing to ground the causation of one and the other. While there may appear to be clear link connecting IQ to ice cream consumption, the data does not definitively reveal anything aside from that obvious correlation.
Making bold claims
Ultimately, Kaushik uses the Economist example as a jumping-off point to remind us - and analysts everywhere - to be more skeptical of claims that draw bold conclusions from correlated data points. He cited a number of other examples, including science and suicide, jet airline quality ratings and flight schedules. Kaushik's call to action encouraged readers to look deeper at the data and avoid the easy conclusions.
"Our job is to be skeptical, to dig and understand and poke and prod and to reject the outrageously wrong and if it is not outrageously wrong then to figure out how right it might be so that you can make an educated recommendation," he continued.
Getting it right
Causality is an area that is frequently misunderstood and it can be notoriously difficult to infer causation between two variables without doing a randomized controlled experience. Furthermore, correlation can be a useful measure but has limitations as it is usually associated with measuring a linear relationship. But understanding that correlation does not imply causation and knowing the difference is a good place to start.
Below are some great resources that explain correlation vs cause and effect.