I’ve always thought that statistics is one of the most important, yet underrepresented, cores of engineering. Stats matter. But there’s a Great Danger in using stats naively – the danger of confusing correlation and causation.
Consider the figure above, from this blog post. It plots the decrease in market share of Internet Explorer versus the drop in murder rate in the US from 2006 to 2011. Clearly, the two trends are correlated; that is, there is a common relationship between the two trends, and it is statistically significant. But there’s a problem: there’s no known causation between the two variables. That is, all we have are two trends that happen to be similar, but are in fact entirely unrelated. The catch is this: there’s nothing inherent in any statistical analysis that tells you whether there is a causal relationship between two variables. So using stats alone just isn’t enough. These correlations without causation are called spurious correlations.
There are many examples of this kind of correlation without causation: global temperature increase versus number of pirates, Facebook usage versus the Greek debt crisis, US highway death rate versus importation of lemons from Mexico, and many others. In fact, you can create your own quite easily:
- Choose a city, anywhere in the world, that has a fire department.
- Get data on fires in that city for, say, 10 years.
- Plot the number of fire engines at each fire and the damages in dollars at each fire.
No matter what city you choose, you’ll be able to argue that fire engines cause damage.
If that’s too much work, you can try Google Correlate. For instance, follow this link to see the correlation between weight loss and houses for rent, or this link for the correlation between Facebook and xhamster (the latter being a porn site).
This may all seem like a bunch of fun and games, but it can be quite serious too. For instance, this spurious correlation between sales of organic foods and incidence of autism actually made the headlines for a time, until real experts debunked it. This page lists a series of spurious correlations taken directly from published scientific papers. Not only are they difficult to understand because of the technical language used, but such errors could actually lead to real harm.
This is one of the reasons that science can seem to move so slowly sometimes. Scientists need to be extremely careful about how they interpret their data. Their caution is, of course, worth it. But it’s also worth it to learn about statistics, and to be wary of the dangers in its indiscriminate use.