Correlation VS Causation
Introduction
The concepts of correlation and causation are sometimes confusing to amateur researchers. In practice, I often saw researchers considering a correlation as causation and making mistakes in conclusions. Mathematically, correlation is the necessary but insufficient condition for causation. In other words, if two things have causation relationship, these two things must have correlation relationship as well. However, if two things have correlation relationship, these two things do not necessarily have causation relationship.
In this blog post, I would use an example to talk about the concepts of correlation and causation, how to verify causation using experiments, and the caveats in using experiments to verify causation.
Example
Suppose we have a system containing the four variables including temperature (
The ground truth relationships are listed in the relationships column in the table, and these relationships are unknown to researchers.
In the data collection, we only collected the data for
Variable | Symbol | Relationships | Data #1 | Data #2 | Data #3 | Data #4 | Data #5 |
---|---|---|---|---|---|---|---|
Temperature | 0 | 10 | 20 | 30 | 40 | ||
Volume of Water Mike Drinks Daily (mL) | 500 | 800 | 900 | 1500 | 3000 | ||
Volume of Urine Mike Has Daily (mL) | 500 | 800 | 900 | 1500 | 3000 | ||
Number of Fires at California Daily | 5 | 8 | 9 | 15 | 30 |
From the data, we found that each pair of
Correlation Is Not Necessarily Causation
If we have some common sense, we know that
This example concretely shows that causation is the necessary but insufficient condition for correlation.
The next question is how to determine or eliminate the causation relationship from all the correlation relationships? The correct way is to do experiments.
Determine Causation By Experiment
In this case, if we keep
Of course, Mike would have more volume of urine daily, but the number of fires at California should not change. This experiment result confirms that
Caveats
However, if we do experiments incorrectly, we might find erroneous causation relationships.
For example, if we are not aware
Since we are not aware of
Such experiments are called control experiments. Essentially when the control experiments show something inconsistent to the causation relationship you found, the causation relationship you found is fake. It is extremely useful, because even though you do not know how many hidden variables you could not capture, as long as you can guarantee the variables, which have no causation relationships to the variables you are experimenting with, remain the same in both control experiments and actual experiments, the causation relationships you found, if any, would be reliable.
However, in practice, because the system is much more complicated, and the single variable you thought you are changing might actually turn out to contain many variables. For example, when Mike starts to drink more water, he has to use the exact the same cup he used to drink the water, the water has to be the exact same water he used to drink. If the cup got changed, or the water quality got changed, in principle, the conclusion that
Conclusions
The determination of causation is extremely complicated and could often go wrong. This is because there is an infinite number of variables when you are doing experiments, even with very good control experiment, you might not be aware of how many variables you have changed in the actual experiments and derive the wrong conclusion. This is essentially why a lot of sophisticated scientific findings, especially for life and biomedical science, turn out to be not true.
Nevertheless, the key ideas in doing experiments to determine causation relationships are:
- Doing control experiment.
- Minimizing and being aware of the change to other variables while you thought you are changing only one single variable.
Correlation VS Causation