Correlation VS Causation
The concepts of correlation and causation are sometimes confusing to amateur researchers. In practice, I often saw researchers considering a correlation as causation and making mistakes in conclusions. Mathematically, correlation is the necessary but insufficient condition for causation. In other words, if two things have causation relationship, these two things must have correlation relationship as well. However, if two things have correlation relationship, these two things do not necessarily have causation relationship.
In this blog post, I would use an example to talk about the concepts of correlation and causation, how to verify causation using experiments, and the caveats in using experiments to verify causation.
Suppose we have a system containing the four variables including temperature ($t$), the volume of water Mike drinks daily ($x_1$), the volume of urine Mike has daily ($x_2$), the number of fires at California daily ($x_3$). The ground truth values are listed in the table. The underlying relationships have also been listed and we assume there is no error in the measurement during experimental data collection.
The ground truth relationships are listed in the relationships column in the table, and these relationships are unknown to researchers.
In the data collection, we only collected the data for $x_1$, $x_2$, and $x_3$, but not $t$.
|Variable||Symbol||Relationships||Data #1||Data #2||Data #3||Data #4||Data #5|
|Volume of Water Mike Drinks Daily (mL)||$x_1$||$x_1=f_1(t)$||500||800||900||1500||3000|
|Volume of Urine Mike Has Daily (mL)||$x_2$||$x_2=f_2(t)$||500||800||900||1500||3000|
|Number of Fires at California Daily||$x_3$||$x_3=f_3(t)$||5||8||9||15||30|
From the data, we found that each pair of $x_1$, $x_2$, and $x_3$ are highly correlated. Can we say $x_1$ and $x_2$ have causation relationship and further $x_1$ caused $x_2$? Can we say $x_1$ and $x_3$ have causation relationship and further $x_1$ caused $x_3$?
If we have some common sense, we know that $x_1$, the volume of water Mike drinks daily, caused $x_2$, the volume of urine Mike has daily. However, since Mike is just an ordinary person and does not have divine power, definitely $x_1$, the volume of water Mike drinks daily, would not cause $x_3$, the number of fires at California daily.
This example concretely shows that causation is the necessary but insufficient condition for correlation.
The next question is how to determine or eliminate the causation relationship from all the correlation relationships? The correct way is to do experiments.
In this case, if we keep $t$ the same (although we are not monitoring it), increase $x_1$, and monitor the change of $x_2$ and $x_3$. That is to say, we keep the temperature the same, ask Mike to drink more water daily, and monitor change of the volume of urine Mike has daily and the number of fires at California daily.
Of course, Mike would have more volume of urine daily, but the number of fires at California should not change. This experiment result confirms that $x_1$ caused $x_2$, but $x_1$ would not cause $x_3$.
However, if we do experiments incorrectly, we might find erroneous causation relationships.
For example, if we are not aware $t$ is going up since we are not monitoring it, we increase $x_1$ and monitor the change of $x_2$ and $x_3$, then $x_2$ and $x_3$ would both increase. Then we would draw an incorrect conclusion that $x_2$ caused $x_3$, i.e., the volume of water Mike drinks daily caused the number fires at California daily.
Since we are not aware of $t$ increasing, or even did not know the existence of $t$ if we are ignorant, how do we eliminate that $x_2$ caused $x_3$? In this particular problem, it is almost impossible. However, if we have a clone of Mike which is exactly the same as Mike on any aspect, we would ask the first Mike to drink more water daily and the second Mike drink the same amount of water daily to the usual, and monitor the change of $x_2$ and $x_3$. We would find although the first Mike suggested that $x_1$ caused $x_3$, the second Mike would show that although $x_1$ remained the same but $x_3$ increases, which is inconsistent to the causation relationship that $x_1$ caused $x_3$. This rules out the causation relationship between $x_1$ and $x_3$.
Such experiments are called control experiments. Essentially when the control experiments show something inconsistent to the causation relationship you found, the causation relationship you found is fake. It is extremely useful, because even though you do not know how many hidden variables you could not capture, as long as you can guarantee the variables, which have no causation relationships to the variables you are experimenting with, remain the same in both control experiments and actual experiments, the causation relationships you found, if any, would be reliable.
However, in practice, because the system is much more complicated, and the single variable you thought you are changing might actually turn out to contain many variables. For example, when Mike starts to drink more water, he has to use the exact the same cup he used to drink the water, the water has to be the exact same water he used to drink. If the cup got changed, or the water quality got changed, in principle, the conclusion that $x_1$ caused $x_2$ and $x_1$ would not cause $x_3$ would not hold.
The determination of causation is extremely complicated and could often go wrong. This is because there is an infinite number of variables when you are doing experiments, even with very good control experiment, you might not be aware of how many variables you have changed in the actual experiments and derive the wrong conclusion. This is essentially why a lot of sophisticated scientific findings, especially for life and biomedical science, turn out to be not true.
Nevertheless, the key ideas in doing experiments to determine causation relationships are:
- Doing control experiment.
- Minimizing and being aware of the change to other variables while you thought you are changing only one single variable.
Correlation VS Causation