Correlation VS Causation

Introduction

The concepts of correlation and causation are sometimes confusing to amateur researchers. In practice, I often saw researchers considering a correlation as causation and making mistakes in conclusions. Mathematically, correlation is the necessary but insufficient condition for causation. In other words, if two things have causation relationship, these two things must have correlation relationship as well. However, if two things have correlation relationship, these two things do not necessarily have causation relationship.

In this blog post, I would use an example to talk about the concepts of correlation and causation, how to verify causation using experiments, and the caveats in using experiments to verify causation.

Example

Suppose we have a system containing the four variables including temperature (t), the volume of water Mike drinks daily (x1), the volume of urine Mike has daily (x2), the number of fires at California daily (x3). The ground truth values are listed in the table. The underlying relationships have also been listed and we assume there is no error in the measurement during experimental data collection.

The ground truth relationships are listed in the relationships column in the table, and these relationships are unknown to researchers.

In the data collection, we only collected the data for x1, x2, and x3, but not t.

Variable Symbol Relationships Data #1 Data #2 Data #3 Data #4 Data #5
Temperature t t 0 10 20 30 40
Volume of Water Mike Drinks Daily (mL) x1 x1=f1(t) 500 800 900 1500 3000
Volume of Urine Mike Has Daily (mL) x2 x2=f2(t) 500 800 900 1500 3000
Number of Fires at California Daily x3 x3=f3(t) 5 8 9 15 30

From the data, we found that each pair of x1, x2, and x3 are highly correlated. Can we say x1 and x2 have causation relationship and further x1 caused x2? Can we say x1 and x3 have causation relationship and further x1 caused x3?

Correlation Is Not Necessarily Causation

If we have some common sense, we know that x1, the volume of water Mike drinks daily, caused x2, the volume of urine Mike has daily. However, since Mike is just an ordinary person and does not have divine power, definitely x1, the volume of water Mike drinks daily, would not cause x3, the number of fires at California daily.

This example concretely shows that causation is the necessary but insufficient condition for correlation.

The next question is how to determine or eliminate the causation relationship from all the correlation relationships? The correct way is to do experiments.

Determine Causation By Experiment

In this case, if we keep t the same (although we are not monitoring it), increase x1, and monitor the change of x2 and x3. That is to say, we keep the temperature the same, ask Mike to drink more water daily, and monitor change of the volume of urine Mike has daily and the number of fires at California daily.

Of course, Mike would have more volume of urine daily, but the number of fires at California should not change. This experiment result confirms that x1 caused x2, but x1 would not cause x3.

Caveats

However, if we do experiments incorrectly, we might find erroneous causation relationships.

For example, if we are not aware t is going up since we are not monitoring it, we increase x1 and monitor the change of x2 and x3, then x2 and x3 would both increase. Then we would draw an incorrect conclusion that x2 caused x3, i.e., the volume of water Mike drinks daily caused the number fires at California daily.

Since we are not aware of t increasing, or even did not know the existence of t if we are ignorant, how do we eliminate that x2 caused x3? In this particular problem, it is almost impossible. However, if we have a clone of Mike which is exactly the same as Mike on any aspect, we would ask the first Mike to drink more water daily and the second Mike drink the same amount of water daily to the usual, and monitor the change of x2 and x3. We would find although the first Mike suggested that x1 caused x3, the second Mike would show that although x1 remained the same but x3 increases, which is inconsistent to the causation relationship that x1 caused x3. This rules out the causation relationship between x1 and x3.

Such experiments are called control experiments. Essentially when the control experiments show something inconsistent to the causation relationship you found, the causation relationship you found is fake. It is extremely useful, because even though you do not know how many hidden variables you could not capture, as long as you can guarantee the variables, which have no causation relationships to the variables you are experimenting with, remain the same in both control experiments and actual experiments, the causation relationships you found, if any, would be reliable.

However, in practice, because the system is much more complicated, and the single variable you thought you are changing might actually turn out to contain many variables. For example, when Mike starts to drink more water, he has to use the exact the same cup he used to drink the water, the water has to be the exact same water he used to drink. If the cup got changed, or the water quality got changed, in principle, the conclusion that x1 caused x2 and x1 would not cause x3 would not hold.

Conclusions

The determination of causation is extremely complicated and could often go wrong. This is because there is an infinite number of variables when you are doing experiments, even with very good control experiment, you might not be aware of how many variables you have changed in the actual experiments and derive the wrong conclusion. This is essentially why a lot of sophisticated scientific findings, especially for life and biomedical science, turn out to be not true.

Nevertheless, the key ideas in doing experiments to determine causation relationships are:

  • Doing control experiment.
  • Minimizing and being aware of the change to other variables while you thought you are changing only one single variable.
Author

Lei Mao

Posted on

11-28-2019

Updated on

11-28-2019

Licensed under


Comments