Correlation is not causal!
It is a phrase repeatedly repeated in all the books. It is important not to make an absolute reference to the results of a statistical study. It is important to clearly demarcate its scope of action and identify cases where its indications are subject to caution1.
The correlation can sometimes be totally fortuitous2. For example, it is reported that on annual data from 1897 to 1985, studies showed a correlation of 0. 91 between the U.S. national income and the number of solar tasks (dark areas of the sun that are the least hot). However, no one can decently argue that there is any relationship between these two data.
The correlation can also hide the influence of another factor. For example, it is shown that there is a negative relationship between the size of people and the length of their hair. We will always be able to make more or less psychological arguments, but before we move forward, we should go back to the conditions of the data collection and check that there is no hidden information. But on average, men are larger than women, and conversely, women have longer hair than men. The gender of the person then plays the role of confounding factor.
The apparent binding is an artifact linked to the existence of an uncontrolled factor.
In the case where the confounding factor is qualitative, the problem is easily detected by constructing a cloud of points by distinguishing the subgroups.
When the factor is quantitative, it is a little more complicated. For example: for the sale of sunglasses and ice creams, there is no direct link. It is the sunshine or temperature that makes them vary concomitantly. This case is studied through partial correlations.
The Illusion of the series
This is our tendency to erroneously perceive coincidences in random data. This is due to the fact that our mind expects a certain result and develops a priori a result. If we do not carry out the statistical study to the end, it is possible that it leads us to an illusion.
To illustrate this phenomenon, researchers Gilovich, Vallone and Tversky have shown that the idea that a basketball player is in luck if he succeeds in a series of shots is wrong. Analyses made within the Philadelphia team did not show that the players were successful in a series of successful shots more than chance suggests. When a player succeeds his first pitch, he succeeds the second 75 of the time. But when he misses the first throw, he succeeds the second also 75% of the time. Inother words, you have as much chance of being “lucky” as you are not. In other words, if we had conducted our statistical study only on successful firings, we would have “justified” the cause and we would face an illusion. By looking for the probability of missed shots, we break this illusion.
We sometimes withdraw data that goes in the direction that “arrangesus“. This is something we do in a completely voluntary way (each one having its reasons) or simply without having realised the consequences of this retouching.
This is the case for example where to accentuate the effect we are looking for, we remove one or more points of measurement claiming that they are aberrant. Without having shown it clearly.
In fact, for each of the measures “non-normal“so an aberrant priori, it is necessary to investigate whether it is a reality (error in the measurement…) or whether it is a point to be taken into account because there is no good reason not to take it into account.
The fallacy of the Texas Elite shooter
The Texas sniper fallacy is an illustration of the concept of this distortion of data. The origin is in an American joke:
A person shoots a series of balls on the wall of a barn. Once done, he draws a target around each of the bullets and writes “I am a sniper“.
Small aside on the notion of statistics
We all have the intuition, statistics are powerful tools, but we often have this feeling of being “fooled” by statisticalfigures especially in the case of political exchanges. An excellent article from the Harvard Business Review (downloadable above) highlights this phenomenon and all the mistrust we need to take in terms of statistical figures and studies.
1 – Y. Dodge, V. Rousse (2004) – Applied regression analysis
2 – J. Johnston, J. DiNardo (1999) – Econometric methods
R. Cornelius (1960)-The Longest Day