Regressions are mathematical tools to study the behavior of variables between them. These are tools used to predict behaviors by analyzing the correlations of variables between them. They therefore allow:
- Determine the function that binds variables
- Information on the intensity of the link between the variables
Regression studies can be used for other purposes. For example, a correlation graph tells us that the rate of delay in a meeting influence the quality of these meetings. Thus, to set up an indicator, we prefer to use the delay rate, which is easier to measure, than the quality of the meetings.
Regressions may only be used when the explanatory data (or information) is quantitative (continuous or discrete) and according to the following table:
Continuous or discrete Quantitative
Number of quantitative explanatory variables (continuous or discrete)
qualitative variables can also be taken as long as they are transformed into dichotomous or class variables.
Example: Male and female is translated into 0 and 1
simple or monotonous regression
It was Francis galtn (1822-1911-mathematician cousin of Charles Darwin) who introduced the expression of regression. Working on the transmission of hereditary traits, he noticed that although there is a tendency for high-waisted parents to have high-waisted children and vice versa, the average size of children tended to approximate the size Average population. In other words, the size of children born to unusually large or small parents approximated the average population size1.
GALTON’s universal regression law was confirmed by his friend Karl Pearson, who collected more than a thousand sizes from family groups2. He discovered that the average size of the sons of a large group of fathers was lower than that of their fathers and that the average of the sons of a small group was greater than the size of their fathers, ” regressing “So the small and large wires towards the medium size.
1. Collect Data
First step, collect the data. For a regression study, the explanatory variable (or variables) are necessarily quantitative variables. In other words, a regression study can not be done with data of type yes/no or white/blue… In this type of case, it is necessary to use specific hypothesis tests, or if possible, to transform the data into a quantitative variable.
Let’s take the example of a quality control linked to traces on a label. No trace is desired and control is currently a good/not good control. A hypothesis test could be put in place, but it can also be interesting to make a regression. In the latter case, the appearance of the trace will be translated by a measure of the size of it. When there is none, the value is 0, and when it appears, one measures the surface of it and thus becomes a quantitative variable.
On the other hand, as systematically in a statistical study, the data collection Must be done in accordance with the basic rules. In particular, if necessary, remove outliers.
It is also necessary to ensure that the same number of data is collected for each of the two variables.
Finally, it is necessary to validate that the values are independent. For this we have either the logic… or the Durbin Watson test.
2. Identify the type of regression
According to the diagram presented in the introduction, the type of regression to be set up according to the data type is selected.
3. Characterizing the relationship
In this step, the data is graphically represented to characterise the relationship and choose a model. Whether it is a single or multiple regression, these graphs always represent the value to be explained in relation to only one of the other values. We then develop as many graphs as there are explanatory values.
As residues are the difference between our prediction model and our data, the “best” regression will be achieved when the square sum of the tailings is as small as possible.
We’re going to find three types of relationships:
|Type of connection||Description||Chart|
|Linear link||The simplest case, the two variables have a correlation that can be ascending or descending.|
|Monotonic link||More complex case, the connection is not linear but is either strictly positive or strictly negative.|
|Non-monotonic link||We find a "break" in the link but we can mathematically represent it.|
4. Quantifying the intensity of the correlation
The intensity of the correlation is quantified. For this, there are three different coefficients that we describe below and which are to be used according to the table below.
The coefficient of Bravais-Pearson-R
The coefficient of Bravais-Pearson measures the co-variation of the two variables. It calculates the ratio between how much variation the two measures have in common divided by the amount of variation they might have at most. It is expressed according to the following formula (Pearson function in Excel):
If the value of R squared (r2) is raised, it is then called the coefficient of determination. This gives us the amount of variance in common between the two samples. Expressed as a percentage, at most this will be close to 100%, at most, our regression model explains our data.
In the particular case where we have a chronological series of the same data, we can calculate via the coefficient of Bravais Pearson What is called in this case the autocorrelation.
This allows us to know whether in time our data follows the same trend or not.
The coefficient of Spearman-ρ
Basically, the Coefficient of Spearman is a special case of the Pearson coefficient. It is based on the calculation of the difference of the ranks. So it’s a non-parametric test.
The Kendall’s Tau is also a non-parametric test. It is based on the difference in the ranks of the variables.
In all 3 cases, the values of the coefficients oscillate between-1 and 1 and are interpreted according to the following diagram:
5. Validate the significance of the study
It is necessary to validate whether the results obtained have a meaning or not. The details of the tests being put in the various articles relating to linear regressions, multiple…, we only put below the list:
- A test on the model’s R2
- A test on the slope of the model
- Calculating theConfidence interval of the slope of the model
- The calculation of the P-Value
- The calculation of partial correlation coefficient to identify if other factors are to be included in the model
1 – F. Galton (1886) – Family likeness in stature
2 – K. Pearson (1903) – On The laws of inheritance
K. Pearson (1896) – Mathematical contributions to the theory of evolution
C. E. Spearman (1904) – The Proof and measurement of association between two things
R. Rafiq (2012) – Correlation analysis
N. Gujarati, (2003) – Basics Econometrics