**A phenomenon rarely has a single root cause. Often we are faced with a multitude of parameters having a more or less strong influence on the result. That is the whole issue of multiple regressions.**

## Introduction

We have seen a **Simple linear regression** is to predict the reaction of one variable in relation to another via a function Y = f (X). However, **a phenomenon rarely has a single cause**. For example, what explains a cork screwing problem on a bottle? The quality of the cork, the speed of rotation, the quality of the bottle, the coefficient of friction cap/bottle…

**multiple regression** can be used to find a cause-and-effect relationship between more than two variables.

But not just. A multiple regression is also used in the case where we have a **polynomial correlation** (also called **quadratic** in the case of a polynomial of order 2: Y = a_{ } + a_{1} x + a_{2} x^{2}) between 2 variables. In other words, we have a complex relationship that can have many forms that are non-monotonous (parabolic, cubic…).

## The principle

The general model is:

- For a regression with several different variables:
**Y = a**_{ }+ a_{1}x_{1}+ a_{2}x_{2}+… - For a polynomial regression between 2 variables:
**Y =**It should be noted that the addition of an “order” ( X squared, X cube…) adds a fold to the curve._{ }A + a_{1}x + a_{2}x^{2}+… - For a regression ”
*Mixed*“: Y =**A**_{ }+ a_{1}x_{1}+ a_{2}x^{2}_{1}+ a_{3}x_{2}+ a_{4}x_{3}

** **

We are looking for a function f that binds the Y-values to those of the X and such that f (Xi) is as close as possible to Y. To identify the basic model, we find 2 methods:

- Ascending step-by-step regression: Variables are entered into the model one after the other, first looking for the most explanatory variable Xi, then the one that explains the most part of variance remaining to explain…
- Step backward regression: Variables are eliminated from the global model one after the other, eliminating first the least significant variable Xi of Y…

Recent simulations show the bottom-up methodology tends to keep fewer explanatory variables than the second^{1}. For our part, and in a precautionary principle, this second method will be preferred to avoid statistical errors and omit an explanatory variable. This is the one we describe below.

In the same way, for a polynomial regression, it will be considered that from the 5th order^{} (Y = a_{ } + a_{1} x + a_{2} x^{2}_{2} + a_{2} x^{3}_{2} + a_{2} x^{4}_{2 }+ a_{2} x^{5}_{2}), the model becomes too complex and difficult to analyze. It is recommended that a downward regression be put in place by beginning to remove the least significant coefficients.

## 1. Calculation of coefficients a_{ }, a_{1}…

The first step is to select the general model of the regression. For multiple regression with multiple variables, it is sufficient to identify the set of ” *plausible* ” variables to be integrated into the model.

For a polynomial regression, the general pattern depends on the fhape of the point cloud:

Once the general model is chosen, the estimation of the different coefficients is carried out via the calculation of a matrix product. The procedure to be followed for the calculation is as follows^{2} :

1. A matrix is created: The first column is always 1 to take into account the coefficient *a* of the equation.

2. The matrix product of this array is calculated with its transposed. It is called the *information matrix* or* Fisher matrix*.

3. The *information matrix *is reversed. It is called *a dispersion matrix*.

4. The matrix product of the *information array* is calculated with the response column Y

5. The matrix product of the result of step 4 is carried out with the *dispersion matrix* obtained in step 3.

6. The result of step 5 shows the different coefficients a0, A1…

Note that with the ” *linest* ” Excel function, one can calculate directly all these coefficients.

## 2. Calculate correlation and determination coefficients

Like simple linear regression studies, the different correlation coefficients are calculated. We find:

- Pearson R that is inferred from the root of R
^{2} - The R
^{2}which is calculated using the following formula^{3}:

- Finally, the adjusted R
^{2}is calculated which allows to take into account the fact that the more the number of explanatory variables increases and the greater its value will be. The most widely used formula is that of Ezekiel (1930)^{4}:

With:

**n:**number of variable couple**p:**number of explanatory variables

## 3. Test the correlation in its entirety

We then test the quality of the result as a whole. In other words, it is valid that the coefficients a_{1}, a_{2}… are not due to chance and that the model identified thus makes it possible to predict Y.

For this, a **Fisher test** is carried out with the hypothesis H0: a_{1} = a_{2} =… = 0

**Practical value**

**Critical value**

It follows the **Fisher’s Law** (F.INV in Excel) for:

**Probability:**1 – α (α being more often than 5%)**Degree of freedom 1 =**number of explanatory variable p**Degree of freedom 2 =**number of degrees of Freedom (n – p – 1)

Result | Statistical conclusion | Practical conclusion |
---|---|---|

Practical value > Critical value | We reject H0 | It is concluded that the model identified is correct. |

Practical value | On retain H0 | It is concluded that the model is not significant, variables are over or missing. |

If the test is not significant, it should be seen whether **aberrant values** are not present. If not, continue the process and go to step 4 to understand what or what values are significant or not.

We also calculate the **p-value**. This follows the **Fisher’s Law** and is calculated for the p explanatory variables and n – p – 1 degree of freedom. Here too, the p-value reads as always via a scale:

**< = 0.01:**Strong Significativé**0.01 and 0.05:**Significance average**>= 0.05:**Low significance

## 4. Test coefficient correlation by coefficient

each variable coefficient is tested to understand whether they are significant or not. For this, we put the H0 hypothesis: a_{1} = 0, then a_{2} = 0…

**Practical value**

The_{estimated} σ of the coefficient is calculated by the LINEST function of Excel or by the following formula: **Matrix 4 of Step 1 * Residual Variance/(n – p – 1)**

**Critical value**

The critical value follows a **Student’s Law** For a given risk and for n – p – 1 degree of freedom.

Result | Statistical conclusion | Practical conclusion |
---|---|---|

Practical value > Critical value | We reject H0 | It is concluded that the coefficient is significant. |

Practical value < Critical value | We retain H0 | It is concluded that the coefficient is not significant. Be careful, do not remove these values immediately from the model. Indeed, the coefficients correspond to partial contributions, and take into account other variables. Thus, if they are correlated, they interfere with each other and share their influences, so that individually they do not seem interesting. |

We also calculate the **p-value**. It also follows the **Student’s Law** and is calculated for the coefficient a of the variable and n – p – 1 degree of freedom. Also, the P-value reads as always via a scale:

Result | Statistical conclusion | Practical conclusion |
---|---|---|

p-Value < α | We reject H0 | The model is significant with a risk of being wrong with p-value% |

p-value > α | We retain H0 | The model is not significant with a risk of being wrong with p-value% |

## 5. Test a block of coefficients

This test helps to understand whether variables are ” *too much* ” in the model. Generally, it is decided to ” *exit* ” The variable or variables having a little or no significant coefficient (practical value of a… < critical value) that we detected in the previous step.

A test is carried out between 2 coefficients of determination:

- The initial
_{ }R^{2}with all the variables of the study - Another r
_{1}^{2}that takes into account the set of variables minus the variable or variables that we wished to remove and which in priori are not significant in the model. The total number of variables – the number of variables in this second model is called Q.

The H0 hypothesis is posed: a_{1} = a_{2} =… = 0

The practical value is calculated by making the ratio for each of the variables of^{5} :

**Practical value**

**Critical value**

It follows **Fisher’s Law** (F.INV in Excel) for:

- Probability: 1 – α (α being more often than 5%)
- Degree of freedom 1 = number of explanatory variable Q
- Degree of freedom 2 = number of degrees of Freedom (n – p – 1)

Result | Statistical conclusion | Practical conclusion |
---|---|---|

Practical value > Critical value | We reject H0 | We consider that the second model is not identical to the first. The variables that we removed have an influence and we need to keep them. |

Practical value < Critical value | We retain H0 | We consider that the second model does not teach us anything. In other words, the variables that we removed from the initial model have no influence on the model and we can decide to remove them to simplify the equation. We therefore retain the second model. |

Then we calculate the **p-value**. It also follows the **Fisher’s Law** and is calculated for the Q explanatory variables and n – p – 1 degree of freedom. Also, the P-value reads as always via a scale:

Result | Statistical conclusion | Practical conclusion |
---|---|---|

p-Value < α | We reject H0 | The model is significant with a risk of being wrong with p-value% |

p-value > α | We retain H0 | The model is not significant with a risk of being wrong with p-value% |

This is done by successive iterations until we obtain a strong significance and a strong correlation coefficient.

## Source

1 – F. G. Blanchet, P. Legendre, D. Borcard (2008) – Forward selection of explanatory variables

2-P. Legendre, L. F. J. Legendre (1998) – Numerical Ecology

3-B. Scherrer (2009) – Biostatistics

4-M. Ezekiel (1930) – Methods of correlation analysis

5 – J. Jacquard, R. Turrisi (2003) – Interaction effects in multiple regression

R. Rafiq (2013) – econometrics, single and multiple linear regression

Z. Aïvazian (1978)-Statistical Study of dependencies

R. Bourbonnais (1998)-Econometrics

P. Bressoux (2008)-Statistical modeling applied to the social sciences

J. Condone, M. Le Leigh (2006)-First step in linear regression with SAS

P. Moshood (2006)-Theoretical and applied statistics-statistical inference to one and two dimensions

D. Laffly (2006) – Multiple Regression: Principles and examples of application

Y. Dodge, V. Rousse (2004)-Applied regression analysis

D. Borcard (2012) – Multiple regression

B. Delyon (2014)-Regression