Logistic regression is a mathematical model for defining a regression model when the variable to be explained is qualitative. For example, if we want to predict the appearance of a defect, good/Not good, we are in the presence of a qualitative variable.
A prediction model is constructed with one or more quantitative or qualitative explanatory X variables. The model is written in the following way:
Y = a + a1 * x1 + a2 * x2…
The popularity of this method is found in the health sciences and the social sciences, where the variable to predict is the presence or absence of a disease. For example, it may be a study of major depression where you want to know the factors that predict it best, by studying variables such as age, sex, self-esteem, interpersonal relationships… 1
The principle is based on the fact that one calculates the probability of occurrence of an event in relation to another. It is called this ” Luck report “, the odds. Or Y a qualitative variable with J modality. The Odds are defined by the fact that one modality is fulfilled in relation to another.
For example, we want to set up a model of prediction about the appearance of a defect. The probability of the occurrence of the defect is calculated in relation to the probability that it will not appear. This according to one or more explanatory parameters. If the probability of the occurrence (which we note π) of the defect is 0.2, then the odds will be 0.2/0.8 or 0.25. In other words, we have a chance of default of 1 to 4. Or we have four times more chance of not having a defect.
In the context of the logistic regression one represents this by the function Logit, or natural Log of the probability of being part of a group divided by the probability of not being part of the group. It says:
C = Logit (π) = ln (π/(1 – π)) = a + a1 * x1 + a2 * x2 +…
Where π is the probability of the occurrence of the event you wish to study.
The function is to calculate the regression coefficients iteratively. From certain starting values for a , a1…, the challenge will be to maximize the likelihood. This means that the model is optimised in relation to the point2cloud. The likelihood is the probability of getting the point from an estimation of the model.
Step 1: Encode Y values
Fundamental step of the analysis, transform the modalities of the variable Y in value 0 and 1. In the context of a logistic regression study, we address a variable to explain that is qualitative. For example, this can be good/not good if you want to predict the appearance of a defect, or even credit real estate/credit consumption/credit revolving if you want to predict the type of credit that a person takes according to his age…
We call it the modalities. In the first case, we have 2 modalities, in the second 3. The challenge of this first step is to turn this into 0 and 1.
Cas 1:2 Modalities
For example, we want to predict the appearance of a defect according to a machine setting, a type of raw material… The variable Y can take as a good or not good modality.
We replace good by 0 and not good by 1.
Case 2: More than two modalities
For example, we want to study the likelihood of purchasing a product that, with respect to these options, has 3 configurations called A, B, C.
We create the table opposite. Each time the person bought the product A, we put a 1, then this value will be 0 when the person buys a product B or C. And so on.
Step 2: Identify the starting values
The solver of the different calculation software requires starting values for the different coefficients A, a1, A2… of the equation of our model. In theory, we could choose values completely at random.
In practice to facilitate the calculation and reduce the processing time, it is very beneficial to put values that we think close to reality. Failing to have no idea about the values, we can test with different starting values and see if we always get the same result.
Step 3: Calculate the Log likelihood
Named LL, it represents the logarithmic function of the likelihood. This is only for reasons of simplification of calculation manipulations. It depends on different values that we detail.
1. Calculate the logit by value of Y-Ck
The function Logit C is calculated first for each Y-value. It is calculated using the following formula:
Ck = a0, K + a1, k * x1 + a2, k * x2…
2. Calculate the value of π
For each Y-value, the probability of the π-occurrence is calculated. It is calculated with the following formula:
3. Calculate Log likelihood by point-LL
Also by value of Y, the Log likelihood is computed. It gives us an estimate of the distance of the model with the point. This is calculated using the following formula:
In the case of a bi-modal Y variable (good, not good):
In the case of a variable Y multimodal (blue, red, green…):
4. Calculate the total Log likelihood-LLM
Finally, the final step is to add the entire log likelihood per point to estimate the total log likelihood of the model. We call this data the LLM.
Step 4: Calculating deviance
Last step before optimizing the model, calculating the deviance. The deviance (in English residualdeviation) will be the value we want to minimize when calculating the optimization of the model. Deviance is the sum of the points deviations with our model. In a linear regression model, deviance is the sum of the squares of the deviations. Deviance is calculated using the following formula:
Dm =-2 * LLm
Step 5: Optimize the model
The model is computed via the successive iteration of an algorithm. This algorithm has the task of minimizing the deviance according to the parameters a , a1, a2… This calculation must be done via software (Excel, Minitab, SPSS…) because the number of iterations can be very numerous.
There are many iteration algorithms (Newton-Raphson, Fisher Scoring…) and that is why the results obtained between software can be different. But generally the differences are low and so we should not be too alert.
In Excel, the calculation is done via the specific solver that can be loaded. The explanations are given at: http://office.microsoft.com/fr-fr/excel-help/definir-et-resoudre-un-probleme-a-laide-du-solveur-HP010342416.aspx
In the Solver, we will ask him to minimize the deviance by playing on the parameters a1, a2… of the model.
Step 6: Analyze the Model
The question is whether our model is close to reality and can predict the behavior of the dependent variable Y. The calculation of the Pseudo R2is discussed in comparison with the R2 (coefficient of determination) of the other types of regressions.
Its calculation is done by comparing the model we have just optimized with a so-called ” trivial ” model that includes only the variable a in the model. In other words, we come to compare our priori best model with a basic model that we know can perform.
1. Calculating the parameters of the trivial model
In the first stage, we compute the Log likelihood by point, called LL and the associated deviance D.
The formula of the LL0 depends on the number of modalities of the variable Y.
Case 1: dichotomous variable
If the variable has 2 modalities (good or not good for example), the LL is calculated in the following way:
- N: The total number of readings
- n +: The number of positive values
- p +: the proportion of positive value
We have 20 data records. Of the 20, 6 Y values are positive. That gives us:
- n = 20
- n + = 6
- p + = 6/20 = 0.3
Case 2: Multimodal variable
We have a variable Y with a multitude of modalities (blue, green, red…) The LL0 is calculated using the following formula:
- N: The total number of readings
- nk : number of positive value for modality K of variable Y
Let us take the previous example, or we would have 20 readings but with a variable Y having 3 modalities having respectively 4, 10 and 6 values to 1. We will get:
2. Calculation of the Pseudo R2
The pseudo R2 is calculated using various methods. Among them, we find the method of Cox Snell or even Nagelkerke for the best known. However, it is recognized that the McFadden model is the most efficient and most suitable for logistic regression3. It is calculated using the following formula:
His interpretation is rather simple:
- LLM = LL:the R2MF is equal to 0, so we consider that our model does not do better than the trivial model.
- Conversely, at the most our model will be good, at the most the value of the R2MF will be close to 1.
Step 7: Evaluating the Model
Now that we have built the logistic regression model, we have validated the fact that it is ” better ” than the trivial model, we will evaluate it. For that, there are different indicators that we will calculate and interpret.
1. Build the contingency matrix
The objective of this matrix is to confront the values observed in the values predicted by the model. A table of this Shape is constructed:
Result of the model we built
A + b
C + D
A + C
B + D
A + B + C + D
It binds in the following way:
- A and D are the real positives: when our model predicted a modality, it was actually.
- B and C are false positives: When our model indicated a modality, it was not in reality.
2. Calculation of the error rate ε
of this matrix, a first indicator is inferred that is the error rate ε. This evaluates the number of errors in the model in relation to the overall workforce. He estimates the likelihood of our model making a mistake. It is calculated in the following way:
- N: number of the overall workforce
- false positives: in our matrix, it’s B and C
Note that in the case where a first model has been established with, for example, 3 variables X explanatory. One wishes to compare it to another model that we think better but this one with only 2 variables X explanatory. The error rate reading is in this case skewed. It is better to use the other indicators given below.
3. Calculation of success rate-θ
In contrast to the error rate, the success rate is the probability of our model to find the right modality of the variable Y. This is the additional error rate is calculated simply by doing:
θ = 1-ε
4. Calculation of Odds Ratio – OR
As we have seen above, the challenge of logistic regression is to calculate a probability of occurrence of one event over another. Through the equations obtained by logistic regression, this probability is calculated as follows:
- OR = 1: The variable Y is independent of this variable. We can get her out of the model.
- OR > 1: The probability of Y increases with the phenomenon X
- OR < 1: The probability of Y decreases with the phenomenon X
For example, we have an equation obtained from the probability of modality 1 compared to the 3 which is as follows:
C1 = C1/3 = 0.16508 – 0.38229 * X1
The Odds Ratio will then be:
OR (1/3) = e(-0.38229) = 0.68
In other words, in front of alternative 1 and 3, we will have 1.47 times more chance of choosing 1 rather than 3.
5. Interpretation and other indicators
A good model must have a low error rate value, to the nearest 0 and by reciprocal a significant success rate, therefore to the nearest 1.
There are still other indicators (Youden index, F-measure…) to evaluate the model. We find among the most used:
- The sensitivity, also called recall or true positive rate, indicates the ability of the model to regain the positive modalities. It is calculated using the following formula (in view of our table:
Sensitivity = true positive of modality X/sum of modality x = A/(A + B)
- The accuracy indicates the proportion of true positives among individuals who were classified as positive. It is calculated using the following formula:
Accuracy = true positive of modality X/sum of modality X = A/(A + C)
Step 8: Evaluation of the significance
1. Test of likelihood ratio
This test consists of comparing the model ” optimized ” with another model where we removed one or more of the variables of the model. According to the selected comparison model, one concludes on the significance of the model and the variables.
Two cases occur:
- We want to test the model as a whole. We build a comparison model like this Y = a0.
- We want to test the significance of the variables one after the other.
We put the null hypothesis H0: a1 = A2… = 0. In other words, if you don’t learn anything by deleting the model’s variable (or variables), then you can probably remove it from it to simplify it.
The significance of the result is evaluated by a law of χ2. The number of degrees of freedom of the law is calculated using the following formula:
Number of variables in the model optimised – numbers of variables in the comparison model
For example :
To test our model with regard to the trivial model, we will have a degree of freedom equal to the number of variables of our optimized model.
To test our optimized model y = a0 + a1 * x1 + a2 * x2, we build the model Y = a0 + a1 * X1 (so we test the significance of the variable X2). The number of degrees of freedom will be 2 – 1 = 1 degree of freedom.
To do this, the solver of the software is ” turned ” by forcing the variable we want to test with a coefficient at 0. Then, we come to test the difference of deviance DH0 of the forced model with the deviance DM of the optimized model.
Thus, if the p-value given by χ2(CHIDIST function in Excel) is significant then H0is rejected, and it is concluded that the optimized model is more efficient than the simplified model we just Calculate.
2. The ROC curve
The ROC curve, receiver Operating Characteristic, is a graphical tool specific to logistic regression. Coupled with the AUC criterion (area under the curve), it makes it possible to visually evaluate the quality of the model we have just built.
The ROC curve relates the rate of true positives (sensitivity) and the rate of false positives (1-specificity). The construction of the curve is done in the following way:
- Calculate the π score of each individual using the Prediction model
- Calculate the rates of true positives and false positives for each value of π
- The ROC curve corresponds to the point graph that connects true positive/false positive couples. The first point being necessarily 0.0, the last 1.1.
The model is perfect
All positives are located in front of the negatives, the ROC curve is glued to the left and the top ends of the marker.
On the other hand, the model is bad
The model dispatch the results at random, the ROC curve is therefore a right at 45 °.
From this curve, the significance is numerically characterized via the AUC indicator. It expresses the probability of the model to place a positive individual in place of a negative. It is interpreted according to the following table:
- AUC = 0.5: The ROC curve is a straight line at 45 °, the model predicts randomly the behavior of the variable.
- 0.7 <= AUC < 0.8: Acceptable level
- 0.8 <= AUC < 0.9: Excellent
- AUC >= 0.9: Exceptional level
1 – B. G. Tabachnick, L. S. Fidell (2000) – Using multivariate statistics
2 – C. D. Howell (1998) – Statistical methods in the social sciences and humanities
3 – S. Menard (2002) – Applied Logistic regression analysis
J. Jacques (2013) – Statistical modeling
J. Bouyer (2012) – Logistic regression, quantitative variable modeling
O. Godechot (2012) – Introduction to Regression
J. Desjardins (2005) – Logistic regression analysis
E. A. Saulean, N. Meyer (2009) – Logistic regression