Introduction
Hypothesis tests compare groups of data to see if they are similar or not. They can also test a group of data with an expected target and see if it is on target or not. They can also support or supplement design of experiments and regressions to analyze cause-and-effect relationships. Assumptions can be of any kind :
- as the default percentage really decreased as a result of the actions taken ?
- Have we significantly lowered the level of inventory ?
- Are more than half of the employees the driving force behind the progress process ?
- Does changing this part really have an impact on product quality ?
- …
Step 0: Define the purpose of the test
Compare a sample with a target
Data type | Qualitative 2 modalities | Qualitative, + 2 modalities | Quantitative |
---|---|---|---|
Example | We want to compare a percentage of default (Good / Not good) compared to a target percentage. | We want to compare percentages for data divided into different categories | We want to compare an average or a standard deviation with respect to a desired target. |
Test to use | t of Student for a proportion | χ2 Goodness of fit test | t of Student for average χ2 test for a standard deviation |
Compare 2 samples together
Data type | Qualitative 2 modalities | Qualitative, + 2 modalities | Ordinal Qualitative | Quantitative |
---|---|---|---|---|
Example | We want to compare percentages of defect (good / not good) of 2 samples. | We want to compare percentages for data divided into various categories. | We want to compare the ranking proposed by several juries. | We want to compare the average or the standard deviation of 2 samples. |
Independent data | Test t de Student Test du χ2 Test for Association | Test du Test χ2, test for association | Test de Wilcoxon-Mann Whitney | t Test of Student for average Brown Forsythe test or Fisher Snedecor test (but Brown Forsythe is better) for Variances |
Paired data | McNemar test | Wilcoxon test | Wilcoxon test | t test of Student for averages or Variances |
Compare more than 2 samples
Data type | Qualitative 2 modalities | Qualitative, + 2 modalities | Ordinal qualitative | Quantitative |
---|---|---|---|---|
Example | It is desired to compare percentages of defect (good / not good) of several samples. | We want to compare percentages for data divided into various categories. | We want to compare the ranking proposed by several juries. | We want to compare the mean or standard deviation of more than 2 samples. |
Independent data | χ2 Test for Association | Test χ2, test for association | Kruskal-Wallis test Tau of Kendall Rho of Spearman | Anova for average Brown Forsythe test for variance |
Paired data | Q of Cochran | Friedman test | Friedman test | Friedman test Anova blocks Manova for average |
Non parametric VS parametric test
In the previous tables, we find 2 families of tests:
- Parametric tests : they are based on the comparison of the different groups, assuming a number of elements : the distribution is normal and the variance between the samples is similar.
- Non parametric tests : they are almost all based on the notion of ranks. The principle is to substitute values for their order number in the data set. The interest of these tests is to be used for quantitative, qualitative data and in case the distribution is not normal.
Non parametric test | Parametric test | |
---|---|---|
Test name | Friedman Kruskal et Wallis Wilcoxon-Mann Whitney Wilcoxon Q de Cochran McNemar Tau de Kendall Rho de Spearman | Brown Forsythe Fisher Snedecor Student Student for paired data χ2 Manova Anova et Anova in blocks |
Advantage | No conditions to respect Allows to easily take into account qualitative data as quantitative and even ordinal Less sensitive to outliers Suitable for small samples | More precise if the conditions are established. |
Disadvantages | Less accurate than nonparametric tests. | Requires to meet the constraints of normalities Requires "similar" variances for tests that compare a parameter other than the variance. |

Step 1 : Validate the test conditions
Data distribution
For parametric tests, normality is necessary. To make sure, we make a fit test.
Homogeneity of the Variance
Parametric tests are effective if the sample Variance is small. To make sure, we perform a homoscedaticity test.
Independence or paired data
The principle of matching is based on creating pairs of data to reduce the risk of misreading results.
Example
We want to analyze the performance level of an additive to reduce vehicle consumption. We will have 2 scenarios.
The first, we choose 10 vehicles that we separate into 2 groups, including 1 to whom we will administer the additive. We make them travel the same way and we compare the consumption. The results will probably be skewed because we do not know if the vehicles have basic consumptions in correspondence.
The second, we choose 5 vehicles. They are made to make the same journey, a first time without the additive, a second time with. We compare the consumptions thereafter.
We realize that in the second case, our measurements will be much more precise. We are free from various variabilities that distort the results.
Sample size
In general, it will always be preferable to have similar sample sizes between all the groups that one wishes to test. This reduces the variability of Variances.
On the other hand, the more data we have, the more these will tend towards a normal distribution. It is recommended to have a sample size of at least 30 individuals per group.
Do not have outliers
Les valeurs Outliers are sources of biased results. Even if some tests, particularly nonparametric ones, are robust to this type of value, it is necessary to delete them before the study..
Step 2: Ask the hypotheses
Expression of hypotheses
The principle of a hypothesis test is to compare the probability of a hypothesis versus its opposite. For example, compare the assumption that we have 3% default against the assumption that there is not 3% default.
We therefore call :
- Null hypothesis H0 : this is the hypothesis where we learn nothing, therefore the one where we reject the hypothesis that our result is significant. In other words, our result is obtained by chance. For example, for the choice of a drug, the null hypothesis will be the one where this drug has no effect. We note that the hypothesis H0 is always expressed with equality.
- Alternative hypothesis H1 : it is the hypothesis where we learn something and therefore where the tested result is significant, in other words, something other than chance happened . The data collected shows a statistical difference. It actually represents what we want to know unless what we want to know is equality.
For example, we think we have 3% default on average and we want to test it. Our hypotheses will be :
- H0 : our default percentage is equal to 3%
- H1 : we have a default percentage different from 3%
The test direction
We define which side tilts the balance. For that, we give direction to the test and we will express our hypotheses as follows :
- H = 3% and H1 ≠ 3% : We define a bilateral test
- H = 3% and H1 < 3% : We define a unilateral test on the left
- H = 3% and H1 > 3% : We define a unilateral test on the right
Step 3 : Identify the Practical Value
Called also test statistic, it is the value calculated from our samples that we will compare to the critical value. Its calculation depends on the chosen test.
Step 4 : Choose the level of risk
In the hypothesis tests, there are two types of risks.
Truth | |||
---|---|---|---|
H0 | H1 | ||
Decision | H0 | Correct conclusion | Error of the second kind |
H1 | Error of first kind | Correct conclusion |
In other words, by definition :
- A risk of the first kind, called α : Risk of error to reject the null hypothesis while it is true (also called significance threshold). There is the risk of seeing an event when there is none: we condemn an innocent.
- A risk of the second kind, called β : it is the risk of holding the null hypothesis when it is false. So there is the risk of not seeing an event when there is one: we acquit a guilty party.
Statistical power
Calculated by the formula 1 – β, the power must be as great as possible, ie with a risk of the second kind the lowest possible.
More clearly, statistical power represents the probability of rejecting H0 when H0 is false and therefore represents our probability of detecting a difference.
Ideally 0.8 and above to detect a reasonable deviation from the null hypothesis.
The power depends on several parameters: the size of the effect to be highlighted, the sample size, the basic risk and the risk. α.

In practice, it is customary to set the risk of error α 5% and the risk β to 10%. These are values arbitrarily set only by philosophical choice :
We prefer to accept a bad lot than to refuse a good lot or to acquit a guilty person rather than to condemn an innocent person.
Thus, the risk associated with the first type error, which is considered the most serious error, is better controlled. These two risks α and β Being an antagonist, choosing an α risk that is too small will lead to very rarely reject H0. On the contrary, choosing a risk that is too big will lead to acceptance only very rarely. The risk is then deduced by the calculation, if the law under H1 is known.
The confidence level is calculated according to the following formula : 1 – α , where α is named the risk.
Example :
Consider the test of the following hypotheses :
- Hypothesis H0 : the patient must be hospitalized,
- Alternative hypothesis H1 : the patient should not be hospitalized.
The error of the first kind is not to hospitalize a patient who needed it. This error is very serious, since it can lead to the death of the patient. The second type of risk of hospitalizing a patient who did not need it may be less serious.
Step 5 : Establish the decision rule
The critical value of the test is calculated from the previously defined confidence level. This critical value separates 2 areas of choice :
- Rejection area : set of values where the test statistic is likely because H0 was selected.
- Non-rejection area : set formed by the other values in the case where we reject H0 and so we hold back H1.
Unilateral or Bilateral tests
For all laws, one chooses between Bilateral tests and unilateral tests. The graphic representation of this notion is as follows :
Calculate the critical value
The critical value is read on specific tables that have been developed by the test designers. It depends on the applicable law for the chosen test and most of the time the number of degree of freedom.

Degree of freedom concept
The number of degrees of freedom is a measure of the amount of information that can be obtained from an observation. The more degrees of freedom we have, the more information we have.
For example, in the equation A * B = 10, we have 2 solutions :
- if A = 2, then B = 5
- if A = 5, then B = 2
In other words, if we block one of the two parameters we can easily define the other. So in this case we have 1 degree of freedom either n – 1.
Step 6 : calculate the p-Value
La p-Value, Significance index, is an important concept in statistics. Introduced by Fisher, it helps to identify the actual level of « hazard » of the result.
It follows the same law as the chosen test.
Step 7 : Take the statistical decision
7.1 Reading the comparison between the practical value and the critical value
The reading of the results is always done vis-à-vis the null hypothesis. We can be in 2 cases :
- We have retained H0 : we conclude that the alternative hypothesis H1 is not true.
- We have rejected H0 : we conclude that the alternative hypothesis H1 is true.
7.2 Reading thep-Value
The value of the p-Value is interpreted as follows :
- P < α : the result is very significant, and is not due to chance
- P > α : the result is not significant and due to chance
Step 8 : The Post Hoc test
Where more than 2 samples have been compared and the statistical conclusion has led to one or more being different from the others, post-hoc tests. These tests make it possible to identify among the different samples which one or which differ from the others.
Note however that a simple pair analysis can in most cases be sufficient to identify these samples.
Source
D. Chessel, A. B. Dufour (2003) – Pratique des tests élémentaires
N. Boudaoud (2002) – Rappels statistiques
P. Dagnelie (1970) – Théories et méthodes statistiques
P. Sprent (1992) – Pratique des statistiques non paramétriques
D. Mouchiroud (2003) – Tests d’hypothèse
J. Jacques (2012) – Statistiques inférentielles
R. Rakotomalala (2008) – Comparaison de populations, test non paramétriques
E. Ouellet, I. Belley-Ferris, S. Leblond (2011) – Guide d’économétrie appliquée pour Stata
R. Rakotomalala (2013) – Comparaison de populations, test non paramétriques
J. Poirier (1999) – Estimateurs et tests d’hypothèses
M. Lejeune (2005) – Statistique : la théorie et ses applications
P. Capéraà, B. Van Cutsem (1988) – méthodes et modèles en statistique non paramétrique
V. Bhushan (1978) – Les méthodes en statistique
S. Tufféry (2010) – data mining et statistique décisionnelle : l’intelligence des données
Norme NF X06-064