[Total: 0    Average: 0/5]
The Kappa Test is the pendant of Gage R & R for qualitative data. Very useful for quality control, it helps to help its reliability.


The Kappa Test is the equivalent of the Gage R & R for qualitative data. It is a parametric test, also called the Cohen1test, which qualifies the capability of our measurement system between different operators. It is used to evaluate the concordance between two or more observers (inter variance), or between observations made by the same person (intra variance). In other words, he answers the question:

Do you know how to classify parts in the same categories with reliability?

This is the case for example if you want to compare the diagnoses of different doctors. This is also the case, for example, when quality controllers must qualify the parts as ” good/Not good “.

This concept was born in the field of human sciences. In psychology or psychiatry, it is difficult to measure for example the state of depression of a patient. An ordinal qualitative scale of ” low depressed ” or ” strongly depressed ” type is used.

1-Kappa Calculation


The calculation is based on the ratio of the actual agreement to the proportion observed if the observers affected the categories at random. The generalized Formula2 is based on an analysis of the variance of the responses between the different observers (the number of observations must be the same for each observer):


  • N: The number of samples studied
  • m: the number of ” judges
  • xIC : the number of judgments for observation I in category C
  • FC : The proportion of samples assigned to category C

2 – Significance of the result

We put in place a Significance test where we test the null hypothesis where the coefficient kappa = 0 against the alternative hypothesis where the Kappa is different from 0.

2.1 Calculation of the test variable

Under the assumption of random assignment of samples in the different categories, the practical value is written:


  • N: The number of samples studied
  • m: the number of ” judges
  • xIC : the number of judgments for observation I in category C
  • F: the proportion of samples assigned to the category
  • PE : total Agreement ratio for all categories = Σfc2

2.2 Calculation of P-Value

The P-Value follows a normal law and is calculated in the following way:

P-Value = 2 * (1 – normsdist (ABS (test variable))

Its interpretation is standard for all other hypothesis tests and reads in the following way:

  • if the P-Value is < α: The result is very significant, the results are not random
  • if the P-Value is > α: The result is not significant, the results are probably due to random

3 – Interpretation of the Kappa

Once we have been able to validate the fact that our test is significant, we read and interpret the result of the Kappa: at the most we are close to 1, at the most the agreement between the different judges is good. We will hold the following table3 :

  • 0, 8 to 1: almost perfect chord
  • 0.6 to 0.8: strong agreement
  • 0.4 to 0.6: moderate Agreement
  • 0.2 to 0.4: low agreement
  • 0 to 0.2: Very low Agreement
  • of 0: Disagree

In practice

The Kappa test is very often used to set up défauthèques. In the light of the défauthèque criteria, the Kappa is calculated and the reliability is quantified. The principle is as follows:

  1. A first défauthèque is put in place
  2. We train the personnel in charge of the self-control and/or the quality personnel
  3. We’re doing a first Kappa test
  4. The coefficient and its significance are interpreted
  5. If the significance is good but the Kappa weak, it is reworked on the training of the personnel or on the quality of the criteria of judgement of the défauthèque.
  6. Repeat the test until you get a good significance and a good Kappa value.


1 – J. Cohen (1960) – A coefficient of agreement for nominal scales

2 – J. L. Fleiss (1981) – Statistical methods for rates and proportions

3 – J. R. Landis, G. G. Koch (1977) – The Measurement of observer for categorical data

J. Fermanian (1984) – Measure of the agreement between two judges

E. Rafiq (2011) – Study of dependencies

Share This