**The Kappa Test is the pendant of Gage R & R for qualitative data. Very useful for quality control, it helps to help its reliability.**

## Introduction

The Kappa Test is the equivalent of the **Gage R & R** for qualitative data. It is a parametric test, also called the Cohen^{1}test, which qualifies the capability of our measurement system between different operators. It is used to evaluate the concordance between two or more observers (inter variance), or between observations made by the same person (intra variance). In other words, he answers the question:

**Do you know how to classify parts in the same categories with reliability?**

This is the case for example if you want to compare the diagnoses of different doctors. This is also the case, for example, when quality controllers must qualify the parts as ” *good/Not good* “.

This concept was born in the field of human sciences. In psychology or psychiatry, it is difficult to measure for example the state of depression of a patient. An ordinal qualitative scale of ” *low depressed *” or ” *strongly depressed *” type is used.

## 1-Kappa Calculation

The calculation is based on the ratio of the actual agreement to the proportion observed if the observers affected the categories at random. The generalized Formula^{2} is based on an analysis of the variance of the responses between the different observers (the number of observations must be the same for each observer):

With:

**N:**The number of samples studied**m:**the number of ”*judges*“**x**the number of judgments for observation I in category C_{IC}:**F**The proportion of samples assigned to category C_{C}:

## 2 – Significance of the result

We put in place a **Significance test** where we test the null hypothesis where the coefficient kappa = 0 against the alternative hypothesis where the Kappa is different from 0.

### 2.1 Calculation of the test variable

Under the assumption of random assignment of samples in the different categories, the practical value is written:

With:

**N:**The number of samples studied**m:**the number of ”*judges*“**x**the number of judgments for observation I in category C_{IC }:**F:**the proportion of samples assigned to the category**P**total Agreement ratio for all categories = Σf_{E}:_{c}^{2}

### 2.2 Calculation of P-Value

The **P-Value** follows a normal law and is calculated in the following way:

**P-Value = 2 * (1 – normsdist (ABS (test variable))**

Its interpretation is standard for all other hypothesis tests and reads in the following way:

**if the P-Value is < α:**The result is very significant, the results are not random**if the P-Value is > α:**The result is not significant, the results are probably due to random

## 3 – Interpretation of the Kappa

Once we have been able to validate the fact that our test is significant, we read and interpret the result of the Kappa: at the most we are close to 1, at the most the agreement between the different judges is good. We will hold the following table^{3} :

- 0, 8 to 1: almost perfect chord
- 0.6 to 0.8: strong agreement
- 0.4 to 0.6: moderate Agreement
- 0.2 to 0.4: low agreement
- 0 to 0.2: Very low Agreement
- of 0: Disagree

## In practice

The Kappa test is very often used to set up défauthèques. In the light of the défauthèque criteria, the Kappa is calculated and the reliability is quantified. The principle is as follows:

- A first défauthèque is put in place
- We train the personnel in charge of the self-control and/or the quality personnel
- We’re doing a first Kappa test
- The coefficient and its significance are interpreted
- If the significance is good but the Kappa weak, it is reworked on the training of the personnel or on the quality of the criteria of judgement of the défauthèque.
- Repeat the test until you get a good significance and a good Kappa value.

## Source

1 – J. Cohen (1960) – A coefficient of agreement for nominal scales

2 – J. L. Fleiss (1981) – Statistical methods for rates and proportions

3 – J. R. Landis, G. G. Koch (1977) – The Measurement of observer for categorical data

J. Fermanian (1984) – Measure of the agreement between two judges

E. Rafiq (2011) – Study of dependencies