Introduction
Even more than the others, the distribution charts, also called histogram (name given by Karl Pearson in 1895), are part of the statistics. Historically, they were used to represent populations in the early censuses.
Distribution graphs allow you to see the distribution of the same quantitative criterion with various values. Example: Age, weight, length… For example, the number of parts weighing between 10 and 15 kg, between 15 and 20 kg… This is useful in many cases of quality controls of parts, noise measurement…
The construction of the graph
The construction of a distribution diagram responds to very precise rules of statistics. In the Shape of a histogram, here is a process to build it^{1}.
1. Collect data (n)
Collect data according to the Protocol of collection chosen. The number of data is noted n.
2. Calculate the range (R)
The range to be taken into account is the difference between the maximum and minimum values tolerable by the customer.
3. Identify the number of classes (k)
The number of classes and the associated formula depend on the type of variable. Whatever the case, two rules must be followed:
- The number of classes is rounded to the nearest integer.
- The class number must preferably be neither less than 5 nor greater than 20.
Note that for discrete variables, the number of Class K is equal to the type of value that the variable can take.
For continuous variables, there are various empirical formulas:
- Sturge formula: k = Log_{2}(n + 1)
- Sturge-Huntsberger formula:k = 1 + 3.3 * Log_{10}(n)
- Brooks-Carruthers formula: k = 5 * Log_{10}(n)
- Freedman Diaconis’s formula:(Max-Min)/(2 * EI^{(-1/3)})
- Scott formula: (Max-Min)/(3.5 * σ * n ^{(-1/3)})
- Yule formula: k = 2.5 * ^{4}√ (n) – Suitable for other types of distributions. (Note that the calculation of ^{4}√ (n) is done by setting “= n^{1/4} ” in Excel).
It should also be noted that an empirical rule gives the number of classes according to the number of values:
Number of points |
Number of Classes |
50 at 100 |
6, 7, 8 |
100 at 150 |
9, 10, 11 |
150 at 200 |
12, 13, 14 |
+ from 200 |
15 and more |
In all cases, the number of classes is based on the following criteria^{2} :
- For a visual representation for non-specialists, one chooses 7 to 8 maximum classes. Indeed, beyond that, the eye does not make it possible to distinguish sufficiently clearly and quickly the data. In excess of 8 classes, the readability of the graph is reduced to the detriment of the information that one wishes to pass on.
- For a visual representation for specialists or for a thorough study, one will choose the method that allows to have the most class possible. This is to obtain a more fine cutting of the data and a more precise analysis.
- Finally, it is quite possible to define ourselves a number of “manual” class that suits us best.
4. Identify the width of the classes
Also called class interval, it is calculated by making the ratio between the scope and the number of class:
l = R/k
Note that each class must have the same width to allow for an area under the histogram proportional to the total number.
If the classes cannot be of the same width, you must be able to keep that proportion. Thus, in order, instead of wearing the absolute frequency, the relative frequency is indicated.
5. Identify the frequency of each class
For this, 2 techniques:
- Absolute frequency (f): We add the n numbers belonging to the class
- Relative frequency (f): F = f/n
The frequency responds to the so-called ” right-priority rule “. Clearly, for each class, if a measure is straddling 2 classes, it will be counted in the right class.
6. Transform the number of measures into percentages
To ensure a good visualization of the proportions, the y-axis is expressed as a percentage. The frequencies must be expressed accordingly.
7. Perform the route
Interpret a distribution graph
The behaviour of the environment
The behavior of the environment is to analyze the positioning of the medium within the tolerance interval. The medium is defined according to three possible methods.
The arithmetic mean
In the case of distribution charts, the average should be weighted with the numbers per class. So the calculation is done in the following way:
- Calculate the median of each class: maximum value of the class minus the minimum value of the class divided by two.
- Multiply the median of each class by the size of each class.
- Make the sum of step 2 and divided by the total number n.
Note that some are the technique, if the sampling is random, the two mean must be very close. Otherwise, the notion of randomness was not properly followed.
The median
For the calculation of the median, the class containing the n^{th}/2 Individual of the sample is searched. The median calculation is done in the following way:
- B_{inf} : Lower terminal of the median class (class being in the middle of the desired tolerance zone)
- n: The total number of the sample
- n_{inf} : sum of the absolute frequencies of the classes to the left of the median class
- f_{me} : absolute frequency of the median class
- a: class interval (maximum class value – minimum class value)
Note that if the distribution of data is symmetrical, the value of the median is near see confused with the arithmetic mean.
Example:
The table below tells us that the median class is that of 25 to 26, the tolerance interval being 22 to 28. So the formula for calculating the median gives us:
Me = 25 + 1 * ((40/2-11)/14) = 25.64
The mode
The mode is by definition the most frequent value of the sample. In the case of distribution charts, two computational techniques can be used:
- Take the median value with the most important frequency
- Perform a linear interpolation as follows:
With:
- B_{inf} : lower limit of the maximum strength class
- a: class interval
- ΔS: workforce difference between the modal class and the nearest lower class
- Δs: workforce gap between the modal class and the nearest upper class
In the same way, if the distribution is symmetrical, the value of the mode is near see confused with the arithmetic mean.
Example:
If we take the example above, it gives us:
- In approximate value: the class with the maximum value is class 25 to 26 and the median of this class is therefore 25.5 (maxi class minus mini of class divided by two)
- In exact value: 25 + (1 * 6)/(3 + 6) = 25.66
Interpretation of the behaviour of the environment
In this analysis, the medium is calculated in the three ways above and inferred:
- the middle is outside the tolerance range: in this case, your process does not produce or few good results and you need to quickly review either the tolerance intervals or the process.
- the middle is in the tolerance range but either to the left or to the right of the middle of the tolerance interval: Less critical case than the previous one, but requiring a background work on the tolerance interval and the Process itself.
- the middle is in the middle of the tolerance interval: in this case, the process is centered and strongly favorable to a situation generating customer satisfaction.
The dispersion
The study of dispersion studies the way in which the data is scattered on the graph. The dispersion is measured with the calculation of the Variance. In the specific case of distribution charts, the Variance formula is as follows:
There are 3 cases of dispersion:
- Range< tolerance interval: The process performs 100% quality. Besides a strong customer satisfaction, it may be necessary to see if we do not produce the overquality generating extra costs.
- Range = tolerance interval: The most favorable case, this dispersion generates 100% customer satisfaction while respecting the cost constraints. We’re just producing.
- Range>tolerance interval: The most unfavourable case, this process generates non-quality and therefore customer dissatisfaction. If it is not possible to review with the client the constraints imposed, it is necessary to review the process.
The distribution
The distribution is to study how the values are distributed on the chart. In the clear, where the dispersion studies on the width of the graph, the distribution analyses the height of the graph. There are many forms of dispersions of which are the main:
- normal also says Gaussian or Bell Curve
- Binomial
- F distribution
- T distribution
- Chi two Distribution
- Distribution of Weibull
The study of distribution is undoubtedly the key axis of a statistical analysis. If a distribution matches one of the 6 models, the inferential statistics can be used.
Two distributions are observed
With the same data set, two modes of distribution are observed on the graph. It then becomes necessary to investigate, because a priori it is not normal to find yourself in such a situation. Several assumptions can be issued:
- The data was collected on two different machines.
- Two different operators carried out the measurements.
- The data were taken from 2 different batches of parts or from 2 suppliers.
Source
1 – A. Schärlig, O. Blanc (2000) – Making the figures speak: descriptive statistics for the management Service
2-M. Walas, A. De Fombelle, S. Schmid, P. Scotto, A. Stella-Carter, V. Thyrault (1999)-Statistical tools for management
F. Bertrand, M. Maumy-Bertrand (2014)-Introduction to statistics with R
H. A. Sturges (1926) – The choice of A class interval
D. Devika (2002) – Mathematics: Tools for Biology
J. C. Oriol (2007) – Training in statistics through the practice of questionnaire surveys and simulation.
J. Levy (2010)-Web Math