The approach of descriptive statistics is to describe a population or a sample of a population in a crude way. The challenge is to get a mathematical and graphical representation of this data.
The descriptive statistic is broken down into 4 large families of tools that we describe below.
The middle parameters
They qualify the middle of the population by calculating different position characteristics. Note that the formulas below are valid for non-grouped data. In the case of grouped data see the article on distribution charts. We find:
The average is the most known and simplest study criterion. This is the total sum of all the values of the population divided by the number of value.
In general, the average is the best indicator of the behaviour of the medium, however it is strongly influenced by extreme values and poorly represents a heterogeneous population.
It should be noted that we are also talking about mathematical hope. Hope is used when our points have “weights” that must be applied to calculate the average.
The median is the numerical medium, i.e. the value for which the cumulative frequency is equal to 50%. Note that this calculation does not apply to nominal variables Because the calculation of the median requires a linear order of the data. It is calculated by putting in order all the numbers and taking the figure in the middle, ie, the one that has as many values below as above1.
In series 5, 3, 6, 4, 7, 5, 9, 6, 4, 3, 2, 6, the median is 5 (2, 3, 3, 4, 4, 5, 5, 6, 6, 6, 7, 9).
Several cases may arise:
- N is odd: The median is the middle value.
- N is even: The median is the average between the 2 middle values.
The relevance of the median calculation is highlighted when there are exceptional values (very high or very low). With a strong impact on the calculation of the mean, these values (called Outliers) do not impact the median. It is nevertheless not suitable for statistical calculations because it represents only the value that separates the sample into 2 equal parts.
The mode is the most frequent value of the number sequence. In our previous example, 5, 3, 6, 4, 7, 5, 9, 6, 4, 3, 2, 6, mode is 6. In case there are 2 peaks, we call the series as bimodal.
A good indicator of a heterogeneous population, it is not influenced by extreme values but is not suitable for statistical calculations because it represents only the values that are close to the modal class.
Number and frequency
The size of a given variable is the number of individuals for whom the variable is considered to be the value in question. The total number is the sum of all the numbers of a variable.
The frequency of a given value is the ratio of the workforce corresponding to the total number. The total frequency is always equal to 1.
The dispersion parameters
To describe statistical series, the median concept is adapted either to separate the measurements into 2 subsets, but to k Subsets. These subsets called Quantiles can be:
- If k = 4: We’re talking about quartile. There are 3 subgroup separation values (Q1, Q2, and Q3), of which: 25% of values are below Q1, 25% are greater than Q3, Q2 is the median.
- If k = 10: We’re talking about decile. In the same way, 10% of the values are less than Q1, 10% are greater than Q9…
The variance is the measurement of the distance of the points from the average. The variance represents the level of dispersion of the values. Because of its definition, the variance is always a positive number. Its size is the square of that of the variable. However, it is difficult to use variance as a measure of dispersion because the use of square leads to a change of units. It therefore has no direct biological meaning contrary to the standard deviation which is expressed in the same units as the average.
The Variance of a sample, s, is calculated using the following formula (VAR function in Excel):
s = Σ (xi -xcross)2 /(n-1)
- Xi : Individuals
- Xcross: The average of individuals
- n: The number of individuals in the sample
The Variance of a population, S, is calculated using the following formula (VARP function in Excel):
S = Σ (xi -xcross)2 /n
- Xi : individuals
- Xcross : The average of individuals
- n: The number of individuals in the population
The variance of the population from the sample variance (n-size sample) can be estimated (very useful for hypothesis testing) using the following formula:
Ŝ = s * n/(n-1)
This is a number to assess the sense of variation of two sets of data and thus to characterize the dependence of these variables. It is interpreted in the following way:
- Covariance > 0: Each pair of values differs from their mean in the same direction.
- Covariance < 0: Each pair of values differs from their mean in the opposite direction.
- covariance = 0: The two random variables are independent. But, in the case of a nonlinear correlation, the covariance will be null also… so be careful with the attives conclusions.
The covariance of a sample is calculated using the following formula (covariance. s function in Excel):
The covariance of a population is calculated using the following formula (covariance. P function in Excel):
The standard deviation: σ called Sigma
The standard deviation is the square root of the variance. This is the most used indicator because the most representative of the dispersion of the values of the population. In other words, as the variance, the more the value of the standard deviation is significant the more the population values are dispersed, so the population is called ” heterogeneous “.
The standard deviation is a key value in the statistics and in the 6 Sigma. Indeed, ” six Sigma ” means ” six times the standard deviation”. The principle of the Six sigma method is to ensure that all elements of the study process are understood in an interval away from the maximum of 6 sigma compared to the average of the population derived from this process. By reducing the variability of values, the risk of seeing the product or service rejected by its recipient is reduced because outside of its expectations or specifications.
In the same way as for the variance, the calculation of the standard deviation depends on whether we have the whole population or only one sample. So, in Excel:
- for a sample: Stdev function
- for a population: StDevP function
|Sample size||Conversion factor d||Sample size||Conversion factor d||Sample size||Conversion factor d||Sample size||Conversion factor d|
Source: E. S. Pearson, H. O. Hartley (1970)-Biometrika Tables for statisticians
Standard error of the average
The standard deviation is used to calculate the distance of the data around the average of the same sample. Now, let’s imagine that we perform several times the measurements of this same sample or that you measure the average of several samples. The standard error of the average allows to measure the variance between these different groups. The formula is as follows:
Standard error of an average = σ/√ n
- σ: the standard deviation.
- n: the number of individual.
In the end, this indicator gives the level of accuracy in calculating the average. The greater the number of data collected and the lower the standard error. Showing that the more exhaustive the data collection and the more accurate the calculation of the average.
Standard error of a percentage
It represents the degree of precision in calculating a percentage. The formula is as follows:
standard error of a percentage = √ ((p * q)/n)
- n: the number of individual.
- p: Observed frequency (example: 5% of the parts are non-compliant).
- q: The reciprocal of P is 1-p (to follow our example, q = 95%).
Standard deviation Error
It represents the degree of precision in calculating the standard deviation. The formula is as follows:
Standard deviation Error = σ/√ (2 * n)
- n: the number of individual.
- σ: Standard deviation of the sample.
The coefficient of variation
Variance and standard deviation are absolute dispersion parameters that measure the absolute variation of data independently of the order of magnitude of the data.
The coefficient of Variation noted C.V. is a relative dispersion index taking into account this bias and is equal to:
C.V. = 100 * Σ/xcross
Generally, a coefficient of variation less than 15% demonstrates a good homogeneity of the distribution of measurements2.
They allow to visualize the dispersion of the population on the total extent and to understand the behavior of the environment.
From these Graphics, we identify the distribution law to which it responds and thus use the associated statistical tools.
Data Evolution Charts
By representing data over time, they allow us to identify trends and developments. We find the trend charts and the control charts. These are used in cases where we have limit values (dimensional tolerances of parts for example) to which we want to confront the data collected.
1 – R. Veysseyre (2006) – Statistics and probability for engineer
2 – D. Broclain, J. Doubovetzky (2000)-Know how to read a medical article to decide
N. Bommena (2002) – Statistical reminders
A. Baccini (2010) – Basic descriptive statistics
Mr. Love (2010) – Introduction to descriptive statistics
D. Devika (2002) – Descriptive statistics
J. Levy (2010)-Web Math
CD4 3534-1:2003 Standard
ISO Standard/DIS 3534-2