data
Khôi Ròm - 11:15 PM -

Grading on a curve in college instilled a habit for using mean and standard deviation to describe a set of continuous data points. On any given assessment, about 68 percent of students were within one standard deviation of the mean. These were the "B" students. About 15 percent were one standard deviation above; these students each received an "A" – and the rest were "other." At the end of the semester, there might be one or two students in a 100-student freshman course who were two standard deviations above the class average. These students would get a nice letter from the head of the department.
The habit of using "mean" and "standard deviation" and the convenient rule that 68 percent of samples are within one standard deviation of the mean and 95 percent are within two standard deviations makes these measures attractive. Unfortunately, mean and standard deviation are trickier to use than you might remember.
For starters, these statistics only work well on normally distributed "bell curve" data. Such things as test scores or heights / weights of a population are all generally "normal." Errors on a broadband service, network capacity estimates, and stability metrics are generally not normal. Error distributions, for example, are often "positively skewed" with the left side of the bell curve compressed (most lines have low error counts) and with a long tail on the right side of the mean (a few outliers have continuous errors).
Another problem is that mean and standard deviation are not robust against outliers. Below are two groups of data with identical mean and standard deviation. But they’re not identical: Group I has a wider distribution below the mean and Group II has a single high outlier.
An alternative to mean and standard deviation are median and interquartile range (IQR). IQR is the difference between the third and first quartiles (a.k.a. the 75th and 25th quantiles). IQR is often reported using the "five-number summary," which includes: minimum, first quartile, median, third quartile and maximum.
Like mean and standard deviation, median and IQR measure the central tendency and spread, respectively, but are robust against outliers and non-normal data. They have a couple of additional advantages:
Below are the Group I and II data with the IQR and median. Now the red median bar makes it clear that Group I's values are typically higher than Group II's, but Group I also has a wider spread, as indicated by the wider IQR shaded bar.
When doing an analysis, I'll typically create rows for the five-number summary, IQR, skew, and count of low and high outliers. Excel has built-in functions for all of these, except outliers, which can be computed using a basic formula.
Measure | Excel Function |
min and max | MIN() MAX() |
quartiles Q1 and Q3 | QUARTILE.INC(, 1) QUARTILE.INC(, 3) |
median | MEDIAN() or QUARTILE.INC(, 2) |
IQR | Q3 – Q1 |
Outliers (low and high) | COUNTIF(, "< "&(Q1-1.5*IQR) ) COUNTIF(, "< "&(Q3+1.5*IQR) ) |
The open source statistical software R also makes it easy to calculate these values with the built-in summary function. I'll look at this in a future blog post and also look at R's box plot capability, which shows median, IQR, and outliers in a concise graphical summary that enables rapid scanning of many variables or of a single variable over time.