Describing Data: Why median and IQR are often better than mean and standard deviation

Grading on a curve in college instilled a habit for using mean and standard deviation to describe a set of continuous data points. On any given assessment, about 68 percent of students were within one standard deviation of the mean. These were the "B" students. About 15 percent were one standard deviation above; these students each received an "A" – and the rest were "other." At the end of the semester, there might be one or two students in a 100-student freshman course who were two standard deviations above the class average. These students would get a nice letter from the head of the department.

The habit of using "mean" and "standard deviation" and the convenient rule that 68 percent of samples are within one standard deviation of the mean and 95 percent are within two standard deviations makes these measures attractive. Unfortunately, mean and standard deviation are trickier to use than you might remember.

For starters, these statistics only work well on normally distributed "bell curve" data. Such things as test scores or heights / weights of a population are all generally "normal." Errors on a broadband service, network capacity estimates, and stability metrics are generally not normal. Error distributions, for example, are often "positively skewed" with the left side of the bell curve compressed (most lines have low error counts) and with a long tail on the right side of the mean (a few outliers have continuous errors).

Another problem is that mean and standard deviation are not robust against outliers. Below are two groups of data with identical mean and standard deviation. But they’re not identical: Group I has a wider distribution below the mean and Group II has a single high outlier.

Median and IQR

An alternative to mean and standard deviation are median and interquartile range (IQR). IQR is the difference between the third and first quartiles (a.k.a. the 75th and 25th quantiles). IQR is often reported using the "five-number summary," which includes: minimum, first quartile, median, third quartile and maximum.

Like mean and standard deviation, median and IQR measure the central tendency and spread, respectively, but are robust against outliers and non-normal data. They have a couple of additional advantages:

Outlier Identification. IQR makes it easy to do an initial estimate of outliers by looking at values more than one-and-a-half times the IQR distance below the first quartile or above the third quartile.
Skewness. Comparing the median to the quartile values shows whether data is skewed. For example, Group I has a high proportion of larger values, and the median is therefore closer to the third quartile than the first quartile. By contrast, the values in Group II are more evenly distributed.

Below are the Group I and II data with the IQR and median. Now the red median bar makes it clear that Group I's values are typically higher than Group II's, but Group I also has a wider spread, as indicated by the wider IQR shaded bar.

Calculating Median and IQR

When doing an analysis, I'll typically create rows for the five-number summary, IQR, skew, and count of low and high outliers. Excel has built-in functions for all of these, except outliers, which can be computed using a basic formula.

Measure	Excel Function
min and max	MIN() MAX()
quartiles Q1 and Q3	QUARTILE.INC(, 1) QUARTILE.INC(, 3)
median	MEDIAN() or QUARTILE.INC(, 2)
IQR	Q3 – Q1
Outliers (low and high)	COUNTIF(, "< "&(Q1-1.5IQR) ) COUNTIF(, "< "&(Q3+1.5IQR) )

The open source statistical software R also makes it easy to calculate these values with the built-in summary function. I'll look at this in a future blog post and also look at R's box plot capability, which shows median, IQR, and outliers in a concise graphical summary that enables rapid scanning of many variables or of a single variable over time.

data

Deciding With Data

Space Tools

Median and IQR

Calculating Median and IQR

Xem Nhiều

Liên kết

Contact

Họ và tên

Email của bạn

Nội dung

data

Deciding With Data

Space Tools

Breadcrumbs

Describing Data: Why median and IQR are often better than mean and standard deviation

Median and IQR

Calculating Median and IQR

Xem Nhiều

Liên kết

Contact

Họ và tên

Email của bạn

Nội dung