The mean of a data set is obtained by summing the data elements in the data set and divided by the total number of data elements (what most people think of as average). The median of a data set is the middle value of the data elements when the data elements are sorted. Both of these numerical sumamries are measures of center. They are both measures that can be used to represent what a typical data value would look like in a data distribution. Is there a difference between these two types of “average”? How do we know which one to use in a given situation?
In order to make the distinction between mean and median crystal clear, let’s look at a hypothetical example. Let’s say there are 10 middle-class people sitting in a bar. Then the person with the highest income leaves and Bill Gates walks in.
Because the richest person in the bar is now much richer than before, the mean income in the bar soars. The income of the nine people who aren’t Bill Gates has not increased and it is hard to convince them that they are better off because the mean income is now way up.
This example points to a very important concept called resistant measure (or resistant numerical summaries). To further make this notion clear, let’s say the following represents the annual income of the 10 people prior to the arrival of Bill Gates.
The income data in table are already sorted. The total income of the group is $505,000, leading to a mean income of $50,500. There is no middle data value. So we take the middle two data values ($45,000 and $50,000 from Marcus and Randy, respectively). Averaging these gives the median income of $47,500. Suppose Jack (the highest income earner in the group) leaves and Bill Gates walks in. Then we have the following income information.
With Bill Gates’ income of $1 billion (one followed by nine zeros), the mean income of the group shots way up. The total income of the group is now $1 billion plus change. Thus the mean income is slightly above 0.1 billion dollars or $100 millions. However, the median income does not change; it is still the half way between $45,000 and $50,000 (i.e. $47,500).
Most people would agree that $47,500 is a better reflection of the income amounts for the people sitting in the bar and that the amount of $100 millions has nothing to do with the reality of the 9 people in the bar who are not Bill Gates. It is hard to argue that the mean annual income of $100 million dollars is a good indication of the financial well being of the people sitting in the bar.
Numerical summaries that are not affected by extreme data values are said to be resistant. The Bill Gates example just illustrates that the mean as a numerical summary is not resistant while the median is resistant. The extreme income of Bill Gates severely skewed the mean income. With Bill Gates in the room, the median income does not change. In general, the median is less likely to be affected by extreme data values.
The preceding discussion leads us to an interesting and useful observation.
- For a skewed data distribution, median is a better measure of center.
- For any reasonably symmetric distribution with no outliers, mean is a better measure of center.
When the data distribution is skewed (left or right), there are extreme data values in the tail that is longer, which tend to pull the mean in the direction of the tail. Thus we have another useful observation. If the data distribution is skewed right, then the extreme data values in the right tail pull the mean up. If the data distribution is skewed left, then the extreme data values in the left tail pull the mean down toward the lower tail.
- For a skewed right data distribution, the mean is substantially larger than the median.
- For a skewed left data distribution, the mean is substantially smaller than the median.
- For a symmetric data distribution, the mean is roughly equaled to the median.
Thus, the relation between the mean and the median can give us some indication about the skewedness (or the lack of) in the data distribution in question. However, the three bullet points are just guidelines and may not hold in all cases. The guildelines tend to hold up well for continuous data distributions but may be violated in discrete distributions.
Which Spread Should We Use?
A related question is about the spread. We consider two measures of spread, the standard deviation and the inter-quartile range (IQR).
The standard deviation is a numerical summary that measures how much variation or dispersion there is from the mean. Because the calculation of the standard deviation involves the mean, the standard deviation is an arithmetically based calculation just like the mean. Thus the standard deviation is not a resistant measure. Extreme data values skew the standard deviation just as they skew the mean. This can be seen with the Bill Gates example.
Before Bill Gates walking into the bar, the standard deviation of income is $14,975.91. With Bill Gates in the room, the standard deviation of income is larger than $313 millions. On the other hand, the IQR of income remain the same, with or wothout Bill Gates.
The Bill Gates example points out that the inter-quartile range (IQR) is a resistant measure. Note that IQR is a measure of position (defined as the distance between the first quartile and the third quartile).
So the above rule about mean and median applies. Whenever a data distribution is skewed or contains extreme data values, the IQR is a better measure of spread because it is resistant. On the other hand, when a data distribution is reasonably symmetric and has no outliers, the standard deviation is a more appropriate measure of spread.
A Take-Away Lesson
- For a skewed data distribution, median is a better measure of center and inter-quartile range (IQR) is a better measure of spread.
- For any reasonably symmetric distribution with no outliers, mean is a better measure of center and standard deviation is a better measure of spread.
- Moore. D. S., McCabe G. P., Craig B. A., Introduction to the Practice of Statistics, 6th ed., W. H. Freeman and Company, New York, 2009