Two soldiers, both statisticians, were fighting side by side in a battlefield. They spotted an enemy soldier and they both fired their rifles. One statistician soldier fired one foot to the left of the enemy soldier and the other statistician soldier fired one foot to the right of the enemy soldier. They immediately gave each other a high five and exclaimed, “on average the enemy soldier is dead!”
Of course this is an absurd story. Not only the enemy soldier was not dead, he was ready to fire back at the two statistician soldiers. This story also reminds the author of this blog about a man who had one foot in a bucket of boiling hot water and another foot in a bucket of ice cold water. The man said, ‘on average, I ought to feel pretty comfortable!”
In statistics, center (or central tendency or average) refers to a set of numerical summaries that attempt to describe what a typical data value might look like. These are “average” value or representative value of the data distribution. The more common notions of center or average are mean and median. The absurdity in these two stories points to the inadequacy of using center alone in describing a data distribution. To get a more complete picture, we need to use spread too.
Spread (or dispersion) refers to a set of numerical summaries that describe the degree to which the data are spread out. Some of the common notions of spread are range, 5-number summary, interquartile range (IQR), and standard deviation. We will not get into the specifics of these notions here. Refer to Looking at Spread for a more detailed discussion of these specific notions of spread. Our purpose is to discuss the importance of spread.
Why is spread important? Why is using average alone not sufficient in describing a data set? Here are several points to consider.
1. Using average alone can be misleading.
The stories mentioned at the beginning aside, using average alone gives incomplete information. Depending on the point of view, using average along can make things look better than they are or worse than they really are.
A handy example would be the Atlanta Olympic in 1996. In the summer time, Atlanta is hot. In fact, some people call the city Hotlanta! Yet in the bid for the right to host the Olympic, the planning committee described the temperate using average only (the daily average temperate being 75 Fahrenheit). Of course, the temperature of 75 degrees would be indeed comfortable, but was clearly not what the visitors would experience during the middle of the day!
2. A large spread usually means inconsistent results or performances.
A large spread indicates that there is a great deal of dispersion or scatter in the data. If the data are measurements of a manufacturing process, a large spread indicates that the product may be unreliable or substandard. So a quality control procedure would monitor average as well as spread.
If the data are exam scores, a large spread indicates that there exists a wide range of abilities among the students. Thus teaching a class with a large spread may require a different approach than teaching a class with a smaller spread.
3. In investment, standard deviation is used as a measure of risk.
As indicated above, standard deviation is one notion of spread. In investment, risk refers to the chance that the actual return on an investment may deviate greatly from the expected return (i.e. average return). One way to quantify risk is to calculate the standard deviation of the distribution of returns on the investment. The calculation of the standard deviation is based on the deviation of each data value from the mean. It gives an indication of the deviation of an “average” data point from the mean.
A large standard deviation of the returns on an investment indicates that there will be a broad range of possibilities of investment returns (i.e. it is a risky investment in that there is a chance to make a great deal of money and there is also a chance to lose your original investment). A small standard deviation indicates that there will likely be not many surprises (i.e. the actual returns will be likely not too much different from the average return).
Thus it pays for any investor to pay attention to the average returns that are expected as well as the standard deviation of the rate of returns.
4. Without a spread, it is hard to gauge the significance of an observed data value.
When an observed data value deviates from the mean, it may not be easy to gauge the significance of the observed data value. For the type of measurements that we deal with in our everyday experience (e.g. height and weight measurements), we usually have good ideas whether the data values we observe are meaningful.
But for data measurements that are not familiar to us, we usually have a harder time making sense of the data. For example, if the observed data value is different from the average, how do we know if the difference is just due to chance or if the difference is real and significant. This kind of question is at the heart of statistical inference. Many procedures in statistical inference requires the use of a spread in addition to an average.
For more information about the notions of spread, refer to other discussion in this blog or the following references.
- Moore. D. S., Essential Statistics, W. H. Freeman and Company, New York, 2010
- Moore. D. S., McCabe G. P., Craig B. A., Introduction to the Practice of Statistics, 6th ed., W. H. Freeman and Company, New York, 2009