One way to estimate probabilities is to use empirical data. However, if the histogram of the data shapes like a bell curve (or reasonably close to a bell curve), we can use a normal curve to estimate probabilities. All we need to know are the mean and standard deviation of the data. We use the SAT scores as an example.
The version of the SAT described here has three sections – critical reading, mathematics and writing. The scores range from 600 to 2,400. The PDF file is this link lists out all 1,547,990 SAT scores taken in 2010. It lists out all scores as a frequency distribution (e.g. 382 students with a perfect score of 2400, 219 students with a score of 2390, etc). The mean of all 1,547,990 SAT scores in 2010 is 1509 and the standard deviation is 312 (displayed at the bottom of the table).
The table also displays the percentile ranking of each score. For example, the score of 1850 is the 85th percentile, meaning that about 85% of the scores are less than 1850. We can compute this from the frequency distribution. There are 1,313,812 scores below 1850. Thus the proportion of the scores less than 1850 is:
What is the percentage of scores that is greater than 2050? According to the table, it should be about 5% since 2050 is 95th percentile. We can count the number of scores greater than 2050 (there are 74,165 such scores). We have:
Even though this table provides a useful resource to interpret SAT scores, the distribution of SAT scores can be described with just two numbers, namely the mean (1509) and the standard deviation (312). Instead of estimating proportion or percentile ranking using the data in the table, we can use a normal curve.
Figure 1 below is a histogram of all 1,547,990 SAT scores taken in 2010. The frequency distribution that is used to draw the histogram is found here.
There are 181 bars in this histogram, corresponding to the 181 distinct possible scores in the data (the scores come with an increment of 10, ranging from 600, 610, 620, and all the way to 2390, and 2400). Because of many bars crammed into a small graph, each bar appears as a thin vertical line. The histogram is symmetrical around a single peak and it tapers down smoothly on each side. Most of the data is clustered in the middle. The bars around the middle are very tall (i.e. most students score in the middle range). On the other hand, the bars at either the left side or the right side are very short (very few students score at the top and at the bottom). As a result, the histogram has a “bell” shape. The following (Figure 2) is a representation of the same 1,547,990 SAT scores as a smooth curve, which also has a “bell” shape.
Both Figure 1 and Figure 2 can be approximated by the normal curve shown in Figure 3. Figure 2 is a graph of the actual SAT scores (about 1.5 millions scores). On the other hand, Figure 3 is a mathematical model that shapes like a bell curve that is centered at 1509 with standard deviation 312. Note that the curve in Figure 2 is a little jagged, while the curve in Figure 3 is very smooth.
The total area under the normal curve in Figure 3 is 1.0, representing 100% of the data (in this case, the SAT scores). Finding the percentile ranking of the score 1850 is to find the area under the curve to the left of 1850 (see Figure 3a).
The green area in Figure 3a is 0.8621, which is about 0.01 away from the actual proportion of 0.849 in .
Finding the proportion of SAT scores greater than 2050 is to find the area under the curve in Figure 3 to the right of 2050 (see Figure 3b).
The green area is Figure 3b is 0.0418, which is about 0.006 away from the actual proportion of 0.0479 in .
There is no formula for calculating the area under a normal curve. To find the green area in Figure 3a and Figure 3b, either use software that calculates area under a normal curve or use a table that has areas under a normal curve.
The advantage of using a normal curve to estimate ranking of SAT scores (instead of using the actual data) is that we only need to know two pieces of information, the mean SAT score and the standard deviation of SAT scores. Once we know these two items, we can estimate the probabilities or proportions of SAT scores using software or a table of areas under a normal curve.
We can also use normal curves in many other settings where normal distributions are good descriptions of real data. For example, standardized test scores such as SAT and ACT closely follow normal distributions. Normal distributions arise naturally in many physical, biological, and social measurement situations. Normal distributions are also important in statistical inference. As long as we know the mean and standard deviation of the normal distribution in question, we can estimate probabilities (areas under a normal curve) without using actual measurements. One caveat to keep in mind is that there are many data distributions that are not normal. For example, data on income tend to be skewed to the right. So for such distributions, we need to use normal curves with care.
- Moore. D. S., Essential Statistics, W. H. Freeman and Company, New York, 2010
- Moore. D. S., McCabe G. P., Craig B. A., Introduction to the Practice of Statistics, 6th ed., W. H. Freeman and Company, New York, 2009