## Benford’s Law and US Census Data, Part I

The first digit (or leading digit) of a number is the leftmost digit (e.g. the first digit of 567 is 5). The first digit of a number can only be 1, 2, 3, 4, 5, 6, 7, 8, and 9 since we do not usually write number such as 567 as 0567. Some fraudsters may think that the first digits in numbers in financial documents appear with equal frequency (i.e. each digit appears about 11% of the time). In fact, this is not the case. It was discovered by Simon Newcomb in 1881 and rediscovered by physicist Frank Benford in 1938 that the first digits in many data sets occur according to the probability distribution indicated in Figure 1 below:

The above probability distribution is now known as the Benford’s law. It is a powerful and yet relatively simple tool for detecting financial and accounting frauds (see this previous post). For example, according to the Benford’s law, about 30% of numbers in legitimate data have 1 as a first digit. Fraudsters who do not know this will tend to have much fewer ones as first digits in their faked data.

Data for which the Benford’s law is applicable are data that tend to distribute across multiple orders of magnitude. Examples include income data of a large population, census data such as populations of cities and counties. In addition to demographic data and scientific data, the Benford’s law is also applicable to many types of financial data, including income tax data, stock exchange data, corporate disbursement and sales data (see [1]). The author of [1], Mark Nigrini, also discussed data analysis methods (based on the Benford’s law) that are used in forensic accounting and auditing.

In a previous post, we compare the trade volume data of S&P 500 stock index to the Benford’s law. In this post, we provide another example of the Benford’s law in action. We analyze the population data of all 3,143 counties in United States (data are found here). The following figures show the distribution of first digits in the population counts of all 3,143 counties. Figure 2 shows the actual counts and Figure 3 shows the proportions.

In both of these figures, there are nine bars, one for each of the possible first digits. In Figure 2, we see that there are 972 counties in the United States with 1 as the first digit in the population count (0.309 or 30.9% of the total). This agrees quite well with the proportion of 0.301 from the Benford’s law. Comparing the proportions between Figure 1 and Figure 3, we see the actual proportions of first digits in the county population counts are in good general agreement with the Benford’s law. The following figure shows a side-by-side comparison of Figure 1 and Figure 3.

Looking at Figure 4, it is clear that the actual proportions in the first digits in the 3,143 county population counts follow the Benford’s law quite closely. Such a close match with the Benford’s law would be expected in authentic and unmanipulated data. A comparison as shown in Figure 4 lies at the heart of any technique in data analysis using the Benford’s law.

Reference

1. Nigrini M. J., I’ve Got Your Number, Journal of Accountancy, May 1999. Link
2. US Census Bureau – American Fact Finder.
3. Wikipedia’s entry for the Benford’s law.

## Is It Possible to Spot Fraudulent Numbers?

People engage in financial frauds have a need to produce false data as part of their criminal activities. Is it possible to look at the false data and determine they are not real? A probability model known as the Benford’s law is a powerful and relatively simple tool for detecting potential financial frauds or errors. Is this post, we give some indication of how the Benford’s law can be used and we use data from S&P 500 stock index as an example.

The first digit (or leading digit) of a number is the leftmost digit. For example, the first digit of $35,987 is 3. Since zero cannot be a first digit, there are only 9 possible choices for the first digits. Some data fabricators may try to distribute the first digits in false data fairly uniformly. However, first digits of numbers in legitimate documents tend not to distribute uniformly across the 9 possible digits. According to the Benford’s law, about 31% of the numbers in many types of data sets have 1 as a first digit, 19% have 2. The higher digits have even lower frequencies with 9 occurring about 5% of the time. The following figure shows the probability distribution according to the Benford’s law. Frank Benford was a physicist working for the General Electric Company in the 1920s and 1930s. He observed that the first few pages of his logarithm tables (the pages corresponding to the lower digits) were dirtier and more worn than the pages for the higher digits. Without electronic calculator or modern computer, people in those times used logarithm table to facilitate numerical calculation. He concluded that he was looking up the first few pages of the logarithm table more often (hence doing calculation involving lower first digits more often). Benford then hypothesized that there were more numbers with lower first digits in the real world (hence the need for using logarithm of numbers with lower first digits more often). To test out his hypothesis, Benford analyzed various types of data sets, including areas of rivers, baseball statistics, numbers in magazine articles, and street addresses listed in “American Men of Science”, a biographical directory. In all, the data analysis involved a total of 20,229 numerical data values. Benford found that the leading digits in his data distributed very similar to the model described in Figure 1 above. As a hands-on introduction to the Benford’s law, we use data from the S&P 500 stock index from November 11, 2011 (data were found here). S&P 500 is a stock index covering 500 large companies in the U.S. economy. The following table lists the prices and volumes of shares (the number of shares traded) of the first 5 companies of the S&P 500 index as of November 11, 2011. The first digits of these 5 prices are 8, 5, 5, 5, 7, 2. The first digits of these 5 share volumes are 3, 4, 2, 2, 1, 8. $\displaystyle (1) \ \ \ \ \ \ \begin{bmatrix} \text{Stock Symbol}&\text{ }&\text{Company Name}&\text{ }&\text{Prices}&\text{ }&\text{Volume} \\\text{ }&\text{ }&\text{ } \\ \text{MMM}&\text{ }&\text{3M Co}&\text{ }&\82.29&\text{ }&\text{3.6M} \\ \text{ABT}&\text{ }&\text{Abbott Laboratories}&\text{ }&\54.53&\text{ }&\text{4.9M} \\ \text{ANF}&\text{ }&\text{Abercrombie and Fitch Co}&\text{ }&\56.80&\text{ }&\text{2.5M} \\ \text{ACN}&\text{ }&\text{Accenture PLC}&\text{ }&\58.97&\text{ }&\text{2.2M} \\ \text{ACE}&\text{ }&\text{ACE Ltd}&\text{ }&\71.24&\text{ }&\text{1.6M} \\ \text{ADBE}&\text{ }&\text{Adobe Systems Inc}&\text{ }&\28.43&\text{ }&\text{8.4M} \end{bmatrix}$ The following figures show the frequencies of the first digits in the prices and volumes from the entire S&P 500 on November 11, 2011. It is clear both in both the prices and volumes, the lower digits occur more frequently as first digits. For example, of the 500 close prices of S&P 500 on November 11, 2011, 82 prices have 1 as first digits (16.4% of the total) while only 14 prices have 9 as leading digits (2.8% of the total). For the trade volumes of the 500 stocks, the skewness is even more pronounced. There are 166 prices with 1 as the first digits (33.2% of the total) while there are only 25 prices with 9 as first digits (5% of the total). The following figures express the same distributions in terms of proportions (or probabilities). Any one tries to fake S&P 500 prices and volumes data purely by random chance will not produce convincing results, not even results that can withstand a casual analysis based on figures such as Figures 2 through 5 above. What is even more interesting is the comparison between Figure 1 (Benford’s law) with Figure 5 (trade volume of S&P 500). The following figure is a side-by-side comparison. There is a remarkable agreement between how the distribution of the first digits in the 500 traded volumes in S&P 500 index agree and the Benford’s law. According to the distribution in Figure 1 (Benford’s law), about 60% of the leading digits in legitimate data consist of the digits 1, 2 and 3 (0.301+0.176+0.125=0.602). In the actual S&P 500 traded volumes on 11/11/2011, about 65% of the leading digits are from the first three digits (0.332+0.202+0.114=0.648). We cannot expect the actual percentages to be exactly matching those of the Benford’s law. However, the general agreement between the expected (Benford’s law) and the actual data (S&P 500 volumes) is very remarkable and is very informative. There are many sophisticated computer tests that apply the Benford’s law in fraud detection. However, the heart of the method of using Benford’s law is the simple comparison such as the one performed above, i.e. to compare the actual frequencies of the digits with the predicted frequencies according to the Benford’s law. If the data fabricator produces numbers that distribute across the digits fairly uniformly, a simple comparison will expose the discrepancy between the false data and the Benford’s law. Too big of a discrepancy between the actual data and the Benford’s law (e.g. too few 1’s) is sometimes enough to raise suspicion of fraud. Many white collar criminals do not know about the Benford’s law and will not expect that about in many types of realistic data, 1 as a first digit will occur about 30% of the time. Benford’s law does not fit every type of data. It does not fit numbers that are randomly generated. For example, lottery numbers are drawn at random from balls in a big glass jar. Hence lottery numbers are uniformly distributed (i.e. every number has equal chance to be selected). Even some naturally generated numbers do not follow the Benford’s law. For example, data that are confined to a relatively narrow range do not follow the Benford’s law. Examples of such data include heights of human adults and IQ scores. Another example is the S&P 500 stock prices in Figure 2. Note that the pattern of the bars in Figure 2 and Figure 4 does not quite match the pattern of the Benford’s law. Most of the stock prices of S&P 500 fall below$100 (on 11/11/2011, all prices are either 2 or 3-digit numbers with only 35 of the 500 prices being 3-digit). The following figure shows the side-by-side comparison between the S&P stock prices and the Benford’s law. Note that there are too few 1’s as first digits in the S&P 500 prices.

Even though the S&P 500 prices do not follow the Benford’s law, they are far from uniformly distributed (the smaller digits still come up more frequently). Any attempt to fake S&P 500 stock prices by using each digit equally likely as a first digit will still not produce convincing results (at least to an experienced investigator).

Data for which the Benford’s law is applicable are data that tend to distribute across multiple orders of magnitude. In the example of S&P 500 trading volume data, the range of data is from about half a million shares to 210 million shares (from 6-digit to 9-digit numbers, i.e., across 3 orders of magnitude). In contrast, S&P 500 prices only cover 1 order of magnitude. Other examples of data for which Benford’s law is usually applicable: income data of a large population, census data such as populations of cities and counties.

Benford’s law is also applicable to financial data such as income tax data, and corporate expense data. In [1], Nigrini discussed data analysis methods (based on the Benford’s law) that are used in forensic accounting and auditing. For more information about the Benford’s law, see the references below or search in Google.

Reference

1. Nigrini M. J., I’ve Got Your Number, Journal of Accountancy, May 1999. Link
2. S&P 500 data.
3. Wikipedia’s entry for the Benford’s law.

## Presidential Elections and Statistical Inference

We use an example of presidential elections to illustrate the reasoning process of statistical inference.

The current holder of a political office is called an incumbent. Does being an incumbent give an edge for reelection? The answer seems to be yes. For example, in the United States House of Representatives, the percentage of incumbents winning reelection has been routinely over 80% for over 50 years (sometimes over 90%). Since 1936, there were 13 US presidential elections involving an incumbent. In these 13 presidential elections, incumbents won 10 times (see the table below).

$\displaystyle \begin{bmatrix} \text{ }&\text{ }&\text{ } \\ \text{Year}&\text{ }&\text{Candidate}&\text{ }&\text{Electoral Votes}&\text{ }&\text{Candidate}&\text{ }&\text{Electoral Votes}&\text{Winner} \\ \text{ }&\text{ }&\text{Incumbent}&\text{ }&\text{Incumbent}&\text{ }&\text{Challenger}&\text{ }&\text{Challenger}&\text{ } \\\text{ }&\text{ }&\text{ } \\ \text{1936}&\text{ }&\text{Roosevelt}&\text{ }&\text{523}&\text{ }&\text{Landon}&\text{ }&\text{8}&\text{Incumbent} \\ \text{1940}&\text{ }&\text{Roosevelt}&\text{ }&\text{449}&\text{ }&\text{Willkie}&\text{ }&\text{82}&\text{Incumbent} \\ \text{1944}&\text{ }&\text{Roosevelt}&\text{ }&\text{432}&\text{ }&\text{Dewey}&\text{ }&\text{99}&\text{Incumbent} \\ \text{1948}&\text{ }&\text{Truman}&\text{ }&\text{303}&\text{ }&\text{Dewey}&\text{ }&\text{189}&\text{Incumbent} \\ \text{1956}&\text{ }&\text{Eisenhower}&\text{ }&\text{457}&\text{ }&\text{Stevenson}&\text{ }&\text{73}&\text{Incumbent} \\ \text{1964}&\text{ }&\text{Johnson}&\text{ }&\text{486}&\text{ }&\text{Goldwater}&\text{ }&\text{52}&\text{Incumbent} \\ \text{1972}&\text{ }&\text{Nixon}&\text{ }&\text{520}&\text{ }&\text{McGovern}&\text{ }&\text{17}&\text{Incumbent} \\ \text{1976}&\text{ }&\text{Ford}&\text{ }&\text{240}&\text{ }&\text{Carter}&\text{ }&\text{297}&\text{Challenger} \\ \text{1980}&\text{ }&\text{Carter}&\text{ }&\text{49}&\text{ }&\text{Reagan}&\text{ }&\text{489}&\text{Challenger} \\ \text{1984}&\text{ }&\text{Reagan}&\text{ }&\text{525}&\text{ }&\text{Mondale}&\text{ }&\text{13}&\text{Incumbent} \\ \text{1992}&\text{ }&\text{GHW Bush}&\text{ }&\text{168}&\text{ }&\text{Clinton}&\text{ }&\text{370}&\text{Challenger} \\ \text{1996}&\text{ }&\text{Clinton}&\text{ }&\text{379}&\text{ }&\text{Dole}&\text{ }&\text{159}&\text{Incumbent} \\ \text{2004}&\text{ }&\text{GW Bush}&\text{ }&\text{286}&\text{ }&\text{Kerry}&\text{ }&\text{252}&\text{Incumbent} \end{bmatrix}$

Clearly, more than half of these presidential elections were won by incumbents. Note that the three wins for the challenger occurred in times of economic turmoils or in the wake of political scandal. So presidential incumbents seem to have an edge, except possibly in times of economic or political turmoils.

The power of incumbency in political elections at the presidential and congressional level seems like an overwhelming force (reelection rate of over 80% in the US House and 10 wins for the last 13 presidential incumbents). Incumbents of political elections at other levels likely have similar advantage. So the emphasis we have here is not to establish the fact that incumbents have advantage. Rather, our goal is to illustrate the reasoning process in statistical inference using the example of presidential elections.

The statistical question we want to ask is: does this observed result really provide evidence that there is a real advantage for incumbents in US presidential election? Is the high number of incumbent wins due to a real incumbent advantage or just due to chance?

One good way to think about this question is through a “what if”. What if there is no incumbency advantage? Then it is just as likely for a challenger to win as it is for an incumbent. Under this “what if”, each election is like a coin toss using a fair coin. This “what if” is called a null hyposthesis.

Assuming each election is a coin toss, incumbents should win about half the elections, which would be 6.5 (in practice, it would be 6 or 7). Does the observed diffference 10 and 6.5 indicate a real difference or is it just due to random chance (the incumbents were just being lucky)?

Assuming each election is a coin toss, how likely is it to have 10 or more wins for incumbents in 13 presidential elections involving incumbents? How likely is it to have 10 or more heads in tossing a fair coin 13 times?

Based on our intuitive understanding of coin tossing, getting 10 or more heads out of 13 tosses of a fair coin does not seem likely (getting 6 or 7 heads would be a more likely outcome). So we should reject this “what if” notion that every presidential election is like a coin toss, rather than believing that the incumbents were just very lucky in the last 13 presidential elections involving incumbents.

If you think that you do not have a good handle on the probability of getting 10 or more heads in 13 tosses of a fair coin, we can try simulation. We can toss a fair coin 13 times and count the number of heads. We repeated this process 10,000 times (done in an Excel spreadsheet). The following figure is a summary of the results.

Note that in the 10,000 simulated repetitions (each consisting of 13 coin tosses), only 465 repetitions have 10 or more heads (350 + 100 + 15 = 465). So getting 10 or more heads in 13 coin tosses is unlikely. In our simulation, it only happened 465 times out of 10,000. If we perform another 10,000 repetitions of coin tossing, we will get a similar small likelihood for getting 10 or more heads.

Our simulated probability of obtaining 10 or more heads in 13 coin tosses is 0.465. This probability can also be computed exactly using the binomial distribution, which turns out to be 0.0461 (not far from the simulated result).

Under the null hypothesis that incumbents have no advantage over challengers, we can use a simple model of tosses of a fair coin. Then we look at the observed results of 10 heads in 13 coin tosses (10 wins for incumbents in the last 10 presidential elections involving incumbents). We ask: how like is this observed result if elections are like coin tosses? Based on a simulation, we see that the observed result is not likely (it happens 465 times out of 10,000). An exact computation gives the probability to be 0.0461. So we reject the null hypothesis rather than believing that incumbents have no advantage over challengers.

The reasoning process for a test of significance problem can be summarized in the following steps.

1. You have some observed data (e.g. data from experiments or observational studies). The observed data appear to contradict a conventional wisdom (or a neutral position). We want to know whether the difference between the observed data and the conventional wisdom is a real difference or just due to chance. In this example, the observed data are the 10 incumbent wins in the last 13 presidential elections involving incumbents. The neutral position is that incumbents have no advantage over the challengers. We want to know whether the high number of incumbent wins is due to a real incumbent advantage or just due to chance.
2. The starting point of the reasoning process is a “what if”. What if the observed difference is due to chance (i.e. there is really no difference between the observed data and the neutral position)? So we assume the neutral position is valid. We call this the null hypothesis.
3. We then evaluate the observed data, asking: what would be the likelihood of this happening if the null hypothesis were true? This probability is called a p-value. In this example, the observed data are: 10 incumbent wins in the last 13 presidential elections involving incumbents. The p-value is the probability of seeing 10 more incumbents win if indeed incumbents have no real advantage.
4. We estimate the p-value. If the p-value is small, we reject the neutral position (null hypothesis), rather than believing that the observed difference is due to random chance. In this example, we estimate the p-value by a simulation exercise (but can also be computed by a direct calculation). Because we see that the p-value is so small, we reject the notion that there is no incumbent advantage rather than believing that the high number of incumbent wins is just due to incumbents being lucky.

Any statistical inference problem that involves testing a hypothesis would work the same way as the presidential election example described here. The details that are used for calculating the p-value may change, but the reasoning process will remain the same. We realize that the reasoning process for the presidential election example may still not come naturally for some students. One reason may be that our intuition may not be as reliable in working with p-value in some of these statistical inference problems, which may involve normal distributions, binomial distributions and other probability models. So it is critical that students in an introductory statistics class get a good grounding in working with these probability models.

See this previous post for a more intuitive example of statistical inference.

## A Case of Restaurant Arson and the Reasoning of Statistical Inference

I know of a case of a restaruant owner who was convicted for burning down his restaurant for the purpose of collecting insurance money. It turns out that this case is a good example for introducing the reasoning process of statistical inference from an intuitive point of view.

This restaurant owner was convicted for burning down his restaurant and received a lengthy jail sentence. What was the red flag that alerted the insurance company that something was not quite right in the first place? The same restaurant owner’s previous restaurant was also burned to the ground!

Of course, the insurance company could not just file charges against the owner simply because of two losses in a row. But the two losses in a row did raise a giant red flag for the insurer, which brought all its investigative resource to bear on this case. But for us independent observers, this case provides an excellent illustration of an intuitive reasoning process for statistical inference.

The reasoning process behind the suspicion is a natural one for many people, including students in my introductory statistics class. Once we learn that there were two burned down restaurants in a row, we ask: are the two successive losses just due to bad luck or are the losses due to other causes such as fraud? Most people would feel that two successive fire losses in a row are unlikely if there is no fraud involved. It is natural that we would settle on the possibility of fraud rather than attributing the losses to bad luck.

The reasoning process is mostly unspoken and intuitive. For the sake of furthering the discussion, let me write it out:

1. The starting point of the reasoning process is an unspoken belief that the restaurant owner did not cause the fire (we call this the null hypothesis).
2. We then evaluate the observed data (losing two restaurants in a row to fire). What would be the likelihood of this happening if the null hypothesis were true? Though we assess this likelihood intuitively, this probability is called a p-value, which we feel is small in this case.
3. Since we feel that the p-value is very small, we reject the initial belief (null hypothesis), rather than believing that the rare event of two fire losses in a row was solely due to chance.

The statistical inference problems that we encounter in a statistics course are all based on the same intuitive reasoning process described above. However, unlike the restaurant arson case described here, the reasoning process in many statistical inference problems does not come naturally for students in introductory statistics classes. The reason may be that it is not easy to grasp intuitively the implication of having a small p-value in many statistical inference problems. It is difficult for some students to grasp why a small p-value should lead to the rejection of the null hypothesis.

In our restaurant arson example, we do not have to calculate a p-value and simply rely on our intuition to know that the p-value, whatever it is, is likely to be very small. In statistical inference problems that we do in a statistics class, we need a probability model to calculate a p-value. Beginning inference problems in an introductory class usually are based on normal distributons or in some cases the binomial distribution. With the normal model or binomial model, our intuition is much less reliable in computing and interpreting the p-value (the probability of obtaining the observed data given the null hypothesis is true). This is a challenge for students to overcome. To address this challenge, it is critical that students have a good grounding of normal distributions and binomial distribution. Simulation projects can also help students gain an intuitive sense of p-value.

In any case, just know that in terms of reasoning, any inference problem would work the same way as the intuitive example described here. It would start off with a question or a claim. The solution would start with a null hypothesis, which is a neutral proposition (e.g. the restaurant owner did not do it). We then evaluate the observed data to calculate the p-value. The p-value would form the basis for judging the strength of the observed data: the smaller the p-value, the less credible the null hypothesis and the stronger the case for rejecting the null hypothesis.

Repeated large losses for the same claimant raise suspicion for property insurance companies as well as life insurance companies. The concept of p-value is important for insurance companies. The example of repeated large insurance losses described here is an excellent example for illustrating the intuitive reasoning behind statistical inference.

## Rolling Dice to Buy Gas

The price of gas you pay at the pump is influenced by many factors. One such factor is the price of crude oil in the international petroleum market, which can be highly dependent on global macroeconomic conditions. Why don’t we let probability determine the price of gas? We discuss here an experiment that generate gas prices using random chance. The goal of this experiment is to shed some light on some probability concepts such as central limit theorem, law of large numbers, and the sampling distribution of the sample mean. These concepts are difficult concepts for students in an introductory statistics class. We hope that the examples shown here will be of help to these students.

We came across the idea for this example in [1], which devotes only one page to the intriguing idea of random gas prices (on page 722). We took the idea and added our own simulations and observations to make the example more accessible.

The Experiment

There is a gas station called 1-Die Gas Station. The price per gallon you pay at this gas station is determined by the roll of a die. Whatever the number that comes up in rolling the die, that is the price per gallon you pay. You may end up paying $1 per gallon if you are lucky, or$6 if not lucky. But if you buy gas repeatedly at this gas station, you pay $3.50 per gallon on average. Down the street is another gas station called 2-Dice Gas Station. The price you pay there is determined by the average of two dice. For example, if the roll of two dice results in 3 and 5, the price per gallon is$4.

The prices at the 3-Dice Gas Station are determined by taking the average of a roll of three dice. In general, the n-Dice Gas Station works similarly.

The possibility of paying $1 per gallon is certainly attractive for customers. On the other hand, paying$6 per gallon is not so desirable. How likely will customers buy gas with these extreme prices? How likely will they pay in the middle price range, say between $3 and$4? In other words, what is the probability distribution of the gas prices in these n-Dice Gas Stations? Another question is: is there any difference between the gas prices in the 1-Die Gas Station and the 2-Dice Gas Station and the other higher dice gas stations? If the number of dice increases, what will happen to the probability distribution of the gas prices?

One way to look into the above questions is to generate gas prices by rolling dice. After recording the rolls of the dice, we can use graphs and numerical summaries to look for patterns. Instead of actually rolling dice, we simulate the rolling of dice in an Excel spreadsheet. We simulate 10,000 gas purchases in each of following gas stations:

$\displaystyle (0) \ \ \ \ \begin{bmatrix} \text{Simulation}&\text{ }&\text{Gas Station}&\text{ }&\text{Number of Gas Prices} \\\text{ }&\text{ }&\text{ } \\\text{1}&\text{ }&\text{1-Dice}&\text{ }&\text{10,000} \\\text{2}&\text{ }&\text{2-Dice}&\text{ }&\text{10,000} \\\text{3}&\text{ }&\text{3-Dice}&\text{ }&\text{10,000} \\\text{4}&\text{ }&\text{4-Dice}&\text{ }&\text{10,000} \\\text{5}&\text{ }&\text{10-Dice}&\text{ }&\text{10,000} \\\text{6}&\text{ }&\text{30-Dice}&\text{ }&\text{10,000} \\\text{7}&\text{ }&\text{50-Dice}&\text{ }&\text{10,000} \end{bmatrix}$

How Gas Prices are Simulated

We use the function $Rand()$ in Excel to simulate rolls of dice. Here’s how a roll of a die is simulated. The $Rand()$ function generates a random number $x$ that is between $0$ and $1$. If $x$ is between $0$ and $\frac{1}{6}$, it is considered a roll of a die that produces a $1$. If $x$ is between $\frac{1}{6}$ and $\frac{2}{6}$, it is considered a roll of a die that produces a $2$, and so on. The following rule describes how a random number is assigned a value of the die:

$\displaystyle (1) \ \ \ \ \begin{bmatrix} \text{Random Number}&\text{ }&\text{Value of Die} \\\text{ }&\text{ }&\text{ } \\ 0< x <\frac{1}{6}&\text{ }&\text{1} \\\text{ }&\text{ }&\text{ } \\ \frac{1}{6} \le x < \frac{2}{6}&\text{ }&\text{2} \\\text{ }&\text{ }&\text{ } \\ \frac{2}{6} \le x < \frac{3}{6}&\text{ }&\text{3} \\\text{ }&\text{ }&\text{ } \\ \frac{3}{6} \le x < \frac{4}{6}&\text{ }&\text{4} \\\text{ }&\text{ }&\text{ } \\ \frac{4}{6} \le x < \frac{5}{6}&\text{ }&\text{5} \\\text{ }&\text{ }&\text{ } \\ \frac{5}{6} \le x < 1&\text{ }&\text{6} \end{bmatrix}$

For the 1-Die Gas Station, we simulated 10,000 rolls of a die (as described above). These 10,000 die values are considered the gas prices for 10,000 purchases. For the 2-Dice Gas Station, the simulation consists of 10,000 simulated rolls of a pair of dice. The 10,000 gas prices are obtained by taking the average of each pair of simulated dice values. For the 3-Dice Gas Station, the simulation consists of 10,000 iterations where each iteration is one simulated roll of three dice (i.e. 10,000 random sample of dice values where each sample is of size 3). Then the 10,000 gas prices are obtained by taking the average of the three dice values in each iteration (i.e. taking the mean of each sample). The other n-Dice Gas Stations are simulated in a similar fashion.

By going through the process in the above paragraphs, 10,000 gas prices are simulated for each gas station indicated in $(0)$. The following shows the first 10 simulated gas prices in first three gas stations listed in $(0)$.

$\displaystyle (3a) \text{ First 10 Simulated Gas Prices} \ \ \ \ \begin{bmatrix} \text{Iteration}&\text{ }&\text{Value of Die}&\text{ }&\text{1-Die Gas Price} \\\text{ }&\text{ }&\text{ } \\ 1&\text{ }&\text{2}&\text{ }&\2 \\ 2&\text{ }&\text{3}&\text{ }&\3 \\ 3&\text{ }&\text{2}&\text{ }&\2 \\ 4&\text{ }&\text{1}&\text{ }&\1 \\ 5&\text{ }&\text{3}&\text{ }&\3 \\ 6&\text{ }&\text{3}&\text{ }&\3 \\ 7&\text{ }&\text{2}&\text{ }&\2 \\ 8&\text{ }&\text{5}&\text{ }&\5 \\ 9&\text{ }&\text{5}&\text{ }&\5 \\ 10&\text{ }&\text{2}&\text{ }&\2 \end{bmatrix}$

$\displaystyle (3b) \text{ First 10 Simulated Gas Prices} \ \ \ \ \begin{bmatrix} \text{Iteration}&\text{ }&\text{Values of Dice}&\text{ }&\text{2-Dice Gas Price} \\\text{ }&\text{ }&\text{ } \\ 1&\text{ }&\text{4, 6}&\text{ }&\5.0 \\ 2&\text{ }&\text{2, 5}&\text{ }&\3.5 \\ 3&\text{ }&\text{3, 4}&\text{ }&\3.5 \\ 4&\text{ }&\text{2, 2}&\text{ }&\2.0 \\ 5&\text{ }&\text{5, 4}&\text{ }&\4.5 \\ 6&\text{ }&\text{2, 6}&\text{ }&\4.0 \\ 7&\text{ }&\text{5, 2}&\text{ }&\3.5 \\ 8&\text{ }&\text{6, 1}&\text{ }&\3.5 \\ 9&\text{ }&\text{5, 5}&\text{ }&\5.0 \\ 10&\text{ }&\text{5, 2}&\text{ }&\3.5 \end{bmatrix}$

$\displaystyle (3c) \text{ First 10 Simulated Gas Prices} \ \ \ \ \begin{bmatrix} \text{Iteration}&\text{ }&\text{Values of Dice}&\text{ }&\text{3-Dice Gas Price} \\\text{ }&\text{ }&\text{ } \\ 1&\text{ }&\text{1, 6, 6}&\text{ }&\4.33 \\ 2&\text{ }&\text{5, 5, 6}&\text{ }&\5.33 \\ 3&\text{ }&\text{1, 4, 6}&\text{ }&\3.67 \\ 4&\text{ }&\text{3, 2, 1}&\text{ }&\2.00 \\ 5&\text{ }&\text{3, 3, 2}&\text{ }&\2.67 \\ 6&\text{ }&\text{3, 6, 6}&\text{ }&\5.00 \\ 7&\text{ }&\text{1, 6, 2}&\text{ }&\3.00 \\ 8&\text{ }&\text{1, 1, 6}&\text{ }&\2.67 \\ 9&\text{ }&\text{4, 1, 5}&\text{ }&\3.33 \\ 10&\text{ }&\text{1, 4, 5}&\text{ }&\3.33 \end{bmatrix}$

Looking at the Simulated Gas Prices Graphically
We summarized the 10,000 gas prices in each gas station into a frequency distribution and a histogram. Watch the progression of the histograms from 1-Die, 2-Dice, and all the way to 50-Dice.

As the number of dice increases, the histogram becomes more and more bell-shaped. It is a remarkable fact that as the number of dice increases, the distribution of the gas prices changes shape. The gas prices at the 1-Die Gas Station are uniform (the bars in the histogram have basically the same height). The gas prices at the 2-Dice Gas Station are no longer uniform. The 3-Dice and 4-Dice gas prices are approaching bell-shaped. The shapes of the 30-Dice and 50-Dice gas prices are undeniably bell-shaped.

Each gas price is the average of a sample of dice values. For example, each 1-Die gas price is the mean of a sample of 1 die value, each 2-Dice gas price is the mean of a sample of two dice values and each 3-Dice gas price is the mean of a sample of 3 dice values and so on. What we are witnessing in the above series of histograms is that as the sample size increases, the shape of the distribution of the sample mean becomes more and more normal. The above series of histograms (from Figure 4b to Figure 10b) is a demonstration of the central limit theorem in action.

Looking at the Simulated Gas Prices Using Numerical Summaries

Note that all the above histograms center at about $3.50, and that the spread is getting smaller as the number of dice increases. To get a sense that the spread is getting smaller, note in the histograms (as well as frequency distributions) that the min and max gas prices for 1-Die, 2-Dice, 3-Dice and 4-Dice are$1 and $6, respectively. However, the price range for the higher dice gas stations is smaller. For example, the price ranges are$1.70 to $5.30 (for 10-Dice),$2.33 to $4.60 (30-Dice) and$2.60 to $4.64 (50-Dice). So in these higher dice stations, very cheap gas (e.g.$1) and very expensive gas (e.g. $6) are not possible. To confirm what we see, look at the following table of numerical summaries, that are calculated using the 10,000 simulated gas prices for each gas station. $\displaystyle (11) \text{ Numerical Summaries} \ \ \ \ \begin{bmatrix} \text{Gas Station}&\text{ }&\text{Mean}&\text{ }&\text{St Dev}&\text{ }&\text{Min Price}&\text{ }&\text{Max Price} \\\text{ }&\text{ }&\text{ } \\ \text{1-Die}&\text{ } &\3.4857&\text{ }&\1.6968&\text{ } &\1.00&\text{ } &\6.00 \\ \text{2-Dice}&\text{ } &\3.4939&\text{ }&\1.2131&\text{ } &\1.00&\text{ } &\6.00 \\ \text{3-Dice}&\text{ } &\3.5011&\text{ }&\0.9862&\text{ } &\1.00&\text{ } &\6.00 \\ \text{4-Dice}&\text{ } &\3.5220&\text{ }&\0.8555&\text{ } &\1.00&\text{ } &\6.00 \\ \text{10-Dice}&\text{ } &\3.4930&\text{ }&\0.5468&\text{ } &\1.70&\text{ } &\5.30 \\ \text{30-Dice}&\text{ } &\3.5023&\text{ }&\0.3108&\text{ } &\2.33&\text{ } &\4.60 \\ \text{50-Dice}&\text{ } &\3.5009&\text{ }&\0.2421&\text{ } &\2.60&\text{ } &\4.64 \end{bmatrix}$ Note that the mean gas price is about$3.50 in all the gas stations. The standard deviation is getting smaller as the number of dice increases. Note that each standard deviation in the above table is the standard deviation of gas prices. As noted before, each gas price is the mean of a sample of simulated dice values. What we are seeing is that the standard deviation of sample means gets smaller as the sample size increases.

The theoretical mean and standard deviation for the 1-Die gas prices are $3.50 and $\frac{\sqrt{105}}{6}=1.707825$, respectively. As the number of dice increases, the mean of the gas prices (sample means) should remain close to$3.50 (as seen in the above table). But the standard deviation of the sample averages gets smaller. The standard deviation of the gas prices gets smaller according to the following formula:

$\displaystyle (12) \ \ \ \ \text{Standard Deviation of n-Dice Gas Prices}=\frac{1.707825}{\sqrt{n}}$

According to the above formula, the standard deviation of gas prices will get smaller and smaller toward zero as the number of dice increases. This means that the price you pay at these gas stations will be close to $3.50 as long as the gas station uses a large number of dice to determine the price. The following table compares the standard deviations of the simulated prices and the standard deviations computed using formula $(12)$. Note the agreement between the observed standard deviations and the theoretical standard deviations. $\displaystyle (13) \text{ Comparing St Dev} \ \ \ \ \begin{bmatrix} \text{Gas Station}&\text{ }&\text{St Dev}&\text{ }&\text{St Dev} \\\text{ }&\text{ }&\text{(Simulated)}&\text{ }&\text{(Theoretical)} \\\text{ }&\text{ }&\text{ } \\ \text{1-Die}&\text{ } &\1.6968&\text{ } &\1.7078 \\ \text{2-Dice}&\text{ } &\1.2131&\text{ } &\1.2076 \\ \text{3-Dice}&\text{ } &\0.9862&\text{ } &\1.9860 \\ \text{4-Dice}&\text{ } &\0.8555&\text{ } &\0.8539 \\ \text{10-Dice}&\text{ } &\0.5468&\text{ } &\0.5401 \\ \text{30-Dice}&\text{ } &\0.3108&\text{ } &\0.3118 \\ \text{50-Dice}&\text{ } &\0.2421&\text{ } &\0.2415 \\\text{ }&\text{ }&\text{ } \\ \text{100-Dice}&\text{ } &\text{ }&\text{ } &\0.1708 \\ \text{1,000-Dice}&\text{ } &\text{ }&\text{ } &\0.0540 \\ \text{5,000-Dice}&\text{ } &\text{ }&\text{ } &\0.0242 \\ \text{100,000-Dice}&\text{ } &\text{ }&\text{ } &\0.0054 \\ \text{1000,000-Dice}&\text{ } &\text{ }&\text{ } &\0.0017 \end{bmatrix}$ The standard deviation of gas prices in the 50-Dice Gas Station is about$0.24 (let’s say it is 25 cents for use in a quick calculation). Since the gas prices have an approximate normal distribution, we know that the gas prices range from 75 cents below $3.50 to 75 cents above$3.50. So in the 50-Dice Gas Station, you can expect to pay any where from $2.75 to$4.25 per gallon (this is the 99.7 part of the empirical rule). On the other hand, about 95% of the gas prices are between $3.00 to$4.00.

Table $(13)$ also displays the theoretical standard deviations for several gas stations that we did not simulate. For example, the standard deviation of the gas prices in the 1000-Dice gas Station is about $0.054 (let’s say 5 cents). So about 99.7% of the gas prices will be between$3.35 and $3.65. In such a gas station, the customers will be essentially paying$3.50 per gallon.

The standard deviation of gas prices in the 5000-Dice Gas Station is about 2 pennies. The standard deviation of gas prices in the 100,000-Dice Gas Station is about half a 1 penny. In these two gas stations, the prices you pay for gas will be within pennies from $3.50 (for all practical purposes, the customer should just pay$3.50). In the 1,000,000-Dice (one million Dice) Gas Station, the clerk should just skip the rolling of dice and collect $3.50 per gallon. Looking at the Long Run Average in 1-Die Gas Station Another observation we like to make is that the average gas price is not predictable and stable when we only use a small number of simulated prices. For example, if we only simulate 10 gas prices for the 1-Die Gas station, will the average of these 10 prices be close to the theoretical$3.50? If we increase the size of the simulations to 100, what will the average be? If the number of the simulations keeps increasing, what will the mean gas price be like? The answer lies in looking at the partial averages, i.e. the average of the first 10 gas prices, the first 100 gas prices and so on for the 1-Die Gas Station. We can look at the how the partial average progresses.

$\displaystyle (14) \text{ 1-Die Gas Station: Average of the } \ \ \ \ \begin{bmatrix} \text{First 1 Price}&\text{ }&\2.00 \\\text{First 2 Prices}&\text{ }&\2.50 \\\text{First 3 Prices}&\text{ }&\2.33 \\\text{First 4 Prices}&\text{ }&\2.00 \\\text{First 5 Prices}&\text{ }&\2.20 \\\text{First 6 Prices}&\text{ }&\2.33 \\\text{First 7 Prices}&\text{ }&\2.29 \\\text{First 8 Prices}&\text{ }&\2.63 \\\text{First 9 Prices}&\text{ }&\2.89 \\\text{First 10 Prices}&\text{ }&\2.80 \\ \text{First 50 Prices}&\text{ }&\3.16 \\ \text{First 100 Prices}&\text{ }&\3.32 \\ \text{First 500 Prices}&\text{ }&\3.436 \\ \text{First 1000 Prices}&\text{ }&\3.445 \\ \text{First 5000 Prices}&\text{ }&\3.4292 \\ \text{All 10000 Prices}&\text{ }&\3.4857 \end{bmatrix}$

The above table illustrates that the sample mean is unpredictable and unstable if the number of gas prices is small. The first gas price is $2.00. Within the first 10 gas purchases at this gas station, the average is under$3.00. The average of the first 10 prices is $2.80. But as the number of prices increases, the average becomes closer and closer to the theoretical mean$3.50. The sample mean $\overline{x}$ is an accurate estimate of the population mean $3.50 only when more an more gas prices are added in the mix (i.e. as the sample size increases). This remarkable fact is called the law of large numbers. The following two figures show how the sample mean $\overline{x}$ of a random sample of gas prices from the 1-Die Gas Station changes as we add more prices to the sample. Figure 15a shows how the sample mean $\overline{x}$ varies for the first 100 simulated gas prices. Figure 15b shows the variation of $\overline{x}$ over all 10,000 simulated gas prices. Figure 15a shows that within the first 30 prices or so, the sample mean $\overline{x}$ fluctuates wildly under the horizontal line at$3.50. The sample mean $\overline{x}$ becomes more stable as the sample size increases. Figure 15b shows that over the entire 10,000 simulated gas prices, the sample mean $\overline{x}$ is stable and predictable. Eventually the sample mean $\overline{x}$ gets close to the population mean $\mu=\3.50$ and settles down at that line.

Figures 15a and 15b show the behavior of $\overline{x}$ for one instance of simulation of 10,000 gas prices. If we perform another simulation of 10,000 gas prices for the 1-Die Gas Station, both figures will show a different path from left to right. However, the law of large numbers says that whatever path we will get will always settle down at $\mu=\3.50$.

____________________________________________________________________

Discussions

All the observations we make can be generalized outside of the context of n-Dice Gas Stations. There are several important probability concepts that can be drawn from the n-Dice Gas Station example. They are:

1. The sample mean as a ranom variable.
2. The central limit theorem.
3. The mean and standard deviation of the sample mean.
4. The law of large numbers.

1. The Sample Mean As A Random Variable. Given a sample of data values, the sample mean $\overline{x}$ is simply the arithmetic average of all the data values in the sample. It is important to note that the mean of a sample (of a given size) varies. As soon as the data values in the sample change, the calculation of $\overline{x}$ will produce a different value. Take the gas prices from the 3-Dice Gas Station listed in table $(3c)$ as an example. The first simulation produced the sample $1, \ 6, \ 6$. The mean is $\overline{x}=4.33$. The second sample is $5,\ 5, \ 6$ and the mean is $\overline{x}=5.33$. As the three dice are rolled again, another sample is produced and the sample mean $\overline{x}$ will be a different value. So the sample mean $\overline{x}$ is not a static quantity. Because the samples are determined by random chance, the value of $\overline{x}$ cannot be predicted in advance with certainty. Hence the sample mean $\overline{x}$ is a random variable.

2. The Central Limit Theorem. Once we understand that the sample mean $\overline{x}$ is a random variable (that it varies based on random chance), the next question is: what is the probability distribution of the sample mean $\overline{x}$? With respect to the n-Dice Gas Stations, how are the gas prices distributed?

The example of n-Dice Gas Stations shows that the sample mean $\overline{x}$ becomes more and more normal as the sample size increases. This means that we can use normal distribution to determine probability statements about the sample mean $\overline{x}$ whenever the sample size $n$ is “sufficiently large”.

The 1-Die Gas Prices (the histogram in Figure 4b) is the underlying distribution (or population) from which the random samples for the higher dice gas prices are drawn. In this particular case, the underlying distribution has a shape that is symmetric (in fact the histogram is flat). The point we like to make is that even if the histogram in Figure 4b (the starting histogram) has a shape that is skewed, the subsequent histograms will still become more and more normal. That is, the distribution of the sample mean becomes more and more normal as the sample size increases, regardless of the shape of the underlying population.

Because we start out with a symmetric histogram, it does not take a large increase in the number of dice for the sample mean to become normal. Note that the 4-Dice gas prices produce a histogram that looks sufficiently normal. Thus if the underlying distribution is symmetric (but not bell-shaped), increasing the sample mean of sample size to 4 or 5 will have a distribution that is adequately normal (like the example of n-Dice Gas Stations discussed here).

However, if the underlying distribution is skewed, it will take a larger increase of the sample size to get a distribution that is close to normal (usually increasing to $n=30$ or greater).

Moreover, if the underlying population is approximately normal, the sample mean of size 2 or 3 will be very close to normal. If the underlying distribution has a normal distribution, the sample mean of any sample size will have a normal distribution.

3. The Mean and Standard Deviation of Sample Mean

The above histograms (Figure 4b to Figure 10b) center at $3.50. However, the spread of the histograms gets smaller as the number of dice increase (also see Table $(13)$). Thus the gas prices hover around$3.50 no matter which n-Dice Gas Station you are in. The standard deviation of the gas prices gets smaller and smaller according to the Formula $(12)$.

The theoretical mean of the population of the 1-Die Gas Prices is $\mu=\3.50$ and the theoretical standard deviation of this population is $\sigma=\1.707825$. The standard deviation of the gas prices at the n-Dice Gas Station is smaller than the population standard deviation $\sigma$ according to the Formula $(12)$ above.

In general, the sampling distribution of the sample mean $\overline{x}$ is centered at the population mean $\mu$ and is less spread out that the distribution of the individual population values. If $\mu$ is the mean of the individual population values and $\sigma$ is the standard deviation of the individual population values, then the mean and standard deviation of the sample mean $\overline{x}$ (of size $n$) are:

\displaystyle \begin{aligned}(16) \ \ \ \ \ \ \ \ \ \ \ &\text{Mean of }\overline{x}=\mu \\&\text{ } \\&\text{Standard Deviation of }\overline{x}=\frac{\sigma}{\sqrt{n}} \end{aligned}

4. The Law of Large Numbers. If you buy gas from the 1-Die Gas Station for only 1 time, the price you pay may be $1 if you are lucky. However, in the long run, the overall average price per gallon will settle near$3.50. This is the essence of the law of large numbers. In the long run, the owner can expect that the average price over all the purchases to be $3.50. The long run business results in the 1-Die Gas Station is stable and predictable (as long as customers keep coming back to buy gas of course). In fact, the law of large numbers is the business model of the casino. When a gambler makes bets, the short run results are unpredictable. The gambler may get lucky and win a few bets. However, the long run average (over thousands or tens of thousands of bets for example) will be very stable and predictable for the casino. For example, the game of roulette gives the casino an edge of 5.26%. Over a large number of bets at the roulette table, the casino can expect to make 5.26 cents in profit for each$1 in wager.

____________________________________________________________________

Summary

The following are the statements of the probability concepts discussed in this post.

• The Law of Large Numbers. Suppose that you draw observations from a population with finite mean $\mu$. As the number of observations increases, the mean $\overline{x}$ of the observed values becomes closer and closer to the population mean $\mu$.
• $\text{ }$

• The Mean and Standard Deviation of Sample Mean. Suppose that $\overline{x}$ is the mean of a simple random sample of size $n$ drawn from a population with mean $\mu$ and standard deviation $\sigma$. Then the sampling distribution of $\overline{x}$ has mean $\mu$ and standard deviation $\displaystyle \frac{\sigma}{\sqrt{n}}$.
• $\text{ }$

• The Central Limit Theorem. Suppose that you draw a simple random sample of size $n$ from any population with mean $\mu$ and finite standard deviation $\sigma$. When the sample size $n$ is large, the sampling distribution of the sample mean $\overline{x}$ is approximately normal with mean $\mu$ and standard deviation $\displaystyle \frac{\sigma}{\sqrt{n}}$.

To read more about the probability concepts discussed here, you can consult your favorite statistics texts or see [2] or [3].

Reference

1. Burger E. B., Starbird M., The Heart of Mathematics, An invitation to effective thinking, 3rd ed., John Wiley & Sons, Inc, 2010
2. Moore D. S., Essential Statistics, W. H. Freeman and Company, New York, 2010
3. Moore D. S., McCabe G. P., Craig B. A., Introduction to the Practice of Statistics, 6th ed., W. H. Freeman and Company, New York, 2009

## The Middle 80% Is Not 80th Percentile

A student sent me an email about a practice problem involving finding the middle 80% of a normal distribution. The student was confusing the middle 80% of a bell curve with the 80th percentile. The student then tried to answer the practice problem with the 80th percentile, which of course, did not match the answer key. The student sent me an email asking for an explanation. The question in the email is a teachable moment and deserves a small blog post.

The following figures show the difference between the middle 80% under a bell curve and the 80th percentile of a bell curve.

The middle 80% under a bell curve (Figure 1) is the middle section of the bell curve that exlcudes the 10% of the area on the left and 10% of the area on the right. The 80th percentile (Figure 2) is the area of a left tail that excludes 20% of the area on the right.

Finding the 80th percentile (or for that matter any other percentile) is easy. Either use software or a standard normal table. If you look up the area 0.8000 in a standard normal table, the corresponding z-score is $z=0.84$. There is no exact match for 0.8000 and the closest is 0.7995, which produces the z-score of $z=0.84$. Once this is found, the z-score can then be converted to the measurement scale that is relevant in the practice problem at hand.

On the other hand, to find the middle 80%, you need to find the 90th percentile. The reason being that the standard normal table only provides the areas of the left tails. The middle area of 80% plus 10% on the left is the area of the left tail of size 90% (or 0.9000). Figure 3 below makes this clear.

To find the 90th percentile, look up the area 0.9000 in the standard normal table. There is no exact match and the closest area to 0.9000 is 0.8997, which has a z-score of $z=1.28$. Thus the middle 80% of a normal distribution is between $z=-1.28$ and $z=1.28$. Now just convert these to the data in the measurement scale that is relevant in the given practice problem at hand.

Find the middle x% is an important skill in an introductory statistics class. The z-score for the middle x% is called a critical value (or z-critical value). The common critical values are for the middle 90%, middle 95% and middle 99%.

$\displaystyle \text{Some common z-critical values} \ \ \ \ \begin{bmatrix} \text{Middle x}\%&\text{ }&\text{Critical Value} \\\text{Or Confidence Level }&\text{ }&\text{ } \\\text{ }&\text{ }&\text{ } \\ 90\%&\text{ }&\text{z=1.645} \\ 95\%&\text{ }&\text{z=1.960} \\ 99\%&\text{ }&\text{z=2.575} \end{bmatrix}$

How about critical values not found in the above table? It will be a good practice to find the z-scores for the middle 85%, 92%, 93%, 94%, 96%, 97%, 98%.

## Coin Tossing and Family Size

If you a toss a coin 9 times, what is the probability of obtaining 8 heads and 1 tail? I just tossed one penny 9 times and the following is the resulting sequence of heads and tails.

$. \ \ \ \ \ \ \ \ T \ T \ H \ T \ H \ H \ H \ T \ H$

The above little experiment produced 5 heads and 4 tails, matching what many people think should happen when you toss a coin repeatedly, i.e., roughly half of the tosses are heads and half of the tosses are tails. Does this mean that it is impossible to get 8 heads in 9 tosses? It turns out that it is possible, just that such scenarios do not happen very often. On average, you will have to do the 9-toss experiments many times before you will see a result such as $H \ H \ H \ T \ H \ H \ H \ H \ H$. The coin tossing example is a great way to introduce the binomial distribution.

One concrete example of 8 heads in 9 tosses is the family of Olive and George Osmond. They are the parents of nine children, 8 boys and 1 girl, seven of which formed a popular and successful family singing group called The Osmond Brothers. Two of its members, Donny and Marie Osmonds, also had successful solo musical careers. Donny and Marie were both teen idols in the 1970s. The first picture below is a picture of Donny and Marie Osmond in their heyday. The second picture is a photo of seven of the Osmond siblings who are in show business.

Donny and Marie Osmond in their heyday

Seven of the Osmond siblings who are in show business

Assuming that a boy is equally likely as a girl in a pregnancy, the sex of a child is like a coin toss (from a probability point of view). So the Osmond family shows that tossing a coin 9 times can result in 8 heads. But how often does this happen? Of all the families with 9 children, how many of them have 8 boys and 1 girl?

We will show that the probability of obtaining 8 heads in 9 tosses is 0.0176. This means that in 10,000 tosses of a coin, only 176 of the tosses have 8 heads and 1 tail. Looking at this from a family perspective, out of 10,000 families with 9 children, only about 176 have 8 boys and 1 girl (1.76%). So families such as the Osmond family are pretty rare.

The Problem

The problem we want to work on is this:

• In tossing a fair coin 9 times, what is the probability that there are $k$ heads? Here, $k$ can be any whole number from $0$ to $9$.

For convenience, we use $X$ to denote the number of heads that appear as a result of tossing a coin 9 times. We are interested in knowing the probability that $X=k$ ($k$ can be any whole number from $0$ to $9$). We use the notation $P(X=k)$ to denote this probability. In the subsequent discussion, we derive $P(X=k)$ for each value of $k$.

Two important things about this problem. One is that there are $2^9=512$ many outcomes in tossing a coin 9 times. To see this, there are 2 outcomes in tossing a coin 1 time (H or T). There are 4 outcomes in tossing a coin 2 times (HH, HT, TH, and TT). So the number of outcomes in a coin tossing experiment is 2 raised to the number of tosses. For convenience, we denote each outcome by the string of Hs and Ts in the order the heads (Hs) and tails (Ts) appear. The following are four examples of such strings:

$. \ \ \ \ \ \ \ \ T \ T \ T \ T \ T \ T \ T \ T \ T$

$. \ \ \ \ \ \ \ \ T \ T \ H \ T \ T \ T \ T \ T \ T$

$. \ \ \ \ \ \ \ \ H \ T \ H \ T \ T \ T \ T \ T \ T$

$. \ \ \ \ \ \ \ \ T \ T \ H \ T \ H \ H \ T \ T \ T$

The second important thing is that each of the 512 strings has a probability of $\frac{1}{512}$ since we are using a fair coin. In any toss, the probability of a head is $\frac{1}{2}$. So the problem of finding $P(X=k)$ is to count how many of the 512 strings have $k$ Hs and $9-k$ Ts. In other words, the problem at hand is that of a counting problem (or a combinatorial problem).

For example, there are nine strings consisting of 8 Hs and 1 T (the one T can be in any one of the nine positions). So we have:

$\displaystyle . \ \ \ \ \ \ \ \ P(X=\text{8})=9 \times \frac{1}{512}=\frac{9}{512}=0.0176$

The following is how we find the probability $P(X=k)$:

$\displaystyle . \ \ \ \ \ \ \ \ P(X=k) = \text{(the number of strings with k Hs)} \times \frac{1}{512}$

To find $P(X=0)$, we need to find the number of strings with zero H. There is only one (all nine positions are T). So we have:

$\displaystyle . \ \ \ \ \ \ \ \ P(X=0)=1 \times \frac{1}{512}=\frac{1}{512}=0.001953$

To find $P(X=1)$, we need to count the number of strings with exactly 1 H. There are 9 such strings since the one H can be in any one of the nine positions. So we have:

$\displaystyle . \ \ \ \ \ \ \ \ P(X=1)=9 \times \frac{1}{512}=\frac{9}{512}=0.0176$

To find $P(X=2)$, we need to count the number of strings with exactly 2 Hs in the nine positions. Here is where we need to formula to help us do the counting.

The Binomial Coefficient

We need a combinatorial formula to help us count the number of the letter H in a string of 9 letters of H and T. How many of the 512 strings have 2 Hs and 7 Ts? There are 36 (HHTTTTTTT, and HTHTTTTTT are two such strings). The calculation is:

$\displaystyle (1) \ \ \ \ \ \ \ \ \frac{9!}{2! \times (9-2)!}=\frac{9!}{2! \times 7!}=\frac{9 \times 8}{2}=36$

The above calculation uses the factorial notation:

$. \ \ \ \ \ \ \ \ n!= n \times (n-1) \times (n-2) \times \cdots \times 3 \times 2 \times 1$

In addition, we define $0!=1$. More about the formula $(1)$ later. For now we can calculate $P(X=2)$:

$\displaystyle . \ \ \ \ \ \ \ \ P(X=2)=36 \times \frac{1}{512}=\frac{36}{512}=0.070313$

Suppose we have $n$ positions and each position is H or T. There should be $2^n$ many strings consisting of Hs and Ts. The general formula is called the Binomial Coefficient, which is to count the number of strings with $r$ Hs and $n-r$ Ts.

$\displaystyle (2) \ \ \ \ \ \ \ \ _nC_r = \frac{n!}{r! \times (n-r)!} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \text{Binomial Coefficient}$

Plugging into the formula, we have $_9C_3=84$, and $_9C_4=126$. So of the 512 strings consisting of Hs and Ts, 84 of them have 3 Hs and 6 Ts, and 126 of them have 4 Hs and 5 Ts. We have the following probabilities:

\displaystyle \begin{aligned}(3) \ \ \ \ \ \ \ \ \text{binomial probabilities} \ \ \ &P(X=k)=_9C_k \times \frac{1}{512} \\&\text{ } \\&P(X=0)=\frac{1}{512}=0.001953 \\&\text{ } \\&P(X=1)=\frac{9}{512}=0.017578 \\&\text{ } \\&P(X=2)=\frac{36}{512}=0.070313 \\&\text{ } \\&P(X=3)=\frac{84}{512}=0.164063 \\&\text{ } \\&P(X=4)=\frac{126}{512}=0.246094 \\&\text{ } \\&P(X=5)=\frac{126}{512}=0.246094 \\&\text{ } \\&P(X=6)=\frac{84}{512}=0.164063 \\&\text{ } \\&P(X=7)=\frac{36}{512}=0.070313 \\&\text{ } \\&P(X=8)=\frac{9}{512}=0.017578 \\&\text{ } \\&P(X=9)=\frac{1}{512}=0.001953 \end{aligned}

The Binomial Experiment

The example of tossing a coin 9 times or having 9 children is called a binomial experiment. There are four points worth pointing out. The coin is tossed a fixed number of times. The results of one toss has no effect on the subsequent tosses. Each toss has only two outcomes (H or T). The probability of a success (say H) is 0.5, which is the same across the nine tosses. These four points are critical in working with binomial distribution. Here are the four conditions that define a binomial experiment.

• There are a fixed number of trials or observations (say, $n$).
• The $n$ observations are independent, meaning that each observation has no effect on the other observations.
• Each observation has two distinct outcomes, which for convenience are called successes and failures.
• The probability of a success, denoted by $p$, is the same across all observations.

These four conditions are important because only when a random experiment or a problem setting satisfies these four requirements, can we apply the binomial distribution. For example, suppose that an opinion poll calls residential phone numbers at random and suppose that about 25% of the calls reach a live person. A telephone poll worker uses a random dialing machine make 20 calls. The poll worker counts the number of calls that are answered by a live person. This would be a binomial experiment.

However, suppose that the poll worker keeps making calls until she reaches a live person and suppose that she records the number of calls it takes to reach a live person. This would not be a binomial experiment since the number of trials is not fixed. In general, whenever one of the four conditions is violated, the random experiment or problem setting can no longer be called binomial experiment.

The Binomial Distribution
Note that in the coin tossing example we demonstrated above, the probability of success in each toss is 0.5 or $\frac{1}{2}$. Thus each of the 512 possible outcomes is equally likely. The binomial probabilities in $(3)$ are calculated based on this assumption. In general, the probability of success in a binomial experiment needs not be 0.5. For example, the coin used in coin tossing could be a biased coin. The ratio of boys to girls may not be exactly 1 to 1. The following formula shows that how binomial probabilities are calculated in the general case.

Suppose we have a binomial experiment in which $n$ is the number of obervations and $p$ is the probability of success. Let $X$ be the count of successes in these $n$ observations. The possible values of $X$ are $0,1,2,\cdots,n$. If $k$ is any whole number from $0$ to $n$, the probability of $k$ successes is:

$\displaystyle (4) \ \ \ \ \ \ \ \ P(X=k)=_nC_k \ \ p^k \ \ (1-p)^{n-k}$

Let’s discuss the thought process behind the formula $(4)$. To do this, suppose that we have a biased coin such that the probability of getting a head is $p=0.6$. Suppose that we toss this coin $n=9$ times. There are $2^9=512$ many outcomes, just like the above example. However, in this new example, the 512 strings are not equally likely. For example, the string $HHHHHHHHT$ has probability

$\displaystyle . \ \ \ \ \ \ \ \ p^8 \ \ (1-p)^1=(0.6)^8 \ \ (0.4)^1$

So the overall probability of exactly 8 heads and 1 tail is

$\displaystyle . \ \ \ \ \ \ \ \ P(X=\text{8})=_9C_8 \ \ p^8 \ \ (1-p)^1=9 \ \ (0.6)^8 \ \ (0.4)^1=0.060466176$

Similarly, the overall probability of exactly 5 heads and 4 tails is:

$\displaystyle . \ \ \ \ \ \ \ \ P(X=5)=_9C_5 \ \ p^5 \ \ (1-p)^4=126 \ \ (0.6)^5 \ \ (0.4)^4=0.250822656$

The variable $X$ defined above is said to have the binomial distribution with parameters $n$ and $p$. The binomial coefficient $_nC_k$ is defined in $(2)$. The binomial probability formula $(4)$ can be tedious to calculate except when the number of observations $n$ is small. However, knowing how to use the binomial formula, especially in conjunction with the example demonstrated here in this post, is critical in understanding the thought process behind the binomial distribution. However for large $n$, one should use a graphing calculator or software. For example, binomial probabilities $P(X=k)$ and cumulative probabilities $P(X \le k)$ can be readily obtained in a graphing calculator.

For more information and for practice problems on binomial distribution, see your favorite statistics textbooks or one of the references listed below.

Reference

1. Moore. D. S., Essential Statistics, W. H. Freeman and Company, New York, 2010
2. Moore. D. S., McCabe G. P., Craig B. A., Introduction to the Practice of Statistics, 6th ed., W. H. Freeman and Company, New York, 2009

## Comparing Growth Charts

Suppose both an 8-year old boy and a 10-year old boy are 54 inches tall (four feet six inches). Physically they are of the same heights. But a better way to compare is to find out where each boy stands in the distribution for his age group. One convenient way to do this is to compare the growth charts for these two ages. A clinical growth chart from the Center for Disease Control and Prevention contains the percentiles for heights for boys from age 2 to age 20. The following two tables list out the ranking in heights for age 8 and age 10. A copy of the chart is found here.

$\displaystyle \text{in centimeter} \ \ \ \begin{bmatrix} \text{Percentile}&\text{ }&\text{Age 8}&\text{ }&\text{Age 10} \\\text{ }&\text{ }&\text{ } \\\text{95th}&\text{ }&\text{138 cm}&\text{ }&\text{150 cm} \\\text{90th}&\text{ }&\text{135.5 cm}&\text{ }&\text{147 cm} \\\text{75th}&\text{ }&\text{132 cm}&\text{ }&\text{143 cm} \\\text{50th}&\text{ }&\text{128}&\text{ }&\text{138.5 cm} \\\text{25th}&\text{ }&\text{124 cm}&\text{ }&\text{134 cm} \\\text{10th}&\text{ }&\text{121 cm}&\text{ }&\text{130.5 cm} \\\text{5th}&\text{ }&\text{118.5 cm}&\text{ }&\text{128 cm} \end{bmatrix}$

$\displaystyle \text{in inch} \ \ \ \ \ \ \ \ \ \ \ \begin{bmatrix} \text{Percentile}&\text{ }&\text{Age 8}&\text{ }&\text{Age 10} \\\text{ }&\text{ }&\text{ } \\\text{95th}&\text{ }&\text{53.82 inches}&\text{ }&\text{58.5 inches} \\\text{90th}&\text{ }&\text{52.845 inches}&\text{ }&\text{57.33 inches} \\\text{75th}&\text{ }&\text{51.48 inches}&\text{ }&\text{55.77 inches} \\\text{50th}&\text{ }&\text{49.92 inches}&\text{ }&\text{54.015 inches} \\\text{25th}&\text{ }&\text{48.36 inches}&\text{ }&\text{52.26 inches} \\\text{10th}&\text{ }&\text{47.19 inches}&\text{ }&\text{50.895 inches} \\\text{5th}&\text{ }&\text{46.215 inches}&\text{ }&\text{49.92 inches} \end{bmatrix}$

With charts as in the above table, a doctor or a parent knows right away that about 90% of all 8-year old boys are between 46 inches (2 inches under four feet) and 54 inches (four feet 6 inches). Any 8-year old boy outside of this range is unusually short or unusually tall. If an 8-year old boy is taller than 54 inches, he is extremely tall (only 5% of this age group belongs to this height range).

If a 10-year old boy is 56 inches tall, his parent should be satisfied that the boy is developing normally. Not only the height of 56 inches for a 10-year old is above average, this height is at the 75th percentile (taller than 75% of the boys in this age group).

Back to the two 54-inch tall boys mentioned at the beginning. According to the above chart, the 8-year old is at the 95th percentile while the 10-year old is at the 50th percentile (the median). The 8-year old is exceedingly tall for his age while the 10-year old is just average. Though they have the same height measurement, relative to their age groups, one is very tall and the other is average.

The 25th percentile, 50th percentile and the 75th percentile are called the first quartile, the second quartile, and the third quartile, respectively because these three percentiles divide the data (or the distribution) into quarters. The first quartile and the third quartile bracket the middle 50% of the data, which indicates the middle range of the heights. The upper percentiles (like 90th and 95th) indicate the upper end of the growth range. The lower percentiles (like 10th and 5th) indicate the lower end of the growth range.

The take way is that measures of positions such as median, quartiles and other upper and lower percentiles are easy to use. These several percentiles are included in the growth charts. Their inclusion makes the growth chart so easy to use. A doctor or nurse takes one look at the growth chart and knows immediately where the child stands in his age group, whether the child is average or above average. If the child is above average, the doctor will know how much above average. Using the percentiles requires no additional calculation.

The alternative to using measures of position is to report the growth in heights with a mean and a standard deviation. Doing so is also a valid approach, but requires additional calculation. For example, the growth chart for the 10-year old can be reported with a mean of 54 inches and a standard deviation of 2.6 inches. Since height measurements are described by a normal curve, the doctor can know that 68% of all 10-year old boys are within one standard deviation (2.6 inches) from the mean (54 inches). So about 68% of 10-year old boys are between 51.4 inches (54 – 2.6) and 56.6 inches (54 + 2.6).

To find out what a tall height is and what a short height is, the doctor will have to look at two standard deviations away from the mean. About 95% of the 10-year old boys are between 48.8 inches (54 – 2 x 2.6) and 59.2 inches (54 + 2 x 2.6). So any 10-year old boy outside of this range is unusually tall or unusually short. So a growth chart made up of mean and standard deviation is not as easy to use. You have to do some calculation before knowing how tall the boy is in relation to other boys (the mean by itself does not tell you much). A busy doctor is probably not going to reach out for a calculator for this kind of analysis.

In contrast, the median and other percentiles are so easy to use. You take one look and immediately know where the child stands in relation to other boys in the same age group.

If the data are skewed (e.g. income data), we obviously want to use median and quartiles since measures of positions are resistant to extreme data values. On the other hand, if the data are symmetric and have no outliers, we can describe the data using mean (as center) and standard deviation (as spread). This kind of discussion is in many introductory statistics texts. I make sure that this is presented to my students. However, the growth chart example here tells us that there is a good reason to use median and quartiles and other percentiles even if the data are normal (described by a bell curve such as the height measurements in this example).

The median and the quartiles are very useful for describing a data set. This is the case even in the situations where we are taught to use mean and standard deviation.

Interestingly, if we decide to use median and other percentile to describe the data we work with, the median is the center and the other percentiles (such as quartiles) will play the role of a spread. For example, the first quartile and third quartile will tell us how spread out the middle 50% of the data are. If we also include the 90th percentile and the 10th percentile, we know the spread for the middle 80% of the data. If we include the 95th percentile and the 5th percentile, we know the spread of the middle 90% of the data. This is essentially the information provided by the growth chart we examine.

## An Example of a Normal Curve

One way to estimate probabilities is to use empirical data. However, if the histogram of the data shapes like a bell curve (or reasonably close to a bell curve), we can use a normal curve to estimate probabilities. All we need to know are the mean and standard deviation of the data. We use the SAT scores as an example.

The version of the SAT described here has three sections – critical reading, mathematics and writing. The scores range from 600 to 2,400. The PDF file is this link lists out all 1,547,990 SAT scores taken in 2010. It lists out all scores as a frequency distribution (e.g. 382 students with a perfect score of 2400, 219 students with a score of 2390, etc). The mean of all 1,547,990 SAT scores in 2010 is 1509 and the standard deviation is 312 (displayed at the bottom of the table).

The table also displays the percentile ranking of each score. For example, the score of 1850 is the 85th percentile, meaning that about 85% of the scores are less than 1850. We can compute this from the frequency distribution. There are 1,313,812 scores below 1850. Thus the proportion of the scores less than 1850 is:

$\displaystyle (1) \ \ \ \ \ \ \ \ \frac{1313812}{1547990}=0.849$

What is the percentage of scores that is greater than 2050? According to the table, it should be about 5% since 2050 is 95th percentile. We can count the number of scores greater than 2050 (there are 74,165 such scores). We have:

$\displaystyle (2) \ \ \ \ \ \ \ \ \frac{74165}{1547990}=0.0479 \approx 0.05$

Even though this table provides a useful resource to interpret SAT scores, the distribution of SAT scores can be described with just two numbers, namely the mean (1509) and the standard deviation (312). Instead of estimating proportion or percentile ranking using the data in the table, we can use a normal curve.

Figure 1 below is a histogram of all 1,547,990 SAT scores taken in 2010. The frequency distribution that is used to draw the histogram is found here.

There are 181 bars in this histogram, corresponding to the 181 distinct possible scores in the data (the scores come with an increment of 10, ranging from 600, 610, 620, and all the way to 2390, and 2400). Because of many bars crammed into a small graph, each bar appears as a thin vertical line. The histogram is symmetrical around a single peak and it tapers down smoothly on each side. Most of the data is clustered in the middle. The bars around the middle are very tall (i.e. most students score in the middle range). On the other hand, the bars at either the left side or the right side are very short (very few students score at the top and at the bottom). As a result, the histogram has a “bell” shape. The following (Figure 2) is a representation of the same 1,547,990 SAT scores as a smooth curve, which also has a “bell” shape.

Both Figure 1 and Figure 2 can be approximated by the normal curve shown in Figure 3. Figure 2 is a graph of the actual SAT scores (about 1.5 millions scores). On the other hand, Figure 3 is a mathematical model that shapes like a bell curve that is centered at 1509 with standard deviation 312. Note that the curve in Figure 2 is a little jagged, while the curve in Figure 3 is very smooth.

The total area under the normal curve in Figure 3 is 1.0, representing 100% of the data (in this case, the SAT scores). Finding the percentile ranking of the score 1850 is to find the area under the curve to the left of 1850 (see Figure 3a).

The green area in Figure 3a is 0.8621, which is about 0.01 away from the actual proportion of 0.849 in $(1)$.

Finding the proportion of SAT scores greater than 2050 is to find the area under the curve in Figure 3 to the right of 2050 (see Figure 3b).

The green area is Figure 3b is 0.0418, which is about 0.006 away from the actual proportion of 0.0479 in $(2)$.

There is no formula for calculating the area under a normal curve. To find the green area in Figure 3a and Figure 3b, either use software that calculates area under a normal curve or use a table that has areas under a normal curve.

The advantage of using a normal curve to estimate ranking of SAT scores (instead of using the actual data) is that we only need to know two pieces of information, the mean SAT score and the standard deviation of SAT scores. Once we know these two items, we can estimate the probabilities or proportions of SAT scores using software or a table of areas under a normal curve.

We can also use normal curves in many other settings where normal distributions are good descriptions of real data. For example, standardized test scores such as SAT and ACT closely follow normal distributions. Normal distributions arise naturally in many physical, biological, and social measurement situations. Normal distributions are also important in statistical inference. As long as we know the mean and standard deviation of the normal distribution in question, we can estimate probabilities (areas under a normal curve) without using actual measurements. One caveat to keep in mind is that there are many data distributions that are not normal. For example, data on income tend to be skewed to the right. So for such distributions, we need to use normal curves with care.

Reference

1. Moore. D. S., Essential Statistics, W. H. Freeman and Company, New York, 2010
2. Moore. D. S., McCabe G. P., Craig B. A., Introduction to the Practice of Statistics, 6th ed., W. H. Freeman and Company, New York, 2009

## A Student’s View of the Normal Distribution

In my teaching, I always strive to encourage students to look at statistics from a practical point of view. In a recent class period covering the normal distribution, I indicated that data values more than 3 standard deviations away from the mean are rare. The odds for seeing such data points are about 3 in 1,000. After class, a student came up to me and said that she understood the lecture except the things I said about 3 out of 1,000. She did not know what to make of it.

Based on the empirical rule, which can be thought of as a short form of the normal distribution, says that 99.7% of the data are within 3 standard deviations away from the mean. That means that only 0.3% of the data are more than 3 standard deviations away from the mean.

If some event has only 0.3% chance of happening, the odds are 0.3 out of 100. Suppose that we are talking about people and the data are measurements of height (in inches). So only 0.3 people out of 100 have heights more than 3 standard deviation away from the mean. Since we cannot have 0.3 people, it is better to say 3 people out of 1,000 are either 3 or more standard deviations taller than the mean or 3 or more standard deviation shorter than the mean.

The ratio can be expanded further. We can say 30 people out of 10,000 are either 3 or more standard deviations taller than the mean or 3 or more standard deviations shorter than the mean. Add two more zeros, we have: 3,000 people out of 1,000,000 (one million) are either 3 or more standard deviations taller than the mean or 3 or more standard deviations shorter than the mean.

So out of one million people of the same gender and of similar age (say, young adult males aged 20 to 29), only about 3,000 people or so are either very tall or very short. I would say seeing such people is a rare event. Height measurements (and other biological measurements) from a group of people of the same gender (and of similar age) tend to follow a bell-shaped distribution.

To make it even easier to see, let’s say the heights of young adult males follow a normal distribution with mean = 69 inches and standard deviation = 3 inches. Approximately 3,000 out of one million young adult males are either 9 or more inches taller than 69 inches (over 6 feet 6 inches) or 9 or more inches shorter than 69 inches (less than 5 feet). Since the bell curve is symmetrical, about 1,500 out of one million are taller than 6 feet 6 inches.

To look at this visually, the following is a bell curve describing the heights of the young adult males. Note that the bell curve ranges from about 55 inches to 80 inches. But most of the area under the bell curve is from 60 to about 78 inches.

There are about 21.5 million young adult males in the U.S. (looked up the website of US Census Bureau). So the estimated number of young adult males taller than 6 feet 6 inches is 32,250 (=1,500 times 21.5). So we can say that all young adult males are shorter than 6 feet 6 inches (statistically speaking). If all U.S. young adult males taller than 6 feet 6 inches were to attend the same baseball game in the Dodger Stadium, there would still be about 23,000 empty seats!

In my experience, many students have no problem reciting the empirical rule (reciting the three sentences about 1, 2, and 3 standard deviations). Some of them have a hard time applying it, especially using it as a quick gauge of the significance of data. I think understanding it within a practical context should make it easier to do so. I essentiall gave the same explanation to my student, which she thought was helpful.