## Another Look at LA Rainfall

In two previous posts, we examined the annual rainfall data in Los Angeles (see Looking at LA Rainfall Data and LA Rainfall Time Plot). The data we examined in these two post contain 132 years worth of annual rainfall data collected at the Los Angeles Civic Center from 1877 to 2009 (data found in Los Angeles Almanac). These annual rainfall data represent an excellent opportunity to learn the techniques from a body data analysis methods grouped under the broad topic of descriptive statistics (i.e. using graphs and numerical summaries to answer questions or find meaning in data).

Here’s two graphics presented in Looking at LA Rainfall Data.

Figure 1

Figure 2

These charts are called histograms and they look the same (i.e. have the same shape). But they present slightly different information. Figure 1 shows the frequency of annual rainfall. Figure 2 shows the relative frequency of rainfall.

For example, Figure 1 indicates that there were only 3 years (out of the last 132 years) with annual rainfall under 5 inches. On the other hand, there were only 2 years with annual rainfall above 35 inches. So drought years did happen but not very often (only 3 out of 132 years). Extremely wet seasons did happen but not very often. Based on Figure 1, we see that in most years, annual rainfall records range from 5 to about 25 inches. The most likely range is 10 to 15 inches (45 years out of the last 132 years). In Los Angeles, annual rainfall above 25 inches are rare (only happened 12 years out of 132 years).

Figure 1 is all about count. It tells you how many of the data points are in a certain range (e.g. 45 years in between 10 to 15 inches). For this reason, it is called a frequency histogram. Figure 2 gives the same information in terms of proportions (or relative frequency). For example, looking at Figure 2, we see that about 34% of the time, annual rainfall is from 10 to 15 inches. Thus, Figure 2 is called a relative frequency histogram.

Keep in mind raw data usually are not informative until they are summarized. The first step in summarization should be a graph (if possible). After we have graphs, we can look at the data further using numerical calculation (i.e. using various numerical summaries such as mean, median, standard deviation, 5-number summary, etc). To see how this is done, see the previous post Looking at LA Rainfall Data.

What kind of information can we get from graphics such as Figure 1 and Figure 2 above? For example, we can tell what data points are most likely (e.g. annual rainfall of 10 to 15 inches). What data points are considered rare or unlikely? Where do most of the data points fall?

This last question should be expanded upon. Looking at Figure 2, we see that about 60% of the data are under 15 inches (0.023+0.242+0.341=0.606). So for close to 80 years out of the last 132 years, the annual rainfall records were 15 inches or less. About 81% of the data are 20 inches or less. So in the overhelming majority of the years, the annual rainfall records are 20 inches or less. So annual rainfall of more than 20 inches are relatively rare (only happened about 20% of the time).

We have a name of the data situation we see in Figure 1 and Figure 2. The annual rainfall data in Los Angeles have a skewed right distribution. This is because most of the data points are on the left side of the histogram. Another way to see this is that the tallest bar in the histogram is the one at 10 to 15 inches. Note that the side to the right of the peak of the histogram is longer than the side to the left of the peak. In other words, when the right tail of the histogram is longer, it is a skewed right distribution. See the figure below.

Figure 3

Besides the look of the histogram, skewed right distribution has another characteristic. The mean is always a lot larger than the median in a skewed right distribution. For example, the mean of the annual rainfall data is 14.98 inches (essentially 15 inches). Yet the median is only 13.1 inches, almost two inches lower. Whenever, the mean and the median are significantly far apart, we have a skewed distribution on hand. When the mean is a lot higher, it is a skewed right distribution. When the opposite situation occurs (the mean is a lot lower than the median), it is a skewed left distribution. When the mean and median are roughly equal, it is likely a symmetric distribution.

## Is College Worth It?

Is college worth it? This was the question posed by the authors of the report called College Majors, Unemployment and Earnings, which was produced recently by The Center on Education and the Workforce. We do not plan on giving an detailed reporting on this report. Any interested reader can read the report here. Instead, we would like to look at two graphics in this reports, which are reproduced below. These two graphics are very interesting, which capture all the main points of the report. The data used in the report came from American Community Survey for the years 2009 and 2010.

Figure 1

Figure 2

Figure 1 shows the unemployment rates by college major for three groups of college degree holders, namely the recent college graduates (shown with green marking), the experienced college graduates (blue marking) and the college graduates who hold graduate degrees (red marking). Figure 2 shows the median earnings by major for the same three groups of college graduates (using the same colored markings).

Figure 1 ranks the unemployment rates for recent college graduates from highest to the lowest. You can see the descending of green markings from 13.9% (architecture) to 5.4% (education and health). So this graphic shows clearly that the employment prospects of college graduates depend on their majors, which is one of the main points of the report.

The graphic in Figure 1 shows that all recent college graduates are having a hard time finding work. The unemployment rate for recent college graduate is 8.9% (not shown in Figure 1). The employment picture for recent college architecture graduates is especially bleak, which is due to the collapse of the construction and home building industry in the recession. The unemployment rates for recent college graduates who majored in education and healthcare are relatively low, reflecting the reality that these fields are either stable or growing.

Everyone is feeling the pinch in this tough economic environment. Even the recent graduates in technical fields are experiencing higher than usual unemployment rates. For example, the unemployment rates for recent college graduates in engineering and science, though relatively low comparing to architecture, are at 7.5% and 7.7%, respectively. For computers and mathematics recent graduates, the unemployment rate is 8.2%, approaching the average rate of 8.9% for recent college graduates.

The experienced college graduates fare much better than recent graduates. It is much more likely for experienced college graduates to be working. Looking at Figure 1, another observation is that graduate degrees make a huge difference in employment prospects across all majors.

The graphic in Figure 2 suggests that earnings of college graduates also depend on the subjects they study, which is another main point of the report. The technical majors earn the most. For example, median earning among recent engineering college graduates is $55,000 and the median for arts majors is$30,000. Aside from the high technical, business and healthcare majors, the median earnings of recent college graduates are in the low 30,000s (just look at the green markings in Figure 2). Figure 2 also shows that people with graduate degrees have higher earnings across all majors. The premium in earnings for graduate degree holders is substantial and is found across the board. Though the graduate degree advantage is seen in all majors, it is especially pronounced among the technical fields (just look at the descending red markings in Figure 2). So two of the main points are (1), employment prospects of college graduates depend on their majors, and (2) the earning potential of college graduates also depend on the subjects they study. Is college worth it? The report is not trying to persuade college bound high school seniors not to go to college. On the contrary, the authors of the report answer the question in the affirmative. The authors of the report are merely providing the facts that all prospective college students should consider before they pick their majors. The two graphics shown above are effective demonstration of the facts presented by the report. According to the authors, students “should do their homework before picking a major, because, when it comes to employment prospects and compensation, not all college degrees are created equal.” ## Cryptography and Presidential Inaugural Speeches Given a letter in English, how often does it appear in normal usage of the English language? Some letters appear more often than others. For example, the last letter Z is not common. The vowels are very common because they are needed in making words. The following figure shows the relative frequency of the English letters obtained empirically (see [1]). Dewey, the author of [1], obtained this frequency distribution after examining a total of 438,023 letters. We came across this letter frequency distribution in Example 2.11 in page 24 of [2]. Figure 1 displays the letter frequency in descending order. A letter frequency such as Figure 1 is important in cryptography. We explore briefly why this is the case. We give an indication why breaking a cipher is often a statistical process. We then confirm the Dewey letter frequency distribution by examining the letter frequency in the presidential inaugural speeches of George Washington (two speeches) and Barack Obama (one speech). The study of the frequency of letters in text is very important in cryptography. In using an algorithm to encrypt a message, the original information is called plaintext and the encrypted message is called ciphertext. In a simple encryption scheme called substitution cipher, each letter of the plaintext is replaced by another letter. To break such a cipher, it is necessary to know the letter frequency of the language being coded. For example, if the letter W is the most frequently appeared letter in the ciphertext, this might suggest that the letter W in the ciphertext corresponds to the letter E in the plaintext since the letter E is the most frequently occurred English letter (see Figure 1). Figure 1 shows that the most frequently occurring letter in English is E (about 12.68% of the time). The least used letter is Z. The top 5 letters (E, T, A, I, O) comprise about 45% of the total usage. The top 8 letters comprise close to 65% of the total usage. The top 12 letters are used about 80% of the time (80.87%). Another interesting result from the Dewey’s letter frequency is that the vowels comprise about 40% of the total usage. This means that the frequency of consonants is about 60%. \displaystyle \begin{aligned}(1) \ \ \ \ \ \text{relative frequency of vowels}&=\text{relative frequency of A + relative frequency of E} \\&\ \ \ + \text{relative frequency of I + relative frequency of O} \\&\ \ \ +\text{relative frequency of U + relative frequency of Y} \\&=0.0788+0.1268+0.0707+0.0776+0.0280+0.0202 \\&=0.4021 \end{aligned} \displaystyle \begin{aligned}(2) \ \ \ \ \ \text{relative frequency of consonants}&=1-0.4021 \\&=0.5979 \end{aligned} The probability distribution of the letters displayed in Figure 1 is a useful tool that can aid the process of breaking an intercepted cipher. The general idea is to compare the frequency of the letters in the encrypted message with the frequency of the letters in Figure 1. Thus the most used letter in the ciphertext might correspond to the letter E, or might correspond to T and A (as T and A are also very common in plaintext). But the most used letter in the ciphertext is likely not to be a Z or a Q. The second most used letter in the ciphertext might be the letter T in the plaintext, or might be another one of the top letters. The cryptanalyst will likely need to try various combinations of mapping between the letters in the ciphertext and the plaintext. The idea described here is not a sure-fire approach, but is rather a trial and error process that can help the analyst putting the statistical puzzle pieces together. We now use the letters in presidential inaugural speeches to see how the Dewey letter frequency hold up. We want to use text that is from another era (so we choose the two inaugural speeches of George Washington) and to use text that is contemporary (so we choose the inaugural speech of Barack Obama). The text of presidential inaugural speeches can be found here. Figure 2 below shows the letter frequency in the two inaugural speeches of George Washington. There are a total of 7,641 letters (we only use the body of the speeches). Figure 3 below is a side by side comparison between the letter frequency in Figure 1 (Dewey) and the letter frequency in Washington’s two speeches (Figure 2). Figure 3 shows that the letter frequency in Washington’s speeches is on the whole very similar to the letter frequency of Dewey. We cannot expect an exact match. But overall there is a general agreement between the two distributions. Figure 4 below shows the letter frequency in the inaugural speeches of Barack Obama. There are a total of 10,627 letters (we only use the body of the speech). Figure 5 below is a side by side comparison between the letter frequency in Figure 1 (Dewey) and the letter frequency in Obama’s speech (Figure 3). There is also a very good agreement between the letter frequency in Dewey (the benchmark) and the letter frequency in Obama’s speech. Despite the passage of almost 200 years, there is quite an excellent agreement between the letter usage between Washington’s speeches in 1789 and the distribution obtained by Dewey in 1970 (see Figure 3). Some letters appeared more frequently often in Washington’s speeches (e.g. E, I and N) and some appeared less often (e.g. A). The general pattern of the letter distribution in Washington’s speeches is unmistakably similar to that of Dewey’s. Similar observations can be made about the comparison between the letter frequency in Obama’s speech and Dewey’s distribution (see Figure 5). The following table shows the frequency of the top letter, the top 5 letters, the top 8 letters and the top 12 letters in Dewey’s distribution alongside with the corresponding frequency in the speeches of Washington and Obama. Table (1) shows that the frequency of the top letters are quite close between Dewey’s distribution and the speeches of Washington and Obama. $\displaystyle (1) \ \ \ \ \begin{bmatrix} \text{Top Letters in Dewey's Distribution}&\text{ }&\text{Dewey}&\text{ }&\text{Washington}&\text{ }&\text{Obama} \\\text{ }&\text{ }&\text{ } \\\text{E}&\text{ }&0.1268&\text{ }&0.1309&\text{ }&0.1268 \\ \text{E, T, A, O, I}&\text{ }&0.4517&\text{ }&0.4485&\text{ }&0.4441 \\ \text{E, T, A, O, I, N, S, R}&\text{ }&0.6451&\text{ }&0.6409&\text{ }&0.6525 \\ \text{E, T, A, O, I, N, S, R, H, L, D. U}&\text{ }&0.8087&\text{ }&0.7981&\text{ }&0.8163 \end{bmatrix}$ Reference 1. Dewey, G., Relative Frequency of English Spellings, Teachers College Press, Columbia University, New York, 1970 2. Larsen, R. J., Marx., M. L., An Introduction to Mathematical Statistics and its Applications, Prentice-Hall, Inc., Englewood Cliffs, New Jersey, 1981 ## Which Car Rental Company Is More Expensive, Budget or Avis? Any “budget” conscious consumer/traveler would want to find a good deal wherever and whenever he or she can, especially when it comes to airfare and rental cars. In tough economic times, bargain hunting is the norm rather the exception. This is an exercise in price comparison for rental cars; focusing on two popular car rental companies Budget and Avis. This is also an opportunity to use the one-sample t-procedures (test and confidence interval) on matched pairs data. The following table shows the prices found on the websites of Budget Car Rental and Avis Car Rental. The prices are one-day rental prices for full size sedans quoted for the day of December 12, 2011 (a non-holiday Monday) at 35 of the busiest airports in United States. These prices are basic car rental prices without any discount or upgrade. Matched Pairs Data The price data in Table 1 are best viewed as matched pairs data. In such data, observations are taken on the same individual (in this case the same airport) under different conditions (one price is for Budget and one price is for Avis). The Budget car rental prices and the Avis car rental prices are said to be dependent samples. The alternative to thinking of Table 1 as matched pairs data is to view the Budget Prices and the Avis prices as two independent samples and then use two-sample t-procedures to perform the analysis. But this approach is not the best way of using the data. When you are at the Chicago airport, you do not care about the car rental prices at the Tampa airport or any other airport. Just as you only compare prices among the car rental companies in the same airport, we should compare prices within each matched pair. So thinking of the data as matched pairs affords us the best way to compare prices between the two companies in Table 1. To analyze the data in Table 1, we first take the difference between the Budget rental prices and the Avis rental prices (Avis minus Budget). These 35 differences form a single sample (the last column in Table 1). The first difference is -3.29 (indicating that Avis is cheaper by this amount at the Atlanta airport). The second difference is $45.91 (indicating that Budget is cheaper by this amount at the Chicago airport). Most of the calculations and analysis will be done using this “differenced” sample. Thus, the comparative design of using matched pairs data makes use of single-sample procedures (in this case, the one-sample t-procedures). Initial Look of the Data Most of the differences are positive (i.e. Avis charges more than Budget). Some of the differences are small. But some of the differences are in the$30 to $50 range. So we need to take a closer look. The following table shows the sample means and sample standard deviations for the Budget prices, Avis prices and the differences. In these 35 airports, the average one-day rental prices for Budget is$60 and the average Avis price is $73.74. The price differential is$13.64, meaning that Avis price is 22.7% over the Budget average price. Any “budget” conscious traveler should care about a difference of $13. The question is: is the price differential we are seeing statistically significant? Specifically, are the data in Table 1 evidences that Budget Car Rental is less expensive than Avis? The Requirements for Using the t-Procedures The use of the t-procedures (confidence interval and test) rests on two assumptions. One is that the data set is a simple random sample. The second is that the distribution of the data measurements has no outliers and follows a normal distribution (or that the sample size is large). The car rental prices in Table 1 are not a random sample. They are just car rental prices from Budget and Avis at the 35 busiest airports in United States. Because these busy airports spread out across various regions in the United States and because they are of varying sizes, we feel that the sample car rental prices indicated here are representative of car renting experiences at these airports. For these reasons, we feel that there is value in carrying this comparison. Because the sample size is relatively large ($n=35$), the need for checking normality assumption is not critical. The car rental prices do not seem to have any extreme data values. About Technology To carry out the t-test and t-interval, we should use technology (we use TI-83 plus). If software is not used, a t-distribution table is needed to find the p-value and t-critical value. Refer to your favorite statistics textbook for a t-table or use this t-table. One-Sample t-Test With a price differential of$13.64, we see that Avis is more expensive. Let’s confirm it with a one-sample t-test. To assess whether Budget is less expensive than Avis, we test the following hypotheses:

\displaystyle \begin{aligned}(1) \ \ \ \ \ \ \ \ \ \ \ &H_0: \mu = 0 \\&H_1: \mu > 0 \end{aligned}

where $\mu$ is the mean difference in car rental prices (Avis minus Budget). The null hypothesis $H_0$ says that there is no difference in prices between Budget and Avis. The alternative hypothesis $H_1$ says that Budget is less expensive than Avis (i.e. Avis minus Budget > $0$).

The mean and standard deviation of the “differenced” sample (the last column in Table 1) are:

\displaystyle \begin{aligned}(2) \ \ \ \ \ \ \ \ \ \ \ &\overline{x}=\13.64029 \\&s=\14.60763 \end{aligned}

The one-sample t-statistic is:

\displaystyle \begin{aligned}(3) \ \ \ \ \ \ \ \ \ \ \ t&=\frac{\overline{x}-0}{\displaystyle \frac{s}{\sqrt{n}}}=\frac{13.64029-0}{\displaystyle \frac{14.60763}{\sqrt{35}}}=5.5243 \end{aligned}

The p-value of this t-test is found from the t-distribution with 34 degrees of freedom (one less than the sample size). There are two ways to accomplish this (looking up a table or using software). Based on a t-distribution table (such as this one), $p<0.0005$. Software (using TI-83 plus) gives a p-value that is much smaller, $p=1.78871591 \times 10^{-6}$, which is approximately $p=0.0000018$.

Because of the small p-value, the data provide clear evidence in favor of the alternative hypothesis (i.e. we reject the null hypothesis $H_0$). A price differential this large (as large as 13.64) is very unlikely to occur by chance if there is indeed no difference in prices between Budget and Avis. We now have evidence that Budget is less expensive on average (Avis is more expensive on average). One-Sample t-Interval What is the magnitude of the price differential of Avis over Budget with a margin of error? We want to obtain a 95% confidence interval for the mean difference in car rental prices. To this end, we need the critical value $t=2.032$ from a t-distribution table. The margin of error is: \displaystyle \begin{aligned}(4) \ \ \ \ \ \ \ \ \ \ \ &t \times \frac{s}{\sqrt{n}}=2.302 \times \frac{14.60763}{\sqrt{35}}=5.01729 \end{aligned} and the confidence interval is: \displaystyle \begin{aligned}(5) \ \ \ \ \ \ \ \ \ \ \ \overline{x} \pm t \times \frac{s}{\sqrt{n}}&=13.64027 \pm 5.01729 \\&=(8.62298,18.65756) \end{aligned} The estimated average price differential of Avis over Budget is13.64 with a margin of error $5.02 with 95% confidence. On average, you tend to save anywhere from$8.62 to $18.66 for a one-day car rental if you go with Budget. Remark It is clear that Avis is more expensive (at least in terms of one-day rental of full size sedans on a Monday). Perhaps, other factors could alter the picture. For example, this comparison does not account for discount or special promotion. There are variations to the exercise done here. One is to compare week-long rentals. Another exercise is to compare vehicles in other classes (e.g. economy or SUV). Another one is to compare prices for busy seasons (e.g. holiday weekends). Reference 1. Moore. D. S., Essential Statistics, W. H. Freeman and Company, New York, 2010 2. Moore. D. S., McCabe G. P., Craig B. A., Introduction to the Practice of Statistics, 6th ed., W. H. Freeman and Company, New York, 2009 ## Benford’s Law and US Census Data, Part II The Benford’s law is a probability model that is a powerful tool for detecting frauds and data irregularity. The first digit (or leading digit) of a number is the leftmost digit and can only be 1, 2, 3, 4, 5, 6, 7, 8 and 9. The Benford’s law tells us the first digit is 1 about 30% of the time and the first digit is 2 about 17.6% of the time and so on (see Figure 1 below). The key to the detection of fraud is to compare the distribution of first digits in the data being investigated with the Benford’s distribution. Too big of a discrepancy between the actual data and the Benford’s law (e.g. too few 1′s) is sometimes enough to raise suspicion of fraud. In this post, we demonstrate the use of chi-square goodness-of-fit test to compare the actual distribution of first digits with the Benford’s law. We use the example discussed in the previous post (population counts of 3,143 counties in the United States). The following two figures show the distribution of the first digits in the population counts for the 3,143 counties in the United States. Figure 1 shows the count for each first digit. Figure 2 shows the proportion of each first digit. The following figure is a side-by-side comparison of Figure 1 (the Benford’s law) and Figure 3 (the actual proportions from the census data). It is quite clear that the distribution of first digits in the population counts of U.S. counties follows the Benford’s law quite closely. The blue bars and the orange bars have roughly the same height with the possible exception of the first digit 5. The blue bar and the orange bar at 5 has a difference of 1.4% (0.014=0.079-0.065). Is the difference for the first digit of 5 problematic? Visually speaking, we see that the actual distribution of first digits in the county population data matches quite well with the Benford’s law. Is it possible to confirm that with a statistical test? To answer these questions, we use the chi-square goodness-of-fit test. The above questions are translated into a null hypothesis and an alternative hypothesis. The null hypothesis is that the first digits in the population counts for the 3,143 counties follow the Benford’s law. The alternative hypothesis is that the first digits in the population counts do not follow the Benford’s law. The following states the hypotheses more explicitly: $\displaystyle (1) \ \ \ \ \ H_0: \text{The first digits in the population counts follow the Benford's law}$ $\displaystyle (2) \ \ \ \ \ \ \ H_1: \text{The first digits in the population counts do not follow the Benford's law}$ If the null hypothesis is true, we would expected 946 first digits of 1 (3,143 times 0.301) and expect 553.2 first digits of 2 (3,143 times 0.176) and so on. The following figure shows the expected counts of first digits under the assumption of the null hypothesis. The expected counts in Figure 5 are obtained by multiplying the proportions from the Benford’s law by the total count of 3,143 (the total number of counties in the United States). We use the chi-square statistic to measure the difference between the observed counts (in Figure 2) and the expected counts (in Figure 5). The formula for the chi-square statistic is: $\displaystyle (3) \ \ \ \ \ \chi^2=\sum \frac{(\text{observed count - expected count})^2}{\text{expected count}}$ The above chi-square statistic has an approximate chi-square distribution with 8 degrees of freedom. There are 9 categories of counts (i.e. the 9 possible first digits) and the degrees of freedom are always one less than the number of categories in the observed counts. The computation of the chi-square statistic is usually performed using software. The following shows the idea behind the calculation: $\displaystyle (4) \ \ \ \ \ \chi^2=\frac{(972-946)^2}{946}+\frac{(573-553.2)^2}{553.2}+\frac{(376-392.9)^2}{392.9}$ $\displaystyle . \ \ \ \ \ \ \ \ \ \ \ \ \ +\frac{(325-304.9)^2}{304.9}+\frac{(205-248.3)^2}{248.3}+\frac{(209-210.3)^2}{210.3}$ $\displaystyle . \ \ \ \ \ \ \ \ \ \ \ \ \ +\frac{(179-182.3)^2}{182.3}+\frac{(155-160.3)^2}{160.3}+\frac{(149-144.6)^2}{144.6}=11.4$ The idea for the chi-square statistic is that for each possible first digit, we take the difference between the observed count and the expected count. We then square the difference and normalize it by dividing by the expected count. For example, for digit 1, the difference between the observed count and the expected count is 26 (=972-946). Squaring it produces 676. Dividing 676 by 946 produces 0.7145877378. The sum of all 9 normalized differences is 11.4. The value of $\chi^2=11.4$ is a realization of the chi-square statistic stated in $(3)$, which has an approximate chi-square distribution with 8 degrees of freedom. The probability that a chi-square distribution (with 8 degrees of freedom) having another realized value greater than 11.4 is $p=0.179614$. This probability is called the p-value and is usually estimated using a chi-square table or obtained by using software. We obtain this p-value using the graphing calculator TI-83 plus. With the p-value being 0.1796, we do not reject the hull hypothesis $H_0$. The differences that we see between the observed counts in Figure 2 and the expected counts in Figure 5 are not sufficient evidence for us to believe that the first digits in the county population counts do not follow the Benford’s law. The calculated chi-square statistic of 11.4 captures the differences between the observed counts (in Figure 2) and the expected counts (in Figure 5). As expected, the large portion of 11.4 is due to the difference in the digit 5. $\displaystyle (5) \ \ \ \ \ \frac{(205-248.3)^2}{248.3}=7.55$ Even with the relatively large deviation in digit 5, the calculated chi-square statistic of 11.4 is not large enough for us to believe that the first digits from the census data at hand deviate from the Benford’s law. Consequently we still have strong evidence to believe that the distribution of first digits in this census data set follows the Benford’s law. ## Benford’s Law and US Census Data, Part I The first digit (or leading digit) of a number is the leftmost digit (e.g. the first digit of 567 is 5). The first digit of a number can only be 1, 2, 3, 4, 5, 6, 7, 8, and 9 since we do not usually write number such as 567 as 0567. Some fraudsters may think that the first digits in numbers in financial documents appear with equal frequency (i.e. each digit appears about 11% of the time). In fact, this is not the case. It was discovered by Simon Newcomb in 1881 and rediscovered by physicist Frank Benford in 1938 that the first digits in many data sets occur according to the probability distribution indicated in Figure 1 below: The above probability distribution is now known as the Benford’s law. It is a powerful and yet relatively simple tool for detecting financial and accounting frauds (see this previous post). For example, according to the Benford’s law, about 30% of numbers in legitimate data have 1 as a first digit. Fraudsters who do not know this will tend to have much fewer ones as first digits in their faked data. Data for which the Benford’s law is applicable are data that tend to distribute across multiple orders of magnitude. Examples include income data of a large population, census data such as populations of cities and counties. In addition to demographic data and scientific data, the Benford’s law is also applicable to many types of financial data, including income tax data, stock exchange data, corporate disbursement and sales data (see [1]). The author of [1], Mark Nigrini, also discussed data analysis methods (based on the Benford’s law) that are used in forensic accounting and auditing. In a previous post, we compare the trade volume data of S&P 500 stock index to the Benford’s law. In this post, we provide another example of the Benford’s law in action. We analyze the population data of all 3,143 counties in United States (data are found here). The following figures show the distribution of first digits in the population counts of all 3,143 counties. Figure 2 shows the actual counts and Figure 3 shows the proportions. In both of these figures, there are nine bars, one for each of the possible first digits. In Figure 2, we see that there are 972 counties in the United States with 1 as the first digit in the population count (0.309 or 30.9% of the total). This agrees quite well with the proportion of 0.301 from the Benford’s law. Comparing the proportions between Figure 1 and Figure 3, we see the actual proportions of first digits in the county population counts are in good general agreement with the Benford’s law. The following figure shows a side-by-side comparison of Figure 1 and Figure 3. Looking at Figure 4, it is clear that the actual proportions in the first digits in the 3,143 county population counts follow the Benford’s law quite closely. Such a close match with the Benford’s law would be expected in authentic and unmanipulated data. A comparison as shown in Figure 4 lies at the heart of any technique in data analysis using the Benford’s law. Reference 1. Nigrini M. J., I’ve Got Your Number, Journal of Accountancy, May 1999. Link 2. US Census Bureau – American Fact Finder. 3. Wikipedia’s entry for the Benford’s law. ## Is It Possible to Spot Fraudulent Numbers? People engage in financial frauds have a need to produce false data as part of their criminal activities. Is it possible to look at the false data and determine they are not real? A probability model known as the Benford’s law is a powerful and relatively simple tool for detecting potential financial frauds or errors. Is this post, we give some indication of how the Benford’s law can be used and we use data from S&P 500 stock index as an example. The first digit (or leading digit) of a number is the leftmost digit. For example, the first digit of$35,987 is 3. Since zero cannot be a first digit, there are only 9 possible choices for the first digits. Some data fabricators may try to distribute the first digits in false data fairly uniformly. However, first digits of numbers in legitimate documents tend not to distribute uniformly across the 9 possible digits. According to the Benford’s law, about 31% of the numbers in many types of data sets have 1 as a first digit, 19% have 2. The higher digits have even lower frequencies with 9 occurring about 5% of the time. The following figure shows the probability distribution according to the Benford’s law.

Frank Benford was a physicist working for the General Electric Company in the 1920s and 1930s. He observed that the first few pages of his logarithm tables (the pages corresponding to the lower digits) were dirtier and more worn than the pages for the higher digits. Without electronic calculator or modern computer, people in those times used logarithm table to facilitate numerical calculation. He concluded that he was looking up the first few pages of the logarithm table more often (hence doing calculation involving lower first digits more often). Benford then hypothesized that there were more numbers with lower first digits in the real world (hence the need for using logarithm of numbers with lower first digits more often).

To test out his hypothesis, Benford analyzed various types of data sets, including areas of rivers, baseball statistics, numbers in magazine articles, and street addresses listed in “American Men of Science”, a biographical directory. In all, the data analysis involved a total of 20,229 numerical data values. Benford found that the leading digits in his data distributed very similar to the model described in Figure 1 above.

As a hands-on introduction to the Benford’s law, we use data from the S&P 500 stock index from November 11, 2011 (data were found here). S&P 500 is a stock index covering 500 large companies in the U.S. economy. The following table lists the prices and volumes of shares (the number of shares traded) of the first 5 companies of the S&P 500 index as of November 11, 2011. The first digits of these 5 prices are 8, 5, 5, 5, 7, 2. The first digits of these 5 share volumes are 3, 4, 2, 2, 1, 8.

$\displaystyle (1) \ \ \ \ \ \ \begin{bmatrix} \text{Stock Symbol}&\text{ }&\text{Company Name}&\text{ }&\text{Prices}&\text{ }&\text{Volume} \\\text{ }&\text{ }&\text{ } \\ \text{MMM}&\text{ }&\text{3M Co}&\text{ }&\82.29&\text{ }&\text{3.6M} \\ \text{ABT}&\text{ }&\text{Abbott Laboratories}&\text{ }&\54.53&\text{ }&\text{4.9M} \\ \text{ANF}&\text{ }&\text{Abercrombie and Fitch Co}&\text{ }&\56.80&\text{ }&\text{2.5M} \\ \text{ACN}&\text{ }&\text{Accenture PLC}&\text{ }&\58.97&\text{ }&\text{2.2M} \\ \text{ACE}&\text{ }&\text{ACE Ltd}&\text{ }&\71.24&\text{ }&\text{1.6M} \\ \text{ADBE}&\text{ }&\text{Adobe Systems Inc}&\text{ }&\28.43&\text{ }&\text{8.4M} \end{bmatrix}$

The following figures show the frequencies of the first digits in the prices and volumes from the entire S&P 500 on November 11, 2011.

It is clear both in both the prices and volumes, the lower digits occur more frequently as first digits. For example, of the 500 close prices of S&P 500 on November 11, 2011, 82 prices have 1 as first digits (16.4% of the total) while only 14 prices have 9 as leading digits (2.8% of the total). For the trade volumes of the 500 stocks, the skewness is even more pronounced. There are 166 prices with 1 as the first digits (33.2% of the total) while there are only 25 prices with 9 as first digits (5% of the total). The following figures express the same distributions in terms of proportions (or probabilities).

Any one tries to fake S&P 500 prices and volumes data purely by random chance will not produce convincing results, not even results that can withstand a casual analysis based on figures such as Figures 2 through 5 above.

What is even more interesting is the comparison between Figure 1 (Benford’s law) with Figure 5 (trade volume of S&P 500). The following figure is a side-by-side comparison.

There is a remarkable agreement between how the distribution of the first digits in the 500 traded volumes in S&P 500 index agree and the Benford’s law. According to the distribution in Figure 1 (Benford’s law), about 60% of the leading digits in legitimate data consist of the digits 1, 2 and 3 (0.301+0.176+0.125=0.602). In the actual S&P 500 traded volumes on 11/11/2011, about 65% of the leading digits are from the first three digits (0.332+0.202+0.114=0.648). We cannot expect the actual percentages to be exactly matching those of the Benford’s law. However, the general agreement between the expected (Benford’s law) and the actual data (S&P 500 volumes) is very remarkable and is very informative.

There are many sophisticated computer tests that apply the Benford’s law in fraud detection. However, the heart of the method of using Benford’s law is the simple comparison such as the one performed above, i.e. to compare the actual frequencies of the digits with the predicted frequencies according to the Benford’s law. If the data fabricator produces numbers that distribute across the digits fairly uniformly, a simple comparison will expose the discrepancy between the false data and the Benford’s law. Too big of a discrepancy between the actual data and the Benford’s law (e.g. too few 1’s) is sometimes enough to raise suspicion of fraud. Many white collar criminals do not know about the Benford’s law and will not expect that about in many types of realistic data, 1 as a first digit will occur about 30% of the time.

Benford’s law does not fit every type of data. It does not fit numbers that are randomly generated. For example, lottery numbers are drawn at random from balls in a big glass jar. Hence lottery numbers are uniformly distributed (i.e. every number has equal chance to be selected).

Even some naturally generated numbers do not follow the Benford’s law. For example, data that are confined to a relatively narrow range do not follow the Benford’s law. Examples of such data include heights of human adults and IQ scores. Another example is the S&P 500 stock prices in Figure 2. Note that the pattern of the bars in Figure 2 and Figure 4 does not quite match the pattern of the Benford’s law. Most of the stock prices of S&P 500 fall below $100 (on 11/11/2011, all prices are either 2 or 3-digit numbers with only 35 of the 500 prices being 3-digit). The following figure shows the side-by-side comparison between the S&P stock prices and the Benford’s law. Note that there are too few 1’s as first digits in the S&P 500 prices. Even though the S&P 500 prices do not follow the Benford’s law, they are far from uniformly distributed (the smaller digits still come up more frequently). Any attempt to fake S&P 500 stock prices by using each digit equally likely as a first digit will still not produce convincing results (at least to an experienced investigator). Data for which the Benford’s law is applicable are data that tend to distribute across multiple orders of magnitude. In the example of S&P 500 trading volume data, the range of data is from about half a million shares to 210 million shares (from 6-digit to 9-digit numbers, i.e., across 3 orders of magnitude). In contrast, S&P 500 prices only cover 1 order of magnitude. Other examples of data for which Benford’s law is usually applicable: income data of a large population, census data such as populations of cities and counties. Benford’s law is also applicable to financial data such as income tax data, and corporate expense data. In [1], Nigrini discussed data analysis methods (based on the Benford’s law) that are used in forensic accounting and auditing. For more information about the Benford’s law, see the references below or search in Google. Reference 1. Nigrini M. J., I’ve Got Your Number, Journal of Accountancy, May 1999. Link 2. S&P 500 data. 3. Wikipedia’s entry for the Benford’s law. ## Presidential Elections and Statistical Inference We use an example of presidential elections to illustrate the reasoning process of statistical inference. The current holder of a political office is called an incumbent. Does being an incumbent give an edge for reelection? The answer seems to be yes. For example, in the United States House of Representatives, the percentage of incumbents winning reelection has been routinely over 80% for over 50 years (sometimes over 90%). Since 1936, there were 13 US presidential elections involving an incumbent. In these 13 presidential elections, incumbents won 10 times (see the table below). $\displaystyle \begin{bmatrix} \text{ }&\text{ }&\text{ } \\ \text{Year}&\text{ }&\text{Candidate}&\text{ }&\text{Electoral Votes}&\text{ }&\text{Candidate}&\text{ }&\text{Electoral Votes}&\text{Winner} \\ \text{ }&\text{ }&\text{Incumbent}&\text{ }&\text{Incumbent}&\text{ }&\text{Challenger}&\text{ }&\text{Challenger}&\text{ } \\\text{ }&\text{ }&\text{ } \\ \text{1936}&\text{ }&\text{Roosevelt}&\text{ }&\text{523}&\text{ }&\text{Landon}&\text{ }&\text{8}&\text{Incumbent} \\ \text{1940}&\text{ }&\text{Roosevelt}&\text{ }&\text{449}&\text{ }&\text{Willkie}&\text{ }&\text{82}&\text{Incumbent} \\ \text{1944}&\text{ }&\text{Roosevelt}&\text{ }&\text{432}&\text{ }&\text{Dewey}&\text{ }&\text{99}&\text{Incumbent} \\ \text{1948}&\text{ }&\text{Truman}&\text{ }&\text{303}&\text{ }&\text{Dewey}&\text{ }&\text{189}&\text{Incumbent} \\ \text{1956}&\text{ }&\text{Eisenhower}&\text{ }&\text{457}&\text{ }&\text{Stevenson}&\text{ }&\text{73}&\text{Incumbent} \\ \text{1964}&\text{ }&\text{Johnson}&\text{ }&\text{486}&\text{ }&\text{Goldwater}&\text{ }&\text{52}&\text{Incumbent} \\ \text{1972}&\text{ }&\text{Nixon}&\text{ }&\text{520}&\text{ }&\text{McGovern}&\text{ }&\text{17}&\text{Incumbent} \\ \text{1976}&\text{ }&\text{Ford}&\text{ }&\text{240}&\text{ }&\text{Carter}&\text{ }&\text{297}&\text{Challenger} \\ \text{1980}&\text{ }&\text{Carter}&\text{ }&\text{49}&\text{ }&\text{Reagan}&\text{ }&\text{489}&\text{Challenger} \\ \text{1984}&\text{ }&\text{Reagan}&\text{ }&\text{525}&\text{ }&\text{Mondale}&\text{ }&\text{13}&\text{Incumbent} \\ \text{1992}&\text{ }&\text{GHW Bush}&\text{ }&\text{168}&\text{ }&\text{Clinton}&\text{ }&\text{370}&\text{Challenger} \\ \text{1996}&\text{ }&\text{Clinton}&\text{ }&\text{379}&\text{ }&\text{Dole}&\text{ }&\text{159}&\text{Incumbent} \\ \text{2004}&\text{ }&\text{GW Bush}&\text{ }&\text{286}&\text{ }&\text{Kerry}&\text{ }&\text{252}&\text{Incumbent} \end{bmatrix}$ Clearly, more than half of these presidential elections were won by incumbents. Note that the three wins for the challenger occurred in times of economic turmoils or in the wake of political scandal. So presidential incumbents seem to have an edge, except possibly in times of economic or political turmoils. The power of incumbency in political elections at the presidential and congressional level seems like an overwhelming force (reelection rate of over 80% in the US House and 10 wins for the last 13 presidential incumbents). Incumbents of political elections at other levels likely have similar advantage. So the emphasis we have here is not to establish the fact that incumbents have advantage. Rather, our goal is to illustrate the reasoning process in statistical inference using the example of presidential elections. The statistical question we want to ask is: does this observed result really provide evidence that there is a real advantage for incumbents in US presidential election? Is the high number of incumbent wins due to a real incumbent advantage or just due to chance? One good way to think about this question is through a “what if”. What if there is no incumbency advantage? Then it is just as likely for a challenger to win as it is for an incumbent. Under this “what if”, each election is like a coin toss using a fair coin. This “what if” is called a null hyposthesis. Assuming each election is a coin toss, incumbents should win about half the elections, which would be 6.5 (in practice, it would be 6 or 7). Does the observed diffference 10 and 6.5 indicate a real difference or is it just due to random chance (the incumbents were just being lucky)? Assuming each election is a coin toss, how likely is it to have 10 or more wins for incumbents in 13 presidential elections involving incumbents? How likely is it to have 10 or more heads in tossing a fair coin 13 times? Based on our intuitive understanding of coin tossing, getting 10 or more heads out of 13 tosses of a fair coin does not seem likely (getting 6 or 7 heads would be a more likely outcome). So we should reject this “what if” notion that every presidential election is like a coin toss, rather than believing that the incumbents were just very lucky in the last 13 presidential elections involving incumbents. If you think that you do not have a good handle on the probability of getting 10 or more heads in 13 tosses of a fair coin, we can try simulation. We can toss a fair coin 13 times and count the number of heads. We repeated this process 10,000 times (done in an Excel spreadsheet). The following figure is a summary of the results. Note that in the 10,000 simulated repetitions (each consisting of 13 coin tosses), only 465 repetitions have 10 or more heads (350 + 100 + 15 = 465). So getting 10 or more heads in 13 coin tosses is unlikely. In our simulation, it only happened 465 times out of 10,000. If we perform another 10,000 repetitions of coin tossing, we will get a similar small likelihood for getting 10 or more heads. Our simulated probability of obtaining 10 or more heads in 13 coin tosses is 0.0465. This probability can also be computed exactly using the binomial distribution, which turns out to be 0.0461 (not far from the simulated result). Under the null hypothesis that incumbents have no advantage over challengers, we can use a simple model of tosses of a fair coin. Then we look at the observed results of 10 heads in 13 coin tosses (10 wins for incumbents in the last 10 presidential elections involving incumbents). We ask: how like is this observed result if elections are like coin tosses? Based on a simulation, we see that the observed result is not likely (it happens 465 times out of 10,000). An exact computation gives the probability to be 0.0461. So we reject the null hypothesis rather than believing that incumbents have no advantage over challengers. The reasoning process for a test of significance problem can be summarized in the following steps. 1. You have some observed data (e.g. data from experiments or observational studies). The observed data appear to contradict a conventional wisdom (or a neutral position). We want to know whether the difference between the observed data and the conventional wisdom is a real difference or just due to chance. In this example, the observed data are the 10 incumbent wins in the last 13 presidential elections involving incumbents. The neutral position is that incumbents have no advantage over the challengers. We want to know whether the high number of incumbent wins is due to a real incumbent advantage or just due to chance. 2. The starting point of the reasoning process is a “what if”. What if the observed difference is due to chance (i.e. there is really no difference between the observed data and the neutral position)? So we assume the neutral position is valid. We call this the null hypothesis. 3. We then evaluate the observed data, asking: what would be the likelihood of this happening if the null hypothesis were true? This probability is called a p-value. In this example, the observed data are: 10 incumbent wins in the last 13 presidential elections involving incumbents. The p-value is the probability of seeing 10 more incumbents win if indeed incumbents have no real advantage. 4. We estimate the p-value. If the p-value is small, we reject the neutral position (null hypothesis), rather than believing that the observed difference is due to random chance. In this example, we estimate the p-value by a simulation exercise (but can also be computed by a direct calculation). Because we see that the p-value is so small, we reject the notion that there is no incumbent advantage rather than believing that the high number of incumbent wins is just due to incumbents being lucky. Any statistical inference problem that involves testing a hypothesis would work the same way as the presidential election example described here. The details that are used for calculating the p-value may change, but the reasoning process will remain the same. We realize that the reasoning process for the presidential election example may still not come naturally for some students. One reason may be that our intuition may not be as reliable in working with p-value in some of these statistical inference problems, which may involve normal distributions, binomial distributions and other probability models. So it is critical that students in an introductory statistics class get a good grounding in working with these probability models. See this previous post for a more intuitive example of statistical inference. ## A Case of Restaurant Arson and the Reasoning of Statistical Inference I know of a case of a restaruant owner who was convicted for burning down his restaurant for the purpose of collecting insurance money. It turns out that this case is a good example for introducing the reasoning process of statistical inference from an intuitive point of view. This restaurant owner was convicted for burning down his restaurant and received a lengthy jail sentence. What was the red flag that alerted the insurance company that something was not quite right in the first place? The same restaurant owner’s previous restaurant was also burned to the ground! Of course, the insurance company could not just file charges against the owner simply because of two losses in a row. But the two losses in a row did raise a giant red flag for the insurer, which brought all its investigative resource to bear on this case. But for us independent observers, this case provides an excellent illustration of an intuitive reasoning process for statistical inference. The reasoning process behind the suspicion is a natural one for many people, including students in my introductory statistics class. Once we learn that there were two burned down restaurants in a row, we ask: are the two successive losses just due to bad luck or are the losses due to other causes such as fraud? Most people would feel that two successive fire losses in a row are unlikely if there is no fraud involved. It is natural that we would settle on the possibility of fraud rather than attributing the losses to bad luck. The reasoning process is mostly unspoken and intuitive. For the sake of furthering the discussion, let me write it out: 1. The starting point of the reasoning process is an unspoken belief that the restaurant owner did not cause the fire (we call this the null hypothesis). 2. We then evaluate the observed data (losing two restaurants in a row to fire). What would be the likelihood of this happening if the null hypothesis were true? Though we assess this likelihood intuitively, this probability is called a p-value, which we feel is small in this case. 3. Since we feel that the p-value is very small, we reject the initial belief (null hypothesis), rather than believing that the rare event of two fire losses in a row was solely due to chance. The statistical inference problems that we encounter in a statistics course are all based on the same intuitive reasoning process described above. However, unlike the restaurant arson case described here, the reasoning process in many statistical inference problems does not come naturally for students in introductory statistics classes. The reason may be that it is not easy to grasp intuitively the implication of having a small p-value in many statistical inference problems. It is difficult for some students to grasp why a small p-value should lead to the rejection of the null hypothesis. In our restaurant arson example, we do not have to calculate a p-value and simply rely on our intuition to know that the p-value, whatever it is, is likely to be very small. In statistical inference problems that we do in a statistics class, we need a probability model to calculate a p-value. Beginning inference problems in an introductory class usually are based on normal distributons or in some cases the binomial distribution. With the normal model or binomial model, our intuition is much less reliable in computing and interpreting the p-value (the probability of obtaining the observed data given the null hypothesis is true). This is a challenge for students to overcome. To address this challenge, it is critical that students have a good grounding of normal distributions and binomial distribution. Simulation projects can also help students gain an intuitive sense of p-value. In any case, just know that in terms of reasoning, any inference problem would work the same way as the intuitive example described here. It would start off with a question or a claim. The solution would start with a null hypothesis, which is a neutral proposition (e.g. the restaurant owner did not do it). We then evaluate the observed data to calculate the p-value. The p-value would form the basis for judging the strength of the observed data: the smaller the p-value, the less credible the null hypothesis and the stronger the case for rejecting the null hypothesis. Repeated large losses for the same claimant raise suspicion for property insurance companies as well as life insurance companies. The concept of p-value is important for insurance companies. The example of repeated large insurance losses described here is an excellent example for illustrating the intuitive reasoning behind statistical inference. ## Rolling Dice to Buy Gas The price of gas you pay at the pump is influenced by many factors. One such factor is the price of crude oil in the international petroleum market, which can be highly dependent on global macroeconomic conditions. Why don’t we let probability determine the price of gas? We discuss here an experiment that generate gas prices using random chance. The goal of this experiment is to shed some light on some probability concepts such as central limit theorem, law of large numbers, and the sampling distribution of the sample mean. These concepts are difficult concepts for students in an introductory statistics class. We hope that the examples shown here will be of help to these students. We came across the idea for this example in [1], which devotes only one page to the intriguing idea of random gas prices (on page 722). We took the idea and added our own simulations and observations to make the example more accessible. The Experiment There is a gas station called 1-Die Gas Station. The price per gallon you pay at this gas station is determined by the roll of a die. Whatever the number that comes up in rolling the die, that is the price per gallon you pay. You may end up paying$1 per gallon if you are lucky, or $6 if not lucky. But if you buy gas repeatedly at this gas station, you pay$3.50 per gallon on average.

Down the street is another gas station called 2-Dice Gas Station. The price you pay there is determined by the average of two dice. For example, if the roll of two dice results in 3 and 5, the price per gallon is $4. The prices at the 3-Dice Gas Station are determined by taking the average of a roll of three dice. In general, the n-Dice Gas Station works similarly. The possibility of paying$1 per gallon is certainly attractive for customers. On the other hand, paying $6 per gallon is not so desirable. How likely will customers buy gas with these extreme prices? How likely will they pay in the middle price range, say between$3 and $4? In other words, what is the probability distribution of the gas prices in these n-Dice Gas Stations? Another question is: is there any difference between the gas prices in the 1-Die Gas Station and the 2-Dice Gas Station and the other higher dice gas stations? If the number of dice increases, what will happen to the probability distribution of the gas prices? One way to look into the above questions is to generate gas prices by rolling dice. After recording the rolls of the dice, we can use graphs and numerical summaries to look for patterns. Instead of actually rolling dice, we simulate the rolling of dice in an Excel spreadsheet. We simulate 10,000 gas purchases in each of following gas stations: $\displaystyle (0) \ \ \ \ \begin{bmatrix} \text{Simulation}&\text{ }&\text{Gas Station}&\text{ }&\text{Number of Gas Prices} \\\text{ }&\text{ }&\text{ } \\\text{1}&\text{ }&\text{1-Dice}&\text{ }&\text{10,000} \\\text{2}&\text{ }&\text{2-Dice}&\text{ }&\text{10,000} \\\text{3}&\text{ }&\text{3-Dice}&\text{ }&\text{10,000} \\\text{4}&\text{ }&\text{4-Dice}&\text{ }&\text{10,000} \\\text{5}&\text{ }&\text{10-Dice}&\text{ }&\text{10,000} \\\text{6}&\text{ }&\text{30-Dice}&\text{ }&\text{10,000} \\\text{7}&\text{ }&\text{50-Dice}&\text{ }&\text{10,000} \end{bmatrix}$ How Gas Prices are Simulated We use the function $Rand()$ in Excel to simulate rolls of dice. Here’s how a roll of a die is simulated. The $Rand()$ function generates a random number $x$ that is between $0$ and $1$. If $x$ is between $0$ and $\frac{1}{6}$, it is considered a roll of a die that produces a $1$. If $x$ is between $\frac{1}{6}$ and $\frac{2}{6}$, it is considered a roll of a die that produces a $2$, and so on. The following rule describes how a random number is assigned a value of the die: $\displaystyle (1) \ \ \ \ \begin{bmatrix} \text{Random Number}&\text{ }&\text{Value of Die} \\\text{ }&\text{ }&\text{ } \\ 0< x <\frac{1}{6}&\text{ }&\text{1} \\\text{ }&\text{ }&\text{ } \\ \frac{1}{6} \le x < \frac{2}{6}&\text{ }&\text{2} \\\text{ }&\text{ }&\text{ } \\ \frac{2}{6} \le x < \frac{3}{6}&\text{ }&\text{3} \\\text{ }&\text{ }&\text{ } \\ \frac{3}{6} \le x < \frac{4}{6}&\text{ }&\text{4} \\\text{ }&\text{ }&\text{ } \\ \frac{4}{6} \le x < \frac{5}{6}&\text{ }&\text{5} \\\text{ }&\text{ }&\text{ } \\ \frac{5}{6} \le x < 1&\text{ }&\text{6} \end{bmatrix}$ For the 1-Die Gas Station, we simulated 10,000 rolls of a die (as described above). These 10,000 die values are considered the gas prices for 10,000 purchases. For the 2-Dice Gas Station, the simulation consists of 10,000 simulated rolls of a pair of dice. The 10,000 gas prices are obtained by taking the average of each pair of simulated dice values. For the 3-Dice Gas Station, the simulation consists of 10,000 iterations where each iteration is one simulated roll of three dice (i.e. 10,000 random sample of dice values where each sample is of size 3). Then the 10,000 gas prices are obtained by taking the average of the three dice values in each iteration (i.e. taking the mean of each sample). The other n-Dice Gas Stations are simulated in a similar fashion. By going through the process in the above paragraphs, 10,000 gas prices are simulated for each gas station indicated in $(0)$. The following shows the first 10 simulated gas prices in first three gas stations listed in $(0)$. $\displaystyle (3a) \text{ First 10 Simulated Gas Prices} \ \ \ \ \begin{bmatrix} \text{Iteration}&\text{ }&\text{Value of Die}&\text{ }&\text{1-Die Gas Price} \\\text{ }&\text{ }&\text{ } \\ 1&\text{ }&\text{2}&\text{ }&\2 \\ 2&\text{ }&\text{3}&\text{ }&\3 \\ 3&\text{ }&\text{2}&\text{ }&\2 \\ 4&\text{ }&\text{1}&\text{ }&\1 \\ 5&\text{ }&\text{3}&\text{ }&\3 \\ 6&\text{ }&\text{3}&\text{ }&\3 \\ 7&\text{ }&\text{2}&\text{ }&\2 \\ 8&\text{ }&\text{5}&\text{ }&\5 \\ 9&\text{ }&\text{5}&\text{ }&\5 \\ 10&\text{ }&\text{2}&\text{ }&\2 \end{bmatrix}$ $\displaystyle (3b) \text{ First 10 Simulated Gas Prices} \ \ \ \ \begin{bmatrix} \text{Iteration}&\text{ }&\text{Values of Dice}&\text{ }&\text{2-Dice Gas Price} \\\text{ }&\text{ }&\text{ } \\ 1&\text{ }&\text{4, 6}&\text{ }&\5.0 \\ 2&\text{ }&\text{2, 5}&\text{ }&\3.5 \\ 3&\text{ }&\text{3, 4}&\text{ }&\3.5 \\ 4&\text{ }&\text{2, 2}&\text{ }&\2.0 \\ 5&\text{ }&\text{5, 4}&\text{ }&\4.5 \\ 6&\text{ }&\text{2, 6}&\text{ }&\4.0 \\ 7&\text{ }&\text{5, 2}&\text{ }&\3.5 \\ 8&\text{ }&\text{6, 1}&\text{ }&\3.5 \\ 9&\text{ }&\text{5, 5}&\text{ }&\5.0 \\ 10&\text{ }&\text{5, 2}&\text{ }&\3.5 \end{bmatrix}$ $\displaystyle (3c) \text{ First 10 Simulated Gas Prices} \ \ \ \ \begin{bmatrix} \text{Iteration}&\text{ }&\text{Values of Dice}&\text{ }&\text{3-Dice Gas Price} \\\text{ }&\text{ }&\text{ } \\ 1&\text{ }&\text{1, 6, 6}&\text{ }&\4.33 \\ 2&\text{ }&\text{5, 5, 6}&\text{ }&\5.33 \\ 3&\text{ }&\text{1, 4, 6}&\text{ }&\3.67 \\ 4&\text{ }&\text{3, 2, 1}&\text{ }&\2.00 \\ 5&\text{ }&\text{3, 3, 2}&\text{ }&\2.67 \\ 6&\text{ }&\text{3, 6, 6}&\text{ }&\5.00 \\ 7&\text{ }&\text{1, 6, 2}&\text{ }&\3.00 \\ 8&\text{ }&\text{1, 1, 6}&\text{ }&\2.67 \\ 9&\text{ }&\text{4, 1, 5}&\text{ }&\3.33 \\ 10&\text{ }&\text{1, 4, 5}&\text{ }&\3.33 \end{bmatrix}$ Looking at the Simulated Gas Prices Graphically We summarized the 10,000 gas prices in each gas station into a frequency distribution and a histogram. Watch the progression of the histograms from 1-Die, 2-Dice, and all the way to 50-Dice. As the number of dice increases, the histogram becomes more and more bell-shaped. It is a remarkable fact that as the number of dice increases, the distribution of the gas prices changes shape. The gas prices at the 1-Die Gas Station are uniform (the bars in the histogram have basically the same height). The gas prices at the 2-Dice Gas Station are no longer uniform. The 3-Dice and 4-Dice gas prices are approaching bell-shaped. The shapes of the 30-Dice and 50-Dice gas prices are undeniably bell-shaped. Each gas price is the average of a sample of dice values. For example, each 1-Die gas price is the mean of a sample of 1 die value, each 2-Dice gas price is the mean of a sample of two dice values and each 3-Dice gas price is the mean of a sample of 3 dice values and so on. What we are witnessing in the above series of histograms is that as the sample size increases, the shape of the distribution of the sample mean becomes more and more normal. The above series of histograms (from Figure 4b to Figure 10b) is a demonstration of the central limit theorem in action. Looking at the Simulated Gas Prices Using Numerical Summaries Note that all the above histograms center at about$3.50, and that the spread is getting smaller as the number of dice increases. To get a sense that the spread is getting smaller, note in the histograms (as well as frequency distributions) that the min and max gas prices for 1-Die, 2-Dice, 3-Dice and 4-Dice are $1 and$6, respectively. However, the price range for the higher dice gas stations is smaller. For example, the price ranges are $1.70 to$5.30 (for 10-Dice), $2.33 to$4.60 (30-Dice) and $2.60 to$4.64 (50-Dice). So in these higher dice stations, very cheap gas (e.g. $1) and very expensive gas (e.g.$6) are not possible. To confirm what we see, look at the following table of numerical summaries, that are calculated using the 10,000 simulated gas prices for each gas station.

$\displaystyle (11) \text{ Numerical Summaries} \ \ \ \ \begin{bmatrix} \text{Gas Station}&\text{ }&\text{Mean}&\text{ }&\text{St Dev}&\text{ }&\text{Min Price}&\text{ }&\text{Max Price} \\\text{ }&\text{ }&\text{ } \\ \text{1-Die}&\text{ } &\3.4857&\text{ }&\1.6968&\text{ } &\1.00&\text{ } &\6.00 \\ \text{2-Dice}&\text{ } &\3.4939&\text{ }&\1.2131&\text{ } &\1.00&\text{ } &\6.00 \\ \text{3-Dice}&\text{ } &\3.5011&\text{ }&\0.9862&\text{ } &\1.00&\text{ } &\6.00 \\ \text{4-Dice}&\text{ } &\3.5220&\text{ }&\0.8555&\text{ } &\1.00&\text{ } &\6.00 \\ \text{10-Dice}&\text{ } &\3.4930&\text{ }&\0.5468&\text{ } &\1.70&\text{ } &\5.30 \\ \text{30-Dice}&\text{ } &\3.5023&\text{ }&\0.3108&\text{ } &\2.33&\text{ } &\4.60 \\ \text{50-Dice}&\text{ } &\3.5009&\text{ }&\0.2421&\text{ } &\2.60&\text{ } &\4.64 \end{bmatrix}$

Note that the mean gas price is about $3.50 in all the gas stations. The standard deviation is getting smaller as the number of dice increases. Note that each standard deviation in the above table is the standard deviation of gas prices. As noted before, each gas price is the mean of a sample of simulated dice values. What we are seeing is that the standard deviation of sample means gets smaller as the sample size increases. The theoretical mean and standard deviation for the 1-Die gas prices are$3.50 and $\frac{\sqrt{105}}{6}=1.707825$, respectively. As the number of dice increases, the mean of the gas prices (sample means) should remain close to $3.50 (as seen in the above table). But the standard deviation of the sample averages gets smaller. The standard deviation of the gas prices gets smaller according to the following formula: $\displaystyle (12) \ \ \ \ \text{Standard Deviation of n-Dice Gas Prices}=\frac{1.707825}{\sqrt{n}}$ According to the above formula, the standard deviation of gas prices will get smaller and smaller toward zero as the number of dice increases. This means that the price you pay at these gas stations will be close to$3.50 as long as the gas station uses a large number of dice to determine the price. The following table compares the standard deviations of the simulated prices and the standard deviations computed using formula $(12)$. Note the agreement between the observed standard deviations and the theoretical standard deviations.

$\displaystyle (13) \text{ Comparing St Dev} \ \ \ \ \begin{bmatrix} \text{Gas Station}&\text{ }&\text{St Dev}&\text{ }&\text{St Dev} \\\text{ }&\text{ }&\text{(Simulated)}&\text{ }&\text{(Theoretical)} \\\text{ }&\text{ }&\text{ } \\ \text{1-Die}&\text{ } &\1.6968&\text{ } &\1.7078 \\ \text{2-Dice}&\text{ } &\1.2131&\text{ } &\1.2076 \\ \text{3-Dice}&\text{ } &\0.9862&\text{ } &\1.9860 \\ \text{4-Dice}&\text{ } &\0.8555&\text{ } &\0.8539 \\ \text{10-Dice}&\text{ } &\0.5468&\text{ } &\0.5401 \\ \text{30-Dice}&\text{ } &\0.3108&\text{ } &\0.3118 \\ \text{50-Dice}&\text{ } &\0.2421&\text{ } &\0.2415 \\\text{ }&\text{ }&\text{ } \\ \text{100-Dice}&\text{ } &\text{ }&\text{ } &\0.1708 \\ \text{1,000-Dice}&\text{ } &\text{ }&\text{ } &\0.0540 \\ \text{5,000-Dice}&\text{ } &\text{ }&\text{ } &\0.0242 \\ \text{100,000-Dice}&\text{ } &\text{ }&\text{ } &\0.0054 \\ \text{1000,000-Dice}&\text{ } &\text{ }&\text{ } &\0.0017 \end{bmatrix}$

The standard deviation of gas prices in the 50-Dice Gas Station is about $0.24 (let’s say it is 25 cents for use in a quick calculation). Since the gas prices have an approximate normal distribution, we know that the gas prices range from 75 cents below$3.50 to 75 cents above $3.50. So in the 50-Dice Gas Station, you can expect to pay any where from$2.75 to $4.25 per gallon (this is the 99.7 part of the empirical rule). On the other hand, about 95% of the gas prices are between$3.00 to $4.00. Table $(13)$ also displays the theoretical standard deviations for several gas stations that we did not simulate. For example, the standard deviation of the gas prices in the 1000-Dice gas Station is about$0.054 (let’s say 5 cents). So about 99.7% of the gas prices will be between $3.35 and$3.65. In such a gas station, the customers will be essentially paying $3.50 per gallon. The standard deviation of gas prices in the 5000-Dice Gas Station is about 2 pennies. The standard deviation of gas prices in the 100,000-Dice Gas Station is about half a 1 penny. In these two gas stations, the prices you pay for gas will be within pennies from$3.50 (for all practical purposes, the customer should just pay $3.50). In the 1,000,000-Dice (one million Dice) Gas Station, the clerk should just skip the rolling of dice and collect$3.50 per gallon.

Looking at the Long Run Average in 1-Die Gas Station

Another observation we like to make is that the average gas price is not predictable and stable when we only use a small number of simulated prices. For example, if we only simulate 10 gas prices for the 1-Die Gas station, will the average of these 10 prices be close to the theoretical $3.50? If we increase the size of the simulations to 100, what will the average be? If the number of the simulations keeps increasing, what will the mean gas price be like? The answer lies in looking at the partial averages, i.e. the average of the first 10 gas prices, the first 100 gas prices and so on for the 1-Die Gas Station. We can look at the how the partial average progresses. $\displaystyle (14) \text{ 1-Die Gas Station: Average of the } \ \ \ \ \begin{bmatrix} \text{First 1 Price}&\text{ }&\2.00 \\\text{First 2 Prices}&\text{ }&\2.50 \\\text{First 3 Prices}&\text{ }&\2.33 \\\text{First 4 Prices}&\text{ }&\2.00 \\\text{First 5 Prices}&\text{ }&\2.20 \\\text{First 6 Prices}&\text{ }&\2.33 \\\text{First 7 Prices}&\text{ }&\2.29 \\\text{First 8 Prices}&\text{ }&\2.63 \\\text{First 9 Prices}&\text{ }&\2.89 \\\text{First 10 Prices}&\text{ }&\2.80 \\ \text{First 50 Prices}&\text{ }&\3.16 \\ \text{First 100 Prices}&\text{ }&\3.32 \\ \text{First 500 Prices}&\text{ }&\3.436 \\ \text{First 1000 Prices}&\text{ }&\3.445 \\ \text{First 5000 Prices}&\text{ }&\3.4292 \\ \text{All 10000 Prices}&\text{ }&\3.4857 \end{bmatrix}$ The above table illustrates that the sample mean is unpredictable and unstable if the number of gas prices is small. The first gas price is$2.00. Within the first 10 gas purchases at this gas station, the average is under $3.00. The average of the first 10 prices is$2.80. But as the number of prices increases, the average becomes closer and closer to the theoretical mean $3.50. The sample mean $\overline{x}$ is an accurate estimate of the population mean$3.50 only when more an more gas prices are added in the mix (i.e. as the sample size increases). This remarkable fact is called the law of large numbers.

The following two figures show how the sample mean $\overline{x}$ of a random sample of gas prices from the 1-Die Gas Station changes as we add more prices to the sample. Figure 15a shows how the sample mean $\overline{x}$ varies for the first 100 simulated gas prices. Figure 15b shows the variation of $\overline{x}$ over all 10,000 simulated gas prices.

Figure 15a shows that within the first 30 prices or so, the sample mean $\overline{x}$ fluctuates wildly under the horizontal line at $3.50. The sample mean $\overline{x}$ becomes more stable as the sample size increases. Figure 15b shows that over the entire 10,000 simulated gas prices, the sample mean $\overline{x}$ is stable and predictable. Eventually the sample mean $\overline{x}$ gets close to the population mean $\mu=\3.50$ and settles down at that line. Figures 15a and 15b show the behavior of $\overline{x}$ for one instance of simulation of 10,000 gas prices. If we perform another simulation of 10,000 gas prices for the 1-Die Gas Station, both figures will show a different path from left to right. However, the law of large numbers says that whatever path we will get will always settle down at $\mu=\3.50$. ____________________________________________________________________ Discussions All the observations we make can be generalized outside of the context of n-Dice Gas Stations. There are several important probability concepts that can be drawn from the n-Dice Gas Station example. They are: 1. The sample mean as a ranom variable. 2. The central limit theorem. 3. The mean and standard deviation of the sample mean. 4. The law of large numbers. 1. The Sample Mean As A Random Variable. Given a sample of data values, the sample mean $\overline{x}$ is simply the arithmetic average of all the data values in the sample. It is important to note that the mean of a sample (of a given size) varies. As soon as the data values in the sample change, the calculation of $\overline{x}$ will produce a different value. Take the gas prices from the 3-Dice Gas Station listed in table $(3c)$ as an example. The first simulation produced the sample $1, \ 6, \ 6$. The mean is $\overline{x}=4.33$. The second sample is $5,\ 5, \ 6$ and the mean is $\overline{x}=5.33$. As the three dice are rolled again, another sample is produced and the sample mean $\overline{x}$ will be a different value. So the sample mean $\overline{x}$ is not a static quantity. Because the samples are determined by random chance, the value of $\overline{x}$ cannot be predicted in advance with certainty. Hence the sample mean $\overline{x}$ is a random variable. 2. The Central Limit Theorem. Once we understand that the sample mean $\overline{x}$ is a random variable (that it varies based on random chance), the next question is: what is the probability distribution of the sample mean $\overline{x}$? With respect to the n-Dice Gas Stations, how are the gas prices distributed? The example of n-Dice Gas Stations shows that the sample mean $\overline{x}$ becomes more and more normal as the sample size increases. This means that we can use normal distribution to determine probability statements about the sample mean $\overline{x}$ whenever the sample size $n$ is “sufficiently large”. The 1-Die Gas Prices (the histogram in Figure 4b) is the underlying distribution (or population) from which the random samples for the higher dice gas prices are drawn. In this particular case, the underlying distribution has a shape that is symmetric (in fact the histogram is flat). The point we like to make is that even if the histogram in Figure 4b (the starting histogram) has a shape that is skewed, the subsequent histograms will still become more and more normal. That is, the distribution of the sample mean becomes more and more normal as the sample size increases, regardless of the shape of the underlying population. Because we start out with a symmetric histogram, it does not take a large increase in the number of dice for the sample mean to become normal. Note that the 4-Dice gas prices produce a histogram that looks sufficiently normal. Thus if the underlying distribution is symmetric (but not bell-shaped), increasing the sample mean of sample size to 4 or 5 will have a distribution that is adequately normal (like the example of n-Dice Gas Stations discussed here). However, if the underlying distribution is skewed, it will take a larger increase of the sample size to get a distribution that is close to normal (usually increasing to $n=30$ or greater). Moreover, if the underlying population is approximately normal, the sample mean of size 2 or 3 will be very close to normal. If the underlying distribution has a normal distribution, the sample mean of any sample size will have a normal distribution. 3. The Mean and Standard Deviation of Sample Mean The above histograms (Figure 4b to Figure 10b) center at$3.50. However, the spread of the histograms gets smaller as the number of dice increase (also see Table $(13)$). Thus the gas prices hover around 3.50 no matter which n-Dice Gas Station you are in. The standard deviation of the gas prices gets smaller and smaller according to the Formula $(12)$. The theoretical mean of the population of the 1-Die Gas Prices is $\mu=\3.50$ and the theoretical standard deviation of this population is $\sigma=\1.707825$. The standard deviation of the gas prices at the n-Dice Gas Station is smaller than the population standard deviation $\sigma$ according to the Formula $(12)$ above. In general, the sampling distribution of the sample mean $\overline{x}$ is centered at the population mean $\mu$ and is less spread out that the distribution of the individual population values. If $\mu$ is the mean of the individual population values and $\sigma$ is the standard deviation of the individual population values, then the mean and standard deviation of the sample mean $\overline{x}$ (of size $n$) are: \displaystyle \begin{aligned}(16) \ \ \ \ \ \ \ \ \ \ \ &\text{Mean of }\overline{x}=\mu \\&\text{ } \\&\text{Standard Deviation of }\overline{x}=\frac{\sigma}{\sqrt{n}} \end{aligned} 4. The Law of Large Numbers. If you buy gas from the 1-Die Gas Station for only 1 time, the price you pay may be1 if you are lucky. However, in the long run, the overall average price per gallon will settle near $3.50. This is the essence of the law of large numbers. In the long run, the owner can expect that the average price over all the purchases to be$3.50. The long run business results in the 1-Die Gas Station is stable and predictable (as long as customers keep coming back to buy gas of course).

In fact, the law of large numbers is the business model of the casino. When a gambler makes bets, the short run results are unpredictable. The gambler may get lucky and win a few bets. However, the long run average (over thousands or tens of thousands of bets for example) will be very stable and predictable for the casino. For example, the game of roulette gives the casino an edge of 5.26%. Over a large number of bets at the roulette table, the casino can expect to make 5.26 cents in profit for each \$1 in wager.

____________________________________________________________________

Summary

The following are the statements of the probability concepts discussed in this post.

• The Law of Large Numbers. Suppose that you draw observations from a population with finite mean $\mu$. As the number of observations increases, the mean $\overline{x}$ of the observed values becomes closer and closer to the population mean $\mu$.
• $\text{ }$

• The Mean and Standard Deviation of Sample Mean. Suppose that $\overline{x}$ is the mean of a simple random sample of size $n$ drawn from a population with mean $\mu$ and standard deviation $\sigma$. Then the sampling distribution of $\overline{x}$ has mean $\mu$ and standard deviation $\displaystyle \frac{\sigma}{\sqrt{n}}$.
• $\text{ }$

• The Central Limit Theorem. Suppose that you draw a simple random sample of size $n$ from any population with mean $\mu$ and finite standard deviation $\sigma$. When the sample size $n$ is large, the sampling distribution of the sample mean $\overline{x}$ is approximately normal with mean $\mu$ and standard deviation $\displaystyle \frac{\sigma}{\sqrt{n}}$.

To read more about the probability concepts discussed here, you can consult your favorite statistics texts or see [2] or [3].

Reference

1. Burger E. B., Starbird M., The Heart of Mathematics, An invitation to effective thinking, 3rd ed., John Wiley & Sons, Inc, 2010
2. Moore D. S., Essential Statistics, W. H. Freeman and Company, New York, 2010
3. Moore D. S., McCabe G. P., Craig B. A., Introduction to the Practice of Statistics, 6th ed., W. H. Freeman and Company, New York, 2009