The price of gas you pay at the pump is influenced by many factors. One such factor is the price of crude oil in the international petroleum market, which can be highly dependent on global macroeconomic conditions. Why don’t we let probability determine the price of gas? We discuss here an experiment that generate gas prices using random chance. The goal of this experiment is to shed some light on some probability concepts such as central limit theorem, law of large numbers, and the sampling distribution of the sample mean. These concepts are difficult concepts for students in an introductory statistics class. We hope that the examples shown here will be of help to these students.
We came across the idea for this example in , which devotes only one page to the intriguing idea of random gas prices (on page 722). We took the idea and added our own simulations and observations to make the example more accessible.
There is a gas station called 1-Die Gas Station. The price per gallon you pay at this gas station is determined by the roll of a die. Whatever the number that comes up in rolling the die, that is the price per gallon you pay. You may end up paying $1 per gallon if you are lucky, or $6 if not lucky. But if you buy gas repeatedly at this gas station, you pay $3.50 per gallon on average.
Down the street is another gas station called 2-Dice Gas Station. The price you pay there is determined by the average of two dice. For example, if the roll of two dice results in 3 and 5, the price per gallon is $4.
The prices at the 3-Dice Gas Station are determined by taking the average of a roll of three dice. In general, the n-Dice Gas Station works similarly.
The possibility of paying $1 per gallon is certainly attractive for customers. On the other hand, paying $6 per gallon is not so desirable. How likely will customers buy gas with these extreme prices? How likely will they pay in the middle price range, say between $3 and $4? In other words, what is the probability distribution of the gas prices in these n-Dice Gas Stations? Another question is: is there any difference between the gas prices in the 1-Die Gas Station and the 2-Dice Gas Station and the other higher dice gas stations? If the number of dice increases, what will happen to the probability distribution of the gas prices?
One way to look into the above questions is to generate gas prices by rolling dice. After recording the rolls of the dice, we can use graphs and numerical summaries to look for patterns. Instead of actually rolling dice, we simulate the rolling of dice in an Excel spreadsheet. We simulate 10,000 gas purchases in each of following gas stations:
How Gas Prices are Simulated
We use the function in Excel to simulate rolls of dice. Here’s how a roll of a die is simulated. The function generates a random number that is between and . If is between and , it is considered a roll of a die that produces a . If is between and , it is considered a roll of a die that produces a , and so on. The following rule describes how a random number is assigned a value of the die:
For the 1-Die Gas Station, we simulated 10,000 rolls of a die (as described above). These 10,000 die values are considered the gas prices for 10,000 purchases. For the 2-Dice Gas Station, the simulation consists of 10,000 simulated rolls of a pair of dice. The 10,000 gas prices are obtained by taking the average of each pair of simulated dice values. For the 3-Dice Gas Station, the simulation consists of 10,000 iterations where each iteration is one simulated roll of three dice (i.e. 10,000 random sample of dice values where each sample is of size 3). Then the 10,000 gas prices are obtained by taking the average of the three dice values in each iteration (i.e. taking the mean of each sample). The other n-Dice Gas Stations are simulated in a similar fashion.
By going through the process in the above paragraphs, 10,000 gas prices are simulated for each gas station indicated in . The following shows the first 10 simulated gas prices in first three gas stations listed in .
Looking at the Simulated Gas Prices Graphically
We summarized the 10,000 gas prices in each gas station into a frequency distribution and a histogram. Watch the progression of the histograms from 1-Die, 2-Dice, and all the way to 50-Dice.
As the number of dice increases, the histogram becomes more and more bell-shaped. It is a remarkable fact that as the number of dice increases, the distribution of the gas prices changes shape. The gas prices at the 1-Die Gas Station are uniform (the bars in the histogram have basically the same height). The gas prices at the 2-Dice Gas Station are no longer uniform. The 3-Dice and 4-Dice gas prices are approaching bell-shaped. The shapes of the 30-Dice and 50-Dice gas prices are undeniably bell-shaped.
Each gas price is the average of a sample of dice values. For example, each 1-Die gas price is the mean of a sample of 1 die value, each 2-Dice gas price is the mean of a sample of two dice values and each 3-Dice gas price is the mean of a sample of 3 dice values and so on. What we are witnessing in the above series of histograms is that as the sample size increases, the shape of the distribution of the sample mean becomes more and more normal. The above series of histograms (from Figure 4b to Figure 10b) is a demonstration of the central limit theorem in action.
Looking at the Simulated Gas Prices Using Numerical Summaries
Note that all the above histograms center at about $3.50, and that the spread is getting smaller as the number of dice increases. To get a sense that the spread is getting smaller, note in the histograms (as well as frequency distributions) that the min and max gas prices for 1-Die, 2-Dice, 3-Dice and 4-Dice are $1 and $6, respectively. However, the price range for the higher dice gas stations is smaller. For example, the price ranges are $1.70 to $5.30 (for 10-Dice), $2.33 to $4.60 (30-Dice) and $2.60 to $4.64 (50-Dice). So in these higher dice stations, very cheap gas (e.g. $1) and very expensive gas (e.g. $6) are not possible. To confirm what we see, look at the following table of numerical summaries, that are calculated using the 10,000 simulated gas prices for each gas station.
Note that the mean gas price is about $3.50 in all the gas stations. The standard deviation is getting smaller as the number of dice increases. Note that each standard deviation in the above table is the standard deviation of gas prices. As noted before, each gas price is the mean of a sample of simulated dice values. What we are seeing is that the standard deviation of sample means gets smaller as the sample size increases.
The theoretical mean and standard deviation for the 1-Die gas prices are $3.50 and , respectively. As the number of dice increases, the mean of the gas prices (sample means) should remain close to $3.50 (as seen in the above table). But the standard deviation of the sample averages gets smaller. The standard deviation of the gas prices gets smaller according to the following formula:
According to the above formula, the standard deviation of gas prices will get smaller and smaller toward zero as the number of dice increases. This means that the price you pay at these gas stations will be close to $3.50 as long as the gas station uses a large number of dice to determine the price. The following table compares the standard deviations of the simulated prices and the standard deviations computed using formula . Note the agreement between the observed standard deviations and the theoretical standard deviations.
The standard deviation of gas prices in the 50-Dice Gas Station is about $0.24 (let’s say it is 25 cents for use in a quick calculation). Since the gas prices have an approximate normal distribution, we know that the gas prices range from 75 cents below $3.50 to 75 cents above $3.50. So in the 50-Dice Gas Station, you can expect to pay any where from $2.75 to $4.25 per gallon (this is the 99.7 part of the empirical rule). On the other hand, about 95% of the gas prices are between $3.00 to $4.00.
Table also displays the theoretical standard deviations for several gas stations that we did not simulate. For example, the standard deviation of the gas prices in the 1000-Dice gas Station is about $0.054 (let’s say 5 cents). So about 99.7% of the gas prices will be between $3.35 and $3.65. In such a gas station, the customers will be essentially paying $3.50 per gallon.
The standard deviation of gas prices in the 5000-Dice Gas Station is about 2 pennies. The standard deviation of gas prices in the 100,000-Dice Gas Station is about half a 1 penny. In these two gas stations, the prices you pay for gas will be within pennies from $3.50 (for all practical purposes, the customer should just pay $3.50). In the 1,000,000-Dice (one million Dice) Gas Station, the clerk should just skip the rolling of dice and collect $3.50 per gallon.
Looking at the Long Run Average in 1-Die Gas Station
Another observation we like to make is that the average gas price is not predictable and stable when we only use a small number of simulated prices. For example, if we only simulate 10 gas prices for the 1-Die Gas station, will the average of these 10 prices be close to the theoretical $3.50? If we increase the size of the simulations to 100, what will the average be? If the number of the simulations keeps increasing, what will the mean gas price be like? The answer lies in looking at the partial averages, i.e. the average of the first 10 gas prices, the first 100 gas prices and so on for the 1-Die Gas Station. We can look at the how the partial average progresses.
The above table illustrates that the sample mean is unpredictable and unstable if the number of gas prices is small. The first gas price is $2.00. Within the first 10 gas purchases at this gas station, the average is under $3.00. The average of the first 10 prices is $2.80. But as the number of prices increases, the average becomes closer and closer to the theoretical mean $3.50. The sample mean is an accurate estimate of the population mean $3.50 only when more an more gas prices are added in the mix (i.e. as the sample size increases). This remarkable fact is called the law of large numbers.
The following two figures show how the sample mean of a random sample of gas prices from the 1-Die Gas Station changes as we add more prices to the sample. Figure 15a shows how the sample mean varies for the first 100 simulated gas prices. Figure 15b shows the variation of over all 10,000 simulated gas prices.
Figure 15a shows that within the first 30 prices or so, the sample mean fluctuates wildly under the horizontal line at $3.50. The sample mean becomes more stable as the sample size increases. Figure 15b shows that over the entire 10,000 simulated gas prices, the sample mean is stable and predictable. Eventually the sample mean gets close to the population mean and settles down at that line.
Figures 15a and 15b show the behavior of for one instance of simulation of 10,000 gas prices. If we perform another simulation of 10,000 gas prices for the 1-Die Gas Station, both figures will show a different path from left to right. However, the law of large numbers says that whatever path we will get will always settle down at .
All the observations we make can be generalized outside of the context of n-Dice Gas Stations. There are several important probability concepts that can be drawn from the n-Dice Gas Station example. They are:
- The sample mean as a ranom variable.
- The central limit theorem.
- The mean and standard deviation of the sample mean.
- The law of large numbers.
1. The Sample Mean As A Random Variable. Given a sample of data values, the sample mean is simply the arithmetic average of all the data values in the sample. It is important to note that the mean of a sample (of a given size) varies. As soon as the data values in the sample change, the calculation of will produce a different value. Take the gas prices from the 3-Dice Gas Station listed in table as an example. The first simulation produced the sample . The mean is . The second sample is and the mean is . As the three dice are rolled again, another sample is produced and the sample mean will be a different value. So the sample mean is not a static quantity. Because the samples are determined by random chance, the value of cannot be predicted in advance with certainty. Hence the sample mean is a random variable.
2. The Central Limit Theorem. Once we understand that the sample mean is a random variable (that it varies based on random chance), the next question is: what is the probability distribution of the sample mean ? With respect to the n-Dice Gas Stations, how are the gas prices distributed?
The example of n-Dice Gas Stations shows that the sample mean becomes more and more normal as the sample size increases. This means that we can use normal distribution to determine probability statements about the sample mean whenever the sample size is “sufficiently large”.
The 1-Die Gas Prices (the histogram in Figure 4b) is the underlying distribution (or population) from which the random samples for the higher dice gas prices are drawn. In this particular case, the underlying distribution has a shape that is symmetric (in fact the histogram is flat). The point we like to make is that even if the histogram in Figure 4b (the starting histogram) has a shape that is skewed, the subsequent histograms will still become more and more normal. That is, the distribution of the sample mean becomes more and more normal as the sample size increases, regardless of the shape of the underlying population.
Because we start out with a symmetric histogram, it does not take a large increase in the number of dice for the sample mean to become normal. Note that the 4-Dice gas prices produce a histogram that looks sufficiently normal. Thus if the underlying distribution is symmetric (but not bell-shaped), increasing the sample mean of sample size to 4 or 5 will have a distribution that is adequately normal (like the example of n-Dice Gas Stations discussed here).
However, if the underlying distribution is skewed, it will take a larger increase of the sample size to get a distribution that is close to normal (usually increasing to or greater).
Moreover, if the underlying population is approximately normal, the sample mean of size 2 or 3 will be very close to normal. If the underlying distribution has a normal distribution, the sample mean of any sample size will have a normal distribution.
3. The Mean and Standard Deviation of Sample Mean
The above histograms (Figure 4b to Figure 10b) center at $3.50. However, the spread of the histograms gets smaller as the number of dice increase (also see Table ). Thus the gas prices hover around $3.50 no matter which n-Dice Gas Station you are in. The standard deviation of the gas prices gets smaller and smaller according to the Formula .
The theoretical mean of the population of the 1-Die Gas Prices is and the theoretical standard deviation of this population is . The standard deviation of the gas prices at the n-Dice Gas Station is smaller than the population standard deviation according to the Formula above.
In general, the sampling distribution of the sample mean is centered at the population mean and is less spread out that the distribution of the individual population values. If is the mean of the individual population values and is the standard deviation of the individual population values, then the mean and standard deviation of the sample mean (of size ) are:
4. The Law of Large Numbers. If you buy gas from the 1-Die Gas Station for only 1 time, the price you pay may be $1 if you are lucky. However, in the long run, the overall average price per gallon will settle near $3.50. This is the essence of the law of large numbers. In the long run, the owner can expect that the average price over all the purchases to be $3.50. The long run business results in the 1-Die Gas Station is stable and predictable (as long as customers keep coming back to buy gas of course).
In fact, the law of large numbers is the business model of the casino. When a gambler makes bets, the short run results are unpredictable. The gambler may get lucky and win a few bets. However, the long run average (over thousands or tens of thousands of bets for example) will be very stable and predictable for the casino. For example, the game of roulette gives the casino an edge of 5.26%. Over a large number of bets at the roulette table, the casino can expect to make 5.26 cents in profit for each $1 in wager.
The following are the statements of the probability concepts discussed in this post.
- The Law of Large Numbers. Suppose that you draw observations from a population with finite mean . As the number of observations increases, the mean of the observed values becomes closer and closer to the population mean .
- The Mean and Standard Deviation of Sample Mean. Suppose that is the mean of a simple random sample of size drawn from a population with mean and standard deviation . Then the sampling distribution of has mean and standard deviation .
- The Central Limit Theorem. Suppose that you draw a simple random sample of size from any population with mean and finite standard deviation . When the sample size is large, the sampling distribution of the sample mean is approximately normal with mean and standard deviation .
To read more about the probability concepts discussed here, you can consult your favorite statistics texts or see  or .
- Burger E. B., Starbird M., The Heart of Mathematics, An invitation to effective thinking, 3rd ed., John Wiley & Sons, Inc, 2010
- Moore D. S., Essential Statistics, W. H. Freeman and Company, New York, 2010
- Moore D. S., McCabe G. P., Craig B. A., Introduction to the Practice of Statistics, 6th ed., W. H. Freeman and Company, New York, 2009