## The probability of breaking the bank

What is the likelihood of a gambler winning all the cash in a casino? A better question is: what is the likelihood of a gambler losing all the money he or she brings into the casino?

There had been casinos going out of business for sure. But those bankruptcies were the results of business failures, e.g. not attracting enough customers to come into the casino, and were not for the results of gamblers breaking the bank. The business model of the casino is based on a built-in advantage called the house edge, which in mathematical term is the the casino’s average profit from a player’s bet. The house edge differs for different games but is always positive (e.g. the house edge for the game of roulette is 5.26%).

As long as a casino has a steady stream of customers willing to plunk down cash to play various gambling games, the casino will always win. For example, for every $100 bet in the game of roulette, the house wins on average$5.26. This post discusses a problem called gambler’s ruin that will further highlight a gambler’s dim prospect against the house.

The problem of gambler’s ruin features two players A and B, betting on the outcome of a coin. At the start of the game, there are $n$ units of wealth between the two players, with player A owning $i$ units and player B owning $n-i$ units. In each play of the game, a coin is tossed. If the result of the coin toss is a head, player A collect 1 unit from player B. If the result of the coin toss is a tail, player A pays player B 1 unit. What is the probability that player A ends up with all the $n$ units of wealth? What is the probability that player B ends up with all the $n$ units of wealth? Probability of winning is winning for one player is the probability of losing for the other player. The following further clarifies the probabilities being discussed.

If the total wealth $n$ is a large number and player A owns a small proportion of that, then player A is like the gambler and player B is like the house. Then the the probability of player A owning all $n$ units of initial wealth is the probability of player A breaking the bank and the probability of player B owning all $n$ units of wealth is the probability of ruin for player A.

In playing this game, player A gains 1 unit for each coin toss that results in a head and loses 1 unit for each coin toss that results in a tail. Theoretically player A can go broke (losing all the $i$ units that he starts with) or break the bank (owning the total $n$ units between the two players). Losing it all for player A is certainly plausible especially if there is a long run of tails in the coin tosses. On the other hand, winning everything is plausible for player A too, especially if there is a long run of heads. Which is more likely?

The answer to the problem of gambler’s ruin will produce two formulas that can tell us the likelihood of player A winning, and the probability of player A losing everything.

______________________________________________________________

Gambler’s Ruin

The answers had been worked out in a recent post in a companion blog on probability. In this post, we discuss the answers and make some remarks on gambling. We first discuss the case for even odds and then the general case.

Even Odds. Suppose that player A and player B bet on tosses of a fair coin. Suppose further that player A owns $i$ units of wealth and player B owns $n-i$ units at the beginning. The long run probability of player A winning all the units is $\displaystyle A_i=\frac{i}{n}$. The long run probability of player B winning all the units is $\displaystyle B_i=\frac{n-i}{n}$.

The probability of a player owning all the original combined wealth is the ratio of the initial wealth of that player to the total initial wealth. Putting it in another way, the probability of ruin for a player is the ratio of the initial wealth of the other player to the total initial wealth. For the player with only a tiny percentage of the combined wealth, the probability of ruin is virtually certain. For example, if the house (player B) has 10,000,000 units and you the gambler (player A) has 100 units, then the probability of ruin for you is 0.99999 (99.999% chance). This is based on the assumption that all bets are at even odds, which is not a realistic scenario when playing against the house. So in the uneven odds case, the losing probabilities for the gambler would be worse.

Uneven Odds. Suppose that player A and player B bet on tosses of a coin such that the probability of tossing a head is $p$ with $p$ not 0.5. Let $q=1-p$. Suppose further that player A owns $i$ units of wealth and player B owns $n-i$ units at the beginning. The long run probability of player A winning all the units is $\displaystyle A_i=\frac{1-\biggl(\displaystyle \frac{q}{p} \biggr)^i}{\displaystyle 1-\biggl(\frac{q}{p} \biggr)^n}$. The long run probability of player B winning all the units is $\displaystyle B_i=\frac{1-\biggl(\displaystyle \frac{p}{q} \biggr)^{n-i}}{\displaystyle 1-\biggl(\frac{p}{q} \biggr)^n}$.

First, let’s unpack these formulas. Let’s say the coin has a slight bias with the probability of a head being 0.49. Then $p$ = 0.49 (the house edge is 1%) and $q$ = 0.51 with $\frac{q}{p}=\frac{0.51}{0.49}$ = 1.040816327. Let’s say the total initial wealth is 100 units and player A owns 10 units at the beginning. Then $A_{10}$ is the following:

$\displaystyle A_{10}=\frac{1-\biggl(\displaystyle \frac{0.51}{0.49} \biggr)^{10}}{1-\biggl(\displaystyle \frac{0.51}{0.49} \biggr)^{100}}=0.00917265$

In the even odds case, player A would have a 10% chance of winning everything (by owning 10% of the total initial wealth). With the house having just a 1% edge, the chance for winning everything is now 0.917%, less than 1% chance!

To calculate the probability of player B winning everything, plug into the ratio $\frac{p}{q}$ instead of $\frac{q}{p}$. In the numerator, $\frac{p}{q}$ is raised to $n-i$. In other words, $\frac{p}{q}$ in the numerator is always raised to the initial wealth of the player being calculated. In the example of $p$ = 0.49, plug in $\frac{p}{q}=\frac{0.49}{0.51}$ to obtain $B_{10}$ = 0.99082735. There is more than 99% chance that player B (the house) will own all 100 units.

Note that $A_{10}+B_{10}$ equals 1.0. In fact, $A_{i}+B_i$ is always 1.0. Thus $B_{i}$ can be calculated by $1-A_{i}$. In other words, the probability of breaking the bank plus the probability of ruin is always 1.0. When a gambler plays at a casino, there are only two possibilities, either breaking the bank or ruin. There is no third possibility, e.g. the game goes indefinitely with no winner.

______________________________________________________________

More Calculations

Using Excel, the winning probabilities for player A are calculated for two scenarios, one for total wealth of 100 units and one for total wealth of 10,000 units. The results are in Table 1 and Table 2. Each table uses four values of $p$, starting with the fair game of $p$ = 0.5. The column for $p$ = 0.49 does not correspond to an actual casino game. It is used to demonstrate what happen when the house has a 1% edge. The last two columns are for the game of craps and the game of roulette.

Table 1 – Probabilities of Player A Breaking the Bank
Total Initial Wealth = 100 Units

$\begin{array}{ccccccccc} \text{Player A} & \text{ } & \text{ } & \text{ } & \text{ } & \text{ } & \text{ } & \text{ } & \text{ } \\ \text{Initial} & \text{ } & \text{Fair} & \text{ } & \text{House} & \text{ } & \text{Craps} & \text{ } & \text{Roulette} \\ \text{Wealth} & \text{ } & \text{Game} & \text{ } & \text{Edge 1\%} & \text{ } & \text{ } & \text{ } & \text{ } \\ \text{i} & \text{ } & p=0.5 & \text{ } & p=0.49 & \text{ } & p=0.493 & \text{ } & p=0.474 \\ \text{ } & \text{ } & \text{ } & \text{ } & \text{ } & \\ i=10 & \text{ }& 0.10 & \text{ } & 0.0092 & \text{ } & 0.0209 & \text{ } & 0.0001 \\ i=20 & \text{ }& 0.20 & \text{ } & 0.0229 & \text{ } & 0.0486 & \text{ } & 0.0002 \\ i=30 & \text{ }& 0.30 & \text{ } & 0.0433 & \text{ } & 0.0852 & \text{ } & 0.0007 \\ i=40 & \text{ }& 0.40 & \text{ } & 0.0737 & \text{ } & 0.1337 & \text{ } & 0.0019 \\ i=50 & \text{ }& 0.50 & \text{ } & 0.1192 & \text{ } & 0.1978 & \text{ } & 0.0055 \\ i=60 & \text{ }& 0.60 & \text{ } & 0.1870 & \text{ } & 0.2826 & \text{ } & 0.0155 \\ i=70 & \text{ }& 0.70 & \text{ } & 0.2881 & \text{ } & 0.3949 & \text{ } & 0.0440 \\ i=80 & \text{ }& 0.80 & \text{ } & 0.4390 & \text{ } & 0.5434 & \text{ } & 0.1247 \\ i=90 & \text{ }& 0.90 & \text{ } & 0.6641 & \text{ } & 0.7400 & \text{ } & 0.3531 \\ \end{array}$

Table 2 – Probabilities of Player A Breaking the Bank
Total Initial Wealth = 10,000 Units

$\begin{array}{ccccccccc} \text{Player A} & \text{ } & \text{ } & \text{ } & \text{ } & \text{ } & \text{ } & \text{ } & \text{ } \\ \text{Initial} & \text{ } & \text{Fair} & \text{ } & \text{House} & \text{ } & \text{Craps} & \text{ } & \text{Roulette} \\ \text{Wealth} & \text{ } & \text{Game} & \text{ } & \text{Edge 1\%} & \text{ } & \text{ } & \text{ } & \text{ } \\ \text{i} & \text{ } & p=0.5 & \text{ } & p=0.49 & \text{ } & p=0.493 & \text{ } & p=0.474 \\ \text{ } & \text{ } & \text{ } & \text{ } & \text{ } & \\ i=100 & \text{ }& 0.010 & \text{ } & 0.0000 & \text{ } & 0.0000 & \text{ } & 0.0000 \\ i=500 & \text{ }& 0.050 & \text{ } & 0.0000 & \text{ } & 0.0000 & \text{ } & 0.0000 \\ i=1000 & \text{ }& 0.100 & \text{ } & 0.0000 & \text{ } & 0.0000 & \text{ } & 0.0000 \\ i=5000 & \text{ }& 0.500 & \text{ } & 0.0000 & \text{ } & 0.0000 & \text{ } & 0.0000 \\ i=9800 & \text{ }& 0.980 & \text{ } & 0.0003 & \text{ } & 0.0037 & \text{ } & 0.0000 \\ i=9850 & \text{ }& 0.985 & \text{ } & 0.0025 & \text{ } & 0.0150 & \text{ } & 0.0000 \\ i=9900 & \text{ }& 0.990 & \text{ } & 0.0183 & \text{ } & 0.0608 & \text{ } & 0.0000 \\ i=9950 & \text{ }& 0.995 & \text{ } & 0.1353 & \text{ } & 0.2466 & \text{ } & 0.0055 \\ i=9990 & \text{ }& 0.999 & \text{ } & 0.6703 & \text{ } & 0.7558 & \text{ } & 0.3531 \\ \end{array}$

Table 1 describes the gambling situations where the two players have a combined initial wealth of 100 units. It is not a realistic scenario for the situation of gambler vs. house as the total wealth is quite moderate. Perhaps it best describes the interpersonal gambling situations where the initial wealth positions of the two players are moderate in size. It is clear that the values of $p$ are critical to the eventual success or ruin for player A. Let’s look at the row for $i$ = 50 (both players in equal initial position). For a fair game, there is equal chance for breaking the bank and ruin. As the game becomes more unfair, the chance for breaking the bank is less than 1% (roulette).

In Table 1, the same pattern for $i$ = 50 appears for the other initial positions as well. As $p$ drops below 0.5 (as the game becomes more unfair to player A), the less likely it is for player A to win and the more likely it is for player B (the player who has the edge) to win.

Table 2, with the total wealth of 10,000 units, is designed to more realistically reflect the gambling situation of a typical gambler versus the house. With the higher combined wealth, the effect of $p$ is more pronounced. For example, in the top half of Table 2, there is virtually no chance for player A to win at any of the unfair games, i.e. there is a virtually certain chance for ruin for the gambler. Even when player A has the same initial wealth as the casino, there is still no chance to win at the unfair games.

The bottom half of Table 2 shows that when player A has substantially more wealth than the casino, player A begins to have a positive chance of winning (but the positive probability is still very small). Of course the gambler having more wealth than the casino is far from a realistic scenario.

The formulas for $A_i$ and $B_i$ only provide the long run probability for player A (the gambler) to break the bank and the long run probability of ruin for player A, respectively. They give no information about the number of plays of the game in order to reach the eventual success or ruin.

For example, in table 2, if the initial wealth of player A is 100, the probability of winning at the game of roulette is zero. This means that eventually player A will lose the 100 units. But the results in Table 2 do not tell us on average how many plays of the game have to happen before eventual ruin. If the gambler is lucky, he or she may be able to rack up some winnings for a period of time before getting to the point of ruin. One thing is clear. When the game is unfair (to the gambler), there is a virtually certain chance that the gambler will lose all the money that he or she bring into the casino. This is essentially the story told by the formulas discussed here.

______________________________________________________________
$\copyright$ 2017 – Dan Ma

## Is your public pool free from urine?

Some people believe that the answer to the question posted in the title of the blog post is no. Now there is evidence to back it up. The study profiled here also gives an indication of how serious the problem of urine in the pool might be.

The study, conducted by a team of Canadian researchers, did not measure urine in pool water directly. The research team collected 250 samples from 31 pools and tubs at hotels and recreation facilities in two Canadian cities. They measured the amount of a sweetener called acesulfame-K (ACE), which is widely consumed and is completely excreted into urine.

Concentrations of ACE in the samples range from 30 to 7110 nanograms per liter (ng/L) and are up to 570 times greater than in tap water. In particular, the research team determined the levels of ACE over a 3-week period in two pools (110,000 and 220,000 U.S. gallons) and then used the ACE levels to estimate the urine concentration to be 30 and 75 liters, respectively.

Thus the researchers found that the 220,000-gallon public commercial-size pool contained about 75 liters (about 20 gallons). This estimate would translate to roughly 2.7 gallons in a typical residential pool (20 feet by 40 feet by 5 feet). So there may be close to 3 gallons of urine in a typical residential pool.

The sweetener is widely consumed in North America. ACE has been detected in other places too (e.g. China). The study seems to confirm some people’s suspicion that people are peeing in swimming pools. The peeing is frequent enough for ACE to show up in sufficient quantity.

Urine in swimming pools is obviously gross. What are the potential health hazards? Ever wonder why there is a sharp odor of chlorine after the pool has been used by a lot of people? When chlorine are mixed with urine, a host of potentially toxic compounds called disinfection byproducts are created. One such byproduct is chloramines, which give out the sharp odor that people may mistake for just the smell of chlorine. The other byproducts are more harmful, e.g. cyanogen chloride, which is classified as a chemical warfare agent, and nitrosamines, which can cause cancer. The study does not have evidence to say whether the nitrosamine levels in pools increase cancer risk. Nonetheless, this is still a striking finding.

The potential health hazards are compounded by the fact that it is not uncommon for pool water to go unchanged for years. When that is the case, the pool operator simply adds more water and more chlorine to disinfect. Such practice will lead to formation of more disinfection byproducts.

Here’s a reporting of the study from npr.org. The study, published in Environmental Science & Technology Letters, was conducted by a research team from University of Alberta. Here’s a link to the actual paper.

The lead researcher in the study said that she is a regular swimmer and that the study is not meant to scare people away from a healthy activity. The intention is to promote public awareness of the potential hazards of urine in the pool and to promote best practices in using swimming pools.

## Is there anything we can do to avoid the flu?

In many places, influenza activity peaks during the period in between December and February. What are some of the best practices to reduce the spread of the flu and to minimize the risk of catching it? The standard recommendations are: get a flu shot if you are in a risk group, cough and sneeze into your elbow and wash your hands frequently. On an individual basis, we can follow these recommendations and hope for the best outcomes. From a public health standpoint, it is a good idea to understand how a flu outbreak starts. A recent article published in the Journal of Clinical Virology gives insight in this regard.

Researchers at Sahlgrenska Academy in the University of Gothenburg in Gothenburg, Sweden conducted the study during the period between October 2010 and July 2013. During that period, researchers collected more than 20,000 nasal swabs from people seeking medical care in and around the city of Gothenburg, and analyzed them for influenza A and other respiratory viruses.

The incidence of respiratory viruses was then compared over time with weather data from the Swedish Meteorological and Hydrological Institute (SMHI). The findings: Flu outbreaks seem to be activated about one week after the first really cold period with low outdoor temperatures and low humidity.

The finding of this study has been reported in Time.com. Here are the abstracts of the study in three different places (here, here and here) for anyone who would like more detailed information. Here is the press release from the Sahlgrenska Academy.

The most interesting take away from the study is the predictable timing with the flu outbreak starting the week after a cold snap. There is a lag time of about a week. The onset of a cold snap is a leading indicator of a flu outbreak. The correlation is a useful piece of information.

If you know when the cold month begins, you know when the flu activity starts to get intense. Thus this knowledge can be put to good use. For example, start publicize the need for flu shots in the period leading to the first cold month. It can also help hospitals and other health providers prepare in advance for a higher volume of patients seeking care.

Even individuals can make use of this insight. In the period leading to the first cold spell, be extra careful and take extra precaution: hand washing and sneezing into elbow for example.

Cold temperature and dry weather condition together allow the flu virus to spread more easily. According to the study, aerosol particles containing virus and liquid are more able to spread in cold and dry weather. The dry air absorbs moisture and the aerosol particles shrink and can remain airborne.

Of course, the cold and dry weather alone is not enough to cause an outbreak of the flu. According to one of the researchers, “the virus has to be present among the population and there have to be enough people susceptible to the infection.”

The study indicates that the findings for seasonal flu (Influenza A) also hold true for a number of other common viruses that cause respiratory tract infections, such as RS-virus and coronavirus. The combination of cold and dry weather seems to exacerbate the problems caused by these viruses. On the other hand some viruses such as rhinovirus, that are a common cause of cold, are independent of weather factors and is present all year round.

______________________________________________________
$\copyright$ 2017

In the previous post Two Statisticians in a Battlefield, we discussed the importance of reporting a spread in addition to an average when describing data. In this post we look at three specific notions of spread. They are measures that indicate the “range” of the data. First, look at the following three statements:

1. The annual household incomes in the United States range from under $10,000 to$7.8 billion.
2. The middle 50% of the annual household incomes in the United States range from $25,000 to$90,000.
3. The birth weight of European baby girls ranges from 5.4 pounds to 9.1 pounds (mean minus 2 standard deviations to mean plus 2 standard deviations).

All three statements describe the range of the data in questions. Each can be interpreted as a measure of spread since each statement indicates how spread out the data measurements are.

__________________________________________________________
1. The Range

The first statement is related to the notion of spread called the range, which is calculated as:

$\displaystyle (1) \ \ \ \ \ \text{Range}=\text{Maximum Data Value}-\text{Minimum Data Value}$

Note that the range as calculated in $(1)$ is simply the length of the interval from the smallest data value to the largest data value. We do not know exactly what the largest household income is. So we assume it is the household of Bill Gates ($7.8 billion is a number we found on the Internet). So the range of annual household incomes is about$7.8 billion (all the way from an amount near $0 to$7.8 billion).

Figure 1

The range is a measure of spread since the larger this numerical summary, the more dispersed the data (i.e. the data are more scattered within the interval of “min to max”). However, the range is not very informative. It is calculated using the two data values that are, in this case, outliers. Because the range is influenced by extreme data values (i.e. outliers), it is not used often and is usually not emphasized. It is given here to provide a contrast to the other measures of spread.

__________________________________________________________
2. Interquartile Range

The interval described in the second statement is the more stable part of the income scale. It does not contain outliers such as Bill Gates and Warren Buffet. It also does not contain the households in extreme poverty. As a result, the interval of $25,000 to$90,000 describes the “range” of household incomes that are more representative of the working families in the United States.

The measure of spread at work here is called the interquartile range (IQR), which is based on the numerical summary called the 5-number summary, as indicated in $(2)$ below and in the following figure.

\displaystyle \begin{aligned}(2) \ \ \ \ \ \text{5-number summary}&\ \ \ \ \ \text{Min}=\0 \\&\ \ \ \ \ \text{first quartile}=Q_1=\25,000 \\&\ \ \ \ \ \text{median}=\50,000 \\&\ \ \ \ \ \text{third quartile}=\90,000 \\&\ \ \ \ \ \text{Max}=\7.8 \text{ billion} \end{aligned}

Figure 2

As demonstrated in Figure 2, the 5-number summary breaks up the data range into 4 quarters. Thus the interval from the first quartile ($Q_1$) to the third quartile ($Q_3$) contains the middle 50% of the data. The interquartile range (IQR) is defined to be the length of this interval. The IQR is computed as in the following:

\displaystyle \begin{aligned}(3) \ \ \ \ \ \text{IQR}&=Q_3-Q_1 \\&=\90,000-\25,000 \\&=\65,000 \end{aligned}

Figure 3

The IQR is the length of the interval from $Q_1$ to $Q_3$. The larger this measure of spread, the more dispersed (or scattered) the data are within this interval. The smaller the IQR, the more tightly the data cluster around the median. The median as a measure of center is a resistant measure since it is not influenced significantly by extreme data values. Likewise, IQR is also not influenced by extreme data values. So IQR is also a resistant numerical summary. Thus IQR is typically used as a measure of spread for skewed distributions.
__________________________________________________________
3. Standard Deviation

The third statement also presents a “range” to describe the spread of the data (in this case birth weights in pounds of girls of European descent). We will see that the interval of 5.4 pounds to 9.1 pounds covers approximately 95% of the baby girls in this ethnic group.

The measure of spread at work here is called standard deviation. It is a numerical summary that measures how far a typical data value deviates from the mean. Standard deviation is usually used with data that are symmetric.

The information in the third statement is found in this oneline article. The source data are from this table and are repeated below.

\displaystyle \begin{aligned}(4) \ \ \ \ \ \text{ }&\text{Birth weights of European baby girls} \\&\text{ } \\&\ \ \ \text{mean}=3293.5 \text{ grams }=7.25 \text{ pounds} \\&\ \ \ \text{standard deviation }=423.3 \text{ grams}=0.93126 \text{ pounds} \end{aligned}

The third statement at the beginning tells us that birth weights of girls of this ethnic background can range from 5.4 pounds to 9.1 pounds. The interval spans two standard deviations from either side of the mean. That is, the value of 5.4 pounds is two standard deviations below the mean of 7.25 pounds and the value of 9.1 pounds is two standard deviations above the mean.

\displaystyle \begin{aligned}(5) \ \ \ \ \ \text{ }&\text{Birth weights of European baby girls} \\&\text{ } \\&\ \ \ 7.25-2 \times 0.93126=5.4 \text{ pounds} \\&\ \ \ 7.25+2 \times 0.93126=9.1 \text{ pounds} \end{aligned}

There certainly are babies with birth weights 9.1 pounds and below 5.4 pounds. So what is the proportion of babies that fall within this range?

We assume that the birth weights of babies in a large enough sample follow a normal bell curve. The standard deviation has a special interpretation if the data follow a normal distribution. In a normal bell curve, about 68% of the data are one standard deviation away from the mean (both below and above). About 95% of the data are two standard deviations away from the mean (both below and above). About 99.7% of the data are three standard deviations away from the mean (both below and above). With respect to the birth weights of baby girls with European descent, we have the following bell curve.

Figure 4

__________________________________________________________
Remark

The three measures of spread discussed here all try to describe the spread of the data by presenting a “range”. The first one, called the range, is not useful since the min and the max can be outliers. The median (as a center) and the interquartile (as a spread) are typically used to describe skewed distributions. The mean (as a center) and the standard deviation (as a spread) are typically used to describe data distributions that have no outliers and are symmetric.

__________________________________________________________
Related Blog Posts

When Bill Gates Walks into a Bar

Choosing a High School

For more a detailed discussion of the measures of spread discussed here, go to the following blog post:
Looking at LA Rainfall Data

__________________________________________________________
Reference

1. Moore. D. S., Essential Statistics, W. H. Freeman and Company, New York, 2010
2. Moore. D. S., McCabe G. P., Craig B. A., Introduction to the Practice of Statistics, 6th ed., W. H. Freeman and Company, New York, 2009

## Two Statisticians in a Battlefield

Two soldiers, both statisticians, were fighting side by side in a battlefield. They spotted an enemy soldier and they both fired their rifles. One statistician soldier fired one foot to the left of the enemy soldier and the other statistician soldier fired one foot to the right of the enemy soldier. They immediately gave each other a high five and exclaimed, “on average the enemy soldier is dead!”

Of course this is an absurd story. Not only the enemy soldier was not dead, he was ready to fire back at the two statistician soldiers. This story also reminds the author of this blog about a man who had one foot in a bucket of boiling hot water and another foot in a bucket of ice cold water. The man said, ‘on average, I ought to feel pretty comfortable!”

In statistics, center (or central tendency or average) refers to a set of numerical summaries that attempt to describe what a typical data value might look like. These are “average” value or representative value of the data distribution. The more common notions of center or average are mean and median. The absurdity in these two stories points to the inadequacy of using center alone in describing a data distribution. To get a more complete picture, we need to use spread too.

Spread (or dispersion) refers to a set of numerical summaries that describe the degree to which the data are spread out. Some of the common notions of spread are range, 5-number summary, interquartile range (IQR), and standard deviation. We will not get into the specifics of these notions here. Refer to Looking at Spread for a more detailed discussion of these specific notions of spread. Our purpose is to discuss the importance of spread.

Why is spread important? Why is using average alone not sufficient in describing a data set? Here are several points to consider.

________________________________________________________________
1. Using average alone can be misleading.

The stories mentioned at the beginning aside, using average alone gives incomplete information. Depending on the point of view, using average along can make things look better than they are or worse than they really are.

A handy example would be the Atlanta Olympic in 1996. In the summer time, Atlanta is hot. In fact, some people call the city Hotlanta! Yet in the bid for the right to host the Olympic, the planning committee described the temperate using average only (the daily average temperate being 75 Fahrenheit). Of course, the temperature of 75 degrees would be indeed comfortable, but was clearly not what the visitors would experience during the middle of the day!

________________________________________________________________
2. A large spread usually means inconsistent results or performances.

A large spread indicates that there is a great deal of dispersion or scatter in the data. If the data are measurements of a manufacturing process, a large spread indicates that the product may be unreliable or substandard. So a quality control procedure would monitor average as well as spread.

If the data are exam scores, a large spread indicates that there exists a wide range of abilities among the students. Thus teaching a class with a large spread may require a different approach than teaching a class with a smaller spread.

________________________________________________________________
3. In investment, standard deviation is used as a measure of risk.

As indicated above, standard deviation is one notion of spread. In investment, risk refers to the chance that the actual return on an investment may deviate greatly from the expected return (i.e. average return). One way to quantify risk is to calculate the standard deviation of the distribution of returns on the investment. The calculation of the standard deviation is based on the deviation of each data value from the mean. It gives an indication of the deviation of an “average” data point from the mean.

A large standard deviation of the returns on an investment indicates that there will be a broad range of possibilities of investment returns (i.e. it is a risky investment in that there is a chance to make a great deal of money and there is also a chance to lose your original investment). A small standard deviation indicates that there will likely be not many surprises (i.e. the actual returns will be likely not too much different from the average return).

Thus it pays for any investor to pay attention to the average returns that are expected as well as the standard deviation of the rate of returns.

________________________________________________________________
4. Without a spread, it is hard to gauge the significance of an observed data value.

When an observed data value deviates from the mean, it may not be easy to gauge the significance of the observed data value. For the type of measurements that we deal with in our everyday experience (e.g. height and weight measurements), we usually have good ideas whether the data values we observe are meaningful.

But for data measurements that are not familiar to us, we usually have a harder time making sense of the data. For example, if the observed data value is different from the average, how do we know if the difference is just due to chance or if the difference is real and significant. This kind of question is at the heart of statistical inference. Many procedures in statistical inference requires the use of a spread in addition to an average.

Reference

1. Moore. D. S., Essential Statistics, W. H. Freeman and Company, New York, 2010
2. Moore. D. S., McCabe G. P., Craig B. A., Introduction to the Practice of Statistics, 6th ed., W. H. Freeman and Company, New York, 2009

## The Tax Return of Mitt Romney

Mitt Romney is currently a candidate for the 2012 Republican Party nomination for U.S. President. He recently, bowed to pressure from another presidential candidate in the Republican Party, had to release his past tax returns. The release of these tax returns opened up a window on the personal finance of a very rich presidential candidate. Immediately upon the release on January 24, much of the discussion in the media was centered around the fact that Romney paid an effective tax rate of about 15%, which is much less than the rates paid by many ordinary Americans. Our discussion here is neither about tax rates nor politics. For the author of this blog, Romney’s tax return provides a rich opportunity to talk about statistics. Mitt Romney is an excellent example opening up a discussion on income distribution and several related statistical concepts.

Mitt Romney

Mitt Romney’s tax return in 2010 consisted of over 200 pages. The 2010 tax return can be found here (the PDF file can be found here). The following is a screen grab from the first page.

Figure 1

Note that the total adjusted gross income was over $21 million. Just the taxable interest income alone was$3.2 million. Most of Romney’s income was from capital gain (about $12.5 million). It is clear that Romney is an ultra rich individual. How wealthy? For example, where is Romney placed in the income scale relative to other Americans? To get a perspective, let’s look at some income data from the US Census Bureau. The following is a histogram constructed using a frequency table from the Current Population Survey. The source data are from this table. Figure 2 The horizontal axis in Figure 2 is divided into intervals made up of increments of$10,000 all the way from $0 to$250,000 plus. According to the Current Population Survey, about 7.8% of all American households had income under $10,000 in 2010 (almost 1 out of 13 households). About 12.1% of all households had income in between$10,000 to $20,000. (about 1 in 8 households). Only 2.1% of the households had income over$250,000 in 2010. Obviously Romney belongs to this category. The graphic in Figure 2 shows that Romney is in the top 2% of all American households.

Of course, being in the top 2% is not the entire story. There is a long way from $250,000 (a quarter of a million) to$21 million! Clearly Romney is in the top range of the top class indicated in Figure 2. Romney is actually in the top 1% of the income distribution. According to Wall Street Journal, Romney is well above the top 1% category. According to this online reporting from Wall Street Journal, Romney is in the top 0.0025%! According to one calculation (mentioned in the Wall Street Journal piece), there are at least 4,000 families in the category of being in the top 0.0025%. Could it be that the families in this category number in the thousands?

The figure below shows that the sum of the percentages from the first 5 bars in the histogram equals 50.5%. This confirms the well known figure that the median household income in the United States is around $50,000. Figure 3 The histograms in both Figure 1 and Figure 2 are clear visual demonstrations that income distribution are skewed (e.g. most of the households make modest income). Most of the households are located in the lower income range. Just the first 5 intervals alone contain 50% of the households. The sum of the percentages of the first 10 vertical bars ($0 to $99,999) is about 80%. So making 6-figure income lands you in the top 20% of the households. Both histograms are classic examples of a skewed right distribution. The vertical bars on the left are tall and the bars taper off gradually at first but later drop rather precipitously. The last two vertical bars (in green) are aggregations of all the vertical bars in the$250,000+ range had we continued to draw the histograms using $10,000 increments. Another clear visual sign that this is a skewed distribution is that the left tail (the length of the horizontal axis to the left of the median) differs greatly from the right tail (the length of the horizontal axis to the right of the median). When the right tail is much longer than the left tail, it is called a skewed right distribution (see the figure below). Figure 4 On the other hand, when the left tail is longer, it will be called a skewed left distribution. Another indication that the income distribution is skewed is that the mean income and the median income are far apart. According to the source data, the mean household income in 2010 was$67,530, much higher than the median of about $50,000 (see the figure below). Figure 5 Whenever the mean is much higher than the median, it is usually the case that it is a skewed right distribution (as in Figure 5). On the other hand, when the opposite is true (the median is much higher than the mean), most of the time it is a skewed left distribution. A related statistical concept is the so-called resistant measures. The median is a resistant measure because it is not skewed significantly by extreme data values (in this case extremely high income and wealth). On the other hand, the mean is not resistant. As a result, in a skewed distribution, the median is a better indication of an average. This is why income is usually reported using median (e.g. in newspaper articles). For a more detailed read of resistant measures, see When Bill Gates Walks into a Bar and Choosing a High School. ## What Are Helicopter Parents Up To? We came across several interesting graphics about helicopter parents. These graphics give some indication as to what the so called helicopter parents are doing in terms of shepherding their children through the job search process. These graphics are the screen grabs from a report that came from Michigan State University. Here’s the graphics (I put the most interesting one first). Figure 1 Figure 2 Figure 3 Figure 1 shows a list of 9 possible job search or work related activities that helicopter parents undertake on behalf of their children. The information in this graphic was the results of surveying more than 700 employers that were in the process of hiring recent college graduates. Nearly one-third of these employers indicated that parents had submitted resumes on behalf of their children, sometimes without the knowledge of their children. About one quarter of the employers in the sample reported that they saw parents trying to urge these companies to hire their children! To me, the most interesting point is that about 4% of these employers reported that parents actually showed up for the interviews! Maybe these companies should hire the parents instead! Which companies do not like job candidates that are enthusiastic and show initiative? Figures 2 and 3 present the same information in different format. One is a pie chart and the other is bar graph. Both break down the survey responses according to company size. The upshot: large employers tend to see more cases of helicopter parental involvement in their college recruiting. This makes sense since larger companies tend to have regional and national brand recognition. Larger companies also tend to recruit on campus more often than smaller companies. Any interested reader can read more about this report that came out of Michigan State University. I found this report through a reporting in npr.org called Helicopter Parents Hover in the Workplace. ## Another Look at LA Rainfall In two previous posts, we examined the annual rainfall data in Los Angeles (see Looking at LA Rainfall Data and LA Rainfall Time Plot). The data we examined in these two post contain 132 years worth of annual rainfall data collected at the Los Angeles Civic Center from 1877 to 2009 (data found in Los Angeles Almanac). These annual rainfall data represent an excellent opportunity to learn the techniques from a body data analysis methods grouped under the broad topic of descriptive statistics (i.e. using graphs and numerical summaries to answer questions or find meaning in data). Here’s two graphics presented in Looking at LA Rainfall Data. Figure 1 Figure 2 These charts are called histograms and they look the same (i.e. have the same shape). But they present slightly different information. Figure 1 shows the frequency of annual rainfall. Figure 2 shows the relative frequency of rainfall. For example, Figure 1 indicates that there were only 3 years (out of the last 132 years) with annual rainfall under 5 inches. On the other hand, there were only 2 years with annual rainfall above 35 inches. So drought years did happen but not very often (only 3 out of 132 years). Extremely wet seasons did happen but not very often. Based on Figure 1, we see that in most years, annual rainfall records range from 5 to about 25 inches. The most likely range is 10 to 15 inches (45 years out of the last 132 years). In Los Angeles, annual rainfall above 25 inches are rare (only happened 12 years out of 132 years). Figure 1 is all about count. It tells you how many of the data points are in a certain range (e.g. 45 years in between 10 to 15 inches). For this reason, it is called a frequency histogram. Figure 2 gives the same information in terms of proportions (or relative frequency). For example, looking at Figure 2, we see that about 34% of the time, annual rainfall is from 10 to 15 inches. Thus, Figure 2 is called a relative frequency histogram. Keep in mind raw data usually are not informative until they are summarized. The first step in summarization should be a graph (if possible). After we have graphs, we can look at the data further using numerical calculation (i.e. using various numerical summaries such as mean, median, standard deviation, 5-number summary, etc). To see how this is done, see the previous post Looking at LA Rainfall Data. What kind of information can we get from graphics such as Figure 1 and Figure 2 above? For example, we can tell what data points are most likely (e.g. annual rainfall of 10 to 15 inches). What data points are considered rare or unlikely? Where do most of the data points fall? This last question should be expanded upon. Looking at Figure 2, we see that about 60% of the data are under 15 inches (0.023+0.242+0.341=0.606). So for close to 80 years out of the last 132 years, the annual rainfall records were 15 inches or less. About 81% of the data are 20 inches or less. So in the overhelming majority of the years, the annual rainfall records are 20 inches or less. So annual rainfall of more than 20 inches are relatively rare (only happened about 20% of the time). We have a name of the data situation we see in Figure 1 and Figure 2. The annual rainfall data in Los Angeles have a skewed right distribution. This is because most of the data points are on the left side of the histogram. Another way to see this is that the tallest bar in the histogram is the one at 10 to 15 inches. Note that the side to the right of the peak of the histogram is longer than the side to the left of the peak. In other words, when the right tail of the histogram is longer, it is a skewed right distribution. See the figure below. Figure 3 Besides the look of the histogram, skewed right distribution has another characteristic. The mean is always a lot larger than the median in a skewed right distribution. For example, the mean of the annual rainfall data is 14.98 inches (essentially 15 inches). Yet the median is only 13.1 inches, almost two inches lower. Whenever, the mean and the median are significantly far apart, we have a skewed distribution on hand. When the mean is a lot higher, it is a skewed right distribution. When the opposite situation occurs (the mean is a lot lower than the median), it is a skewed left distribution. When the mean and median are roughly equal, it is likely a symmetric distribution. ## Is College Worth It? Is college worth it? This was the question posed by the authors of the report called College Majors, Unemployment and Earnings, which was produced recently by The Center on Education and the Workforce. We do not plan on giving an detailed reporting on this report. Any interested reader can read the report here. Instead, we would like to look at two graphics in this reports, which are reproduced below. These two graphics are very interesting, which capture all the main points of the report. The data used in the report came from American Community Survey for the years 2009 and 2010. Figure 1 Figure 2 Figure 1 shows the unemployment rates by college major for three groups of college degree holders, namely the recent college graduates (shown with green marking), the experienced college graduates (blue marking) and the college graduates who hold graduate degrees (red marking). Figure 2 shows the median earnings by major for the same three groups of college graduates (using the same colored markings). Figure 1 ranks the unemployment rates for recent college graduates from highest to the lowest. You can see the descending of green markings from 13.9% (architecture) to 5.4% (education and health). So this graphic shows clearly that the employment prospects of college graduates depend on their majors, which is one of the main points of the report. The graphic in Figure 1 shows that all recent college graduates are having a hard time finding work. The unemployment rate for recent college graduate is 8.9% (not shown in Figure 1). The employment picture for recent college architecture graduates is especially bleak, which is due to the collapse of the construction and home building industry in the recession. The unemployment rates for recent college graduates who majored in education and healthcare are relatively low, reflecting the reality that these fields are either stable or growing. Everyone is feeling the pinch in this tough economic environment. Even the recent graduates in technical fields are experiencing higher than usual unemployment rates. For example, the unemployment rates for recent college graduates in engineering and science, though relatively low comparing to architecture, are at 7.5% and 7.7%, respectively. For computers and mathematics recent graduates, the unemployment rate is 8.2%, approaching the average rate of 8.9% for recent college graduates. The experienced college graduates fare much better than recent graduates. It is much more likely for experienced college graduates to be working. Looking at Figure 1, another observation is that graduate degrees make a huge difference in employment prospects across all majors. The graphic in Figure 2 suggests that earnings of college graduates also depend on the subjects they study, which is another main point of the report. The technical majors earn the most. For example, median earning among recent engineering college graduates is$55,000 and the median for arts majors is $30,000. Aside from the high technical, business and healthcare majors, the median earnings of recent college graduates are in the low$30,000s (just look at the green markings in Figure 2).

Figure 2 also shows that people with graduate degrees have higher earnings across all majors. The premium in earnings for graduate degree holders is substantial and is found across the board. Though the graduate degree advantage is seen in all majors, it is especially pronounced among the technical fields (just look at the descending red markings in Figure 2).

So two of the main points are (1), employment prospects of college graduates depend on their majors, and (2) the earning potential of college graduates also depend on the subjects they study. Is college worth it? The report is not trying to persuade college bound high school seniors not to go to college. On the contrary, the authors of the report answer the question in the affirmative. The authors of the report are merely providing the facts that all prospective college students should consider before they pick their majors. The two graphics shown above are effective demonstration of the facts presented by the report. According to the authors, students “should do their homework before picking a major, because, when it comes to employment prospects and compensation, not all college degrees are created equal.”

## Cryptography and Presidential Inaugural Speeches

Given a letter in English, how often does it appear in normal usage of the English language? Some letters appear more often than others. For example, the last letter Z is not common. The vowels are very common because they are needed in making words. The following figure shows the relative frequency of the English letters obtained empirically (see [1]). Dewey, the author of [1], obtained this frequency distribution after examining a total of 438,023 letters. We came across this letter frequency distribution in Example 2.11 in page 24 of [2]. Figure 1 displays the letter frequency in descending order.

A letter frequency such as Figure 1 is important in cryptography. We explore briefly why this is the case. We give an indication why breaking a cipher is often a statistical process. We then confirm the Dewey letter frequency distribution by examining the letter frequency in the presidential inaugural speeches of George Washington (two speeches) and Barack Obama (one speech).

The study of the frequency of letters in text is very important in cryptography. In using an algorithm to encrypt a message, the original information is called plaintext and the encrypted message is called ciphertext. In a simple encryption scheme called substitution cipher, each letter of the plaintext is replaced by another letter. To break such a cipher, it is necessary to know the letter frequency of the language being coded. For example, if the letter W is the most frequently appeared letter in the ciphertext, this might suggest that the letter W in the ciphertext corresponds to the letter E in the plaintext since the letter E is the most frequently occurred English letter (see Figure 1).

Figure 1 shows that the most frequently occurring letter in English is E (about 12.68% of the time). The least used letter is Z. The top 5 letters (E, T, A, I, O) comprise about 45% of the total usage. The top 8 letters comprise close to 65% of the total usage. The top 12 letters are used about 80% of the time (80.87%).

Another interesting result from the Dewey’s letter frequency is that the vowels comprise about 40% of the total usage. This means that the frequency of consonants is about 60%.

\displaystyle \begin{aligned}(1) \ \ \ \ \ \text{relative frequency of vowels}&=\text{relative frequency of A + relative frequency of E} \\&\ \ \ + \text{relative frequency of I + relative frequency of O} \\&\ \ \ +\text{relative frequency of U + relative frequency of Y} \\&=0.0788+0.1268+0.0707+0.0776+0.0280+0.0202 \\&=0.4021 \end{aligned}

\displaystyle \begin{aligned}(2) \ \ \ \ \ \text{relative frequency of consonants}&=1-0.4021 \\&=0.5979 \end{aligned}

The probability distribution of the letters displayed in Figure 1 is a useful tool that can aid the process of breaking an intercepted cipher. The general idea is to compare the frequency of the letters in the encrypted message with the frequency of the letters in Figure 1. Thus the most used letter in the ciphertext might correspond to the letter E, or might correspond to T and A (as T and A are also very common in plaintext). But the most used letter in the ciphertext is likely not to be a Z or a Q. The second most used letter in the ciphertext might be the letter T in the plaintext, or might be another one of the top letters. The cryptanalyst will likely need to try various combinations of mapping between the letters in the ciphertext and the plaintext. The idea described here is not a sure-fire approach, but is rather a trial and error process that can help the analyst putting the statistical puzzle pieces together.

We now use the letters in presidential inaugural speeches to see how the Dewey letter frequency hold up. We want to use text that is from another era (so we choose the two inaugural speeches of George Washington) and to use text that is contemporary (so we choose the inaugural speech of Barack Obama). The text of presidential inaugural speeches can be found here.

Figure 2 below shows the letter frequency in the two inaugural speeches of George Washington. There are a total of 7,641 letters (we only use the body of the speeches). Figure 3 below is a side by side comparison between the letter frequency in Figure 1 (Dewey) and the letter frequency in Washington’s two speeches (Figure 2).

Figure 3 shows that the letter frequency in Washington’s speeches is on the whole very similar to the letter frequency of Dewey. We cannot expect an exact match. But overall there is a general agreement between the two distributions.

Figure 4 below shows the letter frequency in the inaugural speeches of Barack Obama. There are a total of 10,627 letters (we only use the body of the speech). Figure 5 below is a side by side comparison between the letter frequency in Figure 1 (Dewey) and the letter frequency in Obama’s speech (Figure 3).

There is also a very good agreement between the letter frequency in Dewey (the benchmark) and the letter frequency in Obama’s speech.

Despite the passage of almost 200 years, there is quite an excellent agreement between the letter usage between Washington’s speeches in 1789 and the distribution obtained by Dewey in 1970 (see Figure 3). Some letters appeared more frequently often in Washington’s speeches (e.g. E, I and N) and some appeared less often (e.g. A). The general pattern of the letter distribution in Washington’s speeches is unmistakably similar to that of Dewey’s. Similar observations can be made about the comparison between the letter frequency in Obama’s speech and Dewey’s distribution (see Figure 5).

The following table shows the frequency of the top letter, the top 5 letters, the top 8 letters and the top 12 letters in Dewey’s distribution alongside with the corresponding frequency in the speeches of Washington and Obama. Table (1) shows that the frequency of the top letters are quite close between Dewey’s distribution and the speeches of Washington and Obama.

$\displaystyle (1) \ \ \ \ \begin{bmatrix} \text{Top Letters in Dewey's Distribution}&\text{ }&\text{Dewey}&\text{ }&\text{Washington}&\text{ }&\text{Obama} \\\text{ }&\text{ }&\text{ } \\\text{E}&\text{ }&0.1268&\text{ }&0.1309&\text{ }&0.1268 \\ \text{E, T, A, O, I}&\text{ }&0.4517&\text{ }&0.4485&\text{ }&0.4441 \\ \text{E, T, A, O, I, N, S, R}&\text{ }&0.6451&\text{ }&0.6409&\text{ }&0.6525 \\ \text{E, T, A, O, I, N, S, R, H, L, D. U}&\text{ }&0.8087&\text{ }&0.7981&\text{ }&0.8163 \end{bmatrix}$

Reference

1. Dewey, G., Relative Frequency of English Spellings, Teachers College Press, Columbia University, New York, 1970
2. Larsen, R. J., Marx., M. L., An Introduction to Mathematical Statistics and its Applications, Prentice-Hall, Inc., Englewood Cliffs, New Jersey, 1981