Is your public pool free from urine?

Some people believe that the answer to the question posted in the title of the blog post is no. Now there is evidence to back it up. The study profiled here also gives an indication of how serious the problem of urine in the pool might be.


The study, conducted by a team of Canadian researchers, did not measure urine in pool water directly. The research team collected 250 samples from 31 pools and tubs at hotels and recreation facilities in two Canadian cities. They measured the amount of a sweetener called acesulfame-K (ACE), which is widely consumed and is completely excreted into urine.

Concentrations of ACE in the samples range from 30 to 7110 nanograms per liter (ng/L) and are up to 570 times greater than in tap water. In particular, the research team determined the levels of ACE over a 3-week period in two pools (110,000 and 220,000 U.S. gallons) and then used the ACE levels to estimate the urine concentration to be 30 and 75 liters, respectively.

Thus the researchers found that the 220,000-gallon public commercial-size pool contained about 75 liters (about 20 gallons). This estimate would translate to roughly 2.7 gallons in a typical residential pool (20 feet by 40 feet by 5 feet). So there may be close to 3 gallons of urine in a typical residential pool.

The sweetener is widely consumed in North America. ACE has been detected in other places too (e.g. China). The study seems to confirm some people’s suspicion that people are peeing in swimming pools. The peeing is frequent enough for ACE to show up in sufficient quantity.

Urine in swimming pools is obviously gross. What are the potential health hazards? Ever wonder why there is a sharp odor of chlorine after the pool has been used by a lot of people? When chlorine are mixed with urine, a host of potentially toxic compounds called disinfection byproducts are created. One such byproduct is chloramines, which give out the sharp odor that people may mistake for just the smell of chlorine. The other byproducts are more harmful, e.g. cyanogen chloride, which is classified as a chemical warfare agent, and nitrosamines, which can cause cancer. The study does not have evidence to say whether the nitrosamine levels in pools increase cancer risk. Nonetheless, this is still a striking finding.

The potential health hazards are compounded by the fact that it is not uncommon for pool water to go unchanged for years. When that is the case, the pool operator simply adds more water and more chlorine to disinfect. Such practice will lead to formation of more disinfection byproducts.

Here’s a reporting of the study from The study, published in Environmental Science & Technology Letters, was conducted by a research team from University of Alberta. Here’s a link to the actual paper.

The lead researcher in the study said that she is a regular swimmer and that the study is not meant to scare people away from a healthy activity. The intention is to promote public awareness of the potential hazards of urine in the pool and to promote best practices in using swimming pools.

Posted in Statistical studies | Tagged , , | Leave a comment

Is there anything we can do to avoid the flu?

In many places, influenza activity peaks during the period in between December and February. What are some of the best practices to reduce the spread of the flu and to minimize the risk of catching it? The standard recommendations are: get a flu shot if you are in a risk group, cough and sneeze into your elbow and wash your hands frequently. On an individual basis, we can follow these recommendations and hope for the best outcomes. From a public health standpoint, it is a good idea to understand how a flu outbreak starts. A recent article published in the Journal of Clinical Virology gives insight in this regard.

Researchers at Sahlgrenska Academy in the University of Gothenburg in Gothenburg, Sweden conducted the study during the period between October 2010 and July 2013. During that period, researchers collected more than 20,000 nasal swabs from people seeking medical care in and around the city of Gothenburg, and analyzed them for influenza A and other respiratory viruses.

The incidence of respiratory viruses was then compared over time with weather data from the Swedish Meteorological and Hydrological Institute (SMHI). The findings: Flu outbreaks seem to be activated about one week after the first really cold period with low outdoor temperatures and low humidity.

The finding of this study has been reported in Here are the abstracts of the study in three different places (here, here and here) for anyone who would like more detailed information. Here is the press release from the Sahlgrenska Academy.

The most interesting take away from the study is the predictable timing with the flu outbreak starting the week after a cold snap. There is a lag time of about a week. The onset of a cold snap is a leading indicator of a flu outbreak. The correlation is a useful piece of information.

If you know when the cold month begins, you know when the flu activity starts to get intense. Thus this knowledge can be put to good use. For example, start publicize the need for flu shots in the period leading to the first cold month. It can also help hospitals and other health providers prepare in advance for a higher volume of patients seeking care.

Even individuals can make use of this insight. In the period leading to the first cold spell, be extra careful and take extra precaution: hand washing and sneezing into elbow for example.

Cold temperature and dry weather condition together allow the flu virus to spread more easily. According to the study, aerosol particles containing virus and liquid are more able to spread in cold and dry weather. The dry air absorbs moisture and the aerosol particles shrink and can remain airborne.

Of course, the cold and dry weather alone is not enough to cause an outbreak of the flu. According to one of the researchers, “the virus has to be present among the population and there have to be enough people susceptible to the infection.”

The study indicates that the findings for seasonal flu (Influenza A) also hold true for a number of other common viruses that cause respiratory tract infections, such as RS-virus and coronavirus. The combination of cold and dry weather seems to exacerbate the problems caused by these viruses. On the other hand some viruses such as rhinovirus, that are a common cause of cold, are independent of weather factors and is present all year round.

\copyright 2017

Posted in Observational Studies, Statistical studies | Tagged , , , | Leave a comment

Looking at Spread

In the previous post Two Statisticians in a Battlefield, we discussed the importance of reporting a spread in addition to an average when describing data. In this post we look at three specific notions of spread. They are measures that indicate the “range” of the data. First, look at the following three statements:

  1. The annual household incomes in the United States range from under $10,000 to $7.8 billion.
  2. The middle 50% of the annual household incomes in the United States range from $25,000 to $90,000.
  3. The birth weight of European baby girls ranges from 5.4 pounds to 9.1 pounds (mean minus 2 standard deviations to mean plus 2 standard deviations).

All three statements describe the range of the data in questions. Each can be interpreted as a measure of spread since each statement indicates how spread out the data measurements are.

1. The Range

The first statement is related to the notion of spread called the range, which is calculated as:

\displaystyle (1) \ \ \ \ \ \text{Range}=\text{Maximum Data Value}-\text{Minimum Data Value}

Note that the range as calculated in (1) is simply the length of the interval from the smallest data value to the largest data value. We do not know exactly what the largest household income is. So we assume it is the household of Bill Gates ($7.8 billion is a number we found on the Internet). So the range of annual household incomes is about $7.8 billion (all the way from an amount near $0 to $7.8 billion).

Figure 1

The range is a measure of spread since the larger this numerical summary, the more dispersed the data (i.e. the data are more scattered within the interval of “min to max”). However, the range is not very informative. It is calculated using the two data values that are, in this case, outliers. Because the range is influenced by extreme data values (i.e. outliers), it is not used often and is usually not emphasized. It is given here to provide a contrast to the other measures of spread.

2. Interquartile Range

The interval described in the second statement is the more stable part of the income scale. It does not contain outliers such as Bill Gates and Warren Buffet. It also does not contain the households in extreme poverty. As a result, the interval of $25,000 to $90,000 describes the “range” of household incomes that are more representative of the working families in the United States.

The measure of spread at work here is called the interquartile range (IQR), which is based on the numerical summary called the 5-number summary, as indicated in (2) below and in the following figure.

\displaystyle \begin{aligned}(2) \ \ \ \ \ \text{5-number summary}&\ \ \ \ \ \text{Min}=\$0 \\&\ \ \ \ \ \text{first quartile}=Q_1=\$25,000 \\&\ \ \ \ \ \text{median}=\$50,000 \\&\ \ \ \ \ \text{third quartile}=\$90,000 \\&\ \ \ \ \ \text{Max}=\$7.8 \text{ billion} \end{aligned}

Figure 2

As demonstrated in Figure 2, the 5-number summary breaks up the data range into 4 quarters. Thus the interval from the first quartile (Q_1) to the third quartile (Q_3) contains the middle 50% of the data. The interquartile range (IQR) is defined to be the length of this interval. The IQR is computed as in the following:

\displaystyle \begin{aligned}(3) \ \ \ \ \ \text{IQR}&=Q_3-Q_1 \\&=\$90,000-\$25,000 \\&=\$65,000 \end{aligned}

Figure 3

The IQR is the length of the interval from Q_1 to Q_3. The larger this measure of spread, the more dispersed (or scattered) the data are within this interval. The smaller the IQR, the more tightly the data cluster around the median. The median as a measure of center is a resistant measure since it is not influenced significantly by extreme data values. Likewise, IQR is also not influenced by extreme data values. So IQR is also a resistant numerical summary. Thus IQR is typically used as a measure of spread for skewed distributions.
3. Standard Deviation

The third statement also presents a “range” to describe the spread of the data (in this case birth weights in pounds of girls of European descent). We will see that the interval of 5.4 pounds to 9.1 pounds covers approximately 95% of the baby girls in this ethnic group.

The measure of spread at work here is called standard deviation. It is a numerical summary that measures how far a typical data value deviates from the mean. Standard deviation is usually used with data that are symmetric.

The information in the third statement is found in this oneline article. The source data are from this table and are repeated below.

\displaystyle \begin{aligned}(4) \ \ \ \ \ \text{ }&\text{Birth weights of European baby girls} \\&\text{ } \\&\ \ \ \text{mean}=3293.5 \text{ grams }=7.25 \text{ pounds} \\&\ \ \ \text{standard deviation }=423.3 \text{ grams}=0.93126 \text{ pounds} \end{aligned}

The third statement at the beginning tells us that birth weights of girls of this ethnic background can range from 5.4 pounds to 9.1 pounds. The interval spans two standard deviations from either side of the mean. That is, the value of 5.4 pounds is two standard deviations below the mean of 7.25 pounds and the value of 9.1 pounds is two standard deviations above the mean.

\displaystyle \begin{aligned}(5) \ \ \ \ \ \text{ }&\text{Birth weights of European baby girls} \\&\text{ } \\&\ \ \ 7.25-2 \times 0.93126=5.4 \text{ pounds} \\&\ \ \ 7.25+2 \times 0.93126=9.1 \text{ pounds} \end{aligned}

There certainly are babies with birth weights 9.1 pounds and below 5.4 pounds. So what is the proportion of babies that fall within this range?

We assume that the birth weights of babies in a large enough sample follow a normal bell curve. The standard deviation has a special interpretation if the data follow a normal distribution. In a normal bell curve, about 68% of the data are one standard deviation away from the mean (both below and above). About 95% of the data are two standard deviations away from the mean (both below and above). About 99.7% of the data are three standard deviations away from the mean (both below and above). With respect to the birth weights of baby girls with European descent, we have the following bell curve.

Figure 4


The three measures of spread discussed here all try to describe the spread of the data by presenting a “range”. The first one, called the range, is not useful since the min and the max can be outliers. The median (as a center) and the interquartile (as a spread) are typically used to describe skewed distributions. The mean (as a center) and the standard deviation (as a spread) are typically used to describe data distributions that have no outliers and are symmetric.

Related Blog Posts

For more information about resistant numerical summaries, go to the following blog posts:

When Bill Gates Walks into a Bar

Choosing a High School

For more a detailed discussion of the measures of spread discussed here, go to the following blog post:
Looking at LA Rainfall Data


  1. Moore. D. S., Essential Statistics, W. H. Freeman and Company, New York, 2010
  2. Moore. D. S., McCabe G. P., Craig B. A., Introduction to the Practice of Statistics, 6th ed., W. H. Freeman and Company, New York, 2009
Posted in Descriptive statistics, Statistics | Tagged , , , , , , , , , , | Leave a comment

Two Statisticians in a Battlefield

Two soldiers, both statisticians, were fighting side by side in a battlefield. They spotted an enemy soldier and they both fired their rifles. One statistician soldier fired one foot to the left of the enemy soldier and the other statistician soldier fired one foot to the right of the enemy soldier. They immediately gave each other a high five and exclaimed, “on average the enemy soldier is dead!”

Of course this is an absurd story. Not only the enemy soldier was not dead, he was ready to fire back at the two statistician soldiers. This story also reminds the author of this blog about a man who had one foot in a bucket of boiling hot water and another foot in a bucket of ice cold water. The man said, ‘on average, I ought to feel pretty comfortable!”

In statistics, center (or central tendency or average) refers to a set of numerical summaries that attempt to describe what a typical data value might look like. These are “average” value or representative value of the data distribution. The more common notions of center or average are mean and median. The absurdity in these two stories points to the inadequacy of using center alone in describing a data distribution. To get a more complete picture, we need to use spread too.

Spread (or dispersion) refers to a set of numerical summaries that describe the degree to which the data are spread out. Some of the common notions of spread are range, 5-number summary, interquartile range (IQR), and standard deviation. We will not get into the specifics of these notions here. Refer to Looking at Spread for a more detailed discussion of these specific notions of spread. Our purpose is to discuss the importance of spread.

Why is spread important? Why is using average alone not sufficient in describing a data set? Here are several points to consider.

1. Using average alone can be misleading.

The stories mentioned at the beginning aside, using average alone gives incomplete information. Depending on the point of view, using average along can make things look better than they are or worse than they really are.

A handy example would be the Atlanta Olympic in 1996. In the summer time, Atlanta is hot. In fact, some people call the city Hotlanta! Yet in the bid for the right to host the Olympic, the planning committee described the temperate using average only (the daily average temperate being 75 Fahrenheit). Of course, the temperature of 75 degrees would be indeed comfortable, but was clearly not what the visitors would experience during the middle of the day!

2. A large spread usually means inconsistent results or performances.

A large spread indicates that there is a great deal of dispersion or scatter in the data. If the data are measurements of a manufacturing process, a large spread indicates that the product may be unreliable or substandard. So a quality control procedure would monitor average as well as spread.

If the data are exam scores, a large spread indicates that there exists a wide range of abilities among the students. Thus teaching a class with a large spread may require a different approach than teaching a class with a smaller spread.

3. In investment, standard deviation is used as a measure of risk.

As indicated above, standard deviation is one notion of spread. In investment, risk refers to the chance that the actual return on an investment may deviate greatly from the expected return (i.e. average return). One way to quantify risk is to calculate the standard deviation of the distribution of returns on the investment. The calculation of the standard deviation is based on the deviation of each data value from the mean. It gives an indication of the deviation of an “average” data point from the mean.

A large standard deviation of the returns on an investment indicates that there will be a broad range of possibilities of investment returns (i.e. it is a risky investment in that there is a chance to make a great deal of money and there is also a chance to lose your original investment). A small standard deviation indicates that there will likely be not many surprises (i.e. the actual returns will be likely not too much different from the average return).

Thus it pays for any investor to pay attention to the average returns that are expected as well as the standard deviation of the rate of returns.

4. Without a spread, it is hard to gauge the significance of an observed data value.

When an observed data value deviates from the mean, it may not be easy to gauge the significance of the observed data value. For the type of measurements that we deal with in our everyday experience (e.g. height and weight measurements), we usually have good ideas whether the data values we observe are meaningful.

But for data measurements that are not familiar to us, we usually have a harder time making sense of the data. For example, if the observed data value is different from the average, how do we know if the difference is just due to chance or if the difference is real and significant. This kind of question is at the heart of statistical inference. Many procedures in statistical inference requires the use of a spread in addition to an average.

For more information about the notions of spread, refer to other discussion in this blog or the following references.


  1. Moore. D. S., Essential Statistics, W. H. Freeman and Company, New York, 2010
  2. Moore. D. S., McCabe G. P., Craig B. A., Introduction to the Practice of Statistics, 6th ed., W. H. Freeman and Company, New York, 2009
Posted in Descriptive statistics, Statistics | Tagged , , , , , , , , , , , | Leave a comment

The Tax Return of Mitt Romney

Mitt Romney is currently a candidate for the 2012 Republican Party nomination for U.S. President. He recently, bowed to pressure from another presidential candidate in the Republican Party, had to release his past tax returns. The release of these tax returns opened up a window on the personal finance of a very rich presidential candidate. Immediately upon the release on January 24, much of the discussion in the media was centered around the fact that Romney paid an effective tax rate of about 15%, which is much less than the rates paid by many ordinary Americans. Our discussion here is neither about tax rates nor politics. For the author of this blog, Romney’s tax return provides a rich opportunity to talk about statistics. Mitt Romney is an excellent example opening up a discussion on income distribution and several related statistical concepts.

Mitt Romney

Mitt Romney’s tax return in 2010 consisted of over 200 pages. The 2010 tax return can be found here (the PDF file can be found here). The following is a screen grab from the first page.

Figure 1

Note that the total adjusted gross income was over $21 million. Just the taxable interest income alone was $3.2 million. Most of Romney’s income was from capital gain (about $12.5 million). It is clear that Romney is an ultra rich individual. How wealthy? For example, where is Romney placed in the income scale relative to other Americans? To get a perspective, let’s look at some income data from the US Census Bureau. The following is a histogram constructed using a frequency table from the Current Population Survey. The source data are from this table.

Figure 2

The horizontal axis in Figure 2 is divided into intervals made up of increments of $10,000 all the way from $0 to $250,000 plus. According to the Current Population Survey, about 7.8% of all American households had income under $10,000 in 2010 (almost 1 out of 13 households). About 12.1% of all households had income in between $10,000 to $20,000. (about 1 in 8 households). Only 2.1% of the households had income over $250,000 in 2010. Obviously Romney belongs to this category. The graphic in Figure 2 shows that Romney is in the top 2% of all American households.

Of course, being in the top 2% is not the entire story. There is a long way from $250,000 (a quarter of a million) to $21 million! Clearly Romney is in the top range of the top class indicated in Figure 2. Romney is actually in the top 1% of the income distribution. According to Wall Street Journal, Romney is well above the top 1% category. According to this online reporting from Wall Street Journal, Romney is in the top 0.0025%! According to one calculation (mentioned in the Wall Street Journal piece), there are at least 4,000 families in the category of being in the top 0.0025%. Could it be that the families in this category number in the thousands?

The figure below shows that the sum of the percentages from the first 5 bars in the histogram equals 50.5%. This confirms the well known figure that the median household income in the United States is around $50,000.

Figure 3

The histograms in both Figure 1 and Figure 2 are clear visual demonstrations that income distribution are skewed (e.g. most of the households make modest income). Most of the households are located in the lower income range. Just the first 5 intervals alone contain 50% of the households. The sum of the percentages of the first 10 vertical bars ($0 to $99,999) is about 80%. So making 6-figure income lands you in the top 20% of the households. Both histograms are classic examples of a skewed right distribution. The vertical bars on the left are tall and the bars taper off gradually at first but later drop rather precipitously.

The last two vertical bars (in green) are aggregations of all the vertical bars in the $250,000+ range had we continued to draw the histograms using $10,000 increments. Another clear visual sign that this is a skewed distribution is that the left tail (the length of the horizontal axis to the left of the median) differs greatly from the right tail (the length of the horizontal axis to the right of the median). When the right tail is much longer than the left tail, it is called a skewed right distribution (see the figure below).

Figure 4

On the other hand, when the left tail is longer, it will be called a skewed left distribution.

Another indication that the income distribution is skewed is that the mean income and the median income are far apart. According to the source data, the mean household income in 2010 was $67,530, much higher than the median of about $50,000 (see the figure below).

Figure 5

Whenever the mean is much higher than the median, it is usually the case that it is a skewed right distribution (as in Figure 5). On the other hand, when the opposite is true (the median is much higher than the mean), most of the time it is a skewed left distribution.

A related statistical concept is the so-called resistant measures. The median is a resistant measure because it is not skewed significantly by extreme data values (in this case extremely high income and wealth). On the other hand, the mean is not resistant. As a result, in a skewed distribution, the median is a better indication of an average. This is why income is usually reported using median (e.g. in newspaper articles).

For a more detailed read of resistant measures, see When Bill Gates Walks into a Bar and Choosing a High School.

Posted in Descriptive statistics, Statistics | Tagged , , , , , , , | Leave a comment

What Are Helicopter Parents Up To?

We came across several interesting graphics about helicopter parents. These graphics give some indication as to what the so called helicopter parents are doing in terms of shepherding their children through the job search process. These graphics are the screen grabs from a report that came from Michigan State University. Here’s the graphics (I put the most interesting one first).

Figure 1

Figure 2

Figure 3

Figure 1 shows a list of 9 possible job search or work related activities that helicopter parents undertake on behalf of their children. The information in this graphic was the results of surveying more than 700 employers that were in the process of hiring recent college graduates. Nearly one-third of these employers indicated that parents had submitted resumes on behalf of their children, sometimes without the knowledge of their children. About one quarter of the employers in the sample reported that they saw parents trying to urge these companies to hire their children!

To me, the most interesting point is that about 4% of these employers reported that parents actually showed up for the interviews! Maybe these companies should hire the parents instead! Which companies do not like job candidates that are enthusiastic and show initiative?

Figures 2 and 3 present the same information in different format. One is a pie chart and the other is bar graph. Both break down the survey responses according to company size. The upshot: large employers tend to see more cases of helicopter parental involvement in their college recruiting. This makes sense since larger companies tend to have regional and national brand recognition. Larger companies also tend to recruit on campus more often than smaller companies.

Any interested reader can read more about this report that came out of Michigan State University. I found this report through a reporting in called Helicopter Parents Hover in the Workplace.

Posted in Commentary, Descriptive statistics, Statistical studies, Statistics, Statistics In Action | Tagged , , , , , | Leave a comment

Another Look at LA Rainfall

In two previous posts, we examined the annual rainfall data in Los Angeles (see Looking at LA Rainfall Data and LA Rainfall Time Plot). The data we examined in these two post contain 132 years worth of annual rainfall data collected at the Los Angeles Civic Center from 1877 to 2009 (data found in Los Angeles Almanac). These annual rainfall data represent an excellent opportunity to learn the techniques from a body data analysis methods grouped under the broad topic of descriptive statistics (i.e. using graphs and numerical summaries to answer questions or find meaning in data).

Here’s two graphics presented in Looking at LA Rainfall Data.

Figure 1

Figure 2

These charts are called histograms and they look the same (i.e. have the same shape). But they present slightly different information. Figure 1 shows the frequency of annual rainfall. Figure 2 shows the relative frequency of rainfall.

For example, Figure 1 indicates that there were only 3 years (out of the last 132 years) with annual rainfall under 5 inches. On the other hand, there were only 2 years with annual rainfall above 35 inches. So drought years did happen but not very often (only 3 out of 132 years). Extremely wet seasons did happen but not very often. Based on Figure 1, we see that in most years, annual rainfall records range from 5 to about 25 inches. The most likely range is 10 to 15 inches (45 years out of the last 132 years). In Los Angeles, annual rainfall above 25 inches are rare (only happened 12 years out of 132 years).

Figure 1 is all about count. It tells you how many of the data points are in a certain range (e.g. 45 years in between 10 to 15 inches). For this reason, it is called a frequency histogram. Figure 2 gives the same information in terms of proportions (or relative frequency). For example, looking at Figure 2, we see that about 34% of the time, annual rainfall is from 10 to 15 inches. Thus, Figure 2 is called a relative frequency histogram.

Keep in mind raw data usually are not informative until they are summarized. The first step in summarization should be a graph (if possible). After we have graphs, we can look at the data further using numerical calculation (i.e. using various numerical summaries such as mean, median, standard deviation, 5-number summary, etc). To see how this is done, see the previous post Looking at LA Rainfall Data.

What kind of information can we get from graphics such as Figure 1 and Figure 2 above? For example, we can tell what data points are most likely (e.g. annual rainfall of 10 to 15 inches). What data points are considered rare or unlikely? Where do most of the data points fall?

This last question should be expanded upon. Looking at Figure 2, we see that about 60% of the data are under 15 inches (0.023+0.242+0.341=0.606). So for close to 80 years out of the last 132 years, the annual rainfall records were 15 inches or less. About 81% of the data are 20 inches or less. So in the overhelming majority of the years, the annual rainfall records are 20 inches or less. So annual rainfall of more than 20 inches are relatively rare (only happened about 20% of the time).

We have a name of the data situation we see in Figure 1 and Figure 2. The annual rainfall data in Los Angeles have a skewed right distribution. This is because most of the data points are on the left side of the histogram. Another way to see this is that the tallest bar in the histogram is the one at 10 to 15 inches. Note that the side to the right of the peak of the histogram is longer than the side to the left of the peak. In other words, when the right tail of the histogram is longer, it is a skewed right distribution. See the figure below.

Figure 3

Besides the look of the histogram, skewed right distribution has another characteristic. The mean is always a lot larger than the median in a skewed right distribution. For example, the mean of the annual rainfall data is 14.98 inches (essentially 15 inches). Yet the median is only 13.1 inches, almost two inches lower. Whenever, the mean and the median are significantly far apart, we have a skewed distribution on hand. When the mean is a lot higher, it is a skewed right distribution. When the opposite situation occurs (the mean is a lot lower than the median), it is a skewed left distribution. When the mean and median are roughly equal, it is likely a symmetric distribution.

Posted in Descriptive statistics, Exploratory Data Analysis (EDA), Statistics | Tagged , , , , , , , , | Leave a comment

Is College Worth It?

Is college worth it? This was the question posed by the authors of the report called College Majors, Unemployment and Earnings, which was produced recently by The Center on Education and the Workforce. We do not plan on giving an detailed reporting on this report. Any interested reader can read the report here. Instead, we would like to look at two graphics in this reports, which are reproduced below. These two graphics are very interesting, which capture all the main points of the report. The data used in the report came from American Community Survey for the years 2009 and 2010.

Figure 1

Figure 2

Figure 1 shows the unemployment rates by college major for three groups of college degree holders, namely the recent college graduates (shown with green marking), the experienced college graduates (blue marking) and the college graduates who hold graduate degrees (red marking). Figure 2 shows the median earnings by major for the same three groups of college graduates (using the same colored markings).

Figure 1 ranks the unemployment rates for recent college graduates from highest to the lowest. You can see the descending of green markings from 13.9% (architecture) to 5.4% (education and health). So this graphic shows clearly that the employment prospects of college graduates depend on their majors, which is one of the main points of the report.

The graphic in Figure 1 shows that all recent college graduates are having a hard time finding work. The unemployment rate for recent college graduate is 8.9% (not shown in Figure 1). The employment picture for recent college architecture graduates is especially bleak, which is due to the collapse of the construction and home building industry in the recession. The unemployment rates for recent college graduates who majored in education and healthcare are relatively low, reflecting the reality that these fields are either stable or growing.

Everyone is feeling the pinch in this tough economic environment. Even the recent graduates in technical fields are experiencing higher than usual unemployment rates. For example, the unemployment rates for recent college graduates in engineering and science, though relatively low comparing to architecture, are at 7.5% and 7.7%, respectively. For computers and mathematics recent graduates, the unemployment rate is 8.2%, approaching the average rate of 8.9% for recent college graduates.

The experienced college graduates fare much better than recent graduates. It is much more likely for experienced college graduates to be working. Looking at Figure 1, another observation is that graduate degrees make a huge difference in employment prospects across all majors.

The graphic in Figure 2 suggests that earnings of college graduates also depend on the subjects they study, which is another main point of the report. The technical majors earn the most. For example, median earning among recent engineering college graduates is $55,000 and the median for arts majors is $30,000. Aside from the high technical, business and healthcare majors, the median earnings of recent college graduates are in the low $30,000s (just look at the green markings in Figure 2).

Figure 2 also shows that people with graduate degrees have higher earnings across all majors. The premium in earnings for graduate degree holders is substantial and is found across the board. Though the graduate degree advantage is seen in all majors, it is especially pronounced among the technical fields (just look at the descending red markings in Figure 2).

So two of the main points are (1), employment prospects of college graduates depend on their majors, and (2) the earning potential of college graduates also depend on the subjects they study. Is college worth it? The report is not trying to persuade college bound high school seniors not to go to college. On the contrary, the authors of the report answer the question in the affirmative. The authors of the report are merely providing the facts that all prospective college students should consider before they pick their majors. The two graphics shown above are effective demonstration of the facts presented by the report. According to the authors, students “should do their homework before picking a major, because, when it comes to employment prospects and compensation, not all college degrees are created equal.”

Posted in Commentary, Descriptive statistics, Exploratory Data Analysis (EDA), Observational Studies, Statistical studies, Statistics | Tagged , , , , | Leave a comment

Cryptography and Presidential Inaugural Speeches

Given a letter in English, how often does it appear in normal usage of the English language? Some letters appear more often than others. For example, the last letter Z is not common. The vowels are very common because they are needed in making words. The following figure shows the relative frequency of the English letters obtained empirically (see [1]). Dewey, the author of [1], obtained this frequency distribution after examining a total of 438,023 letters. We came across this letter frequency distribution in Example 2.11 in page 24 of [2]. Figure 1 displays the letter frequency in descending order.

A letter frequency such as Figure 1 is important in cryptography. We explore briefly why this is the case. We give an indication why breaking a cipher is often a statistical process. We then confirm the Dewey letter frequency distribution by examining the letter frequency in the presidential inaugural speeches of George Washington (two speeches) and Barack Obama (one speech).

The study of the frequency of letters in text is very important in cryptography. In using an algorithm to encrypt a message, the original information is called plaintext and the encrypted message is called ciphertext. In a simple encryption scheme called substitution cipher, each letter of the plaintext is replaced by another letter. To break such a cipher, it is necessary to know the letter frequency of the language being coded. For example, if the letter W is the most frequently appeared letter in the ciphertext, this might suggest that the letter W in the ciphertext corresponds to the letter E in the plaintext since the letter E is the most frequently occurred English letter (see Figure 1).

Figure 1 shows that the most frequently occurring letter in English is E (about 12.68% of the time). The least used letter is Z. The top 5 letters (E, T, A, I, O) comprise about 45% of the total usage. The top 8 letters comprise close to 65% of the total usage. The top 12 letters are used about 80% of the time (80.87%).

Another interesting result from the Dewey’s letter frequency is that the vowels comprise about 40% of the total usage. This means that the frequency of consonants is about 60%.

\displaystyle \begin{aligned}(1) \ \ \ \ \  \text{relative frequency of vowels}&=\text{relative frequency of A + relative frequency of E} \\&\ \ \ + \text{relative frequency of I + relative frequency of O} \\&\ \ \ +\text{relative frequency of U + relative frequency of Y} \\&=0.0788+0.1268+0.0707+0.0776+0.0280+0.0202 \\&=0.4021 \end{aligned}

\displaystyle \begin{aligned}(2) \ \ \ \ \  \text{relative frequency of consonants}&=1-0.4021 \\&=0.5979 \end{aligned}

The probability distribution of the letters displayed in Figure 1 is a useful tool that can aid the process of breaking an intercepted cipher. The general idea is to compare the frequency of the letters in the encrypted message with the frequency of the letters in Figure 1. Thus the most used letter in the ciphertext might correspond to the letter E, or might correspond to T and A (as T and A are also very common in plaintext). But the most used letter in the ciphertext is likely not to be a Z or a Q. The second most used letter in the ciphertext might be the letter T in the plaintext, or might be another one of the top letters. The cryptanalyst will likely need to try various combinations of mapping between the letters in the ciphertext and the plaintext. The idea described here is not a sure-fire approach, but is rather a trial and error process that can help the analyst putting the statistical puzzle pieces together.

We now use the letters in presidential inaugural speeches to see how the Dewey letter frequency hold up. We want to use text that is from another era (so we choose the two inaugural speeches of George Washington) and to use text that is contemporary (so we choose the inaugural speech of Barack Obama). The text of presidential inaugural speeches can be found here.

Figure 2 below shows the letter frequency in the two inaugural speeches of George Washington. There are a total of 7,641 letters (we only use the body of the speeches). Figure 3 below is a side by side comparison between the letter frequency in Figure 1 (Dewey) and the letter frequency in Washington’s two speeches (Figure 2).

Figure 3 shows that the letter frequency in Washington’s speeches is on the whole very similar to the letter frequency of Dewey. We cannot expect an exact match. But overall there is a general agreement between the two distributions.

Figure 4 below shows the letter frequency in the inaugural speeches of Barack Obama. There are a total of 10,627 letters (we only use the body of the speech). Figure 5 below is a side by side comparison between the letter frequency in Figure 1 (Dewey) and the letter frequency in Obama’s speech (Figure 3).

There is also a very good agreement between the letter frequency in Dewey (the benchmark) and the letter frequency in Obama’s speech.

Despite the passage of almost 200 years, there is quite an excellent agreement between the letter usage between Washington’s speeches in 1789 and the distribution obtained by Dewey in 1970 (see Figure 3). Some letters appeared more frequently often in Washington’s speeches (e.g. E, I and N) and some appeared less often (e.g. A). The general pattern of the letter distribution in Washington’s speeches is unmistakably similar to that of Dewey’s. Similar observations can be made about the comparison between the letter frequency in Obama’s speech and Dewey’s distribution (see Figure 5).

The following table shows the frequency of the top letter, the top 5 letters, the top 8 letters and the top 12 letters in Dewey’s distribution alongside with the corresponding frequency in the speeches of Washington and Obama. Table (1) shows that the frequency of the top letters are quite close between Dewey’s distribution and the speeches of Washington and Obama.

\displaystyle (1) \ \ \ \ \begin{bmatrix} \text{Top Letters in Dewey's Distribution}&\text{ }&\text{Dewey}&\text{ }&\text{Washington}&\text{ }&\text{Obama}  \\\text{ }&\text{ }&\text{ } \\\text{E}&\text{ }&0.1268&\text{ }&0.1309&\text{ }&0.1268  \\ \text{E, T, A, O, I}&\text{ }&0.4517&\text{ }&0.4485&\text{ }&0.4441  \\ \text{E, T, A, O, I, N, S, R}&\text{ }&0.6451&\text{ }&0.6409&\text{ }&0.6525   \\ \text{E, T, A, O, I, N, S, R, H, L, D. U}&\text{ }&0.8087&\text{ }&0.7981&\text{ }&0.8163    \end{bmatrix}


  1. Dewey, G., Relative Frequency of English Spellings, Teachers College Press, Columbia University, New York, 1970
  2. Larsen, R. J., Marx., M. L., An Introduction to Mathematical Statistics and its Applications, Prentice-Hall, Inc., Englewood Cliffs, New Jersey, 1981
Posted in Descriptive statistics, Exploratory Data Analysis (EDA), Probability, Statistics, Statistics In Action | Tagged , , , , , | Leave a comment

Which Car Rental Company Is More Expensive, Budget or Avis?

Any “budget” conscious consumer/traveler would want to find a good deal wherever and whenever he or she can, especially when it comes to airfare and rental cars. In tough economic times, bargain hunting is the norm rather the exception. This is an exercise in price comparison for rental cars; focusing on two popular car rental companies Budget and Avis. This is also an opportunity to use the one-sample t-procedures (test and confidence interval) on matched pairs data.

The following table shows the prices found on the websites of Budget Car Rental and Avis Car Rental. The prices are one-day rental prices for full size sedans quoted for the day of December 12, 2011 (a non-holiday Monday) at 35 of the busiest airports in United States. These prices are basic car rental prices without any discount or upgrade.

Matched Pairs Data

The price data in Table 1 are best viewed as matched pairs data. In such data, observations are taken on the same individual (in this case the same airport) under different conditions (one price is for Budget and one price is for Avis). The Budget car rental prices and the Avis car rental prices are said to be dependent samples.

The alternative to thinking of Table 1 as matched pairs data is to view the Budget Prices and the Avis prices as two independent samples and then use two-sample t-procedures to perform the analysis. But this approach is not the best way of using the data. When you are at the Chicago airport, you do not care about the car rental prices at the Tampa airport or any other airport. Just as you only compare prices among the car rental companies in the same airport, we should compare prices within each matched pair. So thinking of the data as matched pairs affords us the best way to compare prices between the two companies in Table 1.

To analyze the data in Table 1, we first take the difference between the Budget rental prices and the Avis rental prices (Avis minus Budget). These 35 differences form a single sample (the last column in Table 1). The first difference is -$3.29 (indicating that Avis is cheaper by this amount at the Atlanta airport). The second difference is $45.91 (indicating that Budget is cheaper by this amount at the Chicago airport). Most of the calculations and analysis will be done using this “differenced” sample. Thus, the comparative design of using matched pairs data makes use of single-sample procedures (in this case, the one-sample t-procedures).

Initial Look of the Data

Most of the differences are positive (i.e. Avis charges more than Budget). Some of the differences are small. But some of the differences are in the $30 to $50 range. So we need to take a closer look.

The following table shows the sample means and sample standard deviations for the Budget prices, Avis prices and the differences. In these 35 airports, the average one-day rental prices for Budget is $60 and the average Avis price is $73.74. The price differential is $13.64, meaning that Avis price is 22.7% over the Budget average price. Any “budget” conscious traveler should care about a difference of $13. The question is: is the price differential we are seeing statistically significant? Specifically, are the data in Table 1 evidences that Budget Car Rental is less expensive than Avis?

The Requirements for Using the t-Procedures

The use of the t-procedures (confidence interval and test) rests on two assumptions. One is that the data set is a simple random sample. The second is that the distribution of the data measurements has no outliers and follows a normal distribution (or that the sample size is large).

The car rental prices in Table 1 are not a random sample. They are just car rental prices from Budget and Avis at the 35 busiest airports in United States. Because these busy airports spread out across various regions in the United States and because they are of varying sizes, we feel that the sample car rental prices indicated here are representative of car renting experiences at these airports. For these reasons, we feel that there is value in carrying this comparison.

Because the sample size is relatively large (n=35), the need for checking normality assumption is not critical. The car rental prices do not seem to have any extreme data values.

About Technology

To carry out the t-test and t-interval, we should use technology (we use TI-83 plus). If software is not used, a t-distribution table is needed to find the p-value and t-critical value. Refer to your favorite statistics textbook for a t-table or use this t-table.

One-Sample t-Test

With a price differential of $13.64, we see that Avis is more expensive. Let’s confirm it with a one-sample t-test. To assess whether Budget is less expensive than Avis, we test the following hypotheses:

\displaystyle \begin{aligned}(1) \ \ \ \ \ \ \ \ \ \ \  &H_0: \mu = 0 \\&H_1: \mu > 0  \end{aligned}

where \mu is the mean difference in car rental prices (Avis minus Budget). The null hypothesis H_0 says that there is no difference in prices between Budget and Avis. The alternative hypothesis H_1 says that Budget is less expensive than Avis (i.e. Avis minus Budget > 0).

The mean and standard deviation of the “differenced” sample (the last column in Table 1) are:

\displaystyle \begin{aligned}(2) \ \ \ \ \ \ \ \ \ \ \  &\overline{x}=\$13.64029 \\&s=\$14.60763  \end{aligned}

The one-sample t-statistic is:

\displaystyle \begin{aligned}(3) \ \ \ \ \ \ \ \ \ \ \  t&=\frac{\overline{x}-0}{\displaystyle \frac{s}{\sqrt{n}}}=\frac{13.64029-0}{\displaystyle \frac{14.60763}{\sqrt{35}}}=5.5243  \end{aligned}

The p-value of this t-test is found from the t-distribution with 34 degrees of freedom (one less than the sample size). There are two ways to accomplish this (looking up a table or using software). Based on a t-distribution table (such as this one), p<0.0005. Software (using TI-83 plus) gives a p-value that is much smaller, p=1.78871591 \times 10^{-6}, which is approximately p=0.0000018.

Because of the small p-value, the data provide clear evidence in favor of the alternative hypothesis (i.e. we reject the null hypothesis H_0). A price differential this large (as large as $13.64) is very unlikely to occur by chance if there is indeed no difference in prices between Budget and Avis. We now have evidence that Budget is less expensive on average (Avis is more expensive on average).

One-Sample t-Interval

What is the magnitude of the price differential of Avis over Budget with a margin of error? We want to obtain a 95% confidence interval for the mean difference in car rental prices. To this end, we need the critical value t=2.032 from a t-distribution table. The margin of error is:

\displaystyle \begin{aligned}(4) \ \ \ \ \ \ \ \ \ \ \  &t \times \frac{s}{\sqrt{n}}=2.302 \times \frac{14.60763}{\sqrt{35}}=5.01729  \end{aligned}

and the confidence interval is:

\displaystyle \begin{aligned}(5) \ \ \ \ \ \ \ \ \ \ \  \overline{x} \pm t \times \frac{s}{\sqrt{n}}&=13.64027 \pm 5.01729 \\&=(8.62298,18.65756)  \end{aligned}

The estimated average price differential of Avis over Budget is $13.64 with a margin of error $5.02 with 95% confidence. On average, you tend to save anywhere from $8.62 to $18.66 for a one-day car rental if you go with Budget.


It is clear that Avis is more expensive (at least in terms of one-day rental of full size sedans on a Monday). Perhaps, other factors could alter the picture. For example, this comparison does not account for discount or special promotion. There are variations to the exercise done here. One is to compare week-long rentals. Another exercise is to compare vehicles in other classes (e.g. economy or SUV). Another one is to compare prices for busy seasons (e.g. holiday weekends).


  1. Moore. D. S., Essential Statistics, W. H. Freeman and Company, New York, 2010
  2. Moore. D. S., McCabe G. P., Craig B. A., Introduction to the Practice of Statistics, 6th ed., W. H. Freeman and Company, New York, 2009
Posted in Sampling Distributions, Statistical Inference, Statistics | Tagged , , , , , , , , , , , | Leave a comment