Looking at Spread

In the previous post Two Statisticians in a Battlefield, we discussed the importance of reporting a spread in addition to an average when describing data. In this post we look at three specific notions of spread. They are measures that indicate the “range” of the data. First, look at the following three statements:

  1. The annual household incomes in the United States range from under $10,000 to $7.8 billion.
  2. The middle 50% of the annual household incomes in the United States range from $25,000 to $90,000.
  3. The birth weight of European baby girls ranges from 5.4 pounds to 9.1 pounds (mean minus 2 standard deviations to mean plus 2 standard deviations).

All three statements describe the range of the data in questions. Each can be interpreted as a measure of spread since each statement indicates how spread out the data measurements are.

__________________________________________________________
1. The Range

The first statement is related to the notion of spread called the range, which is calculated as:

\displaystyle (1) \ \ \ \ \ \text{Range}=\text{Maximum Data Value}-\text{Minimum Data Value}

Note that the range as calculated in (1) is simply the length of the interval from the smallest data value to the largest data value. We do not know exactly what the largest household income is. So we assume it is the household of Bill Gates ($7.8 billion is a number we found on the Internet). So the range of annual household incomes is about $7.8 billion (all the way from an amount near $0 to $7.8 billion).

Figure 1

The range is a measure of spread since the larger this numerical summary, the more dispersed the data (i.e. the data are more scattered within the interval of “min to max”). However, the range is not very informative. It is calculated using the two data values that are, in this case, outliers. Because the range is influenced by extreme data values (i.e. outliers), it is not used often and is usually not emphasized. It is given here to provide a contrast to the other measures of spread.

__________________________________________________________
2. Interquartile Range

The interval described in the second statement is the more stable part of the income scale. It does not contain outliers such as Bill Gates and Warren Buffet. It also does not contain the households in extreme poverty. As a result, the interval of $25,000 to $90,000 describes the “range” of household incomes that are more representative of the working families in the United States.

The measure of spread at work here is called the interquartile range (IQR), which is based on the numerical summary called the 5-number summary, as indicated in (2) below and in the following figure.

\displaystyle \begin{aligned}(2) \ \ \ \ \ \text{5-number summary}&\ \ \ \ \ \text{Min}=\$0 \\&\ \ \ \ \ \text{first quartile}=Q_1=\$25,000 \\&\ \ \ \ \ \text{median}=\$50,000 \\&\ \ \ \ \ \text{third quartile}=\$90,000 \\&\ \ \ \ \ \text{Max}=\$7.8 \text{ billion} \end{aligned}

Figure 2

As demonstrated in Figure 2, the 5-number summary breaks up the data range into 4 quarters. Thus the interval from the first quartile (Q_1) to the third quartile (Q_3) contains the middle 50% of the data. The interquartile range (IQR) is defined to be the length of this interval. The IQR is computed as in the following:

\displaystyle \begin{aligned}(3) \ \ \ \ \ \text{IQR}&=Q_3-Q_1 \\&=\$90,000-\$25,000 \\&=\$65,000 \end{aligned}

Figure 3

The IQR is the length of the interval from Q_1 to Q_3. The larger this measure of spread, the more dispersed (or scattered) the data are within this interval. The smaller the IQR, the more tightly the data cluster around the median. The median as a measure of center is a resistant measure since it is not influenced significantly by extreme data values. Likewise, IQR is also not influenced by extreme data values. So IQR is also a resistant numerical summary. Thus IQR is typically used as a measure of spread for skewed distributions.
__________________________________________________________
3. Standard Deviation

The third statement also presents a “range” to describe the spread of the data (in this case birth weights in pounds of girls of European descent). We will see that the interval of 5.4 pounds to 9.1 pounds covers approximately 95% of the baby girls in this ethnic group.

The measure of spread at work here is called standard deviation. It is a numerical summary that measures how far a typical data value deviates from the mean. Standard deviation is usually used with data that are symmetric.

The information in the third statement is found in this oneline article. The source data are from this table and are repeated below.

\displaystyle \begin{aligned}(4) \ \ \ \ \ \text{ }&\text{Birth weights of European baby girls} \\&\text{ } \\&\ \ \ \text{mean}=3293.5 \text{ grams }=7.25 \text{ pounds} \\&\ \ \ \text{standard deviation }=423.3 \text{ grams}=0.93126 \text{ pounds} \end{aligned}

The third statement at the beginning tells us that birth weights of girls of this ethnic background can range from 5.4 pounds to 9.1 pounds. The interval spans two standard deviations from either side of the mean. That is, the value of 5.4 pounds is two standard deviations below the mean of 7.25 pounds and the value of 9.1 pounds is two standard deviations above the mean.

\displaystyle \begin{aligned}(5) \ \ \ \ \ \text{ }&\text{Birth weights of European baby girls} \\&\text{ } \\&\ \ \ 7.25-2 \times 0.93126=5.4 \text{ pounds} \\&\ \ \ 7.25+2 \times 0.93126=9.1 \text{ pounds} \end{aligned}

There certainly are babies with birth weights 9.1 pounds and below 5.4 pounds. So what is the proportion of babies that fall within this range?

We assume that the birth weights of babies in a large enough sample follow a normal bell curve. The standard deviation has a special interpretation if the data follow a normal distribution. In a normal bell curve, about 68% of the data are one standard deviation away from the mean (both below and above). About 95% of the data are two standard deviations away from the mean (both below and above). About 99.7% of the data are three standard deviations away from the mean (both below and above). With respect to the birth weights of baby girls with European descent, we have the following bell curve.

Figure 4

__________________________________________________________
Remark

The three measures of spread discussed here all try to describe the spread of the data by presenting a “range”. The first one, called the range, is not useful since the min and the max can be outliers. The median (as a center) and the interquartile (as a spread) are typically used to describe skewed distributions. The mean (as a center) and the standard deviation (as a spread) are typically used to describe data distributions that have no outliers and are symmetric.

__________________________________________________________
Related Blog Posts

For more information about resistant numerical summaries, go to the following blog posts:

When Bill Gates Walks into a Bar

Choosing a High School

For more a detailed discussion of the measures of spread discussed here, go to the following blog post:
Looking at LA Rainfall Data

__________________________________________________________
Reference

  1. Moore. D. S., Essential Statistics, W. H. Freeman and Company, New York, 2010
  2. Moore. D. S., McCabe G. P., Craig B. A., Introduction to the Practice of Statistics, 6th ed., W. H. Freeman and Company, New York, 2009
Posted in Descriptive statistics, Statistics | Tagged , , , , , , , , , , | Leave a comment

Two Statisticians in a Battlefield

Two soldiers, both statisticians, were fighting side by side in a battlefield. They spotted an enemy soldier and they both fired their rifles. One statistician soldier fired one foot to the left of the enemy soldier and the other statistician soldier fired one foot to the right of the enemy soldier. They immediately gave each other a high five and exclaimed, “on average the enemy soldier is dead!”

Of course this is an absurd story. Not only the enemy soldier was not dead, he was ready to fire back at the two statistician soldiers. This story also reminds the author of this blog about a man who had one foot in a bucket of boiling hot water and another foot in a bucket of ice cold water. The man said, ‘on average, I ought to feel pretty comfortable!”

In statistics, center (or central tendency or average) refers to a set of numerical summaries that attempt to describe what a typical data value might look like. These are “average” value or representative value of the data distribution. The more common notions of center or average are mean and median. The absurdity in these two stories points to the inadequacy of using center alone in describing a data distribution. To get a more complete picture, we need to use spread too.

Spread (or dispersion) refers to a set of numerical summaries that describe the degree to which the data are spread out. Some of the common notions of spread are range, 5-number summary, interquartile range (IQR), and standard deviation. We will not get into the specifics of these notions here. Refer to Looking at Spread for a more detailed discussion of these specific notions of spread. Our purpose is to discuss the importance of spread.

Why is spread important? Why is using average alone not sufficient in describing a data set? Here are several points to consider.

________________________________________________________________
1. Using average alone can be misleading.

The stories mentioned at the beginning aside, using average alone gives incomplete information. Depending on the point of view, using average along can make things look better than they are or worse than they really are.

A handy example would be the Atlanta Olympic in 1996. In the summer time, Atlanta is hot. In fact, some people call the city Hotlanta! Yet in the bid for the right to host the Olympic, the planning committee described the temperate using average only (the daily average temperate being 75 Fahrenheit). Of course, the temperature of 75 degrees would be indeed comfortable, but was clearly not what the visitors would experience during the middle of the day!

________________________________________________________________
2. A large spread usually means inconsistent results or performances.

A large spread indicates that there is a great deal of dispersion or scatter in the data. If the data are measurements of a manufacturing process, a large spread indicates that the product may be unreliable or substandard. So a quality control procedure would monitor average as well as spread.

If the data are exam scores, a large spread indicates that there exists a wide range of abilities among the students. Thus teaching a class with a large spread may require a different approach than teaching a class with a smaller spread.

________________________________________________________________
3. In investment, standard deviation is used as a measure of risk.

As indicated above, standard deviation is one notion of spread. In investment, risk refers to the chance that the actual return on an investment may deviate greatly from the expected return (i.e. average return). One way to quantify risk is to calculate the standard deviation of the distribution of returns on the investment. The calculation of the standard deviation is based on the deviation of each data value from the mean. It gives an indication of the deviation of an “average” data point from the mean.

A large standard deviation of the returns on an investment indicates that there will be a broad range of possibilities of investment returns (i.e. it is a risky investment in that there is a chance to make a great deal of money and there is also a chance to lose your original investment). A small standard deviation indicates that there will likely be not many surprises (i.e. the actual returns will be likely not too much different from the average return).

Thus it pays for any investor to pay attention to the average returns that are expected as well as the standard deviation of the rate of returns.

________________________________________________________________
4. Without a spread, it is hard to gauge the significance of an observed data value.

When an observed data value deviates from the mean, it may not be easy to gauge the significance of the observed data value. For the type of measurements that we deal with in our everyday experience (e.g. height and weight measurements), we usually have good ideas whether the data values we observe are meaningful.

But for data measurements that are not familiar to us, we usually have a harder time making sense of the data. For example, if the observed data value is different from the average, how do we know if the difference is just due to chance or if the difference is real and significant. This kind of question is at the heart of statistical inference. Many procedures in statistical inference requires the use of a spread in addition to an average.

For more information about the notions of spread, refer to other discussion in this blog or the following references.

Reference

  1. Moore. D. S., Essential Statistics, W. H. Freeman and Company, New York, 2010
  2. Moore. D. S., McCabe G. P., Craig B. A., Introduction to the Practice of Statistics, 6th ed., W. H. Freeman and Company, New York, 2009
Posted in Descriptive statistics, Statistics | Tagged , , , , , , , , , , , | Leave a comment

The Tax Return of Mitt Romney

Mitt Romney is currently a candidate for the 2012 Republican Party nomination for U.S. President. He recently, bowed to pressure from another presidential candidate in the Republican Party, had to release his past tax returns. The release of these tax returns opened up a window on the personal finance of a very rich presidential candidate. Immediately upon the release on January 24, much of the discussion in the media was centered around the fact that Romney paid an effective tax rate of about 15%, which is much less than the rates paid by many ordinary Americans. Our discussion here is neither about tax rates nor politics. For the author of this blog, Romney’s tax return provides a rich opportunity to talk about statistics. Mitt Romney is an excellent example opening up a discussion on income distribution and several related statistical concepts.


Mitt Romney

Mitt Romney’s tax return in 2010 consisted of over 200 pages. The 2010 tax return can be found here (the PDF file can be found here). The following is a screen grab from the first page.

Figure 1

Note that the total adjusted gross income was over $21 million. Just the taxable interest income alone was $3.2 million. Most of Romney’s income was from capital gain (about $12.5 million). It is clear that Romney is an ultra rich individual. How wealthy? For example, where is Romney placed in the income scale relative to other Americans? To get a perspective, let’s look at some income data from the US Census Bureau. The following is a histogram constructed using a frequency table from the Current Population Survey. The source data are from this table.

Figure 2

The horizontal axis in Figure 2 is divided into intervals made up of increments of $10,000 all the way from $0 to $250,000 plus. According to the Current Population Survey, about 7.8% of all American households had income under $10,000 in 2010 (almost 1 out of 13 households). About 12.1% of all households had income in between $10,000 to $20,000. (about 1 in 8 households). Only 2.1% of the households had income over $250,000 in 2010. Obviously Romney belongs to this category. The graphic in Figure 2 shows that Romney is in the top 2% of all American households.

Of course, being in the top 2% is not the entire story. There is a long way from $250,000 (a quarter of a million) to $21 million! Clearly Romney is in the top range of the top class indicated in Figure 2. Romney is actually in the top 1% of the income distribution. According to Wall Street Journal, Romney is well above the top 1% category. According to this online reporting from Wall Street Journal, Romney is in the top 0.0025%! According to one calculation (mentioned in the Wall Street Journal piece), there are at least 4,000 families in the category of being in the top 0.0025%. Could it be that the families in this category number in the thousands?

The figure below shows that the sum of the percentages from the first 5 bars in the histogram equals 50.5%. This confirms the well known figure that the median household income in the United States is around $50,000.

Figure 3

The histograms in both Figure 1 and Figure 2 are clear visual demonstrations that income distribution are skewed (e.g. most of the households make modest income). Most of the households are located in the lower income range. Just the first 5 intervals alone contain 50% of the households. The sum of the percentages of the first 10 vertical bars ($0 to $99,999) is about 80%. So making 6-figure income lands you in the top 20% of the households. Both histograms are classic examples of a skewed right distribution. The vertical bars on the left are tall and the bars taper off gradually at first but later drop rather precipitously.

The last two vertical bars (in green) are aggregations of all the vertical bars in the $250,000+ range had we continued to draw the histograms using $10,000 increments. Another clear visual sign that this is a skewed distribution is that the left tail (the length of the horizontal axis to the left of the median) differs greatly from the right tail (the length of the horizontal axis to the right of the median). When the right tail is much longer than the left tail, it is called a skewed right distribution (see the figure below).

Figure 4

On the other hand, when the left tail is longer, it will be called a skewed left distribution.

Another indication that the income distribution is skewed is that the mean income and the median income are far apart. According to the source data, the mean household income in 2010 was $67,530, much higher than the median of about $50,000 (see the figure below).

Figure 5

Whenever the mean is much higher than the median, it is usually the case that it is a skewed right distribution (as in Figure 5). On the other hand, when the opposite is true (the median is much higher than the mean), most of the time it is a skewed left distribution.

A related statistical concept is the so-called resistant measures. The median is a resistant measure because it is not skewed significantly by extreme data values (in this case extremely high income and wealth). On the other hand, the mean is not resistant. As a result, in a skewed distribution, the median is a better indication of an average. This is why income is usually reported using median (e.g. in newspaper articles).

For a more detailed read of resistant measures, see When Bill Gates Walks into a Bar and Choosing a High School.

Posted in Descriptive statistics, Statistics | Tagged , , , , , , , | Leave a comment

What Are Helicopter Parents Up To?

We came across several interesting graphics about helicopter parents. These graphics give some indication as to what the so called helicopter parents are doing in terms of shepherding their children through the job search process. These graphics are the screen grabs from a report that came from Michigan State University. Here’s the graphics (I put the most interesting one first).

Figure 1

Figure 2

Figure 3

Figure 1 shows a list of 9 possible job search or work related activities that helicopter parents undertake on behalf of their children. The information in this graphic was the results of surveying more than 700 employers that were in the process of hiring recent college graduates. Nearly one-third of these employers indicated that parents had submitted resumes on behalf of their children, sometimes without the knowledge of their children. About one quarter of the employers in the sample reported that they saw parents trying to urge these companies to hire their children!

To me, the most interesting point is that about 4% of these employers reported that parents actually showed up for the interviews! Maybe these companies should hire the parents instead! Which companies do not like job candidates that are enthusiastic and show initiative?

Figures 2 and 3 present the same information in different format. One is a pie chart and the other is bar graph. Both break down the survey responses according to company size. The upshot: large employers tend to see more cases of helicopter parental involvement in their college recruiting. This makes sense since larger companies tend to have regional and national brand recognition. Larger companies also tend to recruit on campus more often than smaller companies.

Any interested reader can read more about this report that came out of Michigan State University. I found this report through a reporting in npr.org called Helicopter Parents Hover in the Workplace.

Posted in Commentary, Descriptive statistics, Statistical studies, Statistics, Statistics In Action | Tagged , , , , , | Leave a comment

Another Look at LA Rainfall

In two previous posts, we examined the annual rainfall data in Los Angeles (see Looking at LA Rainfall Data and LA Rainfall Time Plot). The data we examined in these two post contain 132 years worth of annual rainfall data collected at the Los Angeles Civic Center from 1877 to 2009 (data found in Los Angeles Almanac). These annual rainfall data represent an excellent opportunity to learn the techniques from a body data analysis methods grouped under the broad topic of descriptive statistics (i.e. using graphs and numerical summaries to answer questions or find meaning in data).

Here’s two graphics presented in Looking at LA Rainfall Data.

Figure 1

Figure 2

These charts are called histograms and they look the same (i.e. have the same shape). But they present slightly different information. Figure 1 shows the frequency of annual rainfall. Figure 2 shows the relative frequency of rainfall.

For example, Figure 1 indicates that there were only 3 years (out of the last 132 years) with annual rainfall under 5 inches. On the other hand, there were only 2 years with annual rainfall above 35 inches. So drought years did happen but not very often (only 3 out of 132 years). Extremely wet seasons did happen but not very often. Based on Figure 1, we see that in most years, annual rainfall records range from 5 to about 25 inches. The most likely range is 10 to 15 inches (45 years out of the last 132 years). In Los Angeles, annual rainfall above 25 inches are rare (only happened 12 years out of 132 years).

Figure 1 is all about count. It tells you how many of the data points are in a certain range (e.g. 45 years in between 10 to 15 inches). For this reason, it is called a frequency histogram. Figure 2 gives the same information in terms of proportions (or relative frequency). For example, looking at Figure 2, we see that about 34% of the time, annual rainfall is from 10 to 15 inches. Thus, Figure 2 is called a relative frequency histogram.

Keep in mind raw data usually are not informative until they are summarized. The first step in summarization should be a graph (if possible). After we have graphs, we can look at the data further using numerical calculation (i.e. using various numerical summaries such as mean, median, standard deviation, 5-number summary, etc). To see how this is done, see the previous post Looking at LA Rainfall Data.

What kind of information can we get from graphics such as Figure 1 and Figure 2 above? For example, we can tell what data points are most likely (e.g. annual rainfall of 10 to 15 inches). What data points are considered rare or unlikely? Where do most of the data points fall?

This last question should be expanded upon. Looking at Figure 2, we see that about 60% of the data are under 15 inches (0.023+0.242+0.341=0.606). So for close to 80 years out of the last 132 years, the annual rainfall records were 15 inches or less. About 81% of the data are 20 inches or less. So in the overhelming majority of the years, the annual rainfall records are 20 inches or less. So annual rainfall of more than 20 inches are relatively rare (only happened about 20% of the time).

We have a name of the data situation we see in Figure 1 and Figure 2. The annual rainfall data in Los Angeles have a skewed right distribution. This is because most of the data points are on the left side of the histogram. Another way to see this is that the tallest bar in the histogram is the one at 10 to 15 inches. Note that the side to the right of the peak of the histogram is longer than the side to the left of the peak. In other words, when the right tail of the histogram is longer, it is a skewed right distribution. See the figure below.

Figure 3

Besides the look of the histogram, skewed right distribution has another characteristic. The mean is always a lot larger than the median in a skewed right distribution. For example, the mean of the annual rainfall data is 14.98 inches (essentially 15 inches). Yet the median is only 13.1 inches, almost two inches lower. Whenever, the mean and the median are significantly far apart, we have a skewed distribution on hand. When the mean is a lot higher, it is a skewed right distribution. When the opposite situation occurs (the mean is a lot lower than the median), it is a skewed left distribution. When the mean and median are roughly equal, it is likely a symmetric distribution.

Posted in Descriptive statistics, Exploratory Data Analysis (EDA), Statistics | Tagged , , , , , , , , | Leave a comment

Is College Worth It?

Is college worth it? This was the question posed by the authors of the report called College Majors, Unemployment and Earnings, which was produced recently by The Center on Education and the Workforce. We do not plan on giving an detailed reporting on this report. Any interested reader can read the report here. Instead, we would like to look at two graphics in this reports, which are reproduced below. These two graphics are very interesting, which capture all the main points of the report. The data used in the report came from American Community Survey for the years 2009 and 2010.

Figure 1

Figure 2

Figure 1 shows the unemployment rates by college major for three groups of college degree holders, namely the recent college graduates (shown with green marking), the experienced college graduates (blue marking) and the college graduates who hold graduate degrees (red marking). Figure 2 shows the median earnings by major for the same three groups of college graduates (using the same colored markings).

Figure 1 ranks the unemployment rates for recent college graduates from highest to the lowest. You can see the descending of green markings from 13.9% (architecture) to 5.4% (education and health). So this graphic shows clearly that the employment prospects of college graduates depend on their majors, which is one of the main points of the report.

The graphic in Figure 1 shows that all recent college graduates are having a hard time finding work. The unemployment rate for recent college graduate is 8.9% (not shown in Figure 1). The employment picture for recent college architecture graduates is especially bleak, which is due to the collapse of the construction and home building industry in the recession. The unemployment rates for recent college graduates who majored in education and healthcare are relatively low, reflecting the reality that these fields are either stable or growing.

Everyone is feeling the pinch in this tough economic environment. Even the recent graduates in technical fields are experiencing higher than usual unemployment rates. For example, the unemployment rates for recent college graduates in engineering and science, though relatively low comparing to architecture, are at 7.5% and 7.7%, respectively. For computers and mathematics recent graduates, the unemployment rate is 8.2%, approaching the average rate of 8.9% for recent college graduates.

The experienced college graduates fare much better than recent graduates. It is much more likely for experienced college graduates to be working. Looking at Figure 1, another observation is that graduate degrees make a huge difference in employment prospects across all majors.

The graphic in Figure 2 suggests that earnings of college graduates also depend on the subjects they study, which is another main point of the report. The technical majors earn the most. For example, median earning among recent engineering college graduates is $55,000 and the median for arts majors is $30,000. Aside from the high technical, business and healthcare majors, the median earnings of recent college graduates are in the low $30,000s (just look at the green markings in Figure 2).

Figure 2 also shows that people with graduate degrees have higher earnings across all majors. The premium in earnings for graduate degree holders is substantial and is found across the board. Though the graduate degree advantage is seen in all majors, it is especially pronounced among the technical fields (just look at the descending red markings in Figure 2).

So two of the main points are (1), employment prospects of college graduates depend on their majors, and (2) the earning potential of college graduates also depend on the subjects they study. Is college worth it? The report is not trying to persuade college bound high school seniors not to go to college. On the contrary, the authors of the report answer the question in the affirmative. The authors of the report are merely providing the facts that all prospective college students should consider before they pick their majors. The two graphics shown above are effective demonstration of the facts presented by the report. According to the authors, students “should do their homework before picking a major, because, when it comes to employment prospects and compensation, not all college degrees are created equal.”

Posted in Commentary, Descriptive statistics, Exploratory Data Analysis (EDA), Observational Studies, Statistical studies, Statistics | Tagged , , , , | Leave a comment

Cryptography and Presidential Inaugural Speeches

Given a letter in English, how often does it appear in normal usage of the English language? Some letters appear more often than others. For example, the last letter Z is not common. The vowels are very common because they are needed in making words. The following figure shows the relative frequency of the English letters obtained empirically (see [1]). Dewey, the author of [1], obtained this frequency distribution after examining a total of 438,023 letters. We came across this letter frequency distribution in Example 2.11 in page 24 of [2]. Figure 1 displays the letter frequency in descending order.

A letter frequency such as Figure 1 is important in cryptography. We explore briefly why this is the case. We give an indication why breaking a cipher is often a statistical process. We then confirm the Dewey letter frequency distribution by examining the letter frequency in the presidential inaugural speeches of George Washington (two speeches) and Barack Obama (one speech).

The study of the frequency of letters in text is very important in cryptography. In using an algorithm to encrypt a message, the original information is called plaintext and the encrypted message is called ciphertext. In a simple encryption scheme called substitution cipher, each letter of the plaintext is replaced by another letter. To break such a cipher, it is necessary to know the letter frequency of the language being coded. For example, if the letter W is the most frequently appeared letter in the ciphertext, this might suggest that the letter W in the ciphertext corresponds to the letter E in the plaintext since the letter E is the most frequently occurred English letter (see Figure 1).

Figure 1 shows that the most frequently occurring letter in English is E (about 12.68% of the time). The least used letter is Z. The top 5 letters (E, T, A, I, O) comprise about 45% of the total usage. The top 8 letters comprise close to 65% of the total usage. The top 12 letters are used about 80% of the time (80.87%).

Another interesting result from the Dewey’s letter frequency is that the vowels comprise about 40% of the total usage. This means that the frequency of consonants is about 60%.

\displaystyle \begin{aligned}(1) \ \ \ \ \  \text{relative frequency of vowels}&=\text{relative frequency of A + relative frequency of E} \\&\ \ \ + \text{relative frequency of I + relative frequency of O} \\&\ \ \ +\text{relative frequency of U + relative frequency of Y} \\&=0.0788+0.1268+0.0707+0.0776+0.0280+0.0202 \\&=0.4021 \end{aligned}

\displaystyle \begin{aligned}(2) \ \ \ \ \  \text{relative frequency of consonants}&=1-0.4021 \\&=0.5979 \end{aligned}

The probability distribution of the letters displayed in Figure 1 is a useful tool that can aid the process of breaking an intercepted cipher. The general idea is to compare the frequency of the letters in the encrypted message with the frequency of the letters in Figure 1. Thus the most used letter in the ciphertext might correspond to the letter E, or might correspond to T and A (as T and A are also very common in plaintext). But the most used letter in the ciphertext is likely not to be a Z or a Q. The second most used letter in the ciphertext might be the letter T in the plaintext, or might be another one of the top letters. The cryptanalyst will likely need to try various combinations of mapping between the letters in the ciphertext and the plaintext. The idea described here is not a sure-fire approach, but is rather a trial and error process that can help the analyst putting the statistical puzzle pieces together.

We now use the letters in presidential inaugural speeches to see how the Dewey letter frequency hold up. We want to use text that is from another era (so we choose the two inaugural speeches of George Washington) and to use text that is contemporary (so we choose the inaugural speech of Barack Obama). The text of presidential inaugural speeches can be found here.

Figure 2 below shows the letter frequency in the two inaugural speeches of George Washington. There are a total of 7,641 letters (we only use the body of the speeches). Figure 3 below is a side by side comparison between the letter frequency in Figure 1 (Dewey) and the letter frequency in Washington’s two speeches (Figure 2).

Figure 3 shows that the letter frequency in Washington’s speeches is on the whole very similar to the letter frequency of Dewey. We cannot expect an exact match. But overall there is a general agreement between the two distributions.

Figure 4 below shows the letter frequency in the inaugural speeches of Barack Obama. There are a total of 10,627 letters (we only use the body of the speech). Figure 5 below is a side by side comparison between the letter frequency in Figure 1 (Dewey) and the letter frequency in Obama’s speech (Figure 3).

There is also a very good agreement between the letter frequency in Dewey (the benchmark) and the letter frequency in Obama’s speech.

Despite the passage of almost 200 years, there is quite an excellent agreement between the letter usage between Washington’s speeches in 1789 and the distribution obtained by Dewey in 1970 (see Figure 3). Some letters appeared more frequently often in Washington’s speeches (e.g. E, I and N) and some appeared less often (e.g. A). The general pattern of the letter distribution in Washington’s speeches is unmistakably similar to that of Dewey’s. Similar observations can be made about the comparison between the letter frequency in Obama’s speech and Dewey’s distribution (see Figure 5).

The following table shows the frequency of the top letter, the top 5 letters, the top 8 letters and the top 12 letters in Dewey’s distribution alongside with the corresponding frequency in the speeches of Washington and Obama. Table (1) shows that the frequency of the top letters are quite close between Dewey’s distribution and the speeches of Washington and Obama.

\displaystyle (1) \ \ \ \ \begin{bmatrix} \text{Top Letters in Dewey's Distribution}&\text{ }&\text{Dewey}&\text{ }&\text{Washington}&\text{ }&\text{Obama}  \\\text{ }&\text{ }&\text{ } \\\text{E}&\text{ }&0.1268&\text{ }&0.1309&\text{ }&0.1268  \\ \text{E, T, A, O, I}&\text{ }&0.4517&\text{ }&0.4485&\text{ }&0.4441  \\ \text{E, T, A, O, I, N, S, R}&\text{ }&0.6451&\text{ }&0.6409&\text{ }&0.6525   \\ \text{E, T, A, O, I, N, S, R, H, L, D. U}&\text{ }&0.8087&\text{ }&0.7981&\text{ }&0.8163    \end{bmatrix}

Reference

  1. Dewey, G., Relative Frequency of English Spellings, Teachers College Press, Columbia University, New York, 1970
  2. Larsen, R. J., Marx., M. L., An Introduction to Mathematical Statistics and its Applications, Prentice-Hall, Inc., Englewood Cliffs, New Jersey, 1981
Posted in Descriptive statistics, Exploratory Data Analysis (EDA), Probability, Statistics, Statistics In Action | Tagged , , , , , | Leave a comment

Which Car Rental Company Is More Expensive, Budget or Avis?

Any “budget” conscious consumer/traveler would want to find a good deal wherever and whenever he or she can, especially when it comes to airfare and rental cars. In tough economic times, bargain hunting is the norm rather the exception. This is an exercise in price comparison for rental cars; focusing on two popular car rental companies Budget and Avis. This is also an opportunity to use the one-sample t-procedures (test and confidence interval) on matched pairs data.

The following table shows the prices found on the websites of Budget Car Rental and Avis Car Rental. The prices are one-day rental prices for full size sedans quoted for the day of December 12, 2011 (a non-holiday Monday) at 35 of the busiest airports in United States. These prices are basic car rental prices without any discount or upgrade.

Matched Pairs Data

The price data in Table 1 are best viewed as matched pairs data. In such data, observations are taken on the same individual (in this case the same airport) under different conditions (one price is for Budget and one price is for Avis). The Budget car rental prices and the Avis car rental prices are said to be dependent samples.

The alternative to thinking of Table 1 as matched pairs data is to view the Budget Prices and the Avis prices as two independent samples and then use two-sample t-procedures to perform the analysis. But this approach is not the best way of using the data. When you are at the Chicago airport, you do not care about the car rental prices at the Tampa airport or any other airport. Just as you only compare prices among the car rental companies in the same airport, we should compare prices within each matched pair. So thinking of the data as matched pairs affords us the best way to compare prices between the two companies in Table 1.

To analyze the data in Table 1, we first take the difference between the Budget rental prices and the Avis rental prices (Avis minus Budget). These 35 differences form a single sample (the last column in Table 1). The first difference is -$3.29 (indicating that Avis is cheaper by this amount at the Atlanta airport). The second difference is $45.91 (indicating that Budget is cheaper by this amount at the Chicago airport). Most of the calculations and analysis will be done using this “differenced” sample. Thus, the comparative design of using matched pairs data makes use of single-sample procedures (in this case, the one-sample t-procedures).

Initial Look of the Data

Most of the differences are positive (i.e. Avis charges more than Budget). Some of the differences are small. But some of the differences are in the $30 to $50 range. So we need to take a closer look.

The following table shows the sample means and sample standard deviations for the Budget prices, Avis prices and the differences. In these 35 airports, the average one-day rental prices for Budget is $60 and the average Avis price is $73.74. The price differential is $13.64, meaning that Avis price is 22.7% over the Budget average price. Any “budget” conscious traveler should care about a difference of $13. The question is: is the price differential we are seeing statistically significant? Specifically, are the data in Table 1 evidences that Budget Car Rental is less expensive than Avis?

The Requirements for Using the t-Procedures

The use of the t-procedures (confidence interval and test) rests on two assumptions. One is that the data set is a simple random sample. The second is that the distribution of the data measurements has no outliers and follows a normal distribution (or that the sample size is large).

The car rental prices in Table 1 are not a random sample. They are just car rental prices from Budget and Avis at the 35 busiest airports in United States. Because these busy airports spread out across various regions in the United States and because they are of varying sizes, we feel that the sample car rental prices indicated here are representative of car renting experiences at these airports. For these reasons, we feel that there is value in carrying this comparison.

Because the sample size is relatively large (n=35), the need for checking normality assumption is not critical. The car rental prices do not seem to have any extreme data values.

About Technology

To carry out the t-test and t-interval, we should use technology (we use TI-83 plus). If software is not used, a t-distribution table is needed to find the p-value and t-critical value. Refer to your favorite statistics textbook for a t-table or use this t-table.

One-Sample t-Test

With a price differential of $13.64, we see that Avis is more expensive. Let’s confirm it with a one-sample t-test. To assess whether Budget is less expensive than Avis, we test the following hypotheses:

\displaystyle \begin{aligned}(1) \ \ \ \ \ \ \ \ \ \ \  &H_0: \mu = 0 \\&H_1: \mu > 0  \end{aligned}

where \mu is the mean difference in car rental prices (Avis minus Budget). The null hypothesis H_0 says that there is no difference in prices between Budget and Avis. The alternative hypothesis H_1 says that Budget is less expensive than Avis (i.e. Avis minus Budget > 0).

The mean and standard deviation of the “differenced” sample (the last column in Table 1) are:

\displaystyle \begin{aligned}(2) \ \ \ \ \ \ \ \ \ \ \  &\overline{x}=\$13.64029 \\&s=\$14.60763  \end{aligned}

The one-sample t-statistic is:

\displaystyle \begin{aligned}(3) \ \ \ \ \ \ \ \ \ \ \  t&=\frac{\overline{x}-0}{\displaystyle \frac{s}{\sqrt{n}}}=\frac{13.64029-0}{\displaystyle \frac{14.60763}{\sqrt{35}}}=5.5243  \end{aligned}

The p-value of this t-test is found from the t-distribution with 34 degrees of freedom (one less than the sample size). There are two ways to accomplish this (looking up a table or using software). Based on a t-distribution table (such as this one), p<0.0005. Software (using TI-83 plus) gives a p-value that is much smaller, p=1.78871591 \times 10^{-6}, which is approximately p=0.0000018.

Because of the small p-value, the data provide clear evidence in favor of the alternative hypothesis (i.e. we reject the null hypothesis H_0). A price differential this large (as large as $13.64) is very unlikely to occur by chance if there is indeed no difference in prices between Budget and Avis. We now have evidence that Budget is less expensive on average (Avis is more expensive on average).

One-Sample t-Interval

What is the magnitude of the price differential of Avis over Budget with a margin of error? We want to obtain a 95% confidence interval for the mean difference in car rental prices. To this end, we need the critical value t=2.032 from a t-distribution table. The margin of error is:

\displaystyle \begin{aligned}(4) \ \ \ \ \ \ \ \ \ \ \  &t \times \frac{s}{\sqrt{n}}=2.302 \times \frac{14.60763}{\sqrt{35}}=5.01729  \end{aligned}

and the confidence interval is:

\displaystyle \begin{aligned}(5) \ \ \ \ \ \ \ \ \ \ \  \overline{x} \pm t \times \frac{s}{\sqrt{n}}&=13.64027 \pm 5.01729 \\&=(8.62298,18.65756)  \end{aligned}

The estimated average price differential of Avis over Budget is $13.64 with a margin of error $5.02 with 95% confidence. On average, you tend to save anywhere from $8.62 to $18.66 for a one-day car rental if you go with Budget.

Remark

It is clear that Avis is more expensive (at least in terms of one-day rental of full size sedans on a Monday). Perhaps, other factors could alter the picture. For example, this comparison does not account for discount or special promotion. There are variations to the exercise done here. One is to compare week-long rentals. Another exercise is to compare vehicles in other classes (e.g. economy or SUV). Another one is to compare prices for busy seasons (e.g. holiday weekends).

Reference

  1. Moore. D. S., Essential Statistics, W. H. Freeman and Company, New York, 2010
  2. Moore. D. S., McCabe G. P., Craig B. A., Introduction to the Practice of Statistics, 6th ed., W. H. Freeman and Company, New York, 2009
Posted in Sampling Distributions, Statistical Inference, Statistics | Tagged , , , , , , , , , , , | Leave a comment

Benford’s Law and US Census Data, Part II

The Benford’s law is a probability model that is a powerful tool for detecting frauds and data irregularity. The first digit (or leading digit) of a number is the leftmost digit and can only be 1, 2, 3, 4, 5, 6, 7, 8 and 9. The Benford’s law tells us the first digit is 1 about 30% of the time and the first digit is 2 about 17.6% of the time and so on (see Figure 1 below). The key to the detection of fraud is to compare the distribution of first digits in the data being investigated with the Benford’s distribution. Too big of a discrepancy between the actual data and the Benford’s law (e.g. too few 1′s) is sometimes enough to raise suspicion of fraud. In this post, we demonstrate the use of chi-square goodness-of-fit test to compare the actual distribution of first digits with the Benford’s law. We use the example discussed in the previous post (population counts of 3,143 counties in the United States).

The following two figures show the distribution of the first digits in the population counts for the 3,143 counties in the United States. Figure 1 shows the count for each first digit. Figure 2 shows the proportion of each first digit.

The following figure is a side-by-side comparison of Figure 1 (the Benford’s law) and Figure 3 (the actual proportions from the census data).

It is quite clear that the distribution of first digits in the population counts of U.S. counties follows the Benford’s law quite closely. The blue bars and the orange bars have roughly the same height with the possible exception of the first digit 5. The blue bar and the orange bar at 5 has a difference of 1.4% (0.014=0.079-0.065). Is the difference for the first digit of 5 problematic? Visually speaking, we see that the actual distribution of first digits in the county population data matches quite well with the Benford’s law. Is it possible to confirm that with a statistical test? To answer these questions, we use the chi-square goodness-of-fit test.

The above questions are translated into a null hypothesis and an alternative hypothesis. The null hypothesis is that the first digits in the population counts for the 3,143 counties follow the Benford’s law. The alternative hypothesis is that the first digits in the population counts do not follow the Benford’s law. The following states the hypotheses more explicitly:

\displaystyle (1) \ \ \ \ \ H_0: \text{The first digits in the population counts follow the Benford's law}

\displaystyle (2) \ \ \ \ \ \ \ H_1: \text{The first digits in the population counts do not follow the Benford's law}

If the null hypothesis is true, we would expected 946 first digits of 1 (3,143 times 0.301) and expect 553.2 first digits of 2 (3,143 times 0.176) and so on. The following figure shows the expected counts of first digits under the assumption of the null hypothesis. The expected counts in Figure 5 are obtained by multiplying the proportions from the Benford’s law by the total count of 3,143 (the total number of counties in the United States).

We use the chi-square statistic to measure the difference between the observed counts (in Figure 2) and the expected counts (in Figure 5). The formula for the chi-square statistic is:

\displaystyle (3) \ \ \ \ \ \chi^2=\sum \frac{(\text{observed count - expected count})^2}{\text{expected count}}

The above chi-square statistic has an approximate chi-square distribution with 8 degrees of freedom. There are 9 categories of counts (i.e. the 9 possible first digits) and the degrees of freedom are always one less than the number of categories in the observed counts.

The computation of the chi-square statistic is usually performed using software. The following shows the idea behind the calculation:

\displaystyle (4) \ \ \ \ \ \chi^2=\frac{(972-946)^2}{946}+\frac{(573-553.2)^2}{553.2}+\frac{(376-392.9)^2}{392.9}

\displaystyle . \ \ \ \ \ \ \ \ \ \ \ \ \  +\frac{(325-304.9)^2}{304.9}+\frac{(205-248.3)^2}{248.3}+\frac{(209-210.3)^2}{210.3}

\displaystyle . \ \ \ \ \ \ \ \ \ \ \ \ \ +\frac{(179-182.3)^2}{182.3}+\frac{(155-160.3)^2}{160.3}+\frac{(149-144.6)^2}{144.6}=11.4

The idea for the chi-square statistic is that for each possible first digit, we take the difference between the observed count and the expected count. We then square the difference and normalize it by dividing by the expected count. For example, for digit 1, the difference between the observed count and the expected count is 26 (=972-946). Squaring it produces 676. Dividing 676 by 946 produces 0.7145877378. The sum of all 9 normalized differences is 11.4.

The value of \chi^2=11.4 is a realization of the chi-square statistic stated in (3), which has an approximate chi-square distribution with 8 degrees of freedom. The probability that a chi-square distribution (with 8 degrees of freedom) having another realized value greater than 11.4 is p=0.179614. This probability is called the p-value and is usually estimated using a chi-square table or obtained by using software. We obtain this p-value using the graphing calculator TI-83 plus.

With the p-value being 0.1796, we do not reject the hull hypothesis H_0. The differences that we see between the observed counts in Figure 2 and the expected counts in Figure 5 are not sufficient evidence for us to believe that the first digits in the county population counts do not follow the Benford’s law.

The calculated chi-square statistic of 11.4 captures the differences between the observed counts (in Figure 2) and the expected counts (in Figure 5). As expected, the large portion of 11.4 is due to the difference in the digit 5.

\displaystyle (5) \ \ \ \ \ \frac{(205-248.3)^2}{248.3}=7.55

Even with the relatively large deviation in digit 5, the calculated chi-square statistic of 11.4 is not large enough for us to believe that the first digits from the census data at hand deviate from the Benford’s law. Consequently we still have strong evidence to believe that the distribution of first digits in this census data set follows the Benford’s law.

Posted in Probability, Statistical Inference, Statistics | Tagged , , , , , , , , , , | Leave a comment

Benford’s Law and US Census Data, Part I

The first digit (or leading digit) of a number is the leftmost digit (e.g. the first digit of 567 is 5). The first digit of a number can only be 1, 2, 3, 4, 5, 6, 7, 8, and 9 since we do not usually write number such as 567 as 0567. Some fraudsters may think that the first digits in numbers in financial documents appear with equal frequency (i.e. each digit appears about 11% of the time). In fact, this is not the case. It was discovered by Simon Newcomb in 1881 and rediscovered by physicist Frank Benford in 1938 that the first digits in many data sets occur according to the probability distribution indicated in Figure 1 below:

The above probability distribution is now known as the Benford’s law. It is a powerful and yet relatively simple tool for detecting financial and accounting frauds (see this previous post). For example, according to the Benford’s law, about 30% of numbers in legitimate data have 1 as a first digit. Fraudsters who do not know this will tend to have much fewer ones as first digits in their faked data.

Data for which the Benford’s law is applicable are data that tend to distribute across multiple orders of magnitude. Examples include income data of a large population, census data such as populations of cities and counties. In addition to demographic data and scientific data, the Benford’s law is also applicable to many types of financial data, including income tax data, stock exchange data, corporate disbursement and sales data (see [1]). The author of [1], Mark Nigrini, also discussed data analysis methods (based on the Benford’s law) that are used in forensic accounting and auditing.

In a previous post, we compare the trade volume data of S&P 500 stock index to the Benford’s law. In this post, we provide another example of the Benford’s law in action. We analyze the population data of all 3,143 counties in United States (data are found here). The following figures show the distribution of first digits in the population counts of all 3,143 counties. Figure 2 shows the actual counts and Figure 3 shows the proportions.

In both of these figures, there are nine bars, one for each of the possible first digits. In Figure 2, we see that there are 972 counties in the United States with 1 as the first digit in the population count (0.309 or 30.9% of the total). This agrees quite well with the proportion of 0.301 from the Benford’s law. Comparing the proportions between Figure 1 and Figure 3, we see the actual proportions of first digits in the county population counts are in good general agreement with the Benford’s law. The following figure shows a side-by-side comparison of Figure 1 and Figure 3.

Looking at Figure 4, it is clear that the actual proportions in the first digits in the 3,143 county population counts follow the Benford’s law quite closely. Such a close match with the Benford’s law would be expected in authentic and unmanipulated data. A comparison as shown in Figure 4 lies at the heart of any technique in data analysis using the Benford’s law.

Reference

  1. Nigrini M. J., I’ve Got Your Number, Journal of Accountancy, May 1999. Link
  2. US Census Bureau – American Fact Finder.
  3. Wikipedia’s entry for the Benford’s law.
Posted in Probability, Statistical Inference, Statistics | Tagged , , | Leave a comment