The first digit (or leading digit) of a number is the leftmost digit (e.g. the first digit of 567 is 5). The first digit of a number can only be 1, 2, 3, 4, 5, 6, 7, 8, and 9 since we do not usually write number such as 567 as 0567. Some fraudsters may think that the first digits in numbers in financial documents appear with equal frequency (i.e. each digit appears about 11% of the time). In fact, this is not the case. It was discovered by Simon Newcomb in 1881 and rediscovered by physicist Frank Benford in 1938 that the first digits in many data sets occur according to the probability distribution indicated in Figure 1 below:

The above probability distribution is now known as the Benford’s law. It is a powerful and yet relatively simple tool for detecting financial and accounting frauds (see this previous post). For example, according to the Benford’s law, about 30% of numbers in legitimate data have 1 as a first digit. Fraudsters who do not know this will tend to have much fewer ones as first digits in their faked data.

Data for which the Benford’s law is applicable are data that tend to distribute across multiple orders of magnitude. Examples include income data of a large population, census data such as populations of cities and counties. In addition to demographic data and scientific data, the Benford’s law is also applicable to many types of financial data, including income tax data, stock exchange data, corporate disbursement and sales data (see [1]). The author of [1], Mark Nigrini, also discussed data analysis methods (based on the Benford’s law) that are used in forensic accounting and auditing.

In a previous post, we compare the trade volume data of S&P 500 stock index to the Benford’s law. In this post, we provide another example of the Benford’s law in action. We analyze the population data of all 3,143 counties in United States (data are found here). The following figures show the distribution of first digits in the population counts of all 3,143 counties. Figure 2 shows the actual counts and Figure 3 shows the proportions.

In both of these figures, there are nine bars, one for each of the possible first digits. In Figure 2, we see that there are 972 counties in the United States with 1 as the first digit in the population count (0.309 or 30.9% of the total). This agrees quite well with the proportion of 0.301 from the Benford’s law. Comparing the proportions between Figure 1 and Figure 3, we see the actual proportions of first digits in the county population counts are in good general agreement with the Benford’s law. The following figure shows a side-by-side comparison of Figure 1 and Figure 3.

Looking at Figure 4, it is clear that the actual proportions in the first digits in the 3,143 county population counts follow the Benford’s law quite closely. Such a close match with the Benford’s law would be expected in authentic and unmanipulated data. A comparison as shown in Figure 4 lies at the heart of any technique in data analysis using the Benford’s law.

**Reference**

- Nigrini M. J.,
*I’ve Got Your Number*, Journal of Accountancy, May 1999. Link - US Census Bureau – American Fact Finder.
- Wikipedia’s entry for the Benford’s law.

## A Student’s View of the Normal Distribution

In my teaching, I always strive to encourage students to look at statistics from a practical point of view. In a recent class period covering the normal distribution, I indicated that data values more than 3 standard deviations away from the mean are rare. The odds for seeing such data points are about 3 in 1,000. After class, a student came up to me and said that she understood the lecture except the things I said about 3 out of 1,000. She did not know what to make of it.

Based on the empirical rule, which can be thought of as a short form of the normal distribution, says that 99.7% of the data are within 3 standard deviations away from the mean. That means that only 0.3% of the data are more than 3 standard deviations away from the mean.

If some event has only 0.3% chance of happening, the odds are 0.3 out of 100. Suppose that we are talking about people and the data are measurements of height (in inches). So only 0.3 people out of 100 have heights more than 3 standard deviation away from the mean. Since we cannot have 0.3 people, it is better to say 3 people out of 1,000 are either 3 or more standard deviations taller than the mean or 3 or more standard deviation shorter than the mean.

The ratio can be expanded further. We can say 30 people out of 10,000 are either 3 or more standard deviations taller than the mean or 3 or more standard deviations shorter than the mean. Add two more zeros, we have: 3,000 people out of 1,000,000 (one million) are either 3 or more standard deviations taller than the mean or 3 or more standard deviations shorter than the mean.

So out of one million people of the same gender and of similar age (say, young adult males aged 20 to 29), only about 3,000 people or so are either very tall or very short. I would say seeing such people is a rare event. Height measurements (and other biological measurements) from a group of people of the same gender (and of similar age) tend to follow a bell-shaped distribution.

To make it even easier to see, let’s say the heights of young adult males follow a normal distribution with mean = 69 inches and standard deviation = 3 inches. Approximately 3,000 out of one million young adult males are either 9 or more inches taller than 69 inches (over 6 feet 6 inches) or 9 or more inches shorter than 69 inches (less than 5 feet). Since the bell curve is symmetrical, about 1,500 out of one million are taller than 6 feet 6 inches.

To look at this visually, the following is a bell curve describing the heights of the young adult males. Note that the bell curve ranges from about 55 inches to 80 inches. But most of the area under the bell curve is from 60 to about 78 inches.

There are about 21.5 million young adult males in the U.S. (looked up the website of US Census Bureau). So the estimated number of young adult males taller than 6 feet 6 inches is 32,250 (=1,500 times 21.5). So we can say that all young adult males are shorter than 6 feet 6 inches (statistically speaking). If all U.S. young adult males taller than 6 feet 6 inches were to attend the same baseball game in the Dodger Stadium, there would still be about 23,000 empty seats!

In my experience, many students have no problem reciting the empirical rule (reciting the three sentences about 1, 2, and 3 standard deviations). Some of them have a hard time applying it, especially using it as a quick gauge of the significance of data. I think understanding it within a practical context should make it easier to do so. I essentiall gave the same explanation to my student, which she thought was helpful.