Benford’s Law and US Census Data, Part I

The first digit (or leading digit) of a number is the leftmost digit (e.g. the first digit of 567 is 5). The first digit of a number can only be 1, 2, 3, 4, 5, 6, 7, 8, and 9 since we do not usually write number such as 567 as 0567. Some fraudsters may think that the first digits in numbers in financial documents appear with equal frequency (i.e. each digit appears about 11% of the time). In fact, this is not the case. It was discovered by Simon Newcomb in 1881 and rediscovered by physicist Frank Benford in 1938 that the first digits in many data sets occur according to the probability distribution indicated in Figure 1 below:

The above probability distribution is now known as the Benford’s law. It is a powerful and yet relatively simple tool for detecting financial and accounting frauds (see this previous post). For example, according to the Benford’s law, about 30% of numbers in legitimate data have 1 as a first digit. Fraudsters who do not know this will tend to have much fewer ones as first digits in their faked data.

Data for which the Benford’s law is applicable are data that tend to distribute across multiple orders of magnitude. Examples include income data of a large population, census data such as populations of cities and counties. In addition to demographic data and scientific data, the Benford’s law is also applicable to many types of financial data, including income tax data, stock exchange data, corporate disbursement and sales data (see [1]). The author of [1], Mark Nigrini, also discussed data analysis methods (based on the Benford’s law) that are used in forensic accounting and auditing.

In a previous post, we compare the trade volume data of S&P 500 stock index to the Benford’s law. In this post, we provide another example of the Benford’s law in action. We analyze the population data of all 3,143 counties in United States (data are found here). The following figures show the distribution of first digits in the population counts of all 3,143 counties. Figure 2 shows the actual counts and Figure 3 shows the proportions.

In both of these figures, there are nine bars, one for each of the possible first digits. In Figure 2, we see that there are 972 counties in the United States with 1 as the first digit in the population count (0.309 or 30.9% of the total). This agrees quite well with the proportion of 0.301 from the Benford’s law. Comparing the proportions between Figure 1 and Figure 3, we see the actual proportions of first digits in the county population counts are in good general agreement with the Benford’s law. The following figure shows a side-by-side comparison of Figure 1 and Figure 3.

Looking at Figure 4, it is clear that the actual proportions in the first digits in the 3,143 county population counts follow the Benford’s law quite closely. Such a close match with the Benford’s law would be expected in authentic and unmanipulated data. A comparison as shown in Figure 4 lies at the heart of any technique in data analysis using the Benford’s law.

Reference

  1. Nigrini M. J., I’ve Got Your Number, Journal of Accountancy, May 1999. Link
  2. US Census Bureau – American Fact Finder.
  3. Wikipedia’s entry for the Benford’s law.
Advertisements
This entry was posted in Probability, Statistical Inference, Statistics and tagged , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s