People engage in financial frauds have a need to produce false data as part of their criminal activities. Is it possible to look at the false data and determine they are not real? A probability model known as the Benford’s law is a powerful and relatively simple tool for detecting potential financial frauds or errors. Is this post, we give some indication of how the Benford’s law can be used and we use data from S&P 500 stock index as an example.
The first digit (or leading digit) of a number is the leftmost digit. For example, the first digit of $35,987 is 3. Since zero cannot be a first digit, there are only 9 possible choices for the first digits. Some data fabricators may try to distribute the first digits in false data fairly uniformly. However, first digits of numbers in legitimate documents tend not to distribute uniformly across the 9 possible digits. According to the Benford’s law, about 31% of the numbers in many types of data sets have 1 as a first digit, 19% have 2. The higher digits have even lower frequencies with 9 occurring about 5% of the time. The following figure shows the probability distribution according to the Benford’s law.
Frank Benford was a physicist working for the General Electric Company in the 1920s and 1930s. He observed that the first few pages of his logarithm tables (the pages corresponding to the lower digits) were dirtier and more worn than the pages for the higher digits. Without electronic calculator or modern computer, people in those times used logarithm table to facilitate numerical calculation. He concluded that he was looking up the first few pages of the logarithm table more often (hence doing calculation involving lower first digits more often). Benford then hypothesized that there were more numbers with lower first digits in the real world (hence the need for using logarithm of numbers with lower first digits more often).
To test out his hypothesis, Benford analyzed various types of data sets, including areas of rivers, baseball statistics, numbers in magazine articles, and street addresses listed in “American Men of Science”, a biographical directory. In all, the data analysis involved a total of 20,229 numerical data values. Benford found that the leading digits in his data distributed very similar to the model described in Figure 1 above.
As a hands-on introduction to the Benford’s law, we use data from the S&P 500 stock index from November 11, 2011 (data were found here). S&P 500 is a stock index covering 500 large companies in the U.S. economy. The following table lists the prices and volumes of shares (the number of shares traded) of the first 5 companies of the S&P 500 index as of November 11, 2011. The first digits of these 5 prices are 8, 5, 5, 5, 7, 2. The first digits of these 5 share volumes are 3, 4, 2, 2, 1, 8.
The following figures show the frequencies of the first digits in the prices and volumes from the entire S&P 500 on November 11, 2011.
It is clear both in both the prices and volumes, the lower digits occur more frequently as first digits. For example, of the 500 close prices of S&P 500 on November 11, 2011, 82 prices have 1 as first digits (16.4% of the total) while only 14 prices have 9 as leading digits (2.8% of the total). For the trade volumes of the 500 stocks, the skewness is even more pronounced. There are 166 prices with 1 as the first digits (33.2% of the total) while there are only 25 prices with 9 as first digits (5% of the total). The following figures express the same distributions in terms of proportions (or probabilities).
Any one tries to fake S&P 500 prices and volumes data purely by random chance will not produce convincing results, not even results that can withstand a casual analysis based on figures such as Figures 2 through 5 above.
What is even more interesting is the comparison between Figure 1 (Benford’s law) with Figure 5 (trade volume of S&P 500). The following figure is a side-by-side comparison.
There is a remarkable agreement between how the distribution of the first digits in the 500 traded volumes in S&P 500 index agree and the Benford’s law. According to the distribution in Figure 1 (Benford’s law), about 60% of the leading digits in legitimate data consist of the digits 1, 2 and 3 (0.301+0.176+0.125=0.602). In the actual S&P 500 traded volumes on 11/11/2011, about 65% of the leading digits are from the first three digits (0.332+0.202+0.114=0.648). We cannot expect the actual percentages to be exactly matching those of the Benford’s law. However, the general agreement between the expected (Benford’s law) and the actual data (S&P 500 volumes) is very remarkable and is very informative.
There are many sophisticated computer tests that apply the Benford’s law in fraud detection. However, the heart of the method of using Benford’s law is the simple comparison such as the one performed above, i.e. to compare the actual frequencies of the digits with the predicted frequencies according to the Benford’s law. If the data fabricator produces numbers that distribute across the digits fairly uniformly, a simple comparison will expose the discrepancy between the false data and the Benford’s law. Too big of a discrepancy between the actual data and the Benford’s law (e.g. too few 1’s) is sometimes enough to raise suspicion of fraud. Many white collar criminals do not know about the Benford’s law and will not expect that about in many types of realistic data, 1 as a first digit will occur about 30% of the time.
Benford’s law does not fit every type of data. It does not fit numbers that are randomly generated. For example, lottery numbers are drawn at random from balls in a big glass jar. Hence lottery numbers are uniformly distributed (i.e. every number has equal chance to be selected).
Even some naturally generated numbers do not follow the Benford’s law. For example, data that are confined to a relatively narrow range do not follow the Benford’s law. Examples of such data include heights of human adults and IQ scores. Another example is the S&P 500 stock prices in Figure 2. Note that the pattern of the bars in Figure 2 and Figure 4 does not quite match the pattern of the Benford’s law. Most of the stock prices of S&P 500 fall below $100 (on 11/11/2011, all prices are either 2 or 3-digit numbers with only 35 of the 500 prices being 3-digit). The following figure shows the side-by-side comparison between the S&P stock prices and the Benford’s law. Note that there are too few 1’s as first digits in the S&P 500 prices.
Even though the S&P 500 prices do not follow the Benford’s law, they are far from uniformly distributed (the smaller digits still come up more frequently). Any attempt to fake S&P 500 stock prices by using each digit equally likely as a first digit will still not produce convincing results (at least to an experienced investigator).
Data for which the Benford’s law is applicable are data that tend to distribute across multiple orders of magnitude. In the example of S&P 500 trading volume data, the range of data is from about half a million shares to 210 million shares (from 6-digit to 9-digit numbers, i.e., across 3 orders of magnitude). In contrast, S&P 500 prices only cover 1 order of magnitude. Other examples of data for which Benford’s law is usually applicable: income data of a large population, census data such as populations of cities and counties.
Benford’s law is also applicable to financial data such as income tax data, and corporate expense data. In , Nigrini discussed data analysis methods (based on the Benford’s law) that are used in forensic accounting and auditing. For more information about the Benford’s law, see the references below or search in Google.