The Benford’s law is a probability model that is a powerful tool for detecting frauds and data irregularity. The first digit (or leading digit) of a number is the leftmost digit and can only be 1, 2, 3, 4, 5, 6, 7, 8 and 9. The Benford’s law tells us the first digit is 1 about 30% of the time and the first digit is 2 about 17.6% of the time and so on (see Figure 1 below). The key to the detection of fraud is to compare the distribution of first digits in the data being investigated with the Benford’s distribution. Too big of a discrepancy between the actual data and the Benford’s law (e.g. too few 1′s) is sometimes enough to raise suspicion of fraud. In this post, we demonstrate the use of chi-square goodness-of-fit test to compare the actual distribution of first digits with the Benford’s law. We use the example discussed in the previous post (population counts of 3,143 counties in the United States).
The following two figures show the distribution of the first digits in the population counts for the 3,143 counties in the United States. Figure 1 shows the count for each first digit. Figure 2 shows the proportion of each first digit.
The following figure is a side-by-side comparison of Figure 1 (the Benford’s law) and Figure 3 (the actual proportions from the census data).
It is quite clear that the distribution of first digits in the population counts of U.S. counties follows the Benford’s law quite closely. The blue bars and the orange bars have roughly the same height with the possible exception of the first digit 5. The blue bar and the orange bar at 5 has a difference of 1.4% (0.014=0.079-0.065). Is the difference for the first digit of 5 problematic? Visually speaking, we see that the actual distribution of first digits in the county population data matches quite well with the Benford’s law. Is it possible to confirm that with a statistical test? To answer these questions, we use the chi-square goodness-of-fit test.
The above questions are translated into a null hypothesis and an alternative hypothesis. The null hypothesis is that the first digits in the population counts for the 3,143 counties follow the Benford’s law. The alternative hypothesis is that the first digits in the population counts do not follow the Benford’s law. The following states the hypotheses more explicitly:
If the null hypothesis is true, we would expected 946 first digits of 1 (3,143 times 0.301) and expect 553.2 first digits of 2 (3,143 times 0.176) and so on. The following figure shows the expected counts of first digits under the assumption of the null hypothesis. The expected counts in Figure 5 are obtained by multiplying the proportions from the Benford’s law by the total count of 3,143 (the total number of counties in the United States).
We use the chi-square statistic to measure the difference between the observed counts (in Figure 2) and the expected counts (in Figure 5). The formula for the chi-square statistic is:
The above chi-square statistic has an approximate chi-square distribution with 8 degrees of freedom. There are 9 categories of counts (i.e. the 9 possible first digits) and the degrees of freedom are always one less than the number of categories in the observed counts.
The computation of the chi-square statistic is usually performed using software. The following shows the idea behind the calculation:
The idea for the chi-square statistic is that for each possible first digit, we take the difference between the observed count and the expected count. We then square the difference and normalize it by dividing by the expected count. For example, for digit 1, the difference between the observed count and the expected count is 26 (=972-946). Squaring it produces 676. Dividing 676 by 946 produces 0.7145877378. The sum of all 9 normalized differences is 11.4.
The value of is a realization of the chi-square statistic stated in , which has an approximate chi-square distribution with 8 degrees of freedom. The probability that a chi-square distribution (with 8 degrees of freedom) having another realized value greater than 11.4 is . This probability is called the p-value and is usually estimated using a chi-square table or obtained by using software. We obtain this p-value using the graphing calculator TI-83 plus.
With the p-value being 0.1796, we do not reject the hull hypothesis . The differences that we see between the observed counts in Figure 2 and the expected counts in Figure 5 are not sufficient evidence for us to believe that the first digits in the county population counts do not follow the Benford’s law.
The calculated chi-square statistic of 11.4 captures the differences between the observed counts (in Figure 2) and the expected counts (in Figure 5). As expected, the large portion of 11.4 is due to the difference in the digit 5.
Even with the relatively large deviation in digit 5, the calculated chi-square statistic of 11.4 is not large enough for us to believe that the first digits from the census data at hand deviate from the Benford’s law. Consequently we still have strong evidence to believe that the distribution of first digits in this census data set follows the Benford’s law.