Given a letter in English, how often does it appear in normal usage of the English language? Some letters appear more often than others. For example, the last letter Z is not common. The vowels are very common because they are needed in making words. The following figure shows the relative frequency of the English letters obtained empirically (see ). Dewey, the author of , obtained this frequency distribution after examining a total of 438,023 letters. We came across this letter frequency distribution in Example 2.11 in page 24 of . Figure 1 displays the letter frequency in descending order.
A letter frequency such as Figure 1 is important in cryptography. We explore briefly why this is the case. We give an indication why breaking a cipher is often a statistical process. We then confirm the Dewey letter frequency distribution by examining the letter frequency in the presidential inaugural speeches of George Washington (two speeches) and Barack Obama (one speech).
The study of the frequency of letters in text is very important in cryptography. In using an algorithm to encrypt a message, the original information is called plaintext and the encrypted message is called ciphertext. In a simple encryption scheme called substitution cipher, each letter of the plaintext is replaced by another letter. To break such a cipher, it is necessary to know the letter frequency of the language being coded. For example, if the letter W is the most frequently appeared letter in the ciphertext, this might suggest that the letter W in the ciphertext corresponds to the letter E in the plaintext since the letter E is the most frequently occurred English letter (see Figure 1).
Figure 1 shows that the most frequently occurring letter in English is E (about 12.68% of the time). The least used letter is Z. The top 5 letters (E, T, A, I, O) comprise about 45% of the total usage. The top 8 letters comprise close to 65% of the total usage. The top 12 letters are used about 80% of the time (80.87%).
Another interesting result from the Dewey’s letter frequency is that the vowels comprise about 40% of the total usage. This means that the frequency of consonants is about 60%.
The probability distribution of the letters displayed in Figure 1 is a useful tool that can aid the process of breaking an intercepted cipher. The general idea is to compare the frequency of the letters in the encrypted message with the frequency of the letters in Figure 1. Thus the most used letter in the ciphertext might correspond to the letter E, or might correspond to T and A (as T and A are also very common in plaintext). But the most used letter in the ciphertext is likely not to be a Z or a Q. The second most used letter in the ciphertext might be the letter T in the plaintext, or might be another one of the top letters. The cryptanalyst will likely need to try various combinations of mapping between the letters in the ciphertext and the plaintext. The idea described here is not a sure-fire approach, but is rather a trial and error process that can help the analyst putting the statistical puzzle pieces together.
We now use the letters in presidential inaugural speeches to see how the Dewey letter frequency hold up. We want to use text that is from another era (so we choose the two inaugural speeches of George Washington) and to use text that is contemporary (so we choose the inaugural speech of Barack Obama). The text of presidential inaugural speeches can be found here.
Figure 2 below shows the letter frequency in the two inaugural speeches of George Washington. There are a total of 7,641 letters (we only use the body of the speeches). Figure 3 below is a side by side comparison between the letter frequency in Figure 1 (Dewey) and the letter frequency in Washington’s two speeches (Figure 2).
Figure 3 shows that the letter frequency in Washington’s speeches is on the whole very similar to the letter frequency of Dewey. We cannot expect an exact match. But overall there is a general agreement between the two distributions.
Figure 4 below shows the letter frequency in the inaugural speeches of Barack Obama. There are a total of 10,627 letters (we only use the body of the speech). Figure 5 below is a side by side comparison between the letter frequency in Figure 1 (Dewey) and the letter frequency in Obama’s speech (Figure 3).
There is also a very good agreement between the letter frequency in Dewey (the benchmark) and the letter frequency in Obama’s speech.
Despite the passage of almost 200 years, there is quite an excellent agreement between the letter usage between Washington’s speeches in 1789 and the distribution obtained by Dewey in 1970 (see Figure 3). Some letters appeared more frequently often in Washington’s speeches (e.g. E, I and N) and some appeared less often (e.g. A). The general pattern of the letter distribution in Washington’s speeches is unmistakably similar to that of Dewey’s. Similar observations can be made about the comparison between the letter frequency in Obama’s speech and Dewey’s distribution (see Figure 5).
The following table shows the frequency of the top letter, the top 5 letters, the top 8 letters and the top 12 letters in Dewey’s distribution alongside with the corresponding frequency in the speeches of Washington and Obama. Table (1) shows that the frequency of the top letters are quite close between Dewey’s distribution and the speeches of Washington and Obama.
- Dewey, G., Relative Frequency of English Spellings, Teachers College Press, Columbia University, New York, 1970
- Larsen, R. J., Marx., M. L., An Introduction to Mathematical Statistics and its Applications, Prentice-Hall, Inc., Englewood Cliffs, New Jersey, 1981