My wife and I enjoy playing the iPhone game LetterPress which was recently released by atebits. Atebits's lead developer is Loren Brichter, who also wrote Tweetie, which became the official Twitter app for the iPhone, writes some great software. So, if you have not done so already, you should definitely try this free game [and then make the in-app purchase if you enjoy it]. That being said, I did come up with one tiny complaint.

Recently a board came up which five "Z"s, two "V"s, two "K"s, an "X", and a "J". Those letters are five of the seven least used letters in the English dictionary, only "Q" and "W" were missing. Almost half of the game board was composed of very infrequently used letters. The game was still playable, but I thought that gameplay might be a little better if concentrations of infrequently used letters were minimized. I think that the chance of getting a "Z" on any given tile should be less than 1 in 26, and instead of just complaining about the game I decided to see what it would take to create an improved random letter generator.

I wanted to know how the letters were distributed in the English language, so I started by creating a small python script which would count the frequency of letters in the English dictionary. I ran the script, which is published on github, against the unix system dictionary (/usr/share/dict/words) and got the following results:


a 196995
c 100944
b 39038
...
y 51542
x 6840
z 8230
Next, I wrote a script which would calculate the probability of each letter and format those results for the final script, which will perform the random letter selection. For the final script I wanted to use a binary tree to make lookups easier, so the output from the second scripts takes this into account. The script assumes that the binary tree will be filled starting with the most frequent letter to the least frequent in a top-down, left-to-right manner. The tree needs to be flattened into a number line so that the number range for each node can be calculated based upon its probability and position in the tree. The leftmost node of the tree will start at zero and the rightmost node in the tree will end with one. In this example the letter 'E', which is the most likely, will be 0.105022804 units long, starting at 0.538979491 and ending at 0.644002294 on the number line.


Once the number line ranges for each letter are calculated, the list is sorted by frequency again so that the tree will build properly in the final script. This following data is output form the second script:

('e', 0.105022804, 0.538979491, 0.644002294)
('i', 0.089845073, 0.267441153, 0.357286225)
('a', 0.088258618, 0.785527013, 0.87378563)
('o', 0.07619197,  0.105176925, 0.181368894)
('r', 0.071804464, 0.418384245, 0.490188708)
('n', 0.070815226, 0.682697737, 0.753512962)
('t', 0.067765522, 0.903911257, 0.971676778)
('s', 0.061441654, 0.023092087, 0.08453374)
('l', 0.057969911, 0.198858882, 0.256828792)
('c', 0.045225401, 0.366164283, 0.411389683)
('u', 0.039043109, 0.496249139, 0.535292247)
('p', 0.033992527, 0.647066784, 0.68105931)
('m', 0.030812,    0.754715013, 0.785527012)
('d', 0.030125626, 0.873785631, 0.903911256)
('h', 0.028323223, 0.971676779, 1.000000001)
('y', 0.023092087, 0.0,         0.023092086)
('g', 0.020643184, 0.084533741, 0.105176924)
('b', 0.017489987, 0.181368895, 0.198858881)
('f', 0.01061236,  0.256828793, 0.267441152)
('v', 0.008878057, 0.357286226, 0.366164282)
('k', 0.006994561, 0.411389684, 0.418384244)
('w', 0.00606043,  0.490188709, 0.496249138)
('z', 0.003687243, 0.535292248, 0.53897949)
('x', 0.003064489, 0.644002295, 0.647066783)
('q', 0.001638426, 0.681059311, 0.682697736)
('j', 0.00120205,  0.753512963, 0.754715012)

For simplicity that output is slightly modified and pasted into the third script as the body of the variable 'letter_list.' The third script uses this data to build a binary tree with knowledge of each letters range on the number line. A random number between zero and one is generated and the tree is searched to find the corresponding letter.

17:15:10 stollcri>./random_letter.py 25
e c a s i i h b p q t r o u a t r c i e i t a t m
17:15:17 stollcri>./random_letter.py 25
i g o o s u a e o p e i g o e t n l o d a s e f i
17:15:21 stollcri>./random_letter.py 25
c r c c e u o d r m a u e n w e a h o g s e i i r
17:15:25 stollcri>./random_letter.py 25
u s s n m t c n l b l i p t a a c l a i i e s n a
17:15:26 stollcri>./random_letter.py 25
e i o u d x n i s r a l e t a t n i n e n t l o h
17:15:29 stollcri>

We can now select random letters roughly based upon how likely the letter is to appear in the English language. However, now there is over a ten percent chance that the letter 'E' will be selected and a less than one percent chance that any of the seven least common letters will be selected. It would probably be a good idea to manually tune the probabilities by compressing the range; this might further enhance gameplay.

Comments (newest first)

Anonymous
You must have a lot of time to keep you busy with these things. Although the thought is very nice and keep up the good work.
Christopher Stoll
These are all good points; I figured that more work would be required to create a usable algorithm. You may be interested in checking out NLTK.org. I have only used the toolkit for processing words (or longer), but I think it includes tools for working with letter distributions as well.

Thanks for the feedback and good luck with your game.
John Swensen
I am going to reply to myself because I think I have a decent way to getting better letter selection. I started looking around for information about two-letter sequencing in the English language and found that this terminology is call Bi-grams. There is a great paper that is available at http://bit.ly/VttmEZ that gives the two letter frequencies in a table. This means that I can treat each row of this table as a conditional PMF, selecting the next letter using the PMF for the previous letter.
John Swensen
I have been working on a word game and have been dealing with the same issue. I have tried to use a couple of different distributions of letters. The first was basing my PMF on what Oxford reported at
http://oxforddictionaries.com/words/what-is-the-frequency-of-the-letters-of-the-alphabet-in-english
The second was using a PMF generated from the scrabble letter distributions
http://en.wikipedia.org/wiki/Scrabble_letter_distributions
The third was to analyze the dictionary I have come up with for the game using a similar approach to what you found.

The problem with all these approaches is that it seems to make vowels occur much too frequently for good gameplay. Having played a fair bit of Letterpress myself, my only conclusion was that he tuned the PMF for his random letter generator quite a bit because he doesn't get anywhere near the same vowel frequency that I have observer with the three PMFs described above.

Tags: Blogger Games iPhone github Python Code Algorithms

Published: 2012-11-04