The player with the secret writes a series of dashes, one representing each letter in the solution. Initially, no information is known about the target word, other than its length. The solver calls out letters, one-by-one. If a called letter appears in the solution, all occurrences in the solution are filled in. If the letter does not appear in the solution, the secret writer adds one element to a drawing of a gallows (complete with a stick man).
A complete rendering takes eleven moves. If the hangman drawing gets completed (eleven incorrect letters), then the secret writer has won. If all letters of the word are revealed before this happens, then the solver wins.
As a young person, when you first started to play the game, you probably called out random letters. Once you got a hit of a couple of letters, it helped you narrow down the solution. Next, you probably graduated to calling vowels first, having learned that (just about*) all words contain at least one vowel (or the letter ‘Y’).
* The very complete word dictionary I’m using for this exercise contains 172,806 words. Only twenty of these words do not contain any vowel, or the letter ‘Y’ e.g. CWM, TSKTSK, PSST, PHPHT and BRRR. (I counted just 121 words with that do not contain any of the letters 'AEIOU', so the number that use just 'Y' as a vowel is 101).
Enlightenment - Guessing the first letter
Next, you probably graduated to learning that not all letters are used equally. It’s rare that the letter ‘Q’ appears in a word, whereas ‘T’ is used a lot more often.
Once you get just a couple of letters of in a hangman puzzle, the game becomes easier. The solution set is drastically reduced, and skills like pattern matching and word knowledge become important. It’s crucial to get that first letter in the puzzle as soon as possible. Which letter should you guess first?
Code, Cyphers and Secret Writing
Growing up you probably invented or used your own substitution cypher (where each letter is replaced by a different letter or symbol). A classic example is the Pigpen Cypher. Messages encoded in a simple cypher are pretty easy to crack because the same letter always is always represented by the same symbol. If you solved a lot of puzzle cyphers, then you probably learned and used the letter ordering below.
Ordering of letter frequency in English language: ETAOIN SHRDLU CMFWYP VBGKQJ XZ
The sequence above represents the usage order of letters in the English language, with the letter ‘E’ being the most common letter, followed by the letter ‘T’, all the way down to the letter ‘Z’, the least commonly used.
So, the first letter we should guess when trying to solve a hangman is the letter ‘E’, right?
Since ‘E’ is the most popular letter in English text, it will have the highest probability of being in our word, right?
Wrong!
First mistake
Yes, the ordering above is an accurate portrayal of the frequency of usage of letters in English text, and if we were examining English textit is what we should be using …
… But, we’re not looking at pages of text, we’re looking at isolated words.
English text is full of words that are used very frequently:THE, OF, AND, A, TO, IN, IS, YOU, THAT, IT …
A frequency table of letter usage based on English text is biased because of the substantial presence of these common words.
(About one-third of all printed English material is made up of the top 25 occurring words. The most popular 100 words make up approximately one-half of all printed English!).
Because we’re trying to guess a naked word in isolation the above frequency distribution is not appropriate. It's distorted.
First Refinement
Instead, what we need to look for is the incidence of letters in the words in our dictionary, not the incidence of letters in all English text.
This will give a much better probability estimate for the frequency of letters because it will be unbiased by the frequency of common words.
We can further refine this strategy and do a little better. Since we’re happy if we hit one, or many, letters in our target word we do not want to double count frequency if there is more than one of the same letter in a word. Instead of counting the occurrences of all the letters, we count the number of times a letter is present (one or many) times in each word. Essentially giving a count of, if we select a letter, the number of words that this letter is present in.
We can then sort this list based on the probabilities (count of the number of words that letter is present in). Here are the results: ESIARN TOLCDU PMGHBY FVKWZX QJ
There's a noticeable difference. Here, again, is the distribution based on frequency in English text (for comparison). ETAOIN SHRDLU CMFWYP VBGKQJ XZ
Whilst 'E' is still the most popular letter, the next most popular (based on number of words in the dictionary that contain it), is 'S' and not 'T'. 'T' has been relogated to seventh ordinal position (60.13% of all words in my dictionary have a letter 'S' in them, but only 48.23% of them have a letter 'T').
Next in popularity come two more vowels 'I' and 'A' ('O' having moved further back). 'R' occurs significantly more often in isolated words than it does when biased by the frequency in everyday text.
The ordering of vowels is now 'E I A O U' instead of 'E A O I U'
Interestingly, the least likely letter is now 'J' instead of 'Z'. (There are just 2,463 words in the dictionary that contain the letter 'J' cf. 4,592 with the letter 'X' and 7,028 containing the letter 'Z').
Now that we know the chances of a letter being in any word we can use this new table to select our guesses, right?
Wrong!
Don't forget about the length!
The above distribution has been calculated for all the words in the dictionary. But remember, when playing Hangman, we know the length of the word we are trying to guess. This allows us to further refine our searching.
Below is a table showing the popularity of letters in dictionary words grouped by the length of those words. The most popular letters are at the top of the table, and the the least popular letters at the bottom. To the left are the shorter word lengths, and to the right are the longer ones.
Length of Word | ||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | |
#1 | A | A | A | A | S | E | E | E | E | E | E | E | I | I | I | I | I | I | I | I |
#2 | I | O | E | E | E | S | S | S | S | I | I | I | E | E | E | E | E | S | E | O |
#3 | E | O | S | A | A | I | I | I | S | S | S | N | T | T | T | T | E | T | E | |
#4 | I | I | O | R | R | A | A | R | R | N | N | T | S | N | S | N | T | O | T | |
#5 | M | T | I | O | I | R | R | A | A | A | T | S | N | S | N | S | O | N | R | |
#6 | H | S | R | I | O | N | N | N | N | R | A | A | A | O | A | O | N | A | S | |
#7 | N | U | L | L | L | T | T | T | T | T | R | O | O | A | O | A | R | S | A | |
#8 | U | P | T | T | N | O | O | O | O | O | O | R | R | R | R | R | A | R | N | |
#9 | S | R | N | N | T | L | L | L | L | L | L | L | L | L | L | L | L | L | C | |
#10 | T | N | U | U | D | D | D | C | C | C | C | C | C | C | C | C | C | C | L | |
#11 | Y | D | D | D | U | U | C | D | D | U | P | P | P | P | P | P | P | P | P | |
#12 | B | B | P | C | C | C | U | U | U | D | U | U | U | U | U | U | M | M | H | |
#13 | L | G | M | Y | M | G | G | G | G | P | M | M | M | M | M | M | U | U | U | |
#14 | P | M | H | P | P | P | M | M | M | M | D | G | D | D | H | H | H | H | M | |
#15 | X | Y | C | M | G | M | P | P | P | G | G | D | H | H | D | D | D | D | Y | |
#16 | D | L | B | H | H | H | H | H | H | H | H | H | G | G | Y | G | G | G | D | |
#17 | F | H | K | G | B | B | B | B | B | B | Y | Y | Y | Y | G | Y | Y | Y | G | |
#18 | R | W | G | B | Y | Y | Y | Y | Y | Y | B | B | B | B | B | B | B | B | B | |
#19 | W | F | Y | K | K | F | F | F | F | F | V | V | V | V | V | V | V | V | Z | |
#20 | G | C | W | F | F | K | K | V | V | V | F | F | F | F | F | F | Z | F | V | |
#21 | J | K | F | W | W | W | W | K | K | K | Z | Z | Z | Z | Z | Z | F | Z | F | |
#22 | K | X | V | V | V | V | V | W | W | W | K | X | X | X | X | X | X | X | K | |
#23 | V | J | Z | Z | Z | Z | Z | Z | Z | W | K | K | W | W | Q | Q | K | X | ||
#24 | J | Z | X | X | X | X | X | X | X | X | W | W | K | Q | W | W | J | J | ||
#25 | Z | X | J | J | J | Q | Q | Q | Q | Q | Q | Q | Q | K | J | K | Q | Q | ||
#26 | Q | Q | Q | Q | Q | J | J | J | J | J | J | J | J | J | K | W |
There are so many fascinating things to point out about this table that I don't know where to start!
There are only two words with one letter! There are no two letter words containing the letter C, Q, V or Z.
From one to four letter words, the most popular letter is A. For five letter words it changes to S, then from six to twelve it is the letter E. From thirteen letters onwards, the most likely letter to be in a word is the letter I.
The letter A starts off as the most popular vowel, but by the time words grow to 15 letter long, it has been relegated to fourth most common vowel.\
There are only two words with one letter! There are no two letter words containing the letter C, Q, V or Z.
From one to four letter words, the most popular letter is A. For five letter words it changes to S, then from six to twelve it is the letter E. From thirteen letters onwards, the most likely letter to be in a word is the letter I.
The letter A starts off as the most popular vowel, but by the time words grow to 15 letter long, it has been relegated to fourth most common vowel.\
There is no word in the English lanaguage that is 18 letters long and contains the letter J. Similarly, there is no twenty letter word which contains the letter W. T is the most popular consonant in three letter words, and falls in popularity in mid-length words before regaining its popularity at fourteen. Z is never the least popular letter.
O falls in popularity in mid-length words.
… plus much more …
Here is the same table again with a splash of color highlighting the vowels.
Here is the same table again with a splash of color highlighting the vowels.
Length of Word | ||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | |
#1 | A | A | A | A | S | E | E | E | E | E | E | E | I | I | I | I | I | I | I | I |
#2 | I | O | E | E | E | S | S | S | S | I | I | I | E | E | E | E | E | S | E | O |
#3 | E | O | S | A | A | I | I | I | S | S | S | N | T | T | T | T | E | T | E | |
#4 | I | I | O | R | R | A | A | R | R | N | N | T | S | N | S | N | T | O | T | |
#5 | M | T | I | O | I | R | R | A | A | A | T | S | N | S | N | S | O | N | R | |
#6 | H | S | R | I | O | N | N | N | N | R | A | A | A | O | A | O | N | A | S | |
#7 | N | U | L | L | L | T | T | T | T | T | R | O | O | A | O | A | R | S | A | |
#8 | U | P | T | T | N | O | O | O | O | O | O | R | R | R | R | R | A | R | N | |
#9 | S | R | N | N | T | L | L | L | L | L | L | L | L | L | L | L | L | L | C | |
#10 | T | N | U | U | D | D | D | C | C | C | C | C | C | C | C | C | C | C | L | |
#11 | Y | D | D | D | U | U | C | D | D | U | P | P | P | P | P | P | P | P | P | |
#12 | B | B | P | C | C | C | U | U | U | D | U | U | U | U | U | U | M | M | H | |
#13 | L | G | M | Y | M | G | G | G | G | P | M | M | M | M | M | M | U | U | U | |
#14 | P | M | H | P | P | P | M | M | M | M | D | G | D | D | H | H | H | H | M | |
#15 | X | Y | C | M | G | M | P | P | P | G | G | D | H | H | D | D | D | D | Y | |
#16 | D | L | B | H | H | H | H | H | H | H | H | H | G | G | Y | G | G | G | D | |
#17 | F | H | K | G | B | B | B | B | B | B | Y | Y | Y | Y | G | Y | Y | Y | G | |
#18 | R | W | G | B | Y | Y | Y | Y | Y | Y | B | B | B | B | B | B | B | B | B | |
#19 | W | F | Y | K | K | F | F | F | F | F | V | V | V | V | V | V | V | V | Z | |
#20 | G | C | W | F | F | K | K | V | V | V | F | F | F | F | F | F | Z | F | V | |
#21 | J | K | F | W | W | W | W | K | K | K | Z | Z | Z | Z | Z | Z | F | Z | F | |
#22 | K | X | V | V | V | V | V | W | W | W | K | X | X | X | X | X | X | X | K | |
#23 | V | J | Z | Z | Z | Z | Z | Z | Z | W | K | K | W | W | Q | Q | K | X | ||
#24 | J | Z | X | X | X | X | X | X | X | X | W | W | K | Q | W | W | J | J | ||
#25 | Z | X | J | J | J | Q | Q | Q | Q | Q | Q | Q | Q | K | J | K | Q | Q | ||
#26 | Q | Q | Q | Q | Q | J | J | J | J | J | J | J | J | J | K | W |
OK, so our strategy should be to find the column corresponding to the number of letters in the target word, and start calling down the letters from the top until we get a hit, right?
Wrong! | (Though now we're a lot closer to an optimal strategy!) |
Conditional Probability
Results
Here are the final results of these calculations. These charts tell you what order to call letters, based on length of the word, to maximize your chances of getting your first hit.
| There are some interesting take aways from these results:
|
Carrying on
The above analysis (finding our first letter) is easy to render in table form because there are only two choices: We either miss, or we hit. If we miss, we simply try again. Once we've hit a letter or two, however, things get too complex to display in table format. e.g. "Show me the next best letter to guess for eight letter words that have do not have an 'E' or 'I', but have an 'A' and a 'T'! " We'd have a stack of tables reaching up to the ceiling for all combinations of letters present or not, and their positions!Computers are far better at filtering and sifting through databases. Once a first letter has been found, this knowledge (letters not present, letter found and the position of this letter), massively reduces the solution set of possible words. Tools like SQL and regular expressions can be quickly applied to find all possible words that match the comb filter built up.
Pre-computed tables are only fine up to a point, after that, they become unmanageable. To paraphrase a famous quote:
"Battle plans are excellent up until the first shot is fired!"