Raising the WORDLE first-guess bar!
I had seen the term WORDLE mentioned in Discord servers, but didn’t check it out until a few days ago. It’s a fun game that is essentially a Word Mastermind variant for 5 letters:
- The answer (the ‘codeword’ / ‘code’) is a predefined 5-letter dictionary word (letters can be repeated — so for example PAPAL is a legitimate answer).
- The player will guess a word (which must be a dictionary word), and will receive feedback for each character in the guessed word: this character is not in the answer (grey), this character is in the answer but not in this spot (yellow), or this character is in the answer in this exact spot (green).
- If a guess is correct, the game is won. If the player does not win within six guesses, then they lose.
My first question was obviously “what’s the best first word?” — surely there exists some optimal first guess (in a given word-list).
I grabbed the British English word-list in /usr/share/dict and trimmed it down to just 5-letter alpha-only words (5872 of them). A few helper functions for gameplay components: the ability to create a game object (with a defined or random word), generate feedback for a guess, automatically cull impossible words based on this feedback, print out still-possible words, etc. I then realized the scale of the problem and how it would not be easily optimally solved — each game state needs to include:
- all the known grey (non-present) letters discovered
- a map for all green/orange letters (ones we know are in the answer), to their known positions.
- a map for all green/orange letters to a list of where they could possibly be.
Case #2 vs #3 highlights a small but important distinction for situations where there are repeated letters. For example if the answer is PUPPY but we guess PAPAL we know that P_P__ is correct, but we still need to track that P may exist in the answer at positions [0, 1, 2, 4] => i.e. {PUPPY, POPPY}.
I then took a break from coding and did some research — reading this article by Bertrand Fan. His analysis of the game was very similar and he explained that there was a different dictionary used, which was split into two sub-lists: ‘possible answers’ and ‘valid dictionary words that are guaranteed not answers’. He looked at frequency analysis (however his code ignores the fact that words can have multiple copies of the same letter, which could skew certain letters much more than others). His final analysis is the most key — sorting possible first guesses based on how much information they gave in terms of green/yellow/black characters. He does admit that the definition of “best” is arbitrary, but I think it can be improved.
I decided on just evaluating the number of possible words in (average case, worst case) left after each possible first guess. After all, the optimal solutions to this game is one which can always determine the answer in X guesses or fewer, but has the lowest expected # of guesses for all possible answers. Since we’re not doing an exhaustive search (so there’s no guarantee there exists an optimal solution for X=6 even) using the remaining search-space seems more appropriate than just valuing the information by Bertrand’s heuristic “number of greens * 2 + number of yellows)”.
Brute forcing all the possible first guesses took a little over 3 hours running on 7 cores of an i7–6700K.
Raw data output (in the form WORD: AVG (min: MIN, max: MAX)): https://gist.github.com/Noxville/d17d04e196b146a2c18eda67aa4d2e13 —
As said above we want our initial guess to be one which, on average, reduces the remaining pool the most — but we also care about the worst case. With this in mind, there are two clear favourites for best first guess — ROATE has the lowest average search-space but a relatively high worst case, and RAISE has a slightly higher average worse case than ROATE but a much lower worst case. Guessing ROATE decreases the remaining possible word pool to just ~60.42 words on average, just 1.03% of the original 5872 valid words.
Interesting to note that ARISE, AESIR and REAIS also appear in this shortlist (all anagrams of RAISE) — however because of the letter ordering they do a less efficient job at eliminating words than RAISE (at least, for the provided word-list).
The worst first guesses include the likely suspects with repeated letters and no E’s like PUPPY, FUZZY, KIBBI, and (my favourite) SUSUS. These have an average and worst case remaining word count ~11.5x —12.5x and 6.5x respectively when compared to the best first guesses.
Some final/random observations:
- The fact that the answer must be a dictionary word means that there’s no uniformity over the {a..z}⁵ space.
- All guesses being dictionary words means that it might be better to guess a word which you know is not a possible valid answer. This reduces your best-case (winning immediately with the guess) in order try reduce average and worst case.
- Not all search spaces are equal. Sometimes the remaining possible words are nice — when there exist some specific untested letters present in ~half the remaining words (and in the words they are present in they are spread between multiple possible locations). Obviously in practice there is a large overlap in redundant information for each letter’s feedback — so you’re not making a guaranteed ~32x reduction if you have 5 ‘nice’ letters and a word which exists containing all of them.
- Because of the difficult to evaluate search-spaces (a smaller search space not being strictly better than a larger one), a heuristic-based approach is likely the only real option — other than an exhaustive search which is likely too slow. Even if you somehow knew what the optimal first word was, calculating the best possible second guess for all 3⁵ feedback states you could’ve received (3 possible colors, 5 letters) remains a challenging problem. The information doesn’t sum nicely (since there’s overlap excluded words), and is obviously based on the gamestate after the response to the first guess.
Might spend some more time and see if there’s nice short-cuts to solve the game fully! Until then, I’m starting with RAISE!
- Noxville