If you’re anything like me, everytime you see a cryptogram in the entertainment section of the newspaper, you stare at it for about 60 seconds and then you give up. I’ve always wanted to be able to solve cryptograms but I never really learned how. Furthermore, even after I read a few things about the strategies you can use to solve them, I was intrigued by (in my mind) a much more interesting problem – could you write a computer program to solve cryptograms for you?
First, some definition of the problem is in order. The object of the program is to change a ciphertext consisting of cipher words into a cleartext made up of words which exist in a dictionary file, by mapping each cipher letter into exactly one cleartext letter. There is a 1-to-1 mapping in that no two cipher letters share the same cleartext and vice versa. I will also allow for apostrophes as letters which always map to apostrophes, and ignore all other characters (periods, commas). Words are separated by any non-alphabetic or apostrophe letter.
Gb py bgioy lkq ckd’y oqwwttc, yil, yil prpgd.
My first attempt at solving this problem was very rudimentary. Basically, assign a random mapping of all cipher letters to cleartext letters and produce the result. Check to see that every word is in fact in the dictionary. This solution will inevitably work, but it will also take forever. How many mappings are there for 26 letters? Well, for “A” you have 26 choices, for “B” you have 25 choices left, etc. In other words there are 26! mappings to choose from and only one will give you the right answer. In fact this is worse than just looping through all possibilities because you will start duplicating random assignments that you’ve already tried. Don’t do this.
This is a similar approach to what a human would do. Basically, we know that for most texts, e is the most common letter, followed by t, a, o, i, n, s, h, and r. Insert RSTLNE joke here. What we can try to do is to analyze our ciphertext and rank the letters by occurrence frequency and then assign letters based on our letter ranking list. As it turns out, this approach didn’t work very well. If your text has exactly the same letter distribution as the English language in general, then your first assignment should be the answer, and you’re done. This is not the case, and it’s even possible that the most common letter isn’t even E. And that’s a big problem, because when it comes to recursive algorithms, getting the first part wrong means you spend lots of time searching for solutions where none will be found. There may be a way to do this using some kind of breadth first search which will lead to a relatively quick solution, but I have to say I think this approach is misguided.
The solution I eventually came up with has much less to do with the letters themselves and more to do with the uniqueness of each word. When you’re looking at a cipher word, you don’t know what the letters actually represent, but you do know one very important thing. You know that everytime the same letter is used in the cipher, it is the same letter in the real word too. In other words, the letter pattern of a word is a property shared by the cleartext of that word.
Here’s an example:
oqwwttc (could be rewritten as)
0122334 (numerically speaking)
abccdde (as a normalized cipher word)
succeed (an actual word with this pattern)
As it turns out, in my dictionary, there are only five seven-letter words with that letter pattern, and succeed is one of them. Fantastic! We now know that the solution, if it will be found, must assign letters according to one of those five words. This leads us to the following pseudocode:
- Load the dictionary into a hash table where the hash is the normalized letter pattern of each word, and the real words are added to the table according to their hash
- For each ciphertext word, normalize it and get a list of all English words that match this pattern
- Sort the list of ciphertext words in increasing count of English words
- Recursively assign a word to the current cipher. Double check that this does not violate any previously assigned letters for previous words; if so, back out. Pick the next word and recurse.
This algorithm is actually really fast. For the cipher I gave above, there are only 5 choices for the first* word and only 7 for the second*, and this results in an assignment of 8 letters immediately which have a high (1/35) chance of being correct. The search space increases to 49 choices for the next word, and more for each word following, but by the time you reach the words with a high number of pattern matches you’ve already decided most of your letter assignments already, so your choices are easy to prune.
Here’s the source: View Source
A true programmatic crypogram solver doesn’t necessarily need to use the same skills as a human would use to solve the same problem; we can use the computer’s advantage of being able to complete repetitive tasks by pre-processing every word in the dictionary based on its letter pattern, and draw on that information to choose whole word assignments instead of individual letters. Choosing from the smallest number of options severely reduces the recursive search space, and from there it’s just a matter of testing the remaining word patterns until we reach the solution(s).
If you see any glaring errors in my code, or possible optimizations, please feel free to add a comment saying how I could improve. Thanks!
Oh and by the way, if at first you don’t succeed, try, try again.
* Not actually the first word, but the word with the lowest matching pattern count in the dictionary.