Spelling Suggestion on the JVM #1: Norvig's Approach to Spelling Correction
A Word on Spelling Suggestion
Spelling Suggestion is all about helping document writers spell words correctly. This service is pervasive among editors, word processors and, nowadays, even IDE’s.
We’ve chosen spelling suggestion as a tutorial subject because of how familiar an application domain it is. This application domain also lends itself to showcase modern language constructs such as functional programming, extension methods or type inference, to name a few.
Simple, readable code examples are first presented in Java 10 so as to ensure they’re readily understood; this reduces the cognitive effort of perusing implementations written in languages not yet familiar.
A typo is the mistaken transcription of a valid, in-dictionary word. Most typos originate in one or more of the following cases:
deletes: A letter is omitted from the word: dilbert becomes dlbert
inserts: A letter is inserted that doesn't belong in the word: dogbert becomes dogvbert
transposes Two contiguous letters are swapped: alice becomes alcie
replaces: A letter is changed to another letter: wally becomes walky
Note that a delete could be amended with an insert. Correspondingly, a transpose could be fixed with the appropriate replace. We have two groups of symmetric edits.
Hence, upon detecting a typo, we can hope to reconstitute the original word by applying an “opposing” edit. This insightful realization is at the core of Norvig’s correction algorithm.
Obviously, we don’t know what letter was omitted or what letters were transposed. Therefore. we need to traverse the entire problem space by brute-force generating all possible opposing edits.
Thus, for example, we can try to amend dlbert with dalbert, dbbert, dcbert, …, dzbert and check to see if one or more of the resulting strings actually occurs in the dictionary (which, happily in this case, happens to appear as dilbert).
The general strategy is, then, to exhaustively apply the four edits to a typo hoping to find at least one dictionary word that was mistyped.
Statistically, most typos arise from one edit only (which Norvig dubs
edit1). Two-edit typos
edit2), while less frequent, are sufficiently common so as to warrant their own two-pass
attempt at fixing the typo.
edit2 is used only when
edit1 fails to find at least one matching
edit2’s input is the result of a failed
These two scenarios should suffice for the vast majority of typo cases. Of course, it’s still possible for a typo to slip through our crib anyway but we’re not aiming for perfection ;-)
Another frequent type of typo is the joining of two contiguous words by omitting the intervening
space: two-word poor asok mutates to poorasok.
Norvig’s solution to this conundrum is to apply
edit1 and, possibly,
edit2 to all possible
typo word splits. Word split pairs progress from
("", "poorasok") to
("p", "oorasok") to
("poo", "rasok"") all the way down to
Next in our plate: Implementing Norvig’s Algo in Java