Spelling Suggestion on the JVM #1: Norvig's Approach to Spelling Correction
- Norvig’s Approach to Spelling Correction (this post)
- Implementing Norvig’s Algo in Java
- Implementing Norvig’s Algo in Kotlin
- Implementing Norvig’s Algo in Scala
- Implementing Norvig’s Algo in Xtend
Functional, tested code in the 4 languages is available on the project’s Spellbound repository.
A Word on Spelling Suggestion
Spelling Suggestion is all about helping document writers spell words correctly. This service is pervasive among editors, word processors and, nowadays, even IDE’s.
We’ve chosen spelling suggestion as a tutorial subject because of how familiar an application domain it is. This application domain also lends itself to showcase modern language constructs such as functional programming, extension methods or type inference, to name a few.
Simple, readable code examples are first presented in Java 10 so as to ensure they’re readily understood; this reduces the cognitive effort of perusing implementations written in languages not yet familiar.
Typo Edits
A typo is the mistaken transcription of a valid, in-dictionary word. Most typos originate in one or more of the following cases:
-
deletes: A letter is omitted from the word: dilbert becomes dlbert
-
inserts: A letter is inserted that doesn't belong in the word: dogbert becomes dogvbert
-
transposes Two contiguous letters are swapped: alice becomes alcie
-
replaces: A letter is changed to another letter: wally becomes walky
Note that a delete could be amended with an insert. Correspondingly, a transpose could be fixed with the appropriate replace. We have two groups of symmetric edits.
Hence, upon detecting a typo, we can hope to reconstitute the original word by applying an “opposing” edit. This insightful realization is at the core of Norvig’s correction algorithm.
Obviously, we don’t know what letter was omitted or what letters were transposed. Therefore. we need to traverse the entire problem space by brute-force generating all possible opposing edits.
Thus, for example, we can try to amend dlbert with dalbert, dbbert, dcbert, …, dzbert and check to see if one or more of the resulting strings actually occurs in the dictionary (which, happily in this case, happens to appear as dilbert).
The general strategy is, then, to exhaustively apply the four edits to a typo hoping to find at least one dictionary word that was mistyped.
Statistically, most typos arise from one edit only (which Norvig dubs edit1
). Two-edit typos
(edit2
), while less frequent, are sufficiently common so as to warrant their own two-pass
attempt at fixing the typo. edit2
is used only when edit1
fails to find at least one matching
dictionary word; edit2
’s input is the result of a failed edit1
.
These two scenarios should suffice for the vast majority of typo cases. Of course, it’s still possible for a typo to slip through our crib anyway but we’re not aiming for perfection ;-)
Word Splits
Another frequent type of typo is the joining of two contiguous words by omitting the intervening
space: two-word poor asok mutates to poorasok.
Norvig’s solution to this conundrum is to apply edit1
and, possibly, edit2
to all possible
typo word splits. Word split pairs progress from ("", "poorasok")
to ("p", "oorasok")
to
("poo", "rasok"")
all the way down to ("poorasok", "")
.
Next in our plate: Implementing Norvig’s Algo in Java