Ricardo Rocha's Website

Musings on Programming and Programming Languages

Email GitHub Twitter LinkedIn

Spelling Suggestion on the JVM #1: Norvig's Approach to Spelling Correction

This tutorial project implements a basic spelling suggestion service. The motivation is didactic: becoming familiar with “better java” languages around a simple but instructional example. Ease of implementation outranks optimal performance so readers can focus on the JVM languages as such. Examples and new concepts are introduced in Java 10 first so they’re immediately understandable to the experienced Java programmer. Crucially, the Java implementation is accompanied by equivalent (but idiomatic) implementations in today’s most relevant alternative JVM languages: Kotlin, Scala and Xtend. Code is available at Spellbound’s Github repo.

This first post discusses Norvig’s approach to generating typo corrections.

The next post (Implementing Norvig’s Algo in Java) illustrates how to implement Norvig’s spelling corrector in Java 9 following a functional approach.

A Word on Spelling Suggestion

Spelling Suggestion is all about helping document writers spell words correctly. This service is pervasive among editors, word processors and, nowadays, even IDE’s.

We’ve chosen spelling suggestion as a tutorial subject because of how familiar an application domain it is. This application domain also lends itself to showcase modern language constructs such as functional programming, extension methods or type inference, to name a few.

Simple, readable code examples are first presented in Java 10 so as to ensure they’re readily understood; this reduces the cognitive effort of perusing implementations written in languages not yet familiar.

Typo Edits

A typo is the mistaken transcription of a valid, in-dictionary word. Most typos originate in one or more of the following cases:

  • deletes: A letter is omitted from the word: dilbert becomes dlbert
  • inserts: A letter is inserted that doesn't belong in the word: dogbert becomes dogvbert
  • transposes Two contiguous letters are swapped: alice becomes alcie
  • replaces: A letter is changed to another letter: wally becomes walky

Note that a delete could be amended with an insert. Correspondingly, a transpose could be fixed with the appropriate replace. We have two groups of symmetric edits.

Hence, upon detecting a typo, we can hope to reconstitute the original word by applying an “opposing” edit. This insightful realization is at the core of Norvig’s correction algorithm.

Obviously, we don’t know what letter was omitted or what letters were transposed. Therefore. we need to traverse the entire problem space by brute-force generating all possible opposing edits.

Thus, for example, we can try to amend dlbert with dalbert, dbbert, dcbert, …, dzbert and check to see if one or more of the resulting strings actually occurs in the dictionary (which, happily in this case, happens to appear as dilbert).

The general strategy is, then, to exhaustively apply the four edits to a typo hoping to find at least one dictionary word that was mistyped.

Statistically, most typos arise from one edit only (which Norvig dubs edit1). Two-edit typos (edit2), while less frequent, are sufficiently common so as to warrant their own two-pass attempt at fixing the typo. edit2 is used only when edit1 fails to find at least one matching dictionary word; edit2’s input is the result of a failed edit1.

These two scenarios should suffice for the vast majority of typo cases. Of course, it’s still possible for a typo to slip through our crib anyway but we’re not aiming for perfection ;-)

Word Splits

Another frequent type of typo is the joining of two contiguous words by omitting the intervening space: two-word poor asok mutates to poorasok.

Norvig’s solution to this conundrum is to apply edit1 and, possibly, edit2 to all possible typo word splits. Word split pairs progress from ("", "poorasok") to ("p", "oorasok") to ("poo", "rasok"") all the way down to ("poorasok", "").


Next in our plate: Implementing Norvig’s Algo in Java

comments powered by Disqus