Ricardo Rocha's Website

Musings on Programming and Programming Languages

Email GitHub Twitter LinkedIn

Spelling Suggestion on the JVM #2: Implementing Norvig's Algo in Java

This tutorial project implements a basic spelling suggestion service with a didactic motivation: becoming familiar with “better Java” languages around a simple but instructional example (Norvig’s approach to spelling correction). Norvig’s algo is first implemented in Java 10. Subsequently, idiomatic implementations are presented in Kotlin, Scala and Xtend. The posts in this series are:

  1. Norvig’s Approach to Spelling Correction
  2. Implementing Norvig’s Algo in Java (this post)
  3. Implementing Norvig’s Algo in Kotlin
  4. Implementing Norvig’s Algo in Scala
  5. Implementing Norvig’s Algo in Xtend

Functional, tested code in the 4 languages is available on the project’s Spellbound repository.

If you haven’t already, taking a cursory look at Norvig’s Python script may be useful in digesting this Java implementation.

The Dictionary

Everything having to do with spelling correction revolves around a curated list of valid words we refer to as the the dictionary.

In our dictionary implementation we accompany every word with an integer rank indicating how frequently the word is used in relation to all other words. The higher the rank, the more frequently used the word is. Thus, for instance, the has rank 106295 while triose has rank 1.

The dictionary declaration is:

public class SpellingCorrector {
  private final Map<String, Integer> dictionary;
    
  public SpellingCorrector(Map<String, Integer> dictionary) {
    checkNotNull(dictionary);
    checkArgument(!dictionary.isEmpty());
    this.dictionary = dictionary;
  }
  // stuff...
}

Transforming the typo into a set of word splits

In order to hold word splits we define an inner data class WordSplit as follows:

public class SpellingCorrector {

  // stuff...
  
  static class WordSplit {
    final String left;
    final String right;

    WordSplit(String left, String right) {
      this.left = left;
      this.right = right;
    }
  }
}

Note: no getter/setter or other boilerplate is needed. This is an immutable data class visible to the SpellingCorrector and its tests only.

Armed with this definition we can now formulate a rather trivial implementation of method wordSplits:

static List<WordSplit> 
wordSplits(String word) {
  return IntStream
    .rangeClosed(0, word.length()).boxed()
    .map(i -> {
      String left = word.substring(0, i);
      String right = word.substring(i);
      return new WordSplit(left, right);
    })
    .collect(toList());
}

While we could have used a good-ole’ for loop to populate the resulting list, it’s interesting to see how to build it in a functional fashion. IntStream is a type of Stream that can only contain elements of type int. Factory method rangeClosed generates a stream of integers starting at 0 and ending right before word.length() (i.e., the closing range position is exclusive). Because we need to yield a List<WordSplit>we’re forced to “box” the IntStream so it becomes an object stream we can map over.

The wordSplits() method takes a string (the typo) and returns all possible pairs of word splits. Thus, for our beloved but misspelt hero dlbert we get:

List.of(
  new WordSplit("", "dlbert"),
  new WordSplit("dl", "bert"),
  new WordSplit("dlb", "ert"),
  new WordSplit("dlbe", "rt"),
  new WordSplit("dlber", "t"),
  new WordSplit("dlbert", "")
);

By the way, all four edit implementations take a List<WordSplit> argument rather than a simple List<String>. All of them return a Stream<String>, too.

Edits #1: deletes

Generating all possible deletes for a word is also quite simple:

static Stream<String> 
deletes(List<WordSplit> splits) {
  return splits.stream()
    .filter(split -> !split.right.isEmpty())
    .map(split -> split.left + split.right.substring(1));
}

Here, we stream over the input List<WordPair> selecting only the ones where right is not empty (which includes all splits but the very first one). Next, we concatenate each word split into a string formed by the left part followed by the “tail” of the right one.

Given the typo waly deletes() yields:

List.of("aly", "wly", "way", "wal");

Edits #2: inserts

Inserts are the opposite of deletes and their implementation is straightforward:

static Stream<String> 
inserts(List<WordSplit> splits) {
  return splits.stream()
    .flatMap(split ->
      Arrays.stream(LETTERS).map(letter ->
        split.left + letter + split.right)
    );
}

In this method our friend LETTERS makes its first appearance. Again for the sake of simplicity we deal only with lowercase ASCII characters. There is no support for diacritics though its implementation is simple anyways). LETTERS is statically defined as:

static final String[] LETTERS = "abcdefghijklmnopqrstuvwxyz".split("");

Splitting on an empty regex results in each string’s character becoming a 1-char string on its own right.

For our canine friend dgbert inserts() would yield:

List.of(
  "adgbert", "bdgbert", "cdgbert", "ddgbert", "edgbert", "fdgbert",
  "gdgbert", "hdgbert", "idgbert", "jdgbert", "kdgbert", "ldgbert",
  "mdgbert", "ndgbert", "odgbert", "pdgbert", "qdgbert", "rdgbert",
  "dgbertm", "dgbertn", "dgberto", "dgbertp", "dgbertq", "dgbertr",
  // lots of inserts ellided...
  "dgberts", "dgbertt", "dgbertu", "dgbertv", "dgbertw", "dgbertx",
  "dgberty", "dgbertz")

Edits #3: transposes

Implementing transposition requires the right fragment to have more than one character in length, which makes sense as transposing involves not one but two letters:

static Stream<String> 
transposes(List<WordSplit> splits) {
  return splits.stream()
    .filter(split -> split.right.length() > 1)
    .map(split ->
      split.left +
      split.right.substring(1, 2) +
      split.right.substring(0, 1) +
      split.right.substring(2)
    );
}

For our heroine alice transposes() yields:

List.of("laice", "ailce", "alcie", "aliec");

Edits #4: replaces

The replace edit is very prolific requiring the substitution of every letter in the typo for every letter in the alphabet. It also requires the right segment not to be empty (as something needs to be replaced). Because we’re nesting the processing of each letter inside the processing of every typo character we need to apply flatMap rather than map:

static Stream<String> 
replaces(List<WordSplit> splits) {
  return splits.stream()
    .filter(split -> !split.right.isEmpty())
    .flatMap(split ->
      Arrays.stream(LETTERS).map(letter ->
        split.left + letter + split.right.substring(1))
    );
}

For our little friend asok replaces() yields:

List.of(
        "asok", "bsok", "csok", "dsok", 
        "esok", "fsok", "gsok", "hsok",
        // quite a few replaces ellided...
        "asos", "asot", "asou", "asov",
        "asow", "asox", "asoy", "asoz"
    );

edit1: Applying All Edits Hoping for Valid Words

To cover all our bases we need to run and then concatenate all four edits. For this we define a static list of edits as follows:

private static final 
  List<Function<List<WordSplit>, Stream<String>>> edits =
    List.of(
      SpellingCorrector::deletes,
      SpellingCorrector::transposes,
      SpellingCorrector::replaces,
      SpellingCorrector::inserts
    );

This is an instance of a common functional pattern where code is treated as data. Here, we have a list of functions mapping from List<WordSplit> to Stream<String> (the type of transformation all our four edits implement). In this context, double colon-separated Java method references shine as they, effectively, provide us with a data-centric view of executable code.

We make use of this list of functions to implement edit1 as follows:

static Stream<String> 
edits1(String typo) {
  // Generate all wordSplits for typo
  var wordSplits = wordSplits(typo);

  // Generate and apply all 4 edits (in parallel)
  // to each split
  return edits
    .parallelStream()
    .flatMap(edit -> edit.apply(wordSplits));
}

Because the four edits are independent of one another we can parallelize their execution. This is transparently handled by Java by means of the Stream.parallelStream() method.

Also, because each edit returns a nested Stream<String> we need to use flatMap instead of map.

edit2: When edit1 Falters

Despite generating lots of possible suggestions, edit1 can fail to yield a dictionary word In this case we assume our typo stems not from one but two transcription mistakes.

edit2 is quite simply defined in terms of edit1 as follows:

static Stream<String> 
edits2(String typo) {
  // Apply all 4 edits twice to each split
  return edits1(typo)
    .flatMap(SpellingCorrector::edits1);
}

Again, we use flatMap because we generate candidate suggestions from the nested edit1’s Stream of results. Note also how the inner invocation of method edits1 is expressed as a method reference.

Putting it All Together

We’re almost there now. Our workhorse method getCorrections() method looks as follows:

public Optional<List<String>> 
getCorrections(String word) {
  // Ensure word format matches that of the dictionary: lowercase alphabetics
  var normalizedWord = normalize(word);

  // If word occurs in dictionary then return no suggestions
  if (dictionary.containsKey(normalizedWord)) {
    return Optional.empty();
  }

  // Corrections for one-edit typos; most typos contain just one error.
  // Method known removes duplicates, ensures presence in dictionary
  // and orders by rank
  var corrections = known(edits1(normalizedWord));

  // If edit1 yields no in-dictionary word, try with edits2.
  // Some typos stem from 2 errors; few come from more than 2
  if (corrections.isEmpty()) {
    corrections = known(edits2(normalizedWord));
 }

  // Return (possibly empty) list of suggested corrections
  return Optional.of(corrections);
}

Note we return Optional<List<String>> rather than a plain List<String>. Why?

When the word passed to our getCorrections method is a valid word occurring in the dictionary we return an Optional.empty() to indicate no corrections apply since the word is a valid one; here, we express the absence of suggestions.

If no similar dictionary word can be rebuilt from the typo then we return List.empty() (not Optional.empty()). By returning List.empty() we declare that the word in question is a typo, yes, but no dictionary word resembles it. This scenario arises when the word passed is gibberish in the likes of xwphjwl.

More commonly, though, we’ll return the equivalent of List.of("spelling", "spewing", "spiling") for typos like speling (one delete) and spelinmg (one delete and one insert).

We there yet?

Yes, we are ;-)

The complete Java implementation has 320+ lines including comments and empty lines. Its neutron star-dense incarnation comes down to 120+ lines.

Compare that to Norvig’s feat of packing the whole thing in 44 lines of Python code!

Java’s verbosity is one of the main reasons alternative JVM languages have emerged. In the next 3 posts we’ll see how the Kotlin, Scala and Xtend versions improve upon Java in terms of economy of expression and clarity of intent.


Next in our plate: Implementing Norvig’s Algo in Kotlin

comments powered by Disqus