Ricardo Rocha's Website

Musings on Programming and Programming Languages

Email GitHub Twitter LinkedIn

Spelling Suggestion on the JVM #4: Implementing Norvig's Algo in Scala

This tutorial project implements a basic spelling suggestion service with a didactic motivation: becoming familiar with “better Java” languages around a simple but instructional example (Norvig’s approach to spelling correction). Norvig’s algo is first implemented in Java 10. Subsequently, idiomatic implementations are presented in Kotlin, Scala and Xtend. The posts in this series are:

  1. Norvig’s Approach to Spelling Correction
  2. Implementing Norvig’s Algo in Java
  3. Implementing Norvig’s Algo in Kotlin
  4. Implementing Norvig’s Algo in Scala (this post)
  5. Implementing Norvig’s Algo in Xtend

Functional, tested code in the 4 languages is available on the project’s Spellbound repository.

The Dictionary

Everything having to do with spelling correction revolves around a curated list of valid words we refer to as the the dictionary.

In our dictionary implementation we associate every word with an integer rank indicating how frequently the word is used in relation to all other words. The higher the rank, the more frequently used the word is. Thus, for instance, the has rank 106295 while triose has rank 1.

The skeletal SpellingCorrector case class is:

case class SpellingCorrector(dictionary: Map[Word, Rank]) {
  require(dictionary != null && dictionary.nonEmpty)

  def getCorrectionsFor(word: Word): Option[Seq[Word]] = {
    require(word != null)
      .  .  .
  }
}

Note we can intersperse validation statements in the class’s body as well as in its methods.

Here, we make use of Scala’s case classes and their primary constructors.

Among other goodies, a case class provides sensible implementations for the toString(), hashCode() and equals() methods. It also defines a primary constructor defining all fields/properties making up the class data. Because case classes and their fields are immutable there is no need for Java-style “accessor” methods: fields are referred to directly through their names because they’re first-class properties of their containing class.

Of course, it’s still possible to transparently redefine field access via methods but this is beyond our current sphere of interest.

Object and Companion Objects

In addition to case classes (like the above SpellingCorrector) Scala also supports the notion of object: a singleton-like instance that may or may not be associated with a given class or interface. Objects may implement interfaces or extend classes as needed but they can also stand on their own.

Especially relevant for our purposes, it is possible for an object to be associated with an equally-named class in which case it’s referred to as the class’s companion object. Both the class and its companion object can mutually see their private members.

A frequent comparison between Java classes and Scala objects is that Scala objects methods can be seen as Java static methods. Although such correspondence can in effect be established, the notion of object goes much further. Thus, for instance, where static Java members are limited to fields and methods a Scala object can extend unrelated classes as well as implement arbitrary traits (comparable to -tough richer than- Java interfaces).

It is customary to invoke companion object methods from the object’s associated class. This is the case in our SpellingCorrector case class from whose getCorrections() implementation makes use of the companion object in expressions such as:

  • SpellingCorrector.normalize(word)
  • SpellingCorrector.edits1(normalizedWord)
  • SpellingCorrector.edits2(normalizedWord)

Let’s see:

case class SpellingCorrector(dictionary: Map[Word, Rank]) {
  require(dictionary != null && dictionary.nonEmpty)

  def getCorrections(word: Word): Option[Seq[Word]] = {

    require(word != null)

    // Invoke companion object's normalize() method
    val normalizedWord = SpellingCorrector.normalize(word)

    if (dictionary.contains(normalizedWord)) {
      // Curated dictionary word; no corrections needed
      None
    } else { // Try and find typo-originating words

      val corrections = {

        // Inner function to remove duplicates, weed out words not
        // present in dictionary and sort descending by rank
        def known(normalizedWords: Seq[Word]) =
          normalizedWords
            .distinct
            .filter(dictionary.contains) // Lambda usage similar to Java method reference
            .sortBy(-dictionary(_)) // reverse by rank


        // Invoke companion object's edits1() method
        val editResults1 = known(SpellingCorrector.edits1(normalizedWord))

        if (editResults1.nonEmpty) {

          // Dictionary words do occur in edits1() results, return them
          editResults1

        } else { // No dictionary words so far, try edits2()

          // Invoke companion object's edits2() method
          known(SpellingCorrector.edits2(normalizedWord))

        }
      }

      Some(corrections)
    }
  }
}

Transforming a typo into a set of word splits

Let’s take a look at the basic implementation of the edits1() and edits2() method in our companion object:

object SpellingCorrector {

  // Placeholder edit method implementations (detailed below)
  def wordSplits(word: Word): Seq[(Word, Word)] = ???
  def deletes(splits: Seq[(Word, Word)]): Seq[Word] = ???
  def inserts(splits: Seq[(Word, Word)]): Seq[Word] = ???
  def transposes(splits: Seq[(Word, Word)]): Seq[Word] = ???
  def replaces(splits: Seq[(Word, Word)]): Seq[Word] = ???

  def edits1(word: Word): Seq[Word] = {

    // wordSplits("dogbert"):
    //   ("", "dogbert"), ("d", "ogbert"), ("do", "gbert"), ("do", "gbert"),
    //   ("do", "gbert"), ("dog", "bert"), ("dogb", "ert"), ("dogbe", "rt"), ("dogber", "t")
    val splits = wordSplits(word)

    val edits = Seq[Seq[(Word, Word)] => Seq[Word]](
      deletes, // dilbert -> dilbrt, dlbert, ...
      inserts, // wally -> wallyt, walluy, ...
      transposes, // alice -> alcie, aliec ...
      replaces // boss -> bosz, bosd, ...
    )

    // Concurrently apply all edits on all possible word splits
    edits
      .par // compute separate edits in parallel
      .flatMap(edit => edit(splits)) // Apply each edit
      .seq // Make resulting collection sequential
  }

  // For less common but still frequent two-pronged typos
  def edits2(word: Word): Seq[Word] = {
    for {
      // Apply edits1() on given word
      e1 <- edits1(word).par // Run in parallel
      // Apply edits2() on each candidate yielded by edits1()
      e2 <- edits1(e1).par // Run in parallel
    } yield e2
  }
    .seq // Make resulting collection sequential
    
  // . . .  Additional logic elided
}

Here, we leave the actual edit operations unimplemented in order to emphasize how to use them as lambdas. All such edits (deletes, inserts, transposes and replaces) are implemented as methods but they’re referred to as lambdas in the edits collection.

The edits lambda collection is applied -concurrently- to each incoming word.

BTW, the ??? placeholder is of type Nothing and, therefore, is an instance of all types. Such placeholders are handy during algorithm formulation but will raise a runtime error if actually executed.

In the edits1 implementation we return:

// Concurrently apply all edits on all possible word splits
edits
  .par // compute separate edits in parallel
  .flatMap(edit => edit(splits))
  .seq // Make resulting collection sequential
}

Note Scala does not require the return keyword. The last expression in the method is returned to the caller.

For readability we’ve written above:

.flatMap(edit => edit(splits))

The Letters constant

Some of the edit operations explained below make use of the list of all valid letters. This is trivially defined as follows:

val Letters: String = ('a' to 'z').mkString // ASCII-only for simplicity

Splitting words: wordSplits

All edit operations (deletes, inserts, transposes and replaces) operate not on incoming words but on words splits derived from them.

The actual implementation of wordSplit() is straightforward:

def wordSplits(word: Word): Seq[(Word, Word)] =
  for (i <- 0 to word.length)
    yield (word.substring(0, i), word.substring(i))

The split() method takes a string (the typo) and returns all possible pairs of word splits. Thus, for our beloved but misspelt hero dlbert we get the equivalent of:

Seq(
  ("", "dlbert"),
  ("dl", "bert"),
  ("dlb", "ert"),
  ("dlbe", "rt"),
  ("dlber", "t"),
  ("dlbert", "")
)

We don’t need to define a separate WordSplit case class to hold generated word splits. In Scala a 2-element tuple is simpler and more idiomatic.

Implementing edits: delete

delete operates by suppressing each letter in turn:

def deletes(splits: Seq[(Word, Word)]): Seq[Word] =
  for ((left, right) <- splits if !right.isEmpty)
    yield left + right.substring(1)

Given the typo waly this deletes() anonymous implementation yields the equivalent of:

Seq("aly", "wly", "way", "wal")

Implementing edits: insert

insert introduces each letter to all word splits:

def inserts(splits: Seq[(Word, Word)]): Seq[Word] =
  for ((left, right) <- splits; letter <- Letters)
    yield left + letter + right

For our canine friend dgbert the above anonymous implementation of inserts() yields the equivalent of:

Seq(
  "adgbert", "bdgbert", "cdgbert", "ddgbert", "edgbert", "fdgbert",
  "gdgbert", "hdgbert", "idgbert", "jdgbert", "kdgbert", "ldgbert",
  "mdgbert", "ndgbert", "odgbert", "pdgbert", "qdgbert", "rdgbert",
  "dgbertm", "dgbertn", "dgberto", "dgbertp", "dgbertq", "dgbertr",
  // lots of inserts ellided...
  "dgberts", "dgbertt", "dgbertu", "dgbertv", "dgbertw", "dgbertx",
  "dgberty", "dgbertz")

Implementing edits: transpose

transpose operates by rotating consecutive characters:

def transposes(splits: Seq[(Word, Word)]): Seq[Word] =
  for ((left, right) <- splits if right.length > 1)
    yield left + right(1) + right(0) + right.substring(2)

For our heroine alice the above anonymous implementation of transposes() yields the equivalent of:

Seq("laice", "ailce", "alcie", "aliec")

Implementing edits: replace

replace operated by substituting every letter in the word split by each letter in the alphabet:

def replaces(splits: Seq[(Word, Word)]): Seq[Word] =
  for ((left, right) <- splits if !right.isEmpty; letter <- Letters)
    yield left + letter + right.substring(1)

For our little friend asok the above replaces() implementation yields the equivalent of:

Seq(
    "asok", "bsok", "csok", "dsok", 
    "esok", "fsok", "gsok", "hsok",
    // quite a few replaces ellided...
    "asos", "asot", "asou", "asov",
    "asow", "asox", "asoy", "asoz"
)

Other companion object methods used by the case class

In addition to all the methods listed above, the following implementations are used by the object’s associated class:

val Alphabetic: Regex = "^[\\p{Alpha}]+$".r

def isAlphabetic(word: Word): Boolean = Alphabetic.findFirstIn(word).isDefined

def normalize(word: Word): Word = {
  val normalizedWord = word.trim.toLowerCase
  if (isAlphabetic(normalizedWord)) normalizedWord
  else throw new IllegalArgumentException(s"Non-alpha word: $word")
}

edit2: When edit1 Falters

edit2 is quite simply defined in terms of edit1 as follows:

def edits2(word: Word): Seq[Word] = {
  for {
    e1 <- edits1(word).par
    e2 <- edits1(e1).par
  } yield e2
}
  .seq

Despite generating lots of possible suggestions, edit1 can fail to yield a dictionary word In this case we assume our typo stems not from one but two transcription mistakes.

Putting it All Together

We’re almost there now. Our workhorse method getCorrections() method looks as follows:

def getCorrections(word: Word): Option[Seq[Word]] = {

  require(word != null)

  import SpellingCorrector.{edits1, edits2, normalize}

  val normalizedWord = normalize(word)

  if (dictionary.contains(normalizedWord)) {
    // Curated dictionary word; no corrections needed
    None
  } else { // Try and find typo-originating words

    val corrections = {

      // Inner function to remove duplicates, weed out words not
      // present in dictionary and sort descending by rank
      def known(normalizedWords: Seq[Word]) =
        normalizedWords
          .distinct
          .filter(dictionary.contains)
          .sortBy(-dictionary(_)) // reverse by rank


      val editResults1 = known(edits1(normalizedWord))

      if (editResults1.nonEmpty) {

        // Dictionary words do occur in edits1() results
        editResults1

      } else {

        // Try edit2() only if edits1() fails; may fail itself
        known(edits2(normalizedWord))

      }
    }

    Some(corrections)
  }
}

Final words about Scala

The SpellingCorrector source file contain both the case class and companion object implementations.

The Scala implementation adds up to 117 lines including comments and empty lines. Its neutron star-dense incarnation comes down to only 74 lines!

Compare that to Norvig’s feat of packing the whole thing in 44 lines of (commented) Python code and the horror of Java taking up over 300 lines of code (120 in neutron star form).

Java’s verbosity is precisely one of the most important reason alternative JVM languages have emerged. In the next post we’ll see how the the Xtend language fares in terms of economy of expression and clarity of intent. So far Kotlin and Scala are our uncontested winners over Java.


Next in our plate: Implementing Norvig’s Algo in Xtend

comments powered by Disqus