Ricardo Rocha's Website

Musings on Programming and Programming Languages

Email GitHub Twitter LinkedIn

Spelling Suggestion on the JVM #3: Implementing Norvig's Algo in Kotlin

This tutorial project implements a basic spelling suggestion service with a didactic motivation: becoming familiar with “better Java” languages around a simple but instructional example (Norvig’s approach to spelling correction). Norvig’s algo is first implemented in Java 10. Subsequently, idiomatic implementations are presented in Kotlin, Scala and Xtend. The posts in this series are:

  1. Norvig’s Approach to Spelling Correction
  2. Implementing Norvig’s Algo in Java
  3. Implementing Norvig’s Algo in Kotlin (this post)
  4. Implementing Norvig’s Algo in Scala
  5. Implementing Norvig’s Algo in Xtend

Functional, tested code in the 4 languages is available on the project’s Spellbound repository.

The Dictionary

Everything having to do with spelling correction revolves around a curated list of valid words we refer to as the the dictionary.

In our dictionary implementation we associate every word with an integer rank indicating how frequently the word is used in relation to all other words. The higher the rank, the more frequently used the word is. Thus, for instance, the has rank 106295 while triose has rank 1.

The dictionary declaration is:

class SpellingCorrector(private val dictionary: Map<String, Int>) {
  // stuff...
}

Here, we make use of Kotlin’s primary constructor where a class instance is created from one or more values. Kotlin encourages using immutable values; this is reflected above by declaring the dictionary to be an immutable val.

In addition to declaring the dictionary field, this primary constructor also makes it inaccessible from the outside world by means of private. Unlike Java, Kotlin’s general default visibility modifier is public; this is very convenient since we strive to make all values immutable.

Transforming the typo into a set of word splits

In addition to good ole’ classes (like the above SpellingCorrector) Kotlin also supports the notion of object: a singleton-like instance that may or may not be associated with a corresponding class or interface. In other words, objects may implement interfaces or extend classes as needed but they can also stand on their own.

Importantly, objects can declare their own inner classes, values, variables, functions, etc.

In our case, we create an object called Edits that contains:

  • A data class WordSplit for storing left/right word splits
  • A function to generate word splits given a (typo) word
  • A list of lambdas transforming a set of word splits into a set of candidate words

On the surface our Edits Kotlin object looks like:

object Edits {

    // Data class for word splits
    data class WordSplit(val left: String, val right: String)

    // Build a collection of word splits from a given word
    fun wordSplits(word: String): Iterable<WordSplit> =
            IntRange(0, word.length).map {
                WordSplit(word.substring(0, it), word.substring(it))
            }

    // Store Norvig's four edit-generating functions as lambdas
    // Here, like in Java, we treat code as data
    val ALL_EDITS: List<(Iterable<WordSplit>) -> Iterable<String>> = listOf(
        fun(wordSplits) = // delete logic...,
        fun(wordSplits) = // insert logic...,
        fun(wordSplits) = // transpose logic...,
        fun(wordSplits) = // replace logic...
    )
}

Word Splits

In order to hold word splits we define an inner data class WordSplit as:

// Data class for word splits
data class WordSplit(val left: String, val right: String)

Kotlin data classes don’t require getters/setters for their properties (which must all be declared in the data class’s primary constructor as immutable values). A data class also implements sensible equals, hashCode and toString implementations for free!

Armed with this definition we can now formulate the rather trivial implementation of wordSplits():

// Build a collection of word splits from a given word
fun wordSplits(word: String): Iterable<WordSplit> =
        // No need to use "return": this is an expression, not a statement
        IntRange(0, word.length).map {
            // Look ma: no "new" needed to create WordSplit instance!
            WordSplit(word.substring(0, it), word.substring(it))
        }

Here we have an instance of an expression function using no curly braces to surround the function body. Instead, we use the = sign to separate the function’s declaration from its one-expression body. This body creates a collection from the range 0 to word.length (inclusive). Note, en passant, that we don’t need to use empty parens after length in the expression word.length; in Kotlin, it is a property rather than a function.

The range is, in effect, a collection of integers that can be mapped over to apply a lambda. In Kotlin, lambdas must be surrounded by curly braces (unlike Java, parenthesis just don’t cut it).

One-argument lambdas can use the implicitly defined value it to refer to their sole argument. Thus, above we use it to refer to each subsequent value in the range. In this usage it has type Int and is used as an index to the substring function.

The ALL_EDITS list

Even though we could use vanilla functions for each of Norvig’s four edits we purposefully choose to implement them as elements of a list of anonymous functions:

// Store Norvig's four edit-generating functions as lambdas
// Here, like in Java, we treat code as data
val ALL_EDITS: List<(Iterable<WordSplit>) -> Iterable<String>> = listOf(
    fun(wordSplits) = // delete logic...,
    fun(wordSplits) = // insert logic...,
    fun(wordSplits) = // transpose logic...,
    fun(wordSplits) = // replace logic...
)

Note the list’s generic type: (Iterable<WordSplit>) -> Iterable<String> which can be read roughly as: given a sequence of WordSplits generate a sequence of Strings.

By storing functions as data in a list we can apply them in a generic fashion to any given collection of word splits:

Edits.ALL_EDITS.flatMap { it(wordSplits) }

Here, we actually invoke each edit function by passing the wordSplits argument to it, cool!

The split() method takes a string (the typo) and returns all possible pairs of word splits. Thus, for our beloved but misspelt hero dlbert we get the equivalent of:

listOf(
  WordSplit("", "dlbert"),
  WordSplit("dl", "bert"),
  WordSplit("dlb", "ert"),
  WordSplit("dlbe", "rt"),
  WordSplit("dlber", "t"),
  WordSplit("dlbert", "")
)

By the way, all four edit implementations take an Iterable<WordSplit> argument while returning an Iterable<String>.

IntRange(startInclusive: Int, endInclusive: Int) returns a subtype of Iterable<Int>. In our example above this means all integers between 0 and 6.`

Subsequently, each successive integer value is used as a substring index for splitting the word into a pair of left and right fragments.

Edits #1: deletes

Generating all possible deletes for a word is quite simple:

// deletes
fun(wordSplits) = wordSplits
        .filterNot { it.right.isEmpty() }
        .map { it.left + it.right.substring(1) }

Here, we traverse an input Iterable<WordPair> selecting only the pairs where right is not empty (which is to say to all splits but the very first one). Next, we concatenate each word split into a string formed by the left part followed by the “tail” of the right one.

Given the typo waly this deletes() anonymous implementation yields the equivalent of:

listOf("aly", "wly", "way", "wal")

Edits #2: inserts

Inserts are the opposite of deletes and while their implementation is correspondingly simple:

// inserts
fun(wordSplits) = wordSplits
        .flatMap { wordSplit ->
            LETTERS.map { wordSplit.left + it + wordSplit.right }
        },

This inserts() edit uses a LETTERS value of type Iterable<String>. Again for the sake of simplicity we deal only with lowercase ASCII characters. There is no support for diacritics (whose implementation is not so complicated anyways). LETTERS is defined in the Edits object as:

private val LETTERS = CharRange('a', 'z')

CharRange is a subtype of Iterable<String>.

For our canine friend dgbert the above anonymous implementation of inserts() yields the equivalent of:

listOf(
  "adgbert", "bdgbert", "cdgbert", "ddgbert", "edgbert", "fdgbert",
  "gdgbert", "hdgbert", "idgbert", "jdgbert", "kdgbert", "ldgbert",
  "mdgbert", "ndgbert", "odgbert", "pdgbert", "qdgbert", "rdgbert",
  "dgbertm", "dgbertn", "dgberto", "dgbertp", "dgbertq", "dgbertr",
  // lots of inserts ellided...
  "dgberts", "dgbertt", "dgbertu", "dgbertv", "dgbertw", "dgbertx",
  "dgberty", "dgbertz")

Edits #3: transposes

Implementing transposition requires the right fragment to have more than one character in length, which makes sense as transposing involves not one but two letters:

// transposes
fun(wordSplits) = wordSplits
        .filter { it.right.length > 1 }
        .map {
            it.left +
                    it.right.substring(1, 2) +
                    it.right.substring(0, 1) +
                    it.right.substring(2)
        }

For our heroine alice the above anonymous implementation of transposes() yields the equivalent of:

listOf("laice", "ailce", "alcie", "aliec")

Edits #4: replaces

The replace edit is very prolific requiring the substitution of every letter in the typo for each letter in the alphabet. It also requires the right segment not to be empty (as something needs to be replaced). Because we’re nesting the processing of each letter inside the processing of every typo character we need to apply flatMap rathen than map:

// replaces
fun(wordSplits) = wordSplits
        .filterNot { it.right.isEmpty() }
        .flatMap { wordSplit ->
            LETTERS.map { wordSplit.left + it + wordSplit.right.substring(1) }
        }

For our little friend asok the above replaces() implementation yields the equivalent of:

listOf(
    "asok", "bsok", "csok", "dsok", 
    "esok", "fsok", "gsok", "hsok",
    // quite a few replaces ellided...
    "asos", "asot", "asou", "asov",
    "asow", "asox", "asoy", "asoz"
)

edit1: Applying All Edits Hoping for Valid Words

To cover all our bases we need to run and then concatenate all four edits. For this we define, in the Edits object, an Iterable<String> as follows:

val ALL_EDITS: List<(Iterable<WordSplit>) -> Iterable<String>> = listOf(
        // deletes
        fun(wordSplits) = wordSplits
                .filterNot { it.right.isEmpty() }
                .map { it.left + it.right.substring(1) },
        // inserts
        fun(wordSplits) = wordSplits
                .flatMap { wordSplit ->
                    LETTERS.map { wordSplit.left + it + wordSplit.right }
                },
        // transposes
        fun(wordSplits) = wordSplits
                .filter { it.right.length > 1 }
                .map {
                    it.left +
                            it.right.substring(1, 2) +
                            it.right.substring(0, 1) +
                            it.right.substring(2)
                },
        // replaces
        fun(wordSplits) = wordSplits
                .filterNot { it.right.isEmpty() }
                .flatMap { wordSplit ->
                    LETTERS.map { wordSplit.left + it + wordSplit.right.substring(1) }
                }
)

Note the type of ALL_EDITS: it’s a List<...> rather than an Iterable<...>. This is done so as to enable us to reference each edit by index on ALL_EDITS:

ALL_EDITS[0](wordSplits) // deletes(wordSplits)
ALL_EDITS[1](wordSplits) // inserts(wordSplits)
ALL_EDITS[2](wordSplits) // transposes(wordSplits)
ALL_EDITS[3](wordSplits) // replace(wordSplits)

The grouping of edits in a List<(Iterable<WordSplit>) -> Iterable<String>> is an instance of a common functional pattern where code is treated as data. Here, we have a list of functions mapping from Iterable<WordSplit> to Iterable<String> (which is, precisely, the type of transformation all of our four edits implement). In this context Kotlin anonymous functions shine as they, effectively, provide us with a data-centric view of executable code.

We make use of this list of functions to implement edit1 as follows:

fun edits1(typo: String): Iterable<String> {
    // Collect typo's word splits
    val wordSplits = Edits.wordSplits(typo)
    // Exercise all four edits in parallel and coalesce (*pack*)
    // the results
    return Edits.ALL_EDITS
            .flatMap { it(wordSplits) }
            .pack(dictionary)
}

Because each edit returns a nested Iterable<String> we need to use flatMap instead of map.

Because the four edits are independent of one another we can (and, indeed, should) parallelize their execution.

On the bright side, Kotlin doesn’t requires us to “take the stream()” of a collection as Java does. Functional functions such as filter, map and flatMap can be invoked directly on all subtypes of Iterable.

On the not-so-bright side, Kotlin doesn’t directly provide the equivalent of Java’s Stream.parallelStream() method.

We can, however, take advantage of Kotlin’s (still experimental) coroutines to enable concurrent execution of lambdas passed to flatMap (and potentially any other functional operation including map and filter):

// Run flattening mapper in parallel via coroutines
// Akin to Java's Collection.parallelStream() followed by flatMap()
fun <A, B> Iterable<A>.pFlatMap(mapper: suspend (A) -> Iterable<B>): Iterable<B> = runBlocking {
    map { async(CommonPool) { mapper(it) } }.flatMap { it.await() }
}

Note the heading Iterable<A> type prepended to pFlatMap. This is an instance of extension functions described below.

Here, each edit (represented here by the mapper function) is run on a separate thread and the overall pFlatMap() function waits for all of them to complete before returning itself.

Given this function we can now parallelize edit execution as in:

return Edits.ALL_EDITS
        .pFlatMap { it(wordSplits) }
        .pack(dictionary)

What is this trailing pack() function at the end? It doesn’t appear on Kotlin’s Iterable documentation…

As it happens, edit1 is not yet done when all edits have been concatenated.

Once edit1 generates all possible edits we need to screen them for duplicates as well as for actual appearance in the dictionary. We also need to order the resulting valid words so the most frequently used ones are shown first. This is all achieved by function pack() as follows:

fun Iterable<String>.pack(dictionary: Map<String, Int>): Iterable<String> =
        this
            .distinct()
            .filter { dictionary.containsKey(it) }
            .map { Pair(it, dictionary[it]!!) }
            .sortedBy { -it.second }
            .map { it.first }

Note here, again, the pack() function is preprended by Iterable<String>. What does this mean?

Kotlin provides a very powerful construct called extension functions. Extension functions enable developers to augment the repertoire of functions exposed by any type as if they had been declared in their original implementation.

This is why we can append a trailing pack(dictionary) to the collection transformation embodied in edit1 (and edit2 below). Note the body of pack() above uses this to refer to its target augmented type.

What about those two trailing bangs in the map line?. Kotlin uses the bracket operator to retrieve a value from a dictionary given its key. Thus, above, dictionary[it] is equivalent to dictionary.get(it). When the given key is not contained in the dictionary the get operation returns null. Thus the expression dictionary[it] is nullable. This limits what we can do with its value; for instance, we cannot invert the sign of the integer rank value directly by writing -it.second. We’re sure, however, that dictionary[it] will never return null because the preceding filter statement guarantees the dictionary does contain the word in question. For such cases, Kotlin provides the !! suffix operator to indicate “by the way, make the preceding expression not nullable”. Thanks to this refinement we can safely write -it.second or -1 * it.second because the type of it.second is no longer Int? (nullable) but plain Int (non-nullable).f

fun edits1(typo: String): Iterable<String> {
    // Collect typo's word splits
    val wordSplits = Edits.wordSplits(typo)
    // Exercise all four edits in parallel and coalesce (*pack*)
    // the results
    return Edits.ALL_EDITS
            .pFlatMap { it(wordSplits) }
            .pack(dictionary)
}

Coincidentally, pFlatMap is also an extension function used to type-safely augment the list of functions exposed by Iterable.

edit2: When edit1 Falters

Despite generating lots of possible suggestions, edit1 can fail to yield a dictionary word In this case we assume our typo stems not from one but two transcription mistakes.

edit2 is quite simply defined in terms of edit1 as follows:

fun edits2(typo: String): Iterable<String> =
    // Try to find dictionary words in the (failed) result
    // of `edit1` by applying, in parallel, all four edits
    // to each edit and coalescing results (with *pack*)
    edits1(typo)
            .pFlatMap { edits1(it) }
            .pack(dictionary)

Here we see our friend pack() in action once again to ensure word uniqueness, validity and order.

Again, we use flatMap (rather than map) because we generate candidate suggestions from the nested edit1 results which are themselves of type Iterable<String>.

Putting it All Together

We’re almost there now. Our workhorse method getCorrections() method looks as follows:

fun getCorrections(word: String): Iterable<String>? {

    val normalizedWord = normalize(word)

    // Dictionary words generate no suggestions;
    // this is expressed as a `null` word set
    if (dictionary.containsKey(normalizedWord)) {
        return null
    }

    // Attempt to reconstitute one or more dictionary words
    // from the typo assuming it comes from a single error
    val corrections1 = edits1(normalizedWord)
    return if (corrections1.iterator().hasNext()) corrections1
    else {
        // Failing the above, attempt to reconstitute one or
        // more dictionary words from the typo assuming it
        // comes from two errors. Return `emptyList()` to
        // indicate the word *is* a typo but no suggestions
        // were found for it
        val corrections2 = edits2(normalizedWord)
        if (corrections2.iterator().hasNext()) corrections2 else emptyList()
    }
}

Note the return type is Iterable<String>?. What on Earth is that trailing ? sign?

Unlike other JVM languages like Java itself, Scala or Xtend, Kotlin does not try to “fix” null (Hoare’s billion-dollar mistake).

Rather, Kotlin embraces the fact that null is a non-replaceable part of the JVM definition and provides what it calls null safety: a controlled way to make use of null to indicate the absence of value that is simpler and more intuitive than Java’s (and Guava’s) Optional<> or Scala’s Option[].

Rules are very simple: by default you cannot return null with impunity. If you do intend to return null as an indication of value absence you must explicitly annotate your function’s signature with a trailing question mark sign (Iterable<String>? above).

In our case, we need to express the absence of suggestions not because the typo doesn’t match anything known but, on the contrary, because it was handed a valid dictionary word for which the notion of suggestion simply doesn’t make sense. In this scenario we return null.

It could be argued that returning emptyList() would be appropriate to indicate such inapplicability.

In reality, however, it is perfectly possible for a word absent from the dictionary (.i.e, a typo) not to resemble any known word. In this case we still need to make it explicit we’re dealing with a typo. Thus, emptyList() is not semantically equivalent to null.

We there yet?

Well, almost ;-)

One last interesting bit is Kotlin’s companion object concept where an object implicitly named as its containing class “accompanies” the class in a similar fashion to Java’s static fields and methods. In Kotlin, by the way, there is no concept of static class members. If you need to achieve static functionality then the class’s companion object is the way to go as both the class and the companion object’s internals are visible to each other.

For our SpellingCorrector class its companion object looks like:

class SpellingCorrector(private val dictionary: Map<String, Int>) {

    companion object {
    
        private val alphabetic = "^[a-z]+$".toRegex()
    
        fun normalize(word: String): String {
            val normalizedWord = word.trim().toLowerCase()
            return if (alphabetic.containsMatchIn(normalizedWord)) normalizedWord
            else throw IllegalArgumentException("Non-alpha word: $normalizedWord")
        }
    
        fun Iterable<String>.pack(dictionary: Map<String, Int>): Iterable<String> =
                this
                        .distinct()
                        .filter { dictionary.containsKey(it) }
                        .map { Pair(it, dictionary[it]) }
                        .sortedBy { it.second }
                        .map { it.first }
    
        // Run flattening mapper in parallel via coroutines.
        // Akin to Java's Collection.parallelStream() followed by flatMap()
        fun <A, B> Iterable<A>.pFlatMap(mapper: suspend (A) -> Iterable<B>): Iterable<B> = runBlocking {
            map { async(CommonPool) { mapper(it) } }.flatMap { it.await() }
        }
    }
    
    // Class stuff comes here...
}

Final words about Kotlin

The SpellingCorrector and Edits Kotlin implementations add up to 200 lines including comments and empty lines. Their neutron star-dense incarnation comes down to only 75 lines!

Compare that to Norvig’s feat of packing the whole thing in 44 lines of (commented) Python code and the horror of Java taking up over 300 lines of code (120 in neutron star form).

Java’s verbosity is precisely one of the most important reason alternative JVM languages have emerged. In the next 2 posts we’ll see how the Scala and Xtend fare in terms of economy of expression and clarity of intent. So far Kotlin is an uncontested winner over Java.


Next in our plate: Implementing Norvig’s Algo in Scala

comments powered by Disqus