Spelling Suggestion on the JVM #3: Implementing Norvig's Algo in Kotlin
- Norvig’s Approach to Spelling Correction
- Implementing Norvig’s Algo in Java
- Implementing Norvig’s Algo in Kotlin (this post)
- Implementing Norvig’s Algo in Scala
- Implementing Norvig’s Algo in Xtend
Functional, tested code in the 4 languages is available on the project’s Spellbound repository.
The Dictionary
Everything having to do with spelling correction revolves around a curated list of valid words we refer to as the the dictionary.
In our dictionary implementation we associate every word with an integer rank indicating how
frequently the word is used in relation to all other words. The higher the rank, the more
frequently used the word is. Thus, for instance, the has rank 106295
while triose has rank
1
.
The dictionary declaration is:
class SpellingCorrector(private val dictionary: Map<String, Int>) {
// stuff...
}
Here, we make use of Kotlin’s primary constructor where a class instance is created
from one or more values. Kotlin encourages using immutable values; this is reflected
above by declaring the dictionary
to be an immutable val
.
In addition to declaring the dictionary
field, this primary constructor also makes it
inaccessible from the outside world by means of private
. Unlike Java, Kotlin’s general
default visibility modifier is public
; this is very convenient since we strive to make
all values immutable.
Transforming the typo into a set of word splits
In addition to good ole’ classes (like the above SpellingCorrector
) Kotlin also supports the
notion of object
: a singleton-like instance that may or may not be associated with a
corresponding class or interface. In other words, objects may implement interfaces or extend
classes as needed but they can also stand on their own.
Importantly, objects can declare their own inner classes, values, variables, functions, etc.
In our case, we create an object called Edits
that contains:
- A data class
WordSplit
for storing left/right word splits - A function to generate word splits given a (typo) word
- A list of lambdas transforming a set of word splits into a set of candidate words
On the surface our Edits
Kotlin object looks like:
object Edits {
// Data class for word splits
data class WordSplit(val left: String, val right: String)
// Build a collection of word splits from a given word
fun wordSplits(word: String): Iterable<WordSplit> =
IntRange(0, word.length).map {
WordSplit(word.substring(0, it), word.substring(it))
}
// Store Norvig's four edit-generating functions as lambdas
// Here, like in Java, we treat code as data
val ALL_EDITS: List<(Iterable<WordSplit>) -> Iterable<String>> = listOf(
fun(wordSplits) = // delete logic...,
fun(wordSplits) = // insert logic...,
fun(wordSplits) = // transpose logic...,
fun(wordSplits) = // replace logic...
)
}
Word Splits
In order to hold word splits we define an inner data class WordSplit as:
// Data class for word splits
data class WordSplit(val left: String, val right: String)
Kotlin data classes don’t require getters/setters for their properties (which must all
be declared in the data class’s primary constructor as immutable values). A data class also
implements sensible equals
, hashCode
and toString
implementations for free!
Armed with this definition we can now formulate the rather trivial implementation of wordSplits()
:
// Build a collection of word splits from a given word
fun wordSplits(word: String): Iterable<WordSplit> =
// No need to use "return": this is an expression, not a statement
IntRange(0, word.length).map {
// Look ma: no "new" needed to create WordSplit instance!
WordSplit(word.substring(0, it), word.substring(it))
}
Here we have an instance of an expression function using no curly braces to surround the
function body. Instead, we use the =
sign to separate the function’s declaration from its
one-expression body. This body creates a collection from the range 0
to word.length
(inclusive). Note, en passant, that we don’t need to use empty parens after length
in the
expression word.length
; in Kotlin, it is a property rather than a function.
The range is, in effect, a collection of integers that can be mapped over to apply a lambda. In Kotlin, lambdas must be surrounded by curly braces (unlike Java, parenthesis just don’t cut it).
One-argument lambdas can use the implicitly defined value it
to refer to their sole
argument. Thus, above we use it
to refer to each subsequent value in the range. In this
usage it
has type Int
and is used as an index to the substring
function.
The ALL_EDITS
list
Even though we could use vanilla functions for each of Norvig’s four edits we purposefully choose to implement them as elements of a list of anonymous functions:
// Store Norvig's four edit-generating functions as lambdas
// Here, like in Java, we treat code as data
val ALL_EDITS: List<(Iterable<WordSplit>) -> Iterable<String>> = listOf(
fun(wordSplits) = // delete logic...,
fun(wordSplits) = // insert logic...,
fun(wordSplits) = // transpose logic...,
fun(wordSplits) = // replace logic...
)
Note the list’s generic type: (Iterable<WordSplit>) -> Iterable<String>
which can be
read roughly as: given a sequence of WordSplit
s generate a sequence of String
s.
By storing functions as data in a list we can apply them in a generic fashion to any given collection of word splits:
Edits.ALL_EDITS.flatMap { it(wordSplits) }
Here, we actually invoke each edit function by passing the wordSplits
argument to
it
, cool!
The split()
method takes a string (the typo) and returns all possible pairs of word splits.
Thus, for our beloved but misspelt hero dlbert we get the equivalent of:
listOf(
WordSplit("", "dlbert"),
WordSplit("dl", "bert"),
WordSplit("dlb", "ert"),
WordSplit("dlbe", "rt"),
WordSplit("dlber", "t"),
WordSplit("dlbert", "")
)
By the way, all four edit implementations take an Iterable<WordSplit>
argument while
returning an Iterable<String>
.
IntRange(startInclusive: Int, endInclusive: Int)
returns a subtype of Iterable<Int>
.
In our example above this means all integers between 0
and 6
.`
Subsequently, each successive integer value is used as a substring index for splitting the word
into a pair of left
and right
fragments.
Edits #1: deletes
Generating all possible deletes for a word is quite simple:
// deletes
fun(wordSplits) = wordSplits
.filterNot { it.right.isEmpty() }
.map { it.left + it.right.substring(1) }
Here, we traverse an input Iterable<WordPair>
selecting only the pairs where right
is not
empty (which is to say to all splits but the very first one). Next, we concatenate each word
split into a string formed by the left
part followed by the “tail” of the right
one.
Given the typo waly this deletes()
anonymous implementation yields the equivalent of:
listOf("aly", "wly", "way", "wal")
Edits #2: inserts
Inserts are the opposite of deletes and while their implementation is correspondingly simple:
// inserts
fun(wordSplits) = wordSplits
.flatMap { wordSplit ->
LETTERS.map { wordSplit.left + it + wordSplit.right }
},
This inserts()
edit uses a LETTERS
value of type Iterable<String>
. Again for the sake of
simplicity we deal only with lowercase ASCII characters. There is no support for diacritics
(whose implementation is not so complicated anyways). LETTERS
is defined in the Edits
object as:
private val LETTERS = CharRange('a', 'z')
CharRange
is a subtype ofIterable<String>
.
For our canine friend dgbert the above anonymous implementation of inserts()
yields the equivalent of:
listOf(
"adgbert", "bdgbert", "cdgbert", "ddgbert", "edgbert", "fdgbert",
"gdgbert", "hdgbert", "idgbert", "jdgbert", "kdgbert", "ldgbert",
"mdgbert", "ndgbert", "odgbert", "pdgbert", "qdgbert", "rdgbert",
"dgbertm", "dgbertn", "dgberto", "dgbertp", "dgbertq", "dgbertr",
// lots of inserts ellided...
"dgberts", "dgbertt", "dgbertu", "dgbertv", "dgbertw", "dgbertx",
"dgberty", "dgbertz")
Edits #3: transposes
Implementing transposition requires the right
fragment to have more than one character
in length, which makes sense as transposing involves not one but two letters:
// transposes
fun(wordSplits) = wordSplits
.filter { it.right.length > 1 }
.map {
it.left +
it.right.substring(1, 2) +
it.right.substring(0, 1) +
it.right.substring(2)
}
For our heroine alice the above anonymous implementation of transposes()
yields the equivalent of:
listOf("laice", "ailce", "alcie", "aliec")
Edits #4: replaces
The replace
edit is very prolific requiring the substitution of every letter
in the typo for each letter in the alphabet. It also requires the right segment
not to be empty (as something needs to be replaced). Because we’re nesting the
processing of each letter inside the processing of every typo character we need
to apply flatMap
rathen than map
:
// replaces
fun(wordSplits) = wordSplits
.filterNot { it.right.isEmpty() }
.flatMap { wordSplit ->
LETTERS.map { wordSplit.left + it + wordSplit.right.substring(1) }
}
For our little friend asok the above replaces()
implementation yields the equivalent of:
listOf(
"asok", "bsok", "csok", "dsok",
"esok", "fsok", "gsok", "hsok",
// quite a few replaces ellided...
"asos", "asot", "asou", "asov",
"asow", "asox", "asoy", "asoz"
)
edit1
: Applying All Edits Hoping for Valid Words
To cover all our bases we need to run and then concatenate all four edits. For this we define, in
the Edits
object, an Iterable<String>
as follows:
val ALL_EDITS: List<(Iterable<WordSplit>) -> Iterable<String>> = listOf(
// deletes
fun(wordSplits) = wordSplits
.filterNot { it.right.isEmpty() }
.map { it.left + it.right.substring(1) },
// inserts
fun(wordSplits) = wordSplits
.flatMap { wordSplit ->
LETTERS.map { wordSplit.left + it + wordSplit.right }
},
// transposes
fun(wordSplits) = wordSplits
.filter { it.right.length > 1 }
.map {
it.left +
it.right.substring(1, 2) +
it.right.substring(0, 1) +
it.right.substring(2)
},
// replaces
fun(wordSplits) = wordSplits
.filterNot { it.right.isEmpty() }
.flatMap { wordSplit ->
LETTERS.map { wordSplit.left + it + wordSplit.right.substring(1) }
}
)
Note the type of ALL_EDITS
: it’s a List<...>
rather than an Iterable<...>
. This is
done so as to enable us to reference each edit by index on ALL_EDITS
:
ALL_EDITS[0](wordSplits) // deletes(wordSplits)
ALL_EDITS[1](wordSplits) // inserts(wordSplits)
ALL_EDITS[2](wordSplits) // transposes(wordSplits)
ALL_EDITS[3](wordSplits) // replace(wordSplits)
The grouping of edits in a List<(Iterable<WordSplit>) -> Iterable<String>>
is an instance
of a common functional pattern where code is treated as data. Here, we have a list of
functions mapping from Iterable<WordSplit>
to Iterable<String>
(which is, precisely,
the type of transformation all of our four edits implement). In this context Kotlin anonymous
functions shine as they, effectively, provide us with a data-centric view of executable code.
We make use of this list of functions to implement edit1
as follows:
fun edits1(typo: String): Iterable<String> {
// Collect typo's word splits
val wordSplits = Edits.wordSplits(typo)
// Exercise all four edits in parallel and coalesce (*pack*)
// the results
return Edits.ALL_EDITS
.flatMap { it(wordSplits) }
.pack(dictionary)
}
Because each edit returns a nested
Iterable<String>
we need to useflatMap
instead ofmap
.
Because the four edits are independent of one another we can (and, indeed, should) parallelize their execution.
On the bright side, Kotlin doesn’t requires us to “take the stream()
” of a collection as Java
does. Functional functions such as filter
, map
and flatMap
can be invoked directly on all
subtypes of Iterable
.
On the not-so-bright side, Kotlin doesn’t directly provide the equivalent of Java’s
Stream.parallelStream()
method.
We can, however, take advantage of Kotlin’s (still experimental)
coroutines
to enable concurrent execution of lambdas passed to flatMap
(and potentially any other
functional operation including map
and filter
):
// Run flattening mapper in parallel via coroutines
// Akin to Java's Collection.parallelStream() followed by flatMap()
fun <A, B> Iterable<A>.pFlatMap(mapper: suspend (A) -> Iterable<B>): Iterable<B> = runBlocking {
map { async(CommonPool) { mapper(it) } }.flatMap { it.await() }
}
Note the heading
Iterable<A>
type prepended topFlatMap
. This is an instance of extension functions described below.
Here, each edit (represented here by the mapper
function) is run on a separate thread and the
overall pFlatMap()
function waits for all of them to complete before returning itself.
Given this function we can now parallelize edit execution as in:
return Edits.ALL_EDITS
.pFlatMap { it(wordSplits) }
.pack(dictionary)
What is this trailing pack()
function at the end? It doesn’t appear on Kotlin’s
Iterable
documentation…
As it happens, edit1
is not yet done when all edits have been concatenated.
Once edit1
generates all possible edits we need to screen them for duplicates as well as for
actual appearance in the dictionary. We also need to order the resulting valid words so the most
frequently used ones are shown first. This is all achieved by function pack()
as follows:
fun Iterable<String>.pack(dictionary: Map<String, Int>): Iterable<String> =
this
.distinct()
.filter { dictionary.containsKey(it) }
.map { Pair(it, dictionary[it]!!) }
.sortedBy { -it.second }
.map { it.first }
Note here, again, the pack()
function is preprended by Iterable<String>.
What does this mean?
Kotlin provides a very powerful construct called extension functions. Extension functions enable developers to augment the repertoire of functions exposed by any type as if they had been declared in their original implementation.
This is why we can append a trailing pack(dictionary)
to the collection transformation embodied
in edit1
(and edit2
below). Note the body of pack()
above uses this
to refer to its target
augmented type.
What about those two trailing bangs in the map
line?. Kotlin uses the bracket operator to retrieve
a value from a dictionary given its key. Thus, above, dictionary[it]
is equivalent to
dictionary.get(it)
. When the given key
is not contained in the dictionary the get
operation
returns null
. Thus the expression dictionary[it]
is nullable. This limits what we can do with
its value; for instance, we cannot invert the sign of the integer rank value directly by writing
-it.second
. We’re sure, however, that dictionary[it]
will never return null because the preceding
filter
statement guarantees the dictionary does contain the word in question. For such cases, Kotlin
provides the !!
suffix operator to indicate “by the way, make the preceding expression not nullable”.
Thanks to this refinement we can safely write -it.second
or -1 * it.second
because the type of
it.second
is no longer Int?
(nullable) but plain Int
(non-nullable).f
fun edits1(typo: String): Iterable<String> {
// Collect typo's word splits
val wordSplits = Edits.wordSplits(typo)
// Exercise all four edits in parallel and coalesce (*pack*)
// the results
return Edits.ALL_EDITS
.pFlatMap { it(wordSplits) }
.pack(dictionary)
}
Coincidentally, pFlatMap
is also an extension function used to type-safely augment the list of
functions exposed by Iterable
.
edit2
: When edit1
Falters
Despite generating lots of possible suggestions, edit1
can fail to yield a dictionary word
In this case we assume our typo stems not from one but two transcription mistakes.
edit2
is quite simply defined in terms of edit1
as follows:
fun edits2(typo: String): Iterable<String> =
// Try to find dictionary words in the (failed) result
// of `edit1` by applying, in parallel, all four edits
// to each edit and coalescing results (with *pack*)
edits1(typo)
.pFlatMap { edits1(it) }
.pack(dictionary)
Here we see our friend pack()
in action once again to ensure word uniqueness, validity and
order.
Again, we use flatMap
(rather than map
) because we generate candidate suggestions from
the nested edit1
results which are themselves of type Iterable<String>
.
Putting it All Together
We’re almost there now. Our workhorse method getCorrections()
method looks as follows:
fun getCorrections(word: String): Iterable<String>? {
val normalizedWord = normalize(word)
// Dictionary words generate no suggestions;
// this is expressed as a `null` word set
if (dictionary.containsKey(normalizedWord)) {
return null
}
// Attempt to reconstitute one or more dictionary words
// from the typo assuming it comes from a single error
val corrections1 = edits1(normalizedWord)
return if (corrections1.iterator().hasNext()) corrections1
else {
// Failing the above, attempt to reconstitute one or
// more dictionary words from the typo assuming it
// comes from two errors. Return `emptyList()` to
// indicate the word *is* a typo but no suggestions
// were found for it
val corrections2 = edits2(normalizedWord)
if (corrections2.iterator().hasNext()) corrections2 else emptyList()
}
}
Note the return type is Iterable<String>?
. What on Earth is that trailing ?
sign?
Unlike other JVM languages like Java itself, Scala or Xtend, Kotlin does not try to
“fix” null
(Hoare’s billion-dollar mistake).
Rather, Kotlin embraces the fact that null
is a non-replaceable part of the JVM
definition and provides what it calls null
safety:
a controlled way to make use of null
to indicate the absence of value that is
simpler and more intuitive than Java’s (and Guava’s) Optional<>
or Scala’s Option[]
.
Rules are very simple: by default you cannot return null with impunity. If you do
intend to return null
as an indication of value absence you must explicitly annotate
your function’s signature with a trailing question mark sign (Iterable<String>?
above).
In our case, we need to express the absence of suggestions not because the typo doesn’t
match anything known but, on the contrary, because it was handed a valid dictionary word
for which the notion of suggestion simply doesn’t make sense. In this scenario we return null
.
It could be argued that returning emptyList()
would be appropriate to indicate such
inapplicability.
In reality, however, it is perfectly possible for a word absent from the dictionary (.i.e, a typo)
not to resemble any known word. In this case we still need to make it explicit we’re dealing with
a typo. Thus, emptyList()
is not semantically equivalent to null
.
We there yet?
Well, almost ;-)
One last interesting bit is Kotlin’s companion object concept where an object implicitly
named as its containing class “accompanies” the class in a similar fashion to Java’s static
fields and methods. In Kotlin, by the way, there is no concept of static
class members.
If you need to achieve static
functionality then the class’s companion object is the way
to go as both the class and the companion object’s internals are visible to each other.
For our SpellingCorrector
class its companion object looks like:
class SpellingCorrector(private val dictionary: Map<String, Int>) {
companion object {
private val alphabetic = "^[a-z]+$".toRegex()
fun normalize(word: String): String {
val normalizedWord = word.trim().toLowerCase()
return if (alphabetic.containsMatchIn(normalizedWord)) normalizedWord
else throw IllegalArgumentException("Non-alpha word: $normalizedWord")
}
fun Iterable<String>.pack(dictionary: Map<String, Int>): Iterable<String> =
this
.distinct()
.filter { dictionary.containsKey(it) }
.map { Pair(it, dictionary[it]) }
.sortedBy { it.second }
.map { it.first }
// Run flattening mapper in parallel via coroutines.
// Akin to Java's Collection.parallelStream() followed by flatMap()
fun <A, B> Iterable<A>.pFlatMap(mapper: suspend (A) -> Iterable<B>): Iterable<B> = runBlocking {
map { async(CommonPool) { mapper(it) } }.flatMap { it.await() }
}
}
// Class stuff comes here...
}
Final words about Kotlin
The
SpellingCorrector
and
Edits
Kotlin implementations add up to 200 lines including comments and empty lines. Their neutron
star-dense incarnation comes down to only 75 lines!
Compare that to Norvig’s feat of packing the whole thing in 44 lines of (commented) Python code and the horror of Java taking up over 300 lines of code (120 in neutron star form).
Java’s verbosity is precisely one of the most important reason alternative JVM languages have emerged. In the next 2 posts we’ll see how the Scala and Xtend fare in terms of economy of expression and clarity of intent. So far Kotlin is an uncontested winner over Java.
Next in our plate: Implementing Norvig’s Algo in Scala