# Japanese and Korean voice search

Posted on
nlp

This was mentioned in the paper on Google’s Multilingual Neural Machine Translation System. It’s regarded as the original paper to use the word-piece model, which is the focus of my notes here.

Here’s the WordPieceModel algorithm:

func WordPieceModel(D, chars, n, threshold) -> inventory:
# D: training data
# n: user-specified number of word units (often 200k)
# chars: unicode characters used in the language (e.g. Kanji, Hiragana, Katakana, ASCII for Japanese)
# threshold: stopping criterion for likelihood increase
# inventory: the set of word units created by the model

inventory := chars
likelihood := +INF
while len(inventory) < n && likelihood >= threshold:
lm := LM(inventory, D)
inventory += argmax_{combined word unit}(lm.likelihood_{inventory + combined word unit}(D))
likelihood = lm.likelihood_{inventory}(D)
return inventory


The algorithm can be optimized by

• testing only word pairs that exist in the training data
• testing only pairs with a significant chance of being the best
• combining several clustering steps into a single iteration (possible for groups of pairs that don’t affect each other)
• only modify the LM counts for affected entries

After these optimizations, building a 200k word piece inventory can take a few hours on a single machine.

They also do something important to make sure the ASR output text has spaces formatted reasonably. It’s best explained in the following image from the paper: