papers
These are my notes from research papers I read. Each page’s title is also a link to the abstract or PDF.
This was Manohar’s PhD dissertation at JHU.
Chapter 2 provides a relatively clear overview of how chain and non-chain models work in Kaldi.
In chapter 3 he tried using negative conditional entropy as the loss function for the unsupervised data, and it helped a bit.
In chapter 4 Manohar uses [CTC loss]/paper/ctc/.
In chapter 5, he discusses ways to do semi-supervised model training. It’s nice when you have parallel data in different domains, because then you can do a student-teacher model. When there’s no parallel data, the best you can do is decode the unsupervised data with the seed model and use that to train the LF-MMI model (see section 5.2.1).
Read moreRNNs generally require pre-segmented training data, but this avoids that need.
Basically, you have the RNN output probabilities for each label (or a blank) for every frame, and then you find the most likely path across that lattice of probabilities.
The section explaining the loss function was kind of complicated. They used their forward-backward algorithm (sort of like Viterbi) to get the probability of all paths corresponding to the output that go through each symbol at each time, and then they differentiated that to get the derivatives with respect to the outputs. Then it was backpropagation as normal from that point.
Read moreThis model was superseded by this one.
They did some careful things with residual connections to make sure it was very parallelizable. They put each LSTM layer on a separate GPU. They quantized the models such that they could train using full floating-point computations with a couple restrictions and then convert the models to quantized versions.
They use the word-piece model from “Japanese and Korean Voice Search”, with 32,000 word pieces. (This is a lot less than the 200,000 used in that paper.) They state in the paper that the shared word-piece model is very similar to Byte-Pair-Encoding, which was used for NMT in this paper by researchers at U of Edinburgh.
The model and training process are exactly as in Google’s earlier paper. It takes 3 weeks on 100 GPUs to train, even after increasing batch size and learning rate.
Read moreThis was mentioned in the paper on Google’s Multilingual Neural Machine Translation System. It’s regarded as the original paper to use the word-piece model, which is the focus of my notes here.
the WordPieceModel permalink Here’s the WordPieceModel algorithm:
func WordPieceModel(D, chars, n, threshold) -> inventory: # D: training data # n: user-specified number of word units (often 200k) # chars: unicode characters used in the language (e.g. Kanji, Hiragana, Katakana, ASCII for Japanese) # threshold: stopping criterion for likelihood increase # inventory: the set of word units created by the model inventory := chars likelihood := +INF while len(inventory) < n && likelihood >= threshold: lm := LM(inventory, D) inventory += argmax_{combined word unit}(lm.likelihood_{inventory + combined word unit}(D)) likelihood = lm.likelihood_{inventory}(D) return inventory The algorithm can be optimized by
Read more