T0 builds on T5 by fine-tuning on more natural prompts and testing the model’s generalization to held-out tasks.
Compare the training format diagrams for T5 (top) and T0 (bottom):
Intuitively, the T0 prompts are more likely to be similar to implicit/explicit prompting that’s present in the pretraining data. The authors created several prompts for each dataset.
Our experiments study two questions. First, does multitask prompted training improve generalization to held-out tasks? Second, does training on a wider range of prompts improve robustness to prompt wording? For the first question, we find that multitask training enables zero-shot task generalization by showing that our model matches or exceeds the performance of GPT-3 on 9 out of 11 held-out datasets, despite being about 16x smaller. We also show that the model improves over a large baseline language model on 13 out of 14 tasks in the BIG-bench benchmark. For the second question, we find that training on more prompts per dataset consistently improves the median and decreases the variability of performance on held-out tasks. Training on prompts from a wider range of datasets also generally improves the median but does not consistently decrease the variability.