Small Language Models Are Also Few-Shot Learners

Опубликовано: 14 Март 2026
на канале: Connor Shorten
5,344
163

This video explains the latest work in Pattern-Exploiting Training. This paper shows that this distillation scheme from knowledge captured in pre-trained language models to discriminative classifiers can also work in the Few-shot setting. This is compared directly with GPT-3's performance using 32 labeled examples for different tasks like BoolQ or Winograde Schema. This is very interesting, but not a fair, apples-to-apples, comparison with GPT-3. Thanks for watching! Please Subscribe!

Paper Links:
Paper Link: https://arxiv.org/abs/2009.07118
First PET Paper: https://arxiv.org/pdf/2001.07676.pdf
Next Word Prediction Demo: https://github.com/renatoviolin/next_...
Hacker News Reaction: https://news.ycombinator.com/item?id=...
HuggingFace NLP Viewer: https://huggingface.co/nlp/viewer/?da...
GPT-3: https://arxiv.org/pdf/2005.14165.pdf
SimCLRv2 (if curious about semi-supervised knowledge distillation in vision): https://arxiv.org/pdf/2006.10029.pdf
Measuring Massive Multitask Language Understanding: https://arxiv.org/pdf/2009.03300.pdf
GenAug: https://arxiv.org/pdf/2010.01794.pdf
Efficient Transformers Survey: https://arxiv.org/abs/2009.06732
T5: https://ai.googleblog.com/2020/02/exp...

Thanks for watching!

Chapters
0:00 Introduction
1:17 Bold Headline on Hacker News
2:16 All Tasks are Language Modeling
3:15 Pattern-Exploiting Training Recap
4:40 Masked Word Prediction Demo
5:56 Iterative PET
6:38 Semi-Supervised Knowledge Distillation
8:05 Text-Input, Text-Output to All Tasks are Language Modeling
9:04 Datasets
13:28 GPT-3 Priming: Recap
14:56 PET vs. GPT-3
17:08 PET with Multiple Masks
18:27 Generative to Discriminative Models