RW

Small Models for Text Classification

Research bibliography for the small models post. Papers cover word embeddings, transformers, model compression, sentence embeddings, few-shot learning, and on-device deployment.

← All references
Word Representations
T. Mikolov, K. Chen, G. Corrado & J. Dean, "Efficient Estimation of Word Representations in Vector Space," 2013.
Introduced Word2Vec, learning dense vector representations of words from large corpora. Showed that semantic relationships are encoded as geometric relationships in the vector space.
A. Joulin, E. Grave, P. Bojanowski & T. Mikolov, "Bag of Tricks for Efficient Text Classification," Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics (EACL), 2017.
Introduced FastText, combining word and character n-gram embeddings for fast, accurate text classification. Models are small (1-10 MB) and handle misspellings and rare words well.
Transformers
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser & I. Polosukhin, "Attention Is All You Need," Advances in Neural Information Processing Systems (NeurIPS), 2017.
Introduced the transformer architecture with self-attention as the sole mechanism for sequence modeling. The foundation for all subsequent encoder-only (BERT) and decoder-only (GPT) models.
J. Devlin, M. Chang, K. Lee & K. Toutanova, "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding," Proceedings of NAACL-HLT, 2019.
Introduced bidirectional pre-training with masked language modeling. BERT-base (110M parameters) became the foundation for a generation of NLP models and the starting point for most model compression work.
Model Compression
G. Hinton, O. Vinyals & J. Dean, "Distilling the Knowledge in a Neural Network," NIPS Deep Learning Workshop, 2015.
Introduced knowledge distillation: training a small student model on the soft probability distributions of a large teacher model. The foundational technique behind DistilBERT, TinyBERT, and other compressed models.
V. Sanh, L. Debut, J. Chaumond & T. Wolf, "DistilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter," 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing (NeurIPS), 2019.
Applied knowledge distillation to BERT, producing a 6-layer model (66M parameters) that retains 97% of BERT's language understanding while being 60% smaller and 60% faster.
X. Jiao, Y. Yin, L. Shang, X. Jiang, X. Chen, L. Li, F. Wang & Q. Liu, "TinyBERT: Distilling BERT for Natural Language Understanding," Findings of EMNLP, 2020.
Extended distillation to match intermediate representations (attention matrices and hidden states), producing a 14.5M parameter model that preserves more of BERT's behavior per parameter than output-only distillation.
Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma & R. Soricut, "ALBERT: A Lite BERT for Self-supervised Learning of Language Representations," ICLR, 2020.
Achieved parameter reduction through cross-layer weight sharing and embedding factorization. ALBERT-tiny reaches 4M parameters, though with reduced accuracy compared to full BERT.
Z. Sun, H. Yu, X. Song, R. Liu, Y. Yang & D. Zhou, "MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices," Proceedings of ACL, 2020.
Designed specifically for mobile deployment with a bottleneck architecture (narrow hidden size, wider feedforward blocks). At 25M parameters, MobileBERT fits comfortably in a mobile app bundle.
Sentence Embeddings
N. Reimers & I. Gurevych, "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks," Proceedings of EMNLP-IJCNLP, 2019.
Fine-tuned BERT with a siamese network architecture to produce high-quality sentence embeddings. The foundation for the sentence-transformers library and the backbone models used in SetFit.
Few-Shot Learning
L. Tunstall, N. Reimers, U. E. S. Jo, L. Bates, D. Korat, M. Wasserblat & O. Pereg, "Efficient Few-Shot Learning Without Prompts," 2022.
Introduced SetFit, a two-phase approach (contrastive fine-tuning of sentence transformers + classification head) that achieves strong few-shot classification without prompt engineering. The contrastive pairs amplify small datasets combinatorially.
Representation Learning
A. Kusupati, G. Bhatt, A. Rber, M. Wallingford, A. Sinha, V. Ramanujan, W. Howard-Snyder, K. Chen, S. Kakade, P. Jain & A. Farhadi, "Matryoshka Representation Learning," NeurIPS, 2022.
Trained models to produce embeddings useful at multiple dimensionalities. Allows deploying the same model at different quality/size trade-offs by truncating the embedding vector.