RW

Small Models for On-Device Text Classification

March 18, 2026 | 19 minutes (3527 words)

Imagine you’ve built a text classification system. It takes short incident descriptions like “EE hurt back lifting patient” and “forklift struck shelving unit in warehouse” and classifies them into one of fifteen safety categories. You’ve been using Claude or GPT to do this, and it works well: 85-90% accuracy, with the model reasoning through each classification. But every prediction requires an API call, costs money, and needs an internet connection. What if you could ship a model small enough to run entirely on a phone?

This is the on-device inference problem, and it sits at the intersection of natural language processing, model compression, and mobile engineering. The goal is to take the classification intelligence of a large language model and distill it into something that fits in a 20-50MB app bundle and runs in milliseconds on consumer hardware.

This post walks through the landscape of small text classification models, covering what they are, how they work, and the architectural ideas that make them possible. We’ll start from the simplest approaches and build toward the transformer-based methods that currently represent the state of the art for compact classifiers. By the end, you’ll understand each component in the pipeline below and how they fit together to get a classifier onto a mobile device.

From Transformer to On-Device Classifier
Each step builds on the previous one.
Attention / Transformers Context-dependent representations BERT Pre-trained bidirectional encoder Model Compression DistilBERT · TinyBERT · MobileBERT Shrinks the encoder for mobile Sentence Transformer Training Fine-tunes for similarity-optimized embeddings SetFit Few-shot contrastive fine-tuning On-Device Classifier Runs entirely on a phone or tablet

From Words to Numbers

Every text classification model, no matter how sophisticated, must solve the same foundational problem: converting variable-length text into a fixed-length numerical representation that a classifier can consume. The history of NLP is largely the history of increasingly powerful solutions to this problem.

Bag of Words

The simplest approach ignores word order entirely. Build a vocabulary of all words in your training data, then represent each document as a vector of word counts. If your vocabulary has 10,000 words, every document becomes a 10,000-dimensional vector, mostly filled with zeros.

A variant called TF-IDF (term frequency-inverse document frequency) improves on raw counts by weighting words that appear in fewer documents more heavily. The word “forklift” appearing in an incident description is far more informative than the word “the,” and TF-IDF captures this.

Pair these representations with a logistic regression or support vector machine, and you have a text classifier that is:

  • Tiny: The model is a vocabulary mapping plus a weight matrix. A few megabytes at most.
  • Fast: Classification is a sparse vector dot product. Microseconds on any hardware.
  • Surprisingly competitive: For domain-specific tasks with distinctive vocabulary, TF-IDF classifiers can produce strong baselines.

The limitation is that bag-of-words representations discard all semantic structure. “The forklift struck the worker” and “The worker struck the forklift” produce identical vectors, despite describing very different incidents. To do better, we need representations that understand meaning.

Word Embeddings

In 2013, Tomas Mikolov and colleagues at Google published Word2Vec,[1] a method for learning dense vector representations of words from large text corpora. The key insight was that words appearing in similar contexts should have similar representations. The model learns 100-300 dimensional vectors for each word such that semantic relationships are encoded as geometric relationships in the vector space.

Consider the famous example. The vector for “king” minus “man” plus “woman” lands near “queen.” These aren’t hand-crafted rules. They emerge from training on billions of words of text.

For classification, you can average the word embeddings of all words in a document to produce a single vector, then feed that to a classifier. This is the approach behind FastText,[2] released by Facebook in 2016. FastText goes further by also learning embeddings for character n-grams (subword units), which helps it handle misspellings and rare words. This is a significant advantage for messy, real-world text like incident reports.

FastText models are remarkably small (1-10MB), fast to train, and fast at inference. For many practical classification tasks, they represent an excellent trade-off between accuracy and simplicity. But they still average over word order, and they assign each word a single fixed embedding regardless of context. The word “bank” gets the same vector whether it appears in “river bank” or “bank account.”

Contextual Embeddings and the Attention Mechanism

The breakthrough that changed everything was the attention mechanism, introduced in 2017 by Vaswani et al. in the paper “Attention Is All You Need.”[3]

Instead of representing a word with a single fixed vector, compute its representation dynamically based on every other word in the sentence. When processing the word “bank” in “flooded river bank,” the model attends to “river” and “flooded” and produces a representation that captures the waterway meaning. In “opened a bank account,” the same word attends to “account” and “opened” and produces a completely different representation.

Mechanically, attention works through three learned projections of each word’s embedding: a query (what am I looking for?), a key (what do I contain?), and a value (what information do I provide). The attention score between two words is the dot product of one word’s query with another’s key, normalized by a softmax. High scores mean “these words are relevant to each other.” The output for each word is a weighted sum of all values, where the weights are the attention scores.

\[\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^{T}}{\sqrt{d_k}}\right)V\]

This is self-attention, where the sequence attends to itself. Stack multiple layers of self-attention with feedforward networks between them, and you get a transformer, a model that builds increasingly abstract contextual representations of text through successive layers of attention.


The Transformer Family Tree

The original transformer was designed for translation (sequence-to-sequence). But researchers quickly discovered that the architecture, or pieces of it, could be adapted for many tasks, producing three major branches.

  1. Encoder-only (BERT and descendants): Read the entire input at once, attending to words both before and after each position. Best for classification and understanding tasks.
  2. Decoder-only (GPT family): Process text left-to-right, generating one token at a time. Best for text generation.
  3. Encoder-decoder (T5, BART): Use both halves. Best for tasks that transform input text into different output text.

For text classification, we care about encoder-only models. These read the entire input and produce a single representation that captures its meaning, which is exactly what a classifier needs.

BERT: The Foundation

BERT (Bidirectional Encoder Representations from Transformers)[4] was released by Google in 2018 and became the foundation for a generation of NLP models. BERT-base has 12 transformer layers, 768-dimensional hidden states, 12 attention heads, and 110 million parameters. BERT-large doubles most of these dimensions to 340 million parameters.

BERT is pre-trained on a massive text corpus using two objectives: masked language modeling (predict randomly masked words from context) and next sentence prediction (determine if two sentences are consecutive in the original text). This pre-training teaches the model general language understanding. You then fine-tune BERT on your specific classification task by adding a small classification head on top and training for a few epochs on labeled examples.

Fine-tuning BERT typically achieves state-of-the-art results on text classification benchmarks, but the model is too large for mobile deployment. At 440MB (FP32), BERT-base alone exceeds reasonable app size budgets, and inference requires significant compute.

Making Transformers Smaller

The quest to shrink BERT while preserving its capabilities has produced a rich landscape of techniques. These fall into several families.

Knowledge Distillation

The idea behind distillation[5] is to train a small student model to mimic the behavior of a large teacher model. Instead of training the student on hard labels (class 0 or class 1), you train it on the teacher’s soft probability distribution over all classes. These soft targets carry more information than hard labels. A teacher that assigns 70% to the correct class and 20% to a related class is teaching the student about inter-class similarity.

DistilBERT[6] applied this to BERT, producing a 6-layer model (compared to BERT’s 12) that retains 97% of BERT’s language understanding while being 60% smaller and 60% faster. At 66 million parameters, it’s a meaningful reduction but still large for mobile.

TinyBERT[7] uses a more aggressive distillation method that matches not just the teacher’s output probabilities but also its intermediate representations, including attention matrices and hidden states at each layer. This richer supervision signal is what allows TinyBERT to compress into a much smaller architecture (6 layers, 14.5 million parameters) while still retaining useful behavior. The better the distillation, the smaller you can make the student.

Architectural Efficiency

ALBERT[8] (A Lite BERT) takes a different approach by sharing parameters across layers. Instead of having unique weight matrices for each transformer layer, ALBERT shares weights across layers. The same parameters are applied repeatedly, like an unrolled recurrent network. ALBERT also factorizes the embedding matrix, separating the vocabulary embedding size from the hidden layer size. ALBERT-tiny achieves dramatic parameter reduction (4 million parameters) at the cost of some accuracy.

MobileBERT[9] was designed specifically for mobile deployment. It uses a “bottleneck” architecture where each transformer layer has a narrow hidden size (128 dimensions) with wider feedforward blocks, maintaining representational capacity while reducing the parameter count. At 25 million parameters, MobileBERT fits comfortably in a mobile app.

Sentence Transformers

The compressed models above give us smaller encoders, but classification requires one more step. A classifier needs a single fixed-length vector for each input, and it needs similar inputs to land near each other in the vector space so that decision boundaries are easy to draw. Standard BERT produces a separate vector for each token in the input. You can collapse these into a single vector by using the special [CLS] token (a dedicated token BERT prepends to every input, whose output vector is trained to represent the entire sequence), or by averaging all token vectors. But these representations are not optimized for similarity between inputs, so texts with the same label don’t necessarily cluster together. (Note that despite the name “sentence transformers,” these models work on any text up to their token limit, typically 512 tokens, which covers paragraphs and short documents.)

Sentence-BERT (SBERT)[10] fine-tunes BERT specifically to produce high-quality sentence embeddings. It uses a siamese network architecture in which the same BERT model processes two sentences independently, then a loss function pushes similar sentences together and dissimilar sentences apart in the embedding space.

The sentence-transformers library provides a family of pre-trained models optimized this way. The choice between them is primarily a size-vs-quality trade-off, shaped by your deployment constraints.

22M params · 384-dim · ~22 MB quantized
The default starting point for mobile. Small enough to bundle in an app, fast inference on-device, and strong enough for most domain-specific classification tasks. Start here and only move up if your eval accuracy plateaus.
33M params · 384-dim · ~33 MB quantized
The upgrade path from L6. Doubles the transformer layers (6 → 12) while keeping the same 384-dim embedding, so your downstream pipeline stays unchanged. Only the encoder grows. Worth trying when L6 accuracy isn't enough.
33M params · 384-dim · ~33 MB quantized
Trained with a different objective (retrieval-focused) that produces embeddings better at distinguishing subtle semantic similarity. A good choice when your categories overlap, e.g., separating "struck by object" from "caught between objects" in safety classification.
110M params · 768-dim · ~110 MB quantized
Too large for mobile deployment. Useful server-side as a high-quality teacher for knowledge distillation, or when you need the best possible embeddings and model size doesn't matter. Think of it as the quality ceiling to benchmark against.

These sentence transformers are the building blocks for the approach we’ll explore next.


SetFit: Few-Shot Classification Without Prompts

Traditional fine-tuning requires hundreds or thousands of labeled examples per class to be effective. For many real-world tasks, labeled data is scarce and expensive to obtain. SetFit (Sentence Transformer Fine-tuning)[11] addresses this with a two-phase training approach designed for few-shot scenarios.

Phase 1: Contrastive Learning

SetFit generates pairs of sentences from your training data.

  • Positive pairs: Two sentences with the same label.
  • Negative pairs: Two sentences with different labels.

To make this concrete, imagine a training set with three labeled examples:

A. "EE hurt back lifting patient" → Strain/Sprain
B. "Nurse strained shoulder moving bed" → Strain/Sprain
C. "Forklift struck shelving in warehouse" → Struck By

SetFit generates pairs from these examples:

Positive: (A, B) — both Strain/Sprain → push embeddings closer
Negative: (A, C) — Strain/Sprain vs. Struck By → push embeddings apart
Negative: (B, C) — Strain/Sprain vs. Struck By → push embeddings apart

Three examples produce three pairs. With 10 examples per class across 5 classes, you get thousands of pairs. This combinatorial amplification is what makes SetFit effective with limited labeled data, because the model sees many more training signals than the raw label count suggests.

It then fine-tunes a pre-trained sentence transformer using a contrastive loss that pulls positive-pair embeddings together and pushes negative-pair embeddings apart. This phase reshapes the embedding space so that sentences of the same class cluster together.

Phase 2: Classification Head

After the embedding space has been reshaped, SetFit trains a simple classification head (typically logistic regression) on the fine-tuned embeddings. This head learns the decision boundaries between the clusters that Phase 1 created.

To visualize what this looks like, imagine projecting the 384-dimensional embeddings down to two dimensions. Before contrastive fine-tuning, incidents from different categories are scattered and overlapping. After Phase 1, they form distinct clusters.

Embedding Space After Contrastive Fine-Tuning
Each sphere is one incident description. Nearby points share similar meaning.
Strain/Sprain Struck By Fall

The classification head (Phase 2) then draws decision boundaries between these clusters. Because the clusters are well-separated, a simple logistic regression is enough. No deep network needed.

The classification head is tiny, just a weight matrix and bias vector. For 15 classes with 384-dimensional embeddings, the head is 15 x 384 = 5,760 parameters. The overwhelming majority of the model’s parameters are in the sentence transformer backbone.

Why SetFit Works Well for Small Models

SetFit’s architecture is naturally suited to on-device deployment.

  1. The backbone is a standard sentence transformer, which can be exported to ONNX, Core ML, or TensorFlow Lite using well-established tooling.
  2. The classification head is trivially small, adding negligible size.
  3. Inference is two steps. Embed the text (one forward pass through the transformer), then classify the embedding (one matrix multiply). No autoregressive generation, no beam search.
  4. Quantization is straightforward. Sentence transformers respond well to INT8 quantization, typically losing less than 1% accuracy while reducing model size by 4x.

A SetFit model built on all-MiniLM-L6-v2, quantized to INT8, is approximately 22MB, comfortably within mobile app size budgets.


The Practical Landscape

Putting this together, here is the spectrum of approaches for on-device text classification, ordered from simplest to most capable.

TF-IDF + Logistic Regression
No neural network · 1-5 MB · No quantization needed
Trivial to deploy with no runtime dependencies beyond basic linear algebra. A sparse vocabulary mapping plus a weight matrix. Excellent baseline to benchmark against before reaching for anything more complex.
Shallow network · 1-10 MB · Prunable vocabulary for size control
Native mobile SDKs available for both iOS and Android. Handles misspellings and rare words well thanks to subword embeddings. The strongest option that doesn't require an ONNX or CoreML runtime.
14.5M params · 6 layers · ~15 MB INT8 · ONNX / CoreML / TFLite
The smallest transformer-based option. TinyBERT's aggressive distillation from BERT produces a compact encoder that benefits from SetFit's contrastive fine-tuning. Requires ONNX or CoreML runtime in your app.
22M params · 6 layers · 384-dim · ~22 MB INT8 · ONNX / CoreML / TFLite
The sweet spot for most on-device classification. Well-tested sentence transformer backbone, strong community support, and a good balance of size, speed, and accuracy. This is the approach we use for SIFp.
33M params · 12 layers · 384-dim · ~33 MB INT8 · ONNX / CoreML / TFLite
Worth trying if MiniLM-L6 struggles with semantically overlapping categories. The retrieval-optimized embeddings can better separate subtle distinctions, at the cost of a larger bundle.
25M params · 24 layers · 512-dim · ~25 MB INT8 · ONNX / CoreML / TFLite
Designed from the ground up for mobile. Its bottleneck architecture trades embedding quality for inference speed. A good alternative to SetFit when you have enough labeled data for traditional fine-tuning (hundreds+ examples per class).
66M params · 6 layers · 768-dim · ~65 MB INT8 · ONNX / CoreML / TFLite
Pushing the upper bound of what fits in a mobile bundle. Highest accuracy of the group, but the 65 MB size may be prohibitive depending on your app's size budget. Consider server-side deployment or as a distillation teacher.

Which model you choose depends on your deployment constraints. A strict app size budget pushes you toward TF-IDF or FastText; a need for contextual understanding with limited labeled data pushes you toward SetFit. Each step up in model size buys more representational power, but the accuracy gain is not uniform. The right trade-off depends on your application, including how much ambiguity exists between classes, how messy the input text is, and how much labeled data you can afford to produce.

Deployment Formats

Regardless of which model you choose, getting it onto a mobile device requires converting it from a training framework (PyTorch) to a mobile inference format.

  • ONNX Runtime Mobile: Cross-platform, supports iOS and Android, good quantization support. Probably the most pragmatic choice for cross-platform apps.
  • Core ML: Apple’s native format. Runs on the Neural Engine for maximum performance on iOS devices. Only works on Apple platforms.
  • TensorFlow Lite: Google’s mobile format. Strong Android support, also works on iOS. Mature quantization tools.

The conversion pipeline typically looks like:

PyTorch Model Export to ONNX Quantize (FP32 → INT8) CoreML / TFLite Bundle in App

What About Non-Transformer Alternatives?

For some tasks, the simplest approach is the best one. Before committing to a transformer-based model, it’s worth benchmarking TF-IDF or FastText on your specific task. If a 2MB FastText model achieves 85% accuracy and a 22MB MiniLM SetFit model achieves 89%, the 4 percentage points may not justify the additional complexity, size, and inference latency.

The advantage of transformer-based approaches becomes most pronounced in a few situations.

  • Your text is semantically ambiguous and word order matters (bag-of-words fails).
  • You have limited labeled data (SetFit’s contrastive learning amplifies small datasets).
  • You need to handle linguistic diversity like misspellings, abbreviations, multilingual input, and terse vs. verbose descriptions.

For workplace incident reports, which are often terse, full of abbreviations, grammatically imperfect, and span multiple industries, the contextual understanding of a transformer tends to matter.


Looking Forward

The small model landscape is evolving rapidly. A few directions are worth watching:

Structured pruning removes entire attention heads or feedforward neurons that contribute least to task performance, producing models that are architecturally smaller (not just numerically compressed). Unlike quantization, pruning reduces actual computation at inference time.

Matryoshka embeddings[12] train models to produce embeddings that are useful at multiple dimensionalities. The first 64 dimensions are optimized to be as informative as possible, then the first 128, then the first 256. This allows deploying the same model at different quality/size trade-offs by simply truncating the embedding vector.

On-device fine-tuning is beginning to emerge in frameworks like Core ML and TensorFlow Lite. The idea is to personalize a pre-shipped model based on the user’s specific data, without sending that data to a server. For safety classification, this could mean a base model that improves as a client reviews and corrects predictions.

The trajectory is clear. Capable models are getting smaller, and the tooling for deploying them on consumer devices is maturing. For text classification tasks, where the input is short, the output is a discrete label, and the model architecture is a simple encoder, we’re already at the point where on-device inference is practical, performant, and cost-free at the point of use.


References

See the annotated bibliography for notes on each paper and additional links.

  1. Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv:1301.3781
  2. Joulin, A., Grave, E., Bojanowski, P., & Mikolov, T. (2017). Bag of tricks for efficient text classification. arXiv:1607.01759
  3. Vaswani, A., et al. (2017). Attention is all you need. NeurIPS 2017. arXiv:1706.03762
  4. Devlin, J., Chang, M., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805
  5. Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the knowledge in a neural network. arXiv:1503.02531
  6. Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv:1910.01108
  7. Jiao, X., et al. (2020). TinyBERT: Distilling BERT for natural language understanding. arXiv:1909.10351
  8. Lan, Z., et al. (2020). ALBERT: A lite BERT for self-supervised learning of language representations. arXiv:1909.11942
  9. Sun, Z., et al. (2020). MobileBERT: a compact task-agnostic BERT for resource-limited devices. arXiv:2004.02984
  10. Reimers, N. & Gurevych, I. (2019). Sentence-BERT: Sentence embeddings using Siamese BERT-networks. arXiv:1908.10084
  11. Tunstall, L., et al. (2022). Efficient few-shot learning without prompts. arXiv:2209.11055
  12. Kusupati, A., et al. (2022). Matryoshka representation learning. arXiv:2205.13147