Language models with billions of parameters pre-trained on millions of documents have changed our lives. We’re constantly asking ChatGPT, Bard, or others and receiving answers that are more than correct, and they’re getting better all the time.
Beyond question-and-answer conversations with bots, beyond generating marketing texts or documentation, these models are very useful for data labeling.
For example:
- You can label your customers’ comments based on their sentiment
- You can label your company’s documents based on their content
- You can classify your store’s products based on their description
- You can classify your customers’ complaints based on their content
Labeling data is cool, it helps you organize, it helps you be faster in your interaction with users, it helps you make decisions.
Labeling data is a task that can be very tedious and very expensive if done by people. But labeling data with an LLM (GPT-X) has some drawbacks:
- Models with billions of parameters don’t fit on any server and are expensive if you call them via API
- Depending on the use case, LLMs can be slow in real-time interactions
What is a Distilled Model?
A distilled model is a model that has been trained to be smaller and faster than the original model, while maintaining most of its performance.
In other words, you can “generate” a model that serves to label your customer complaints that:
- Works as well as a Large LM with well-crafted and tested prompts
- Is smaller and much faster
And how does this work:
You use a large model as a teacher and a small model as a student. The teacher teaches the student and the student learns to do what the teacher does. The student can be a foundational model or a conventional neural network or something like BERT, one of Google’s initial models, which was published based on the work of the paper “Attention is All You Need” by Vaswani et al. in 2017.
In the case of BERT, its simplicity is its advantage, and unlike LLMs like GPT-4 that need distributed systems with GPUs to run, BERT can run on a laptop. It can be trained on a laptop. Additionally, BERT works very well in text classification tasks and data labeling tasks.
Why is it Important?
Because it gives you the freedom to adapt an engine to your business needs and the freedom to not depend on a service provider that charges you for each request you make.
In the case where you want to automate the classification of your customer complaints:
- You collect the pile of complaints you have
- You label them with a large model, working well on the prompts and how you want the output
- You train a small model with the labels you’ve obtained
- You put the small model to work in your system
From time to time, you retrain the model, or supervise it with a human, or do both.
You save on:
- Manual labeling
- Infrastructure costs
- Third-party service costs
Problems You Might Encounter
- Lack of data: If you don’t have enough data to train the small model, it won’t work well
- If the large model, which is the teacher, doesn’t work well, the small model won’t either
- The privacy of your customers’ data. You must respect it
The founder and CEO of Hugging Face, Clément Delangue, says that distilled models are the future of generative AI, and what he says seems sensible, doesn’t it?
My prediction: in 2024, most companies will realize that smaller, cheaper, more specialized models make more sense for 99% of AI use-cases. The current market & usage is fooled by companies sponsoring the cost of training and running big models behind APIs (especially with cloud incentives).

