Month after month, foundation model developers like Meta, Google, and OpenAI release newer, larger LLMs. The advantage of these models is that they’re more powerful and flexible than their smaller predecessors. For many organizations, these models are expensive to set up and run in production. For example, using GPT-4o to summarize a 10-page PDF could cost $4.50. This becomes unnecessarily expensive for enterprise use cases, especially if users use a chat interface that includes the document and the summary in each chat message’s context. A single such session with 20 chat messages brings the total to $90.

Practitioners can achieve the accuracy of the larger models, while still benefiting from the lower latency and lower cost of smaller models using a technique called model distillation.

Model distillation is done by using a “teacher LLM” to produce data to fine-tune a “student LLM”. Let’s say you need to summarize internal documents in real-time. If you have an existing set of a few hundred documents similar to what you expect in production, then you can generate summaries for them using GPT-4o. These pairs of documents and their summaries become your dataset for supervised fine-tuning of a smaller model such as LLaMA’s 7 billion parameter model.

Using LLaMA 7B in production rather than something like GPT-4o for our example task of summarizing documents would be a 97% cost savings, bringing $4.50 down to a negligible 14¢. This allows the student LLaMA 7B LLM to perform nearly as well as the teacher model on this particular task on this set of documents. Empirical results show that for some tasks, the difference in quality could be less than 1% between the student and teacher.

The GenerativeAI industry is collectively betting that most use cases will benefit from using a distilled smaller model trained on the outputs of a larger model. In fact, LLaMA 3.1’s license explicitly permits users to distill the 405 billion parameter model into smaller models.

The key to distillation is ensuring the quality of the distilled model, which is measured through quantitative evaluations. Then, when the model is used for inference in production, the production application needs to be monitored for changes in quality. If the evaluation metrics show a dip in quality due to a shift in the real-life usage of the LLM application, the student model needs to be re-trained. This is why quantitative evaluations and continuous fine-tuning is so important to maintaining the quality of a model, regardless of whether it was distilled.

To learn more about Plum Defense’s LLM evaluations and supervised fine-tuning product, sign up on our website: https://www.plumdefense.com