July 18th, 2024

Every chatbot in use today, from ChatGPT to custom chatbots built from open-source large language models (LLMs), has been instruction-tuned. An LLM, like any language model, is simply a next-token predictor. To get a vanilla LLM to interact with a user like a chatbot, it must be fine-tuned using tens of thousands of examples of user-and-assistant conversations. This process, called supervised fine-tuning, is a basic building block of productionizing an LLM application.

Publicly available LLMs remain general purpose and aren’t suitable for direct use in most business applications because they need to be continuously fine-tuned to produce high-quality results.

A modern supervised fine-tuning solution involves something called the low-rank adapter. Low-rank adapters are relatively small matrices (millions, not billions of elements) that sit alongside each layer of the LLM and act as a sidekick. It’s job is to translate the inputs and outputs of LLM layers into the proper domain without adding latency in production.

During the fine-tuning process, low-rank adapters are trained on gold standard examples to teach an LLM how to respond. If the dataset is high quality and diverse, then the fine-tuned LLM’s output measurably increases in quality with as few as 100 examples as opposed to tens of thousands. Traditionally, these examples would be handcrafted by an expert, but writing them is time-consuming and labor-intensive. At Plum Defense, we automatically generate examples that are on par with human-written ones. This allows for continuous fine-tuning, which increases the quality of the LLM’s responses on an ongoing basis.

By combining well-trained low-rank adapters with a well-written system prompt, a machine learning practitioner can produce a robust application that conforms well to the required output and is fast enough to use in production. A good system prompt conveys the intention but is concise enough to leave room for retrieval-augmentation (RAG) systems to inject relevant facts into the application. The system prompt’s length also has a direct impact on application latency. The smaller the system prompt, the faster the application’s average response time. With advanced techniques like soft-prompting, the size of the system prompt can be reduced significantly, which speeds up response time.

If you’d like to learn more about Plum Defense’s continuous fine-tuning and soft-prompting system for your production application, talk to us: [email protected]