LLM Knowledge Distillation: Teacher-Student Guide

Q: What is knowledge distillation in machine learning?

Knowledge distillation is a model-compression technique where a small "student" model is trained to reproduce the behaviour of a larger, more capable "teacher" model, so the student can be deployed at a fraction of the teacher's cost and latency while retaining most of its quality on the target task. The idea was formalised by Hinton, Vinyals, and Dean in 2015: rather than training the small model only on hard ground-truth labels, you train it to match the teacher's output distribution, which carries richer information about how the teacher generalises — sometimes called "dark knowledge," the relative probabilities the teacher assigns to all the answers it did not choose. The landmark practical demonstration was DistilBERT, which retained roughly 97 percent of BERT's language-understanding capability at 40 percent of the parameter count while running 60 percent faster. Applied to modern LLMs, the dominant industrial form has shifted to something operationally simpler: generate high-quality outputs from a frontier teacher model on a large sample of your real task inputs, filter them, and fine-tune a small open-weight model on the resulting pairs. The student then serves the task on your own infrastructure at a small fraction of the frontier model's per-token price.

Q: What is the difference between distillation and fine-tuning?

Fine-tuning is the training mechanism; distillation is a strategy for where the training targets come from. Ordinary supervised fine-tuning requires a dataset of inputs paired with correct outputs, which usually means thousands of human-labelled examples that are expensive and slow to produce. Distillation manufactures those targets automatically: you collect representative inputs — which production traffic supplies for free — and have the teacher model generate the outputs, so no human labelling budget is needed. The training step that follows is typically standard supervised fine-tuning, often with LoRA or QLoRA to keep compute cheap, so mechanically the two look identical; the difference is entirely in the data supply chain. This also defines distillation's inherent ceiling: the student learns to imitate the teacher, mistakes included, and cannot exceed the teacher's quality on the task, whereas fine-tuning on carefully curated human labels can in principle push past what any general-purpose model does out of the box. In practice teams combine them — a distilled base plus a smaller set of human-corrected examples for the cases the teacher gets wrong — getting the coverage of machine labelling with targeted human correction where it matters most.

Q: When should I distill instead of just using a frontier model API?

Distill when three conditions hold simultaneously. First, the task is narrow and stable: classification, extraction, routing, summarisation in a fixed format, templated drafting — something a small model can master because it does not require the teacher's full breadth, and something whose definition will not shift substantially month to month. Second, the volume is high enough that unit economics dominate: if you are pushing tens of millions of tokens per day through the task, the 30 to 100 times unit-cost difference between a frontier API and a self-hosted small student is real money; at a few thousand requests a day, the engineering cost of building and maintaining a distillation pipeline outweighs the savings. Third, you can measure quality: you need a held-out evaluation set and an acceptable-gap threshold against the teacher, or you cannot know whether the student is good enough to ship. Do not distill when the workload genuinely exercises frontier capability — open-ended reasoning, novel problem solving, broad knowledge — because a small student preserves narrow-task quality, not general intelligence, and the quality cliff will surface in production. The common successful pattern is hybrid: the student absorbs the high-volume routine slice while a router escalates the hard 5 to 10 percent to the teacher.

Q: What are sequence-level, logit, and chain-of-thought distillation?

Sequence-level distillation, also called hard-label distillation, has the teacher generate complete output texts for a large set of inputs, and trains the student with ordinary supervised fine-tuning on those input-output pairs. It is the 2026 industrial default because it only requires the teacher's text — meaning a closed frontier API can teach an open-weight student across model families — and it slots into standard fine-tuning tooling. Logit distillation, the original research formulation, trains the student to match the teacher's full probability distribution over tokens, usually softened with a raised temperature; the distribution carries more information per example than a single sampled output, so it can transfer more quality from the same data, but it requires access to the teacher's logits, which generally means both models run in your own stack. Chain-of-thought or rationale distillation has the teacher produce its reasoning steps alongside the final answer, and trains the student to generate both; this measurably lifts small-model performance on multi-step reasoning tasks, because the student learns the path and not just the destination. These compose: a common recipe is sequence-level distillation with rationales included, filtered for correctness, trained via LoRA — cheap, cross-family, and strong on reasoning-adjacent tasks.

Q: How much data do I need to distill an LLM?

For a narrow task, useful results typically start around ten thousand high-quality teacher-labelled examples, with meaningful gains continuing into the low hundreds of thousands; beyond that, returns depend almost entirely on whether the new data adds coverage or merely repeats what is already represented. The honest answer is that composition matters far more than count. Three properties of the transfer set dominate outcomes. Coverage: sample from real production traffic across the whole task distribution, deliberately including the long tail, the ambiguous cases, and the adversarial ones, because the student will fail confidently on anything it never saw. Quality: the teacher makes mistakes too, and every unfiltered error becomes a permanent conviction of the student, so validate outputs with schema checks, self-consistency across resamples, or a judge pass, and discard the bottom slice. Deduplication: near-duplicate examples inflate the count without adding information and can skew the student toward overrepresented patterns. Ten thousand excellent, deduplicated, well-covered pairs reliably beat a hundred thousand noisy ones. Budget-wise, generating a transfer set of fifty thousand examples through a frontier API is usually a few hundred dollars of inference — trivial next to what the resulting student saves in its first week at volume.

Q: Is it legal to distill from a frontier model API?

It depends on the provider's terms of service, and this is a genuine legal checkpoint, not a formality. Provider terms differ in what they permit you to do with API outputs: some explicitly support distillation into models you deploy for your own products — OpenAI, for example, ships a first-party model-distillation feature for training smaller models within its own platform — while the same providers' usage policies commonly restrict using outputs to train models that compete with the provider. The distinction between "a small internal model that serves your product's narrow task" and "a competing general-purpose model" is where the legal nuance lives, and where you should involve counsel if the investment is significant. Practical guidance: read the current terms of the specific teacher you plan to use, because these clauses change; keep provenance records of what data came from which teacher under which terms; and note that using open-weight models as teachers, where the licence explicitly permits derivative training, removes most of the ambiguity — one reason permissively-licensed strong open models are popular teachers. None of this is legal advice, but "we never checked whether we were allowed to" is not a position you want to discover after the student is serving production traffic.

Q: How do I evaluate whether a distilled student is good enough to ship?

Freeze a held-out evaluation set before training — drawn from the same production distribution as the transfer set but never used in it — and measure both teacher and student on it, so the quantity you manage is the gap, not an absolute score. Define the acceptable gap per task in business terms: a routing task might tolerate two points of accuracy for a 50 times cost reduction, while a customer-facing drafting task might not. Slice the evaluation rather than trusting the aggregate, because distilled students characteristically fail unevenly: strong on the well-covered head of the distribution, weak on the tail, specific formats, or input types that were thin in the transfer set — aggregate parity can hide a segment that is badly broken. Include adversarial and edge cases explicitly. Then treat the ship decision as a gate, not a vibe: the student deploys only when every slice is within its threshold, and the same harness re-runs on every re-distillation. Post-launch, keep the teacher as the standing baseline — sample a fraction of live traffic, run it through both models, and track divergence, because task drift erodes the student silently. When the gap widens past threshold, that is the trigger for re-sampling the transfer set and re-distilling, on a cadence the drift rate dictates.

Q: Can I combine distillation with quantization?

Yes, and you almost always should — they compress along different axes and multiply cleanly. Distillation reduces the parameter count itself, moving you from a frontier-scale teacher to a student of perhaps 1 to 8 billion parameters that has been trained to keep your task's quality. Quantization then reduces the memory footprint of that student's weights, from 16-bit floats down to 8-bit or 4-bit representations, cutting memory another 2 to 4 times and usually improving token throughput because generation is memory-bandwidth-bound. The combined effect is what makes the most aggressive deployment targets feasible: a distilled-then-quantized model is the standard recipe behind on-device and edge inference, where phone-sized models running offline are almost all quantized, distilled children of much larger parents. The order matters: distill first, then quantize the finished student, and evaluate after each lossy step separately — each transformation costs some quality, the losses compound, and if you only measure at the end you cannot tell which step to fix when a slice regresses. The same held-out evaluation harness that gated the distillation should gate the quantized artefact before it ships, with per-slice thresholds rather than a single aggregate number.