Model Distillation: When You Do It vs When We Do It

Distillation just went from standard ML technique to geopolitical flashpoint. Here's how to actually do it - plus the legal traps nobody warns you about.

Jack Tom2026-02-2610 min readBeginner

Two weeks ago, Anthropic published a blog post accusing three Chinese AI companies of running “industrial-scale distillation attacks” against Claude. OpenAI sent a similar memo to Congress about DeepSeek. The controversy? These companies allegedly used millions of API calls to copy the behavior of frontier models – training their own smaller, cheaper versions without permission.

OpenAI distills GPT-4o into GPT-4o-mini. Anthropic distills Claude into cheaper variants. Google does it. Meta does it. Amazon Bedrock launched a whole distillation feature in October 2025. The difference? When you distill someone else’s API model, it’s “adversarial exploitation.” When they distill their own, it’s “efficient deployment.” Same technique. Different rules depending on who owns the keys.

What Distillation Actually Is

You train a small “student” model to mimic a large “teacher” model. The student learns not just the teacher’s final answers, but the confidence behind those answers – the probability distribution across all possible outputs.

Why does this work?

A well-trained teacher model encodes useful information even in its wrong answers. GPT-4o answers “Paris” to “What’s the capital of France?” – 95% confidence to Paris, 3% to Lyon, 1% to Marseille, 1% scattered elsewhere. That 3% for Lyon tells the student something: Lyon is more capital-like than, say, Toulouse. This richer signal – “soft labels” – lets the student model learn faster and generalize better than hard labels (Paris = correct, everything else = wrong).

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean formalized this in their 2015 paper “Distilling the Knowledge in a Neural Network”. A decade later, it’s how most efficient AI models get built.

One thing most guides won’t tell you: if you’re distilling for a specialized task (like customer support for your product), the student can actually outperform the teacher on that narrow domain. You’re trading breadth for depth – the student forgets everything irrelevant and doubles down on what matters.

How to Distill a Model

Step 1: Pick Your Teacher and Student

Teacher = the expensive model that works well. Student = the cheap model you want to deploy. For OpenAI, maybe GPT-4o (teacher) → GPT-4o-mini (student). For open models, Qwen3-32B → Qwen3-8B.

The student should be smaller but not tiny. A 7B-parameter student distilling from a 70B teacher works. A 1B student might lack the capacity to absorb what the teacher knows – this is the “capacity gap,” and it kills distillation.

Step 2: Generate Training Data From the Teacher

You need the teacher’s outputs on your task. Two approaches: use your real data (if you have user queries, run them through the teacher and save the responses – OpenAI’s API stores completions for 30 days by default), or synthesize data (prompt the teacher to generate question-answer pairs for your domain, like “Generate 50 customer support questions about password resets, then answer each one”).

How much data? Stanford’s Alpaca used 52K examples and cost under $600 (as of 2023). DeepSeek allegedly used 16 million API calls per Anthropic’s February 2026 accusations. You’re probably somewhere in between – start with 5K-10K high-quality examples.

Step 3: Fine-Tune the Student on Teacher Outputs

Take the teacher’s outputs and use them as training targets. In OpenAI’s platform, this is just supervised fine-tuning – upload a JSONL file, hit “Distill,” wait.

For open models, use Hugging Face Transformers or PyTorch. The key parameter: temperature – set it higher (e.g., 2-5) during training to “soften” the teacher’s probability distribution. This amplifies the signal from low-probability classes.

from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
 output_dir="./distilled-model",
 per_device_train_batch_size=8,
 num_train_epochs=3,
 logging_steps=100,
)

trainer = Trainer(
 model=student_model,
 args=training_args,
 train_dataset=teacher_outputs,
)

trainer.train()

Step 4: Evaluate Before You Deploy

Run the student on a held-out test set. Compare its outputs to the teacher’s. If the student’s accuracy is within 5-10% of the teacher on your task, you’re good. If worse, you need more training data or a bigger student model. OpenAI’s Evaluations tool lets you set up criteria like “Does the student’s answer match the teacher’s intent?” and run it across your test set.

The Legal Trap

Distillation is legal. The technique is just machine learning. But most API providers ban it in their terms of service. According to a January 2025 legal analysis from Law.asia, OpenAI, Anthropic, Mistral, and xAI all include “anti-competitive distillation” clauses. The exact wording varies, but the intent: you can’t use API outputs to train a model that competes with theirs.

What counts as “competing”?

They don’t define it. Distill GPT-4o to build a coding assistant that only handles Python debugging for your internal team? Probably not competing. Distill it to build a general-purpose chatbot you sell to other companies? Definitely competing. The gray area in between is where most real-world use cases live.

OpenAI’s terms explicitly ban using API outputs to train competing models – but “competing” is undefined. If you distill GPT-4o to build a customer support bot that only handles your company’s FAQs, is that competing? The ToS doesn’t say.

The companies can detect this. OpenAI’s memo to Congress (as reported by Bloomberg and Rest of World in February 2026) mentioned they identified “accounts associated with DeepSeek employees” and tracked usage patterns for distillation – high query volumes, systematic prompt variation, third-party routers masking the source. Hammering an API with 10,000 calls per hour using similar prompt templates? Red flag.

Turns out the rules aren’t about what you build. They’re about who you are and what threat you pose. OpenAI, Anthropic, Google – they all run internal distillation pipelines. Amazon Bedrock added Claude distillation support in October 2025. Microsoft expanded Azure OpenAI distillation in January 2025. The technique is standard. The crime is doing it to someone else’s model without permission. Which raises a question: if the method is legal and the results are just compressed approximations, what exactly is being stolen?

When Distillation Backfires

Teacher model gets updated mid-project: Your distilled student is now based on an outdated teacher. GPT-4o-mini improved dramatically between late 2024 and early 2025 – faster than most distillation projects. Claude 3.5 Haiku got 60% faster on AWS Trainium2 as of early 2025. If the base model improves on its own, your distillation effort was wasted.

API costs exceed alternatives: Anthropic estimated 16 million exchanges for the alleged DeepSeek distillation (February 2026 report). At roughly $15 per million tokens (Claude API pricing), that’s $240K+. For many tasks, fine-tuning an open-source model from scratch costs less.

Student inherits teacher’s bugs: If the teacher has a jailbreak vulnerability or outputs biased content, the student learns that too. You’re freezing the teacher’s flaws into production. OpenAI noted this in their Congressional memo – distilled models can bypass safety guardrails, enabling misuse in high-risk areas like bio/chem. But the inverse is also true: if your teacher model has a safety bug, the student inherits it.

API-only distillation loses signal: Most distillation guides assume you have access to the teacher’s soft outputs (probability distributions). If you’re distilling via API, you only get hard text completions – no logits, no confidence scores. This forces you into a less effective method (response-based only, not feature-based).

The Geopolitical Angle

DeepSeek’s R1 model, released in January 2025, claimed to match GPT-4-level performance for $6 million in training costs. US companies spend billions. If distillation can close that gap, it shifts the AI power balance.

Actually, no.

A distilled model can’t create capabilities the teacher doesn’t have – it can only compress what’s already there. DeepSeek’s efficiency likely came from a mix of distillation plus innovations in training techniques (like reinforcement learning from AI feedback). Distillation alone doesn’t explain the cost gap. But the controversy reveals something: the line between “legitimate optimization” and “IP theft” depends entirely on who’s doing it. Distill your own models or open models (LLaMA, Qwen, Mistral)? Fine. Distill a competitor’s API without permission? Lawsuit – or if you’re a Chinese company, a Congressional hearing.

OpenAI reportedly started investigating right after DeepSeek R1 launched in January 2025. The public accusations came a year later in February 2026. Why wait? DeepSeek’s success had already challenged US export controls on AI chips. Companies wanted policy backing before escalating. Some analysts suggest the timing was about justifying stricter chip restrictions and securing government support for US AI dominance. The technical violation might be real, but the response is geopolitical.

When NOT to Distill

Don’t distill if:

The base small model already works. Test GPT-4o-mini or Claude Haiku on your task first. If they work, you’re done. Distillation is for closing a specific performance gap, not a default step.

You don’t have 5K+ high-quality examples. Below that, distillation rarely beats few-shot prompting. Better off crafting a great system prompt.

Your task changes frequently. Distillation freezes knowledge at a point in time. If your domain shifts every few months (e.g., news summarization), you’ll be constantly re-distilling. Prompt engineering is more flexible.

You’re distilling a closed API model you don’t own. See the ToS trap. Even if you’re “just experimenting,” you risk account suspension. Stick to open models.

What This Means for You

Distillation isn’t going away. Amazon Bedrock, Azure OpenAI, and Google Vertex all offer managed distillation now – they handle data generation, training, and evaluation in one pipeline. If you’re paying $50K+/year in API costs for a repetitive task, distillation can cut that 70-90%.

But the OpenAI/DeepSeek controversy just clarified the rules: you can distill what you own, not what you rent. Model behind an API with anti-distillation ToS? Assume you can’t touch it. Open-source or self-trained? Distill away.

Next step: Grab an open model (LLaMA 3, Qwen 3, Mistral) and try distilling it for a narrow task you care about. Generate 1K examples with the teacher, fine-tune a smaller student, compare outputs. You’ll learn more in an afternoon than reading ten more tutorials.

FAQ

Can I distill ChatGPT for my company’s internal use without violating OpenAI’s ToS?

Probably not. OpenAI’s terms prohibit using API outputs to train models that “replicate or compete with” their services. Internal use might fly under the radar, but there’s no safe harbor. Safer move: use OpenAI’s own distillation pipeline or distill an open model like LLaMA.

How is distillation different from fine-tuning?

Fine-tuning adapts a pre-trained model to your data – you’re teaching it new facts or formats. Distillation compresses a larger model’s behavior into a smaller one – you’re teaching it to imitate without adding new knowledge. You can combine them: distill first to get a small model, then fine-tune on your specific data. Some teams do exactly that, especially when they need both efficiency and domain specialization. For example, you might distill GPT-4o into a 7B model to capture its reasoning style, then fine-tune that 7B model on your company’s support tickets to handle your specific product terminology.

Why did Anthropic and OpenAI wait until February 2026 to accuse DeepSeek if they detected this in 2025?

OpenAI started investigating right after DeepSeek R1 launched (January 2025). Public accusations came a year later. Likely strategic – DeepSeek’s success had already challenged US export controls on AI chips, and companies wanted policy backing before escalating. Some analysts suggest the timing was about justifying stricter chip restrictions and securing government support. The technical violation might be real, but the response is geopolitical. (Also worth asking: if this was truly an urgent security threat, why sit on it for 12 months?)