If you’re building anything in Europe right now, you’ve probably hit the same wall I have: every serious LLM API ships through US infrastructure, and your legal team has opinions about that. Apertus – the new open foundation model from Switzerland – is the first credible answer to that problem that doesn’t require a half-million-euro GPU cluster. It’s a fully open foundation model for sovereign AI, and unlike most “open” releases, this one actually ships the training data, the recipes, and the intermediate checkpoints.
It dropped on September 2, 2025 and developer Twitter has been arguing about it ever since. Some people are calling it the European DeepSeek moment. Others point at the benchmarks and say it’s not even close to Qwen yet. Both are kind of right.
This guide skips the press release. You’ll get a runnable 10-minute setup, the gotchas nobody is documenting, and a frank read on when Apertus is actually the right call.
What Apertus actually is (the 90-second version)
Apertus was built by EPFL, ETH Zurich, and the Swiss National Supercomputing Centre (CSCS). Two sizes: 8B and 70B parameters. The architectural choices are genuinely unusual – a new xIELU activation function and the AdEMAMix optimizer, trained from scratch on 15 trillion tokens spanning 1,811 languages, roughly 40% of which are non-English. Training ran on the Alps supercomputer using over 4,000 NVIDIA GH200 GPUs.
The interesting part isn’t the spec sheet. It’s the licensing posture. Both models ship under Apache 2.0 – fully open-source, commercial use included. Unlike many “open” models that publish weights without meaningful process detail, Apertus ships code, weights, intermediate checkpoints, and training documentation. You can audit the whole thing. Your compliance officer will, eventually, love this.
Run Apertus in 10 minutes (the 8B model)
The 8B Instruct model is the one to start with. It fits on a single consumer GPU at reasonable quantization, and the API matches anything else you’ve used with Hugging Face Transformers.
First, make sure your environment is current. The modeling code for Apertus requires transformers v4.56.0 or later – if you skip this step you’ll get a cryptic config error about unknown activation functions and waste an hour on Stack Overflow.
pip install --upgrade "transformers>=4.56.0" torch accelerate
Then the actual inference call. This is straight from the official model card, lightly trimmed:
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "swiss-ai/Apertus-8B-Instruct-2509"
device = "cuda" # or "cpu" if you enjoy waiting
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name).to(device)
messages = [{"role": "user", "content": "Explain Romansh grammar in two sentences."}]
text = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
inputs = tokenizer([text], return_tensors="pt", add_special_tokens=False).to(device)
output = model.generate(**inputs, max_new_tokens=512, temperature=0.8, top_p=0.9)
print(tokenizer.decode(output[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
Two things to notice in that code. The team recommends temperature=0.8 and top_p=0.9 – don’t just leave defaults, the model is noticeably worse at temp 1.0. And the prompt I used (Romansh grammar) is deliberate: this is exactly where Apertus should beat Llama, because Romansh is a Swiss national language that most US-trained models barely know exists.
If you don’t want to run it locally, the Public AI Inference Utility exposes Apertus through a hosted endpoint, and Swisscom offers it for business customers inside Switzerland. AWS SageMaker JumpStart has it too if you’re already in that ecosystem.
The gotchas nobody is writing about
Three things tripped me up that aren’t in the announcement posts.
1. The 6-month output filter
This one is unique to Apertus and strange the first time you read it. The team provides a separate output filter that reflects data protection deletion requests addressed to them as the developer – it lets you remove personal data contained in the model output, and they strongly advise downloading and applying this filter every six months. Most LLM workflows assume the model is static once you deploy it. Apertus’s GDPR posture means your filter file is a moving target, and treating it as one-time setup is a compliance bug waiting to happen.
2. Context is 65K, not 128K (as of September 2025)
Apertus supports a default context length of 65,536 tokens. That’s plenty for chat and most RAG pipelines, but it’s half of what Llama 3.1 and Qwen 2.5 offer. If you’re feeding it whole codebases or hundred-page contracts, you’ll be chunking.
3. Instruct mode is weaker than the marketing suggests
This is the one that surprised me. An independent benchmark from DS-NLP Lab found Apertus-8B-Instruct hits 44.18% on IFEval – behind OLMo-2 at 57.67% and Qwen2.5 at 58.04%. Translation: when you ask for structured output (“give me JSON with these five fields”), Apertus is more likely to drift than its similarly-sized peers. Math reasoning is the bright spot – it scores 5.29% on Math-lvl-5, beating Mistral 7B’s 2.95%, and hits 31.14% on MMLU-Pro.
Pro tip: If you need reliable JSON output from Apertus-8B, wrap the model with a constrained-decoding library like Outlines or use grammar-guided generation in vLLM. Don’t trust the raw instruct mode for structured tasks – the IFEval numbers are honest about this weakness.
The retroactive opt-out – why this matters more than you think
Here’s the part that almost no English-language coverage explains properly.
The training corpus is built only on publicly available data, filtered to respect machine-readable opt-out requests from websites – even retroactively – and to remove personal data before training begins. Retroactively is the key word. As the Apertus team explained on Hacker News, back in 2013 – their oldest training data – LLMs didn’t exist, so website owners opting out today of AI crawlers might want the option to also remove their past contents.
What that means practically: if a publisher’s lawyer sues OpenAI in 2027 over content from 2015, that lawsuit also implicates every model that scraped 2015 content with no opt-out mechanism. Apertus, by design, removed itself from that universe of risk. Your compliance team didn’t ask for this feature yet. They will.
Performance: where it actually wins, where it doesn’t
Take vendor benchmarks with the usual salt. Here’s a snapshot from the independent DS-NLP Lab evaluation of the 8B Instruct model, compared to peers in the same weight class:
| Benchmark | Apertus-8B-Instruct | Qwen2.5 | OLMo-2 | Mistral 7B |
|---|---|---|---|---|
| IFEval (instructions) | 44.18% | 58.04% | 57.67% | – |
| Math-lvl-5 | 5.29% | – | – | 2.95% |
| MMLU-Pro | 31.14% | – | – | – |
The IFEval gap is real and it shapes the decision. Apertus trails Qwen on instruction-following by 14 percentage points – that’s not noise. For agentic tool-use or anything that depends on reliably shaped output, Qwen is still the better starting point. What Apertus offers instead: a model you can actually point at and say “here is every document that went into training this,” which no commercial model matches at this scale.
When NOT to use Apertus
Skip Apertus for now if any of these apply:
- You need long-context document analysis – 65K tokens is half of what newer Llama and Qwen models offer.
- Your workflow depends on tight structured output – the IFEval gap is real, and “return valid JSON” prompts will fail more often than with Qwen2.5.
- You want a chat model that just works – Apertus is a foundation model first, a polished assistant second. Future updates are expected to expand its capabilities while maintaining strict transparency standards, but right now you’re closer to a research artifact than a ChatGPT replacement.
- You’re a hobbyist who just wants the best free chatbot – use the hosted version on Public AI and don’t bother with local setup yet.
The strongest use case right now: a regulated European org (health, law, public sector) that needs an in-country deployment with auditable training lineage. For that profile, no other model comes close.
FAQ
Can I fine-tune Apertus on my own data?
Yes – and it’s worth knowing where the process differs from Llama. Apache 2.0 means no licensing headaches, weights are public, and the team published the training recipes. LoRA, QLoRA, and full fine-tuning all work through standard Hugging Face tooling. The 8B model typically fits on a single A100 with QLoRA, though your actual VRAM headroom depends on sequence length and batch size. One thing to check before you start: the transformers version requirement (v4.56.0+) applies to fine-tuning too, not just inference – older PEFT integrations may silently use the wrong modeling code.
Why does Apertus support 1,811 languages when most models claim only a few dozen?
Because the team chose to. Most LLM vendors quietly drop low-resource languages from training data to chase English benchmark scores – if a language has under a few hundred million tokens of clean text, it’s a rounding error for MMLU. Apertus inverted that priority: multilingual support is a core design principle, and the team publishes multilingual evaluations in the technical report. Quality on tail languages still varies – don’t expect frontier-model fluency in Romansh – but the coverage is real, not marketing.
Is it really free for commercial use?
Yes, under Apache 2.0. No revenue thresholds, no acceptable-use clauses that quietly exclude competitors, no “contact us for enterprise.” That’s still rare among open-weight models at this scale.
Your next step: Pull swiss-ai/Apertus-8B-Instruct-2509 from Hugging Face, run the snippet above with a prompt in your native language, and compare the output side-by-side with Llama 3.1 8B on the same prompt. That comparison – not the benchmark table – is what’ll tell you whether Apertus belongs in your stack.