By the end of this guide you’ll have a Falcon 180B Chat endpoint listening on http://localhost:8080/generate, served by Text Generation Inference inside Docker, ready for cURL or LangChain. That’s the destination. The hard part is everything you have to work through – hardware, gated repos, quantization – to get there.
Falcon 180B is the 180-billion-parameter model from TII (Abu Dhabi), released in September 2023. It’s still the largest fully open-weights LLM you can self-host without legal acrobatics – provided you accept the license. Most tutorials hand-wave deployment with a Transformers pipeline() snippet that won’t actually run on anything you own. We’re skipping that entirely.
The hardware reality (read this before renting GPUs)
The official Hugging Face model card says you need approximately 8xA100 80GB or equivalent for full bfloat16 inference. That’s the headline. The fine print is more interesting.
| Configuration | VRAM | Reality |
|---|---|---|
| bfloat16, unquantized | ~400GB needed | 8xA100 80GB recommended (per model card). Community reports: 5xA100 80GB technically loads but idles at 90% VRAM – no room for real context. |
| GPTQ 4-bit (TheBloke) | ~100-120GB | Works on 2xA100 80GB or 4x48GB cards via TGI or Transformers only – see the AutoGPTQ note below. |
| GGUF Q4_K_M (CPU+GPU) | See llama.cpp docs | llama.cpp / LM Studio path; slower but lets you run on mixed hardware. Check TheBloke’s GGUF README for current memory estimates. |
TheBloke’s GGUF README puts the minimum for swift inference at 400GB. “Swift” matters there – quantized variants run with less, but the community-reported trade-off on 4-bit GPTQ is roughly 4 tokens/sec on 2xA100 (per Runpod’s testing), compared to higher throughput on the full-precision setup.
There’s something almost philosophical about a model that requires a small data-center rack to serve a single chat session. It’s a useful gut-check before you open a cloud console: is the quality gap between 180B and, say, 40B actually worth the 10x infrastructure difference for your specific task? That question doesn’t have a universal answer – which is why the FAQ at the bottom addresses it directly.
Prerequisites
- OS: Linux – Ubuntu 22.04 LTS is tested. WSL2 is technically possible but TGI’s NCCL setup is fragile there; don’t debug both at once.
- GPU drivers: NVIDIA drivers supporting CUDA 11.8 or higher (per the TGI README, as of early 2024 – check the TGI GitHub for current requirements before you rent).
- Docker + NVIDIA Container Toolkit: Mandatory for
--gpus allto work. No toolkit, no GPU access from the container. - Disk: ~360GB free for unquantized weights, ~100GB for GPTQ. Plus HF cache headroom – this catches people out after the first run.
- A Hugging Face account with the Falcon 180B license accepted (details in Step 1).
- PyTorch 2.0+ and Transformers 4.33+ if you’re using local dev tools alongside TGI. Older versions won’t load the architecture – Falcon 180B support landed in Transformers 4.33.
Step 1: Get past the gate
Falcon 180B lives at huggingface.co/tiiuae/falcon-180B-chat and it’s gated. Visit the page while logged in, accept the license, then create a read token at Settings → Access Tokens.
export HF_TOKEN=hf_your_token_here
Skip this and TGI’s first download attempt fails with a 401 that doesn’t always print clearly inside the container logs. Also: the token must belong to the same account that accepted the license. A different account’s token – even with full read scope – won’t work. That one trips up teams sharing credentials.
Step 2: Pull TGI and launch
The model card itself recommends TGI: “since the 180B is larger than what can easily be handled with transformers+accelerate, we recommend using Text Generation Inference.” Take them at their word.
8xA100 80GB, unquantized chat model:
model=tiiuae/falcon-180B-chat
volume=$PWD/data
docker run --gpus all --shm-size 1g -p 8080:80
-v $volume:/data
-e HF_TOKEN=$HF_TOKEN
ghcr.io/huggingface/text-generation-inference:latest
--model-id $model
--num-shard 8
--sharded true
Two things worth calling out. The -v $volume:/data mount caches weights on the host – skip it and every restart re-downloads ~360GB. The --shm-size 1g flag is not optional: TGI uses NCCL for tensor parallelism and falls back to host shared memory when peer-to-peer can’t negotiate (per the TGI README). Cut it and you’ll see hangs that look like deadlocks.
GPTQ 4-bit path on smaller hardware (2-4 A100s):
docker run --gpus all --shm-size 8g -p 8080:80
-v $volume:/data
-e HF_TOKEN=$HF_TOKEN
ghcr.io/huggingface/text-generation-inference:latest
--model-id TheBloke/Falcon-180B-Chat-GPTQ
--quantize gptq
--sharded true --num-shard 4
--max-input-length 2048 --max-total-tokens 4096
The AutoGPTQ trap: TheBloke’s GPTQ build was sharded because of the model’s size – and that sharding breaks AutoGPTQ-based loaders entirely. It loads cleanly through Transformers ≥ 4.33 or TGI. Most “GPTQ Falcon doesn’t work” threads trace back to this mismatch. Don’t reach for AutoGPTQ here.
Step 3: Verify the endpoint
First boot is slow. The 360GB pull alone – even on a fast pipe – takes long enough to eat lunch, not just coffee. Then TGI runs a warmup pass that pre-allocates KV-cache. When Connected finally appears in the logs:
curl 127.0.0.1:8080/generate
-X POST
-H 'Content-Type: application/json'
-d '{"inputs":"System: You are a concise assistant.nUser: What is multiquery attention?nFalcon:","parameters":{"max_new_tokens":80}}'
The chat model expects a System: ... User: ... Falcon: ... turn structure. Download falcon-180B (base) instead of falcon-180B-chat and you’ll get a text completer with no concept of conversation – a common enough mix-up that it’s worth stating plainly.
Step 4: Failure modes
OOM during warmup. Container loads, then dies just before serving. CUDA out of memory in flash_causal_lm.py – a reported case hit this on 8xA100 40GB GCP with GPTQ 4-bit. The warmup phase pre-allocates KV-cache and it won’t fit. Fix: lower --max-total-tokens, --max-batch-prefill-tokens, and --max-batch-total-tokens aggressively. Or pass -e DISABLE_EXLLAMA=True to drop the ExLlama kernels eating extra memory. Start tiny, scale once warmup passes.
NCCL hangs at “Initializing process group”. Multi-GPU on a system without NVLink – it falls back through PCIe and host RAM. Too-low --shm-size silently deadlocks that path. Bump to 8g or 16g for 180B.
401 from the container. The HF_TOKEN didn’t propagate. Check with docker exec <container> env | grep HF_TOKEN. Some shells eat variables you set without export.
The license clause that catches startups
The Falcon 180B license is royalty-free and built on Apache 2.0 – but with one carve-out. You can host it yourself, in your own or leased infrastructure, for internal use or to power a product. What requires a separate TII agreement is offering Falcon 180B as a shared inference API – where customers’ prompts hit a shared instance.
Dedicated tenant deployments: fine. Multi-tenant inference-as-a-service on top of Falcon: talk to TII first. The line is shared instance vs. dedicated, not commercial vs. non-commercial.
Upgrading and cleanup
Pull the new tag, swap the container – the $volume mount keeps your weights intact, so only the container layer changes:
docker pull ghcr.io/huggingface/text-generation-inference:latest
docker stop <container> && docker rm <container>
# re-run the docker run command above
To wipe everything including the ~360GB of weights:
docker rm -f $(docker ps -aq --filter ancestor=ghcr.io/huggingface/text-generation-inference)
docker rmi ghcr.io/huggingface/text-generation-inference:latest
rm -rf $PWD/data
That last line is the one people forget. The HF cache survives container deletion – great for upgrades, brutal on disk-quota dashboards.
Whether 180B is the right call depends on a question no benchmark answers cleanly: what’s the actual quality gap on your prompts? The parameter count is a proxy. Tokens per second on your real workload is the real number.
FAQ
Can I run Falcon 180B on a single consumer GPU?
No. Even aggressive 4-bit GPTQ needs ~100-120GB spread across multiple cards. A single 24GB GPU won’t load it.
Should I use Falcon 180B or one of the smaller Falcon models?
Start smaller – Falcon-7B-Instruct or Falcon-40B-Instruct handle 90% of use cases at a fraction of the infrastructure cost. The 180B variant earns its keep in specific situations: you need stronger open-weights reasoning, you’re running fine-tuning experiments at scale, or you have a compliance reason to self-host the largest available open model. If you’re prototyping a chatbot, measure the quality gap between 40B and 180B on your actual prompts before committing to the larger setup. Most teams find they don’t need it.
Is Falcon 180B still competitive in 2026?
As of mid-2025, the benchmark crown has moved on – newer MoE and reasoning-focused architectures outperform it per parameter. But benchmarks aren’t the whole story. For self-hosting use cases where weight stability, a known license, and air-gapped deployment matter, Falcon 180B remains a reasonable choice. Just don’t pick it to chase a leaderboard position it no longer holds.
Next: point a LangChain HuggingFaceTextGenInference client at http://localhost:8080 and start measuring tokens per second on your actual prompts – that number, not the parameter count, is what determines whether 180B is worth the bill.