Best Self-Hosted AI Assistant: Cloud vs Local (2026)

Most self-hosted AI guides pick the wrong stack. Here's the truth: cloud-hosted beats local for 80% of use cases - and the memory issue no tutorial mentions.

Jack Tom2026-04-087 min readBeginner

I tested both approaches for two months. A $5/month VPS running Open WebUI with cloud API calls beats a $900 local AI PC for 80% of real use cases. The reason isn’t about model quality.

The Two Paths (One Works Better Than You Think)

Method A: True local – Ollama on your own hardware, models running entirely on your machine. Zero ongoing costs after you buy the PC. Complete privacy.

Method B: Cloud-hosted self-controlled – Open WebUI on a $5 VPS, connected to API providers you choose (Anthropic, OpenAI, or even Ollama running remotely). You control the interface and conversation history. The LLM runs elsewhere.

Method B wins for most people. Not because of cost – though it’s cheaper than you think. The context window trap kills Method A setups within three weeks.

Why I Started With Local (And Hit a Wall)

August 2025. I wanted my own ChatGPT that I controlled. Privacy mattered. Client data flowing through OpenAI’s servers? No thanks.

Setup: install Ollama, pull Llama 3.1 8B, run Open WebUI in Docker. Done. Llama 3.1 8B runs well on most modern systems with 8GB+ RAM (as of April 2026, per OpenClaw’s official guide).

It did run. 40 tokens per second on my 12GB GPU.

Week three, I noticed something. The assistant kept forgetting things I’d told it 10 messages ago. Not vague things – concrete facts like “I use TypeScript, not JavaScript.”

Ollama defaults to a 2048-token context window. Even if you set a higher limit in Open WebUI’s interface, Ollama itself still caps at 2048 unless you configure it separately. No tutorial covers this in depth.

I bumped it to 8192. My GPU choked. Inference: 40 tokens/sec → 8. Unusable for real-time chat.

Larger context windows eat VRAM. Way more than you’d guess. The “0.5GB VRAM per billion parameters” rule everyone quotes? Just the model weights. Turns out actual usage runs 10-20% higher due to KV cache, activations, and framework overhead (Onyx AI leaderboard data, April 2026).

A 12GB GPU can’t actually hold a 20B model at 4-bit quantization if you want a usable context window. The math doesn’t work.

The Moment I Switched to Cloud-Hosted

September. Spun up a $5/month Hetzner VPS, installed Open WebUI, connected it to Anthropic’s API using my own key.

First conversation: 200K token context window. It remembered everything from the start of the chat. No VRAM juggling. No quantization compromises.

Cost that month: $8.40 total. $5 for the VPS, $3.40 in Claude API calls.

The “free” local setup? Only free if you accept models that aren’t as good as Claude/GPT-4 for complex tasks (as of 2026). For anything hard, you end up calling cloud APIs anyway. Hybrid mode – which most real users run – still costs $5-15/month (OpenClaw pricing data, April 2026).

Think about it: you buy a $900 GPU to avoid $20/month ChatGPT Plus, then spend $10/month on APIs for the tasks your local model can’t handle. Break-even? 8 months. But only if your local setup actually works for everything – which it won’t.

Start with the smallest VPS tier. Scale up only if you add team members. A single-core 1GB RAM instance handles Open WebUI fine – the LLM runs on the API provider’s side, not your server.

When Local Actually Wins

Three scenarios where local is the right call:

You process truly sensitive data – Medical records, legal docs, proprietary code under NDA. If your compliance requirements say “data cannot leave your infrastructure,” local is mandatory. Cloud-hosted still sends prompts to third-party APIs.

You already own the hardware – Gaming PC with a 24GB GPU sitting idle? Marginal cost of running Ollama: zero. The VPS + API combo only wins if you’re buying hardware specifically for AI.

You’re running batch workloads, not chat – Summarizing 500 documents overnight works great locally. Latency doesn’t matter, you can use cheaper quantized models, and you’re not fighting context limits in real-time.

For everything else – personal assistant, customer support, research, coding help – cloud-hosted beats local on availability (works from your phone), memory (no VRAM ceiling), and total cost when you factor in your time.

How to Actually Set This Up (Cloud Path)

The stack I’m running in production as of March 2026:

Provision a VPS – Hetzner CX11 ($5/mo) or Oracle Cloud Always Free tier (4 ARM cores, 24GB RAM – officially supported for OpenClaw + Ollama as of April 2026). Ubuntu 24.04.
Install Docker – curl -fsSL https://get.docker.com | sh
Run Open WebUI – docker run -d -p 3000:8080 -v open-webui:/app/backend/data --name open-webui --restart always ghcr.io/open-webui/open-webui:main
Connect your API key – Open WebUI supports Anthropic, OpenAI, Gemini (as of 2026). Settings → Connections, add your key. You control which model handles which requests.
Set up a domain + SSL – Caddy or Nginx with Let’s Encrypt. Exposes Open WebUI at https://ai.yourdomain.com. Accessible from anywhere, any device.

Setup takes 20 minutes if you’ve used Docker before, 45 if you haven’t.

The interface looks identical to ChatGPT. Conversation history persists. You can upload documents for RAG. Mobile works perfectly.

The Memory Gotcha (And How to Fix It)

If you go local, the #1 issue that breaks setups:

Open WebUI shows “Ollama disconnected” after installation. Fix: set OLLAMA_HOST=0.0.0.0 before starting Ollama, so the Docker container can reach it. Default localhost binding doesn’t work with Docker’s network isolation.

export OLLAMA_HOST=0.0.0.0
ollama serve

Then in Open WebUI, connect to http://host.docker.internal:11434.

Second gotcha: context length. Don’t just set it in Open WebUI. You have to configure Ollama itself. Edit your model’s Modelfile or use the API to set num_ctx parameter explicitly.

Third: VRAM overhead. Budget 20% more than the model size suggests. A 7B model at 4-bit isn’t 3.5GB – it’s closer to 4.2GB in practice.

What Broke vs What Worked (60 Days, Real Usage)

Broke:

Llama 70B locally on a 24GB GPU – context window collapsed under memory pressure, unusable for multi-turn conversations
Expecting $0/month costs – hybrid mode (local + occasional API calls for hard questions) still ran $8-12/month
Using my laptop as the “server” – works until you close the lid or take it to a coffee shop, then your assistant vanishes

Worked:

VPS + Claude API for 90% of tasks, with Ollama on the same VPS for basic stuff (drafts, summaries) – brought monthly API cost down to $4
Open WebUI’s model switching – click a dropdown, choose between local Llama or cloud GPT-4, same interface
Document upload + RAG – fed it 80 pages of client docs, it cited sources accurately

FAQ

Is self-hosting actually cheaper than ChatGPT Plus?

Cloud-hosted (VPS + API) costs $8-20/month depending on usage (as of 2026) – comparable to ChatGPT Plus at $20/month but with more control. True local is free after hardware, but you need a $900+ PC with a decent GPU to run quality models. Break-even: ~8 months of ChatGPT Plus subscription cost vs buying hardware. This assumes your local setup handles 100% of your queries – if you still use APIs for hard tasks, add $5-15/month.

Can I run a good AI assistant on my laptop?

Llama 3.1 8B runs on 8GB+ RAM at ~5 tokens/second CPU-only. Usable for summarization and drafts, too slow for real-time chat. Discrete GPU (16GB+ VRAM)? You can run 13B models smoothly. But your “assistant” disappears when you close the laptop or travel unless you set up remote access.

OpenClaw vs Open WebUI:

Open WebUI is a web interface for Ollama and other LLM APIs – self-hosted ChatGPT UI. OpenClaw is an autonomous agent framework (as of 2026, per OneClaw official guide) that connects to messaging apps (Telegram, Discord) and includes skills, memory, and task automation. Open WebUI: chat. OpenClaw: “AI assistant that books meetings and manages your calendar.” Both can be self-hosted.

Next step: pick your path. Privacy mandatory + you have hardware? Go local with Ollama. Want the best model quality and 24/7 availability? Spin up a VPS and connect your API key. Both are self-hosted – you just choose where the model runs.