Most people who set up a self-hosted personal AI assistant uninstall it within two weeks. Not because it doesn’t work – because they realize a $20 ChatGPT subscription gives them frontier-model-quality reasoning while their laptop wheezes through a 7B model. So before you copy-paste another Ollama tutorial, let’s be honest about what you’re actually getting and when it’s worth it.
This guide skips the privacy sermon. You already know the pitch. Instead, you’ll get exact RAM numbers, the security trap that ships in every default config, and a clear answer on when local actually loses to the cloud.
What you’re actually building
A self-hosted AI assistant is two pieces glued together: an inference engine that runs the model, and a frontend that gives you a chat window. The most common stack right now is Ollama plus Open WebUI. According to Sitepoint’s 2026 Ollama guide, Ollama wraps llama.cpp behind a simple CLI and REST API – it handles model quantization and GPU memory allocation so you don’t have to think about either.
The alternative is Jan, which bundles everything into a single desktop app. Turns out it also runs an OpenAI-compatible API server locally on port 1337, which means anything you build against the OpenAI spec works with Jan too (as documented at Pinggy’s self-hosting guide). Both paths get you to the same place. Ollama gives you more control; Jan gets you running in five minutes.
Check your RAM before installing anything
Every tutorial glosses over this with “you need a decent computer.” Here’s the actual math, as of early 2026, from Sitepoint’s benchmarks:
| Model size | RAM/VRAM needed (q4_K_M) | Realistic use |
|---|---|---|
| 3B (Llama 3.2) | ~2 GB | Quick replies, basic Q&A |
| 7B (Mistral, Llama 3.1) | 4-6 GB | The sweet spot for most laptops |
| 13B | 8-10 GB | Better reasoning, slower |
| 70B | 38-48 GB | Workstation territory |
Rough rule (as of early 2026): ~0.6 GB per billion parameters at q4_K_M quantization, plus headroom for context. If you have an Apple Silicon Mac, you’re in better shape than the spec sheet suggests – the unified memory architecture means GPU and CPU share the same RAM pool, making local inference noticeably faster than on comparable Intel hardware (per MindStudio’s testing). That matters because it flips conventional wisdom: an M2 MacBook Air with 16GB often outruns a much bigger Windows desktop without a dedicated GPU.
Install Ollama and pull your first model
Ollama runs on macOS, Windows, and Linux – download from ollama.com and run the installer. No configuration needed. Then in your terminal:
# Pull a 7B model (about 4GB download)
ollama pull llama3.1
# Start chatting
ollama run llama3.1
# Or hit the API directly
curl http://localhost:11434/api/chat -d '{
"model": "llama3.1",
"messages": [{"role": "user", "content": "Hello"}]
}'
The default pull is a 4-bit quantized (q4_K_M) model – that’s almost always what you want. Don’t pull the full-precision version unless you know why; you’ll burn 4x the disk for a quality difference most people can’t detect in chat.
Quick catch: running ollama pull on an already-installed model checks for updates and downloads newer versions if available. If your assistant suddenly seems worse than last month, that’s your first troubleshooting move.
Add a real chat interface
The terminal works but gets old fast. Open WebUI gives you a ChatGPT-style browser UI – deploy it with Docker:
docker run -d -p 3000:8080
--add-host=host.docker.internal:host-gateway
-v open-webui:/var/lib/open-webui
--name open-webui
--restart always
ghcr.io/open-webui/open-webui:main
Open http://localhost:3000, create an account (it stays local), and you’ll see a familiar interface that auto-discovers your Ollama models. This is what most people end up with: a shared, self-hosted chat interface with no external API dependencies.
Pro tip: Don’t bother with multiple frontends. Pick Open WebUI or Jan or AnythingLLM and stick with it. Each stores chat history in its own database – switching means losing your conversation context. Trying all three in one weekend is the #1 way beginners give up.
The security trap nobody warns beginners about
You’ll eventually want to access your assistant from your phone. Every blog says “just set OLLAMA_HOST=0.0.0.0 and you’re good.” Read that flag carefully.
Binding to 0.0.0.0 exposes the Ollama API on all network interfaces with no authentication – any device on the same network can query your API and load arbitrary models on your machine (Sitepoint, 2026). Your Wi-Fi guests, your smart fridge, your neighbor if they’re already on the network – anyone can hit the endpoint. No password, no token, nothing.
The fix is a firewall rule scoped to your subnet:
# Linux with ufw - adjust CIDR to match your network
sudo ufw allow from 192.168.1.0/24 to any port 11434
sudo ufw deny 11434
Or skip remote access entirely and stay on the default 127.0.0.1:11434. If you genuinely need phone access, a VPN like Tailscale is the right answer – not opening the LAN.
Pitfalls
- Disk space disappears overnight. Models are large enough to fill all available storage on a typical computer (per Northwestern Feinberg’s LLM guide). Pulling Llama 3.1, Mistral, DeepSeek, and Qwen “to compare” eats 20+ GB before you’ve had coffee. Use
ollama listandollama rmregularly – and remember the 70B models alone run 38-48 GB each. - Out-of-memory crashes look like a broken model. They’re not. Drop one tier: 13B → 7B, or 7B → 3B.
- Context is per-session. Open WebUI saves chat logs, but the model itself doesn’t “learn” between conversations. Long-term memory means building a RAG pipeline – a separate project.
- Don’t judge on the first response. Cold-start tokens are slow and sometimes incoherent. Run two or three prompts before deciding whether the model is actually useful for your workflow.
What performance actually looks like
The verified number: at least 10 tok/s for interactive generation from a 7B-q4_K_M model on a modern CPU with 16 GB RAM (Sitepoint, 2026). That’s roughly reading speed – workable for chat, frustrating if you’re generating long documents. Apple Silicon does better because of unified memory, though exact speeds vary by chip generation and haven’t been formally benchmarked in a way this guide can cite precisely.
Quality? A local 7B sits somewhere in the GPT-3.5 neighborhood for general chat – weaker on hard math and multi-step code, surprisingly decent for summarization and rewriting. It is not a frontier model. Anyone telling you otherwise hasn’t actually compared both tools this week.
Which raises the real question most tutorials dodge: what does “good enough” actually mean for your specific use case? The answer varies so much by workflow that no benchmark number settles it. The honest move is to run your actual prompts – not synthetic demos – against both a local model and a cloud model for a week, then decide.
When NOT to self-host
- You need frontier-level reasoning – complex code, hard math, nuanced legal or medical analysis.
- You don’t have at least 16 GB RAM, an Apple Silicon chip, or a GPU.
- Your privacy concern is vague rather than concrete. If you can’t name the specific data you’re protecting, the cloud is fine.
- You’d use it less than 10 hours a month. Setup time alone exceeds the value.
- You want voice, image generation, and web browsing in one product. Cloud assistants ship that integrated; locally you’re bolting together five separate projects.
Self-hosting wins when you have specific privacy requirements (HIPAA-adjacent work, client NDAs, internal documents), when you’re building a product on top of an LLM and want predictable cost, or when you genuinely want to learn how the stack works. Those are real reasons. “OpenAI is creepy” is not, by itself, enough to justify the maintenance burden.
FAQ
Can I run this on a 5-year-old laptop with 8 GB of RAM?
Yes – Llama 3.2 3B or Phi-3 Mini will run. Anything larger will swap to disk and become unusable.
How does this compare to paying for ChatGPT Plus?
Different products, honestly. ChatGPT Plus gets you image generation, voice mode, web search, and code execution for $20/month, plus a model that handles hard reasoning problems. A self-hosted setup gives you a chat interface with a smaller model, no usage limits, and full data control. If your workflow is “write a doc, ask questions, summarize PDFs” – local is enough. If you need the model to reason through genuinely hard problems or browse the web, pay the $20. A local 7B isn’t going to close that gap anytime soon.
Is Ollama or Jan better for beginners?
Jan if you want a single app that just works. Ollama if you want to script, extend, or integrate later. Most people who stick with self-hosting end up on Ollama plus Open WebUI within a month anyway – so the starting point matters less than you’d think.
Your next move
Open a terminal, run ollama pull llama3.2, then ollama run llama3.2, and ask it something specific to your actual work. Not “hello” – something you’d genuinely use AI for tomorrow. Whatever you think of the answer, that’s your real baseline. Decide from there whether to invest the weekend in Open WebUI, or close the tab and go back to ChatGPT with no shame.