You want a capable LLM that runs on your laptop, doesn’t phone home, and costs $0/month. Cloud APIs are great until your usage spikes, your compliance team panics, or your internet hiccups mid-demo. Microsoft’s Phi-3 fills that gap – and unlike a 70B-parameter monster, you can deploy it on hardware you already own.
This guide skips the marketing recap. It covers which variant to pick, what RAM you actually need, the version trap that silently breaks the 128K context window, and how to confirm the thing works end to end.
Which variant you’re actually deploying
Phi-3 is a family. Picking the wrong size is the most common deployment mistake – and it’s almost never discussed upfront. Three sizes shipped: mini at 3.8B parameters, small at 7B, medium at 14B. The series has since expanded – Phi-3.5 and Phi-4 now sit under the same Phi brand on Azure – but mini and the 3.5 refresh remain the practical choices for local deployment in 2025.
Sizing rule of thumb: mini if you’re on 8 GB RAM, medium if you have 32 GB+ and reasoning quality matters more than speed.
| Variant | Params | Context | MMLU | Disk (q4) |
|---|---|---|---|---|
| Phi-3-mini | 3.8B | 4K / 128K | 69% | ~2.2 GB |
| Phi-3-small | 7B | 8K / 128K | – | – |
| Phi-3-medium | 14B | 4K / 128K | 78% | ~8 GB |
MMLU and disk figures from the Phi-3 technical report (mini: 69%, medium: 78%) and the Ollama library (mini q4: ~2.2 GB). Phi-3-small benchmarks are not independently verified here – check the technical report directly before relying on that variant.
Think of it like buying a car based on trunk size but forgetting to check fuel consumption. The disk footprint – 2.2 GB for mini-q4 – tells you almost nothing about how the model behaves when you actually feed it a long document. That gap between file size and runtime behavior is where most deployment surprises live.
System requirements (the honest version)
Most tutorials list “8 GB RAM” and stop. That number covers Phi-3-mini at 4K context with no other apps open. Beyond that:
- Phi-3-mini, 4K context, q4: 8 GB minimum, 16 GB comfortable. Any modern CPU.
- Phi-3-mini, 128K context: 16 GB minimum – the KV cache cost is real, covered in detail in the gotchas section below.
- Phi-3-medium, 4K context, q4: 16 GB minimum, 32 GB recommended.
- GPU (optional): any 8 GB+ VRAM card (NVIDIA, AMD, or Apple Silicon) cuts inference latency by roughly 5-10×.
- OS: Windows 10+, macOS 11+, or any modern Linux distro.
- Disk: ~3 GB for mini-q4, ~8 GB for medium-q4.
The Raspberry Pi 5 runs Phi-3.5-mini via Ollama (quantized, ~2.2 GB download). If a Pi handles it, a mid-range laptop has no excuse.
Install Phi-3 with Ollama
Three deployment paths exist: Ollama (easiest), llama.cpp (most control), Hugging Face Transformers (most fragile). Ollama handles GGUF quantization, the local server, and the REST API in one binary – it’s the right starting point for most use cases.
Step 1 – Install Ollama
Grab the installer from the official Ollama download page. Ollama runs on Windows, macOS, and Linux. On Linux:
curl -fsSL https://ollama.com/install.sh | sh
On macOS or Windows, use the GUI installer. Then verify:
ollama --version
Check your version before anything else. The 128K context model requires Ollama 0.1.39 or later (per the Ollama library page). Older builds either silently fall back to the 4K variant or throw an error without explaining why. If you’re upgrading from an older Ollama install, do it now.
Step 2 – Pull the model
# Phi-3 mini (3.8B, default 4K context)
ollama pull phi3:mini
# Phi-3 medium (14B, needs more RAM)
ollama pull phi3:medium
# Phi-3.5 mini (128K context, ~2.2 GB)
ollama pull phi3.5
Phi3.5’s ~2.2 GB download supports 128K tokens – confirmed on the Ollama library page as of 2025. Bandwidth-dependent, expect 2-10 minutes.
Step 3 – First run
ollama run phi3:mini
Interactive prompt appears. Type anything. Response within a few seconds = working install. First inference takes 10-30 seconds while the model loads into RAM; everything after is faster.
Verify with the HTTP API
The CLI confirms the model loads. The API confirms your code can use it. Ollama exposes a local REST endpoint at localhost:11434 by default:
curl http://localhost:11434/api/generate -d '{
"model": "phi3:mini",
"prompt": "Reply with the single word: ready",
"stream": false
}'
JSON response with a response field = success. Connection refused? Run ollama serve in a separate terminal – the desktop app auto-starts the server, headless installs don’t.
LAN access: Set
OLLAMA_HOST=0.0.0.0:11434beforeollama serveto reach the API from other devices on your network. Useful for phone-based testing. Never expose this to the public internet without an auth layer in front.
The three gotchas competitors don’t cover
These don’t appear prominently in official docs or typical tutorials. All three have caused real deployment failures.
1. The knowledge cutoff is October 2023. Turns out this is harder to remember in practice than it sounds – the model is confident. Phi-3-mini is a static model trained on an offline dataset; anything it says about events after October 2023 is a hallucination regardless of how certain it sounds (per the HuggingFace model card, as of the original release). Pair it with retrieval if current facts matter.
2. Code generation outside Python is unreliable. The model card says the majority of Phi-3 training data is Python-based, using specific packages – typing, math, random, collections, datetime, itertools. Ask for Rust, Go, or even Python with pandas and you get plausible-looking but often wrong API calls. Verify everything non-trivial, and treat non-Python output as a draft, not a solution.
3. The 128K context window is a RAM trap. Here’s where file size becomes actively misleading. The 2.2 GB download for phi3.5 doesn’t change based on context length – but runtime memory does. The KV cache scales with sequence length, not weight size. Load 100K tokens into Phi-3-medium and you can push RAM usage past 20 GB on top of the model weights. The machine starts swapping to disk. Inference that took seconds now takes minutes. Fix: use mini for long-context jobs, or cap num_ctx in your Modelfile (8192 is a sane default for most tasks).
Why does this matter more for Phi-3 than larger models? Counter-intuitively, the small file size creates a false sense of safety – people assume a 2.2 GB model can’t possibly eat 20 GB of RAM. It can, if you fill that context window.
Common errors and fixes
Error: model 'phi3' not found– ranrunbeforepull. Just useollama run phi3– it pulls automatically.connection refused on 11434– server isn’t running. Start it:ollama serve.- Responses are gibberish – usually a corrupted download. Fix:
ollama rm phi3 && ollama pull phi3. - Extreme first-token delay – normal on first load; subsequent responses are faster once the model is resident in RAM.
CUDA out of memory– drop to a smaller variant, or force CPU:OLLAMA_NUM_GPU=0.
Upgrade and uninstall
To pull the latest weights:
ollama pull phi3
# replaces old layers in place
Remove a single model:
ollama rm phi3:mini
Full uninstall on macOS: drag the app to trash, delete ~/.ollama. On Linux:
sudo systemctl stop ollama
sudo rm /usr/local/bin/ollama
sudo rm -rf /usr/share/ollama ~/.ollama
The ~/.ollama directory holds downloaded model weights – deleting it reclaims all disk space.
Where next
Phi-3 is running. The productive question now isn’t “what else can I configure” – it’s whether local inference actually fits your use case. If you need answers about events after October 2023, you need retrieval on top. If your codebase isn’t Python-heavy, the model’s output needs more scrutiny than you might expect from its benchmark scores.
Two concrete next steps: connect it to Open WebUI for a ChatGPT-style interface, or write a 20-line Python script against the Ollama REST API to feed it your own documents. Both take under an hour and turn a curiosity into something actually useful.
FAQ
Is Phi-3 actually open source?
Yes – MIT license, which allows commercial use, modification, and redistribution. No usage restrictions.
Should I use Phi-3 or wait for Phi-4?
Phi-3 (or Phi-3.5) has the most mature local deployment ecosystem right now – quantized GGUF builds are widely available and every major inference runtime supports it. The Phi-4 family is the newer generation under the same Phi brand on Azure, worth testing if you want the latest capabilities. The catch: Ollama support and quantized builds for Phi-4 variants may lag behind Phi-3.5. If you’re shipping something today, Phi-3.5 is the lower-risk pick; you can swap models later without rewriting anything – the Ollama API call is identical.
Can I fine-tune Phi-3 on my own data locally?
Not via Ollama. Fine-tuning happens upstream: use Hugging Face’s PEFT/LoRA workflow on the base weights, convert the adapter to GGUF format, then import the result into Ollama. LoRA on Phi-3-mini realistically needs a 16 GB+ VRAM GPU – and that’s the RAM-on-the-GPU figure, not system RAM, which trips up a lot of people who check the wrong number. Full fine-tuning needs substantially more. If you’re on a consumer GPU with 8 GB VRAM, a quantized LoRA setup is possible but you’ll need to be careful about batch size and sequence length – expect a fair amount of trial and error before the training run doesn’t OOM.