Skip to content

Jan v0.7.9: Local AI App for Mac (Install Guide)

Install Jan v0.7.9, the open-source local AI app for Mac. Real benchmarks (MLX vs GGUF), the M-series gotcha most guides miss, and clean uninstall steps.

8 min readIntermediate

The #1 mistake people make installing a local AI app on Mac: they download Jan, grab the first GGUF model the Hub recommends, and never touch the engine settings. On Apple Silicon, that’s leaving roughly a quarter of your tokens-per-second on the table – sometimes more.

So this guide reverse-engineers the install around that decision. We’ll get Jan v0.7.9 running, but the goal isn’t installed. The goal is running on the right engine for your workload.

What Jan actually is (in one paragraph)

Jan is an open-source desktop app that runs LLMs locally and exposes an OpenAI-compatible API. It connects to GPT models via OpenAI, Claude models via Anthropic, Mistral, Groq, MiniMax, and others, and exposes a local server at localhost:1337 for other applications. Local models include Llama, Gemma, Qwen, GPT-oss and others from HuggingFace. Think of it as a polished UI on top of llama.cpp and (newer) MLX, with a routing layer for cloud providers when you want them.

System requirements for the Mac build

Skip this section at your own risk – the macOS version floor matters because it gates MLX.

Component Minimum Recommended
OS macOS 13.6 macOS 14+ (required for MLX engine)
CPU Intel x64 or Apple Silicon M-series (M1 or newer)
RAM 8 GB (3B-4B models only) 16 GB+ for 7B-13B; 32 GB+ for 30B-class
Disk ~5 GB for app + small model 40+ GB if you collect bigger quants

The official docs spell out the architecture split: Apple Silicon Macs use Metal for GPU acceleration, making them faster than Intel Macs, which operate on CPU only. Apple Silicon (M1, M2, M3) – Metal acceleration enabled by default. If you’re still on an Intel Mac, you can run Jan, but treat it as a CPU-only tool – pick 3B-class models and don’t expect the MLX path to apply to you.

Download and install Jan v0.7.9

Go to jan.ai/download and grab the .dmg. The install dance itself is boring: open the .dmg, drag Jan to Applications, launch Jan. The first launch silently does two things worth knowing about.

  1. It auto-downloads a default foundation model so you can chat immediately. Jan automatically downloads its default foundation model on first launch. Once the download completes, you’re ready to chat – no setup required.
  2. It installs the Jan CLI. Jan CLI is installed automatically when you launch the Jan desktop app for the first time – no extra steps needed. The CLI binary is installed at ~/.local/bin/jan on macOS/Linux. Make sure this path is in your $PATH to use the jan command from any terminal.

Verify the install from Terminal:

jan --help
# If 'command not found', add to PATH:
echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.zshrc
source ~/.zshrc

The engine choice that nobody explains: MLX vs GGUF

This is where the install actually pays off – or doesn’t. Jan ships with two local engines: the mature llama.cpp path (GGUF models) and the newer MLX path. MLX uses Metal GPU acceleration for fast, efficient local inference – available on macOS 14+. MLX support in Jan is currently in beta and will be improved significantly over time.

The numbers tell a more interesting story than the docs do. In a direct test on a MacBook Pro M4 Pro (16-core GPU, 48GB), on Jan, the native MLX implementation hit 44 t/s, significantly outpacing the GGUF implementation (35 t/s). This confirms that for smaller, faster models, MLX’s lower overhead and “zero-copy” optimization provide a tangible speed boost. That’s a ~25% free uplift just from picking the right engine.

But here’s the catch most tutorials skip. MLX actually lags behind llama.cpp in time-to-first-token (TTFT). Real-world data from a developer on an M1 Max shows that with a prompt of about 650 tokens, the effective tok/s (combining prefill and decode) for MLX was only 13 tok/s, while GGUF reached 20 tok/s. This is because MLX spent 94% of its time on prefill. Translation: MLX shines on long generations (coding agents, document drafts). For chat-style back-and-forth with short prompts, GGUF often feels faster even when raw tok/s is lower.

Pro tip: Default to MLX models from the mlx-community org on Hugging Face if your typical use case is generation-heavy. Switch to GGUF if your typical use case is short Q&A. Don’t pick once and forget.

First-time configuration that actually matters

After install, ignore the Hub for a minute and visit Settings first.

  • Model Providers → Hugging Face Token. Paste a token from huggingface.co/settings/tokens – you’ll need it for any gated model (Llama family especially).
  • Engine settings → ngl (GPU layers). Default is usually fine on M-series. Override only if you’re hitting OOM on a model that’s clearly within your unified-memory budget.
  • Hub → search by ID.Enter a model’s Hugging Face ID (e.g., org/model_name_or_id) in the Hub’s search bar. For MLX, search mlx-community/... directly.

Verify the API server is alive:

curl http://localhost:1337/v1/models
# Should return JSON listing whatever you've loaded.

Common errors and the fixes that actually work

Three you’ll likely hit, in roughly this order of frequency.

“Failed to Fetch” or “Something’s Amiss” mid-chat. Per Jan’s troubleshooting docs, the usual culprit is GPU layers, not a network problem. In Engine Settings in the right sidebar, check your ngl (number of GPU layers) setting to see if it’s too high. Start with a lower NGL value and increase gradually based on your GPU memory. Counterintuitively, this also fires on Apple Silicon when the model is too large for available unified memory – a clearer OOM message would help, but isn’t there yet (as of v0.7.x).

EACCES: permission denied on first launch. The error message references the bundled extension: Error invoking remote method ‘installExtension’: Error Package /Applications/Jan.app/Contents/Resources/app.asar.unpacked/pre-install/janhq-assistant-extension-1.0.0.tgz does not contain a valid manifest: Error EACCES: permission denied. Don’t chmod the bundle – that breaks Gatekeeper signing. Trash Jan, wipe the caches (commands in the uninstall section below), and reinstall from a fresh download.

RAM pressure during long chats. The general rule from Jan’s own docs: choose models that use less than 80% of your available RAM. On a 16 GB Mac, that effectively caps you at 7B Q4 with comfortable headroom. The unified memory architecture is generous but not magic.

The data folder trap when relocating

One thing worth flagging because it’s caused real disk-space surprises: when you relocate Jan Data Folder, Jan Data Folder will be duplicated into the new location while the original folder remains intact. An app restart will be required afterward.

If you have 30 GB of models, you now have 60 GB. Manually delete the old path after confirming the new one works.

Upgrading and uninstalling cleanly

Upgrading from v0.6.x or earlier? The simplest path is replace-in-place: download the new .dmg from github.com/janhq/jan/releases, drag it over the existing Jan in /Applications, choose Replace. Your data folder is untouched.

If something’s stuck, do a Terminal-level clean uninstall (commands compiled from community guides – verify each path before running):

killall Jan
rm -rf "/Applications/Jan.app"
rm -rf ~/Library/Caches/jan.ai.app
rm -rf ~/Library/Caches/9e2ed717.jan.ai.app
# Optional - wipes ALL models and chats:
rm -rf ~/Jan

The two cache paths come from ~/Library/Caches/jan.ai.app and ~/Library/Caches/9e2ed717.jan.ai.app. Skip the last command if you plan to reinstall – your downloaded models live there.

One thing the M5 changes

Worth noting because it’s recent and shifts the calculus. MLX works with all Apple silicon systems, and with the latest macOS beta, it now takes advantage of the Neural Accelerators in the new M5 chip, introduced in the new 14-inch MacBook Pro. The Neural Accelerators provide dedicated matrix-multiplication operations, which are critical for many machine learning workloads, and enable even faster model inference experiences on Apple silicon. If you’re on M5 hardware, the MLX engine in Jan is no longer just “slightly faster than GGUF on long generations” – the gap widens significantly. That tilts the engine choice further toward MLX for any M5 user.

FAQ

Does Jan run on Intel Macs?

Yes, but only on llama.cpp/CPU. MLX requires Apple Silicon plus macOS 14+, so the entire MLX-vs-GGUF discussion is moot on Intel. Stick to small quantized models.

Why does my first message take 30+ seconds when later ones are fast?

That’s MLX warmup combined with prefill on your first prompt. The model is being loaded into unified memory and the KV cache is being initialized. On an M1 Max running a 30B-class MLX model, the first generation can take roughly a minute end-to-end before tokens-per-second stabilizes – community benchmarks have measured around 40 of those 60 seconds as warmup. Subsequent messages in the same session reuse the warmed state, so they feel instant by comparison. If this latency is killing your workflow, switch that model to GGUF – its TTFT is consistently lower even though steady-state throughput is worse.

Can I use Jan as a drop-in replacement for the OpenAI API in my own apps?

Yes. Point your client at http://localhost:1337/v1, use any string as the API key, and existing OpenAI SDK code works. The catch: tool-calling fidelity depends entirely on the local model – small models hallucinate function arguments. Test before you ship.

Next step: open Settings → Hub right now, search mlx-community/Llama-3.2-3B-Instruct-4bit (or any MLX model that fits your RAM), and run a side-by-side test against the GGUF version of the same model. The 25% gap on Jan isn’t theoretical – it shows up on a stopwatch.