Skip to content

Best Local LLM for Your Hardware: whichllm Guide [2026]

A new Show HN tool ranks the best local LLM for your hardware by real benchmarks, not parameter count. Here's how it works and where it slips up.

6 min readBeginner

So you want to run an LLM locally and the only honest answer to “which one should I download?” used to be: try three, see which one doesn’t crash, give up. A Show HN tool that just blew up on Hacker News is trying to fix that – and the bigger question worth asking is whether it actually picks a better local LLM for your hardware than you would by eyeballing VRAM charts.

Short answer: yes, but with caveats the README doesn’t lead with.

The key takeaway upfront

whichllm is the tool – a Python CLI that auto-detects your GPU, CPU, and RAM, then ranks HuggingFace models by merged benchmark scores rather than “biggest thing that fits.” It reached v0.5.2 as of May 2026 and accumulated over 500 GitHub stars since its March 2026 creation. The HN thread hit 144 points within a day.

If you’re choosing between this and the other tool everyone keeps confusing it with (CanIRun.ai), pick whichllm when you want a ranked recommendation you can run from one command. Pick CanIRun.ai when you don’t want to install anything.

Background – why this even matters

Local-LLM tooling has had a blind spot for two years: most “will it run?” checkers just compare model size to your VRAM. That answers the wrong question. A 32B model can “fit” your card and still be a worse pick than a 27B from a newer generation that scores higher on real benchmarks. whichllm’s own demo makes this explicit – on an RTX 4090, the 32B model fits fine, but it still ranks the 27B #1 because it scores higher on real benchmarks and is a newer generation; a size-only “what fits?” tool would hand you the bigger one.

whichllm vs CanIRun.ai – which one fits your workflow

Both tools dropped in 2026, both went viral on HN, both answer overlapping questions. They’re not the same product though.

Feature whichllm CanIRun.ai
Format Python CLI (pip/uv/Homebrew) Website, WebGPU-based
Install required Yes No
Catalog size Live HuggingFace data ~60 curated open-source models
Ranking basis Merged benchmarks + recency demotion VRAM fit + bandwidth-bound speed estimate
Hardware detection Reads your actual system navigator.hardwareConcurrency + navigator.deviceMemory + WEBGL_debug_renderer_info against a database of ~40 GPUs and ~12 Apple Silicon chips
Best for “What should I actually run?” “Can my laptop handle this at all?”

The real differentiator: whichllm tries to answer the quality question by merging real benchmarks (LiveBench, Artificial Analysis, Aider, multimodal/vision, Chatbot Arena ELO, Open LLM Leaderboard). CanIRun.ai is closer to a hardware-fit calculator with speed estimates. Different jobs.

Walkthrough: getting whichllm to give you a real answer

Install is a one-liner if you already have uv or pip. Then the basic flow:

# Install
pip install whichllm

# Auto-detect your hardware and rank models
whichllm

# Or simulate a GPU you're thinking of buying
whichllm --gpu "RTX 4090"

# Just pick the best one and start chatting
whichllm run

# Reverse lookup: what GPU do I need for this?
whichllm plan "llama 3 70b"

The output looks like this (snapshot from the README, RTX 4090):

#1 Qwen/Qwen3.6-27B 27.8B Q5_K_M score 92.8 27 t/s
#2 Qwen/Qwen3-32B 32.0B Q4_K_M score 83.0 31 t/s
#3 Qwen/Qwen3-30B-A3B 30.0B Q5_K_M score 82.7 102 t/s

Notice #3. That 102 t/s isn’t a typo – it’s a Mixture-of-Experts model with 3B active params per forward pass. whichllm ranks speed on active params and quality on total params, which is the correct way to do it but also the part most people miss. If you’re comparing rows in this table, you’re not comparing apples to apples on speed.

Pro tip: Before you trust any ranking, run whichllm hardware first to confirm the tool detected your GPU correctly. On laptops with switchable graphics it sometimes picks up the integrated chip, and every recommendation downstream is then wrong.

One feature most coverage skips: the snippet subcommand prints a copy-paste Python code snippet for any model, so you can drop it into a script instead of using the interactive chat. Handy if you’re building something rather than just kicking tires.

Edge cases the README doesn’t lead with

Three things to know before you take any ranking as gospel.

1. The estimates can be off for specific model/GPU combos. Within a day of launch, an HN commenter pointed out that for gpt-oss-120b on an RTX Pro 6000, every single number was off and the tool notably missed the most important quant for GPT-OSS, the MXFP4 variant. If you’re running a recently-released model, double-check the prediction against a real benchmark before committing to a download.

2. The speed numbers are bandwidth-bound estimates, not measurements. Both tools use roughly the same formula. CanIRun.ai documents theirs openly: inference speed ≈ Memory bandwidth ÷ Model VRAM × Efficiency (0.70 for discrete GPUs, 0.65 for Apple Silicon), with ±20% real-world variance depending on batch size, context length, and quantization format. A predicted 25 t/s realistically lands anywhere from 20 to 30 t/s before you even open a long context window.

3. Unified-memory rigs need extra setup neither tool surfaces. If you’re running AMD Strix Halo (Ryzen AI Max+ 395 with 128GB unified RAM), the recommendation is technically correct but practically useless until you tune the kernel. Setting Linux kernel params ttm.pages_limit=31457280 and ttm.page_pool_size=31457280 unlocks dynamic VRAM allocation up to 110-120 GB on Strix Halo – set once, no further reboots needed. Apple Silicon Just Works; AMD’s competitive but needs this one-time fix.

What the community actually runs locally (vs what gets recommended)

Here’s the gap nobody talks about. The rankings will happily suggest a 27B coding model. But should you run coding tasks locally at all? An HN veteran with 100+ hours of local-model experimentation concluded small models like Qwen3.5 9B are excellent for tool use, information extraction, and embedded applications – not for coding agents or general knowledge; for coding, cloud frontier models win every time.

So the ranking is honest about quality, but quality on a benchmark doesn’t equal usefulness for your task. Worth keeping in your head while reading the score column.

FAQ

Is whichllm free?

Yes. MIT-licensed, install with pip, uv, or Homebrew. No account, no cloud calls for the ranking itself.

How does it handle quantization choice?

It picks a quant per model based on what fits your VRAM and reports it in the output (Q4_K_M, Q5_K_M, etc.). As a rough rule, Q4_K_M typically uses about 60% less VRAM than F16 with only a small quality tradeoff, which is why it’s the most common pick for local AI. If you want to force a different quant or test FP16/BF16, the tool supports AWQ/GPTQ and full-precision formats too, but you’ll need to pass it explicitly.

Should I use whichllm or just stick with Ollama’s model library?

Use both. Ollama’s library tells you what’s installable; whichllm tells you which of those is actually the strongest choice for your specific hardware right now. The two complement each other – whichllm can even generate the Ollama pull command for you.

Next step: run pip install whichllm && whichllm and post your top-3 result in the GitHub issues – the author has explicitly asked people to drop their picks there, and that’s also where the MXFP4-style edge cases get caught.