Stop Guessing Which LLM Your Hardware Can Actually Run
You've got a decent GPU and you want to run a local language model. So you head to HuggingFace, browse by size, pick something that fits in your VRAM, and hope for the best. But that approach is broken — the biggest model that fits isn't always the best one, and you have no way of knowing which 7B model outperforms which 13B model without hours of trial and error. That's the problem whichllm solves. It's a command-line tool that auto-detects your hardware, pulls live benchmark data from HuggingFace, and tells you exactly which local LLM you should run on your machine.
What It Does
Whichllm is a Python tool (3.11+) that scans your system's GPU, CPU, and RAM, then queries HuggingFace to find models that fit your hardware. It ranks them by real benchmark scores — not just parameter count. You run a single command, and it returns a ranked list with model names, quantization levels, quality scores, and estimated tokens per second.
The ranking engine merges data from multiple benchmarks: LiveBench, Artificial Analysis, Aider, Chatbot Arena ELO, Open LLM Leaderboard, and multimodal/vision evaluations. Every score comes tagged with a confidence level — direct, variant, base, interpolated, or self-reported — so you know how reliable each recommendation is. It also factors in recency, so a 2024 model with old benchmark scores can't outrank a current-generation model on stale data.
You can also simulate hardware you don't own yet. Want to know what you'd get with an RTX 4090 before buying one? Just pass --gpu "RTX 4090" and whichllm will show you the top picks as if you had that card. There's a plan command that works backward — tell it a model name and it tells you what GPU you need to run it. And if you just want to get started immediately, whichllm run "qwen 2.5 1.5b gguf" launches a chat session.
Why It's Cool
The obvious thing whichllm does is save you time — you don't have to manually cross-reference model sizes against your VRAM. But the deeper value is in the ranking logic.
It ranks by actual quality, not size. Most people assume a 13B model beats a 7B model. Whichllm regularly recommends smaller models over larger ones because they score higher on real benchmarks. The README's example shows a 27.8B model ranked above a 32B model because it's a newer generation with better benchmark performance. That's the kind of insight you'd never get from a "what fits?" approach.
It's recency-aware. Old leaderboard scores don't get treated like new ones. Each model's lineage is tracked, and stale benchmarks are demoted. The benchmark snapshot date is printed under every ranking, so if you're looking at old data, you can see it immediately rather than trusting it silently.
Confidence grading prevents bad recommendations.