“What’s the best model I can actually run on my card?” is the most-asked question in local AI, and it resets every time a new GPU or model drops. The answer is almost always decided by one number: VRAM. Get the tier right and a local LLM is fast and private; get it wrong and you’re either offloading to system RAM at a crawl or running a model too small to be useful. Here’s the 2026 map, tier by tier, with the actual file sizes and the math behind them.

The rule of thumb

Memory for the weights scales with parameters and quantization:

  • ~2 GB per billion parameters at FP16 (full precision).
  • ~0.5 GB per billion at 4-bit, then add 15–20% for the KV cache, activations, and framework overhead.

One important nuance: Q4_K_M isn’t true 4-bit. It’s mixed precision averaging about 4.83 bits per weight, so real GGUF files run a little larger than the clean 0.5 GB/B estimate — but it’s the everyday sweet spot, roughly half the VRAM of FP16 for only a ~3–5% quality drop. For a given card, fitting a bigger model at Q4_K_M usually beats a smaller one at Q8.

For an exact figure including your context length, run the numbers through the LLM VRAM Calculator; for how fast it’ll generate, the LLM Speed Calculator.

The tiers (all at Q4_K_M)

8 GB — entry (RTX 3060 8GB, 4060)

7–8B models. Qwen3-8B at Q4_K_M is a 5.03 GB file; with overhead it lands ~6–7 GB, comfortable at moderate context. A 12–14B fits only at a lower quant or short context. This tier is excellent for fast 8B chat and coding helpers — not for the 20B+ class.

12 GB — comfortable small (RTX 3060 12GB, 5070)

8B with room to spare, and 14B at Q4 (~9 GB). You can run longer contexts on an 8B here, or step up to a 14B for better reasoning. Still below the 20B class.

16 GB — the mid tier (RTX 4060 Ti 16GB, 4080, 5080)

14B comfortably, and GPT-OSS-20B — OpenAI’s 20B is explicitly designed to run in 16 GB (it’s an MoE with only ~3.6B active params, MXFP4-quantized, ~12 GB of model data, ~14.9 GB total at 8k context). Cap the context (~32k) to stay inside 16 GB. A 32B won’t fit here.

24 GB — the sweet spot (RTX 3090, 4090)

Up to ~32B at Q4_K_M. Qwen3-32B Q4_K_M is a 19.76 GB file, leaving headroom for context, and 14–20B models run easily at higher quants. This is the best single-card tier for serious local work. What it can’t do comfortably is a 70B at Q4 (that needs ~45–50 GB).

32 GB — headroom (RTX 5090)

32B at higher quants (Q5/Q6) or with long context, and GPT-OSS-20B with room to spare. A 70B at Q4 (~45–50 GB) still doesn’t fit a single 32 GB card — you’d be offloading. Great for running a 32B comfortably or several smaller models at once.

48 GB+ — the 70B club (2× 24GB, workstation cards, big unified memory)

70B at Q4_K_M needs roughly 45–50 GB, so: two 24 GB GPUs (48 GB), a 48 GB workstation card, or a unified-memory machine (64–128 GB Apple Silicon / Strix Halo) that trades speed for capacity. On 80 GB (H100/MI300X), GPT-OSS-120B fits (~64 GB at 8k).

The models worth running in 2026

The dependable open-weights lineup: GPT-OSS (20B for 16 GB cards; 120B for 80 GB), Alibaba’s Qwen3 family (0.6B–32B dense plus 30B-A3B and 235B MoE) and the newer Qwen3.5 open-weights family (Apache 2.0, released early 2026), DeepSeek V3/V3.2, Meta’s Llama 3.x, Mistral Small 3.x, and Google’s Gemma 3. A useful pattern: MoE models punch above their size for speed — GPT-OSS-20B and Qwen3-30B-A3B only activate a few billion parameters per token, so they generate faster than their total parameter count implies (though they still occupy their full size in VRAM).

Fit is half the question — speed is the other

A model fitting in VRAM doesn’t mean it runs fast. Generation speed is bound by memory bandwidth (each token reads the active weights once), so a big model on a modest card can fit yet feel sluggish. Check both before you commit to a model or a card: the VRAM calculator for fit and the speed calculator for tokens/sec. And if you’re building around a specific card, the single-RTX-5060 local-LLM walkthrough shows the full setup end to end.

Bottom line

For most homelab GPUs, the practical pick is simple: an 8B at Q4_K_M on 8–12 GB cards, a 14B–20B on 16 GB, and a 32B at Q4_K_M on a 24 GB card — the single-card sweet spot. Reach for 48 GB+ only when a 70B is genuinely worth the cost and complexity. Match the model to the VRAM tier, default to Q4_K_M, and you’ll get fast, private inference without fighting your hardware.

Frequently asked questions

How much VRAM do I need to run a local LLM?
The rule of thumb is about 2 GB of VRAM per billion parameters at FP16, or roughly 0.5 GB per billion at 4-bit, then add 15–20% on top for the KV cache, activations, and framework overhead. So an 8B model at 4-bit needs ~5 GB of weights plus overhead (~6–7 GB total), a 32B needs ~20 GB, and a 70B needs roughly 45–50 GB. Note Q4_K_M isn’t true 4-bit — it averages ~4.83 bits per weight — so real GGUF files run a little larger than the 0.5 GB/B estimate.
What local LLM can I run on an 8GB GPU?
An 8 GB card (RTX 3060 8GB, 4060) comfortably runs 7–8B models at Q4_K_M — Qwen3-8B at Q4_K_M is a 5.03 GB file, leaving room for moderate context. You can squeeze a 12–14B at a lower quant or short context, but it’s tight. 8 GB is the entry tier: great for fast 8B chat and coding assistants, not for anything in the 20B+ class.
What can a 24GB GPU like a 3090 or 4090 run locally?
A 24 GB card is the local-LLM sweet spot: it runs up to ~32B models at Q4_K_M with room for context (Qwen3-32B Q4_K_M is a 19.76 GB file) and handles 14–20B models easily at higher quants. What it can’t do comfortably is a 70B at Q4 — that needs ~45–50 GB, so you’re looking at two 24 GB cards (48 GB) or heavy CPU offload (which is slow). For most people, 24 GB + a good 32B model is the practical ceiling for fast, fully-GPU inference.
Is Q4_K_M quantization good enough, or should I run higher?
Q4_K_M is the widely-recommended default — it’s roughly half the VRAM of FP16 with only about a 3–5% quality drop on most benchmarks, which is why it’s called the everyday sweet spot. Go higher (Q5_K_M, Q6_K, Q8) only if you have spare VRAM and want the last few percent of quality; go lower than Q4 only when you’re desperate to fit a bigger model, since quality falls off faster below 4-bit. For a given card, fitting a bigger model at Q4_K_M usually beats running a smaller model at Q8.
Can I run a 70B model locally, and on what hardware?
Yes, but not on a single consumer 24 GB card at usable speed. A 70B at Q4_K_M needs roughly 45–50 GB of VRAM including overhead, so the realistic options are two 24 GB GPUs (2× 3090/4090 = 48 GB), a 48 GB workstation card, or a unified-memory machine (a 64–128 GB Apple Silicon or Strix Halo box) that trades raw speed for capacity. On a single 24 or 32 GB card you’d be offloading layers to system RAM, which works but is much slower.
What are the best local LLMs to run in 2026?
The dependable 2026 lineup: OpenAI’s GPT-OSS (20B, designed to run in 16 GB; and 120B for 80 GB cards), Alibaba’s Qwen3 family (0.6B–32B dense plus 30B-A3B/235B MoE) and the newer Qwen3.5 open-weights family (Apache 2.0, released early 2026), DeepSeek V3/V3.2, Meta’s Llama 3.x, Mistral Small 3.x, and Google’s Gemma 3. For most homelab GPUs, an 8B (8–12 GB cards) or a 32B (24 GB cards) at Q4_K_M is the practical pick; MoE models like GPT-OSS-20B run faster than their total size suggests because only a few billion parameters are active per token.