LLM VRAM Calculator: Will a Model Fit Your GPU?

Pick your GPU, the model size you want to run, a quantization level, and a context length. The calculator estimates total VRAM as three parts: the model weights, the KV cache that holds your context, and a small runtime overhead. Here is how to read those numbers.

Weights are the big, predictable cost. A model’s weight memory is just its parameter count times the bits-per-weight of the quantization, divided by eight. An 8B model at FP16 is about 16 GB; the same model at Q4_K_M is about 4.8 GB. This is why quantization is the first lever to reach for when a model does not fit.

The KV cache scales with context, not parameters. Every token in your context window is cached per layer, so a long context can quietly cost more memory than the weights on a small model. The tool estimates the cache from the attention shape of modern GQA models and lets you quantize it (Q8 or Q4) to claw back room.

Headroom matters. The verdict bands leave a margin: a model that needs 92 percent of your VRAM is marked tight, not safe, because runtimes, the desktop, and transient spikes all want a slice. Size for the green band and you avoid out-of-memory errors mid-generation.

These are planning estimates that assume a llama.cpp / GGUF-style setup. Your exact numbers depend on the specific model, the runtime, and how much you batch.

Worked examples

These are this calculator’s own outputs, computed with the same formula the tool runs in your browser — so you can see a real answer without touching a single input.

Mainstream 16 GB gaming/AI desktop — RTX 5080 (16 GB) running Llama 3.1 8B at Q4_K_M with an 8K context and an unquantized FP16 KV cache. This is the calculator’s own default state on page load

Inputs: Graphics card = RTX 5080 (16 GB); Number of GPUs = 1; Model size = 7-8B (params_B = 8); Quantization = Q4_K_M (bpw = 4.83); Context length = 8K (8192 tokens); KV cache precision = FP16 (multiplier = 1)
Result: Headline total: 6.7 GB. Weights: 4.8 GB. KV cache: 1.0 GB. Runtime overhead: 0.8 GB. Your VRAM: 16.0 GB. Verdict: Fits on 16.0 GB of VRAM, with 9.3 GB to spare. In the per-quant table every quantization from Q8_0 down fits (Q8_0 10.3 GB, Q6_K 8.4 GB, Q5_K_M 7.5 GB, Q4_K_M 6.7 GB, Q3_K_M 5.7 GB, Q2_K 5.2 GB); only FP16 at 17.8 GB is marked ‘Won’t fit’.

Single 24 GB card — RTX 3090 or RTX 4090 — trying to run a 27-32B class model (Gemma 2 27B / Qwen2.5 32B) at Q4_K_M with a 32K context and FP16 KV cache. A very common single-card homelab ask, and the interesting result is that it lands just barely over budget

Inputs: Graphics card = RTX 3090 / 3090 Ti (24 GB) [identical result for RTX 4090, also value 24]; Number of GPUs = 1; Model size = 27-32B (params_B = 27); Quantization = Q4_K_M (bpw = 4.83); Context length = 32K (32768 tokens); KV cache precision = FP16 (multiplier = 1)
Result: Headline total: 24.7 GB. Weights: 16.3 GB. KV cache: 7.6 GB. Runtime overhead: 0.8 GB. Your VRAM: 24.0 GB. Verdict: Won’t fit on 24.0 GB of VRAM — 0.7 GB over budget. Warning line recommends dropping to Q3_K_M (3.91 bpw), or shortening the context / quantizing the KV cache / adding a second GPU. (Note: selecting Q8 KV instead of FP16 halves the cache to 3.8 GB and brings the total to 20.9 GB, which fits.)

Classic dual-GPU homelab inference box — two used RTX 3090s (24 GB each, pooled by llama.cpp / vLLM) running Llama 3.3 70B at Q4_K_M with an 8K context and FP16 KV cache

Inputs: Graphics card = RTX 3090 / 3090 Ti (24 GB); Number of GPUs = 2; Model size = 70-72B (params_B = 70); Quantization = Q4_K_M (bpw = 4.83); Context length = 8K (8192 tokens); KV cache precision = FP16 (multiplier = 1)
Result: Headline total: 45.6 GB. Weights: 42.3 GB. KV cache: 2.5 GB. Runtime overhead: 0.8 GB. Your VRAM: 48.0 GB. Verdict: Tight on 48.0 GB of VRAM — about 45.6 GB needed, 2.4 GB to spare. No fallback warning is shown because the total is under the available VRAM.

Your GPU memory

Graphics card

Number of GPUs

The model

Model size (parameters)

Quantization (weights)

Context length

KV cache precision

—

estimated VRAM to load this model and context

Weights

—

KV cache

—

Your VRAM

—

Memory breakdown

Estimated total —

This model at each quantization (your GPU + context)

Quant	bpw	Total VRAM	Verdict

Estimates assume a modern grouped-query-attention (GQA) architecture and llama.cpp / GGUF-style quantization. Real usage varies by model, runtime, and batch size; treat these as planning numbers with a little headroom.

Frequently asked questions

How accurate is this VRAM estimate?

The weights figure is precise: parameters times bits-per-weight divided by eight. The KV cache figure is an estimate, because it depends on a model’s exact layer count, attention heads, and head dimension. This tool uses the math of modern grouped-query-attention models (Llama 3.x and Qwen2.5 class), which reproduces the commonly cited result of about 4 GB of KV cache for an 8B model at 32K context in FP16. Plan for the number shown plus roughly 10 to 15 percent, since your runtime, batch size, and any draft model add a little on top.

What quantization should I use for a local LLM?

Q4_K_M is the sweet spot for most people: it cuts a model to about a third of its FP16 size while keeping quality close to the original. If you have VRAM to spare, Q5_K_M or Q6_K give a small quality bump. Q8_0 is near-lossless and worth it only when memory is no object. Q3_K_M and Q2_K trade visible quality for size and are best reserved for fitting a model that otherwise would not load at all.

Why does context length use so much memory?

Every token you feed the model is stored in the KV cache for each layer, so memory grows in a straight line with context length. Doubling the context roughly doubles the KV cache. A long 128K context can cost more VRAM than the model weights themselves on smaller models. Quantizing the KV cache to Q8 halves that cost and Q4 quarters it, usually with little quality loss for chat workloads.

Can I run a model that is bigger than my VRAM?

Yes, with tradeoffs. Tools like llama.cpp and Ollama can offload some layers to system RAM and run them on the CPU; the model loads but generation slows down in proportion to how much sits off the GPU. Splitting the model across two GPUs is faster than CPU offload if you have a second card. Otherwise, drop to a smaller quantization or a smaller model size.

Does the calculator handle multiple GPUs?

It adds your cards together: total VRAM is per-card memory times the number of GPUs. That matches how llama.cpp and vLLM pool memory across cards. Real multi-GPU setups lose a little to communication and per-GPU buffers, so leave some headroom rather than sizing to the last gigabyte.

Worked examples

Your GPU memory

The model

Related guides and tools

Which Local LLM Fits Your GPU in 2026? VRAM Tiers

How to Self-Host a Local LLM on a Single RTX 5060 in 2026

$1,000 Local AI Workstation: RTX 5080-Class Build Guide (2026)

RTX 5080 Performance-per-Watt as a 24/7 AI Inference Card (2026)

PSU Calculator: Power Supply Wattage Calculator

Frequently asked questions