Pick your GPU, the model size you want to run, a quantization level, and a context length. The calculator estimates total VRAM as three parts: the model weights, the KV cache that holds your context, and a small runtime overhead. Here is how to read those numbers.

Weights are the big, predictable cost. A model’s weight memory is just its parameter count times the bits-per-weight of the quantization, divided by eight. An 8B model at FP16 is about 16 GB; the same model at Q4_K_M is about 4.8 GB. This is why quantization is the first lever to reach for when a model does not fit.

The KV cache scales with context, not parameters. Every token in your context window is cached per layer, so a long context can quietly cost more memory than the weights on a small model. The tool estimates the cache from the attention shape of modern GQA models and lets you quantize it (Q8 or Q4) to claw back room.

Headroom matters. The verdict bands leave a margin: a model that needs 92 percent of your VRAM is marked tight, not safe, because runtimes, the desktop, and transient spikes all want a slice. Size for the green band and you avoid out-of-memory errors mid-generation.

These are planning estimates that assume a llama.cpp / GGUF-style setup. Your exact numbers depend on the specific model, the runtime, and how much you batch.

Your GPU memory

The model

estimated VRAM to load this model and context
Weights
KV cache
Your VRAM
Memory breakdown
Estimated total
This model at each quantization (your GPU + context)
QuantbpwTotal VRAMVerdict

Estimates assume a modern grouped-query-attention (GQA) architecture and llama.cpp / GGUF-style quantization. Real usage varies by model, runtime, and batch size; treat these as planning numbers with a little headroom.

Frequently asked questions

How accurate is this VRAM estimate?
The weights figure is precise: parameters times bits-per-weight divided by eight. The KV cache figure is an estimate, because it depends on a model’s exact layer count, attention heads, and head dimension. This tool uses the math of modern grouped-query-attention models (Llama 3.x and Qwen2.5 class), which reproduces the commonly cited result of about 4 GB of KV cache for an 8B model at 32K context in FP16. Plan for the number shown plus roughly 10 to 15 percent, since your runtime, batch size, and any draft model add a little on top.
What quantization should I use for a local LLM?
Q4_K_M is the sweet spot for most people: it cuts a model to about a third of its FP16 size while keeping quality close to the original. If you have VRAM to spare, Q5_K_M or Q6_K give a small quality bump. Q8_0 is near-lossless and worth it only when memory is no object. Q3_K_M and Q2_K trade visible quality for size and are best reserved for fitting a model that otherwise would not load at all.
Why does context length use so much memory?
Every token you feed the model is stored in the KV cache for each layer, so memory grows in a straight line with context length. Doubling the context roughly doubles the KV cache. A long 128K context can cost more VRAM than the model weights themselves on smaller models. Quantizing the KV cache to Q8 halves that cost and Q4 quarters it, usually with little quality loss for chat workloads.
Can I run a model that is bigger than my VRAM?
Yes, with tradeoffs. Tools like llama.cpp and Ollama can offload some layers to system RAM and run them on the CPU; the model loads but generation slows down in proportion to how much sits off the GPU. Splitting the model across two GPUs is faster than CPU offload if you have a second card. Otherwise, drop to a smaller quantization or a smaller model size.
Does the calculator handle multiple GPUs?
It adds your cards together: total VRAM is per-card memory times the number of GPUs. That matches how llama.cpp and vLLM pool memory across cards. Real multi-GPU setups lose a little to communication and per-GPU buffers, so leave some headroom rather than sizing to the last gigabyte.