How to use this calculator
Enter the model’s size in billions of parameters, pick the quantization you’ll run, and choose your GPU (or enter a custom memory bandwidth). The calculator estimates the generation speed in tokens per second — the speed you feel while the model is typing its reply — plus the weights size and a rough VRAM figure. The table compares common quant levels so you can see the speed-vs-quality trade for your exact hardware.
If you’ve measured your real tokens/sec on a known model, nudge the efficiency slider until the estimate matches — then its predictions for other models and quants on your rig get more accurate.
Why bandwidth, not TFLOPS, decides speed
Local LLM generation is memory-bandwidth-bound. To produce each token, the GPU streams the model’s weights out of VRAM once and does comparatively little arithmetic per byte. So the limiting factor is how fast the card can read memory — its GB/s — not its raw compute. That’s why a memory-bandwidth number predicts token speed far better than a TFLOPS rating, and why the formula is simply:
tokens/sec ≈ (memory bandwidth ÷ weights size) × efficiency
A 70B model at Q4_K_M is ~39 GB of weights; divide a card’s bandwidth by that and you’re in the right ballpark. Halve the weights (a lighter quant) and you roughly double the speed.
Speed vs. fit: two different questions
“How fast will it run?” and “will it even fit?” are separate problems. This tool answers the first. A model can fit comfortably in VRAM and still be slow (big model, modest bandwidth), or be fast in theory but not fit at all (not enough VRAM, forcing slow CPU offload). For the fit side — VRAM for weights plus the KV cache at your context length — use the LLM VRAM Calculator. Run both before you buy a card or pick a model.
Picking a model for your card
If the estimated speed is comfortable (say 15+ tok/s for chat) and the model fits, you’re set. If it’s too slow, your levers are: a lighter quant (faster and smaller, slight quality cost), a smaller model, or a card with more bandwidth. For build-level guidance, see the single-RTX-5060 local-LLM walkthrough and the $1,000 local-AI workstation for a tuned setup.
A note on accuracy
These are estimates from the bandwidth-bound model, most reliable for larger models and short-to-medium contexts. Real numbers vary with context length (the KV cache grows and gets re-read), the inference backend, and whether the whole model fits in VRAM. Use the figure to compare GPUs and quants and to size expectations — not as a guaranteed benchmark.