Estimated speed by quantization (your model + GPU)

QuantWeights sizeEst. tokens/secEst. VRAM need

This estimates generation (decode) speed, which is memory-bandwidth-bound: each new token requires reading the active weights from VRAM once, so tokens/sec ≈ (bandwidth ÷ weights size) × efficiency. It's a ballpark — most accurate for larger models; small models and long contexts run slower than the pure-bandwidth math suggests (compute, sampling, and KV-cache reads add overhead), so treat results as ±30%. For whether a model fits in VRAM, use the LLM VRAM Calculator.

How to use this calculator

Enter the model’s size in billions of parameters, pick the quantization you’ll run, and choose your GPU (or enter a custom memory bandwidth). The calculator estimates the generation speed in tokens per second — the speed you feel while the model is typing its reply — plus the weights size and a rough VRAM figure. The table compares common quant levels so you can see the speed-vs-quality trade for your exact hardware.

If you’ve measured your real tokens/sec on a known model, nudge the efficiency slider until the estimate matches — then its predictions for other models and quants on your rig get more accurate.

Why bandwidth, not TFLOPS, decides speed

Local LLM generation is memory-bandwidth-bound. To produce each token, the GPU streams the model’s weights out of VRAM once and does comparatively little arithmetic per byte. So the limiting factor is how fast the card can read memory — its GB/s — not its raw compute. That’s why a memory-bandwidth number predicts token speed far better than a TFLOPS rating, and why the formula is simply:

tokens/sec ≈ (memory bandwidth ÷ weights size) × efficiency

A 70B model at Q4_K_M is ~39 GB of weights; divide a card’s bandwidth by that and you’re in the right ballpark. Halve the weights (a lighter quant) and you roughly double the speed.

Speed vs. fit: two different questions

“How fast will it run?” and “will it even fit?” are separate problems. This tool answers the first. A model can fit comfortably in VRAM and still be slow (big model, modest bandwidth), or be fast in theory but not fit at all (not enough VRAM, forcing slow CPU offload). For the fit side — VRAM for weights plus the KV cache at your context length — use the LLM VRAM Calculator. Run both before you buy a card or pick a model.

Picking a model for your card

If the estimated speed is comfortable (say 15+ tok/s for chat) and the model fits, you’re set. If it’s too slow, your levers are: a lighter quant (faster and smaller, slight quality cost), a smaller model, or a card with more bandwidth. For build-level guidance, see the single-RTX-5060 local-LLM walkthrough and the $1,000 local-AI workstation for a tuned setup.

A note on accuracy

These are estimates from the bandwidth-bound model, most reliable for larger models and short-to-medium contexts. Real numbers vary with context length (the KV cache grows and gets re-read), the inference backend, and whether the whole model fits in VRAM. Use the figure to compare GPUs and quants and to size expectations — not as a guaranteed benchmark.

Frequently asked questions

How is local LLM speed (tokens per second) estimated?
Generation speed is dominated by memory bandwidth: to produce each new token, the GPU must read the model’s weights out of VRAM once. So tokens/sec ≈ (memory bandwidth ÷ weights size in GB) × an efficiency factor. A 70B model quantized to Q4_K_M is about 39 GB of weights; on an RTX 4090 (1008 GB/s) at ~70% efficiency that’s roughly 18 tokens/sec. This calculator runs that math for your GPU, model size, and quant.
Why is LLM inference memory-bandwidth-bound and not compute-bound?
During generation, the model processes one token at a time, and each step reads every active weight from memory but does relatively little math per byte read. That makes the memory subsystem the bottleneck, not the GPU’s compute units — which is why a card’s GB/s memory bandwidth predicts token speed far better than its TFLOPS. Prompt processing (reading your input) is the opposite — compute-bound and much faster per token — so this tool estimates the generation speed you feel while the model is replying.
How many tokens per second will a 4090 or 3090 do on a 70B model?
Ballpark, at Q4_K_M (~39 GB of weights): an RTX 4090 (1008 GB/s) lands around 15–20 tok/s and an RTX 3090 (936 GB/s) around 14–18 tok/s, assuming the model fits in VRAM (a single 24 GB card is tight for 70B Q4 — many people use two). Smaller models are much faster: an 8B at Q4 on either card runs well over 80–100 tok/s. Use the calculator with your exact card and quant for a per-setup estimate.
Does quantization make a model faster, or just smaller?
Both. A heavier quant means fewer bytes per parameter, so there are fewer bytes to read per token — which directly raises tokens/sec on a bandwidth-bound workload, on top of shrinking the VRAM footprint. Q4_K_M reads roughly a quarter of the data of FP16, so it’s both far smaller and several times faster. The trade is quality: lower quants lose some accuracy, with Q4_K_M widely considered the sweet spot for most local use.
Why is my real tokens/sec slower than this estimate?
The pure-bandwidth model is an upper-ish bound and is most accurate for larger models. Real speed is lower because of KV-cache reads that grow with context length, sampling overhead, CPU/offload bottlenecks if the model doesn’t fully fit in VRAM, and backend efficiency (llama.cpp vs vLLM vs others). That’s why the tool exposes an efficiency slider — dialing it to match a number you’ve actually measured makes its other estimates more accurate. Treat results as ±30%.
Does this tell me whether a model fits in my GPU?
Not directly — this tool estimates speed, not fit. It shows the weights size and a rough VRAM figure (weights + ~20% for the KV cache and overhead), but actual VRAM use also depends on context length and batch size. For a proper fit check, use the companion LLM VRAM Calculator, which models the KV cache and context window. The two together answer ‘will it fit, and how fast will it run.’