Local LLM Speed Calculator: Estimate Tokens per Second

Model size (billions of parameters)

Quantization

GPU / accelerator

Efficiency (% of bandwidth realized)

Estimated speed by quantization (your model + GPU)

Quant	Weights size	Est. tokens/sec	Est. VRAM need

This estimates generation (decode) speed, which is memory-bandwidth-bound: each new token requires reading the active weights from VRAM once, so tokens/sec ≈ (bandwidth ÷ weights size) × efficiency. It's a ballpark — most accurate for larger models; small models and long contexts run slower than the pure-bandwidth math suggests (compute, sampling, and KV-cache reads add overhead), so treat results as ±30%. For whether a model fits in VRAM, use the LLM VRAM Calculator.

Worked examples

These are this calculator’s own outputs, computed with the same formula the tool runs in your browser — so you can see a real answer without touching a single input.

Mainstream 2026 gaming/AI desktop running a small everyday chat model — an 8B model (Llama 3.1 8B class) at Q4_K_M on an RTX 5070 Ti, efficiency left at the 70% default

Inputs: Model size = 8 B params; Quantization = Q4_K_M (0.60 bytes/param); GPU = RTX 5070 Ti (896 GB/s); Efficiency = 70%
Result: Est. generation speed 131 tokens/sec · Model weights 4.8 GB (at 0.60 bytes/param) · Rough VRAM need 5.8 GB · Memory bandwidth 896 GB/s × 70% efficiency. Table: Q8 8.5 GB / 74.0 tok/s / 10.2 GB; Q5_K_M 5.7 GB / 110 tok/s / 6.8 GB; Q4_K_M 4.8 GB / 131 tok/s / 5.8 GB; Q3_K_M 3.9 GB / 160 tok/s / 4.7 GB.

Common homelab box on a used 24 GB card — a 32B model (Qwen2.5 32B / QwQ class) at Q4_K_M on an RTX 3090, efficiency left at 70%

Inputs: Model size = 32 B params; Quantization = Q4_K_M (0.60 bytes/param); GPU = RTX 3090 (936 GB/s); Efficiency = 70%
Result: Est. generation speed 34.1 tokens/sec · Model weights 19.2 GB (at 0.60 bytes/param) · Rough VRAM need 23.0 GB · Memory bandwidth 936 GB/s × 70% efficiency. Table: Q8 33.9 GB / 19.3 tok/s / 40.7 GB; Q5_K_M 22.7 GB / 28.8 tok/s / 27.3 GB; Q4_K_M 19.2 GB / 34.1 tok/s / 23.0 GB; Q3_K_M 15.7 GB / 41.8 tok/s / 18.8 GB.
Note the 23.0 GB VRAM figure against a 24 GB card — genuinely borderline, which is the useful reader takeaway.

How to use this calculator

Enter the model’s size in billions of parameters, pick the quantization you’ll run, and choose your GPU (or enter a custom memory bandwidth). The calculator estimates the generation speed in tokens per second — the speed you feel while the model is typing its reply — plus the weights size and a rough VRAM figure. The table compares common quant levels so you can see the speed-vs-quality trade for your exact hardware.

If you’ve measured your real tokens/sec on a known model, nudge the efficiency slider until the estimate matches — then its predictions for other models and quants on your rig get more accurate.

Why bandwidth, not TFLOPS, decides speed

Local LLM generation is memory-bandwidth-bound. To produce each token, the GPU streams the model’s weights out of VRAM once and does comparatively little arithmetic per byte. So the limiting factor is how fast the card can read memory — its GB/s — not its raw compute. That’s why a memory-bandwidth number predicts token speed far better than a TFLOPS rating, and why the formula is simply:

tokens/sec ≈ (memory bandwidth ÷ weights size) × efficiency

A 70B model at Q4_K_M is 42.0 GB of weights at the 0.60 bytes-per-parameter figure this calculator uses; divide a card’s bandwidth by that and you’re in the right ballpark. Halve the weights (a lighter quant) and you roughly double the speed.

Speed vs. fit: two different questions

“How fast will it run?” and “will it even fit?” are separate problems. This tool answers the first. A model can fit comfortably in VRAM and still be slow (big model, modest bandwidth), or be fast in theory but not fit at all (not enough VRAM, forcing slow CPU offload). For the fit side — VRAM for weights plus the KV cache at your context length — use the LLM VRAM Calculator. Run both before you buy a card or pick a model.

Picking a model for your card

If the estimated speed is comfortable (say 15+ tok/s for chat) and the model fits, you’re set. If it’s too slow, your levers are: a lighter quant (faster and smaller, slight quality cost), a smaller model, or a card with more bandwidth. For which models actually fit each card in the first place, the VRAM-tier guide to local LLMs by GPU maps it tier by tier; for build-level guidance, see the single-RTX-5060 local-LLM walkthrough and the $1,000 local-AI workstation for a tuned setup.

A note on accuracy

These are estimates from the bandwidth-bound model, most reliable for larger models and short-to-medium contexts. Real numbers vary with context length (the KV cache grows and gets re-read), the inference backend, and whether the whole model fits in VRAM. Use the figure to compare GPUs and quants and to size expectations — not as a guaranteed benchmark.

Frequently asked questions

How is local LLM speed (tokens per second) estimated?

Generation speed is dominated by memory bandwidth: to produce each new token, the GPU must read the model’s weights out of VRAM once. So tokens/sec ≈ (memory bandwidth ÷ weights size in GB) × an efficiency factor. A 70B model quantized to Q4_K_M is 42.0 GB of weights (this calculator uses 0.60 bytes per parameter for Q4_K_M); on an RTX 4090 (1008 GB/s) at 70% efficiency that’s 16.8 tokens/sec. This calculator runs that math for your GPU, model size, and quant.

Why is LLM inference memory-bandwidth-bound and not compute-bound?

During generation, the model processes one token at a time, and each step reads every active weight from memory but does relatively little math per byte read. That makes the memory subsystem the bottleneck, not the GPU’s compute units — which is why a card’s GB/s memory bandwidth predicts token speed far better than its TFLOPS. Prompt processing (reading your input) is the opposite — compute-bound and much faster per token — so this tool estimates the generation speed you feel while the model is replying.

How many tokens per second will a 4090 or 3090 do on a 70B model?

Ballpark, at Q4_K_M (42.0 GB of weights): an RTX 4090 (1008 GB/s) lands around 15–20 tok/s and an RTX 3090 (936 GB/s) around 14–18 tok/s, assuming the model fits in VRAM (a single 24 GB card is tight for 70B Q4 — many people use two). Smaller models are much faster: an 8B at Q4 on either card runs well over 80–100 tok/s. Use the calculator with your exact card and quant for a per-setup estimate.

Does quantization make a model faster, or just smaller?

Both. A heavier quant means fewer bytes per parameter, so there are fewer bytes to read per token — which directly raises tokens/sec on a bandwidth-bound workload, on top of shrinking the VRAM footprint. Q4_K_M reads roughly a quarter of the data of FP16, so it’s both far smaller and several times faster. The trade is quality: lower quants lose some accuracy, with Q4_K_M widely considered the sweet spot for most local use.

Why is my real tokens/sec slower than this estimate?

The pure-bandwidth model is an upper-ish bound and is most accurate for larger models. Real speed is lower because of KV-cache reads that grow with context length, sampling overhead, CPU/offload bottlenecks if the model doesn’t fully fit in VRAM, and backend efficiency (llama.cpp vs vLLM vs others). That’s why the tool exposes an efficiency slider — dialing it to match a number you’ve actually measured makes its other estimates more accurate. Treat results as ±30%.

Does this tell me whether a model fits in my GPU?

Not directly — this tool estimates speed, not fit. It shows the weights size and a rough VRAM figure (weights + ~20% for the KV cache and overhead), but actual VRAM use also depends on context length and batch size. For a proper fit check, use the companion LLM VRAM Calculator, which models the KV cache and context window. The two together answer ‘will it fit, and how fast will it run.’

Estimated speed by quantization (your model + GPU)

Worked examples

How to use this calculator

Why bandwidth, not TFLOPS, decides speed

Speed vs. fit: two different questions

Picking a model for your card

A note on accuracy

Related guides and tools

LLM VRAM Calculator

Which Local LLM Fits Your GPU in 2026? VRAM Tiers

How to Self-Host a Local LLM on a Single RTX 5060 in 2026

$1,000 Local AI Workstation: RTX 5080-Class Build Guide (2026)

Frequently asked questions