Tokens per second simulator

Token Generation Speed Simulator

Simulate how tokens-per-second and output length affect perceived LLM response time. Use it to plan streaming UX, long answers, and latency expectations.

Simulation settings

Response timing

Estimated full response
10.00 s
Elapsed
0.00 s
Generated
0 tokens
Progress0.0%

Simulated streamed output

Start the simulation to preview how output appears as tokens stream in.

What does tokens per second mean?

Tokens per second is the rate at which an LLM generates output tokens after generation begins. It is not the whole user-visible latency story: real response time also includes request routing, prompt processing, retrieval work, model prefill, and time to first token. This simulator isolates output length and generation speed so teams can reason about how long a visible answer may take.

Decode speed

Tokens per second estimates how fast the answer body appears after generation starts.

First-token latency

Users also wait for request handling and prefill before the first streamed token arrives.

Perceived latency

Streaming helps because users can begin reading while the completion is still being generated.

How to interpret tokens per second

Short responses

At 100 tokens/s, a 150-token answer appears in about 1.5 seconds.

Long responses

A 1,000-token answer at the same speed takes about 10 seconds to finish.

Streaming UX

Streaming improves perceived latency because users see partial output before completion.

Response time formula

The simplest estimate divides output length by generation speed. For interactive products, add time to first token because users do not see streamed output until the first token arrives.

completion_time = output_tokens / tokens_per_second
perceived_total_time = time_to_first_token + completion_time
Output length50 tokens/s100 tokens/s200 tokens/s
150 tokens3.0 s1.5 s0.75 s
500 tokens10.0 s5.0 s2.5 s
1,000 tokens20.0 s10.0 s5.0 s

Latency terms to separate

LLM latency is easier to plan when teams separate the stages. Time to first token describes how long the user waits before any streamed output appears. Prefill time is the work needed to process the prompt and context. Decode time is the repeated generation step that produces output tokens one by one.

Time to first token

The wait before streaming begins. It is affected by routing, prompt processing, retrieval, and provider load.

Prefill

The model processes input context before output starts. Longer prompts and retrieved documents can increase this stage.

Decode

The model generates output tokens. This is the stage approximated by tokens per second in the simulator.

Where speed planning matters

Token generation speed is one part of LLM latency. Real user experience also depends on time to first token, network latency, prompt length, model size, retrieval steps, and whether the answer streams while it is generated. This simulator isolates output length and tokens per second so product teams can reason about response pacing.

Chat products

Use the simulator to set output limits that keep conversational replies feeling responsive.

Agent workflows

Long tool traces and multi-step plans can feel slow even when model throughput is high.

Document tasks

Summaries, reports, and code reviews often need streaming or progress states for longer outputs.

Streaming references

Provider streaming APIs can improve perceived latency by sending partial model output while the full answer is still being generated. Check each provider's official documentation for exact streaming behavior.

Token Speed Simulator FAQ