Tokens per second simulator
Token Generation Speed Simulator
Simulate how tokens-per-second and output length affect perceived LLM response time. Use it to plan streaming UX, long answers, and latency expectations.
Simulation settings
Response timing
Simulated streamed output
Start the simulation to preview how output appears as tokens stream in.
What does tokens per second mean?
Tokens per second is the rate at which an LLM generates output tokens after generation begins. It is not the whole user-visible latency story: real response time also includes request routing, prompt processing, retrieval work, model prefill, and time to first token. This simulator isolates output length and generation speed so teams can reason about how long a visible answer may take.
Decode speed
Tokens per second estimates how fast the answer body appears after generation starts.
First-token latency
Users also wait for request handling and prefill before the first streamed token arrives.
Perceived latency
Streaming helps because users can begin reading while the completion is still being generated.
How to interpret tokens per second
Short responses
At 100 tokens/s, a 150-token answer appears in about 1.5 seconds.
Long responses
A 1,000-token answer at the same speed takes about 10 seconds to finish.
Streaming UX
Streaming improves perceived latency because users see partial output before completion.
Response time formula
The simplest estimate divides output length by generation speed. For interactive products, add time to first token because users do not see streamed output until the first token arrives.
| Output length | 50 tokens/s | 100 tokens/s | 200 tokens/s |
|---|---|---|---|
| 150 tokens | 3.0 s | 1.5 s | 0.75 s |
| 500 tokens | 10.0 s | 5.0 s | 2.5 s |
| 1,000 tokens | 20.0 s | 10.0 s | 5.0 s |
Latency terms to separate
LLM latency is easier to plan when teams separate the stages. Time to first token describes how long the user waits before any streamed output appears. Prefill time is the work needed to process the prompt and context. Decode time is the repeated generation step that produces output tokens one by one.
Time to first token
The wait before streaming begins. It is affected by routing, prompt processing, retrieval, and provider load.
Prefill
The model processes input context before output starts. Longer prompts and retrieved documents can increase this stage.
Decode
The model generates output tokens. This is the stage approximated by tokens per second in the simulator.
Where speed planning matters
Token generation speed is one part of LLM latency. Real user experience also depends on time to first token, network latency, prompt length, model size, retrieval steps, and whether the answer streams while it is generated. This simulator isolates output length and tokens per second so product teams can reason about response pacing.
Chat products
Use the simulator to set output limits that keep conversational replies feeling responsive.
Agent workflows
Long tool traces and multi-step plans can feel slow even when model throughput is high.
Document tasks
Summaries, reports, and code reviews often need streaming or progress states for longer outputs.
Streaming references
Provider streaming APIs can improve perceived latency by sending partial model output while the full answer is still being generated. Check each provider's official documentation for exact streaming behavior.