High TTFT (Time to First Token)
Detects when the p95 time to first token exceeds the configured threshold.
Background
TTFT measures how long a client waits before receiving the first token. High TTFT indicates that requests are spending too long in prefill or in the waiting queue before prefill even begins.
Unlike TPOT, which reflects decode throughput, high TTFT with stable TPOT strongly suggests the bottleneck is in prefill or queue pressure — not decode.
Prometheus mode only
TTFT percentiles require histogram_quantile() over a time window. This rule does not fire in direct scrape mode.
Signals
| Signal | Condition |
|---|---|
| High TTFT | ttft_p95_seconds >= 2.0 (default) |
| TPOT stable | tpot_p95_seconds < 0.2 — decode is not the bottleneck |
| Requests queued | num_requests_waiting > 0 — prefill pressure confirmed |
Confidence
| Signals matched | Confidence |
|---|---|
| High TTFT only | Low |
| High TTFT + stable TPOT | Medium |
| High TTFT + stable TPOT + queue depth | High |
Likely causes
- Long input prompts increasing prefill compute time
- Queue pressure delaying prefill start
- Chunked prefill not enabled or misconfigured
- Insufficient capacity for current prompt load
Recommendations
- Enable or tune chunked prefill (
--enable-chunked-prefill) - Reduce max prompt length or filter long requests upstream
- Inspect queue depth — consider adding replicas
- Separate long-context traffic to dedicated instances
Metrics used
vllm:time_to_first_token_seconds(histogram)vllm:time_per_output_token_seconds(histogram)vllm:num_requests_waiting
Configuration
| Setting | Default |
|---|---|
| TTFT p95 threshold | 2.0s |
| TPOT p95 threshold | 0.2s |