Skip to content

High TTFT (Time to First Token)

Detects when the p95 time to first token exceeds the configured threshold.

Background

TTFT measures how long a client waits before receiving the first token. High TTFT indicates that requests are spending too long in prefill or in the waiting queue before prefill even begins.

Unlike TPOT, which reflects decode throughput, high TTFT with stable TPOT strongly suggests the bottleneck is in prefill or queue pressure — not decode.

Prometheus mode only

TTFT percentiles require histogram_quantile() over a time window. This rule does not fire in direct scrape mode.

Signals

Signal Condition
High TTFT ttft_p95_seconds >= 2.0 (default)
TPOT stable tpot_p95_seconds < 0.2 — decode is not the bottleneck
Requests queued num_requests_waiting > 0 — prefill pressure confirmed

Confidence

Signals matched Confidence
High TTFT only Low
High TTFT + stable TPOT Medium
High TTFT + stable TPOT + queue depth High

Likely causes

  • Long input prompts increasing prefill compute time
  • Queue pressure delaying prefill start
  • Chunked prefill not enabled or misconfigured
  • Insufficient capacity for current prompt load

Recommendations

  • Enable or tune chunked prefill (--enable-chunked-prefill)
  • Reduce max prompt length or filter long requests upstream
  • Inspect queue depth — consider adding replicas
  • Separate long-context traffic to dedicated instances

Metrics used

  • vllm:time_to_first_token_seconds (histogram)
  • vllm:time_per_output_token_seconds (histogram)
  • vllm:num_requests_waiting

Configuration

Setting Default
TTFT p95 threshold 2.0s
TPOT p95 threshold 0.2s