Skip to content

Queue Latency

Detects when requests are spending too long in the waiting queue before prefill begins.

Background

vllm:request_queue_time_seconds measures the time a request spends in the WAITING phase — from arrival to when prefill starts. Unlike the waiting request count, this histogram gives a direct latency measurement: clients experience this delay before receiving any output.

High p95 queue time means the server cannot admit requests fast enough, even if the queue depth looks moderate.

Prometheus mode only

Queue time percentiles require histogram_quantile() over a time window. This rule does not fire in direct scrape mode.

Signals

Signal Condition
High queue time queue_time_p95_seconds >= 1.0 (default threshold)
Active backlog num_requests_waiting > 0 — backlog confirmed

Confidence

Signals matched Confidence
High queue time only Low
High queue time + requests waiting High

Likely causes

  • Insufficient replica capacity for current request rate
  • Long-context requests blocking admission of new sequences
  • Autoscaling has not reacted to traffic increase
  • KV cache exhaustion limiting sequence admission

Recommendations

  • Add replicas or increase concurrency limits
  • Inspect autoscaling thresholds and reaction time
  • Correlate with KV cache pressure — reduce max_num_seqs if cache is full
  • Separate long-context traffic to a dedicated replica

Metrics used

  • vllm:request_queue_time_seconds (histogram)
  • vllm:num_requests_waiting

Configuration

Setting Default
Queue time p95 threshold 1.0s