Queue Latency
Detects when requests are spending too long in the waiting queue before prefill begins.
Background
vllm:request_queue_time_seconds measures the time a request spends in the WAITING phase — from arrival to when prefill starts. Unlike the waiting request count, this histogram gives a direct latency measurement: clients experience this delay before receiving any output.
High p95 queue time means the server cannot admit requests fast enough, even if the queue depth looks moderate.
Prometheus mode only
Queue time percentiles require histogram_quantile() over a time window. This rule does not fire in direct scrape mode.
Signals
| Signal | Condition |
|---|---|
| High queue time | queue_time_p95_seconds >= 1.0 (default threshold) |
| Active backlog | num_requests_waiting > 0 — backlog confirmed |
Confidence
| Signals matched | Confidence |
|---|---|
| High queue time only | Low |
| High queue time + requests waiting | High |
Likely causes
- Insufficient replica capacity for current request rate
- Long-context requests blocking admission of new sequences
- Autoscaling has not reacted to traffic increase
- KV cache exhaustion limiting sequence admission
Recommendations
- Add replicas or increase concurrency limits
- Inspect autoscaling thresholds and reaction time
- Correlate with KV cache pressure — reduce
max_num_seqsif cache is full - Separate long-context traffic to a dedicated replica
Metrics used
vllm:request_queue_time_seconds(histogram)vllm:num_requests_waiting
Configuration
| Setting | Default |
|---|---|
| Queue time p95 threshold | 1.0s |