Skip to content

Low Throughput

Detects when the server is processing requests below expected throughput with no queue pressure.

Background

Low throughput with an empty queue means the server is underutilized — not overloaded. This is distinct from queue pressure, where low throughput is caused by saturation. Here, the server has capacity but is not using it, typically due to low incoming load, poor batching, or misconfigured concurrency limits.

Signals

Signal Condition
Prefill throughput low prompt_tokens_per_second < 10 (default)
Decode throughput low generation_tokens_per_second < 50 (default)
Very few active requests num_requests_running < 2 (default)

This rule is suppressed when num_requests_waiting > 0 — a queue means the low throughput is a capacity problem, not underutilization.

Confidence

Signals matched Confidence
Both prefill and decode low, or very few running Medium
Only one metric low Low

Likely causes

  • Low incoming request rate — server is idle
  • Poor batching due to few concurrent requests
  • max_num_seqs or max_num_batched_tokens configured too conservatively for current load

Recommendations

  • Increase concurrent requests to improve batching efficiency
  • Review max_num_seqs and max_num_batched_tokens settings
  • Compare against a benchmark baseline to confirm underperformance
  • Consider consolidating replicas if load is consistently low

Metrics used

  • vllm:prompt_tokens_per_second
  • vllm:generation_tokens_per_second
  • vllm:num_requests_running

Configuration

from vllm_doctor.rules.low_throughput import LowThroughputRule

rule = LowThroughputRule(
    low_prompt_tps=20.0,    # default: 10.0
    low_gen_tps=100.0,      # default: 50.0
    low_running=3,          # default: 2
)