Low Throughput
Detects when the server is processing requests below expected throughput with no queue pressure.
Background
Low throughput with an empty queue means the server is underutilized — not overloaded. This is distinct from queue pressure, where low throughput is caused by saturation. Here, the server has capacity but is not using it, typically due to low incoming load, poor batching, or misconfigured concurrency limits.
Signals
| Signal | Condition |
|---|---|
| Prefill throughput low | prompt_tokens_per_second < 10 (default) |
| Decode throughput low | generation_tokens_per_second < 50 (default) |
| Very few active requests | num_requests_running < 2 (default) |
This rule is suppressed when num_requests_waiting > 0 — a queue means the low throughput is a capacity problem, not underutilization.
Confidence
| Signals matched | Confidence |
|---|---|
| Both prefill and decode low, or very few running | Medium |
| Only one metric low | Low |
Likely causes
- Low incoming request rate — server is idle
- Poor batching due to few concurrent requests
max_num_seqsormax_num_batched_tokensconfigured too conservatively for current load
Recommendations
- Increase concurrent requests to improve batching efficiency
- Review
max_num_seqsandmax_num_batched_tokenssettings - Compare against a benchmark baseline to confirm underperformance
- Consider consolidating replicas if load is consistently low
Metrics used
vllm:prompt_tokens_per_secondvllm:generation_tokens_per_secondvllm:num_requests_running
Configuration
from vllm_doctor.rules.low_throughput import LowThroughputRule
rule = LowThroughputRule(
low_prompt_tps=20.0, # default: 10.0
low_gen_tps=100.0, # default: 50.0
low_running=3, # default: 2
)