KV Cache Pressure
Detects when GPU KV cache is near exhaustion.
Background
The KV cache stores intermediate attention computations (keys and values) for each active sequence on GPU. When the cache fills up, vLLM cannot admit new sequences — requests stall in the waiting queue even if GPU compute is otherwise available.
Classic failure mode: a few long-context requests fill the cache, blocking all other short requests behind them.
Signals
| Signal | Condition |
|---|---|
| Cache near full | kv_cache_usage_perc >= 0.90 (default) |
| Cache blocking queue | num_requests_waiting > 0 while cache is full |
Confidence
| Signals matched | Confidence |
|---|---|
| Cache high only | Medium |
| Cache high + requests waiting | High |
Likely causes
- Long-context requests holding large KV cache allocations
max_num_seqsormax_num_batched_tokensset too high for available GPU memory- Sudden spike in concurrent requests
Recommendations
- Reduce
max_num_seqsto limit concurrent sequences - Reduce
max_num_batched_tokensto cap memory per step - Increase
gpu_memory_utilizationif GPU memory headroom exists - Route long-context requests to a dedicated replica
Metrics used
vllm:kv_cache_usage_percvllm:num_requests_waiting
Configuration
This threshold is currently fixed in the CLI:
| Setting | Default |
|---|---|
| KV cache usage threshold | 0.90 |
CLI flags for threshold overrides are planned.