Prefix Cache Efficiency
Detects when the prefix cache hit rate is low despite requests being served.
Background
vLLM's prefix cache reuses KV computations for repeated prompt prefixes — system prompts, few-shot examples, or shared context. When requests share a common prefix, vLLM skips recomputing it on every request, reducing TTFT and GPU load.
A low hit rate means those savings are not being realized. This is often a configuration oversight: prefix caching is disabled by default in some vLLM versions.
Signals
| Signal | Condition |
|---|---|
| Low hit rate | prefix_cache_hit_rate < 0.50 (default threshold) |
Confidence
| Hit rate | Confidence |
|---|---|
| < 20% | High |
| 20% – 50% | Medium |
Likely causes
- Prefix caching not enabled (
--enable-prefix-cachingnot set) - Requests do not share common prefixes (system prompts, few-shot examples)
- Cache eviction too aggressive for the workload
Recommendations
- Enable prefix caching: add
--enable-prefix-cachingto vLLM startup - Ensure requests share a common system prompt or few-shot prefix
- Review
prefix_caching_hash_algoif cache collisions are suspected
Metrics used
vllm:prefix_cache_hits_totalvllm:prefix_cache_queries_total
Configuration
| Setting | Default |
|---|---|
| Min hit rate | 0.50 |
CLI flags for threshold overrides are planned.