Skip to content

Prefix Cache Efficiency

Detects when the prefix cache hit rate is low despite requests being served.

Background

vLLM's prefix cache reuses KV computations for repeated prompt prefixes — system prompts, few-shot examples, or shared context. When requests share a common prefix, vLLM skips recomputing it on every request, reducing TTFT and GPU load.

A low hit rate means those savings are not being realized. This is often a configuration oversight: prefix caching is disabled by default in some vLLM versions.

Signals

Signal Condition
Low hit rate prefix_cache_hit_rate < 0.50 (default threshold)

Confidence

Hit rate Confidence
< 20% High
20% – 50% Medium

Likely causes

  • Prefix caching not enabled (--enable-prefix-caching not set)
  • Requests do not share common prefixes (system prompts, few-shot examples)
  • Cache eviction too aggressive for the workload

Recommendations

  • Enable prefix caching: add --enable-prefix-caching to vLLM startup
  • Ensure requests share a common system prompt or few-shot prefix
  • Review prefix_caching_hash_algo if cache collisions are suspected

Metrics used

  • vllm:prefix_cache_hits_total
  • vllm:prefix_cache_queries_total

Configuration

Setting Default
Min hit rate 0.50

CLI flags for threshold overrides are planned.