Skip to content

Replica Imbalance

Detects when load is unevenly distributed across the replicas of a deployment — one replica overloaded while its peers sit idle. This points at the routing layer rather than the model: an uneven load balancer, an unready replica receiving no traffic, or long-context requests pinned to a subset of pods.

Because one vLLM deployment serves one model, replicas are grouped by the model_name label and compared only against peers serving the same model. This keeps the comparison correct on a shared Prometheus that scrapes several deployments. A group with a single replica is skipped — there is nothing to compare.

Signals

Each signal is evaluated per model group.

Signal Condition
Running spread busiest replica handles >= imbalance_factor x the least busy (default 2x), gated by num_requests_running summing to at least min_total_running (default 5)
Cache gap kv_cache_usage_perc max − min >= cache_gap (default 0.30)
Waiting skew one replica has queued requests while another has none

Confidence

Signals matched Confidence
1 Low
2 Medium
3 High

Likely causes

  • Load balancer not distributing requests evenly (sticky sessions or connection reuse)
  • A replica is not Ready or recently restarted, so traffic skips it
  • Long-context requests pinned to a subset of replicas
  • Autoscaler added replicas that are not yet receiving traffic

Recommendations

  • Check the load balancer / service routing and session affinity settings
  • Verify readiness probes — an unready replica receives no traffic
  • Compare per-replica latency and restart any unhealthy replica
  • Confirm newly added replicas are registered with the load balancer

Metrics used

  • vllm:num_requests_running
  • vllm:num_requests_waiting
  • vllm:kv_cache_usage_perc

Configuration

[rules.replica_imbalance]
imbalance_factor = 2.0     # busiest / least-busy running ratio
cache_gap = 0.30           # kv cache usage max − min (fraction)
min_total_running = 5.0    # minimum total running load before the running signal fires

Notes

  • Requires per-replica labels in the metrics (e.g. pod, instance, host). In direct scrape mode against a single endpoint there is only one replica, so this rule does not fire — use a Prometheus target that scrapes all replicas.
  • Latency imbalance (TTFT, TPOT, queue time) is not detected: those percentiles are returned as a single aggregate value, not per replica.