Skip to content

Error Rate

Detects elevated server-side errors or client aborts relative to total requests.

Background

vLLM tracks completed requests by finished_reason:

Reason Meaning
stop Completed normally
error Server-side failure (OOM, internal error)
abort Client disconnected or request cancelled
length Hit max_tokens limit
repetition Stopped by repetition penalty

This rule monitors error and abort rates. A high error rate indicates the server is failing requests internally. A high abort rate often means clients are giving up — typically because responses are too slow.

Signals

Signal Condition
Error rate high errors / total >= 0.05 (default)
Abort rate high aborts / total >= 0.10 (default)

Confidence

Signals matched Confidence
Error high only Low
Abort high only Low
Both error + abort high High

Severity

  • Critical when error rate is high — server is actively failing requests
  • Warning when only abort rate is high — clients are disconnecting

Likely causes

  • Server-side OOM or internal errors under high load
  • Requests exceeding timeout limits causing client aborts
  • High latency causing clients to disconnect before completion
  • Resource exhaustion correlating with KV cache pressure

Recommendations

  • Inspect vLLM server logs for error details
  • Correlate with KV cache pressure and queue pressure findings
  • Check client timeout settings relative to observed TTFT and TPOT
  • Reduce load or add replicas if errors correlate with traffic spikes

Metrics used

  • vllm:request_success_total{finished_reason="error"}
  • vllm:request_success_total{finished_reason="abort"}
  • vllm:request_success_total{finished_reason="stop"}

Configuration

from vllm_doctor.rules.error_rate import ErrorRateRule

rule = ErrorRateRule(
    high_error_rate=0.02,   # default: 0.05
    high_abort_rate=0.05,   # default: 0.10
)