Skip to content

Introduction

vLLM Doctor

Diagnose vLLM serving issues from /metrics.

vLLM Doctor reads production metrics and turns them into operational findings: what looks wrong, how confident the diagnosis is, and which vLLM knobs are worth checking first.

vllm-doctor --url http://localhost:8000/metrics

Built for incident context

vLLM Doctor is not a dashboard replacement. It is a fast diagnostic snapshot for a single server or Prometheus target.

Why not just a dashboard?

Dashboards show metrics. vLLM Doctor explains inference-system behavior.

Dashboards vLLM Doctor
Shows raw metrics
Explains what's wrong
Recommends vLLM configs
Requires setup
Works on a single server

Installation

pip install vllm-doctor
uv tool install vllm-doctor

Quickstart

vllm-doctor --url http://localhost:8000/metrics

Note

Direct scrape mode reads instant gauge values. Latency percentile rules (TTFT, TPOT) are not available — use Prometheus mode for full diagnosis.

vllm-doctor --url http://localhost:9090
vllm-doctor --url http://localhost:8000/metrics --format json
vllm-doctor --url http://localhost:8000/metrics --verbose

Example output

─────────── vLLM Doctor  ·  Health: CRITICAL  ·  Window: 5m ────────────

╭─  KV cache pressure  [high confidence] ─────────────────────────────╮
│   GPU KV cache usage: 94%  ·  Waiting requests: 7                    │
│                                                                      │
│    Reduce max_num_seqs to limit concurrent sequences                │
│    Increase gpu_memory_utilization if GPU memory headroom exists    │
╰──────────────────────────────────────────────────────────────────────╯
╭─  Queue pressure  [low confidence] ─────────────────────────────────╮
│   Waiting requests: 7                                                │
│                                                                      │
│    Add replicas or increase concurrency limits                      │
│    Inspect autoscaling thresholds                                   │
╰──────────────────────────────────────────────────────────────────────╯

─────────────────────────── Observed Metrics ───────────────────────────

  Requests Running                             12
  Requests Waiting                              7
  GPU Cache Usage        ███████████████████░ 94%
  Generation Tokens/s                        42.0
  TTFT p95 (s)                              3.200
  TPOT p95 (s)                              0.050