Skip to content

Introduction

vLLM Doctor

Diagnose vLLM server bottlenecks from live metrics.

vllm-doctor demo

vLLM Doctor reads vLLM server metrics and turns them into diagnostic findings: what looks unhealthy, why it may be happening, and which vLLM settings are worth checking first.

vllm-doctor http://localhost:8000/metrics

Built for incident context

vLLM Doctor is not a dashboard replacement or benchmark runner. It is a fast server-side diagnostic snapshot for a single vLLM server or Prometheus target.

Why not just a dashboard?

Dashboards show metrics. vLLM Doctor explains server-side inference behavior.

Dashboards vLLM Doctor
Shows raw metrics
Explains what's wrong
Recommends vLLM configs
Requires setup
Works on a single server

How does this relate to GuideLLM?

GuideLLM is a good fit for generating workloads and measuring endpoint behavior. vLLM Doctor is a good fit for explaining server-side symptoms from vLLM metrics.

Used together, GuideLLM can create or replay load while vLLM Doctor helps explain bottlenecks such as queue pressure, KV cache pressure, high TTFT, or high TPOT.

Installation

pip install vllm-doctor
uv tool install vllm-doctor

Quickstart

vllm-doctor http://localhost:8000/metrics

Note

Direct scrape mode reads instant gauge values. Latency percentile rules (TTFT, TPOT) are not available — use Prometheus mode for full diagnosis.

vllm-doctor http://localhost:9090

Run with Docker

A prebuilt image is published to GitHub Container Registry:

docker run --rm ghcr.io/aminalaee/vllm-doctor <url>

<url> is your vLLM /metrics or Prometheus endpoint — the same argument the CLI takes — reachable from inside the container.

Options

Usage: vllm-doctor [OPTIONS] URL

Arguments:
  URL                         vLLM /metrics or Prometheus URL to diagnose.  [required]

Options:
  -s, --since    TEXT         Time window (e.g. '1h', '30m', 'now').  [default: now]
  -m, --model    TEXT         Filter metrics by model_name label (for a target serving several models).
  -w, --watch                 Refresh continuously every 5s (pipe through `watch -n N` for a different interval).
  -o, --output   [text|json]  Output format.  [default: text]
  -v, --verbose               Show additional diagnostic detail.
  -c, --config   PATH         Path to config file (default: vllm-doctor.toml).
      --version               Show version and exit.
      --help                  Show this message and exit.

Example verbose output

─────────────────────────────────── vLLM Doctor  ·  Health: CRITICAL  ·  Since: now ────────────────────────────────────

╭─  KV cache pressure  [high confidence] ─────────────────────────────────────────────────────────────────────────────╮
│   GPU KV cache usage: 94% (threshold: 90%)  ·  Waiting requests: 7 (blocked by full cache)                           │
│                                                                                                                      │
│    Reduce max_num_seqs to limit concurrent sequences                                                                │
│    Reduce max_num_batched_tokens to cap memory per step                                                             │
│    Increase gpu_memory_utilization if GPU memory headroom exists                                                    │
│    Route long-context requests to a dedicated replica                                                               │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─  High time to first token (TTFT)  [high confidence] ───────────────────────────────────────────────────────────────╮
│   TTFT p95: 3.200s  ·  TPOT p95: 0.050s  ·  Waiting requests: 7                                                      │
│                                                                                                                      │
│    Enable or tune chunked prefill (--enable-chunked-prefill)                                                        │
│    Reduce max prompt length or filter long requests                                                                 │
│    Inspect queue depth  consider adding replicas                                                                   │
│    Separate long-context traffic to dedicated instances                                                             │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─  Replica imbalance  [high confidence] ─────────────────────────────────────────────────────────────────────────────╮
│   meta-llama/Llama-3.1-8B: running vllm-1=10 vs vllm-0=2; cache 94% vs 41%; waiting vllm-1=7 vs vllm-0=0             │
│                                                                                                                      │
│    Check the load balancer / service routing and session affinity settings                                          │
│    Verify readiness probes  an unready replica receives no traffic                                                 │
│    Compare per-replica latency and restart any unhealthy replica                                                    │
│    Confirm newly added replicas are registered with the load balancer                                               │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─  Queue pressure  [low confidence] ─────────────────────────────────────────────────────────────────────────────────╮
│   Waiting requests: 7 (threshold: 5)                                                                                 │
│                                                                                                                      │
│    Add replicas or increase concurrency limits                                                                      │
│    Inspect autoscaling thresholds                                                                                   │
│    Separate long-context traffic to a dedicated replica                                                             │
│    Reduce incoming request rate                                                                                     │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

  KV Cache Pressure           critical    [high]
  High TTFT                   warning     [high]
  Replica Imbalance           warning     [high]
  Queue Pressure              warning     [low]
  Queue Latency               ok
  Preemption Pressure         ok
  Low Throughput              ok
  Error Rate                  ok
  High TPOT                   ok
  Prefix Cache Efficiency     ok

─────────────────────────────────────────────────── Observed Metrics ───────────────────────────────────────────────────

  Summary
  Requests Running                               12
  Requests Waiting                                7
  GPU Cache Usage          ███████████████████░ 94%
  Prefill Tokens/s                            390.0
  Decode Tokens/s                             252.0
  Requests Success                              114
  Requests Error                                  0
  Requests Aborted                                0
  TTFT p95 (s)                                3.200
  TPOT p95 (s)                                0.050
  Queue Time p95 (s)                          0.800
  Preemptions Total                               0
  Prefix Cache Hit Rate                         50%

─────────────────────────────────────────────── Observed Metrics per pod ───────────────────────────────────────────────

                       vllm-1    vllm-0
  Requests Running         10         2
  Requests Waiting          7         0
  GPU Cache Usage         94%       41%
  Prefill Tokens/s       80.0     310.0
  Decode Tokens/s        42.0     210.0
  Requests Success         30        84
  Requests Error            0         0
  Requests Aborted          0         0
  Preemptions Total         0         0