Debugging SGLang's NIXL PD Disaggregation: How a Missing CLI Argument Breaks Multi-Prefill Inference

February 20, 2025·13 min read·Sudipta Pathak
SGLearnsglangllm-inferencedebuggingdistributed-systemsgpunixl

Debugging SGLang's NIXL PD Disaggregation: How a Missing CLI Argument Breaks Multi-Prefill Inference

Background

SGLang is a high-performance serving framework for large language models. One of its advanced features is Prefill-Decode (PD) Disaggregation -- the ability to split the prefill phase and decode phase of LLM inference across separate GPU workers. The prefill worker computes the initial KV cache from the prompt, then transfers it via RDMA (using the NIXL backend) to a decode worker that generates tokens autoregressively.

This architecture scales well: you can have multiple prefill workers and multiple decode workers behind a Rust-based PD router. In theory.

In practice, issue #18414 reported that multi-prefill setups (2 prefills + 1 decode, or "2P1D") fail with alternating request failures that progressively degrade to 100% failure. Single-prefill setups work fine.

This post walks through how I root-caused the bug -- from initial symptoms to the three-line code path that breaks everything.

The Architecture

To understand the bug, you need to understand how PD disaggregation routes a request through four components:

User Request
     |
     v
 [PD Router]  (Rust, sgl-model-gateway)
     |
     +---> POST to selected Prefill (HTTP)
     +---> POST to Decode (HTTP)
             |
             v
        [Decode Worker]
             |
             | 1. Fetches bootstrap info from Prefill's HTTP bootstrap server
             | 2. Sends TransferInfo via ZMQ to Prefill's bootstrap thread
             |
             v
        [Prefill Worker]
             |
             | 3. Bootstrap thread receives TransferInfo, marks room as "bootstrapped"
             | 4. Prefill computes KV cache, initiates RDMA transfer to Decode
             |
             v
        [Decode Worker]
             | 5. Receives KV cache, begins autoregressive decoding

The critical coordination mechanism is the bootstrap room -- a random u64 generated by the router for each request. The router injects this room ID, along with the selected prefill's bootstrap_host and bootstrap_port, into the JSON request body. Both the prefill and decode receive this same JSON, so the decode knows which prefill's bootstrap server to contact for this specific request.

Reproducing the Bug

Setup: 4 GPUs on a RunPod instance

  • Prefill 1: GPU 0, TP=1, port 30000, bootstrap port 8998
  • Prefill 2: GPU 1, TP=1, port 30001, bootstrap port 8999
  • Decode: GPUs 2-3, TP=2, port 30002
  • Router: port 8000

1P1D (works perfectly):

sglang-router launch --pd-disaggregation \
  --prefill http://127.0.0.1:30000 \
  --decode http://127.0.0.1:30002 --port 8000

5/5 requests pass. Debug logs confirm both TP ranks send TransferInfo, bootstrap thread receives both, rooms get bootstrapped, transfers complete in ~1 second.

2P1D (fails immediately):

sglang-router launch --pd-disaggregation \
  --prefill http://127.0.0.1:30000 \
  --prefill http://127.0.0.1:30001 \
  --decode http://127.0.0.1:30002 --port 8000

The very first request that gets routed to Prefill 2 hangs for 300 seconds, then times out.

The Investigation

Narrowing Down: The Bootstrap Mismatch

With debug logging enabled on all servers, I analyzed the failed 2P1D request (bootstrap room 2148025886387619708):

Decode log -- contacted Prefill 1's bootstrap (rank_port 42337):

[04:46:19 TP0] Fetched bootstrap info: {'rank_ip': '172.22.0.2', 'rank_port': 42337} for engine rank: 0
[04:46:19 TP1] Fetched bootstrap info: {'rank_ip': '172.22.0.2', 'rank_port': 42337} for engine rank: 1
[04:46:19 TP0] Sending to prefill server with bootstrap room 2148025886387619708

Prefill 1 log -- received TransferInfo and bootstrapped the room (but has no request for it!):

[04:46:19] got info room=2148025886387619708 agent_name='e8197e8e...' required_dst_info_num=2
[04:46:19] got info room=2148025886387619708 agent_name='a2bca0a5...' required_dst_info_num=2
[04:46:19] room=2148025886387619708 is bootstrapped

Prefill 2 log -- received the HTTP POST but never got TransferInfo, eventually aborted:

[04:46:51] Abort bootstrap queue request. req.rid='8ba391cea5ee418994eeff91231b4909'

The picture was clear: the router sent the POST to Prefill 2, but the decode sent TransferInfo to Prefill 1. A prefill/bootstrap mismatch.

Following the Bootstrap Address

The decode constructs bootstrap_addr from the request's bootstrap_host and bootstrap_port, then uses it to contact the prefill's bootstrap HTTP server and ZMQ endpoint:

# decode.py:359
kv_receiver = kv_receiver_class(
    mgr=self.kv_manager,
    bootstrap_addr=f"{req.bootstrap_host}:{req.bootstrap_port}",
    bootstrap_room=req.bootstrap_room,
)

The bootstrap_addr is also used as a cache key for the connection_pool -- a per-instance cache that stores bootstrap connection info to avoid re-fetching on every request:

# common/conn.py:396-436
bootstrap_key = f"{self.bootstrap_addr}_{self.target_dp_group}_{self.target_tp_rank}"

if bootstrap_key not in self.kv_mgr.connection_pool:
    # Fresh fetch from bootstrap HTTP server
    bootstrap_info = self._get_bootstrap_info_from_server(...)
    self.kv_mgr.connection_pool[bootstrap_key] = self.bootstrap_infos
else:
    # Cache hit -- reuse existing connection info
    self.bootstrap_infos = self.kv_mgr.connection_pool[bootstrap_key]

If the bootstrap_addr is different for each prefill (e.g., "127.0.0.1:8998" vs "127.0.0.1:8999"), the cache keys are different, and each prefill gets its own fresh connection. If they're the same, every request hits the cache and contacts the same prefill -- regardless of which prefill the router selected.

The Smoking Gun

The decode log for the failed request showed NO "for DP 0 TP 0 PP 0" log line (which only appears on a fresh fetch). Only "for engine rank:" (which appears on every request). This meant the connection_pool cache was hit -- the bootstrap_key was the same as the 1P1D requests.

But how? If the router injected Prefill 2's bootstrap port (8999), the key would be "127.0.0.1:8999_0_0", different from the cached "127.0.0.1:8998_0_0".

I checked the router's startup log:

Starting router on 0.0.0.0:8000 | mode: PrefillDecode {
  prefill_urls: [
    ("http://127.0.0.1:30000", None),
    ("http://127.0.0.1:30001", None)
  ], ...
}

Both prefills had None for bootstrap_port. The router was started without specifying bootstrap ports.

The Root Cause: A Three-Component Bug

The failure involves three components collaborating to create the mismatch:

1. Router injects null bootstrap_port

When the router doesn't know a prefill's bootstrap port, it injects null into the JSON request:

// pd_router.rs:262-267
obj.insert(
    Self::BOOTSTRAP_PORT_KEY.to_string(),
    match prefill_worker.bootstrap_port() {
        Some(v) => Value::from(v),
        None => Value::Null,  // <-- null when not specified
    },
);

The router could discover this from the prefill's /server_info endpoint (which returns disaggregation_bootstrap_port), but the ServerInfo struct doesn't include that field:

// discover_metadata.rs:29-44
pub struct ServerInfo {
    pub model_id: Option<String>,
    pub model_path: Option<String>,
    pub tp_size: Option<usize>,
    pub disaggregation_mode: Option<String>,
    // ... no bootstrap_port field!
}

2. Decode defaults to its own bootstrap port

When the decode scheduler receives a request with bootstrap_port: null, it defaults to the decode server's own --disaggregation-bootstrap-port value:

# scheduler.py:1447-1449
if recv_req.bootstrap_port is None:
    # Use default bootstrap port
    recv_req.bootstrap_port = self.server_args.disaggregation_bootstrap_port

The decode was started with --disaggregation-bootstrap-port 8998. This happens to be Prefill 1's port -- and now it becomes the default for ALL requests, even those meant for Prefill 2.

3. Connection pool amplifies the problem

The bootstrap_addr becomes "127.0.0.1:8998" for every request (since port always defaults to 8998). The connection_pool caches Prefill 1's bootstrap info under key "127.0.0.1:8998_0_0". Every subsequent request -- even those the router intends for Prefill 2 -- hits this cache and contacts Prefill 1.

The Complete Failure Flow

1. Router selects Prefill 2 for the request
2. Router injects bootstrap_host="127.0.0.1", bootstrap_port=null
3. Router sends POST to Prefill 2 AND Decode simultaneously
4. Decode scheduler: bootstrap_port is None -> defaults to 8998 (Prefill 1!)
5. Decode: bootstrap_addr = "127.0.0.1:8998" -> cache hit -> Prefill 1's info
6. Decode sends TransferInfo via ZMQ to Prefill 1's bootstrap thread
7. Prefill 1 marks room as bootstrapped, but has NO HTTP request for this room
8. Prefill 2 has the HTTP request, waits for TransferInfo that never arrives
9. Prefill 2 hangs in pop_bootstrapped() forever (no timeout!)
10. Decode times out after 300s -> HTTP 500

The Fix and Proof

The immediate fix is simple -- pass bootstrap ports explicitly when starting the router:

# Broken -- no bootstrap ports, both show None:
sglang-router launch --pd-disaggregation \
  --prefill http://127.0.0.1:30000 \
  --prefill http://127.0.0.1:30001 \
  --decode http://127.0.0.1:30002 --port 8000

# Router log confirms:
# prefill_urls: [("http://127.0.0.1:30000", None), ("http://127.0.0.1:30001", None)]
# Fixed -- explicit bootstrap port per prefill:
sglang-router launch --pd-disaggregation \
  --prefill http://127.0.0.1:30000 8998 \
  --prefill http://127.0.0.1:30001 8999 \
  --decode http://127.0.0.1:30002 --port 8000

# Router log confirms:
# prefill_urls: [("http://127.0.0.1:30000", Some(8998)), ("http://127.0.0.1:30001", Some(8999))]

Before the Fix: Broken 2P1D

With the broken router (no bootstrap ports), the first request routed to Prefill 2 fails. Here is the actual log evidence showing the mismatch:

Decode -- both TP ranks contact Prefill 1 (rank_port: 42337) instead of Prefill 2 (rank_port: 34833):

[04:46:19 TP0] Fetched bootstrap info: {'rank_ip': '172.22.0.2', 'rank_port': 42337, 'is_dummy': False} for engine rank: 0
[04:46:19 TP1] Fetched bootstrap info: {'rank_ip': '172.22.0.2', 'rank_port': 42337, 'is_dummy': False} for engine rank: 1
[04:46:19 TP0] Sending to prefill server with bootstrap room 2148025886387619708 is_dummy=False
[04:46:19 TP1] Sending to prefill server with bootstrap room 2148025886387619708 is_dummy=False

Prefill 1 -- receives TransferInfo, bootstraps the room. But it has no HTTP request for this room:

[04:46:19] got info room=2148025886387619708 agent_name='e8197e8e-84fb-4552-b02f-b9d79e334bf7' required_dst_info_num=2
[04:46:19] got info room=2148025886387619708 agent_name='a2bca0a5-f36b-4c23-8250-de11603b86b1' required_dst_info_num=2
[04:46:19] room=2148025886387619708 is bootstrapped

Prefill 2 -- has the HTTP request but never receives TransferInfo. Aborts after 32 seconds:

[04:46:51] Abort bootstrap queue request. req.rid='8ba391cea5ee418994eeff91231b4909'

Decode -- times out after 300 seconds waiting for KV transfer that never starts:

[04:51:19 TP0] Failure recorded for room 2148025886387619708: Request 2148025886387619708 timed out after 300.0s in KVPoll.WaitingForInput
[04:51:19 TP0] Decode transfer failed for request rank=0 decode_req.req.rid='9d468d4f41804f33bfe12ce2602c73eb' decode_req.req.bootstrap_room=2148025886387619708 with exception Request 2148025886387619708 timed out after 300.0s in KVPoll.WaitingForInput

After the Fix: Working 2P1D

With the fixed router (explicit bootstrap ports), 15/15 sequential requests succeed instantly:

Request 1:  OK  (response: Hi!)
Request 2:  OK  (response: Hi)
Request 3:  OK  (response: Hi!)
Request 4:  OK  (response: Hi!)
Request 5:  OK  (response: Hi)
Request 6:  OK  (response: Hi)
Request 7:  OK  (response: Hi!)
Request 8:  OK  (response: Hi)
Request 9:  OK  (response: Hi)
Request 10: OK  (response: Hi!)
Request 11: OK  (response: Hi)
Request 12: OK  (response: Hi!)
Request 13: OK  (response: Hi)
Request 14: OK  (response: Hi!)
Request 15: OK  (response: Hi)

Decode -- now correctly discovers and alternates between BOTH prefills' bootstrap servers:

# Decode discovers Prefill 2's bootstrap for the first time (cache miss -> fresh fetch):
[04:59:59 TP0] Fetch prefill parallel info from [127.0.0.1:8999]: DP size:1, TP size:1 PP size:1 Page size:1

# Subsequent requests alternate between both prefills:
[05:04:10 TP0] Fetched bootstrap info: {'rank_ip': '172.22.0.2', 'rank_port': 42337, 'is_dummy': False} ...  # Prefill 1
[05:04:10 TP0] Fetched bootstrap info: {'rank_ip': '172.22.0.2', 'rank_port': 34833, 'is_dummy': False} ...  # Prefill 2
[05:04:10 TP0] Fetched bootstrap info: {'rank_ip': '172.22.0.2', 'rank_port': 42337, 'is_dummy': False} ...  # Prefill 1
[05:04:10 TP0] Fetched bootstrap info: {'rank_ip': '172.22.0.2', 'rank_port': 34833, 'is_dummy': False} ...  # Prefill 2
[05:04:10 TP0] Fetched bootstrap info: {'rank_ip': '172.22.0.2', 'rank_port': 42337, 'is_dummy': False} ...  # Prefill 1

Prefill 1 -- receives TransferInfo and bootstraps rooms only for its own requests:

[05:04:10] got info room=7221185637627680706 agent_name='e8197e8e...' required_dst_info_num=2
[05:04:10] got info room=7221185637627680706 agent_name='a2bca0a5...' required_dst_info_num=2
[05:04:10] room=7221185637627680706 is bootstrapped
[05:04:10] got info room=8126175921748888596 agent_name='e8197e8e...' required_dst_info_num=2
[05:04:10] got info room=8126175921748888596 agent_name='a2bca0a5...' required_dst_info_num=2
[05:04:10] room=8126175921748888596 is bootstrapped

Prefill 2 -- now also receives TransferInfo correctly (this NEVER happened before the fix):

[05:04:10] got info room=2563102377340552573 agent_name='e8197e8e...' required_dst_info_num=2
[05:04:10] got info room=2563102377340552573 agent_name='a2bca0a5...' required_dst_info_num=2
[05:04:10] room=2563102377340552573 is bootstrapped
[05:04:10] got info room=2066834353112136973 agent_name='e8197e8e...' required_dst_info_num=2
[05:04:10] got info room=2066834353112136973 agent_name='a2bca0a5...' required_dst_info_num=2
[05:04:10] room=2066834353112136973 is bootstrapped

No aborts, no timeouts, no mismatches. Both prefills serve their share of requests successfully.

The Proper Upstream Fix

Rather than requiring users to pass bootstrap ports manually, the right fix is to have the router auto-discover each prefill's bootstrap port from the /server_info endpoint during worker registration. The prefill already exposes disaggregation_bootstrap_port in that response -- the router just wasn't reading it.

The fix is a 14-line change across 2 Rust files in the router (sgl-model-gateway).

Change 1: Deserialize the bootstrap port from /server_info

In discover_metadata.rs, add the field to the ServerInfo struct and extract it into the discovered labels:

pub struct ServerInfo {
    // ... existing fields ...
    pub disaggregation_mode: Option<String>,
    pub disaggregation_bootstrap_port: Option<u16>,  // NEW
    // ...
}
// In the label extraction block, following the same pattern as disaggregation_mode:
if let Some(port) = server_info.disaggregation_bootstrap_port {
    labels.insert(
        "disaggregation_bootstrap_port".to_string(),
        port.to_string(),
    );
}

Change 2: Use the discovered port as a fallback

In create_worker.rs, update parse_worker_type to use the discovered port when the CLI doesn't provide one:

fn parse_worker_type(config: &WorkerConfigRequest, labels: &HashMap<String, String>) -> WorkerType {
    config
        .worker_type
        .as_ref()
        .map(|t| match t.as_str() {
            "prefill" => WorkerType::Prefill {
                bootstrap_port: config.bootstrap_port.or_else(|| {
                    labels
                        .get("disaggregation_bootstrap_port")
                        .and_then(|s| s.parse().ok())
                }),
            },
            "decode" => WorkerType::Decode,
            _ => WorkerType::Regular,
        })
        .unwrap_or(WorkerType::Regular)
}

The or_else chain means: use the CLI-provided port if available, otherwise fall back to the auto-discovered port. This is backward compatible -- existing setups that pass bootstrap ports explicitly continue to work unchanged.

Verifying the Fix

I tested the fix on a 4x A100-80GB RunPod instance with the same 2P1D NIXL setup. The router was started without bootstrap ports on the CLI:

sglang-router launch --pd-disaggregation \
  --prefill http://127.0.0.1:30000 \
  --prefill http://127.0.0.1:30001 \
  --decode http://127.0.0.1:30002 --port 8000

The router log still shows None for the CLI-provided ports:

Prefill nodes: [("http://127.0.0.1:30000", None), ("http://127.0.0.1:30001", None)]

But now the router auto-discovers them from /server_info:

Prefill 1: "disaggregation_bootstrap_port": 8998
Prefill 2: "disaggregation_bootstrap_port": 8999

Result: 15/15 sequential requests succeeded. Both prefills served requests (7 and 8 respectively), zero hangs, zero timeouts. The same test that previously failed on every Prefill 2 request now works perfectly -- without any CLI changes.

Once the router always sends the correct port, the Python fallback in scheduler.py becomes dead code and can be cleaned up separately without breaking anyone.

Lessons Learned

1. Defaults that work for simple cases can silently break complex ones. The scheduler.py fallback was probably written for single-prefill setups where the decode's port naturally matches the prefill's. In multi-prefill setups, it creates an invisible mismatch.

2. Silent degradation is worse than a crash. The request didn't fail fast -- it hung for 300 seconds. The prefill's pop_bootstrapped() loop has no timeout for the bootstrapping phase, and the bootstrap thread has no error reporting for "received TransferInfo for unknown room." Adding timeouts and logging at these boundaries would have made this a 5-minute debug instead of a multi-session investigation.

3. Cache keying matters enormously. The connection_pool cache key includes the bootstrap_addr, which is the right design -- but when all bootstrap addresses collapse to the same string due to the port defaulting bug, the cache effectively becomes a single entry that always returns the wrong prefill's info.

4. Multi-component bugs require end-to-end tracing. This bug spans a Rust router, Python scheduler, Python KV manager, and a ZMQ bootstrap thread. No single component is broken in isolation. The router correctly injects what it knows (null). The scheduler correctly applies its default. The connection pool correctly caches. Each component is locally rational; the bug only emerges from their interaction.

5. The CLI is part of the API. When the --prefill URL [PORT] syntax makes the port optional and the system silently degrades without it, that's an API design bug. Either the port should be required in PD mode, or auto-discovery should fill it in.