Skip to content

Add pd disaggregated inference#3558

Merged
Bihan merged 15 commits intodstackai:masterfrom
Bihan:add_pd_disaggregated_inference
Feb 18, 2026
Merged

Add pd disaggregated inference#3558
Bihan merged 15 commits intodstackai:masterfrom
Bihan:add_pd_disaggregated_inference

Conversation

@Bihan
Copy link
Collaborator

@Bihan Bihan commented Feb 10, 2026

Testing Steps

  1. Create (CPU node) in K8s cluster

  2. Create gateway in the CPU node using below config

type: gateway
name: bihan-gateway

backend: kubernetes
region: any

domain: bihan-gateway.dstack.ai
router: sglang
  1. Create GPU-node with 3 instances (1 Prefill, 1 Decode and 1 for testing scaling) in the same K8s cluster where gateway node exists.
    Note: See design doc for details on why the gateway and workers are required to be on the same network.

  2. Apply below prefill-decode service configuration

type: service
name: prefill-decode
image: lmsysorg/sglang:latest

env:
  - HF_TOKEN
  - MODEL_ID=meta-llama/Llama-3.2-1B-Instruct

replicas:
  - count: 1..2
    scaling:
      metric: rps
      target: 3
    commands:
      - |
          python -m sglang.launch_server \
            --model-path $MODEL_ID \
            --disaggregation-mode prefill \
            --disaggregation-transfer-backend mooncake \
            --host 0.0.0.0 \
            --port 8000 \
            --disaggregation-bootstrap-port 8998 \
            --log-level debug \
            > worker-server.log 2>&1
    resources:
      gpu: 1

  - count: 1
    commands:
      - |
          python -m sglang.launch_server \
            --model-path $MODEL_ID \
            --disaggregation-mode decode \
            --disaggregation-transfer-backend mooncake \
            --host 0.0.0.0 \
            --port 8000 \
            --log-level debug \
            > worker-server.log 2>&1
    resources:
      gpu: 1

port: 8000
model: meta-llama/Llama-3.2-1B-Instruct

probes:
  - type: http
    url: /health_generate
    interval: 15s

router:
  type: sglang
  policy: round_robin
  pd_disaggregation: true
  1. When rps>=3 prefill replica scales to 2.

Note: For testing you need to assign wheel to https://bihan-test-bucket.s3.eu-west-1.amazonaws.com/dstack_gateway-0.0.1-py3-none-any.whl

Bihan Rana added 2 commits February 10, 2026 12:04
Test2

Internal IP Test

Add worker with internal_ip

Check status and register

Add Status Ready Log

Add Prefill-Decode

Add PD to dstack

Test register worker without poll

Add router config in service config

Update remove worker

Clean Up router code

Clean Up

Further Cleanup
Optional[AnyRouterConfig],
Field(
description=(
"Router configuration for the service. Requires a gateway with matching router enabled. "
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit)

Suggested change
"Router configuration for the service. Requires a gateway with matching router enabled. "
"Router configuration for the service. Requires a gateway with matching router enabled"

@Bihan Bihan merged commit 3aae583 into dstackai:master Feb 18, 2026
28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments