/`.
+
+## Configuration options
+
+### PD disaggregation
+
+If you create a gateway with the [`sglang` router](https://dstack.ai/docs/concepts/gateways/#sglang), you can run SGLang with [PD disaggregation](https://docs.sglang.io/advanced_features/pd_disaggregation.html).
+
+
+
+```yaml
+type: service
+name: prefill-decode
+image: lmsysorg/sglang:latest
+
+env:
+ - HF_TOKEN
+ - MODEL_ID=zai-org/GLM-4.5-Air-FP8
+
+replicas:
+ - count: 1..4
+ scaling:
+ metric: rps
+ target: 3
+ commands:
+ - |
+ python -m sglang.launch_server \
+ --model-path $MODEL_ID \
+ --disaggregation-mode prefill \
+ --disaggregation-transfer-backend mooncake \
+ --host 0.0.0.0 \
+ --port 8000 \
+ --disaggregation-bootstrap-port 8998
+ resources:
+ gpu: H200
+
+ - count: 1..8
+ scaling:
+ metric: rps
+ target: 2
+ commands:
+ - |
+ python -m sglang.launch_server \
+ --model-path $MODEL_ID \
+ --disaggregation-mode decode \
+ --disaggregation-transfer-backend mooncake \
+ --host 0.0.0.0 \
+ --port 8000
+ resources:
+ gpu: H200
+
+port: 8000
+model: zai-org/GLM-4.5-Air-FP8
+
+# Custom probe is required for PD disaggregation
+probes:
+ - type: http
+ url: /health_generate
+ interval: 15s
+
+router:
+ type: sglang
+ pd_disaggregation: true
+```
+
+
+
+Currently, auto-scaling only supports `rps` as the metric. TTFT and ITL metrics are coming soon.
+
+#### Gateway
+
+Note, running services with PD disaggregation currently requires the gateway to run in the same cluster as the service.
+
+For example, if you run services on the `kubernetes` backend, make sure to also create the gateway in the same backend:
+
+
+
+```yaml
+type: gateway
+name: gateway-name
+
+backend: kubernetes
+region: any
+
+domain: example.com
+router:
+ type: sglang
+```
+
+
+
+
## Source code
-The source-code of this example can be found in
-[`examples/llms/deepseek/sglang`](https://github.com/dstackai/dstack/blob/master/examples/llms/deepseek/sglang).
+The source-code of these examples can be found in
+[`examples/llms/deepseek/sglang`](https://github.com/dstackai/dstack/blob/master/examples/llms/deepseek/sglang) and [`examples/inference/sglang`](https://github.com/dstackai/dstack/blob/master/examples/inference/sglang).
## What's next?
diff --git a/examples/inference/sglang/pd.dstack.yml b/examples/inference/sglang/pd.dstack.yml
new file mode 100644
index 000000000..614d4e72b
--- /dev/null
+++ b/examples/inference/sglang/pd.dstack.yml
@@ -0,0 +1,51 @@
+type: service
+name: prefill-decode
+image: lmsysorg/sglang:latest
+
+env:
+ - HF_TOKEN
+ - MODEL_ID=zai-org/GLM-4.5-Air-FP8
+
+replicas:
+ - count: 1..4
+ scaling:
+ metric: rps
+ target: 3
+ commands:
+ - |
+ python -m sglang.launch_server \
+ --model-path $MODEL_ID \
+ --disaggregation-mode prefill \
+ --disaggregation-transfer-backend mooncake \
+ --host 0.0.0.0 \
+ --port 8000 \
+ --disaggregation-bootstrap-port 8998
+ resources:
+ gpu: 1
+
+ - count: 1..8
+ scaling:
+ metric: rps
+ target: 2
+ commands:
+ - |
+ python -m sglang.launch_server \
+ --model-path $MODEL_ID \
+ --disaggregation-mode decode \
+ --disaggregation-transfer-backend mooncake \
+ --host 0.0.0.0 \
+ --port 8000
+ resources:
+ gpu: 1
+
+port: 8000
+model: zai-org/GLM-4.5-Air-FP8
+
+probes:
+ - type: http
+ url: /health_generate
+ interval: 15s
+
+router:
+ type: sglang
+ pd_disaggregation: true