diff --git a/docs/docs/concepts/gateways.md b/docs/docs/concepts/gateways.md index 55573bd74..6ed19c2a0 100644 --- a/docs/docs/concepts/gateways.md +++ b/docs/docs/concepts/gateways.md @@ -110,7 +110,11 @@ router: -!!! info "Policy" +If you configure the `sglang` router, [services](../concepts/services.md) can run either [standard SGLang workers](../../examples/inference/sglang/index.md) or [Prefill-Decode workers](../../examples/inference/sglang/index.md#pd-disaggregation) (aka PD disaggregation). + +> Note, if you want to run services with PD disaggregation, the gateway must currently run in the same cluster as the service. + +??? info "Policy" The `policy` property allows you to configure the routing policy: * `cache_aware` — Default policy; combines cache locality with load balancing, falling back to shortest queue. @@ -119,9 +123,6 @@ router: * `round_robin` — Cycles through workers in order. -> Currently, services using this type of gateway must run standard SGLang workers. See the [example](../../examples/inference/sglang/index.md). -> -> Support for prefill/decode disaggregation and auto-scaling based on inter-token latency is coming soon. ### Public IP diff --git a/docs/docs/concepts/services.md b/docs/docs/concepts/services.md index d40984866..1eb63dd01 100644 --- a/docs/docs/concepts/services.md +++ b/docs/docs/concepts/services.md @@ -182,6 +182,8 @@ Setting the minimum number of replicas to `0` allows the service to scale down t > The `scaling` property requires creating a [gateway](gateways.md). + + ??? info "Replica groups" A service can include multiple replica groups. Each group can define its own `commands`, `resources` requirements, and `scaling` rules. @@ -230,8 +232,9 @@ Setting the minimum number of replicas to `0` allows the service to scale down t > Properties such as `regions`, `port`, `image`, `env` and some other cannot be configured per replica group. This support is coming soon. -??? info "Disaggregated serving" - Native support for disaggregated prefill and decode, allowing both worker types to run within a single service, is coming soon. +### PD disaggregation + +If you create a gateway with the [`sglang` router](gateways.md#sglang), you can run SGLang with [Prefill-Decode disaggregation](https://docs.sglang.io/advanced_features/pd_disaggregation.html). See the [corresponding example](../../examples/inference/sglang/index.md#pd-disaggregation). ### Authorization diff --git a/examples/inference/sglang/README.md b/examples/inference/sglang/README.md index 5b7dc640a..6549afe5c 100644 --- a/examples/inference/sglang/README.md +++ b/examples/inference/sglang/README.md @@ -9,7 +9,7 @@ This example shows how to deploy DeepSeek-R1-Distill-Llama 8B and 70B using [SGL ## Apply a configuration -Here's an example of a service that deploys DeepSeek-R1-Distill-Llama 8B and 70B using SgLang. +Here's an example of a service that deploys DeepSeek-R1-Distill-Llama 8B and 70B using SGLang. === "NVIDIA" @@ -108,15 +108,106 @@ curl http://127.0.0.1:3000/proxy/services/main/deepseek-r1/v1/chat/completions \ ``` -!!! info "SGLang Model Gateway" - If you'd like to use a custom routing policy, e.g. by leveraging the [SGLang Model Gateway](https://docs.sglang.ai/advanced_features/router.html#), create a gateway with `router` set to `sglang`. Check out [gateways](https://dstack.ai/docs/concepts/gateways#router) for more details. +!!! info "Router policy" + If you'd like to use a custom routing policy, create a gateway with `router` set to `sglang`. Check out [gateways](https://dstack.ai/docs/concepts/gateways#router) for more details. -> If a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured (e.g. to enable auto-scaling or HTTPs, rate-limits, etc), the service endpoint will be available at `https://deepseek-r1./`. +> If a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured (e.g. to enable auto-scaling, HTTPS, rate limits, etc.), the service endpoint will be available at `https://deepseek-r1./`. + +## Configuration options + +### PD disaggregation + +If you create a gateway with the [`sglang` router](https://dstack.ai/docs/concepts/gateways/#sglang), you can run SGLang with [PD disaggregation](https://docs.sglang.io/advanced_features/pd_disaggregation.html). + +
+ +```yaml +type: service +name: prefill-decode +image: lmsysorg/sglang:latest + +env: + - HF_TOKEN + - MODEL_ID=zai-org/GLM-4.5-Air-FP8 + +replicas: + - count: 1..4 + scaling: + metric: rps + target: 3 + commands: + - | + python -m sglang.launch_server \ + --model-path $MODEL_ID \ + --disaggregation-mode prefill \ + --disaggregation-transfer-backend mooncake \ + --host 0.0.0.0 \ + --port 8000 \ + --disaggregation-bootstrap-port 8998 + resources: + gpu: H200 + + - count: 1..8 + scaling: + metric: rps + target: 2 + commands: + - | + python -m sglang.launch_server \ + --model-path $MODEL_ID \ + --disaggregation-mode decode \ + --disaggregation-transfer-backend mooncake \ + --host 0.0.0.0 \ + --port 8000 + resources: + gpu: H200 + +port: 8000 +model: zai-org/GLM-4.5-Air-FP8 + +# Custom probe is required for PD disaggregation +probes: + - type: http + url: /health_generate + interval: 15s + +router: + type: sglang + pd_disaggregation: true +``` + +
+ +Currently, auto-scaling only supports `rps` as the metric. TTFT and ITL metrics are coming soon. + +#### Gateway + +Note, running services with PD disaggregation currently requires the gateway to run in the same cluster as the service. + +For example, if you run services on the `kubernetes` backend, make sure to also create the gateway in the same backend: + +
+ +```yaml +type: gateway +name: gateway-name + +backend: kubernetes +region: any + +domain: example.com +router: + type: sglang +``` + +
+ + ## Source code -The source-code of this example can be found in -[`examples/llms/deepseek/sglang`](https://github.com/dstackai/dstack/blob/master/examples/llms/deepseek/sglang). +The source-code of these examples can be found in +[`examples/llms/deepseek/sglang`](https://github.com/dstackai/dstack/blob/master/examples/llms/deepseek/sglang) and [`examples/inference/sglang`](https://github.com/dstackai/dstack/blob/master/examples/inference/sglang). ## What's next? diff --git a/examples/inference/sglang/pd.dstack.yml b/examples/inference/sglang/pd.dstack.yml new file mode 100644 index 000000000..614d4e72b --- /dev/null +++ b/examples/inference/sglang/pd.dstack.yml @@ -0,0 +1,51 @@ +type: service +name: prefill-decode +image: lmsysorg/sglang:latest + +env: + - HF_TOKEN + - MODEL_ID=zai-org/GLM-4.5-Air-FP8 + +replicas: + - count: 1..4 + scaling: + metric: rps + target: 3 + commands: + - | + python -m sglang.launch_server \ + --model-path $MODEL_ID \ + --disaggregation-mode prefill \ + --disaggregation-transfer-backend mooncake \ + --host 0.0.0.0 \ + --port 8000 \ + --disaggregation-bootstrap-port 8998 + resources: + gpu: 1 + + - count: 1..8 + scaling: + metric: rps + target: 2 + commands: + - | + python -m sglang.launch_server \ + --model-path $MODEL_ID \ + --disaggregation-mode decode \ + --disaggregation-transfer-backend mooncake \ + --host 0.0.0.0 \ + --port 8000 + resources: + gpu: 1 + +port: 8000 +model: zai-org/GLM-4.5-Air-FP8 + +probes: + - type: http + url: /health_generate + interval: 15s + +router: + type: sglang + pd_disaggregation: true