dstackai
diff --git a/‎blog/dstack-sky/index.html‎
Lines changed: 2 additions & 3 deletions b/‎blog/dstack-sky/index.html‎
Lines changed: 2 additions & 3 deletions
diff --git a/‎blog/posts/dstack-sky.md‎
Lines changed: 2 additions & 3 deletions b/‎blog/posts/dstack-sky.md‎
Lines changed: 2 additions & 3 deletions
diff --git a/‎docs/concepts/services.md‎
Lines changed: 42 additions & 30 deletions b/‎docs/concepts/services.md‎
Lines changed: 42 additions & 30 deletions
diff --git a/‎docs/concepts/services/index.html‎
Lines changed: 68 additions & 33 deletions b/‎docs/concepts/services/index.html‎
Lines changed: 68 additions & 33 deletions
diff --git a/‎examples/accelerators/tenstorrent.md‎
Lines changed: 1 addition & 1 deletion b/‎examples/accelerators/tenstorrent.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎examples/accelerators/tenstorrent/index.html‎
Lines changed: 1 addition & 1 deletion b/‎examples/accelerators/tenstorrent/index.html‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎examples/inference/nim.md‎
Lines changed: 3 additions & 5 deletions b/‎examples/inference/nim.md‎
Lines changed: 3 additions & 5 deletions
@@ -4133,13 +4133,12 @@ <h2 id="what-is-dstack-sky">What is dstack Sky?<a class="headerlink" href="#what
 </code></pre></div>
 </div>
 
-<p>If it has a <code>model</code> mapping, the model will be accessible
-at <code>https://gateway.&lt;project name&gt;.sky.dstack.ai</code> via the OpenAI compatible interface.</p>
+<p>The service endpoint will be accessible at <code>https://&lt;run name&gt;.&lt;project name&gt;.sky.dstack.ai</code> via the OpenAI compatible interface.</p>
 <div class="highlight"><pre><span></span><code><span class="kn">from</span><span class="w"> </span><span class="nn">openai</span><span class="w"> </span><span class="kn">import</span> <span class="n">OpenAI</span>
 
 
 <span class="n">client</span> <span class="o">=</span> <span class="n">OpenAI</span><span class="p">(</span>
-  <span class="n">base_url</span><span class="o">=</span><span class="s2">&quot;https://gateway.&lt;project name&gt;.sky.dstack.ai&quot;</span><span class="p">,</span>
+  <span class="n">base_url</span><span class="o">=</span><span class="s2">&quot;https://&lt;run name&gt;.&lt;project name&gt;.sky.dstack.ai/v1&quot;</span><span class="p">,</span>
   <span class="n">api_key</span><span class="o">=</span><span class="s2">&quot;&lt;dstack token&gt;&quot;</span>
 <span class="p">)</span>
 
 
@@ -121,15 +121,14 @@ model: mixtral
 ```
 </div>
 
-If it has a `model` mapping, the model will be accessible
-at `https://gateway.<project name>.sky.dstack.ai` via the OpenAI compatible interface.
+The service endpoint will be accessible at `https://<run name>.<project name>.sky.dstack.ai` via the OpenAI compatible interface.
 
 ```python
 from openai import OpenAI
 
 
 client = OpenAI(
-  base_url="https://gateway.<project name>.sky.dstack.ai",
+  base_url="https://<run name>.<project name>.sky.dstack.ai/v1",
   api_key="<dstack token>"
 )
 
 
@@ -73,7 +73,7 @@ Model meta-llama/Meta-Llama-3.1-8B-Instruct is published at:
 
 `dstack apply` automatically provisions instances and runs the service.
 
-If a [gateway](gateways.md) is not configured, the service’s endpoint will be accessible at
+If you do not have a [gateway](gateways.md) created, the service endpoint will be accessible at
 `<dstack server URL>/proxy/services/<project name>/<run name>/`.
 
 <div class="termy">
@@ -95,37 +95,50 @@ $ curl http://localhost:3000/proxy/services/main/llama31/v1/chat/completions \
 
 </div>
 
-If the service defines the [`model`](#model) property, the model can be accessed with
-the global OpenAI-compatible endpoint at `<dstack server URL>/proxy/models/<project name>/`,
-or via `dstack` UI.
+If [authorization](#authorization) is not disabled, the service endpoint requires the `Authorization` header with `Bearer <dstack token>`.
 
-If [authorization](#authorization) is not disabled, the service endpoint requires the `Authorization` header with
-`Bearer <dstack token>`.
+## Configuration options
 
-??? info "Gateway"
-    Running services for development purposes doesn’t require setting up a [gateway](gateways.md).
+<!-- !!! info "No commands"
+    If `commands` are not specified, `dstack` runs `image`’s entrypoint (or fails if none is set). -->
 
-    However, you'll need a gateway in the following cases:
+### Gateway
 
-    * To use auto-scaling or rate limits
-    * To enable a support custom router, e.g. such as the [SGLang Model Gateway](https://docs.sglang.ai/advanced_features/router.html#)
-    * To enable HTTPS for the endpoint and map it to your domain
-    * If your service requires WebSockets
-    * If your service cannot work with a [path prefix](#path-prefix)
+Here are cases where a service may need a [gateway](gateways.md):
 
-    <!-- Note, if you're using [dstack Sky](https://sky.dstack.ai),
-    a gateway is already pre-configured for you. -->
+* To use [auto-scaling](#replicas-and-scaling) or [rate limits](#rate-limits)
+* To enable a support custom router, e.g. such as the [SGLang Model Gateway](https://docs.sglang.ai/advanced_features/router.html#)
+* To enable HTTPS for the endpoint and map it to your domain
+* If your service requires WebSockets
+* If your service cannot work with a [path prefix](#path-prefix)
 
-    If a [gateway](gateways.md) is configured, the service endpoint will be accessible at
-    `https://<run name>.<gateway domain>/`.
+<!-- Note, if you're using [dstack Sky](https://sky.dstack.ai),
+a gateway is already pre-configured for you. -->
 
-    If the service defines the `model` property, the model will be available via the global OpenAI-compatible endpoint 
-    at `https://gateway.<gateway domain>/`.
+If you want `dstack` to explicitly validate that a gateway is used, you can set the [`gateway`](../reference/dstack.yml/service.md#gateway) property in the service configuration to `true`. In this case, `dstack` will raise an error during `dstack apply` if a default gateway is not created.
 
-## Configuration options
+You can also set the `gateway` property to the name of a specific gateway, if required.
+
+If you have a [gateway](gateways.md) created, the service endpoint will be accessible at `https://<run name>.<gateway domain>/`:
+
+<div class="termy">
+
+```shell
+$ curl https://llama31.example.com/v1/chat/completions \
+    -H 'Content-Type: application/json' \
+    -H 'Authorization: Bearer &lt;dstack token&gt;' \
+    -d '{
+        "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
+        "messages": [
+            {
+                "role": "user",
+                "content": "Compose a poem that explains the concept of recursion in programming."
+            }
+        ]
+    }'
+```
 
-!!! info "No commands"
-    If `commands` are not specified, `dstack` runs `image`’s entrypoint (or fails if none is set).
+</div>
 
 ### Replicas and scaling
 
@@ -220,12 +233,6 @@ Setting the minimum number of replicas to `0` allows the service to scale down t
 ??? info "Disaggregated serving"
     Native support for disaggregated prefill and decode, allowing both worker types to run within a single service, is coming soon.
 
-### Model
-
-If the service is running a chat model with an OpenAI-compatible interface,
-set the [`model`](#model) property to make the model accessible via `dstack`'s 
-global OpenAI-compatible endpoint, and also accessible via `dstack`'s UI.
-
 ### Authorization
 
 By default, the service enables authorization, meaning the service endpoint requires a `dstack` user token.
@@ -364,7 +371,7 @@ set [`strip_prefix`](../reference/dstack.yml/service.md#strip_prefix) to `false`
 If your app cannot be configured to work with a path prefix, you can host it
 on a dedicated domain name by setting up a [gateway](gateways.md).
 
-### Rate limits { #rate-limits }
+### Rate limits
 
 If you have a [gateway](gateways.md), you can configure rate limits for your service
 using the [`rate_limits`](../reference/dstack.yml/service.md#rate_limits) property.
@@ -413,6 +420,11 @@ Limits apply to the whole service (all replicas) and per client (by IP). Clients
 
     </div>
 
+### Model
+
+If the service runs a model with an OpenAI-compatible interface, you can set the [`model`](#model) property to make the model accessible through `dstack`'s chat UI on the `Models` page. 
+In this case, `dstack` will use the service's `/v1/chat/completions` service.
+
 ### Resources
 
 If you specify memory size, you can either specify an explicit size (e.g. `24GB`) or a 
 
@@ -934,11 +934,11 @@
       <ul class="md-nav__list">
 
           <li class="md-nav__item">
-  <a href="#replicas-and-scaling" class="md-nav__link">
+  <a href="#gateway" class="md-nav__link">
     <span class="md-ellipsis">
 
         <span class="md-typeset">
-          Replicas and scaling
+          Gateway
         </span>
 
     </span>
@@ -947,11 +947,11 @@
 </li>
 
           <li class="md-nav__item">
-  <a href="#model" class="md-nav__link">
+  <a href="#replicas-and-scaling" class="md-nav__link">
     <span class="md-ellipsis">
 
         <span class="md-typeset">
-          Model
+          Replicas and scaling
         </span>
 
     </span>
@@ -1009,6 +1009,19 @@
     </span>
   </a>
 
+</li>
+        
+          <li class="md-nav__item">
+  <a href="#model" class="md-nav__link">
+    <span class="md-ellipsis">
+      
+        <span class="md-typeset">
+          Model
+        </span>
+      
+    </span>
+  </a>
+  
 </li>
 
           <li class="md-nav__item">
@@ -4282,11 +4295,11 @@
       <ul class="md-nav__list">
 
           <li class="md-nav__item">
-  <a href="#replicas-and-scaling" class="md-nav__link">
+  <a href="#gateway" class="md-nav__link">
     <span class="md-ellipsis">
 
         <span class="md-typeset">
-          Replicas and scaling
+          Gateway
         </span>
 
     </span>
@@ -4295,11 +4308,11 @@
 </li>
 
           <li class="md-nav__item">
-  <a href="#model" class="md-nav__link">
+  <a href="#replicas-and-scaling" class="md-nav__link">
     <span class="md-ellipsis">
 
         <span class="md-typeset">
-          Model
+          Replicas and scaling
         </span>
 
     </span>
@@ -4357,6 +4370,19 @@
     </span>
   </a>
 
+</li>
+        
+          <li class="md-nav__item">
+  <a href="#model" class="md-nav__link">
+    <span class="md-ellipsis">
+      
+        <span class="md-typeset">
+          Model
+        </span>
+      
+    </span>
+  </a>
+  
 </li>
 
           <li class="md-nav__item">
@@ -4691,7 +4717,7 @@ <h2 id="apply-a-configuration">Apply a configuration<a class="headerlink" href="
 </div>
 
 <p><code>dstack apply</code> automatically provisions instances and runs the service.</p>
-<p>If a <a href="../gateways/">gateway</a> is not configured, the service’s endpoint will be accessible at
+<p>If you do not have a <a href="../gateways/">gateway</a> created, the service endpoint will be accessible at
 <code>&lt;dstack server URL&gt;/proxy/services/&lt;project name&gt;/&lt;run name&gt;/</code>.</p>
 <div class="termy">
 
@@ -4711,34 +4737,44 @@ <h2 id="apply-a-configuration">Apply a configuration<a class="headerlink" href="
 
 </div>
 
-<p>If the service defines the <a href="#model"><code>model</code></a> property, the model can be accessed with
-the global OpenAI-compatible endpoint at <code>&lt;dstack server URL&gt;/proxy/models/&lt;project name&gt;/</code>,
-or via <code>dstack</code> UI.</p>
-<p>If <a href="#authorization">authorization</a> is not disabled, the service endpoint requires the <code>Authorization</code> header with
-<code>Bearer &lt;dstack token&gt;</code>.</p>
-<details class="info">
-<summary>Gateway</summary>
-<p>Running services for development purposes doesn’t require setting up a <a href="../gateways/">gateway</a>.</p>
-<p>However, you'll need a gateway in the following cases:</p>
+<p>If <a href="#authorization">authorization</a> is not disabled, the service endpoint requires the <code>Authorization</code> header with <code>Bearer &lt;dstack token&gt;</code>.</p>
+<h2 id="configuration-options">Configuration options<a class="headerlink" href="#configuration-options" title="Permanent link">&para;</a></h2>
+<!-- !!! info "No commands"
+    If `commands` are not specified, `dstack` runs `image`’s entrypoint (or fails if none is set). -->
+
+<h3 id="gateway">Gateway<a class="headerlink" href="#gateway" title="Permanent link">&para;</a></h3>
+<p>Here are cases where a service may need a <a href="../gateways/">gateway</a>:</p>
 <ul>
-<li>To use auto-scaling or rate limits</li>
+<li>To use <a href="#replicas-and-scaling">auto-scaling</a> or <a href="#rate-limits">rate limits</a></li>
 <li>To enable a support custom router, e.g. such as the <a href="https://docs.sglang.ai/advanced_features/router.html#">SGLang Model Gateway</a></li>
 <li>To enable HTTPS for the endpoint and map it to your domain</li>
 <li>If your service requires WebSockets</li>
 <li>If your service cannot work with a <a href="#path-prefix">path prefix</a></li>
 </ul>
-<!-- Note, if you're using <a href="https://sky.dstack.ai">dstack Sky</a>,
+<!-- Note, if you're using [dstack Sky](https://sky.dstack.ai),
 a gateway is already pre-configured for you. -->
-<p>If a <a href="../gateways/">gateway</a> is configured, the service endpoint will be accessible at
-<code>https://&lt;run name&gt;.&lt;gateway domain&gt;/</code>.</p>
-<p>If the service defines the <code>model</code> property, the model will be available via the global OpenAI-compatible endpoint 
-at <code>https://gateway.&lt;gateway domain&gt;/</code>.</p>
-</details>
-<h2 id="configuration-options">Configuration options<a class="headerlink" href="#configuration-options" title="Permanent link">&para;</a></h2>
-<div class="admonition info">
-<p class="admonition-title">No commands</p>
-<p>If <code>commands</code> are not specified, <code>dstack</code> runs <code>image</code>’s entrypoint (or fails if none is set).</p>
+
+<p>If you want <code>dstack</code> to explicitly validate that a gateway is used, you can set the <a href="../../reference/dstack.yml/service/#gateway"><code>gateway</code></a> property in the service configuration to <code>true</code>. In this case, <code>dstack</code> will raise an error during <code>dstack apply</code> if a default gateway is not created.</p>
+<p>You can also set the <code>gateway</code> property to the name of a specific gateway, if required.</p>
+<p>If you have a <a href="../gateways/">gateway</a> created, the service endpoint will be accessible at <code>https://&lt;run name&gt;.&lt;gateway domain&gt;/</code>:</p>
+<div class="termy">
+
+<div class="highlight"><pre><span></span><code>$<span class="w"> </span>curl<span class="w"> </span>https://llama31.example.com/v1/chat/completions<span class="w"> </span><span class="se">\</span>
+<span class="w">    </span>-H<span class="w"> </span><span class="s1">&#39;Content-Type: application/json&#39;</span><span class="w"> </span><span class="se">\</span>
+<span class="w">    </span>-H<span class="w"> </span><span class="s1">&#39;Authorization: Bearer &amp;lt;dstack token&amp;gt;&#39;</span><span class="w"> </span><span class="se">\</span>
+<span class="w">    </span>-d<span class="w"> </span><span class="s1">&#39;{</span>
+<span class="s1">        &quot;model&quot;: &quot;meta-llama/Meta-Llama-3.1-8B-Instruct&quot;,</span>
+<span class="s1">        &quot;messages&quot;: [</span>
+<span class="s1">            {</span>
+<span class="s1">                &quot;role&quot;: &quot;user&quot;,</span>
+<span class="s1">                &quot;content&quot;: &quot;Compose a poem that explains the concept of recursion in programming.&quot;</span>
+<span class="s1">            }</span>
+<span class="s1">        ]</span>
+<span class="s1">    }&#39;</span>
+</code></pre></div>
+
 </div>
+
 <h3 id="replicas-and-scaling">Replicas and scaling<a class="headerlink" href="#replicas-and-scaling" title="Permanent link">&para;</a></h3>
 <p>By default, <code>dstack</code> runs a single replica of the service.
 You can configure the number of replicas as well as the auto-scaling rules.</p>
@@ -4826,10 +4862,6 @@ <h3 id="replicas-and-scaling">Replicas and scaling<a class="headerlink" href="#r
 <summary>Disaggregated serving</summary>
 <p>Native support for disaggregated prefill and decode, allowing both worker types to run within a single service, is coming soon.</p>
 </details>
-<h3 id="model">Model<a class="headerlink" href="#model" title="Permanent link">&para;</a></h3>
-<p>If the service is running a chat model with an OpenAI-compatible interface,
-set the <a href="#model"><code>model</code></a> property to make the model accessible via <code>dstack</code>'s 
-global OpenAI-compatible endpoint, and also accessible via <code>dstack</code>'s UI.</p>
 <h3 id="authorization">Authorization<a class="headerlink" href="#authorization" title="Permanent link">&para;</a></h3>
 <p>By default, the service enables authorization, meaning the service endpoint requires a <code>dstack</code> user token.
 This can be disabled by setting <code>auth</code> to <code>false</code>.</p>
@@ -4992,6 +5024,9 @@ <h3 id="rate-limits">Rate limits<a class="headerlink" href="#rate-limits" title=
 </code></pre></div>
 </div>
 </details>
+<h3 id="model">Model<a class="headerlink" href="#model" title="Permanent link">&para;</a></h3>
+<p>If the service runs a model with an OpenAI-compatible interface, you can set the <a href="#model"><code>model</code></a> property to make the model accessible through <code>dstack</code>'s chat UI on the <code>Models</code> page. 
+In this case, <code>dstack</code> will use the service's <code>/v1/chat/completions</code> service.</p>
 <h3 id="resources">Resources<a class="headerlink" href="#resources" title="Permanent link">&para;</a></h3>
 <p>If you specify memory size, you can either specify an explicit size (e.g. <code>24GB</code>) or a 
 range (e.g. <code>24GB..</code>, or <code>24GB..80GB</code>, or <code>..80GB</code>).</p>
 
@@ -96,7 +96,7 @@ at `<dstack server URL>/proxy/services/<project name>/<run name>/`.
 <div class="termy">
 
 ```shell
-$ curl http://127.0.0.1:3000/proxy/models/main/chat/completions \
+$ curl http://127.0.0.1:3000/proxy/services/main/tt-inference-server/v1/chat/completions \
     -X POST \
     -H 'Authorization: Bearer &lt;dstack token&gt;' \
     -H 'Content-Type: application/json' \
 
@@ -4195,7 +4195,7 @@ <h2 id="services">Services<a class="headerlink" href="#services" title="Permanen
 at <code>&lt;dstack server URL&gt;/proxy/services/&lt;project name&gt;/&lt;run name&gt;/</code>.</p>
 <div class="termy">
 
-<div class="highlight"><pre><span></span><code>$<span class="w"> </span>curl<span class="w"> </span>http://127.0.0.1:3000/proxy/models/main/chat/completions<span class="w"> </span><span class="se">\</span>
+<div class="highlight"><pre><span></span><code>$<span class="w"> </span>curl<span class="w"> </span>http://127.0.0.1:3000/proxy/services/main/tt-inference-server/v1/chat/completions<span class="w"> </span><span class="se">\</span>
 <span class="w">    </span>-X<span class="w"> </span>POST<span class="w"> </span><span class="se">\</span>
 <span class="w">    </span>-H<span class="w"> </span><span class="s1">&#39;Authorization: Bearer &amp;lt;dstack token&amp;gt;&#39;</span><span class="w"> </span><span class="se">\</span>
 <span class="w">    </span>-H<span class="w"> </span><span class="s1">&#39;Content-Type: application/json&#39;</span><span class="w"> </span><span class="se">\</span>
 
@@ -78,13 +78,12 @@ Provisioning...
 ```
 </div>
 
-If no gateway is created, the model will be available via the OpenAI-compatible endpoint
-at `<dstack server URL>/proxy/models/<project name>/`.
+If no gateway is created, the service endpoint will be available at `<dstack server URL>/proxy/services/<project name>/<run name>/`.
 
 <div class="termy">
 
 ```shell
-$ curl http://127.0.0.1:3000/proxy/models/main/chat/completions \
+$ curl http://127.0.0.1:3000/proxy/services/main/serve-distill-deepseek/v1/chat/completions \
     -X POST \
     -H 'Authorization: Bearer &lt;dstack token&gt;' \
     -H 'Content-Type: application/json' \
@@ -106,8 +105,7 @@ $ curl http://127.0.0.1:3000/proxy/models/main/chat/completions \
 
 </div>
 
-When a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured, the OpenAI-compatible endpoint
-is available at `https://gateway.<gateway domain>/`.
+When a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured, the service endpoint will be available at `https://serve-distill-deepseek.<gateway domain>/`.
 
 ## Source code