|
1099 | 1099 | </label> |
1100 | 1100 | <ul class="md-nav__list" data-md-component="toc" data-md-scrollfix> |
1101 | 1101 |
|
| 1102 | + <li class="md-nav__item"> |
| 1103 | + <a href="#ui" class="md-nav__link"> |
| 1104 | + <span class="md-ellipsis"> |
| 1105 | + |
| 1106 | + <span class="md-typeset"> |
| 1107 | + UI |
| 1108 | + </span> |
| 1109 | + |
| 1110 | + </span> |
| 1111 | + </a> |
| 1112 | + |
| 1113 | +</li> |
| 1114 | + |
| 1115 | + <li class="md-nav__item"> |
| 1116 | + <a href="#cli" class="md-nav__link"> |
| 1117 | + <span class="md-ellipsis"> |
| 1118 | + |
| 1119 | + <span class="md-typeset"> |
| 1120 | + CLI |
| 1121 | + </span> |
| 1122 | + |
| 1123 | + </span> |
| 1124 | + </a> |
| 1125 | + |
| 1126 | +</li> |
| 1127 | + |
1102 | 1128 | <li class="md-nav__item"> |
1103 | 1129 | <a href="#prometheus" class="md-nav__link"> |
1104 | 1130 | <span class="md-ellipsis"> |
|
3960 | 3986 | </label> |
3961 | 3987 | <ul class="md-nav__list" data-md-component="toc" data-md-scrollfix> |
3962 | 3988 |
|
| 3989 | + <li class="md-nav__item"> |
| 3990 | + <a href="#ui" class="md-nav__link"> |
| 3991 | + <span class="md-ellipsis"> |
| 3992 | + |
| 3993 | + <span class="md-typeset"> |
| 3994 | + UI |
| 3995 | + </span> |
| 3996 | + |
| 3997 | + </span> |
| 3998 | + </a> |
| 3999 | + |
| 4000 | +</li> |
| 4001 | + |
| 4002 | + <li class="md-nav__item"> |
| 4003 | + <a href="#cli" class="md-nav__link"> |
| 4004 | + <span class="md-ellipsis"> |
| 4005 | + |
| 4006 | + <span class="md-typeset"> |
| 4007 | + CLI |
| 4008 | + </span> |
| 4009 | + |
| 4010 | + </span> |
| 4011 | + </a> |
| 4012 | + |
| 4013 | +</li> |
| 4014 | + |
3963 | 4015 | <li class="md-nav__item"> |
3964 | 4016 | <a href="#prometheus" class="md-nav__link"> |
3965 | 4017 | <span class="md-ellipsis"> |
|
4101 | 4153 |
|
4102 | 4154 |
|
4103 | 4155 | <h1 id="metrics">Metrics<a class="headerlink" href="#metrics" title="Permanent link">¶</a></h1> |
| 4156 | +<p><code>dstack</code> automatically tracks essential metrics, which you can access via the CLI and UI. |
| 4157 | +You can also configure the <code>dstack</code> server to export metrics to Prometheus—this is required to access advanced metrics such as those from DCGM.</p> |
| 4158 | +<h2 id="ui">UI<a class="headerlink" href="#ui" title="Permanent link">¶</a></h2> |
| 4159 | +<p>To access metrics via the UI, open the page of the corresponding run or job and switch to the <code>Metrics</code> tab:</p> |
| 4160 | +<p><img alt="" src="https://dstack.ai/static-assets/static-assets/images/dstack-newsletter-metrics.png" width="800" /></p> |
| 4161 | +<p>This tab displays key CPU, memory, and GPU metrics collected during the last hour of the run or job.</p> |
| 4162 | +<h2 id="cli">CLI<a class="headerlink" href="#cli" title="Permanent link">¶</a></h2> |
| 4163 | +<p>As an alternative to the UI, you can track real-time essential metrics via the CLI. |
| 4164 | +The <code>dstack metrics</code> command displays the most recently tracked CPU, memory, and GPU metrics.</p> |
| 4165 | +<div class="termy"> |
| 4166 | + |
| 4167 | +<div class="highlight"><pre><span></span><code>dstack<span class="w"> </span>metrics<span class="w"> </span>gentle-mayfly-1 |
| 4168 | + |
| 4169 | +<span class="w"> </span>NAME<span class="w"> </span>STATUS<span class="w"> </span>CPU<span class="w"> </span>MEMORY<span class="w"> </span>GPU |
| 4170 | +<span class="w"> </span>gentle-mayfly-1<span class="w"> </span><span class="k">done</span><span class="w"> </span><span class="m">0</span>%<span class="w"> </span><span class="m">16</span>.27GB/2000GB<span class="w"> </span><span class="nv">gpu</span><span class="o">=</span><span class="m">0</span><span class="w"> </span><span class="nv">mem</span><span class="o">=</span><span class="m">72</span>.48GB/80GB<span class="w"> </span><span class="nv">util</span><span class="o">=</span><span class="m">0</span>% |
| 4171 | +<span class="w"> </span><span class="nv">gpu</span><span class="o">=</span><span class="m">1</span><span class="w"> </span><span class="nv">mem</span><span class="o">=</span><span class="m">64</span>.99GB/80GB<span class="w"> </span><span class="nv">util</span><span class="o">=</span><span class="m">0</span>% |
| 4172 | +<span class="w"> </span><span class="nv">gpu</span><span class="o">=</span><span class="m">2</span><span class="w"> </span><span class="nv">mem</span><span class="o">=</span>580MB/80GB<span class="w"> </span><span class="nv">util</span><span class="o">=</span><span class="m">0</span>% |
| 4173 | +<span class="w"> </span><span class="nv">gpu</span><span class="o">=</span><span class="m">3</span><span class="w"> </span><span class="nv">mem</span><span class="o">=</span>4MB/80GB<span class="w"> </span><span class="nv">util</span><span class="o">=</span><span class="m">0</span>% |
| 4174 | +<span class="w"> </span><span class="nv">gpu</span><span class="o">=</span><span class="m">4</span><span class="w"> </span><span class="nv">mem</span><span class="o">=</span>4MB/80GB<span class="w"> </span><span class="nv">util</span><span class="o">=</span><span class="m">0</span>% |
| 4175 | +<span class="w"> </span><span class="nv">gpu</span><span class="o">=</span><span class="m">5</span><span class="w"> </span><span class="nv">mem</span><span class="o">=</span>4MB/80GB<span class="w"> </span><span class="nv">util</span><span class="o">=</span><span class="m">0</span>% |
| 4176 | +<span class="w"> </span><span class="nv">gpu</span><span class="o">=</span><span class="m">6</span><span class="w"> </span><span class="nv">mem</span><span class="o">=</span>4MB/80GB<span class="w"> </span><span class="nv">util</span><span class="o">=</span><span class="m">0</span>% |
| 4177 | +<span class="w"> </span><span class="nv">gpu</span><span class="o">=</span><span class="m">7</span><span class="w"> </span><span class="nv">mem</span><span class="o">=</span>292MB/80GB<span class="w"> </span><span class="nv">util</span><span class="o">=</span><span class="m">0</span>% |
| 4178 | +</code></pre></div> |
| 4179 | + |
| 4180 | +</div> |
| 4181 | + |
4104 | 4182 | <h2 id="prometheus">Prometheus<a class="headerlink" href="#prometheus" title="Permanent link">¶</a></h2> |
4105 | | -<p>To collect and export fleet and run metrics to Prometheus, enable the |
4106 | | -<code>DSTACK_ENABLE_PROMETHEUS_METRICS</code> environment variable and configure Prometheus to fetch metrics from |
| 4183 | +<p>To enable exporting metrics to Prometheus, set the |
| 4184 | +<code>DSTACK_ENABLE_PROMETHEUS_METRICS</code> environment variable and configure Prometheus to scrape metrics from |
4107 | 4185 | <code><dstack server URL>/metrics</code>.</p> |
| 4186 | +<p>In addition to the essential metrics available via the CLI and UI, <code>dstack</code> exports additional metrics to Prometheus, including data on fleets, runs, jobs, and DCGM metrics.</p> |
4108 | 4187 | <details class="info"> |
4109 | 4188 | <summary>NVIDIA DCGM</summary> |
4110 | 4189 | <p>NVIDIA DCGM metrics are automatically collected for <code>aws</code>, <code>azure</code>, <code>gcp</code>, and <code>oci</code> backends, |
|
0 commit comments