-
Notifications
You must be signed in to change notification settings - Fork 8
Docs/metrics export #263
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
+269
−0
Merged
Docs/metrics export #263
Changes from all commits
Commits
Show all changes
18 commits
Select commit
Hold shift + click to select a range
b5f1e31
docs: add metrics export integration guide
Hkhan161 eb3b72b
Prettified Code!
Hkhan161 9d76608
docs: address all feedback - how it works, clearer instructions, tabs…
Hkhan161 84860a9
Prettified Code!
Hkhan161 f381ab5
docs: fix metric names to match actual implementation
Hkhan161 a7ede5c
Prettified Code!
Hkhan161 04ea1c3
docs: clarify auth headers come from Step 1
Hkhan161 f3ce178
Prettified Code!
Hkhan161 d798195
docs: fold verify step into troubleshooting note
Hkhan161 43b915d
docs: replace inline API reference with link to API docs
Hkhan161 ac20917
Prettified Code!
Hkhan161 e85b9cd
docs: remove broken API reference link
Hkhan161 e8caf84
Prettified Code!
Hkhan161 425a098
docs: move metrics reference to top of page
Hkhan161 a94391b
docs: add Prometheus naming note, troubleshooting section, improved G…
Hkhan161 0a5c6a5
docs: add histogram naming note before Grafana PromQL examples
Hkhan161 2651a16
Prettified Code!
Hkhan161 592b488
docs: prevent metric name wrapping in tables
Hkhan161 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Some comments aren't visible on the classic Files Changed page.
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,262 @@ | ||
| --- | ||
| title: Exporting Metrics to Monitoring Platforms | ||
| description: Export your application metrics to any OTLP-compatible observability platform including Grafana Cloud, Datadog, Prometheus, New Relic, and more | ||
| --- | ||
|
|
||
| Export real-time resource and execution metrics from your Cerebrium applications to your existing observability platform. Monitor CPU, memory, GPU usage, request counts, and latency metrics exported by your applications. We support most major monitoring platforms that are OTLP-compatible. | ||
|
|
||
| ## What metrics are exported? | ||
|
|
||
| ### Resource Metrics | ||
|
|
||
| | Metric | Type | Unit | Description | | ||
| | ------------------------------------------------------------------------------------- | ----- | ------- | --------------------------------------- | | ||
| | <code style={{whiteSpace: 'nowrap'}}>cerebrium_cpu_utilization_cores</code> | Gauge | cores | CPU cores actively in use per app | | ||
| | <code style={{whiteSpace: 'nowrap'}}>cerebrium_memory_usage_bytes</code> | Gauge | bytes | Memory actively in use per app | | ||
| | <code style={{whiteSpace: 'nowrap'}}>cerebrium_gpu_memory_usage_bytes</code> | Gauge | bytes | GPU VRAM in use per app | | ||
| | <code style={{whiteSpace: 'nowrap'}}>cerebrium_gpu_compute_utilization_percent</code> | Gauge | percent | GPU compute utilization (0-100) per app | | ||
| | <code style={{whiteSpace: 'nowrap'}}>cerebrium_containers_running_count</code> | Gauge | count | Number of running containers per app | | ||
| | <code style={{whiteSpace: 'nowrap'}}>cerebrium_containers_ready_count</code> | Gauge | count | Number of ready containers per app | | ||
|
|
||
| ### Execution Metrics | ||
|
|
||
| | Metric | Type | Unit | Description | | ||
| | --------------------------------------------------------------------------- | --------- | ---- | ------------------------------ | | ||
| | <code style={{whiteSpace: 'nowrap'}}>cerebrium_run_execution_time_ms</code> | Histogram | ms | Time spent executing user code | | ||
| | <code style={{whiteSpace: 'nowrap'}}>cerebrium_run_queue_time_ms</code> | Histogram | ms | Time spent waiting in queue | | ||
| | <code style={{whiteSpace: 'nowrap'}}>cerebrium_run_coldstart_time_ms</code> | Histogram | ms | Time for container cold start | | ||
| | <code style={{whiteSpace: 'nowrap'}}>cerebrium_run_response_time_ms</code> | Histogram | ms | Total end-to-end response time | | ||
| | <code style={{whiteSpace: 'nowrap'}}>cerebrium_run_total</code> | Counter | — | Total run count | | ||
| | <code style={{whiteSpace: 'nowrap'}}>cerebrium_run_successes_total</code> | Counter | — | Successful run count | | ||
| | <code style={{whiteSpace: 'nowrap'}}>cerebrium_run_errors_total</code> | Counter | — | Failed run count | | ||
|
|
||
| <Note> | ||
| **Prometheus metric name mapping:** When metrics are ingested by Prometheus | ||
| (including Grafana Cloud), OTLP automatically appends unit suffixes to metric | ||
| names. Histogram metrics will appear with `_milliseconds` appended — for | ||
| example, `cerebrium_run_execution_time_ms` becomes | ||
| `cerebrium_run_execution_time_ms_milliseconds_bucket`, `_count`, and `_sum`. | ||
| Counter metrics with the `_total` suffix remain unchanged. The example queries | ||
| throughout this guide use the Prometheus-ingested names. | ||
| </Note> | ||
|
|
||
| ### Labels | ||
|
|
||
| Every metric includes the following labels for filtering and grouping: | ||
|
|
||
| | Label | Description | Example | | ||
| | ------------ | --------------------------- | --------------------- | | ||
| | `project_id` | Your Cerebrium project ID | `p-abc12345` | | ||
| | `app_id` | Full application identifier | `p-abc12345-my-model` | | ||
| | `app_name` | Human-readable app name | `my-model` | | ||
| | `region` | Deployment region | `us-east-1` | | ||
|
|
||
| ## How it works | ||
|
|
||
| Cerebrium automatically pushes metrics from your applications to your monitoring platform every **60 seconds** using the [OpenTelemetry Protocol (OTLP)](https://opentelemetry.io/docs/specs/otlp/). You provide an OTLP endpoint and authentication credentials through the Cerebrium dashboard, and Cerebrium handles the rest — collecting resource usage and execution data, formatting it as OpenTelemetry metrics, and delivering it to your platform. | ||
|
|
||
| - Metrics are pushed every **60 seconds** | ||
| - Failed pushes are retried **3 times** with exponential backoff | ||
| - If pushes fail **10 consecutive times**, export is automatically paused to avoid noise (you can re-enable at any time from the dashboard) | ||
| - Your credentials are stored encrypted and are never returned in API responses | ||
|
|
||
| ### Supported destinations | ||
|
|
||
| - **Grafana Cloud** — Primary supported destination | ||
| - **Datadog** — Via OTLP endpoint | ||
| - **Prometheus** — Self-hosted with OTLP receiver enabled | ||
| - **Custom** — Any OTLP-compatible endpoint (New Relic, Honeycomb, etc.) | ||
|
|
||
| ## Setup Guide | ||
|
|
||
| ### Step 1: Get your platform credentials | ||
|
|
||
| Before heading to the Cerebrium dashboard, you'll need an OTLP endpoint and authentication credentials from your monitoring platform. | ||
|
|
||
| <Tabs> | ||
| <Tab title="Grafana Cloud"> | ||
| 1. Sign in to [Grafana Cloud](https://grafana.com) | ||
| 2. Go to your stack → **Connections** → **Add new connection** | ||
| 3. Search for **"OpenTelemetry"** and click **Configure** | ||
| 4. Copy the **OTLP endpoint** — this will match your stack's region: | ||
| - US: `https://otlp-gateway-prod-us-east-0.grafana.net/otlp` | ||
| - EU: `https://otlp-gateway-prod-eu-west-0.grafana.net/otlp` | ||
| - Other regions will show their specific URL on the configuration page | ||
| 5. On the same page, generate an API token. Click **Generate now** and ensure the token has the **MetricsPublisher** role — this is a separate token from any Prometheus Remote Write tokens you may already have. | ||
| 6. The page will show you an **Instance ID** and the generated token. Run the following in your terminal to create the Basic auth string: | ||
|
|
||
| ```bash | ||
| echo -n "INSTANCE_ID:TOKEN" | base64 | ||
| ``` | ||
|
|
||
| Copy the output — you'll paste it in the dashboard in the next step. | ||
|
|
||
| <Warning> | ||
| The API token **must** have the **MetricsPublisher** role. The default Prometheus Remote Write token will not work with the OTLP endpoint. If you're unsure, generate a new token from the OpenTelemetry configuration page — it will have the correct role by default. | ||
| </Warning> | ||
|
|
||
| </Tab> | ||
| <Tab title="Datadog"> | ||
| 1. Sign in to [Datadog](https://app.datadoghq.com) | ||
| 2. Go to **Organization Settings** → **API Keys** | ||
| 3. Create or copy an existing API key | ||
| 4. Your OTLP endpoint depends on your [Datadog site](https://docs.datadoghq.com/getting_started/site/): | ||
|
|
||
| | Datadog Site | OTLP Endpoint | | ||
| |-------------|---------------| | ||
| | US1 (datadoghq.com) | `https://api.datadoghq.com/api/v2/otlp` | | ||
| | US3 (us3.datadoghq.com) | `https://api.us3.datadoghq.com/api/v2/otlp` | | ||
| | US5 (us5.datadoghq.com) | `https://api.us5.datadoghq.com/api/v2/otlp` | | ||
| | EU (datadoghq.eu) | `https://api.datadoghq.eu/api/v2/otlp` | | ||
| | AP1 (ap1.datadoghq.com) | `https://api.ap1.datadoghq.com/api/v2/otlp` | | ||
|
|
||
| You can find your site in your Datadog URL — for example, if you log in at `app.us3.datadoghq.com`, your site is US3. | ||
|
|
||
| Keep your API key and endpoint handy for the next step. | ||
|
|
||
| </Tab> | ||
| <Tab title="Self-hosted Prometheus"> | ||
| 1. Enable the OTLP receiver in your Prometheus config: | ||
| - Add `--enable-feature=otlp-write-receiver` flag | ||
| - Or use an OpenTelemetry Collector as a sidecar | ||
| 2. Your endpoint will be `http://YOUR_PROMETHEUS_HOST:4318` (this is the OTLP HTTP port — not `4317`, which is gRPC) — copy this for the next step | ||
| </Tab> | ||
| <Tab title="Custom OTLP"> | ||
| Any platform that supports [OpenTelemetry OTLP over HTTP](https://opentelemetry.io/docs/specs/otlp/) will work, including New Relic, Honeycomb, Lightstep, and others. | ||
|
|
||
| 1. Get the OTLP HTTP endpoint from your provider's documentation | ||
| 2. Get the required authentication headers | ||
|
|
||
| **Common examples:** | ||
|
|
||
| | Platform | Auth Header Name | Auth Header Value | | ||
| |----------|-----------------|-------------------| | ||
| | New Relic | `api-key` | Your New Relic license key | | ||
| | Honeycomb | `x-honeycomb-team` | Your Honeycomb API key | | ||
| | Lightstep | `lightstep-access-token` | Your Lightstep token | | ||
|
|
||
| </Tab> | ||
| </Tabs> | ||
|
|
||
| ### Step 2: Configure in the Cerebrium dashboard | ||
|
|
||
| 1. In the [Cerebrium dashboard](https://dashboard.cerebrium.ai), go to your project → **Integrations** → **Metrics Export** | ||
| 2. Paste your **OTLP endpoint** from Step 1 | ||
| 3. Add the **authentication headers** from Step 1: | ||
|
|
||
| <Tabs> | ||
| <Tab title="Grafana Cloud"> | ||
| - **Header name:** `Authorization` - **Header value:** `Basic | ||
| YOUR_BASE64_STRING` (the output from the terminal command in Step 1) | ||
| </Tab> | ||
| <Tab title="Datadog"> | ||
| - **Header name:** `DD-API-KEY` - **Header value:** Your Datadog API key | ||
| </Tab> | ||
| <Tab title="Self-hosted Prometheus"> | ||
| - **Header name:** `Authorization` (if auth is enabled on your Prometheus, | ||
| otherwise leave empty) - **Header value:** `Bearer your-token` (if auth is | ||
| enabled) | ||
| </Tab> | ||
| <Tab title="Custom OTLP"> | ||
| Add the authentication headers required by your platform. You can add | ||
| multiple headers using the **Add Header** button. | ||
| </Tab> | ||
| </Tabs> | ||
|
|
||
| 4. Click **Save & Enable** | ||
|
|
||
| Your metrics will start flowing within 60 seconds. The dashboard will show a green "Connected" status with the time of the last successful export. | ||
|
|
||
| If something doesn't look right, click **Test Connection** to verify Cerebrium can reach your monitoring platform. You'll see a success or failure message with details to help you troubleshoot. | ||
|
|
||
| ## Viewing Metrics | ||
|
|
||
| Once connected, metrics will appear in your monitoring platform within a minute or two (exact latency depends on your platform's ingestion pipeline). | ||
|
|
||
| <Tabs> | ||
| <Tab title="Grafana Cloud"> | ||
| 1. Go to your Grafana Cloud dashboard → **Explore** | ||
| 2. Select your Prometheus data source — it will be named something like **grafanacloud-yourstack-prom** (you can find it under **Connections** → **Data sources** if you're unsure) | ||
| 3. Search for metrics starting with `cerebrium_` | ||
|
|
||
| **Example queries:** | ||
|
|
||
| <Note> | ||
| Histogram metrics in Prometheus have `_milliseconds` appended by OTLP's unit suffix convention, so you'll see names like `cerebrium_run_execution_time_ms_milliseconds_bucket`. This is expected behavior — see the [metric name mapping note](#execution-metrics) above. | ||
| </Note> | ||
|
|
||
| ```promql | ||
| # CPU usage by app | ||
| cerebrium_cpu_utilization_cores{project_id="YOUR_PROJECT_ID"} | ||
|
|
||
| # Memory for a specific app | ||
| cerebrium_memory_usage_bytes{app_name="my-model"} | ||
|
|
||
| # Container scaling over time | ||
| cerebrium_containers_running_count{project_id="YOUR_PROJECT_ID"} | ||
|
|
||
| # Request rate (requests per second over 5 minutes) | ||
| rate(cerebrium_run_total[5m]) | ||
|
|
||
| # p99 execution latency | ||
| histogram_quantile(0.99, rate(cerebrium_run_execution_time_ms_milliseconds_bucket{app_name="my-model"}[5m])) | ||
|
|
||
| # p99 end-to-end response time | ||
| histogram_quantile(0.99, rate(cerebrium_run_response_time_ms_milliseconds_bucket{app_name="my-model"}[5m])) | ||
|
|
||
| # Error rate as a percentage | ||
| rate(cerebrium_run_errors_total{app_name="my-model"}[5m]) / rate(cerebrium_run_total{app_name="my-model"}[5m]) * 100 | ||
|
|
||
| # Average cold start time | ||
| rate(cerebrium_run_coldstart_time_ms_milliseconds_sum{app_name="my-model"}[5m]) / rate(cerebrium_run_coldstart_time_ms_milliseconds_count{app_name="my-model"}[5m]) | ||
| ``` | ||
|
|
||
| </Tab> | ||
| <Tab title="Datadog"> | ||
| 1. Go to **Metrics** → **Explorer** in your Datadog dashboard | ||
| 2. Search for metrics starting with `cerebrium` | ||
| 3. You can filter by `project_id`, `app_name`, and other labels using the "from" field | ||
| </Tab> | ||
| <Tab title="Prometheus"> | ||
| Query your Prometheus instance directly. All Cerebrium metrics are prefixed with `cerebrium_`: | ||
|
|
||
| ```promql | ||
| # List all Cerebrium metrics | ||
| {__name__=~"cerebrium_.*"} | ||
|
|
||
| # CPU usage across all apps | ||
| cerebrium_cpu_utilization_cores | ||
| ``` | ||
|
|
||
| </Tab> | ||
| </Tabs> | ||
|
|
||
| ## Managing Metrics Export | ||
|
|
||
| You can manage your metrics export configuration from the dashboard at any time by going to **Integrations** → **Metrics Export**. | ||
|
|
||
| - **Disable export:** Toggle the switch off. Your configuration is preserved — you can re-enable at any time without reconfiguring. | ||
| - **Update credentials:** Enter new authentication headers and click **Save Changes**. Useful when rotating API keys. | ||
| - **Change endpoint:** Update the OTLP endpoint field and click **Save Changes**. | ||
| - **Check status:** The dashboard shows whether export is connected, the time of the last successful export, and any error messages. | ||
Hkhan161 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| ## Troubleshooting | ||
|
|
||
| ### Metrics not appearing | ||
|
|
||
| 1. **Check the dashboard status.** Go to **Integrations** → **Metrics Export** and look for the connection status. If it shows "Paused," export was automatically disabled after repeated failures — click **Re-enable** after fixing the issue. | ||
| 2. **Run a connection test.** Click **Test Connection** on the dashboard. Common errors: | ||
| - **401 / 403 Unauthorized:** Your auth headers are wrong. For Grafana Cloud, make sure you're using a MetricsPublisher token (not a Prometheus Remote Write token). For Datadog, verify your API key is active. | ||
| - **404 Not Found:** The OTLP endpoint URL is incorrect. Double-check the URL matches your platform and region. | ||
| - **Connection timeout:** Your endpoint may be unreachable. For self-hosted Prometheus, confirm the host is publicly accessible and port `4318` is open. | ||
| 3. **Check your platform's data source.** In Grafana Cloud, make sure you're querying the correct Prometheus data source (not a Loki or Tempo source). In Datadog, check that your site region matches the endpoint you configured. | ||
|
|
||
| ### Metrics appear but values look wrong | ||
|
|
||
| - **Histogram metrics have `_milliseconds` in the name.** This is normal — Prometheus appends unit suffixes from OTLP metadata. Use the full name (e.g., `cerebrium_run_execution_time_ms_milliseconds_bucket`) in your queries. | ||
| - **Container counts fluctuate during deploys.** This is expected — you may see temporary spikes in `cerebrium_containers_running_count` during rolling deployments as new containers start and old ones drain. | ||
| - **Gaps in metrics.** Short gaps (1-2 minutes) can occur during deployments or scaling events. If you see persistent gaps, check whether export was paused. | ||
|
|
||
| ### Still stuck? | ||
|
|
||
| Reach out to [support@cerebrium.ai](mailto:support@cerebrium.ai) with your project ID and the error message from the dashboard — we can check the export logs on our side. | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.