From b5f1e31301e9efe39d4b5cdf1c8ed92c804dc8d9 Mon Sep 17 00:00:00 2001 From: Harris Khan Date: Wed, 11 Feb 2026 19:29:59 -0500 Subject: [PATCH 01/18] docs: add metrics export integration guide Adds documentation for the metrics export feature: - Setup guides for Grafana Cloud, Datadog, Prometheus, and custom OTLP endpoints - Full metrics reference (resource + execution metrics with labels) - API reference for config, test, and enable/disable endpoints - Example Grafana queries - Metric name mapping (OTLP to Prometheus naming conventions) - Reliability details (retry, circuit breaker, credential security) --- cerebrium/integrations/metrics-export.mdx | 287 ++++++++++++++++++++++ docs.json | 7 + 2 files changed, 294 insertions(+) create mode 100644 cerebrium/integrations/metrics-export.mdx diff --git a/cerebrium/integrations/metrics-export.mdx b/cerebrium/integrations/metrics-export.mdx new file mode 100644 index 00000000..1acc06d3 --- /dev/null +++ b/cerebrium/integrations/metrics-export.mdx @@ -0,0 +1,287 @@ +--- +title: Metrics Export +description: Export your application metrics to Grafana Cloud, Datadog, Prometheus, or any OTLP-compatible platform +--- + +Export real-time resource and execution metrics from your Cerebrium applications to your existing observability platform. Monitor CPU, memory, GPU usage, request counts, and latency alongside your other services. + +## What metrics are exported? + +### Resource Metrics + +| Metric | Type | Unit | Description | +|--------|------|------|-------------| +| `cerebrium_cpu_utilization_cores` | Gauge | cores | CPU cores actively in use per app | +| `cerebrium_memory_usage_bytes` | Gauge | bytes | Memory actively in use per app | +| `cerebrium_gpu_memory_usage_bytes` | Gauge | bytes | GPU VRAM in use per app | +| `cerebrium_gpu_compute_utilization_percent` | Gauge | percent | GPU compute utilization (0-100) per app | +| `cerebrium_containers_running_count` | Gauge | count | Number of running containers per app | +| `cerebrium_containers_ready_count` | Gauge | count | Number of ready containers per app | + +### Execution Metrics + +| Metric | Type | Unit | Description | +|--------|------|------|-------------| +| `cerebrium_run_execution_time_ms` | Histogram | ms | Time spent executing user code | +| `cerebrium_run_queue_time_ms` | Histogram | ms | Time spent waiting in queue | +| `cerebrium_run_total_response_time_ms` | Histogram | ms | Total end-to-end response time | +| `cerebrium_requests_total` | Counter | count | Total request count | +| `cerebrium_requests_success` | Counter | count | Successful requests (2xx) | +| `cerebrium_requests_errors` | Counter | count | Failed requests (4xx/5xx) | + +### Labels + +Every metric includes the following labels for filtering and grouping: + +| Label | Description | Example | +|-------|-------------|---------| +| `project_id` | Your Cerebrium project ID | `p-abc12345` | +| `app_id` | Full application identifier | `p-abc12345-my-model` | +| `app_name` | Human-readable app name | `my-model` | +| `region` | Deployment region | `us-east-1` | +| `environment` | Deployment environment | `prod` | + +## Supported Destinations + +Metrics are exported using the [OpenTelemetry Protocol (OTLP)](https://opentelemetry.io/docs/specs/otlp/), which is supported by most observability platforms: + +- **Grafana Cloud** — Primary supported destination +- **Datadog** — Via OTLP endpoint +- **Prometheus** — Self-hosted with OTLP receiver enabled +- **Custom** — Any OTLP-compatible endpoint (New Relic, Honeycomb, etc.) + +## Setup Guide + +### Prerequisites + +- A Cerebrium project with deployed apps +- Your Cerebrium API key +- An account with your chosen observability platform + +### Step 1: Get your destination credentials + + + + 1. Sign in to [Grafana Cloud](https://grafana.com) + 2. Go to your stack → **Connections** → **Add new connection** + 3. Search for **"OpenTelemetry"** and click **Configure** + 4. Copy the **OTLP endpoint** (e.g., `https://otlp-gateway-prod-us-east-0.grafana.net/otlp`) + 5. Note your **Instance ID** (a number like `755366`) + 6. Generate an API token with **MetricsPublisher** role + 7. Create the Basic auth string: + + ```bash + echo -n "YOUR_INSTANCE_ID:YOUR_TOKEN" | base64 + ``` + + **In the Cerebrium dashboard:** + - **OTLP Endpoint:** `https://otlp-gateway-prod-us-east-0.grafana.net/otlp` + - **Auth Header Name:** `Authorization` + - **Auth Header Value:** `Basic YOUR_BASE64_STRING` + + + Make sure the API token has the **MetricsPublisher** role. The default Prometheus Remote Write token will not work with the OTLP endpoint. + + + + 1. Sign in to [Datadog](https://app.datadoghq.com) + 2. Go to **Organization Settings** → **API Keys** + 3. Create or copy an existing API key + 4. Your endpoint depends on your Datadog site: + - US1: `https://api.datadoghq.com/api/v2/otlp` + - EU: `https://api.datadoghq.eu/api/v2/otlp` + + **In the Cerebrium dashboard:** + - **OTLP Endpoint:** `https://api.datadoghq.com/api/v2/otlp` (US1) or `https://api.datadoghq.eu/api/v2/otlp` (EU) + - **Auth Header Name:** `DD-API-KEY` + - **Auth Header Value:** `your-datadog-api-key` + + + 1. Enable the OTLP receiver in your Prometheus config: + - Add `--enable-feature=otlp-write-receiver` flag + - Or use an OpenTelemetry Collector as a sidecar + 2. Your endpoint is: `http://YOUR_PROMETHEUS:4318` + + **In the Cerebrium dashboard:** + - **OTLP Endpoint:** `http://your-prometheus-host:4318` + - **Auth Header Name:** `Authorization` (if auth is enabled, otherwise leave empty) + - **Auth Header Value:** `Bearer your-token` (if auth is enabled) + + + Any platform that supports [OpenTelemetry OTLP over HTTP](https://opentelemetry.io/docs/specs/otlp/) will work, including New Relic, Honeycomb, Lightstep, and others. + + 1. Get the OTLP HTTP endpoint from your provider's documentation + 2. Get the required authentication headers + + **Common examples:** + + | Platform | Auth Header Name | Auth Header Value | + |----------|-----------------|-------------------| + | New Relic | `api-key` | Your New Relic license key | + | Honeycomb | `x-honeycomb-team` | Your Honeycomb API key | + | Lightstep | `lightstep-access-token` | Your Lightstep token | + + You can add multiple auth headers if your platform requires them using the **Add Header** button. + + + +### Step 2: Configure metrics export + +**Option A: Dashboard UI** + +Go to your project → **Integrations** → **Metrics Export**. Enter your OTLP endpoint and authentication headers from Step 1, then click **Save Changes**. + +**Option B: API** + +```bash +curl -X PUT "https://rest.cerebrium.ai/v2/metrics-export/YOUR_PROJECT_ID/config" \ + -H "Authorization: Bearer YOUR_CEREBRIUM_API_KEY" \ + -H "Content-Type: application/json" \ + -d '{ + "otlpEndpoint": "https://otlp-gateway-prod-us-east-0.grafana.net/otlp", + "authHeaders": { + "Authorization": "Basic YOUR_BASE64_CREDENTIALS" + } + }' +``` + +The `authHeaders` field is a map of header name → header value. These are stored encrypted and never returned in API responses. + +### Step 3: Test the connection + +```bash +curl -X POST "https://rest.cerebrium.ai/v2/metrics-export/YOUR_PROJECT_ID/test" \ + -H "Authorization: Bearer YOUR_CEREBRIUM_API_KEY" +``` + +**Success response:** +```json +{ + "success": true, + "message": "Successfully connected to grafana (145ms)", + "latencyMs": 145 +} +``` + +**Failure response:** +```json +{ + "success": false, + "error": "Authentication failed (HTTP 401). Check your API key or credentials." +} +``` + +### Step 4: Enable export + +```bash +curl -X PUT "https://rest.cerebrium.ai/v2/metrics-export/YOUR_PROJECT_ID/config" \ + -H "Authorization: Bearer YOUR_CEREBRIUM_API_KEY" \ + -H "Content-Type: application/json" \ + -d '{"enabled": true}' +``` + +Metrics start flowing within 60 seconds. + +## Viewing Metrics + +### Grafana Cloud + +1. Go to your Grafana dashboard → **Explore** +2. Select the **grafanacloud-*-prom** data source +3. Search for metrics starting with `cerebrium_` + +**Example queries:** + +```promql +# CPU usage by app +cerebrium_cpu_utilization_cores{project_id="YOUR_PROJECT_ID"} + +# Memory for a specific app +cerebrium_memory_usage_bytes{app_name="my-model"} + +# Container scaling over time +cerebrium_containers_running_count{project_id="YOUR_PROJECT_ID"} + +# Request rate +rate(cerebrium_requests_total{app_name="my-model"}[5m]) + +# p99 latency +histogram_quantile(0.99, rate(cerebrium_run_total_response_time_ms_bucket{app_name="my-model"}[5m])) +``` + +### Datadog + +Metrics appear under `cerebrium.*` in the Metrics Explorer. You can filter by `project_id`, `app_name`, and other labels. + +## Managing Export + +### Check status + +```bash +curl "https://rest.cerebrium.ai/v2/metrics-export/YOUR_PROJECT_ID/config" \ + -H "Authorization: Bearer YOUR_CEREBRIUM_API_KEY" +``` + +```json +{ + "enabled": true, + "destination": "grafana", + "otlpEndpoint": "https://otlp-gateway-prod-us-east-0.grafana.net/otlp", + "authHeadersConfigured": true, + "lastExportAt": "2026-02-11T22:44:56Z", + "lastExportStatus": "success" +} +``` + +### Disable export + +Disabling preserves your configuration. You can re-enable at any time without reconfiguring. + +```bash +curl -X PUT "https://rest.cerebrium.ai/v2/metrics-export/YOUR_PROJECT_ID/config" \ + -H "Authorization: Bearer YOUR_CEREBRIUM_API_KEY" \ + -H "Content-Type: application/json" \ + -d '{"enabled": false}' +``` + +### Update credentials + +```bash +curl -X PUT "https://rest.cerebrium.ai/v2/metrics-export/YOUR_PROJECT_ID/config" \ + -H "Authorization: Bearer YOUR_CEREBRIUM_API_KEY" \ + -H "Content-Type: application/json" \ + -d '{ + "authHeaders": { + "Authorization": "Basic NEW_CREDENTIALS" + } + }' +``` + +## Reliability + +- Metrics are pushed every **60 seconds** +- Failed exports are retried **3 times** with exponential backoff +- If exports fail **10 consecutive times**, export is automatically disabled (circuit breaker) +- Re-enabling export resets the failure counter +- Your credential are stored encrypted in AWS Secrets Manager and are never returned in API responses + +## Metric Name Mapping + +Cerebrium uses OpenTelemetry naming conventions. When metrics arrive in Prometheus-compatible systems (like Grafana Cloud), dots are converted to underscores and units are appended: + +| Cerebrium metric name | Prometheus/Grafana name | +|----------------------|------------------------| +| `cerebrium.cpu.utilization` | `cerebrium_cpu_utilization_cores` | +| `cerebrium.memory.usage_bytes` | `cerebrium_memory_usage_bytes` | +| `cerebrium.gpu.memory.usage_bytes` | `cerebrium_gpu_memory_usage_bytes` | +| `cerebrium.gpu.compute.utilization` | `cerebrium_gpu_compute_utilization_percent` | +| `cerebrium.containers.running` | `cerebrium_containers_running_count` | +| `cerebrium.containers.ready` | `cerebrium_containers_ready_count` | + +## API Reference + +| Method | Endpoint | Description | +|--------|----------|-------------| +| `GET` | `/v2/metrics-export/{project_id}/config` | Get current configuration | +| `PUT` | `/v2/metrics-export/{project_id}/config` | Update configuration | +| `POST` | `/v2/metrics-export/{project_id}/test` | Test connection | diff --git a/docs.json b/docs.json index 6c94824f..e5048f84 100644 --- a/docs.json +++ b/docs.json @@ -88,6 +88,13 @@ "cerebrium/partner-services/rime" ] }, + { + "group": "Integrations", + "pages": [ + "cerebrium/integrations/metrics-export", + "cerebrium/integrations/vercel" + ] + }, { "group": "Other concepts", "pages": [ From eb3b72b65a25a4ebedede3fe3b69d078eee13f2d Mon Sep 17 00:00:00 2001 From: Hkhan161 Date: Thu, 12 Feb 2026 17:49:12 +0000 Subject: [PATCH 02/18] Prettified Code! --- cerebrium/integrations/metrics-export.mdx | 76 ++++++++++++----------- 1 file changed, 41 insertions(+), 35 deletions(-) diff --git a/cerebrium/integrations/metrics-export.mdx b/cerebrium/integrations/metrics-export.mdx index 1acc06d3..f2aa9449 100644 --- a/cerebrium/integrations/metrics-export.mdx +++ b/cerebrium/integrations/metrics-export.mdx @@ -9,37 +9,37 @@ Export real-time resource and execution metrics from your Cerebrium applications ### Resource Metrics -| Metric | Type | Unit | Description | -|--------|------|------|-------------| -| `cerebrium_cpu_utilization_cores` | Gauge | cores | CPU cores actively in use per app | -| `cerebrium_memory_usage_bytes` | Gauge | bytes | Memory actively in use per app | -| `cerebrium_gpu_memory_usage_bytes` | Gauge | bytes | GPU VRAM in use per app | +| Metric | Type | Unit | Description | +| ------------------------------------------- | ----- | ------- | --------------------------------------- | +| `cerebrium_cpu_utilization_cores` | Gauge | cores | CPU cores actively in use per app | +| `cerebrium_memory_usage_bytes` | Gauge | bytes | Memory actively in use per app | +| `cerebrium_gpu_memory_usage_bytes` | Gauge | bytes | GPU VRAM in use per app | | `cerebrium_gpu_compute_utilization_percent` | Gauge | percent | GPU compute utilization (0-100) per app | -| `cerebrium_containers_running_count` | Gauge | count | Number of running containers per app | -| `cerebrium_containers_ready_count` | Gauge | count | Number of ready containers per app | +| `cerebrium_containers_running_count` | Gauge | count | Number of running containers per app | +| `cerebrium_containers_ready_count` | Gauge | count | Number of ready containers per app | ### Execution Metrics -| Metric | Type | Unit | Description | -|--------|------|------|-------------| -| `cerebrium_run_execution_time_ms` | Histogram | ms | Time spent executing user code | -| `cerebrium_run_queue_time_ms` | Histogram | ms | Time spent waiting in queue | -| `cerebrium_run_total_response_time_ms` | Histogram | ms | Total end-to-end response time | -| `cerebrium_requests_total` | Counter | count | Total request count | -| `cerebrium_requests_success` | Counter | count | Successful requests (2xx) | -| `cerebrium_requests_errors` | Counter | count | Failed requests (4xx/5xx) | +| Metric | Type | Unit | Description | +| -------------------------------------- | --------- | ----- | ------------------------------ | +| `cerebrium_run_execution_time_ms` | Histogram | ms | Time spent executing user code | +| `cerebrium_run_queue_time_ms` | Histogram | ms | Time spent waiting in queue | +| `cerebrium_run_total_response_time_ms` | Histogram | ms | Total end-to-end response time | +| `cerebrium_requests_total` | Counter | count | Total request count | +| `cerebrium_requests_success` | Counter | count | Successful requests (2xx) | +| `cerebrium_requests_errors` | Counter | count | Failed requests (4xx/5xx) | ### Labels Every metric includes the following labels for filtering and grouping: -| Label | Description | Example | -|-------|-------------|---------| -| `project_id` | Your Cerebrium project ID | `p-abc12345` | -| `app_id` | Full application identifier | `p-abc12345-my-model` | -| `app_name` | Human-readable app name | `my-model` | -| `region` | Deployment region | `us-east-1` | -| `environment` | Deployment environment | `prod` | +| Label | Description | Example | +| ------------- | --------------------------- | --------------------- | +| `project_id` | Your Cerebrium project ID | `p-abc12345` | +| `app_id` | Full application identifier | `p-abc12345-my-model` | +| `app_name` | Human-readable app name | `my-model` | +| `region` | Deployment region | `us-east-1` | +| `environment` | Deployment environment | `prod` | ## Supported Destinations @@ -82,6 +82,7 @@ Metrics are exported using the [OpenTelemetry Protocol (OTLP)](https://opentelem Make sure the API token has the **MetricsPublisher** role. The default Prometheus Remote Write token will not work with the OTLP endpoint. + 1. Sign in to [Datadog](https://app.datadoghq.com) @@ -95,6 +96,7 @@ Metrics are exported using the [OpenTelemetry Protocol (OTLP)](https://opentelem - **OTLP Endpoint:** `https://api.datadoghq.com/api/v2/otlp` (US1) or `https://api.datadoghq.eu/api/v2/otlp` (EU) - **Auth Header Name:** `DD-API-KEY` - **Auth Header Value:** `your-datadog-api-key` + 1. Enable the OTLP receiver in your Prometheus config: @@ -106,6 +108,7 @@ Metrics are exported using the [OpenTelemetry Protocol (OTLP)](https://opentelem - **OTLP Endpoint:** `http://your-prometheus-host:4318` - **Auth Header Name:** `Authorization` (if auth is enabled, otherwise leave empty) - **Auth Header Value:** `Bearer your-token` (if auth is enabled) + Any platform that supports [OpenTelemetry OTLP over HTTP](https://opentelemetry.io/docs/specs/otlp/) will work, including New Relic, Honeycomb, Lightstep, and others. @@ -122,6 +125,7 @@ Metrics are exported using the [OpenTelemetry Protocol (OTLP)](https://opentelem | Lightstep | `lightstep-access-token` | Your Lightstep token | You can add multiple auth headers if your platform requires them using the **Add Header** button. + @@ -155,6 +159,7 @@ curl -X POST "https://rest.cerebrium.ai/v2/metrics-export/YOUR_PROJECT_ID/test" ``` **Success response:** + ```json { "success": true, @@ -164,6 +169,7 @@ curl -X POST "https://rest.cerebrium.ai/v2/metrics-export/YOUR_PROJECT_ID/test" ``` **Failure response:** + ```json { "success": false, @@ -187,7 +193,7 @@ Metrics start flowing within 60 seconds. ### Grafana Cloud 1. Go to your Grafana dashboard → **Explore** -2. Select the **grafanacloud-*-prom** data source +2. Select the **grafanacloud-\*-prom** data source 3. Search for metrics starting with `cerebrium_` **Example queries:** @@ -269,19 +275,19 @@ curl -X PUT "https://rest.cerebrium.ai/v2/metrics-export/YOUR_PROJECT_ID/config" Cerebrium uses OpenTelemetry naming conventions. When metrics arrive in Prometheus-compatible systems (like Grafana Cloud), dots are converted to underscores and units are appended: -| Cerebrium metric name | Prometheus/Grafana name | -|----------------------|------------------------| -| `cerebrium.cpu.utilization` | `cerebrium_cpu_utilization_cores` | -| `cerebrium.memory.usage_bytes` | `cerebrium_memory_usage_bytes` | -| `cerebrium.gpu.memory.usage_bytes` | `cerebrium_gpu_memory_usage_bytes` | +| Cerebrium metric name | Prometheus/Grafana name | +| ----------------------------------- | ------------------------------------------- | +| `cerebrium.cpu.utilization` | `cerebrium_cpu_utilization_cores` | +| `cerebrium.memory.usage_bytes` | `cerebrium_memory_usage_bytes` | +| `cerebrium.gpu.memory.usage_bytes` | `cerebrium_gpu_memory_usage_bytes` | | `cerebrium.gpu.compute.utilization` | `cerebrium_gpu_compute_utilization_percent` | -| `cerebrium.containers.running` | `cerebrium_containers_running_count` | -| `cerebrium.containers.ready` | `cerebrium_containers_ready_count` | +| `cerebrium.containers.running` | `cerebrium_containers_running_count` | +| `cerebrium.containers.ready` | `cerebrium_containers_ready_count` | ## API Reference -| Method | Endpoint | Description | -|--------|----------|-------------| -| `GET` | `/v2/metrics-export/{project_id}/config` | Get current configuration | -| `PUT` | `/v2/metrics-export/{project_id}/config` | Update configuration | -| `POST` | `/v2/metrics-export/{project_id}/test` | Test connection | +| Method | Endpoint | Description | +| ------ | ---------------------------------------- | ------------------------- | +| `GET` | `/v2/metrics-export/{project_id}/config` | Get current configuration | +| `PUT` | `/v2/metrics-export/{project_id}/config` | Update configuration | +| `POST` | `/v2/metrics-export/{project_id}/test` | Test connection | From 9d766086802118a15a2097f5fa60aa05d8cd561f Mon Sep 17 00:00:00 2001 From: Harris Khan Date: Fri, 13 Feb 2026 12:31:15 -0500 Subject: [PATCH 03/18] docs: address all feedback - how it works, clearer instructions, tabs for viewing --- cerebrium/integrations/metrics-export.mdx | 240 +++++++++++----------- 1 file changed, 117 insertions(+), 123 deletions(-) diff --git a/cerebrium/integrations/metrics-export.mdx b/cerebrium/integrations/metrics-export.mdx index f2aa9449..ed437c96 100644 --- a/cerebrium/integrations/metrics-export.mdx +++ b/cerebrium/integrations/metrics-export.mdx @@ -1,63 +1,63 @@ --- -title: Metrics Export -description: Export your application metrics to Grafana Cloud, Datadog, Prometheus, or any OTLP-compatible platform +title: Exporting Metrics to Monitoring Platforms +description: Export your application metrics to any OTLP-compatible observability platform including Grafana Cloud, Datadog, Prometheus, New Relic, and more --- -Export real-time resource and execution metrics from your Cerebrium applications to your existing observability platform. Monitor CPU, memory, GPU usage, request counts, and latency alongside your other services. +Export real-time resource and execution metrics from your Cerebrium applications to your existing observability platform. Monitor CPU, memory, GPU usage, request counts, and latency alongside your other services. We support most major monitoring platforms that are OTLP-compatible. + +## How it works + +Cerebrium automatically pushes metrics from your applications to your monitoring platform every **60 seconds** using the [OpenTelemetry Protocol (OTLP)](https://opentelemetry.io/docs/specs/otlp/). You provide an OTLP endpoint and authentication credentials, and Cerebrium handles the rest — collecting resource usage and execution data, formatting it as OpenTelemetry metrics, and delivering it to your platform. + +- Metrics are pushed every **60 seconds** +- Failed pushes are retried **3 times** with exponential backoff +- If pushes fail **10 consecutive times**, export is automatically paused to avoid noise (you can re-enable at any time) +- Your credentials are stored encrypted and are never returned in API responses + +### Supported destinations + +- **Grafana Cloud** — Primary supported destination +- **Datadog** — Via OTLP endpoint +- **Prometheus** — Self-hosted with OTLP receiver enabled +- **Custom** — Any OTLP-compatible endpoint (New Relic, Honeycomb, etc.) ## What metrics are exported? ### Resource Metrics -| Metric | Type | Unit | Description | -| ------------------------------------------- | ----- | ------- | --------------------------------------- | -| `cerebrium_cpu_utilization_cores` | Gauge | cores | CPU cores actively in use per app | -| `cerebrium_memory_usage_bytes` | Gauge | bytes | Memory actively in use per app | -| `cerebrium_gpu_memory_usage_bytes` | Gauge | bytes | GPU VRAM in use per app | +| Metric | Type | Unit | Description | +|--------|------|------|-------------| +| `cerebrium_cpu_utilization_cores` | Gauge | cores | CPU cores actively in use per app | +| `cerebrium_memory_usage_bytes` | Gauge | bytes | Memory actively in use per app | +| `cerebrium_gpu_memory_usage_bytes` | Gauge | bytes | GPU VRAM in use per app | | `cerebrium_gpu_compute_utilization_percent` | Gauge | percent | GPU compute utilization (0-100) per app | -| `cerebrium_containers_running_count` | Gauge | count | Number of running containers per app | -| `cerebrium_containers_ready_count` | Gauge | count | Number of ready containers per app | +| `cerebrium_containers_running_count` | Gauge | count | Number of running containers per app | +| `cerebrium_containers_ready_count` | Gauge | count | Number of ready containers per app | ### Execution Metrics -| Metric | Type | Unit | Description | -| -------------------------------------- | --------- | ----- | ------------------------------ | -| `cerebrium_run_execution_time_ms` | Histogram | ms | Time spent executing user code | -| `cerebrium_run_queue_time_ms` | Histogram | ms | Time spent waiting in queue | -| `cerebrium_run_total_response_time_ms` | Histogram | ms | Total end-to-end response time | -| `cerebrium_requests_total` | Counter | count | Total request count | -| `cerebrium_requests_success` | Counter | count | Successful requests (2xx) | -| `cerebrium_requests_errors` | Counter | count | Failed requests (4xx/5xx) | +| Metric | Type | Unit | Description | +|--------|------|------|-------------| +| `cerebrium_run_execution_time_ms` | Histogram | ms | Time spent executing user code | +| `cerebrium_run_queue_time_ms` | Histogram | ms | Time spent waiting in queue | +| `cerebrium_run_total_response_time_ms` | Histogram | ms | Total end-to-end response time | +| `cerebrium_requests_total` | Counter | count | Total request count | +| `cerebrium_requests_success` | Counter | count | Successful requests (2xx) | +| `cerebrium_requests_errors` | Counter | count | Failed requests (4xx/5xx) | ### Labels Every metric includes the following labels for filtering and grouping: -| Label | Description | Example | -| ------------- | --------------------------- | --------------------- | -| `project_id` | Your Cerebrium project ID | `p-abc12345` | -| `app_id` | Full application identifier | `p-abc12345-my-model` | -| `app_name` | Human-readable app name | `my-model` | -| `region` | Deployment region | `us-east-1` | -| `environment` | Deployment environment | `prod` | - -## Supported Destinations - -Metrics are exported using the [OpenTelemetry Protocol (OTLP)](https://opentelemetry.io/docs/specs/otlp/), which is supported by most observability platforms: - -- **Grafana Cloud** — Primary supported destination -- **Datadog** — Via OTLP endpoint -- **Prometheus** — Self-hosted with OTLP receiver enabled -- **Custom** — Any OTLP-compatible endpoint (New Relic, Honeycomb, etc.) +| Label | Description | Example | +|-------|-------------|---------| +| `project_id` | Your Cerebrium project ID | `p-abc12345` | +| `app_id` | Full application identifier | `p-abc12345-my-model` | +| `app_name` | Human-readable app name | `my-model` | +| `region` | Deployment region | `us-east-1` | ## Setup Guide -### Prerequisites - -- A Cerebrium project with deployed apps -- Your Cerebrium API key -- An account with your chosen observability platform - ### Step 1: Get your destination credentials @@ -65,50 +65,59 @@ Metrics are exported using the [OpenTelemetry Protocol (OTLP)](https://opentelem 1. Sign in to [Grafana Cloud](https://grafana.com) 2. Go to your stack → **Connections** → **Add new connection** 3. Search for **"OpenTelemetry"** and click **Configure** - 4. Copy the **OTLP endpoint** (e.g., `https://otlp-gateway-prod-us-east-0.grafana.net/otlp`) - 5. Note your **Instance ID** (a number like `755366`) - 6. Generate an API token with **MetricsPublisher** role - 7. Create the Basic auth string: + 4. Copy the **OTLP endpoint** — this will match your stack's region: + - US: `https://otlp-gateway-prod-us-east-0.grafana.net/otlp` + - EU: `https://otlp-gateway-prod-eu-west-0.grafana.net/otlp` + - Other regions will show their specific URL on the configuration page + 5. On the same page, generate an API token with the **MetricsPublisher** role + 6. The page will show you an **Instance ID** and the generated token. Run the following in your terminal to create the Basic auth string: ```bash - echo -n "YOUR_INSTANCE_ID:YOUR_TOKEN" | base64 + echo -n "INSTANCE_ID:TOKEN" | base64 ``` + Copy the output — you'll need it in the next step. + **In the Cerebrium dashboard:** - - **OTLP Endpoint:** `https://otlp-gateway-prod-us-east-0.grafana.net/otlp` + - **OTLP Endpoint:** The endpoint URL from step 4 - **Auth Header Name:** `Authorization` - **Auth Header Value:** `Basic YOUR_BASE64_STRING` Make sure the API token has the **MetricsPublisher** role. The default Prometheus Remote Write token will not work with the OTLP endpoint. - 1. Sign in to [Datadog](https://app.datadoghq.com) 2. Go to **Organization Settings** → **API Keys** 3. Create or copy an existing API key - 4. Your endpoint depends on your Datadog site: - - US1: `https://api.datadoghq.com/api/v2/otlp` - - EU: `https://api.datadoghq.eu/api/v2/otlp` + 4. Your OTLP endpoint depends on your [Datadog site](https://docs.datadoghq.com/getting_started/site/): + + | Datadog Site | OTLP Endpoint | + |-------------|---------------| + | US1 (datadoghq.com) | `https://api.datadoghq.com/api/v2/otlp` | + | US3 (us3.datadoghq.com) | `https://api.us3.datadoghq.com/api/v2/otlp` | + | US5 (us5.datadoghq.com) | `https://api.us5.datadoghq.com/api/v2/otlp` | + | EU (datadoghq.eu) | `https://api.datadoghq.eu/api/v2/otlp` | + | AP1 (ap1.datadoghq.com) | `https://api.ap1.datadoghq.com/api/v2/otlp` | + + You can find your site in your Datadog URL — for example, if you log in at `app.us3.datadoghq.com`, your site is US3. **In the Cerebrium dashboard:** - - **OTLP Endpoint:** `https://api.datadoghq.com/api/v2/otlp` (US1) or `https://api.datadoghq.eu/api/v2/otlp` (EU) + - **OTLP Endpoint:** The endpoint matching your Datadog site from the table above - **Auth Header Name:** `DD-API-KEY` - - **Auth Header Value:** `your-datadog-api-key` - + - **Auth Header Value:** Your Datadog API key from step 3 1. Enable the OTLP receiver in your Prometheus config: - Add `--enable-feature=otlp-write-receiver` flag - Or use an OpenTelemetry Collector as a sidecar - 2. Your endpoint is: `http://YOUR_PROMETHEUS:4318` + 2. Your endpoint will be `http://YOUR_PROMETHEUS_HOST:4318` — copy this for the next step **In the Cerebrium dashboard:** - **OTLP Endpoint:** `http://your-prometheus-host:4318` - **Auth Header Name:** `Authorization` (if auth is enabled, otherwise leave empty) - **Auth Header Value:** `Bearer your-token` (if auth is enabled) - Any platform that supports [OpenTelemetry OTLP over HTTP](https://opentelemetry.io/docs/specs/otlp/) will work, including New Relic, Honeycomb, Lightstep, and others. @@ -125,23 +134,21 @@ Metrics are exported using the [OpenTelemetry Protocol (OTLP)](https://opentelem | Lightstep | `lightstep-access-token` | Your Lightstep token | You can add multiple auth headers if your platform requires them using the **Add Header** button. - -### Step 2: Configure metrics export - -**Option A: Dashboard UI** +### Step 2: Configure in the Cerebrium dashboard -Go to your project → **Integrations** → **Metrics Export**. Enter your OTLP endpoint and authentication headers from Step 1, then click **Save Changes**. +Go to your project → **Integrations** → **Metrics Export**. Enter the OTLP endpoint and authentication headers from Step 1, then click **Save & Enable**. -**Option B: API** +You can also configure via the API: ```bash curl -X PUT "https://rest.cerebrium.ai/v2/metrics-export/YOUR_PROJECT_ID/config" \ -H "Authorization: Bearer YOUR_CEREBRIUM_API_KEY" \ -H "Content-Type: application/json" \ -d '{ + "enabled": true, "otlpEndpoint": "https://otlp-gateway-prod-us-east-0.grafana.net/otlp", "authHeaders": { "Authorization": "Basic YOUR_BASE64_CREDENTIALS" @@ -151,15 +158,18 @@ curl -X PUT "https://rest.cerebrium.ai/v2/metrics-export/YOUR_PROJECT_ID/config" The `authHeaders` field is a map of header name → header value. These are stored encrypted and never returned in API responses. +You can find your Cerebrium API key in the [dashboard](https://dashboard.cerebrium.ai) under **Settings** → **API Keys**. + ### Step 3: Test the connection +Click **Test Connection** in the dashboard, or via the API: + ```bash curl -X POST "https://rest.cerebrium.ai/v2/metrics-export/YOUR_PROJECT_ID/test" \ -H "Authorization: Bearer YOUR_CEREBRIUM_API_KEY" ``` **Success response:** - ```json { "success": true, @@ -169,7 +179,6 @@ curl -X POST "https://rest.cerebrium.ai/v2/metrics-export/YOUR_PROJECT_ID/test" ``` **Failure response:** - ```json { "success": false, @@ -177,51 +186,56 @@ curl -X POST "https://rest.cerebrium.ai/v2/metrics-export/YOUR_PROJECT_ID/test" } ``` -### Step 4: Enable export - -```bash -curl -X PUT "https://rest.cerebrium.ai/v2/metrics-export/YOUR_PROJECT_ID/config" \ - -H "Authorization: Bearer YOUR_CEREBRIUM_API_KEY" \ - -H "Content-Type: application/json" \ - -d '{"enabled": true}' -``` - Metrics start flowing within 60 seconds. ## Viewing Metrics -### Grafana Cloud - -1. Go to your Grafana dashboard → **Explore** -2. Select the **grafanacloud-\*-prom** data source -3. Search for metrics starting with `cerebrium_` + + + 1. Go to your Grafana Cloud dashboard → **Explore** + 2. Select your Prometheus data source — it will be named something like **grafanacloud-yourstack-prom** (you can find it under **Connections** → **Data sources** if you're unsure) + 3. Search for metrics starting with `cerebrium_` -**Example queries:** + **Example queries:** -```promql -# CPU usage by app -cerebrium_cpu_utilization_cores{project_id="YOUR_PROJECT_ID"} + ```promql + # CPU usage by app (replace with your project ID, e.g. p-9676c59f) + cerebrium_cpu_utilization_cores{project_id="p-9676c59f"} -# Memory for a specific app -cerebrium_memory_usage_bytes{app_name="my-model"} + # Memory for a specific app + cerebrium_memory_usage_bytes{app_name="my-model"} -# Container scaling over time -cerebrium_containers_running_count{project_id="YOUR_PROJECT_ID"} + # Container scaling over time + cerebrium_containers_running_count{project_id="p-9676c59f"} -# Request rate -rate(cerebrium_requests_total{app_name="my-model"}[5m]) + # Request rate (requests per second over 5 minutes) + rate(cerebrium_requests_total{app_name="my-model"}[5m]) -# p99 latency -histogram_quantile(0.99, rate(cerebrium_run_total_response_time_ms_bucket{app_name="my-model"}[5m])) -``` + # p99 latency + histogram_quantile(0.99, rate(cerebrium_run_total_response_time_ms_bucket{app_name="my-model"}[5m])) + ``` + + + 1. Go to **Metrics** → **Explorer** in your Datadog dashboard + 2. Search for metrics starting with `cerebrium` + 3. You can filter by `project_id`, `app_name`, and other labels using the "from" field + + + Query your Prometheus instance directly. All Cerebrium metrics are prefixed with `cerebrium_`: -### Datadog + ```promql + # List all Cerebrium metrics + {__name__=~"cerebrium_.*"} -Metrics appear under `cerebrium.*` in the Metrics Explorer. You can filter by `project_id`, `app_name`, and other labels. + # CPU usage across all apps + cerebrium_cpu_utilization_cores + ``` + + -## Managing Export +## Managing Metrics Export -### Check status +### Check export status ```bash curl "https://rest.cerebrium.ai/v2/metrics-export/YOUR_PROJECT_ID/config" \ @@ -231,7 +245,6 @@ curl "https://rest.cerebrium.ai/v2/metrics-export/YOUR_PROJECT_ID/config" \ ```json { "enabled": true, - "destination": "grafana", "otlpEndpoint": "https://otlp-gateway-prod-us-east-0.grafana.net/otlp", "authHeadersConfigured": true, "lastExportAt": "2026-02-11T22:44:56Z", @@ -239,7 +252,7 @@ curl "https://rest.cerebrium.ai/v2/metrics-export/YOUR_PROJECT_ID/config" \ } ``` -### Disable export +### Disable metrics export Disabling preserves your configuration. You can re-enable at any time without reconfiguring. @@ -250,7 +263,9 @@ curl -X PUT "https://rest.cerebrium.ai/v2/metrics-export/YOUR_PROJECT_ID/config" -d '{"enabled": false}' ``` -### Update credentials +### Update OTLP credentials + +If you need to rotate or change your monitoring platform credentials: ```bash curl -X PUT "https://rest.cerebrium.ai/v2/metrics-export/YOUR_PROJECT_ID/config" \ @@ -263,31 +278,10 @@ curl -X PUT "https://rest.cerebrium.ai/v2/metrics-export/YOUR_PROJECT_ID/config" }' ``` -## Reliability - -- Metrics are pushed every **60 seconds** -- Failed exports are retried **3 times** with exponential backoff -- If exports fail **10 consecutive times**, export is automatically disabled (circuit breaker) -- Re-enabling export resets the failure counter -- Your credential are stored encrypted in AWS Secrets Manager and are never returned in API responses - -## Metric Name Mapping - -Cerebrium uses OpenTelemetry naming conventions. When metrics arrive in Prometheus-compatible systems (like Grafana Cloud), dots are converted to underscores and units are appended: - -| Cerebrium metric name | Prometheus/Grafana name | -| ----------------------------------- | ------------------------------------------- | -| `cerebrium.cpu.utilization` | `cerebrium_cpu_utilization_cores` | -| `cerebrium.memory.usage_bytes` | `cerebrium_memory_usage_bytes` | -| `cerebrium.gpu.memory.usage_bytes` | `cerebrium_gpu_memory_usage_bytes` | -| `cerebrium.gpu.compute.utilization` | `cerebrium_gpu_compute_utilization_percent` | -| `cerebrium.containers.running` | `cerebrium_containers_running_count` | -| `cerebrium.containers.ready` | `cerebrium_containers_ready_count` | - ## API Reference -| Method | Endpoint | Description | -| ------ | ---------------------------------------- | ------------------------- | -| `GET` | `/v2/metrics-export/{project_id}/config` | Get current configuration | -| `PUT` | `/v2/metrics-export/{project_id}/config` | Update configuration | -| `POST` | `/v2/metrics-export/{project_id}/test` | Test connection | +| Method | Endpoint | Description | +|--------|----------|-------------| +| `GET` | `/v2/metrics-export/{project_id}/config` | Get current export configuration | +| `PUT` | `/v2/metrics-export/{project_id}/config` | Update export configuration | +| `POST` | `/v2/metrics-export/{project_id}/test` | Test connection to your monitoring platform | From 84860a9e74eab81f6ed2888b2e44ce71d22f6cfe Mon Sep 17 00:00:00 2001 From: Hkhan161 Date: Fri, 13 Feb 2026 17:31:29 +0000 Subject: [PATCH 04/18] Prettified Code! --- cerebrium/integrations/metrics-export.mdx | 60 +++++++++++++---------- 1 file changed, 34 insertions(+), 26 deletions(-) diff --git a/cerebrium/integrations/metrics-export.mdx b/cerebrium/integrations/metrics-export.mdx index ed437c96..716f0f72 100644 --- a/cerebrium/integrations/metrics-export.mdx +++ b/cerebrium/integrations/metrics-export.mdx @@ -25,36 +25,36 @@ Cerebrium automatically pushes metrics from your applications to your monitoring ### Resource Metrics -| Metric | Type | Unit | Description | -|--------|------|------|-------------| -| `cerebrium_cpu_utilization_cores` | Gauge | cores | CPU cores actively in use per app | -| `cerebrium_memory_usage_bytes` | Gauge | bytes | Memory actively in use per app | -| `cerebrium_gpu_memory_usage_bytes` | Gauge | bytes | GPU VRAM in use per app | +| Metric | Type | Unit | Description | +| ------------------------------------------- | ----- | ------- | --------------------------------------- | +| `cerebrium_cpu_utilization_cores` | Gauge | cores | CPU cores actively in use per app | +| `cerebrium_memory_usage_bytes` | Gauge | bytes | Memory actively in use per app | +| `cerebrium_gpu_memory_usage_bytes` | Gauge | bytes | GPU VRAM in use per app | | `cerebrium_gpu_compute_utilization_percent` | Gauge | percent | GPU compute utilization (0-100) per app | -| `cerebrium_containers_running_count` | Gauge | count | Number of running containers per app | -| `cerebrium_containers_ready_count` | Gauge | count | Number of ready containers per app | +| `cerebrium_containers_running_count` | Gauge | count | Number of running containers per app | +| `cerebrium_containers_ready_count` | Gauge | count | Number of ready containers per app | ### Execution Metrics -| Metric | Type | Unit | Description | -|--------|------|------|-------------| -| `cerebrium_run_execution_time_ms` | Histogram | ms | Time spent executing user code | -| `cerebrium_run_queue_time_ms` | Histogram | ms | Time spent waiting in queue | -| `cerebrium_run_total_response_time_ms` | Histogram | ms | Total end-to-end response time | -| `cerebrium_requests_total` | Counter | count | Total request count | -| `cerebrium_requests_success` | Counter | count | Successful requests (2xx) | -| `cerebrium_requests_errors` | Counter | count | Failed requests (4xx/5xx) | +| Metric | Type | Unit | Description | +| -------------------------------------- | --------- | ----- | ------------------------------ | +| `cerebrium_run_execution_time_ms` | Histogram | ms | Time spent executing user code | +| `cerebrium_run_queue_time_ms` | Histogram | ms | Time spent waiting in queue | +| `cerebrium_run_total_response_time_ms` | Histogram | ms | Total end-to-end response time | +| `cerebrium_requests_total` | Counter | count | Total request count | +| `cerebrium_requests_success` | Counter | count | Successful requests (2xx) | +| `cerebrium_requests_errors` | Counter | count | Failed requests (4xx/5xx) | ### Labels Every metric includes the following labels for filtering and grouping: -| Label | Description | Example | -|-------|-------------|---------| -| `project_id` | Your Cerebrium project ID | `p-abc12345` | -| `app_id` | Full application identifier | `p-abc12345-my-model` | -| `app_name` | Human-readable app name | `my-model` | -| `region` | Deployment region | `us-east-1` | +| Label | Description | Example | +| ------------ | --------------------------- | --------------------- | +| `project_id` | Your Cerebrium project ID | `p-abc12345` | +| `app_id` | Full application identifier | `p-abc12345-my-model` | +| `app_name` | Human-readable app name | `my-model` | +| `region` | Deployment region | `us-east-1` | ## Setup Guide @@ -86,6 +86,7 @@ Every metric includes the following labels for filtering and grouping: Make sure the API token has the **MetricsPublisher** role. The default Prometheus Remote Write token will not work with the OTLP endpoint. + 1. Sign in to [Datadog](https://app.datadoghq.com) @@ -107,6 +108,7 @@ Every metric includes the following labels for filtering and grouping: - **OTLP Endpoint:** The endpoint matching your Datadog site from the table above - **Auth Header Name:** `DD-API-KEY` - **Auth Header Value:** Your Datadog API key from step 3 + 1. Enable the OTLP receiver in your Prometheus config: @@ -118,6 +120,7 @@ Every metric includes the following labels for filtering and grouping: - **OTLP Endpoint:** `http://your-prometheus-host:4318` - **Auth Header Name:** `Authorization` (if auth is enabled, otherwise leave empty) - **Auth Header Value:** `Bearer your-token` (if auth is enabled) + Any platform that supports [OpenTelemetry OTLP over HTTP](https://opentelemetry.io/docs/specs/otlp/) will work, including New Relic, Honeycomb, Lightstep, and others. @@ -134,6 +137,7 @@ Every metric includes the following labels for filtering and grouping: | Lightstep | `lightstep-access-token` | Your Lightstep token | You can add multiple auth headers if your platform requires them using the **Add Header** button. + @@ -170,6 +174,7 @@ curl -X POST "https://rest.cerebrium.ai/v2/metrics-export/YOUR_PROJECT_ID/test" ``` **Success response:** + ```json { "success": true, @@ -179,6 +184,7 @@ curl -X POST "https://rest.cerebrium.ai/v2/metrics-export/YOUR_PROJECT_ID/test" ``` **Failure response:** + ```json { "success": false, @@ -214,6 +220,7 @@ Metrics start flowing within 60 seconds. # p99 latency histogram_quantile(0.99, rate(cerebrium_run_total_response_time_ms_bucket{app_name="my-model"}[5m])) ``` + 1. Go to **Metrics** → **Explorer** in your Datadog dashboard @@ -230,6 +237,7 @@ Metrics start flowing within 60 seconds. # CPU usage across all apps cerebrium_cpu_utilization_cores ``` + @@ -280,8 +288,8 @@ curl -X PUT "https://rest.cerebrium.ai/v2/metrics-export/YOUR_PROJECT_ID/config" ## API Reference -| Method | Endpoint | Description | -|--------|----------|-------------| -| `GET` | `/v2/metrics-export/{project_id}/config` | Get current export configuration | -| `PUT` | `/v2/metrics-export/{project_id}/config` | Update export configuration | -| `POST` | `/v2/metrics-export/{project_id}/test` | Test connection to your monitoring platform | +| Method | Endpoint | Description | +| ------ | ---------------------------------------- | ------------------------------------------- | +| `GET` | `/v2/metrics-export/{project_id}/config` | Get current export configuration | +| `PUT` | `/v2/metrics-export/{project_id}/config` | Update export configuration | +| `POST` | `/v2/metrics-export/{project_id}/test` | Test connection to your monitoring platform | From f381ab54fa11ead80182c698aee643774fa09ed7 Mon Sep 17 00:00:00 2001 From: Harris Khan Date: Fri, 13 Feb 2026 12:36:28 -0500 Subject: [PATCH 05/18] docs: fix metric names to match actual implementation --- cerebrium/integrations/metrics-export.mdx | 255 ++++++++++------------ 1 file changed, 111 insertions(+), 144 deletions(-) diff --git a/cerebrium/integrations/metrics-export.mdx b/cerebrium/integrations/metrics-export.mdx index 716f0f72..e21bcab7 100644 --- a/cerebrium/integrations/metrics-export.mdx +++ b/cerebrium/integrations/metrics-export.mdx @@ -7,11 +7,11 @@ Export real-time resource and execution metrics from your Cerebrium applications ## How it works -Cerebrium automatically pushes metrics from your applications to your monitoring platform every **60 seconds** using the [OpenTelemetry Protocol (OTLP)](https://opentelemetry.io/docs/specs/otlp/). You provide an OTLP endpoint and authentication credentials, and Cerebrium handles the rest — collecting resource usage and execution data, formatting it as OpenTelemetry metrics, and delivering it to your platform. +Cerebrium automatically pushes metrics from your applications to your monitoring platform every **60 seconds** using the [OpenTelemetry Protocol (OTLP)](https://opentelemetry.io/docs/specs/otlp/). You provide an OTLP endpoint and authentication credentials through the Cerebrium dashboard, and Cerebrium handles the rest — collecting resource usage and execution data, formatting it as OpenTelemetry metrics, and delivering it to your platform. - Metrics are pushed every **60 seconds** - Failed pushes are retried **3 times** with exponential backoff -- If pushes fail **10 consecutive times**, export is automatically paused to avoid noise (you can re-enable at any time) +- If pushes fail **10 consecutive times**, export is automatically paused to avoid noise (you can re-enable at any time from the dashboard) - Your credentials are stored encrypted and are never returned in API responses ### Supported destinations @@ -25,40 +25,43 @@ Cerebrium automatically pushes metrics from your applications to your monitoring ### Resource Metrics -| Metric | Type | Unit | Description | -| ------------------------------------------- | ----- | ------- | --------------------------------------- | -| `cerebrium_cpu_utilization_cores` | Gauge | cores | CPU cores actively in use per app | -| `cerebrium_memory_usage_bytes` | Gauge | bytes | Memory actively in use per app | -| `cerebrium_gpu_memory_usage_bytes` | Gauge | bytes | GPU VRAM in use per app | +| Metric | Type | Unit | Description | +|--------|------|------|-------------| +| `cerebrium_cpu_utilization_cores` | Gauge | cores | CPU cores actively in use per app | +| `cerebrium_memory_usage_bytes` | Gauge | bytes | Memory actively in use per app | +| `cerebrium_gpu_memory_usage_bytes` | Gauge | bytes | GPU VRAM in use per app | | `cerebrium_gpu_compute_utilization_percent` | Gauge | percent | GPU compute utilization (0-100) per app | -| `cerebrium_containers_running_count` | Gauge | count | Number of running containers per app | -| `cerebrium_containers_ready_count` | Gauge | count | Number of ready containers per app | +| `cerebrium_containers_running_count` | Gauge | count | Number of running containers per app | +| `cerebrium_containers_ready_count` | Gauge | count | Number of ready containers per app | ### Execution Metrics -| Metric | Type | Unit | Description | -| -------------------------------------- | --------- | ----- | ------------------------------ | -| `cerebrium_run_execution_time_ms` | Histogram | ms | Time spent executing user code | -| `cerebrium_run_queue_time_ms` | Histogram | ms | Time spent waiting in queue | -| `cerebrium_run_total_response_time_ms` | Histogram | ms | Total end-to-end response time | -| `cerebrium_requests_total` | Counter | count | Total request count | -| `cerebrium_requests_success` | Counter | count | Successful requests (2xx) | -| `cerebrium_requests_errors` | Counter | count | Failed requests (4xx/5xx) | +| Metric | Type | Unit | Description | +|--------|------|------|-------------| +| `cerebrium_run_execution_time_ms` | Histogram | ms | Time spent executing user code | +| `cerebrium_run_queue_time_ms` | Histogram | ms | Time spent waiting in queue | +| `cerebrium_run_coldstart_time_ms` | Histogram | ms | Time for container cold start | +| `cerebrium_run_response_time_ms` | Histogram | ms | Total end-to-end response time | +| `cerebrium_run_total_total` | Counter | — | Total run count | +| `cerebrium_run_successes_total` | Counter | — | Successful run count | +| `cerebrium_run_errors_total` | Counter | — | Failed run count | ### Labels Every metric includes the following labels for filtering and grouping: -| Label | Description | Example | -| ------------ | --------------------------- | --------------------- | -| `project_id` | Your Cerebrium project ID | `p-abc12345` | -| `app_id` | Full application identifier | `p-abc12345-my-model` | -| `app_name` | Human-readable app name | `my-model` | -| `region` | Deployment region | `us-east-1` | +| Label | Description | Example | +|-------|-------------|---------| +| `project_id` | Your Cerebrium project ID | `p-abc12345` | +| `app_id` | Full application identifier | `p-abc12345-my-model` | +| `app_name` | Human-readable app name | `my-model` | +| `region` | Deployment region | `us-east-1` | ## Setup Guide -### Step 1: Get your destination credentials +### Step 1: Get your platform credentials + +Before heading to the Cerebrium dashboard, you'll need an OTLP endpoint and authentication credentials from your monitoring platform. @@ -76,17 +79,11 @@ Every metric includes the following labels for filtering and grouping: echo -n "INSTANCE_ID:TOKEN" | base64 ``` - Copy the output — you'll need it in the next step. - - **In the Cerebrium dashboard:** - - **OTLP Endpoint:** The endpoint URL from step 4 - - **Auth Header Name:** `Authorization` - - **Auth Header Value:** `Basic YOUR_BASE64_STRING` + Copy the output — you'll paste it in the dashboard in the next step. Make sure the API token has the **MetricsPublisher** role. The default Prometheus Remote Write token will not work with the OTLP endpoint. - 1. Sign in to [Datadog](https://app.datadoghq.com) @@ -104,23 +101,13 @@ Every metric includes the following labels for filtering and grouping: You can find your site in your Datadog URL — for example, if you log in at `app.us3.datadoghq.com`, your site is US3. - **In the Cerebrium dashboard:** - - **OTLP Endpoint:** The endpoint matching your Datadog site from the table above - - **Auth Header Name:** `DD-API-KEY` - - **Auth Header Value:** Your Datadog API key from step 3 - + Keep your API key and endpoint handy for the next step. 1. Enable the OTLP receiver in your Prometheus config: - Add `--enable-feature=otlp-write-receiver` flag - Or use an OpenTelemetry Collector as a sidecar 2. Your endpoint will be `http://YOUR_PROMETHEUS_HOST:4318` — copy this for the next step - - **In the Cerebrium dashboard:** - - **OTLP Endpoint:** `http://your-prometheus-host:4318` - - **Auth Header Name:** `Authorization` (if auth is enabled, otherwise leave empty) - - **Auth Header Value:** `Bearer your-token` (if auth is enabled) - Any platform that supports [OpenTelemetry OTLP over HTTP](https://opentelemetry.io/docs/specs/otlp/) will work, including New Relic, Honeycomb, Lightstep, and others. @@ -135,67 +122,47 @@ Every metric includes the following labels for filtering and grouping: | New Relic | `api-key` | Your New Relic license key | | Honeycomb | `x-honeycomb-team` | Your Honeycomb API key | | Lightstep | `lightstep-access-token` | Your Lightstep token | - - You can add multiple auth headers if your platform requires them using the **Add Header** button. - ### Step 2: Configure in the Cerebrium dashboard -Go to your project → **Integrations** → **Metrics Export**. Enter the OTLP endpoint and authentication headers from Step 1, then click **Save & Enable**. - -You can also configure via the API: - -```bash -curl -X PUT "https://rest.cerebrium.ai/v2/metrics-export/YOUR_PROJECT_ID/config" \ - -H "Authorization: Bearer YOUR_CEREBRIUM_API_KEY" \ - -H "Content-Type: application/json" \ - -d '{ - "enabled": true, - "otlpEndpoint": "https://otlp-gateway-prod-us-east-0.grafana.net/otlp", - "authHeaders": { - "Authorization": "Basic YOUR_BASE64_CREDENTIALS" - } - }' -``` - -The `authHeaders` field is a map of header name → header value. These are stored encrypted and never returned in API responses. - -You can find your Cerebrium API key in the [dashboard](https://dashboard.cerebrium.ai) under **Settings** → **API Keys**. - -### Step 3: Test the connection - -Click **Test Connection** in the dashboard, or via the API: +1. In the [Cerebrium dashboard](https://dashboard.cerebrium.ai), go to your project → **Integrations** → **Metrics Export** +2. Paste your **OTLP endpoint** from Step 1 +3. Add your **authentication headers**: -```bash -curl -X POST "https://rest.cerebrium.ai/v2/metrics-export/YOUR_PROJECT_ID/test" \ - -H "Authorization: Bearer YOUR_CEREBRIUM_API_KEY" -``` + + + - **Header name:** `Authorization` + - **Header value:** `Basic YOUR_BASE64_STRING` (the output from the terminal command in Step 1) + + + - **Header name:** `DD-API-KEY` + - **Header value:** Your Datadog API key + + + - **Header name:** `Authorization` (if auth is enabled on your Prometheus, otherwise leave empty) + - **Header value:** `Bearer your-token` (if auth is enabled) + + + Add the authentication headers required by your platform. You can add multiple headers using the **Add Header** button. + + -**Success response:** +4. Click **Save & Enable** -```json -{ - "success": true, - "message": "Successfully connected to grafana (145ms)", - "latencyMs": 145 -} -``` +Your metrics will start flowing within 60 seconds. The dashboard will show a green "Connected" status with the time of the last successful export. -**Failure response:** +### Step 3: Verify the connection -```json -{ - "success": false, - "error": "Authentication failed (HTTP 401). Check your API key or credentials." -} -``` +Click **Test Connection** in the dashboard to verify Cerebrium can reach your monitoring platform. You'll see a success or failure message with details. -Metrics start flowing within 60 seconds. +If the test fails, double-check your endpoint URL and credentials from Step 1. ## Viewing Metrics +Once connected, metrics will appear in your monitoring platform within a minute. + 1. Go to your Grafana Cloud dashboard → **Explore** @@ -215,12 +182,14 @@ Metrics start flowing within 60 seconds. cerebrium_containers_running_count{project_id="p-9676c59f"} # Request rate (requests per second over 5 minutes) - rate(cerebrium_requests_total{app_name="my-model"}[5m]) + rate(cerebrium_run_total_total{app_name="my-model"}[5m]) - # p99 latency - histogram_quantile(0.99, rate(cerebrium_run_total_response_time_ms_bucket{app_name="my-model"}[5m])) - ``` + # p99 execution latency + histogram_quantile(0.99, rate(cerebrium_run_execution_time_ms_bucket{app_name="my-model"}[5m])) + # p99 end-to-end response time + histogram_quantile(0.99, rate(cerebrium_run_response_time_ms_bucket{app_name="my-model"}[5m])) + ``` 1. Go to **Metrics** → **Explorer** in your Datadog dashboard @@ -237,59 +206,57 @@ Metrics start flowing within 60 seconds. # CPU usage across all apps cerebrium_cpu_utilization_cores ``` - ## Managing Metrics Export -### Check export status - -```bash -curl "https://rest.cerebrium.ai/v2/metrics-export/YOUR_PROJECT_ID/config" \ - -H "Authorization: Bearer YOUR_CEREBRIUM_API_KEY" -``` - -```json -{ - "enabled": true, - "otlpEndpoint": "https://otlp-gateway-prod-us-east-0.grafana.net/otlp", - "authHeadersConfigured": true, - "lastExportAt": "2026-02-11T22:44:56Z", - "lastExportStatus": "success" -} -``` - -### Disable metrics export - -Disabling preserves your configuration. You can re-enable at any time without reconfiguring. - -```bash -curl -X PUT "https://rest.cerebrium.ai/v2/metrics-export/YOUR_PROJECT_ID/config" \ - -H "Authorization: Bearer YOUR_CEREBRIUM_API_KEY" \ - -H "Content-Type: application/json" \ - -d '{"enabled": false}' -``` - -### Update OTLP credentials - -If you need to rotate or change your monitoring platform credentials: - -```bash -curl -X PUT "https://rest.cerebrium.ai/v2/metrics-export/YOUR_PROJECT_ID/config" \ - -H "Authorization: Bearer YOUR_CEREBRIUM_API_KEY" \ - -H "Content-Type: application/json" \ - -d '{ - "authHeaders": { - "Authorization": "Basic NEW_CREDENTIALS" - } - }' -``` - -## API Reference - -| Method | Endpoint | Description | -| ------ | ---------------------------------------- | ------------------------------------------- | -| `GET` | `/v2/metrics-export/{project_id}/config` | Get current export configuration | -| `PUT` | `/v2/metrics-export/{project_id}/config` | Update export configuration | -| `POST` | `/v2/metrics-export/{project_id}/test` | Test connection to your monitoring platform | +You can manage your metrics export configuration from the dashboard at any time by going to **Integrations** → **Metrics Export**. + +- **Disable export:** Toggle the switch off. Your configuration is preserved — you can re-enable at any time without reconfiguring. +- **Update credentials:** Enter new authentication headers and click **Save Changes**. Useful when rotating API keys. +- **Change endpoint:** Update the OTLP endpoint field and click **Save Changes**. +- **Check status:** The dashboard shows whether export is connected, the time of the last successful export, and any error messages. + + + You can also manage metrics export programmatically. Find your Cerebrium API key in the [dashboard](https://dashboard.cerebrium.ai) under **Settings** → **API Keys**. + + | Method | Endpoint | Description | + |--------|----------|-------------| + | `GET` | `/v2/metrics-export/{project_id}/config` | Get current export configuration | + | `PUT` | `/v2/metrics-export/{project_id}/config` | Update export configuration | + | `POST` | `/v2/metrics-export/{project_id}/test` | Test connection to your monitoring platform | + + **Enable with endpoint and credentials:** + + ```bash + curl -X PUT "https://rest.cerebrium.ai/v2/metrics-export/YOUR_PROJECT_ID/config" \ + -H "Authorization: Bearer YOUR_CEREBRIUM_API_KEY" \ + -H "Content-Type: application/json" \ + -d '{ + "enabled": true, + "otlpEndpoint": "https://otlp-gateway-prod-us-east-0.grafana.net/otlp", + "authHeaders": { + "Authorization": "Basic YOUR_BASE64_CREDENTIALS" + } + }' + ``` + + **Test connection:** + + ```bash + curl -X POST "https://rest.cerebrium.ai/v2/metrics-export/YOUR_PROJECT_ID/test" \ + -H "Authorization: Bearer YOUR_CEREBRIUM_API_KEY" + ``` + + **Disable export:** + + ```bash + curl -X PUT "https://rest.cerebrium.ai/v2/metrics-export/YOUR_PROJECT_ID/config" \ + -H "Authorization: Bearer YOUR_CEREBRIUM_API_KEY" \ + -H "Content-Type: application/json" \ + -d '{"enabled": false}' + ``` + + The `authHeaders` field is a map of header name → header value. These are stored encrypted and never returned in API responses. + From a7ede5cb1e3e7c6d96d35131837bc811be00ec2f Mon Sep 17 00:00:00 2001 From: Hkhan161 Date: Fri, 13 Feb 2026 17:36:45 +0000 Subject: [PATCH 06/18] Prettified Code! --- cerebrium/integrations/metrics-export.mdx | 141 ++++++++++++---------- 1 file changed, 74 insertions(+), 67 deletions(-) diff --git a/cerebrium/integrations/metrics-export.mdx b/cerebrium/integrations/metrics-export.mdx index e21bcab7..a8338723 100644 --- a/cerebrium/integrations/metrics-export.mdx +++ b/cerebrium/integrations/metrics-export.mdx @@ -25,37 +25,37 @@ Cerebrium automatically pushes metrics from your applications to your monitoring ### Resource Metrics -| Metric | Type | Unit | Description | -|--------|------|------|-------------| -| `cerebrium_cpu_utilization_cores` | Gauge | cores | CPU cores actively in use per app | -| `cerebrium_memory_usage_bytes` | Gauge | bytes | Memory actively in use per app | -| `cerebrium_gpu_memory_usage_bytes` | Gauge | bytes | GPU VRAM in use per app | +| Metric | Type | Unit | Description | +| ------------------------------------------- | ----- | ------- | --------------------------------------- | +| `cerebrium_cpu_utilization_cores` | Gauge | cores | CPU cores actively in use per app | +| `cerebrium_memory_usage_bytes` | Gauge | bytes | Memory actively in use per app | +| `cerebrium_gpu_memory_usage_bytes` | Gauge | bytes | GPU VRAM in use per app | | `cerebrium_gpu_compute_utilization_percent` | Gauge | percent | GPU compute utilization (0-100) per app | -| `cerebrium_containers_running_count` | Gauge | count | Number of running containers per app | -| `cerebrium_containers_ready_count` | Gauge | count | Number of ready containers per app | +| `cerebrium_containers_running_count` | Gauge | count | Number of running containers per app | +| `cerebrium_containers_ready_count` | Gauge | count | Number of ready containers per app | ### Execution Metrics -| Metric | Type | Unit | Description | -|--------|------|------|-------------| -| `cerebrium_run_execution_time_ms` | Histogram | ms | Time spent executing user code | -| `cerebrium_run_queue_time_ms` | Histogram | ms | Time spent waiting in queue | -| `cerebrium_run_coldstart_time_ms` | Histogram | ms | Time for container cold start | -| `cerebrium_run_response_time_ms` | Histogram | ms | Total end-to-end response time | -| `cerebrium_run_total_total` | Counter | — | Total run count | -| `cerebrium_run_successes_total` | Counter | — | Successful run count | -| `cerebrium_run_errors_total` | Counter | — | Failed run count | +| Metric | Type | Unit | Description | +| --------------------------------- | --------- | ---- | ------------------------------ | +| `cerebrium_run_execution_time_ms` | Histogram | ms | Time spent executing user code | +| `cerebrium_run_queue_time_ms` | Histogram | ms | Time spent waiting in queue | +| `cerebrium_run_coldstart_time_ms` | Histogram | ms | Time for container cold start | +| `cerebrium_run_response_time_ms` | Histogram | ms | Total end-to-end response time | +| `cerebrium_run_total_total` | Counter | — | Total run count | +| `cerebrium_run_successes_total` | Counter | — | Successful run count | +| `cerebrium_run_errors_total` | Counter | — | Failed run count | ### Labels Every metric includes the following labels for filtering and grouping: -| Label | Description | Example | -|-------|-------------|---------| -| `project_id` | Your Cerebrium project ID | `p-abc12345` | -| `app_id` | Full application identifier | `p-abc12345-my-model` | -| `app_name` | Human-readable app name | `my-model` | -| `region` | Deployment region | `us-east-1` | +| Label | Description | Example | +| ------------ | --------------------------- | --------------------- | +| `project_id` | Your Cerebrium project ID | `p-abc12345` | +| `app_id` | Full application identifier | `p-abc12345-my-model` | +| `app_name` | Human-readable app name | `my-model` | +| `region` | Deployment region | `us-east-1` | ## Setup Guide @@ -84,6 +84,7 @@ Before heading to the Cerebrium dashboard, you'll need an OTLP endpoint and auth Make sure the API token has the **MetricsPublisher** role. The default Prometheus Remote Write token will not work with the OTLP endpoint. + 1. Sign in to [Datadog](https://app.datadoghq.com) @@ -102,6 +103,7 @@ Before heading to the Cerebrium dashboard, you'll need an OTLP endpoint and auth You can find your site in your Datadog URL — for example, if you log in at `app.us3.datadoghq.com`, your site is US3. Keep your API key and endpoint handy for the next step. + 1. Enable the OTLP receiver in your Prometheus config: @@ -122,6 +124,7 @@ Before heading to the Cerebrium dashboard, you'll need an OTLP endpoint and auth | New Relic | `api-key` | Your New Relic license key | | Honeycomb | `x-honeycomb-team` | Your Honeycomb API key | | Lightstep | `lightstep-access-token` | Your Lightstep token | + @@ -133,19 +136,20 @@ Before heading to the Cerebrium dashboard, you'll need an OTLP endpoint and auth - - **Header name:** `Authorization` - - **Header value:** `Basic YOUR_BASE64_STRING` (the output from the terminal command in Step 1) + - **Header name:** `Authorization` - **Header value:** `Basic + YOUR_BASE64_STRING` (the output from the terminal command in Step 1) - - **Header name:** `DD-API-KEY` - - **Header value:** Your Datadog API key + - **Header name:** `DD-API-KEY` - **Header value:** Your Datadog API key - - **Header name:** `Authorization` (if auth is enabled on your Prometheus, otherwise leave empty) - - **Header value:** `Bearer your-token` (if auth is enabled) + - **Header name:** `Authorization` (if auth is enabled on your Prometheus, + otherwise leave empty) - **Header value:** `Bearer your-token` (if auth is + enabled) - Add the authentication headers required by your platform. You can add multiple headers using the **Add Header** button. + Add the authentication headers required by your platform. You can add + multiple headers using the **Add Header** button. @@ -190,6 +194,7 @@ Once connected, metrics will appear in your monitoring platform within a minute. # p99 end-to-end response time histogram_quantile(0.99, rate(cerebrium_run_response_time_ms_bucket{app_name="my-model"}[5m])) ``` + 1. Go to **Metrics** → **Explorer** in your Datadog dashboard @@ -206,6 +211,7 @@ Once connected, metrics will appear in your monitoring platform within a minute. # CPU usage across all apps cerebrium_cpu_utilization_cores ``` + @@ -221,42 +227,43 @@ You can manage your metrics export configuration from the dashboard at any time You can also manage metrics export programmatically. Find your Cerebrium API key in the [dashboard](https://dashboard.cerebrium.ai) under **Settings** → **API Keys**. - | Method | Endpoint | Description | - |--------|----------|-------------| - | `GET` | `/v2/metrics-export/{project_id}/config` | Get current export configuration | - | `PUT` | `/v2/metrics-export/{project_id}/config` | Update export configuration | - | `POST` | `/v2/metrics-export/{project_id}/test` | Test connection to your monitoring platform | - - **Enable with endpoint and credentials:** - - ```bash - curl -X PUT "https://rest.cerebrium.ai/v2/metrics-export/YOUR_PROJECT_ID/config" \ - -H "Authorization: Bearer YOUR_CEREBRIUM_API_KEY" \ - -H "Content-Type: application/json" \ - -d '{ - "enabled": true, - "otlpEndpoint": "https://otlp-gateway-prod-us-east-0.grafana.net/otlp", - "authHeaders": { - "Authorization": "Basic YOUR_BASE64_CREDENTIALS" - } - }' - ``` - - **Test connection:** - - ```bash - curl -X POST "https://rest.cerebrium.ai/v2/metrics-export/YOUR_PROJECT_ID/test" \ - -H "Authorization: Bearer YOUR_CEREBRIUM_API_KEY" - ``` - - **Disable export:** - - ```bash - curl -X PUT "https://rest.cerebrium.ai/v2/metrics-export/YOUR_PROJECT_ID/config" \ - -H "Authorization: Bearer YOUR_CEREBRIUM_API_KEY" \ - -H "Content-Type: application/json" \ - -d '{"enabled": false}' - ``` - - The `authHeaders` field is a map of header name → header value. These are stored encrypted and never returned in API responses. +| Method | Endpoint | Description | +| ------ | ---------------------------------------- | ------------------------------------------- | +| `GET` | `/v2/metrics-export/{project_id}/config` | Get current export configuration | +| `PUT` | `/v2/metrics-export/{project_id}/config` | Update export configuration | +| `POST` | `/v2/metrics-export/{project_id}/test` | Test connection to your monitoring platform | + +**Enable with endpoint and credentials:** + +```bash +curl -X PUT "https://rest.cerebrium.ai/v2/metrics-export/YOUR_PROJECT_ID/config" \ + -H "Authorization: Bearer YOUR_CEREBRIUM_API_KEY" \ + -H "Content-Type: application/json" \ + -d '{ + "enabled": true, + "otlpEndpoint": "https://otlp-gateway-prod-us-east-0.grafana.net/otlp", + "authHeaders": { + "Authorization": "Basic YOUR_BASE64_CREDENTIALS" + } + }' +``` + +**Test connection:** + +```bash +curl -X POST "https://rest.cerebrium.ai/v2/metrics-export/YOUR_PROJECT_ID/test" \ + -H "Authorization: Bearer YOUR_CEREBRIUM_API_KEY" +``` + +**Disable export:** + +```bash +curl -X PUT "https://rest.cerebrium.ai/v2/metrics-export/YOUR_PROJECT_ID/config" \ + -H "Authorization: Bearer YOUR_CEREBRIUM_API_KEY" \ + -H "Content-Type: application/json" \ + -d '{"enabled": false}' +``` + +The `authHeaders` field is a map of header name → header value. These are stored encrypted and never returned in API responses. + From 04ea1c36188867f6a673a9d1a194ac8748770846 Mon Sep 17 00:00:00 2001 From: Harris Khan Date: Tue, 17 Feb 2026 14:52:49 -0500 Subject: [PATCH 07/18] docs: clarify auth headers come from Step 1 --- cerebrium/integrations/metrics-export.mdx | 171 +++++++++++----------- 1 file changed, 82 insertions(+), 89 deletions(-) diff --git a/cerebrium/integrations/metrics-export.mdx b/cerebrium/integrations/metrics-export.mdx index a8338723..6753e6ba 100644 --- a/cerebrium/integrations/metrics-export.mdx +++ b/cerebrium/integrations/metrics-export.mdx @@ -21,42 +21,6 @@ Cerebrium automatically pushes metrics from your applications to your monitoring - **Prometheus** — Self-hosted with OTLP receiver enabled - **Custom** — Any OTLP-compatible endpoint (New Relic, Honeycomb, etc.) -## What metrics are exported? - -### Resource Metrics - -| Metric | Type | Unit | Description | -| ------------------------------------------- | ----- | ------- | --------------------------------------- | -| `cerebrium_cpu_utilization_cores` | Gauge | cores | CPU cores actively in use per app | -| `cerebrium_memory_usage_bytes` | Gauge | bytes | Memory actively in use per app | -| `cerebrium_gpu_memory_usage_bytes` | Gauge | bytes | GPU VRAM in use per app | -| `cerebrium_gpu_compute_utilization_percent` | Gauge | percent | GPU compute utilization (0-100) per app | -| `cerebrium_containers_running_count` | Gauge | count | Number of running containers per app | -| `cerebrium_containers_ready_count` | Gauge | count | Number of ready containers per app | - -### Execution Metrics - -| Metric | Type | Unit | Description | -| --------------------------------- | --------- | ---- | ------------------------------ | -| `cerebrium_run_execution_time_ms` | Histogram | ms | Time spent executing user code | -| `cerebrium_run_queue_time_ms` | Histogram | ms | Time spent waiting in queue | -| `cerebrium_run_coldstart_time_ms` | Histogram | ms | Time for container cold start | -| `cerebrium_run_response_time_ms` | Histogram | ms | Total end-to-end response time | -| `cerebrium_run_total_total` | Counter | — | Total run count | -| `cerebrium_run_successes_total` | Counter | — | Successful run count | -| `cerebrium_run_errors_total` | Counter | — | Failed run count | - -### Labels - -Every metric includes the following labels for filtering and grouping: - -| Label | Description | Example | -| ------------ | --------------------------- | --------------------- | -| `project_id` | Your Cerebrium project ID | `p-abc12345` | -| `app_id` | Full application identifier | `p-abc12345-my-model` | -| `app_name` | Human-readable app name | `my-model` | -| `region` | Deployment region | `us-east-1` | - ## Setup Guide ### Step 1: Get your platform credentials @@ -84,7 +48,6 @@ Before heading to the Cerebrium dashboard, you'll need an OTLP endpoint and auth Make sure the API token has the **MetricsPublisher** role. The default Prometheus Remote Write token will not work with the OTLP endpoint. - 1. Sign in to [Datadog](https://app.datadoghq.com) @@ -103,7 +66,6 @@ Before heading to the Cerebrium dashboard, you'll need an OTLP endpoint and auth You can find your site in your Datadog URL — for example, if you log in at `app.us3.datadoghq.com`, your site is US3. Keep your API key and endpoint handy for the next step. - 1. Enable the OTLP receiver in your Prometheus config: @@ -124,7 +86,6 @@ Before heading to the Cerebrium dashboard, you'll need an OTLP endpoint and auth | New Relic | `api-key` | Your New Relic license key | | Honeycomb | `x-honeycomb-team` | Your Honeycomb API key | | Lightstep | `lightstep-access-token` | Your Lightstep token | - @@ -132,24 +93,23 @@ Before heading to the Cerebrium dashboard, you'll need an OTLP endpoint and auth 1. In the [Cerebrium dashboard](https://dashboard.cerebrium.ai), go to your project → **Integrations** → **Metrics Export** 2. Paste your **OTLP endpoint** from Step 1 -3. Add your **authentication headers**: +3. Add the **authentication headers** from Step 1: - - **Header name:** `Authorization` - **Header value:** `Basic - YOUR_BASE64_STRING` (the output from the terminal command in Step 1) + - **Header name:** `Authorization` + - **Header value:** `Basic YOUR_BASE64_STRING` (the output from the terminal command in Step 1) - - **Header name:** `DD-API-KEY` - **Header value:** Your Datadog API key + - **Header name:** `DD-API-KEY` + - **Header value:** Your Datadog API key - - **Header name:** `Authorization` (if auth is enabled on your Prometheus, - otherwise leave empty) - **Header value:** `Bearer your-token` (if auth is - enabled) + - **Header name:** `Authorization` (if auth is enabled on your Prometheus, otherwise leave empty) + - **Header value:** `Bearer your-token` (if auth is enabled) - Add the authentication headers required by your platform. You can add - multiple headers using the **Add Header** button. + Add the authentication headers required by your platform. You can add multiple headers using the **Add Header** button. @@ -163,6 +123,42 @@ Click **Test Connection** in the dashboard to verify Cerebrium can reach your mo If the test fails, double-check your endpoint URL and credentials from Step 1. +## What metrics are exported? + +### Resource Metrics + +| Metric | Type | Unit | Description | +|--------|------|------|-------------| +| `cerebrium_cpu_utilization_cores` | Gauge | cores | CPU cores actively in use per app | +| `cerebrium_memory_usage_bytes` | Gauge | bytes | Memory actively in use per app | +| `cerebrium_gpu_memory_usage_bytes` | Gauge | bytes | GPU VRAM in use per app | +| `cerebrium_gpu_compute_utilization_percent` | Gauge | percent | GPU compute utilization (0-100) per app | +| `cerebrium_containers_running_count` | Gauge | count | Number of running containers per app | +| `cerebrium_containers_ready_count` | Gauge | count | Number of ready containers per app | + +### Execution Metrics + +| Metric | Type | Unit | Description | +|--------|------|------|-------------| +| `cerebrium_run_execution_time_ms` | Histogram | ms | Time spent executing user code | +| `cerebrium_run_queue_time_ms` | Histogram | ms | Time spent waiting in queue | +| `cerebrium_run_coldstart_time_ms` | Histogram | ms | Time for container cold start | +| `cerebrium_run_response_time_ms` | Histogram | ms | Total end-to-end response time | +| `cerebrium_run_total_total` | Counter | — | Total run count | +| `cerebrium_run_successes_total` | Counter | — | Successful run count | +| `cerebrium_run_errors_total` | Counter | — | Failed run count | + +### Labels + +Every metric includes the following labels for filtering and grouping: + +| Label | Description | Example | +|-------|-------------|---------| +| `project_id` | Your Cerebrium project ID | `p-abc12345` | +| `app_id` | Full application identifier | `p-abc12345-my-model` | +| `app_name` | Human-readable app name | `my-model` | +| `region` | Deployment region | `us-east-1` | + ## Viewing Metrics Once connected, metrics will appear in your monitoring platform within a minute. @@ -194,7 +190,6 @@ Once connected, metrics will appear in your monitoring platform within a minute. # p99 end-to-end response time histogram_quantile(0.99, rate(cerebrium_run_response_time_ms_bucket{app_name="my-model"}[5m])) ``` - 1. Go to **Metrics** → **Explorer** in your Datadog dashboard @@ -211,7 +206,6 @@ Once connected, metrics will appear in your monitoring platform within a minute. # CPU usage across all apps cerebrium_cpu_utilization_cores ``` - @@ -227,43 +221,42 @@ You can manage your metrics export configuration from the dashboard at any time You can also manage metrics export programmatically. Find your Cerebrium API key in the [dashboard](https://dashboard.cerebrium.ai) under **Settings** → **API Keys**. -| Method | Endpoint | Description | -| ------ | ---------------------------------------- | ------------------------------------------- | -| `GET` | `/v2/metrics-export/{project_id}/config` | Get current export configuration | -| `PUT` | `/v2/metrics-export/{project_id}/config` | Update export configuration | -| `POST` | `/v2/metrics-export/{project_id}/test` | Test connection to your monitoring platform | - -**Enable with endpoint and credentials:** - -```bash -curl -X PUT "https://rest.cerebrium.ai/v2/metrics-export/YOUR_PROJECT_ID/config" \ - -H "Authorization: Bearer YOUR_CEREBRIUM_API_KEY" \ - -H "Content-Type: application/json" \ - -d '{ - "enabled": true, - "otlpEndpoint": "https://otlp-gateway-prod-us-east-0.grafana.net/otlp", - "authHeaders": { - "Authorization": "Basic YOUR_BASE64_CREDENTIALS" - } - }' -``` - -**Test connection:** - -```bash -curl -X POST "https://rest.cerebrium.ai/v2/metrics-export/YOUR_PROJECT_ID/test" \ - -H "Authorization: Bearer YOUR_CEREBRIUM_API_KEY" -``` - -**Disable export:** - -```bash -curl -X PUT "https://rest.cerebrium.ai/v2/metrics-export/YOUR_PROJECT_ID/config" \ - -H "Authorization: Bearer YOUR_CEREBRIUM_API_KEY" \ - -H "Content-Type: application/json" \ - -d '{"enabled": false}' -``` - -The `authHeaders` field is a map of header name → header value. These are stored encrypted and never returned in API responses. - + | Method | Endpoint | Description | + |--------|----------|-------------| + | `GET` | `/v2/metrics-export/{project_id}/config` | Get current export configuration | + | `PUT` | `/v2/metrics-export/{project_id}/config` | Update export configuration | + | `POST` | `/v2/metrics-export/{project_id}/test` | Test connection to your monitoring platform | + + **Enable with endpoint and credentials:** + + ```bash + curl -X PUT "https://rest.cerebrium.ai/v2/metrics-export/YOUR_PROJECT_ID/config" \ + -H "Authorization: Bearer YOUR_CEREBRIUM_API_KEY" \ + -H "Content-Type: application/json" \ + -d '{ + "enabled": true, + "otlpEndpoint": "https://otlp-gateway-prod-us-east-0.grafana.net/otlp", + "authHeaders": { + "Authorization": "Basic YOUR_BASE64_CREDENTIALS" + } + }' + ``` + + **Test connection:** + + ```bash + curl -X POST "https://rest.cerebrium.ai/v2/metrics-export/YOUR_PROJECT_ID/test" \ + -H "Authorization: Bearer YOUR_CEREBRIUM_API_KEY" + ``` + + **Disable export:** + + ```bash + curl -X PUT "https://rest.cerebrium.ai/v2/metrics-export/YOUR_PROJECT_ID/config" \ + -H "Authorization: Bearer YOUR_CEREBRIUM_API_KEY" \ + -H "Content-Type: application/json" \ + -d '{"enabled": false}' + ``` + + The `authHeaders` field is a map of header name → header value. These are stored encrypted and never returned in API responses. From f3ce178ad19892b1f27412f8af6bb31e5c05d37a Mon Sep 17 00:00:00 2001 From: Hkhan161 Date: Tue, 17 Feb 2026 19:53:02 +0000 Subject: [PATCH 08/18] Prettified Code! --- cerebrium/integrations/metrics-export.mdx | 141 ++++++++++++---------- 1 file changed, 74 insertions(+), 67 deletions(-) diff --git a/cerebrium/integrations/metrics-export.mdx b/cerebrium/integrations/metrics-export.mdx index 6753e6ba..3dd96932 100644 --- a/cerebrium/integrations/metrics-export.mdx +++ b/cerebrium/integrations/metrics-export.mdx @@ -48,6 +48,7 @@ Before heading to the Cerebrium dashboard, you'll need an OTLP endpoint and auth Make sure the API token has the **MetricsPublisher** role. The default Prometheus Remote Write token will not work with the OTLP endpoint. + 1. Sign in to [Datadog](https://app.datadoghq.com) @@ -66,6 +67,7 @@ Before heading to the Cerebrium dashboard, you'll need an OTLP endpoint and auth You can find your site in your Datadog URL — for example, if you log in at `app.us3.datadoghq.com`, your site is US3. Keep your API key and endpoint handy for the next step. + 1. Enable the OTLP receiver in your Prometheus config: @@ -86,6 +88,7 @@ Before heading to the Cerebrium dashboard, you'll need an OTLP endpoint and auth | New Relic | `api-key` | Your New Relic license key | | Honeycomb | `x-honeycomb-team` | Your Honeycomb API key | | Lightstep | `lightstep-access-token` | Your Lightstep token | + @@ -97,19 +100,20 @@ Before heading to the Cerebrium dashboard, you'll need an OTLP endpoint and auth - - **Header name:** `Authorization` - - **Header value:** `Basic YOUR_BASE64_STRING` (the output from the terminal command in Step 1) + - **Header name:** `Authorization` - **Header value:** `Basic + YOUR_BASE64_STRING` (the output from the terminal command in Step 1) - - **Header name:** `DD-API-KEY` - - **Header value:** Your Datadog API key + - **Header name:** `DD-API-KEY` - **Header value:** Your Datadog API key - - **Header name:** `Authorization` (if auth is enabled on your Prometheus, otherwise leave empty) - - **Header value:** `Bearer your-token` (if auth is enabled) + - **Header name:** `Authorization` (if auth is enabled on your Prometheus, + otherwise leave empty) - **Header value:** `Bearer your-token` (if auth is + enabled) - Add the authentication headers required by your platform. You can add multiple headers using the **Add Header** button. + Add the authentication headers required by your platform. You can add + multiple headers using the **Add Header** button. @@ -127,37 +131,37 @@ If the test fails, double-check your endpoint URL and credentials from Step 1. ### Resource Metrics -| Metric | Type | Unit | Description | -|--------|------|------|-------------| -| `cerebrium_cpu_utilization_cores` | Gauge | cores | CPU cores actively in use per app | -| `cerebrium_memory_usage_bytes` | Gauge | bytes | Memory actively in use per app | -| `cerebrium_gpu_memory_usage_bytes` | Gauge | bytes | GPU VRAM in use per app | +| Metric | Type | Unit | Description | +| ------------------------------------------- | ----- | ------- | --------------------------------------- | +| `cerebrium_cpu_utilization_cores` | Gauge | cores | CPU cores actively in use per app | +| `cerebrium_memory_usage_bytes` | Gauge | bytes | Memory actively in use per app | +| `cerebrium_gpu_memory_usage_bytes` | Gauge | bytes | GPU VRAM in use per app | | `cerebrium_gpu_compute_utilization_percent` | Gauge | percent | GPU compute utilization (0-100) per app | -| `cerebrium_containers_running_count` | Gauge | count | Number of running containers per app | -| `cerebrium_containers_ready_count` | Gauge | count | Number of ready containers per app | +| `cerebrium_containers_running_count` | Gauge | count | Number of running containers per app | +| `cerebrium_containers_ready_count` | Gauge | count | Number of ready containers per app | ### Execution Metrics -| Metric | Type | Unit | Description | -|--------|------|------|-------------| -| `cerebrium_run_execution_time_ms` | Histogram | ms | Time spent executing user code | -| `cerebrium_run_queue_time_ms` | Histogram | ms | Time spent waiting in queue | -| `cerebrium_run_coldstart_time_ms` | Histogram | ms | Time for container cold start | -| `cerebrium_run_response_time_ms` | Histogram | ms | Total end-to-end response time | -| `cerebrium_run_total_total` | Counter | — | Total run count | -| `cerebrium_run_successes_total` | Counter | — | Successful run count | -| `cerebrium_run_errors_total` | Counter | — | Failed run count | +| Metric | Type | Unit | Description | +| --------------------------------- | --------- | ---- | ------------------------------ | +| `cerebrium_run_execution_time_ms` | Histogram | ms | Time spent executing user code | +| `cerebrium_run_queue_time_ms` | Histogram | ms | Time spent waiting in queue | +| `cerebrium_run_coldstart_time_ms` | Histogram | ms | Time for container cold start | +| `cerebrium_run_response_time_ms` | Histogram | ms | Total end-to-end response time | +| `cerebrium_run_total_total` | Counter | — | Total run count | +| `cerebrium_run_successes_total` | Counter | — | Successful run count | +| `cerebrium_run_errors_total` | Counter | — | Failed run count | ### Labels Every metric includes the following labels for filtering and grouping: -| Label | Description | Example | -|-------|-------------|---------| -| `project_id` | Your Cerebrium project ID | `p-abc12345` | -| `app_id` | Full application identifier | `p-abc12345-my-model` | -| `app_name` | Human-readable app name | `my-model` | -| `region` | Deployment region | `us-east-1` | +| Label | Description | Example | +| ------------ | --------------------------- | --------------------- | +| `project_id` | Your Cerebrium project ID | `p-abc12345` | +| `app_id` | Full application identifier | `p-abc12345-my-model` | +| `app_name` | Human-readable app name | `my-model` | +| `region` | Deployment region | `us-east-1` | ## Viewing Metrics @@ -190,6 +194,7 @@ Once connected, metrics will appear in your monitoring platform within a minute. # p99 end-to-end response time histogram_quantile(0.99, rate(cerebrium_run_response_time_ms_bucket{app_name="my-model"}[5m])) ``` + 1. Go to **Metrics** → **Explorer** in your Datadog dashboard @@ -206,6 +211,7 @@ Once connected, metrics will appear in your monitoring platform within a minute. # CPU usage across all apps cerebrium_cpu_utilization_cores ``` + @@ -221,42 +227,43 @@ You can manage your metrics export configuration from the dashboard at any time You can also manage metrics export programmatically. Find your Cerebrium API key in the [dashboard](https://dashboard.cerebrium.ai) under **Settings** → **API Keys**. - | Method | Endpoint | Description | - |--------|----------|-------------| - | `GET` | `/v2/metrics-export/{project_id}/config` | Get current export configuration | - | `PUT` | `/v2/metrics-export/{project_id}/config` | Update export configuration | - | `POST` | `/v2/metrics-export/{project_id}/test` | Test connection to your monitoring platform | - - **Enable with endpoint and credentials:** - - ```bash - curl -X PUT "https://rest.cerebrium.ai/v2/metrics-export/YOUR_PROJECT_ID/config" \ - -H "Authorization: Bearer YOUR_CEREBRIUM_API_KEY" \ - -H "Content-Type: application/json" \ - -d '{ - "enabled": true, - "otlpEndpoint": "https://otlp-gateway-prod-us-east-0.grafana.net/otlp", - "authHeaders": { - "Authorization": "Basic YOUR_BASE64_CREDENTIALS" - } - }' - ``` - - **Test connection:** - - ```bash - curl -X POST "https://rest.cerebrium.ai/v2/metrics-export/YOUR_PROJECT_ID/test" \ - -H "Authorization: Bearer YOUR_CEREBRIUM_API_KEY" - ``` - - **Disable export:** - - ```bash - curl -X PUT "https://rest.cerebrium.ai/v2/metrics-export/YOUR_PROJECT_ID/config" \ - -H "Authorization: Bearer YOUR_CEREBRIUM_API_KEY" \ - -H "Content-Type: application/json" \ - -d '{"enabled": false}' - ``` - - The `authHeaders` field is a map of header name → header value. These are stored encrypted and never returned in API responses. +| Method | Endpoint | Description | +| ------ | ---------------------------------------- | ------------------------------------------- | +| `GET` | `/v2/metrics-export/{project_id}/config` | Get current export configuration | +| `PUT` | `/v2/metrics-export/{project_id}/config` | Update export configuration | +| `POST` | `/v2/metrics-export/{project_id}/test` | Test connection to your monitoring platform | + +**Enable with endpoint and credentials:** + +```bash +curl -X PUT "https://rest.cerebrium.ai/v2/metrics-export/YOUR_PROJECT_ID/config" \ + -H "Authorization: Bearer YOUR_CEREBRIUM_API_KEY" \ + -H "Content-Type: application/json" \ + -d '{ + "enabled": true, + "otlpEndpoint": "https://otlp-gateway-prod-us-east-0.grafana.net/otlp", + "authHeaders": { + "Authorization": "Basic YOUR_BASE64_CREDENTIALS" + } + }' +``` + +**Test connection:** + +```bash +curl -X POST "https://rest.cerebrium.ai/v2/metrics-export/YOUR_PROJECT_ID/test" \ + -H "Authorization: Bearer YOUR_CEREBRIUM_API_KEY" +``` + +**Disable export:** + +```bash +curl -X PUT "https://rest.cerebrium.ai/v2/metrics-export/YOUR_PROJECT_ID/config" \ + -H "Authorization: Bearer YOUR_CEREBRIUM_API_KEY" \ + -H "Content-Type: application/json" \ + -d '{"enabled": false}' +``` + +The `authHeaders` field is a map of header name → header value. These are stored encrypted and never returned in API responses. + From d79819536dea3eb9c4af105e329836a33dd246fa Mon Sep 17 00:00:00 2001 From: Harris Khan Date: Tue, 17 Feb 2026 14:53:48 -0500 Subject: [PATCH 09/18] docs: fold verify step into troubleshooting note --- cerebrium/integrations/metrics-export.mdx | 6 +----- 1 file changed, 1 insertion(+), 5 deletions(-) diff --git a/cerebrium/integrations/metrics-export.mdx b/cerebrium/integrations/metrics-export.mdx index 3dd96932..5f25a1ed 100644 --- a/cerebrium/integrations/metrics-export.mdx +++ b/cerebrium/integrations/metrics-export.mdx @@ -121,11 +121,7 @@ Before heading to the Cerebrium dashboard, you'll need an OTLP endpoint and auth Your metrics will start flowing within 60 seconds. The dashboard will show a green "Connected" status with the time of the last successful export. -### Step 3: Verify the connection - -Click **Test Connection** in the dashboard to verify Cerebrium can reach your monitoring platform. You'll see a success or failure message with details. - -If the test fails, double-check your endpoint URL and credentials from Step 1. +If something doesn't look right, click **Test Connection** to verify Cerebrium can reach your monitoring platform. You'll see a success or failure message with details to help you troubleshoot. ## What metrics are exported? From 43b915d4665c9340ff7ab7e14559d5ab1ee8d9ae Mon Sep 17 00:00:00 2001 From: Harris Khan Date: Tue, 17 Feb 2026 14:55:37 -0500 Subject: [PATCH 10/18] docs: replace inline API reference with link to API docs --- cerebrium/integrations/metrics-export.mdx | 108 ++++++---------------- 1 file changed, 30 insertions(+), 78 deletions(-) diff --git a/cerebrium/integrations/metrics-export.mdx b/cerebrium/integrations/metrics-export.mdx index 5f25a1ed..4d03abd9 100644 --- a/cerebrium/integrations/metrics-export.mdx +++ b/cerebrium/integrations/metrics-export.mdx @@ -48,7 +48,6 @@ Before heading to the Cerebrium dashboard, you'll need an OTLP endpoint and auth Make sure the API token has the **MetricsPublisher** role. The default Prometheus Remote Write token will not work with the OTLP endpoint. - 1. Sign in to [Datadog](https://app.datadoghq.com) @@ -67,7 +66,6 @@ Before heading to the Cerebrium dashboard, you'll need an OTLP endpoint and auth You can find your site in your Datadog URL — for example, if you log in at `app.us3.datadoghq.com`, your site is US3. Keep your API key and endpoint handy for the next step. - 1. Enable the OTLP receiver in your Prometheus config: @@ -88,7 +86,6 @@ Before heading to the Cerebrium dashboard, you'll need an OTLP endpoint and auth | New Relic | `api-key` | Your New Relic license key | | Honeycomb | `x-honeycomb-team` | Your Honeycomb API key | | Lightstep | `lightstep-access-token` | Your Lightstep token | - @@ -100,20 +97,19 @@ Before heading to the Cerebrium dashboard, you'll need an OTLP endpoint and auth - - **Header name:** `Authorization` - **Header value:** `Basic - YOUR_BASE64_STRING` (the output from the terminal command in Step 1) + - **Header name:** `Authorization` + - **Header value:** `Basic YOUR_BASE64_STRING` (the output from the terminal command in Step 1) - - **Header name:** `DD-API-KEY` - **Header value:** Your Datadog API key + - **Header name:** `DD-API-KEY` + - **Header value:** Your Datadog API key - - **Header name:** `Authorization` (if auth is enabled on your Prometheus, - otherwise leave empty) - **Header value:** `Bearer your-token` (if auth is - enabled) + - **Header name:** `Authorization` (if auth is enabled on your Prometheus, otherwise leave empty) + - **Header value:** `Bearer your-token` (if auth is enabled) - Add the authentication headers required by your platform. You can add - multiple headers using the **Add Header** button. + Add the authentication headers required by your platform. You can add multiple headers using the **Add Header** button. @@ -127,37 +123,37 @@ If something doesn't look right, click **Test Connection** to verify Cerebrium c ### Resource Metrics -| Metric | Type | Unit | Description | -| ------------------------------------------- | ----- | ------- | --------------------------------------- | -| `cerebrium_cpu_utilization_cores` | Gauge | cores | CPU cores actively in use per app | -| `cerebrium_memory_usage_bytes` | Gauge | bytes | Memory actively in use per app | -| `cerebrium_gpu_memory_usage_bytes` | Gauge | bytes | GPU VRAM in use per app | +| Metric | Type | Unit | Description | +|--------|------|------|-------------| +| `cerebrium_cpu_utilization_cores` | Gauge | cores | CPU cores actively in use per app | +| `cerebrium_memory_usage_bytes` | Gauge | bytes | Memory actively in use per app | +| `cerebrium_gpu_memory_usage_bytes` | Gauge | bytes | GPU VRAM in use per app | | `cerebrium_gpu_compute_utilization_percent` | Gauge | percent | GPU compute utilization (0-100) per app | -| `cerebrium_containers_running_count` | Gauge | count | Number of running containers per app | -| `cerebrium_containers_ready_count` | Gauge | count | Number of ready containers per app | +| `cerebrium_containers_running_count` | Gauge | count | Number of running containers per app | +| `cerebrium_containers_ready_count` | Gauge | count | Number of ready containers per app | ### Execution Metrics -| Metric | Type | Unit | Description | -| --------------------------------- | --------- | ---- | ------------------------------ | -| `cerebrium_run_execution_time_ms` | Histogram | ms | Time spent executing user code | -| `cerebrium_run_queue_time_ms` | Histogram | ms | Time spent waiting in queue | -| `cerebrium_run_coldstart_time_ms` | Histogram | ms | Time for container cold start | -| `cerebrium_run_response_time_ms` | Histogram | ms | Total end-to-end response time | -| `cerebrium_run_total_total` | Counter | — | Total run count | -| `cerebrium_run_successes_total` | Counter | — | Successful run count | -| `cerebrium_run_errors_total` | Counter | — | Failed run count | +| Metric | Type | Unit | Description | +|--------|------|------|-------------| +| `cerebrium_run_execution_time_ms` | Histogram | ms | Time spent executing user code | +| `cerebrium_run_queue_time_ms` | Histogram | ms | Time spent waiting in queue | +| `cerebrium_run_coldstart_time_ms` | Histogram | ms | Time for container cold start | +| `cerebrium_run_response_time_ms` | Histogram | ms | Total end-to-end response time | +| `cerebrium_run_total_total` | Counter | — | Total run count | +| `cerebrium_run_successes_total` | Counter | — | Successful run count | +| `cerebrium_run_errors_total` | Counter | — | Failed run count | ### Labels Every metric includes the following labels for filtering and grouping: -| Label | Description | Example | -| ------------ | --------------------------- | --------------------- | -| `project_id` | Your Cerebrium project ID | `p-abc12345` | -| `app_id` | Full application identifier | `p-abc12345-my-model` | -| `app_name` | Human-readable app name | `my-model` | -| `region` | Deployment region | `us-east-1` | +| Label | Description | Example | +|-------|-------------|---------| +| `project_id` | Your Cerebrium project ID | `p-abc12345` | +| `app_id` | Full application identifier | `p-abc12345-my-model` | +| `app_name` | Human-readable app name | `my-model` | +| `region` | Deployment region | `us-east-1` | ## Viewing Metrics @@ -190,7 +186,6 @@ Once connected, metrics will appear in your monitoring platform within a minute. # p99 end-to-end response time histogram_quantile(0.99, rate(cerebrium_run_response_time_ms_bucket{app_name="my-model"}[5m])) ``` - 1. Go to **Metrics** → **Explorer** in your Datadog dashboard @@ -207,7 +202,6 @@ Once connected, metrics will appear in your monitoring platform within a minute. # CPU usage across all apps cerebrium_cpu_utilization_cores ``` - @@ -220,46 +214,4 @@ You can manage your metrics export configuration from the dashboard at any time - **Change endpoint:** Update the OTLP endpoint field and click **Save Changes**. - **Check status:** The dashboard shows whether export is connected, the time of the last successful export, and any error messages. - - You can also manage metrics export programmatically. Find your Cerebrium API key in the [dashboard](https://dashboard.cerebrium.ai) under **Settings** → **API Keys**. - -| Method | Endpoint | Description | -| ------ | ---------------------------------------- | ------------------------------------------- | -| `GET` | `/v2/metrics-export/{project_id}/config` | Get current export configuration | -| `PUT` | `/v2/metrics-export/{project_id}/config` | Update export configuration | -| `POST` | `/v2/metrics-export/{project_id}/test` | Test connection to your monitoring platform | - -**Enable with endpoint and credentials:** - -```bash -curl -X PUT "https://rest.cerebrium.ai/v2/metrics-export/YOUR_PROJECT_ID/config" \ - -H "Authorization: Bearer YOUR_CEREBRIUM_API_KEY" \ - -H "Content-Type: application/json" \ - -d '{ - "enabled": true, - "otlpEndpoint": "https://otlp-gateway-prod-us-east-0.grafana.net/otlp", - "authHeaders": { - "Authorization": "Basic YOUR_BASE64_CREDENTIALS" - } - }' -``` - -**Test connection:** - -```bash -curl -X POST "https://rest.cerebrium.ai/v2/metrics-export/YOUR_PROJECT_ID/test" \ - -H "Authorization: Bearer YOUR_CEREBRIUM_API_KEY" -``` - -**Disable export:** - -```bash -curl -X PUT "https://rest.cerebrium.ai/v2/metrics-export/YOUR_PROJECT_ID/config" \ - -H "Authorization: Bearer YOUR_CEREBRIUM_API_KEY" \ - -H "Content-Type: application/json" \ - -d '{"enabled": false}' -``` - -The `authHeaders` field is a map of header name → header value. These are stored encrypted and never returned in API responses. - - +You can also manage metrics export programmatically via our [REST API](/api-reference/metrics-export). From ac20917f9dad8361c0262312ce87483434651234 Mon Sep 17 00:00:00 2001 From: Hkhan161 Date: Tue, 17 Feb 2026 19:55:50 +0000 Subject: [PATCH 11/18] Prettified Code! --- cerebrium/integrations/metrics-export.mdx | 64 +++++++++++++---------- 1 file changed, 35 insertions(+), 29 deletions(-) diff --git a/cerebrium/integrations/metrics-export.mdx b/cerebrium/integrations/metrics-export.mdx index 4d03abd9..284898cd 100644 --- a/cerebrium/integrations/metrics-export.mdx +++ b/cerebrium/integrations/metrics-export.mdx @@ -48,6 +48,7 @@ Before heading to the Cerebrium dashboard, you'll need an OTLP endpoint and auth Make sure the API token has the **MetricsPublisher** role. The default Prometheus Remote Write token will not work with the OTLP endpoint. + 1. Sign in to [Datadog](https://app.datadoghq.com) @@ -66,6 +67,7 @@ Before heading to the Cerebrium dashboard, you'll need an OTLP endpoint and auth You can find your site in your Datadog URL — for example, if you log in at `app.us3.datadoghq.com`, your site is US3. Keep your API key and endpoint handy for the next step. + 1. Enable the OTLP receiver in your Prometheus config: @@ -86,6 +88,7 @@ Before heading to the Cerebrium dashboard, you'll need an OTLP endpoint and auth | New Relic | `api-key` | Your New Relic license key | | Honeycomb | `x-honeycomb-team` | Your Honeycomb API key | | Lightstep | `lightstep-access-token` | Your Lightstep token | + @@ -97,19 +100,20 @@ Before heading to the Cerebrium dashboard, you'll need an OTLP endpoint and auth - - **Header name:** `Authorization` - - **Header value:** `Basic YOUR_BASE64_STRING` (the output from the terminal command in Step 1) + - **Header name:** `Authorization` - **Header value:** `Basic + YOUR_BASE64_STRING` (the output from the terminal command in Step 1) - - **Header name:** `DD-API-KEY` - - **Header value:** Your Datadog API key + - **Header name:** `DD-API-KEY` - **Header value:** Your Datadog API key - - **Header name:** `Authorization` (if auth is enabled on your Prometheus, otherwise leave empty) - - **Header value:** `Bearer your-token` (if auth is enabled) + - **Header name:** `Authorization` (if auth is enabled on your Prometheus, + otherwise leave empty) - **Header value:** `Bearer your-token` (if auth is + enabled) - Add the authentication headers required by your platform. You can add multiple headers using the **Add Header** button. + Add the authentication headers required by your platform. You can add + multiple headers using the **Add Header** button. @@ -123,37 +127,37 @@ If something doesn't look right, click **Test Connection** to verify Cerebrium c ### Resource Metrics -| Metric | Type | Unit | Description | -|--------|------|------|-------------| -| `cerebrium_cpu_utilization_cores` | Gauge | cores | CPU cores actively in use per app | -| `cerebrium_memory_usage_bytes` | Gauge | bytes | Memory actively in use per app | -| `cerebrium_gpu_memory_usage_bytes` | Gauge | bytes | GPU VRAM in use per app | +| Metric | Type | Unit | Description | +| ------------------------------------------- | ----- | ------- | --------------------------------------- | +| `cerebrium_cpu_utilization_cores` | Gauge | cores | CPU cores actively in use per app | +| `cerebrium_memory_usage_bytes` | Gauge | bytes | Memory actively in use per app | +| `cerebrium_gpu_memory_usage_bytes` | Gauge | bytes | GPU VRAM in use per app | | `cerebrium_gpu_compute_utilization_percent` | Gauge | percent | GPU compute utilization (0-100) per app | -| `cerebrium_containers_running_count` | Gauge | count | Number of running containers per app | -| `cerebrium_containers_ready_count` | Gauge | count | Number of ready containers per app | +| `cerebrium_containers_running_count` | Gauge | count | Number of running containers per app | +| `cerebrium_containers_ready_count` | Gauge | count | Number of ready containers per app | ### Execution Metrics -| Metric | Type | Unit | Description | -|--------|------|------|-------------| -| `cerebrium_run_execution_time_ms` | Histogram | ms | Time spent executing user code | -| `cerebrium_run_queue_time_ms` | Histogram | ms | Time spent waiting in queue | -| `cerebrium_run_coldstart_time_ms` | Histogram | ms | Time for container cold start | -| `cerebrium_run_response_time_ms` | Histogram | ms | Total end-to-end response time | -| `cerebrium_run_total_total` | Counter | — | Total run count | -| `cerebrium_run_successes_total` | Counter | — | Successful run count | -| `cerebrium_run_errors_total` | Counter | — | Failed run count | +| Metric | Type | Unit | Description | +| --------------------------------- | --------- | ---- | ------------------------------ | +| `cerebrium_run_execution_time_ms` | Histogram | ms | Time spent executing user code | +| `cerebrium_run_queue_time_ms` | Histogram | ms | Time spent waiting in queue | +| `cerebrium_run_coldstart_time_ms` | Histogram | ms | Time for container cold start | +| `cerebrium_run_response_time_ms` | Histogram | ms | Total end-to-end response time | +| `cerebrium_run_total_total` | Counter | — | Total run count | +| `cerebrium_run_successes_total` | Counter | — | Successful run count | +| `cerebrium_run_errors_total` | Counter | — | Failed run count | ### Labels Every metric includes the following labels for filtering and grouping: -| Label | Description | Example | -|-------|-------------|---------| -| `project_id` | Your Cerebrium project ID | `p-abc12345` | -| `app_id` | Full application identifier | `p-abc12345-my-model` | -| `app_name` | Human-readable app name | `my-model` | -| `region` | Deployment region | `us-east-1` | +| Label | Description | Example | +| ------------ | --------------------------- | --------------------- | +| `project_id` | Your Cerebrium project ID | `p-abc12345` | +| `app_id` | Full application identifier | `p-abc12345-my-model` | +| `app_name` | Human-readable app name | `my-model` | +| `region` | Deployment region | `us-east-1` | ## Viewing Metrics @@ -186,6 +190,7 @@ Once connected, metrics will appear in your monitoring platform within a minute. # p99 end-to-end response time histogram_quantile(0.99, rate(cerebrium_run_response_time_ms_bucket{app_name="my-model"}[5m])) ``` + 1. Go to **Metrics** → **Explorer** in your Datadog dashboard @@ -202,6 +207,7 @@ Once connected, metrics will appear in your monitoring platform within a minute. # CPU usage across all apps cerebrium_cpu_utilization_cores ``` + From e85b9cdafa3374f6663c4dbe1f65aa1e24cf2f50 Mon Sep 17 00:00:00 2001 From: Harris Khan Date: Tue, 17 Feb 2026 17:49:37 -0500 Subject: [PATCH 12/18] docs: remove broken API reference link --- cerebrium/integrations/metrics-export.mdx | 66 ++++++++++------------- 1 file changed, 29 insertions(+), 37 deletions(-) diff --git a/cerebrium/integrations/metrics-export.mdx b/cerebrium/integrations/metrics-export.mdx index 284898cd..283b0c5d 100644 --- a/cerebrium/integrations/metrics-export.mdx +++ b/cerebrium/integrations/metrics-export.mdx @@ -48,7 +48,6 @@ Before heading to the Cerebrium dashboard, you'll need an OTLP endpoint and auth Make sure the API token has the **MetricsPublisher** role. The default Prometheus Remote Write token will not work with the OTLP endpoint. - 1. Sign in to [Datadog](https://app.datadoghq.com) @@ -67,7 +66,6 @@ Before heading to the Cerebrium dashboard, you'll need an OTLP endpoint and auth You can find your site in your Datadog URL — for example, if you log in at `app.us3.datadoghq.com`, your site is US3. Keep your API key and endpoint handy for the next step. - 1. Enable the OTLP receiver in your Prometheus config: @@ -88,7 +86,6 @@ Before heading to the Cerebrium dashboard, you'll need an OTLP endpoint and auth | New Relic | `api-key` | Your New Relic license key | | Honeycomb | `x-honeycomb-team` | Your Honeycomb API key | | Lightstep | `lightstep-access-token` | Your Lightstep token | - @@ -100,20 +97,19 @@ Before heading to the Cerebrium dashboard, you'll need an OTLP endpoint and auth - - **Header name:** `Authorization` - **Header value:** `Basic - YOUR_BASE64_STRING` (the output from the terminal command in Step 1) + - **Header name:** `Authorization` + - **Header value:** `Basic YOUR_BASE64_STRING` (the output from the terminal command in Step 1) - - **Header name:** `DD-API-KEY` - **Header value:** Your Datadog API key + - **Header name:** `DD-API-KEY` + - **Header value:** Your Datadog API key - - **Header name:** `Authorization` (if auth is enabled on your Prometheus, - otherwise leave empty) - **Header value:** `Bearer your-token` (if auth is - enabled) + - **Header name:** `Authorization` (if auth is enabled on your Prometheus, otherwise leave empty) + - **Header value:** `Bearer your-token` (if auth is enabled) - Add the authentication headers required by your platform. You can add - multiple headers using the **Add Header** button. + Add the authentication headers required by your platform. You can add multiple headers using the **Add Header** button. @@ -127,37 +123,37 @@ If something doesn't look right, click **Test Connection** to verify Cerebrium c ### Resource Metrics -| Metric | Type | Unit | Description | -| ------------------------------------------- | ----- | ------- | --------------------------------------- | -| `cerebrium_cpu_utilization_cores` | Gauge | cores | CPU cores actively in use per app | -| `cerebrium_memory_usage_bytes` | Gauge | bytes | Memory actively in use per app | -| `cerebrium_gpu_memory_usage_bytes` | Gauge | bytes | GPU VRAM in use per app | +| Metric | Type | Unit | Description | +|--------|------|------|-------------| +| `cerebrium_cpu_utilization_cores` | Gauge | cores | CPU cores actively in use per app | +| `cerebrium_memory_usage_bytes` | Gauge | bytes | Memory actively in use per app | +| `cerebrium_gpu_memory_usage_bytes` | Gauge | bytes | GPU VRAM in use per app | | `cerebrium_gpu_compute_utilization_percent` | Gauge | percent | GPU compute utilization (0-100) per app | -| `cerebrium_containers_running_count` | Gauge | count | Number of running containers per app | -| `cerebrium_containers_ready_count` | Gauge | count | Number of ready containers per app | +| `cerebrium_containers_running_count` | Gauge | count | Number of running containers per app | +| `cerebrium_containers_ready_count` | Gauge | count | Number of ready containers per app | ### Execution Metrics -| Metric | Type | Unit | Description | -| --------------------------------- | --------- | ---- | ------------------------------ | -| `cerebrium_run_execution_time_ms` | Histogram | ms | Time spent executing user code | -| `cerebrium_run_queue_time_ms` | Histogram | ms | Time spent waiting in queue | -| `cerebrium_run_coldstart_time_ms` | Histogram | ms | Time for container cold start | -| `cerebrium_run_response_time_ms` | Histogram | ms | Total end-to-end response time | -| `cerebrium_run_total_total` | Counter | — | Total run count | -| `cerebrium_run_successes_total` | Counter | — | Successful run count | -| `cerebrium_run_errors_total` | Counter | — | Failed run count | +| Metric | Type | Unit | Description | +|--------|------|------|-------------| +| `cerebrium_run_execution_time_ms` | Histogram | ms | Time spent executing user code | +| `cerebrium_run_queue_time_ms` | Histogram | ms | Time spent waiting in queue | +| `cerebrium_run_coldstart_time_ms` | Histogram | ms | Time for container cold start | +| `cerebrium_run_response_time_ms` | Histogram | ms | Total end-to-end response time | +| `cerebrium_run_total_total` | Counter | — | Total run count | +| `cerebrium_run_successes_total` | Counter | — | Successful run count | +| `cerebrium_run_errors_total` | Counter | — | Failed run count | ### Labels Every metric includes the following labels for filtering and grouping: -| Label | Description | Example | -| ------------ | --------------------------- | --------------------- | -| `project_id` | Your Cerebrium project ID | `p-abc12345` | -| `app_id` | Full application identifier | `p-abc12345-my-model` | -| `app_name` | Human-readable app name | `my-model` | -| `region` | Deployment region | `us-east-1` | +| Label | Description | Example | +|-------|-------------|---------| +| `project_id` | Your Cerebrium project ID | `p-abc12345` | +| `app_id` | Full application identifier | `p-abc12345-my-model` | +| `app_name` | Human-readable app name | `my-model` | +| `region` | Deployment region | `us-east-1` | ## Viewing Metrics @@ -190,7 +186,6 @@ Once connected, metrics will appear in your monitoring platform within a minute. # p99 end-to-end response time histogram_quantile(0.99, rate(cerebrium_run_response_time_ms_bucket{app_name="my-model"}[5m])) ``` - 1. Go to **Metrics** → **Explorer** in your Datadog dashboard @@ -207,7 +202,6 @@ Once connected, metrics will appear in your monitoring platform within a minute. # CPU usage across all apps cerebrium_cpu_utilization_cores ``` - @@ -219,5 +213,3 @@ You can manage your metrics export configuration from the dashboard at any time - **Update credentials:** Enter new authentication headers and click **Save Changes**. Useful when rotating API keys. - **Change endpoint:** Update the OTLP endpoint field and click **Save Changes**. - **Check status:** The dashboard shows whether export is connected, the time of the last successful export, and any error messages. - -You can also manage metrics export programmatically via our [REST API](/api-reference/metrics-export). From e8caf84dd4ebdeb6f21d8231610603ed81eaaf17 Mon Sep 17 00:00:00 2001 From: Hkhan161 Date: Tue, 17 Feb 2026 22:49:54 +0000 Subject: [PATCH 13/18] Prettified Code! --- cerebrium/integrations/metrics-export.mdx | 64 +++++++++++++---------- 1 file changed, 35 insertions(+), 29 deletions(-) diff --git a/cerebrium/integrations/metrics-export.mdx b/cerebrium/integrations/metrics-export.mdx index 283b0c5d..44dc762b 100644 --- a/cerebrium/integrations/metrics-export.mdx +++ b/cerebrium/integrations/metrics-export.mdx @@ -48,6 +48,7 @@ Before heading to the Cerebrium dashboard, you'll need an OTLP endpoint and auth Make sure the API token has the **MetricsPublisher** role. The default Prometheus Remote Write token will not work with the OTLP endpoint. + 1. Sign in to [Datadog](https://app.datadoghq.com) @@ -66,6 +67,7 @@ Before heading to the Cerebrium dashboard, you'll need an OTLP endpoint and auth You can find your site in your Datadog URL — for example, if you log in at `app.us3.datadoghq.com`, your site is US3. Keep your API key and endpoint handy for the next step. + 1. Enable the OTLP receiver in your Prometheus config: @@ -86,6 +88,7 @@ Before heading to the Cerebrium dashboard, you'll need an OTLP endpoint and auth | New Relic | `api-key` | Your New Relic license key | | Honeycomb | `x-honeycomb-team` | Your Honeycomb API key | | Lightstep | `lightstep-access-token` | Your Lightstep token | + @@ -97,19 +100,20 @@ Before heading to the Cerebrium dashboard, you'll need an OTLP endpoint and auth - - **Header name:** `Authorization` - - **Header value:** `Basic YOUR_BASE64_STRING` (the output from the terminal command in Step 1) + - **Header name:** `Authorization` - **Header value:** `Basic + YOUR_BASE64_STRING` (the output from the terminal command in Step 1) - - **Header name:** `DD-API-KEY` - - **Header value:** Your Datadog API key + - **Header name:** `DD-API-KEY` - **Header value:** Your Datadog API key - - **Header name:** `Authorization` (if auth is enabled on your Prometheus, otherwise leave empty) - - **Header value:** `Bearer your-token` (if auth is enabled) + - **Header name:** `Authorization` (if auth is enabled on your Prometheus, + otherwise leave empty) - **Header value:** `Bearer your-token` (if auth is + enabled) - Add the authentication headers required by your platform. You can add multiple headers using the **Add Header** button. + Add the authentication headers required by your platform. You can add + multiple headers using the **Add Header** button. @@ -123,37 +127,37 @@ If something doesn't look right, click **Test Connection** to verify Cerebrium c ### Resource Metrics -| Metric | Type | Unit | Description | -|--------|------|------|-------------| -| `cerebrium_cpu_utilization_cores` | Gauge | cores | CPU cores actively in use per app | -| `cerebrium_memory_usage_bytes` | Gauge | bytes | Memory actively in use per app | -| `cerebrium_gpu_memory_usage_bytes` | Gauge | bytes | GPU VRAM in use per app | +| Metric | Type | Unit | Description | +| ------------------------------------------- | ----- | ------- | --------------------------------------- | +| `cerebrium_cpu_utilization_cores` | Gauge | cores | CPU cores actively in use per app | +| `cerebrium_memory_usage_bytes` | Gauge | bytes | Memory actively in use per app | +| `cerebrium_gpu_memory_usage_bytes` | Gauge | bytes | GPU VRAM in use per app | | `cerebrium_gpu_compute_utilization_percent` | Gauge | percent | GPU compute utilization (0-100) per app | -| `cerebrium_containers_running_count` | Gauge | count | Number of running containers per app | -| `cerebrium_containers_ready_count` | Gauge | count | Number of ready containers per app | +| `cerebrium_containers_running_count` | Gauge | count | Number of running containers per app | +| `cerebrium_containers_ready_count` | Gauge | count | Number of ready containers per app | ### Execution Metrics -| Metric | Type | Unit | Description | -|--------|------|------|-------------| -| `cerebrium_run_execution_time_ms` | Histogram | ms | Time spent executing user code | -| `cerebrium_run_queue_time_ms` | Histogram | ms | Time spent waiting in queue | -| `cerebrium_run_coldstart_time_ms` | Histogram | ms | Time for container cold start | -| `cerebrium_run_response_time_ms` | Histogram | ms | Total end-to-end response time | -| `cerebrium_run_total_total` | Counter | — | Total run count | -| `cerebrium_run_successes_total` | Counter | — | Successful run count | -| `cerebrium_run_errors_total` | Counter | — | Failed run count | +| Metric | Type | Unit | Description | +| --------------------------------- | --------- | ---- | ------------------------------ | +| `cerebrium_run_execution_time_ms` | Histogram | ms | Time spent executing user code | +| `cerebrium_run_queue_time_ms` | Histogram | ms | Time spent waiting in queue | +| `cerebrium_run_coldstart_time_ms` | Histogram | ms | Time for container cold start | +| `cerebrium_run_response_time_ms` | Histogram | ms | Total end-to-end response time | +| `cerebrium_run_total_total` | Counter | — | Total run count | +| `cerebrium_run_successes_total` | Counter | — | Successful run count | +| `cerebrium_run_errors_total` | Counter | — | Failed run count | ### Labels Every metric includes the following labels for filtering and grouping: -| Label | Description | Example | -|-------|-------------|---------| -| `project_id` | Your Cerebrium project ID | `p-abc12345` | -| `app_id` | Full application identifier | `p-abc12345-my-model` | -| `app_name` | Human-readable app name | `my-model` | -| `region` | Deployment region | `us-east-1` | +| Label | Description | Example | +| ------------ | --------------------------- | --------------------- | +| `project_id` | Your Cerebrium project ID | `p-abc12345` | +| `app_id` | Full application identifier | `p-abc12345-my-model` | +| `app_name` | Human-readable app name | `my-model` | +| `region` | Deployment region | `us-east-1` | ## Viewing Metrics @@ -186,6 +190,7 @@ Once connected, metrics will appear in your monitoring platform within a minute. # p99 end-to-end response time histogram_quantile(0.99, rate(cerebrium_run_response_time_ms_bucket{app_name="my-model"}[5m])) ``` + 1. Go to **Metrics** → **Explorer** in your Datadog dashboard @@ -202,6 +207,7 @@ Once connected, metrics will appear in your monitoring platform within a minute. # CPU usage across all apps cerebrium_cpu_utilization_cores ``` + From 425a098ead00aa955ed174d2eefab4579418f1c7 Mon Sep 17 00:00:00 2001 From: Harris Khan Date: Wed, 18 Feb 2026 09:07:30 -0500 Subject: [PATCH 14/18] docs: move metrics reference to top of page --- cerebrium/integrations/metrics-export.mdx | 72 +++++++++++------------ 1 file changed, 36 insertions(+), 36 deletions(-) diff --git a/cerebrium/integrations/metrics-export.mdx b/cerebrium/integrations/metrics-export.mdx index 44dc762b..c33b890e 100644 --- a/cerebrium/integrations/metrics-export.mdx +++ b/cerebrium/integrations/metrics-export.mdx @@ -5,6 +5,42 @@ description: Export your application metrics to any OTLP-compatible observabilit Export real-time resource and execution metrics from your Cerebrium applications to your existing observability platform. Monitor CPU, memory, GPU usage, request counts, and latency alongside your other services. We support most major monitoring platforms that are OTLP-compatible. +## What metrics are exported? + +### Resource Metrics + +| Metric | Type | Unit | Description | +| ------------------------------------------- | ----- | ------- | --------------------------------------- | +| `cerebrium_cpu_utilization_cores` | Gauge | cores | CPU cores actively in use per app | +| `cerebrium_memory_usage_bytes` | Gauge | bytes | Memory actively in use per app | +| `cerebrium_gpu_memory_usage_bytes` | Gauge | bytes | GPU VRAM in use per app | +| `cerebrium_gpu_compute_utilization_percent` | Gauge | percent | GPU compute utilization (0-100) per app | +| `cerebrium_containers_running_count` | Gauge | count | Number of running containers per app | +| `cerebrium_containers_ready_count` | Gauge | count | Number of ready containers per app | + +### Execution Metrics + +| Metric | Type | Unit | Description | +| --------------------------------- | --------- | ---- | ------------------------------ | +| `cerebrium_run_execution_time_ms` | Histogram | ms | Time spent executing user code | +| `cerebrium_run_queue_time_ms` | Histogram | ms | Time spent waiting in queue | +| `cerebrium_run_coldstart_time_ms` | Histogram | ms | Time for container cold start | +| `cerebrium_run_response_time_ms` | Histogram | ms | Total end-to-end response time | +| `cerebrium_run_total_total` | Counter | — | Total run count | +| `cerebrium_run_successes_total` | Counter | — | Successful run count | +| `cerebrium_run_errors_total` | Counter | — | Failed run count | + +### Labels + +Every metric includes the following labels for filtering and grouping: + +| Label | Description | Example | +| ------------ | --------------------------- | --------------------- | +| `project_id` | Your Cerebrium project ID | `p-abc12345` | +| `app_id` | Full application identifier | `p-abc12345-my-model` | +| `app_name` | Human-readable app name | `my-model` | +| `region` | Deployment region | `us-east-1` | + ## How it works Cerebrium automatically pushes metrics from your applications to your monitoring platform every **60 seconds** using the [OpenTelemetry Protocol (OTLP)](https://opentelemetry.io/docs/specs/otlp/). You provide an OTLP endpoint and authentication credentials through the Cerebrium dashboard, and Cerebrium handles the rest — collecting resource usage and execution data, formatting it as OpenTelemetry metrics, and delivering it to your platform. @@ -123,42 +159,6 @@ Your metrics will start flowing within 60 seconds. The dashboard will show a gre If something doesn't look right, click **Test Connection** to verify Cerebrium can reach your monitoring platform. You'll see a success or failure message with details to help you troubleshoot. -## What metrics are exported? - -### Resource Metrics - -| Metric | Type | Unit | Description | -| ------------------------------------------- | ----- | ------- | --------------------------------------- | -| `cerebrium_cpu_utilization_cores` | Gauge | cores | CPU cores actively in use per app | -| `cerebrium_memory_usage_bytes` | Gauge | bytes | Memory actively in use per app | -| `cerebrium_gpu_memory_usage_bytes` | Gauge | bytes | GPU VRAM in use per app | -| `cerebrium_gpu_compute_utilization_percent` | Gauge | percent | GPU compute utilization (0-100) per app | -| `cerebrium_containers_running_count` | Gauge | count | Number of running containers per app | -| `cerebrium_containers_ready_count` | Gauge | count | Number of ready containers per app | - -### Execution Metrics - -| Metric | Type | Unit | Description | -| --------------------------------- | --------- | ---- | ------------------------------ | -| `cerebrium_run_execution_time_ms` | Histogram | ms | Time spent executing user code | -| `cerebrium_run_queue_time_ms` | Histogram | ms | Time spent waiting in queue | -| `cerebrium_run_coldstart_time_ms` | Histogram | ms | Time for container cold start | -| `cerebrium_run_response_time_ms` | Histogram | ms | Total end-to-end response time | -| `cerebrium_run_total_total` | Counter | — | Total run count | -| `cerebrium_run_successes_total` | Counter | — | Successful run count | -| `cerebrium_run_errors_total` | Counter | — | Failed run count | - -### Labels - -Every metric includes the following labels for filtering and grouping: - -| Label | Description | Example | -| ------------ | --------------------------- | --------------------- | -| `project_id` | Your Cerebrium project ID | `p-abc12345` | -| `app_id` | Full application identifier | `p-abc12345-my-model` | -| `app_name` | Human-readable app name | `my-model` | -| `region` | Deployment region | `us-east-1` | - ## Viewing Metrics Once connected, metrics will appear in your monitoring platform within a minute. From a94391b5768df20fbf8275ab28ffde9445200194 Mon Sep 17 00:00:00 2001 From: Harris Khan Date: Wed, 18 Feb 2026 09:18:19 -0500 Subject: [PATCH 15/18] docs: add Prometheus naming note, troubleshooting section, improved Grafana instructions, more PromQL examples --- cerebrium/integrations/metrics-export.mdx | 65 +++++++++++++++++------ 1 file changed, 48 insertions(+), 17 deletions(-) diff --git a/cerebrium/integrations/metrics-export.mdx b/cerebrium/integrations/metrics-export.mdx index c33b890e..7632f570 100644 --- a/cerebrium/integrations/metrics-export.mdx +++ b/cerebrium/integrations/metrics-export.mdx @@ -3,7 +3,7 @@ title: Exporting Metrics to Monitoring Platforms description: Export your application metrics to any OTLP-compatible observability platform including Grafana Cloud, Datadog, Prometheus, New Relic, and more --- -Export real-time resource and execution metrics from your Cerebrium applications to your existing observability platform. Monitor CPU, memory, GPU usage, request counts, and latency alongside your other services. We support most major monitoring platforms that are OTLP-compatible. +Export real-time resource and execution metrics from your Cerebrium applications to your existing observability platform. Monitor CPU, memory, GPU usage, request counts, and latency metrics exported by your applications. We support most major monitoring platforms that are OTLP-compatible. ## What metrics are exported? @@ -26,10 +26,14 @@ Export real-time resource and execution metrics from your Cerebrium applications | `cerebrium_run_queue_time_ms` | Histogram | ms | Time spent waiting in queue | | `cerebrium_run_coldstart_time_ms` | Histogram | ms | Time for container cold start | | `cerebrium_run_response_time_ms` | Histogram | ms | Total end-to-end response time | -| `cerebrium_run_total_total` | Counter | — | Total run count | +| `cerebrium_run_total` | Counter | — | Total run count | | `cerebrium_run_successes_total` | Counter | — | Successful run count | | `cerebrium_run_errors_total` | Counter | — | Failed run count | + +**Prometheus metric name mapping:** When metrics are ingested by Prometheus (including Grafana Cloud), OTLP automatically appends unit suffixes to metric names. Histogram metrics will appear with `_milliseconds` appended — for example, `cerebrium_run_execution_time_ms` becomes `cerebrium_run_execution_time_ms_milliseconds_bucket`, `_count`, and `_sum`. Counter metrics with the `_total` suffix remain unchanged. The example queries throughout this guide use the Prometheus-ingested names. + + ### Labels Every metric includes the following labels for filtering and grouping: @@ -72,7 +76,7 @@ Before heading to the Cerebrium dashboard, you'll need an OTLP endpoint and auth - US: `https://otlp-gateway-prod-us-east-0.grafana.net/otlp` - EU: `https://otlp-gateway-prod-eu-west-0.grafana.net/otlp` - Other regions will show their specific URL on the configuration page - 5. On the same page, generate an API token with the **MetricsPublisher** role + 5. On the same page, generate an API token. Click **Generate now** and ensure the token has the **MetricsPublisher** role — this is a separate token from any Prometheus Remote Write tokens you may already have. 6. The page will show you an **Instance ID** and the generated token. Run the following in your terminal to create the Basic auth string: ```bash @@ -82,7 +86,7 @@ Before heading to the Cerebrium dashboard, you'll need an OTLP endpoint and auth Copy the output — you'll paste it in the dashboard in the next step. - Make sure the API token has the **MetricsPublisher** role. The default Prometheus Remote Write token will not work with the OTLP endpoint. + The API token **must** have the **MetricsPublisher** role. The default Prometheus Remote Write token will not work with the OTLP endpoint. If you're unsure, generate a new token from the OpenTelemetry configuration page — it will have the correct role by default. @@ -109,7 +113,7 @@ Before heading to the Cerebrium dashboard, you'll need an OTLP endpoint and auth 1. Enable the OTLP receiver in your Prometheus config: - Add `--enable-feature=otlp-write-receiver` flag - Or use an OpenTelemetry Collector as a sidecar - 2. Your endpoint will be `http://YOUR_PROMETHEUS_HOST:4318` — copy this for the next step + 2. Your endpoint will be `http://YOUR_PROMETHEUS_HOST:4318` (this is the OTLP HTTP port — not `4317`, which is gRPC) — copy this for the next step Any platform that supports [OpenTelemetry OTLP over HTTP](https://opentelemetry.io/docs/specs/otlp/) will work, including New Relic, Honeycomb, Lightstep, and others. @@ -161,7 +165,7 @@ If something doesn't look right, click **Test Connection** to verify Cerebrium c ## Viewing Metrics -Once connected, metrics will appear in your monitoring platform within a minute. +Once connected, metrics will appear in your monitoring platform within a minute or two (exact latency depends on your platform's ingestion pipeline). @@ -172,23 +176,29 @@ Once connected, metrics will appear in your monitoring platform within a minute. **Example queries:** ```promql - # CPU usage by app (replace with your project ID, e.g. p-9676c59f) - cerebrium_cpu_utilization_cores{project_id="p-9676c59f"} - + # CPU usage by app + cerebrium_cpu_utilization_cores{project_id="YOUR_PROJECT_ID"} + # Memory for a specific app cerebrium_memory_usage_bytes{app_name="my-model"} - + # Container scaling over time - cerebrium_containers_running_count{project_id="p-9676c59f"} - + cerebrium_containers_running_count{project_id="YOUR_PROJECT_ID"} + # Request rate (requests per second over 5 minutes) - rate(cerebrium_run_total_total{app_name="my-model"}[5m]) - + rate(cerebrium_run_total[5m]) + # p99 execution latency - histogram_quantile(0.99, rate(cerebrium_run_execution_time_ms_bucket{app_name="my-model"}[5m])) - + histogram_quantile(0.99, rate(cerebrium_run_execution_time_ms_milliseconds_bucket{app_name="my-model"}[5m])) + # p99 end-to-end response time - histogram_quantile(0.99, rate(cerebrium_run_response_time_ms_bucket{app_name="my-model"}[5m])) + histogram_quantile(0.99, rate(cerebrium_run_response_time_ms_milliseconds_bucket{app_name="my-model"}[5m])) + + # Error rate as a percentage + rate(cerebrium_run_errors_total{app_name="my-model"}[5m]) / rate(cerebrium_run_total{app_name="my-model"}[5m]) * 100 + + # Average cold start time + rate(cerebrium_run_coldstart_time_ms_milliseconds_sum{app_name="my-model"}[5m]) / rate(cerebrium_run_coldstart_time_ms_milliseconds_count{app_name="my-model"}[5m]) ``` @@ -219,3 +229,24 @@ You can manage your metrics export configuration from the dashboard at any time - **Update credentials:** Enter new authentication headers and click **Save Changes**. Useful when rotating API keys. - **Change endpoint:** Update the OTLP endpoint field and click **Save Changes**. - **Check status:** The dashboard shows whether export is connected, the time of the last successful export, and any error messages. + +## Troubleshooting + +### Metrics not appearing + +1. **Check the dashboard status.** Go to **Integrations** → **Metrics Export** and look for the connection status. If it shows "Paused," export was automatically disabled after repeated failures — click **Re-enable** after fixing the issue. +2. **Run a connection test.** Click **Test Connection** on the dashboard. Common errors: + - **401 / 403 Unauthorized:** Your auth headers are wrong. For Grafana Cloud, make sure you're using a MetricsPublisher token (not a Prometheus Remote Write token). For Datadog, verify your API key is active. + - **404 Not Found:** The OTLP endpoint URL is incorrect. Double-check the URL matches your platform and region. + - **Connection timeout:** Your endpoint may be unreachable. For self-hosted Prometheus, confirm the host is publicly accessible and port `4318` is open. +3. **Check your platform's data source.** In Grafana Cloud, make sure you're querying the correct Prometheus data source (not a Loki or Tempo source). In Datadog, check that your site region matches the endpoint you configured. + +### Metrics appear but values look wrong + +- **Histogram metrics have `_milliseconds` in the name.** This is normal — Prometheus appends unit suffixes from OTLP metadata. Use the full name (e.g., `cerebrium_run_execution_time_ms_milliseconds_bucket`) in your queries. +- **Container counts fluctuate during deploys.** This is expected — you may see temporary spikes in `cerebrium_containers_running_count` during rolling deployments as new containers start and old ones drain. +- **Gaps in metrics.** Short gaps (1-2 minutes) can occur during deployments or scaling events. If you see persistent gaps, check whether export was paused. + +### Still stuck? + +Reach out to [support@cerebrium.ai](mailto:support@cerebrium.ai) with your project ID and the error message from the dashboard — we can check the export logs on our side. From 0a5c6a54bf929d134a8a05c335179dd6bcd13295 Mon Sep 17 00:00:00 2001 From: Harris Khan Date: Wed, 18 Feb 2026 09:20:14 -0500 Subject: [PATCH 16/18] docs: add histogram naming note before Grafana PromQL examples --- cerebrium/integrations/metrics-export.mdx | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/cerebrium/integrations/metrics-export.mdx b/cerebrium/integrations/metrics-export.mdx index 7632f570..f328dbc6 100644 --- a/cerebrium/integrations/metrics-export.mdx +++ b/cerebrium/integrations/metrics-export.mdx @@ -175,6 +175,10 @@ Once connected, metrics will appear in your monitoring platform within a minute **Example queries:** + + Histogram metrics in Prometheus have `_milliseconds` appended by OTLP's unit suffix convention, so you'll see names like `cerebrium_run_execution_time_ms_milliseconds_bucket`. This is expected behavior — see the [metric name mapping note](#execution-metrics) above. + + ```promql # CPU usage by app cerebrium_cpu_utilization_cores{project_id="YOUR_PROJECT_ID"} From 2651a16c19361477461d1e4c48dc91ebddf4fc15 Mon Sep 17 00:00:00 2001 From: Hkhan161 Date: Wed, 18 Feb 2026 14:20:27 +0000 Subject: [PATCH 17/18] Prettified Code! --- cerebrium/integrations/metrics-export.mdx | 22 ++++++++++++++-------- 1 file changed, 14 insertions(+), 8 deletions(-) diff --git a/cerebrium/integrations/metrics-export.mdx b/cerebrium/integrations/metrics-export.mdx index f328dbc6..bc83504d 100644 --- a/cerebrium/integrations/metrics-export.mdx +++ b/cerebrium/integrations/metrics-export.mdx @@ -31,7 +31,13 @@ Export real-time resource and execution metrics from your Cerebrium applications | `cerebrium_run_errors_total` | Counter | — | Failed run count | -**Prometheus metric name mapping:** When metrics are ingested by Prometheus (including Grafana Cloud), OTLP automatically appends unit suffixes to metric names. Histogram metrics will appear with `_milliseconds` appended — for example, `cerebrium_run_execution_time_ms` becomes `cerebrium_run_execution_time_ms_milliseconds_bucket`, `_count`, and `_sum`. Counter metrics with the `_total` suffix remain unchanged. The example queries throughout this guide use the Prometheus-ingested names. + **Prometheus metric name mapping:** When metrics are ingested by Prometheus + (including Grafana Cloud), OTLP automatically appends unit suffixes to metric + names. Histogram metrics will appear with `_milliseconds` appended — for + example, `cerebrium_run_execution_time_ms` becomes + `cerebrium_run_execution_time_ms_milliseconds_bucket`, `_count`, and `_sum`. + Counter metrics with the `_total` suffix remain unchanged. The example queries + throughout this guide use the Prometheus-ingested names. ### Labels @@ -182,25 +188,25 @@ Once connected, metrics will appear in your monitoring platform within a minute ```promql # CPU usage by app cerebrium_cpu_utilization_cores{project_id="YOUR_PROJECT_ID"} - + # Memory for a specific app cerebrium_memory_usage_bytes{app_name="my-model"} - + # Container scaling over time cerebrium_containers_running_count{project_id="YOUR_PROJECT_ID"} - + # Request rate (requests per second over 5 minutes) rate(cerebrium_run_total[5m]) - + # p99 execution latency histogram_quantile(0.99, rate(cerebrium_run_execution_time_ms_milliseconds_bucket{app_name="my-model"}[5m])) - + # p99 end-to-end response time histogram_quantile(0.99, rate(cerebrium_run_response_time_ms_milliseconds_bucket{app_name="my-model"}[5m])) - + # Error rate as a percentage rate(cerebrium_run_errors_total{app_name="my-model"}[5m]) / rate(cerebrium_run_total{app_name="my-model"}[5m]) * 100 - + # Average cold start time rate(cerebrium_run_coldstart_time_ms_milliseconds_sum{app_name="my-model"}[5m]) / rate(cerebrium_run_coldstart_time_ms_milliseconds_count{app_name="my-model"}[5m]) ``` From 592b488ad59a2eb54c96d42b01b8e345129692c9 Mon Sep 17 00:00:00 2001 From: Harris Khan Date: Wed, 18 Feb 2026 09:27:23 -0500 Subject: [PATCH 18/18] docs: prevent metric name wrapping in tables --- cerebrium/integrations/metrics-export.mdx | 34 +++++++++++------------ 1 file changed, 17 insertions(+), 17 deletions(-) diff --git a/cerebrium/integrations/metrics-export.mdx b/cerebrium/integrations/metrics-export.mdx index bc83504d..33c9f66f 100644 --- a/cerebrium/integrations/metrics-export.mdx +++ b/cerebrium/integrations/metrics-export.mdx @@ -9,26 +9,26 @@ Export real-time resource and execution metrics from your Cerebrium applications ### Resource Metrics -| Metric | Type | Unit | Description | -| ------------------------------------------- | ----- | ------- | --------------------------------------- | -| `cerebrium_cpu_utilization_cores` | Gauge | cores | CPU cores actively in use per app | -| `cerebrium_memory_usage_bytes` | Gauge | bytes | Memory actively in use per app | -| `cerebrium_gpu_memory_usage_bytes` | Gauge | bytes | GPU VRAM in use per app | -| `cerebrium_gpu_compute_utilization_percent` | Gauge | percent | GPU compute utilization (0-100) per app | -| `cerebrium_containers_running_count` | Gauge | count | Number of running containers per app | -| `cerebrium_containers_ready_count` | Gauge | count | Number of ready containers per app | +| Metric | Type | Unit | Description | +| ------------------------------------------------------------------------------------- | ----- | ------- | --------------------------------------- | +| cerebrium_cpu_utilization_cores | Gauge | cores | CPU cores actively in use per app | +| cerebrium_memory_usage_bytes | Gauge | bytes | Memory actively in use per app | +| cerebrium_gpu_memory_usage_bytes | Gauge | bytes | GPU VRAM in use per app | +| cerebrium_gpu_compute_utilization_percent | Gauge | percent | GPU compute utilization (0-100) per app | +| cerebrium_containers_running_count | Gauge | count | Number of running containers per app | +| cerebrium_containers_ready_count | Gauge | count | Number of ready containers per app | ### Execution Metrics -| Metric | Type | Unit | Description | -| --------------------------------- | --------- | ---- | ------------------------------ | -| `cerebrium_run_execution_time_ms` | Histogram | ms | Time spent executing user code | -| `cerebrium_run_queue_time_ms` | Histogram | ms | Time spent waiting in queue | -| `cerebrium_run_coldstart_time_ms` | Histogram | ms | Time for container cold start | -| `cerebrium_run_response_time_ms` | Histogram | ms | Total end-to-end response time | -| `cerebrium_run_total` | Counter | — | Total run count | -| `cerebrium_run_successes_total` | Counter | — | Successful run count | -| `cerebrium_run_errors_total` | Counter | — | Failed run count | +| Metric | Type | Unit | Description | +| --------------------------------------------------------------------------- | --------- | ---- | ------------------------------ | +| cerebrium_run_execution_time_ms | Histogram | ms | Time spent executing user code | +| cerebrium_run_queue_time_ms | Histogram | ms | Time spent waiting in queue | +| cerebrium_run_coldstart_time_ms | Histogram | ms | Time for container cold start | +| cerebrium_run_response_time_ms | Histogram | ms | Total end-to-end response time | +| cerebrium_run_total | Counter | — | Total run count | +| cerebrium_run_successes_total | Counter | — | Successful run count | +| cerebrium_run_errors_total | Counter | — | Failed run count | **Prometheus metric name mapping:** When metrics are ingested by Prometheus