diff --git a/docs/data-tests/data-freshness-sla.mdx b/docs/data-tests/data-freshness-sla.mdx new file mode 100644 index 000000000..d63e4a446 --- /dev/null +++ b/docs/data-tests/data-freshness-sla.mdx @@ -0,0 +1,143 @@ +--- +title: "data_freshness_sla" +sidebarTitle: "Data Freshness SLA" +--- + +import AiGenerateTest from '/snippets/ai-generate-test.mdx'; + + + + +`elementary.data_freshness_sla` + +Verifies that data in a model was updated before a specified SLA deadline time. + +This test checks the maximum timestamp value of a specified column in your data to determine whether the data was refreshed before your deadline. Unlike `freshness_anomalies` (which uses ML-based anomaly detection), this test validates against a fixed, explicit SLA time — making it ideal when you have a concrete contractual or operational deadline. + +### Use Case + +"Was the data in my model updated before 7 AM Pacific today?" + +### Test Logic + +1. If today is not a scheduled check day → **PASS** (skip) +2. Query the model for the maximum value of `timestamp_column` +3. If the max timestamp is from today → **PASS** (data is fresh) +4. If the SLA deadline hasn't passed yet → **PASS** (still time) +5. If the max timestamp is from a previous day → **FAIL** (DATA_STALE) +6. If no data exists in the table → **FAIL** (NO_DATA) + +### Test configuration + +_Required configuration: `timestamp_column`, `sla_time`, `timezone`_ + +{/* prettier-ignore */} +
+ 
+  data_tests:
+      -- elementary.data_freshness_sla:
+          arguments:
+              timestamp_column: column name # Required - timestamp column to check for freshness
+              sla_time: string # Required - e.g., "07:00", "7am", "2:30pm", "14:30"
+              timezone: string # Required - IANA timezone name, e.g., "America/Los_Angeles"
+              day_of_week: string | array # Optional - Day(s) to check: "Monday" or ["Monday", "Wednesday"]
+              day_of_month: int | array # Optional - Day(s) of month to check: 1 or [1, 15]
+              where_expression: sql expression # Optional - filter the data before checking
+ 
+
+ + + +```yml Models +models: + - name: < model name > + data_tests: + - elementary.data_freshness_sla: + arguments: + timestamp_column: < column name > # Required + sla_time: < deadline time > # Required - e.g., "07:00", "7am", "2:30pm" + timezone: < IANA timezone > # Required - e.g., "America/Los_Angeles" + day_of_week: < day or array > # Optional + day_of_month: < day or array > # Optional + where_expression: < sql expression > # Optional +``` + +```yml Daily check +models: + - name: daily_revenue + data_tests: + - elementary.data_freshness_sla: + arguments: + timestamp_column: updated_at + sla_time: "07:00" + timezone: "America/Los_Angeles" + config: + tags: ["elementary"] + severity: error +``` + +```yml With filter expression +models: + - name: daily_events + data_tests: + - elementary.data_freshness_sla: + arguments: + timestamp_column: event_timestamp + sla_time: "6am" + timezone: "Europe/Amsterdam" + where_expression: "event_type = 'completed'" + config: + tags: ["elementary"] +``` + +```yml Weekly - only Mondays +models: + - name: weekly_report_data + data_tests: + - elementary.data_freshness_sla: + arguments: + timestamp_column: report_date + sla_time: "09:00" + timezone: "Asia/Tokyo" + day_of_week: ["Monday"] + config: + tags: ["elementary"] +``` + + + +### Features + +- **Data-level freshness**: Checks actual data timestamps, not just pipeline execution time +- **Flexible time formats**: Supports `"07:00"`, `"7am"`, `"2:30pm"`, `"14:30"`, and other common formats +- **IANA timezone support**: Uses standard timezone names like `"America/Los_Angeles"`, `"Europe/Amsterdam"`, etc. +- **Automatic DST handling**: Uses `pytz` for timezone conversions with automatic daylight saving time handling +- **Database-agnostic**: All timezone logic happens at compile time +- **Schedule filters**: Optional `day_of_week` and `day_of_month` parameters to check only specific days +- **Filter support**: Use `where_expression` to check freshness of a specific subset of data + +### Parameters + +| Parameter | Required | Description | +| ------------------ | -------- | -------------------------------------------------------------- | +| `timestamp_column` | Yes | Column name containing timestamps to check for freshness | +| `sla_time` | Yes | Deadline time (e.g., `"07:00"`, `"7am"`, `"2:30pm"`) | +| `timezone` | Yes | IANA timezone name (e.g., `"America/Los_Angeles"`) | +| `day_of_week` | No | Day(s) to check: `"Monday"` or `["Monday", "Wednesday"]` | +| `day_of_month` | No | Day(s) of month to check: `1` or `[1, 15]` | +| `where_expression` | No | SQL expression to filter the data before checking | + +### Comparison with other freshness tests + +| Feature | `data_freshness_sla` | `freshness_anomalies` | `execution_sla` | +| --- | --- | --- | --- | +| What it checks | Data timestamps | Data timestamps | Pipeline run time | +| Detection method | Fixed SLA deadline | ML-based anomaly detection | Fixed SLA deadline | +| Best for | Contractual/operational deadlines | Detecting unexpected delays | Pipeline execution deadlines | +| Works with sources | Yes | Yes | No (models only) | + +### Notes + +- The `timestamp_column` values are assumed to be in **UTC** (or timezone-naive timestamps that represent UTC). If your data stores local timestamps, the comparison may be incorrect. +- If both `day_of_week` and `day_of_month` are set, the test uses OR logic (checks if either matches) +- The test passes if the SLA deadline hasn't been reached yet, giving your data time to be updated diff --git a/docs/data-tests/volume-threshold.mdx b/docs/data-tests/volume-threshold.mdx new file mode 100644 index 000000000..067b60bfc --- /dev/null +++ b/docs/data-tests/volume-threshold.mdx @@ -0,0 +1,152 @@ +--- +title: "volume_threshold" +sidebarTitle: "Volume Threshold" +--- + +import AiGenerateTest from '/snippets/ai-generate-test.mdx'; + + + + +`elementary.volume_threshold` + +Monitors row count changes between time buckets using configurable percentage thresholds with multiple severity levels. + +Unlike `volume_anomalies` (which uses ML-based anomaly detection to determine what's "normal"), this test lets you define explicit percentage thresholds for warnings and errors — giving you precise control over when to be alerted. It uses Elementary's metric caching infrastructure to avoid recalculating row counts for buckets that have already been computed. + +### Use Case + +"Alert me if my table's row count drops or spikes by more than 10% compared to the previous period." + +### Test Logic + +1. Collect row count metrics per time bucket (using Elementary's incremental metric caching) +2. Compare the most recent completed bucket against the previous bucket +3. Calculate the percentage change between the two +4. If the previous bucket has fewer rows than `min_row_count` → **PASS** (insufficient baseline) +5. If the absolute change exceeds `error_threshold_percent` → **ERROR** +6. If the absolute change exceeds `warn_threshold_percent` → **WARN** +7. Otherwise → **PASS** + +### Test configuration + +_Required configuration: `timestamp_column`_ + +{/* prettier-ignore */} +
+ 
+  data_tests:
+      -- elementary.volume_threshold:
+          arguments:
+              timestamp_column: column name # Required
+              warn_threshold_percent: int # Optional - default: 5
+              error_threshold_percent: int # Optional - default: 10
+              direction: [both | spike | drop] # Optional - default: both
+              time_bucket: # Optional
+                period: [hour | day | week | month]
+                count: int
+              where_expression: sql expression # Optional
+              days_back: int # Optional - default: 14
+              backfill_days: int # Optional - default: 2
+              min_row_count: int # Optional - default: 100
+ 
+
+ + + +```yml Models +models: + - name: < model name > + data_tests: + - elementary.volume_threshold: + arguments: + timestamp_column: < column name > # Required + warn_threshold_percent: < int > # Optional - default: 5 + error_threshold_percent: < int > # Optional - default: 10 + direction: < both | spike | drop > # Optional - default: both +``` + +```yml Default thresholds (5% warn, 10% error) +models: + - name: daily_orders + data_tests: + - elementary.volume_threshold: + arguments: + timestamp_column: created_at + config: + tags: ["elementary"] +``` + +```yml Custom thresholds +models: + - name: critical_transactions + data_tests: + - elementary.volume_threshold: + arguments: + timestamp_column: transaction_time + warn_threshold_percent: 3 + error_threshold_percent: 8 + direction: drop + config: + tags: ["elementary"] + severity: error +``` + +```yml With time bucket and filter +models: + - name: hourly_events + data_tests: + - elementary.volume_threshold: + arguments: + timestamp_column: event_timestamp + warn_threshold_percent: 10 + error_threshold_percent: 25 + direction: both + time_bucket: + period: hour + count: 1 + where_expression: "event_type = 'purchase'" + config: + tags: ["elementary"] +``` + + + +### Features + +- **Dual severity levels**: Separate thresholds for warnings and errors, giving you graduated alerting +- **Directional monitoring**: Choose to monitor `both` directions, only `spike` (increases), or only `drop` (decreases) +- **Incremental metric caching**: Uses Elementary's `data_monitoring_metrics` table to avoid recalculating row counts for previously computed time buckets +- **Minimum baseline protection**: The `min_row_count` parameter prevents false alerts when the baseline is too small +- **Configurable time buckets**: Works with hourly, daily, weekly, or monthly buckets + +### Parameters + +| Parameter | Required | Default | Description | +| ------------------------- | -------- | ------- | ---------------------------------------------------------------------------- | +| `timestamp_column` | Yes | — | Column to determine time periods | +| `warn_threshold_percent` | No | 5 | Percentage change that triggers a warning | +| `error_threshold_percent` | No | 10 | Percentage change that triggers an error | +| `direction` | No | `both` | Direction to monitor: `both`, `spike`, or `drop` | +| `time_bucket` | No | `{period: day, count: 1}` | Time bucket configuration | +| `where_expression` | No | — | SQL expression to filter the data | +| `days_back` | No | 14 | Days of metric history to retain | +| `backfill_days` | No | 2 | Days to recalculate on each run | +| `min_row_count` | No | 100 | Minimum rows in the previous bucket required to trigger the check | + +### Comparison with volume_anomalies + +| Feature | `volume_threshold` | `volume_anomalies` | +| --- | --- | --- | +| Detection method | Fixed percentage thresholds | ML-based anomaly detection | +| Severity levels | Dual (warn + error) | Single (pass/fail) | +| Best for | Known acceptable ranges | Unknown/variable patterns | +| Configuration | Explicit thresholds | Sensitivity tuning | +| Baseline | Previous bucket | Training period average | + +### Notes + +- The `warn_threshold_percent` must be less than or equal to `error_threshold_percent` +- The test uses Elementary's metric caching infrastructure — row counts for previously computed time buckets are reused across runs +- If the previous bucket has fewer rows than `min_row_count`, the test passes (insufficient data for a meaningful comparison) +- The test only evaluates completed time buckets diff --git a/docs/docs.json b/docs/docs.json index ee4247209..aebe7a562 100644 --- a/docs/docs.json +++ b/docs/docs.json @@ -436,7 +436,9 @@ "group": "Other Tests", "pages": [ "data-tests/python-tests", - "data-tests/execution-sla" + "data-tests/execution-sla", + "data-tests/data-freshness-sla", + "data-tests/volume-threshold" ] }, {