Add test case for exclude_detection_period_from_training flag validation (CORE-19) #876

devin-ai-integration · 2025-10-20T10:15:06Z

Add test case for exclude_detection_period_from_training flag validation (CORE-19)

Summary

Adds a test case that demonstrates the current behavior of volume anomaly detection when anomalous data exists in the detection period. This test is part of the groundwork for implementing the exclude_detection_period_from_training flag described in CORE-19.

The test creates a scenario with:

30 days of normal data (100 rows/day baseline)
7 days of anomalous data (500 rows/day, 5x spike) in the detection period
Validates that the anomaly is detected (test status = fail)

Key Finding: The 5x spike is large enough to be detected even though the detection period data is included in the cumulative training baseline. This documents current behavior and provides a foundation for testing the flag implementation.

Review & Testing Checklist for Human

Verify test scenario alignment: Review whether this test scenario (5x spike) accurately demonstrates the problem CORE-19 is trying to solve. The Linear ticket suggests anomalies should be MISSED when included in training, but this large spike is still detected. Consider if a more subtle/gradual anomaly scenario would better demonstrate the masking effect.
Review test assertion logic: The test asserts status == "fail" (anomaly detected), which documents current behavior but may not align with the original ticket description of anomalies being missed. Confirm this is the intended approach.
Validate TODO comments: Review the TODO section (lines 78-92) for implementing the flag test. Verify this structure will be useful when the actual flag is implemented.
Run the test locally: Execute cd integration_tests/tests && pytest test_exclude_detection_period_from_training.py -vvv --target elementary_tests to verify it passes in your environment.

Notes

Test currently only validated on Postgres (elementary_tests target)
Test is marked to skip ClickHouse (line 12)
The cumulative window function in get_anomaly_scores_query.sql (lines 167-171) includes all data up to the current row in training, which is the behavior this flag aims to address
Pre-commit hooks passed (black, isort, flake8, prettier, typos)

Link to Devin run: https://app.devin.ai/sessions/d03e5647a7dc47ccb1253298c63f3f92
Requested by: Yosef Arbiv (yosef@elementary-data.com, @arbiv)

Summary by CodeRabbit

Tests
- Added an integration test covering detection-period exclusion behavior for anomaly-detection training pipelines.
- Test sets up normal data plus a multi-day anomaly window and asserts current baseline behavior (detection period included in training).
- Marked as scaffolding for future flag-enabled negative validation; currently skipped on certain backends.

This test demonstrates the current behavior where anomalous data in the detection period is included in the training baseline due to the cumulative window function in get_anomaly_scores_query.sql. The test validates that a 5x volume spike is detected even with the current implementation. When the exclude_detection_period_from_training flag is implemented, it will make anomaly detection more reliable by excluding the detection period from training calculations. Related to Linear ticket CORE-19. Co-Authored-By: Yosef Arbiv <yosef.arbiv@gmail.com>

devin-ai-integration · 2025-10-20T10:15:10Z

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

Disable automatic comment and CI monitoring

linear · 2025-10-20T10:15:11Z

CORE-19 Create Test Case for exclude_detection_period_from_training Flag Validation

github-actions · 2025-10-20T10:15:22Z

👋 @devin-ai-integration[bot]
Thank you for raising your pull request.
Please make sure to add tests and document all user-facing changes.
You can do this by editing the docs files in the elementary repository.

coderabbitai · 2025-10-20T10:15:33Z

Walkthrough

Adds a new integration test that builds 30 days of baseline data and a 7-day anomalous detection window, runs the DBT test flow, and asserts current behavior where the detection-period anomaly is included in training (test passes). The test is skipped for ClickHouse.

Changes

Cohort / File(s)	Summary
New integration test `integration_tests/tests/test_exclude_detection_period_from_training.py`	Adds `test_exclude_detection_period_from_training_baseline()` which constructs 30 days of normal data (100 rows/day) and 7 days of anomalous data (500 rows/day), sets test args (time_bucket, training_period, detection_period, sensitivity), invokes the DBT test via a `DbtProject` fixture, and asserts the current behavior where the anomaly is masked by inclusion in training. Includes a placeholder for future validation when `exclude_detection_period_from_training` is implemented and is skipped for ClickHouse targets.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Test as Integration Test
  participant Data as Synthetic Data Builder
  participant DBT as DbtProject / DBT
  participant Runner as DBT Test Runner

  Note over Test,Data: Test constructs dataset
  Test->>Data: generate 30d normal + 7d anomalous
  Note over Test,DBT: prepare test_args (time_bucket, training_period, detection_period, sensitivity)
  Test->>DBT: load data & invoke DBT test
  DBT->>Runner: execute anomaly detection model/test
  Runner-->>DBT: test result (PASS — anomaly included in training)
  DBT-->>Test: return result
  Note over Test: assert expected PASS (current behavior)

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

🐰 I stitched thirty calm days, then seven of surprise,

A spike that hides beneath training's eyes.
The test runs now, it nods — the anomaly sleeps;
Soon flags will wake it from those layered heaps. 🥕✨

Pre-merge checks and finishing touches

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The pull request title "Add test case for exclude_detection_period_from_training flag validation (CORE-19)" accurately and specifically describes the main change being introduced. The title clearly communicates that a new test case is being added for validating the exclude_detection_period_from_training flag, and it references the relevant issue number for context. The title is concise, meaningful, and provides sufficient information for developers scanning the commit history to understand the primary purpose of the changeset.
Linked Issues Check	✅ Passed	The pull request aligns with the objectives specified in CORE-19 by providing an integration test that demonstrates the current volume anomaly detection behavior with the test scenario parameters defined in the issue (30 days of normal data at 100 rows/day and 7 days of anomalous data at 500 rows/day with a 5x spike). While CORE-19 ultimately requires two test behaviors (without and with exclusion), the PR appropriately creates only the baseline test demonstrating the "without exclusion" behavior since the exclude_detection_period_from_training flag has not yet been implemented. This approach directly addresses the issue's explicit request to "create the tests possible with the current implementation and show the behavior," establishing a foundation for the future flag implementation.
Out of Scope Changes Check	✅ Passed	The pull request introduces only changes directly related to CORE-19 objectives. A single new integration test file is added with appropriate test configuration and scaffolding that aligns with the exclude_detection_period_from_training validation requirements, including proper test parameters (training_period, detection_period, time_bucket, sensitivity) and placeholder comments for future flag-enabled validation. The test includes necessary skip decorators for the ClickHouse target and leverages the specified DbtProject fixture. No unrelated modifications or extraneous changes are present in the changeset.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch devin/CORE-19-1760955240

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

integration_tests/tests/test_exclude_detection_period_from_training.py (2)
41-48: Consider using a fixed reference date for deterministic tests.

Using datetime.utcnow() makes the test non-deterministic—dates change on every run, which can complicate debugging and result comparison. Consider using a fixed reference date instead:
-    now = datetime.utcnow()
+    # Use a fixed reference date for deterministic test behavior
+    now = datetime(2024, 1, 31, 12, 0, 0)
This ensures consistent test behavior and makes it easier to reproduce issues.

79-92: Restructure the TODO as a separate test function.

The TODO provides good forward-looking guidance, but consider these improvements:

Remove the (important-comment) marker on line 85—it appears to be a leftover tag.

Create a separate test function now (marked with @pytest.mark.skip or @pytest.mark.xfail) instead of embedded TODO comments. This makes it easier to track and enables the test when the flag is implemented:
@pytest.mark.skip(reason="Waiting for exclude_detection_period_from_training flag implementation")
@pytest.mark.skip_targets(["clickhouse"])
def test_exclude_detection_period_with_flag(test_id: str, dbt_project: DbtProject):
    """Test with exclude_detection_period_from_training flag enabled."""
    # Implementation here
    pass
Clarify force_metrics_backfill: The TODO includes test_vars={"force_metrics_backfill": True} which isn't in the baseline test. Document why this is needed for the flag test case.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 20b7637 and 5644e1e.

📒 Files selected for processing (1)

integration_tests/tests/test_exclude_detection_period_from_training.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

integration_tests/tests/test_exclude_detection_period_from_training.py (2)

integration_tests/tests/conftest.py (2)

dbt_project (144-147)

test_id (180-183)

integration_tests/tests/dbt_project.py (1)

DbtProject (46-334)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (14)

GitHub Check: test (latest_official, athena) / test
GitHub Check: test (latest_pre, postgres) / test
GitHub Check: test (fusion, redshift) / test
GitHub Check: test (latest_official, redshift) / test
GitHub Check: test (latest_official, bigquery) / test
GitHub Check: test (latest_official, dremio) / test
GitHub Check: test (latest_official, clickhouse) / test
GitHub Check: test (latest_official, trino) / test
GitHub Check: test (latest_official, databricks_catalog) / test
GitHub Check: test (fusion, bigquery) / test
GitHub Check: test (latest_official, postgres) / test
GitHub Check: test (fusion, databricks_catalog) / test
GitHub Check: test (latest_official, snowflake) / test
GitHub Check: test (fusion, snowflake) / test

🔇 Additional comments (7)

integration_tests/tests/test_exclude_detection_period_from_training.py (7)

1-10: LGTM! Clean imports and constants.

The imports and constant definitions are appropriate for the integration test. Using constants for the timestamp column and DBT test configuration promotes maintainability.

12-15: LGTM! Proper test signature.

The skip decorator and fixture usage align with the project's testing patterns. The function name clearly indicates this is a baseline test for the feature.

49-54: Anomalous data generation logic is correct.

The 7-day detection period with 500 rows/day (5x spike) is correctly implemented. However, as noted earlier, this spike size may be too large to demonstrate the problem the flag aims to solve.

55-63: LGTM! Test configuration is correct.

The data combination and test arguments properly configure the 30-day training period and 7-day detection period scenario described in the docstring.

65-70: LGTM! Test execution follows the project pattern.

The call to dbt_project.test() correctly passes the test configuration and data for seeding.

72-77: Assertion logic is correct for the current test design.

The assertion correctly verifies that the test FAILs (anomaly detected) with a clear error message. However, this assertion reinforces the design issue noted earlier—the test should ideally demonstrate a case where the anomaly is missed without the flag.

16-40: Test scenario doesn't optimally demonstrate the problem the flag is designed to solve.

The test properly follows baseline testing practices by explicitly marking the unimplemented feature and documenting expected vs. current behavior. However, the scenario choice undermines its effectiveness.

The 5x spike is too pronounced to be masked by training inclusion, so it fails to demonstrate the actual problem: gradual anomalies being missed when included in training. The flag's value lies in handling subtle patterns that cumulative training obscures.

Use a smaller, more realistic anomaly (1.5x–2x spike) that would pass with current behavior (anomaly missed) but fail once the flag is implemented. This validates the flag actually solves the problem rather than just detecting obvious outliers.

The test now correctly demonstrates that a 10% volume increase in the detection period is masked when included in training (test passes with current implementation). With sensitivity 10, the anomaly is not detected because it gets absorbed into the cumulative training baseline. When the exclude_detection_period_from_training flag is implemented, the same anomaly should be detected (test fails) because the detection period will be excluded from training calculations. Co-Authored-By: Yosef Arbiv <yosef.arbiv@gmail.com>

coderabbitai

Actionable comments posted: 5