-
Notifications
You must be signed in to change notification settings - Fork 121
Add test case for exclude_detection_period_from_training flag validation (CORE-19) #876
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This test demonstrates the current behavior where anomalous data in the detection period is included in the training baseline due to the cumulative window function in get_anomaly_scores_query.sql. The test validates that a 5x volume spike is detected even with the current implementation. When the exclude_detection_period_from_training flag is implemented, it will make anomaly detection more reliable by excluding the detection period from training calculations. Related to Linear ticket CORE-19. Co-Authored-By: Yosef Arbiv <yosef.arbiv@gmail.com>
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
|
👋 @devin-ai-integration[bot] |
WalkthroughAdds a new integration test that builds 30 days of baseline data and a 7-day anomalous detection window, runs the DBT test flow, and asserts current behavior where the detection-period anomaly is included in training (test passes). The test is skipped for ClickHouse. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Test as Integration Test
participant Data as Synthetic Data Builder
participant DBT as DbtProject / DBT
participant Runner as DBT Test Runner
Note over Test,Data: Test constructs dataset
Test->>Data: generate 30d normal + 7d anomalous
Note over Test,DBT: prepare test_args (time_bucket, training_period, detection_period, sensitivity)
Test->>DBT: load data & invoke DBT test
DBT->>Runner: execute anomaly detection model/test
Runner-->>DBT: test result (PASS — anomaly included in training)
DBT-->>Test: return result
Note over Test: assert expected PASS (current behavior)
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Poem
Pre-merge checks and finishing touches✅ Passed checks (5 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
integration_tests/tests/test_exclude_detection_period_from_training.py (2)
41-48: Consider using a fixed reference date for deterministic tests.Using
datetime.utcnow()makes the test non-deterministic—dates change on every run, which can complicate debugging and result comparison. Consider using a fixed reference date instead:- now = datetime.utcnow() + # Use a fixed reference date for deterministic test behavior + now = datetime(2024, 1, 31, 12, 0, 0)This ensures consistent test behavior and makes it easier to reproduce issues.
79-92: Restructure the TODO as a separate test function.The TODO provides good forward-looking guidance, but consider these improvements:
Remove the
(important-comment)marker on line 85—it appears to be a leftover tag.Create a separate test function now (marked with
@pytest.mark.skipor@pytest.mark.xfail) instead of embedded TODO comments. This makes it easier to track and enables the test when the flag is implemented:@pytest.mark.skip(reason="Waiting for exclude_detection_period_from_training flag implementation") @pytest.mark.skip_targets(["clickhouse"]) def test_exclude_detection_period_with_flag(test_id: str, dbt_project: DbtProject): """Test with exclude_detection_period_from_training flag enabled.""" # Implementation here pass
- Clarify
force_metrics_backfill: The TODO includestest_vars={"force_metrics_backfill": True}which isn't in the baseline test. Document why this is needed for the flag test case.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
integration_tests/tests/test_exclude_detection_period_from_training.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
integration_tests/tests/test_exclude_detection_period_from_training.py (2)
integration_tests/tests/conftest.py (2)
dbt_project(144-147)test_id(180-183)integration_tests/tests/dbt_project.py (1)
DbtProject(46-334)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (14)
- GitHub Check: test (latest_official, athena) / test
- GitHub Check: test (latest_pre, postgres) / test
- GitHub Check: test (fusion, redshift) / test
- GitHub Check: test (latest_official, redshift) / test
- GitHub Check: test (latest_official, bigquery) / test
- GitHub Check: test (latest_official, dremio) / test
- GitHub Check: test (latest_official, clickhouse) / test
- GitHub Check: test (latest_official, trino) / test
- GitHub Check: test (latest_official, databricks_catalog) / test
- GitHub Check: test (fusion, bigquery) / test
- GitHub Check: test (latest_official, postgres) / test
- GitHub Check: test (fusion, databricks_catalog) / test
- GitHub Check: test (latest_official, snowflake) / test
- GitHub Check: test (fusion, snowflake) / test
🔇 Additional comments (7)
integration_tests/tests/test_exclude_detection_period_from_training.py (7)
1-10: LGTM! Clean imports and constants.The imports and constant definitions are appropriate for the integration test. Using constants for the timestamp column and DBT test configuration promotes maintainability.
12-15: LGTM! Proper test signature.The skip decorator and fixture usage align with the project's testing patterns. The function name clearly indicates this is a baseline test for the feature.
49-54: Anomalous data generation logic is correct.The 7-day detection period with 500 rows/day (5x spike) is correctly implemented. However, as noted earlier, this spike size may be too large to demonstrate the problem the flag aims to solve.
55-63: LGTM! Test configuration is correct.The data combination and test arguments properly configure the 30-day training period and 7-day detection period scenario described in the docstring.
65-70: LGTM! Test execution follows the project pattern.The call to
dbt_project.test()correctly passes the test configuration and data for seeding.
72-77: Assertion logic is correct for the current test design.The assertion correctly verifies that the test FAILs (anomaly detected) with a clear error message. However, this assertion reinforces the design issue noted earlier—the test should ideally demonstrate a case where the anomaly is missed without the flag.
16-40: Test scenario doesn't optimally demonstrate the problem the flag is designed to solve.The test properly follows baseline testing practices by explicitly marking the unimplemented feature and documenting expected vs. current behavior. However, the scenario choice undermines its effectiveness.
The 5x spike is too pronounced to be masked by training inclusion, so it fails to demonstrate the actual problem: gradual anomalies being missed when included in training. The flag's value lies in handling subtle patterns that cumulative training obscures.
Use a smaller, more realistic anomaly (1.5x–2x spike) that would pass with current behavior (anomaly missed) but fail once the flag is implemented. This validates the flag actually solves the problem rather than just detecting obvious outliers.
The test now correctly demonstrates that a 10% volume increase in the detection period is masked when included in training (test passes with current implementation). With sensitivity 10, the anomaly is not detected because it gets absorbed into the cumulative training baseline. When the exclude_detection_period_from_training flag is implemented, the same anomaly should be detected (test fails) because the detection period will be excluded from training calculations. Co-Authored-By: Yosef Arbiv <yosef.arbiv@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
integration_tests/tests/test_exclude_detection_period_from_training.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
integration_tests/tests/test_exclude_detection_period_from_training.py (2)
integration_tests/tests/conftest.py (2)
dbt_project(144-147)test_id(180-183)integration_tests/tests/dbt_project.py (1)
DbtProject(46-334)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (14)
- GitHub Check: test (latest_official, athena) / test
- GitHub Check: test (latest_official, dremio) / test
- GitHub Check: test (latest_official, postgres) / test
- GitHub Check: test (latest_official, redshift) / test
- GitHub Check: test (fusion, snowflake) / test
- GitHub Check: test (fusion, bigquery) / test
- GitHub Check: test (latest_pre, postgres) / test
- GitHub Check: test (fusion, databricks_catalog) / test
- GitHub Check: test (latest_official, snowflake) / test
- GitHub Check: test (fusion, redshift) / test
- GitHub Check: test (latest_official, bigquery) / test
- GitHub Check: test (latest_official, clickhouse) / test
- GitHub Check: test (latest_official, databricks_catalog) / test
- GitHub Check: test (latest_official, trino) / test
🔇 Additional comments (2)
integration_tests/tests/test_exclude_detection_period_from_training.py (2)
1-10: LGTM!Imports and constants are well-organized and follow conventions. All imports are utilized in the test.
12-15: LGTM!Function signature and decorator are properly configured. The skip_targets decorator correctly excludes ClickHouse as documented in the PR notes.
| """ | ||
| Test case for CORE-19: Validates the exclude_detection_period_from_training flag functionality. | ||
|
|
||
| This test demonstrates the core use case where: | ||
| 1. Detection period contains anomalous data that gets absorbed into training baseline | ||
| 2. WITHOUT exclusion: Anomaly is missed (test passes) because it's included in training | ||
| 3. WITH exclusion: Anomaly is detected (test fails) because it's excluded from training | ||
|
|
||
| Test Scenario: | ||
| - 30 days of normal data: 100 rows per day (baseline pattern) | ||
| - 7 days of anomalous data: 110 rows per day (10% increase) in the detection period | ||
| - Training period: 30 days | ||
| - Detection period: 7 days | ||
| - Time bucket: Daily aggregation | ||
| - Sensitivity: 10 (high threshold to demonstrate masking effect) | ||
|
|
||
| The 10% increase across 7 days gets absorbed into the cumulative training average, | ||
| making the anomaly undetectable with the current implementation. | ||
|
|
||
| Current Behavior (WITHOUT flag): | ||
| - Test PASSES (no anomaly detected) because the 10% increase is absorbed into the | ||
| cumulative training baseline when detection period data is included. | ||
|
|
||
| Expected Behavior (WITH flag): | ||
| - Test FAILS (anomaly detected) because the detection period is excluded from training, | ||
| so the 10% increase is properly detected against the clean 30-day baseline. | ||
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Critical: Test scenario contradicts PR objectives and CORE-19 requirements.
The docstring documents a 10% increase (110 rows/day) and expects the test to PASS (anomaly missed), but the PR objectives and CORE-19 requirements clearly specify:
- Anomalous data: 500 rows/day (5x spike, not 10% increase)
- Expected behavior: Anomaly is detected (test status == "fail")
The current 10% increase may be too subtle to demonstrate the core use case. CORE-19 aims to validate that a clear anomaly (5x spike) is properly detected when excluded from training.
Apply this diff to align with requirements:
- - 7 days of anomalous data: 110 rows per day (10% increase) in the detection period
+ - 7 days of anomalous data: 500 rows per day (5x spike) in the detection period
- Training period: 30 days
- Detection period: 7 days
- Time bucket: Daily aggregation
- - Sensitivity: 10 (high threshold to demonstrate masking effect)
+ - Sensitivity: 3 (standard threshold for clear anomaly detection)
- The 10% increase across 7 days gets absorbed into the cumulative training average,
- making the anomaly undetectable with the current implementation.
+ The 5x spike across 7 days gets absorbed into the cumulative training average,
+ but is still large enough to be detected with the current implementation.
Current Behavior (WITHOUT flag):
- - Test PASSES (no anomaly detected) because the 10% increase is absorbed into the
- cumulative training baseline when detection period data is included.
+ - Test FAILS (anomaly detected) because the 5x spike is large enough to be detected
+ even when detection period data is included in the cumulative training baseline.
Expected Behavior (WITH flag):
- - Test FAILS (anomaly detected) because the detection period is excluded from training,
- so the 10% increase is properly detected against the clean 30-day baseline.
+ - Test FAILS (anomaly detected) with more confidence because the detection period
+ is excluded from training, ensuring the 5x spike is detected against the clean baseline.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| """ | |
| Test case for CORE-19: Validates the exclude_detection_period_from_training flag functionality. | |
| This test demonstrates the core use case where: | |
| 1. Detection period contains anomalous data that gets absorbed into training baseline | |
| 2. WITHOUT exclusion: Anomaly is missed (test passes) because it's included in training | |
| 3. WITH exclusion: Anomaly is detected (test fails) because it's excluded from training | |
| Test Scenario: | |
| - 30 days of normal data: 100 rows per day (baseline pattern) | |
| - 7 days of anomalous data: 110 rows per day (10% increase) in the detection period | |
| - Training period: 30 days | |
| - Detection period: 7 days | |
| - Time bucket: Daily aggregation | |
| - Sensitivity: 10 (high threshold to demonstrate masking effect) | |
| The 10% increase across 7 days gets absorbed into the cumulative training average, | |
| making the anomaly undetectable with the current implementation. | |
| Current Behavior (WITHOUT flag): | |
| - Test PASSES (no anomaly detected) because the 10% increase is absorbed into the | |
| cumulative training baseline when detection period data is included. | |
| Expected Behavior (WITH flag): | |
| - Test FAILS (anomaly detected) because the detection period is excluded from training, | |
| so the 10% increase is properly detected against the clean 30-day baseline. | |
| """ | |
| """ | |
| Test case for CORE-19: Validates the exclude_detection_period_from_training flag functionality. | |
| This test demonstrates the core use case where: | |
| 1. Detection period contains anomalous data that gets absorbed into training baseline | |
| 2. WITHOUT exclusion: Anomaly is missed (test passes) because it's included in training | |
| 3. WITH exclusion: Anomaly is detected (test fails) because it's excluded from training | |
| Test Scenario: | |
| - 30 days of normal data: 100 rows per day (baseline pattern) | |
| - 7 days of anomalous data: 500 rows per day (5x spike) in the detection period | |
| - Training period: 30 days | |
| - Detection period: 7 days | |
| - Time bucket: Daily aggregation | |
| - Sensitivity: 3 (standard threshold for clear anomaly detection) | |
| The 5x spike across 7 days gets absorbed into the cumulative training average, | |
| but is still large enough to be detected with the current implementation. | |
| Current Behavior (WITHOUT flag): | |
| - Test FAILS (anomaly detected) because the 5x spike is large enough to be detected | |
| even when detection period data is included in the cumulative training baseline. | |
| Expected Behavior (WITH flag): | |
| - Test FAILS (anomaly detected) with more confidence because the detection period | |
| is excluded from training, ensuring the 5x spike is detected against the clean baseline. | |
| """ |
🤖 Prompt for AI Agents
integration_tests/tests/test_exclude_detection_period_from_training.py lines
16-42: The docstring and test scenario currently describe a 10% increase (110
rows/day) and claim the test should PASS, which contradicts CORE-19 and the PR
objectives that require a clear 5x spike (500 rows/day) and an expected FAIL
when exclusion is applied. Update the docstring and any test parameters in this
range to specify anomalous data as 500 rows/day (5x spike), state the expected
behavior as "FAIL" (anomaly detected) when exclusion is used, and adjust any
referenced percentages or descriptions to reflect the 5x spike scenario.
| now = datetime.utcnow() | ||
|
|
||
| normal_data = [] | ||
| for day_offset in range(37, 7, -1): | ||
| date = now - timedelta(days=day_offset) | ||
| for _ in range(100): | ||
| normal_data.append({TIMESTAMP_COLUMN: date.strftime(DATE_FORMAT)}) | ||
|
|
||
| anomalous_data = [] | ||
| for day_offset in range(7, 0, -1): | ||
| date = now - timedelta(days=day_offset) | ||
| for _ in range(110): | ||
| anomalous_data.append({TIMESTAMP_COLUMN: date.strftime(DATE_FORMAT)}) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Critical: Anomalous data generation contradicts PR requirements.
The code generates 110 rows per day (10% increase), but PR objectives specify 500 rows per day (5x spike). This undermines the test's ability to validate the core CORE-19 scenario.
Apply this diff to generate the 5x spike:
anomalous_data = []
for day_offset in range(7, 0, -1):
date = now - timedelta(days=day_offset)
- for _ in range(110):
+ for _ in range(500):
anomalous_data.append({TIMESTAMP_COLUMN: date.strftime(DATE_FORMAT)})📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| now = datetime.utcnow() | |
| normal_data = [] | |
| for day_offset in range(37, 7, -1): | |
| date = now - timedelta(days=day_offset) | |
| for _ in range(100): | |
| normal_data.append({TIMESTAMP_COLUMN: date.strftime(DATE_FORMAT)}) | |
| anomalous_data = [] | |
| for day_offset in range(7, 0, -1): | |
| date = now - timedelta(days=day_offset) | |
| for _ in range(110): | |
| anomalous_data.append({TIMESTAMP_COLUMN: date.strftime(DATE_FORMAT)}) | |
| now = datetime.utcnow() | |
| normal_data = [] | |
| for day_offset in range(37, 7, -1): | |
| date = now - timedelta(days=day_offset) | |
| for _ in range(100): | |
| normal_data.append({TIMESTAMP_COLUMN: date.strftime(DATE_FORMAT)}) | |
| anomalous_data = [] | |
| for day_offset in range(7, 0, -1): | |
| date = now - timedelta(days=day_offset) | |
| for _ in range(500): | |
| anomalous_data.append({TIMESTAMP_COLUMN: date.strftime(DATE_FORMAT)}) |
🤖 Prompt for AI Agents
In integration_tests/tests/test_exclude_detection_period_from_training.py around
lines 43 to 56, the anomalous data loop currently appends 110 rows per day which
contradicts the PR requirement for a 5x spike; change the anomalous_data
generation to append 500 rows per day (i.e., replace the 110 count with 500) so
the detection-period days have a fivefold increase compared to the 100-row
baseline.
| test_args = { | ||
| **DBT_TEST_ARGS, | ||
| "time_bucket": {"period": "day", "count": 1}, | ||
| "training_period": {"period": "day", "count": 30}, | ||
| "detection_period": {"period": "day", "count": 7}, | ||
| "sensitivity": 10, | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Adjust sensitivity for clear anomaly detection.
The sensitivity is set to 10 (high threshold), which makes anomaly detection more difficult. For a clear 5x spike scenario as specified in CORE-19, a standard sensitivity (e.g., 3) would be more appropriate to ensure the anomaly is detected.
Apply this diff:
"time_bucket": {"period": "day", "count": 1},
"training_period": {"period": "day", "count": 30},
"detection_period": {"period": "day", "count": 7},
- "sensitivity": 10,
+ "sensitivity": 3,
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| test_args = { | |
| **DBT_TEST_ARGS, | |
| "time_bucket": {"period": "day", "count": 1}, | |
| "training_period": {"period": "day", "count": 30}, | |
| "detection_period": {"period": "day", "count": 7}, | |
| "sensitivity": 10, | |
| } | |
| test_args = { | |
| **DBT_TEST_ARGS, | |
| "time_bucket": {"period": "day", "count": 1}, | |
| "training_period": {"period": "day", "count": 30}, | |
| "detection_period": {"period": "day", "count": 7}, | |
| "sensitivity": 3, | |
| } |
🤖 Prompt for AI Agents
In integration_tests/tests/test_exclude_detection_period_from_training.py around
lines 59 to 65, the test sets sensitivity to 10 which is too high for a clear 5x
spike scenario; change the "sensitivity" value in the test_args dict from 10 to
3 so the anomaly detection threshold is standard and the 5x spike will be
detected reliably.
| # Current behavior: Test PASSES (no anomaly detected) | ||
| # The 10% increase is absorbed into the cumulative training baseline | ||
| assert test_result["status"] == "pass", ( | ||
| "Test should PASS in current implementation (without exclusion flag). " | ||
| "The 10% increase is absorbed into training, masking the anomaly." | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Critical: Assertion contradicts PR objectives.
The assertion expects the test to PASS (no anomaly detected), but the PR description states: "The test validates that the anomaly is detected (test status == 'fail')" and "documenting that the 5x spike is large enough to be detected even when detection-period data is included in the cumulative training baseline."
With a 5x spike (500 rows/day vs 100 rows/day baseline), the anomaly should be detected in the current implementation.
Apply this diff to match the documented behavior:
- # Current behavior: Test PASSES (no anomaly detected)
- # The 10% increase is absorbed into the cumulative training baseline
- assert test_result["status"] == "pass", (
- "Test should PASS in current implementation (without exclusion flag). "
- "The 10% increase is absorbed into training, masking the anomaly."
+ # Current behavior: Test FAILS (anomaly detected)
+ # The 5x spike is large enough to be detected even when included in training
+ assert test_result["status"] == "fail", (
+ "Test should FAIL in current implementation (without exclusion flag). "
+ "The 5x spike is large enough to be detected even when included in the cumulative training baseline."
)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Current behavior: Test PASSES (no anomaly detected) | |
| # The 10% increase is absorbed into the cumulative training baseline | |
| assert test_result["status"] == "pass", ( | |
| "Test should PASS in current implementation (without exclusion flag). " | |
| "The 10% increase is absorbed into training, masking the anomaly." | |
| ) | |
| # Current behavior: Test FAILS (anomaly detected) | |
| # The 5x spike is large enough to be detected even when included in training | |
| assert test_result["status"] == "fail", ( | |
| "Test should FAIL in current implementation (without exclusion flag). " | |
| "The 5x spike is large enough to be detected even when included in the cumulative training baseline." | |
| ) |
🤖 Prompt for AI Agents
In integration_tests/tests/test_exclude_detection_period_from_training.py around
lines 74 to 79, the test assertion expectation contradicts the PR objectives: it
currently asserts status == "pass" but the test should validate that the large
5x spike is detected and therefore status == "fail". Update the assertion to
expect "fail" and adjust the assertion message to state that the anomaly should
be detected (e.g., "Test should FAIL: 5x spike should be detected even when
detection-period data is included in training baseline.").
| # TODO: When the exclude_detection_period_from_training flag is implemented, (important-comment) | ||
| # add a second test here that sets the flag to True and expects FAIL: (important-comment) | ||
| # test_args_with_exclusion = { | ||
| # **test_args, | ||
| # "exclude_detection_period_from_training": True, (important-comment) | ||
| # } | ||
| # test_result_with_exclusion = dbt_project.test( (important-comment) | ||
| # test_id, | ||
| # DBT_TEST_NAME, | ||
| # test_args_with_exclusion, | ||
| # test_vars={"force_metrics_backfill": True}, (important-comment) | ||
| # ) | ||
| # assert test_result_with_exclusion["status"] == "fail", ( (important-comment) | ||
| # "Test should FAIL with exclusion flag enabled. " (important-comment) | ||
| # "The 10% increase is detected against the clean baseline." | ||
| # ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Update TODO comment to reflect correct scenario.
The TODO comment references "10% increase" but should reference the "5x spike" per requirements. Additionally, once the main assertion is corrected to expect "fail" (anomaly detected), the TODO should clarify that the flag will improve detection confidence or catch more subtle anomalies, not just enable detection.
Apply this diff:
# TODO: When the exclude_detection_period_from_training flag is implemented,
- # add a second test here that sets the flag to True and expects FAIL:
+ # add a second test (or modify this one) that sets the flag to True:
# test_args_with_exclusion = {
# **test_args,
# "exclude_detection_period_from_training": True,
# }
# test_result_with_exclusion = dbt_project.test(
# test_id,
# DBT_TEST_NAME,
# test_args_with_exclusion,
# test_vars={"force_metrics_backfill": True},
# )
# assert test_result_with_exclusion["status"] == "fail", (
# "Test should FAIL with exclusion flag enabled. "
- # "The 10% increase is detected against the clean baseline."
+ # "The 5x spike is detected with higher confidence against the clean 30-day baseline."
# )Note: Consider whether this should be a separate test case with a different test_id to validate both behaviors in the same test run, or if it should conditionally test based on flag availability.
🤖 Prompt for AI Agents
In integration_tests/tests/test_exclude_detection_period_from_training.py around
lines 81 to 96, the TODO comment is incorrect and must be updated: change the
reference from "10% increase" to "5x spike" and clarify that when the
exclude_detection_period_from_training flag is implemented and set to True the
test should expect a FAIL (anomaly detected) because the flag increases
detection confidence or helps catch more subtle anomalies; also update the TODO
to recommend creating a separate test case (with a distinct test_id) or
conditionally exercising the flag so both behaviors (with and without exclusion)
are validated in the test suite.
|
will be done in the feature PR |
Add test case for exclude_detection_period_from_training flag validation (CORE-19)
Summary
Adds a test case that demonstrates the current behavior of volume anomaly detection when anomalous data exists in the detection period. This test is part of the groundwork for implementing the
exclude_detection_period_from_trainingflag described in CORE-19.The test creates a scenario with:
Key Finding: The 5x spike is large enough to be detected even though the detection period data is included in the cumulative training baseline. This documents current behavior and provides a foundation for testing the flag implementation.
Review & Testing Checklist for Human
Verify test scenario alignment: Review whether this test scenario (5x spike) accurately demonstrates the problem CORE-19 is trying to solve. The Linear ticket suggests anomalies should be MISSED when included in training, but this large spike is still detected. Consider if a more subtle/gradual anomaly scenario would better demonstrate the masking effect.
Review test assertion logic: The test asserts
status == "fail"(anomaly detected), which documents current behavior but may not align with the original ticket description of anomalies being missed. Confirm this is the intended approach.Validate TODO comments: Review the TODO section (lines 78-92) for implementing the flag test. Verify this structure will be useful when the actual flag is implemented.
Run the test locally: Execute
cd integration_tests/tests && pytest test_exclude_detection_period_from_training.py -vvv --target elementary_teststo verify it passes in your environment.Notes
get_anomaly_scores_query.sql(lines 167-171) includes all data up to the current row in training, which is the behavior this flag aims to addressLink to Devin run: https://app.devin.ai/sessions/d03e5647a7dc47ccb1253298c63f3f92
Requested by: Yosef Arbiv (yosef@elementary-data.com, @arbiv)
Summary by CodeRabbit