Fix stale calibration targets by deriving time_period from dataset#505
Fix stale calibration targets by deriving time_period from dataset#505
Conversation
ee54587 to
69406d6
Compare
|
@MaxGhenis we're doing pretty well on the new income tax target from CBO The SNAP CBO target looks equally good. We're roughly 25% off on social security, ssi, and eitc, which is not great obviously. I still would highly recommend pushing this through and we can adjust from here. We're going to be in 2024 finally for local areas and mapped to the 119th congress. |
PR Review🔴 Critical (Must Fix)
🟡 Should Address
🟢 Suggestions
Validation Summary
Recommendation: COMMENTThe core fix (deriving CBO/Treasury year from the dataset) is sound and addresses the 18% income tax gap described in #503. The |
- Remove hardcoded CBO_YEAR and TREASURY_YEAR constants - Add --dataset CLI argument to etl_national_targets.py - Derive time_period from sim.default_calculation_period - Default to HuggingFace production dataset The dataset itself is now the single source of truth for the calibration year, preventing future drift when updating to new base years. Closes #503 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The CBO income_tax parameter represents positive-only receipts (refundable credit payments in excess of liability are classified as outlays, not negative receipts). Using income_tax_positive matches this definition. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
All ETL scripts now derive their target year from the dataset's default_calculation_period instead of hardcoding years. This ensures all calibration targets stay synchronized when updating to a new base year annually. Updated scripts: - create_initial_strata.py - etl_age.py - etl_irs_soi.py (with configurable --lag for IRS data delay) - etl_medicaid.py - etl_snap.py - etl_state_income_tax.py Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Update parse_ucgid to recognize both 5001800US (118th) and 5001900US (119th Congress) - Expand Puerto Rico and territory filters to handle both Congress code formats - Update TERRITORY_UCGIDS and NON_VOTING_GEO_IDS with 119th Congress codes This ensures consistent redistricting alignment: 2024 ACS data uses 119th Congress codes natively, and IRS SOI data is converted via the 116th→119th mapping matrix. Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Revert deterministic hash-based medicaid/SSI seed logic in cps.py, update Makefile seed to 3526. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Needed for income_tax_positive variable used in loss.py. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Adds aca_ptc ingestion from IRS SOI data (code 85530) to etl_irs_soi.py and updates DATABASE_GUIDE.md to reflect stratum_group_id 119. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
b618feb to
c54ae1c
Compare
|
Cherry-picked ACA PTC database changes from #508 into this branch. This shifts all calibration target group IDs (now 61 groups, 0-60 instead of the previous 53). The |
Prevents silent no-op promotes by detecting when HF commits don't change HEAD. Adds separate promote workflow for manual gate before pushing staging files to production. Also bumps calibration epochs from 200 to 250. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
MaxGhenis
left a comment
There was a problem hiding this comment.
Updated Review (Feb 7)
Since my last review, two new commits were added:
New changes
1. Cherry-pick ACA PTC targets (c54ae1c)
- Adds
aca_ptcingestion from IRS SOI data (code 85530) toetl_irs_soi.py - Updates
DATABASE_GUIDE.mdto reflect stratum_group_id 119 - Clean integration — follows existing patterns
2. Split local area publish into build+stage and promote phases (9a7a81b)
- New
local_area_promote.yamlworkflow for manual promotion gate - Refactored
modal_app/local_area.py:atomic_upload→upload_to_staging+ separatepromote_publish - No-op detection in
data_upload.py— raisesRuntimeErrorif HF commits don't change HEAD - Epochs bump: 200 → 250 in
enhanced_cps.py
Assessment
| Change | Status |
|---|---|
| ACA PTC integration | ✅ Clean |
| Build/stage/promote separation | ✅ Good operational improvement |
| No-op detection | ✅ Useful safety check |
Outstanding tech debt
The hardcoded 2024 dollar values tagged with dynamic time_period remain a future concern, but not blocking given the immediate need to move to 2024 calibration and 119th Congress alignment.
LGTM 🚀
- Extract shared etl_argparser() into utils/db.py to eliminate repeated boilerplate across 7 ETL scripts - Label hardcoded dollar targets with HARDCODED_YEAR = 2024 instead of dynamic time_period; add warnings.warn when dataset year differs - Delete dead get_pseudo_input_variables() and update callers - Switch DEFAULT_DATASET to local storage path for local-first workflow - Add promote-dataset Makefile target and HF_CLONE_DIR variable - Add SOI Congress-session constants with RuntimeError guard for future tax-year bumps - Update Makefile comments for stratified CPS parameters Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Summary
CBO_YEARandTREASURY_YEARconstants frometl_national_targets.py--datasetCLI argument to specify the source datasettime_periodfromsim.default_calculation_period- the dataset itself is now the single source of truthRoot Cause
The ETL had hardcoded year constants:
But the calibration runs at
time_period=2024. This caused an 18% gap for income tax alone ($2,051B vs $2,426B).The Fix
Instead of hardcoding years, we now derive the time period from the dataset:
This ensures CBO/Treasury targets always match the dataset's year, preventing future drift when updating to new base years annually.
Cherry-picked from #508
This PR includes the ACA PTC database changes from #508 (Add ACA Premium Tax Credit targets from IRS SOI data). This was cherry-picked so that the calibration package could be rebuilt with the updated database and validated end-to-end in Stratified mode (436 CDs).
Usage
Test plan
make databaseto regenerate policy_data.dbincome_taxtarget is ~$2,426B (not $2,051B)Closes #503