Add geographic target reconciliation to ETL pipeline by MaxGhenis · Pull Request #516 · PolicyEngine/policyengine-us-data

MaxGhenis · 2026-02-08T17:06:59Z

Summary

Adds two-pass proportional rescaling to ensure child-level targets (state, CD) sum to their parent-level targets
Adds raw_value column to Target model to preserve original source values for auditability
Adds post-pipeline validation that catches any future unreconciled targets

Problem

Geographic targets come from independent data sources and often don't agree. Before reconciliation:

Variable	National	State Sum	Ratio
real_estate_taxes	604.8B	104.8B	0.17
eitc	122.5B	58.1B	0.47
unemployment_compensation	64.3B	29.3B	0.46
medical_expense_deduction	89.5B	78.1B	0.87
qualified_business_income_deduction	271.4B	208.3B	0.77

These contradictory signals cause the calibration optimizer to struggle.

Approach

Two-pass proportional rescaling (ported from Ben Ogorek's us-congressional-districts/pull_soi_targets.py):

Pass 1: Scale state targets so they sum to the national target
Pass 2: Scale CD targets so they sum to their (corrected) state target

Original values are preserved in raw_value for audit trail. The algorithm is idempotent (safe to re-run) and handles non-geographic sub-strata (filer strata, AGI bands) by resolving to their geographic ancestor.

Test plan

10 new unit tests covering: scaling, raw_value preservation, two-pass, no-national, zero-sum, idempotency, non-geo sub-strata
All 49 existing + new tests pass
Full make database pipeline run with reconciliation step
Verify reconciliation validation passes in validate_database.py

🤖 Generated with Claude Code

Recovered from JSONL agent transcripts after session crash lost uncommitted working tree. Files recovered: - calibration/national_matrix_builder.py: DB-driven matrix builder for national calibration - calibration/fit_national_weights.py: L0 national calibration using NationalMatrixBuilder - db/etl_all_targets.py: ETL for all legacy loss.py targets into calibration DB - datasets/cps/enhanced_cps.py.recovered: Enhanced CPS with use_db dual-path (not applied yet) - tests/test_calibration/: ~90 tests for builder and weight fitting These files need review against current main before integration. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Fixes identified by parallel review agents: 1. Wire up NationalMatrixBuilder in build_calibration_inputs() (was TODO stub) 2. Convert dense numpy matrix to scipy.sparse.csr_matrix for l0-python (l0-python calls .tocoo() which only exists on sparse matrices) 3. Add missing PERSON_LEVEL_VARIABLES and SPM_UNIT_VARIABLES constants 4. Add spm_unit_count to COUNT_VARIABLES 5. Fix test method name mismatches: - _query_all_targets -> _query_active_targets - _get_constraints -> _get_all_constraints 6. Fix zero-value target test to expect ValueError (builder filters zeros) 7. Fix SQLModel import bug in etl_all_targets.py main() 8. Add missing test_db/ directory with test_etl_all_targets.py 9. Export fit_national_weights from calibration __init__.py 10. Run black formatting on all files Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ipeline-v2)

Replaces the legacy-only EnhancedCPS.generate() with a dual-path architecture controlled by use_db flag (defaults to False/legacy). Adds _generate_db() for DB-driven calibration via NationalMatrixBuilder and _generate_legacy() for the existing HardConcrete reweight() path. Also deduplicates bad_targets list and removes the .recovered file. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Make SparseMatrixBuilder inherit from BaseMatrixBuilder to eliminate duplicated code for __init__, _build_entity_relationship, _evaluate_constraints_entity_aware, and _get_stratum_constraints. Also remove the same duplicated methods from NationalMatrixBuilder (_build_entity_relationship, _evaluate_constraints, _get_stratum_constraints) and update all internal calls to use the base class method names. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Break the 1156-line monolithic ETL into focused modules: - etl_helpers.py: shared helpers (fmt, get_or_create_stratum, upsert_target) - etl_healthcare_spending.py: healthcare spending by age band (cat 4) - etl_spm_threshold.py: AGI by SPM threshold decile (cat 5) - etl_tax_expenditure.py: tax expenditure targets (cat 10) - etl_state_targets.py: state pop, real estate taxes, ACA, age, AGI (cats 9,11,12,14,15) - etl_misc_national.py: census age, EITC, SOI filers, neg market income, infant, net worth, Medicaid, SOI filing-status (cats 1,2,3,6,7,8,13,16) etl_all_targets.py is now a thin orchestrator that delegates to these modules and re-exports all extract functions for backward compatibility. All 39 existing tests pass unchanged. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Rename _evaluate_constraints → _evaluate_constraints_entity_aware in test_national_matrix_builder.py (4 tests) - Remove duplicate tax_unit_is_filer constraint from AGI stratum fixture (inherits from parent filer_stratum) (1 test) - Fix TestImports to use importlib.import_module to get the module, not the function re-exported from __init__.py (2 tests) - Fix reweight_l0 test patch path to policyengine_us.Microsimulation matching the actual import inside the method (1 test) - Add try/except fallback in build_calibration_inputs for when DB path fails, gracefully falling back to legacy build_loss_matrix (1 test) All 100 calibration tests + 39 ETL tests pass. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

State-level targets from independent data sources often don't sum to their corresponding national totals (e.g., state real_estate_taxes sums to only 17% of national). This causes contradictory signals for the calibration optimizer. Adds proportional rescaling: scale states to match national, then CDs to match corrected states. Original values preserved in new raw_value column for auditability. Validation step catches any future unreconciled targets. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Resolve conflict in enhanced_cps.py by keeping the db/legacy dispatch from this branch. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Drop reweight() (HardConcrete/torch), _generate_legacy(), use_db flag, and ReweightedCPS_2024 class. EnhancedCPS.generate() now always uses the DB-driven calibration via NationalMatrixBuilder + l0-python. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Delete utils/loss.py (build_loss_matrix, print_reweighting_diagnostics), utils/soi.py (pe_to_soi, get_soi), utils/l0.py (HardConcrete), utils/seed.py (set_seeds). Remove legacy fallback from fit_national_weights.py. Strip legacy tests from test_sparse_enhanced_cps.py. Delete paper/scripts/generate_validation_metrics.py and tests/test_reproducibility.py (both fully legacy). -2,172 lines. DB-driven calibration is the only path. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- NationalMatrixBuilder: add _classify_target_geo() and geo_level parameter to _query_active_targets() / build_matrix() for optional filtering by national/state/cd level - fit_national_weights: pass through geo_level param, add --geo-level CLI flag (default: all) - Tests: replace legacy build_loss_matrix mocks with NationalMatrixBuilder mocks, add test_passes_geo_level Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

New pipeline: clone extended CPS ~130x, assign random census blocks to each clone, build sparse matrix against all DB targets, run L0 calibration. Two L0 presets: "local" (1e-8, ~3-4M records) and "national" (1e-4, ~50K records). New modules: - clone_and_assign.py: population-weighted random block assignment - unified_matrix_builder.py: sparse matrix builder (state-by-state) - unified_calibration.py: CLI entry point with L0 presets 46 tests covering all modules. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

MaxGhenis and others added 17 commits February 8, 2026 09:30

WIP: auto-save before context compact (8 files on unify-calibration-p…

81e5272

…ipeline-v2)

Trigger CI

6bde58f

Merge main into unify-calibration-pipeline-v2

fdab1ec

Resolve conflict in enhanced_cps.py by keeping the db/legacy dispatch from this branch. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add changelog entry for target reconciliation

eba1487

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Format files to match CI black version

c45b146

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Fix black formatting for CI

f252a72

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add geographic target reconciliation to ETL pipeline#516

Add geographic target reconciliation to ETL pipeline#516
MaxGhenis wants to merge 17 commits intomainfrom
unify-calibration-pipeline-v2

MaxGhenis commented Feb 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

MaxGhenis commented Feb 8, 2026

Summary

Problem

Approach

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant