Skip to content

Add geographic target reconciliation to ETL pipeline#516

Open
MaxGhenis wants to merge 17 commits intomainfrom
unify-calibration-pipeline-v2
Open

Add geographic target reconciliation to ETL pipeline#516
MaxGhenis wants to merge 17 commits intomainfrom
unify-calibration-pipeline-v2

Conversation

@MaxGhenis
Copy link
Contributor

Summary

  • Adds two-pass proportional rescaling to ensure child-level targets (state, CD) sum to their parent-level targets
  • Adds raw_value column to Target model to preserve original source values for auditability
  • Adds post-pipeline validation that catches any future unreconciled targets

Problem

Geographic targets come from independent data sources and often don't agree. Before reconciliation:

Variable National State Sum Ratio
real_estate_taxes 604.8B 104.8B 0.17
eitc 122.5B 58.1B 0.47
unemployment_compensation 64.3B 29.3B 0.46
medical_expense_deduction 89.5B 78.1B 0.87
qualified_business_income_deduction 271.4B 208.3B 0.77

These contradictory signals cause the calibration optimizer to struggle.

Approach

Two-pass proportional rescaling (ported from Ben Ogorek's us-congressional-districts/pull_soi_targets.py):

  1. Pass 1: Scale state targets so they sum to the national target
  2. Pass 2: Scale CD targets so they sum to their (corrected) state target

Original values are preserved in raw_value for audit trail. The algorithm is idempotent (safe to re-run) and handles non-geographic sub-strata (filer strata, AGI bands) by resolving to their geographic ancestor.

Test plan

  • 10 new unit tests covering: scaling, raw_value preservation, two-pass, no-national, zero-sum, idempotency, non-geo sub-strata
  • All 49 existing + new tests pass
  • Full make database pipeline run with reconciliation step
  • Verify reconciliation validation passes in validate_database.py

🤖 Generated with Claude Code

MaxGhenis and others added 17 commits February 8, 2026 09:30
Recovered from JSONL agent transcripts after session crash lost
uncommitted working tree. Files recovered:

- calibration/national_matrix_builder.py: DB-driven matrix builder for national calibration
- calibration/fit_national_weights.py: L0 national calibration using NationalMatrixBuilder
- db/etl_all_targets.py: ETL for all legacy loss.py targets into calibration DB
- datasets/cps/enhanced_cps.py.recovered: Enhanced CPS with use_db dual-path (not applied yet)
- tests/test_calibration/: ~90 tests for builder and weight fitting

These files need review against current main before integration.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fixes identified by parallel review agents:

1. Wire up NationalMatrixBuilder in build_calibration_inputs() (was TODO stub)
2. Convert dense numpy matrix to scipy.sparse.csr_matrix for l0-python
   (l0-python calls .tocoo() which only exists on sparse matrices)
3. Add missing PERSON_LEVEL_VARIABLES and SPM_UNIT_VARIABLES constants
4. Add spm_unit_count to COUNT_VARIABLES
5. Fix test method name mismatches:
   - _query_all_targets -> _query_active_targets
   - _get_constraints -> _get_all_constraints
6. Fix zero-value target test to expect ValueError (builder filters zeros)
7. Fix SQLModel import bug in etl_all_targets.py main()
8. Add missing test_db/ directory with test_etl_all_targets.py
9. Export fit_national_weights from calibration __init__.py
10. Run black formatting on all files

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replaces the legacy-only EnhancedCPS.generate() with a dual-path
architecture controlled by use_db flag (defaults to False/legacy).
Adds _generate_db() for DB-driven calibration via NationalMatrixBuilder
and _generate_legacy() for the existing HardConcrete reweight() path.
Also deduplicates bad_targets list and removes the .recovered file.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Make SparseMatrixBuilder inherit from BaseMatrixBuilder to eliminate
duplicated code for __init__, _build_entity_relationship,
_evaluate_constraints_entity_aware, and _get_stratum_constraints.
Also remove the same duplicated methods from NationalMatrixBuilder
(_build_entity_relationship, _evaluate_constraints, _get_stratum_constraints)
and update all internal calls to use the base class method names.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Break the 1156-line monolithic ETL into focused modules:
- etl_helpers.py: shared helpers (fmt, get_or_create_stratum, upsert_target)
- etl_healthcare_spending.py: healthcare spending by age band (cat 4)
- etl_spm_threshold.py: AGI by SPM threshold decile (cat 5)
- etl_tax_expenditure.py: tax expenditure targets (cat 10)
- etl_state_targets.py: state pop, real estate taxes, ACA, age, AGI (cats 9,11,12,14,15)
- etl_misc_national.py: census age, EITC, SOI filers, neg market income, infant, net worth, Medicaid, SOI filing-status (cats 1,2,3,6,7,8,13,16)

etl_all_targets.py is now a thin orchestrator that delegates to these
modules and re-exports all extract functions for backward compatibility.
All 39 existing tests pass unchanged.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Rename _evaluate_constraints → _evaluate_constraints_entity_aware in
  test_national_matrix_builder.py (4 tests)
- Remove duplicate tax_unit_is_filer constraint from AGI stratum fixture
  (inherits from parent filer_stratum) (1 test)
- Fix TestImports to use importlib.import_module to get the module, not
  the function re-exported from __init__.py (2 tests)
- Fix reweight_l0 test patch path to policyengine_us.Microsimulation
  matching the actual import inside the method (1 test)
- Add try/except fallback in build_calibration_inputs for when DB path
  fails, gracefully falling back to legacy build_loss_matrix (1 test)

All 100 calibration tests + 39 ETL tests pass.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
State-level targets from independent data sources often don't sum to
their corresponding national totals (e.g., state real_estate_taxes sums
to only 17% of national). This causes contradictory signals for the
calibration optimizer.

Adds proportional rescaling: scale states to match national, then CDs to
match corrected states. Original values preserved in new raw_value column
for auditability. Validation step catches any future unreconciled targets.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Resolve conflict in enhanced_cps.py by keeping the db/legacy dispatch
from this branch.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Drop reweight() (HardConcrete/torch), _generate_legacy(), use_db flag,
and ReweightedCPS_2024 class. EnhancedCPS.generate() now always uses
the DB-driven calibration via NationalMatrixBuilder + l0-python.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Delete utils/loss.py (build_loss_matrix, print_reweighting_diagnostics),
utils/soi.py (pe_to_soi, get_soi), utils/l0.py (HardConcrete),
utils/seed.py (set_seeds). Remove legacy fallback from
fit_national_weights.py. Strip legacy tests from
test_sparse_enhanced_cps.py. Delete paper/scripts/generate_validation_metrics.py
and tests/test_reproducibility.py (both fully legacy).

-2,172 lines. DB-driven calibration is the only path.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- NationalMatrixBuilder: add _classify_target_geo() and geo_level
  parameter to _query_active_targets() / build_matrix() for optional
  filtering by national/state/cd level
- fit_national_weights: pass through geo_level param, add --geo-level
  CLI flag (default: all)
- Tests: replace legacy build_loss_matrix mocks with
  NationalMatrixBuilder mocks, add test_passes_geo_level

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
New pipeline: clone extended CPS ~130x, assign random census blocks
to each clone, build sparse matrix against all DB targets, run L0
calibration. Two L0 presets: "local" (1e-8, ~3-4M records) and
"national" (1e-4, ~50K records).

New modules:
- clone_and_assign.py: population-weighted random block assignment
- unified_matrix_builder.py: sparse matrix builder (state-by-state)
- unified_calibration.py: CLI entry point with L0 presets

46 tests covering all modules.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant