Add geographic target reconciliation to ETL pipeline#516
Open
Add geographic target reconciliation to ETL pipeline#516
Conversation
Recovered from JSONL agent transcripts after session crash lost uncommitted working tree. Files recovered: - calibration/national_matrix_builder.py: DB-driven matrix builder for national calibration - calibration/fit_national_weights.py: L0 national calibration using NationalMatrixBuilder - db/etl_all_targets.py: ETL for all legacy loss.py targets into calibration DB - datasets/cps/enhanced_cps.py.recovered: Enhanced CPS with use_db dual-path (not applied yet) - tests/test_calibration/: ~90 tests for builder and weight fitting These files need review against current main before integration. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fixes identified by parallel review agents: 1. Wire up NationalMatrixBuilder in build_calibration_inputs() (was TODO stub) 2. Convert dense numpy matrix to scipy.sparse.csr_matrix for l0-python (l0-python calls .tocoo() which only exists on sparse matrices) 3. Add missing PERSON_LEVEL_VARIABLES and SPM_UNIT_VARIABLES constants 4. Add spm_unit_count to COUNT_VARIABLES 5. Fix test method name mismatches: - _query_all_targets -> _query_active_targets - _get_constraints -> _get_all_constraints 6. Fix zero-value target test to expect ValueError (builder filters zeros) 7. Fix SQLModel import bug in etl_all_targets.py main() 8. Add missing test_db/ directory with test_etl_all_targets.py 9. Export fit_national_weights from calibration __init__.py 10. Run black formatting on all files Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replaces the legacy-only EnhancedCPS.generate() with a dual-path architecture controlled by use_db flag (defaults to False/legacy). Adds _generate_db() for DB-driven calibration via NationalMatrixBuilder and _generate_legacy() for the existing HardConcrete reweight() path. Also deduplicates bad_targets list and removes the .recovered file. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Make SparseMatrixBuilder inherit from BaseMatrixBuilder to eliminate duplicated code for __init__, _build_entity_relationship, _evaluate_constraints_entity_aware, and _get_stratum_constraints. Also remove the same duplicated methods from NationalMatrixBuilder (_build_entity_relationship, _evaluate_constraints, _get_stratum_constraints) and update all internal calls to use the base class method names. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Break the 1156-line monolithic ETL into focused modules: - etl_helpers.py: shared helpers (fmt, get_or_create_stratum, upsert_target) - etl_healthcare_spending.py: healthcare spending by age band (cat 4) - etl_spm_threshold.py: AGI by SPM threshold decile (cat 5) - etl_tax_expenditure.py: tax expenditure targets (cat 10) - etl_state_targets.py: state pop, real estate taxes, ACA, age, AGI (cats 9,11,12,14,15) - etl_misc_national.py: census age, EITC, SOI filers, neg market income, infant, net worth, Medicaid, SOI filing-status (cats 1,2,3,6,7,8,13,16) etl_all_targets.py is now a thin orchestrator that delegates to these modules and re-exports all extract functions for backward compatibility. All 39 existing tests pass unchanged. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Rename _evaluate_constraints → _evaluate_constraints_entity_aware in test_national_matrix_builder.py (4 tests) - Remove duplicate tax_unit_is_filer constraint from AGI stratum fixture (inherits from parent filer_stratum) (1 test) - Fix TestImports to use importlib.import_module to get the module, not the function re-exported from __init__.py (2 tests) - Fix reweight_l0 test patch path to policyengine_us.Microsimulation matching the actual import inside the method (1 test) - Add try/except fallback in build_calibration_inputs for when DB path fails, gracefully falling back to legacy build_loss_matrix (1 test) All 100 calibration tests + 39 ETL tests pass. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
State-level targets from independent data sources often don't sum to their corresponding national totals (e.g., state real_estate_taxes sums to only 17% of national). This causes contradictory signals for the calibration optimizer. Adds proportional rescaling: scale states to match national, then CDs to match corrected states. Original values preserved in new raw_value column for auditability. Validation step catches any future unreconciled targets. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Resolve conflict in enhanced_cps.py by keeping the db/legacy dispatch from this branch. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Drop reweight() (HardConcrete/torch), _generate_legacy(), use_db flag, and ReweightedCPS_2024 class. EnhancedCPS.generate() now always uses the DB-driven calibration via NationalMatrixBuilder + l0-python. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Delete utils/loss.py (build_loss_matrix, print_reweighting_diagnostics), utils/soi.py (pe_to_soi, get_soi), utils/l0.py (HardConcrete), utils/seed.py (set_seeds). Remove legacy fallback from fit_national_weights.py. Strip legacy tests from test_sparse_enhanced_cps.py. Delete paper/scripts/generate_validation_metrics.py and tests/test_reproducibility.py (both fully legacy). -2,172 lines. DB-driven calibration is the only path. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- NationalMatrixBuilder: add _classify_target_geo() and geo_level parameter to _query_active_targets() / build_matrix() for optional filtering by national/state/cd level - fit_national_weights: pass through geo_level param, add --geo-level CLI flag (default: all) - Tests: replace legacy build_loss_matrix mocks with NationalMatrixBuilder mocks, add test_passes_geo_level Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
New pipeline: clone extended CPS ~130x, assign random census blocks to each clone, build sparse matrix against all DB targets, run L0 calibration. Two L0 presets: "local" (1e-8, ~3-4M records) and "national" (1e-4, ~50K records). New modules: - clone_and_assign.py: population-weighted random block assignment - unified_matrix_builder.py: sparse matrix builder (state-by-state) - unified_calibration.py: CLI entry point with L0 presets 46 tests covering all modules. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
raw_valuecolumn toTargetmodel to preserve original source values for auditabilityProblem
Geographic targets come from independent data sources and often don't agree. Before reconciliation:
These contradictory signals cause the calibration optimizer to struggle.
Approach
Two-pass proportional rescaling (ported from Ben Ogorek's
us-congressional-districts/pull_soi_targets.py):Original values are preserved in
raw_valuefor audit trail. The algorithm is idempotent (safe to re-run) and handles non-geographic sub-strata (filer strata, AGI bands) by resolving to their geographic ancestor.Test plan
make databasepipeline run with reconciliation stepvalidate_database.py🤖 Generated with Claude Code