Skip to content

Conversation

@baogorek
Copy link
Collaborator

@baogorek baogorek commented Dec 5, 2025

Summary

Adds infrastructure for local area (congressional district) calibration components related the sparse X matrix (aka the "loss matrix" and the target vector, starting with SNAP targets only.

The Jupyter Notebook added to docs is a great way to get comfortable with the functionality.

Core components

  • sparse_matrix_builder.py: Database-driven approach for building calibration matrices
  • calibration_utils.py: Shared utilities (cache clearing, constraints, target grouping)
  • matrix_tracer.py: Debugging utility for tracing through sparse matrices
  • create_stratified_cps.py: Create stratified sample preserving high-income households

Test plan

  • Matrix verification tests in tests/test_local_area_calibration/:
    • test_column_indexing.py: Verify column structure
    • test_same_state.py: Same-state household placement
    • test_cross_state.py: Cross-state benefit recalculation
    • test_geo_masking.py: Geographic masking for state targets

Data pipeline changes

  • Add LOCAL_AREA_CALIBRATION env var to cps.py and puf.py
  • Add LOCAL_AREA_CALIBRATION_MODE env var to extended_cps.py
  • Add PUF_2023, ExtendedCPS_2023 classes
  • Add make data-local-area target

Documentation

  • Add docs/local_area_calibration_setup.ipynb notebook demonstrating matrix construction

baogorek and others added 4 commits December 5, 2025 11:22
Core components:
- sparse_matrix_builder.py: Database-driven approach for building calibration matrices
- calibration_utils.py: Shared utilities (cache clearing, constraints, geo helpers)
- matrix_tracer.py: Debugging utility for tracing through sparse matrices
- create_stratified_cps.py: Create stratified sample preserving high-income households
- test_sparse_matrix_builder.py: 6 verification tests for matrix correctness

Data pipeline changes:
- Add GEO_STACKING env var to cps.py and puf.py for geo-stacking data generation
- Add GEO_STACKING_MODE env var to extended_cps.py
- Add CPS_2024_Full, PUF_2023, ExtendedCPS_2023 classes
- Add policy_data.db download to prerequisites
- Add 'make data-geo' target for geo-stacking data pipeline

CI/CD:
- Add geo-stacking dataset build step to workflow
- Add sparse matrix builder test step after geo data generation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Move sparse matrix tests to tests/test_local_area_calibration/
- Split large test file into focused modules (column indexing, same-state,
  cross-state, geo masking)
- Fix small_enhanced_cps.py enum encoding (decode_to_str before astype)
- Fix create_stratified_cps.py to use local storage instead of HuggingFace
- Remove CPS_2024_Full to keep PR minimal
- Revert ExtendedCPS_2024 to use CPS_2024

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
…tionality

- Rename GEO_STACKING to LOCAL_AREA_CALIBRATION in cps.py, puf.py, extended_cps.py
- Rename data-geo to data-local-area in Makefile and workflow
- Add create_target_groups function to calibration_utils.py
- Enhance MatrixTracer with get_group_rows method and variable_desc in row catalog
- Add TARGET GROUPS section to print_matrix_structure output
- Add local_area_calibration_setup.ipynb documentation notebook

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@baogorek baogorek changed the title Add sparse matrix builder for local area calibration Add sparse matrix builder for local area calibration - SNAP targets Dec 5, 2025
@baogorek baogorek requested a review from MaxGhenis December 6, 2025 00:29
…format

- Replace silent exception catch with debug logging for constraint evaluation
- Add comment explaining CD GEOID format (SSCCC where SS=state FIPS)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@MaxGhenis
Copy link
Contributor

Code Review Summary

Overall this is a well-designed PR with clean architecture and comprehensive test coverage. I've pushed a small commit with minor improvements:

Changes Made (commit 7f6ea43)

  1. Added logging for constraint evaluation failures - Replaced silent except Exception: pass with debug-level logging so failures are traceable when needed

  2. Added comment explaining CD GEOID format - Documented that CD GEOIDs follow SSCCC format where SS is state FIPS and CCC is CD number

Notes

  • The Alaska CD query includes both 200 and 201 - this is harmless as non-existent values just won't match
  • The documentation notebook has some hardcoded row indices (iloc[28], iloc[10]) that could break if target ordering changes, but this is just for demonstration purposes
  • PUF_2023 class doesn't have a url attribute unlike other PUF classes - appears intentional since it's generated locally

Waiting for CI to pass before merging.

Copy link
Contributor

@MaxGhenis MaxGhenis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All CI checks passing. Code review complete with minor improvements pushed (logging for constraint failures, documentation of CD GEOID format). Approving.

@MaxGhenis MaxGhenis merged commit 999d696 into main Dec 8, 2025
6 checks passed
@MaxGhenis MaxGhenis deleted the local-area-snap branch December 8, 2025 19:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants