Skip to content

Conversation

@baogorek
Copy link
Collaborator

@baogorek baogorek commented Dec 9, 2025

Summary

  • Add stacked_dataset_builder.py for creating CD-stacked H5 datasets from calibrated weights
  • Add population-weighted P(county|CD) distributions computed from Census block data
  • Add county_assignment.py module for assigning counties to households based on congressional district
  • Add script to generate county-CD distributions from 119th Congress BEFs and 2020 Census population

Key Features

  • Stacked Dataset Builder: Creates H5 datasets with households replicated across congressional districts, using calibrated weights
  • County Assignment: Assigns realistic county distributions to households based on their CD using Census block-level population data
  • 436 CDs covered: All 435 voting districts plus DC at-large

Test Plan

  • Unit tests for county assignment pass (test_county_assignment.py)
  • NY-10 distribution verified: 55.6% Kings County, 44.4% New York County (matches Census)
  • Manual testing of stacked dataset generation

🤖 Generated with Claude Code

baogorek and others added 8 commits December 5, 2025 11:22
Core components:
- sparse_matrix_builder.py: Database-driven approach for building calibration matrices
- calibration_utils.py: Shared utilities (cache clearing, constraints, geo helpers)
- matrix_tracer.py: Debugging utility for tracing through sparse matrices
- create_stratified_cps.py: Create stratified sample preserving high-income households
- test_sparse_matrix_builder.py: 6 verification tests for matrix correctness

Data pipeline changes:
- Add GEO_STACKING env var to cps.py and puf.py for geo-stacking data generation
- Add GEO_STACKING_MODE env var to extended_cps.py
- Add CPS_2024_Full, PUF_2023, ExtendedCPS_2023 classes
- Add policy_data.db download to prerequisites
- Add 'make data-geo' target for geo-stacking data pipeline

CI/CD:
- Add geo-stacking dataset build step to workflow
- Add sparse matrix builder test step after geo data generation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Move sparse matrix tests to tests/test_local_area_calibration/
- Split large test file into focused modules (column indexing, same-state,
  cross-state, geo masking)
- Fix small_enhanced_cps.py enum encoding (decode_to_str before astype)
- Fix create_stratified_cps.py to use local storage instead of HuggingFace
- Remove CPS_2024_Full to keep PR minimal
- Revert ExtendedCPS_2024 to use CPS_2024

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
…tionality

- Rename GEO_STACKING to LOCAL_AREA_CALIBRATION in cps.py, puf.py, extended_cps.py
- Rename data-geo to data-local-area in Makefile and workflow
- Add create_target_groups function to calibration_utils.py
- Enhance MatrixTracer with get_group_rows method and variable_desc in row catalog
- Add TARGET GROUPS section to print_matrix_structure output
- Add local_area_calibration_setup.ipynb documentation notebook

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
…t builder

- Add make_county_cd_distributions.py to compute P(county|CD) from Census block data
- Add county_cd_distributions.csv with distributions for all 436 CDs
- Add county_assignment.py module for assigning counties to households
- Add stacked_dataset_builder.py for creating CD-stacked H5 datasets
- Add tests for county assignment functionality
- Update calibration_utils.py with state/CD mapping utilities

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@baogorek
Copy link
Collaborator Author

baogorek commented Dec 9, 2025

Closes #458

@baogorek baogorek requested a review from MaxGhenis December 9, 2025 17:06
baogorek and others added 6 commits December 10, 2025 09:21
- New GitHub Actions workflow (local_area_publish.yaml) that:
  - Triggers on local_area_calibration/ changes, repository_dispatch, or manual
  - Downloads calibration inputs from HF calibration/ folder
  - Builds 51 state + 436 district H5 files with checkpointing
  - Uploads to GCP and HF states/ and districts/ subdirectories

- New publish_local_area.py script with:
  - Per-state and per-district checkpointing for spot instance resilience
  - Immediate upload after each file is built
  - Support for --states-only, --districts-only, --skip-download flags

- Added upload_local_area_file() to data_upload.py for subdirectory uploads
- Added download_calibration_inputs() to huggingface.py
- Added publish-local-area Makefile target

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
- download_private_prerequisites.py: Download from calibration/policy_data.db
- calibration_utils.py: Look for db in storage/calibration/
- conftest.py: Update test fixture path
- huggingface.py: Fix download_calibration_inputs to return correct paths

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Create a minimal 50-household H5 fixture with known values for stable testing
of the stacked dataset builder without relying on sampled stratified CPS data.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Cast np.arange output to int32 to match column dtype.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
baogorek and others added 7 commits December 12, 2025 11:32
…ication

- Add spm_unit_tenure_type mapping from SPM_TENMORTSTATUS in add_spm_variables
- Fix create_stratified_cps.py to use source sim's input_variables instead of empty sim
- Fix stacked_dataset_builder.py to use base_sim's input_variables instead of sparse_sim

The input_variables fix ensures variables like spm_unit_tenure_type are preserved
when creating stratified/stacked datasets, since input_variables is only populated
from variables that have actual data in the loaded dataset.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
…lder

- Add spm-calculator integration for SPM threshold calculation
- Replace random placeholder geoadj with real values from Census ACS rent data
- Add load_cd_geoadj_values() to compute geoadj from median 2BR rents
- Add calculate_spm_thresholds_for_cd() to calculate SPM thresholds per CD
- Add CD rent data CSV and fetch script (requires CENSUS_API_KEY)
- Update .gitignore to track rent CSV

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Add upload_local_area_batch_to_hf() to batch multiple files per commit
- Add skip_hf parameter to upload_local_area_file() for GCP-only uploads
- Modify publish_local_area.py to batch HF uploads (10 files per commit)
- Fix at-large district geoadj lookup (XX01 -> XX00 mapping for AK, DE, etc.)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
… to gitignore

Pseudo-inputs are variables with adds/subtracts that aggregate formula-based
components. Saving their stale pre-computed values corrupts calculations when
the dataset is reloaded.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
…e_type

- Accept main's SPM threshold calculation using calculate_spm_thresholds_with_geoadj()
- Preserve branch's spm_unit_tenure_type variable for local area calibration
- Refactor calibration_utils.py to import TENURE_CODE_MAP from utils/spm.py
- Remove duplicate SPM_TENURE_CODE_TO_CALC definition

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants