Commit a4322e5
Add stacked dataset builder and P(county|CD) distributions (#457)
* Add sparse matrix builder for local area calibration
Core components:
- sparse_matrix_builder.py: Database-driven approach for building calibration matrices
- calibration_utils.py: Shared utilities (cache clearing, constraints, geo helpers)
- matrix_tracer.py: Debugging utility for tracing through sparse matrices
- create_stratified_cps.py: Create stratified sample preserving high-income households
- test_sparse_matrix_builder.py: 6 verification tests for matrix correctness
Data pipeline changes:
- Add GEO_STACKING env var to cps.py and puf.py for geo-stacking data generation
- Add GEO_STACKING_MODE env var to extended_cps.py
- Add CPS_2024_Full, PUF_2023, ExtendedCPS_2023 classes
- Add policy_data.db download to prerequisites
- Add 'make data-geo' target for geo-stacking data pipeline
CI/CD:
- Add geo-stacking dataset build step to workflow
- Add sparse matrix builder test step after geo data generation
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Add changelog entry and format code
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Refactor tests and fix enum encoding, minimize PR scope
- Move sparse matrix tests to tests/test_local_area_calibration/
- Split large test file into focused modules (column indexing, same-state,
cross-state, geo masking)
- Fix small_enhanced_cps.py enum encoding (decode_to_str before astype)
- Fix create_stratified_cps.py to use local storage instead of HuggingFace
- Remove CPS_2024_Full to keep PR minimal
- Revert ExtendedCPS_2024 to use CPS_2024
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Rename GEO_STACKING to LOCAL_AREA_CALIBRATION and restore tracer functionality
- Rename GEO_STACKING to LOCAL_AREA_CALIBRATION in cps.py, puf.py, extended_cps.py
- Rename data-geo to data-local-area in Makefile and workflow
- Add create_target_groups function to calibration_utils.py
- Enhance MatrixTracer with get_group_rows method and variable_desc in row catalog
- Add TARGET GROUPS section to print_matrix_structure output
- Add local_area_calibration_setup.ipynb documentation notebook
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Clear notebook outputs for Myst compatibility
* Pin mystmd>=1.7.0 to fix notebook rendering in docs
* Add population-weighted P(county|CD) distributions and stacked dataset builder
- Add make_county_cd_distributions.py to compute P(county|CD) from Census block data
- Add county_cd_distributions.csv with distributions for all 436 CDs
- Add county_assignment.py module for assigning counties to households
- Add stacked_dataset_builder.py for creating CD-stacked H5 datasets
- Add tests for county assignment functionality
- Update calibration_utils.py with state/CD mapping utilities
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Add local area H5 publishing workflow
- New GitHub Actions workflow (local_area_publish.yaml) that:
- Triggers on local_area_calibration/ changes, repository_dispatch, or manual
- Downloads calibration inputs from HF calibration/ folder
- Builds 51 state + 436 district H5 files with checkpointing
- Uploads to GCP and HF states/ and districts/ subdirectories
- New publish_local_area.py script with:
- Per-state and per-district checkpointing for spot instance resilience
- Immediate upload after each file is built
- Support for --states-only, --districts-only, --skip-download flags
- Added upload_local_area_file() to data_upload.py for subdirectory uploads
- Added download_calibration_inputs() to huggingface.py
- Added publish-local-area Makefile target
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Update paths to use calibration/ subdirectory for policy_data.db
- download_private_prerequisites.py: Download from calibration/policy_data.db
- calibration_utils.py: Look for db in storage/calibration/
- conftest.py: Update test fixture path
- huggingface.py: Fix download_calibration_inputs to return correct paths
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* documentation updates
* Add deterministic test fixture and tests for stacked_dataset_builder
Create a minimal 50-household H5 fixture with known values for stable testing
of the stacked dataset builder without relying on sampled stratified CPS data.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Fix dtype warning in stacked_dataset_builder person ID assignment
Cast np.arange output to int32 to match column dtype.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Format test files with black
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Add spm_unit_tenure_type and fix input_variables detection in stratification
- Add spm_unit_tenure_type mapping from SPM_TENMORTSTATUS in add_spm_variables
- Fix create_stratified_cps.py to use source sim's input_variables instead of empty sim
- Fix stacked_dataset_builder.py to use base_sim's input_variables instead of sparse_sim
The input_variables fix ensures variables like spm_unit_tenure_type are preserved
when creating stratified/stacked datasets, since input_variables is only populated
from variables that have actual data in the loaded dataset.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Add real rent-based SPM geographic adjustments to stacked dataset builder
- Add spm-calculator integration for SPM threshold calculation
- Replace random placeholder geoadj with real values from Census ACS rent data
- Add load_cd_geoadj_values() to compute geoadj from median 2BR rents
- Add calculate_spm_thresholds_for_cd() to calculate SPM thresholds per CD
- Add CD rent data CSV and fetch script (requires CENSUS_API_KEY)
- Update .gitignore to track rent CSV
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* NYC workflow
* Add batched HuggingFace uploads and fix at-large district geoadj lookup
- Add upload_local_area_batch_to_hf() to batch multiple files per commit
- Add skip_hf parameter to upload_local_area_file() for GCP-only uploads
- Modify publish_local_area.py to batch HF uploads (10 files per commit)
- Fix at-large district geoadj lookup (XX01 -> XX00 mapping for AK, DE, etc.)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Filter pseudo-input variables from H5 output and add checkpoint files to gitignore
Pseudo-inputs are variables with adds/subtracts that aggregate formula-based
components. Saving their stale pre-computed values corrupts calculations when
the dataset is reloaded.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Add spm-calculator as a dependency
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Fix database path and add download dependency
- Update notebook to use correct db path: storage/calibration/policy_data.db
- Add download as dependency of data target in Makefile
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Fix test fixture path to use absolute path
Use os.path.dirname(__file__) instead of relative path so tests
work regardless of working directory.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Trigger CI after runner restart
* Add uv.lock to pin dependency versions
- Add uv.lock file with all pinned dependencies
- Update all workflows to use `uv sync --dev` instead of pip install
- Add lock freshness check to PR workflow
- Narrow Python version to >=3.12 (required by microimpute)
This prevents stale cached packages on the self-hosted runner from
causing test failures (e.g., missing spm_unit_tenure_type variable).
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Use uv run for all Python commands in workflows
uv sync creates a virtual environment, but commands were running
with system Python which still had stale cached packages.
All make/python/pytest commands now use `uv run` to execute within
the virtual environment where the locked dependencies are installed.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Fix family_id reindexing and vectorize entity ID assignment
- Add family_id/person_family_id to entity reindexing loop
(was missing, causing ID collisions when same household appears in multiple CDs)
- Vectorize entity reindexing using groupby().ngroup() instead of O(n²) nested loops
- Add comment explaining why fresh Microsimulation per CD is necessary
- Add tests for entity ID uniqueness across stacked CDs
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* Update uv.lock with latest package versions
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* Format with Black
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
---------
Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: Max Ghenis <mghenis@gmail.com>1 parent 17fff13 commit a4322e5
File tree
29 files changed
+11017
-57
lines changed- .github/workflows
- docs
- policyengine_us_data
- datasets/cps
- local_area_calibration
- storage
- calibration_targets
- tests/test_local_area_calibration
- utils
29 files changed
+11017
-57
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
37 | 37 | | |
38 | 38 | | |
39 | 39 | | |
40 | | - | |
| 40 | + | |
41 | 41 | | |
42 | | - | |
| 42 | + | |
43 | 43 | | |
44 | 44 | | |
45 | 45 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
30 | 30 | | |
31 | 31 | | |
32 | 32 | | |
33 | | - | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
34 | 36 | | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
35 | 55 | | |
36 | 56 | | |
37 | 57 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
57 | 57 | | |
58 | 58 | | |
59 | 59 | | |
60 | | - | |
| 60 | + | |
61 | 61 | | |
62 | 62 | | |
63 | 63 | | |
64 | | - | |
| 64 | + | |
65 | 65 | | |
66 | 66 | | |
67 | 67 | | |
| |||
70 | 70 | | |
71 | 71 | | |
72 | 72 | | |
73 | | - | |
| 73 | + | |
74 | 74 | | |
75 | 75 | | |
76 | 76 | | |
77 | 77 | | |
78 | 78 | | |
79 | 79 | | |
80 | 80 | | |
81 | | - | |
82 | | - | |
83 | | - | |
84 | | - | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
85 | 85 | | |
86 | 86 | | |
87 | 87 | | |
88 | | - | |
| 88 | + | |
89 | 89 | | |
90 | 90 | | |
91 | 91 | | |
| |||
95 | 95 | | |
96 | 96 | | |
97 | 97 | | |
98 | | - | |
| 98 | + | |
99 | 99 | | |
100 | 100 | | |
101 | 101 | | |
102 | | - | |
| 102 | + | |
103 | 103 | | |
104 | 104 | | |
105 | | - | |
| 105 | + | |
106 | 106 | | |
107 | 107 | | |
108 | 108 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
2 | 2 | | |
3 | 3 | | |
4 | 4 | | |
| 5 | + | |
5 | 6 | | |
6 | 7 | | |
7 | 8 | | |
| |||
23 | 24 | | |
24 | 25 | | |
25 | 26 | | |
| 27 | + | |
26 | 28 | | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | | - | |
| 1 | + | |
2 | 2 | | |
3 | 3 | | |
4 | 4 | | |
| |||
62 | 62 | | |
63 | 63 | | |
64 | 64 | | |
65 | | - | |
| 65 | + | |
66 | 66 | | |
67 | 67 | | |
68 | 68 | | |
| |||
80 | 80 | | |
81 | 81 | | |
82 | 82 | | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
83 | 86 | | |
84 | 87 | | |
85 | 88 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
0 commit comments