Skip to content

Commit a4322e5

Browse files
baogorekclaudeMaxGhenis
authored
Add stacked dataset builder and P(county|CD) distributions (#457)
* Add sparse matrix builder for local area calibration Core components: - sparse_matrix_builder.py: Database-driven approach for building calibration matrices - calibration_utils.py: Shared utilities (cache clearing, constraints, geo helpers) - matrix_tracer.py: Debugging utility for tracing through sparse matrices - create_stratified_cps.py: Create stratified sample preserving high-income households - test_sparse_matrix_builder.py: 6 verification tests for matrix correctness Data pipeline changes: - Add GEO_STACKING env var to cps.py and puf.py for geo-stacking data generation - Add GEO_STACKING_MODE env var to extended_cps.py - Add CPS_2024_Full, PUF_2023, ExtendedCPS_2023 classes - Add policy_data.db download to prerequisites - Add 'make data-geo' target for geo-stacking data pipeline CI/CD: - Add geo-stacking dataset build step to workflow - Add sparse matrix builder test step after geo data generation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Add changelog entry and format code 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Refactor tests and fix enum encoding, minimize PR scope - Move sparse matrix tests to tests/test_local_area_calibration/ - Split large test file into focused modules (column indexing, same-state, cross-state, geo masking) - Fix small_enhanced_cps.py enum encoding (decode_to_str before astype) - Fix create_stratified_cps.py to use local storage instead of HuggingFace - Remove CPS_2024_Full to keep PR minimal - Revert ExtendedCPS_2024 to use CPS_2024 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Rename GEO_STACKING to LOCAL_AREA_CALIBRATION and restore tracer functionality - Rename GEO_STACKING to LOCAL_AREA_CALIBRATION in cps.py, puf.py, extended_cps.py - Rename data-geo to data-local-area in Makefile and workflow - Add create_target_groups function to calibration_utils.py - Enhance MatrixTracer with get_group_rows method and variable_desc in row catalog - Add TARGET GROUPS section to print_matrix_structure output - Add local_area_calibration_setup.ipynb documentation notebook 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Clear notebook outputs for Myst compatibility * Pin mystmd>=1.7.0 to fix notebook rendering in docs * Add population-weighted P(county|CD) distributions and stacked dataset builder - Add make_county_cd_distributions.py to compute P(county|CD) from Census block data - Add county_cd_distributions.csv with distributions for all 436 CDs - Add county_assignment.py module for assigning counties to households - Add stacked_dataset_builder.py for creating CD-stacked H5 datasets - Add tests for county assignment functionality - Update calibration_utils.py with state/CD mapping utilities 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Add local area H5 publishing workflow - New GitHub Actions workflow (local_area_publish.yaml) that: - Triggers on local_area_calibration/ changes, repository_dispatch, or manual - Downloads calibration inputs from HF calibration/ folder - Builds 51 state + 436 district H5 files with checkpointing - Uploads to GCP and HF states/ and districts/ subdirectories - New publish_local_area.py script with: - Per-state and per-district checkpointing for spot instance resilience - Immediate upload after each file is built - Support for --states-only, --districts-only, --skip-download flags - Added upload_local_area_file() to data_upload.py for subdirectory uploads - Added download_calibration_inputs() to huggingface.py - Added publish-local-area Makefile target 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Update paths to use calibration/ subdirectory for policy_data.db - download_private_prerequisites.py: Download from calibration/policy_data.db - calibration_utils.py: Look for db in storage/calibration/ - conftest.py: Update test fixture path - huggingface.py: Fix download_calibration_inputs to return correct paths 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * documentation updates * Add deterministic test fixture and tests for stacked_dataset_builder Create a minimal 50-household H5 fixture with known values for stable testing of the stacked dataset builder without relying on sampled stratified CPS data. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Fix dtype warning in stacked_dataset_builder person ID assignment Cast np.arange output to int32 to match column dtype. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Format test files with black 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Add spm_unit_tenure_type and fix input_variables detection in stratification - Add spm_unit_tenure_type mapping from SPM_TENMORTSTATUS in add_spm_variables - Fix create_stratified_cps.py to use source sim's input_variables instead of empty sim - Fix stacked_dataset_builder.py to use base_sim's input_variables instead of sparse_sim The input_variables fix ensures variables like spm_unit_tenure_type are preserved when creating stratified/stacked datasets, since input_variables is only populated from variables that have actual data in the loaded dataset. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Add real rent-based SPM geographic adjustments to stacked dataset builder - Add spm-calculator integration for SPM threshold calculation - Replace random placeholder geoadj with real values from Census ACS rent data - Add load_cd_geoadj_values() to compute geoadj from median 2BR rents - Add calculate_spm_thresholds_for_cd() to calculate SPM thresholds per CD - Add CD rent data CSV and fetch script (requires CENSUS_API_KEY) - Update .gitignore to track rent CSV 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * NYC workflow * Add batched HuggingFace uploads and fix at-large district geoadj lookup - Add upload_local_area_batch_to_hf() to batch multiple files per commit - Add skip_hf parameter to upload_local_area_file() for GCP-only uploads - Modify publish_local_area.py to batch HF uploads (10 files per commit) - Fix at-large district geoadj lookup (XX01 -> XX00 mapping for AK, DE, etc.) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Filter pseudo-input variables from H5 output and add checkpoint files to gitignore Pseudo-inputs are variables with adds/subtracts that aggregate formula-based components. Saving their stale pre-computed values corrupts calculations when the dataset is reloaded. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Add spm-calculator as a dependency 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Fix database path and add download dependency - Update notebook to use correct db path: storage/calibration/policy_data.db - Add download as dependency of data target in Makefile 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Fix test fixture path to use absolute path Use os.path.dirname(__file__) instead of relative path so tests work regardless of working directory. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Trigger CI after runner restart * Add uv.lock to pin dependency versions - Add uv.lock file with all pinned dependencies - Update all workflows to use `uv sync --dev` instead of pip install - Add lock freshness check to PR workflow - Narrow Python version to >=3.12 (required by microimpute) This prevents stale cached packages on the self-hosted runner from causing test failures (e.g., missing spm_unit_tenure_type variable). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Use uv run for all Python commands in workflows uv sync creates a virtual environment, but commands were running with system Python which still had stale cached packages. All make/python/pytest commands now use `uv run` to execute within the virtual environment where the locked dependencies are installed. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Fix family_id reindexing and vectorize entity ID assignment - Add family_id/person_family_id to entity reindexing loop (was missing, causing ID collisions when same household appears in multiple CDs) - Vectorize entity reindexing using groupby().ngroup() instead of O(n²) nested loops - Add comment explaining why fresh Microsimulation per CD is necessary - Add tests for entity ID uniqueness across stacked CDs 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Update uv.lock with latest package versions 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Format with Black 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> --------- Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: Max Ghenis <mghenis@gmail.com>
1 parent 17fff13 commit a4322e5

29 files changed

+11017
-57
lines changed

.github/workflows/code_changes.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -37,9 +37,9 @@ jobs:
3737
- name: Install uv
3838
uses: astral-sh/setup-uv@v5
3939
- name: Install package
40-
run: uv pip install -e .[dev] --system
40+
run: uv sync --dev
4141
- name: Build package
42-
run: python -m build
42+
run: uv run python -m build
4343
- name: Publish a Python distribution to PyPI
4444
uses: pypa/gh-action-pypi-publish@release/v1
4545
with:
Lines changed: 70 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,70 @@
1+
name: Publish Local Area H5 Files
2+
3+
on:
4+
push:
5+
branches: [main]
6+
paths:
7+
- 'policyengine_us_data/datasets/cps/local_area_calibration/**'
8+
- '.github/workflows/local_area_publish.yaml'
9+
repository_dispatch:
10+
types: [calibration-updated]
11+
workflow_dispatch:
12+
13+
# Trigger strategy:
14+
# 1. Automatic: Code changes to local_area_calibration/ pushed to main
15+
# 2. repository_dispatch: Calibration workflow triggers after uploading new weights
16+
# 3. workflow_dispatch: Manual trigger when you update weights/data on HF yourself
17+
18+
jobs:
19+
publish-local-area:
20+
runs-on: self-hosted
21+
permissions:
22+
contents: read
23+
id-token: write
24+
env:
25+
HUGGING_FACE_TOKEN: ${{ secrets.HUGGING_FACE_TOKEN }}
26+
27+
steps:
28+
- name: Checkout repo
29+
uses: actions/checkout@v4
30+
31+
- name: Install uv
32+
uses: astral-sh/setup-uv@v5
33+
34+
- name: Set up Python
35+
uses: actions/setup-python@v5
36+
with:
37+
python-version: '3.13'
38+
39+
- name: Authenticate to Google Cloud
40+
uses: google-github-actions/auth@v2
41+
with:
42+
workload_identity_provider: "projects/322898545428/locations/global/workloadIdentityPools/policyengine-research-id-pool/providers/prod-github-provider"
43+
service_account: "policyengine-research@policyengine-research.iam.gserviceaccount.com"
44+
45+
- name: Install package
46+
run: uv sync --dev
47+
48+
- name: Download checkpoint (if exists)
49+
continue-on-error: true
50+
run: |
51+
gsutil cp gs://policyengine-us-data/checkpoints/completed_states.txt . || true
52+
gsutil cp gs://policyengine-us-data/checkpoints/completed_districts.txt . || true
53+
gsutil cp gs://policyengine-us-data/checkpoints/completed_cities.txt . || true
54+
55+
- name: Build and publish local area H5 files
56+
run: uv run make publish-local-area
57+
58+
- name: Upload checkpoint
59+
if: always()
60+
run: |
61+
gsutil cp completed_states.txt gs://policyengine-us-data/checkpoints/ || true
62+
gsutil cp completed_districts.txt gs://policyengine-us-data/checkpoints/ || true
63+
gsutil cp completed_cities.txt gs://policyengine-us-data/checkpoints/ || true
64+
65+
- name: Clean up checkpoints on success
66+
if: success()
67+
run: |
68+
gsutil rm gs://policyengine-us-data/checkpoints/completed_states.txt || true
69+
gsutil rm gs://policyengine-us-data/checkpoints/completed_districts.txt || true
70+
gsutil rm gs://policyengine-us-data/checkpoints/completed_cities.txt || true

.github/workflows/pr_code_changes.yaml

Lines changed: 21 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,8 +30,28 @@ jobs:
3030
fi
3131
echo "✅ PR is from the correct repository"
3232
33-
Lint:
33+
check-lock-freshness:
34+
name: Check uv.lock freshness
35+
runs-on: ubuntu-latest
3436
needs: check-fork
37+
steps:
38+
- uses: actions/checkout@v4
39+
- name: Set up Python
40+
uses: actions/setup-python@v5
41+
with:
42+
python-version: '3.13'
43+
- name: Install uv
44+
uses: astral-sh/setup-uv@v5
45+
- name: Check lock file is up-to-date
46+
run: |
47+
uv lock --upgrade
48+
git diff --exit-code uv.lock || {
49+
echo "::error::uv.lock is outdated. Run 'uv lock --upgrade' and commit the changes."
50+
exit 1
51+
}
52+
53+
Lint:
54+
needs: [check-fork, check-lock-freshness]
3555
uses: ./.github/workflows/reusable_lint.yaml
3656

3757
SmokeTestForMultipleVersions:

.github/workflows/reusable_test.yaml

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -57,11 +57,11 @@ jobs:
5757
service_account: "policyengine-research@policyengine-research.iam.gserviceaccount.com"
5858

5959
- name: Install package
60-
run: uv pip install -e .[dev] --system
60+
run: uv sync --dev
6161

6262
- name: Download data inputs
6363
if: inputs.full_suite
64-
run: make download
64+
run: uv run make download
6565

6666
# Temporarily disabled - database target causing issues
6767
# - name: Create and load calibration targets database
@@ -70,22 +70,22 @@ jobs:
7070

7171
- name: Build datasets
7272
if: inputs.full_suite
73-
run: make data
73+
run: uv run make data
7474
env:
7575
TEST_LITE: ${{ !inputs.upload_data }}
7676
PYTHON_LOG_LEVEL: INFO
7777

7878
- name: Build datasets for local area calibration
7979
if: inputs.full_suite
8080
run: |
81-
LOCAL_AREA_CALIBRATION=true python policyengine_us_data/datasets/cps/cps.py
82-
LOCAL_AREA_CALIBRATION=true python policyengine_us_data/datasets/puf/puf.py
83-
LOCAL_AREA_CALIBRATION=true python policyengine_us_data/datasets/cps/extended_cps.py
84-
python policyengine_us_data/datasets/cps/local_area_calibration/create_stratified_cps.py 10500
81+
LOCAL_AREA_CALIBRATION=true uv run python policyengine_us_data/datasets/cps/cps.py
82+
LOCAL_AREA_CALIBRATION=true uv run python policyengine_us_data/datasets/puf/puf.py
83+
LOCAL_AREA_CALIBRATION=true uv run python policyengine_us_data/datasets/cps/extended_cps.py
84+
uv run python policyengine_us_data/datasets/cps/local_area_calibration/create_stratified_cps.py 10500
8585
8686
- name: Run local area calibration tests
8787
if: inputs.full_suite
88-
run: pytest policyengine_us_data/tests/test_local_area_calibration/ -v
88+
run: uv run pytest policyengine_us_data/tests/test_local_area_calibration/ -v
8989

9090
- name: Save calibration log
9191
if: inputs.full_suite
@@ -95,14 +95,14 @@ jobs:
9595
path: calibration_log.csv
9696

9797
- name: Run tests
98-
run: pytest
98+
run: uv run pytest
9999

100100
- name: Upload data
101101
if: inputs.upload_data
102-
run: make upload
102+
run: uv run make upload
103103

104104
- name: Test documentation builds
105-
run: make documentation
105+
run: uv run make documentation
106106
env:
107107
BASE_URL: ${{ inputs.deploy_docs && '/policyengine-us-data' || '' }}
108108

.gitignore

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22
**/__pycache__
33
**/.DS_STORE
44
**/*.h5
5+
**/*.npy
56
**/*.csv
67
**/_build
78
**/*.pkl
@@ -23,4 +24,11 @@ node_modules
2324
!soi_targets.csv
2425
!policyengine_us_data/storage/social_security_aux.csv
2526
!policyengine_us_data/storage/SSPopJul_TR2024.csv
27+
!policyengine_us_data/storage/national_and_district_rents_2023.csv
2628
docs/.ipynb_checkpoints/
29+
30+
## Batch processing checkpoints
31+
completed_*.txt
32+
33+
## Test fixtures
34+
!policyengine_us_data/tests/test_local_area_calibration/test_fixture_50hh.h5

Makefile

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
.PHONY: all format test install download upload docker documentation data clean build paper clean-paper presentations
1+
.PHONY: all format test install download upload docker documentation data data-local-area publish-local-area clean build paper clean-paper presentations
22

33
all: data test
44

@@ -62,7 +62,7 @@ database:
6262
python policyengine_us_data/db/etl_irs_soi.py
6363
python policyengine_us_data/db/validate_database.py
6464

65-
data:
65+
data: download
6666
python policyengine_us_data/utils/uprating.py
6767
python policyengine_us_data/datasets/acs/acs.py
6868
python policyengine_us_data/datasets/cps/cps.py
@@ -80,6 +80,9 @@ data-local-area: data
8080
LOCAL_AREA_CALIBRATION=true python policyengine_us_data/datasets/cps/extended_cps.py
8181
python policyengine_us_data/datasets/cps/local_area_calibration/create_stratified_cps.py 10500
8282

83+
publish-local-area:
84+
python policyengine_us_data/datasets/cps/local_area_calibration/publish_local_area.py
85+
8386
clean:
8487
rm -f policyengine_us_data/storage/*.h5
8588
rm -f policyengine_us_data/storage/*.db

changelog_entry.yaml

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
- bump: minor
2+
changes:
3+
added:
4+
- Sparse matrix builder for local area calibration with database-driven constraints
5+
- Local area calibration data pipeline (make data-local-area)
6+
- ExtendedCPS_2023 and PUF_2023 dataset classes
7+
- Stratified CPS sampling to preserve high-income households
8+
- Matrix verification tests for local area calibration
9+
- Population-weighted P(county|CD) distributions from Census block data
10+
- County assignment module for stacked dataset builder

0 commit comments

Comments
 (0)