Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
67 commits
Select commit Hold shift + click to select a range
2e87e54
UCGID removed. Age data input completed
baogorek Aug 29, 2025
98350e9
extensive refactoring of db
baogorek Aug 31, 2025
4aa6dd0
after a lot of database work
baogorek Sep 2, 2025
e9fa7c2
milestone of 2 state stacking with group loss
baogorek Sep 3, 2025
ff6e81d
state stacking proof of concept
baogorek Sep 4, 2025
9458a0e
temporarily removing microimpute
baogorek Sep 5, 2025
9b9d2df
temporarily disabling these init files
baogorek Sep 5, 2025
3ae2548
checkpoint
baogorek Sep 7, 2025
90e0785
checkpoint
baogorek Sep 7, 2025
e9ecd5b
checkpoint
baogorek Sep 7, 2025
a046d14
comment out the __init__.py lines again
baogorek Sep 7, 2025
4b2b665
checkpoint
baogorek Sep 8, 2025
ae0bb32
State Stacking sim working!
baogorek Sep 9, 2025
49729f5
State level h5s
baogorek Sep 10, 2025
661fc24
getting started with congressional districts
baogorek Sep 10, 2025
aca8382
congressional districts is training
baogorek Sep 11, 2025
8bca5e2
sparse congressional districts stacking
baogorek Sep 11, 2025
f26cc62
Accounting solid for congressional District level reweighting
baogorek Sep 15, 2025
6628677
checkpoint
baogorek Sep 15, 2025
7fb4d92
checkpoint
baogorek Sep 16, 2025
f7fa1ee
DC trip thursday commit
baogorek Sep 18, 2025
55189eb
getting new notebook to run
baogorek Sep 19, 2025
3aef84e
running in notebook
baogorek Sep 20, 2025
89f1707
checkpoint
baogorek Sep 21, 2025
e5f4f2f
matrix accounting completed
baogorek Sep 23, 2025
6b9b1e6
structurally things look good. Model fit is not so good
baogorek Sep 25, 2025
ce29dc8
checkpoint
baogorek Sep 25, 2025
cd61e4e
before the change
baogorek Oct 2, 2025
780900e
metrics_matrix_geo_stacking_sparse.py
baogorek Oct 2, 2025
e97f6c1
in the middle of major metrics_matrix surgery
baogorek Oct 3, 2025
f135789
work on metrics matrix completed
baogorek Oct 3, 2025
410a058
checkpoint
baogorek Oct 8, 2025
a5ea9d4
checkpoint
baogorek Oct 9, 2025
e2d5697
household tracing successful
baogorek Oct 10, 2025
9fe8b87
completing merge with main
baogorek Oct 10, 2025
10f2121
after linting
baogorek Oct 10, 2025
0a10f60
adding a changelog
baogorek Oct 10, 2025
cc73006
reverting unintended changes
baogorek Oct 10, 2025
d467109
lint
baogorek Oct 10, 2025
1d6b505
bringing down the version of microimpute
baogorek Oct 14, 2025
7e0c813
Fix CPS_2025 test failure in CI
baogorek Oct 15, 2025
f9034a5
checkpoint
baogorek Oct 15, 2025
85951f0
pipeline
baogorek Oct 23, 2025
43d7bcb
district tests and GCP workflow
baogorek Oct 23, 2025
cc772f2
Add temporary push trigger for testing
baogorek Oct 23, 2025
416e3b6
Fix workflow to skip data pipeline rebuild
baogorek Oct 23, 2025
4d7bd16
Fix create_sparse_cd_stacked to load dataset from path
baogorek Oct 23, 2025
745ec88
Fix workflow to download datasets from calibration bucket
baogorek Oct 23, 2025
50dc99d
Auto-upload dataset and db with calibration package
baogorek Oct 24, 2025
6b3f6ab
Fix SQLite database connection error in workflow
baogorek Oct 24, 2025
6664fb5
Fix hardcoded database path in get_cd_index_mapping
baogorek Oct 24, 2025
c1570b6
removed all states .h5 default
baogorek Oct 29, 2025
ead3526
Friday
baogorek Oct 31, 2025
0fc7599
nov 12 commit of changes
baogorek Nov 12, 2025
3077a63
Merge branch 'main' of github.com:PolicyEngine/policyengine-us-data i…
baogorek Nov 13, 2025
e232eff
docs
baogorek Nov 14, 2025
01c3e78
first end-to-end integration test
baogorek Nov 18, 2025
0a59dd3
nov 24 prior to working
baogorek Nov 24, 2025
d84ccd0
snap matching in test_walkthrough.py
baogorek Nov 24, 2025
ddc2426
adding jupyter walkthrough
baogorek Nov 24, 2025
7af8c99
checkpoint with snap tests passing
baogorek Nov 27, 2025
055d74e
checkpoint
baogorek Dec 2, 2025
777d201
Merge main into new-cd-var
baogorek Dec 2, 2025
bbd3005
Consolidate geo-stacking documentation into single README
baogorek Dec 3, 2025
23db3a6
Merge branch 'main' of github.com:PolicyEngine/policyengine-us-data i…
baogorek Dec 4, 2025
27c8fbd
Consolidate utilities into calibration_utils.py as single source of t…
baogorek Dec 5, 2025
4684411
Rename geo_stacking_calibration to local_area_calibration
baogorek Dec 5, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
104 changes: 104 additions & 0 deletions .github/workflows/validate_district_calibration.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
name: Validate District-Level Calibration

on:
push:
branches:
- new-cd-var
workflow_dispatch:
inputs:
gcs_date:
description: 'GCS date prefix (e.g., 2025-10-22-1721)'
required: true
type: string

jobs:
validate-and-upload:
runs-on: ubuntu-latest
permissions:
contents: read
id-token: write
steps:
- uses: actions/checkout@v4

- name: Install uv
uses: astral-sh/setup-uv@v5

- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.13'

- name: Install dependencies
run: uv pip install -e .[dev] --system

- name: Authenticate to Google Cloud
uses: google-github-actions/auth@v2
with:
workload_identity_provider: "projects/322898545428/locations/global/workloadIdentityPools/policyengine-research-id-pool/providers/prod-github-provider"
service_account: "policyengine-research@policyengine-research.iam.gserviceaccount.com"

- name: Set up Cloud SDK
uses: google-github-actions/setup-gcloud@v2

- name: Download weights from GCS
run: |
GCS_DATE="${{ inputs.gcs_date || '2025-10-22-1721' }}"
echo "Downloading weights from gs://policyengine-calibration/$GCS_DATE/outputs/"
mkdir -p policyengine_us_data/storage/calibration
gsutil ls gs://policyengine-calibration/$GCS_DATE/outputs/**/w_cd.npy | head -1 | xargs -I {} gsutil cp {} policyengine_us_data/storage/calibration/w_cd.npy
echo "Downloaded w_cd.npy"

- name: Download prerequisite datasets
run: |
GCS_DATE="${{ inputs.gcs_date || '2025-10-22-1721' }}"
echo "Downloading stratified dataset and database from calibration run..."
mkdir -p policyengine_us_data/storage
gsutil cp gs://policyengine-calibration/$GCS_DATE/inputs/stratified_extended_cps_2023.h5 policyengine_us_data/storage/
gsutil cp gs://policyengine-calibration/$GCS_DATE/inputs/policy_data.db policyengine_us_data/storage/

- name: Verify downloaded files
run: |
echo "Verifying downloaded files exist..."
if [ ! -f policyengine_us_data/storage/stratified_extended_cps_2023.h5 ]; then
echo "ERROR: stratified_extended_cps_2023.h5 not found"
exit 1
fi
if [ ! -f policyengine_us_data/storage/policy_data.db ]; then
echo "ERROR: policy_data.db not found"
exit 1
fi
echo "All required files present:"
ls -lh policyengine_us_data/storage/stratified_extended_cps_2023.h5
ls -lh policyengine_us_data/storage/policy_data.db

- name: Create state files
run: |
echo "Creating state-level .h5 files..."
python -m policyengine_us_data.datasets.cps.geo_stacking_calibration.create_sparse_cd_stacked \
--weights-path policyengine_us_data/storage/calibration/w_cd.npy \
--dataset-path policyengine_us_data/storage/stratified_extended_cps_2023.h5 \
--db-path policyengine_us_data/storage/policy_data.db \
--output-dir policyengine_us_data/storage/cd_states

- name: Run district-level validation tests
run: |
echo "Running validation tests..."
pytest -m "district_level_validation" -v

- name: Upload state files to GCS
if: success()
run: |
GCS_DATE="${{ inputs.gcs_date || '2025-10-22-1721' }}"
echo "Tests passed! Uploading state files to GCS..."
gsutil -m cp policyengine_us_data/storage/cd_states/*.h5 gs://policyengine-calibration/$GCS_DATE/state_files/
gsutil -m cp policyengine_us_data/storage/cd_states/*_household_mapping.csv gs://policyengine-calibration/$GCS_DATE/state_files/
echo ""
echo "✅ State files uploaded to gs://policyengine-calibration/$GCS_DATE/state_files/"

- name: Report validation failure
if: failure()
run: |
echo "❌ District-level calibration validation FAILED"
echo "Check the test output above for details"
echo "State files were NOT uploaded to GCS"
exit 1
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -24,3 +24,7 @@ node_modules
!policyengine_us_data/storage/social_security_aux.csv
!policyengine_us_data/storage/SSPopJul_TR2024.csv
docs/.ipynb_checkpoints/

# Geo-stacking pipeline outputs
policyengine_us_data/storage/calibration/
policyengine_us_data/storage/cd_states/
64 changes: 0 additions & 64 deletions CLAUDE.md

This file was deleted.

71 changes: 71 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,7 @@ documentation-dev:
database:
python policyengine_us_data/db/create_database_tables.py
python policyengine_us_data/db/create_initial_strata.py
python policyengine_us_data/db/etl_national_targets.py
python policyengine_us_data/db/etl_age.py
python policyengine_us_data/db/etl_medicaid.py
python policyengine_us_data/db/etl_snap.py
Expand All @@ -74,9 +75,79 @@ data:
mv policyengine_us_data/storage/enhanced_cps_2024.h5 policyengine_us_data/storage/dense_enhanced_cps_2024.h5
cp policyengine_us_data/storage/sparse_enhanced_cps_2024.h5 policyengine_us_data/storage/enhanced_cps_2024.h5

data-geo: data
GEO_STACKING=true python policyengine_us_data/datasets/cps/cps.py
GEO_STACKING=true python policyengine_us_data/datasets/puf/puf.py
GEO_STACKING_MODE=true python policyengine_us_data/datasets/cps/extended_cps.py
python policyengine_us_data/datasets/cps/geo_stacking_calibration/create_stratified_cps.py 10000

calibration-package: data-geo
python policyengine_us_data/datasets/cps/geo_stacking_calibration/create_calibration_package.py \
--db-path policyengine_us_data/storage/policy_data.db \
--dataset-uri policyengine_us_data/storage/stratified_extended_cps_2023.h5 \
--mode Stratified \
--local-output policyengine_us_data/storage/calibration

optimize-weights-local: calibration-package
python policyengine_us_data/datasets/cps/geo_stacking_calibration/optimize_weights.py \
--input-dir policyengine_us_data/storage/calibration \
--output-dir policyengine_us_data/storage/calibration \
--total-epochs 100 \
--device cpu

create-state-files: optimize-weights-local
python -m policyengine_us_data.datasets.cps.geo_stacking_calibration.create_sparse_cd_stacked \
--weights-path policyengine_us_data/storage/calibration/w_cd.npy \
--dataset-path policyengine_us_data/storage/stratified_extended_cps_2023.h5 \
--db-path policyengine_us_data/storage/policy_data.db \
--output-dir policyengine_us_data/storage/cd_states

upload-calibration-package: calibration-package
$(eval GCS_DATE := $(shell date +%Y-%m-%d-%H%M)) # For bash: GCS_DATE=$$(date +%Y-%m-%d-%H%M)
python policyengine_us_data/datasets/cps/geo_stacking_calibration/create_calibration_package.py \
--db-path policyengine_us_data/storage/policy_data.db \
--dataset-uri policyengine_us_data/storage/stratified_extended_cps_2023.h5 \
--mode Stratified \
--gcs-bucket policyengine-calibration \
--gcs-date $(GCS_DATE)
@echo "Uploading dataset and database to GCS inputs..."
gsutil cp policyengine_us_data/storage/stratified_extended_cps_2023.h5 gs://policyengine-calibration/$(GCS_DATE)/inputs/
gsutil cp policyengine_us_data/storage/policy_data.db gs://policyengine-calibration/$(GCS_DATE)/inputs/
@echo ""
@echo "Calibration package uploaded to GCS"
@echo "Date prefix: $(GCS_DATE)"
@echo ""
@echo "To submit GCP batch job, update batch_pipeline/config.env:"
@echo " INPUT_PATH=$(GCS_DATE)/inputs"
@echo " OUTPUT_PATH=$(GCS_DATE)/outputs"

optimize-weights-gcp:
@echo "Submitting Cloud Batch job for weight optimization..."
@echo "Make sure you've run 'make upload-calibration-package' first"
@echo "and updated batch_pipeline/config.env with the correct paths"
@echo ""
cd policyengine_us_data/datasets/cps/geo_stacking_calibration/batch_pipeline && ./submit_batch_job.sh

download-weights-from-gcs:
@echo "Downloading weights from GCS..."
rm -f policyengine_us_data/storage/calibration/w_cd.npy
@read -p "Enter GCS date prefix (e.g., 2025-10-22-1630): " gcs_date; \
gsutil ls gs://policyengine-calibration/$$gcs_date/outputs/**/w_cd.npy | head -1 | xargs -I {} gsutil cp {} policyengine_us_data/storage/calibration/w_cd.npy && \
gsutil ls gs://policyengine-calibration/$$gcs_date/outputs/**/w_cd_*.npy | xargs -I {} gsutil cp {} policyengine_us_data/storage/calibration/ && \
echo "Weights downloaded successfully"

upload-state-files-to-gcs:
@echo "Uploading state files to GCS..."
@read -p "Enter GCS date prefix (e.g., 2025-10-22-1721): " gcs_date; \
gsutil -m cp policyengine_us_data/storage/cd_states/*.h5 gs://policyengine-calibration/$$gcs_date/state_files/ && \
gsutil -m cp policyengine_us_data/storage/cd_states/*_household_mapping.csv gs://policyengine-calibration/$$gcs_date/state_files/ && \
echo "" && \
echo "State files uploaded to gs://policyengine-calibration/$$gcs_date/state_files/"

clean:
rm -f policyengine_us_data/storage/*.h5
rm -f policyengine_us_data/storage/*.db
rm -f policyengine_us_data/storage/*.pkl
git clean -fX -- '*.csv'
rm -rf policyengine_us_data/docs/_build

Expand Down
9 changes: 9 additions & 0 deletions changelog_entry.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
- bump: minor
changes:
added:
- Targets database infrastructure for geo-stacking calibration
- Congressional district level estimation capability
- Geo-stacking calibration utilities and modeling functionality
- GEO_STACKING environment variable for specialized data pipeline
- Hierarchical validation for calibration targets
- Holdout validation framework for geo-stacking models
Loading
Loading