Cleaning ACS age, SOI agi, hardcoded, and SNAP targets #373

juaristi22 · 2025-07-16T16:05:23Z

This PR starts addressing the need to organize and clean calibration targets following a common schema. It edits and creates .py files that "pull" targets from data sources. When run, these files produce one csv per data source, structured with the following columns:

DATA_SOURCE,GEO_ID,GEO_NAME,VARIABLE,VALUE,IS_COUNT,BREAKDOWN_VARIABLE,LOWER_BOUND,UPPER_BOUND

To ensure compatibility with the data sources that loss.py and enhanced_cps.py expect, I have tracked the old (unformatted) csv files that contain all the currently calibrated targets (these are the ones we have been working with the past month, which link to the calibration score of ~80% we expect for the ECPS and have nothing to do with the tests we have been running in us-congressional-districts). I made this decision to avoid duplicating .py scripts that pull targets but in different formats, and also to avoid confusion for other users that may originate from having different targets coming from different files or downloading processes. Moving these files to huggingface would also achieve this (let me know if best).

To avoid making the pr even larger I decided not to track the newly formatted files, but they can be easily generated running make targets.

My ideal state is that once all the scripts that pull targets into the clean format are ready, we would untrack and delete these old csv files and simply make make targets a necessary step before running calibration to pull all data in a clean format without having to store large target data files.

baogorek

Maria, thanks for being the first to take a swing at the target restructuring exercise! A couple of comments:

I was expecting to see a PR of identically formatted csvs. It seems this is a full transport from congressional-districts to -us-data.
I was expecting all target files to have the format DATA_SOURCE,GEO_ID,GEO_NAME,VARIABLE,VALUE,IS_COUNT,BREAKDOWN_VARIABLE, LOWER_BOUND,UPPER_BOUND
It is a very large PR

I cleaned up our Roadmap doc that we worked on together, specifically putting the decisions we made at the top. If you get some time, please take a look at it and give me any feedback you have. Let's sync tomorrow and make sure we're on the same page with this roadmap milestone.

juaristi22 · 2025-07-17T07:28:07Z

Hello Ben,

My rationale for the pr you are seeing is the following:

I tracked all the csv files that are currently necessary for loss.py to run for backward compatibility. My understanding was that we wanted to clean all the target processing scripts before changing the logic that imports them to create the metrics_matrix, so that if we kept merging prs with new files there wouldn't be merge conflicts. Sorry if this made the pr too big, most of the changes are just moving all the old csv files to the calibration directory in storage.

The scripts that pull targets, like pull_soi_targets.pyare the ones that produce csv files following with the new schema. To avoid making the pr even larger, i decided to push the scripts that clean the targets instead of the csv files that result from them. If you run make targets you should be able to see them (I am also happy to share them via slack if that would be easier I have them locally). The scripts resemble the files in congressional-districts as I thought that it would be better to produce a single, thorough script that we will need as soon as we migrate to us-data than having to update these files with some new logic later. Nonetheless, they follow the schema that we agreed on, which congressional-districts doesn't have.

I left a small description of this in the README, let me know if I should provide more context in it. Happy to chat once you are back online.

MaxGhenis

Could you describe what this PR does in more detail? e.g. I see age_state is added; was that in the repo before after running the script, just not committed?

It also looks like git messed up and didn't catch some file moves, making this PR somewhat smaller than it appears.

What's our strategy for files, are we committing them or pulling from HF? @baogorek please make the call.

In general please add more description to issues and PRs. Thank you for this cleanup!

MaxGhenis · 2025-07-17T19:04:46Z

.gitignore

-**/*.pkl
-venv
 !real_estate_taxes_by_state_acs.csv
+!np2023_d5_mid.csv


is this a district file? let's avoid adding district files here

i dont think its a district file, it contains a breakdown of population by age and race (at the national level if im not mistaken)

MaxGhenis · 2025-07-17T19:04:58Z

.gitignore

-venv
 !real_estate_taxes_by_state_acs.csv
+!np2023_d5_mid.csv
+!snap_state.csv


doesn't seem to be working - this file is still in the pr

yea, i meant to track them (*.csv so all csvs are ignored, except for the ones whose names are specified with a ! in front) all the now tracked files, including "np2023_d5_mid.csv" are files that were already being used by the ECPS before the clean-up

policyengine_us_data/storage/calibration_targets/district_mapping.py

* initial commit of L0 branch * Add HardConcrete L0 regularization * l0 example completed * removing commented code * pre lint cleanup * post-lint cleanup * Refactor reweighting diagnostics * removed _clean from names in the reweighting function * modifying print function and test * Convert diagnostics prints to logging * removing unused variable * setting high tolerance for ssn test just to pass * linting * fixed data set creation logic. Modified parameters * docs. more epochs

nikhilwoodruff and others added 20 commits July 16, 2025 10:56

Use normal runner in PR tests

a518386

added the 3.11.12 pin

961e91b

cps.py

c0636c9

adding diagnostics

3deb2c9

lint

27ac19b

taking out bad targets

4a319de

fixing workflow arg passthrough

3126721

deps and defaults

bdc2768

wrong pipeline for manual test

ddc6d1e

trying again to get the manual test to work

adb2c7f

reverting to older workflow code

e34a067

cleaning up enhanced_cps.py

d32bf72

Update package version

dc05f99

removing github download option. Switching to hugging face downloads

073d4c6

changelog entry

e1605f5

reverting the old code changes workflow

cc0b7bf

Update package version

0bbcd0e

start cleaning calibration targets

e07ade2

Merge branch 'main' into maria/calibration-targets

ec3ce8d

add us package to dependencies

baf01b2

juaristi22 requested a review from baogorek July 16, 2025 16:14

update csv paths in tests too

2435d6f

baogorek reviewed Jul 17, 2025

View reviewed changes

MaxGhenis reviewed Jul 17, 2025

View reviewed changes

policyengine_us_data/storage/calibration_targets/district_mapping.py Outdated Show resolved Hide resolved

baogorek added 4 commits July 18, 2025 14:11

manual test

2536a96

pr

62cef43

updates

7144e70

trying to get the right workflow to run

b56cbbc

baogorek and others added 28 commits July 18, 2025 14:11

adding diagnostics

2550d46

taking out bad targets

d170abc

fixing workflow arg passthrough

a9878c0

wrong pipeline for manual test

4471683

Update package version

c3362e1

removing github download option. Switching to hugging face downloads

4664f7e

reverting the old code changes workflow

0943ca8

remove districting file

cbd190d

remove duplications from merge with main

eaf3ccb

Merge branch 'main' into maria/calibration-targets

79f9c38

add changelog_entry

a02aa68

Update package version

bea1490

Pin microdf

9108ea7

adding diagnostics

46f98f6

taking out bad targets

cea087d

Update package version

0a8b00b

start cleaning calibration targets

1a55d60

trying to get the right workflow to run

fb766ed

ready for review

bdac124

taking out bad targets

8040cd8

Merge branch 'main' into maria/calibration-targets

07a534f

restore changes lost when merging with main

54e5f7e

more cleanup

ab0dfdc

even more cleanup

7abb95b

fix file paths in new sparse ecps test

7fdab59

lint

4649e6e

fixing merge

e22c24c

nikhilwoodruff merged commit 28508d3 into main Jul 18, 2025
7 checks passed

juaristi22 deleted the maria/calibration-targets branch July 21, 2025 12:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cleaning ACS age, SOI agi, hardcoded, and SNAP targets #373

Cleaning ACS age, SOI agi, hardcoded, and SNAP targets #373

Uh oh!

juaristi22 commented Jul 16, 2025 •

edited

Loading

Uh oh!

baogorek left a comment

Uh oh!

juaristi22 commented Jul 17, 2025

Uh oh!

MaxGhenis left a comment

Uh oh!

MaxGhenis Jul 17, 2025

Uh oh!

juaristi22 Jul 18, 2025

Uh oh!

MaxGhenis Jul 17, 2025

Uh oh!

juaristi22 Jul 18, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Cleaning ACS age, SOI agi, hardcoded, and SNAP targets #373

Cleaning ACS age, SOI agi, hardcoded, and SNAP targets #373

Uh oh!

Conversation

juaristi22 commented Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

baogorek left a comment

Choose a reason for hiding this comment

Uh oh!

juaristi22 commented Jul 17, 2025

Uh oh!

MaxGhenis left a comment

Choose a reason for hiding this comment

Uh oh!

MaxGhenis Jul 17, 2025

Choose a reason for hiding this comment

Uh oh!

juaristi22 Jul 18, 2025

Choose a reason for hiding this comment

Uh oh!

MaxGhenis Jul 17, 2025

Choose a reason for hiding this comment

Uh oh!

juaristi22 Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

juaristi22 commented Jul 16, 2025 •

edited

Loading

juaristi22 Jul 18, 2025 •

edited

Loading