Skip to content

Conversation

@juaristi22
Copy link
Collaborator

@juaristi22 juaristi22 commented Jul 16, 2025

Fix #374

This PR starts addressing the need to organize and clean calibration targets following a common schema. It edits and creates .py files that "pull" targets from data sources. When run, these files produce one csv per data source, structured with the following columns:

DATA_SOURCE,GEO_ID,GEO_NAME,VARIABLE,VALUE,IS_COUNT,BREAKDOWN_VARIABLE,LOWER_BOUND,UPPER_BOUND

To ensure compatibility with the data sources that loss.py and enhanced_cps.py expect, I have tracked the old (unformatted) csv files that contain all the currently calibrated targets (these are the ones we have been working with the past month, which link to the calibration score of ~80% we expect for the ECPS and have nothing to do with the tests we have been running in us-congressional-districts). I made this decision to avoid duplicating .py scripts that pull targets but in different formats, and also to avoid confusion for other users that may originate from having different targets coming from different files or downloading processes. Moving these files to huggingface would also achieve this (let me know if best).

To avoid making the pr even larger I decided not to track the newly formatted files, but they can be easily generated running make targets.

My ideal state is that once all the scripts that pull targets into the clean format are ready, we would untrack and delete these old csv files and simply make make targets a necessary step before running calibration to pull all data in a clean format without having to store large target data files.

@juaristi22 juaristi22 requested a review from baogorek July 16, 2025 16:14
Copy link
Collaborator

@baogorek baogorek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maria, thanks for being the first to take a swing at the target restructuring exercise! A couple of comments:

  • I was expecting to see a PR of identically formatted csvs. It seems this is a full transport from congressional-districts to -us-data.

  • I was expecting all target files to have the format DATA_SOURCE,GEO_ID,GEO_NAME,VARIABLE,VALUE,IS_COUNT,BREAKDOWN_VARIABLE, LOWER_BOUND,UPPER_BOUND

  • It is a very large PR

I cleaned up our Roadmap doc that we worked on together, specifically putting the decisions we made at the top. If you get some time, please take a look at it and give me any feedback you have. Let's sync tomorrow and make sure we're on the same page with this roadmap milestone.

@juaristi22
Copy link
Collaborator Author

Hello Ben,

My rationale for the pr you are seeing is the following:

I tracked all the csv files that are currently necessary for loss.py to run for backward compatibility. My understanding was that we wanted to clean all the target processing scripts before changing the logic that imports them to create the metrics_matrix, so that if we kept merging prs with new files there wouldn't be merge conflicts. Sorry if this made the pr too big, most of the changes are just moving all the old csv files to the calibration directory in storage.

The scripts that pull targets, like pull_soi_targets.pyare the ones that produce csv files following with the new schema. To avoid making the pr even larger, i decided to push the scripts that clean the targets instead of the csv files that result from them. If you run make targets you should be able to see them (I am also happy to share them via slack if that would be easier I have them locally). The scripts resemble the files in congressional-districts as I thought that it would be better to produce a single, thorough script that we will need as soon as we migrate to us-data than having to update these files with some new logic later. Nonetheless, they follow the schema that we agreed on, which congressional-districts doesn't have.

I left a small description of this in the README, let me know if I should provide more context in it. Happy to chat once you are back online.

Copy link
Contributor

@MaxGhenis MaxGhenis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you describe what this PR does in more detail? e.g. I see age_state is added; was that in the repo before after running the script, just not committed?

It also looks like git messed up and didn't catch some file moves, making this PR somewhat smaller than it appears.

What's our strategy for files, are we committing them or pulling from HF? @baogorek please make the call.

In general please add more description to issues and PRs. Thank you for this cleanup!

**/*.pkl
venv
!real_estate_taxes_by_state_acs.csv
!np2023_d5_mid.csv
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a district file? let's avoid adding district files here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i dont think its a district file, it contains a breakdown of population by age and race (at the national level if im not mistaken)

venv
!real_estate_taxes_by_state_acs.csv
!np2023_d5_mid.csv
!snap_state.csv
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't seem to be working - this file is still in the pr

Copy link
Collaborator Author

@juaristi22 juaristi22 Jul 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, i meant to track them (*.csv so all csvs are ignored, except for the ones whose names are specified with a ! in front) all the now tracked files, including "np2023_d5_mid.csv" are files that were already being used by the ECPS before the clean-up

baogorek and others added 28 commits July 18, 2025 14:11
* initial commit of L0 branch

* Add HardConcrete L0 regularization

* l0 example completed

* removing commented code

* pre lint cleanup

* post-lint cleanup

* Refactor reweighting diagnostics

* removed _clean from names in the reweighting function

* modifying print function and test

* Convert diagnostics prints to logging

* removing unused variable

* setting high tolerance for ssn test just to pass

* linting

* fixed data set creation logic. Modified parameters

* docs. more epochs
@nikhilwoodruff nikhilwoodruff merged commit 28508d3 into main Jul 18, 2025
7 checks passed
@juaristi22 juaristi22 deleted the maria/calibration-targets branch July 21, 2025 12:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Clean ACS age, SOI agi, SNAP and hardcoded targets

5 participants