-
Notifications
You must be signed in to change notification settings - Fork 10
Cleaning ACS age, SOI agi, hardcoded, and SNAP targets #373
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
baogorek
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maria, thanks for being the first to take a swing at the target restructuring exercise! A couple of comments:
-
I was expecting to see a PR of identically formatted csvs. It seems this is a full transport from congressional-districts to -us-data.
-
I was expecting all target files to have the format
DATA_SOURCE,GEO_ID,GEO_NAME,VARIABLE,VALUE,IS_COUNT,BREAKDOWN_VARIABLE, LOWER_BOUND,UPPER_BOUND -
It is a very large PR
I cleaned up our Roadmap doc that we worked on together, specifically putting the decisions we made at the top. If you get some time, please take a look at it and give me any feedback you have. Let's sync tomorrow and make sure we're on the same page with this roadmap milestone.
|
Hello Ben, My rationale for the pr you are seeing is the following: I tracked all the csv files that are currently necessary for loss.py to run for backward compatibility. My understanding was that we wanted to clean all the target processing scripts before changing the logic that imports them to create the metrics_matrix, so that if we kept merging prs with new files there wouldn't be merge conflicts. Sorry if this made the pr too big, most of the changes are just moving all the old csv files to the calibration directory in storage. The scripts that pull targets, like I left a small description of this in the README, let me know if I should provide more context in it. Happy to chat once you are back online. |
MaxGhenis
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you describe what this PR does in more detail? e.g. I see age_state is added; was that in the repo before after running the script, just not committed?
It also looks like git messed up and didn't catch some file moves, making this PR somewhat smaller than it appears.
What's our strategy for files, are we committing them or pulling from HF? @baogorek please make the call.
In general please add more description to issues and PRs. Thank you for this cleanup!
| **/*.pkl | ||
| venv | ||
| !real_estate_taxes_by_state_acs.csv | ||
| !np2023_d5_mid.csv |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this a district file? let's avoid adding district files here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i dont think its a district file, it contains a breakdown of population by age and race (at the national level if im not mistaken)
| venv | ||
| !real_estate_taxes_by_state_acs.csv | ||
| !np2023_d5_mid.csv | ||
| !snap_state.csv |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't seem to be working - this file is still in the pr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, i meant to track them (*.csv so all csvs are ignored, except for the ones whose names are specified with a ! in front) all the now tracked files, including "np2023_d5_mid.csv" are files that were already being used by the ECPS before the clean-up
policyengine_us_data/storage/calibration_targets/district_mapping.py
Outdated
Show resolved
Hide resolved
* initial commit of L0 branch * Add HardConcrete L0 regularization * l0 example completed * removing commented code * pre lint cleanup * post-lint cleanup * Refactor reweighting diagnostics * removed _clean from names in the reweighting function * modifying print function and test * Convert diagnostics prints to logging * removing unused variable * setting high tolerance for ssn test just to pass * linting * fixed data set creation logic. Modified parameters * docs. more epochs
Fix #374
This PR starts addressing the need to organize and clean calibration targets following a common schema. It edits and creates .py files that "pull" targets from data sources. When run, these files produce one csv per data source, structured with the following columns:
DATA_SOURCE,GEO_ID,GEO_NAME,VARIABLE,VALUE,IS_COUNT,BREAKDOWN_VARIABLE,LOWER_BOUND,UPPER_BOUND
To ensure compatibility with the data sources that loss.py and enhanced_cps.py expect, I have tracked the old (unformatted) csv files that contain all the currently calibrated targets (these are the ones we have been working with the past month, which link to the calibration score of ~80% we expect for the ECPS and have nothing to do with the tests we have been running in us-congressional-districts). I made this decision to avoid duplicating .py scripts that pull targets but in different formats, and also to avoid confusion for other users that may originate from having different targets coming from different files or downloading processes. Moving these files to huggingface would also achieve this (let me know if best).
To avoid making the pr even larger I decided not to track the newly formatted files, but they can be easily generated running
make targets.My ideal state is that once all the scripts that pull targets into the clean format are ready, we would untrack and delete these old csv files and simply make
make targetsa necessary step before running calibration to pull all data in a clean format without having to store large target data files.