Merged
Conversation
- Extract shared etl_argparser() into utils/db.py to eliminate repeated boilerplate across 7 ETL scripts - Label hardcoded dollar targets with HARDCODED_YEAR = 2024 instead of dynamic time_period; add warnings.warn when dataset year differs - Delete dead get_pseudo_input_variables() and update callers - Switch DEFAULT_DATASET to local storage path for local-first workflow - Add promote-dataset Makefile target and HF_CLONE_DIR variable - Add SOI Congress-session constants with RuntimeError guard for future tax-year bumps - Update Makefile comments for stratified CPS parameters Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
MaxGhenis
approved these changes
Feb 9, 2026
Contributor
MaxGhenis
left a comment
There was a problem hiding this comment.
Clean, well-scoped PR that directly addresses the review feedback. A few notes:
etl_argparser()extraction is nicely done — theextra_args_fncallback for SOI's--lagis a good patternHARDCODED_YEAR = 2024withwarnings.warnis exactly right — those dollar targets were silently mislabeled before- SOI Congress
RuntimeErrorguard is smart defensive coding promote-datasetMakefile target will be useful
One minor thought: the DEFAULT_DATASET switch to local storage means fresh checkouts without the .h5 file will fail on python -m policyengine_us_data.db.etl_* with a missing file error instead of auto-downloading from HF. That's probably fine for the intended local-first workflow, but worth noting in case someone hits it.
LGTM!
On a fresh checkout without `make data`, the local DEFAULT_DATASET won't exist. Give a clear FileNotFoundError suggesting `make data` or `--dataset hf://...` instead of a cryptic load failure. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Addresses actionable items from @MaxGhenis's review of PR #505. Remaining deferred items tracked in #518.
etl_argparser()intoutils/db.py— eliminates ~15 lines of repeated argparse + Microsimulation boilerplate across 7 ETL scriptsHARDCODED_YEAR = 2024instead of dynamictime_period;warnings.warnfires when dataset year differsget_pseudo_input_variables()— was a no-opreturn set(); callers updatedDEFAULT_DATASETto local storage — local-first workflow; addpromote-datasetMakefile targetSOI_CONGRESS_PREFIX/SOI_DISTRICT_TAX_YEARwithRuntimeErrorguard for future tax-year bumpsHF_CLONE_DIRvariable, updated stratified CPS commentsTest plan
black -l 79 --checkpasses on all modified filesDEFAULT_DATASETimport prints local pathpytest— 120 passed, 5 failed (all pre-existing), 7 skippedCloses #518
🤖 Generated with Claude Code