-
Notifications
You must be signed in to change notification settings - Fork 10
Move all randomness to data package for deterministic country package #451
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This change moves ALL random number generation from policyengine-us into the dataset generation in policyengine-us-data. The country package is now a purely deterministic rules engine. ## Key Changes ### policyengine-us-data: - Add take-up rate YAML parameter files in `parameters/take_up/` - Generate all stochastic boolean take-up decisions in CPS dataset - Use seeded RNG (seed=100) for full reproducibility ### Stochastic variables generated: **Take-up decisions (boolean):** - takes_up_snap_if_eligible - takes_up_aca_if_eligible - takes_up_medicaid_if_eligible - takes_up_eitc (already boolean) - takes_up_dc_ptc (already boolean) All random generation now uses np.random.default_rng(seed=100) for full reproducibility across dataset builds. ## Trade-offs **IMPORTANT**: Take-up rates can no longer be adjusted dynamically via policy reforms or in the web app. They are fixed in the microdata. This is an acceptable trade-off for the cleaner architecture of keeping the country package purely deterministic. To adjust take-up rates, the microdata must be regenerated. Related: policyengine-us PR (must be merged after this) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
- Create takeup parameter files with rates from NIEER report - Head Start: 40% (pre-pandemic), 30% (pandemic 2020-2021) - Early Head Start: 9% - Generate stochastic takeup in CPS dataset using same pattern as SNAP/Medicaid - Coordinates with policyengine-us PR adding takeup variables
c5844cf to
a502f1a
Compare
Tests verify: - Take-up rate parameters load correctly (EITC, SNAP, Medicaid, etc.) - Seeded RNG produces deterministic results - Take-up proportions match expected rates 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
|
I haven't forgotten about this. Will get to it (to review) Tues, Dec 16th. |
baogorek
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MaxGhenis apologies for taking so long on this review.
My first thought is, do we really want to add the responsibility of dealing with parameters (the take up rates) in this package when they're already handled in policyengine-us? That's a lot of yaml files if we don't need them, and it feels like duplication.
I'm not crazy about the order of the computations mattering. There will almost certainly be breaks in reproducibility if the generator is used in between these take-up decisions.
Finally, and this is my most theoretical criticism: I was really thinking of a seed as being a potentially multivariate random draw that corresponds to an entity, forever and always. As is, the take up decisions are entirely independent. That is probably the most practical way to do it, and this could always be an area for enhancement later, but I would ask you to think about the 1-1 assignment between units and random values and whether there's an advantage of doing that.
| ) | ||
|
|
||
| # SNAP | ||
| data["takes_up_snap_if_eligible"] = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just noting that all these are independent. Probably in the real world, takeups are correlated, but I understand that may be beyond the scope of the engine in 2025.
OTOH, we could think about what the latent dimensions are and then create "loadings" from, say 3 random, even correlated values to the different take up seeds. Probably not worth it now, and we could always extend to that later, but just throwing it out there.
| PARAMETERS_DIR = Path(__file__).parent | ||
|
|
||
|
|
||
| def load_take_up_rate(variable_name: str, year: int = 2018) -> float: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why duplicate rather than import? The data package already depends on policyengine-us. Why not just read the parameters from there?
| for c in eitc_child_count | ||
| ] | ||
| ) | ||
| data["takes_up_eitc"] = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each call to generator.random() advances the RNG state. So the order of these calls matters. If someone reorders these lines, or inserts a new program in the middle, all downstream assignments get different random draws.
|
|
||
| eitc_takeup_rates = parameters.gov.irs.credits.eitc.takeup | ||
| # Load take-up rates from parameter files | ||
| eitc_rates_by_children = load_take_up_rate("eitc", self.time_period) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The eitc YAML has no values with dates, so what happens if someone adds time-varying EITC rates? Claude thinks it would silently ignore the date:
If someone updated eitc.yaml to:
rates_by_children:
0: 0.65
1: 0.86
2: 0.85
3: 0.85
values:
2018-01-01: 0.80
2025-01-01: 0.90
The function would still hit this first and return early:
if "rates_by_children" in data:
return data["rates_by_children"] # never reaches the values lookup
So the dated values would be ignored entirely.
Summary
This PR moves ALL random number generation from policyengine-us into the dataset generation in policyengine-us-data. The country package is now a purely deterministic rules engine.
Changes
New take-up rate parameters
Added YAML parameter files in
policyengine_us_data/parameters/take_up/:CPS dataset generation
Stochastic variables generated
Take-up decisions (boolean):
Trade-offs
IMPORTANT: Take-up rates can no longer be adjusted dynamically via policy reforms or in the web app. They are fixed in the microdata at generation time. This is an acceptable trade-off for the cleaner architecture of keeping the country package purely deterministic.
To adjust take-up rates for analysis, the microdata must be regenerated with updated parameter values.
Related PRs
🤖 Generated with Claude Code
Co-Authored-By: Claude noreply@anthropic.com