Move all randomness to data package for deterministic country package #451

MaxGhenis · 2025-12-03T22:19:42Z

Summary

This PR moves ALL random number generation from policyengine-us into the dataset generation in policyengine-us-data. The country package is now a purely deterministic rules engine.

⚠️ MERGE ORDER: This PR must be merged BEFORE the companion policyengine-us PR #6635

Changes

New take-up rate parameters

Added YAML parameter files in policyengine_us_data/parameters/take_up/:

snap.yaml (0.82)
medicaid.yaml (0.93)
aca.yaml (0.672)
eitc.yaml (0.65/0.86/0.85 by children)
dc_ptc.yaml (0.32)
head_start.yaml (0.40/0.30)
early_head_start.yaml (0.09)

CPS dataset generation

Load take-up rates from YAML parameter files
Generate all stochastic boolean take-up decisions
Use seeded RNG (seed=100) for full reproducibility
Changed from seed variables to boolean decisions

Stochastic variables generated

Take-up decisions (boolean):

takes_up_snap_if_eligible
takes_up_aca_if_eligible
takes_up_medicaid_if_eligible
takes_up_eitc (already boolean, now uses YAML rates)
takes_up_dc_ptc (already boolean, now uses YAML rates)
takes_up_head_start_if_eligible
takes_up_early_head_start_if_eligible

Trade-offs

IMPORTANT: Take-up rates can no longer be adjusted dynamically via policy reforms or in the web app. They are fixed in the microdata at generation time. This is an acceptable trade-off for the cleaner architecture of keeping the country package purely deterministic.

To adjust take-up rates for analysis, the microdata must be regenerated with updated parameter values.

Related PRs

policyengine-us: #6635
Follows same pattern as UK: Move all randomness to data package for deterministic country package policyengine-uk-data#203, Make country package purely deterministic - read stochastic variables from dataset policyengine-uk#1355

🤖 Generated with Claude Code

Co-Authored-By: Claude noreply@anthropic.com

This change moves ALL random number generation from policyengine-us into the dataset generation in policyengine-us-data. The country package is now a purely deterministic rules engine. ## Key Changes ### policyengine-us-data: - Add take-up rate YAML parameter files in `parameters/take_up/` - Generate all stochastic boolean take-up decisions in CPS dataset - Use seeded RNG (seed=100) for full reproducibility ### Stochastic variables generated: **Take-up decisions (boolean):** - takes_up_snap_if_eligible - takes_up_aca_if_eligible - takes_up_medicaid_if_eligible - takes_up_eitc (already boolean) - takes_up_dc_ptc (already boolean) All random generation now uses np.random.default_rng(seed=100) for full reproducibility across dataset builds. ## Trade-offs **IMPORTANT**: Take-up rates can no longer be adjusted dynamically via policy reforms or in the web app. They are fixed in the microdata. This is an acceptable trade-off for the cleaner architecture of keeping the country package purely deterministic. To adjust take-up rates, the microdata must be regenerated. Related: policyengine-us PR (must be merged after this) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

- Create takeup parameter files with rates from NIEER report - Head Start: 40% (pre-pandemic), 30% (pandemic 2020-2021) - Early Head Start: 9% - Generate stochastic takeup in CPS dataset using same pattern as SNAP/Medicaid - Coordinates with policyengine-us PR adding takeup variables

Tests verify: - Take-up rate parameters load correctly (EITC, SNAP, Medicaid, etc.) - Seeded RNG produces deterministic results - Take-up proportions match expected rates 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

baogorek · 2025-12-16T00:17:06Z

I haven't forgotten about this. Will get to it (to review) Tues, Dec 16th.

baogorek

@MaxGhenis apologies for taking so long on this review.

My first thought is, do we really want to add the responsibility of dealing with parameters (the take up rates) in this package when they're already handled in policyengine-us? That's a lot of yaml files if we don't need them, and it feels like duplication.

I'm not crazy about the order of the computations mattering. There will almost certainly be breaks in reproducibility if the generator is used in between these take-up decisions.

Finally, and this is my most theoretical criticism: I was really thinking of a seed as being a potentially multivariate random draw that corresponds to an entity, forever and always. As is, the take up decisions are entirely independent. That is probably the most practical way to do it, and this could always be an area for enhancement later, but I would ask you to think about the 1-1 assignment between units and random values and whether there's an advantage of doing that.

baogorek · 2025-12-16T22:06:36Z

policyengine_us_data/datasets/cps/cps.py

+    )
+
+    # SNAP
+    data["takes_up_snap_if_eligible"] = (


Just noting that all these are independent. Probably in the real world, takeups are correlated, but I understand that may be beyond the scope of the engine in 2025.

OTOH, we could think about what the latent dimensions are and then create "loadings" from, say 3 random, even correlated values to the different take up seeds. Probably not worth it now, and we could always extend to that later, but just throwing it out there.

baogorek · 2025-12-16T22:23:13Z

policyengine_us_data/parameters/__init__.py

+PARAMETERS_DIR = Path(__file__).parent
+
+
+def load_take_up_rate(variable_name: str, year: int = 2018) -> float:


Why duplicate rather than import? The data package already depends on policyengine-us. Why not just read the parameters from there?

baogorek · 2025-12-16T22:29:06Z

policyengine_us_data/datasets/cps/cps.py

+            for c in eitc_child_count
+        ]
+    )
    data["takes_up_eitc"] = (


Each call to generator.random() advances the RNG state. So the order of these calls matters. If someone reorders these lines, or inserts a new program in the middle, all downstream assignments get different random draws.

baogorek · 2025-12-16T22:33:01Z

policyengine_us_data/datasets/cps/cps.py


-    eitc_takeup_rates = parameters.gov.irs.credits.eitc.takeup
+    # Load take-up rates from parameter files
+    eitc_rates_by_children = load_take_up_rate("eitc", self.time_period)


The eitc YAML has no values with dates, so what happens if someone adds time-varying EITC rates? Claude thinks it would silently ignore the date:

If someone updated eitc.yaml to: rates_by_children: 0: 0.65 1: 0.86 2: 0.85 3: 0.85 values: 2018-01-01: 0.80 2025-01-01: 0.90 The function would still hit this first and return early: if "rates_by_children" in data: return data["rates_by_children"] # never reaches the values lookup So the dated values would be ignored entirely.

MaxGhenis and others added 3 commits October 5, 2025 15:11

Add changelog entry and remove debug file

a502f1a

MaxGhenis force-pushed the migrate-random-to-data-upstream branch from c5844cf to a502f1a Compare December 3, 2025 22:42

MaxGhenis requested a review from baogorek December 10, 2025 16:14

baogorek requested changes Dec 16, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Move all randomness to data package for deterministic country package #451

Move all randomness to data package for deterministic country package #451

Uh oh!

MaxGhenis commented Dec 3, 2025

Uh oh!

baogorek commented Dec 16, 2025

Uh oh!

baogorek left a comment

Uh oh!

baogorek Dec 16, 2025

Uh oh!

baogorek Dec 16, 2025

Uh oh!

baogorek Dec 16, 2025

Uh oh!

baogorek Dec 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		PARAMETERS_DIR = Path(__file__).parent


		def load_take_up_rate(variable_name: str, year: int = 2018) -> float:

Move all randomness to data package for deterministic country package #451

Are you sure you want to change the base?

Move all randomness to data package for deterministic country package #451

Uh oh!

Conversation

MaxGhenis commented Dec 3, 2025

Summary

Changes

New take-up rate parameters

CPS dataset generation

Stochastic variables generated

Trade-offs

Related PRs

Uh oh!

baogorek commented Dec 16, 2025

Uh oh!

baogorek left a comment

Choose a reason for hiding this comment

Uh oh!

baogorek Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

baogorek Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

baogorek Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

baogorek Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants