Skip to content

Conversation

@MaxGhenis
Copy link
Contributor

Summary

This PR moves ALL random number generation from policyengine-us into the dataset generation in policyengine-us-data. The country package is now a purely deterministic rules engine.

⚠️ MERGE ORDER: This PR must be merged BEFORE the companion policyengine-us PR #6635

Changes

New take-up rate parameters

Added YAML parameter files in policyengine_us_data/parameters/take_up/:

  • snap.yaml (0.82)
  • medicaid.yaml (0.93)
  • aca.yaml (0.672)
  • eitc.yaml (0.65/0.86/0.85 by children)
  • dc_ptc.yaml (0.32)
  • head_start.yaml (0.40/0.30)
  • early_head_start.yaml (0.09)

CPS dataset generation

  • Load take-up rates from YAML parameter files
  • Generate all stochastic boolean take-up decisions
  • Use seeded RNG (seed=100) for full reproducibility
  • Changed from seed variables to boolean decisions

Stochastic variables generated

Take-up decisions (boolean):

  • takes_up_snap_if_eligible
  • takes_up_aca_if_eligible
  • takes_up_medicaid_if_eligible
  • takes_up_eitc (already boolean, now uses YAML rates)
  • takes_up_dc_ptc (already boolean, now uses YAML rates)
  • takes_up_head_start_if_eligible
  • takes_up_early_head_start_if_eligible

Trade-offs

IMPORTANT: Take-up rates can no longer be adjusted dynamically via policy reforms or in the web app. They are fixed in the microdata at generation time. This is an acceptable trade-off for the cleaner architecture of keeping the country package purely deterministic.

To adjust take-up rates for analysis, the microdata must be regenerated with updated parameter values.

Related PRs

🤖 Generated with Claude Code

Co-Authored-By: Claude noreply@anthropic.com

MaxGhenis and others added 3 commits October 5, 2025 15:11
This change moves ALL random number generation from policyengine-us into the
dataset generation in policyengine-us-data. The country package is now a
purely deterministic rules engine.

## Key Changes

### policyengine-us-data:
- Add take-up rate YAML parameter files in `parameters/take_up/`
- Generate all stochastic boolean take-up decisions in CPS dataset
- Use seeded RNG (seed=100) for full reproducibility

### Stochastic variables generated:
**Take-up decisions (boolean):**
- takes_up_snap_if_eligible
- takes_up_aca_if_eligible
- takes_up_medicaid_if_eligible
- takes_up_eitc (already boolean)
- takes_up_dc_ptc (already boolean)

All random generation now uses np.random.default_rng(seed=100) for full
reproducibility across dataset builds.

## Trade-offs

**IMPORTANT**: Take-up rates can no longer be adjusted dynamically via policy
reforms or in the web app. They are fixed in the microdata. This is an
acceptable trade-off for the cleaner architecture of keeping the country
package purely deterministic. To adjust take-up rates, the microdata must be
regenerated.

Related: policyengine-us PR (must be merged after this)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Create takeup parameter files with rates from NIEER report
- Head Start: 40% (pre-pandemic), 30% (pandemic 2020-2021)
- Early Head Start: 9%
- Generate stochastic takeup in CPS dataset using same pattern as SNAP/Medicaid
- Coordinates with policyengine-us PR adding takeup variables
@MaxGhenis MaxGhenis force-pushed the migrate-random-to-data-upstream branch from c5844cf to a502f1a Compare December 3, 2025 22:42
Tests verify:
- Take-up rate parameters load correctly (EITC, SNAP, Medicaid, etc.)
- Seeded RNG produces deterministic results
- Take-up proportions match expected rates

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@MaxGhenis MaxGhenis requested a review from baogorek December 10, 2025 16:14
@baogorek
Copy link
Collaborator

I haven't forgotten about this. Will get to it (to review) Tues, Dec 16th.

Copy link
Collaborator

@baogorek baogorek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MaxGhenis apologies for taking so long on this review.

My first thought is, do we really want to add the responsibility of dealing with parameters (the take up rates) in this package when they're already handled in policyengine-us? That's a lot of yaml files if we don't need them, and it feels like duplication.

I'm not crazy about the order of the computations mattering. There will almost certainly be breaks in reproducibility if the generator is used in between these take-up decisions.

Finally, and this is my most theoretical criticism: I was really thinking of a seed as being a potentially multivariate random draw that corresponds to an entity, forever and always. As is, the take up decisions are entirely independent. That is probably the most practical way to do it, and this could always be an area for enhancement later, but I would ask you to think about the 1-1 assignment between units and random values and whether there's an advantage of doing that.

)

# SNAP
data["takes_up_snap_if_eligible"] = (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noting that all these are independent. Probably in the real world, takeups are correlated, but I understand that may be beyond the scope of the engine in 2025.

OTOH, we could think about what the latent dimensions are and then create "loadings" from, say 3 random, even correlated values to the different take up seeds. Probably not worth it now, and we could always extend to that later, but just throwing it out there.

PARAMETERS_DIR = Path(__file__).parent


def load_take_up_rate(variable_name: str, year: int = 2018) -> float:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why duplicate rather than import? The data package already depends on policyengine-us. Why not just read the parameters from there?

for c in eitc_child_count
]
)
data["takes_up_eitc"] = (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each call to generator.random() advances the RNG state. So the order of these calls matters. If someone reorders these lines, or inserts a new program in the middle, all downstream assignments get different random draws.


eitc_takeup_rates = parameters.gov.irs.credits.eitc.takeup
# Load take-up rates from parameter files
eitc_rates_by_children = load_take_up_rate("eitc", self.time_period)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The eitc YAML has no values with dates, so what happens if someone adds time-varying EITC rates? Claude thinks it would silently ignore the date:

  If someone updated eitc.yaml to:

  rates_by_children:
    0: 0.65
    1: 0.86
    2: 0.85
    3: 0.85
  values:
    2018-01-01: 0.80
    2025-01-01: 0.90

  The function would still hit this first and return early:

  if "rates_by_children" in data:
      return data["rates_by_children"]  # never reaches the values lookup

  So the dated values would be ignored entirely. 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants