Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 6 additions & 1 deletion .github/workflows/reusable_test.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,11 @@ jobs:
with:
python-version: '3.13'

- name: Set up Node.js
uses: actions/setup-node@v4
with:
node-version: '24'

- uses: "google-github-actions/auth@v2"
if: inputs.upload_data
with:
Expand Down Expand Up @@ -94,5 +99,5 @@ jobs:
uses: JamesIves/github-pages-deploy-action@v4
with:
branch: gh-pages
folder: docs/_build/site
folder: docs/_build/html
clean: true
4 changes: 2 additions & 2 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -33,8 +33,8 @@ documentation:
rm -rf _build .jupyter_cache && \
rm -f _toc.yml && \
myst clean && \
timeout 10 myst build --html || true
cd docs && test -d _build/site && touch _build/site/.nojekyll || true
myst build --html
cd docs && test -d _build/html && touch _build/html/.nojekyll || true

documentation-build:
cd docs && \
Expand Down
11 changes: 11 additions & 0 deletions changelog_entry.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
- bump: patch
changes:
fixed:
- GitHub Pages documentation deployment (was deploying wrong directory causing blank pages)
- Removed timeout and error suppression from documentation build
added:
- Node.js 24 LTS setup to CI workflow for MyST builds
- H6 Social Security reform calibration for long-term projections (phases out OASDI taxation 2045-2054)
- H6 threshold crossover handling when OASDI thresholds exceed HI thresholds
- start_year parameter to run_household_projection.py CLI
- docs/README.md documenting MyST build output pitfall
46 changes: 46 additions & 0 deletions docs/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
# Documentation

This project uses [MyST Markdown](https://mystmd.org/) for documentation.

## Building Locally

### Requirements
- Python 3.13+ with dev dependencies: `uv pip install -e .[dev] --system`
- Node.js 20+ (required by MyST)

### Commands
```bash
make documentation # Build static HTML files
make documentation-serve # Serve locally on http://localhost:8080
```

## Important: MyST Build Outputs

**MyST creates two different outputs - DO NOT confuse them:**

- `_build/html/` - **Static HTML files (use for GitHub Pages deployment)**
- `_build/site/` - Dynamic content for `myst start` development server only

**GitHub Pages must deploy `_build/html/`**, not `_build/site/`. The `_build/site/` directory contains JSON files for MyST's development server and will result in a blank page on GitHub Pages.

## GitHub Pages Deployment

- Site URL: https://policyengine.github.io/policyengine-us-data/
- Deployed from: `docs/_build/html/` directory
- Propagation time: 5-10 minutes after push to gh-pages branch
- Workflow: `.github/workflows/code_changes.yaml` (on main branch only)

## Troubleshooting

**Blank page after deployment:**
- Check that workflow deploys `folder: docs/_build/html` (not `_build/site`)
- Wait 5-10 minutes for GitHub Pages propagation
- Hard refresh browser (Ctrl+Shift+R / Cmd+Shift+R)

**Build fails in CI:**
- Ensure Node.js setup step exists in workflow (MyST requires Node.js)
- Never add timeouts or `|| true` to build commands - they mask failures

**Missing index.html:**
- MyST auto-generates index.html in `_build/html/`
- Do not create manual index.html in docs/
11 changes: 1 addition & 10 deletions docs/abstract.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,15 +6,6 @@ quantile regression forests to impute 67 tax variables from the PUF onto CPS rec
preserving distributional characteristics while maintaining household composition and member
relationships. The imputation process alone does not guarantee consistency with official
statistics, necessitating a reweighting step to align the combined dataset with known
population totals and administrative benchmarks. We apply a reweighting algorithm that
calibrates the dataset to 2,813 targets from
the IRS Statistics of Income, Census population projections, Congressional Budget
Office benefit program estimates, Treasury
expenditure data, Joint Committee on Taxation tax expenditure estimates, healthcare
spending patterns, and other benefit program costs. The reweighting employs dropout-regularized
gradient descent optimization
to ensure consistency with administrative benchmarks. Validation shows the enhanced dataset
reduces error in key tax components by [TO BE CALCULATED]% relative to the baseline CPS.
The dataset maintains the CPS's demographic detail and geographic granularity while
population totals and administrative benchmarks. We apply a reweighting algorithm that calibrates the dataset to 2,813 targets from the IRS Statistics of Income, Census population projections, Congressional Budget Office benefit program estimates, Treasury expenditure data, Joint Committee on Taxation tax expenditure estimates, healthcare spending patterns, and other benefit program costs. The reweighting employs dropout-regularized gradient descent optimization to ensure consistency with administrative benchmarks. The dataset maintains the CPS's demographic detail and geographic granularity while
incorporating tax reporting data from administrative sources. We release the enhanced
dataset, source code, and documentation to support policy analysis.
93 changes: 92 additions & 1 deletion docs/appendix.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,4 +46,95 @@ for iteration in range(5000):

### Table A1: Complete List of Imputed Variables

[TO BE GENERATED - Complete list of 72 imputed variables from PUF organized by category]
#### Variables Imputed from IRS Public Use File (67 variables)

**Income Variables:**
- employment_income
- partnership_s_corp_income
- social_security
- taxable_pension_income
- tax_exempt_pension_income
- long_term_capital_gains
- short_term_capital_gains
- taxable_ira_distributions
- self_employment_income
- qualified_dividend_income
- non_qualified_dividend_income
- rental_income
- taxable_unemployment_compensation
- taxable_interest_income
- tax_exempt_interest_income
- estate_income
- miscellaneous_income
- farm_income
- alimony_income
- farm_rent_income
- non_sch_d_capital_gains
- long_term_capital_gains_on_collectibles
- unrecaptured_section_1250_gain
- salt_refund_income

**Deductions and Adjustments:**
- interest_deduction
- unreimbursed_business_employee_expenses
- pre_tax_contributions
- charitable_cash_donations
- self_employed_pension_contribution_ald
- domestic_production_ald
- self_employed_health_insurance_ald
- charitable_non_cash_donations
- alimony_expense
- health_savings_account_ald
- student_loan_interest
- investment_income_elected_form_4952
- early_withdrawal_penalty
- educator_expense
- deductible_mortgage_interest

**Tax Credits:**
- cdcc_relevant_expenses
- foreign_tax_credit
- american_opportunity_credit
- general_business_credit
- energy_efficient_home_improvement_credit
- amt_foreign_tax_credit
- excess_withheld_payroll_tax
- savers_credit
- prior_year_minimum_tax_credit
- other_credits

**Qualified Business Income Variables:**
- w2_wages_from_qualified_business
- unadjusted_basis_qualified_property
- business_is_sstb
- qualified_reit_and_ptp_income
- qualified_bdc_income
- farm_operations_income
- estate_income_would_be_qualified
- farm_operations_income_would_be_qualified
- farm_rent_income_would_be_qualified
- partnership_s_corp_income_would_be_qualified
- rental_income_would_be_qualified
- self_employment_income_would_be_qualified

**Other Tax Variables:**
- traditional_ira_contributions
- qualified_tuition_expenses
- casualty_loss
- unreported_payroll_tax
- recapture_of_investment_credit

#### Variables Imputed from Survey of Income and Program Participation (1 variable)

- tip_income

#### Variables Imputed from Survey of Consumer Finances (3 variables)

- networth
- auto_loan_balance
- auto_loan_interest

#### Variables Imputed from American Community Survey (2 variables)

- rent
- real_estate_taxes
2 changes: 1 addition & 1 deletion docs/conclusion.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ Our work makes several key contributions:

The validation results demonstrate that combining survey and administrative data through principled statistical methods can achieve:
- Improved income distribution representation
- Better alignment with program participation totals
- Better alignment with program participation totals
- Maintained demographic and geographic detail
- Suitable accuracy for policy simulation

Expand Down
4 changes: 2 additions & 2 deletions docs/discussion.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ We examine the strengths, limitations, and potential applications of the Enhance

The Enhanced CPS uniquely combines:
- Demographic detail from the CPS including state identifiers
- Tax precision from IRS administrative data
- Tax precision from IRS administrative data
- Calibration to contemporary official statistics
- Open-source availability for research use

Expand All @@ -26,7 +26,7 @@ The large-scale calibration to 2,813 targets ensures consistency with administra

### Practical Advantages

For policy analysis, the dataset offers state-level geographic detail enabling subnational analysis, household structure for distributional studies, tax detail for revenue estimation, program participation for benefit analysis, and recent data calibrated to current totals.
For policy analysis, the dataset offers several key features: state-level geographic detail for subnational analysis, household structure for distributional studies, tax detail for revenue estimation, program participation for benefit analysis, and calibration to current administrative totals.

## Limitations

Expand Down
6 changes: 3 additions & 3 deletions docs/introduction.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
# Introduction

Microsimulation models require high-quality microdata that accurately represents both demographic characteristics and economic outcomes. The ideal dataset would combine the demographic richness and household structure of surveys with the income precision of administrative tax records. However, publicly available datasets typically excel in one dimension while lacking in the other.
Microsimulation models require high-quality microdata that accurately represent demographic characteristics and economic outcomes. The ideal dataset would combine the demographic richness and household structure of surveys with the income precision of administrative tax records. However, publicly available datasets typically excel in one dimension while lacking in the other.

The Current Population Survey (CPS) Annual Social and Economic Supplement provides detailed household demographics, family relationships, and program participation data for a representative sample of US households. However, it suffers from well-documented income underreporting, particularly at the top of the distribution. The IRS Public Use File (PUF) contains accurate tax return information but lacks household structure, demographic detail, and state identifiers needed for comprehensive policy analysis.

This paper presents a methodology for creating an Enhanced CPS dataset that combines the strengths of both sources. Through an enhancement processimputation followed by reweightingwe create a dataset suitable for analyzing both tax and transfer policies at federal and state levels.
This paper presents a methodology for creating an Enhanced CPS dataset that combines the strengths of both sources. Through an enhancement process: imputation followed by reweighting, we create a dataset suitable for analyzing both tax and transfer policies at federal and state levels.

## Related Work

Expand All @@ -24,4 +24,4 @@ Our empirical contribution involves creating and validating a publicly available

From a practical perspective, we provide open-source tools and comprehensive documentation that enable researchers to apply these methods, modify the approach, or build upon our work. This transparency contrasts with existing proprietary models and supports reproducible research. Government agencies could use our framework to enhance their own microsimulation capabilities, while academic researchers gain access to data suitable for analyzing distributional impacts of tax and transfer policies. The modular design allows incremental improvements as new data sources become available.

We organize the remainder of this paper as follows. Section 2 describes our data sources including the primary datasets and calibration targets. Section 3 details the enhancement methodology including both the imputation and reweighting stages. Section 4 presents validation results comparing performance across datasets. Section 5 discusses limitations, applications, and future directions. Section 6 concludes with implications for policy analysis.
We organize the remainder of this paper as follows. Section 2 describes our data sources including the primary datasets and calibration targets. Section 3 details the enhancement methodology including both the imputation and reweighting stages. Section 4 presents validation results comparing performance across datasets. Section 5 discusses limitations, applications, and future directions. Section 6 concludes with implications for policy analysis.
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,17 @@
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Comparison to Penn Wharton Budget Model: Eliminating Tax on Social Security 2025-2100\n",
"## Integrating Economic Uprating with Demographic Reweighting"
]
"source": "# Long Term Projections\n## Integrating Economic Uprating with Demographic Reweighting"
},
{
"cell_type": "markdown",
"source": "## Executive Summary\n\nThis document outlines an innovative approach for projecting federal income tax revenue through 2100 that uniquely combines sophisticated economic microsimulation with demographic reweighting. By harmonizing PolicyEngine's state-of-the-art tax modeling with Social Security Administration demographic projections, we can isolate and quantify the fiscal impact of population aging while preserving the full complexity of the tax code.",
"metadata": {}
},
{
"cell_type": "markdown",
"source": "## The Challenge\n\nProjecting tax revenue over a 75-year horizon requires simultaneously modeling two distinct but interrelated dynamics:\n\n**Economic Evolution**: How incomes, prices, and tax parameters change over time\n- Wage growth and income distribution shifts\n- Inflation affecting brackets and deductions\n- Legislative changes and indexing rules\n- Behavioral responses to tax policy\n\n**Demographic Transformation**: How the population structure evolves\n- Baby boom generation aging through retirement\n- Declining birth rates reducing working-age population\n- Increasing longevity extending retirement duration\n- Shifting household composition patterns\n\nTraditional approaches typically sacrifice either economic sophistication (using simplified tax calculations) or demographic realism (holding age distributions constant). Our methodology preserves both.",
"metadata": {}
},
{
"cell_type": "markdown",
Expand Down Expand Up @@ -176,17 +183,6 @@
"- `--save-h5`: Save year-specific .h5 files to `./projected_datasets/` directory"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"## Executive Summary\n",
"\n",
"This document outlines an innovative approach for projecting federal income tax revenue through 2100 that uniquely combines sophisticated economic microsimulation with demographic reweighting. By harmonizing PolicyEngine's state-of-the-art tax modeling with Social Security Administration demographic projections, we can isolate and quantify the fiscal impact of population aging while preserving the full complexity of the tax code."
]
},
{
"cell_type": "markdown",
"metadata": {},
Expand All @@ -210,13 +206,6 @@
"Traditional approaches typically sacrifice either economic sophistication (using simplified tax calculations) or demographic realism (holding age distributions constant). Our methodology preserves both."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Loading and Exploring the Data"
]
},
{
"cell_type": "markdown",
"metadata": {},
Expand Down Expand Up @@ -1023,4 +1012,4 @@
},
"nbformat": 4,
"nbformat_minor": 4
}
}
17 changes: 15 additions & 2 deletions docs/methodology.md
Original file line number Diff line number Diff line change
Expand Up @@ -147,7 +147,20 @@ From the American Community Survey (ACS), we impute property taxes for homeowner

### Example: Tip Income Imputation

To illustrate how QRF preserves conditional distributions, consider tip income imputation. The training data from SIPP contains workers with employment income and tip income. For a worker with predictors of $30,000 employment income, age 25, and no children, QRF finds that similar workers in SIPP have a conditional distribution ranging from $0 at the 10th percentile (no tips) to $2,000 at the median, $8,000 at the 90th percentile, and $15,000 at the 99th percentile. If the random quantile drawn is 0.85, the imputed tip income would be approximately $6,500. This approach ensures that some similar workers receive no tips while others receive substantial tips, preserving realistic variation.
To illustrate how QRF preserves conditional distributions, consider tip income imputation. The training data from SIPP contains workers with employment income and tip income.

For a worker with the following characteristics:
- Employment income: \$30,000
- Age: 25
- Number of children: 0

QRF finds that similar workers in SIPP have a conditional distribution of tip income:
- 10th percentile: \$0 (no tips)
- 50th percentile: \$2,000
- 90th percentile: \$8,000
- 99th percentile: \$15,000

If the random quantile drawn is 0.85, the imputed tip income would be approximately \$6,500. This approach ensures that some similar workers receive no tips while others receive substantial tips, preserving realistic variation.

## Stage 2: Reweighting

Expand Down Expand Up @@ -185,7 +198,7 @@ The calibration process incorporates tax and benefit calculations through Policy

### Convergence

The optimization converges within iterations. We monitor convergence through the loss value trajectory, weight stability across iterations, and target achievement rates.
The optimization converges within 500 epochs. We monitor convergence through the loss value trajectory, weight stability across iterations, and target achievement rates.

## Validation

Expand Down
2 changes: 1 addition & 1 deletion docs/myst.yml
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ project:
- file: background.md
- file: data.md
- file: methodology.md
- file: pwbm_ss_comparison_2025_2100.ipynb
- file: long_term_projections.ipynb
- file: discussion.md
- file: conclusion.md
- file: appendix.md
Expand Down
Loading