Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/workflows/code_changes.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
name: Code changes
on:
workflow_call:
workflow_dispatch:
push:
branches:
- main
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/reusable_test.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -94,5 +94,5 @@ jobs:
uses: JamesIves/github-pages-deploy-action@v4
with:
branch: gh-pages
folder: docs/_build/html
folder: docs/_build/site
clean: true
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -21,3 +21,6 @@ node_modules
!age_state.csv
!agi_state.csv
!soi_targets.csv
!policyengine_us_data/storage/social_security_aux.csv
!policyengine_us_data/storage/SSPopJul_TR2024.csv
docs/.ipynb_checkpoints/
4 changes: 2 additions & 2 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ documentation:
rm -f _toc.yml && \
myst clean && \
timeout 10 myst build --html || true
cd docs && test -d _build/html && touch _build/html/.nojekyll || true
cd docs && test -d _build/site && touch _build/site/.nojekyll || true

documentation-build:
cd docs && \
Expand All @@ -44,7 +44,7 @@ documentation-build:
myst build --html

documentation-serve:
cd docs/_build/html && python3 -m http.server 8080
cd docs/_build/site && python3 -m http.server 8080

documentation-dev:
cd docs && \
Expand Down
8 changes: 8 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,14 @@ which installs the development dependencies in a reference-only manner (so that
to the package code will be reflected immediately); `policyengine-us-data` is a dev package
and not intended for direct access.

## SSA Data Sources

The following SSA data sources are used in this project:

- [Latest Trustee's Report (2025)](https://www.ssa.gov/oact/TR/2025/index.html) - Source for `social_security_aux.csv` (extracted via `extract_ssa_costs.py`)
- [Single Year Supplementary Tables (2025)](https://www.ssa.gov/oact/tr/2025/lrIndex.html) - Long-range demographic and economic projections
- [Single Year Age Demographic Projections (2024 - latest published)](https://www.ssa.gov/oact/HistEst/Population/2024/Population2024.html) - Source for `SSPopJul_TR2024.csv` population data

## Building the Paper

### Prerequisites
Expand Down
13 changes: 13 additions & 0 deletions changelog_entry.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
- bump: minor
changes:
added:
- Additional calibration based on SSA Trustees data that extends projections until 2100
- Manual trigger capability for documentation deployment workflow
- Documentation for SSA data sources in storage README
changed:
- Renamed long-term projections notebook to clarify PWBM comparison scope (2025-2100)
fixed:
- GitHub Pages documentation deployment path
- Corrected number of imputed variables from 72 to 67 in documentation
- Corrected calibration target count from 7,000+ to 2,813 across all docs
- Removed inaccurate "two-stage" terminology in methodology descriptions
19 changes: 18 additions & 1 deletion docs/abstract.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,20 @@
# Abstract

We present a methodology for creating enhanced microsimulation datasets by combining the Current Population Survey (CPS) with the IRS Public Use File (PUF). Our two-stage approach uses quantile regression forests to impute 72 tax variables from the PUF onto CPS records, preserving distributional characteristics while maintaining household composition and member relationships. The imputation process alone does not guarantee consistency with official statistics, necessitating a reweighting step to align the combined dataset with known population totals and administrative benchmarks. We apply a reweighting algorithm that calibrates the dataset to over 7,000 targets from six sources: IRS Statistics of Income, Census population projections, Congressional Budget Office program estimates, Treasury expenditure data, Joint Committee on Taxation tax expenditure estimates, and healthcare spending patterns. The reweighting employs dropout-regularized gradient descent optimization to ensure consistency with administrative benchmarks. Validation shows the enhanced dataset reduces error in key tax components by [TO BE CALCULATED]% relative to the baseline CPS. The dataset maintains the CPS's demographic detail and geographic granularity while incorporating tax reporting data from administrative sources. We release the enhanced dataset, source code, and documentation to support policy analysis.
We present a methodology for creating enhanced microsimulation datasets by combining the
Current Population Survey (CPS) with the IRS Public Use File (PUF). Our approach uses
quantile regression forests to impute 67 tax variables from the PUF onto CPS records,
preserving distributional characteristics while maintaining household composition and member
relationships. The imputation process alone does not guarantee consistency with official
statistics, necessitating a reweighting step to align the combined dataset with known
population totals and administrative benchmarks. We apply a reweighting algorithm that
calibrates the dataset to 2,813 targets from
the IRS Statistics of Income, Census population projections, Congressional Budget
Office benefit program estimates, Treasury
expenditure data, Joint Committee on Taxation tax expenditure estimates, healthcare
spending patterns, and other benefit program costs. The reweighting employs dropout-regularized
gradient descent optimization
to ensure consistency with administrative benchmarks. Validation shows the enhanced dataset
reduces error in key tax components by [TO BE CALCULATED]% relative to the baseline CPS.
The dataset maintains the CPS's demographic detail and geographic granularity while
incorporating tax reporting data from administrative sources. We release the enhanced
dataset, source code, and documentation to support policy analysis.
2 changes: 1 addition & 1 deletion docs/conclusion.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ We present a methodology for creating enhanced microsimulation datasets that com

Our work makes several key contributions:

**Methodological Innovation**: The use of Quantile Regression Forests for imputation preserves distributional characteristics while maintaining computational efficiency. The large-scale calibration to 7,000+ targets pushes the boundaries of survey data enhancement.
**Methodological Innovation**: The use of Quantile Regression Forests for imputation preserves distributional characteristics while maintaining computational efficiency. The large-scale calibration to 2,813 targets pushes the boundaries of survey data enhancement.

**Practical Tools**: We provide open-source implementations that enable researchers to apply, modify, and extend these methods. The modular design facilitates experimentation with alternative approaches.

Expand Down
4 changes: 2 additions & 2 deletions docs/discussion.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ The use of Quantile Regression Forests for imputation represents an advance over
- Maintains realistic variable correlations
- Allows uncertainty quantification

The large-scale calibration to 7,000+ targets ensures consistency with administrative benchmarks across multiple dimensions simultaneously.
The large-scale calibration to 2,813 targets ensures consistency with administrative benchmarks across multiple dimensions simultaneously.

### Practical Advantages

Expand All @@ -44,7 +44,7 @@ These assumptions may not hold perfectly, particularly for subpopulations that t

### Calibration Trade-offs

With 7,000+ targets, perfect fit to all benchmarks is impossible. The optimization must balance competing objectives across target types, the relative importance of different statistics, stability of resulting weights, and preservation of household relationships.
With 2,813 targets, perfect fit to all benchmarks is impossible. The optimization must balance competing objectives across target types, the relative importance of different statistics, stability of resulting weights, and preservation of household relationships.

Users should consult validation metrics for targets most relevant to their analysis.

Expand Down
6 changes: 3 additions & 3 deletions docs/introduction.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ Microsimulation models require high-quality microdata that accurately represents

The Current Population Survey (CPS) Annual Social and Economic Supplement provides detailed household demographics, family relationships, and program participation data for a representative sample of US households. However, it suffers from well-documented income underreporting, particularly at the top of the distribution. The IRS Public Use File (PUF) contains accurate tax return information but lacks household structure, demographic detail, and state identifiers needed for comprehensive policy analysis.

This paper presents a methodology for creating an Enhanced CPS dataset that combines the strengths of both sources. Through a two-stage enhancement process—imputation followed by reweighting—we create a dataset suitable for analyzing both tax and transfer policies at federal and state levels.
This paper presents a methodology for creating an Enhanced CPS dataset that combines the strengths of both sources. Through an enhancement process—imputation followed by reweighting—we create a dataset suitable for analyzing both tax and transfer policies at federal and state levels.

## Related Work

Expand All @@ -14,13 +14,13 @@ Economic researchers address dataset limitations through various strategies. The

Statistical agencies and researchers employ reweighting methods to align survey data with administrative totals. The Luxembourg Income Study uses calibration to improve cross-national comparability {cite:p}`gornick2013`. The Urban-Brookings Tax Policy Center employs reweighting in their microsimulation model but relies on proprietary data that cannot be shared publicly {cite:p}`khitatrakun2016`.

Our approach differs from previous efforts in three key ways. First, we employ quantile regression forests to preserve distributional characteristics during imputation, improving upon traditional hot-deck and regression-based methods that may distort variable relationships. We conduct robustness checks comparing QRF performance to gradient boosting and neural network approaches, finding QRF provides the best balance of accuracy and interpretability. Second, we calibrate to over 7,000 targets from multiple administrative sources, far exceeding the scope of previous calibration efforts which typically use fewer than 100 targets. Third, we provide a fully open-source implementation enabling reproducibility and collaborative improvement, addressing the transparency limitations of existing proprietary models.
Our approach differs from previous efforts in three key ways. First, we employ quantile regression forests to preserve distributional characteristics during imputation, improving upon traditional hot-deck and regression-based methods that may distort variable relationships. We conduct robustness checks comparing QRF performance to gradient boosting and neural network approaches, finding QRF provides the best balance of accuracy and interpretability. Second, we calibrate to 2,813 targets from multiple administrative sources, far exceeding the scope of previous calibration efforts which typically use fewer than 100 targets. Third, we provide a fully open-source implementation enabling reproducibility and collaborative improvement, addressing the transparency limitations of existing proprietary models.

## Contributions

This paper makes three main contributions to the economic and public policy literature. Methodologically, we demonstrate how quantile regression forests can effectively impute detailed tax variables while preserving their joint distribution and relationship to demographics. This advances the statistical matching literature by showing how modern machine learning methods can overcome limitations of traditional hot-deck and parametric approaches. The preservation of distributional characteristics is particularly important for tax policy analysis where outcomes often depend on complex interactions between income sources and household characteristics.

Our empirical contribution involves creating and validating a publicly available enhanced dataset that addresses longstanding data limitations in microsimulation modeling. By combining the demographic richness of the CPS with the tax precision of the PUF, we enable analyses that were previously infeasible with public data. The dataset's calibration to over 7,000 administrative targets ensures consistency with official statistics across multiple dimensions simultaneously.
Our empirical contribution involves creating and validating a publicly available enhanced dataset that addresses longstanding data limitations in microsimulation modeling. By combining the demographic richness of the CPS with the tax precision of the PUF, we enable analyses that were previously infeasible with public data. The dataset's calibration to 2,813 administrative targets ensures consistency with official statistics across multiple dimensions simultaneously.

From a practical perspective, we provide open-source tools and comprehensive documentation that enable researchers to apply these methods, modify the approach, or build upon our work. This transparency contrasts with existing proprietary models and supports reproducible research. Government agencies could use our framework to enhance their own microsimulation capabilities, while academic researchers gain access to data suitable for analyzing distributional impacts of tax and transfer policies. The modular design allows incremental improvements as new data sources become available.

Expand Down
4 changes: 2 additions & 2 deletions docs/methodology.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Methodology

We create the Enhanced CPS dataset through a two-stage process: imputation followed by reweighting. The imputation stage creates a copy of the CPS and uses Quantile Regression Forests to impute tax variables from the PUF onto this copy, creating the Extended CPS. The reweighting stage then optimizes household weights to match administrative targets, producing the Enhanced CPS with weights calibrated to statistics.
We create the Enhanced CPS dataset through imputation followed by reweighting. The imputation stage creates a copy of the CPS and uses Quantile Regression Forests to impute tax variables from the PUF onto this copy, creating the Extended CPS. The reweighting stage then optimizes household weights to match administrative targets, producing the Enhanced CPS with weights calibrated to statistics.

```mermaid
graph TD
Expand Down Expand Up @@ -37,7 +37,7 @@ graph TD

Extended["Extended CPS - 2x households"]:::data

Targets{{"Administrative Targets - 7000+"}}:::data
Targets{{"Administrative Targets - 2,813"}}:::data

Reweight("Reweight Optimization"):::process

Expand Down
7 changes: 1 addition & 6 deletions docs/myst.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,12 +7,6 @@ project:
family: Team
copyright: '2024'
github: policyengine/policyengine-us-data
thebe:
binder:
repo: policyengine/policyengine-us-data
provider: github
url: https://mybinder.org
ref: master
jupyter:
myst:
enable_extensions:
Expand All @@ -30,6 +24,7 @@ project:
- file: background.md
- file: data.md
- file: methodology.md
- file: pwbm_ss_comparison_2025_2100.ipynb
- file: discussion.md
- file: conclusion.md
- file: appendix.md
Expand Down
Loading