PolicyEngine · baogorek · Nov 20, 2025 · Oct 17, 2025 · Oct 24, 2025 · Oct 29, 2025
diff --git a/.github/workflows/code_changes.yaml b/.github/workflows/code_changes.yaml
@@ -3,6 +3,7 @@
 name: Code changes
 on:
   workflow_call:
+  workflow_dispatch:
   push:
     branches:
       - main

diff --git a/.github/workflows/reusable_test.yaml b/.github/workflows/reusable_test.yaml
@@ -94,5 +94,5 @@ jobs:
         uses: JamesIves/github-pages-deploy-action@v4
         with:
           branch: gh-pages
-          folder: docs/_build/html
+          folder: docs/_build/site
           clean: true
diff --git a/.gitignore b/.gitignore
@@ -21,3 +21,6 @@ node_modules
 !age_state.csv
 !agi_state.csv
 !soi_targets.csv
+!policyengine_us_data/storage/social_security_aux.csv
+!policyengine_us_data/storage/SSPopJul_TR2024.csv
+docs/.ipynb_checkpoints/
diff --git a/Makefile b/Makefile
@@ -34,7 +34,7 @@ documentation:
 	rm -f _toc.yml && \
 	myst clean && \
 	timeout 10 myst build --html || true
-	cd docs && test -d _build/html && touch _build/html/.nojekyll || true
+	cd docs && test -d _build/site && touch _build/site/.nojekyll || true
 
 documentation-build:
 	cd docs && \
@@ -44,7 +44,7 @@ documentation-build:
 	myst build --html
 
 documentation-serve:
-	cd docs/_build/html && python3 -m http.server 8080
+	cd docs/_build/site && python3 -m http.server 8080
 
 documentation-dev:
 	cd docs && \

diff --git a/README.md b/README.md
@@ -14,6 +14,14 @@ which installs the development dependencies in a reference-only manner (so that
 to the package code will be reflected immediately); `policyengine-us-data` is a dev package
 and not intended for direct access.
 
+## SSA Data Sources
+
+The following SSA data sources are used in this project:
+
+- [Latest Trustee's Report (2025)](https://www.ssa.gov/oact/TR/2025/index.html) - Source for `social_security_aux.csv` (extracted via `extract_ssa_costs.py`)
+- [Single Year Supplementary Tables (2025)](https://www.ssa.gov/oact/tr/2025/lrIndex.html) - Long-range demographic and economic projections
+- [Single Year Age Demographic Projections (2024 - latest published)](https://www.ssa.gov/oact/HistEst/Population/2024/Population2024.html) - Source for `SSPopJul_TR2024.csv` population data
+
 ## Building the Paper
 
 ### Prerequisites

diff --git a/changelog_entry.yaml b/changelog_entry.yaml
@@ -0,0 +1,13 @@
+- bump: minor
+  changes:
+    added:
+      - Additional calibration based on SSA Trustees data that extends projections until 2100
+      - Manual trigger capability for documentation deployment workflow
+      - Documentation for SSA data sources in storage README
+    changed:
+      - Renamed long-term projections notebook to clarify PWBM comparison scope (2025-2100)
+    fixed:
+      - GitHub Pages documentation deployment path
+      - Corrected number of imputed variables from 72 to 67 in documentation
+      - Corrected calibration target count from 7,000+ to 2,813 across all docs
+      - Removed inaccurate "two-stage" terminology in methodology descriptions
diff --git a/docs/abstract.md b/docs/abstract.md
@@ -1,3 +1,20 @@
 # Abstract
 
-We present a methodology for creating enhanced microsimulation datasets by combining the Current Population Survey (CPS) with the IRS Public Use File (PUF). Our two-stage approach uses quantile regression forests to impute 72 tax variables from the PUF onto CPS records, preserving distributional characteristics while maintaining household composition and member relationships. The imputation process alone does not guarantee consistency with official statistics, necessitating a reweighting step to align the combined dataset with known population totals and administrative benchmarks. We apply a reweighting algorithm that calibrates the dataset to over 7,000 targets from six sources: IRS Statistics of Income, Census population projections, Congressional Budget Office program estimates, Treasury expenditure data, Joint Committee on Taxation tax expenditure estimates, and healthcare spending patterns. The reweighting employs dropout-regularized gradient descent optimization to ensure consistency with administrative benchmarks. Validation shows the enhanced dataset reduces error in key tax components by [TO BE CALCULATED]% relative to the baseline CPS. The dataset maintains the CPS's demographic detail and geographic granularity while incorporating tax reporting data from administrative sources. We release the enhanced dataset, source code, and documentation to support policy analysis.
+We present a methodology for creating enhanced microsimulation datasets by combining the
+Current Population Survey (CPS) with the IRS Public Use File (PUF). Our approach uses
+quantile regression forests to impute 67 tax variables from the PUF onto CPS records,
+preserving distributional characteristics while maintaining household composition and member
+relationships. The imputation process alone does not guarantee consistency with official
+statistics, necessitating a reweighting step to align the combined dataset with known
+population totals and administrative benchmarks. We apply a reweighting algorithm that
+ calibrates the dataset to 2,813 targets from
+the IRS Statistics of Income, Census population projections, Congressional Budget
+Office benefit program estimates, Treasury
+expenditure data, Joint Committee on Taxation tax expenditure estimates, healthcare
+spending patterns, and other benefit program costs. The reweighting employs dropout-regularized
+ gradient descent optimization
+to ensure consistency with administrative benchmarks. Validation shows the enhanced dataset
+reduces error in key tax components by [TO BE CALCULATED]% relative to the baseline CPS.
+The dataset maintains the CPS's demographic detail and geographic granularity while
+incorporating tax reporting data from administrative sources. We release the enhanced
+dataset, source code, and documentation to support policy analysis.
diff --git a/docs/conclusion.md b/docs/conclusion.md
@@ -6,7 +6,7 @@ We present a methodology for creating enhanced microsimulation datasets that com
 
 Our work makes several key contributions:
 
-**Methodological Innovation**: The use of Quantile Regression Forests for imputation preserves distributional characteristics while maintaining computational efficiency. The large-scale calibration to 7,000+ targets pushes the boundaries of survey data enhancement.
+**Methodological Innovation**: The use of Quantile Regression Forests for imputation preserves distributional characteristics while maintaining computational efficiency. The large-scale calibration to 2,813 targets pushes the boundaries of survey data enhancement.
 
 **Practical Tools**: We provide open-source implementations that enable researchers to apply, modify, and extend these methods. The modular design facilitates experimentation with alternative approaches.
 

diff --git a/docs/discussion.md b/docs/discussion.md
@@ -22,7 +22,7 @@ The use of Quantile Regression Forests for imputation represents an advance over
 - Maintains realistic variable correlations
 - Allows uncertainty quantification
 
-The large-scale calibration to 7,000+ targets ensures consistency with administrative benchmarks across multiple dimensions simultaneously.
+The large-scale calibration to 2,813 targets ensures consistency with administrative benchmarks across multiple dimensions simultaneously.
 
 ### Practical Advantages
 
@@ -44,7 +44,7 @@ These assumptions may not hold perfectly, particularly for subpopulations that t
 
 ### Calibration Trade-offs
 
-With 7,000+ targets, perfect fit to all benchmarks is impossible. The optimization must balance competing objectives across target types, the relative importance of different statistics, stability of resulting weights, and preservation of household relationships.
+With 2,813 targets, perfect fit to all benchmarks is impossible. The optimization must balance competing objectives across target types, the relative importance of different statistics, stability of resulting weights, and preservation of household relationships.
 
 Users should consult validation metrics for targets most relevant to their analysis.
 

diff --git a/docs/introduction.md b/docs/introduction.md
@@ -4,7 +4,7 @@ Microsimulation models require high-quality microdata that accurately represents
 
 The Current Population Survey (CPS) Annual Social and Economic Supplement provides detailed household demographics, family relationships, and program participation data for a representative sample of US households. However, it suffers from well-documented income underreporting, particularly at the top of the distribution. The IRS Public Use File (PUF) contains accurate tax return information but lacks household structure, demographic detail, and state identifiers needed for comprehensive policy analysis.
 
-This paper presents a methodology for creating an Enhanced CPS dataset that combines the strengths of both sources. Through a two-stage enhancement process—imputation followed by reweighting—we create a dataset suitable for analyzing both tax and transfer policies at federal and state levels.
+This paper presents a methodology for creating an Enhanced CPS dataset that combines the strengths of both sources. Through an enhancement process—imputation followed by reweighting—we create a dataset suitable for analyzing both tax and transfer policies at federal and state levels.
 
 ## Related Work
 
@@ -14,13 +14,13 @@ Economic researchers address dataset limitations through various strategies. The
 
 Statistical agencies and researchers employ reweighting methods to align survey data with administrative totals. The Luxembourg Income Study uses calibration to improve cross-national comparability {cite:p}`gornick2013`. The Urban-Brookings Tax Policy Center employs reweighting in their microsimulation model but relies on proprietary data that cannot be shared publicly {cite:p}`khitatrakun2016`.
 
-Our approach differs from previous efforts in three key ways. First, we employ quantile regression forests to preserve distributional characteristics during imputation, improving upon traditional hot-deck and regression-based methods that may distort variable relationships. We conduct robustness checks comparing QRF performance to gradient boosting and neural network approaches, finding QRF provides the best balance of accuracy and interpretability. Second, we calibrate to over 7,000 targets from multiple administrative sources, far exceeding the scope of previous calibration efforts which typically use fewer than 100 targets. Third, we provide a fully open-source implementation enabling reproducibility and collaborative improvement, addressing the transparency limitations of existing proprietary models.
+Our approach differs from previous efforts in three key ways. First, we employ quantile regression forests to preserve distributional characteristics during imputation, improving upon traditional hot-deck and regression-based methods that may distort variable relationships. We conduct robustness checks comparing QRF performance to gradient boosting and neural network approaches, finding QRF provides the best balance of accuracy and interpretability. Second, we calibrate to 2,813 targets from multiple administrative sources, far exceeding the scope of previous calibration efforts which typically use fewer than 100 targets. Third, we provide a fully open-source implementation enabling reproducibility and collaborative improvement, addressing the transparency limitations of existing proprietary models.
 
 ## Contributions
 
 This paper makes three main contributions to the economic and public policy literature. Methodologically, we demonstrate how quantile regression forests can effectively impute detailed tax variables while preserving their joint distribution and relationship to demographics. This advances the statistical matching literature by showing how modern machine learning methods can overcome limitations of traditional hot-deck and parametric approaches. The preservation of distributional characteristics is particularly important for tax policy analysis where outcomes often depend on complex interactions between income sources and household characteristics.
 
-Our empirical contribution involves creating and validating a publicly available enhanced dataset that addresses longstanding data limitations in microsimulation modeling. By combining the demographic richness of the CPS with the tax precision of the PUF, we enable analyses that were previously infeasible with public data. The dataset's calibration to over 7,000 administrative targets ensures consistency with official statistics across multiple dimensions simultaneously.
+Our empirical contribution involves creating and validating a publicly available enhanced dataset that addresses longstanding data limitations in microsimulation modeling. By combining the demographic richness of the CPS with the tax precision of the PUF, we enable analyses that were previously infeasible with public data. The dataset's calibration to 2,813 administrative targets ensures consistency with official statistics across multiple dimensions simultaneously.
 
 From a practical perspective, we provide open-source tools and comprehensive documentation that enable researchers to apply these methods, modify the approach, or build upon our work. This transparency contrasts with existing proprietary models and supports reproducible research. Government agencies could use our framework to enhance their own microsimulation capabilities, while academic researchers gain access to data suitable for analyzing distributional impacts of tax and transfer policies. The modular design allows incremental improvements as new data sources become available.
 

diff --git a/docs/methodology.md b/docs/methodology.md
@@ -1,6 +1,6 @@
 # Methodology
 
-We create the Enhanced CPS dataset through a two-stage process: imputation followed by reweighting. The imputation stage creates a copy of the CPS and uses Quantile Regression Forests to impute tax variables from the PUF onto this copy, creating the Extended CPS. The reweighting stage then optimizes household weights to match administrative targets, producing the Enhanced CPS with weights calibrated to statistics.
+We create the Enhanced CPS dataset through imputation followed by reweighting. The imputation stage creates a copy of the CPS and uses Quantile Regression Forests to impute tax variables from the PUF onto this copy, creating the Extended CPS. The reweighting stage then optimizes household weights to match administrative targets, producing the Enhanced CPS with weights calibrated to statistics.
 
 ```mermaid
 graph TD
@@ -37,7 +37,7 @@ graph TD
 
     Extended["Extended CPS - 2x households"]:::data
 
-    Targets{{"Administrative Targets - 7000+"}}:::data
+    Targets{{"Administrative Targets - 2,813"}}:::data
 
     Reweight("Reweight Optimization"):::process
 

diff --git a/docs/myst.yml b/docs/myst.yml
@@ -7,12 +7,6 @@ project:
         family: Team
   copyright: '2024'
   github: policyengine/policyengine-us-data
-  thebe:
-    binder:
-      repo: policyengine/policyengine-us-data
-      provider: github
-      url: https://mybinder.org
-      ref: master
   jupyter:
     myst:
       enable_extensions:
@@ -30,6 +24,7 @@ project:
     - file: background.md
     - file: data.md
     - file: methodology.md
+    - file: pwbm_ss_comparison_2025_2100.ipynb
     - file: discussion.md
     - file: conclusion.md
     - file: appendix.md