igerber · igerber · Feb 19, 2026 · Feb 19, 2026 · Feb 19, 2026 · Feb 19, 2026
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -144,6 +144,18 @@ pure Rust by default.
 - **`diff_diff/two_stage_bootstrap.py`** - Bootstrap inference:
   - `TwoStageDiDBootstrapMixin` - Mixin with GMM influence function bootstrap methods
 
+- **`diff_diff/stacked_did.py`** - Stacked DiD estimator (Wing et al. 2024):
+  - `StackedDiD` - Stacked DiD with corrective Q-weights for compositional balance
+  - `stacked_did()` - Convenience function
+  - Builds sub-experiments per adoption cohort with clean controls
+  - IC1/IC2 trimming for compositional balance across event times
+  - Q-weights for aggregate, population, or sample share estimands (Table 1)
+  - WLS event study regression via sqrt(w) transformation
+  - Re-exports result class for backward compatibility
+
+- **`diff_diff/stacked_did_results.py`** - Result container classes:
+  - `StackedDiDResults` - Results with overall ATT, event study, group effects, stacked data access
+
 - **`diff_diff/triple_diff.py`** - Triple Difference (DDD) estimator:
   - `TripleDifference` - Ortiz-Villavicencio & Sant'Anna (2025) estimator for DDD designs
   - `TripleDifferenceResults` - Results with ATT, SEs, cell means, diagnostics
@@ -314,6 +326,7 @@ pure Rust by default.
    ├── TwoStageDiD
    ├── TripleDifference
    ├── TROP
+   ├── StackedDiD
    ├── SyntheticDiD
    └── BaconDecomposition
    ```
@@ -429,6 +442,7 @@ Tests mirror the source modules:
 - `tests/test_sun_abraham.py` - Tests for SunAbraham interaction-weighted estimator
 - `tests/test_imputation.py` - Tests for ImputationDiD (Borusyak et al. 2024) estimator
 - `tests/test_two_stage.py` - Tests for TwoStageDiD (Gardner 2022) estimator, including equivalence tests with ImputationDiD
+- `tests/test_stacked_did.py` - Tests for Stacked DiD (Wing et al. 2024) estimator
 - `tests/test_triple_diff.py` - Tests for Triple Difference (DDD) estimator
 - `tests/test_trop.py` - Tests for Triply Robust Panel (TROP) estimator
 - `tests/test_bacon.py` - Tests for Goodman-Bacon decomposition
@@ -445,6 +459,8 @@ Tests mirror the source modules:
 
 Session-scoped `ci_params` fixture in `conftest.py` scales bootstrap iterations and TROP grid sizes in pure Python mode — use `ci_params.bootstrap(n)` and `ci_params.grid(values)` in new tests with `n_bootstrap >= 20`. For SE convergence tests (analytical vs bootstrap comparison), use `ci_params.bootstrap(n, min_n=199)` with a conditional tolerance: `threshold = 0.40 if n_boot < 100 else 0.15`. The `min_n` parameter is capped at 49 in pure Python mode to keep CI fast, so convergence tests use wider tolerances when running with fewer bootstrap iterations.
 
+**Slow test suites:** `tests/test_trop.py` is very time-consuming. Only run TROP tests when changes could affect the TROP estimator (e.g., `diff_diff/trop.py`, `diff_diff/trop_results.py`, `diff_diff/linalg.py`, `diff_diff/_backend.py`, or `rust/src/trop.rs`). For unrelated changes, exclude with `pytest --ignore=tests/test_trop.py`.
+
 ### Test Writing Guidelines
 
 **For fallback/error handling paths:**

diff --git a/METHODOLOGY_REVIEW.md b/METHODOLOGY_REVIEW.md
@@ -27,6 +27,7 @@ Each estimator in diff-diff should be periodically reviewed to ensure:
 | SunAbraham | `sun_abraham.py` | `fixest::sunab()` | **Complete** | 2026-02-15 |
 | SyntheticDiD | `synthetic_did.py` | `synthdid::synthdid_estimate()` | **Complete** | 2026-02-10 |
 | TripleDifference | `triple_diff.py` | `triplediff::ddd()` | **Complete** | 2026-02-18 |
+| StackedDiD | `stacked_did.py` | `stacked-did-weights` | **Complete** | 2026-02-19 |
 | TROP | `trop.py` | (forthcoming) | Not Started | - |
 | BaconDecomposition | `bacon.py` | `bacondecomp::bacon()` | Not Started | - |
 | HonestDiD | `honest_did.py` | `HonestDiD` package | Not Started | - |
@@ -379,6 +380,102 @@ variables appear to the left of the `|` separator.
 
 ---
 
+#### StackedDiD
+
+| Field | Value |
+|-------|-------|
+| Module | `stacked_did.py` |
+| Primary Reference | Wing, Freedman & Hollingsworth (2024), NBER WP 32054 |
+| R Reference | `stacked-did-weights` (`create_sub_exp()` + `compute_weights()`) |
+| Status | **Complete** |
+| Last Review | 2026-02-19 |
+
+**Verified Components:**
+- [x] IC1 trimming: `a - kappa_pre >= T_min AND a + kappa_post <= T_max` (matches R reference)
+- [x] IC2 trimming: Three clean control modes (not_yet_treated, strict, never_treated)
+- [x] Sub-experiment construction: treated + clean controls within `[a - kappa_pre, a + kappa_post]`
+- [x] Q-weights aggregate: treated Q=1, control `Q = (sub_treat_n/stack_treat_n) / (sub_control_n/stack_control_n)` per (event_time, sub_exp) — matches R `compute_weights()`
+- [x] Q-weights population: `Q_a = (Pop_a^D / Pop^D) / (N_a^C / N^C)` (Table 1, Row 2)
+- [x] Q-weights sample_share: `Q_a = ((N_a^D + N_a^C)/(N^D+N^C)) / (N_a^C / N^C)` (Table 1, Row 3)
+- [x] WLS via sqrt(w) transformation (numerically equivalent to weighted regression)
+- [x] Event study regression: `Y = α_0 + α_1·D_sa + Σ_{h≠-1}[λ_h·1(e=h) + δ_h·D_sa·1(e=h)] + U` (Eq. 3)
+- [x] Reference period e=-1-anticipation normalized to zero (omitted from design matrix)
+- [x] Delta-method SE for overall ATT: `SE = sqrt(ones' @ sub_vcv @ ones) / K`
+- [x] Cluster-robust SEs at unit level (default) and unit×sub-experiment level
+- [x] Anticipation parameter: reference period shifts to e=-1-anticipation, post-treatment includes anticipation periods
+- [x] Rank deficiency handling (warn/error/silent via `solve_ols()`)
+- [x] Never-treated encoding: both `first_treat=0` and `first_treat=inf` handled
+- [x] R comparison: ATT matches within machine precision (diff < 2.1e-11)
+- [x] R comparison: SE matches within machine precision (diff < 4.0e-10)
+- [x] R comparison: Event study effects correlation = 1.000000, max diff < 4.5e-11
+- [x] safe_inference() used for all inference fields
+- [x] All REGISTRY.md edge cases tested
+
+**Test Coverage:**
+- 72 tests in `tests/test_stacked_did.py` across 11 test classes:
+  - `TestStackedDiDBasic` (8): fit, event study, group/all raises, simple aggregation, known constant effect, dynamic effects
+  - `TestTrimming` (5): IC1 window, IC2 no-controls, trimmed groups reported, all-trimmed raises, wider window
+  - `TestQWeights` (4): treated=1, aggregate formula, sample_share formula, positivity
+  - `TestCleanControl` (5): not_yet_treated, strict, never_treated, missing never-treated raises
+  - `TestClustering` (2): unit, unit_subexp
+  - `TestStackedData` (4): accessible, required columns, event time range
+  - `TestEdgeCases` (8): single cohort, anticipation, unbalanced panel, NaN inference, never-treated encodings
+  - `TestSklearnInterface` (4): get_params, set_params, unknown raises, convenience function
+  - `TestResultsMethods` (7): summary, to_dataframe, is_significant, significance_stars, repr
+  - `TestValidation` (8): missing columns, invalid params, population required, no treated units
+- R benchmark tests via `benchmarks/run_benchmarks.py --estimator stacked`
+
+**R Comparison Results (200 units, 8 periods, kappa_pre=2, kappa_post=2):**
+| Metric | Python | R | Diff |
+|--------|--------|---|------|
+| Overall ATT | 2.277699574579 | 2.2776995746 | 2.1e-11 |
+| Overall SE | 0.062045687626 | 0.062045688027 | 4.0e-10 |
+| ES e=-2 ATT | 0.044517975379 | 0.044517975379 | <1e-12 |
+| ES e=0 ATT | 2.104181683763 | 2.104181683800 | <1e-11 |
+| ES e=1 ATT | 2.209990715130 | 2.209990715100 | <1e-11 |
+| ES e=2 ATT | 2.518926324845 | 2.518926324800 | <1e-11 |
+| Stacked obs | 1600 | 1600 | exact |
+| Sub-experiments | 3 | 3 | exact |
+
+**Corrections Made:**
+1. **IC1 lower bound and time window aligned with R reference** (`stacked_did.py`,
+   `_trim_adoption_events()` and `_build_sub_experiment()`): The paper text specifies
+   time window `[a - kappa_pre - 1, a + kappa_post]` (including an extra pre-period),
+   but the R reference implementation by co-author Hollingsworth uses
+   `[a - kappa_pre, a + kappa_post]`. The extra period had no event-study dummy,
+   altering the baseline regression. Fixed to match R: removed `-1` from both
+   IC1 check (`a - kappa_pre >= T_min`) and time window start. Discrepancy documented
+   in `docs/methodology/papers/wing-2024-review.md` Gaps section.
+
+2. **Q-weight computation: event-time-specific for aggregate weighting** (`stacked_did.py`,
+   `_compute_q_weights()`): Changed aggregate Q-weights from unit counts per sub-experiment
+   to observation counts per (event_time, sub_exp), matching R reference `compute_weights()`.
+   For balanced panels, results are unchanged. For unbalanced panels, weights now adjust for
+   varying observation density. Population/sample_share retain unit-count formulas (paper notation).
+
+3. **Anticipation parameter: reference period and dummies** (`stacked_did.py`, `fit()`):
+   Reference period now shifts to `e = -1 - anticipation`. Event-time dummies cover the
+   full window `[-kappa_pre - anticipation, ..., kappa_post]`. Post-treatment effects include
+   anticipation periods. Consistent with ImputationDiD, TwoStageDiD, SunAbraham.
+
+4. **Group aggregation removed** (`stacked_did.py`): `aggregate="group"` and `aggregate="all"`
+   removed. The pooled stacked regression cannot produce cohort-specific effects without
+   cohort×event-time interactions. Use CallawaySantAnna or ImputationDiD for cohort-level estimates.
+
+5. **n_sub_experiments metadata** (`stacked_did.py`, `fit()`): Now tracks actual built
+   sub-experiments, not all events in omega_kappa. Warns if any sub-experiments are empty
+   after data filtering.
+
+**Outstanding Concerns:**
+- Population/sample_share Q-weights use paper's unit-count formulas (no R reference to validate)
+- Anticipation not validated against R (R reference doesn't test anticipation > 0)
+
+**Deviations from R's stacked-did-weights:**
+1. **NaN for invalid inference**: Python returns NaN for t_stat/p_value/conf_int when
+   SE is non-finite or zero. R would propagate through `fixest::feols()` error handling.
+
+---
+
 ### Advanced Estimators
 
 #### SyntheticDiD

diff --git a/README.md b/README.md
@@ -70,7 +70,7 @@ Signif. codes: '***' 0.001, '**' 0.01, '*' 0.05, '.' 0.1
 - **Wild cluster bootstrap**: Valid inference with few clusters (<50) using Rademacher, Webb, or Mammen weights
 - **Panel data support**: Two-way fixed effects estimator for panel designs
 - **Multi-period analysis**: Event-study style DiD with period-specific treatment effects
-- **Staggered adoption**: Callaway-Sant'Anna (2021), Sun-Abraham (2021), Borusyak-Jaravel-Spiess (2024) imputation, and Two-Stage DiD (Gardner 2022) estimators for heterogeneous treatment timing
+- **Staggered adoption**: Callaway-Sant'Anna (2021), Sun-Abraham (2021), Borusyak-Jaravel-Spiess (2024) imputation, Two-Stage DiD (Gardner 2022), and Stacked DiD (Wing, Freedman & Hollingsworth 2024) estimators for heterogeneous treatment timing
 - **Triple Difference (DDD)**: Ortiz-Villavicencio & Sant'Anna (2025) estimators with proper covariate handling
 - **Synthetic DiD**: Combined DiD with synthetic control for improved robustness
 - **Triply Robust Panel (TROP)**: Factor-adjusted DiD with synthetic weights (Athey et al. 2025)
@@ -974,6 +974,78 @@ TwoStageDiD(
 
 Both estimators are the efficient estimator under homogeneous treatment effects, producing shorter confidence intervals than Callaway-Sant'Anna or Sun-Abraham.
 
+### Stacked DiD (Wing, Freedman & Hollingsworth 2024)
+
+Stacked DiD addresses TWFE bias in staggered adoption settings by constructing a "clean" comparison dataset for each treatment cohort and stacking them together. Each cohort's sub-experiment compares units treated at that cohort's timing against units that are not yet treated (or never treated) within a symmetric event-study window. This avoids the "bad comparisons" problem in TWFE while retaining a regression-based framework that practitioners familiar with event studies will find intuitive.
+
+```python
+from diff_diff import StackedDiD, generate_staggered_data
+
+# Generate sample data
+data = generate_staggered_data(n_units=200, n_periods=12,
+                                cohort_periods=[4, 6, 8], seed=42)
+
+# Fit stacked DiD with event study
+est = StackedDiD(kappa_pre=2, kappa_post=2)
+results = est.fit(data, outcome='outcome', unit='unit',
+                  time='period', first_treat='first_treat',
+                  aggregate='event_study')
+results.print_summary()
+
+# Access stacked data for custom analysis
+stacked = results.stacked_data
+
+# Convenience function
+from diff_diff import stacked_did
+results = stacked_did(data, 'outcome', 'unit', 'period', 'first_treat',
+                      kappa_pre=2, kappa_post=2, aggregate='event_study')
+```
+
+**Parameters:**
+
+```python
+StackedDiD(
+    kappa_pre=1,                          # Pre-treatment event-study periods
+    kappa_post=1,                         # Post-treatment event-study periods
+    weighting='aggregate',                # 'aggregate', 'population', or 'sample_share'
+    clean_control='not_yet_treated',      # 'not_yet_treated', 'strict', or 'never_treated'
+    cluster='unit',                       # 'unit' or 'unit_subexp'
+    alpha=0.05,                           # Significance level
+    anticipation=0,                       # Anticipation periods
+    rank_deficient_action='warn',         # 'warn', 'error', or 'silent'
+)
+```
+
+> **Note:** Group aggregation (`aggregate='group'`) is not supported because the pooled
+> stacked regression cannot produce cohort-specific effects. Use `CallawaySantAnna` or
+> `ImputationDiD` for cohort-level estimates.
+
+**When to use Stacked DiD vs Callaway-Sant'Anna:**
+
+| Aspect | Stacked DiD | Callaway-Sant'Anna |
+|--------|-------------|-------------------|
+| Approach | Stack cohort sub-experiments, run pooled TWFE | 2x2 DiD aggregation |
+| Symmetric windows | Enforced via kappa_pre / kappa_post | Not required |
+| Control group | Not-yet-treated (default) or never-treated | Never-treated or not-yet-treated |
+| Covariates | Passed to pooled regression | Doubly robust / IPW |
+| Intuition | Familiar event-study regression | Nonparametric aggregation |
+
+**Convenience function:**
+
+```python
+# One-liner estimation
+results = stacked_did(
+    data,
+    outcome='outcome',
+    unit='unit',
+    time='period',
+    first_treat='first_treat',
+    kappa_pre=3,
+    kappa_post=3,
+    aggregate='event_study'
+)
+```
+
 ### Triple Difference (DDD)
 
 Triple Difference (DDD) is used when treatment requires satisfying two criteria: belonging to a treated **group** AND being in an eligible **partition**. The `TripleDifference` class implements the methodology from Ortiz-Villavicencio & Sant'Anna (2025), which correctly handles covariate adjustment (unlike naive implementations).
@@ -2203,6 +2275,60 @@ TwoStageDiD(
 | `print_summary(alpha)` | Print summary to stdout |
 | `to_dataframe(level)` | Convert to DataFrame ('observation', 'event_study', 'group') |
 
+### StackedDiD
+
+```python
+StackedDiD(
+    kappa_pre=1,                          # Pre-treatment event-study periods
+    kappa_post=1,                         # Post-treatment event-study periods
+    weighting='aggregate',                # 'aggregate', 'population', or 'sample_share'
+    clean_control='not_yet_treated',      # 'not_yet_treated', 'strict', or 'never_treated'
+    cluster='unit',                       # 'unit' or 'unit_subexp'
+    alpha=0.05,                           # Significance level
+    anticipation=0,                       # Anticipation periods
+    rank_deficient_action='warn',         # 'warn', 'error', or 'silent'
+)
+```
+
+**fit() Parameters:**
+
+| Parameter | Type | Description |
+|-----------|------|-------------|
+| `data` | DataFrame | Panel data |
+| `outcome` | str | Outcome variable column name |
+| `unit` | str | Unit identifier column |
+| `time` | str | Time period column |
+| `first_treat` | str | First treatment period column (0 for never-treated) |
+| `population` | str, optional | Population column (required if weighting='population') |
+| `aggregate` | str | Aggregation: None, `"simple"`, or `"event_study"` |
+
+### StackedDiDResults
+
+**Attributes:**
+
+| Attribute | Description |
+|-----------|-------------|
+| `overall_att` | Overall average treatment effect on the treated |
+| `overall_se` | Standard error |
+| `overall_t_stat` | T-statistic |
+| `overall_p_value` | P-value for H0: ATT = 0 |
+| `overall_conf_int` | Confidence interval |
+| `event_study_effects` | Dict of relative time -> effect dict (if `aggregate='event_study'`) |
+| `stacked_data` | The stacked dataset used for estimation |
+| `n_treated_obs` | Number of treated observations |
+| `n_untreated_obs` | Number of untreated (clean control) observations |
+| `n_cohorts` | Number of treatment cohorts |
+| `kappa_pre` | Pre-treatment window used |
+| `kappa_post` | Post-treatment window used |
+
+**Methods:**
+
+| Method | Description |
+|--------|-------------|
+| `summary(alpha)` | Get formatted summary string |
+| `print_summary(alpha)` | Print summary to stdout |
+| `to_dataframe(level)` | Convert to DataFrame ('event_study') |
+
 ### TripleDifference
 
 ```python
@@ -2689,6 +2815,8 @@ The `HonestDiD` module implements sensitivity analysis methods for relaxing the
 
 - **Goodman-Bacon, A. (2021).** "Difference-in-Differences with Variation in Treatment Timing." *Journal of Econometrics*, 225(2), 254-277. [https://doi.org/10.1016/j.jeconom.2021.03.014](https://doi.org/10.1016/j.jeconom.2021.03.014)
 
+- **Wing, C., Freedman, S. M., & Hollingsworth, A. (2024).** "Stacked Difference-in-Differences." *NBER Working Paper* 32054. [https://www.nber.org/papers/w32054](https://www.nber.org/papers/w32054)
+
 ### Power Analysis
 
 - **Bloom, H. S. (1995).** "Minimum Detectable Effects: A Simple Way to Report the Statistical Power of Experimental Designs." *Evaluation Review*, 19(5), 547-556. [https://doi.org/10.1177/0193841X9501900504](https://doi.org/10.1177/0193841X9501900504)

diff --git a/ROADMAP.md b/ROADMAP.md
@@ -10,7 +10,7 @@ For past changes and release history, see [CHANGELOG.md](CHANGELOG.md).
 
 diff-diff v2.4.1 is a **production-ready** DiD library with feature parity with R's `did` + `HonestDiD` + `synthdid` ecosystem for core DiD analysis:
 
-- **Core estimators**: Basic DiD, TWFE, MultiPeriod, Callaway-Sant'Anna, Sun-Abraham, Borusyak-Jaravel-Spiess Imputation, Synthetic DiD, Triple Difference (DDD), TROP, Two-Stage DiD (Gardner 2022)
+- **Core estimators**: Basic DiD, TWFE, MultiPeriod, Callaway-Sant'Anna, Sun-Abraham, Borusyak-Jaravel-Spiess Imputation, Synthetic DiD, Triple Difference (DDD), TROP, Two-Stage DiD (Gardner 2022), Stacked DiD (Wing et al. 2024)
 - **Valid inference**: Robust SEs, cluster SEs, wild bootstrap, multiplier bootstrap, placebo-based variance
 - **Assumption diagnostics**: Parallel trends tests, placebo tests, Goodman-Bacon decomposition
 - **Sensitivity analysis**: Honest DiD (Rambachan-Roth), Pre-trends power analysis (Roth 2022)
@@ -24,16 +24,9 @@ diff-diff v2.4.1 is a **production-ready** DiD library with feature parity with
 
 High-value additions building on our existing foundation.
 
-### Stacked Difference-in-Differences
+### ~~Stacked Difference-in-Differences~~ (Implemented in v2.5)
 
-An intuitive approach that explicitly constructs sub-experiments for each treatment cohort, avoiding forbidden comparisons.
-
-- Creates separate datasets per cohort with valid controls only
-- Stacks sub-experiments and applies corrective sample weights
-- Returns variance-weighted ATT with proper compositional balance
-- Conceptually simpler alternative to aggregation-based methods
-
-**Reference**: [Wing, Freedman & Hollingsworth (2024)](https://www.nber.org/papers/w32054). *NBER Working Paper 32054*. Stata: `STACKDID`.
+Implemented as `StackedDiD`. See `diff_diff/stacked_did.py`.
 
 ### Staggered Triple Difference (DDD)