Skip to content

Commit 1d5cda8

Browse files
authored
Merge pull request #172 from igerber/stacked-did
Add Stacked DiD estimator (Wing, Freedman & Hollingsworth 2024)
2 parents 9943d7f + bbe584f commit 1d5cda8

File tree

15 files changed

+3476
-88
lines changed

15 files changed

+3476
-88
lines changed

CLAUDE.md

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -144,6 +144,18 @@ pure Rust by default.
144144
- **`diff_diff/two_stage_bootstrap.py`** - Bootstrap inference:
145145
- `TwoStageDiDBootstrapMixin` - Mixin with GMM influence function bootstrap methods
146146

147+
- **`diff_diff/stacked_did.py`** - Stacked DiD estimator (Wing et al. 2024):
148+
- `StackedDiD` - Stacked DiD with corrective Q-weights for compositional balance
149+
- `stacked_did()` - Convenience function
150+
- Builds sub-experiments per adoption cohort with clean controls
151+
- IC1/IC2 trimming for compositional balance across event times
152+
- Q-weights for aggregate, population, or sample share estimands (Table 1)
153+
- WLS event study regression via sqrt(w) transformation
154+
- Re-exports result class for backward compatibility
155+
156+
- **`diff_diff/stacked_did_results.py`** - Result container classes:
157+
- `StackedDiDResults` - Results with overall ATT, event study, group effects, stacked data access
158+
147159
- **`diff_diff/triple_diff.py`** - Triple Difference (DDD) estimator:
148160
- `TripleDifference` - Ortiz-Villavicencio & Sant'Anna (2025) estimator for DDD designs
149161
- `TripleDifferenceResults` - Results with ATT, SEs, cell means, diagnostics
@@ -314,6 +326,7 @@ pure Rust by default.
314326
├── TwoStageDiD
315327
├── TripleDifference
316328
├── TROP
329+
├── StackedDiD
317330
├── SyntheticDiD
318331
└── BaconDecomposition
319332
```
@@ -429,6 +442,7 @@ Tests mirror the source modules:
429442
- `tests/test_sun_abraham.py` - Tests for SunAbraham interaction-weighted estimator
430443
- `tests/test_imputation.py` - Tests for ImputationDiD (Borusyak et al. 2024) estimator
431444
- `tests/test_two_stage.py` - Tests for TwoStageDiD (Gardner 2022) estimator, including equivalence tests with ImputationDiD
445+
- `tests/test_stacked_did.py` - Tests for Stacked DiD (Wing et al. 2024) estimator
432446
- `tests/test_triple_diff.py` - Tests for Triple Difference (DDD) estimator
433447
- `tests/test_trop.py` - Tests for Triply Robust Panel (TROP) estimator
434448
- `tests/test_bacon.py` - Tests for Goodman-Bacon decomposition
@@ -445,6 +459,8 @@ Tests mirror the source modules:
445459

446460
Session-scoped `ci_params` fixture in `conftest.py` scales bootstrap iterations and TROP grid sizes in pure Python mode — use `ci_params.bootstrap(n)` and `ci_params.grid(values)` in new tests with `n_bootstrap >= 20`. For SE convergence tests (analytical vs bootstrap comparison), use `ci_params.bootstrap(n, min_n=199)` with a conditional tolerance: `threshold = 0.40 if n_boot < 100 else 0.15`. The `min_n` parameter is capped at 49 in pure Python mode to keep CI fast, so convergence tests use wider tolerances when running with fewer bootstrap iterations.
447461

462+
**Slow test suites:** `tests/test_trop.py` is very time-consuming. Only run TROP tests when changes could affect the TROP estimator (e.g., `diff_diff/trop.py`, `diff_diff/trop_results.py`, `diff_diff/linalg.py`, `diff_diff/_backend.py`, or `rust/src/trop.rs`). For unrelated changes, exclude with `pytest --ignore=tests/test_trop.py`.
463+
448464
### Test Writing Guidelines
449465

450466
**For fallback/error handling paths:**

METHODOLOGY_REVIEW.md

Lines changed: 97 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,7 @@ Each estimator in diff-diff should be periodically reviewed to ensure:
2727
| SunAbraham | `sun_abraham.py` | `fixest::sunab()` | **Complete** | 2026-02-15 |
2828
| SyntheticDiD | `synthetic_did.py` | `synthdid::synthdid_estimate()` | **Complete** | 2026-02-10 |
2929
| TripleDifference | `triple_diff.py` | `triplediff::ddd()` | **Complete** | 2026-02-18 |
30+
| StackedDiD | `stacked_did.py` | `stacked-did-weights` | **Complete** | 2026-02-19 |
3031
| TROP | `trop.py` | (forthcoming) | Not Started | - |
3132
| BaconDecomposition | `bacon.py` | `bacondecomp::bacon()` | Not Started | - |
3233
| HonestDiD | `honest_did.py` | `HonestDiD` package | Not Started | - |
@@ -379,6 +380,102 @@ variables appear to the left of the `|` separator.
379380

380381
---
381382

383+
#### StackedDiD
384+
385+
| Field | Value |
386+
|-------|-------|
387+
| Module | `stacked_did.py` |
388+
| Primary Reference | Wing, Freedman & Hollingsworth (2024), NBER WP 32054 |
389+
| R Reference | `stacked-did-weights` (`create_sub_exp()` + `compute_weights()`) |
390+
| Status | **Complete** |
391+
| Last Review | 2026-02-19 |
392+
393+
**Verified Components:**
394+
- [x] IC1 trimming: `a - kappa_pre >= T_min AND a + kappa_post <= T_max` (matches R reference)
395+
- [x] IC2 trimming: Three clean control modes (not_yet_treated, strict, never_treated)
396+
- [x] Sub-experiment construction: treated + clean controls within `[a - kappa_pre, a + kappa_post]`
397+
- [x] Q-weights aggregate: treated Q=1, control `Q = (sub_treat_n/stack_treat_n) / (sub_control_n/stack_control_n)` per (event_time, sub_exp) — matches R `compute_weights()`
398+
- [x] Q-weights population: `Q_a = (Pop_a^D / Pop^D) / (N_a^C / N^C)` (Table 1, Row 2)
399+
- [x] Q-weights sample_share: `Q_a = ((N_a^D + N_a^C)/(N^D+N^C)) / (N_a^C / N^C)` (Table 1, Row 3)
400+
- [x] WLS via sqrt(w) transformation (numerically equivalent to weighted regression)
401+
- [x] Event study regression: `Y = α_0 + α_1·D_sa + Σ_{h≠-1}[λ_h·1(e=h) + δ_h·D_sa·1(e=h)] + U` (Eq. 3)
402+
- [x] Reference period e=-1-anticipation normalized to zero (omitted from design matrix)
403+
- [x] Delta-method SE for overall ATT: `SE = sqrt(ones' @ sub_vcv @ ones) / K`
404+
- [x] Cluster-robust SEs at unit level (default) and unit×sub-experiment level
405+
- [x] Anticipation parameter: reference period shifts to e=-1-anticipation, post-treatment includes anticipation periods
406+
- [x] Rank deficiency handling (warn/error/silent via `solve_ols()`)
407+
- [x] Never-treated encoding: both `first_treat=0` and `first_treat=inf` handled
408+
- [x] R comparison: ATT matches within machine precision (diff < 2.1e-11)
409+
- [x] R comparison: SE matches within machine precision (diff < 4.0e-10)
410+
- [x] R comparison: Event study effects correlation = 1.000000, max diff < 4.5e-11
411+
- [x] safe_inference() used for all inference fields
412+
- [x] All REGISTRY.md edge cases tested
413+
414+
**Test Coverage:**
415+
- 72 tests in `tests/test_stacked_did.py` across 11 test classes:
416+
- `TestStackedDiDBasic` (8): fit, event study, group/all raises, simple aggregation, known constant effect, dynamic effects
417+
- `TestTrimming` (5): IC1 window, IC2 no-controls, trimmed groups reported, all-trimmed raises, wider window
418+
- `TestQWeights` (4): treated=1, aggregate formula, sample_share formula, positivity
419+
- `TestCleanControl` (5): not_yet_treated, strict, never_treated, missing never-treated raises
420+
- `TestClustering` (2): unit, unit_subexp
421+
- `TestStackedData` (4): accessible, required columns, event time range
422+
- `TestEdgeCases` (8): single cohort, anticipation, unbalanced panel, NaN inference, never-treated encodings
423+
- `TestSklearnInterface` (4): get_params, set_params, unknown raises, convenience function
424+
- `TestResultsMethods` (7): summary, to_dataframe, is_significant, significance_stars, repr
425+
- `TestValidation` (8): missing columns, invalid params, population required, no treated units
426+
- R benchmark tests via `benchmarks/run_benchmarks.py --estimator stacked`
427+
428+
**R Comparison Results (200 units, 8 periods, kappa_pre=2, kappa_post=2):**
429+
| Metric | Python | R | Diff |
430+
|--------|--------|---|------|
431+
| Overall ATT | 2.277699574579 | 2.2776995746 | 2.1e-11 |
432+
| Overall SE | 0.062045687626 | 0.062045688027 | 4.0e-10 |
433+
| ES e=-2 ATT | 0.044517975379 | 0.044517975379 | <1e-12 |
434+
| ES e=0 ATT | 2.104181683763 | 2.104181683800 | <1e-11 |
435+
| ES e=1 ATT | 2.209990715130 | 2.209990715100 | <1e-11 |
436+
| ES e=2 ATT | 2.518926324845 | 2.518926324800 | <1e-11 |
437+
| Stacked obs | 1600 | 1600 | exact |
438+
| Sub-experiments | 3 | 3 | exact |
439+
440+
**Corrections Made:**
441+
1. **IC1 lower bound and time window aligned with R reference** (`stacked_did.py`,
442+
`_trim_adoption_events()` and `_build_sub_experiment()`): The paper text specifies
443+
time window `[a - kappa_pre - 1, a + kappa_post]` (including an extra pre-period),
444+
but the R reference implementation by co-author Hollingsworth uses
445+
`[a - kappa_pre, a + kappa_post]`. The extra period had no event-study dummy,
446+
altering the baseline regression. Fixed to match R: removed `-1` from both
447+
IC1 check (`a - kappa_pre >= T_min`) and time window start. Discrepancy documented
448+
in `docs/methodology/papers/wing-2024-review.md` Gaps section.
449+
450+
2. **Q-weight computation: event-time-specific for aggregate weighting** (`stacked_did.py`,
451+
`_compute_q_weights()`): Changed aggregate Q-weights from unit counts per sub-experiment
452+
to observation counts per (event_time, sub_exp), matching R reference `compute_weights()`.
453+
For balanced panels, results are unchanged. For unbalanced panels, weights now adjust for
454+
varying observation density. Population/sample_share retain unit-count formulas (paper notation).
455+
456+
3. **Anticipation parameter: reference period and dummies** (`stacked_did.py`, `fit()`):
457+
Reference period now shifts to `e = -1 - anticipation`. Event-time dummies cover the
458+
full window `[-kappa_pre - anticipation, ..., kappa_post]`. Post-treatment effects include
459+
anticipation periods. Consistent with ImputationDiD, TwoStageDiD, SunAbraham.
460+
461+
4. **Group aggregation removed** (`stacked_did.py`): `aggregate="group"` and `aggregate="all"`
462+
removed. The pooled stacked regression cannot produce cohort-specific effects without
463+
cohort×event-time interactions. Use CallawaySantAnna or ImputationDiD for cohort-level estimates.
464+
465+
5. **n_sub_experiments metadata** (`stacked_did.py`, `fit()`): Now tracks actual built
466+
sub-experiments, not all events in omega_kappa. Warns if any sub-experiments are empty
467+
after data filtering.
468+
469+
**Outstanding Concerns:**
470+
- Population/sample_share Q-weights use paper's unit-count formulas (no R reference to validate)
471+
- Anticipation not validated against R (R reference doesn't test anticipation > 0)
472+
473+
**Deviations from R's stacked-did-weights:**
474+
1. **NaN for invalid inference**: Python returns NaN for t_stat/p_value/conf_int when
475+
SE is non-finite or zero. R would propagate through `fixest::feols()` error handling.
476+
477+
---
478+
382479
### Advanced Estimators
383480

384481
#### SyntheticDiD

README.md

Lines changed: 129 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -70,7 +70,7 @@ Signif. codes: '***' 0.001, '**' 0.01, '*' 0.05, '.' 0.1
7070
- **Wild cluster bootstrap**: Valid inference with few clusters (<50) using Rademacher, Webb, or Mammen weights
7171
- **Panel data support**: Two-way fixed effects estimator for panel designs
7272
- **Multi-period analysis**: Event-study style DiD with period-specific treatment effects
73-
- **Staggered adoption**: Callaway-Sant'Anna (2021), Sun-Abraham (2021), Borusyak-Jaravel-Spiess (2024) imputation, and Two-Stage DiD (Gardner 2022) estimators for heterogeneous treatment timing
73+
- **Staggered adoption**: Callaway-Sant'Anna (2021), Sun-Abraham (2021), Borusyak-Jaravel-Spiess (2024) imputation, Two-Stage DiD (Gardner 2022), and Stacked DiD (Wing, Freedman & Hollingsworth 2024) estimators for heterogeneous treatment timing
7474
- **Triple Difference (DDD)**: Ortiz-Villavicencio & Sant'Anna (2025) estimators with proper covariate handling
7575
- **Synthetic DiD**: Combined DiD with synthetic control for improved robustness
7676
- **Triply Robust Panel (TROP)**: Factor-adjusted DiD with synthetic weights (Athey et al. 2025)
@@ -974,6 +974,78 @@ TwoStageDiD(
974974

975975
Both estimators are the efficient estimator under homogeneous treatment effects, producing shorter confidence intervals than Callaway-Sant'Anna or Sun-Abraham.
976976

977+
### Stacked DiD (Wing, Freedman & Hollingsworth 2024)
978+
979+
Stacked DiD addresses TWFE bias in staggered adoption settings by constructing a "clean" comparison dataset for each treatment cohort and stacking them together. Each cohort's sub-experiment compares units treated at that cohort's timing against units that are not yet treated (or never treated) within a symmetric event-study window. This avoids the "bad comparisons" problem in TWFE while retaining a regression-based framework that practitioners familiar with event studies will find intuitive.
980+
981+
```python
982+
from diff_diff import StackedDiD, generate_staggered_data
983+
984+
# Generate sample data
985+
data = generate_staggered_data(n_units=200, n_periods=12,
986+
cohort_periods=[4, 6, 8], seed=42)
987+
988+
# Fit stacked DiD with event study
989+
est = StackedDiD(kappa_pre=2, kappa_post=2)
990+
results = est.fit(data, outcome='outcome', unit='unit',
991+
time='period', first_treat='first_treat',
992+
aggregate='event_study')
993+
results.print_summary()
994+
995+
# Access stacked data for custom analysis
996+
stacked = results.stacked_data
997+
998+
# Convenience function
999+
from diff_diff import stacked_did
1000+
results = stacked_did(data, 'outcome', 'unit', 'period', 'first_treat',
1001+
kappa_pre=2, kappa_post=2, aggregate='event_study')
1002+
```
1003+
1004+
**Parameters:**
1005+
1006+
```python
1007+
StackedDiD(
1008+
kappa_pre=1, # Pre-treatment event-study periods
1009+
kappa_post=1, # Post-treatment event-study periods
1010+
weighting='aggregate', # 'aggregate', 'population', or 'sample_share'
1011+
clean_control='not_yet_treated', # 'not_yet_treated', 'strict', or 'never_treated'
1012+
cluster='unit', # 'unit' or 'unit_subexp'
1013+
alpha=0.05, # Significance level
1014+
anticipation=0, # Anticipation periods
1015+
rank_deficient_action='warn', # 'warn', 'error', or 'silent'
1016+
)
1017+
```
1018+
1019+
> **Note:** Group aggregation (`aggregate='group'`) is not supported because the pooled
1020+
> stacked regression cannot produce cohort-specific effects. Use `CallawaySantAnna` or
1021+
> `ImputationDiD` for cohort-level estimates.
1022+
1023+
**When to use Stacked DiD vs Callaway-Sant'Anna:**
1024+
1025+
| Aspect | Stacked DiD | Callaway-Sant'Anna |
1026+
|--------|-------------|-------------------|
1027+
| Approach | Stack cohort sub-experiments, run pooled TWFE | 2x2 DiD aggregation |
1028+
| Symmetric windows | Enforced via kappa_pre / kappa_post | Not required |
1029+
| Control group | Not-yet-treated (default) or never-treated | Never-treated or not-yet-treated |
1030+
| Covariates | Passed to pooled regression | Doubly robust / IPW |
1031+
| Intuition | Familiar event-study regression | Nonparametric aggregation |
1032+
1033+
**Convenience function:**
1034+
1035+
```python
1036+
# One-liner estimation
1037+
results = stacked_did(
1038+
data,
1039+
outcome='outcome',
1040+
unit='unit',
1041+
time='period',
1042+
first_treat='first_treat',
1043+
kappa_pre=3,
1044+
kappa_post=3,
1045+
aggregate='event_study'
1046+
)
1047+
```
1048+
9771049
### Triple Difference (DDD)
9781050

9791051
Triple Difference (DDD) is used when treatment requires satisfying two criteria: belonging to a treated **group** AND being in an eligible **partition**. The `TripleDifference` class implements the methodology from Ortiz-Villavicencio & Sant'Anna (2025), which correctly handles covariate adjustment (unlike naive implementations).
@@ -2203,6 +2275,60 @@ TwoStageDiD(
22032275
| `print_summary(alpha)` | Print summary to stdout |
22042276
| `to_dataframe(level)` | Convert to DataFrame ('observation', 'event_study', 'group') |
22052277

2278+
### StackedDiD
2279+
2280+
```python
2281+
StackedDiD(
2282+
kappa_pre=1, # Pre-treatment event-study periods
2283+
kappa_post=1, # Post-treatment event-study periods
2284+
weighting='aggregate', # 'aggregate', 'population', or 'sample_share'
2285+
clean_control='not_yet_treated', # 'not_yet_treated', 'strict', or 'never_treated'
2286+
cluster='unit', # 'unit' or 'unit_subexp'
2287+
alpha=0.05, # Significance level
2288+
anticipation=0, # Anticipation periods
2289+
rank_deficient_action='warn', # 'warn', 'error', or 'silent'
2290+
)
2291+
```
2292+
2293+
**fit() Parameters:**
2294+
2295+
| Parameter | Type | Description |
2296+
|-----------|------|-------------|
2297+
| `data` | DataFrame | Panel data |
2298+
| `outcome` | str | Outcome variable column name |
2299+
| `unit` | str | Unit identifier column |
2300+
| `time` | str | Time period column |
2301+
| `first_treat` | str | First treatment period column (0 for never-treated) |
2302+
| `population` | str, optional | Population column (required if weighting='population') |
2303+
| `aggregate` | str | Aggregation: None, `"simple"`, or `"event_study"` |
2304+
2305+
### StackedDiDResults
2306+
2307+
**Attributes:**
2308+
2309+
| Attribute | Description |
2310+
|-----------|-------------|
2311+
| `overall_att` | Overall average treatment effect on the treated |
2312+
| `overall_se` | Standard error |
2313+
| `overall_t_stat` | T-statistic |
2314+
| `overall_p_value` | P-value for H0: ATT = 0 |
2315+
| `overall_conf_int` | Confidence interval |
2316+
| `event_study_effects` | Dict of relative time -> effect dict (if `aggregate='event_study'`) |
2317+
| `stacked_data` | The stacked dataset used for estimation |
2318+
| `n_treated_obs` | Number of treated observations |
2319+
| `n_untreated_obs` | Number of untreated (clean control) observations |
2320+
| `n_cohorts` | Number of treatment cohorts |
2321+
| `kappa_pre` | Pre-treatment window used |
2322+
| `kappa_post` | Post-treatment window used |
2323+
2324+
**Methods:**
2325+
2326+
| Method | Description |
2327+
|--------|-------------|
2328+
| `summary(alpha)` | Get formatted summary string |
2329+
| `print_summary(alpha)` | Print summary to stdout |
2330+
| `to_dataframe(level)` | Convert to DataFrame ('event_study') |
2331+
22062332
### TripleDifference
22072333

22082334
```python
@@ -2689,6 +2815,8 @@ The `HonestDiD` module implements sensitivity analysis methods for relaxing the
26892815

26902816
- **Goodman-Bacon, A. (2021).** "Difference-in-Differences with Variation in Treatment Timing." *Journal of Econometrics*, 225(2), 254-277. [https://doi.org/10.1016/j.jeconom.2021.03.014](https://doi.org/10.1016/j.jeconom.2021.03.014)
26912817

2818+
- **Wing, C., Freedman, S. M., & Hollingsworth, A. (2024).** "Stacked Difference-in-Differences." *NBER Working Paper* 32054. [https://www.nber.org/papers/w32054](https://www.nber.org/papers/w32054)
2819+
26922820
### Power Analysis
26932821

26942822
- **Bloom, H. S. (1995).** "Minimum Detectable Effects: A Simple Way to Report the Statistical Power of Experimental Designs." *Evaluation Review*, 19(5), 547-556. [https://doi.org/10.1177/0193841X9501900504](https://doi.org/10.1177/0193841X9501900504)

ROADMAP.md

Lines changed: 3 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ For past changes and release history, see [CHANGELOG.md](CHANGELOG.md).
1010

1111
diff-diff v2.4.1 is a **production-ready** DiD library with feature parity with R's `did` + `HonestDiD` + `synthdid` ecosystem for core DiD analysis:
1212

13-
- **Core estimators**: Basic DiD, TWFE, MultiPeriod, Callaway-Sant'Anna, Sun-Abraham, Borusyak-Jaravel-Spiess Imputation, Synthetic DiD, Triple Difference (DDD), TROP, Two-Stage DiD (Gardner 2022)
13+
- **Core estimators**: Basic DiD, TWFE, MultiPeriod, Callaway-Sant'Anna, Sun-Abraham, Borusyak-Jaravel-Spiess Imputation, Synthetic DiD, Triple Difference (DDD), TROP, Two-Stage DiD (Gardner 2022), Stacked DiD (Wing et al. 2024)
1414
- **Valid inference**: Robust SEs, cluster SEs, wild bootstrap, multiplier bootstrap, placebo-based variance
1515
- **Assumption diagnostics**: Parallel trends tests, placebo tests, Goodman-Bacon decomposition
1616
- **Sensitivity analysis**: Honest DiD (Rambachan-Roth), Pre-trends power analysis (Roth 2022)
@@ -24,16 +24,9 @@ diff-diff v2.4.1 is a **production-ready** DiD library with feature parity with
2424

2525
High-value additions building on our existing foundation.
2626

27-
### Stacked Difference-in-Differences
27+
### ~~Stacked Difference-in-Differences~~ (Implemented in v2.5)
2828

29-
An intuitive approach that explicitly constructs sub-experiments for each treatment cohort, avoiding forbidden comparisons.
30-
31-
- Creates separate datasets per cohort with valid controls only
32-
- Stacks sub-experiments and applies corrective sample weights
33-
- Returns variance-weighted ATT with proper compositional balance
34-
- Conceptually simpler alternative to aggregation-based methods
35-
36-
**Reference**: [Wing, Freedman & Hollingsworth (2024)](https://www.nber.org/papers/w32054). *NBER Working Paper 32054*. Stata: `STACKDID`.
29+
Implemented as `StackedDiD`. See `diff_diff/stacked_did.py`.
3730

3831
### Staggered Triple Difference (DDD)
3932

0 commit comments

Comments
 (0)