Add Stacked DiD estimator (Wing, Freedman & Hollingsworth 2024) by igerber · Pull Request #172 · igerber/diff-diff

igerber · 2026-02-19T19:10:23Z

Summary

Implement Stacked Difference-in-Differences estimator from NBER Working Paper 32054
Core StackedDiD class with IC1/IC2 trimming, Q-weight computation (3 schemes), WLS event study regression, delta-method SE
Three clean control modes (not_yet_treated, strict, never_treated), two clustering levels, anticipation support
StackedDiDResults dataclass with summary(), to_dataframe(), event study and group effects
72 tests across 11 test classes covering methodology, edge cases, and sklearn interface
R and Python benchmark scripts validated against R reference implementation
Full documentation: README usage section, API docs, REGISTRY.md entry, METHODOLOGY_REVIEW.md entry

Methodology references (required if estimator / math changes)

Method name(s): Stacked Difference-in-Differences
Paper / source link(s): Wing, C., Freedman, S. M., & Hollingsworth, A. (2024). Stacked Difference-in-Differences. NBER Working Paper 32054. http://www.nber.org/papers/w32054
Reference implementation: https://github.com/hollina/stacked-did-weights (create_sub_exp() + compute_weights())
Any intentional deviations from the source (and why):
- Time window follows R reference [a - kappa_pre, a + kappa_post] rather than paper text [a - kappa_pre - 1, a + kappa_post] (paper vs R discrepancy documented in wing-2024-review.md)
- NaN for invalid inference (defensive enhancement over R's error behavior)

Validation

Tests added/updated: tests/test_stacked_did.py (72 tests, 11 classes)
R benchmark validation: ATT diff < 2.1e-11, SE diff < 4.0e-10, event study correlation = 1.0
Benchmark scripts: benchmarks/R/benchmark_stacked_did.R, benchmarks/python/benchmark_stacked_did.py
Full test suite: 1474 passed, 0 failures

Security / privacy

Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

Implement the Stacked Difference-in-Differences estimator from NBER WP 32054. The estimator corrects bias in naive stacked regressions by applying Q-weights that ensure stable composition across event time. Core implementation: - StackedDiD class with IC1/IC2 trimming, Q-weight computation (aggregate, population, sample_share), WLS event study regression (Eq. 3), and delta-method SE for overall ATT - Three clean control modes: not_yet_treated, strict, never_treated - Clustering at unit or unit×sub-experiment level - Anticipation parameter support Validated against R reference implementation (stacked-did-weights by co-author Hollingsworth): ATT matches within 2.1e-11, SE within 4.0e-10, all event study coefficients match to machine epsilon. Includes 72 tests, R/Python benchmark scripts, full documentation (README, API docs, REGISTRY.md, METHODOLOGY_REVIEW.md), and ROADMAP update. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-actions · 2026-02-19T19:16:21Z

Overall assessment: ⚠️ Needs changes

Executive Summary

Q-weight computation deviates from the event-time-specific reference and can mis-weight unbalanced panels.
anticipation expands the window but does not shift event-time indexing or the reference period, leaving extra periods unmodeled.
aggregate='group' returns identical cohort effects despite claiming cohort-weighted aggregation.
Docs list unsupported parameters and inconsistent control/cluster semantics; tests don’t validate the above behaviors.

Methodology

P1 | Impact: Q-weights are computed using unit counts per sub-experiment only, ignoring event-time-specific shares used in the reference implementation; this changes the estimand and can bias unbalanced panels. | Fix: compute Q-weights by (event_time, sub_exp) as in the reference, or enforce balanced panels and document the deviation. (diff_diff/stacked_did.py:L615-L681, benchmarks/R/benchmark_stacked_did.R:L66-L95)
P1 | Impact: anticipation shifts the window start but event-time dummies and the reference period remain anchored at e=-1; extra pre-periods are included without corresponding indicators, so anticipation>0 yields a mis-specified model and does not “shift treatment timing” as described. | Fix: define an effective adoption time a_eff = a - anticipation, recompute _event_time and event_times, and set the reference period to -1 - anticipation (or explicitly expand the dummy set to cover the added pre-periods). Align IC1/IC2 and window bounds accordingly. (diff_diff/stacked_did.py:L273-L357, diff_diff/stacked_did.py:L563-L581)
P1 | Impact: aggregate='group' returns identical effects for all cohorts while the docstring claims cohort-weighted effects; this is misleading and not cohort-specific. | Fix: either remove/rename group aggregation or compute cohort-specific estimates (e.g., re-fit per sub-experiment or add cohort interactions and extract cohort-specific post-period averages). (diff_diff/stacked_did.py:L688-L771)

Code Quality

P2 | Impact: n_sub_experiments is reported as len(omega_kappa) even if some sub-experiments are dropped after filtering to empty datasets, leading to inconsistent metadata vs stacked_data. | Fix: track the actual appended sub-experiments (len(sub_experiments)) and consider warning when any cohort yields an empty sub-experiment. (diff_diff/stacked_did.py:L250-L407)

Performance
No findings.

Maintainability
No findings.

Tech Debt
No findings.

Security
No findings.

Documentation/Tests

P2 | Impact: README/API docs describe parameters that don’t exist (control_group, n_bootstrap, bootstrap_weights, seed, covariates) and misstate cluster semantics (column name vs fixed mode), which will mislead users and cause runtime errors. | Fix: update docs to match the actual signature (clean_control, weighting, cluster as "unit"/"unit_subexp", no bootstrap/covariates), or implement the documented parameters. (README.md:L1004-L1017, README.md:L2277-L2301, diff_diff/stacked_did.py:L116-L166)
P2 | Impact: REGISTRY algorithm step specifies a window [a - kappa_pre - 1, a + kappa_post] while the implementation uses [a - kappa_pre - anticipation, a + kappa_post] and the note below asserts [a - kappa_pre, a + kappa_post], creating internal inconsistency. | Fix: reconcile REGISTRY wording with the chosen implementation and anticipation semantics. (docs/methodology/REGISTRY.md:L692-L708, diff_diff/stacked_did.py:L563-L581)
P2 | Impact: Tests for anticipation and unbalanced panels only assert “does not crash” and do not validate correct event-time indexing or Q-weighting; group aggregation tests only check presence, not correctness. | Fix: add assertions for reference period under anticipation, event-time coverage, Q-weights under unbalanced data vs reference, and cohort-specific group effects (or update expectations if group effects are intentionally pooled). (tests/test_stacked_did.py:L109-L122, tests/test_stacked_did.py:L320-L358, tests/test_stacked_did.py:L552-L595)

CRITICAL: - Fix Q-weight computation for unbalanced panels: aggregate weighting now uses observation counts per (event_time, sub_exp) matching R reference compute_weights(). Population/sample_share unchanged. - Fix anticipation parameter: reference period shifts to e=-1-anticipation, post-treatment includes anticipation periods, consistent with ImputationDiD/TwoStageDiD/SunAbraham. - Remove aggregate='group' and aggregate='all': pooled stacked regression cannot produce cohort-specific effects. Raises ValueError with guidance. MEDIUM: - Fix n_sub_experiments to track actual built sub-experiments, warn on empty. - Fix README/API parameter blocks to match actual constructor signature. - Fix REGISTRY.md algorithm steps and add anticipation edge cases. - Strengthen test assertions for anticipation, Q-weights, and unbalanced panels. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

igerber · 2026-02-19T20:05:59Z

/ai-review

github-actions · 2026-02-19T20:11:05Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 2026f57039126b52c8f319e45102536aa339a96c

Overall assessment: ✅ Looks good

Executive Summary

StackedDiD implementation aligns with the Methodology Registry and the cited R reference for IC1/IC2 trimming, Q-weights, WLS event-study regression, clustering, and anticipation handling.
No P0/P1 methodology mismatches found.
Minor doc/test gaps: anticipation window messaging doesn’t reflect expanded window; population-weighting path lacks a formula-level test.
Tests not run in this review.

Methodology

No issues found. Cross-check against docs/methodology/REGISTRY.md indicates IC1/IC2 trimming, Q-weight schemes, Equation 3 regression spec, cluster SEs, and anticipation behavior are implemented as described in diff_diff/stacked_did.py.

Code Quality

No issues found.

Performance

No issues found.

Maintainability

No issues found.

Tech Debt

No issues found.

Security

No issues found.

Documentation/Tests

Severity P3. Impact: Trimming warnings and result summaries report the event window as [-kappa_pre, kappa_post] even when anticipation > 0, which can mislead users about the actual estimation window. Concrete fix: include anticipation in the displayed window (e.g., [-kappa_pre-anticipation, kappa_post]) or add an explicit note when anticipation > 0. (diff_diff/stacked_did.py:L492-L499, diff_diff/stacked_did_results.py:L136-L138)
Severity P3. Impact: Population-weighting formula isn’t validated by tests beyond “population column required,” so regressions could drift from the Registry formula without detection. Concrete fix: add a small deterministic test that checks population Q-weights against the formula (Table 1) for a hand-constructed dataset. (tests/test_stacked_did.py:L895-L905)

If you want, I can also add the population-weighting unit test and update the event-window messaging.

Expand minimal test dataset from 4 to 6 observations (3 units × 2 periods) to avoid saturated model (n=k=4) which causes division by zero in cluster-robust VCV adjustment. The Rust backend handles this gracefully but pure Python's (n-1)/(n-k) term requires n > k. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

igerber merged commit 1d5cda8 into main Feb 19, 2026
10 checks passed

igerber deleted the stacked-did branch February 19, 2026 21:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Add Stacked DiD estimator (Wing, Freedman & Hollingsworth 2024)#172

Add Stacked DiD estimator (Wing, Freedman & Hollingsworth 2024)#172
igerber merged 3 commits intomainfrom
stacked-did

igerber commented Feb 19, 2026

Uh oh!

github-actions bot commented Feb 19, 2026

Uh oh!

igerber commented Feb 19, 2026

Uh oh!

github-actions bot commented Feb 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

igerber commented Feb 19, 2026

Summary

Methodology references (required if estimator / math changes)

Validation

Security / privacy

Uh oh!

github-actions bot commented Feb 19, 2026

Uh oh!

igerber commented Feb 19, 2026

Uh oh!

github-actions bot commented Feb 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant