Skip to content

Comments

Add Stacked DiD estimator (Wing, Freedman & Hollingsworth 2024)#172

Merged
igerber merged 3 commits intomainfrom
stacked-did
Feb 19, 2026
Merged

Add Stacked DiD estimator (Wing, Freedman & Hollingsworth 2024)#172
igerber merged 3 commits intomainfrom
stacked-did

Conversation

@igerber
Copy link
Owner

@igerber igerber commented Feb 19, 2026

Summary

  • Implement Stacked Difference-in-Differences estimator from NBER Working Paper 32054
  • Core StackedDiD class with IC1/IC2 trimming, Q-weight computation (3 schemes), WLS event study regression, delta-method SE
  • Three clean control modes (not_yet_treated, strict, never_treated), two clustering levels, anticipation support
  • StackedDiDResults dataclass with summary(), to_dataframe(), event study and group effects
  • 72 tests across 11 test classes covering methodology, edge cases, and sklearn interface
  • R and Python benchmark scripts validated against R reference implementation
  • Full documentation: README usage section, API docs, REGISTRY.md entry, METHODOLOGY_REVIEW.md entry

Methodology references (required if estimator / math changes)

  • Method name(s): Stacked Difference-in-Differences
  • Paper / source link(s): Wing, C., Freedman, S. M., & Hollingsworth, A. (2024). Stacked Difference-in-Differences. NBER Working Paper 32054. http://www.nber.org/papers/w32054
  • Reference implementation: https://github.com/hollina/stacked-did-weights (create_sub_exp() + compute_weights())
  • Any intentional deviations from the source (and why):
    • Time window follows R reference [a - kappa_pre, a + kappa_post] rather than paper text [a - kappa_pre - 1, a + kappa_post] (paper vs R discrepancy documented in wing-2024-review.md)
    • NaN for invalid inference (defensive enhancement over R's error behavior)

Validation

  • Tests added/updated: tests/test_stacked_did.py (72 tests, 11 classes)
  • R benchmark validation: ATT diff < 2.1e-11, SE diff < 4.0e-10, event study correlation = 1.0
  • Benchmark scripts: benchmarks/R/benchmark_stacked_did.R, benchmarks/python/benchmark_stacked_did.py
  • Full test suite: 1474 passed, 0 failures

Security / privacy

  • Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

Implement the Stacked Difference-in-Differences estimator from NBER WP
32054. The estimator corrects bias in naive stacked regressions by
applying Q-weights that ensure stable composition across event time.

Core implementation:
- StackedDiD class with IC1/IC2 trimming, Q-weight computation (aggregate,
  population, sample_share), WLS event study regression (Eq. 3), and
  delta-method SE for overall ATT
- Three clean control modes: not_yet_treated, strict, never_treated
- Clustering at unit or unit×sub-experiment level
- Anticipation parameter support

Validated against R reference implementation (stacked-did-weights by
co-author Hollingsworth): ATT matches within 2.1e-11, SE within 4.0e-10,
all event study coefficients match to machine epsilon.

Includes 72 tests, R/Python benchmark scripts, full documentation
(README, API docs, REGISTRY.md, METHODOLOGY_REVIEW.md), and ROADMAP
update.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions
Copy link

Overall assessment: ⚠️ Needs changes

Executive Summary

  • Q-weight computation deviates from the event-time-specific reference and can mis-weight unbalanced panels.
  • anticipation expands the window but does not shift event-time indexing or the reference period, leaving extra periods unmodeled.
  • aggregate='group' returns identical cohort effects despite claiming cohort-weighted aggregation.
  • Docs list unsupported parameters and inconsistent control/cluster semantics; tests don’t validate the above behaviors.

Methodology

  • P1 | Impact: Q-weights are computed using unit counts per sub-experiment only, ignoring event-time-specific shares used in the reference implementation; this changes the estimand and can bias unbalanced panels. | Fix: compute Q-weights by (event_time, sub_exp) as in the reference, or enforce balanced panels and document the deviation. (diff_diff/stacked_did.py:L615-L681, benchmarks/R/benchmark_stacked_did.R:L66-L95)
  • P1 | Impact: anticipation shifts the window start but event-time dummies and the reference period remain anchored at e=-1; extra pre-periods are included without corresponding indicators, so anticipation>0 yields a mis-specified model and does not “shift treatment timing” as described. | Fix: define an effective adoption time a_eff = a - anticipation, recompute _event_time and event_times, and set the reference period to -1 - anticipation (or explicitly expand the dummy set to cover the added pre-periods). Align IC1/IC2 and window bounds accordingly. (diff_diff/stacked_did.py:L273-L357, diff_diff/stacked_did.py:L563-L581)
  • P1 | Impact: aggregate='group' returns identical effects for all cohorts while the docstring claims cohort-weighted effects; this is misleading and not cohort-specific. | Fix: either remove/rename group aggregation or compute cohort-specific estimates (e.g., re-fit per sub-experiment or add cohort interactions and extract cohort-specific post-period averages). (diff_diff/stacked_did.py:L688-L771)

Code Quality

  • P2 | Impact: n_sub_experiments is reported as len(omega_kappa) even if some sub-experiments are dropped after filtering to empty datasets, leading to inconsistent metadata vs stacked_data. | Fix: track the actual appended sub-experiments (len(sub_experiments)) and consider warning when any cohort yields an empty sub-experiment. (diff_diff/stacked_did.py:L250-L407)

Performance
No findings.

Maintainability
No findings.

Tech Debt
No findings.

Security
No findings.

Documentation/Tests

  • P2 | Impact: README/API docs describe parameters that don’t exist (control_group, n_bootstrap, bootstrap_weights, seed, covariates) and misstate cluster semantics (column name vs fixed mode), which will mislead users and cause runtime errors. | Fix: update docs to match the actual signature (clean_control, weighting, cluster as "unit"/"unit_subexp", no bootstrap/covariates), or implement the documented parameters. (README.md:L1004-L1017, README.md:L2277-L2301, diff_diff/stacked_did.py:L116-L166)
  • P2 | Impact: REGISTRY algorithm step specifies a window [a - kappa_pre - 1, a + kappa_post] while the implementation uses [a - kappa_pre - anticipation, a + kappa_post] and the note below asserts [a - kappa_pre, a + kappa_post], creating internal inconsistency. | Fix: reconcile REGISTRY wording with the chosen implementation and anticipation semantics. (docs/methodology/REGISTRY.md:L692-L708, diff_diff/stacked_did.py:L563-L581)
  • P2 | Impact: Tests for anticipation and unbalanced panels only assert “does not crash” and do not validate correct event-time indexing or Q-weighting; group aggregation tests only check presence, not correctness. | Fix: add assertions for reference period under anticipation, event-time coverage, Q-weights under unbalanced data vs reference, and cohort-specific group effects (or update expectations if group effects are intentionally pooled). (tests/test_stacked_did.py:L109-L122, tests/test_stacked_did.py:L320-L358, tests/test_stacked_did.py:L552-L595)

CRITICAL:
- Fix Q-weight computation for unbalanced panels: aggregate weighting
  now uses observation counts per (event_time, sub_exp) matching R
  reference compute_weights(). Population/sample_share unchanged.
- Fix anticipation parameter: reference period shifts to e=-1-anticipation,
  post-treatment includes anticipation periods, consistent with
  ImputationDiD/TwoStageDiD/SunAbraham.
- Remove aggregate='group' and aggregate='all': pooled stacked regression
  cannot produce cohort-specific effects. Raises ValueError with guidance.

MEDIUM:
- Fix n_sub_experiments to track actual built sub-experiments, warn on empty.
- Fix README/API parameter blocks to match actual constructor signature.
- Fix REGISTRY.md algorithm steps and add anticipation edge cases.
- Strengthen test assertions for anticipation, Q-weights, and unbalanced panels.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@igerber
Copy link
Owner Author

igerber commented Feb 19, 2026

/ai-review

@github-actions
Copy link

🔁 AI review rerun (requested by @igerber)

Head SHA: 2026f57039126b52c8f319e45102536aa339a96c


Overall assessment: ✅ Looks good

Executive Summary

  • StackedDiD implementation aligns with the Methodology Registry and the cited R reference for IC1/IC2 trimming, Q-weights, WLS event-study regression, clustering, and anticipation handling.
  • No P0/P1 methodology mismatches found.
  • Minor doc/test gaps: anticipation window messaging doesn’t reflect expanded window; population-weighting path lacks a formula-level test.
  • Tests not run in this review.

Methodology

  • No issues found. Cross-check against docs/methodology/REGISTRY.md indicates IC1/IC2 trimming, Q-weight schemes, Equation 3 regression spec, cluster SEs, and anticipation behavior are implemented as described in diff_diff/stacked_did.py.

Code Quality

  • No issues found.

Performance

  • No issues found.

Maintainability

  • No issues found.

Tech Debt

  • No issues found.

Security

  • No issues found.

Documentation/Tests

  • Severity P3. Impact: Trimming warnings and result summaries report the event window as [-kappa_pre, kappa_post] even when anticipation > 0, which can mislead users about the actual estimation window. Concrete fix: include anticipation in the displayed window (e.g., [-kappa_pre-anticipation, kappa_post]) or add an explicit note when anticipation > 0. (diff_diff/stacked_did.py:L492-L499, diff_diff/stacked_did_results.py:L136-L138)
  • Severity P3. Impact: Population-weighting formula isn’t validated by tests beyond “population column required,” so regressions could drift from the Registry formula without detection. Concrete fix: add a small deterministic test that checks population Q-weights against the formula (Table 1) for a hand-constructed dataset. (tests/test_stacked_did.py:L895-L905)

If you want, I can also add the population-weighting unit test and update the event-window messaging.

Expand minimal test dataset from 4 to 6 observations (3 units × 2
periods) to avoid saturated model (n=k=4) which causes division by
zero in cluster-robust VCV adjustment. The Rust backend handles this
gracefully but pure Python's (n-1)/(n-k) term requires n > k.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@igerber igerber merged commit 1d5cda8 into main Feb 19, 2026
10 checks passed
@igerber igerber deleted the stacked-did branch February 19, 2026 21:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant