Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 16 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -144,6 +144,18 @@ pure Rust by default.
- **`diff_diff/two_stage_bootstrap.py`** - Bootstrap inference:
- `TwoStageDiDBootstrapMixin` - Mixin with GMM influence function bootstrap methods

- **`diff_diff/stacked_did.py`** - Stacked DiD estimator (Wing et al. 2024):
- `StackedDiD` - Stacked DiD with corrective Q-weights for compositional balance
- `stacked_did()` - Convenience function
- Builds sub-experiments per adoption cohort with clean controls
- IC1/IC2 trimming for compositional balance across event times
- Q-weights for aggregate, population, or sample share estimands (Table 1)
- WLS event study regression via sqrt(w) transformation
- Re-exports result class for backward compatibility

- **`diff_diff/stacked_did_results.py`** - Result container classes:
- `StackedDiDResults` - Results with overall ATT, event study, group effects, stacked data access

- **`diff_diff/triple_diff.py`** - Triple Difference (DDD) estimator:
- `TripleDifference` - Ortiz-Villavicencio & Sant'Anna (2025) estimator for DDD designs
- `TripleDifferenceResults` - Results with ATT, SEs, cell means, diagnostics
Expand Down Expand Up @@ -314,6 +326,7 @@ pure Rust by default.
├── TwoStageDiD
├── TripleDifference
├── TROP
├── StackedDiD
├── SyntheticDiD
└── BaconDecomposition
```
Expand Down Expand Up @@ -429,6 +442,7 @@ Tests mirror the source modules:
- `tests/test_sun_abraham.py` - Tests for SunAbraham interaction-weighted estimator
- `tests/test_imputation.py` - Tests for ImputationDiD (Borusyak et al. 2024) estimator
- `tests/test_two_stage.py` - Tests for TwoStageDiD (Gardner 2022) estimator, including equivalence tests with ImputationDiD
- `tests/test_stacked_did.py` - Tests for Stacked DiD (Wing et al. 2024) estimator
- `tests/test_triple_diff.py` - Tests for Triple Difference (DDD) estimator
- `tests/test_trop.py` - Tests for Triply Robust Panel (TROP) estimator
- `tests/test_bacon.py` - Tests for Goodman-Bacon decomposition
Expand All @@ -445,6 +459,8 @@ Tests mirror the source modules:

Session-scoped `ci_params` fixture in `conftest.py` scales bootstrap iterations and TROP grid sizes in pure Python mode — use `ci_params.bootstrap(n)` and `ci_params.grid(values)` in new tests with `n_bootstrap >= 20`. For SE convergence tests (analytical vs bootstrap comparison), use `ci_params.bootstrap(n, min_n=199)` with a conditional tolerance: `threshold = 0.40 if n_boot < 100 else 0.15`. The `min_n` parameter is capped at 49 in pure Python mode to keep CI fast, so convergence tests use wider tolerances when running with fewer bootstrap iterations.

**Slow test suites:** `tests/test_trop.py` is very time-consuming. Only run TROP tests when changes could affect the TROP estimator (e.g., `diff_diff/trop.py`, `diff_diff/trop_results.py`, `diff_diff/linalg.py`, `diff_diff/_backend.py`, or `rust/src/trop.rs`). For unrelated changes, exclude with `pytest --ignore=tests/test_trop.py`.

### Test Writing Guidelines

**For fallback/error handling paths:**
Expand Down
97 changes: 97 additions & 0 deletions METHODOLOGY_REVIEW.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ Each estimator in diff-diff should be periodically reviewed to ensure:
| SunAbraham | `sun_abraham.py` | `fixest::sunab()` | **Complete** | 2026-02-15 |
| SyntheticDiD | `synthetic_did.py` | `synthdid::synthdid_estimate()` | **Complete** | 2026-02-10 |
| TripleDifference | `triple_diff.py` | `triplediff::ddd()` | **Complete** | 2026-02-18 |
| StackedDiD | `stacked_did.py` | `stacked-did-weights` | **Complete** | 2026-02-19 |
| TROP | `trop.py` | (forthcoming) | Not Started | - |
| BaconDecomposition | `bacon.py` | `bacondecomp::bacon()` | Not Started | - |
| HonestDiD | `honest_did.py` | `HonestDiD` package | Not Started | - |
Expand Down Expand Up @@ -379,6 +380,102 @@ variables appear to the left of the `|` separator.

---

#### StackedDiD

| Field | Value |
|-------|-------|
| Module | `stacked_did.py` |
| Primary Reference | Wing, Freedman & Hollingsworth (2024), NBER WP 32054 |
| R Reference | `stacked-did-weights` (`create_sub_exp()` + `compute_weights()`) |
| Status | **Complete** |
| Last Review | 2026-02-19 |

**Verified Components:**
- [x] IC1 trimming: `a - kappa_pre >= T_min AND a + kappa_post <= T_max` (matches R reference)
- [x] IC2 trimming: Three clean control modes (not_yet_treated, strict, never_treated)
- [x] Sub-experiment construction: treated + clean controls within `[a - kappa_pre, a + kappa_post]`
- [x] Q-weights aggregate: treated Q=1, control `Q = (sub_treat_n/stack_treat_n) / (sub_control_n/stack_control_n)` per (event_time, sub_exp) — matches R `compute_weights()`
- [x] Q-weights population: `Q_a = (Pop_a^D / Pop^D) / (N_a^C / N^C)` (Table 1, Row 2)
- [x] Q-weights sample_share: `Q_a = ((N_a^D + N_a^C)/(N^D+N^C)) / (N_a^C / N^C)` (Table 1, Row 3)
- [x] WLS via sqrt(w) transformation (numerically equivalent to weighted regression)
- [x] Event study regression: `Y = α_0 + α_1·D_sa + Σ_{h≠-1}[λ_h·1(e=h) + δ_h·D_sa·1(e=h)] + U` (Eq. 3)
- [x] Reference period e=-1-anticipation normalized to zero (omitted from design matrix)
- [x] Delta-method SE for overall ATT: `SE = sqrt(ones' @ sub_vcv @ ones) / K`
- [x] Cluster-robust SEs at unit level (default) and unit×sub-experiment level
- [x] Anticipation parameter: reference period shifts to e=-1-anticipation, post-treatment includes anticipation periods
- [x] Rank deficiency handling (warn/error/silent via `solve_ols()`)
- [x] Never-treated encoding: both `first_treat=0` and `first_treat=inf` handled
- [x] R comparison: ATT matches within machine precision (diff < 2.1e-11)
- [x] R comparison: SE matches within machine precision (diff < 4.0e-10)
- [x] R comparison: Event study effects correlation = 1.000000, max diff < 4.5e-11
- [x] safe_inference() used for all inference fields
- [x] All REGISTRY.md edge cases tested

**Test Coverage:**
- 72 tests in `tests/test_stacked_did.py` across 11 test classes:
- `TestStackedDiDBasic` (8): fit, event study, group/all raises, simple aggregation, known constant effect, dynamic effects
- `TestTrimming` (5): IC1 window, IC2 no-controls, trimmed groups reported, all-trimmed raises, wider window
- `TestQWeights` (4): treated=1, aggregate formula, sample_share formula, positivity
- `TestCleanControl` (5): not_yet_treated, strict, never_treated, missing never-treated raises
- `TestClustering` (2): unit, unit_subexp
- `TestStackedData` (4): accessible, required columns, event time range
- `TestEdgeCases` (8): single cohort, anticipation, unbalanced panel, NaN inference, never-treated encodings
- `TestSklearnInterface` (4): get_params, set_params, unknown raises, convenience function
- `TestResultsMethods` (7): summary, to_dataframe, is_significant, significance_stars, repr
- `TestValidation` (8): missing columns, invalid params, population required, no treated units
- R benchmark tests via `benchmarks/run_benchmarks.py --estimator stacked`

**R Comparison Results (200 units, 8 periods, kappa_pre=2, kappa_post=2):**
| Metric | Python | R | Diff |
|--------|--------|---|------|
| Overall ATT | 2.277699574579 | 2.2776995746 | 2.1e-11 |
| Overall SE | 0.062045687626 | 0.062045688027 | 4.0e-10 |
| ES e=-2 ATT | 0.044517975379 | 0.044517975379 | <1e-12 |
| ES e=0 ATT | 2.104181683763 | 2.104181683800 | <1e-11 |
| ES e=1 ATT | 2.209990715130 | 2.209990715100 | <1e-11 |
| ES e=2 ATT | 2.518926324845 | 2.518926324800 | <1e-11 |
| Stacked obs | 1600 | 1600 | exact |
| Sub-experiments | 3 | 3 | exact |

**Corrections Made:**
1. **IC1 lower bound and time window aligned with R reference** (`stacked_did.py`,
`_trim_adoption_events()` and `_build_sub_experiment()`): The paper text specifies
time window `[a - kappa_pre - 1, a + kappa_post]` (including an extra pre-period),
but the R reference implementation by co-author Hollingsworth uses
`[a - kappa_pre, a + kappa_post]`. The extra period had no event-study dummy,
altering the baseline regression. Fixed to match R: removed `-1` from both
IC1 check (`a - kappa_pre >= T_min`) and time window start. Discrepancy documented
in `docs/methodology/papers/wing-2024-review.md` Gaps section.

2. **Q-weight computation: event-time-specific for aggregate weighting** (`stacked_did.py`,
`_compute_q_weights()`): Changed aggregate Q-weights from unit counts per sub-experiment
to observation counts per (event_time, sub_exp), matching R reference `compute_weights()`.
For balanced panels, results are unchanged. For unbalanced panels, weights now adjust for
varying observation density. Population/sample_share retain unit-count formulas (paper notation).

3. **Anticipation parameter: reference period and dummies** (`stacked_did.py`, `fit()`):
Reference period now shifts to `e = -1 - anticipation`. Event-time dummies cover the
full window `[-kappa_pre - anticipation, ..., kappa_post]`. Post-treatment effects include
anticipation periods. Consistent with ImputationDiD, TwoStageDiD, SunAbraham.

4. **Group aggregation removed** (`stacked_did.py`): `aggregate="group"` and `aggregate="all"`
removed. The pooled stacked regression cannot produce cohort-specific effects without
cohort×event-time interactions. Use CallawaySantAnna or ImputationDiD for cohort-level estimates.

5. **n_sub_experiments metadata** (`stacked_did.py`, `fit()`): Now tracks actual built
sub-experiments, not all events in omega_kappa. Warns if any sub-experiments are empty
after data filtering.

**Outstanding Concerns:**
- Population/sample_share Q-weights use paper's unit-count formulas (no R reference to validate)
- Anticipation not validated against R (R reference doesn't test anticipation > 0)

**Deviations from R's stacked-did-weights:**
1. **NaN for invalid inference**: Python returns NaN for t_stat/p_value/conf_int when
SE is non-finite or zero. R would propagate through `fixest::feols()` error handling.

---

### Advanced Estimators

#### SyntheticDiD
Expand Down
130 changes: 129 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,7 @@ Signif. codes: '***' 0.001, '**' 0.01, '*' 0.05, '.' 0.1
- **Wild cluster bootstrap**: Valid inference with few clusters (<50) using Rademacher, Webb, or Mammen weights
- **Panel data support**: Two-way fixed effects estimator for panel designs
- **Multi-period analysis**: Event-study style DiD with period-specific treatment effects
- **Staggered adoption**: Callaway-Sant'Anna (2021), Sun-Abraham (2021), Borusyak-Jaravel-Spiess (2024) imputation, and Two-Stage DiD (Gardner 2022) estimators for heterogeneous treatment timing
- **Staggered adoption**: Callaway-Sant'Anna (2021), Sun-Abraham (2021), Borusyak-Jaravel-Spiess (2024) imputation, Two-Stage DiD (Gardner 2022), and Stacked DiD (Wing, Freedman & Hollingsworth 2024) estimators for heterogeneous treatment timing
- **Triple Difference (DDD)**: Ortiz-Villavicencio & Sant'Anna (2025) estimators with proper covariate handling
- **Synthetic DiD**: Combined DiD with synthetic control for improved robustness
- **Triply Robust Panel (TROP)**: Factor-adjusted DiD with synthetic weights (Athey et al. 2025)
Expand Down Expand Up @@ -974,6 +974,78 @@ TwoStageDiD(

Both estimators are the efficient estimator under homogeneous treatment effects, producing shorter confidence intervals than Callaway-Sant'Anna or Sun-Abraham.

### Stacked DiD (Wing, Freedman & Hollingsworth 2024)

Stacked DiD addresses TWFE bias in staggered adoption settings by constructing a "clean" comparison dataset for each treatment cohort and stacking them together. Each cohort's sub-experiment compares units treated at that cohort's timing against units that are not yet treated (or never treated) within a symmetric event-study window. This avoids the "bad comparisons" problem in TWFE while retaining a regression-based framework that practitioners familiar with event studies will find intuitive.

```python
from diff_diff import StackedDiD, generate_staggered_data

# Generate sample data
data = generate_staggered_data(n_units=200, n_periods=12,
cohort_periods=[4, 6, 8], seed=42)

# Fit stacked DiD with event study
est = StackedDiD(kappa_pre=2, kappa_post=2)
results = est.fit(data, outcome='outcome', unit='unit',
time='period', first_treat='first_treat',
aggregate='event_study')
results.print_summary()

# Access stacked data for custom analysis
stacked = results.stacked_data

# Convenience function
from diff_diff import stacked_did
results = stacked_did(data, 'outcome', 'unit', 'period', 'first_treat',
kappa_pre=2, kappa_post=2, aggregate='event_study')
```

**Parameters:**

```python
StackedDiD(
kappa_pre=1, # Pre-treatment event-study periods
kappa_post=1, # Post-treatment event-study periods
weighting='aggregate', # 'aggregate', 'population', or 'sample_share'
clean_control='not_yet_treated', # 'not_yet_treated', 'strict', or 'never_treated'
cluster='unit', # 'unit' or 'unit_subexp'
alpha=0.05, # Significance level
anticipation=0, # Anticipation periods
rank_deficient_action='warn', # 'warn', 'error', or 'silent'
)
```

> **Note:** Group aggregation (`aggregate='group'`) is not supported because the pooled
> stacked regression cannot produce cohort-specific effects. Use `CallawaySantAnna` or
> `ImputationDiD` for cohort-level estimates.

**When to use Stacked DiD vs Callaway-Sant'Anna:**

| Aspect | Stacked DiD | Callaway-Sant'Anna |
|--------|-------------|-------------------|
| Approach | Stack cohort sub-experiments, run pooled TWFE | 2x2 DiD aggregation |
| Symmetric windows | Enforced via kappa_pre / kappa_post | Not required |
| Control group | Not-yet-treated (default) or never-treated | Never-treated or not-yet-treated |
| Covariates | Passed to pooled regression | Doubly robust / IPW |
| Intuition | Familiar event-study regression | Nonparametric aggregation |

**Convenience function:**

```python
# One-liner estimation
results = stacked_did(
data,
outcome='outcome',
unit='unit',
time='period',
first_treat='first_treat',
kappa_pre=3,
kappa_post=3,
aggregate='event_study'
)
```

### Triple Difference (DDD)

Triple Difference (DDD) is used when treatment requires satisfying two criteria: belonging to a treated **group** AND being in an eligible **partition**. The `TripleDifference` class implements the methodology from Ortiz-Villavicencio & Sant'Anna (2025), which correctly handles covariate adjustment (unlike naive implementations).
Expand Down Expand Up @@ -2203,6 +2275,60 @@ TwoStageDiD(
| `print_summary(alpha)` | Print summary to stdout |
| `to_dataframe(level)` | Convert to DataFrame ('observation', 'event_study', 'group') |

### StackedDiD

```python
StackedDiD(
kappa_pre=1, # Pre-treatment event-study periods
kappa_post=1, # Post-treatment event-study periods
weighting='aggregate', # 'aggregate', 'population', or 'sample_share'
clean_control='not_yet_treated', # 'not_yet_treated', 'strict', or 'never_treated'
cluster='unit', # 'unit' or 'unit_subexp'
alpha=0.05, # Significance level
anticipation=0, # Anticipation periods
rank_deficient_action='warn', # 'warn', 'error', or 'silent'
)
```

**fit() Parameters:**

| Parameter | Type | Description |
|-----------|------|-------------|
| `data` | DataFrame | Panel data |
| `outcome` | str | Outcome variable column name |
| `unit` | str | Unit identifier column |
| `time` | str | Time period column |
| `first_treat` | str | First treatment period column (0 for never-treated) |
| `population` | str, optional | Population column (required if weighting='population') |
| `aggregate` | str | Aggregation: None, `"simple"`, or `"event_study"` |

### StackedDiDResults

**Attributes:**

| Attribute | Description |
|-----------|-------------|
| `overall_att` | Overall average treatment effect on the treated |
| `overall_se` | Standard error |
| `overall_t_stat` | T-statistic |
| `overall_p_value` | P-value for H0: ATT = 0 |
| `overall_conf_int` | Confidence interval |
| `event_study_effects` | Dict of relative time -> effect dict (if `aggregate='event_study'`) |
| `stacked_data` | The stacked dataset used for estimation |
| `n_treated_obs` | Number of treated observations |
| `n_untreated_obs` | Number of untreated (clean control) observations |
| `n_cohorts` | Number of treatment cohorts |
| `kappa_pre` | Pre-treatment window used |
| `kappa_post` | Post-treatment window used |

**Methods:**

| Method | Description |
|--------|-------------|
| `summary(alpha)` | Get formatted summary string |
| `print_summary(alpha)` | Print summary to stdout |
| `to_dataframe(level)` | Convert to DataFrame ('event_study') |

### TripleDifference

```python
Expand Down Expand Up @@ -2689,6 +2815,8 @@ The `HonestDiD` module implements sensitivity analysis methods for relaxing the

- **Goodman-Bacon, A. (2021).** "Difference-in-Differences with Variation in Treatment Timing." *Journal of Econometrics*, 225(2), 254-277. [https://doi.org/10.1016/j.jeconom.2021.03.014](https://doi.org/10.1016/j.jeconom.2021.03.014)

- **Wing, C., Freedman, S. M., & Hollingsworth, A. (2024).** "Stacked Difference-in-Differences." *NBER Working Paper* 32054. [https://www.nber.org/papers/w32054](https://www.nber.org/papers/w32054)

### Power Analysis

- **Bloom, H. S. (1995).** "Minimum Detectable Effects: A Simple Way to Report the Statistical Power of Experimental Designs." *Evaluation Review*, 19(5), 547-556. [https://doi.org/10.1177/0193841X9501900504](https://doi.org/10.1177/0193841X9501900504)
Expand Down
13 changes: 3 additions & 10 deletions ROADMAP.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ For past changes and release history, see [CHANGELOG.md](CHANGELOG.md).

diff-diff v2.4.1 is a **production-ready** DiD library with feature parity with R's `did` + `HonestDiD` + `synthdid` ecosystem for core DiD analysis:

- **Core estimators**: Basic DiD, TWFE, MultiPeriod, Callaway-Sant'Anna, Sun-Abraham, Borusyak-Jaravel-Spiess Imputation, Synthetic DiD, Triple Difference (DDD), TROP, Two-Stage DiD (Gardner 2022)
- **Core estimators**: Basic DiD, TWFE, MultiPeriod, Callaway-Sant'Anna, Sun-Abraham, Borusyak-Jaravel-Spiess Imputation, Synthetic DiD, Triple Difference (DDD), TROP, Two-Stage DiD (Gardner 2022), Stacked DiD (Wing et al. 2024)
- **Valid inference**: Robust SEs, cluster SEs, wild bootstrap, multiplier bootstrap, placebo-based variance
- **Assumption diagnostics**: Parallel trends tests, placebo tests, Goodman-Bacon decomposition
- **Sensitivity analysis**: Honest DiD (Rambachan-Roth), Pre-trends power analysis (Roth 2022)
Expand All @@ -24,16 +24,9 @@ diff-diff v2.4.1 is a **production-ready** DiD library with feature parity with

High-value additions building on our existing foundation.

### Stacked Difference-in-Differences
### ~~Stacked Difference-in-Differences~~ (Implemented in v2.5)

An intuitive approach that explicitly constructs sub-experiments for each treatment cohort, avoiding forbidden comparisons.

- Creates separate datasets per cohort with valid controls only
- Stacks sub-experiments and applies corrective sample weights
- Returns variance-weighted ATT with proper compositional balance
- Conceptually simpler alternative to aggregation-based methods

**Reference**: [Wing, Freedman & Hollingsworth (2024)](https://www.nber.org/papers/w32054). *NBER Working Paper 32054*. Stata: `STACKDID`.
Implemented as `StackedDiD`. See `diff_diff/stacked_did.py`.

### Staggered Triple Difference (DDD)

Expand Down
Loading