Rewrite TripleDifference to match R's triplediff::ddd() by igerber · Pull Request #169 · igerber/diff-diff

igerber · 2026-02-18T21:18:47Z

Summary

Rewrite all 3 estimation methods (DR, IPW, RA) from naive cell-mean approach to three-DiD decomposition matching R's triplediff::ddd() (Ortiz-Villavicencio & Sant'Anna 2025)
Achieve <0.001% relative difference from R for both ATT and SE across all methods, with/without covariates, all 4 DGP types (24 comparisons)
Add 91 methodology verification tests covering hand calculations, pre-computed R comparison, live R comparison, edge cases, and scale validation
Add R and Python benchmark scripts for triple difference
Update METHODOLOGY_REVIEW.md (status: Complete) and REGISTRY.md (reference impl, SE formula, cross-ref table)

Methodology references (required if estimator / math changes)

Method name(s): Triple Difference (DDD) — regression adjustment, IPW, doubly robust
Paper / source link(s): Ortiz-Villavicencio, M., & Sant'Anna, P.H.C. (2025). Better Understanding Triple Differences Estimators. arXiv:2505.09942
R reference implementation: triplediff::ddd() v0.2.1 (CRAN) — official companion by paper authors
Any intentional deviations from the source: Uses panel=FALSE (repeated cross-section) mode only; panel mode not yet implemented

Validation

Tests added/updated:
- tests/test_methodology_triple_diff.py (91 new tests)
- tests/test_triple_diff.py (5 expectations updated)
All 141 tests pass (91 methodology + 50 existing)
R comparison: ATT <0.001% for all 24 comparisons (4 DGPs × 3 methods × 2 cov settings)
R comparison: SE <0.001% for all 24 comparisons

Security / privacy

Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

…SE match) Replace naive cell-mean DDD with three-DiD decomposition from Ortiz-Villavicencio & Sant'Anna (2025), matching the official R companion package triplediff::ddd(). All 3 methods (DR, IPW, RA) now use pairwise DiD comparisons with influence function-based SE, achieving <0.001% relative difference from R across all 4 DGP types, with and without covariates. Key changes: - Decompose DDD into DiD_3 + DiD_2 - DiD_1 (three pairwise comparisons) - SE via combined influence function: std(w3*IF_3 + w2*IF_2 - w1*IF_1)/sqrt(n) - Per-comparison propensity scores and outcome regressions - Add 91 methodology verification tests (hand calc, R comparison, edge cases) - Add R/Python benchmark scripts and pre-computed R results (panel=FALSE) - Update METHODOLOGY_REVIEW.md and REGISTRY.md with verified status Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-actions · 2026-02-18T21:25:36Z

Overall assessment: ⚠️ Needs changes

Executive Summary

TripleDifference (RA/IPW/DR) was rewritten to a three‑DiD + IF SE approach; the high‑level structure matches the registry, but three methodology compliance gaps remain.
robust/cluster parameters are now no-ops; SEs are always unclustered IF, which can materially understate uncertainty.
Rank-deficiency handling no longer matches the documented contract (warn/silent), leaving collinearity undetected.
Registry-required warnings for low cell counts are missing.
Documentation still advertises robust/cluster SEs; tests don’t cover the warn path or clustering behavior.

Methodology

Severity: P1 | Impact: robust and cluster are accepted and documented, but SEs are always computed from unclustered influence functions across RA/IPW/DR, so clustering is silently ignored and uncertainty can be understated. | Fix: implement robust/cluster variance of IF contributions (e.g., cluster-robust variance of IF sums) or explicitly deprecate/raise for these params and update docs/registry accordingly. (diff_diff/triple_diff.py:L350-L560, diff_diff/triple_diff.py:L741-L885)
Severity: P1 | Impact: Rank-deficient handling no longer matches the contract/registry: "warn"/"silent" are documented to detect/drop collinearity, but _fit_predict_mu only errors and otherwise silently uses np.linalg.lstsq, so collinear covariates can change estimates without warning (affects RA/DR). | Fix: restore rank-deficiency detection for "warn" (emit warning + drop dependent columns) and "silent" (drop without warning), or reuse the prior solve_ols path with rank_deficient_action. (diff_diff/triple_diff.py:L359-L363, diff_diff/triple_diff.py:L937-L949, docs/methodology/REGISTRY.md:L808-L812)
Severity: P1 | Impact: Registry requires warnings for low cell counts, but validation only errors on empty cells; sparse cells pass silently, risking unstable DDD/IF variance without user warning. | Fix: add a min_cell_count threshold (or warn below a fixed default) for all 8 cells and each {j,4} pairwise subset. (docs/methodology/REGISTRY.md:L776-L779, diff_diff/triple_diff.py:L626-L647)

Code Quality
No findings.

Performance
No findings.

Maintainability
No findings.

Tech Debt
No findings.

Security
No findings.

Documentation/Tests

Severity: P2 | Impact: Module/class docs still claim robust/cluster SEs and “Current Implementation (v1.3)” details that no longer match IF‑only SEs, which will mislead users. | Fix: update docstrings (and any user-facing docs) to reflect IF SEs and the actual inference options. (diff_diff/triple_diff.py:L14-L18, diff_diff/triple_diff.py:L350-L353)
Severity: P3 | Impact: Tests cover rank_deficient_action="error" and "silent" but not "warn", and there are no tests for robust/cluster inference behavior, so regressions here are unguarded. | Fix: add a "warn" test that asserts a warning and add tests for cluster/robust variance once implemented. (tests/test_triple_diff.py:L891-L936)

Path(__file__) resolves to temp dir under pytest-xdist workers. Use diff_diff.__file__ to find repo root instead. Also skip gracefully when R DGP CSV files (gitignored) aren't available. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ell warning - Make `cluster` param functional with Liang-Zeger cluster-robust SE on influence function; document `robust` as no-op (IF-based SEs inherently heteroskedasticity-robust) - Fix `rank_deficient_action` for "warn"/"silent" by replacing np.linalg.lstsq with solve_ols() in _fit_predict_mu() and R-squared computation - Add low cell count warning (<10 obs) in _validate_data() - Update module/class docstrings to reflect current implementation - Add 5 tests: rank_deficient warn/silent, cluster SE, low cell warning, robust no-op equivalence - Update REGISTRY.md with cluster-robust SE formula and edge cases Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-02-18T22:11:37Z

/ai-review

github-actions · 2026-02-18T22:17:42Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 6b84ec766c4d12ded1cfe3f48bb162a3a117d4c5

Overall assessment: ⚠️ Needs changes

Executive Summary

TripleDifference (RA/IPW/DR) is now implemented as the three‑DiD decomposition with IF‑based SEs and aligns with the Registry formulas; the added R-comparison tests are strong.
P1: Cluster‑robust SE divides by (G-1) without guarding G<2, yielding undefined variance for single‑cluster runs.
P1: The Registry lists propensity‑score overlap as an assumption check for IPW/DR, but the code only clips pscores without warning when overlap is weak.
P2: Missing‑value handling for cluster IDs can silently drop IF contributions while reporting a different cluster count.
Add targeted tests for cluster edge cases and overlap diagnostics.

Methodology

P1: Cluster‑robust SE is undefined for fewer than 2 clusters; the current formula divides by (G-1) without a guard. Impact: se becomes inf (and inference NaN) for single‑cluster runs, which is not a valid variance estimate. Fix: explicitly check n_clusters_val < 2 and raise ValueError or set se=np.nan with a warning before the division, consistent with other cluster‑robust paths. Location: diff_diff/triple_diff.py:L899-L910.
P1: Registry lists “Propensity score overlap required for IPW/DR methods” under assumption checks, but the implementation only clips pscores and provides diagnostics; it does not warn or fail when overlap is poor. Impact: users can receive biased IPW/DR estimates without any signal in low‑overlap subsets. Fix: detect when pscore_sub hits trim bounds (or when effective overlap is low per subset) and warn or error, and surface this in results. Locations: docs/methodology/REGISTRY.md:L776-L779, diff_diff/triple_diff.py:L818-L834.

Code Quality

P2: cluster IDs are not validated for missing values; np.unique includes NaN but self._cluster_ids == NaN is always false, so those observations are excluded from cluster sums, while n_clusters uses nunique() (drops NaN). Impact: understated variance and inconsistent reported cluster counts. Fix: treat cluster as required when provided and error on missing values (or drop with warning), and compute n_clusters from the same unique set used for SE. Locations: diff_diff/triple_diff.py:L520-L565, diff_diff/triple_diff.py:L899-L905.

Performance

No issues found.

Maintainability

No issues found.

Tech Debt

No issues found.

Security

No issues found.

Documentation/Tests

P2: No coverage for cluster edge cases (single cluster, missing cluster IDs) and no test for overlap‑diagnostic behavior. Impact: regressions in cluster SE handling or assumption checks may slip. Fix: add tests in tests/test_methodology_triple_diff.py or tests/test_triple_diff.py for cluster with 1 cluster, with NaN IDs, and for overlap warnings if implemented.

Guard cluster-robust SE against G<2 (ValueError), validate cluster IDs for NaN, add propensity score overlap warning for IPW/DR when >5% of observations are trimmed at bounds. Add 4 tests and update REGISTRY.md. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-02-18T23:34:43Z

/ai-review

github-actions · 2026-02-18T23:39:03Z

🔁 AI review rerun (requested by @igerber)

Head SHA: af777ffc4ad9a4779cd8e717857ed27bf8bdbb7c

Overall Assessment: ⚠️ Needs changes

Executive Summary

P1: Propensity-score failure fallback still applies PS correction/Hessian, which can misstate IF-based SEs when the PS model fails. diff_diff/triple_diff.py:L830-L857
P2: r_squared computation ignores rank_deficient_action and always uses "silent", which is inconsistent with estimator settings. diff_diff/triple_diff.py:L965-L979
Tests not run (review based on diff + registry/docstrings only).

Methodology

P1 | Impact: If _logistic_regression fails, the code falls back to constant pscore_sub but still computes a Hessian and applies PS correction in the IF. This mixes a non-estimated PS with an “estimated PS” correction and can distort SEs, violating the intended IF-based variance behavior in the Registry when PS estimation fails (especially likely in small pairwise subsets with separation). Concrete fix: When PS estimation fails, set hessian=None and skip PS correction, and emit a warning. Alternatively fit an intercept-only PS model and compute a consistent Hessian for that fallback. diff_diff/triple_diff.py:L830-L857

Code Quality

P2 | Impact: r_squared uses solve_ols(..., rank_deficient_action="silent") regardless of user-specified rank_deficient_action. This is inconsistent with the estimator behavior and could hide rank-deficiency issues in diagnostics. Concrete fix: Use self.rank_deficient_action here, or set r_squared=None if rank-deficient and rank_deficient_action=="error". diff_diff/triple_diff.py:L965-L979

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings.

Security

No findings.

Documentation/Tests

No findings (new methodology and R-comparison coverage looks thorough; not executed in this review).

…sistency When propensity score estimation fails, skip Hessian computation (set hessian=None) instead of computing a correction for a non-estimated model. Pass user's rank_deficient_action to r_squared solve_ols call instead of hardcoding "silent". Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-02-19T01:02:49Z

/ai-review

github-actions · 2026-02-19T01:08:40Z

🔁 AI review rerun (requested by @igerber)

Head SHA: ba50dd179e14361931ec58c632e65f233353494a

Overall assessment: ⚠️ Needs changes

Executive summary

Affected methods: TripleDifference RA/IPW/DR are fully rewritten to three‑DiD decomposition with IF-based SEs; registry and methodology review are updated and extensive tests/benchmarks added.
Methodology cross‑check against docs/methodology/REGISTRY.md and in‑code docstrings shows alignment on decomposition, IF SE, overlap warnings, and clustering.
Issue: non‑finite influence-function values are silently zeroed, which can mask invalid inference and violate the registry’s NaN‑propagation rule.
Minor doc mismatch: the convenience wrapper still claims robust toggles SEs despite being a no‑op now.

Methodology

P1: Impact: inf_j non‑finite values are replaced with 0.0, which can yield finite SE/t‑stats when the IF is undefined, violating the registry’s “NaN inference for undefined statistics” edge‑case guarantee. Concrete fix: propagate non‑finite IFs (e.g., if any ~np.isfinite(inf_j), set se=np.nan and let safe_inference produce NaNs, or raise/warn and skip inference). Location: diff_diff/triple_diff.py:L903-L912.
Note: Other registry requirements (3‑DiD decomposition, IF SE formula, overlap warning, cluster SE) appear consistent.

Code Quality
No findings.

Performance
No findings.

Maintainability
No findings.

Tech Debt
No findings.

Security
No findings.

Documentation/Tests

P3: Impact: triple_difference() docstring still says robust controls robust SEs, but it is now a no‑op per class docs/registry; this can mislead users. Concrete fix: update the wrapper docstring to match the class parameter note (“IF‑based SEs are inherently robust; robust retained for API compatibility”). Location: diff_diff/triple_diff.py:L1535-L1536.

Suggestions (optional)

Update IF NaN handling as above and add a small test that forces non‑finite IF (e.g., by monkeypatching _compute_did_rc_* to return NaN) to verify NaN propagation through inference.

…ng fix Non-finite influence function values (from extreme propensity scores or near-singular design) now warn and set SE to NaN instead of silently zeroing. Updated triple_difference() docstring to document robust as no-op, matching class docstring. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-02-19T02:11:04Z

/ai-review

github-actions · 2026-02-19T02:16:39Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 5d047482afcb65ed7261bed90fedf00109367b76

Overall assessment: ✅ Looks good

Executive Summary

Affected methods: TripleDifference RA/IPW/DR; the rewrite to three‑DiD decomposition with IF‑based SEs aligns with docs/methodology/REGISTRY.md and module docstrings.
Assumption/edge handling now matches the registry (8‑cell validation, low‑cell warnings, overlap warnings, NaN‑safe inference, cluster SE).
Extensive methodology and R‑comparison tests were added; I did not run them in this review.

Methodology

No issues found; implementation matches the Registry formulas and stated assumptions for TripleDifference (RA/IPW/DR).

Code Quality

Severity P2 | Impact: If cluster is set but the column is missing, data[self.cluster] raises a KeyError before validation, which is inconsistent with other input errors. | Concrete fix: Add cluster to required_cols in _validate_data when self.cluster is provided, or explicitly guard data[self.cluster] before access. diff_diff/triple_diff.py:520-569

Performance

No issues found.

Maintainability

No issues found.

Tech Debt

No issues found.

Security

No issues found.

Documentation/Tests

Severity P3 | Impact: Test status unknown in this review. | Concrete fix: Run the new and existing tests (e.g., pytest tests/test_methodology_triple_diff.py tests/test_triple_diff.py) to confirm.

…erence Add self.cluster to required_cols in _validate_data() so a missing cluster column raises a consistent ValueError instead of a raw pandas KeyError. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

igerber · 2026-02-19T13:15:04Z

/ai-review

github-actions · 2026-02-19T13:21:54Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 1614459aee62b6bc74c2e078bfc6bfca9f668563

Overall assessment: ⚠️ Needs changes

Executive summary

Methodology alignment with docs/methodology/REGISTRY.md looks consistent for the three‑DiD decomposition and IF‑based SEs; no mismatches found in code.
P1: The new R‑comparison tests silently skip because required DGP CSV fixtures are missing, so the claimed R parity isn’t actually enforced.
P3: Performance regression thresholds were loosened for CallawaySantAnna, which can mask unrelated regressions.

Methodology
No issues found.

Code Quality
No issues found.

Performance

P3 | Impact: Performance regression guardrails are weakened for CallawaySantAnna, which can hide regressions unrelated to this PR. | Fix: Either keep prior thresholds, justify with measured CI variance, or add a TripleDifference‑specific perf test and leave CSA thresholds intact. Location: tests/test_se_accuracy.py:L404.

Maintainability
No issues found.

Tech Debt
No issues found.

Security
No issues found.

Documentation/Tests

P1 | Impact: R‑comparison tests intended to validate parity with triplediff::ddd() are skipped because the DGP CSV fixtures are absent; this removes a key methodology verification step. | Fix: Add benchmarks/data/synthetic/ddd_r_dgp{1..4}.csv fixtures, or generate the DGP data deterministically in the test when files are missing. Location: tests/test_methodology_triple_diff.py:L125. Missing files: benchmarks/data/synthetic/ddd_r_dgp1.csv, benchmarks/data/synthetic/ddd_r_dgp2.csv, benchmarks/data/synthetic/ddd_r_dgp3.csv, benchmarks/data/synthetic/ddd_r_dgp4.csv.

Tests not run (review only).

Suggested next steps

Add or generate the missing R DGP CSV fixtures so R parity tests execute.
Decide whether to keep the relaxed perf thresholds or replace them with a TripleDifference‑specific benchmark.

The detailed section was updated in PR #169 but the summary table at the top of METHODOLOGY_REVIEW.md was missed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

igerber and others added 2 commits February 18, 2026 16:38

Address PR #169 review round 5: validate cluster column in TripleDiff…

1614459

…erence Add self.cluster to required_cols in _validate_data() so a missing cluster column raises a consistent ValueError instead of a raw pandas KeyError. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

igerber merged commit 080ac0b into main Feb 19, 2026
10 checks passed

igerber deleted the method-review-triple-d branch February 19, 2026 14:11

This was referenced Feb 19, 2026

Update TripleDifference summary table to reflect completed review #170

Merged

Bump version to 2.4.3 #171

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Rewrite TripleDifference to match R's triplediff::ddd()#169

Rewrite TripleDifference to match R's triplediff::ddd()#169
igerber merged 7 commits intomainfrom
method-review-triple-d

igerber commented Feb 18, 2026

Uh oh!

github-actions bot commented Feb 18, 2026

Uh oh!

igerber commented Feb 18, 2026

Uh oh!

github-actions bot commented Feb 18, 2026

Uh oh!

igerber commented Feb 18, 2026

Uh oh!

github-actions bot commented Feb 18, 2026

Uh oh!

igerber commented Feb 19, 2026

Uh oh!

github-actions bot commented Feb 19, 2026

Uh oh!

igerber commented Feb 19, 2026

Uh oh!

github-actions bot commented Feb 19, 2026

Uh oh!

igerber commented Feb 19, 2026

Uh oh!

github-actions bot commented Feb 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

igerber commented Feb 18, 2026

Summary

Methodology references (required if estimator / math changes)

Validation

Security / privacy

Uh oh!

github-actions bot commented Feb 18, 2026

Uh oh!

igerber commented Feb 18, 2026

Uh oh!

github-actions bot commented Feb 18, 2026

Uh oh!

igerber commented Feb 18, 2026

Uh oh!

github-actions bot commented Feb 18, 2026

Uh oh!

igerber commented Feb 19, 2026

Uh oh!

github-actions bot commented Feb 19, 2026

Uh oh!

igerber commented Feb 19, 2026

Uh oh!

github-actions bot commented Feb 19, 2026

Uh oh!

igerber commented Feb 19, 2026

Uh oh!

github-actions bot commented Feb 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant