Skip to content

Comments

Rewrite TripleDifference to match R's triplediff::ddd()#169

Merged
igerber merged 7 commits intomainfrom
method-review-triple-d
Feb 19, 2026
Merged

Rewrite TripleDifference to match R's triplediff::ddd()#169
igerber merged 7 commits intomainfrom
method-review-triple-d

Conversation

@igerber
Copy link
Owner

@igerber igerber commented Feb 18, 2026

Summary

  • Rewrite all 3 estimation methods (DR, IPW, RA) from naive cell-mean approach to three-DiD decomposition matching R's triplediff::ddd() (Ortiz-Villavicencio & Sant'Anna 2025)
  • Achieve <0.001% relative difference from R for both ATT and SE across all methods, with/without covariates, all 4 DGP types (24 comparisons)
  • Add 91 methodology verification tests covering hand calculations, pre-computed R comparison, live R comparison, edge cases, and scale validation
  • Add R and Python benchmark scripts for triple difference
  • Update METHODOLOGY_REVIEW.md (status: Complete) and REGISTRY.md (reference impl, SE formula, cross-ref table)

Methodology references (required if estimator / math changes)

  • Method name(s): Triple Difference (DDD) — regression adjustment, IPW, doubly robust
  • Paper / source link(s): Ortiz-Villavicencio, M., & Sant'Anna, P.H.C. (2025). Better Understanding Triple Differences Estimators. arXiv:2505.09942
  • R reference implementation: triplediff::ddd() v0.2.1 (CRAN) — official companion by paper authors
  • Any intentional deviations from the source: Uses panel=FALSE (repeated cross-section) mode only; panel mode not yet implemented

Validation

  • Tests added/updated:
    • tests/test_methodology_triple_diff.py (91 new tests)
    • tests/test_triple_diff.py (5 expectations updated)
  • All 141 tests pass (91 methodology + 50 existing)
  • R comparison: ATT <0.001% for all 24 comparisons (4 DGPs × 3 methods × 2 cov settings)
  • R comparison: SE <0.001% for all 24 comparisons

Security / privacy

  • Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

…SE match)

Replace naive cell-mean DDD with three-DiD decomposition from
Ortiz-Villavicencio & Sant'Anna (2025), matching the official R companion
package triplediff::ddd(). All 3 methods (DR, IPW, RA) now use pairwise DiD
comparisons with influence function-based SE, achieving <0.001% relative
difference from R across all 4 DGP types, with and without covariates.

Key changes:
- Decompose DDD into DiD_3 + DiD_2 - DiD_1 (three pairwise comparisons)
- SE via combined influence function: std(w3*IF_3 + w2*IF_2 - w1*IF_1)/sqrt(n)
- Per-comparison propensity scores and outcome regressions
- Add 91 methodology verification tests (hand calc, R comparison, edge cases)
- Add R/Python benchmark scripts and pre-computed R results (panel=FALSE)
- Update METHODOLOGY_REVIEW.md and REGISTRY.md with verified status

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions
Copy link

Overall assessment: ⚠️ Needs changes

Executive Summary

  • TripleDifference (RA/IPW/DR) was rewritten to a three‑DiD + IF SE approach; the high‑level structure matches the registry, but three methodology compliance gaps remain.
  • robust/cluster parameters are now no-ops; SEs are always unclustered IF, which can materially understate uncertainty.
  • Rank-deficiency handling no longer matches the documented contract (warn/silent), leaving collinearity undetected.
  • Registry-required warnings for low cell counts are missing.
  • Documentation still advertises robust/cluster SEs; tests don’t cover the warn path or clustering behavior.

Methodology

  • Severity: P1 | Impact: robust and cluster are accepted and documented, but SEs are always computed from unclustered influence functions across RA/IPW/DR, so clustering is silently ignored and uncertainty can be understated. | Fix: implement robust/cluster variance of IF contributions (e.g., cluster-robust variance of IF sums) or explicitly deprecate/raise for these params and update docs/registry accordingly. (diff_diff/triple_diff.py:L350-L560, diff_diff/triple_diff.py:L741-L885)
  • Severity: P1 | Impact: Rank-deficient handling no longer matches the contract/registry: "warn"/"silent" are documented to detect/drop collinearity, but _fit_predict_mu only errors and otherwise silently uses np.linalg.lstsq, so collinear covariates can change estimates without warning (affects RA/DR). | Fix: restore rank-deficiency detection for "warn" (emit warning + drop dependent columns) and "silent" (drop without warning), or reuse the prior solve_ols path with rank_deficient_action. (diff_diff/triple_diff.py:L359-L363, diff_diff/triple_diff.py:L937-L949, docs/methodology/REGISTRY.md:L808-L812)
  • Severity: P1 | Impact: Registry requires warnings for low cell counts, but validation only errors on empty cells; sparse cells pass silently, risking unstable DDD/IF variance without user warning. | Fix: add a min_cell_count threshold (or warn below a fixed default) for all 8 cells and each {j,4} pairwise subset. (docs/methodology/REGISTRY.md:L776-L779, diff_diff/triple_diff.py:L626-L647)

Code Quality
No findings.

Performance
No findings.

Maintainability
No findings.

Tech Debt
No findings.

Security
No findings.

Documentation/Tests

  • Severity: P2 | Impact: Module/class docs still claim robust/cluster SEs and “Current Implementation (v1.3)” details that no longer match IF‑only SEs, which will mislead users. | Fix: update docstrings (and any user-facing docs) to reflect IF SEs and the actual inference options. (diff_diff/triple_diff.py:L14-L18, diff_diff/triple_diff.py:L350-L353)
  • Severity: P3 | Impact: Tests cover rank_deficient_action="error" and "silent" but not "warn", and there are no tests for robust/cluster inference behavior, so regressions here are unguarded. | Fix: add a "warn" test that asserts a warning and add tests for cluster/robust variance once implemented. (tests/test_triple_diff.py:L891-L936)

igerber and others added 2 commits February 18, 2026 16:38
Path(__file__) resolves to temp dir under pytest-xdist workers.
Use diff_diff.__file__ to find repo root instead. Also skip
gracefully when R DGP CSV files (gitignored) aren't available.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ell warning

- Make `cluster` param functional with Liang-Zeger cluster-robust SE on
  influence function; document `robust` as no-op (IF-based SEs inherently
  heteroskedasticity-robust)
- Fix `rank_deficient_action` for "warn"/"silent" by replacing np.linalg.lstsq
  with solve_ols() in _fit_predict_mu() and R-squared computation
- Add low cell count warning (<10 obs) in _validate_data()
- Update module/class docstrings to reflect current implementation
- Add 5 tests: rank_deficient warn/silent, cluster SE, low cell warning,
  robust no-op equivalence
- Update REGISTRY.md with cluster-robust SE formula and edge cases

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Owner Author

igerber commented Feb 18, 2026

/ai-review

@github-actions
Copy link

🔁 AI review rerun (requested by @igerber)

Head SHA: 6b84ec766c4d12ded1cfe3f48bb162a3a117d4c5


Overall assessment: ⚠️ Needs changes

Executive Summary

  • TripleDifference (RA/IPW/DR) is now implemented as the three‑DiD decomposition with IF‑based SEs and aligns with the Registry formulas; the added R-comparison tests are strong.
  • P1: Cluster‑robust SE divides by (G-1) without guarding G<2, yielding undefined variance for single‑cluster runs.
  • P1: The Registry lists propensity‑score overlap as an assumption check for IPW/DR, but the code only clips pscores without warning when overlap is weak.
  • P2: Missing‑value handling for cluster IDs can silently drop IF contributions while reporting a different cluster count.
  • Add targeted tests for cluster edge cases and overlap diagnostics.

Methodology

  • P1: Cluster‑robust SE is undefined for fewer than 2 clusters; the current formula divides by (G-1) without a guard. Impact: se becomes inf (and inference NaN) for single‑cluster runs, which is not a valid variance estimate. Fix: explicitly check n_clusters_val < 2 and raise ValueError or set se=np.nan with a warning before the division, consistent with other cluster‑robust paths. Location: diff_diff/triple_diff.py:L899-L910.
  • P1: Registry lists “Propensity score overlap required for IPW/DR methods” under assumption checks, but the implementation only clips pscores and provides diagnostics; it does not warn or fail when overlap is poor. Impact: users can receive biased IPW/DR estimates without any signal in low‑overlap subsets. Fix: detect when pscore_sub hits trim bounds (or when effective overlap is low per subset) and warn or error, and surface this in results. Locations: docs/methodology/REGISTRY.md:L776-L779, diff_diff/triple_diff.py:L818-L834.

Code Quality

  • P2: cluster IDs are not validated for missing values; np.unique includes NaN but self._cluster_ids == NaN is always false, so those observations are excluded from cluster sums, while n_clusters uses nunique() (drops NaN). Impact: understated variance and inconsistent reported cluster counts. Fix: treat cluster as required when provided and error on missing values (or drop with warning), and compute n_clusters from the same unique set used for SE. Locations: diff_diff/triple_diff.py:L520-L565, diff_diff/triple_diff.py:L899-L905.

Performance

  • No issues found.

Maintainability

  • No issues found.

Tech Debt

  • No issues found.

Security

  • No issues found.

Documentation/Tests

  • P2: No coverage for cluster edge cases (single cluster, missing cluster IDs) and no test for overlap‑diagnostic behavior. Impact: regressions in cluster SE handling or assumption checks may slip. Fix: add tests in tests/test_methodology_triple_diff.py or tests/test_triple_diff.py for cluster with 1 cluster, with NaN IDs, and for overlap warnings if implemented.

Guard cluster-robust SE against G<2 (ValueError), validate cluster IDs
for NaN, add propensity score overlap warning for IPW/DR when >5% of
observations are trimmed at bounds. Add 4 tests and update REGISTRY.md.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Owner Author

igerber commented Feb 18, 2026

/ai-review

@github-actions
Copy link

🔁 AI review rerun (requested by @igerber)

Head SHA: af777ffc4ad9a4779cd8e717857ed27bf8bdbb7c


Overall Assessment: ⚠️ Needs changes

Executive Summary

  • P1: Propensity-score failure fallback still applies PS correction/Hessian, which can misstate IF-based SEs when the PS model fails. diff_diff/triple_diff.py:L830-L857
  • P2: r_squared computation ignores rank_deficient_action and always uses "silent", which is inconsistent with estimator settings. diff_diff/triple_diff.py:L965-L979
  • Tests not run (review based on diff + registry/docstrings only).

Methodology

  • P1 | Impact: If _logistic_regression fails, the code falls back to constant pscore_sub but still computes a Hessian and applies PS correction in the IF. This mixes a non-estimated PS with an “estimated PS” correction and can distort SEs, violating the intended IF-based variance behavior in the Registry when PS estimation fails (especially likely in small pairwise subsets with separation). Concrete fix: When PS estimation fails, set hessian=None and skip PS correction, and emit a warning. Alternatively fit an intercept-only PS model and compute a consistent Hessian for that fallback. diff_diff/triple_diff.py:L830-L857

Code Quality

  • P2 | Impact: r_squared uses solve_ols(..., rank_deficient_action="silent") regardless of user-specified rank_deficient_action. This is inconsistent with the estimator behavior and could hide rank-deficiency issues in diagnostics. Concrete fix: Use self.rank_deficient_action here, or set r_squared=None if rank-deficient and rank_deficient_action=="error". diff_diff/triple_diff.py:L965-L979

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • No findings.

Security

  • No findings.

Documentation/Tests

  • No findings (new methodology and R-comparison coverage looks thorough; not executed in this review).

…sistency

When propensity score estimation fails, skip Hessian computation (set
hessian=None) instead of computing a correction for a non-estimated
model. Pass user's rank_deficient_action to r_squared solve_ols call
instead of hardcoding "silent".

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Owner Author

igerber commented Feb 19, 2026

/ai-review

@github-actions
Copy link

🔁 AI review rerun (requested by @igerber)

Head SHA: ba50dd179e14361931ec58c632e65f233353494a


Overall assessment: ⚠️ Needs changes

Executive summary

  • Affected methods: TripleDifference RA/IPW/DR are fully rewritten to three‑DiD decomposition with IF-based SEs; registry and methodology review are updated and extensive tests/benchmarks added.
  • Methodology cross‑check against docs/methodology/REGISTRY.md and in‑code docstrings shows alignment on decomposition, IF SE, overlap warnings, and clustering.
  • Issue: non‑finite influence-function values are silently zeroed, which can mask invalid inference and violate the registry’s NaN‑propagation rule.
  • Minor doc mismatch: the convenience wrapper still claims robust toggles SEs despite being a no‑op now.

Methodology

  • P1: Impact: inf_j non‑finite values are replaced with 0.0, which can yield finite SE/t‑stats when the IF is undefined, violating the registry’s “NaN inference for undefined statistics” edge‑case guarantee. Concrete fix: propagate non‑finite IFs (e.g., if any ~np.isfinite(inf_j), set se=np.nan and let safe_inference produce NaNs, or raise/warn and skip inference). Location: diff_diff/triple_diff.py:L903-L912.
  • Note: Other registry requirements (3‑DiD decomposition, IF SE formula, overlap warning, cluster SE) appear consistent.

Code Quality
No findings.

Performance
No findings.

Maintainability
No findings.

Tech Debt
No findings.

Security
No findings.

Documentation/Tests

  • P3: Impact: triple_difference() docstring still says robust controls robust SEs, but it is now a no‑op per class docs/registry; this can mislead users. Concrete fix: update the wrapper docstring to match the class parameter note (“IF‑based SEs are inherently robust; robust retained for API compatibility”). Location: diff_diff/triple_diff.py:L1535-L1536.

Suggestions (optional)

  1. Update IF NaN handling as above and add a small test that forces non‑finite IF (e.g., by monkeypatching _compute_did_rc_* to return NaN) to verify NaN propagation through inference.

…ng fix

Non-finite influence function values (from extreme propensity scores or
near-singular design) now warn and set SE to NaN instead of silently
zeroing. Updated triple_difference() docstring to document robust as
no-op, matching class docstring.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Owner Author

igerber commented Feb 19, 2026

/ai-review

@github-actions
Copy link

🔁 AI review rerun (requested by @igerber)

Head SHA: 5d047482afcb65ed7261bed90fedf00109367b76


Overall assessment: ✅ Looks good

Executive Summary

  • Affected methods: TripleDifference RA/IPW/DR; the rewrite to three‑DiD decomposition with IF‑based SEs aligns with docs/methodology/REGISTRY.md and module docstrings.
  • Assumption/edge handling now matches the registry (8‑cell validation, low‑cell warnings, overlap warnings, NaN‑safe inference, cluster SE).
  • Extensive methodology and R‑comparison tests were added; I did not run them in this review.

Methodology

  • No issues found; implementation matches the Registry formulas and stated assumptions for TripleDifference (RA/IPW/DR).

Code Quality

  • Severity P2 | Impact: If cluster is set but the column is missing, data[self.cluster] raises a KeyError before validation, which is inconsistent with other input errors. | Concrete fix: Add cluster to required_cols in _validate_data when self.cluster is provided, or explicitly guard data[self.cluster] before access. diff_diff/triple_diff.py:520-569

Performance

  • No issues found.

Maintainability

  • No issues found.

Tech Debt

  • No issues found.

Security

  • No issues found.

Documentation/Tests

  • Severity P3 | Impact: Test status unknown in this review. | Concrete fix: Run the new and existing tests (e.g., pytest tests/test_methodology_triple_diff.py tests/test_triple_diff.py) to confirm.

…erence

Add self.cluster to required_cols in _validate_data() so a missing cluster
column raises a consistent ValueError instead of a raw pandas KeyError.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@igerber
Copy link
Owner Author

igerber commented Feb 19, 2026

/ai-review

@github-actions
Copy link

🔁 AI review rerun (requested by @igerber)

Head SHA: 1614459aee62b6bc74c2e078bfc6bfca9f668563


Overall assessment: ⚠️ Needs changes

Executive summary

  • Methodology alignment with docs/methodology/REGISTRY.md looks consistent for the three‑DiD decomposition and IF‑based SEs; no mismatches found in code.
  • P1: The new R‑comparison tests silently skip because required DGP CSV fixtures are missing, so the claimed R parity isn’t actually enforced.
  • P3: Performance regression thresholds were loosened for CallawaySantAnna, which can mask unrelated regressions.

Methodology
No issues found.

Code Quality
No issues found.

Performance

  • P3 | Impact: Performance regression guardrails are weakened for CallawaySantAnna, which can hide regressions unrelated to this PR. | Fix: Either keep prior thresholds, justify with measured CI variance, or add a TripleDifference‑specific perf test and leave CSA thresholds intact. Location: tests/test_se_accuracy.py:L404.

Maintainability
No issues found.

Tech Debt
No issues found.

Security
No issues found.

Documentation/Tests

  • P1 | Impact: R‑comparison tests intended to validate parity with triplediff::ddd() are skipped because the DGP CSV fixtures are absent; this removes a key methodology verification step. | Fix: Add benchmarks/data/synthetic/ddd_r_dgp{1..4}.csv fixtures, or generate the DGP data deterministically in the test when files are missing. Location: tests/test_methodology_triple_diff.py:L125. Missing files: benchmarks/data/synthetic/ddd_r_dgp1.csv, benchmarks/data/synthetic/ddd_r_dgp2.csv, benchmarks/data/synthetic/ddd_r_dgp3.csv, benchmarks/data/synthetic/ddd_r_dgp4.csv.

Tests not run (review only).

Suggested next steps

  1. Add or generate the missing R DGP CSV fixtures so R parity tests execute.
  2. Decide whether to keep the relaxed perf thresholds or replace them with a TripleDifference‑specific benchmark.

@igerber igerber merged commit 080ac0b into main Feb 19, 2026
10 checks passed
@igerber igerber deleted the method-review-triple-d branch February 19, 2026 14:11
igerber added a commit that referenced this pull request Feb 19, 2026
The detailed section was updated in PR #169 but the summary table at
the top of METHODOLOGY_REVIEW.md was missed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant