From 5b669e718ba9cdec67a9c80d923ace8e2df6b460 Mon Sep 17 00:00:00 2001 From: igerber Date: Thu, 19 Feb 2026 18:58:21 -0500 Subject: [PATCH] Trim CLAUDE.md from 792 to 160 lines for better instruction-following MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Extract development checklists to .claude/commands/dev-checklists.md (accessible via /dev-checklists) and documentation requirements + test guidelines to CONTRIBUTING.md. Remove module structure listing, performance architecture, benchmark docs, and test structure listing — all discoverable from the codebase. Update cross-references in review-plan.md, docs-check.md, and check-plan-review.sh. Co-Authored-By: Claude Opus 4.6 --- .claude/commands/dev-checklists.md | 167 +++++++ .claude/commands/docs-check.md | 2 +- .claude/commands/review-plan.md | 15 +- .claude/hooks/check-plan-review.sh | 4 +- CLAUDE.md | 742 +++-------------------------- CONTRIBUTING.md | 77 +++ 6 files changed, 310 insertions(+), 697 deletions(-) create mode 100644 .claude/commands/dev-checklists.md create mode 100644 CONTRIBUTING.md diff --git a/.claude/commands/dev-checklists.md b/.claude/commands/dev-checklists.md new file mode 100644 index 0000000..342118a --- /dev/null +++ b/.claude/commands/dev-checklists.md @@ -0,0 +1,167 @@ +--- +description: Development checklists for code changes (params, methodology, warnings, reviews, bugs) +argument-hint: "[checklist-name]" +--- + +# Development Checklists + +## Adding a New Parameter to Estimators + +When adding a new `__init__` parameter that should be available across estimators: + +1. **Implementation** (for each affected estimator): + - [ ] Add to `__init__` signature with default value + - [ ] Store as `self.param_name` + - [ ] Add to `get_params()` return dict + - [ ] Handle in `set_params()` (usually automatic via `hasattr`) + +2. **Consistency** - apply to all applicable estimators per the **Estimator inheritance** map in CLAUDE.md + +3. **Testing**: + - [ ] Test `get_params()` includes new param + - [ ] Test parameter affects estimator behavior + - [ ] Test with non-default value + +4. **Downstream tracing**: + - [ ] Before implementing: `grep -rn "self\." diff_diff/.py` to find ALL downstream paths + - [ ] Parameter handled in ALL aggregation methods (simple, event_study, group) + - [ ] Parameter handled in bootstrap inference paths + +5. **Documentation**: + - [ ] Update docstring in all affected classes + - [ ] Update CLAUDE.md if it's a key design pattern + +## Implementing Methodology-Critical Code + +When implementing or modifying code that affects statistical methodology (estimators, SE calculation, inference, edge case handling): + +1. **Before coding - consult the Methodology Registry**: + - [ ] Read the relevant estimator section in `docs/methodology/REGISTRY.md` + - [ ] Identify the reference implementation(s) listed + - [ ] Note the edge case handling requirements + +2. **During implementation**: + - [ ] Follow the documented equations and formulas + - [ ] Match reference implementation behavior for standard cases + - [ ] For edge cases: either match reference OR document deviation + +3. **When deviating from reference implementations**: + - [ ] Add a **Note** in the Methodology Registry explaining the deviation + - [ ] Include rationale (e.g., "defensive enhancement", "R errors here") + - [ ] Ensure the deviation is an improvement, not a bug + +4. **Testing methodology-aligned behavior**: + - [ ] Test that edge cases produce documented behavior (NaN, warning, etc.) + - [ ] Assert warnings are raised (not just captured) + - [ ] Assert the warned-about behavior actually occurred + - [ ] For NaN results: assert `np.isnan()`, don't just check "no exception" + - [ ] All inference fields computed via `safe_inference()` (not inline) + +## Adding Warning/Error/Fallback Handling + +When adding code that emits warnings or handles errors: + +1. **Consult Methodology Registry first**: + - [ ] Check if behavior is documented in edge cases section + - [ ] If not documented, add it before implementing + +2. **Verify behavior matches message**: + - [ ] Manually trace the code path after warning/error + - [ ] Confirm the stated behavior actually occurs + +3. **Write behavioral tests**: + - [ ] Don't just test "no exception raised" + - [ ] Assert the expected outcome occurred + - [ ] For fallbacks: verify fallback behavior was applied + - [ ] Example: If warning says "setting NaN", assert `np.any(np.isnan(result))` + +4. **Protect arithmetic operations**: + - [ ] Wrap ALL related operations in `np.errstate()`, not just the final one + - [ ] Include division, matrix multiplication, and any operation that can overflow/underflow + +## Reviewing New Features or Code Paths + +When reviewing PRs that add new features, modes, or code paths (learned from PR #97 analysis): + +1. **Edge Case Coverage**: + - [ ] Empty result sets (no matching data for a filter condition) + - [ ] NaN/Inf propagation through ALL inference fields (SE, t-stat, p-value, CI) + - [ ] Parameter interactions (e.g., new param x existing aggregation methods) + - [ ] Control/comparison group composition for all code paths + +2. **Documentation Completeness**: + - [ ] All new parameters have docstrings with type, default, and description + - [ ] Methodology docs match implementation behavior (equations, edge cases) + - [ ] Edge cases documented in `docs/methodology/REGISTRY.md` + +3. **Logic Audit for New Code Paths**: + - [ ] When adding new modes (like `base_period="varying"`), trace ALL downstream effects + - [ ] Check aggregation methods handle the new mode correctly + - [ ] Check bootstrap/inference methods handle the new mode correctly + - [ ] Explicitly test control group composition in new code paths + +4. **Pattern Consistency**: + - [ ] Search for similar patterns in codebase (e.g., `t_stat = x / se if se > 0 else ...`) + - [ ] Ensure new code follows established patterns or updates ALL instances + - [ ] If fixing a pattern, grep for ALL occurrences first: + ```bash + grep -n "if.*se.*> 0.*else" diff_diff/*.py + ``` + +## Fixing Bugs Across Multiple Locations + +When a bug fix involves a pattern that appears in multiple places (learned from PR #97 analysis): + +1. **Find All Instances First**: + - [ ] Use grep/search to find ALL occurrences of the pattern before fixing + - [ ] Document the locations found (file:line) + - [ ] Example: `t_stat = effect / se if se > 0 else 0.0` appeared in 7 locations + +2. **Fix Comprehensively in One Round**: + - [ ] Fix ALL instances in the same PR/commit + - [ ] Create a test that covers all locations + - [ ] Don't fix incrementally across multiple review rounds + +3. **Regression Test the Fix**: + - [ ] Verify fix doesn't break other code paths + - [ ] For early-return fixes: ensure downstream code still runs when needed + - [ ] Example: Bootstrap early return must still compute per-effect SEs + +4. **Common Patterns to Watch For**: + - `if se > 0 else 0.0` -> should be `else np.nan` for undefined statistics + - `if len(data) > 0 else return` -> check what downstream code expects + - `mask = (condition)` -> verify mask logic for all parameter combinations + +## Pre-Merge Review Checklist + +Final checklist before approving a PR: + +1. **Behavioral Completeness**: + - [ ] Happy path tested + - [ ] Edge cases tested (empty data, NaN inputs, boundary conditions) + - [ ] Error/warning paths tested with behavioral assertions + +2. **Inference Field Consistency**: + - [ ] If one inference field (SE, t-stat, p-value) can be NaN, all related fields handle it + - [ ] Aggregation methods propagate NaN correctly + - [ ] Bootstrap methods handle NaN in base estimates + +3. **Documentation Sync**: + - [ ] Docstrings updated for all changed signatures + - [ ] README updated if user-facing behavior changes + - [ ] REGISTRY.md updated if methodology edge cases change + +## Quick Reference: Common Patterns to Check + +Before submitting methodology changes, verify these patterns: + +```bash +# Find potential NaN handling issues (should use np.nan, not 0.0) +grep -n "if.*se.*>.*0.*else 0" diff_diff/*.py + +# Find all t_stat calculations to ensure consistency +grep -n "t_stat.*=" diff_diff/*.py + +# Find all inference field assignments +grep -n "\(se\|t_stat\|p_value\|ci_lower\|ci_upper\).*=" diff_diff/*.py | head -30 +``` diff --git a/.claude/commands/docs-check.md b/.claude/commands/docs-check.md index edd792c..c3f58e0 100644 --- a/.claude/commands/docs-check.md +++ b/.claude/commands/docs-check.md @@ -200,6 +200,6 @@ Summary: 15/18 checks passed, 3 issues found ## Notes - This check is especially important after adding new estimators -- The CLAUDE.md file documents what documentation is required for new features +- The CONTRIBUTING.md file documents what documentation is required for new features - Missing references should cite the original methodology paper, not textbooks - When adding new estimators, update this skill's tables accordingly diff --git a/.claude/commands/review-plan.md b/.claude/commands/review-plan.md index dc7e839..57556d1 100644 --- a/.claude/commands/review-plan.md +++ b/.claude/commands/review-plan.md @@ -76,11 +76,12 @@ If no prior review is available from either source (conversation context or revi ### Step 2: Read CLAUDE.md for Project Context Read the project's `CLAUDE.md` file to understand: -- Module structure and architecture +- Key design patterns (sklearn-like API, formula interface, results objects, etc.) - Estimator inheritance map -- Development checklists (adding parameters, methodology-critical code, etc.) -- Test structure and conventions -- Design patterns (sklearn-like API, formula interface, results objects, etc.) +- Testing conventions +- Key reference file pointers (methodology registry, development checklists, etc.) + +Also read `.claude/commands/dev-checklists.md` for development checklists. If the plan modifies estimator math, standard error formulas, inference logic, or edge-case handling, also read `docs/methodology/REGISTRY.md` to understand the academic foundations and reference implementations for the affected estimator(s). @@ -224,7 +225,7 @@ Check for **missing related changes**: - Tests for new/changed functionality - `__init__.py` export updates - `get_params()` / `set_params()` updates for new parameters -- Documentation updates (README, RST, tutorials, CLAUDE.md) +- Documentation updates (README, RST, tutorials, CONTRIBUTING.md, CLAUDE.md if design patterns change) - For bug fixes: did the plan grep for ALL occurrences of the pattern, or just the one reported? Check for **unnecessary additions**: @@ -354,9 +355,9 @@ Present the review in the following format. Number each issue sequentially withi ## Checklist Gaps -Cross-reference against the relevant CLAUDE.md checklists. List which checklist items are not addressed by the plan. +Cross-reference against the relevant development checklists in `.claude/commands/dev-checklists.md`. List which checklist items are not addressed by the plan. -[Identify which CLAUDE.md checklist applies (e.g., "Adding a New Parameter to Estimators", "Implementing Methodology-Critical Code", "Fixing Bugs Across Multiple Locations") and list any items from that checklist that the plan doesn't cover.] +[Identify which checklist applies (e.g., "Adding a New Parameter to Estimators", "Implementing Methodology-Critical Code", "Fixing Bugs Across Multiple Locations") and list any items from that checklist that the plan doesn't cover.] **Registry Alignment** (if methodology files changed): - [ ] Plan equations match REGISTRY.md (or deviations documented) diff --git a/.claude/hooks/check-plan-review.sh b/.claude/hooks/check-plan-review.sh index cf75edb..864d809 100755 --- a/.claude/hooks/check-plan-review.sh +++ b/.claude/hooks/check-plan-review.sh @@ -18,7 +18,7 @@ # - A stale sentinel can approve an unreviewed newer plan. This occurs only when: # (a) the sentinel points to an older plan with a valid review, AND # (b) a newer plan was created without updating the sentinel. -# CLAUDE.md instructs sentinel updates on new plan creation (line 647), making (b) +# CLAUDE.md instructs sentinel updates on new plan creation (Plan Review section), making (b) # unlikely in practice. A runtime guard was tested and removed (cb8d5e3) because it # caused false denials in multi-plan workflows. Automated tests in # test-check-plan-review.sh verify existing behavior; full mitigation would require @@ -102,7 +102,7 @@ if [ -f "$REVIEW_FILE" ]; then fi exit 0 # Review exists and is fresh, allow else - deny "No plan review found. Expected: $REVIEW_FILE. Follow the Plan Review Before Approval instructions in CLAUDE.md." + deny "No plan review found. Expected: $REVIEW_FILE. Run the plan review workflow: offer review via AskUserQuestion, spawn /review-plan if requested, or write a Skipped marker. See CLAUDE.md Plan Review section." fi # Manual verification checklist: diff --git a/CLAUDE.md b/CLAUDE.md index 70da7ec..4e0db93 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -59,261 +59,18 @@ DIFF_DIFF_BACKEND=rust pytest pytest tests/test_rust_backend.py -v ``` -**Note**: As of v2.2.0, the Rust backend uses `faer` (pure Rust) for SVD and matrix inversion. -BLAS is optionally linked via Cargo features (`accelerate` on macOS, `openblas` on Linux) -for matrix-vector/matrix-matrix products. Windows builds remain fully pure Rust with no -external dependencies. Pre-built PyPI wheels include platform BLAS; source builds use -pure Rust by default. - -## Architecture - -### Module Structure - -- **`diff_diff/estimators.py`** - Core estimator classes implementing DiD methods: - - `DifferenceInDifferences` - Basic 2x2 DiD with formula or column-name interface - - `MultiPeriodDiD` - Full event-study DiD with treatment × period interactions for ALL periods (pre and post). Supports `unit` parameter for staggered adoption detection. Default reference period is last pre-period (e=-1 convention). Pre-period coefficients enable parallel trends assessment. `interaction_indices` maps periods to VCV column indices for robust sub-VCV extraction in HonestDiD/PreTrendsPower. - - Re-exports `TwoWayFixedEffects` and `SyntheticDiD` for backward compatibility - -- **`diff_diff/twfe.py`** - Two-Way Fixed Effects estimator: - - `TwoWayFixedEffects` - Panel DiD with unit and time fixed effects (within-transformation) - -- **`diff_diff/synthetic_did.py`** - Synthetic DiD estimator: - - `SyntheticDiD` - Synthetic control combined with DiD (Arkhangelsky et al. 2021) - - Frank-Wolfe solver matching R's `synthdid::sc.weight.fw()` for unit and time weights - - Auto-computed regularization from data noise level (`zeta_omega`, `zeta_lambda`) - - Two-pass sparsification for unit weights (100 iters → sparsify → 1000 iters) - - Bootstrap SE uses fixed weights (matches R's `bootstrap_sample`) - -- **`diff_diff/staggered.py`** - Staggered adoption DiD main module: - - `CallawaySantAnna` - Callaway & Sant'Anna (2021) estimator for heterogeneous treatment timing - - Core estimation methods: `_precompute_structures()`, `_compute_att_gt_fast()`, `fit()` - - Estimation approaches: `_outcome_regression()`, `_ipw_estimation()`, `_doubly_robust()` - - Re-exports result and bootstrap classes for backward compatibility - -- **`diff_diff/staggered_results.py`** - Result container classes: - - `GroupTimeEffect` - Container for individual group-time effects - - `CallawaySantAnnaResults` - Results with group-time ATT(g,t), `summary()`, `to_dataframe()` - -- **`diff_diff/staggered_bootstrap.py`** - Bootstrap inference: - - `CSBootstrapResults` - Bootstrap inference results (SEs, CIs, p-values for all aggregations) - - `CallawaySantAnnaBootstrapMixin` - Mixin with bootstrap methods - - `_generate_bootstrap_weights_batch()` - Vectorized weight generation - - Multiplier bootstrap with Rademacher, Mammen, or Webb weights - -- **`diff_diff/staggered_aggregation.py`** - Aggregation methods: - - `CallawaySantAnnaAggregationMixin` - Mixin with aggregation methods - - `_aggregate_simple()`, `_aggregate_event_study()`, `_aggregate_by_group()` - - `_compute_aggregated_se_with_wif()` - SE with weight influence function adjustment - -- **`diff_diff/sun_abraham.py`** - Sun-Abraham interaction-weighted estimator: - - `SunAbraham` - Sun & Abraham (2021) estimator using saturated regression - - `SunAbrahamResults` - Results with event study effects and cohort weights - - `SABootstrapResults` - Bootstrap inference results - - Alternative to Callaway-Sant'Anna with different weighting scheme - - Useful robustness check when both estimators agree - -- **`diff_diff/imputation.py`** - Borusyak-Jaravel-Spiess imputation DiD estimator: - - `ImputationDiD` - Borusyak et al. (2024) efficient imputation estimator for staggered DiD - - `imputation_did()` - Convenience function - - Steps: (1) OLS on untreated obs for unit+time FE, (2) impute counterfactual Y(0), (3) aggregate - - Conservative variance (Theorem 3) with `aux_partition` parameter for SE tightness - - Pre-trend test (Equation 9) via `results.pretrend_test()` - - Proposition 5: NaN for unidentified long-run horizons without never-treated units - - Re-exports result and bootstrap classes for backward compatibility - -- **`diff_diff/imputation_results.py`** - Result container classes: - - `ImputationBootstrapResults` - Multiplier bootstrap inference results - - `ImputationDiDResults` - Results with overall ATT, event study, group effects, pre-trend test - -- **`diff_diff/imputation_bootstrap.py`** - Bootstrap inference: - - `ImputationDiDBootstrapMixin` - Mixin with multiplier bootstrap methods - - Cluster-level influence function sums (Theorem 3) with Rademacher weights - -- **`diff_diff/two_stage.py`** - Gardner (2022) Two-Stage DiD estimator: - - `TwoStageDiD` - Two-stage estimator: (1) estimate unit+time FE on untreated obs, (2) regress residualized outcomes on treatment indicators - - `two_stage_did()` - Convenience function - - Point estimates identical to ImputationDiD; different variance estimator (GMM sandwich vs. conservative) - - Custom `_compute_gmm_variance()` — cannot reuse `compute_robust_vcov()` because correction term uses GLOBAL cross-moment - - No finite-sample adjustments (raw asymptotic sandwich, matching R `did2s`) - - Re-exports result and bootstrap classes for backward compatibility - -- **`diff_diff/two_stage_results.py`** - Result container classes: - - `TwoStageBootstrapResults` - Multiplier bootstrap inference on GMM influence function - - `TwoStageDiDResults` - Results with overall ATT, event study, group effects, per-observation treatment effects - -- **`diff_diff/two_stage_bootstrap.py`** - Bootstrap inference: - - `TwoStageDiDBootstrapMixin` - Mixin with GMM influence function bootstrap methods - -- **`diff_diff/stacked_did.py`** - Stacked DiD estimator (Wing et al. 2024): - - `StackedDiD` - Stacked DiD with corrective Q-weights for compositional balance - - `stacked_did()` - Convenience function - - Builds sub-experiments per adoption cohort with clean controls - - IC1/IC2 trimming for compositional balance across event times - - Q-weights for aggregate, population, or sample share estimands (Table 1) - - WLS event study regression via sqrt(w) transformation - - Re-exports result class for backward compatibility - -- **`diff_diff/stacked_did_results.py`** - Result container classes: - - `StackedDiDResults` - Results with overall ATT, event study, group effects, stacked data access - -- **`diff_diff/triple_diff.py`** - Triple Difference (DDD) estimator: - - `TripleDifference` - Ortiz-Villavicencio & Sant'Anna (2025) estimator for DDD designs - - `TripleDifferenceResults` - Results with ATT, SEs, cell means, diagnostics - - `triple_difference()` - Convenience function for quick estimation - - Regression adjustment, IPW, and doubly robust estimation methods - - Proper covariate handling (unlike naive DDD implementations) - -- **`diff_diff/trop.py`** - Triply Robust Panel (TROP) estimator (v2.1.0): - - `TROP` - Athey, Imbens, Qu & Viviano (2025) estimator with factor model adjustment - - `trop()` - Convenience function for quick estimation - - Three robustness components: factor adjustment, unit weights, time weights - - Two estimation methods via `method` parameter: - - `"twostep"` (default): Per-observation model fitting (Algorithm 2 of paper) - - `"joint"`: Weighted least squares with homogeneous treatment effect (faster) - - Automatic rank selection via cross-validation, information criterion, or elbow detection - - Bootstrap and placebo-based variance estimation - - Re-exports result classes for backward compatibility - -- **`diff_diff/trop_results.py`** - Result container classes: - - `_LAMBDA_INF` - Sentinel value for disabled factor model (λ_nn=∞) - - `_PrecomputedStructures` - TypedDict for cached matrices - - `TROPResults` - Results with ATT, factors, loadings, unit/time weights - -- **`diff_diff/bacon.py`** - Goodman-Bacon decomposition for TWFE diagnostics: - - `BaconDecomposition` - Decompose TWFE into weighted 2x2 comparisons (Goodman-Bacon 2021) - - `BaconDecompositionResults` - Results with comparison weights and estimates by type - - `Comparison2x2` - Individual 2x2 comparison (treated_vs_never, earlier_vs_later, later_vs_earlier) - - `bacon_decompose()` - Convenience function for quick decomposition - - Integrated with `TwoWayFixedEffects.decompose()` method - -- **`diff_diff/linalg.py`** - Unified linear algebra backend (v1.4.0+): - - `solve_ols()` - OLS solver with R-style rank deficiency handling - - `_detect_rank_deficiency()` - Detect linearly dependent columns via pivoted QR - - `compute_robust_vcov()` - Vectorized HC1 and cluster-robust variance-covariance estimation - - `compute_r_squared()` - R-squared and adjusted R-squared computation - - `LinearRegression` - High-level OLS helper class with unified coefficient extraction and inference - - `InferenceResult` - Dataclass container for coefficient-level inference (SE, t-stat, p-value, CI) - - Single optimization point for all estimators (reduces code duplication) - - Cluster-robust SEs use pandas groupby instead of O(n × clusters) loop - - **Rank deficiency handling** (R-style): - - Detects rank-deficient matrices using pivoted QR decomposition - - `rank_deficient_action` parameter: "warn" (default), "error", or "silent" - - Dropped columns have NaN coefficients (like R's `lm()`) - - VCoV matrix has NaN for rows/cols of dropped coefficients - - Warnings include column names when provided - -- **`diff_diff/_backend.py`** - Backend detection and configuration (v2.0.0): - - Detects optional Rust backend availability - - Handles `DIFF_DIFF_BACKEND` environment variable ('auto', 'python', 'rust') - - Exports `HAS_RUST_BACKEND` flag and Rust function references - - `rust_backend_info()` — returns compile-time BLAS feature status dict - - Other modules import from here to avoid circular imports with `__init__.py` - -- **`rust/`** - Optional Rust backend for accelerated computation (v2.0.0+): - - **`rust/src/lib.rs`** - PyO3 module definition, exports Python bindings - - **`rust/src/bootstrap.rs`** - Parallel bootstrap weight generation (Rademacher, Mammen, Webb) - - **`rust/src/linalg.rs`** - OLS solver (SVD-based) and cluster-robust variance estimation - - **`rust/src/weights.rs`** - Synthetic control weights and simplex projection - - **`rust/src/trop.rs`** - TROP estimator acceleration: - - `compute_unit_distance_matrix()` - Parallel pairwise RMSE distance computation (4-8x speedup) - - `loocv_grid_search()` - Parallel LOOCV across tuning parameters (10-50x speedup) - - `bootstrap_trop_variance()` - Parallel bootstrap variance estimation (5-15x speedup) - - Uses pure-Rust `faer` library for SVD/matrix inversion (no external deps) - - Optional BLAS linking via Cargo features: `accelerate` (macOS), `openblas` (Linux) - - When BLAS is enabled, ndarray `.dot()` calls dispatch to platform-optimized dgemv/dgemm - - Cross-platform: Windows builds use pure Rust with no additional setup - - Provides 4-8x speedup for SyntheticDiD, 5-20x speedup for TROP - -- **`diff_diff/results.py`** - Dataclass containers for estimation results: - - `DiDResults`, `MultiPeriodDiDResults`, `SyntheticDiDResults`, `PeriodEffect` - - Each provides `summary()`, `to_dict()`, `to_dataframe()` methods - -- **`diff_diff/visualization.py`** - Plotting functions: - - `plot_event_study` - Publication-ready event study coefficient plots - - `plot_group_effects` - Treatment effects by cohort visualization - - `plot_sensitivity` - Honest DiD sensitivity analysis plots (bounds vs M) - - `plot_honest_event_study` - Event study with honest confidence intervals - - `plot_bacon` - Bacon decomposition scatter/bar plots (weights vs estimates by comparison type) - - `plot_power_curve` - Power curve visualization (power vs effect size or sample size) - - `plot_pretrends_power` - Pre-trends test power curve (power vs violation magnitude) - - Works with MultiPeriodDiD, CallawaySantAnna, SunAbraham, HonestDiD, BaconDecomposition, PowerAnalysis, PreTrendsPower, or DataFrames - -- **`diff_diff/utils.py`** - Statistical utilities: - - Robust/cluster standard errors (`compute_robust_se`) - - Parallel trends tests (`check_parallel_trends`, `check_parallel_trends_robust`, `equivalence_test_trends`) - - Synthetic control weight computation (`compute_synthetic_weights`, `compute_time_weights`) - - Wild cluster bootstrap (`wild_bootstrap_se`, `WildBootstrapResults`) - -- **`diff_diff/diagnostics.py`** - Placebo tests and DiD diagnostics: - - `run_placebo_test()` - Main dispatcher for different placebo test types - - `placebo_timing_test()` - Fake treatment timing test - - `placebo_group_test()` - Fake treatment group test (DiD on never-treated) - - `permutation_test()` - Permutation-based inference - - `leave_one_out_test()` - Sensitivity to individual treated units - - `run_all_placebo_tests()` - Comprehensive suite of diagnostics - - `PlaceboTestResults` - Dataclass for test results - -- **`diff_diff/datasets.py`** - Real-world datasets for teaching and examples: - - `load_card_krueger()` - Card & Krueger (1994) minimum wage dataset (classic 2x2 DiD) - - `load_castle_doctrine()` - Castle Doctrine / Stand Your Ground laws (staggered adoption) - - `load_divorce_laws()` - Unilateral divorce laws (staggered adoption, Stevenson-Wolfers) - - `load_mpdta()` - Minimum wage panel data from R `did` package (Callaway-Sant'Anna example) - - `list_datasets()` - List available datasets with descriptions - - `load_dataset(name)` - Load dataset by name - - `clear_cache()` - Clear locally cached datasets - - Datasets are downloaded from public sources and cached locally - -- **`diff_diff/honest_did.py`** - Honest DiD sensitivity analysis (Rambachan & Roth 2023): - - `HonestDiD` - Main class for computing bounds under parallel trends violations - - `DeltaSD`, `DeltaRM`, `DeltaSDRM` - Restriction classes for smoothness and relative magnitudes - - `HonestDiDResults` - Results with identified set bounds and robust CIs - - `SensitivityResults` - Results from sensitivity analysis over M grid - - `compute_honest_did()` - Convenience function for quick bounds computation - - `sensitivity_plot()` - Convenience function for plotting sensitivity analysis - -- **`diff_diff/power.py`** - Power analysis for study design: - - `PowerAnalysis` - Main class for analytical power calculations - - `PowerResults` - Results with MDE, power, sample size - - `SimulationPowerResults` - Results from Monte Carlo power simulation - - `simulate_power()` - Simulation-based power for any DiD estimator - - `compute_mde()`, `compute_power()`, `compute_sample_size()` - Convenience functions - -- **`diff_diff/pretrends.py`** - Pre-trends power analysis (Roth 2022): - - `PreTrendsPower` - Main class for assessing informativeness of pre-trends tests - - `PreTrendsPowerResults` - Results with power and minimum detectable violation (MDV) - - `PreTrendsPowerCurve` - Power curve across violation magnitudes with plot method - - `compute_pretrends_power()` - Convenience function for quick power computation - - `compute_mdv()` - Convenience function for minimum detectable violation - - Violation types: 'linear', 'constant', 'last_period', 'custom' - - Integrates with HonestDiD for comprehensive sensitivity analysis - -- **`diff_diff/prep.py`** - Data preparation utilities (core functions): - - `make_treatment_indicator`, `make_post_indicator` - Create binary indicators - - `wide_to_long`, `balance_panel` - Panel data reshaping - - `validate_did_data`, `summarize_did_data` - Data validation and summary - - `create_event_time` - Create event-time column for staggered adoption designs - - `aggregate_to_cohorts` - Aggregate unit-level data to cohort means - - `rank_control_units` - Rank control units by suitability for DiD/Synthetic control - - Re-exports all functions from `prep_dgp.py` for backward compatibility - -- **`diff_diff/prep_dgp.py`** - Data generation functions (DGP): - - `generate_did_data` - Create synthetic data with known treatment effect (basic 2x2 DiD) - - `generate_staggered_data` - Staggered adoption data for CallawaySantAnna/SunAbraham - - `generate_factor_data` - Factor model data for TROP/SyntheticDiD - - `generate_ddd_data` - Triple Difference (DDD) design data - - `generate_panel_data` - Panel data with optional parallel trends violations - - `generate_event_study_data` - Event study data with simultaneous treatment - -### Key Design Patterns +## Key Design Patterns 1. **sklearn-like API**: Estimators use `fit()` method, `get_params()`/`set_params()` for configuration 2. **Formula interface**: Supports R-style formulas like `"outcome ~ treated * post"` 3. **Fixed effects handling**: - `fixed_effects` parameter creates dummy variables (for low-dimensional FE) - `absorb` parameter uses within-transformation (for high-dimensional FE) -4. **Results objects**: Rich dataclass objects with statistical properties (`is_significant`, `significance_stars`) -5. **Unified linear algebra backend**: All estimators use `linalg.py` for OLS and variance estimation -6. **Estimator inheritance**: Understanding inheritance prevents consistency bugs +4. **Results objects**: Rich dataclass containers with `summary()`, `to_dict()`, `to_dataframe()` +5. **Unified `linalg.py` backend**: ALL estimators use `solve_ols()` / `compute_robust_vcov()` +6. **Inference computation**: ALL inference fields (t_stat, p_value, conf_int) MUST be computed + together using `safe_inference()` from `diff_diff.utils`. Never compute individually. +7. **Estimator inheritance** — understanding this prevents consistency bugs: ``` DifferenceInDifferences (base class) ├── TwoWayFixedEffects (inherits get_params/set_params) @@ -332,399 +89,58 @@ pure Rust by default. ``` When adding params to `DifferenceInDifferences.get_params()`, subclasses inherit automatically. Standalone estimators must be updated individually. -7. **Inference computation**: ALL inference fields (t_stat, p_value, conf_int) - MUST be computed together using `safe_inference()` from `diff_diff.utils`. - Never compute t_stat, p_value, or conf_int individually in new estimator code. - -### Performance Architecture (v1.4.0) - -diff-diff achieved significant performance improvements in v1.4.0, now **faster than R** at all scales. Key optimizations: - -#### Unified `linalg.py` Backend - -All estimators use a single optimized OLS/SE implementation: - -- **R-style rank deficiency handling**: Uses pivoted QR to detect linearly dependent columns, drops them, sets NaN for their coefficients, and emits informative warnings (following R's `lm()` approach) -- **Vectorized cluster-robust SE**: Uses pandas groupby aggregation instead of O(n × clusters) Python loop -- **Single optimization point**: Changes to `linalg.py` benefit all estimators - -```python -# All estimators import from linalg.py -from diff_diff.linalg import solve_ols, compute_robust_vcov - -# Example usage (warns on rank deficiency, sets NaN for dropped coefficients) -coefficients, residuals, vcov = solve_ols(X, y, cluster_ids=cluster_ids) - -# Suppress warning or raise error: -coefficients, residuals, vcov = solve_ols(X, y, rank_deficient_action="silent") # no warning -coefficients, residuals, vcov = solve_ols(X, y, rank_deficient_action="error") # raises ValueError - -# At estimator level (DifferenceInDifferences, MultiPeriodDiD): -from diff_diff import DifferenceInDifferences -did = DifferenceInDifferences(rank_deficient_action="error") # raises on collinear data -did = DifferenceInDifferences(rank_deficient_action="silent") # no warning -``` - -#### CallawaySantAnna Optimizations (`staggered.py`) - -- **Pre-computed data structures**: `_precompute_structures()` creates wide-format outcome matrix and cohort masks once -- **Vectorized ATT(g,t)**: `_compute_att_gt_fast()` uses numpy operations (23x faster than loop-based) -- **Batch bootstrap weights**: `_generate_bootstrap_weights_batch()` generates all weights at once -- **Matrix-based bootstrap**: Bootstrap iterations use matrix operations instead of nested loops (26x faster) - -#### Performance Results - -| Estimator | v1.3 (10K scale) | v1.4 (10K scale) | vs R | -|-----------|------------------|------------------|------| -| BasicDiD/TWFE | 0.835s | 0.011s | **4x faster than R** | -| CallawaySantAnna | 2.234s | 0.109s | **8x faster than R** | -| SyntheticDiD | Already optimized | N/A | **37x faster than R** | - -See `docs/performance-plan.md` for full optimization details and `docs/benchmarks.rst` for validation results. - -### Documentation - -- **`docs/methodology/REGISTRY.md`** - Methodology Registry: - - Academic foundations and citations for each estimator - - Key implementation requirements (equations, SE formulas, edge cases) - - Reference implementations (R packages, Stata commands) - - Requirements checklists for validation - - **Must be consulted before implementing methodology-related changes** - - **Must be updated when deviating from reference implementations** - -- **`docs/methodology/papers/`** - Paper review output files produced by `/paper-review` skill, formatted as Methodology Registry entries for implementation reference - -- **`docs/tutorials/`** - Jupyter notebook tutorials: - - `01_basic_did.ipynb` - Basic 2x2 DiD, covariates, fixed effects, wild bootstrap - - `02_staggered_did.ipynb` - Staggered adoption with Callaway-Sant'Anna, bootstrap inference - - `03_synthetic_did.ipynb` - Synthetic DiD with unit/time weights - - `04_parallel_trends.ipynb` - Parallel trends testing and diagnostics - - `05_honest_did.ipynb` - Honest DiD sensitivity analysis for parallel trends violations - - `06_power_analysis.ipynb` - Power analysis for study design, MDE, simulation-based power - - `07_pretrends_power.ipynb` - Pre-trends power analysis (Roth 2022), MDV, power curves - - `08_triple_diff.ipynb` - Triple Difference (DDD) estimation with proper covariate handling - - `09_real_world_examples.ipynb` - Real-world data examples (Card-Krueger, Castle Doctrine, Divorce Laws) - - `10_trop.ipynb` - Triply Robust Panel (TROP) estimation with factor model adjustment - - `11_imputation_did.ipynb` - Imputation DiD (Borusyak et al. 2024), pre-trend test, efficiency comparison - - `12_two_stage_did.ipynb` - Two-Stage DiD (Gardner 2022), GMM sandwich variance, per-observation effects - -### Benchmarks - -- **`benchmarks/`** - Validation benchmarks against R packages: - - `run_benchmarks.py` - Main orchestrator for running all benchmarks - - `compare_results.py` - Result comparison utilities - - `R/` - R benchmark scripts (did, synthdid, fixest, HonestDiD) - - `python/` - Python benchmark scripts mirroring R scripts - - `data/synthetic/` - Generated test data (not committed, use `--generate-data-only`) - - `results/` - JSON output files (not committed) - -Run benchmarks: -```bash -# Generate synthetic data first -python benchmarks/run_benchmarks.py --generate-data-only - -# Run all benchmarks -python benchmarks/run_benchmarks.py --all - -# Run specific estimator -python benchmarks/run_benchmarks.py --estimator callaway -python benchmarks/run_benchmarks.py --estimator multiperiod -``` - -See `docs/benchmarks.rst` for full methodology and validation results. - -### Test Structure - -Tests mirror the source modules: -- `tests/test_estimators.py` - Tests for DifferenceInDifferences, TWFE, MultiPeriodDiD, SyntheticDiD -- `tests/test_methodology_sdid.py` - Methodology tests for SDID: Frank-Wolfe solver, regularization, sparsification, edge cases -- `tests/test_staggered.py` - Tests for CallawaySantAnna -- `tests/test_sun_abraham.py` - Tests for SunAbraham interaction-weighted estimator -- `tests/test_imputation.py` - Tests for ImputationDiD (Borusyak et al. 2024) estimator -- `tests/test_two_stage.py` - Tests for TwoStageDiD (Gardner 2022) estimator, including equivalence tests with ImputationDiD -- `tests/test_stacked_did.py` - Tests for Stacked DiD (Wing et al. 2024) estimator -- `tests/test_triple_diff.py` - Tests for Triple Difference (DDD) estimator -- `tests/test_trop.py` - Tests for Triply Robust Panel (TROP) estimator -- `tests/test_bacon.py` - Tests for Goodman-Bacon decomposition -- `tests/test_linalg.py` - Tests for unified OLS backend, robust variance estimation, LinearRegression helper, and InferenceResult -- `tests/test_utils.py` - Tests for parallel trends, robust SE, synthetic weights -- `tests/test_diagnostics.py` - Tests for placebo tests -- `tests/test_wild_bootstrap.py` - Tests for wild cluster bootstrap -- `tests/test_prep.py` - Tests for data preparation utilities -- `tests/test_visualization.py` - Tests for plotting functions -- `tests/test_honest_did.py` - Tests for Honest DiD sensitivity analysis -- `tests/test_power.py` - Tests for power analysis -- `tests/test_pretrends.py` - Tests for pre-trends power analysis -- `tests/test_datasets.py` - Tests for dataset loading functions - -Session-scoped `ci_params` fixture in `conftest.py` scales bootstrap iterations and TROP grid sizes in pure Python mode — use `ci_params.bootstrap(n)` and `ci_params.grid(values)` in new tests with `n_bootstrap >= 20`. For SE convergence tests (analytical vs bootstrap comparison), use `ci_params.bootstrap(n, min_n=199)` with a conditional tolerance: `threshold = 0.40 if n_boot < 100 else 0.15`. The `min_n` parameter is capped at 49 in pure Python mode to keep CI fast, so convergence tests use wider tolerances when running with fewer bootstrap iterations. - -**Slow test suites:** `tests/test_trop.py` is very time-consuming. Only run TROP tests when changes could affect the TROP estimator (e.g., `diff_diff/trop.py`, `diff_diff/trop_results.py`, `diff_diff/linalg.py`, `diff_diff/_backend.py`, or `rust/src/trop.rs`). For unrelated changes, exclude with `pytest --ignore=tests/test_trop.py`. - -### Test Writing Guidelines - -**For fallback/error handling paths:** -- Don't just test that code runs without exception -- Assert the expected behavior actually occurred -- Bad: `result = func(bad_input)` (only tests no crash) -- Good: `result = func(bad_input); assert np.isnan(result.coef)` (tests behavior) - -**For new parameters:** -- Test parameter appears in `get_params()` output -- Test `set_params()` modifies the attribute -- Test parameter actually affects behavior (not just stored) - -**For warnings:** -- Capture warnings with `warnings.catch_warnings(record=True)` -- Assert warning message was emitted -- Assert the warned-about behavior occurred - -**For NaN inference tests**: Use `assert_nan_inference()` from conftest.py to validate -ALL inference fields are NaN-consistent. Don't check individual fields separately. - -### Dependencies - -Core dependencies are numpy, pandas, and scipy only (no statsmodels). The library implements its own OLS, robust standard errors, and inference. - -## Documentation Requirements - -When implementing new functionality, **always include accompanying documentation updates**: - -### For New Estimators or Major Features - -1. **README.md** - Add: - - Feature mention in the features list - - Full usage section with code examples - - Parameter documentation table - - API reference section (constructor params, fit() params, results attributes/methods) - - Scholarly references if applicable - -2. **docs/api/*.rst** - Add: - - RST documentation with `autoclass` directives - - Method summaries - - References to academic papers - -3. **docs/tutorials/*.ipynb** - Update relevant tutorial or create new one: - - Working code examples - - Explanation of when/why to use the feature - - Comparison with related functionality - -4. **CLAUDE.md** - Update: - - Module structure section - - Test structure section - - Any relevant design patterns - -5. **ROADMAP.md** - Update: - - Move implemented features from planned to current status - - Update version numbers - -### For Bug Fixes or Minor Enhancements - -- Update relevant docstrings -- Add/update tests -- Update CHANGELOG.md (if exists) -- **If methodology-related**: Update `docs/methodology/REGISTRY.md` edge cases section - -### Scholarly References - -For methods based on academic papers, always include: -- Full citation in README.md references section -- Reference in RST docs with paper details -- Citation in tutorial summary - -Example format: -``` -Sun, L., & Abraham, S. (2021). Estimating dynamic treatment effects in -event studies with heterogeneous treatment effects. *Journal of Econometrics*, -225(2), 175-199. -``` - -## Development Checklists - -### Adding a New Parameter to Estimators - -When adding a new `__init__` parameter that should be available across estimators: - -1. **Implementation** (for each affected estimator): - - [ ] Add to `__init__` signature with default value - - [ ] Store as `self.param_name` - - [ ] Add to `get_params()` return dict - - [ ] Handle in `set_params()` (usually automatic via `hasattr`) - -2. **Consistency** - apply to all applicable estimators per the **Estimator inheritance** map above - -3. **Testing**: - - [ ] Test `get_params()` includes new param - - [ ] Test parameter affects estimator behavior - - [ ] Test with non-default value - -4. **Downstream tracing**: - - [ ] Before implementing: `grep -rn "self\." diff_diff/.py` to find ALL downstream paths - - [ ] Parameter handled in ALL aggregation methods (simple, event_study, group) - - [ ] Parameter handled in bootstrap inference paths - -5. **Documentation**: - - [ ] Update docstring in all affected classes - - [ ] Update CLAUDE.md if it's a key design pattern - -### Implementing Methodology-Critical Code - -When implementing or modifying code that affects statistical methodology (estimators, SE calculation, inference, edge case handling): - -1. **Before coding - consult the Methodology Registry**: - - [ ] Read the relevant estimator section in `docs/methodology/REGISTRY.md` - - [ ] Identify the reference implementation(s) listed - - [ ] Note the edge case handling requirements - -2. **During implementation**: - - [ ] Follow the documented equations and formulas - - [ ] Match reference implementation behavior for standard cases - - [ ] For edge cases: either match reference OR document deviation - -3. **When deviating from reference implementations**: - - [ ] Add a **Note** in the Methodology Registry explaining the deviation - - [ ] Include rationale (e.g., "defensive enhancement", "R errors here") - - [ ] Ensure the deviation is an improvement, not a bug - -4. **Testing methodology-aligned behavior**: - - [ ] Test that edge cases produce documented behavior (NaN, warning, etc.) - - [ ] Assert warnings are raised (not just captured) - - [ ] Assert the warned-about behavior actually occurred - - [ ] For NaN results: assert `np.isnan()`, don't just check "no exception" - - [ ] All inference fields computed via `safe_inference()` (not inline) - -### Adding Warning/Error/Fallback Handling - -When adding code that emits warnings or handles errors: - -1. **Consult Methodology Registry first**: - - [ ] Check if behavior is documented in edge cases section - - [ ] If not documented, add it before implementing - -2. **Verify behavior matches message**: - - [ ] Manually trace the code path after warning/error - - [ ] Confirm the stated behavior actually occurs - -3. **Write behavioral tests**: - - [ ] Don't just test "no exception raised" - - [ ] Assert the expected outcome occurred - - [ ] For fallbacks: verify fallback behavior was applied - - [ ] Example: If warning says "setting NaN", assert `np.any(np.isnan(result))` - -4. **Protect arithmetic operations**: - - [ ] Wrap ALL related operations in `np.errstate()`, not just the final one - - [ ] Include division, matrix multiplication, and any operation that can overflow/underflow - -### Reviewing New Features or Code Paths - -When reviewing PRs that add new features, modes, or code paths (learned from PR #97 analysis): - -1. **Edge Case Coverage**: - - [ ] Empty result sets (no matching data for a filter condition) - - [ ] NaN/Inf propagation through ALL inference fields (SE, t-stat, p-value, CI) - - [ ] Parameter interactions (e.g., new param × existing aggregation methods) - - [ ] Control/comparison group composition for all code paths - -2. **Documentation Completeness**: - - [ ] All new parameters have docstrings with type, default, and description - - [ ] Methodology docs match implementation behavior (equations, edge cases) - - [ ] Edge cases documented in `docs/methodology/REGISTRY.md` - -3. **Logic Audit for New Code Paths**: - - [ ] When adding new modes (like `base_period="varying"`), trace ALL downstream effects - - [ ] Check aggregation methods handle the new mode correctly - - [ ] Check bootstrap/inference methods handle the new mode correctly - - [ ] Explicitly test control group composition in new code paths - -4. **Pattern Consistency**: - - [ ] Search for similar patterns in codebase (e.g., `t_stat = x / se if se > 0 else ...`) - - [ ] Ensure new code follows established patterns or updates ALL instances - - [ ] If fixing a pattern, grep for ALL occurrences first: - ```bash - grep -n "if.*se.*> 0.*else" diff_diff/*.py - ``` - -### Fixing Bugs Across Multiple Locations - -When a bug fix involves a pattern that appears in multiple places (learned from PR #97 analysis): - -1. **Find All Instances First**: - - [ ] Use grep/search to find ALL occurrences of the pattern before fixing - - [ ] Document the locations found (file:line) - - [ ] Example: `t_stat = effect / se if se > 0 else 0.0` appeared in 7 locations - -2. **Fix Comprehensively in One Round**: - - [ ] Fix ALL instances in the same PR/commit - - [ ] Create a test that covers all locations - - [ ] Don't fix incrementally across multiple review rounds - -3. **Regression Test the Fix**: - - [ ] Verify fix doesn't break other code paths - - [ ] For early-return fixes: ensure downstream code still runs when needed - - [ ] Example: Bootstrap early return must still compute per-effect SEs - -4. **Common Patterns to Watch For**: - - `if se > 0 else 0.0` → should be `else np.nan` for undefined statistics - - `if len(data) > 0 else return` → check what downstream code expects - - `mask = (condition)` → verify mask logic for all parameter combinations - -### Pre-Merge Review Checklist - -Final checklist before approving a PR: - -1. **Behavioral Completeness**: - - [ ] Happy path tested - - [ ] Edge cases tested (empty data, NaN inputs, boundary conditions) - - [ ] Error/warning paths tested with behavioral assertions - -2. **Inference Field Consistency**: - - [ ] If one inference field (SE, t-stat, p-value) can be NaN, all related fields handle it - - [ ] Aggregation methods propagate NaN correctly - - [ ] Bootstrap methods handle NaN in base estimates - -3. **Documentation Sync**: - - [ ] Docstrings updated for all changed signatures - - [ ] README updated if user-facing behavior changes - - [ ] REGISTRY.md updated if methodology edge cases change - -## Task Implementation Workflow - -When implementing features or fixes, follow this workflow to ensure quality and catch issues early: - -### Phase 1: Planning -- Use `EnterPlanMode` for non-trivial tasks -- Consult `docs/methodology/REGISTRY.md` for methodology-critical code -- Identify all files and code paths that will be affected -- For bug fixes: grep for the pattern first to find ALL occurrences - -### Phase 2: Implementation -- Follow the relevant development checklists above -- Write tests alongside implementation (not after) -- For bug fixes: fix ALL occurrences in the same commit -- For new parameters: ensure all aggregation/bootstrap paths handle them - -### Phase 3: Pre-Merge Review -**Run `/pre-merge-check` before submitting**. This skill will: -1. Run automated pattern checks on changed files (NaN handling, etc.) -2. Check for missing test coverage -3. Display context-specific checklist items based on what changed -4. Optionally run the test suite - -Address any warnings before proceeding. - -### Plan Review Before Approval - -When writing a new plan file (via EnterPlanMode), update the sentinel so the hook tracks the correct plan: +8. **Dependencies**: numpy, pandas, and scipy ONLY. No statsmodels. + +## Testing Conventions + +- **`ci_params` fixture** (session-scoped in `conftest.py`): Use `ci_params.bootstrap(n)` and + `ci_params.grid(values)` to scale iterations in pure Python mode. For SE convergence tests, + use `ci_params.bootstrap(n, min_n=199)` with conditional tolerance: + `threshold = 0.40 if n_boot < 100 else 0.15`. +- **`assert_nan_inference()`** from conftest.py: Use to validate ALL inference fields are + NaN-consistent. Don't check individual fields separately. +- **Slow test suites**: `tests/test_trop.py` is very time-consuming. Skip with + `pytest --ignore=tests/test_trop.py` for unrelated changes. +- **Behavioral assertions**: Always assert expected outcomes, not just no-exception. + Bad: `result = func(bad_input)`. Good: `result = func(bad_input); assert np.isnan(result.coef)`. + +## Key Reference Files + +| File | Contains | +|------|----------| +| `docs/methodology/REGISTRY.md` | Academic foundations, equations, edge cases — **consult before methodology changes** | +| `CONTRIBUTING.md` | Documentation requirements, test writing guidelines | +| `.claude/commands/dev-checklists.md` | Checklists for params, methodology, warnings, reviews, bugs (run `/dev-checklists`) | +| `.claude/memory.md` | Debugging patterns, tolerances, API conventions (git-tracked) | +| `docs/performance-plan.md` | Performance optimization details | +| `docs/benchmarks.rst` | Validation results vs R | + +## Workflow + +- For non-trivial tasks, use `EnterPlanMode`. Consult `docs/methodology/REGISTRY.md` for methodology changes. +- For bug fixes, grep for the pattern across all files before fixing. +- Follow the relevant development checklists (run `/dev-checklists`). +- Before submitting: run `/pre-merge-check`. +- Submit with `/submit-pr`. + +## Plan Review Before Approval + +When writing a new plan file (via EnterPlanMode), update the sentinel: ```bash echo "" > ~/.claude/plans/.last-reviewed ``` -Before calling `ExitPlanMode`, offer the user an independent plan review. Use `AskUserQuestion`: - +Before calling `ExitPlanMode`, offer the user an independent plan review via `AskUserQuestion`: - "Run review agent for independent feedback" (Recommended) - "Present plan for approval as-is" -If "present as-is" is chosen, ensure `~/.claude/plans` exists (`mkdir -p`), then write a minimal review marker to `~/.claude/plans/.review.md`: +**If review requested**: Spawn review agent (Task tool, `subagent_type: "general-purpose"`) +to read `.claude/commands/review-plan.md` and follow Steps 2-5. Display output in conversation. +Save to `~/.claude/plans/.review.md` with YAML frontmatter (plan path, +timestamp, verdict, issue counts). Update sentinel. Collect feedback and revise if needed. +Touch review file after revision to avoid staleness check failure. + +**If skipped**: Write a minimal review marker to `~/.claude/plans/.review.md`: ```yaml --- plan: @@ -737,56 +153,8 @@ flags: [] --- Review skipped by user. ``` -Also write the plan path to `~/.claude/plans/.last-reviewed`. This satisfies the ExitPlanMode hook. - -If review is requested: - -1. **Spawn review agent**: Use the Task tool with `subagent_type: "general-purpose"`. Prompt the agent to: - - Read `.claude/commands/review-plan.md` and follow its Steps 2 through 5 - - Review the plan file at the current plan path - - Return ONLY the structured review output (from "## Overall Assessment" through "## Summary") — do not include the "## Plan Content" display - Do NOT specify a model — let the agent inherit the current model for review quality. - -2. **Display the review agent's output** in the conversation, preceded by the plan content (display the plan yourself for terminal reading). The user reads both here in the terminal — this is the primary reading surface. - -3. **Save to file**: Ensure `~/.claude/plans` exists (`mkdir -p`) before writing. Write the review output (from "## Overall Assessment" through "## Summary") to `~/.claude/plans/.review.md`, with YAML frontmatter (plan path, timestamp, verdict, issue counts). Also write the plan path to `~/.claude/plans/.last-reviewed`. If the review file write fails, report an error and stop — the hook requires this file. If the sentinel write fails, warn but continue (the hook falls back to `ls -t`). - -4. **Collect feedback**: Use `AskUserQuestion`: - - "Address all issues and re-review" (Recommended) - - "Address all issues and present for approval" - - "Let me specify which items to address or dismiss" - If the user selects the third option, they will provide free-form text with directives (e.g., "disagree with CRITICAL #2", "skip all LOW", "for MEDIUM #1, do X instead"). +Update sentinel. The `check-plan-review.sh` hook enforces this workflow. -5. **Revise the plan** based on review feedback and user directives: - - CRITICAL: address unless user explicitly dismissed with justification - - MEDIUM: address unless user dismissed - - LOW: skip unless user explicitly requested - - Checklist gaps: add missing items as plan steps - - Append a `## Revision Notes` section documenting what was addressed, dismissed, and why - - After writing the revised plan, touch the review file to update its mtime (`touch "$REVIEW_FILE"`) so the hook's staleness check passes after the intentional revision - -6. **Loop or exit**: If user chose "re-review", go back to step 1 (spawn a fresh review agent for the revised plan). Otherwise, call `ExitPlanMode`. - -If `ExitPlanMode` is blocked by the plan review hook (no `.review.md` found), follow the steps above to run a review before retrying. - -**Rollback**: To remove the plan review workflow, delete this subsection from CLAUDE.md, remove the `PreToolUse` entry from `.claude/settings.json`, and delete `.claude/hooks/check-plan-review.sh`. - -### Phase 4: Submit -- Use `/submit-pr` to create the PR -- Automated AI review will run additional methodology and edge case checks -- The PR template will prompt for methodology references if applicable - -### Quick Reference: Common Patterns to Check - -Before submitting methodology changes, verify these patterns: - -```bash -# Find potential NaN handling issues (should use np.nan, not 0.0) -grep -n "if.*se.*>.*0.*else 0" diff_diff/*.py - -# Find all t_stat calculations to ensure consistency -grep -n "t_stat.*=" diff_diff/*.py - -# Find all inference field assignments -grep -n "\(se\|t_stat\|p_value\|ci_lower\|ci_upper\).*=" diff_diff/*.py | head -30 -``` +**Rollback**: To remove the plan review workflow, delete this section from CLAUDE.md, +remove the `PreToolUse` entry from `.claude/settings.json`, and delete +`.claude/hooks/check-plan-review.sh`. diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md new file mode 100644 index 0000000..5c158eb --- /dev/null +++ b/CONTRIBUTING.md @@ -0,0 +1,77 @@ +# Contributing to diff-diff + +## Documentation Requirements + +When implementing new functionality, **always include accompanying documentation updates**. + +### For New Estimators or Major Features + +1. **README.md** - Add: + - Feature mention in the features list + - Full usage section with code examples + - Parameter documentation table + - API reference section (constructor params, fit() params, results attributes/methods) + - Scholarly references if applicable + +2. **docs/api/*.rst** - Add: + - RST documentation with `autoclass` directives + - Method summaries + - References to academic papers + +3. **docs/tutorials/*.ipynb** - Update relevant tutorial or create new one: + - Working code examples + - Explanation of when/why to use the feature + - Comparison with related functionality + +4. **CLAUDE.md** - Update only if adding new critical rules or design patterns + +5. **ROADMAP.md** - Update: + - Move implemented features from planned to current status + - Update version numbers + +### For Bug Fixes or Minor Enhancements + +- Update relevant docstrings +- Add/update tests +- Update CHANGELOG.md (if exists) +- **If methodology-related**: Update `docs/methodology/REGISTRY.md` edge cases section + +### Scholarly References + +For methods based on academic papers, always include: +- Full citation in README.md references section +- Reference in RST docs with paper details +- Citation in tutorial summary + +Example format: +``` +Sun, L., & Abraham, S. (2021). Estimating dynamic treatment effects in +event studies with heterogeneous treatment effects. *Journal of Econometrics*, +225(2), 175-199. +``` + +## Test Writing Guidelines + +### For Fallback/Error Handling Paths + +- Don't just test that code runs without exception +- Assert the expected behavior actually occurred +- Bad: `result = func(bad_input)` (only tests no crash) +- Good: `result = func(bad_input); assert np.isnan(result.coef)` (tests behavior) + +### For New Parameters + +- Test parameter appears in `get_params()` output +- Test `set_params()` modifies the attribute +- Test parameter actually affects behavior (not just stored) + +### For Warnings + +- Capture warnings with `warnings.catch_warnings(record=True)` +- Assert warning message was emitted +- Assert the warned-about behavior occurred + +### For NaN Inference Tests + +Use `assert_nan_inference()` from conftest.py to validate ALL inference fields are +NaN-consistent. Don't check individual fields separately.