Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,7 @@ documentation-dev:
database:
rm -f policyengine_us_data/storage/calibration/policy_data.db
python policyengine_us_data/db/create_database_tables.py
python policyengine_us_data/db/create_field_valid_values.py
python policyengine_us_data/db/create_initial_strata.py
python policyengine_us_data/db/etl_national_targets.py
python policyengine_us_data/db/etl_age.py
Expand Down
5 changes: 5 additions & 0 deletions changelog_entry.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
- bump: minor
changes:
added:
- field_valid_values table in the targets database as source of truth for semantic or external target information.
- constraint_validation.py to ensure constraint operations result in consistent and valid sets of constraints for a given stratum.
105 changes: 95 additions & 10 deletions policyengine_us_data/db/DATABASE_GUIDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,15 +24,16 @@ make promote-database # Copy DB + raw inputs to HuggingFace clone

| # | Script | Network? | What it does |
|---|--------|----------|--------------|
| 1 | `create_database_tables.py` | No | Creates empty SQLite schema (7 tables) |
| 2 | `create_initial_strata.py` | Census ACS 5-year | Builds geographic hierarchy: US > 51 states > 436 CDs |
| 3 | `etl_national_targets.py` | No | Loads ~40 hardcoded national targets (CBO, Treasury, CMS) |
| 4 | `etl_age.py` | Census ACS 1-year | Age distribution: 18 bins x 488 geographies |
| 5 | `etl_medicaid.py` | Census ACS + CMS | Medicaid enrollment (admin state-level, survey district-level) |
| 6 | `etl_snap.py` | USDA FNS + Census ACS | SNAP participation (admin state-level, survey district-level) |
| 7 | `etl_state_income_tax.py` | No | State income tax collections (Census STC FY2023, hardcoded) |
| 8 | `etl_irs_soi.py` | IRS | Tax variables, EITC by child count, AGI brackets, conditional strata |
| 9 | `validate_database.py` | No | Checks all target variables exist in policyengine-us |
| 1 | `create_database_tables.py` | No | Creates SQLite schema (8 tables) + validation triggers |
| 2 | `create_field_valid_values.py` | No | Populates field_valid_values with allowed values |
| 3 | `create_initial_strata.py` | Census ACS 5-year | Builds geographic hierarchy: US > 51 states > 436 CDs |
| 4 | `etl_national_targets.py` | No | Loads ~40 hardcoded national targets (CBO, Treasury, CMS) |
| 5 | `etl_age.py` | Census ACS 1-year | Age distribution: 18 bins x 488 geographies |
| 6 | `etl_medicaid.py` | Census ACS + CMS | Medicaid enrollment (admin state-level, survey district-level) |
| 7 | `etl_snap.py` | USDA FNS + Census ACS | SNAP participation (admin state-level, survey district-level) |
| 8 | `etl_state_income_tax.py` | No | State income tax collections (Census STC FY2023, hardcoded) |
| 9 | `etl_irs_soi.py` | IRS | Tax variables, EITC by child count, AGI brackets, conditional strata |
| 10 | `validate_database.py` | No | Checks all target variables exist in policyengine-us |

### Raw Input Caching

Expand Down Expand Up @@ -94,6 +95,36 @@ make database

**variable_metadata** - Display info for variables (display name, units, ordering)

### Validation Table

**field_valid_values** - Centralized registry of valid values for semantic fields

This table is the source of truth for what values are allowed in specific fields throughout
the database. Expecifically those that deal with semantic external information rather than designing relationships inherent to teh database itself. SQL triggers enforce validation on INSERT and UPDATE operations.

| Field Validated | Table | Valid Values |
|-----------------|-------|--------------|
| `operation` | stratum_constraints | `==`, `!=`, `>`, `>=`, `<`, `<=` |
| `constraint_variable` | stratum_constraints | All policyengine-us variables |
| `active` | targets | `0`, `1` |
| `period` | targets | `2022`, `2023`, `2024`, `2025` |
| `variable` | targets | All policyengine-us variables |
| `type` | sources | `administrative`, `survey`, `synthetic`, `derived`, `hardcoded` |

**Triggers**: `validate_stratum_constraints_insert`, `validate_stratum_constraints_update`,
`validate_targets_insert`, `validate_targets_update`, `validate_sources_insert`, `validate_sources_update`

To add a new valid value (e.g., a new year):
```sql
INSERT INTO field_valid_values (field_name, valid_value, description)
VALUES ('period', '2026', NULL);
```

To check what values are valid for a field:
```sql
SELECT valid_value, description FROM field_valid_values WHERE field_name = 'operation';
```

## Key Concepts

### Stratum Groups
Expand Down Expand Up @@ -153,9 +184,63 @@ ETL scripts that pull Census data receive UCGIDs and create their own domain-spe

### Constraint Operations

All constraints use standardized operators validated by the `ConstraintOperation` enum:
All constraints use standardized operators validated by the `field_valid_values` table:
`==`, `!=`, `>`, `>=`, `<`, `<=`

### Constraint Validation

ETL scripts validate constraint sets before inserting them into the database using `ensure_consistent_constraint_set()` from `policyengine_us_data.utils.constraint_validation`. This prevents logically inconsistent constraints from being stored.

**Validation Rules:**

1. **Operation Compatibility** (per constraint_variable):

| Operation | Can combine with | Rationale |
|-----------|-----------------|-----------|
| `==` | Nothing (must be alone) | Equality is absolute |
| `!=` | Nothing (must be alone) | Exclusion is absolute |
| `>` | `<` or `<=` only | Forms valid range |
| `>=` | `<` or `<=` only | Forms valid range |
| `<` | `>` or `>=` only | Forms valid range |
| `<=` | `>` or `>=` only | Forms valid range |

**Invalid combinations:**
- `>` with `>=` (redundant/conflicting lower bounds)
- `<` with `<=` (redundant/conflicting upper bounds)
- `==` with anything else
- `!=` with anything else

2. **Value Checks** (if operations are compatible):
- No empty ranges: lower bound must be < upper bound
- For equal bounds, both must be inclusive (`>=` and `<=`) to be valid

**Usage in ETL:**
```python
from policyengine_us_data.utils.constraint_validation import (
Constraint,
ensure_consistent_constraint_set,
)

# Build constraint list
constraint_list = [
Constraint(variable="age", operation=">=", value="25"),
Constraint(variable="age", operation="<", value="30"),
]

# Validate before creating StratumConstraint objects
ensure_consistent_constraint_set(constraint_list)

# Now safe to add to stratum
stratum.constraints_rel = [
StratumConstraint(
constraint_variable=c.variable,
operation=c.operation,
value=c.value,
)
for c in constraint_list
]
```

### Constraint Value Types

The `value` column stores all values as strings. Downstream code deserializes:
Expand Down
Loading