-
Notifications
You must be signed in to change notification settings - Fork 10
Description
Critical Data Bug: All US Congressional Districts Incorrectly Assigned as Children of Wyoming
Summary
The calibration database has a critical parent-child relationship error where all 436 US congressional districts are assigned as children of Wyoming (0400000US56) instead of their respective states. This causes catastrophic data corruption when running rescale_calibration_targets().
Impact
- The rescaling function would scale all US congressional district values down to match Wyoming's population (~580k)
- This affects all variables with congressional district targets (adjusted_gross_income, eitc, income_tax, person_count, etc.)
- Districts that should sum to ~330M people would be scaled to match Wyoming's values in the tens of thousands
Steps to Reproduce
from policyengine_data.calibration.target_rescaling import download_database
from sqlalchemy import create_engine
import pandas as pd
# Download the database from Hugging Face
db_uri = download_database()
engine = create_engine(db_uri)
# Query showing Wyoming has 436 congressional district children
query = """
SELECT
p.stratum_id as parent_id,
p_geo.value as parent_geo,
COUNT(DISTINCT c.stratum_id) as num_children,
COUNT(DISTINCT c_geo.value) as num_unique_districts,
SUM(c_target.value) as sum_of_children,
p_target.value as wyoming_value
FROM strata p
JOIN stratum_constraints p_geo ON p.stratum_id = p_geo.stratum_id
AND p_geo.constraint_variable = 'ucgid_str'
AND p_geo.value = '0400000US56' -- Wyoming
JOIN strata c ON c.parent_stratum_id = p.stratum_id
JOIN stratum_constraints c_geo ON c.stratum_id = c_geo.stratum_id
AND c_geo.constraint_variable = 'ucgid_str'
AND c_geo.value LIKE '5001800US%' -- Congressional districts
JOIN targets p_target ON p.stratum_id = p_target.stratum_id
AND p_target.variable = 'person_count'
AND p_target.active = 1
JOIN targets c_target ON c.stratum_id = c_target.stratum_id
AND c_target.variable = 'person_count'
AND c_target.active = 1
GROUP BY p.stratum_id, p_geo.value, p_target.value
LIMIT 5
"""
result = pd.read_sql_query(query, engine)
print(result)Expected Output vs Actual Output
Expected:
- Wyoming should have 1 congressional district child (5001800US5600)
- That district should have ~580k people
- Other states should have their own congressional districts as children
Actual Output:
parent_id parent_geo num_children num_unique_districts sum_of_children wyoming_value
557 0400000US56 436 436 18333697.0 30651.0
608 0400000US56 436 436 19799430.0 36438.0
659 0400000US56 436 436 21203879.0 38128.0
The sum_of_children (18-23 million) represents demographic slices of the entire US population, not Wyoming.
Additional Verification Query
This query confirms all 51 states' districts are under Wyoming:
SELECT
COUNT(DISTINCT SUBSTR(c_geo.value, 10, 2)) as num_states_represented,
COUNT(DISTINCT c_geo.value) as total_districts
FROM strata p
JOIN stratum_constraints p_geo ON p.stratum_id = p_geo.stratum_id
AND p_geo.constraint_variable = 'ucgid_str'
AND p_geo.value = '0400000US56'
JOIN strata c ON c.parent_stratum_id = p.stratum_id
JOIN stratum_constraints c_geo ON c.stratum_id = c_geo.stratum_id
AND c_geo.constraint_variable = 'ucgid_str'
AND c_geo.value LIKE '5001800US%';Returns:
num_states_represented: 51total_districts: 436
Root Cause
The parent_stratum_id assignments for congressional districts are incorrectly set during database creation. All districts appear to be assigned Wyoming's stratum IDs as their parent instead of their respective state stratum IDs.
Proposed Fix
Congressional districts need to be reassigned to their correct state parents. For example:
- 5001800US0101 (Alabama district 1) → parent should be 0400000US01 (Alabama)
- 5001800US0601 (California district 1) → parent should be 0400000US06 (California)
- 5001800US5600 (Wyoming at-large) → parent should be 0400000US56 (Wyoming)
The parent_stratum_id values need to be updated to point to the correct state strata based on matching the state FIPS code (characters 10-11 in the UCGID string).