Skip to content

Critical Data Bug: All Congressional Districts Assigned to Wyoming #435

@baogorek

Description

@baogorek

Critical Data Bug: All US Congressional Districts Incorrectly Assigned as Children of Wyoming

Summary

The calibration database has a critical parent-child relationship error where all 436 US congressional districts are assigned as children of Wyoming (0400000US56) instead of their respective states. This causes catastrophic data corruption when running rescale_calibration_targets().

Impact

  • The rescaling function would scale all US congressional district values down to match Wyoming's population (~580k)
  • This affects all variables with congressional district targets (adjusted_gross_income, eitc, income_tax, person_count, etc.)
  • Districts that should sum to ~330M people would be scaled to match Wyoming's values in the tens of thousands

Steps to Reproduce

from policyengine_data.calibration.target_rescaling import download_database
from sqlalchemy import create_engine
import pandas as pd

# Download the database from Hugging Face
db_uri = download_database()
engine = create_engine(db_uri)

# Query showing Wyoming has 436 congressional district children
query = """
SELECT 
    p.stratum_id as parent_id,
    p_geo.value as parent_geo,
    COUNT(DISTINCT c.stratum_id) as num_children,
    COUNT(DISTINCT c_geo.value) as num_unique_districts,
    SUM(c_target.value) as sum_of_children,
    p_target.value as wyoming_value
FROM strata p
JOIN stratum_constraints p_geo ON p.stratum_id = p_geo.stratum_id 
    AND p_geo.constraint_variable = 'ucgid_str'
    AND p_geo.value = '0400000US56'  -- Wyoming
JOIN strata c ON c.parent_stratum_id = p.stratum_id
JOIN stratum_constraints c_geo ON c.stratum_id = c_geo.stratum_id 
    AND c_geo.constraint_variable = 'ucgid_str'
    AND c_geo.value LIKE '5001800US%'  -- Congressional districts
JOIN targets p_target ON p.stratum_id = p_target.stratum_id 
    AND p_target.variable = 'person_count'
    AND p_target.active = 1
JOIN targets c_target ON c.stratum_id = c_target.stratum_id 
    AND c_target.variable = 'person_count'
    AND c_target.active = 1
GROUP BY p.stratum_id, p_geo.value, p_target.value
LIMIT 5
"""

result = pd.read_sql_query(query, engine)
print(result)

Expected Output vs Actual Output

Expected:

  • Wyoming should have 1 congressional district child (5001800US5600)
  • That district should have ~580k people
  • Other states should have their own congressional districts as children

Actual Output:

  parent_id  parent_geo  num_children  num_unique_districts  sum_of_children  wyoming_value
        557 0400000US56           436                   436       18333697.0        30651.0
        608 0400000US56           436                   436       19799430.0        36438.0
        659 0400000US56           436                   436       21203879.0        38128.0

The sum_of_children (18-23 million) represents demographic slices of the entire US population, not Wyoming.

Additional Verification Query

This query confirms all 51 states' districts are under Wyoming:

SELECT 
    COUNT(DISTINCT SUBSTR(c_geo.value, 10, 2)) as num_states_represented,
    COUNT(DISTINCT c_geo.value) as total_districts
FROM strata p
JOIN stratum_constraints p_geo ON p.stratum_id = p_geo.stratum_id 
    AND p_geo.constraint_variable = 'ucgid_str'
    AND p_geo.value = '0400000US56'
JOIN strata c ON c.parent_stratum_id = p.stratum_id
JOIN stratum_constraints c_geo ON c.stratum_id = c_geo.stratum_id 
    AND c_geo.constraint_variable = 'ucgid_str'
    AND c_geo.value LIKE '5001800US%';

Returns:

  • num_states_represented: 51
  • total_districts: 436

Root Cause

The parent_stratum_id assignments for congressional districts are incorrectly set during database creation. All districts appear to be assigned Wyoming's stratum IDs as their parent instead of their respective state stratum IDs.

Proposed Fix

Congressional districts need to be reassigned to their correct state parents. For example:

  • 5001800US0101 (Alabama district 1) → parent should be 0400000US01 (Alabama)
  • 5001800US0601 (California district 1) → parent should be 0400000US06 (California)
  • 5001800US5600 (Wyoming at-large) → parent should be 0400000US56 (Wyoming)

The parent_stratum_id values need to be updated to point to the correct state strata based on matching the state FIPS code (characters 10-11 in the UCGID string).

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions