A SQLite Database for Calibration Targets #398

baogorek · 2025-07-29T12:57:29Z

Closes #399

The branch name is now too narrow. Concepts are:

Age
Medicaid
SNAP

And, from the IRS SOI:
ETIC by number of children
AGI

and, also from the IRS SOI, turning to the logs:

Loading amount data for IRS SOI data on qbid
Loading amount data for IRS SOI data on real_estate_taxes
Loading amount data for IRS SOI data on net_capital_gain
Loading amount data for IRS SOI data on ira_payments
Loading amount data for IRS SOI data on taxable_interest
Loading amount data for IRS SOI data on tax_exempt_interest
Loading amount data for IRS SOI data on oridinary_dividends
Loading amount data for IRS SOI data on qualified_dividends
Loading amount data for IRS SOI data on partnership_and_s_crop_net_income
Loading amount data for IRS SOI data on total_social_security
Loading amount data for IRS SOI data on pension_and_annuities
Loading amount data for IRS SOI data on unemployment_compensation
Loading amount data for IRS SOI data on business_net_income
Loading amount data for IRS SOI data on medical_and_dental_deduction
Loading amount data for IRS SOI data on salt_refunds
Loading amount data for IRS SOI data on salt_amount
Loading amount data for IRS SOI data on income_tax

Note that this PR does increase build time by adding the step of building the target database. Sooner or later, we should split up these data extractions into their own pipeline, but this approach follows the current standards like the CPS and ACS, which are downloaded every time.

juaristi22 · 2025-07-29T17:36:01Z

After policyengine-us #6322 is merged, -data should be able to identify that a household belongs to hierarchical geographies from a single calculation of ucgid_str. The "in" operation in the database will denote that the code should check for partial string matching. @baogorek

MaxGhenis

Could be more modular but works for now

…nto treasury

juaristi22 · 2025-08-04T08:59:32Z

in case we want to load variables in addition to agi this document has the list of all soi targets and the -us sim variables that represent them, could be useful for target variable naming

policyengine_us_data/db/load_soi_targets.py

policyengine_us_data/db/load_age_targets.py

policyengine_us_data/db/etl_snap.py

…e-data-loading

…es-and-improve-data-loading refactor: use SQLModel session

policyengine_us_data/db/etl_irs_soi.py

juaristi22 · 2025-08-15T14:25:13Z

policyengine_us_data/db/etl_irs_soi.py

+            amount_value = amount_j.iloc[i][["target_value"]].values[0]
+
+            stratum.targets_rel.append(
+                # NOTE: If I do the counts, I'm going to need to explode the strata for the vars != 0


is there a particular reason why we wouldn't explode the strata? by explode we mean make really large or break the logic?

"Explode" was used in the sense of "combinatorial explosion." For instance, take partnership_and_s_crop_net_income. We can calibrate to aggregate amount and just have one stratum per geography (almost 500). Or, we can choose to count the number of people who, say have positive partnership_and_s_crop_net_income, or maybe just net_income different from zero, or why not negative, zero to 1 dollar, 1 dollar to 10,000 dollars, ... etc? There's really no right answer, at least not with the information we have now.

A huge benefit though to just using partnership_and_s_crop_net_income is that I can reuse the base geographic strata; it's a sum over that quantity for everyone in that geography (including the 0s). If I want to create bins, there are two ways to achieve this: 1) create new variables in -us to sum over and reuse the base geographic strata, 2) create new strata for these ranges and make the variable person_count or tax_unit_count (variables that are in -us). 1) is probably not going to fly so we're looking at creating a whole bunch of new strata.

Maybe once things get going, we'll realize they're not all that bad to deal with and we can come back for them. But the complexity is already high and time is limited. We can always come back.

I should note too that the IRS SOI variables could also use the AGI splits for every variable. This would not be as hard (those strata could also be reused) but it will create many many more strata and will probably lead to zero cells in certain districts.

juaristi22

So nice to see so many targets getting loaded! should we make uploading the database to hugging face a step in CI following dataset standards?

Update: this is done yay

policyengine_us_data/db/etl_snap.py

…ngine-sqlite-database test: add database creation tests

policyengine_us_data/tests/test_database.py

juaristi22 · 2025-08-18T17:40:17Z

policyengine_us_data/db/etl_irs_soi.py

+    temp_df = df[["ucgid_str"]].copy()
+    temp_df["breakdown_variable"] = "one"
+    temp_df["breakdown_value"] = 1
+    temp_df["target_variable"] = "agi"


variables that do not exist in -us and need to be renamed:

agi --> adjusted_gross_income

eitc_children --> eitc_child_count

oridinary_dividends --> ordinary_dividends (typo)

business_net_income --> self_employment_income

ira_payments --> taxable_ira_distributions or tax_exempt_ira_distributions

medical_and_dental_deduction --> medical_expense_deduction

partnership_and_s_crop_net_income --> partnership_s_corp_income or tax_unit_partnership_s_corp_income

pension_and_annuities --> taxable_pension_income

qbid --> qualified_business_income_deduction

qualified_dividends --> qualified_dividend_income

salt_amount --> state_income_tax (if refering to state and local income taxes amount)

salt_refunds --> not sure what the correct var name is to represent this

tax_exempt_interest --> tax_exempt_interest_income_amount

taxable_interest --> taxable_interest_income

total_social_security --> taxable_social_security

some of these you may have to choose from given your best undersanding of the target because -us has more than one variable with matching names and im not sure which is the right one

most if not all of these are mapped in this document, make sure to review it in case i got something wrong

Thanks for catching this. The good news is that it led to new levels of validation both at the time of data load and after all data has been loaded. At load time, an enum is created only at the ORM level from the policyengine_us variables.

@MaxGhenis mind taking a look at the variables @juaristi22 found and look at the version I ended up with?

TARGETS = [ dict(code="59661", name="eitc", breakdown=("eitc_child_count", 0)), dict(code="59662", name="eitc", breakdown=("eitc_child_count", 1)), dict(code="59663", name="eitc", breakdown=("eitc_child_count", 2)), dict(code="59664", name="eitc", breakdown=("eitc_child_count", "3+")), dict( code="59664", name="qualified_business_income_deduction", breakdown=None, ), dict(code="18500", name="real_estate_taxes", breakdown=None), dict(code="01000", name="net_capital_gain", breakdown=None), dict(code="03150", name="retirement_distributions", breakdown=None), dict(code="00300", name="taxable_interest_income", breakdown=None), dict(code="00400", name="tax_exempt_interest_income", breakdown=None), dict( code="00600", name="non_qualified_dividend_income", breakdown=None ), dict(code="00650", name="qualified_dividend_income", breakdown=None), dict( code="26270", name="partnership_s_corp_income", breakdown=None, ), dict(code="02500", name="social_security", breakdown=None), dict(code="02300", name="unemployment_compensation", breakdown=None), dict(code="00700", name="salt_refund_income", breakdown=None), dict(code="18425", name="reported_salt", breakdown=None), dict(code="06500", name="income_tax", breakdown=None), ]

agi is not there but that's been changed to adjusted_gross_income later in the code

I left out the following because I wasn't sure there was a match:

medical and dental

pension and annuity

I also just used the social security variable.

I used a variable called "retirement distributions"

salt_amount would be more than state_income_tax - we should have another variable that includes real estate taxes, local income taxes, and state+local sales tax (you can take either but not both income or sales tax)
salt_refunds - we don't model refunds (would require knowing the timing of tax payments throughout the year)
total_social_security would be social_security not taxable_social_security - check how the source defines it

medical expense deduction is in the model

i think pension and annuity may also be in the model, at least when pavel reviewed my variable names document i had taxable_pension_income as the variable that maps to it

FYI Pavel has weighed in in the documentation Maria shared, which I didn't realize at first (thought it was the stock documentation. After time with that, Codex, and the above comments:

I think we're good with salt for state and local taxes: Codex: "The salt variable aggregates state and local income or sales tax with real estate taxes, pulling its components from the specified 'salt_and_real_estate' sources list"

I will remove salt_refund_income from the list

Moving from person-level partnership_s_corp_income to tax unit level tax_unit_partnership_s_corp_income

I will move to taxable_social_security

The IRS variable is Medical + Dental deduction, but I will include medical_deduction as Max believes it does include dental.

Moving from retirement_distributions to taxable_ira_distributions

I will use taxable_pension_income as I see that it includes annuity income.

@MaxGhenis , @juaristi22 , spent a lot of time with the LLMs. Added refundable_ctc. Changed the IRS variable I'm pulling in for SALT and added a comment about our omission of person property taxes. I almost merged, but I thought I'd leave it open for one more night. Maria, you might find some other issue.

TARGETS = [ dict(code="59661", name="eitc", breakdown=("eitc_child_count", 0)), dict(code="59662", name="eitc", breakdown=("eitc_child_count", 1)), dict(code="59663", name="eitc", breakdown=("eitc_child_count", 2)), dict(code="59664", name="eitc", breakdown=("eitc_child_count", "3+")), dict( code="04475", name="qualified_business_income_deduction", breakdown=None, ), dict(code="18500", name="real_estate_taxes", breakdown=None), dict(code="01000", name="net_capital_gain", breakdown=None), dict(code="01400", name="taxable_ira_distributions", breakdown=None), dict(code="00300", name="taxable_interest_income", breakdown=None), dict(code="00400", name="tax_exempt_interest_income", breakdown=None), dict(code="00600", name="dividend_income", breakdown=None), dict(code="00650", name="qualified_dividend_income", breakdown=None), dict( code="26270", name="tax_unit_partnership_s_corp_income", breakdown=None, ), dict(code="02500", name="taxable_social_security", breakdown=None), dict(code="02300", name="unemployment_compensation", breakdown=None), dict(code="17000", name="medical_expense_deduction", breakdown=None), dict(code="01700", name="taxable_pension_income", breakdown=None), dict(code="11070", name="refundable_ctc", breakdown=None), # NOTE: A18460 is the capped SALT deduction and matches the `salt` variable. # Our SALT base currently excludes personal property taxes (not modeled yet), # so amounts may be slightly below IRS totals. dict(code="18460", name="salt", breakdown=None), dict(code="06500", name="income_tax", breakdown=None), ]

…s-tests-to-make-database Add Great Expectations validation for database build

juaristi22 · 2025-08-20T08:54:59Z

I tried running calibration with the new dataset and found this:

from policyengine_us import Microsimulation

    sim = Microsimulation(dataset="hf://policyengine/policyengine-us-data/cps_2023.h5")
    sim.default_period = 2023
    print(sim.calculate("tax_unit_partnership_s_corp_income").sum())

-- 
0.0

Seems like the calculation of this variable on the rules engine side is either not implemented, has an error, or something else i am unaware of, which leads to the estimates of all the corp income targets being 0 and thus, never improving with calibration. This is also the case with the partnership_s_corp_income variable.

All other variables seem to be loading fine, though eitc by child count at the national level is not calibrating properly. I will investigate further, but for now, i'd only worry about tax_unit_partnership_s_corp_income

baogorek · 2025-08-20T15:43:08Z

sim = Microsimulation(dataset="hf://policyengine/policyengine-us-data/cps_2023.h5")

Good eyes here, @juaristi22, but I think this is because this variable (very related to our conversation) is zero until the PUF comes in to create the enhanced_cps:


In [1]: from policyengine_us import Microsimulation

In [2]: sim = Microsimulation(dataset="hf://policyengine/policyengine-us-data/enhanced_cps_2024.h5")

In [3]: sim23 = Microsimulation(dataset="hf://policyengine/policyengine-us-data/cps_2023.h5")

In [4]: sim.calculate('tax_unit_partnership_s_corp_income').sum() / 1E9
Out[4]: np.float64(1023.7888367125104)

In [5]: sim23.calculate('tax_unit_partnership_s_corp_income').sum() / 1E9
Out[5]: np.float64(0.0)

juaristi22

So exciting!

baogorek added 4 commits July 29, 2025 08:56

first round of eitc targets are added

cd6cf58

linting

867bec6

changelog_entry.yaml

c2dd4af

merging main

9f7f674

baogorek marked this pull request as ready for review July 29, 2025 13:13

baogorek requested a review from MaxGhenis July 29, 2025 13:14

MaxGhenis approved these changes Jul 29, 2025

View reviewed changes

baogorek added 2 commits August 2, 2025 07:56

Merge branch 'main' of github.com:PolicyEngine/policyengine-us-data i…

92979ca

…nto treasury

new file in progress

95a4a9a

moving to QBID and SALT

6fd3542

juaristi22 reviewed Aug 7, 2025

View reviewed changes

policyengine_us_data/db/load_soi_targets.py Outdated Show resolved Hide resolved

juaristi22 reviewed Aug 7, 2025

View reviewed changes

policyengine_us_data/db/load_age_targets.py Outdated Show resolved Hide resolved

baogorek added 2 commits August 7, 2025 13:50

new variables added

c73ef87

medicaid etl file

57d9850

MaxGhenis marked this pull request as draft August 8, 2025 21:24

baogorek added 8 commits August 10, 2025 21:44

merging main

33cf8e9

medicaid is loading in

9c4838e

medicaid and some SNAP data

57716f2

got SNAP settled

7b3cacc

progress

e45072e

all major targets loaded

6d482e7

linting

dddf689

fixed national stratum in agi script

9c3a460

juaristi22 reviewed Aug 15, 2025

View reviewed changes

policyengine_us_data/db/etl_snap.py Outdated Show resolved Hide resolved

baogorek added 4 commits August 15, 2025 08:43

refactor: use sqlmodel session

81e2011

storage file updates

d5b3571

Merge branch 'treasury' into codex/fix-orm-inconsistencies-and-improv…

a729450

…e-data-loading

Merge pull request #430 from PolicyEngine/codex/fix-orm-inconsistenci…

26b561b

…es-and-improve-data-loading refactor: use SQLModel session

baogorek changed the title ~~Add Treasury EITC targets to targets database~~ A SQLite Database for Calibration Targets Aug 15, 2025

baogorek added 3 commits August 15, 2025 09:35

adding make database to reusable test. Updating changelog_entry

b376726

removing TODOs

9078ed9

Removed troublesome logging. Updated Makefile

9913e3c

baogorek requested review from MaxGhenis and juaristi22 August 15, 2025 14:05

baogorek marked this pull request as ready for review August 15, 2025 14:05

juaristi22 reviewed Aug 15, 2025

View reviewed changes

policyengine_us_data/db/etl_irs_soi.py Outdated Show resolved Hide resolved

juaristi22 reviewed Aug 15, 2025

View reviewed changes

updated comments based on feedback. Removed old make target

fddc3ac

baogorek requested a review from juaristi22 August 18, 2025 14:19

juaristi22 reviewed Aug 18, 2025

View reviewed changes

policyengine_us_data/db/etl_snap.py Show resolved Hide resolved

juaristi22 approved these changes Aug 18, 2025

View reviewed changes

baogorek added 2 commits August 18, 2025 12:52

test: move database tests into package

0cf920a

Merge pull request #433 from PolicyEngine/codex/add-tests-for-policye…

35f78cd

…ngine-sqlite-database test: add database creation tests

juaristi22 requested changes Aug 18, 2025

View reviewed changes

baogorek added 4 commits August 18, 2025 14:31

Add Great Expectations validation for database

0571ff5

Merge pull request #434 from PolicyEngine/codex/add-great-expectation…

648eabf

…s-tests-to-make-database Add Great Expectations validation for database build

working pre lint

bdef501

post lint

d295926

baogorek requested a review from juaristi22 August 18, 2025 23:02

updating IRS target variables

bd104d0

baogorek closed this Aug 20, 2025

baogorek reopened this Aug 20, 2025

juaristi22 approved these changes Aug 20, 2025

View reviewed changes

changing the salt variable to uncapped

2875311

baogorek merged commit 4a395f8 into main Aug 20, 2025
6 checks passed

A SQLite Database for Calibration Targets #398

A SQLite Database for Calibration Targets #398

Uh oh!

Conversation

baogorek commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

juaristi22 commented Jul 29, 2025

Uh oh!

MaxGhenis left a comment

Choose a reason for hiding this comment

Uh oh!

juaristi22 commented Aug 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

juaristi22 Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

baogorek Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

juaristi22 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

juaristi22 Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

juaristi22 Aug 18, 2025

Choose a reason for hiding this comment

Uh oh!

baogorek Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MaxGhenis Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

MaxGhenis Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

juaristi22 Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

baogorek Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

baogorek Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

juaristi22 commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

baogorek commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

juaristi22 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

baogorek commented Jul 29, 2025 •

edited

Loading

juaristi22 commented Aug 4, 2025 •

edited

Loading

juaristi22 left a comment •

edited

Loading

juaristi22 Aug 18, 2025 •

edited

Loading

baogorek Aug 18, 2025 •

edited

Loading

juaristi22 Aug 19, 2025 •

edited

Loading

baogorek Aug 19, 2025 •

edited

Loading

juaristi22 commented Aug 20, 2025 •

edited

Loading

baogorek commented Aug 20, 2025 •

edited

Loading