Skip to content

Conversation

@baogorek
Copy link
Collaborator

@baogorek baogorek commented Jul 29, 2025

Closes #399

The branch name is now too narrow. Concepts are:

Age
Medicaid
SNAP

And, from the IRS SOI:
ETIC by number of children
AGI

and, also from the IRS SOI, turning to the logs:

Loading amount data for IRS SOI data on qbid
Loading amount data for IRS SOI data on real_estate_taxes
Loading amount data for IRS SOI data on net_capital_gain
Loading amount data for IRS SOI data on ira_payments
Loading amount data for IRS SOI data on taxable_interest
Loading amount data for IRS SOI data on tax_exempt_interest
Loading amount data for IRS SOI data on oridinary_dividends
Loading amount data for IRS SOI data on qualified_dividends
Loading amount data for IRS SOI data on partnership_and_s_crop_net_income
Loading amount data for IRS SOI data on total_social_security
Loading amount data for IRS SOI data on pension_and_annuities
Loading amount data for IRS SOI data on unemployment_compensation
Loading amount data for IRS SOI data on business_net_income
Loading amount data for IRS SOI data on medical_and_dental_deduction
Loading amount data for IRS SOI data on salt_refunds
Loading amount data for IRS SOI data on salt_amount
Loading amount data for IRS SOI data on income_tax

Note that this PR does increase build time by adding the step of building the target database. Sooner or later, we should split up these data extractions into their own pipeline, but this approach follows the current standards like the CPS and ACS, which are downloaded every time.

@baogorek baogorek marked this pull request as ready for review July 29, 2025 13:13
@baogorek baogorek requested a review from MaxGhenis July 29, 2025 13:14
@juaristi22
Copy link
Collaborator

After policyengine-us #6322 is merged, -data should be able to identify that a household belongs to hierarchical geographies from a single calculation of ucgid_str. The "in" operation in the database will denote that the code should check for partial string matching. @baogorek

Copy link
Contributor

@MaxGhenis MaxGhenis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be more modular but works for now

@juaristi22
Copy link
Collaborator

juaristi22 commented Aug 4, 2025

in case we want to load variables in addition to agi this document has the list of all soi targets and the -us sim variables that represent them, could be useful for target variable naming

@MaxGhenis MaxGhenis marked this pull request as draft August 8, 2025 21:24
@baogorek baogorek changed the title Add Treasury EITC targets to targets database A SQLite Database for Calibration Targets Aug 15, 2025
@baogorek baogorek marked this pull request as ready for review August 15, 2025 14:05
amount_value = amount_j.iloc[i][["target_value"]].values[0]

stratum.targets_rel.append(
# NOTE: If I do the counts, I'm going to need to explode the strata for the vars != 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a particular reason why we wouldn't explode the strata? by explode we mean make really large or break the logic?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Explode" was used in the sense of "combinatorial explosion." For instance, take partnership_and_s_crop_net_income. We can calibrate to aggregate amount and just have one stratum per geography (almost 500). Or, we can choose to count the number of people who, say have positive partnership_and_s_crop_net_income, or maybe just net_income different from zero, or why not negative, zero to 1 dollar, 1 dollar to 10,000 dollars, ... etc? There's really no right answer, at least not with the information we have now.

A huge benefit though to just using partnership_and_s_crop_net_income is that I can reuse the base geographic strata; it's a sum over that quantity for everyone in that geography (including the 0s). If I want to create bins, there are two ways to achieve this: 1) create new variables in -us to sum over and reuse the base geographic strata, 2) create new strata for these ranges and make the variable person_count or tax_unit_count (variables that are in -us). 1) is probably not going to fly so we're looking at creating a whole bunch of new strata.

Maybe once things get going, we'll realize they're not all that bad to deal with and we can come back for them. But the complexity is already high and time is limited. We can always come back.

I should note too that the IRS SOI variables could also use the AGI splits for every variable. This would not be as hard (those strata could also be reused) but it will create many many more strata and will probably lead to zero cells in certain districts.

Copy link
Collaborator

@juaristi22 juaristi22 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So nice to see so many targets getting loaded! should we make uploading the database to hugging face a step in CI following dataset standards?

Update: this is done yay

@baogorek baogorek requested a review from juaristi22 August 18, 2025 14:19
temp_df = df[["ucgid_str"]].copy()
temp_df["breakdown_variable"] = "one"
temp_df["breakdown_value"] = 1
temp_df["target_variable"] = "agi"
Copy link
Collaborator

@juaristi22 juaristi22 Aug 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

variables that do not exist in -us and need to be renamed:

  • agi --> adjusted_gross_income
  • eitc_children --> eitc_child_count
  • oridinary_dividends --> ordinary_dividends (typo)
  • business_net_income --> self_employment_income
  • ira_payments --> taxable_ira_distributions or tax_exempt_ira_distributions
  • medical_and_dental_deduction --> medical_expense_deduction
  • partnership_and_s_crop_net_income --> partnership_s_corp_income or tax_unit_partnership_s_corp_income
  • pension_and_annuities --> taxable_pension_income
  • qbid --> qualified_business_income_deduction
  • qualified_dividends --> qualified_dividend_income
  • salt_amount --> state_income_tax (if refering to state and local income taxes amount)
  • salt_refunds --> not sure what the correct var name is to represent this
  • tax_exempt_interest --> tax_exempt_interest_income_amount
  • taxable_interest --> taxable_interest_income
  • total_social_security --> taxable_social_security

some of these you may have to choose from given your best undersanding of the target because -us has more than one variable with matching names and im not sure which is the right one

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

most if not all of these are mapped in this document, make sure to review it in case i got something wrong

Copy link
Collaborator Author

@baogorek baogorek Aug 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this. The good news is that it led to new levels of validation both at the time of data load and after all data has been loaded. At load time, an enum is created only at the ORM level from the policyengine_us variables.

@MaxGhenis mind taking a look at the variables @juaristi22 found and look at the version I ended up with?

    TARGETS = [
        dict(code="59661", name="eitc", breakdown=("eitc_child_count", 0)),
        dict(code="59662", name="eitc", breakdown=("eitc_child_count", 1)),
        dict(code="59663", name="eitc", breakdown=("eitc_child_count", 2)),
        dict(code="59664", name="eitc", breakdown=("eitc_child_count", "3+")),
        dict(
            code="59664",
            name="qualified_business_income_deduction",
            breakdown=None,
        ),
        dict(code="18500", name="real_estate_taxes", breakdown=None),
        dict(code="01000", name="net_capital_gain", breakdown=None),
        dict(code="03150", name="retirement_distributions", breakdown=None),
        dict(code="00300", name="taxable_interest_income", breakdown=None),
        dict(code="00400", name="tax_exempt_interest_income", breakdown=None),
        dict(
            code="00600", name="non_qualified_dividend_income", breakdown=None
        ),
        dict(code="00650", name="qualified_dividend_income", breakdown=None),
        dict(
            code="26270",
            name="partnership_s_corp_income",
            breakdown=None,
        ),
        dict(code="02500", name="social_security", breakdown=None),
        dict(code="02300", name="unemployment_compensation", breakdown=None),
        dict(code="00700", name="salt_refund_income", breakdown=None),
        dict(code="18425", name="reported_salt", breakdown=None),
        dict(code="06500", name="income_tax", breakdown=None),
    ]
  • agi is not there but that's been changed to adjusted_gross_income later in the code
  • I left out the following because I wasn't sure there was a match:
    • medical and dental
    • pension and annuity
  • I also just used the social security variable.
  • I used a variable called "retirement distributions"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

salt_amount would be more than state_income_tax - we should have another variable that includes real estate taxes, local income taxes, and state+local sales tax (you can take either but not both income or sales tax)
salt_refunds - we don't model refunds (would require knowing the timing of tax payments throughout the year)
total_social_security would be social_security not taxable_social_security - check how the source defines it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medical expense deduction is in the model

Copy link
Collaborator

@juaristi22 juaristi22 Aug 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think pension and annuity may also be in the model, at least when pavel reviewed my variable names document i had taxable_pension_income as the variable that maps to it

Copy link
Collaborator Author

@baogorek baogorek Aug 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI Pavel has weighed in in the documentation Maria shared, which I didn't realize at first (thought it was the stock documentation. After time with that, Codex, and the above comments:

  1. I think we're good with salt for state and local taxes: Codex: "The salt variable aggregates state and local income or sales tax with real estate taxes, pulling its components from the specified 'salt_and_real_estate' sources list"
  2. I will remove salt_refund_income from the list
  3. Moving from person-level partnership_s_corp_income to tax unit level tax_unit_partnership_s_corp_income
  4. I will move to taxable_social_security
  5. The IRS variable is Medical + Dental deduction, but I will include medical_deduction as Max believes it does include dental.
  6. Moving from retirement_distributions to taxable_ira_distributions
  7. I will use taxable_pension_income as I see that it includes annuity income.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MaxGhenis , @juaristi22 , spent a lot of time with the LLMs. Added refundable_ctc. Changed the IRS variable I'm pulling in for SALT and added a comment about our omission of person property taxes. I almost merged, but I thought I'd leave it open for one more night. Maria, you might find some other issue.

    TARGETS = [
        dict(code="59661", name="eitc", breakdown=("eitc_child_count", 0)),
        dict(code="59662", name="eitc", breakdown=("eitc_child_count", 1)),
        dict(code="59663", name="eitc", breakdown=("eitc_child_count", 2)),
        dict(code="59664", name="eitc", breakdown=("eitc_child_count", "3+")),
        dict(
            code="04475",
            name="qualified_business_income_deduction",
            breakdown=None,
        ),
        dict(code="18500", name="real_estate_taxes", breakdown=None),
        dict(code="01000", name="net_capital_gain", breakdown=None),
        dict(code="01400", name="taxable_ira_distributions", breakdown=None),
        dict(code="00300", name="taxable_interest_income", breakdown=None),
        dict(code="00400", name="tax_exempt_interest_income", breakdown=None),
        dict(code="00600", name="dividend_income", breakdown=None),
        dict(code="00650", name="qualified_dividend_income", breakdown=None),
        dict(
            code="26270",
            name="tax_unit_partnership_s_corp_income",
            breakdown=None,
        ),
        dict(code="02500", name="taxable_social_security", breakdown=None),
        dict(code="02300", name="unemployment_compensation", breakdown=None),
        dict(code="17000", name="medical_expense_deduction", breakdown=None),
        dict(code="01700", name="taxable_pension_income", breakdown=None),
        dict(code="11070", name="refundable_ctc", breakdown=None),
        # NOTE: A18460 is the capped SALT deduction and matches the `salt` variable.
        # Our SALT base currently excludes personal property taxes (not modeled yet),
        # so amounts may be slightly below IRS totals.
        dict(code="18460", name="salt", breakdown=None),
        dict(code="06500", name="income_tax", breakdown=None),
    ]

@baogorek baogorek requested a review from juaristi22 August 18, 2025 23:02
@juaristi22
Copy link
Collaborator

juaristi22 commented Aug 20, 2025

I tried running calibration with the new dataset and found this:

from policyengine_us import Microsimulation

    sim = Microsimulation(dataset="hf://policyengine/policyengine-us-data/cps_2023.h5")
    sim.default_period = 2023
    print(sim.calculate("tax_unit_partnership_s_corp_income").sum())

-- 
0.0

Seems like the calculation of this variable on the rules engine side is either not implemented, has an error, or something else i am unaware of, which leads to the estimates of all the corp income targets being 0 and thus, never improving with calibration. This is also the case with the partnership_s_corp_income variable.

All other variables seem to be loading fine, though eitc by child count at the national level is not calibrating properly. I will investigate further, but for now, i'd only worry about tax_unit_partnership_s_corp_income

@baogorek
Copy link
Collaborator Author

baogorek commented Aug 20, 2025

sim = Microsimulation(dataset="hf://policyengine/policyengine-us-data/cps_2023.h5")

Good eyes here, @juaristi22, but I think this is because this variable (very related to our conversation) is zero until the PUF comes in to create the enhanced_cps:


In [1]: from policyengine_us import Microsimulation

In [2]: sim = Microsimulation(dataset="hf://policyengine/policyengine-us-data/enhanced_cps_2024.h5")

In [3]: sim23 = Microsimulation(dataset="hf://policyengine/policyengine-us-data/cps_2023.h5")

In [4]: sim.calculate('tax_unit_partnership_s_corp_income').sum() / 1E9
Out[4]: np.float64(1023.7888367125104)

In [5]: sim23.calculate('tax_unit_partnership_s_corp_income').sum() / 1E9
Out[5]: np.float64(0.0)

@baogorek baogorek closed this Aug 20, 2025
@baogorek baogorek reopened this Aug 20, 2025
Copy link
Collaborator

@juaristi22 juaristi22 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So exciting!

@baogorek baogorek merged commit 4a395f8 into main Aug 20, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add National EITC targets to target database

4 participants