-
Notifications
You must be signed in to change notification settings - Fork 10
A SQLite Database for Calibration Targets #398
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
After policyengine-us #6322 is merged, -data should be able to identify that a household belongs to hierarchical geographies from a single calculation of |
MaxGhenis
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be more modular but works for now
|
in case we want to load variables in addition to agi this document has the list of all soi targets and the -us sim variables that represent them, could be useful for target variable naming |
…es-and-improve-data-loading refactor: use SQLModel session
| amount_value = amount_j.iloc[i][["target_value"]].values[0] | ||
|
|
||
| stratum.targets_rel.append( | ||
| # NOTE: If I do the counts, I'm going to need to explode the strata for the vars != 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a particular reason why we wouldn't explode the strata? by explode we mean make really large or break the logic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Explode" was used in the sense of "combinatorial explosion." For instance, take partnership_and_s_crop_net_income. We can calibrate to aggregate amount and just have one stratum per geography (almost 500). Or, we can choose to count the number of people who, say have positive partnership_and_s_crop_net_income, or maybe just net_income different from zero, or why not negative, zero to 1 dollar, 1 dollar to 10,000 dollars, ... etc? There's really no right answer, at least not with the information we have now.
A huge benefit though to just using partnership_and_s_crop_net_income is that I can reuse the base geographic strata; it's a sum over that quantity for everyone in that geography (including the 0s). If I want to create bins, there are two ways to achieve this: 1) create new variables in -us to sum over and reuse the base geographic strata, 2) create new strata for these ranges and make the variable person_count or tax_unit_count (variables that are in -us). 1) is probably not going to fly so we're looking at creating a whole bunch of new strata.
Maybe once things get going, we'll realize they're not all that bad to deal with and we can come back for them. But the complexity is already high and time is limited. We can always come back.
I should note too that the IRS SOI variables could also use the AGI splits for every variable. This would not be as hard (those strata could also be reused) but it will create many many more strata and will probably lead to zero cells in certain districts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So nice to see so many targets getting loaded! should we make uploading the database to hugging face a step in CI following dataset standards?
Update: this is done yay
…ngine-sqlite-database test: add database creation tests
| temp_df = df[["ucgid_str"]].copy() | ||
| temp_df["breakdown_variable"] = "one" | ||
| temp_df["breakdown_value"] = 1 | ||
| temp_df["target_variable"] = "agi" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
variables that do not exist in -us and need to be renamed:
-
agi-->adjusted_gross_income -
eitc_children-->eitc_child_count -
oridinary_dividends-->ordinary_dividends(typo) -
business_net_income-->self_employment_income -
ira_payments-->taxable_ira_distributionsortax_exempt_ira_distributions -
medical_and_dental_deduction-->medical_expense_deduction -
partnership_and_s_crop_net_income-->partnership_s_corp_incomeortax_unit_partnership_s_corp_income -
pension_and_annuities-->taxable_pension_income -
qbid-->qualified_business_income_deduction -
qualified_dividends-->qualified_dividend_income -
salt_amount-->state_income_tax(if refering to state and local income taxes amount) -
salt_refunds--> not sure what the correct var name is to represent this -
tax_exempt_interest-->tax_exempt_interest_income_amount -
taxable_interest-->taxable_interest_income -
total_social_security-->taxable_social_security
some of these you may have to choose from given your best undersanding of the target because -us has more than one variable with matching names and im not sure which is the right one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
most if not all of these are mapped in this document, make sure to review it in case i got something wrong
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching this. The good news is that it led to new levels of validation both at the time of data load and after all data has been loaded. At load time, an enum is created only at the ORM level from the policyengine_us variables.
@MaxGhenis mind taking a look at the variables @juaristi22 found and look at the version I ended up with?
TARGETS = [
dict(code="59661", name="eitc", breakdown=("eitc_child_count", 0)),
dict(code="59662", name="eitc", breakdown=("eitc_child_count", 1)),
dict(code="59663", name="eitc", breakdown=("eitc_child_count", 2)),
dict(code="59664", name="eitc", breakdown=("eitc_child_count", "3+")),
dict(
code="59664",
name="qualified_business_income_deduction",
breakdown=None,
),
dict(code="18500", name="real_estate_taxes", breakdown=None),
dict(code="01000", name="net_capital_gain", breakdown=None),
dict(code="03150", name="retirement_distributions", breakdown=None),
dict(code="00300", name="taxable_interest_income", breakdown=None),
dict(code="00400", name="tax_exempt_interest_income", breakdown=None),
dict(
code="00600", name="non_qualified_dividend_income", breakdown=None
),
dict(code="00650", name="qualified_dividend_income", breakdown=None),
dict(
code="26270",
name="partnership_s_corp_income",
breakdown=None,
),
dict(code="02500", name="social_security", breakdown=None),
dict(code="02300", name="unemployment_compensation", breakdown=None),
dict(code="00700", name="salt_refund_income", breakdown=None),
dict(code="18425", name="reported_salt", breakdown=None),
dict(code="06500", name="income_tax", breakdown=None),
]
- agi is not there but that's been changed to
adjusted_gross_incomelater in the code - I left out the following because I wasn't sure there was a match:
- medical and dental
- pension and annuity
- I also just used the social security variable.
- I used a variable called "retirement distributions"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
salt_amount would be more than state_income_tax - we should have another variable that includes real estate taxes, local income taxes, and state+local sales tax (you can take either but not both income or sales tax)
salt_refunds - we don't model refunds (would require knowing the timing of tax payments throughout the year)
total_social_security would be social_security not taxable_social_security - check how the source defines it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
medical expense deduction is in the model
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think pension and annuity may also be in the model, at least when pavel reviewed my variable names document i had taxable_pension_income as the variable that maps to it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI Pavel has weighed in in the documentation Maria shared, which I didn't realize at first (thought it was the stock documentation. After time with that, Codex, and the above comments:
- I think we're good with
saltfor state and local taxes: Codex: "The salt variable aggregates state and local income or sales tax with real estate taxes, pulling its components from the specified 'salt_and_real_estate' sources list" - I will remove
salt_refund_incomefrom the list - Moving from person-level partnership_s_corp_income to tax unit level tax_unit_partnership_s_corp_income
- I will move to
taxable_social_security - The IRS variable is Medical + Dental deduction, but I will include medical_deduction as Max believes it does include dental.
- Moving from
retirement_distributionstotaxable_ira_distributions - I will use
taxable_pension_incomeas I see that it includes annuity income.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MaxGhenis , @juaristi22 , spent a lot of time with the LLMs. Added refundable_ctc. Changed the IRS variable I'm pulling in for SALT and added a comment about our omission of person property taxes. I almost merged, but I thought I'd leave it open for one more night. Maria, you might find some other issue.
TARGETS = [
dict(code="59661", name="eitc", breakdown=("eitc_child_count", 0)),
dict(code="59662", name="eitc", breakdown=("eitc_child_count", 1)),
dict(code="59663", name="eitc", breakdown=("eitc_child_count", 2)),
dict(code="59664", name="eitc", breakdown=("eitc_child_count", "3+")),
dict(
code="04475",
name="qualified_business_income_deduction",
breakdown=None,
),
dict(code="18500", name="real_estate_taxes", breakdown=None),
dict(code="01000", name="net_capital_gain", breakdown=None),
dict(code="01400", name="taxable_ira_distributions", breakdown=None),
dict(code="00300", name="taxable_interest_income", breakdown=None),
dict(code="00400", name="tax_exempt_interest_income", breakdown=None),
dict(code="00600", name="dividend_income", breakdown=None),
dict(code="00650", name="qualified_dividend_income", breakdown=None),
dict(
code="26270",
name="tax_unit_partnership_s_corp_income",
breakdown=None,
),
dict(code="02500", name="taxable_social_security", breakdown=None),
dict(code="02300", name="unemployment_compensation", breakdown=None),
dict(code="17000", name="medical_expense_deduction", breakdown=None),
dict(code="01700", name="taxable_pension_income", breakdown=None),
dict(code="11070", name="refundable_ctc", breakdown=None),
# NOTE: A18460 is the capped SALT deduction and matches the `salt` variable.
# Our SALT base currently excludes personal property taxes (not modeled yet),
# so amounts may be slightly below IRS totals.
dict(code="18460", name="salt", breakdown=None),
dict(code="06500", name="income_tax", breakdown=None),
]
…s-tests-to-make-database Add Great Expectations validation for database build
|
I tried running calibration with the new dataset and found this: from policyengine_us import Microsimulation
sim = Microsimulation(dataset="hf://policyengine/policyengine-us-data/cps_2023.h5")
sim.default_period = 2023
print(sim.calculate("tax_unit_partnership_s_corp_income").sum())
--
0.0Seems like the calculation of this variable on the rules engine side is either not implemented, has an error, or something else i am unaware of, which leads to the estimates of all the corp income targets being 0 and thus, never improving with calibration. This is also the case with the All other variables seem to be loading fine, though eitc by child count at the national level is not calibrating properly. I will investigate further, but for now, i'd only worry about |
Good eyes here, @juaristi22, but I think this is because this variable (very related to our conversation) is zero until the PUF comes in to create the enhanced_cps: |
juaristi22
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So exciting!
Closes #399
The branch name is now too narrow. Concepts are:
Age
Medicaid
SNAP
And, from the IRS SOI:
ETIC by number of children
AGI
and, also from the IRS SOI, turning to the logs:
Note that this PR does increase build time by adding the step of building the target database. Sooner or later, we should split up these data extractions into their own pipeline, but this approach follows the current standards like the CPS and ACS, which are downloaded every time.