diff --git a/README.md b/README.md index 7b077276..ad518501 100644 --- a/README.md +++ b/README.md @@ -1 +1,161 @@ # PolicyEngine UK Data + +[![Documentation](https://img.shields.io/badge/docs-live-blue)](https://policyengine.github.io/policyengine-uk-data/) +[![PyPI version](https://badge.fury.io/py/policyengine-uk-data.svg)](https://badge.fury.io/py/policyengine-uk-data) +[![Python 3.13+](https://img.shields.io/badge/python-3.13+-blue.svg)](https://www.python.org/downloads/) + +**PolicyEngine UK Data** creates representative microdata for the United Kingdom, designed for use in the [PolicyEngine UK](https://github.com/PolicyEngine/policyengine-uk) tax-benefit microsimulation model. + +## What is this? + +This package transforms the UK Family Resources Survey (FRS) into an enhanced dataset suitable for accurate tax-benefit policy analysis. The enhancement process includes: + +- **Imputation** of missing variables (wealth, consumption, VAT exposure) +- **Income enhancement** using Survey of Personal Incomes (SPI) data +- **Calibration** to match official statistics from HMRC, DWP, and ONS +- **Local area datasets** for constituencies and local authorities + +The result is a dataset that accurately represents the UK population and economy, enabling precise policy impact analysis. + +## Installation + +### Prerequisites + +- Python 3.13 or higher +- [Hugging Face account](https://huggingface.co/) (for data downloads) + +### Install from PyPI + +```bash +pip install policyengine-uk-data +``` + +### Install from source + +```bash +git clone https://github.com/PolicyEngine/policyengine-uk-data.git +cd policyengine-uk-data +pip install -e ".[dev]" +``` + +### Authentication + +Set your Hugging Face token as an environment variable: + +```bash +export HUGGING_FACE_TOKEN="your_token_here" +``` + +Or create a `.env` file in your project root: + +``` +HUGGING_FACE_TOKEN=your_token_here +``` + +## Quick Start + +```python +from policyengine_uk_data import EnhancedFRS_2022_23 +from policyengine_uk import Microsimulation + +# Load the enhanced dataset +dataset = EnhancedFRS_2022_23 + +# Create a microsimulation for 2025 +simulation = Microsimulation(dataset=dataset) + +# Calculate total employment income +employment_income = simulation.calculate("employment_income", period=2025) +print(f"Total employment income: £{employment_income.sum() / 1e9:.1f}bn") + +# Calculate mean household income +household_income = simulation.calculate("household_net_income", period=2025) +weights = simulation.calculate("household_weight", period=2025) +mean_income = (household_income * weights).sum() / weights.sum() +print(f"Mean household net income: £{mean_income:,.0f}") +``` + +## Available Datasets + +| Dataset | Description | Use Case | +|---------|-------------|----------| +| `FRS_2022_23` | Raw FRS with benefits as reported | Baseline comparison | +| `ExtendedFRS_2022_23` | FRS + imputed wealth/consumption | Basic policy analysis | +| `EnhancedFRS_2022_23` | Extended + SPI income enhancement | Recommended for most analyses | +| `ReweightedFRS_2022_23` | Enhanced + calibrated weights | Maximum accuracy | + +## Documentation + +- **[Getting Started Guide](https://policyengine.github.io/policyengine-uk-data/getting-started)** - Detailed installation and setup +- **[Methodology](https://policyengine.github.io/policyengine-uk-data/methodology)** - How we create the datasets +- **[API Reference](https://policyengine.github.io/policyengine-uk-data/api-reference)** - Complete API documentation +- **[Examples](https://policyengine.github.io/policyengine-uk-data/examples)** - Usage examples and tutorials +- **[Validation](https://policyengine.github.io/policyengine-uk-data/validation/)** - Comparison with official statistics + +## Building the Datasets + +To rebuild the datasets from source data: + +```bash +# Download prerequisites (requires authentication) +make download + +# Build all datasets +make data + +# Run tests +make test + +# Upload to storage (requires GCP credentials) +make upload +``` + +## Data Sources + +This package combines data from multiple UK surveys: + +- **Family Resources Survey (FRS)** - household demographics, income, benefits +- **Wealth and Assets Survey (WAS)** - wealth imputations +- **Living Costs and Food Survey (LCFS)** - consumption imputations +- **Survey of Personal Incomes (SPI)** - high-income enhancement +- **Effects of Taxes and Benefits (ETB)** - VAT exposure + +See [Data Sources documentation](https://policyengine.github.io/policyengine-uk-data/data-sources) for details. + +## Citation + +If you use this package in research, please cite: + +``` +PolicyEngine. (2024). PolicyEngine UK Data. GitHub. +https://github.com/PolicyEngine/policyengine-uk-data +``` + +For the methodology, see our [documentation](https://policyengine.github.io/policyengine-uk-data/methodology.html). + +## Contributing + +We welcome contributions! Please see our [Contributing Guide](CONTRIBUTING.md) for details. + +1. Fork the repository +2. Create a feature branch (`git checkout -b feature/amazing-feature`) +3. Commit your changes (`git commit -m 'Add amazing feature'`) +4. Push to the branch (`git push origin feature/amazing-feature`) +5. Open a Pull Request + +## License + +This project is licensed under the AGPL-3.0 License - see the [LICENSE](LICENSE) file for details. + +## Support + +- **Documentation**: https://policyengine.github.io/policyengine-uk-data/ +- **Issues**: https://github.com/PolicyEngine/policyengine-uk-data/issues +- **Discussions**: https://github.com/PolicyEngine/policyengine-uk-data/discussions +- **Email**: hello@policyengine.org + +## Related Projects + +- [**PolicyEngine UK**](https://github.com/PolicyEngine/policyengine-uk) - UK tax-benefit microsimulation model +- [**PolicyEngine**](https://github.com/PolicyEngine/policyengine) - Policy simulation platform +- [**PolicyEngine US Data**](https://github.com/PolicyEngine/policyengine-us-data) - US equivalent dataset \ No newline at end of file diff --git a/changelog_entry.yaml b/changelog_entry.yaml index e69de29b..7656f10f 100644 --- a/changelog_entry.yaml +++ b/changelog_entry.yaml @@ -0,0 +1,12 @@ +- bump: minor + changes: + added: + - Comprehensive README with installation, quick start, and examples + - Getting Started guide with detailed setup instructions + - API Reference documentation for all datasets and functions + - Usage Examples page with common analysis patterns + - Data Sources documentation explaining FRS, WAS, LCFS, SPI, ETB + - Glossary of technical terms and abbreviations + changed: + - Enhanced introduction with clearer problem statement and use cases + - Reorganized documentation TOC into User Guide, Technical Details, and Validation sections \ No newline at end of file diff --git a/docs/api-reference.md b/docs/api-reference.md new file mode 100644 index 00000000..a2a23f97 --- /dev/null +++ b/docs/api-reference.md @@ -0,0 +1,249 @@ +# API Reference + +This page documents the main classes and functions in PolicyEngine UK Data. + +## Datasets + +### Main Datasets + +All datasets inherit from `policyengine_uk.data.UKSingleYearDataset` and can be used with `policyengine_uk.Microsimulation`. + +#### `FRS_2022_23` + +Raw Family Resources Survey data for 2022-23. + +**Use case:** Baseline comparison, replicating official FRS analysis + +**Features:** +- Demographics from FRS +- Reported benefits (not simulated) +- No wealth or consumption data +- Known income underreporting + +```python +from policyengine_uk_data import FRS_2022_23 +from policyengine_uk import Microsimulation + +simulation = Microsimulation(dataset=FRS_2022_23) +``` + +#### `ExtendedFRS_2022_23` + +FRS with imputed wealth and consumption variables. + +**Use case:** Analysis requiring wealth or consumption but not requiring maximum income accuracy + +**Additions over FRS:** +- Wealth variables (from WAS) +- Consumption variables (from LCFS) +- VAT exposure rates (from ETB) +- Simulated benefits (not reported) + +```python +from policyengine_uk_data import ExtendedFRS_2022_23 + +simulation = Microsimulation(dataset=ExtendedFRS_2022_23) +``` + +#### + + `EnhancedFRS_2022_23` (Recommended) + +Extended FRS with SPI-based income enhancement to correct high-income underreporting. + +**Use case:** Most policy analysis (recommended default) + +**Additions over Extended FRS:** +- High-income correction using SPI data +- More accurate income distribution +- Maintains all wealth/consumption imputations + +```python +from policyengine_uk_data import EnhancedFRS_2022_23 + +simulation = Microsimulation(dataset=EnhancedFRS_2022_23) +``` + +#### `ReweightedFRS_2022_23` + +Enhanced FRS with calibrated weights to match official statistics. + +**Use case:** Maximum accuracy, official statistic replication + +**Additions over Enhanced FRS:** +- Calibrated to 2000+ official statistics +- Matches HMRC, DWP, OBR data +- Higher computational cost + +```python +from policyengine_uk_data import ReweightedFRS_2022_23 + +simulation = Microsimulation(dataset=ReweightedFRS_2022_23) +``` + +### Local Area Datasets + +#### `Constituency_2024_25` + +Parliamentary constituency-level dataset. + +```python +from policyengine_uk_data.datasets.local_areas import Constituency_2024_25 + +simulation = Microsimulation(dataset=Constituency_2024_25) +constituency = simulation.calculate("constituency", period=2025) +``` + +#### `LocalAuthority_2024_25` + +Local authority-level dataset. + +```python +from policyengine_uk_data.datasets.local_areas import LocalAuthority_2024_25 + +simulation = Microsimulation(dataset=LocalAuthority_2024_25) +local_authority = simulation.calculate("local_authority", period=2025) +``` + +## Utility Functions + +### Dataset Utilities + +#### `sum_to_entity(df, entity, variable, target_entity)` + +Aggregate a variable from one entity level to another. + +**Parameters:** +- `df` (DataFrame): Source data +- `entity` (str): Source entity level +- `variable` (str): Variable to aggregate +- `target_entity` (str): Target entity level + +**Returns:** Aggregated series + +```python +from policyengine_uk_data.utils.datasets import sum_to_entity + +# Sum person-level income to household level +household_income = sum_to_entity( + df=person_df, + entity="person", + variable="employment_income", + target_entity="household" +) +``` + +#### `categorical(series, categories)` + +Convert a series to categorical codes. + +**Parameters:** +- `series` (Series): Input series +- `categories` (dict): Mapping of values to category codes + +**Returns:** Series with categorical codes + +### Loss/Validation Functions + +#### `get_loss_results(dataset, time_period, reform=None)` + +Calculate validation metrics comparing dataset to official statistics. + +**Parameters:** +- `dataset` (UKSingleYearDataset): Dataset to validate +- `time_period` (int): Year to validate +- `reform` (Reform, optional): Policy reform to apply + +**Returns:** DataFrame with validation metrics including: +- `name`: Statistic name +- `estimate`: Dataset estimate +- `target`: Official statistic +- `error`: Absolute error +- `rel_error`: Relative error +- `abs_rel_error`: Absolute relative error + +```python +from policyengine_uk_data.utils import get_loss_results +from policyengine_uk_data import EnhancedFRS_2022_23 + +results = get_loss_results(EnhancedFRS_2022_23, 2025) +print(f"Mean absolute relative error: {results.abs_rel_error.mean():.2%}") +``` + +### Download Functions + +#### `download_prerequisites()` + +Download required data files from Hugging Face. + +**Requires:** `HUGGING_FACE_TOKEN` environment variable + +```python +from policyengine_uk_data import download_prerequisites + +download_prerequisites() +``` + +#### `check_prerequisites()` + +Check if required data files are present. + +**Returns:** Boolean indicating if all prerequisites are available + +```python +from policyengine_uk_data import check_prerequisites + +if not check_prerequisites(): + print("Missing prerequisites. Run download_prerequisites()") +``` + +## Constants + +### `STORAGE_FOLDER` + +Path to local data storage directory. + +```python +from policyengine_uk_data.utils.datasets import STORAGE_FOLDER + +print(f"Data stored in: {STORAGE_FOLDER}") +``` + +## Building Custom Datasets + +### `create_frs(raw_frs_folder, year)` + +Process raw FRS data into PolicyEngine format. + +**Parameters:** +- `raw_frs_folder` (str): Path to raw FRS `.tab` files +- `year` (int): Survey year + +**Returns:** `UKSingleYearDataset` + +```python +from policyengine_uk_data.datasets.frs import create_frs + +dataset = create_frs( + raw_frs_folder="/path/to/frs/data", + year=2022 +) +``` + +### Imputation Modules + +Located in `policyengine_uk_data.datasets.imputations`: + +- `wealth` - Wealth variable imputations from WAS +- `income` - Income enhancements from SPI +- `consumption` - Consumption imputations from LCFS +- `vat` - VAT exposure rates from ETB +- `capital_gains` - Capital gains imputations + +Each module provides functions to add imputed variables to datasets. + +## See Also + +- [Getting Started](getting-started.md) - Installation and basic usage +- [Examples](examples.md) - Detailed usage examples +- [Methodology](methodology.ipynb) - How datasets are created \ No newline at end of file diff --git a/docs/data-sources.md b/docs/data-sources.md new file mode 100644 index 00000000..b7cdf638 --- /dev/null +++ b/docs/data-sources.md @@ -0,0 +1,335 @@ +# Data Sources + +PolicyEngine UK Data combines multiple UK government surveys to create comprehensive representative microdata. This page documents each data source, its purpose, and how to access it. + +## Primary Sources + +### Family Resources Survey (FRS) + +**Publisher:** Department for Work and Pensions (DWP) + +**Coverage:** ~20,000 UK households annually[^frs-sample] + +**Purpose:** Main source of household demographics, income, and benefits data + +**Variables include:** +- Demographics (age, gender, family composition) +- Employment and self-employment income +- Benefit receipt +- Housing costs and tenure +- Disability status +- Childcare arrangements + +**Access:** +- [UK Data Service](https://beta.ukdataservice.ac.uk/) +- Requires registration (free for UK academics) +- Special license for non-academics + +**In PolicyEngine:** +- Base dataset for all variants +- Provides household structure and demographics +- Used for 2022-23 survey year + +**Limitations:** +- Income underreporting, especially at high incomes +- No wealth data +- Limited consumption data +- Sample size limits regional granularity + +### Wealth and Assets Survey (WAS) + +**Publisher:** Office for National Statistics (ONS) + +**Coverage:** ~18,000 UK households biennially[^was-sample] + +**Purpose:** Wealth variable imputations + +**Variables include:** +- Property wealth +- Financial wealth +- Pension wealth +- Physical wealth +- Debt + +**Access:** +- [UK Data Service](https://beta.ukdataservice.ac.uk/) +- Same registration as FRS + +**In PolicyEngine:** +- Used to impute wealth variables in ExtendedFRS and EnhancedFRS +- Quantile regression forests predict wealth from FRS variables +- Maintains demographic correlations + +**Limitations:** +- Biennial survey (less frequent than FRS) +- Wealth measurement challenges +- Sample size limitations + +### Living Costs and Food Survey (LCFS) + +**Publisher:** Office for National Statistics (ONS) + +**Coverage:** ~5,000 UK households annually[^lcfs-sample] + +**Purpose:** Consumption expenditure imputations + +**Variables include:** +- Detailed consumption by category +- Food expenditure +- Housing costs +- Transport costs +- Recreation and culture + +**Access:** +- [UK Data Service](https://beta.ukdataservice.ac.uk/) +- Open access (no registration required for recent years) + +**In PolicyEngine:** +- Imputes consumption variables +- Supports VAT and consumption tax analysis +- QRF models predict consumption from FRS demographics + +**Limitations:** +- Smaller sample size (~5k households) +- Diary-keeping burden may affect representativeness +- Some consumption underreporting + +### Survey of Personal Incomes (SPI) + +**Publisher:** HM Revenue & Customs (HMRC) + +**Coverage:** All UK taxpayers (~30 million records, 1% sample released)[^spi-sample] + +**Purpose:** Correct high-income underreporting + +**Variables include:** +- Employment income +- Self-employment income +- Savings and dividend income +- Property income +- Pension income + +**Access:** +- [HMRC Statistics](https://www.gov.uk/government/statistics/personal-incomes-statistics) +- Publically available aggregated tables +- Microdata available to approved researchers + +**In PolicyEngine:** +- Enhances income distribution in EnhancedFRS +- Adds high-income observations missing from FRS +- QRF models impute full income profiles + +**Strengths:** +- Administrative data (not survey-based) +- Complete coverage of taxpayers +- Accurate income reporting + +**Limitations:** +- Only covers individuals with tax obligations +- No demographic variables beyond age +- 2-year publication lag + +## Auxiliary Sources + +### Effects of Taxes and Benefits on Household Income (ETB) + +**Publisher:** Office for National Statistics (ONS) + +**Based on:** LCFS data with imputed taxes and benefits + +**Purpose:** VAT exposure rate imputations + +**Variables include:** +- VAT-able consumption +- Effective VAT rates +- Tax and benefit burdens + +**Access:** +- [ONS Publications](https://www.ons.gov.uk/peoplepopulationandcommunity/personalandhouseholdfinances/incomeandwealth/bulletins/theeffectsoftaxesandbenefitsonhouseholdincome/latest) +- Aggregated tables publicly available + +**In PolicyEngine:** +- Imputes share of consumption subject to full VAT rate +- Enables accurate VAT policy analysis + +### Official Statistics for Calibration + +**Used for validation and calibration:** + +#### HMRC Statistics +- Income by tax band +- National Insurance contributions +- Employment and self-employment income distributions +- Capital gains tax + +**Source:** [HMRC Statistics](https://www.gov.uk/government/organisations/hm-revenue-customs/about/statistics) + +#### DWP Benefit Statistics +- Benefit caseloads and expenditures +- Universal Credit statistics +- State Pension statistics + +**Source:** [DWP Stat-Xplore](https://stat-xplore.dwp.gov.uk/) + +#### OBR Forecasts +- Total tax revenues +- Total benefit expenditures +- Policy costings + +**Source:** [OBR Publications](https://obr.uk/forecasts-in-depth/the-economy-forecast/) + +#### ONS Demographics +- Population by age and region +- Household composition +- Tenure types + +**Source:** [ONS Population Estimates](https://www.ons.gov.uk/peoplepopulationandcommunity/populationandmigration/populationestimates) + +## Data Processing Pipeline + +```mermaid +graph TD + subgraph sources["Source Datasets"] + FRS["FRS 2022-23
20k households"]:::data + WAS["WAS
18k households"]:::data + LCFS["LCFS
5k households"]:::data + SPI["SPI
~300k taxpayers"]:::data + ETB["ETB
(from LCFS)"]:::data + end + + SimBenefits("Simulate benefits
(replace reported)"):::process + + ImputeWealth("Impute wealth variables"):::process + ImputeConsumption("Impute consumption"):::process + ImputeVAT("Impute VAT exposure"):::process + + Extended["Extended FRS
20k households
+ wealth + consumption"]:::data + + CloneFRS("Clone FRS records"):::process + TrainQRF("Train QRF on SPI"):::process + ImputeIncome("Impute high incomes"):::process + Concat("Concatenate copies"):::process + + Enhanced["Enhanced FRS
40k observations
+ corrected incomes"]:::data + + Targets{{"Official Statistics
2000+ targets
HMRC | DWP | ONS | OBR"}}:::targets + + Calibrate("Optimize weights"):::process + + Reweighted{{"Reweighted FRS
Final Dataset"}}:::output + + FRS --> SimBenefits + + SimBenefits --> ImputeWealth + SimBenefits --> ImputeConsumption + SimBenefits --> ImputeVAT + + WAS --> ImputeWealth + LCFS --> ImputeConsumption + ETB --> ImputeVAT + + ImputeWealth --> Extended + ImputeConsumption --> Extended + ImputeVAT --> Extended + + Extended --> CloneFRS + SPI --> TrainQRF + + CloneFRS --> ImputeIncome + TrainQRF --> ImputeIncome + + ImputeIncome --> Concat + Concat --> Enhanced + + Enhanced --> Calibrate + Targets --> Calibrate + Calibrate --> Reweighted + + classDef data fill:#2C6496,stroke:#2C6496,color:#FFFFFF + classDef process fill:#39C6C0,stroke:#2C6496,color:#FFFFFF + classDef targets fill:#FFA500,stroke:#2C6496,color:#FFFFFF + classDef output fill:#5091CC,stroke:#2C6496,color:#FFFFFF +``` + +## Accessing Raw Data + +### For Users + +PolicyEngine UK Data handles data downloads automatically: + +```python +from policyengine_uk_data import EnhancedFRS_2022_23 + +# Data downloads automatically on first use +dataset = EnhancedFRS_2022_23 +``` + +Set `HUGGING_FACE_TOKEN` environment variable for authentication. + +### For Developers + +To rebuild from source data: + +1. **Register with UK Data Service** +2. **Download FRS, WAS, LCFS** (tab-delimited format) +3. **Set paths in environment:** + ```bash + export FRS_RAW_DATA=/path/to/frs + export WAS_RAW_DATA=/path/to/was + export LCFS_RAW_DATA=/path/to/lcfs + ``` +4. **Build datasets:** + ```bash + make download # Downloads public data (SPI, etc.) + make data # Builds all datasets + ``` + +## Data Licensing and Usage + +### FRS, WAS, LCFS +- **License:** UK Data Service End User License +- **Academic use:** Free with registration +- **Commercial use:** Requires separate license +- **Attribution:** Required + +### SPI +- **License:** Open Government License +- **Use:** Free for any purpose +- **Attribution:** Required + +### PolicyEngine UK Data Outputs +- **License:** AGPL-3.0 +- **Use:** Free for any purpose +- **Attribution:** Appreciated +- **Sharing:** Derivative works must be open source + +## Citation + +When using PolicyEngine UK Data in research: + +``` +PolicyEngine. (2024). PolicyEngine UK Data [Software]. +https://github.com/PolicyEngine/policyengine-uk-data + +Based on: +- Family Resources Survey 2022-23, Department for Work and Pensions +- Wealth and Assets Survey, Office for National Statistics +- Living Costs and Food Survey, Office for National Statistics +- Survey of Personal Incomes, HM Revenue & Customs +``` + +## See Also + +- [Methodology](methodology.ipynb) - How we process and combine these sources +- [Glossary](glossary.md) - Definitions of surveys and terms +- [Validation](validation/) - How outputs compare to official statistics + +## References + +[^frs-sample]: Department for Work and Pensions. (2023). *Family Resources Survey 2022/23*. Sample of approximately 19,000 households. [UK Data Service SN 9016](https://beta.ukdataservice.ac.uk/datacatalogue/series/series?id=200017). + +[^was-sample]: Office for National Statistics. (2020). *Wealth and Assets Survey, Waves 1-7, 2006-2020*. Approximately 18,000 households per wave. [UK Data Service SN 7215](https://beta.ukdataservice.ac.uk/datacatalogue/studies/study?id=7215). + +[^lcfs-sample]: Office for National Statistics. (2023). *Living Costs and Food Survey, 2022*. Sample of approximately 5,000 households. [UK Data Service SN 9114](https://beta.ukdataservice.ac.uk/datacatalogue/series/series?id=200017). + +[^spi-sample]: HM Revenue & Customs. (2024). *Survey of Personal Incomes, 2020-21*. Based on administrative records of all UK taxpayers; 1% sample (~300,000 individuals) released for research. [HMRC Statistics](https://www.gov.uk/government/statistics/personal-incomes-statistics-to-2020-to-2021). \ No newline at end of file diff --git a/docs/examples.md b/docs/examples.md new file mode 100644 index 00000000..fd6aadd9 --- /dev/null +++ b/docs/examples.md @@ -0,0 +1,292 @@ +# Usage Examples + +This page provides practical examples of using PolicyEngine UK Data for policy analysis. + +## Basic Analysis + +### Loading and Exploring a Dataset + +```python +from policyengine_uk_data import EnhancedFRS_2022_23 +from policyengine_uk import Microsimulation +import pandas as pd + +# Load dataset and create simulation +simulation = Microsimulation(dataset=EnhancedFRS_2022_23) + +# Get basic statistics +n_people = len(simulation.calculate("person_id", period=2025)) +n_households = len(simulation.calculate("household_id", period=2025).unique()) + +print(f"Sample size: {n_people:,} people in {n_households:,} households") + +# Calculate key aggregates +employment_income = simulation.calculate("employment_income", period=2025).sum() / 1e9 +benefits = simulation.calculate("benefits", period=2025).sum() / 1e9 +income_tax = simulation.calculate("income_tax", period=2025).sum() / 1e9 + +print(f"Employment income: £{employment_income:.1f}bn") +print(f"Benefits: £{benefits:.1f}bn") +print(f"Income tax: £{income_tax:.1f}bn") +``` + +### Comparing Datasets + +```python +from policyengine_uk_data import FRS_2022_23, EnhancedFRS_2022_23 + +def get_income_stats(dataset): + sim = Microsimulation(dataset=dataset) + income = sim.calculate("household_net_income", period=2025) + weights = sim.calculate("household_weight", period=2025) + + mean = (income * weights).sum() / weights.sum() + total = income.sum() / 1e9 + + return {"mean": mean, "total": total} + +frs_stats = get_income_stats(FRS_2022_23) +efrs_stats = get_income_stats(EnhancedFRS_2022_23) + +print(f"FRS mean income: £{frs_stats['mean']:,.0f}") +print(f"Enhanced FRS mean income: £{efrs_stats['mean']:,.0f}") +print(f"Difference: £{efrs_stats['mean'] - frs_stats['mean']:,.0f}") +``` + +## Policy Reform Analysis + +### Simple Tax Change + +```python +from policyengine_uk import Microsimulation, Reform +from policyengine_uk_data import EnhancedFRS_2022_23 + +# Define a basic rate threshold increase +class BasicRateIncrease(Reform): + def apply(self): + self.update_parameter( + "gov.hmrc.income_tax.rates.uk[0].threshold", + "2025-01-01.2099-12-31", + 15_000 # Increase from ~£12,570 to £15,000 + ) + +# Calculate impact +baseline = Microsimulation(dataset=EnhancedFRS_2022_23) +reformed = Microsimulation(dataset=EnhancedFRS_2022_23, reform=BasicRateIncrease) + +# Revenue impact +baseline_revenue = baseline.calculate("income_tax", period=2025).sum() +reformed_revenue = reformed.calculate("income_tax", period=2025).sum() +revenue_change = (reformed_revenue - baseline_revenue) / 1e9 + +print(f"Revenue change: £{revenue_change:.2f}bn") + +# Winners and losers +baseline_income = baseline.calculate("household_net_income", period=2025) +reformed_income = reformed.calculate("household_net_income", period=2025) +change = reformed_income - baseline_income + +winners = (change > 0).sum() +losers = (change < 0).sum() +unchanged = (change == 0).sum() + +print(f"Winners: {winners:,} households") +print(f"Losers: {losers:,} households") +print(f"Unchanged: {unchanged:,} households") +``` + +### Universal Basic Income + +```python +class UniversalBasicIncome(Reform): + def apply(self): + # £100/week UBI for all adults + self.update_parameter( + "gov.contrib.ubi.adult.amount", + "2025-01-01.2099-12-31", + 100 * 52 # Weekly to annual + ) + +baseline = Microsimulation(dataset=EnhancedFRS_2022_23) +ubi_sim = Microsimulation(dataset=EnhancedFRS_2022_23, reform=UniversalBasicIncome) + +# Cost +ubi_cost = ubi_sim.calculate("universal_basic_income", period=2025).sum() / 1e9 +print(f"UBI cost: £{ubi_cost:.1f}bn/year") + +# Poverty impact +baseline_poverty = ( + baseline.calculate("in_absolute_poverty", period=2025).sum() +) +ubi_poverty = ( + ubi_sim.calculate("in_absolute_poverty", period=2025).sum() +) + +print(f"Poverty reduction: {baseline_poverty - ubi_poverty:,} people") +``` + +## Distributional Analysis + +### Income Deciles + +```python +import numpy as np +import pandas as pd + +simulation = Microsimulation(dataset=EnhancedFRS_2022_23) + +# Get household data +income = simulation.calculate("household_net_income", period=2025) +weights = simulation.calculate("household_weight", period=2025) + +# Calculate deciles +decile = simulation.calculate("household_income_decile", period=2025) + +# Mean income by decile +decile_data = pd.DataFrame({ + "income": income, + "weight": weights, + "decile": decile +}) + +decile_means = decile_data.groupby("decile").apply( + lambda x: (x.income * x.weight).sum() / x.weight.sum() +) + +print("Mean income by decile:") +for d, mean in decile_means.items(): + print(f" Decile {d}: £{mean:,.0f}") +``` + +### Gini Coefficient + +```python +def gini(values, weights): + """Calculate Gini coefficient.""" + sorted_indices = np.argsort(values) + sorted_values = values[sorted_indices] + sorted_weights = weights[sorted_indices] + + cumsum = np.cumsum(sorted_weights) + cumsum_values = np.cumsum(sorted_values * sorted_weights) + + return ( + (2 * np.sum(cumsum * sorted_values * sorted_weights)) / + (cumsum[-1] * cumsum_values[-1]) - 1 + ) + +simulation = Microsimulation(dataset=EnhancedFRS_2022_23) +income = simulation.calculate("household_net_income", period=2025) +weights = simulation.calculate("household_weight", period=2025) + +gini_coef = gini(income, weights) +print(f"Gini coefficient: {gini_coef:.3f}") +``` + +## Regional Analysis + +### Income by Region + +```python +simulation = Microsimulation(dataset=EnhancedFRS_2022_23) + +income = simulation.calculate("household_net_income", period=2025) +region = simulation.calculate("region", period=2025) +weights = simulation.calculate("household_weight", period=2025) + +region_income = pd.DataFrame({ + "income": income, + "region": region, + "weight": weights +}) + +regional_means = region_income.groupby("region").apply( + lambda x: (x.income * x.weight).sum() / x.weight.sum() +) + +print("Mean household income by region:") +for r, mean in regional_means.items(): + print(f" {r}: £{mean:,.0f}") +``` + +## Custom Analysis + +### Targeting Analysis + +```python +# Analyze take-up of a benefit +simulation = Microsimulation(dataset=EnhancedFRS_2022_23) + +# Eligible population +eligible = simulation.calculate("universal_credit_entitlement", period=2025) > 0 + +# Actual recipients +receiving = simulation.calculate("universal_credit", period=2025) > 0 + +# Take-up rate +takeup_rate = receiving[eligible].mean() +print(f"Universal Credit take-up rate: {takeup_rate:.1%}") +``` + +### Marginal Tax Rates + +```python +def marginal_tax_rate(simulation, person_id, base_earnings): + """Calculate marginal tax rate for a person.""" + # Baseline + base_net = simulation.calculate("net_income", period=2025)[person_id] + + # Increment earnings by £1000 + simulation.set_input("employment_income", period=2025, + {person_id: base_earnings + 1000}) + new_net = simulation.calculate("net_income", period=2025)[person_id] + + # MTR = 1 - (change in net / change in gross) + mtr = 1 - (new_net - base_net) / 1000 + return mtr + +simulation = Microsimulation(dataset=EnhancedFRS_2022_23) +# Calculate MTRs for employed people +employment_income = simulation.calculate("employment_income", period=2025) +employed = employment_income > 0 + +mtrs = [ + marginal_tax_rate(simulation, pid, employment_income[pid]) + for pid in range(len(employed)) if employed[pid] +] + +print(f"Mean MTR for employed: {np.mean(mtrs):.1%}") +``` + +## Validation and Quality Checks + +### Compare to Official Statistics + +```python +from policyengine_uk_data.utils import get_loss_results + +results = get_loss_results(EnhancedFRS_2022_23, 2025) + +# Filter to specific statistics +tax_stats = results[results.name.str.contains("obr")] +print("Tax-benefit program accuracy:") +print(tax_stats[["name", "target", "estimate", "abs_rel_error"]].head(10)) + +# Overall accuracy +print(f"\nMean absolute relative error: {results.abs_rel_error.mean():.2%}") +print(f"Median absolute relative error: {results.abs_rel_error.median():.2%}") +``` + +## Tips and Best Practices + +1. **Cache simulations** when running multiple reforms on the same baseline +2. **Use vectorized operations** instead of loops for better performance +3. **Check validation metrics** to understand dataset accuracy for your use case +4. **Start with EnhancedFRS** unless you have specific reasons to use another variant +5. **Weight all statistics** using household/person weights for population estimates + +## Next Steps + +- [API Reference](api-reference.md) - Complete function documentation +- [Methodology](methodology.ipynb) - Understand dataset construction +- [Validation](validation/) - See accuracy metrics \ No newline at end of file diff --git a/docs/getting-started.md b/docs/getting-started.md new file mode 100644 index 00000000..7863bc2e --- /dev/null +++ b/docs/getting-started.md @@ -0,0 +1,256 @@ +# Getting Started + +This guide will help you install and start using PolicyEngine UK Data. + +## Prerequisites + +Before installing, ensure you have: + +1. **Python 3.13 or higher** + ```bash + python --version # Should be 3.13+ + ``` + +2. **A Hugging Face account** + - Sign up at [huggingface.co](https://huggingface.co/) + - Create an access token at [Settings → Access Tokens](https://huggingface.co/settings/tokens) + - The token needs **read** access + +3. **(Optional) Google Cloud credentials** + - Only needed if you're building datasets from scratch + - For most users, pre-built datasets are available via Hugging Face + +## Installation + +### Standard Installation + +Install from PyPI: + +```bash +pip install policyengine-uk-data +``` + +### Development Installation + +For contributing or building datasets: + +```bash +# Clone the repository +git clone https://github.com/PolicyEngine/policyengine-uk-data.git +cd policyengine-uk-data + +# Install with development dependencies +pip install -e ".[dev]" +``` + +## Authentication + +### Hugging Face Token + +Set your Hugging Face token as an environment variable: + +**Linux/macOS:** +```bash +export HUGGING_FACE_TOKEN="your_token_here" +``` + +**Windows (Command Prompt):** +```cmd +set HUGGING_FACE_TOKEN=your_token_here +``` + +**Windows (PowerShell):** +```powershell +$env:HUGGING_FACE_TOKEN="your_token_here" +``` + +**Or use a `.env` file:** + +Create a `.env` file in your project directory: + +``` +HUGGING_FACE_TOKEN=your_token_here +``` + +The package will automatically load environment variables from `.env` files. + +## First Steps + +### 1. Import and Load a Dataset + +```python +from policyengine_uk_data import EnhancedFRS_2022_23 + +# The dataset will download automatically on first use +dataset = EnhancedFRS_2022_23 +``` + +First-time downloads may take a few minutes depending on your connection. + +### 2. Create a Microsimulation + +```python +from policyengine_uk import Microsimulation + +# Create a simulation for 2025 +simulation = Microsimulation(dataset=dataset) +``` + +### 3. Calculate Variables + +```python +# Calculate employment income for all persons +employment_income = simulation.calculate("employment_income", period=2025) + +# Calculate household net income +household_income = simulation.calculate("household_net_income", period=2025) + +# Get household weights for population-representative statistics +weights = simulation.calculate("household_weight", period=2025) +``` + +### 4. Compute Aggregate Statistics + +```python +import numpy as np + +# Total employment income (in billions) +total_employment = employment_income.sum() / 1e9 +print(f"Total employment income: £{total_employment:.1f}bn") + +# Mean household income +mean_income = (household_income * weights).sum() / weights.sum() +print(f"Mean household net income: £{mean_income:,.0f}") + +# Median household income +sorted_indices = np.argsort(household_income) +cumsum = np.cumsum(weights[sorted_indices]) +median_index = sorted_indices[np.searchsorted(cumsum, cumsum[-1] / 2)] +median_income = household_income[median_index] +print(f"Median household net income: £{median_income:,.0f}") +``` + +## Choosing a Dataset + +PolicyEngine UK Data provides four dataset variants: + +| Dataset | When to Use | Pros | Cons | +|---------|-------------|------|------| +| `FRS_2022_23` | Comparing with raw FRS | Matches official FRS | Missing wealth/consumption, income underreporting | +| `ExtendedFRS_2022_23` | Basic analysis with wealth/consumption | Adds wealth and consumption variables | Still has income underreporting | +| `EnhancedFRS_2022_23` | **Most analyses** (recommended) | Corrects income distribution, adds wealth/consumption | Small dataset size increase | +| `ReweightedFRS_2022_23` | Maximum accuracy needed | Calibrated to match official statistics exactly | Slightly higher memory usage | + +For most policy analysis, use `EnhancedFRS_2022_23`. + +## Common Patterns + +### Analyzing a Policy Reform + +```python +from policyengine_uk import Microsimulation, Reform +from policyengine_uk_data import EnhancedFRS_2022_23 + +# Baseline simulation +baseline = Microsimulation(dataset=EnhancedFRS_2022_23) + +# Define a reform (e.g., increase basic rate threshold) +class IncomeT + +axReform(Reform): + def apply(self): + self.update_parameter("gov.hmrc.income_tax.rates.uk[0].threshold", "2025-01-01.2099-12-31", 15_000) + +# Reformed simulation +reformed = Microsimulation(reform=IncomeT + +axReform, dataset=EnhancedFRS_2022_23) + +# Compare tax revenues +baseline_tax = baseline.calculate("income_tax", period=2025).sum() +reformed_tax = reformed.calculate("income_tax", period=2025).sum() + +revenue_change = (reformed_tax - baseline_tax) / 1e9 +print(f"Revenue change: £{revenue_change:.1f}bn") +``` + +### Working with Local Areas + +```python +from policyengine_uk_data.datasets.local_areas import ( + Constituency_2024_25, + LocalAuthority_2024_25 +) + +# Load constituency-level data +constituency_data = Constituency_2024_25 +simulation = Microsimulation(dataset=constituency_data) + +# Get constituency codes +constituency_codes = simulation.calculate("constituency", period=2025) + +# Calculate statistics by constituency +# (Implementation depends on your specific needs) +``` + +## Troubleshooting + +### Import Error: "Prerequisites not found" + +The package checks for required data files on import. If you see this error: + +```python +from policyengine_uk_data import check_prerequisites, download_prerequisites + +# Check what's missing +check_prerequisites() + +# Download missing files +download_prerequisites() +``` + +### Download Fails + +If downloads fail: + +1. **Check your Hugging Face token** is set correctly +2. **Check internet connection** +3. **Try clearing the cache:** + ```bash + rm -rf ~/.cache/huggingface/ + ``` + +### Memory Issues + +For large-scale analysis: + +1. **Use a subset of the data:** + ```python + # Sample 10% of households + simulation.sample_size = 0.1 + ``` + +2. **Calculate variables individually** rather than all at once + +3. **Use `ReweightedFRS_2022_23`** instead of building custom datasets + +### Performance Tips + +- **Cache simulations** when running multiple reforms +- **Use vectorized operations** instead of loops +- **Profile your code** with `cProfile` to find bottlenecks +- **Consider using Dask** for truly large-scale analysis + +## Next Steps + +- **[Examples](examples.md)** - More detailed usage examples +- **[API Reference](api-reference.md)** - Complete API documentation +- **[Methodology](methodology.ipynb)** - Understand how datasets are created +- **[Validation](validation/)** - See how datasets compare to official statistics + +## Getting Help + +- **Documentation**: [policyengine.github.io/policyengine-uk-data](https://policyengine.github.io/policyengine-uk-data/) +- **Issues**: [GitHub Issues](https://github.com/PolicyEngine/policyengine-uk-data/issues) +- **Discussions**: [GitHub Discussions](https://github.com/PolicyEngine/policyengine-uk-data/discussions) +- **Email**: hello@policyengine.org \ No newline at end of file diff --git a/docs/glossary.md b/docs/glossary.md new file mode 100644 index 00000000..a03f7aa3 --- /dev/null +++ b/docs/glossary.md @@ -0,0 +1,170 @@ +# Glossary + +## Datasets and Surveys + +### FRS (Family Resources Survey) +The primary UK household survey conducted annually by the Department for Work and Pensions. Covers ~20,000 households with detailed information on demographics, income, benefits, and housing. The main data source for PolicyEngine UK Data. + +### WAS (Wealth and Assets Survey) +Biennial ONS survey of ~20,000 households focusing on household wealth including property, financial assets, pensions, and debt. Used to impute wealth variables. + +### LCFS (Living Costs and Food Survey) +Annual ONS survey of ~5,000 households recording detailed consumption expenditure. Used to impute consumption variables for VAT analysis. + +### SPI (Survey of Personal Incomes) +HMRC administrative dataset based on tax records covering all UK taxpayers. A 1% sample (~300,000 individuals) is released for research. Used to correct high-income underreporting. + +### ETB (Effects of Taxes and Benefits on Household Income) +ONS analysis based on LCFS data showing the redistributive effects of taxes and benefits. Used to impute VAT exposure rates. + +## Dataset Variants + +### ExtendedFRS +FRS enhanced with imputed wealth (from WAS) and consumption (from LCFS) variables. First enhancement stage. + +### EnhancedFRS +ExtendedFRS with additional high-income enhancement using SPI data to correct income underreporting. Recommended for most analyses. + +### ReweightedFRS +EnhancedFRS with calibrated weights to match 2000+ official statistics from HMRC, DWP, and ONS. Maximum accuracy variant. + +## Statistical Terms + +### Calibration +Process of adjusting survey weights to match known population totals or distributional targets. In PolicyEngine UK Data, weights are calibrated to match official statistics on demographics, incomes, and tax-benefit programs. + +### Imputation +Statistical technique to estimate missing variables using machine learning models trained on other surveys. PolicyEngine uses Quantile Regression Forests for imputation. + +### Microdata +Individual-level (person or household) data, as opposed to aggregated statistics. Enables detailed distributional analysis. + +### Microsimulation +Modeling technique that applies policy rules to representative microdata to estimate policy impacts on individuals and the population. + +### QRF (Quantile Regression Forests) +Machine learning algorithm that predicts the full conditional distribution of a variable, not just its mean. Used for imputation to preserve distributional properties. + +### Reweighting +See Calibration. + +## Entities + +### Person +Individual in the dataset. Basic unit of analysis for many variables like age, gender, employment. + +### Benefit Unit +Group of adults and children whose benefit entitlements are assessed together. Usually a family within a household. + +### Household +Group of people living at the same address. May contain multiple benefit units (e.g., adult children living with parents). + +## Income Concepts + +### Gross Income +Total income before taxes and including benefits. + +### Net Income +Income after taxes and National Insurance contributions, including benefits. + +### Equivalised Income +Income adjusted for household size and composition to enable comparisons. Uses Modified OECD equivalence scale. + +### Market Income +Income from employment, self-employment, investments, and pensions before taxes and benefits. + +## Tax-Benefit System + +### Universal Credit (UC) +Main means-tested benefit in the UK, replacing six legacy benefits. Combines support for unemployment, low income, housing costs, children, and disabilities. + +### Income Tax +Progressive tax on income with multiple bands. Includes Personal Allowance (tax-free amount), Basic Rate (20%), Higher Rate (40%), and Additional Rate (45%). + +### National Insurance (NI) +Social insurance contributions on earnings. Separate rates for employees, employers, and self-employed. Establishes eligibility for State Pension and other contributory benefits. + +### VAT (Value Added Tax) +Consumption tax applied to most goods and services. Standard rate 20%, reduced rate 5%, zero rate for some essentials. + +### Council Tax +Local property tax based on property value bands. Varies by local authority. + +## Methodology Terms + +### Enhancement +Process of improving FRS data by adding variables or correcting biases. Includes imputation (adding variables) and income correction (fixing underreporting). + +### Loss Function +Metric used to evaluate dataset quality by comparing estimates to known statistics. Lower loss indicates better match to reality. + +### + + Representative +Sample that accurately reflects the characteristics of the full population when appropriate weights are applied. + +### Validation +Process of comparing dataset estimates against official statistics to assess accuracy. + +### Weight +Multiplier applied to each household/person indicating how many real-world households/people they represent. Essential for population-level statistics. + +## PolicyEngine Terms + +### Reform +Change to policy parameters (e.g., tax rates, benefit amounts). Can be applied to simulations to estimate impacts. + +### Simulation +Application of the tax-benefit model to a dataset to calculate taxes, benefits, and net incomes under current or reformed policy. + +### Variable +Any measurable characteristic in the model (e.g., age, income, tax liability). Can be inputs (from data) or calculated (by model). + +## UK Government Departments + +### DWP (Department for Work and Pensions) +Responsible for welfare and pension policy. Publishes FRS and benefit statistics. + +### HMRC (HM Revenue & Customs) +Tax authority. Publishes tax statistics and SPI data. + +### ONS (Office for National Statistics) +National statistical institute. Publishes WAS, LCFS, ETB, and demographic statistics. + +### OBR (Office for Budget Responsibility) +Independent fiscal watchdog. Publishes forecasts and policy costings used for validation. + +## Research Terms + +### Gini Coefficient +Measure of income inequality ranging from 0 (perfect equality) to 1 (perfect inequality). Commonly reported for income distributions. + +### Poverty Rate +Percentage of population below a poverty threshold. UK typically uses 60% of median income (relative poverty) or inflation-adjusted threshold (absolute poverty). + +### Decile +One-tenth of a distribution. First decile = bottom 10%, tenth decile = top 10%. Used to analyze distributional impacts. + +### Marginal Tax Rate (MTR) +Percentage of an additional pound of income lost to taxes and benefit withdrawal. Can exceed 100% due to benefit tapers. + +### Winners and Losers +Households gaining (winners) or losing (losers) income under a policy reform. + +## Abbreviations + +- **AGPL**: GNU Affero General Public License (software license) +- **API**: Application Programming Interface +- **CSV**: Comma-Separated Values (data format) +- **GCP**: Google Cloud Platform +- **HDF5**: Hierarchical Data Format 5 (efficient data storage) +- **ML**: Machine Learning +- **OECD**: Organisation for Economic Co-operation and Development +- **PyPI**: Python Package Index +- **UK**: United Kingdom + +## See Also + +- [Data Sources](data-sources.md) - Detailed information on each survey +- [Methodology](methodology.ipynb) - Technical details of enhancement process +- [API Reference](api-reference.md) - Function and class documentation \ No newline at end of file diff --git a/docs/intro.md b/docs/intro.md index af04a858..4bcb429c 100644 --- a/docs/intro.md +++ b/docs/intro.md @@ -1,14 +1,127 @@ # Introduction -PolicyEngine-UK-Data is a package that creates representative microdata for the UK, -designed for input in the PolicyEngine tax-benefit microsimulation model. This tool -allows users to explore the data sources, validation processes, and enhancements -made to ensure accurate and reliable microsimulation results. +Welcome to PolicyEngine UK Data - a comprehensive solution for creating representative microdata for United Kingdom policy analysis. -PolicyEngine is a tool with a clear purpose: for given assumptions about UK government policy and UK households, predicting what UK households will look like in the next few years. To do that, we need both of two things: +## What is PolicyEngine UK Data? -* An accurate model of the effects of policy rules on households. -* An accurate representation of the current UK household sector *now*. +PolicyEngine UK Data transforms the UK Family Resources Survey into enhanced microdata suitable for accurate tax-benefit policy analysis. By combining multiple government surveys and applying advanced statistical techniques, we create datasets that accurately represent the UK population's demographics, incomes, wealth, and consumption patterns. -This repository is dedicated to the second of those. In this documentation, we'll explain how we do that, but we'll also use our model (the first bullet) to see what we end up with when we combine the two, and measure up against other organisations doing the same thing. +## The Challenge + +Effective tax-benefit policy analysis requires: + +1. **An accurate model of policy rules** - How do taxes and benefits actually work? +2. **Accurate representation of the population** - Who are the people affected by these policies? + +PolicyEngine UK provides the first (the tax-benefit model). This package provides the second (the microdata). + +The challenge is that no single survey captures everything we need: +- The FRS has good demographics but underreports income and lacks wealth data +- The WAS has wealth but smaller sample sizes +- The SPI has accurate high incomes but no demographics +- The LCFS has consumption but only ~5,000 households + +## Our Solution + +We combine the strengths of multiple surveys: + +``` +FRS (demographics) + WAS (wealth) + LCFS (consumption) + SPI (high incomes) + ↓ + Statistical enhancement + ↓ + Calibration to match + official statistics + ↓ + Enhanced representative microdata +``` + +The result is a dataset that: +- ✅ Matches official HMRC, DWP, and ONS statistics +- ✅ Includes wealth and consumption variables +- ✅ Correctly represents high-income individuals +- ✅ Enables accurate policy impact analysis + +## Who Should Use This? + +### Researchers +- Academic economists studying UK tax-benefit policy +- Policy researchers analyzing distributional impacts +- PhD students modeling fiscal reforms + +### Policy Analysts +- Government departments evaluating policy options +- Think tanks developing policy proposals +- Advocacy organizations assessing policy impacts + +### Data Scientists +- Building tax-benefit calculators +- Developing distributional analysis tools +- Creating policy simulation platforms + +## What You Can Do + +With PolicyEngine UK Data, you can: + +- **Estimate policy costs** - How much would a reform cost or save? +- **Analyze distributional impacts** - Who wins and loses from policy changes? +- **Calculate poverty and inequality** - How do policies affect poverty rates? +- **Model benefit take-up** - How many people are eligible vs. receiving benefits? +- **Regional analysis** - How do impacts vary by constituency or local authority? +- **Behavioral responses** - How might people respond to policy incentives? + +## Quick Links + +| I want to... | Go to... | +|--------------|----------| +| Install and use the package | [Getting Started](getting-started.md) | +| See code examples | [Examples](examples.md) | +| Understand the methodology | [Methodology](methodology.ipynb) | +| Look up functions and classes | [API Reference](api-reference.md) | +| Check dataset accuracy | [Validation](validation/) | +| Understand technical terms | [Glossary](glossary.md) | +| Learn about data sources | [Data Sources](data-sources.md) | + +## How This Documentation is Organized + +1. **User Guide** - Practical information for using the package + - [Getting Started](getting-started.md) - Installation and first steps + - [Examples](examples.md) - Code examples for common tasks + - [API Reference](api-reference.md) - Complete function documentation + - [Data Sources](data-sources.md) - Information on source surveys + - [Glossary](glossary.md) - Definitions and terminology + +2. **Technical Details** - In-depth methodology + - [Methodology](methodology.ipynb) - Step-by-step dataset creation + - [Pension Contributions](pension_contributions.ipynb) - Pension data processing + - [Constituency Methodology](constituency_methodology.ipynb) - Constituency-level datasets + - [Local Authority Methodology](LA_methodology.ipynb) - Local authority datasets + +3. **Validation** - Accuracy and quality assurance + - [National Validation](validation/national.ipynb) - Comparison to national statistics + - [Constituency Validation](validation/constituencies.ipynb) - Constituency-level accuracy + - [Local Authority Validation](validation/local_authorities.ipynb) - Local authority accuracy + +## Project Context + +PolicyEngine UK Data is part of the broader PolicyEngine ecosystem: + +- **[PolicyEngine UK](https://github.com/PolicyEngine/policyengine-uk)** - The tax-benefit microsimulation model +- **[PolicyEngine](https://policyengine.org)** - Web application for policy analysis +- **[PolicyEngine US Data](https://github.com/PolicyEngine/policyengine-us-data)** - Equivalent dataset for the United States + +## Contributing + +We welcome contributions! Whether you're fixing bugs, improving documentation, or adding features, please see our [GitHub repository](https://github.com/PolicyEngine/policyengine-uk-data) to get started. + +## License and Citation + +PolicyEngine UK Data is open source (AGPL-3.0). If you use it in research, please cite: + +``` +PolicyEngine. (2024). PolicyEngine UK Data. +https://github.com/PolicyEngine/policyengine-uk-data +``` + +For methodology details, see our [Methodology page](methodology.ipynb). diff --git a/docs/myst.yml b/docs/myst.yml index 165211b4..487a86ef 100644 --- a/docs/myst.yml +++ b/docs/myst.yml @@ -9,15 +9,25 @@ project: github: policyengine/policyengine-uk-data toc: - file: intro.md - - file: methodology.ipynb - - file: validation/index.md + - file: getting-started.md + - file: examples.md + - title: User Guide children: + - file: api-reference.md + - file: data-sources.md + - file: glossary.md + - title: Technical Details + children: + - file: methodology.ipynb + - file: pension_contributions.ipynb + - file: constituency_methodology.ipynb + - file: LA_methodology.ipynb + - title: Validation + children: + - file: validation/index.md - file: validation/national.ipynb - file: validation/constituencies.ipynb - file: validation/local_authorities.ipynb - - file: pension_contributions.ipynb - - file: constituency_methodology.ipynb - - file: LA_methodology.ipynb site: options: logo: logo.png