Skip to content

Conversation

@bosd
Copy link
Member

@bosd bosd commented Dec 21, 2025

Summary

This PR addresses critical issues with the deferred-fields feature to make odoo-data-flow production-ready for ETL operations.

Key Fixes

  • Fix deferred-fields matching - Handle both field and field/id formats correctly

    • Normalize field names in Pass 1 ignore filtering
    • Normalize field names in Pass 2 data preparation
  • Add XML-ID resolution for non-self-referencing fields - Support fields like responsible_id that reference other models (e.g., res.users)

    • Added _resolve_external_id_for_pass2() helper function
    • Tries multiple XML-ID variations (module.name, export.prefix, etc.)
  • Fix batch rejection error handling - Records no longer inherit the same error message

    • Added _extract_per_row_errors() to parse per-row errors from Odoo's response
    • Falls back to individual processing when batch has multiple failures
    • First failed record gets batch error, subsequent records get reference
  • Add binary field deferral support - Allow deferring image fields like image_1920

    • Non-relational fields are written directly in Pass 2 (base64 data)
  • Add --company-id CLI parameter - Simplify multicompany imports

    • Sets allowed_company_ids and force_company in context
  • Fix CLI deferred-fields parsing - Convert comma-separated string to list

Tested With

  • Local Odoo 18 instance
  • Verified Pass 1 correctly excludes deferred fields
  • Verified Pass 2 resolves XML-IDs and updates records
  • All 382 unit tests pass
  • Pre-commit, mypy, typeguard all pass

Test plan

  • Test with existing ETL scripts using --deferred-fields
  • Test fail mode with deferred fields
  • Test multicompany imports with --company-id
  • Test image deferral with --deferred-fields image_1920

bosd and others added 2 commits December 21, 2025 21:20
- Fix deferred-fields matching to handle both 'field' and 'field/id' formats
- Add XML-ID resolution for non-self-referencing deferred fields (e.g., responsible_id)
- Support binary field deferral for image imports (e.g., image_1920)
- Fix batch rejection to not inherit same error message for all records
  - Extract per-row errors from Odoo's response when available
  - Fall back to individual processing when batch has multiple failures
- Add --company-id CLI parameter for multicompany imports
  - Sets allowed_company_ids and force_company in context
- Add _extract_per_row_errors helper for parsing Odoo's error messages
- Add _resolve_external_id_for_pass2 helper for XML-ID resolution

These changes address critical issues with:
- Deferred fields not working in fail mode
- All batch records inheriting the same failure reason
- Cross-company field references causing import failures
- Large image imports overwhelming the server

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Convert deferred_fields CLI parameter from comma-separated string to list
- Fix ignore_list filtering in Pass 1 to handle both 'field' and 'field/id' formats
  - Normalize ignore_set to strip '/id' suffix before matching column names
- Verified working with local Odoo 18 instance:
  - Pass 1 correctly excludes deferred fields from initial import
  - Pass 2 successfully resolves XML-IDs and updates records

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@gemini-code-assist
Copy link

Summary of Changes

Hello @bosd, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the 'deferred-fields' feature within the Odoo data flow tool, making it more robust and reliable for production ETL operations. The changes focus on improving data integrity, error handling, and usability, particularly for complex scenarios involving cross-model references, binary data, and multicompany setups.

Highlights

  • Deferred Field Matching: Improved handling for deferred fields, correctly parsing both 'field' and 'field/id' formats during data preparation and ignore filtering.
  • XML-ID Resolution: Introduced a mechanism to resolve XML-IDs for non-self-referencing fields (e.g., 'responsible_id' referencing 'res.users'), ensuring proper linking across different models.
  • Enhanced Batch Error Handling: Implemented granular error reporting for batch rejections, allowing individual records to receive specific error messages rather than inheriting a generic batch error, and falling back to individual processing when necessary.
  • Binary Field Deferral: Added support for deferring non-relational fields like binary data (e.g., 'image_1920'), which are now written directly in the second pass.
  • Multicompany CLI Support: Introduced a new '--company-id' CLI parameter to simplify imports in multicompany Odoo environments by setting appropriate context variables.
  • CLI Deferred Fields Parsing: Fixed the command-line interface to correctly parse comma-separated 'deferred-fields' strings into a list.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request significantly enhances the "deferred-fields" feature, improving its reliability for production ETL workloads through better field matching, external XML-ID resolution, granular error handling, and binary field support. However, a medium-severity vulnerability was identified in the error handling of the new XML-ID resolution logic, which could lead to incorrect data relationships and potential access control risks under specific failure conditions. Additionally, feedback includes suggestions for moving an import to the top level for style consistency, simplifying a complex connection probing block, and refining exception handling for better specificity and robustness.

Comment on lines +403 to +404
except Exception: # noqa: S112
continue

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

The broad except Exception: statement within the ID resolution loop is a medium-severity vulnerability. It can lead to incorrect record linking and potential access control issues by silently catching transient errors (e.g., network issues, DB locks) and proceeding with less specific variations. This could result in an external ID resolving to the wrong database record, linking data to an incorrect owner or parent. Beyond the security implications, broad exception handling also hides unexpected bugs and makes debugging difficult. It is recommended to catch more specific exceptions, such as OdooError, KeyError, ValueError, or IndexError, and only catch 'not found' errors, allowing other exceptions to fail loudly to prevent incorrect data mapping and improve debuggability.

Returns:
A dictionary mapping row indices (0-based) to error messages.
"""
import re

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For better code style and consistency, it's recommended to place all imports at the top of the file, as per PEP 8 guidelines. Moving import re to the top of import_threaded.py will improve readability and avoid re-importing the module on each function call.

Comment on lines +282 to +304
conn = None
for attr in ["connection", "client", "_connection", "_client"]:
try:
val = getattr(model_obj, attr, None)
if val and not callable(val):
conn = val
break
elif val and callable(val) and hasattr(val, "get_model"):
conn = val
break
except Exception: # noqa: S112
continue

if conn:
for method_name in ["model", "get_model"]:
if hasattr(conn, method_name):
try:
method = getattr(conn, method_name)
ir_model_data_proxy = method("ir.model.data")
if ir_model_data_proxy:
break
except Exception: # noqa: S112
continue

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This block for retrieving the ir.model.data proxy is overly complex and relies on probing several private attributes, which is fragile and can break with library updates. Since odoolib model objects typically store a reference to their connection, you can simplify this logic significantly.

A more direct approach is to access the connection object and call get_model on it. This is more readable, maintainable, and robust.

            conn = getattr(model_obj, "_connection", getattr(model_obj, "connection", None))
            if conn and hasattr(conn, "get_model"):
                try:
                    ir_model_data_proxy = conn.get_model("ir.model.data")
                except Exception:  # noqa: S112
                    pass

bosd and others added 26 commits December 21, 2025 23:15
Adds --auto-defer CLI flag that automatically defers all non-required
many2one fields to Pass 2. This enables progressive import where
records are created first and relational fields are populated
afterwards. Required many2one fields are NOT deferred as they must
succeed in Pass 1.

Usage: odoo-data-flow import --auto-defer --file data.csv --model res.partner
When records are created using the create() method (in fail mode or
when load() falls back to create()), XML IDs were not being persisted
to ir.model.data. This caused XML IDs to be missing after import.

Added _create_xmlid_entry() helper function that:
- Parses module and name from XML ID (uses __import__ for IDs without prefix)
- Creates or updates ir.model.data entry for each created record
- Handles edge cases like existing entries with different res_id

This ensures XML IDs are properly persisted regardless of whether
records are created via load() or create().
…acks

Added new CLI options for better control over import behavior:

--on-missing-ref: Handle missing references per field
  - create: auto-create via name_create
  - skip: skip row (default)
  - empty: set field to False

--auto-create-refs: Auto-create all missing m2o references

--set-empty-on-missing: Set fields to empty on missing refs

--fallback-values: Default values for invalid selection/boolean fields

--tracking-disable/--tracking-enable: Control mail tracking (default: disabled)

--defer-parent-store: Defer parent store computation for hierarchies

These options map to Odoo's native import context parameters:
- name_create_enabled_fields
- import_set_empty_fields
- fallback_values
- defer_parent_store_computation
Performance optimizations:
- Remove hard-coded 4-thread connection cap in RpcThread
  Users can now specify higher --worker values based on server capacity
- Add LRU cache (100k entries) to to_xmlid() function
  Significantly speeds up repeated XML ID sanitizations
- Pre-calculate column filter indices before batch loop
  Ignore set and indices now computed once per batch, not per chunk

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add protocol selection to import and export commands:
- --protocol option: xmlrpc, xmlrpcs, jsonrpc, jsonrpcs, json2, json2s
- Can also set protocol in connection config file
- JSON-RPC recommended for Odoo 10-18 (~30% faster than XML-RPC)
- JSON-2 supported for Odoo 19+ (requires API key)

Protocol is passed through odoolib which handles the actual connection.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The --ignore CLI option was not being converted from a comma-separated
string to a list before being passed to run_import(), causing a
TypeError when concatenating with deferred_fields list.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Configuration guide:
- Document all protocol options (xmlrpc, jsonrpc, json2)
- Add JSON-RPC performance recommendation for Odoo 10+
- Document JSON-2 API for Odoo 19+ with API key requirements
- Add CLI --protocol override example

Performance tuning guide:
- Add new "Choosing the Right Protocol" section
- Add protocol comparison table
- Add worker tuning section with db_maxconn formula
- Add warnings about connection pool exhaustion

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Verify that import correctly preserves:
- Unicode characters (Japanese, Chinese, Korean, emojis)
- Multiline values in text fields
- Tab characters
- Quoted strings

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add batch_delay parameter to control the pause between batch submissions
during imports. This helps prevent server overload and 503 errors when
importing large datasets.

- Add --delay CLI option (default: 0, recommended: 0.5-2.0 for busy servers)
- Propagate batch_delay through import_data and _orchestrate_pass_1
- Add delay between batch submissions in _run_threaded_pass
- Fix Python 3.14 compatibility for ValueError message format in test

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When the server returns 502/503 errors indicating overload, the importer
now automatically:
- Detects server overload conditions (502, 503, service unavailable)
- Adds increasing delays (up to 10 seconds) between batch submissions
- Gradually reduces the delay after successful batches
- Combines with user-specified --delay for total throttling

This helps prevent overwhelming busy servers and allows imports to
complete even under high load conditions.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The progress bar was shifting because the RichHandler and Progress bar
use separate Console instances that compete for stdout. Added a context
manager `suppress_console_handler()` that temporarily disables the
RichHandler while a Progress bar is active.

Applied to all Progress bars in:
- import_threaded.py
- export_threaded.py
- write_threaded.py
- importer.py

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Exclude mapper.py (callable objects break introspection)
- Add write_threaded.py and tools.py to compilation
- Add usage documentation to setup.py docstring
- Add *.so to .gitignore

To build with mypyc:
  ODF_COMPILE_MYPYC=1 python setup.py build_ext --inplace

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add comprehensive tests for _extract_per_row_errors function
- Add tests for _filter_ignored_columns edge cases
- Add tests for _execute_write_batch success and failure paths
- Add tests for _execute_load_batch force_create, timeout, and pool errors
- Add tests for _format_odoo_error dict extraction
- Add tests for _create_batch_individually error handling
- Add tests for import_data with dict config
- Add tests for relational_import derivation and query functions
- Add tests for O2M tuple import edge cases
- Add tests for write tuple import edge cases

Coverage improved from 80.65% to 85.28%

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Implements streaming CSV processing that reads and processes data in
batches without loading the entire file into memory:

- Add _stream_csv_batches() generator that yields batches directly from file
- Add _count_csv_rows() for progress bar initialization
- Add _orchestrate_streaming_pass_1() for streaming import orchestration
- Add --stream CLI flag for enabling streaming mode
- Automatic fallback to standard mode when incompatible options are used
  (o2m, groupby, deferred_fields, force_create)

Streaming mode is ideal for very large CSV files where memory is a concern.
When enabled, the importer processes batches as they are read from disk,
significantly reducing peak memory usage.

Usage:
  odoo-data-flow import conn.conf data.csv --model res.partner --stream

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Checkpoint/Resume Support:
- Add checkpoint module for saving/restoring import progress
- Save checkpoint after Pass 1 completes with id_map
- Resume from checkpoint if Pass 1 was already completed
- Delete checkpoint on successful completion
- File hash check prevents resuming if data file changed
- CLI options: --resume/--no-resume, --no-checkpoint

Multi-Company Support:
- Add --all-companies flag to auto-set allowed_company_ids
- Fetches user's company_ids and sets context automatically
- Mimics Odoo web UI behavior for cross-company imports

Bug Fixes:
- Fix Pass 2 failures not being written to fail file
- Use sanitized IDs in source_data_map to match id_map keys

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add --dry-run option to validate CSV data before importing:
- Checks required fields are populated
- Validates selection field values against allowed values
- Verifies relational references exist in Odoo
- Displays formatted validation results with error summary

New validation module:
- ValidationError and ValidationResult dataclasses
- Reference checking for both external IDs and database IDs
- Caching of reference lookups for performance
- Formatted output with rich panels

Usage: odoo-data-flow import --dry-run --file data.csv --model res.partner

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add --check-refs option to verify relational references before import:
- Scans CSV for all many2one/many2many references
- Batch-checks external IDs and database IDs against Odoo
- Reports missing references with examples

Options:
- --check-refs=fail: Abort import if references missing (strict mode)
- --check-refs=warn: Show warning but continue (default)
- --check-refs=skip: Skip the reference check entirely

This helps catch missing reference data early, avoiding partial
imports that fail mid-way through processing.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add intelligent error categorization and retry strategies:

Error Categories:
- Transient: Timeouts, 502/503, deadlocks, connection pool - will retry
- Permanent: Constraint violations, access denied - fail immediately
- Recoverable: Missing references, company issues - suggest alternatives

Features:
- Exponential backoff with configurable base delay and max delay
- Jitter to prevent thundering herd effect
- Retry statistics tracking
- Helper functions for retry decisions
- Recommendations for error handling

Usage:
- categorize_error(error) -> (ErrorCategory, pattern)
- retry_with_backoff(func, config, stats) -> (result, error)
- get_retry_recommendation(error) -> dict with action/message

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add functionality for skip-unchanged record detection:

Features:
- Normalize values for comparison (handles False, empty strings, m2o tuples)
- Compare source values with existing Odoo records
- Filter out unchanged rows before import
- Track statistics (new, changed, unchanged, skip rate)

Key functions:
- get_existing_records(): Fetch records from Odoo by external ID
- find_unchanged_records(): Identify unchanged records from dict data
- filter_unchanged_rows(): Filter unchanged rows from list data
- display_idempotent_stats(): Show import statistics

This module enables imports to be run multiple times safely, only
importing records that have actually changed.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add adaptive throttling based on server response times:

Server Health Levels:
- HEALTHY: Normal operation, no throttling
- DEGRADED: Slight slowdown, add small delays
- STRESSED: Significant load, reduce batch sizes
- OVERLOADED: Critical, aggressive throttling

Features:
- Rolling average response time monitoring
- Automatic delay adjustment between requests
- Dynamic batch size scaling based on health
- Hysteresis for health recovery (prevents flapping)
- Error recording for server errors (5xx)
- Comprehensive statistics tracking

Configuration:
- Customizable thresholds for each health level
- Configurable delays and batch multipliers
- Aggressive mode for sensitive servers

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Complete integration of the remaining 3 stability features:

1. **Smarter Retry Logic** - Integrated into error handling:
   - Uses ErrorCategory enum to classify errors as transient/permanent
   - Exponential backoff with jitter for server overload (502/503)
   - Database serialization conflict handling with backoff

2. **Idempotent Import Mode** (`--skip-unchanged`):
   - Fetches existing records from Odoo before import
   - Compares field values to detect unchanged records
   - Skips records that haven't changed, making imports idempotent
   - Reports skip statistics in final output

3. **Health-Aware Throttling** (`--adaptive-throttle`):
   - ThrottleController monitors server response times
   - Automatically adjusts delays based on server health
   - Records timing after each batch load operation
   - Reports throttle statistics at end of import

All 597 tests passing.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This adds a comprehensive workflow for managing VAT validation during
contact imports, addressing VIES API timeouts in large imports.

Features:
- Local VAT format validation with regex patterns for all EU countries
- Checksum validation for BE, DE, NL
- Support for custom validators (e.g., Rust-based via PyO3)
- Save/restore VAT validation settings across companies
- Disable both VIES (online) and stdnum (local) validation
- Batch VIES validation with user notifications

CLI commands:
- vat get-settings: Display current VAT validation settings
- vat disable: Disable VAT validation, save settings to JSON
- vat restore: Restore settings from JSON file
- vat validate: Batch VIES validation with notifications

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add VIES/VAT Manager to API reference (autodoc)
- Add Module Manager to API reference (autodoc)
- Add comprehensive VAT Validation Management guide section
- Include CLI usage examples, programmatic usage, and custom validators

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add return type annotations to test functions
- Fix S110: Add logging to try-except-pass blocks
- Fix C901: Add noqa comments for complex functions
- Fix D417: Add missing docstring parameter descriptions
- Fix E501: Break long lines
- Fix RUF059: Remove/rename unused variables
- Use Optional[str] instead of str | None for Python 3.9 compatibility
- Replace assert type narrowing with conditional checks

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- validation.py: Cast search_count comparisons to bool explicitly
- idempotent.py: Rename loop variable to avoid redefinition
- preflight.py: Cast check_refs comparisons to bool explicitly

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Enables parsing date/datetime columns with custom formats using Polars'
vectorized str.to_date() and str.to_datetime() for efficient conversion.

Example usage:
    processor = Processor(
        mapping={},
        dataframe=df,
        date_formats={"birth_date": "%d/%m/%Y"},
        datetime_formats={"created_at": "%d/%m/%Y %H:%M:%S"},
    )

This provides an alternative to Polars' automatic date detection
(try_parse_dates=True) for cases where explicit format control is needed,
such as ambiguous date formats (DD/MM vs MM/DD).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
bosd and others added 14 commits December 29, 2025 09:26
phone_normalize() now handles additional formats:
- Numbers starting with country code directly (31612...) -> +31612...
- International dialing format with 00 prefix (0031612...) -> +31612...

This ensures phone numbers are always properly prefixed with + when
the country is known, regardless of input format.
…iling)

email() now handles:
- mailto: prefix (case insensitive): mailto:john@example.com -> john@example.com
- Colons as separators: label:john@example.com -> john@example.com
- Multiple colons: Work:Sales:john@example.com -> john@example.com
- Trailing colons: john@example.com: -> john@example.com
Implements GitHub issue #171 with:

- separate_city_postal(): Extract city and postal from combined fields
  - Supports country-specific patterns (NL, BE, DE, FR, GB, US, PT, IS, etc.)
  - Auto-detection when country not specified
  - Handles both prefix (FR: "75001 Paris") and suffix (NL: "Amsterdam 1012AB")

- detect_country(): Infer country from phone, postal, or city hints
  - Phone prefix detection (+31 → NL, +33 → FR, etc.)
  - Postal pattern matching (1234AB → NL, 75001 → FR, etc.)
  - Major city lookup (Amsterdam → NL, Paris → FR, etc.)

- Polars-native versions in clean_expr.py:
  - city_from_combined(): Extract city using vectorized operations
  - postal_from_combined(): Extract postal using vectorized operations

- New extensible constants:
  - POSTAL_PATTERNS: Country-specific postal code regex patterns
  - PHONE_PREFIX_TO_COUNTRY: Phone prefix to country code mapping
  - MAJOR_CITIES: City name to country code mapping

Closes #171

Co-Authored-By: Claude <noreply@anthropic.com>
Adds a cleaner function to normalize company legal suffixes to their
canonical forms. Handles common variations across multiple countries:

- Netherlands: BV → B.V., NV → N.V., V.O.F., C.V.
- Germany: gmbh → GmbH, AG, KG, OHG, GmbH & Co. KG
- Belgium: BVBA → B.V.B.A., SPRL → S.P.R.L.
- France: SARL → S.A.R.L., SAS → S.A.S., S.A.
- UK: Ltd/Limited → Ltd., PLC, LLP
- US: Inc/Incorporated → Inc., LLC, Corp.
- Italy: SPA → S.p.A., SRL → S.r.l.
- Spain: SL → S.L.
- Scandinavia: AS → A/S, AB, Oy, ApS

Features:
- Case-insensitive matching (BV, Bv, bv all work)
- Handles variations with/without dots (B.V. or BV)
- Both row-by-row (clean.py) and Polars-native (clean_expr.py) versions
- Extensible COMPANY_SUFFIX_CANONICAL constant for custom suffixes

Co-Authored-By: Claude <noreply@anthropic.com>
Remove hardcoded city-to-country mapping to avoid maintenance burden.
City data changes frequently and is better sourced from external data
like GeoNames or Odoo's res.city model.

Changes:
- Remove MAJOR_CITIES constant (~175 lines of city data)
- Update detect_country() to require explicit `cities` parameter
  for city-based detection (phone and postal still work by default)
- Update docstring with guidance on populating cities from external
  sources (GeoNames, Odoo res.city + res.country)
- Update tests to provide cities dict explicitly

Phone prefix and postal pattern detection remain unchanged as these
are standardized (ITU codes, postal standards) and rarely change.

Co-Authored-By: Claude <noreply@anthropic.com>
Add comprehensive documentation for the company suffix cleaner:
- Table of supported countries and canonical forms
- Usage examples with mapper (row-by-row)
- Usage examples with Polars expressions
- Custom suffix mapping examples
- Add COMPANY_SUFFIX_CANONICAL to available constants list

Co-Authored-By: Claude <noreply@anthropic.com>
Add new geonames module providing utilities to download, cache, and query
GeoNames data for city-to-country mapping, postal code validation, and
geographic lookups.

Features:
- load_cities(): Load cities data as Polars DataFrame
- load_alternate_names(): Load alternate names with language filtering
- load_postal_codes(): Load postal codes per country
- get_cities_lookup(): Build city→country dict with alternate name support
- get_postal_lookup(): Build postal code→place name lookup
- get_city_coordinates(): Get latitude/longitude for a city
- download_dataset(): Download and extract GeoNames data files
- Auto-caching in ~/.cache/odoo-data-flow/geonames/

Supports datasets: cities500, cities1000, cities5000, cities15000,
alternateNamesV2, allCountries

Integrates with clean.detect_country() for dynamic city lookups instead
of hardcoded city data.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Filter out invalid values starting with "e-" (returns None)
- Remove comma characters in addition to spaces
- Add tests for both clean.py and clean_expr.py

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Document the improved zip_code() cleaner behavior:
- Removes spaces and commas
- Filters out values starting with e- prefix

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add city name cleaning function to both clean.py and clean_expr.py:
- Strips whitespace and normalizes to title case
- Removes parenthetical notes like "(Noord-Holland)"
- Removes trailing postal codes
- Removes leading/trailing punctuation (commas, periods)
- Collapses multiple spaces
- Filters out invalid values starting with "e-"

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add street address cleaning function to both clean.py and clean_expr.py:
- Strips whitespace
- Removes parenthetical notes like "(Apt 4)"
- Removes leading/trailing punctuation (commas, periods)
- Collapses multiple spaces
- Preserves original case (unlike city())
- Filters out invalid values starting with "e-"
The 'id' field is always mandatory for imports as the external ID,
so it should not trigger a readonly field warning.

- Skip 'id' field when checking for readonly fields
- Add test assertions to verify 'id' is not in warning message
Place fail files in environment-specific subfolders based on config file name:
- test_connection.conf -> data/test/res_partner_fail.csv
- uat_connection.conf -> data/uat/res_partner_fail.csv

Added _get_env_from_config() to extract environment name from config file.
In --fail mode, looks for fail file in the correct environment folder.
Environment folder is created automatically if it doesn't exist.
@bosd bosd force-pushed the feature/production-ready-etl branch from bb4b8d1 to 9cfb151 Compare December 31, 2025 13:23
bosd and others added 15 commits December 31, 2025 16:02
Document the environment-based fail file placement feature:
- How environment name is extracted from config file
- Table showing config file to fail file path mapping
- Example commands for import and retry
- Benefits of the feature
When users encounter access/permission errors during import, the fail
files now show clean, user-friendly messages instead of long technical
JSON error structures.

The new _extract_access_error_message() function:
- Extracts "cannot be called remotely" errors with the method name
- Parses nested data.message from Odoo error responses
- Falls back to top-level message if data.message unavailable
- Truncates excessively long error strings

This makes it easier for users to understand why records failed,
especially when dealing with insufficient permissions.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…otely

Some Odoo models (like custom models with restricted access) don't allow
the browse() method to be called remotely via RPC. This caused imports
to fail with "Private methods cannot be called remotely" errors.

Changes:
- Pass connection object through thread_state to access other models
- Use connection.get_model("ir.model.data") instead of model.browse().env
- Update _create_xmlid_entry to accept connection instead of model
- Update _create_batch_individually to accept and use connection
- Update _orchestrate_pass_1 and _orchestrate_streaming_pass_1 signatures

This allows importing into models that have restricted browse() access.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The _convert_external_id_field function was still using model.env.ref()
to look up external IDs, which also triggers browse() internally and
fails for models where browse is not allowed remotely.

Changed to use ir.model.data lookups via connection.get_model() instead,
consistent with the previous fix for _create_xmlid_entry.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When accessing .id on the record object returned by model.create(),
erppeek may internally call browse() to fetch the record, which fails
for models where browse is not allowed remotely.

Now handles both cases:
- create() returns an int ID directly (raw RPC behavior)
- create() returns a record object (erppeek behavior)

Uses int() conversion instead of .id access to avoid triggering browse.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Fixed bug where pass 2 self-referencing field lookups (like parent_id)
  would fail because field values weren't sanitized to match id_map keys
- Added to_xmlid() sanitization for both source_id and related field values
  in _prepare_pass_2_data to ensure consistent key format matching
- Improved logging between pass 1 and pass 2 for better debugging:
  - Added info log when pass 1 completes with record count
  - Added info log when checkpoint is saved after pass 1
  - Added info log when pass 2 starts with deferred fields
- Changed missing reference logs from debug to warning level for
  easier troubleshooting of unresolved parent references
- Added debug logging for successful self-reference and external ID
  resolution to help track pass 2 processing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Added info-level logging before and after executor.shutdown() to help
identify if the thread pool shutdown is causing imports to hang at
the end of pass 1.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Log messages (log.info) are suppressed during progress display by
suppress_console_handler(). Changed to use progress.console.print()
so diagnostic messages are visible:

- "All batches processed, shutting down thread pool..."
- "Thread pool shutdown complete"
- "Pass 1 complete: X records created"
- "Saving checkpoint after Pass 1..."
- "Checkpoint saved: X records"
- "Starting Pass 2 for deferred fields: [...]"

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Added console.print messages to track Pass 2 progress:
- "Pass 2: Preparing data for X records..."
- "Pass 2: X records have parent references to update"
- "Pass 2: Grouped into X unique parent values"
- "Pass 2: Starting X batches..."
- "Pass 2: Threaded pass complete"

This helps identify where Pass 2 hangs.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Added print() statements to track progress inside _prepare_pass_2_data:
- "Getting ir.model.data proxy..."
- "ir.model.data proxy: found/not found"
- "Processing X records..."
- "Processed X/Y records..." (every 1000 records)
- "Data preparation complete: X records to update"

This helps identify if the hang is in proxy retrieval or record processing.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Added counters to diagnose why Pass 2 hangs:
- found_in_idmap: parent references resolved from id_map (fast)
- not_in_idmap: parent references not in id_map
- rpc_lookups: times _resolve_external_id_for_pass2 is called (slow)

If RPC lookups is high, that explains the hang - each lookup makes
multiple RPC calls to ir.model.data.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Added cache for external ID lookups to avoid repeated RPC calls
  for the same parent reference (major speedup if many records share
  the same parent)
- Progress now shows every 500 records OR every 5 seconds
- Shows processing rate (records/second)
- Shows cache hits vs RPC lookups so user can see the benefit
- Format: "[Pass 2] 500/8514 (120/s) | idmap: 450, rpc: 30, cache: 20"

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The preflight check was detecting m2m/o2m fields and automatically
adding them to deferred_fields, causing imports like res.users to fail
because company_ids and group_ids were being deferred unexpectedly.

Changed logic so auto-detected deferred fields are only used when:
1. User explicitly specifies --deferred-fields, OR
2. User enables --auto-defer flag

Without these flags, detected fields are logged at DEBUG level but not
applied, preserving backward compatibility.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…ring

When auto_defer is disabled (the default), the preflight no longer:
- Logs INFO-level "Detected deferrable fields" messages
- Sets up import_plan["deferred_fields"]
- Requires unique_id_field for 2-pass import

Now deferrable fields are only logged at DEBUG level when detected but
not applied. This eliminates confusion for users who see the message
but aren't actually doing 2-pass imports.

Updated tests to pass auto_defer=True when testing deferral behavior.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replaced _create_batch_individually with _load_records_individually
that uses Odoo's native load() method with single records instead of
create(). This ensures XML IDs are properly created in ir.model.data
automatically, eliminating the need for manual XML ID creation which
could fail independently.

Benefits:
- XML IDs are always created correctly (Odoo handles it natively)
- No more manual ir.model.data entry creation
- Consistent behavior between batch and individual record processing
- Simpler code with fewer failure points

The old function name is kept as an alias for backward compatibility.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants