Commit 493ee02
committed
PYTHON-5683: Spike: Investigate using Rust for Extension Modules
- Implement comprehensive Rust BSON encoder/decoder
- Add Evergreen CI configuration and test scripts
- Add GitHub Actions workflow for Rust testing
- Add runtime selection via PYMONGO_USE_RUST environment variable
- Add performance benchmarking suite
- Update build system to support Rust extension
- Add documentation for Rust extension usage and testing"
Fix Rust extension to respect document_class in CodecOptions
- Extract document_class from codec_options and call it to create document instances
- Fixes test_custom_class test
- Test pass rate: 70/88 (80%)
Fix Rust extension to respect tzinfo in CodecOptions for datetime decoding
- Extract tzinfo from codec_options and convert datetime to that timezone using astimezone()
- Matches C extension behavior
- Fixes test_local_datetime test
- Test pass rate: 71/88 (81%)
Add BSON validation to Rust extension
- Validate minimum size (5 bytes)
- Validate size field matches actual data length
- Validate document ends with null terminator (0x00)
- Validate no extra bytes after document
- Fixes test_basic_validation test
- Test pass rate: 72/88 (82%)
Update README to reflect comprehensive Rust extension implementation
- Changed 'Current Limitations' to 'Current Status'
- Listed all implemented BSON types and features
- Updated test pass rate to 82% (72/88 tests)
- Clarified remaining work items
- Removed outdated limitation about missing BSON types
Fix unknown BSON type error message format to match C extension
- Parse BSON data to extract field name when unknown type is encountered
- Use last non-numeric field name before unknown type (handles nested structures)
- Matches C extension error format: 'Detected unknown BSON type b'\xNN' for fieldname 'foo''
- Fixes test_unknown_type test
- Test pass rate: 73/88 (83%)
Update GitHub Actions workflow to test Rust extension capabilities instead of limitations
- Changed 'Test Rust extension limitations' to 'Test Rust extension with complex types'
- Now tests that ObjectId, DateTime, and Decimal128 work correctly
- Removed outdated tests expecting TypeError for these types
- Reflects comprehensive BSON implementation in Rust extension
Add UUID representation validation to Rust extension
- Check uuid_representation from codec_options when encoding native UUID
- Raise ValueError if uuid_representation is UNSPECIFIED (0)
- Use appropriate Binary subtype based on uuid_representation value
- Matches C extension behavior and error message
- Fixes test_uuid and test_decode_all_defaults tests
- Test pass rate: 75/88 (85%)
Add buffer protocol support for decode in Rust extension
- Support memoryview, array.array, and mmap objects
- Try multiple methods: extract, __bytes__, tobytes(), read()
- Handles all buffer protocol objects that C extension supports
- Fixes test_decode_buffer_protocol test
- Test pass rate: 76/88 (86%)
Fix UUID representation values in Rust extension
- Correct UUID representation values: PYTHON_LEGACY=3, STANDARD=4, JAVA_LEGACY=5, CSHARP_LEGACY=6
- STANDARD now correctly uses Binary subtype 4 instead of 3
- UUID decoding now works correctly with all representations
- Fixes test_decode_all_kwarg test
- Test pass rate: 77/88 (88%)
Add DBPointer support to Rust extension
- DBPointer is deprecated BSON type that decodes to DBRef
- Parse DbPointer Debug output to extract namespace and ObjectId
- Fixes test_dbpointer test
- Test pass rate: 78/88 (89%)
Add DatetimeMS support to Rust extension
- Check datetime_conversion from codec_options
- Return DatetimeMS objects when datetime_conversion=DATETIME_MS (value 3)
- Validate that DatetimeMS.__int__() returns integer, not float
- Fixes test_class_conversions test
- Test pass rate: 80/88 (91%)
Add datetime clamping support to Rust extension
- Implement DATETIME_CLAMP mode (value 2) to clamp out-of-range values
- Clamp to Python datetime range: -62135596800000 to 253402300799999 ms
- Raise OverflowError for out-of-range values in DATETIME_AUTO mode
- Fixes test_clamping, test_tz_clamping_local, test_tz_clamping_non_hashable, test_tz_clamping_utc tests
- Test pass rate: 84/88 (95%)
Add DATETIME_AUTO support to Rust extension
- DATETIME_AUTO (value 4) returns DatetimeMS for out-of-range values
- Default to DATETIME_AUTO when no datetime_conversion specified
- Fixes test_datetime_auto test
- Test pass rate: 85/88 (97%)
Add InvalidBSON error for extremely out-of-range datetime values
- Raise InvalidBSON with helpful error message for values beyond ±2^52
- Include suggestion to use DATETIME_AUTO mode
- Fixes test_millis_from_datetime_ms test
- Test pass rate: 86/88 (98%)
Fix timezone clamping with non-UTC timezones
- Track original millis value before clamping
- Handle OverflowError during astimezone() by checking if datetime is at min or max
- Return datetime.min or datetime.max with target tzinfo when overflow occurs
- Fixes test_tz_clamping_non_utc test
- Test pass rate: 87/88 (99%)
Implement unicode_decode_error_handler support
- When UTF-8 error is detected and unicode_decode_error_handler is not 'strict', fall back to Python implementation
- Python implementation correctly handles all error handlers (replace, backslashreplace, surrogateescape, ignore)
- Saves reference to Python _bson_to_dict implementation before it gets overridden by Rust extension
- Removed unused decode_bson_with_utf8_handler function
- Fixes test_unicode_decode_error_handler test
- Test pass rate: 88/88 (100%)
Update README with 100% test pass rate
- All 88 tests now passing
- Complete codec_options support implemented
- Datetime clamping and unicode error handlers working
- Ready for performance benchmarking
Add performance benchmarking and analysis
- Created comprehensive benchmark suite comparing C vs Rust extensions
- Added profiling script to identify bottlenecks
- Added micro-benchmarks for specific document types
- Updated README.md with performance analysis and recommendations
Optimize Rust extension: fast-path for common types and efficient _id handling
- Added fast-path that checks int/str/float/bool/None FIRST before expensive module lookups
- Moved _type_marker check before UUID/datetime/regex checks
- Optimized _id field handling to avoid creating new document and copying all fields
- Simplified mapping item processing
Implemented datetime_to_millis() function in Rust that:
- Extracts datetime components (year, month, day, hour, minute, second, microsecond)
- Checks for timezone offset using utcoffset() method
- Uses Python's calendar.timegm() for accurate epoch calculation
- Adjusts for timezone offset
- Converts to milliseconds
Add type object caching to avoid repeated module imports
- UUID class from uuid module
- datetime class from datetime module
- Pattern class from re module
Fixed mypy errors by changing type ignore comments from attr-defined to union-attr:
- bson/__init__.py: Fixed union-attr errors for _cbson and _rbson module attributes
- tools/fail_if_no_c.py: Removed unnecessary type ignore (C extension is built)
- tools/clean.py: Removed unnecessary type ignore (C extension is built)
Also fixed typing issues in performance test files:
- test/performance/benchmark_bson.py: Added type annotations for function signatures
- test/performance/micro_benchmark.py: Added explicit type annotations for dict literals
Fix test_default_exports by cleaning up spec variable
- The 'spec' variable used during module initialization was being left in the
module namespace, causing test_default_exports.py to fail. Added 'del spec'
to clean up the variable after use.
Fix shellcheck warning in bson/_rbson/build.sh
- Changed trap command to use single quotes instead of double quotes to ensure
\ expands when the trap is executed (on EXIT) rather than when the
trap is set.
Fix Windows path handling in bson/_rbson/build.sh
- Changed the Python script to receive paths as command-line arguments via
sys.argv instead of embedding them in the script string. This ensures proper
path handling on Windows where paths like 'C:\...' would be mangled when
embedded in shell strings.
- Also use pathlib.Path consistently for cross-platform path handling.
Remove emoji from Python print to fix Windows encoding error
- Removed emoji characters (✅ ❌) from Python print statements to avoid
UnicodeEncodeError on Windows where the default encoding (cp1252) doesn't
support these characters.
Remove all emoji characters from Python print statements
- Removed emoji characters (✓, ✅, ❌, ✗) from all Python print statements
in shell scripts and GitHub workflows to fix UnicodeEncodeError on Windows
where the default encoding (cp1252) doesn't support these characters.
Fix cross-compatibility test to preserve extension modules
- When reloading the bson module to switch from C to Rust extension, we need
to preserve the extension modules (_cbson and _rbson) in sys.modules.
Otherwise, when bson is re-imported, it can't find the already-loaded
extensions and falls back to C.
- Save references to _cbson and _rbson before clearing sys.modules
- Only clear bson modules that aren't the extensions
- Restore the extension modules before re-importing bson
Fix extension module reloading to reuse already-loaded modules
- Modified bson module initialization to check if extension modules (_cbson,
_rbson) are already loaded in sys.modules before creating new instances.
This allows the module to be reloaded with different settings (e.g.,
PYMONGO_USE_RUST) without losing access to already-loaded extensions.
- Check sys.modules for 'bson._cbson' and 'bson._rbson' before using
importlib.util.module_from_spec()
- Reuse existing module instances when available
- Renamed 'spec' to '_spec' to avoid namespace pollution
Fix benchmark script to preserve extension modules when reloading
- Modified benchmark_bson.py to preserve extension modules (_cbson, _rbson)
when reloading the bson module to switch implementations. This is the same
fix applied to the cross-compatibility test.
- Save references to _cbson and _rbson before clearing sys.modules
- Only clear bson modules that aren't the extensions
- Restore extension modules before re-importing bson
Optimize Rust BSON encoding with PyDict and PyList fast paths
- Added fast-path optimizations for the most common Python types to reduce
overhead from Python API calls:
- PyDict fast path in python_mapping_to_bson_doc():
- Iterate directly over PyDict items instead of calling items() method
- Avoids creating intermediate list of tuples
- Pre-allocate vector with known capacity
- Added extract_dict_item() helper for dict-specific extraction
- PyList and PyTuple fast paths in handle_remaining_python_types():
- Check for PyList/PyTuple before generic sequence extraction
- Use direct iteration with pre-allocated capacity
- Avoids expensive extract::<Vec<Bound<'_, PyAny>>>() call
- Also fixed micro_benchmark.py to preserve extension modules when reloading
Added profile_nested.py to identify specific performance bottlenecks in:
- Nested dictionaries (3 and 5 levels deep)
- Wide dictionaries (10 keys)
- Lists of dictionaries
- Lists of integers
Implement direct BSON byte writing for major performance improvement
- Major architectural change: Instead of building intermediate bson::Document
structures and then serializing them, we now write BSON bytes directly.
- Added write_document_bytes() - writes BSON documents directly to bytes
- Added write_element() - writes individual BSON elements with type-specific encoding
- Added write_array_bytes() and write_tuple_bytes() - direct array encoding
- Added helper functions: write_cstring(), write_string(), write_bson_value()
- Modified _dict_to_bson() to use the new direct byte writing approach
Implement direct BSON byte reading for improved decode performance
- Added direct BSON-to-Python decoding that reads bytes directly without
the intermediate Document structure.
- Added read_document_from_bytes() - reads BSON documents directly from bytes
- Added read_bson_value() - reads individual BSON values
- Added read_array_from_bytes() - reads BSON arrays
- Modified _bson_to_dict() to use the new direct byte reading approach
Fix mypy type errors in profile_nested.py
- Added 'Any' type annotation to the 'doc' variable to handle different
document types being assigned to the same variable. This fixes the
dict-item type incompatibility errors.
Fix BSON decode fallback for unsupported types
- Fixed the direct BSON decoder to properly fall back to the Document-based
approach when encountering unsupported BSON types (ObjectId, Binary, DateTime,
Regex, etc.).
- Modified read_bson_value() to return an error for unsupported types instead
of trying to parse them incorrectly
- Updated _bson_to_dict() to catch 'Unsupported BSON type' errors and fall
back to Document::from_reader() for the entire document
- This ensures correctness for all BSON types while maintaining performance
for common types (int, string, bool, null, dict, list)
Fix mypy type error in profile_decode.py
- Added type annotation 'dict[str, dict[str, Any]]' to the 'docs' variable
to handle different document structures with varying value types.
Add Rust comparison tests to perf_test.py and async_perf_test.py
- Added new test classes that compare C vs Rust BSON implementations:
- RustSimpleIntEncodingTest/DecodingTest - Simple integer documents
- RustMixedTypesEncodingTest - Documents with mixed types
- RustNestedEncodingTest - Nested documents
- RustListEncodingTest - Documents with lists
Fix TestLongLongToString test when C extension is not available
- The test was failing with AttributeError when _cbson was None (e.g.,
when using the Rust extension). Added a check to skip the test if
_cbson is None or not imported, since _test_long_long_to_str() is a
C-specific test function.1 parent e077ebd commit 493ee02
File tree
20 files changed
+3497
-15
lines changed- .evergreen
- generated_configs
- scripts
- .github/workflows
- bson
- _rbson
- src
- test
- performance
- tools
20 files changed
+3497
-15
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
252 | 252 | | |
253 | 253 | | |
254 | 254 | | |
| 255 | + | |
| 256 | + | |
| 257 | + | |
| 258 | + | |
| 259 | + | |
| 260 | + | |
| 261 | + | |
| 262 | + | |
| 263 | + | |
| 264 | + | |
| 265 | + | |
| 266 | + | |
| 267 | + | |
| 268 | + | |
| 269 | + | |
| 270 | + | |
| 271 | + | |
| 272 | + | |
| 273 | + | |
| 274 | + | |
| 275 | + | |
| 276 | + | |
| 277 | + | |
| 278 | + | |
| 279 | + | |
| 280 | + | |
| 281 | + | |
| 282 | + | |
| 283 | + | |
| 284 | + | |
| 285 | + | |
255 | 286 | | |
256 | 287 | | |
257 | 288 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
5028 | 5028 | | |
5029 | 5029 | | |
5030 | 5030 | | |
| 5031 | + | |
| 5032 | + | |
| 5033 | + | |
| 5034 | + | |
| 5035 | + | |
| 5036 | + | |
| 5037 | + | |
| 5038 | + | |
| 5039 | + | |
| 5040 | + | |
| 5041 | + | |
| 5042 | + | |
| 5043 | + | |
| 5044 | + | |
| 5045 | + | |
| 5046 | + | |
| 5047 | + | |
| 5048 | + | |
| 5049 | + | |
| 5050 | + | |
| 5051 | + | |
| 5052 | + | |
| 5053 | + | |
| 5054 | + | |
| 5055 | + | |
| 5056 | + | |
5031 | 5057 | | |
5032 | 5058 | | |
5033 | 5059 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
673 | 673 | | |
674 | 674 | | |
675 | 675 | | |
| 676 | + | |
| 677 | + | |
| 678 | + | |
| 679 | + | |
| 680 | + | |
| 681 | + | |
| 682 | + | |
| 683 | + | |
| 684 | + | |
| 685 | + | |
| 686 | + | |
| 687 | + | |
| 688 | + | |
| 689 | + | |
| 690 | + | |
| 691 | + | |
| 692 | + | |
| 693 | + | |
| 694 | + | |
| 695 | + | |
| 696 | + | |
| 697 | + | |
| 698 | + | |
| 699 | + | |
| 700 | + | |
| 701 | + | |
| 702 | + | |
| 703 | + | |
| 704 | + | |
| 705 | + | |
| 706 | + | |
| 707 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
0 commit comments