-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-48695: [Python][C++] Add max_rows parameter to CSV reader #48719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
GH-48695: [Python][C++] Add max_rows parameter to CSV reader #48719
Conversation
This PR implements the max_rows parameter for PyArrow's CSV reader, addressing issue apache#48695. This feature is equivalent to Pandas' nrows parameter, allowing users to limit the number of rows read from a CSV file. Implementation details: - Added max_rows field to ReadOptions (default: -1 for unlimited) - Implemented exact row limiting in all three reader types: * StreamingReaderImpl: Atomic counter with batch slicing * SerialTableReader: Table slicing after reading * AsyncThreadedTableReader: Table slicing after parallel read - Added Python bindings with full property support - Includes 8 comprehensive tests covering all edge cases The implementation guarantees exact row count even with multithreading, using atomic counters and zero-copy slicing operations. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
|
|
|
I see that CI checks require maintainer approval to run. I'll run the full test suite locally first to ensure everything passes before requesting review. I'll update this PR once I've confirmed:
Will post the local test results shortly. |
The CSV reader's default behavior is to infer column types. When reading numeric values like "1", "2", "3", they are correctly converted to integers [1, 2, 3] rather than kept as strings ["1", "2", "3"]. Updated test expectations in test_max_rows_basic(), test_max_rows_with_skip_rows(), and test_max_rows_with_skip_rows_after_names() to expect integers instead of strings, matching the behavior of other CSV reader tests in the codebase. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
|
@hyangminj can you please confirm you have reviewed the changes Claude made to the code? |
|
Hello @AlenkaF I have reviewed the changes. But I might need to execute the test code for my changes. |
Build and Test ResultsBuild Status: ✅ Success
Local Testing: ✅ All tests passed (8/8)
Test Environment:
The implementation is working as expected in local testing. |
GH-48695: [Python][C++] Add max_rows parameter to CSV reader
Summary
This PR implements the
max_rowsparameter for PyArrow's CSV reader, addressing issue #48695. This feature is equivalent to Pandas'nrowsparameter, allowing users to limit the number of rows read from a CSV file.Rationale for Changes
Currently, PyArrow's CSV reader lacks a way to limit the number of rows read, which is available in both Pandas (
nrows) and Polars (n_rows). This feature is useful for:Implementation Details
C++ Core Changes
Added
max_rowsfield to ReadOptions (cpp/src/arrow/csv/options.h)int64_t-1(unlimited)-1for unlimited, or positive integer for exact row countValidation (
cpp/src/arrow/csv/options.cc)max_rows = 0→ Error (invalid)max_rows < -1→ Error (invalid)max_rows = -1→ Read all rows (default)max_rows > 0→ Read exactly that many rowsReader Implementations (
cpp/src/arrow/csv/reader.cc)rows_read_atomic counter for thread-safe row trackingPython Bindings
Cython Declarations (
python/pyarrow/includes/libarrow.pxd)int64_t max_rowsto CCSVReadOptionsPython Wrapper (
python/pyarrow/_csv.pyx)max_rowsparameter toReadOptions.__init__()equals()method__getstate__,__setstate__)Tests
python/pyarrow/tests/test_csv.py)test_max_rows_basic: Basic functionality (2 rows, 1 row, more than available)test_max_rows_with_skip_rows: Interaction withskip_rowstest_max_rows_with_skip_rows_after_names: Interaction withskip_rows_after_namestest_max_rows_edge_cases: Validation (0, negative values)test_max_rows_with_small_blocks: Multiple blocks with small block_sizetest_max_rows_multithreaded: Exact count guarantee withuse_threads=Truetest_max_rows_streaming: StreamingReader compatibilitytest_max_rows_pickle: Pickle supportKey Features
max_rowsrows (not approximate)use_threads=TrueRecordBatch::Slice()andTable::Slice()Usage Examples
Backward Compatibility
-1means no behavior change for existing codeChecklist
Related Issue
Closes #48695