Skip to content

Commit 0ff4c0d

Browse files
committed
Clarify behavior of __arrow_c_stream__ execution, emphasizing incremental batch processing and memory efficiency
1 parent 17c4c2c commit 0ff4c0d

File tree

1 file changed

+10
-4
lines changed

1 file changed

+10
-4
lines changed

docs/source/user-guide/io/arrow.rst

Lines changed: 10 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -60,10 +60,16 @@ Exporting from DataFusion
6060
DataFusion DataFrames implement ``__arrow_c_stream__`` PyCapsule interface, so any
6161
Python library that accepts these can import a DataFusion DataFrame directly.
6262

63-
.. warning::
64-
It is important to note that this will cause the DataFrame execution to happen, which may be
65-
a time consuming task. That is, you will cause a
66-
:py:func:`datafusion.dataframe.DataFrame.collect` operation call to occur.
63+
.. note::
64+
Invoking ``__arrow_c_stream__`` still triggers execution of the underlying
65+
query, but batches are yielded incrementally rather than materialized all at
66+
once in memory. Consumers can process the stream as it arrives, avoiding the
67+
memory overhead of a full
68+
:py:func:`datafusion.dataframe.DataFrame.collect`.
69+
70+
For an example of this streamed execution and its memory safety, see the
71+
``test_arrow_c_stream_large_dataset`` unit test in
72+
:mod:`python.tests.test_io`.
6773

6874

6975
.. ipython:: python

0 commit comments

Comments
 (0)