File tree Expand file tree Collapse file tree 1 file changed +23
-2
lines changed
docs/source/user-guide/io Expand file tree Collapse file tree 1 file changed +23
-2
lines changed Original file line number Diff line number Diff line change @@ -79,8 +79,29 @@ output incrementally:
7979 for batch in reader:
8080 ... # process each batch without buffering the entire table
8181
82- DataFusion reads one partition at a time when exporting a C stream, so large
83- result sets are not buffered entirely in memory.
82+ To protect your process from unexpectedly large result sets, cap DataFusion's
83+ memory usage and allow spilling to disk:
84+
85+ .. ipython :: python
86+
87+ from datafusion import RuntimeEnvBuilder, SessionContext
88+ import pyarrow as pa
89+
90+ runtime = (
91+ RuntimeEnvBuilder()
92+ .with_disk_manager_os()
93+ .with_memory_limit(1_000_000_000 , 0.8 ) # 1 GB cap, spill at 80%
94+ .build()
95+ )
96+ ctx = SessionContext(runtime_env = runtime)
97+ df = ctx.sql(" SELECT * FROM my_large_table" )
98+ reader = pa.ipc.RecordBatchStreamReader._import_from_c(df.__arrow_c_stream__())
99+ for batch in reader:
100+ ... # batches spill to disk once the memory limit is hit
101+
102+ Setting a memory limit prevents out-of-memory errors by spilling to disk, but
103+ processing may slow down due to increased I/O and temporary storage usage. See
104+ :doc: `../configuration ` for detailed setup options.
84105
85106If the goal is simply to persist results, prefer engine-level writers such as
86107``df.write_parquet() ``. These writers stream data from Rust directly to the
You can’t perform that action at this time.
0 commit comments