Skip to content

Commit 948f6ef

Browse files
committed
Add memory usage cap and disk spilling for large result sets in DataFusion
1 parent ad71e58 commit 948f6ef

File tree

1 file changed

+23
-2
lines changed

1 file changed

+23
-2
lines changed

docs/source/user-guide/io/arrow.rst

Lines changed: 23 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -79,8 +79,29 @@ output incrementally:
7979
for batch in reader:
8080
... # process each batch without buffering the entire table
8181
82-
DataFusion reads one partition at a time when exporting a C stream, so large
83-
result sets are not buffered entirely in memory.
82+
To protect your process from unexpectedly large result sets, cap DataFusion's
83+
memory usage and allow spilling to disk:
84+
85+
.. ipython:: python
86+
87+
from datafusion import RuntimeEnvBuilder, SessionContext
88+
import pyarrow as pa
89+
90+
runtime = (
91+
RuntimeEnvBuilder()
92+
.with_disk_manager_os()
93+
.with_memory_limit(1_000_000_000, 0.8) # 1 GB cap, spill at 80%
94+
.build()
95+
)
96+
ctx = SessionContext(runtime_env=runtime)
97+
df = ctx.sql("SELECT * FROM my_large_table")
98+
reader = pa.ipc.RecordBatchStreamReader._import_from_c(df.__arrow_c_stream__())
99+
for batch in reader:
100+
... # batches spill to disk once the memory limit is hit
101+
102+
Setting a memory limit prevents out-of-memory errors by spilling to disk, but
103+
processing may slow down due to increased I/O and temporary storage usage. See
104+
:doc:`../configuration` for detailed setup options.
84105

85106
If the goal is simply to persist results, prefer engine-level writers such as
86107
``df.write_parquet()``. These writers stream data from Rust directly to the

0 commit comments

Comments
 (0)