Add memory usage cap and disk spilling for large result sets in DataFusion

kosiew · kosiew · commit 948f6ef64268 · 2025-08-29T19:56:37.000+08:00
diff --git a/docs/source/user-guide/io/arrow.rst b/docs/source/user-guide/io/arrow.rst
@@ -79,8 +79,29 @@ output incrementally:
     for batch in reader:
         ...  # process each batch without buffering the entire table
 
-DataFusion reads one partition at a time when exporting a C stream, so large
-result sets are not buffered entirely in memory.
+To protect your process from unexpectedly large result sets, cap DataFusion's
+memory usage and allow spilling to disk:
+
+.. ipython:: python
+
+    from datafusion import RuntimeEnvBuilder, SessionContext
+    import pyarrow as pa
+
+    runtime = (
+        RuntimeEnvBuilder()
+        .with_disk_manager_os()
+        .with_memory_limit(1_000_000_000, 0.8)  # 1 GB cap, spill at 80%
+        .build()
+    )
+    ctx = SessionContext(runtime_env=runtime)
+    df = ctx.sql("SELECT * FROM my_large_table")
+    reader = pa.ipc.RecordBatchStreamReader._import_from_c(df.__arrow_c_stream__())
+    for batch in reader:
+        ...  # batches spill to disk once the memory limit is hit
+
+Setting a memory limit prevents out-of-memory errors by spilling to disk, but
+processing may slow down due to increased I/O and temporary storage usage. See
+:doc:`../configuration` for detailed setup options.
 
 If the goal is simply to persist results, prefer engine-level writers such as
 ``df.write_parquet()``. These writers stream data from Rust directly to the