Skip to content

Conversation

@prantogg
Copy link
Contributor

@prantogg prantogg commented Jan 3, 2026

This PR addresses #70 - enables direct S3 streaming for generated data, eliminating local storage requirements for large-scale dataset generation. S3 URIs in --output-dir are automatically detected and routed to a streaming multipart upload implementation.

  • Add direct S3 write support via --output-dir s3://bucket/prefix/ for all output formats (Parquet, CSV, TBL) and zone tables
  • S3 writer uses streaming multipart upload with 32MB chunks to keep memory usage bounded
  • S3 client is built from standard AWS environment variables (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_REGION, AWS_SESSION_TOKEN, etc.) via AmazonS3Builder::from_env()
  • Generated files are byte-for-byte identical to local generation

@prantogg prantogg changed the title [WIP] feat: add s3 write support feat: add S3 write support Feb 12, 2026
@prantogg prantogg added the enhancement New feature or request label Feb 12, 2026
@prantogg prantogg marked this pull request as ready for review February 12, 2026 07:29
@prantogg prantogg force-pushed the pranav/add-s3-write-support branch from 3c12ccf to 00c533b Compare February 12, 2026 16:48
@prantogg prantogg added the documentation Improvements or additions to documentation label Feb 12, 2026
Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

I don't think it is necessary for this PR, but the ObjectStore may be a good future choice (here it would let you unify the local and s3 code path and make it easier to add future gcs or azure support).

It may be worth setting up a test for this in CI using MinIO or another S3 alternative as otherwise I don't think there are any tests here to validate the behaviour.

@jiayuasu
Copy link
Member

@prantogg Does this work with the --mb-per-file 256 config?

Copy link
Collaborator

@james-willis james-willis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I think Dewey's comment about objectstore makes a lot of sense though.

@prantogg
Copy link
Contributor Author

Thank you!

I don't think it is necessary for this PR, but the ObjectStore may be a good future choice (here it would let you unify the local and s3 code path and make it easier to add future gcs or azure support).

It may be worth setting up a test for this in CI using MinIO or another S3 alternative as otherwise I don't think there are any tests here to validate the behaviour.

Thanks @paleolimbot! Agreed on ObjectStore for local path. The current implementation for local path is an exact copy of tpchgen-rs project. Happy to tackle that as a follow-up!

On the MinIO CI test - my thinking was that since S3Writer implements std::io::Write, the datagen pipeline is agnostic to the destination. The unit tests validate the writer logic (buffering, multipart, finalization) against an InMemory object store, which covers our code. Testing against an actual S3 endpoint would mostly be validating object_store::aws::AmazonS3Builder's backend rather than ours?

@prantogg
Copy link
Contributor Author

Adding a small fix for S3 error propagation for things like token expiry.

@prantogg
Copy link
Contributor Author

@prantogg Does this work with the --mb-per-file 256 config?

@jiayuasu Yes, it works with all CLI options including --mb-per-file. Data is just streamed to S3 via multipart upload instead of written to disk.

@prantogg prantogg requested a review from paleolimbot February 12, 2026 20:18
@paleolimbot
Copy link
Member

Testing against an actual S3 endpoint would mostly be validating not our code

There's nothing in this PR to indicate that spatialbench-cli -s 1 --format=parquet --output-dir s3://my-bucket/sf1-parquet actually runs (and nothing to ensure that future LLMs or contributors won't break it). I call this an "is it plugged in check" and I while I think future you will be glad you added it now, the consequences of this not working are very low (and not my problem 🙂 ) so feel free to punt!

@prantogg prantogg force-pushed the pranav/add-s3-write-support branch from d28fb32 to c0550a7 Compare February 12, 2026 22:22
@prantogg prantogg force-pushed the pranav/add-s3-write-support branch from c0550a7 to 534a46c Compare February 12, 2026 22:33
@prantogg
Copy link
Contributor Author

There's nothing in this PR to indicate that spatialbench-cli -s 1 --format=parquet --output-dir s3://my-bucket/sf1-parquet actually runs (and nothing to ensure that future LLMs or contributors won't break it). I call this an "is it plugged in check" and I while I think future you will be glad you added it now, the consequences of this not working are very low (and not my problem 🙂 ) so feel free to punt!

I see your point @paleolimbot! Thanks for that. Just added MinIO tests

Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@prantogg prantogg merged commit 56f280b into main Feb 13, 2026
24 checks passed
@prantogg prantogg deleted the pranav/add-s3-write-support branch February 13, 2026 05:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants