fix: Skip redundant S3 upload when file already exists after rollback by dimitri-yatsenko · Pull Request #1400 · datajoint/datajoint-python

dimitri-yatsenko · 2026-02-17T18:40:27Z

Summary

Fixes upload_filepath re-uploads files after transaction rollback — should check S3 before uploading #1397 — upload_filepath re-uploads multi-GB files after transaction rollback
Before uploading, check S3 via a single stat_object call for an existing object with matching size and contents_hash metadata
If the file is already on S3 (from a prior rolled-back attempt), skip the upload and just re-insert the DB tracking entry
Adds s3.Folder.stat() method; refactors exists() to use it

Root cause

S3 uploads are not transactional. When a transaction rolls back after a successful upload (e.g., DB connection timeout during a long make()), the file remains on S3 but the tracking entry is lost. On retry, upload_filepath checks only the DB, finds no entry, and re-uploads the entire file — creating an infinite retry loop for large files.

What changed

datajoint/s3.py: New stat() method returns the full stat_object result (size, metadata) or None — single HTTP HEAD request. exists() refactored to use it.

datajoint/external.py: In upload_filepath's else branch (no DB entry), before calling _upload_file:

Call s3.stat() on the expected S3 path
If object exists with matching size and contents_hash from metadata → skip upload, log info
If skip_checksum mode, match on size only
Always insert the DB tracking entry regardless

Test plan

Verify existing external storage tests pass
Manual test: upload large filepath, kill connection before commit, verify retry skips re-upload

🤖 Generated with Claude Code

After a transaction rollback, S3 files survive but DB tracking entries are lost. On retry, upload_filepath would re-upload the entire file (potentially multi-GB) because it only checked the DB. Now checks S3 via a single stat_object call before uploading. If the object exists with matching size and contents_hash metadata, the upload is skipped. The DB tracking entry is always (re-)inserted regardless. Also adds s3.Folder.stat() method and refactors exists() to use it, avoiding redundant stat_object calls. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

dimitri-yatsenko requested a review from ttngu207 February 17, 2026 18:47

ttngu207 approved these changes Feb 17, 2026

View reviewed changes

dimitri-yatsenko merged commit f401a20 into maint/0.14 Feb 17, 2026
3 checks passed

dimitri-yatsenko deleted the fix/skip-s3-reupload-1397 branch February 17, 2026 19:01

This was referenced Feb 17, 2026

Bug: Import error for pkg_resources #1394

Closed

upload_filepath re-uploads files after transaction rollback — should check S3 before uploading #1397

Closed

Update datajoint to 0.14.9 conda-forge/datajoint-feedstock#61

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Skip redundant S3 upload when file already exists after rollback#1400

fix: Skip redundant S3 upload when file already exists after rollback#1400
dimitri-yatsenko merged 1 commit intomaint/0.14from
fix/skip-s3-reupload-1397

dimitri-yatsenko commented Feb 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dimitri-yatsenko commented Feb 17, 2026

Summary

Root cause

What changed

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants