feat: Add internal bucket support for team-only datasets by ParticularlyPythonicBS · Pull Request #38 · TemoaProject/data

ParticularlyPythonicBS · 2025-09-23T21:56:53Z

Internal Bucket Support for Team-Only Datasets

This PR introduces comprehensive support for internal (team-only) R2 buckets alongside the existing production (public) bucket system.

These internal bucket databases will be visible in a datamanager pull with the right api access or in the cloudflare bucket page.

Configuration Required

Add to your .env file:

R2_INTERNAL_BUCKET="your-internal-bucket-name"

Summary by CodeRabbit

New Features
- Added internal (team-only) bucket support: datasets record a target bucket (production|internal). CLI prepare/pull gain a --bucket option, interactive flows differentiate production vs internal, listings show per-dataset bucket, and publish/delete/deploy operations target each dataset’s bucket.
Documentation
- Updated README, setup, usage, workflow, and env examples to document multi-bucket behavior and the R2_INTERNAL_BUCKET variable.
Tests
- Expanded tests to cover internal-bucket flows, alias resolution, permission checks across production/staging/internal, and related workflows.
Chores
- CI workflow exposes R2_INTERNAL_BUCKET to the publish step; pre-commit tooling updated.

- Add multi-bucket architecture supporting production (public) and internal (team-only) R2 buckets - Implement bucket routing based on dataset manifest configuration - Add --bucket option to prepare and pull commands - Update CI/CD workflows to handle internal bucket publications and deletions

coderabbitai · 2025-09-23T21:57:00Z

Walkthrough

Adds per-dataset bucket support (production vs internal) across CI publish script, CLI, core R2 APIs, config, manifest, docs, env, and tests. Operations (prepare, pull, verify, delete, publish) now target a dataset's configured bucket; manifest entries include a new "bucket" field and helpers to get/update it.

Changes

Cohort / File(s)	Summary
Publish script bucketization `.github/scripts/publish_script.py`	Adds `INTERNAL_BUCKET` constant, groups objects by target bucket, performs bucket-specific batch deletes and server-side copies, and updates logs/commit messages to include destination bucket context.
Workflow env injection `.github/workflows/publish.yml`	Exposes `R2_INTERNAL_BUCKET` to the Publish Script step environment.
Core R2 API bucket support `src/datamanager/core.py`, `src/datamanager/config.py`	Adds `internal_bucket` setting; resolves bucket aliases; extends upload/download/delete/pull_and_verify to accept an optional `bucket` and compute target bucket; adds `upload_to_internal`; verify checks internal bucket permissions.
CLI bucket-aware flows `src/datamanager/__main__.py`	Adds `--bucket` option to `prepare` and `pull`; threads bucket through logic and R2 keys; adds an internal prepare interactive path; persists dataset `bucket` in manifest; updates menus and signatures.
Manifest helpers and data `src/datamanager/manifest.py`, `manifest.json`, `tests/conftest.py`	Adds `get_dataset_bucket` and `update_dataset_bucket`, includes them in `__all__`; manifest entries gain a `"bucket"` field; a new internal dataset entry added; test fixture manifests updated.
Docs: setup/usage/workflow/README `README.md`, `docs/source/setup.md`, `docs/source/usage.md`, `docs/source/workflow.md`	Documents the internal bucket concept, adds `R2_INTERNAL_BUCKET` to env docs and examples, updates CLI examples and workflow narrative to reflect multi-bucket behavior and access notes.
Env example `env.example`	Adds `R2_INTERNAL_BUCKET` placeholder variable.
Tests for bucket-aware behavior `tests/test_core.py`, `tests/test_main.py`, `tests/test_manifest.py`	Expands verify tests to include internal bucket; adds and updates tests asserting manifest `bucket` defaults and propagation; adds tests for internal prepare/pull flows and alias resolution.
Pre-commit bump `.pre-commit-config.yaml`	Bumps `uv-pre-commit` rev and adds `pyupgrade` hook.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User
  participant CLI as CLI (__main__.py)
  participant Manifest as Manifest
  participant Core as Core (R2 APIs)
  participant R2Prod as R2: Production
  participant R2Int as R2: Internal

  rect rgb(240,248,255)
  note over User,CLI: Prepare dataset
  User->>CLI: prepare --bucket=<production|internal>
  CLI->>Manifest: update_dataset_bucket(name, bucket)
  CLI->>Core: upload_to_staging(file, staging_key, bucket)
  CLI->>Manifest: write history entry (staging_key, bucket)
  end

  rect rgb(245,255,240)
  note over User,CLI: Pull dataset
  User->>CLI: pull --bucket=<production|internal>
  CLI->>Manifest: get_dataset_bucket(name)
  CLI->>Core: pull_and_verify(object_key, sha, out, bucket)
  alt bucket=production
    Core->>R2Prod: GET object_key
  else bucket=internal
    Core->>R2Int: GET object_key
  end
  Core-->>CLI: verified
  CLI-->>User: file saved
  end

sequenceDiagram
  autonumber
  participant GH as GitHub Actions
  participant Pub as publish_script.py
  participant R2P as R2: Production
  participant R2I as R2: Internal

  rect rgb(255,250,240)
  note over GH,Pub: Publish step
  GH->>Pub: env (R2_INTERNAL_BUCKET,…)
  Pub->>Pub: Group objects by bucket
  par Delete production objects
    Pub->>R2P: Batch DELETE(keys)
  and Delete internal objects
    Pub->>R2I: Batch DELETE(keys)
  end
  par Copy production objects
    Pub->>R2P: Server-side COPY(staging_key -> object_key)
  and Copy internal objects
    Pub->>R2I: Server-side COPY(staging_key -> object_key)
  end
  Pub-->>GH: logs incl. bucket context
  end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

A hop to Prod, a skip inside,
Two buckets where the bytes now ride.
I stash my carrots, staged just right,
Then copy under moonlit night.
With paws on manifests I twirl—internal secrets, public swirl. 🥕✨

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title succinctly and accurately summarizes the primary change in the changeset—adding support for an internal, team-only R2 bucket—and aligns with the PR objectives and the code, config, docs, and tests modifications described in the summary; it is concise, specific, and free of noise.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feat/dev_bucket

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tests/test_manifest.py (1)

507-515: Fix failing expectation: pass bucket alias to core or adjust assertion

CI shows the 4th arg to pull_and_verify is a real bucket name (e.g., "DUMMY") instead of "internal". Either:

Pass the alias through to core and let core resolve it, or
Update the test to check against settings.internal_bucket.

Given the rest of the suite expects aliases at boundaries, prefer passing the alias and resolving inside core.

Apply these diffs:

Resolve alias in core and accept both alias or raw bucket names:

--- a/src/datamanager/core.py
+++ b/src/datamanager/core.py
@@
-from typing import Any, TypedDict
+from typing import Any, TypedDict
+from typing import Optional  # if needed elsewhere
@@
-def upload_to_r2(
-    client: S3Client, file_path: Path, object_key: str, bucket: str = None
-) -> None:
+def upload_to_r2(
+    client: S3Client, file_path: Path, object_key: str, bucket: str | None = None
+) -> None:
@@
-    target_bucket = bucket or settings.bucket
+    target_bucket = _resolve_bucket(bucket)
@@
-def download_from_r2(
-    client: S3Client, object_key: str, download_path: Path, bucket: str = None
-) -> None:
+def download_from_r2(
+    client: S3Client, object_key: str, download_path: Path, bucket: str | None = None
+) -> None:
@@
-    target_bucket = bucket or settings.bucket
+    target_bucket = _resolve_bucket(bucket)
@@
-def pull_and_verify(
-    object_key: str, expected_hash: str, output_path: Path, bucket: str = None
-) -> bool:
+def pull_and_verify(
+    object_key: str, expected_hash: str, output_path: Path, bucket: str | None = None
+) -> bool:
@@
-        download_from_r2(client, object_key, output_path, bucket)
+        download_from_r2(client, object_key, output_path, bucket)
@@
-def delete_from_r2(client: S3Client, object_key: str, bucket: str = None) -> None:
+def delete_from_r2(client: S3Client, object_key: str, bucket: str | None = None) -> None:
@@
-    target_bucket = bucket or settings.bucket
+    target_bucket = _resolve_bucket(bucket)

Add this helper (outside the changed ranges, e.g., near other helpers):

def _resolve_bucket(bucket: str | None) -> str:
    # Accept aliases or raw bucket names.
    if bucket is None or bucket == "production":
        return settings.bucket
    if bucket == "internal":
        return settings.internal_bucket
    return bucket

Pass the alias from CLI to core:

--- a/src/datamanager/__main__.py
+++ b/src/datamanager/__main__.py
@@
-    # Determine which bucket to use
-    target_bucket = (
-        settings.internal_bucket if bucket == "internal" else settings.bucket
-    )
-
     success = core.pull_and_verify(
         version_entry["r2_object_key"],
         version_entry["sha256"],
         final_path,
-        target_bucket,
+        bucket,  # pass alias; core resolves to bucket name
     )

Use alias for diff download too (core resolves to actual):

--- a/src/datamanager/__main__.py
+++ b/src/datamanager/__main__.py
@@
-            # Download from the appropriate bucket
-            target_bucket = (
-                settings.internal_bucket if bucket == "internal" else settings.bucket
-            )
-            core.download_from_r2(
-                client, latest_version["r2_object_key"], old_path, target_bucket
-            )
+            # Download from the appropriate bucket (alias; core resolves)
+            core.download_from_r2(
+                client, latest_version["r2_object_key"], old_path, bucket
+            )

🧹 Nitpick comments (14)

docs/source/workflow.md (1)
44-49: Clarify how the “appropriate bucket” is chosen.

Suggest explicitly stating that CI selects the target bucket based on the dataset’s bucket field in manifest.json to remove ambiguity.
- It copies the data from the staging bucket to the appropriate target bucket (production or internal).
+ It copies the data from the staging bucket to the appropriate target bucket (production or internal), as specified by each dataset’s bucket field in manifest.json.
.github/workflows/publish.yml (1)
36-45: Fail fast if R2_INTERNAL_BUCKET (and other required env) are unset.

Add a preflight validation step so the job errors with a clear message when any required variable is missing.
       - name: Install project and dependencies
         run: |
           uv sync --all-extras --dev
           uv pip install -e . --no-deps

+      - name: Validate required env
+        run: |
+          set -euo pipefail
+          for v in R2_ACCOUNT_ID R2_ACCESS_KEY_ID R2_SECRET_ACCESS_KEY R2_PRODUCTION_BUCKET R2_STAGING_BUCKET R2_INTERNAL_BUCKET; do
+            if [ -z "${!v:-}" ]; then
+              echo "::error::Missing required env: $v"
+              exit 1
+            fi
+          done
+        env:
+          R2_ACCOUNT_ID: ${{ secrets.R2_ACCOUNT_ID }}
+          R2_ACCESS_KEY_ID: ${{ secrets.R2_ACCESS_KEY_ID }}
+          R2_SECRET_ACCESS_KEY: ${{ secrets.R2_SECRET_ACCESS_KEY }}
+          R2_PRODUCTION_BUCKET: ${{ vars.R2_PRODUCTION_BUCKET }}
+          R2_STAGING_BUCKET: ${{ vars.R2_STAGING_BUCKET }}
+          R2_INTERNAL_BUCKET: ${{ vars.R2_INTERNAL_BUCKET }}
+
       - name: Run Publish Script
         run: uv run .github/scripts/publish_script.py
         env:
           R2_ACCOUNT_ID: ${{ secrets.R2_ACCOUNT_ID }}
           R2_ACCESS_KEY_ID: ${{ secrets.R2_ACCESS_KEY_ID }}
           R2_SECRET_ACCESS_KEY: ${{ secrets.R2_SECRET_ACCESS_KEY }}
           R2_PRODUCTION_BUCKET: ${{ vars.R2_PRODUCTION_BUCKET }} # Use repo variable
           R2_STAGING_BUCKET: ${{ vars.R2_STAGING_BUCKET }} # Use repo variable
           R2_INTERNAL_BUCKET: ${{ vars.R2_INTERNAL_BUCKET }} # Use repo variable
docs/source/usage.md (1)
55-65: Note permission requirements for internal pulls.

Add a short note pointing users to verify if they get AccessDenied when pulling internal datasets.
-Downloads a dataset from the appropriate R2 bucket (production or internal) and verifies its integrity. The bucket is determined by the dataset's configuration in the manifest.
+Downloads a dataset from the appropriate R2 bucket (production or internal) and verifies its integrity. The bucket is determined by the dataset's configuration in the manifest. Note: Pulling an internal dataset requires access to the internal bucket. If you see AccessDenied, run `uv run datamanager verify`.
src/datamanager/config.py (2)
6-11: Also read from OS environment, not only .env.

current code ignores exported env vars if .env is not present; reading both improves usability in CI and shells.
 from dotenv import find_dotenv, dotenv_values
+import os
 import warnings

 _ENV_PATH = Path(find_dotenv()) if find_dotenv() else None
 _ENV = dotenv_values(_ENV_PATH) if _ENV_PATH else {}
13-22: Apply env fallback in _need().

Prefer OS env first, then .env, then warn and use dummy.
 def _need(var: str) -> str:
-    val = _ENV.get(var)
+    val = os.getenv(var, _ENV.get(var))
     if val is None:
         warnings.warn(
             f"Env var {var} is missing – using dummy value for offline/test mode",
             RuntimeWarning,
         )
         val = "DUMMY"
     return str(val)
docs/source/setup.md (1)

3-8: LGTM; minor doc polish suggestion.

Consider adding a one-liner that access to Internal bucket should be tightly scoped to team members (principle of least privilege). Optional.

Also applies to: 46-47
tests/test_core.py (3)
91-105: Avoid brittle order-dependence in verify_r2_access results.

Tests assume [production, staging, internal] ordering. Please ensure core.verify_r2_access guarantees this order, or expose a bucket identifier so tests can map results by name instead of index.

116-124: Broaden assertions to staging/internal for read-only case.

Optional: assert the same read-only behavior for staging and internal to increase coverage.
 results = core.verify_r2_access()
-prod_result = results[0]
-
-assert prod_result["exists"] is True
-assert prod_result["permissions"]["read"] is True
-assert prod_result["permissions"]["write"] is False
-assert prod_result["permissions"]["delete"] is False
-assert "Partial access: [read]" in prod_result["message"]
+for r in results:
+    assert r["exists"] is True
+    assert r["permissions"]["read"] is True
+    assert r["permissions"]["write"] is False
+    assert r["permissions"]["delete"] is False
+    assert "Partial access: [read]" in r["message"]
132-138: Broaden assertions to all buckets for not-found case.

Given head_bucket raises for each call, all buckets should reflect not-found.
 results = core.verify_r2_access()
-prod_result = results[0]
-
-assert prod_result["exists"] is False
-assert not any(prod_result["permissions"].values())
-assert "Bucket not found" in prod_result["message"]
+for r in results:
+    assert r["exists"] is False
+    assert not any(r["permissions"].values())
+    assert "Bucket not found" in r["message"]
src/datamanager/manifest.py (2)
204-218: Type‑narrow the return and clarify defaulting

Constrain the return type to the allowed literals and keep the current defaulting behavior.
-from typing import Any, Optional
+from typing import Any, Optional, Literal
...
-def get_dataset_bucket(name: str) -> str:
+def get_dataset_bucket(name: str) -> Literal["production", "internal"]:
@@
-    return dataset.get("bucket", "production")
+    return dataset.get("bucket", "production")
220-234: Validate input and avoid unnecessary write on no‑op

Validate that only "production" or "internal" are accepted.

Return early if dataset is not found to avoid noop writes.
-from typing import Any, Optional
+from typing import Any, Optional, Literal
@@
-def update_dataset_bucket(name: str, bucket: str) -> None:
+def update_dataset_bucket(name: str, bucket: Literal["production", "internal"]) -> None:
@@
-    for item in data:
+    updated = False
+    for item in data:
         if item.get("fileName") == name:
             item["bucket"] = bucket
-            break
-    write_manifest(data)
+            updated = True
+            break
+    if updated:
+        write_manifest(data)
.github/scripts/publish_script.py (1)
49-55: Docs generation: avoid public links for internal datasets

For internal datasets, the generated "Download" link will point to a public base URL and likely be inaccessible. Prefer a label instead.

Outside this hunk, adjust finalize_dataset_docs:
# inside the loop over datasets, before building download_link
is_internal = dataset.get("bucket") == "internal"
...
download_link = "N/A"
if r2_object_key:
    if is_internal:
        download_link = "Internal-only"
    else:
        full_url = f"{BASE_DOWNLOAD_URL}{r2_object_key}"
        download_link = f"[Download]({full_url})"
src/datamanager/__main__.py (2)
265-271: Unused ctx in _run_prepare_logic

ctx isn’t used; consider removing the parameter or prefix with underscore to satisfy linters. Keeping it is OK if you plan to use it soon.

385-390: Consider restricting --bucket to choices

Define an Enum or use typer.Choice to prevent invalid values.

Example:
class Bucket(str, Enum):
    production = "production"
    internal = "internal"

bucket: Bucket = typer.Option(Bucket.production, "--bucket", "-b", case_sensitive=False)

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0ec6e33 and 0372b4e.

📒 Files selected for processing (16)

.github/scripts/publish_script.py (5 hunks)
.github/workflows/publish.yml (1 hunks)
README.md (6 hunks)
docs/source/setup.md (2 hunks)
docs/source/usage.md (2 hunks)
docs/source/workflow.md (1 hunks)
env.example (1 hunks)
manifest.json (2 hunks)
src/datamanager/__main__.py (13 hunks)
src/datamanager/config.py (1 hunks)
src/datamanager/core.py (7 hunks)
src/datamanager/manifest.py (2 hunks)
tests/conftest.py (1 hunks)
tests/test_core.py (1 hunks)
tests/test_main.py (4 hunks)
tests/test_manifest.py (2 hunks)

🧰 Additional context used

🧬 Code graph analysis (3)

tests/test_manifest.py (2)

tests/conftest.py (1)

test_repo (16-82)

src/datamanager/manifest.py (3)

get_dataset_bucket (204-217)

update_dataset_bucket (220-233)

read_manifest (33-52)

tests/test_main.py (2)

tests/conftest.py (1)

test_repo (16-82)

src/datamanager/manifest.py (1)

get_dataset (78-92)

src/datamanager/__main__.py (2)

src/datamanager/core.py (2)

pull_and_verify (108-134)

download_from_r2 (81-105)

src/datamanager/manifest.py (1)

get_dataset (78-92)

🪛 GitHub Actions: CI

tests/test_main.py

[error] 514-514: Test 'test_pull_internal_bucket' failed: expected bucket parameter 'internal' but actual value 'DUMMY' (likely due to offline/test mode dummy values).

🪛 Ruff (0.13.1)

src/datamanager/core.py

62-62: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

82-82: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

109-109: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

203-203: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

src/datamanager/__main__.py

266-266: Unused function argument: ctx

(ARG001)

🔇 Additional comments (33)

docs/source/usage.md (1)

97-101: LGTM.

Clearer verify description aligning with three-bucket support.

tests/conftest.py (1)

46-46: LGTM; ensure legacy manifests remain compatible.

Adding bucket: "production" is good. Please confirm the loader defaults to production when bucket is absent so older manifests don’t break; add/keep a test for that.

src/datamanager/config.py (1)

31-31: LGTM.

internal_bucket addition aligns with docs and workflow.

env.example (1)

6-6: LGTM.

Example env updated for internal bucket.

src/datamanager/manifest.py (1)

24-26: Export list update looks good

New APIs are properly exported.

tests/test_manifest.py (5)

36-36: LGTM: Bucket field asserted

Asserting default "production" on existing dataset is correct per fixtures.

52-53: LGTM: Bucket preserved across history update

Ensures update path doesn't drop the field.

55-64: LGTM: get_dataset_bucket tests

Covers both present and missing dataset cases.

66-76: LGTM: update_dataset_bucket tests

Covers update and no‑op behaviors.

476-500: LGTM: Internal bucket prepare path

Validates internal flow and manifest fields.

manifest.json (2)

4-4: LGTM: Default bucket added to existing dataset

Matches code paths that default to "production".

36-51: LGTM: Example internal dataset entry

Structure aligns with create flow (bucket="internal", v1 history).

.github/scripts/publish_script.py (4)

137-142: Good: per‑dataset bucket routing for deletions

Bucket selection logic is clear and isolated.

174-213: Good: batch delete grouped per bucket

Chunking and error handling are appropriate.

241-255: Good: publish targets correct bucket and logs context

Server‑side copy uses target bucket; logs are clear.

15-15: R2_INTERNAL_BUCKET exported in CI
publish.yml exports R2_INTERNAL_BUCKET from repo vars: .github/workflows/publish.yml:44.

README.md (8)

62-69: LGTM: Feature bullets updated for multi-bucket and verify

Clear and consistent with CLI.

70-87: LGTM: Bucket types section

Concise and actionable guidance.

93-97: LGTM: Prereqs list includes internal bucket

Matches new env requirements.

139-139: LGTM: Env var added

R2_INTERNAL_BUCKET documented.

169-174: LGTM: Prepare examples show bucket selection

Good defaults and internal option.

220-221: LGTM: CLI note on --bucket

Accurately reflects new behavior.

260-271: LGTM: Pull examples for production and internal

Covers both latest and specific version cases.

303-307: LGTM: Verify mentions all three buckets

Aligned with core.verify_r2_access.

src/datamanager/core.py (2)

325-339: LGTM: upload_to_internal helper

Useful for explicit internal uploads; consistent progress handling.

280-296: LGTM: verify includes internal bucket

Covers all three buckets; messages are coherent.

src/datamanager/__main__.py (2)

70-73: LGTM: Verify table headers clarified

Minor UX polish.

104-112: LGTM: Datasets list shows bucket clearly

Icons and labels improve visibility.

tests/test_main.py (5)

33-33: Default bucket assertion LGTM

Good check that new datasets default to production.

233-234: Interactive prepare correctly propagates bucket

Ensures default production bucket flows into core logic.

299-300: Interactive pull correctly propagates bucket

Production bucket is forwarded to the pull logic as expected.

476-500: Internal prepare path looks good

Covers internal bucket flag end-to-end and validates manifest persistence.

502-515: Fix failing assertion: compare against settings.internal_bucket, not literal "internal"

Replace the assertion with:
assert call_args[3] == settings.internal_bucket

Likely an incorrect or invalid review comment.

src/datamanager/__main__.py

src/datamanager/core.py

refactor: centralize bucket resolution logic in core module - Add _resolve_bucket() helper function with PEP 604 union types - Update upload_to_r2, download_from_r2, and delete_from_r2 to use helper - Simplify bucket handling in main module to pass aliases directly - Support both bucket aliases (production/internal) and raw bucket names consistently

codecov · 2025-09-23T22:37:09Z

Codecov Report

❌ Patch coverage is 95.28024% with 16 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/datamanager/__main__.py	67.39%	12 Missing and 3 partials ⚠️
tests/test_manifest.py	97.50%	0 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (5)

src/datamanager/core.py (2)

68-85: _resolve_bucket logic looks good

Covers None/"production"/"internal" and pass-through of raw names. Consider centralizing all resolution via this helper (see previous comment).

386-400: Avoid duplication: delegate upload_to_internal to upload_to_r2

Reuse the generic uploader with bucket="internal" to keep behavior consistent and reduce maintenance.

 def upload_to_internal(client: S3Client, file_path: Path, object_key: str) -> None:
-    """Uploads a file to the INTERNAL R2 bucket with a progress bar."""
-    file_size = file_path.stat().st_size
-    with Progress() as progress:
-        task = progress.add_task(
-            f"[blue]Uploading to internal: {file_path.name}...", total=file_size
-        )
-        client.upload_file(
-            str(file_path),
-            settings.internal_bucket,
-            object_key,
-            Callback=lambda bytes_transferred: progress.update(
-                task, advance=bytes_transferred
-            ),
-        )
+    """Uploads a file to the INTERNAL R2 bucket with a progress bar."""
+    upload_to_r2(client, file_path, object_key, bucket="internal")

src/datamanager/__main__.py (3)

260-266: Unused ctx argument in _run_prepare_logic

Ruff ARG001: ctx is unused. Rename to _ctx or remove the parameter.
-def _run_prepare_logic(
-    ctx: typer.Context, name: str, file: Path, bucket: str = "production"
-) -> None:
+def _run_prepare_logic(
+    _ctx: typer.Context, name: str, file: Path, bucket: str = "production"
+) -> None:
Call sites need no change since it’s positional.

377-382: Constrain bucket option to avoid typos

Restrict CLI bucket values to "production" or "internal" for better UX.
-from typing import Callable, Optional, Any
+from typing import Callable, Optional, Any, Literal
@@
-def prepare(
+def prepare(
@@
-    bucket: str = typer.Option(
+    bucket: Literal["production", "internal"] = typer.Option(
         "production",
         "--bucket",
         "-b",
         help="Target bucket: 'production' or 'internal'. Defaults to production.",
     ),
Consider applying the same to the pull command’s bucket option.

179-184: Apply the same choice constraint to pull command

Mirror the prepare command’s restriction here.
-    bucket: str = typer.Option(
+    bucket: Literal["production", "internal"] = typer.Option(
         "production",
         "--bucket",
         "-b",
         help="Bucket to pull from: 'production' or 'internal'. Defaults to production.",
     ),

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0372b4e and 2159e49.

📒 Files selected for processing (3)

src/datamanager/__main__.py (13 hunks)
src/datamanager/core.py (8 hunks)
src/datamanager/manifest.py (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

src/datamanager/manifest.py

🧰 Additional context used

🧬 Code graph analysis (1)

src/datamanager/__main__.py (2)

src/datamanager/manifest.py (1)

get_dataset (78-92)

src/datamanager/core.py (1)

download_from_r2 (116-140)

🪛 Ruff (0.13.1)

src/datamanager/core.py

155-155: Do not catch blind exception: Exception

(BLE001)

src/datamanager/__main__.py

261-261: Unused function argument: ctx

(ARG001)

🔇 Additional comments (11)

src/datamanager/core.py (3)

96-101: upload_to_r2 correctly adopts alias resolution and PEP 604 type hints

Good use of _resolve_bucket and progress callback.

Also applies to: 107-109

116-138: download_from_r2 correctly resolves bucket and applies head/download to the same target

Looks consistent and safe. Errors are surfaced via ClientError.

343-357: verify_r2_access includes internal bucket — LGTM

Covers production, staging, internal. Good.

Please ensure settings.internal_bucket is non-empty in all deployment environments to avoid a false-negative “bucket not found”.

src/datamanager/__main__.py (8)

70-73: Minor table header rename — LGTM

Clearer labeling in the verify report.

104-112: Display of dataset bucket in list view — LGTM

Good UX; defaults to production and uses clear icons.

120-127: Threading bucket alias through pull logic — LGTM

Passing the alias to core centralizes resolution as intended.

Also applies to: 146-151

199-206: Interactive pull uses dataset bucket — LGTM

Choice formatting, null handling, and using the manifest’s bucket are consistent.

Also applies to: 207-209, 215-217, 223-225, 254-255

296-299: Alias passed to core.download_from_r2 — LGTM

Keeps bucket resolution centralized.

392-427: Production prepare interactive path — LGTM

Flow and messaging look good.

430-468: Internal prepare interactive path — LGTM

Clear separation and user messaging for internal datasets.

779-781: TUI adds distinct internal/production prepare actions — LGTM

Good discoverability.

src/datamanager/__main__.py

src/datamanager/core.py

coderabbitai · 2025-09-23T22:40:59Z

src/datamanager/core.py

+def delete_from_r2(
+    client: S3Client, object_key: str, bucket: str | None = None
+) -> None:
    """Deletes an object from the R2 bucket."""
+    target_bucket = _resolve_bucket(bucket)
    console.print(f"Attempting to delete [yellow]{object_key}[/] from R2...")
    try:
-        client.delete_object(Bucket=settings.bucket, Key=object_key)
+        client.delete_object(Bucket=target_bucket, Key=object_key)
        console.print("✅ Rollback deletion successful.")


⚠️ Potential issue

Clarify deletion target and narrow exception type

Improve observability by logging the resolved bucket, and avoid catching broad exceptions.

def delete_from_r2( client: S3Client, object_key: str, bucket: str | None = None ) -> None: """Deletes an object from the R2 bucket.""" target_bucket = _resolve_bucket(bucket) - console.print(f"Attempting to delete [yellow]{object_key}[/] from R2...") + console.print( + f"Attempting to delete [yellow]{object_key}[/] from R2 bucket [cyan]{target_bucket}[/]..." + ) try: client.delete_object(Bucket=target_bucket, Key=object_key) console.print("✅ Rollback deletion successful.") - except Exception as e: + except ClientError as e: # This is a best-effort cleanup. We notify the user if it fails. console.print( f"[bold red]Warning:[/] Failed to delete R2 object during rollback: {e}" ) console.print("You may need to manually delete the object from your R2 bucket.")

Also applies to: 271-277

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (6)

tests/test_manifest.py (2)

115-119: Stabilize test environment by using the fixture

This test doesn’t chdir into the repo. It works today because read_manifest handles missing files, but using the test_repo fixture would make it fully isolated and order-independent.

Apply this diff:
-def test_get_dataset_bucket_nonexistent() -> None:
+def test_get_dataset_bucket_nonexistent(test_repo: Path) -> None:
     """Test getting bucket type for non-existent dataset."""
+    os.chdir(test_repo)
     bucket = manifest.get_dataset_bucket("non-existent.sqlite")
     assert bucket == "production"  # Should default to production
121-131: Optional: consider validating bucket values at API boundaries

Tests intentionally allow any string. If you later want stricter semantics, add a validator (e.g., in CLI) or a helper like is_valid_bucket() rather than changing update_dataset_bucket’s behavior.

tests/test_core.py (1)

264-268: Silence Ruff ARG001: remove unused fixture param

The mocker argument isn’t used here.

Apply this diff:

-def test_resolve_bucket_custom(mocker: MockerFixture) -> None:
+def test_resolve_bucket_custom() -> None:
     """Test _resolve_bucket with custom bucket name."""
     result = core._resolve_bucket("custom-bucket-name")
     assert result == "custom-bucket-name"

src/datamanager/core.py (1)

65-82: Clarify docstring: pass-through of non-alias names

The function may return a raw bucket name if it’s not an alias; reflect that in the docstring.

Apply this diff:
 def _resolve_bucket(bucket: str | None) -> str:
     """
-    Resolves bucket parameter to actual bucket name.
+    Resolves bucket parameter to an actual bucket name:
+    - None or "production" -> settings.bucket
+    - "internal" -> settings.internal_bucket
+    - any other value -> returned as-is (pass-through)

src/datamanager/__main__.py (2)

120-127: Default pull bucket to dataset’s configured bucket when unspecified

Safer DX: avoid accidental cross-bucket pulls. Consider making bucket optional and resolving from manifest if None.

Apply this diff:

-def _run_pull_logic(
-    name: str, version: str, output: Optional[Path], bucket: str = "production"
-) -> None:
+def _run_pull_logic(
+    name: str, version: str, output: Optional[Path], bucket: Optional[str] = None
+) -> None:
@@
-    success = core.pull_and_verify(
+    # Resolve default bucket from manifest if not provided
+    effective_bucket = bucket or manifest.get_dataset_bucket(name)
+    success = core.pull_and_verify(
         version_entry["r2_object_key"],
         version_entry["sha256"],
         final_path,
-        bucket,
+        effective_bucket,
     )

And in the CLI command:

-    bucket: str = typer.Option(
-        "production",
+    bucket: Optional[str] = typer.Option(
+        None,
         "--bucket",
         "-b",
-        help="Bucket to pull from: 'production' or 'internal'. Defaults to production.",
+        help="Bucket to pull from: 'production', 'internal', or a raw name. Defaults to the dataset's configured bucket.",
     ),

296-299: Use dataset’s bucket for updates, ignore user-supplied bucket

Prevents accidental cross-bucket mixing on updates of existing datasets.

Apply this diff:

-            # Download from the appropriate bucket using core helper
-            core.download_from_r2(
-                client, latest_version["r2_object_key"], old_path, bucket
-            )
+            # Use the dataset's configured bucket for existing datasets
+            effective_bucket = dataset.get("bucket", "production")
+            core.download_from_r2(
+                client, latest_version["r2_object_key"], old_path, effective_bucket
+            )

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2159e49 and 8b9edd4.

📒 Files selected for processing (6)

.pre-commit-config.yaml (2 hunks)
src/datamanager/__main__.py (13 hunks)
src/datamanager/core.py (8 hunks)
tests/test_core.py (5 hunks)
tests/test_main.py (4 hunks)
tests/test_manifest.py (2 hunks)

🧰 Additional context used

🧬 Code graph analysis (4)

tests/test_manifest.py (2)

tests/conftest.py (1)

test_repo (16-82)

src/datamanager/manifest.py (4)

get_dataset_bucket (204-218)

update_dataset_bucket (221-234)

read_manifest (33-52)

write_manifest (55-65)

tests/test_core.py (1)

src/datamanager/core.py (9)

download_from_r2 (113-137)

resolve_bucket_alias (52-62)

_resolve_bucket (65-81)

upload_to_staging (343-357)

upload_to_internal (360-362)

pull_and_verify (140-167)

delete_from_r2 (236-250)

verify_r2_access (315-340)

_check_bucket_permissions (253-312)

tests/test_main.py (3)

tests/conftest.py (1)

test_repo (16-82)

src/datamanager/manifest.py (1)

get_dataset (78-92)

src/datamanager/__main__.py (1)

_run_prepare_logic (260-366)

src/datamanager/__main__.py (2)

src/datamanager/manifest.py (1)

get_dataset (78-92)

src/datamanager/core.py (1)

download_from_r2 (113-137)

🪛 Ruff (0.13.1)

tests/test_core.py

264-264: Unused function argument: mocker

(ARG001)

🔇 Additional comments (42)

.pre-commit-config.yaml (1)
24-28: Add pyupgrade target to match your minimum supported Python

Set pyupgrade's args to the project's minimum Python (e.g. --py311-plus for >=3.11). Detection of requires-python in this repo was inconclusive — confirm pyproject.toml and choose the matching flag.
   - repo: https://github.com/asottile/pyupgrade
     rev: v3.20.0
     hooks:
-      - id: pyupgrade
+      - id: pyupgrade
+        args: ["--py311-plus"]
tests/test_manifest.py (7)

36-36: LGTM: bucket field asserted on get_dataset

Good check that fixtures include the new "bucket" field.

52-53: LGTM: bucket preserved across history updates

Verifies that unrelated fields remain intact.

55-64: LGTM: defaults to production when unset/missing

Covers both present and non-existent dataset defaults.

66-76: LGTM: update_dataset_bucket basic path

Non-existent dataset no-op is acceptable per tests.

78-91: LGTM: backward-compat for old manifests

Solid coverage for missing bucket field.

93-103: LGTM: internal bucket retrieval

Covers explicit internal setting.

104-113: LGTM: production bucket retrieval

Covers explicit production setting.

tests/test_main.py (7)

33-33: LGTM: new datasets default to production bucket

Matches CLI default and manifest schema.

232-234: LGTM: interactive prepare passes explicit production bucket

Keeps core resolution centralized.

295-300: LGTM: interactive pull uses dataset’s bucket

Correctly derives from manifest for safer defaults.

476-501: LGTM: prepare with internal bucket

Validates end-to-end create path for internal datasets.

503-516: LGTM: pull with internal bucket

Ensures bucket param is propagated to core.

518-560: LGTM: object key prefix uses dataset stem

Prevents namespace collisions; independent of bucket.

562-604: LGTM: update path derives r2_dir from existing key

Ensures stable prefixing and expected version bumping.

tests/test_core.py (13)

16-37: LGTM: typed ClientError builder

Cleaner and mypy-friendly approach for consistent error shaping.

115-129: LGTM: verify includes internal bucket

Asserts presence and permissions across all three buckets.

197-207: LGTM: download error raises ClientError

Aligns with core.download_from_r2 behavior.

209-268: LGTM: bucket alias/custom resolution tests

Covers production, internal, None, and custom names.

270-307: LGTM: staging/internal upload wrappers

Correct buckets asserted without leaking callback internals.

309-333: LGTM: pull_and_verify download error path

Properly avoids cleanup on download failure.

335-353: LGTM: pull_and_verify success path

No unnecessary cleanup; verifies expected flow.

355-362: LGTM: delete_from_r2 best-effort handling

Matches current core behavior to not raise.

364-376: LGTM: verify_r2_access client creation failure

Produces connection sentinel entry.

378-392: LGTM: access denied path

Message and permissions map look correct.

394-407: LGTM: connection error path

Surfaces connection failure coherently.

409-428: LGTM: read-only permissions path

Partial access messaging is precise.

430-447: LGTM: no permissions path

Covers zero-permission condition.
src/datamanager/core.py (7)
52-63: LGTM: alias resolution delegates to internal helper

Centralizes behavior and supports pass-through.

93-106: LGTM: upload_to_r2 resolves bucket aliases

Correctly honors explicit, aliased, and default buckets.

113-138: LGTM: download_from_r2 resolves bucket aliases and surfaces errors

Head + download with progress and re-raise is appropriate.

140-168: LGTM: pull_and_verify narrows exceptions and cleans up on mismatch

Matches tests and improves error messaging.

236-251: Log resolved bucket and narrow exception handling

Include the resolved bucket in the log; prefer ClientError first, then fall back to a broad catch to preserve current behavior and tests.

Apply this diff:
 def delete_from_r2(
     client: S3Client, object_key: str, bucket: str | None = None
 ) -> None:
     """Deletes an object from the R2 bucket."""
     target_bucket = _resolve_bucket(bucket)
-    console.print(f"Attempting to delete [yellow]{object_key}[/] from R2...")
+    console.print(
+        f"Attempting to delete [yellow]{object_key}[/] from R2 bucket [cyan]{target_bucket}[/]..."
+    )
     try:
         client.delete_object(Bucket=target_bucket, Key=object_key)
         console.print("✅ Rollback deletion successful.")
-    except Exception as e:
+    except ClientError as e:
         # This is a best-effort cleanup. We notify the user if it fails.
         console.print(
             f"[bold red]Warning:[/] Failed to delete R2 object during rollback: {e}"
         )
         console.print("You may need to manually delete the object from your R2 bucket.")
+    except Exception as e:
+        console.print(
+            f"[bold red]Warning:[/] Unexpected error during R2 deletion: {e}"
+        )
+        console.print("You may need to manually delete the object from your R2 bucket.")
315-341: LGTM: verify includes internal bucket

Expanded surface while preserving connection error handling.

360-363: LGTM: upload_to_internal wrapper

Encapsulates alias usage cleanly.
src/datamanager/__main__.py (7)

70-73: LGTM: clearer table headers

Improves UX without behavior change.

100-116: LGTM: dataset list shows bucket context

Nice visual cue with internal/production icons.

334-348: LGTM: v1 object key uses dataset stem, not bucket name

Fixes potential namespace collisions across buckets.

375-388: LGTM: prepare command exposes bucket option

Aligns CLI with internal/production use-cases.

390-425: LGTM: interactive production prepare

Clear prompts and safe confirmation flow.

428-465: LGTM: interactive internal prepare

Good separation of flows and messaging.

775-779: LGTM: TUI adds internal prepare action

Completes the UX for team-only datasets.

.pre-commit-config.yaml

ParticularlyPythonicBS requested a review from jdecarolis September 23, 2025 21:56

ParticularlyPythonicBS self-assigned this Sep 23, 2025

ParticularlyPythonicBS added the enhancement New feature or request label Sep 23, 2025

coderabbitai bot reviewed Sep 23, 2025

View reviewed changes

src/datamanager/__main__.py Outdated Show resolved Hide resolved

src/datamanager/__main__.py Outdated Show resolved Hide resolved

src/datamanager/core.py Show resolved Hide resolved

ParticularlyPythonicBS added 2 commits September 23, 2025 18:26

fixing pre-commit mypy errors

1844ba7

coderabbitai bot reviewed Sep 23, 2025

View reviewed changes

tests, bug fixes, and typing

8b9edd4

coderabbitai bot reviewed Sep 24, 2025

View reviewed changes

.pre-commit-config.yaml Show resolved Hide resolved

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add internal bucket support for team-only datasets#38

feat: Add internal bucket support for team-only datasets#38
ParticularlyPythonicBS wants to merge 4 commits intomainfrom
feat/dev_bucket

ParticularlyPythonicBS commented Sep 23, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Sep 23, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Sep 23, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Sep 23, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ParticularlyPythonicBS commented Sep 23, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Internal Bucket Support for Team-Only Datasets

Configuration Required

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ParticularlyPythonicBS commented Sep 23, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 23, 2025 •

edited

Loading

codecov bot commented Sep 23, 2025 •

edited

Loading