feat: Add internal bucket support for team-only datasets#38
feat: Add internal bucket support for team-only datasets#38ParticularlyPythonicBS wants to merge 4 commits intomainfrom
Conversation
- Add multi-bucket architecture supporting production (public) and internal (team-only) R2 buckets - Implement bucket routing based on dataset manifest configuration - Add --bucket option to prepare and pull commands - Update CI/CD workflows to handle internal bucket publications and deletions
WalkthroughAdds per-dataset bucket support (production vs internal) across CI publish script, CLI, core R2 APIs, config, manifest, docs, env, and tests. Operations (prepare, pull, verify, delete, publish) now target a dataset's configured bucket; manifest entries include a new "bucket" field and helpers to get/update it. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor User
participant CLI as CLI (__main__.py)
participant Manifest as Manifest
participant Core as Core (R2 APIs)
participant R2Prod as R2: Production
participant R2Int as R2: Internal
rect rgb(240,248,255)
note over User,CLI: Prepare dataset
User->>CLI: prepare --bucket=<production|internal>
CLI->>Manifest: update_dataset_bucket(name, bucket)
CLI->>Core: upload_to_staging(file, staging_key, bucket)
CLI->>Manifest: write history entry (staging_key, bucket)
end
rect rgb(245,255,240)
note over User,CLI: Pull dataset
User->>CLI: pull --bucket=<production|internal>
CLI->>Manifest: get_dataset_bucket(name)
CLI->>Core: pull_and_verify(object_key, sha, out, bucket)
alt bucket=production
Core->>R2Prod: GET object_key
else bucket=internal
Core->>R2Int: GET object_key
end
Core-->>CLI: verified
CLI-->>User: file saved
end
sequenceDiagram
autonumber
participant GH as GitHub Actions
participant Pub as publish_script.py
participant R2P as R2: Production
participant R2I as R2: Internal
rect rgb(255,250,240)
note over GH,Pub: Publish step
GH->>Pub: env (R2_INTERNAL_BUCKET,…)
Pub->>Pub: Group objects by bucket
par Delete production objects
Pub->>R2P: Batch DELETE(keys)
and Delete internal objects
Pub->>R2I: Batch DELETE(keys)
end
par Copy production objects
Pub->>R2P: Server-side COPY(staging_key -> object_key)
and Copy internal objects
Pub->>R2I: Server-side COPY(staging_key -> object_key)
end
Pub-->>GH: logs incl. bucket context
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/test_manifest.py (1)
507-515: Fix failing expectation: pass bucket alias to core or adjust assertionCI shows the 4th arg to pull_and_verify is a real bucket name (e.g., "DUMMY") instead of "internal". Either:
- Pass the alias through to core and let core resolve it, or
- Update the test to check against settings.internal_bucket.
Given the rest of the suite expects aliases at boundaries, prefer passing the alias and resolving inside core.
Apply these diffs:
- Resolve alias in core and accept both alias or raw bucket names:
--- a/src/datamanager/core.py +++ b/src/datamanager/core.py @@ -from typing import Any, TypedDict +from typing import Any, TypedDict +from typing import Optional # if needed elsewhere @@ -def upload_to_r2( - client: S3Client, file_path: Path, object_key: str, bucket: str = None -) -> None: +def upload_to_r2( + client: S3Client, file_path: Path, object_key: str, bucket: str | None = None +) -> None: @@ - target_bucket = bucket or settings.bucket + target_bucket = _resolve_bucket(bucket) @@ -def download_from_r2( - client: S3Client, object_key: str, download_path: Path, bucket: str = None -) -> None: +def download_from_r2( + client: S3Client, object_key: str, download_path: Path, bucket: str | None = None +) -> None: @@ - target_bucket = bucket or settings.bucket + target_bucket = _resolve_bucket(bucket) @@ -def pull_and_verify( - object_key: str, expected_hash: str, output_path: Path, bucket: str = None -) -> bool: +def pull_and_verify( + object_key: str, expected_hash: str, output_path: Path, bucket: str | None = None +) -> bool: @@ - download_from_r2(client, object_key, output_path, bucket) + download_from_r2(client, object_key, output_path, bucket) @@ -def delete_from_r2(client: S3Client, object_key: str, bucket: str = None) -> None: +def delete_from_r2(client: S3Client, object_key: str, bucket: str | None = None) -> None: @@ - target_bucket = bucket or settings.bucket + target_bucket = _resolve_bucket(bucket)Add this helper (outside the changed ranges, e.g., near other helpers):
def _resolve_bucket(bucket: str | None) -> str: # Accept aliases or raw bucket names. if bucket is None or bucket == "production": return settings.bucket if bucket == "internal": return settings.internal_bucket return bucket
- Pass the alias from CLI to core:
--- a/src/datamanager/__main__.py +++ b/src/datamanager/__main__.py @@ - # Determine which bucket to use - target_bucket = ( - settings.internal_bucket if bucket == "internal" else settings.bucket - ) - success = core.pull_and_verify( version_entry["r2_object_key"], version_entry["sha256"], final_path, - target_bucket, + bucket, # pass alias; core resolves to bucket name )
- Use alias for diff download too (core resolves to actual):
--- a/src/datamanager/__main__.py +++ b/src/datamanager/__main__.py @@ - # Download from the appropriate bucket - target_bucket = ( - settings.internal_bucket if bucket == "internal" else settings.bucket - ) - core.download_from_r2( - client, latest_version["r2_object_key"], old_path, target_bucket - ) + # Download from the appropriate bucket (alias; core resolves) + core.download_from_r2( + client, latest_version["r2_object_key"], old_path, bucket + )
🧹 Nitpick comments (14)
docs/source/workflow.md (1)
44-49: Clarify how the “appropriate bucket” is chosen.Suggest explicitly stating that CI selects the target bucket based on the dataset’s bucket field in manifest.json to remove ambiguity.
- It copies the data from the staging bucket to the appropriate target bucket (production or internal). + It copies the data from the staging bucket to the appropriate target bucket (production or internal), as specified by each dataset’s bucket field in manifest.json..github/workflows/publish.yml (1)
36-45: Fail fast if R2_INTERNAL_BUCKET (and other required env) are unset.Add a preflight validation step so the job errors with a clear message when any required variable is missing.
- name: Install project and dependencies run: | uv sync --all-extras --dev uv pip install -e . --no-deps + - name: Validate required env + run: | + set -euo pipefail + for v in R2_ACCOUNT_ID R2_ACCESS_KEY_ID R2_SECRET_ACCESS_KEY R2_PRODUCTION_BUCKET R2_STAGING_BUCKET R2_INTERNAL_BUCKET; do + if [ -z "${!v:-}" ]; then + echo "::error::Missing required env: $v" + exit 1 + fi + done + env: + R2_ACCOUNT_ID: ${{ secrets.R2_ACCOUNT_ID }} + R2_ACCESS_KEY_ID: ${{ secrets.R2_ACCESS_KEY_ID }} + R2_SECRET_ACCESS_KEY: ${{ secrets.R2_SECRET_ACCESS_KEY }} + R2_PRODUCTION_BUCKET: ${{ vars.R2_PRODUCTION_BUCKET }} + R2_STAGING_BUCKET: ${{ vars.R2_STAGING_BUCKET }} + R2_INTERNAL_BUCKET: ${{ vars.R2_INTERNAL_BUCKET }} + - name: Run Publish Script run: uv run .github/scripts/publish_script.py env: R2_ACCOUNT_ID: ${{ secrets.R2_ACCOUNT_ID }} R2_ACCESS_KEY_ID: ${{ secrets.R2_ACCESS_KEY_ID }} R2_SECRET_ACCESS_KEY: ${{ secrets.R2_SECRET_ACCESS_KEY }} R2_PRODUCTION_BUCKET: ${{ vars.R2_PRODUCTION_BUCKET }} # Use repo variable R2_STAGING_BUCKET: ${{ vars.R2_STAGING_BUCKET }} # Use repo variable R2_INTERNAL_BUCKET: ${{ vars.R2_INTERNAL_BUCKET }} # Use repo variabledocs/source/usage.md (1)
55-65: Note permission requirements for internal pulls.Add a short note pointing users to verify if they get AccessDenied when pulling internal datasets.
-Downloads a dataset from the appropriate R2 bucket (production or internal) and verifies its integrity. The bucket is determined by the dataset's configuration in the manifest. +Downloads a dataset from the appropriate R2 bucket (production or internal) and verifies its integrity. The bucket is determined by the dataset's configuration in the manifest. Note: Pulling an internal dataset requires access to the internal bucket. If you see AccessDenied, run `uv run datamanager verify`.src/datamanager/config.py (2)
6-11: Also read from OS environment, not only .env.current code ignores exported env vars if .env is not present; reading both improves usability in CI and shells.
from dotenv import find_dotenv, dotenv_values +import os import warnings _ENV_PATH = Path(find_dotenv()) if find_dotenv() else None _ENV = dotenv_values(_ENV_PATH) if _ENV_PATH else {}
13-22: Apply env fallback in _need().Prefer OS env first, then .env, then warn and use dummy.
def _need(var: str) -> str: - val = _ENV.get(var) + val = os.getenv(var, _ENV.get(var)) if val is None: warnings.warn( f"Env var {var} is missing – using dummy value for offline/test mode", RuntimeWarning, ) val = "DUMMY" return str(val)docs/source/setup.md (1)
3-8: LGTM; minor doc polish suggestion.Consider adding a one-liner that access to Internal bucket should be tightly scoped to team members (principle of least privilege). Optional.
Also applies to: 46-47
tests/test_core.py (3)
91-105: Avoid brittle order-dependence in verify_r2_access results.Tests assume [production, staging, internal] ordering. Please ensure core.verify_r2_access guarantees this order, or expose a bucket identifier so tests can map results by name instead of index.
116-124: Broaden assertions to staging/internal for read-only case.Optional: assert the same read-only behavior for staging and internal to increase coverage.
results = core.verify_r2_access() -prod_result = results[0] - -assert prod_result["exists"] is True -assert prod_result["permissions"]["read"] is True -assert prod_result["permissions"]["write"] is False -assert prod_result["permissions"]["delete"] is False -assert "Partial access: [read]" in prod_result["message"] +for r in results: + assert r["exists"] is True + assert r["permissions"]["read"] is True + assert r["permissions"]["write"] is False + assert r["permissions"]["delete"] is False + assert "Partial access: [read]" in r["message"]
132-138: Broaden assertions to all buckets for not-found case.Given head_bucket raises for each call, all buckets should reflect not-found.
results = core.verify_r2_access() -prod_result = results[0] - -assert prod_result["exists"] is False -assert not any(prod_result["permissions"].values()) -assert "Bucket not found" in prod_result["message"] +for r in results: + assert r["exists"] is False + assert not any(r["permissions"].values()) + assert "Bucket not found" in r["message"]src/datamanager/manifest.py (2)
204-218: Type‑narrow the return and clarify defaultingConstrain the return type to the allowed literals and keep the current defaulting behavior.
-from typing import Any, Optional +from typing import Any, Optional, Literal ... -def get_dataset_bucket(name: str) -> str: +def get_dataset_bucket(name: str) -> Literal["production", "internal"]: @@ - return dataset.get("bucket", "production") + return dataset.get("bucket", "production")
220-234: Validate input and avoid unnecessary write on no‑op
- Validate that only "production" or "internal" are accepted.
- Return early if dataset is not found to avoid noop writes.
-from typing import Any, Optional +from typing import Any, Optional, Literal @@ -def update_dataset_bucket(name: str, bucket: str) -> None: +def update_dataset_bucket(name: str, bucket: Literal["production", "internal"]) -> None: @@ - for item in data: + updated = False + for item in data: if item.get("fileName") == name: item["bucket"] = bucket - break - write_manifest(data) + updated = True + break + if updated: + write_manifest(data).github/scripts/publish_script.py (1)
49-55: Docs generation: avoid public links for internal datasetsFor internal datasets, the generated "Download" link will point to a public base URL and likely be inaccessible. Prefer a label instead.
Outside this hunk, adjust finalize_dataset_docs:
# inside the loop over datasets, before building download_link is_internal = dataset.get("bucket") == "internal" ... download_link = "N/A" if r2_object_key: if is_internal: download_link = "Internal-only" else: full_url = f"{BASE_DOWNLOAD_URL}{r2_object_key}" download_link = f"[Download]({full_url})"src/datamanager/__main__.py (2)
265-271: Unused ctx in _run_prepare_logicctx isn’t used; consider removing the parameter or prefix with underscore to satisfy linters. Keeping it is OK if you plan to use it soon.
385-390: Consider restricting --bucket to choicesDefine an Enum or use typer.Choice to prevent invalid values.
Example:
class Bucket(str, Enum): production = "production" internal = "internal" bucket: Bucket = typer.Option(Bucket.production, "--bucket", "-b", case_sensitive=False)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (16)
.github/scripts/publish_script.py(5 hunks).github/workflows/publish.yml(1 hunks)README.md(6 hunks)docs/source/setup.md(2 hunks)docs/source/usage.md(2 hunks)docs/source/workflow.md(1 hunks)env.example(1 hunks)manifest.json(2 hunks)src/datamanager/__main__.py(13 hunks)src/datamanager/config.py(1 hunks)src/datamanager/core.py(7 hunks)src/datamanager/manifest.py(2 hunks)tests/conftest.py(1 hunks)tests/test_core.py(1 hunks)tests/test_main.py(4 hunks)tests/test_manifest.py(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
tests/test_manifest.py (2)
tests/conftest.py (1)
test_repo(16-82)src/datamanager/manifest.py (3)
get_dataset_bucket(204-217)update_dataset_bucket(220-233)read_manifest(33-52)
tests/test_main.py (2)
tests/conftest.py (1)
test_repo(16-82)src/datamanager/manifest.py (1)
get_dataset(78-92)
src/datamanager/__main__.py (2)
src/datamanager/core.py (2)
pull_and_verify(108-134)download_from_r2(81-105)src/datamanager/manifest.py (1)
get_dataset(78-92)
🪛 GitHub Actions: CI
tests/test_main.py
[error] 514-514: Test 'test_pull_internal_bucket' failed: expected bucket parameter 'internal' but actual value 'DUMMY' (likely due to offline/test mode dummy values).
🪛 Ruff (0.13.1)
src/datamanager/core.py
62-62: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
82-82: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
109-109: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
203-203: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
src/datamanager/__main__.py
266-266: Unused function argument: ctx
(ARG001)
🔇 Additional comments (33)
docs/source/usage.md (1)
97-101: LGTM.Clearer verify description aligning with three-bucket support.
tests/conftest.py (1)
46-46: LGTM; ensure legacy manifests remain compatible.Adding bucket: "production" is good. Please confirm the loader defaults to production when bucket is absent so older manifests don’t break; add/keep a test for that.
src/datamanager/config.py (1)
31-31: LGTM.internal_bucket addition aligns with docs and workflow.
env.example (1)
6-6: LGTM.Example env updated for internal bucket.
src/datamanager/manifest.py (1)
24-26: Export list update looks goodNew APIs are properly exported.
tests/test_manifest.py (5)
36-36: LGTM: Bucket field assertedAsserting default "production" on existing dataset is correct per fixtures.
52-53: LGTM: Bucket preserved across history updateEnsures update path doesn't drop the field.
55-64: LGTM: get_dataset_bucket testsCovers both present and missing dataset cases.
66-76: LGTM: update_dataset_bucket testsCovers update and no‑op behaviors.
476-500: LGTM: Internal bucket prepare pathValidates internal flow and manifest fields.
manifest.json (2)
4-4: LGTM: Default bucket added to existing datasetMatches code paths that default to "production".
36-51: LGTM: Example internal dataset entryStructure aligns with create flow (bucket="internal", v1 history).
.github/scripts/publish_script.py (4)
137-142: Good: per‑dataset bucket routing for deletionsBucket selection logic is clear and isolated.
174-213: Good: batch delete grouped per bucketChunking and error handling are appropriate.
241-255: Good: publish targets correct bucket and logs contextServer‑side copy uses target bucket; logs are clear.
15-15: R2_INTERNAL_BUCKET exported in CI
publish.yml exports R2_INTERNAL_BUCKET from repo vars: .github/workflows/publish.yml:44.README.md (8)
62-69: LGTM: Feature bullets updated for multi-bucket and verifyClear and consistent with CLI.
70-87: LGTM: Bucket types sectionConcise and actionable guidance.
93-97: LGTM: Prereqs list includes internal bucketMatches new env requirements.
139-139: LGTM: Env var addedR2_INTERNAL_BUCKET documented.
169-174: LGTM: Prepare examples show bucket selectionGood defaults and internal option.
220-221: LGTM: CLI note on --bucketAccurately reflects new behavior.
260-271: LGTM: Pull examples for production and internalCovers both latest and specific version cases.
303-307: LGTM: Verify mentions all three bucketsAligned with core.verify_r2_access.
src/datamanager/core.py (2)
325-339: LGTM: upload_to_internal helperUseful for explicit internal uploads; consistent progress handling.
280-296: LGTM: verify includes internal bucketCovers all three buckets; messages are coherent.
src/datamanager/__main__.py (2)
70-73: LGTM: Verify table headers clarifiedMinor UX polish.
104-112: LGTM: Datasets list shows bucket clearlyIcons and labels improve visibility.
tests/test_main.py (5)
33-33: Default bucket assertion LGTMGood check that new datasets default to production.
233-234: Interactive prepare correctly propagates bucketEnsures default production bucket flows into core logic.
299-300: Interactive pull correctly propagates bucketProduction bucket is forwarded to the pull logic as expected.
476-500: Internal prepare path looks goodCovers internal bucket flag end-to-end and validates manifest persistence.
502-515: Fix failing assertion: compare against settings.internal_bucket, not literal "internal"Replace the assertion with:
assert call_args[3] == settings.internal_bucketLikely an incorrect or invalid review comment.
refactor: centralize bucket resolution logic in core module - Add _resolve_bucket() helper function with PEP 604 union types - Update upload_to_r2, download_from_r2, and delete_from_r2 to use helper - Simplify bucket handling in main module to pass aliases directly - Support both bucket aliases (production/internal) and raw bucket names consistently
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (5)
src/datamanager/core.py (2)
68-85: _resolve_bucket logic looks goodCovers None/"production"/"internal" and pass-through of raw names. Consider centralizing all resolution via this helper (see previous comment).
386-400: Avoid duplication: delegate upload_to_internal to upload_to_r2Reuse the generic uploader with bucket="internal" to keep behavior consistent and reduce maintenance.
def upload_to_internal(client: S3Client, file_path: Path, object_key: str) -> None: - """Uploads a file to the INTERNAL R2 bucket with a progress bar.""" - file_size = file_path.stat().st_size - with Progress() as progress: - task = progress.add_task( - f"[blue]Uploading to internal: {file_path.name}...", total=file_size - ) - client.upload_file( - str(file_path), - settings.internal_bucket, - object_key, - Callback=lambda bytes_transferred: progress.update( - task, advance=bytes_transferred - ), - ) + """Uploads a file to the INTERNAL R2 bucket with a progress bar.""" + upload_to_r2(client, file_path, object_key, bucket="internal")src/datamanager/__main__.py (3)
260-266: Unused ctx argument in _run_prepare_logicRuff ARG001: ctx is unused. Rename to _ctx or remove the parameter.
-def _run_prepare_logic( - ctx: typer.Context, name: str, file: Path, bucket: str = "production" -) -> None: +def _run_prepare_logic( + _ctx: typer.Context, name: str, file: Path, bucket: str = "production" +) -> None:Call sites need no change since it’s positional.
377-382: Constrain bucket option to avoid typosRestrict CLI bucket values to "production" or "internal" for better UX.
-from typing import Callable, Optional, Any +from typing import Callable, Optional, Any, Literal @@ -def prepare( +def prepare( @@ - bucket: str = typer.Option( + bucket: Literal["production", "internal"] = typer.Option( "production", "--bucket", "-b", help="Target bucket: 'production' or 'internal'. Defaults to production.", ),Consider applying the same to the pull command’s bucket option.
179-184: Apply the same choice constraint to pull commandMirror the prepare command’s restriction here.
- bucket: str = typer.Option( + bucket: Literal["production", "internal"] = typer.Option( "production", "--bucket", "-b", help="Bucket to pull from: 'production' or 'internal'. Defaults to production.", ),
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
src/datamanager/__main__.py(13 hunks)src/datamanager/core.py(8 hunks)src/datamanager/manifest.py(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- src/datamanager/manifest.py
🧰 Additional context used
🧬 Code graph analysis (1)
src/datamanager/__main__.py (2)
src/datamanager/manifest.py (1)
get_dataset(78-92)src/datamanager/core.py (1)
download_from_r2(116-140)
🪛 Ruff (0.13.1)
src/datamanager/core.py
155-155: Do not catch blind exception: Exception
(BLE001)
src/datamanager/__main__.py
261-261: Unused function argument: ctx
(ARG001)
🔇 Additional comments (11)
src/datamanager/core.py (3)
96-101: upload_to_r2 correctly adopts alias resolution and PEP 604 type hintsGood use of _resolve_bucket and progress callback.
Also applies to: 107-109
116-138: download_from_r2 correctly resolves bucket and applies head/download to the same targetLooks consistent and safe. Errors are surfaced via ClientError.
343-357: verify_r2_access includes internal bucket — LGTMCovers production, staging, internal. Good.
Please ensure settings.internal_bucket is non-empty in all deployment environments to avoid a false-negative “bucket not found”.
src/datamanager/__main__.py (8)
70-73: Minor table header rename — LGTMClearer labeling in the verify report.
104-112: Display of dataset bucket in list view — LGTMGood UX; defaults to production and uses clear icons.
120-127: Threading bucket alias through pull logic — LGTMPassing the alias to core centralizes resolution as intended.
Also applies to: 146-151
199-206: Interactive pull uses dataset bucket — LGTMChoice formatting, null handling, and using the manifest’s bucket are consistent.
Also applies to: 207-209, 215-217, 223-225, 254-255
296-299: Alias passed to core.download_from_r2 — LGTMKeeps bucket resolution centralized.
392-427: Production prepare interactive path — LGTMFlow and messaging look good.
430-468: Internal prepare interactive path — LGTMClear separation and user messaging for internal datasets.
779-781: TUI adds distinct internal/production prepare actions — LGTMGood discoverability.
| def delete_from_r2( | ||
| client: S3Client, object_key: str, bucket: str | None = None | ||
| ) -> None: | ||
| """Deletes an object from the R2 bucket.""" | ||
| target_bucket = _resolve_bucket(bucket) | ||
| console.print(f"Attempting to delete [yellow]{object_key}[/] from R2...") | ||
| try: | ||
| client.delete_object(Bucket=settings.bucket, Key=object_key) | ||
| client.delete_object(Bucket=target_bucket, Key=object_key) | ||
| console.print("✅ Rollback deletion successful.") |
There was a problem hiding this comment.
Clarify deletion target and narrow exception type
Improve observability by logging the resolved bucket, and avoid catching broad exceptions.
def delete_from_r2(
client: S3Client, object_key: str, bucket: str | None = None
) -> None:
"""Deletes an object from the R2 bucket."""
target_bucket = _resolve_bucket(bucket)
- console.print(f"Attempting to delete [yellow]{object_key}[/] from R2...")
+ console.print(
+ f"Attempting to delete [yellow]{object_key}[/] from R2 bucket [cyan]{target_bucket}[/]..."
+ )
try:
client.delete_object(Bucket=target_bucket, Key=object_key)
console.print("✅ Rollback deletion successful.")
- except Exception as e:
+ except ClientError as e:
# This is a best-effort cleanup. We notify the user if it fails.
console.print(
f"[bold red]Warning:[/] Failed to delete R2 object during rollback: {e}"
)
console.print("You may need to manually delete the object from your R2 bucket.")Also applies to: 271-277
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (6)
tests/test_manifest.py (2)
115-119: Stabilize test environment by using the fixtureThis test doesn’t chdir into the repo. It works today because read_manifest handles missing files, but using the test_repo fixture would make it fully isolated and order-independent.
Apply this diff:
-def test_get_dataset_bucket_nonexistent() -> None: +def test_get_dataset_bucket_nonexistent(test_repo: Path) -> None: """Test getting bucket type for non-existent dataset.""" + os.chdir(test_repo) bucket = manifest.get_dataset_bucket("non-existent.sqlite") assert bucket == "production" # Should default to production
121-131: Optional: consider validating bucket values at API boundariesTests intentionally allow any string. If you later want stricter semantics, add a validator (e.g., in CLI) or a helper like is_valid_bucket() rather than changing update_dataset_bucket’s behavior.
tests/test_core.py (1)
264-268: Silence Ruff ARG001: remove unused fixture paramThe mocker argument isn’t used here.
Apply this diff:
-def test_resolve_bucket_custom(mocker: MockerFixture) -> None: +def test_resolve_bucket_custom() -> None: """Test _resolve_bucket with custom bucket name.""" result = core._resolve_bucket("custom-bucket-name") assert result == "custom-bucket-name"src/datamanager/core.py (1)
65-82: Clarify docstring: pass-through of non-alias namesThe function may return a raw bucket name if it’s not an alias; reflect that in the docstring.
Apply this diff:
def _resolve_bucket(bucket: str | None) -> str: """ - Resolves bucket parameter to actual bucket name. + Resolves bucket parameter to an actual bucket name: + - None or "production" -> settings.bucket + - "internal" -> settings.internal_bucket + - any other value -> returned as-is (pass-through)src/datamanager/__main__.py (2)
120-127: Default pull bucket to dataset’s configured bucket when unspecifiedSafer DX: avoid accidental cross-bucket pulls. Consider making bucket optional and resolving from manifest if None.
Apply this diff:
-def _run_pull_logic( - name: str, version: str, output: Optional[Path], bucket: str = "production" -) -> None: +def _run_pull_logic( + name: str, version: str, output: Optional[Path], bucket: Optional[str] = None +) -> None: @@ - success = core.pull_and_verify( + # Resolve default bucket from manifest if not provided + effective_bucket = bucket or manifest.get_dataset_bucket(name) + success = core.pull_and_verify( version_entry["r2_object_key"], version_entry["sha256"], final_path, - bucket, + effective_bucket, )And in the CLI command:
- bucket: str = typer.Option( - "production", + bucket: Optional[str] = typer.Option( + None, "--bucket", "-b", - help="Bucket to pull from: 'production' or 'internal'. Defaults to production.", + help="Bucket to pull from: 'production', 'internal', or a raw name. Defaults to the dataset's configured bucket.", ),
296-299: Use dataset’s bucket for updates, ignore user-supplied bucketPrevents accidental cross-bucket mixing on updates of existing datasets.
Apply this diff:
- # Download from the appropriate bucket using core helper - core.download_from_r2( - client, latest_version["r2_object_key"], old_path, bucket - ) + # Use the dataset's configured bucket for existing datasets + effective_bucket = dataset.get("bucket", "production") + core.download_from_r2( + client, latest_version["r2_object_key"], old_path, effective_bucket + )
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
.pre-commit-config.yaml(2 hunks)src/datamanager/__main__.py(13 hunks)src/datamanager/core.py(8 hunks)tests/test_core.py(5 hunks)tests/test_main.py(4 hunks)tests/test_manifest.py(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (4)
tests/test_manifest.py (2)
tests/conftest.py (1)
test_repo(16-82)src/datamanager/manifest.py (4)
get_dataset_bucket(204-218)update_dataset_bucket(221-234)read_manifest(33-52)write_manifest(55-65)
tests/test_core.py (1)
src/datamanager/core.py (9)
download_from_r2(113-137)resolve_bucket_alias(52-62)_resolve_bucket(65-81)upload_to_staging(343-357)upload_to_internal(360-362)pull_and_verify(140-167)delete_from_r2(236-250)verify_r2_access(315-340)_check_bucket_permissions(253-312)
tests/test_main.py (3)
tests/conftest.py (1)
test_repo(16-82)src/datamanager/manifest.py (1)
get_dataset(78-92)src/datamanager/__main__.py (1)
_run_prepare_logic(260-366)
src/datamanager/__main__.py (2)
src/datamanager/manifest.py (1)
get_dataset(78-92)src/datamanager/core.py (1)
download_from_r2(113-137)
🪛 Ruff (0.13.1)
tests/test_core.py
264-264: Unused function argument: mocker
(ARG001)
🔇 Additional comments (42)
.pre-commit-config.yaml (1)
24-28: Add pyupgrade target to match your minimum supported PythonSet pyupgrade's args to the project's minimum Python (e.g. --py311-plus for >=3.11). Detection of requires-python in this repo was inconclusive — confirm pyproject.toml and choose the matching flag.
- repo: https://github.com/asottile/pyupgrade rev: v3.20.0 hooks: - - id: pyupgrade + - id: pyupgrade + args: ["--py311-plus"]tests/test_manifest.py (7)
36-36: LGTM: bucket field asserted on get_datasetGood check that fixtures include the new "bucket" field.
52-53: LGTM: bucket preserved across history updatesVerifies that unrelated fields remain intact.
55-64: LGTM: defaults to production when unset/missingCovers both present and non-existent dataset defaults.
66-76: LGTM: update_dataset_bucket basic pathNon-existent dataset no-op is acceptable per tests.
78-91: LGTM: backward-compat for old manifestsSolid coverage for missing bucket field.
93-103: LGTM: internal bucket retrievalCovers explicit internal setting.
104-113: LGTM: production bucket retrievalCovers explicit production setting.
tests/test_main.py (7)
33-33: LGTM: new datasets default to production bucketMatches CLI default and manifest schema.
232-234: LGTM: interactive prepare passes explicit production bucketKeeps core resolution centralized.
295-300: LGTM: interactive pull uses dataset’s bucketCorrectly derives from manifest for safer defaults.
476-501: LGTM: prepare with internal bucketValidates end-to-end create path for internal datasets.
503-516: LGTM: pull with internal bucketEnsures bucket param is propagated to core.
518-560: LGTM: object key prefix uses dataset stemPrevents namespace collisions; independent of bucket.
562-604: LGTM: update path derives r2_dir from existing keyEnsures stable prefixing and expected version bumping.
tests/test_core.py (13)
16-37: LGTM: typed ClientError builderCleaner and mypy-friendly approach for consistent error shaping.
115-129: LGTM: verify includes internal bucketAsserts presence and permissions across all three buckets.
197-207: LGTM: download error raises ClientErrorAligns with core.download_from_r2 behavior.
209-268: LGTM: bucket alias/custom resolution testsCovers production, internal, None, and custom names.
270-307: LGTM: staging/internal upload wrappersCorrect buckets asserted without leaking callback internals.
309-333: LGTM: pull_and_verify download error pathProperly avoids cleanup on download failure.
335-353: LGTM: pull_and_verify success pathNo unnecessary cleanup; verifies expected flow.
355-362: LGTM: delete_from_r2 best-effort handlingMatches current core behavior to not raise.
364-376: LGTM: verify_r2_access client creation failureProduces connection sentinel entry.
378-392: LGTM: access denied pathMessage and permissions map look correct.
394-407: LGTM: connection error pathSurfaces connection failure coherently.
409-428: LGTM: read-only permissions pathPartial access messaging is precise.
430-447: LGTM: no permissions pathCovers zero-permission condition.
src/datamanager/core.py (7)
52-63: LGTM: alias resolution delegates to internal helperCentralizes behavior and supports pass-through.
93-106: LGTM: upload_to_r2 resolves bucket aliasesCorrectly honors explicit, aliased, and default buckets.
113-138: LGTM: download_from_r2 resolves bucket aliases and surfaces errorsHead + download with progress and re-raise is appropriate.
140-168: LGTM: pull_and_verify narrows exceptions and cleans up on mismatchMatches tests and improves error messaging.
236-251: Log resolved bucket and narrow exception handlingInclude the resolved bucket in the log; prefer ClientError first, then fall back to a broad catch to preserve current behavior and tests.
Apply this diff:
def delete_from_r2( client: S3Client, object_key: str, bucket: str | None = None ) -> None: """Deletes an object from the R2 bucket.""" target_bucket = _resolve_bucket(bucket) - console.print(f"Attempting to delete [yellow]{object_key}[/] from R2...") + console.print( + f"Attempting to delete [yellow]{object_key}[/] from R2 bucket [cyan]{target_bucket}[/]..." + ) try: client.delete_object(Bucket=target_bucket, Key=object_key) console.print("✅ Rollback deletion successful.") - except Exception as e: + except ClientError as e: # This is a best-effort cleanup. We notify the user if it fails. console.print( f"[bold red]Warning:[/] Failed to delete R2 object during rollback: {e}" ) console.print("You may need to manually delete the object from your R2 bucket.") + except Exception as e: + console.print( + f"[bold red]Warning:[/] Unexpected error during R2 deletion: {e}" + ) + console.print("You may need to manually delete the object from your R2 bucket.")
315-341: LGTM: verify includes internal bucketExpanded surface while preserving connection error handling.
360-363: LGTM: upload_to_internal wrapperEncapsulates alias usage cleanly.
src/datamanager/__main__.py (7)
70-73: LGTM: clearer table headersImproves UX without behavior change.
100-116: LGTM: dataset list shows bucket contextNice visual cue with internal/production icons.
334-348: LGTM: v1 object key uses dataset stem, not bucket nameFixes potential namespace collisions across buckets.
375-388: LGTM: prepare command exposes bucket optionAligns CLI with internal/production use-cases.
390-425: LGTM: interactive production prepareClear prompts and safe confirmation flow.
428-465: LGTM: interactive internal prepareGood separation of flows and messaging.
775-779: LGTM: TUI adds internal prepare actionCompletes the UX for team-only datasets.
Internal Bucket Support for Team-Only Datasets
This PR introduces comprehensive support for internal (team-only) R2 buckets alongside the existing production (public) bucket system.
These internal bucket databases will be visible in a datamanager pull with the right api access or in the cloudflare bucket page.
Configuration Required
Add to your .env file:
R2_INTERNAL_BUCKET="your-internal-bucket-name"Summary by CodeRabbit
New Features
Documentation
Tests
Chores