MDF Connect v2 backend: streaming, curation, DOI minting, and Globus Search#133
Open
MDF Connect v2 backend: streaming, curation, DOI minting, and Globus Search#133
Conversation
New features: - Streaming API for automated lab data ingestion - POST /stream/create, /stream/:id/upload, /stream/:id/close - Support for local, Globus HTTPS, and S3 storage backends - File preview without full download (CSV stats, JSON structure) - Server-side curation workflow - GET /curation/pending - List submissions awaiting review - POST /curation/:id/approve - Approve with DOI minting - POST /curation/:id/reject - Reject with reason - Full curation history tracking - DOI minting via DataCite API - Mock client for local development - Real client for production deployment - Simplified Globus Flow - Removed curation steps (now handled by server) - Keeps: email notification, file transfer, user notification - Deployment tooling - deploy.sh for local dev and AWS SAM deployment - SAM template for Lambda + API Gateway Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…t, and curation - Add publication orchestration: approval triggers DOI mint + Globus Search ingest + status update - Fix status transitions: submissions land as pending_curation (not submitted) - Add data source URL validation on submit (globus://, https://, stream://) - Wire DataCite credentials into SAM template, samconfig, and deploy script - Add DataCite test_connection() diagnostic method - Add GlobusSearchClient + MockGlobusSearchClient with factory pattern - Update search to try Globus Search first, fallback to DynamoDB scan - Add Search index UUID params and USE_MOCK_SEARCH to both Lambda functions - Refactor app into FastAPI router modules with auth, middleware, models - Add async job system (inline/SQS/SQLite) with publish_submission job type Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add test_v2_publish_pipeline.py: 19 tests covering status transitions, data source validation, inline/async publish pipeline, and mock search client - Fix test_v2_async_jobs.py: update doi_job → publish_job, remove stale status/update call (submissions now land as pending_curation directly) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Two bugs prevented search ingest from working: 1. oauth2_client_credentials_tokens() was called without requested_scopes, so no search.api.globus.org token was obtained (access_token was None) 2. License Pydantic model was passed directly into GMetaEntry content, causing JSON serialization failure in the Globus SDK Also: clean up samconfig.toml duplicates, add search index UUIDs, add configurable CORS origins for prod, fix two pre-existing test bugs, and add operational scripts (search index permissions, DataCite SSM setup, search token diagnostics). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…c DOIs
- Add dataset_doi field to SQLite schema (store.py) with migration
- Propagate dataset_doi from prior published versions on update submissions
- Refactor _mint_doi_for_submission for version-aware DOI logic:
- First version: mint dataset DOI (stored as both doi and dataset_doi)
- Subsequent + mint_doi=False: inherit dataset_doi, update DataCite metadata
- Subsequent + mint_doi=True: mint version DOI with -v{ver} suffix,
add IsVersionOf/HasVersion relatedIdentifiers
- Add related_identifiers support and update_metadata() to DataCiteClient
- Add dataset_doi and version_count to Globus Search GMetaEntry
- dc.doi falls back to dataset_doi when no version-specific DOI
- 13 new tests covering full lifecycle, DOI propagation, search index,
mock DataCite, and curation logic
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Support domains (List[str]) for scientific domain categorization and external_doi/external_url/external_source for tracking provenance of externally-published datasets imported into MDF. Fields round-trip through submit → status and are indexed in Globus Search. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ignore Set up prod samconfig.toml with test DataCite credentials (Globus.TEST) and test Globus Search index for initial production stack deployment. Add .aws-sam/ and .DS_Store to .gitignore. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Cover dev/staging/prod deployment, SSM parameters, switching to real credentials, quick deploy, local dev, tests, and architecture overview. Note that prod currently uses test credentials. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Infrastructure
template.yamldeploy.shsamconfig.tomlVerified on staging
Health check, stream CRUD, file upload via Globus HTTPS, snapshot, repo publish, DOI minting, Globus Search ingest, dataset cards, citations — all passing E2E.
Test plan
cd aws && python -m pytest v2/test_v2_*.py -v— all unit/integration tests pass./deploy.sh staging— staging stack deploys cleanly./deploy.sh prod— prod stack deploys with test credentials🤖 Generated with Claude Code