Skip to content

MDF Connect v2 backend: streaming, curation, DOI minting, and Globus Search#133

Open
blaiszik wants to merge 9 commits intoprodfrom
v2-backend-curation
Open

MDF Connect v2 backend: streaming, curation, DOI minting, and Globus Search#133
blaiszik wants to merge 9 commits intoprodfrom
v2-backend-curation

Conversation

@blaiszik
Copy link
Contributor

Summary

  • Complete v2 backend built on FastAPI + Mangum (single Lambda), deployed via SAM to AWS
  • Full publication pipeline: submit → pending_curation → approved (DOI minted) → published (Globus Search indexed)
  • Streaming endpoints: create, append (file upload to Globus HTTPS), close, snapshot (stream → dataset)
  • Dataset versioning with DOI inheritance and version-specific DOIs
  • Curation workflow with approve/reject, curator guards, and ownership enforcement
  • Real DataCite DOI minting (test credentials) and Globus Search ingest (test index)
  • Security hardening: path traversal protection, ownership checks, rate limiting, input size limits, structured logging
  • Cost-optimized: right-sized Lambda/concurrency, bounded log retention, capped search scans
  • Async job dispatch via SQS with inline/SQLite modes for testing
  • Comprehensive test suites: hardening, integration, publish pipeline, versioning, async jobs
  • SAM template with dev/staging/prod configs, deploy.sh with quick-deploy and teardown
  • Production config uses test credentials for initial deployment

Infrastructure

Resource Description
template.yaml SAM template: Lambda, API Gateway, DynamoDB, SQS, S3, CloudWatch
deploy.sh Deploy script: dev, staging, prod, quick, local, teardown, logs
samconfig.toml Per-environment config (dev, staging, prod)

Verified on staging

Health check, stream CRUD, file upload via Globus HTTPS, snapshot, repo publish, DOI minting, Globus Search ingest, dataset cards, citations — all passing E2E.

Test plan

  • cd aws && python -m pytest v2/test_v2_*.py -v — all unit/integration tests pass
  • ./deploy.sh staging — staging stack deploys cleanly
  • ./deploy.sh prod — prod stack deploys with test credentials
  • Verify health endpoint on prod API URL
  • Submit test dataset through prod and verify DOI + search ingest

🤖 Generated with Claude Code

blaiszik and others added 9 commits January 31, 2026 22:52
New features:
- Streaming API for automated lab data ingestion
  - POST /stream/create, /stream/:id/upload, /stream/:id/close
  - Support for local, Globus HTTPS, and S3 storage backends
  - File preview without full download (CSV stats, JSON structure)

- Server-side curation workflow
  - GET /curation/pending - List submissions awaiting review
  - POST /curation/:id/approve - Approve with DOI minting
  - POST /curation/:id/reject - Reject with reason
  - Full curation history tracking

- DOI minting via DataCite API
  - Mock client for local development
  - Real client for production deployment

- Simplified Globus Flow
  - Removed curation steps (now handled by server)
  - Keeps: email notification, file transfer, user notification

- Deployment tooling
  - deploy.sh for local dev and AWS SAM deployment
  - SAM template for Lambda + API Gateway

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…t, and curation

- Add publication orchestration: approval triggers DOI mint + Globus Search ingest + status update
- Fix status transitions: submissions land as pending_curation (not submitted)
- Add data source URL validation on submit (globus://, https://, stream://)
- Wire DataCite credentials into SAM template, samconfig, and deploy script
- Add DataCite test_connection() diagnostic method
- Add GlobusSearchClient + MockGlobusSearchClient with factory pattern
- Update search to try Globus Search first, fallback to DynamoDB scan
- Add Search index UUID params and USE_MOCK_SEARCH to both Lambda functions
- Refactor app into FastAPI router modules with auth, middleware, models
- Add async job system (inline/SQS/SQLite) with publish_submission job type

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add test_v2_publish_pipeline.py: 19 tests covering status transitions,
  data source validation, inline/async publish pipeline, and mock search client
- Fix test_v2_async_jobs.py: update doi_job → publish_job, remove stale
  status/update call (submissions now land as pending_curation directly)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Two bugs prevented search ingest from working:
1. oauth2_client_credentials_tokens() was called without requested_scopes,
   so no search.api.globus.org token was obtained (access_token was None)
2. License Pydantic model was passed directly into GMetaEntry content,
   causing JSON serialization failure in the Globus SDK

Also: clean up samconfig.toml duplicates, add search index UUIDs, add
configurable CORS origins for prod, fix two pre-existing test bugs, and
add operational scripts (search index permissions, DataCite SSM setup,
search token diagnostics).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…c DOIs

- Add dataset_doi field to SQLite schema (store.py) with migration
- Propagate dataset_doi from prior published versions on update submissions
- Refactor _mint_doi_for_submission for version-aware DOI logic:
  - First version: mint dataset DOI (stored as both doi and dataset_doi)
  - Subsequent + mint_doi=False: inherit dataset_doi, update DataCite metadata
  - Subsequent + mint_doi=True: mint version DOI with -v{ver} suffix,
    add IsVersionOf/HasVersion relatedIdentifiers
- Add related_identifiers support and update_metadata() to DataCiteClient
- Add dataset_doi and version_count to Globus Search GMetaEntry
- dc.doi falls back to dataset_doi when no version-specific DOI
- 13 new tests covering full lifecycle, DOI propagation, search index,
  mock DataCite, and curation logic

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Support domains (List[str]) for scientific domain categorization and
external_doi/external_url/external_source for tracking provenance of
externally-published datasets imported into MDF. Fields round-trip
through submit → status and are indexed in Globus Search.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ignore

Set up prod samconfig.toml with test DataCite credentials (Globus.TEST)
and test Globus Search index for initial production stack deployment.
Add .aws-sam/ and .DS_Store to .gitignore.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Cover dev/staging/prod deployment, SSM parameters, switching to
real credentials, quick deploy, local dev, tests, and architecture
overview. Note that prod currently uses test credentials.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments