Skip to content

Commit b9b6e34

Browse files
Simplify to two-layer architecture: database types + AttributeTypes
- Remove "core types" concept - all storage types are now AttributeTypes - Built-in AttributeTypes (object, content, filepath@store) use json dtype - JSON stores metadata: path, hash, store name, size, etc. - User-defined AttributeTypes can compose built-in ones (e.g., <xblob> uses content) - Clearer separation: database types (json, longblob) vs AttributeTypes (encode/decode) Co-authored-by: dimitri-yatsenko <dimitri@datajoint.com>
1 parent 43c1999 commit b9b6e34

File tree

1 file changed

+137
-89
lines changed

1 file changed

+137
-89
lines changed

docs/src/design/tables/storage-types-spec.md

Lines changed: 137 additions & 89 deletions
Original file line numberDiff line numberDiff line change
@@ -2,11 +2,14 @@
22

33
## Overview
44

5-
This document defines a layered storage architecture:
5+
This document defines a two-layer storage architecture:
66

7-
1. **Database types**: `longblob`, `varchar`, `int`, `json`, etc.
8-
2. **Core DataJoint types**: `object`, `content`, `filepath`, `json` (and `@store` variants where applicable)
9-
3. **AttributeTypes**: `<djblob>`, `<xblob>`, `<attach>`, etc. (built on top of core types)
7+
1. **Database types**: `longblob`, `varchar`, `int`, `json`, etc. (MySQL/PostgreSQL native)
8+
2. **AttributeTypes**: Custom types with `encode()`/`decode()` semantics
9+
10+
All DataJoint storage types (`object`, `content`, `filepath@store`, `<djblob>`, etc.) are
11+
implemented as **AttributeTypes**. Some are built-in (auto-registered, use `dj.config` for stores)
12+
while others are user-defined.
1013

1114
### OAS Storage Regions
1215

@@ -20,17 +23,21 @@ This document defines a layered storage architecture:
2023
`filepath@store` provides portable relative paths within configured stores with lazy ObjectRef access.
2124
For arbitrary URLs that don't need ObjectRef semantics, use `varchar` instead.
2225

23-
## Core Types
26+
## Built-in AttributeTypes
27+
28+
Built-in types are auto-registered and use `dj.config['stores']` for store configuration.
29+
They use `json` as their database dtype to store metadata.
2430

2531
### `object` / `object@store` - Path-Addressed Storage
2632

27-
**Already implemented.** OAS (Object-Augmented Schema) storage:
33+
**Built-in AttributeType.** OAS (Object-Augmented Schema) storage:
2834

2935
- Path derived from primary key: `{schema}/{table}/{pk}/{attribute}/`
3036
- One-to-one relationship with table row
3137
- Deleted when row is deleted
3238
- Returns `ObjectRef` for lazy access
3339
- Supports direct writes (Zarr, HDF5) via fsspec
40+
- **dtype**: `json` (stores path, store name, metadata)
3441

3542
```python
3643
class Analysis(dj.Computed):
@@ -42,9 +49,34 @@ class Analysis(dj.Computed):
4249
"""
4350
```
4451

52+
#### Implementation
53+
54+
```python
55+
class ObjectType(AttributeType):
56+
"""Built-in AttributeType for path-addressed OAS storage."""
57+
type_name = "object"
58+
dtype = "json"
59+
60+
def encode(self, value, *, key=None, store_name=None) -> dict:
61+
store = get_store(store_name or dj.config['stores']['default'])
62+
path = self._compute_path(key) # {schema}/{table}/{pk}/{attr}/
63+
store.put(path, value)
64+
return {
65+
"path": path,
66+
"store": store_name,
67+
# Additional metadata (size, timestamps, etc.)
68+
}
69+
70+
def decode(self, stored: dict, *, key=None) -> ObjectRef:
71+
return ObjectRef(
72+
store=get_store(stored["store"]),
73+
path=stored["path"]
74+
)
75+
```
76+
4577
### `content` / `content@store` - Content-Addressed Storage
4678

47-
**New core type.** Content-addressed storage with deduplication:
79+
**Built-in AttributeType.** Content-addressed storage with deduplication:
4880

4981
- **Single blob only**: stores a single file or serialized object (not folders)
5082
- **Per-project scope**: content is shared across all schemas in a project (not per-schema)
@@ -53,6 +85,7 @@ class Analysis(dj.Computed):
5385
- Reference counted for garbage collection
5486
- Deduplication: identical content stored once across the entire project
5587
- For folders/complex objects, use `object` type instead
88+
- **dtype**: `json` (stores hash, store name, size, metadata)
5689

5790
```
5891
store_root/
@@ -63,58 +96,63 @@ store_root/
6396
└── {hash[:2]}/{hash[2:4]}/{hash}
6497
```
6598

66-
#### Content Type Behavior
67-
68-
The `content` core type:
69-
- Accepts `bytes` on insert
70-
- Computes SHA256 hash of the content
71-
- Stores in `_content/{hash}/` if not already present (deduplication)
72-
- Returns `bytes` on fetch (transparent retrieval)
73-
- Registers in `ContentRegistry` for GC tracking
99+
#### Implementation
74100

75101
```python
76-
# Core type behavior (built-in, not an AttributeType)
77-
class ContentType:
78-
"""Core content-addressed storage type."""
102+
class ContentType(AttributeType):
103+
"""Built-in AttributeType for content-addressed storage."""
104+
type_name = "content"
105+
dtype = "json"
79106

80-
def store(self, data: bytes, store_backend) -> str:
81-
"""Store content, return hash."""
107+
def encode(self, data: bytes, *, key=None, store_name=None) -> dict:
108+
"""Store content, return metadata as JSON."""
82109
content_hash = hashlib.sha256(data).hexdigest()
110+
store = get_store(store_name or dj.config['stores']['default'])
83111
path = f"_content/{content_hash[:2]}/{content_hash[2:4]}/{content_hash}"
84112

85-
if not store_backend.exists(path):
86-
store_backend.put(path, data)
113+
if not store.exists(path):
114+
store.put(path, data)
87115
ContentRegistry().insert1({
88116
'content_hash': content_hash,
89-
'store': store_backend.name,
117+
'store': store_name,
90118
'size': len(data)
91-
})
119+
}, skip_duplicates=True)
92120

93-
return content_hash
121+
return {
122+
"hash": content_hash,
123+
"store": store_name,
124+
"size": len(data)
125+
}
94126

95-
def retrieve(self, content_hash: str, store_backend) -> bytes:
127+
def decode(self, stored: dict, *, key=None) -> bytes:
96128
"""Retrieve content by hash."""
97-
path = f"_content/{content_hash[:2]}/{content_hash[2:4]}/{content_hash}"
98-
return store_backend.get(path)
129+
store = get_store(stored["store"])
130+
path = f"_content/{stored['hash'][:2]}/{stored['hash'][2:4]}/{stored['hash']}"
131+
return store.get(path)
99132
```
100133

101134
#### Database Column
102135

103-
The `content` type stores a `char(64)` hash in the database:
136+
The `content` type stores JSON metadata:
104137

105138
```sql
106-
-- content column
107-
features CHAR(64) NOT NULL -- SHA256 hex hash
139+
-- content column (MySQL)
140+
features JSON NOT NULL
141+
-- Contains: {"hash": "abc123...", "store": "main", "size": 12345}
142+
143+
-- content column (PostgreSQL)
144+
features JSONB NOT NULL
108145
```
109146

110147
### `filepath@store` - Portable External Reference
111148

112-
**Upgraded from legacy.** Relative path references within configured stores:
149+
**Built-in AttributeType.** Relative path references within configured stores:
113150

114151
- **Relative paths**: paths within a configured store (portable across environments)
115152
- **Store-aware**: resolves paths against configured store backend
116153
- Returns `ObjectRef` for lazy access via fsspec
117154
- Stores optional checksum for verification
155+
- **dtype**: `json` (stores path, store name, checksum, metadata)
118156

119157
**Key benefit**: Portability. The path is relative to the store, so pipelines can be moved
120158
between environments (dev → prod, cloud → local) by changing store configuration without
@@ -154,42 +192,43 @@ ref.open() # fsspec streaming access
154192
For arbitrary URLs (S3, HTTP, etc.) where you don't need ObjectRef semantics,
155193
just use `varchar`. A string is simpler and more transparent.
156194

157-
#### Filepath Type Behavior
195+
#### Implementation
158196

159197
```python
160-
# Core type behavior
161-
class FilepathType:
162-
"""Core external reference type with store-relative paths."""
198+
class FilepathType(AttributeType):
199+
"""Built-in AttributeType for store-relative file references."""
200+
type_name = "filepath"
201+
dtype = "json"
163202

164-
def store(self, relative_path: str, store_backend, compute_checksum: bool = False) -> dict:
203+
def encode(self, relative_path: str, *, key=None, store_name=None,
204+
compute_checksum: bool = False) -> dict:
165205
"""Register reference to file in store."""
166-
metadata = {'path': relative_path}
206+
store = get_store(store_name) # store_name required for filepath
207+
metadata = {'path': relative_path, 'store': store_name}
167208

168209
if compute_checksum:
169-
full_path = store_backend.resolve(relative_path)
170-
if store_backend.exists(full_path):
171-
metadata['checksum'] = compute_file_checksum(store_backend, full_path)
172-
metadata['size'] = store_backend.size(full_path)
210+
full_path = store.resolve(relative_path)
211+
if store.exists(full_path):
212+
metadata['checksum'] = compute_file_checksum(store, full_path)
213+
metadata['size'] = store.size(full_path)
173214

174215
return metadata
175216

176-
def retrieve(self, metadata: dict, store_backend) -> ObjectRef:
217+
def decode(self, stored: dict, *, key=None) -> ObjectRef:
177218
"""Return ObjectRef for lazy access."""
178219
return ObjectRef(
179-
store=store_backend,
180-
path=metadata['path'],
181-
checksum=metadata.get('checksum') # optional verification
220+
store=get_store(stored['store']),
221+
path=stored['path'],
222+
checksum=stored.get('checksum') # optional verification
182223
)
183224
```
184225

185226
#### Database Column
186227

187-
The `filepath` type uses the `json` core type:
188-
189228
```sql
190229
-- filepath column (MySQL)
191230
recording JSON NOT NULL
192-
-- Contains: {"path": "experiment_001/data.nwb", "checksum": "...", "size": ...}
231+
-- Contains: {"path": "experiment_001/data.nwb", "store": "main", "checksum": "...", "size": ...}
193232

194233
-- filepath column (PostgreSQL)
195234
recording JSONB NOT NULL
@@ -205,49 +244,52 @@ recording JSONB NOT NULL
205244
| Paths | Relative | Relative (unchanged) |
206245
| Store param | Required (`@store`) | Required (`@store`) |
207246

247+
## Database Types
248+
208249
### `json` - Cross-Database JSON Type
209250

210-
**New core type.** JSON storage compatible across MySQL and PostgreSQL:
251+
JSON storage compatible across MySQL and PostgreSQL:
211252

212253
```sql
213254
-- MySQL
214255
column_name JSON NOT NULL
215256

216-
-- PostgreSQL
257+
-- PostgreSQL (uses JSONB for better indexing)
217258
column_name JSONB NOT NULL
218259
```
219260

220-
The `json` core type:
261+
The `json` database type:
262+
- Used as dtype by built-in AttributeTypes (`object`, `content`, `filepath@store`)
221263
- Stores arbitrary JSON-serializable data
222264
- Automatically uses appropriate type for database backend
223265
- Supports JSON path queries where available
224266

225267
## Parameterized AttributeTypes
226268

227-
AttributeTypes can be parameterized with `<type@param>` syntax. The parameter is passed
228-
through to the underlying dtype:
269+
AttributeTypes can be parameterized with `<type@param>` syntax. The parameter specifies
270+
which store to use:
229271

230272
```python
231273
class AttributeType:
232-
type_name: str # Name used in <brackets>
233-
dtype: str # Base underlying type
274+
type_name: str # Name used in <brackets> or as bare type
275+
dtype: str # Database type or built-in AttributeType
234276

235-
# When user writes <type_name@param>, resolved dtype becomes:
236-
# f"{dtype}@{param}" if param specified, else dtype
277+
# When user writes type_name@param, resolved store becomes param
237278
```
238279

239280
**Resolution examples:**
240281
```
241-
<xblob> → dtype = "content" → default store
242-
<xblob@cold> → dtype = "content@cold" → cold store
243-
<djblob> → dtype = "longblob" → database
244-
<djblob@x> → ERROR: longblob doesn't support parameters
282+
<xblob> → uses content type → default store
283+
<xblob@cold> → uses content type → cold store
284+
<djblob> → dtype = "longblob" → database (no store)
285+
object@cold → uses object type → cold store
245286
```
246287

247-
This means `<xblob>` and `<xblob@store>` share the same AttributeType class - the
248-
parameter flows through to the core type, which validates whether it supports `@store`.
288+
AttributeTypes can use other AttributeTypes as their dtype (composition):
289+
- `<xblob>` uses `content` - adds djblob serialization on top of content-addressed storage
290+
- `<xattach>` uses `content` - adds filename preservation on top of content-addressed storage
249291

250-
## AttributeTypes (Built on Core Types)
292+
## User-Defined AttributeTypes
251293

252294
### `<djblob>` - Internal Serialized Blob
253295

@@ -364,31 +406,35 @@ class Attachments(dj.Manual):
364406
```
365407
┌───────────────────────────────────────────────────────────────────┐
366408
│ AttributeTypes │
367-
│ <djblob> <xblob> <attach> <xattach> <custom> │
409+
│ │
410+
│ Built-in: object content filepath@s │
411+
│ User: <djblob> <xblob> <attach> <xattach> <custom> │
368412
├───────────────────────────────────────────────────────────────────┤
369-
│ Core DataJoint Types │
370-
│ longblob content object filepath@s json │
371-
│ content@s object@s │
372-
├───────────────────────────────────────────────────────────────────┤
373-
│ Database Types │
374-
│ LONGBLOB CHAR(64) JSON JSON/JSONB VARCHAR etc. │
375-
│ (MySQL) (PostgreSQL) │
413+
│ Database Types (dtype) │
414+
│ │
415+
│ LONGBLOB JSON/JSONB VARCHAR INT etc. │
376416
└───────────────────────────────────────────────────────────────────┘
377417
```
378418

419+
All storage types are AttributeTypes:
420+
- **Built-in**: `object`, `content`, `filepath@store` - auto-registered, use `dj.config`
421+
- **User-defined**: `<djblob>`, `<xblob>`, `<attach>`, `<xattach>`, `<custom>` - registered via `@dj.register_type`
422+
379423
## Storage Comparison
380424

381-
| Type | Core Type | Storage Location | Dedup | Returns |
382-
|------|-----------|------------------|-------|---------|
425+
| Type | dtype | Storage Location | Dedup | Returns |
426+
|------|-------|------------------|-------|---------|
427+
| `object` | `json` | `{schema}/{table}/{pk}/` | No | ObjectRef |
428+
| `object@s` | `json` | `{schema}/{table}/{pk}/` | No | ObjectRef |
429+
| `content` | `json` | `_content/{hash}` | Yes | bytes |
430+
| `content@s` | `json` | `_content/{hash}` | Yes | bytes |
431+
| `filepath@s` | `json` | Configured store (relative path) | No | ObjectRef |
383432
| `<djblob>` | `longblob` | Database | No | Python object |
384433
| `<xblob>` | `content` | `_content/{hash}` | Yes | Python object |
385434
| `<xblob@s>` | `content@s` | `_content/{hash}` | Yes | Python object |
386435
| `<attach>` | `longblob` | Database | No | Local file path |
387436
| `<xattach>` | `content` | `_content/{hash}` | Yes | Local file path |
388437
| `<xattach@s>` | `content@s` | `_content/{hash}` | Yes | Local file path |
389-
| `object` || `{schema}/{table}/{pk}/` | No | ObjectRef |
390-
| `object@s` || `{schema}/{table}/{pk}/` | No | ObjectRef |
391-
| `filepath@s` | `json` | Configured store (relative path) | No | ObjectRef |
392438

393439
## Reference Counting for Content Type
394440

@@ -435,10 +481,11 @@ def garbage_collect(project):
435481
(ContentRegistry() & {'content_hash': content_hash}).delete()
436482
```
437483

438-
## Core Type Comparison
484+
## Built-in AttributeType Comparison
439485

440486
| Feature | `object` | `content` | `filepath@store` |
441487
|---------|----------|-----------|------------------|
488+
| dtype | `json` | `json` | `json` |
442489
| Location | OAS store | OAS store | Configured store |
443490
| Addressing | Primary key | Content hash | Relative path |
444491
| Path control | DataJoint | DataJoint | User |
@@ -456,20 +503,21 @@ def garbage_collect(project):
456503

457504
## Key Design Decisions
458505

459-
1. **Layered architecture**: Core types (`object`, `content`, `filepath@store`, `json`) separate from AttributeTypes
460-
2. **Two OAS regions**: object (PK-addressed) and content (hash-addressed) within managed stores
461-
3. **Filepath for portability**: `filepath@store` uses relative paths within stores for environment portability
462-
4. **No `uri` type**: For arbitrary URLs, use `varchar`—simpler and more transparent
463-
5. **Content type**: Single-blob, content-addressed, deduplicated storage
464-
6. **JSON core type**: Cross-database compatible (MySQL JSON, PostgreSQL JSONB)
465-
7. **Parameterized types**: `<type@param>` passes parameter to underlying dtype
466-
8. **Naming convention**:
506+
1. **Two-layer architecture**: Database types (`json`, `longblob`, etc.) and AttributeTypes
507+
2. **All storage types are AttributeTypes**: Built-in (`object`, `content`, `filepath@store`) and user-defined (`<djblob>`, etc.)
508+
3. **Built-in types use JSON dtype**: Stores metadata (path, hash, store name, etc.) in JSON columns
509+
4. **Two OAS regions**: object (PK-addressed) and content (hash-addressed) within managed stores
510+
5. **Filepath for portability**: `filepath@store` uses relative paths within stores for environment portability
511+
6. **No `uri` type**: For arbitrary URLs, use `varchar`—simpler and more transparent
512+
7. **Content type**: Single-blob, content-addressed, deduplicated storage
513+
8. **Parameterized types**: `<type@param>` passes parameter to underlying dtype
514+
9. **Naming convention**:
467515
- `<djblob>` = internal serialized (database)
468516
- `<xblob>` = external serialized (content-addressed)
469517
- `<attach>` = internal file (single file)
470518
- `<xattach>` = external file (single file)
471-
9. **Transparent access**: AttributeTypes return Python objects or file paths
472-
10. **Lazy access**: `object`, `object@store`, and `filepath@store` return ObjectRef
519+
10. **Transparent access**: AttributeTypes return Python objects or file paths
520+
11. **Lazy access**: `object`, `object@store`, and `filepath@store` return ObjectRef
473521

474522
## Migration from Legacy Types
475523

0 commit comments

Comments
 (0)