|
1 | | - |
2 | 1 | --- |
3 | 2 |
|
4 | 3 | ### **v4.1.0 — Baseline stability** |
|
11 | 10 |
|
12 | 11 | ### **v4.2.0 — Faster spaCy path** |
13 | 12 |
|
14 | | -* **MUST** hold the spaCy `nlp` object in a module-level cache (singleton). |
15 | | -* **MUST** replace per-doc loops with `nlp.pipe(batch_size=?, n_process=-1)`. |
16 | | -* **MUST** run spaCy and Tesseract calls in `asyncio.to_thread()` (or a thread-pool) so the event-loop stays free. |
17 | | -* **SHOULD** expose `PIPE_BATCH_SIZE` env var for tuning. |
| 13 | +- **MUST** hold the spaCy `nlp` object in a module-level cache (singleton). |
| 14 | +- **MUST** replace per-doc loops with `nlp.pipe(batch_size=?, n_process=-1)`. |
| 15 | +- **MUST** run spaCy and Tesseract calls in `asyncio.to_thread()` (or a thread-pool) so the event-loop stays free. |
| 16 | +- **SHOULD** expose `PIPE_BATCH_SIZE` env var for tuning. |
18 | 17 |
|
19 | 18 | --- |
20 | 19 |
|
21 | 20 | ### **v4.3.0 — Strong types, predictable output** |
22 | 21 |
|
23 | | -* **MUST** make `_process_text` always return `Dict[str, Dict]`. |
24 | | -* **MUST** add `mypy --strict` to CI; fix any revealed issues. |
25 | | -* **SHOULD** convert `datafog.config` to a Pydantic v2 `BaseSettings`. |
| 22 | +- **MUST** make `_process_text` always return `Dict[str, Dict]`. |
| 23 | +- **MUST** add `mypy --strict` to CI; fix any revealed issues. |
| 24 | +- **SHOULD** convert `datafog.config` to a Pydantic v2 `BaseSettings`. |
26 | 25 |
|
27 | 26 | --- |
28 | 27 |
|
29 | 28 | ### **v4.4.0 — Clean OCR architecture** |
30 | 29 |
|
31 | | -* **MUST** split `ImageService` into `TesseractOCR` and `DonutOCR`, each with `extract_text(Image)->str`. |
32 | | -* **MUST** let users pick via `ImageService(backend="tesseract"|"donut")` or the `DATAFOG_DEFAULT_OCR` env var. |
33 | | -* **SHOULD** add unit tests that stub each backend independently. |
| 30 | +- **MUST** split `ImageService` into `TesseractOCR` and `DonutOCR`, each with `extract_text(Image)->str`. |
| 31 | +- **MUST** let users pick via `ImageService(backend="tesseract"|"donut")` or the `DATAFOG_DEFAULT_OCR` env var. |
| 32 | +- **SHOULD** add unit tests that stub each backend independently. |
34 | 33 |
|
35 | 34 | --- |
36 | 35 |
|
37 | 36 | ### **v4.5.0 — Rust-powered pattern matching (optional wheel)** |
38 | 37 |
|
39 | | -* **MUST** create a PyO3 extension `datafog._fastregex` that wraps `aho-corasick` / `regex-automata`. |
40 | | -* **MUST** auto-import it when available; fall back to pure-Python silently. |
41 | | -* **SHOULD** publish platform wheels under `pip install "datafog[fastregex]"`. |
| 38 | +- **MUST** create a PyO3 extension `datafog._fastregex` that wraps `aho-corasick` / `regex-automata`. |
| 39 | +- **MUST** auto-import it when available; fall back to pure-Python silently. |
| 40 | +- **SHOULD** publish platform wheels under `pip install "datafog[fastregex]"`. |
42 | 41 |
|
43 | 42 | --- |
44 | 43 |
|
45 | 44 | ### **v4.6.0 — Streaming and zero-copy** |
46 | 45 |
|
47 | | -* **MUST** add `async def stream_text_pipeline(iterable[str]) -> AsyncIterator[Result]`. |
48 | | -* **MUST** scan CSV/JSON via `pyarrow.dataset` to avoid reading the whole file into RAM. |
49 | | -* **SHOULD** provide example notebook comparing latency/bandwidth vs. v4.5. |
| 46 | +- **MUST** add `async def stream_text_pipeline(iterable[str]) -> AsyncIterator[Result]`. |
| 47 | +- **MUST** scan CSV/JSON via `pyarrow.dataset` to avoid reading the whole file into RAM. |
| 48 | +- **SHOULD** provide example notebook comparing latency/bandwidth vs. v4.5. |
50 | 49 |
|
51 | 50 | --- |
52 | 51 |
|
53 | 52 | ### **v4.7.0 — GPU / transformer toggle** |
54 | 53 |
|
55 | | -* **MUST** accept `DataFog(use_gpu=True)` which loads `en_core_web_trf` in half precision if CUDA is present. |
56 | | -* **MUST** fall back gracefully on CPU-only hosts. |
57 | | -* **SHOULD** benchmark and log model choice at INFO level. |
| 54 | +- **MUST** accept `DataFog(use_gpu=True)` which loads `en_core_web_trf` in half precision if CUDA is present. |
| 55 | +- **MUST** fall back gracefully on CPU-only hosts. |
| 56 | +- **SHOULD** benchmark and log model choice at INFO level. |
58 | 57 |
|
59 | 58 | --- |
60 | 59 |
|
61 | 60 | ### **v4.8.0 — Fast anonymizer core** |
62 | 61 |
|
63 | | -* **MUST** rewrite `Anonymizer.replace_pii/redact_pii/hash_pii` in Cython (single-pass over the string). |
64 | | -* **MUST** switch hashing to OpenSSL EVP via `cffi` for SHA-256/SHA3-256. |
65 | | -* **SHOULD** guard with `pip install "datafog[fast]"`. |
| 62 | +- **MUST** rewrite `Anonymizer.replace_pii/redact_pii/hash_pii` in Cython (single-pass over the string). |
| 63 | +- **MUST** switch hashing to OpenSSL EVP via `cffi` for SHA-256/SHA3-256. |
| 64 | +- **SHOULD** guard with `pip install "datafog[fast]"`. |
66 | 65 |
|
67 | 66 | --- |
68 | 67 |
|
69 | 68 | ### **v4.9.0 — Edge & CI polish** |
70 | 69 |
|
71 | | -* **MUST** compile the annotator and anonymizer to WebAssembly using `maturin`, package as `_datafog_wasm`. |
72 | | -* **MUST** auto-load WASM build on `wasmtime` when `import datafog.wasm` succeeds. |
73 | | -* **MUST** cache spaCy model artefacts in GitHub Actions with `actions/cache`, keyed by `model-hash`. |
74 | | -* **SHOULD** update docs and `README.md` badges for new extras and WASM support. |
| 70 | +- **MUST** compile the annotator and anonymizer to WebAssembly using `maturin`, package as `_datafog_wasm`. |
| 71 | +- **MUST** auto-load WASM build on `wasmtime` when `import datafog.wasm` succeeds. |
| 72 | +- **MUST** cache spaCy model artefacts in GitHub Actions with `actions/cache`, keyed by `model-hash`. |
| 73 | +- **SHOULD** update docs and `README.md` badges for new extras and WASM support. |
75 | 74 |
|
76 | 75 | --- |
77 | 76 |
|
78 | | -Use this ladder as-is, bumping **only the minor version** each time, so v4.0.x callers never break. |
| 77 | +Use this ladder as-is, bumping **only the minor version** each time, so v4.0.x callers never break. |
0 commit comments