Skip to content

Commit 3e9683a

Browse files
authored
Merge pull request #56 from DataFog/feat/4.1-baseline-fixes
Feat/4.1 baseline fixes
2 parents 4be5015 + b6afabc commit 3e9683a

File tree

12 files changed

+459
-85
lines changed

12 files changed

+459
-85
lines changed

.codecov.yml

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1 +1,15 @@
11
comment: no
2+
3+
coverage:
4+
status:
5+
project:
6+
default:
7+
# Target overall coverage percentage
8+
target: 74%
9+
# Allow coverage to drop by this amount without failing
10+
# threshold: 0.5% # Optional: uncomment to allow small drops
11+
patch:
12+
default:
13+
# Target coverage percentage for the changes in the PR/commit
14+
target: 20% # Lower target for patch coverage
15+
# threshold: 1% # Optional: Allow patch coverage to drop

.pre-commit-config.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,4 +25,5 @@ repos:
2525
rev: v4.0.0-alpha.8
2626
hooks:
2727
- id: prettier
28+
types: [yaml, markdown] # Explicitly define file types
2829
exclude: .venv

README.md

Lines changed: 22 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,10 +21,31 @@
2121

2222
DataFog can be installed via pip:
2323

24-
```
24+
```bash
2525
pip install datafog
2626
```
2727

28+
### Optional Features (Extras)
29+
30+
DataFog uses `extras` to manage dependencies for optional features like specific OCR engines or Apache Spark integration. You can install these as needed:
31+
32+
- **OCR (Tesseract):** For image scanning using Tesseract. Requires Tesseract OCR engine to be installed on your system separately.
33+
```bash
34+
pip install "datafog[ocr]"
35+
```
36+
- **OCR (Donut):** For image scanning using the Donut document understanding model.
37+
```bash
38+
pip install "datafog[donut]"
39+
```
40+
- **Spark:** For processing data using PySpark.
41+
```bash
42+
pip install "datafog[spark]"
43+
```
44+
- **All:** To install all optional features at once.
45+
```bash
46+
pip install "datafog[all]"
47+
```
48+
2849
# CLI
2950

3051
## 📚 Quick Reference

datafog/__about__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
__version__ = "3.3.0"
1+
__version__ = "4.1.0"

datafog/processing/image_processing/donut_processor.py

Lines changed: 15 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,21 @@
1919

2020
from .image_downloader import ImageDownloader
2121

22+
# Attempt imports and provide helpful error messages
23+
try:
24+
import torch
25+
except ModuleNotFoundError:
26+
raise ModuleNotFoundError(
27+
"torch is not installed. Please install it to use Donut features: pip install 'datafog[donut]'"
28+
)
29+
try:
30+
from transformers import DonutProcessor as TransformersDonutProcessor
31+
from transformers import VisionEncoderDecoderModel
32+
except ModuleNotFoundError:
33+
raise ModuleNotFoundError(
34+
"transformers is not installed. Please install it to use Donut features: pip install 'datafog[donut]'"
35+
)
36+
2237

2338
class DonutProcessor:
2439
"""
@@ -30,28 +45,13 @@ class DonutProcessor:
3045
"""
3146

3247
def __init__(self, model_path="naver-clova-ix/donut-base-finetuned-cord-v2"):
33-
self.ensure_installed("torch")
34-
self.ensure_installed("transformers")
35-
36-
import torch
37-
from transformers import DonutProcessor as TransformersDonutProcessor
38-
from transformers import VisionEncoderDecoderModel
39-
4048
self.processor = TransformersDonutProcessor.from_pretrained(model_path)
4149
self.model = VisionEncoderDecoderModel.from_pretrained(model_path)
4250
self.device = "cuda" if torch.cuda.is_available() else "cpu"
4351
self.model.to(self.device)
4452
self.model.eval()
4553
self.downloader = ImageDownloader()
4654

47-
def ensure_installed(self, package_name):
48-
try:
49-
importlib.import_module(package_name)
50-
except ImportError:
51-
subprocess.check_call(
52-
[sys.executable, "-m", "pip", "install", package_name]
53-
)
54-
5555
def preprocess_image(self, image: Image.Image) -> np.ndarray:
5656
# Convert to RGB if the image is not already in RGB mode
5757
if image.mode != "RGB":

datafog/processing/spark_processing/pyspark_udfs.py

Lines changed: 37 additions & 44 deletions
Original file line numberDiff line numberDiff line change
@@ -8,70 +8,63 @@
88
"""
99

1010
import importlib
11+
import logging
1112
import subprocess
1213
import sys
14+
import traceback
15+
from typing import List
1316

14-
PII_ANNOTATION_LABELS = ["DATE_TIME", "LOC", "NRP", "ORG", "PER"]
15-
MAXIMAL_STRING_SIZE = 1000000
16-
17-
18-
def pii_annotator(text: str, broadcasted_nlp) -> list[list[str]]:
19-
"""Extract features using en_core_web_lg model.
20-
21-
Returns:
22-
list[list[str]]: Values as arrays in order defined in the PII_ANNOTATION_LABELS.
23-
"""
24-
ensure_installed("pyspark")
25-
ensure_installed("spacy")
17+
try:
2618
import spacy
19+
except ImportError:
20+
print("Spacy not found. Please install it: pip install spacy")
21+
print("and download the model: python -m spacy download en_core_web_lg")
22+
spacy = None
23+
traceback.print_exc()
24+
sys.exit(1)
25+
26+
try:
2727
from pyspark.sql import SparkSession
2828
from pyspark.sql.functions import udf
29-
from pyspark.sql.types import ArrayType, StringType, StructField, StructType
29+
from pyspark.sql.types import ArrayType, StringType
30+
except ImportError:
31+
print(
32+
"PySpark not found. Please install it with the [spark] extra: pip install 'datafog[spark]'"
33+
)
34+
35+
# Set placeholders to allow module import even if pyspark is not installed
36+
def placeholder_udf(*args, **kwargs):
37+
return None
38+
39+
def placeholder_arraytype(x):
40+
return None
3041

31-
if text:
32-
if len(text) > MAXIMAL_STRING_SIZE:
33-
# Cut the strings for required sizes
34-
text = text[:MAXIMAL_STRING_SIZE]
35-
nlp = broadcasted_nlp.value
36-
doc = nlp(text)
42+
def placeholder_stringtype():
43+
return None
3744

38-
# Pre-create dictionary with labels matching to expected extracted entities
39-
classified_entities: dict[str, list[str]] = {
40-
_label: [] for _label in PII_ANNOTATION_LABELS
41-
}
42-
for ent in doc.ents:
43-
# Add entities from extracted values
44-
classified_entities[ent.label_].append(ent.text)
45+
udf = placeholder_udf
46+
ArrayType = placeholder_arraytype
47+
StringType = placeholder_stringtype
48+
SparkSession = None # Define a placeholder
49+
traceback.print_exc()
50+
# Do not exit, allow basic import but functions using Spark will fail later if called
4551

46-
return [_ent for _ent in classified_entities.values()]
47-
else:
48-
return [[] for _ in PII_ANNOTATION_LABELS]
52+
from datafog.processing.text_processing.spacy_pii_annotator import pii_annotator
53+
54+
PII_ANNOTATION_LABELS = ["DATE_TIME", "LOC", "NRP", "ORG", "PER"]
55+
MAXIMAL_STRING_SIZE = 1000000
4956

5057

5158
def broadcast_pii_annotator_udf(
5259
spark_session=None, spacy_model: str = "en_core_web_lg"
5360
):
5461
"""Broadcast PII annotator across Spark cluster and create UDF"""
55-
ensure_installed("pyspark")
56-
ensure_installed("spacy")
57-
import spacy
58-
from pyspark.sql import SparkSession
59-
from pyspark.sql.functions import udf
60-
from pyspark.sql.types import ArrayType, StringType, StructField, StructType
61-
6262
if not spark_session:
63-
spark_session = SparkSession.builder.getOrCreate()
63+
spark_session = SparkSession.builder.getOrCreate() # noqa: F821
6464
broadcasted_nlp = spark_session.sparkContext.broadcast(spacy.load(spacy_model))
6565

6666
pii_annotation_udf = udf(
6767
lambda text: pii_annotator(text, broadcasted_nlp),
6868
ArrayType(ArrayType(StringType())),
6969
)
7070
return pii_annotation_udf
71-
72-
73-
def ensure_installed(self, package_name):
74-
try:
75-
importlib.import_module(package_name)
76-
except ImportError:
77-
subprocess.check_call([sys.executable, "-m", "pip", "install", package_name])

datafog/services/spark_service.py

Lines changed: 21 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -7,9 +7,21 @@
77

88
import importlib
99
import json
10+
import logging
1011
import subprocess
1112
import sys
12-
from typing import Any, List
13+
from typing import Any, List, Optional
14+
15+
# Attempt to import pyspark and provide a helpful error message if missing
16+
try:
17+
from pyspark.sql import DataFrame, SparkSession
18+
except ModuleNotFoundError:
19+
raise ModuleNotFoundError(
20+
"pyspark is not installed. Please install it to use Spark features: pip install datafog[spark]"
21+
)
22+
23+
from pyspark.sql.functions import udf
24+
from pyspark.sql.types import ArrayType, StringType
1325

1426

1527
class SparkService:
@@ -20,30 +32,21 @@ class SparkService:
2032
data reading and package installation.
2133
"""
2234

23-
def __init__(self):
24-
self.spark = self.create_spark_session()
25-
self.ensure_installed("pyspark")
26-
27-
from pyspark.sql import DataFrame, SparkSession
28-
from pyspark.sql.functions import udf
29-
from pyspark.sql.types import ArrayType, StringType
35+
def __init__(self, spark_session: Optional[SparkSession] = None):
36+
if spark_session:
37+
self.spark = spark_session
38+
else:
39+
self.spark = self.create_spark_session()
3040

31-
self.SparkSession = SparkSession
3241
self.DataFrame = DataFrame
3342
self.udf = udf
3443
self.ArrayType = ArrayType
3544
self.StringType = StringType
3645

46+
logging.info("SparkService initialized.")
47+
3748
def create_spark_session(self):
38-
return self.SparkSession.builder.appName("datafog").getOrCreate()
49+
return SparkSession.builder.appName("datafog").getOrCreate()
3950

4051
def read_json(self, path: str) -> List[dict]:
4152
return self.spark.read.json(path).collect()
42-
43-
def ensure_installed(self, package_name):
44-
try:
45-
importlib.import_module(package_name)
46-
except ImportError:
47-
subprocess.check_call(
48-
[sys.executable, "-m", "pip", "install", package_name]
49-
)

notes/ROADMAP.md

Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,77 @@
1+
---
2+
3+
### **v4.1.0 — Baseline stability**
4+
5+
* **MUST** read `__version__` from `datafog/__about__.py` and import it in `setup.py`; delete the duplicate there.
6+
* **MUST** remove every `ensure_installed()` runtime `pip install`; fail fast instead.
7+
* **MUST** document OCR/Donut extras in `setup.py[extras]`.
8+
9+
---
10+
11+
### **v4.2.0 — Faster spaCy path**
12+
13+
- **MUST** hold the spaCy `nlp` object in a module-level cache (singleton).
14+
- **MUST** replace per-doc loops with `nlp.pipe(batch_size=?, n_process=-1)`.
15+
- **MUST** run spaCy and Tesseract calls in `asyncio.to_thread()` (or a thread-pool) so the event-loop stays free.
16+
- **SHOULD** expose `PIPE_BATCH_SIZE` env var for tuning.
17+
18+
---
19+
20+
### **v4.3.0 — Strong types, predictable output**
21+
22+
- **MUST** make `_process_text` always return `Dict[str, Dict]`.
23+
- **MUST** add `mypy --strict` to CI; fix any revealed issues.
24+
- **SHOULD** convert `datafog.config` to a Pydantic v2 `BaseSettings`.
25+
26+
---
27+
28+
### **v4.4.0 — Clean OCR architecture**
29+
30+
- **MUST** split `ImageService` into `TesseractOCR` and `DonutOCR`, each with `extract_text(Image)->str`.
31+
- **MUST** let users pick via `ImageService(backend="tesseract"|"donut")` or the `DATAFOG_DEFAULT_OCR` env var.
32+
- **SHOULD** add unit tests that stub each backend independently.
33+
34+
---
35+
36+
### **v4.5.0 — Rust-powered pattern matching (optional wheel)**
37+
38+
- **MUST** create a PyO3 extension `datafog._fastregex` that wraps `aho-corasick` / `regex-automata`.
39+
- **MUST** auto-import it when available; fall back to pure-Python silently.
40+
- **SHOULD** publish platform wheels under `pip install "datafog[fastregex]"`.
41+
42+
---
43+
44+
### **v4.6.0 — Streaming and zero-copy**
45+
46+
- **MUST** add `async def stream_text_pipeline(iterable[str]) -> AsyncIterator[Result]`.
47+
- **MUST** scan CSV/JSON via `pyarrow.dataset` to avoid reading the whole file into RAM.
48+
- **SHOULD** provide example notebook comparing latency/bandwidth vs. v4.5.
49+
50+
---
51+
52+
### **v4.7.0 — GPU / transformer toggle**
53+
54+
- **MUST** accept `DataFog(use_gpu=True)` which loads `en_core_web_trf` in half precision if CUDA is present.
55+
- **MUST** fall back gracefully on CPU-only hosts.
56+
- **SHOULD** benchmark and log model choice at INFO level.
57+
58+
---
59+
60+
### **v4.8.0 — Fast anonymizer core**
61+
62+
- **MUST** rewrite `Anonymizer.replace_pii/redact_pii/hash_pii` in Cython (single-pass over the string).
63+
- **MUST** switch hashing to OpenSSL EVP via `cffi` for SHA-256/SHA3-256.
64+
- **SHOULD** guard with `pip install "datafog[fast]"`.
65+
66+
---
67+
68+
### **v4.9.0 — Edge & CI polish**
69+
70+
- **MUST** compile the annotator and anonymizer to WebAssembly using `maturin`, package as `_datafog_wasm`.
71+
- **MUST** auto-load WASM build on `wasmtime` when `import datafog.wasm` succeeds.
72+
- **MUST** cache spaCy model artefacts in GitHub Actions with `actions/cache`, keyed by `model-hash`.
73+
- **SHOULD** update docs and `README.md` badges for new extras and WASM support.
74+
75+
---
76+
77+
Use this ladder as-is, bumping **only the minor version** each time, so v4.0.x callers never break.

0 commit comments

Comments
 (0)