Skip to content

Commit a0a8bfd

Browse files
committed
chore: Apply pre-commit fixes
1 parent 1da1fd3 commit a0a8bfd

File tree

7 files changed

+112
-112
lines changed

7 files changed

+112
-112
lines changed

README.md

Lines changed: 16 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -29,22 +29,22 @@ pip install datafog
2929

3030
DataFog uses `extras` to manage dependencies for optional features like specific OCR engines or Apache Spark integration. You can install these as needed:
3131

32-
* **OCR (Tesseract):** For image scanning using Tesseract. Requires Tesseract OCR engine to be installed on your system separately.
33-
```bash
34-
pip install "datafog[ocr]"
35-
```
36-
* **OCR (Donut):** For image scanning using the Donut document understanding model.
37-
```bash
38-
pip install "datafog[donut]"
39-
```
40-
* **Spark:** For processing data using PySpark.
41-
```bash
42-
pip install "datafog[spark]"
43-
```
44-
* **All:** To install all optional features at once.
45-
```bash
46-
pip install "datafog[all]"
47-
```
32+
- **OCR (Tesseract):** For image scanning using Tesseract. Requires Tesseract OCR engine to be installed on your system separately.
33+
```bash
34+
pip install "datafog[ocr]"
35+
```
36+
- **OCR (Donut):** For image scanning using the Donut document understanding model.
37+
```bash
38+
pip install "datafog[donut]"
39+
```
40+
- **Spark:** For processing data using PySpark.
41+
```bash
42+
pip install "datafog[spark]"
43+
```
44+
- **All:** To install all optional features at once.
45+
```bash
46+
pip install "datafog[all]"
47+
```
4848

4949
# CLI
5050

datafog/processing/image_processing/donut_processor.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,8 @@
2727
"torch is not installed. Please install it to use Donut features: pip install 'datafog[donut]'"
2828
)
2929
try:
30-
from transformers import DonutProcessor as TransformersDonutProcessor, VisionEncoderDecoderModel
30+
from transformers import DonutProcessor as TransformersDonutProcessor
31+
from transformers import VisionEncoderDecoderModel
3132
except ModuleNotFoundError:
3233
raise ModuleNotFoundError(
3334
"transformers is not installed. Please install it to use Donut features: pip install 'datafog[donut]'"

datafog/processing/spark_processing/pyspark_udfs.py

Lines changed: 34 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -7,67 +7,60 @@
77
on text data.
88
"""
99

10-
import logging
11-
import sys
1210
import importlib
11+
import logging
1312
import subprocess
13+
import sys
14+
import traceback
15+
from typing import List
1416

15-
# Attempt imports and provide helpful error messages
1617
try:
17-
from pyspark.sql.functions import udf
18-
from pyspark.sql.types import StringType, ArrayType
19-
except ModuleNotFoundError:
20-
raise ModuleNotFoundError(
21-
"pyspark is not installed. Please install it to use Spark features: pip install datafog[spark]"
22-
)
18+
import spacy
19+
except ImportError:
20+
print("Spacy not found. Please install it: pip install spacy")
21+
print("and download the model: python -m spacy download en_core_web_lg")
22+
spacy = None
23+
traceback.print_exc()
24+
sys.exit(1)
2325

2426
try:
25-
import spacy
26-
except ModuleNotFoundError:
27-
# Spacy is a core dependency, but let's provide a helpful message just in case.
28-
raise ModuleNotFoundError(
29-
"spacy is not installed. Please ensure datafog is installed correctly: pip install datafog"
27+
from pyspark.sql import SparkSession
28+
from pyspark.sql.functions import udf
29+
from pyspark.sql.types import ArrayType, StringType
30+
except ImportError:
31+
print(
32+
"PySpark not found. Please install it with the [spark] extra: pip install 'datafog[spark]'"
3033
)
3134

35+
# Set placeholders to allow module import even if pyspark is not installed
36+
def placeholder_udf(*args, **kwargs):
37+
return None
3238

33-
from typing import List
34-
35-
PII_ANNOTATION_LABELS = ["DATE_TIME", "LOC", "NRP", "ORG", "PER"]
36-
MAXIMAL_STRING_SIZE = 1000000
37-
39+
def placeholder_arraytype(x):
40+
return None
3841

39-
def pii_annotator(text: str, broadcasted_nlp) -> List[List[str]]:
40-
"""Extract features using en_core_web_lg model.
42+
def placeholder_stringtype():
43+
return None
4144

42-
Returns:
43-
list[list[str]]: Values as arrays in order defined in the PII_ANNOTATION_LABELS.
44-
"""
45-
if text:
46-
if len(text) > MAXIMAL_STRING_SIZE:
47-
# Cut the strings for required sizes
48-
text = text[:MAXIMAL_STRING_SIZE]
49-
nlp = broadcasted_nlp.value
50-
doc = nlp(text)
45+
udf = placeholder_udf
46+
ArrayType = placeholder_arraytype
47+
StringType = placeholder_stringtype
48+
SparkSession = None # Define a placeholder
49+
traceback.print_exc()
50+
# Do not exit, allow basic import but functions using Spark will fail later if called
5151

52-
# Pre-create dictionary with labels matching to expected extracted entities
53-
classified_entities: dict[str, list[str]] = {
54-
_label: [] for _label in PII_ANNOTATION_LABELS
55-
}
56-
for ent in doc.ents:
57-
# Add entities from extracted values
58-
classified_entities[ent.label_].append(ent.text)
52+
from datafog.processing.text_processing.spacy_pii_annotator import pii_annotator
5953

60-
return [_ent for _ent in classified_entities.values()]
61-
else:
62-
return [[] for _ in PII_ANNOTATION_LABELS]
54+
PII_ANNOTATION_LABELS = ["DATE_TIME", "LOC", "NRP", "ORG", "PER"]
55+
MAXIMAL_STRING_SIZE = 1000000
6356

6457

6558
def broadcast_pii_annotator_udf(
6659
spark_session=None, spacy_model: str = "en_core_web_lg"
6760
):
6861
"""Broadcast PII annotator across Spark cluster and create UDF"""
6962
if not spark_session:
70-
spark_session = SparkSession.builder.getOrCreate()
63+
spark_session = SparkSession.builder.getOrCreate() # noqa: F821
7164
broadcasted_nlp = spark_session.sparkContext.broadcast(spacy.load(spacy_model))
7265

7366
pii_annotation_udf = udf(

datafog/services/spark_service.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -5,16 +5,16 @@
55
JSON reading, and package management.
66
"""
77

8-
import sys
98
import importlib
10-
import subprocess
11-
import logging
129
import json
10+
import logging
11+
import subprocess
12+
import sys
1313
from typing import Any, List, Optional
1414

1515
# Attempt to import pyspark and provide a helpful error message if missing
1616
try:
17-
from pyspark.sql import SparkSession, DataFrame
17+
from pyspark.sql import DataFrame, SparkSession
1818
except ModuleNotFoundError:
1919
raise ModuleNotFoundError(
2020
"pyspark is not installed. Please install it to use Spark features: pip install datafog[spark]"

notes/ROADMAP.md

Lines changed: 27 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,3 @@
1-
21
---
32

43
### **v4.1.0 — Baseline stability**
@@ -11,68 +10,68 @@
1110

1211
### **v4.2.0 — Faster spaCy path**
1312

14-
* **MUST** hold the spaCy `nlp` object in a module-level cache (singleton).
15-
* **MUST** replace per-doc loops with `nlp.pipe(batch_size=?, n_process=-1)`.
16-
* **MUST** run spaCy and Tesseract calls in `asyncio.to_thread()` (or a thread-pool) so the event-loop stays free.
17-
* **SHOULD** expose `PIPE_BATCH_SIZE` env var for tuning.
13+
- **MUST** hold the spaCy `nlp` object in a module-level cache (singleton).
14+
- **MUST** replace per-doc loops with `nlp.pipe(batch_size=?, n_process=-1)`.
15+
- **MUST** run spaCy and Tesseract calls in `asyncio.to_thread()` (or a thread-pool) so the event-loop stays free.
16+
- **SHOULD** expose `PIPE_BATCH_SIZE` env var for tuning.
1817

1918
---
2019

2120
### **v4.3.0 — Strong types, predictable output**
2221

23-
* **MUST** make `_process_text` always return `Dict[str, Dict]`.
24-
* **MUST** add `mypy --strict` to CI; fix any revealed issues.
25-
* **SHOULD** convert `datafog.config` to a Pydantic v2 `BaseSettings`.
22+
- **MUST** make `_process_text` always return `Dict[str, Dict]`.
23+
- **MUST** add `mypy --strict` to CI; fix any revealed issues.
24+
- **SHOULD** convert `datafog.config` to a Pydantic v2 `BaseSettings`.
2625

2726
---
2827

2928
### **v4.4.0 — Clean OCR architecture**
3029

31-
* **MUST** split `ImageService` into `TesseractOCR` and `DonutOCR`, each with `extract_text(Image)->str`.
32-
* **MUST** let users pick via `ImageService(backend="tesseract"|"donut")` or the `DATAFOG_DEFAULT_OCR` env var.
33-
* **SHOULD** add unit tests that stub each backend independently.
30+
- **MUST** split `ImageService` into `TesseractOCR` and `DonutOCR`, each with `extract_text(Image)->str`.
31+
- **MUST** let users pick via `ImageService(backend="tesseract"|"donut")` or the `DATAFOG_DEFAULT_OCR` env var.
32+
- **SHOULD** add unit tests that stub each backend independently.
3433

3534
---
3635

3736
### **v4.5.0 — Rust-powered pattern matching (optional wheel)**
3837

39-
* **MUST** create a PyO3 extension `datafog._fastregex` that wraps `aho-corasick` / `regex-automata`.
40-
* **MUST** auto-import it when available; fall back to pure-Python silently.
41-
* **SHOULD** publish platform wheels under `pip install "datafog[fastregex]"`.
38+
- **MUST** create a PyO3 extension `datafog._fastregex` that wraps `aho-corasick` / `regex-automata`.
39+
- **MUST** auto-import it when available; fall back to pure-Python silently.
40+
- **SHOULD** publish platform wheels under `pip install "datafog[fastregex]"`.
4241

4342
---
4443

4544
### **v4.6.0 — Streaming and zero-copy**
4645

47-
* **MUST** add `async def stream_text_pipeline(iterable[str]) -> AsyncIterator[Result]`.
48-
* **MUST** scan CSV/JSON via `pyarrow.dataset` to avoid reading the whole file into RAM.
49-
* **SHOULD** provide example notebook comparing latency/bandwidth vs. v4.5.
46+
- **MUST** add `async def stream_text_pipeline(iterable[str]) -> AsyncIterator[Result]`.
47+
- **MUST** scan CSV/JSON via `pyarrow.dataset` to avoid reading the whole file into RAM.
48+
- **SHOULD** provide example notebook comparing latency/bandwidth vs. v4.5.
5049

5150
---
5251

5352
### **v4.7.0 — GPU / transformer toggle**
5453

55-
* **MUST** accept `DataFog(use_gpu=True)` which loads `en_core_web_trf` in half precision if CUDA is present.
56-
* **MUST** fall back gracefully on CPU-only hosts.
57-
* **SHOULD** benchmark and log model choice at INFO level.
54+
- **MUST** accept `DataFog(use_gpu=True)` which loads `en_core_web_trf` in half precision if CUDA is present.
55+
- **MUST** fall back gracefully on CPU-only hosts.
56+
- **SHOULD** benchmark and log model choice at INFO level.
5857

5958
---
6059

6160
### **v4.8.0 — Fast anonymizer core**
6261

63-
* **MUST** rewrite `Anonymizer.replace_pii/redact_pii/hash_pii` in Cython (single-pass over the string).
64-
* **MUST** switch hashing to OpenSSL EVP via `cffi` for SHA-256/SHA3-256.
65-
* **SHOULD** guard with `pip install "datafog[fast]"`.
62+
- **MUST** rewrite `Anonymizer.replace_pii/redact_pii/hash_pii` in Cython (single-pass over the string).
63+
- **MUST** switch hashing to OpenSSL EVP via `cffi` for SHA-256/SHA3-256.
64+
- **SHOULD** guard with `pip install "datafog[fast]"`.
6665

6766
---
6867

6968
### **v4.9.0 — Edge & CI polish**
7069

71-
* **MUST** compile the annotator and anonymizer to WebAssembly using `maturin`, package as `_datafog_wasm`.
72-
* **MUST** auto-load WASM build on `wasmtime` when `import datafog.wasm` succeeds.
73-
* **MUST** cache spaCy model artefacts in GitHub Actions with `actions/cache`, keyed by `model-hash`.
74-
* **SHOULD** update docs and `README.md` badges for new extras and WASM support.
70+
- **MUST** compile the annotator and anonymizer to WebAssembly using `maturin`, package as `_datafog_wasm`.
71+
- **MUST** auto-load WASM build on `wasmtime` when `import datafog.wasm` succeeds.
72+
- **MUST** cache spaCy model artefacts in GitHub Actions with `actions/cache`, keyed by `model-hash`.
73+
- **SHOULD** update docs and `README.md` badges for new extras and WASM support.
7574

7675
---
7776

78-
Use this ladder as-is, bumping **only the minor version** each time, so v4.0.x callers never break.
77+
Use this ladder as-is, bumping **only the minor version** each time, so v4.0.x callers never break.

notes/v4.1.0-tickets.md

Lines changed: 13 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -10,13 +10,15 @@
1010
Currently, the package version might be duplicated or inconsistently defined. We need to centralize the version definition in `datafog/__about__.py`.
1111

1212
**Tasks:**
13+
1314
1. Ensure `datafog/__about__.py` exists and contains a `__version__` string variable (e.g., `__version__ = "4.1.0"`).
1415
2. Modify `setup.py` to read this `__version__` variable from `datafog/__about__.py`. Common patterns involve reading the file and executing its content in a temporary namespace or using regular expressions.
1516
3. Remove any hardcoded `version` assignment within `setup.py` itself.
1617
4. Verify that `pip install .` and building distributions (`sdist`, `wheel`) correctly pick up the version from `__about__.py`.
1718

1819
**Acceptance Criteria:**
19-
- The package version is defined *only* in `datafog/__about__.py`.
20+
21+
- The package version is defined _only_ in `datafog/__about__.py`.
2022
- `setup.py` successfully reads the version from `__about__.py` during installation and build processes.
2123
- Running `import datafog; print(datafog.__version__)` (if applicable) shows the correct version.
2224

@@ -30,12 +32,14 @@ Currently, the package version might be duplicated or inconsistently defined. We
3032
The codebase currently uses functions like `ensure_installed()` that attempt to `pip install` missing dependencies at runtime. This practice is unreliable, can hide dependency issues, slow down startup, and interfere with environment management. We must remove this pattern and adopt a "fail fast" approach.
3133

3234
**Tasks:**
35+
3336
1. Identify all code locations where runtime `pip install` commands are executed (e.g., calls to `ensure_installed`, `subprocess.run(['pip', 'install', ...])`).
3437
2. Remove these runtime installation calls entirely.
3538
3. Replace them with standard `import` statements. If an `ImportError` occurs, the program should exit gracefully, clearly stating which dependency is missing and how to install it (e.g., "Please install the 'X' package: pip install datafog[feature]").
3639
4. Ensure all necessary dependencies are listed correctly in `setup.py`'s `install_requires` or `extras_require`.
3740

3841
**Acceptance Criteria:**
42+
3943
- No code attempts to install packages using `pip` or similar mechanisms during program execution.
4044
- If an optional dependency (part of an `extra`) is needed but not installed, the program raises an `ImportError` with a helpful message instructing the user how to install the required extra.
4145
- Core dependencies listed in `install_requires` are assumed to be present; missing core dependencies will naturally cause `ImportError` on startup.
@@ -50,18 +54,20 @@ The codebase currently uses functions like `ensure_installed()` that attempt to
5054
The project offers optional OCR functionality using Tesseract and/or Donut models, which have their own dependencies. These optional dependencies need to be formally defined using `extras_require` in `setup.py` and documented for users.
5155

5256
**Tasks:**
53-
1. Identify all dependencies required *only* for Tesseract functionality.
54-
2. Identify all dependencies required *only* for Donut functionality.
57+
58+
1. Identify all dependencies required _only_ for Tesseract functionality.
59+
2. Identify all dependencies required _only_ for Donut functionality.
5560
3. Define appropriate extras in the `extras_require` dictionary within `setup.py`. Suggestions:
56-
* `'ocr': ['pytesseract', 'pillow', ...]` (for Tesseract)
57-
* `'donut': ['transformers[torch]', 'sentencepiece', ...]` (for Donut)
58-
* Optionally, a combined extra: `'all_ocr': ['pytesseract', 'pillow', 'transformers[torch]', 'sentencepiece', ...]` or include dependencies in a general `'ocr'` extra if they don't conflict significantly.
61+
- `'ocr': ['pytesseract', 'pillow', ...]` (for Tesseract)
62+
- `'donut': ['transformers[torch]', 'sentencepiece', ...]` (for Donut)
63+
- Optionally, a combined extra: `'all_ocr': ['pytesseract', 'pillow', 'transformers[torch]', 'sentencepiece', ...]` or include dependencies in a general `'ocr'` extra if they don't conflict significantly.
5964
4. Update the `README.md` and any installation documentation (e.g., `docs/installation.md`) to explain these extras and how users can install them (e.g., `pip install "datafog[ocr]"` or `pip install "datafog[donut]"`).
6065

6166
**Acceptance Criteria:**
67+
6268
- `setup.py` contains an `extras_require` section defining keys like `ocr` and/or `donut`.
6369
- Installing the package with these extras (e.g., `pip install .[ocr]`) successfully installs the associated dependencies.
6470
- Documentation clearly explains the available extras and the installation commands.
65-
- Core installation (`pip install .`) does *not* install the OCR-specific dependencies.
71+
- Core installation (`pip install .`) does _not_ install the OCR-specific dependencies.
6672

6773
---

0 commit comments

Comments
 (0)