Skip to content

Commit 5c7018c

Browse files
committed
20251115_02-Update
- Attacked the TODO list with some simple glyph fixes.
1 parent f1d72a7 commit 5c7018c

File tree

7 files changed

+51
-18
lines changed

7 files changed

+51
-18
lines changed

CHANGELOG.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,8 @@
99
- README: clarified installation modes (standard, editable, and NLP extras), tightened wording, and refreshed badges/links.
1010
- Setup: improved guidance printed by `setup.sh` after environment creation; clarified quick start steps.
1111
- Requirements: synced with current packaging to ensure local venv installs match `pyproject.toml` expectations.
12-
- No behavior changes to the cleaner; tests unchanged.
12+
- Normalization: fold fullwidth square brackets 【】 to ASCII [] by default; add `--keep-fullwidth-brackets` to preserve them; dagger `` remains untouched.
13+
- Minor behavior change: default folding of fullwidth brackets; use the new flag to opt out.
1314

1415
## 2025-09-18
1516

README.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -102,7 +102,7 @@ Once installed and activated:
102102
```bash
103103
(LLaSA-speech) [unixwzrd@xanax: bin]$ cleanup-text --help
104104

105-
usage: cleanup-text [-h] [-i] [-Q] [-D] [-n] [-o OUTPUT] [-t] [-p] [infile ...]
105+
usage: cleanup-text [-h] [-i] [-Q] [-D] [--keep-fullwidth-brackets] [-n] [-o OUTPUT] [-t] [-p] [infile ...]
106106

107107
Clean Unicode quirks from text. If no input files are given, reads from STDIN and writes to STDOUT (filter mode). If input files are given, creates cleaned files with .clean before the extension (e.g., foo.txt -> foo.clean.txt). Use -o - to force output to STDOUT for all input files, or -o <file> to specify a single output file (only with one
108108
input file).
@@ -116,6 +116,8 @@ options:
116116
-Q, --keep-smart-quotes
117117
Preserve Unicode smart quotes (do not convert to ASCII)
118118
-D, --keep-dashes Preserve Unicode EN/EM dashes (do not convert to ASCII)
119+
--keep-fullwidth-brackets
120+
Preserve fullwidth square brackets (【】) (do not fold to ASCII)
119121
-n, --no-newline Do not add a newline at the end of the output file (suppress final newline).
120122
-o OUTPUT, --output OUTPUT
121123
Output file name, or '-' for STDOUT. Only valid with one input file, or use '-' for STDOUT with multiple files.
@@ -127,6 +129,7 @@ options:
127129

128130
- `-Q`, `--keep-smart-quotes`: Preserve Unicode smart quotes (curly single/double quotes). Useful when preparing prose/blog posts where typographic quotes are intentional. Default behavior converts them to ASCII for shell/CI safety.
129131
- `-D`, `--keep-dashes`: Preserve EN/EM dashes. Useful when stylistic punctuation is desired in prose. Default behavior converts EM dash to ` - ` and EN dash to `-`.
132+
- `--keep-fullwidth-brackets`: Preserve fullwidth square brackets (`【】`). By default, they are folded to ASCII `[]` to keep monospace alignment in terminals and fixed-width tables.
130133
- `-R`, `--report`: Audit text for anomalies, human-readable.
131134
- `-J`, `--json`: Audit text for anomalies, JSON format.
132135
- `-T`, `--threshold`: Fail CI if anomalies exceed threshold.

docs/cleanup-text.md

Lines changed: 14 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
1-
# Unicode Text Cleaner (`cleanup-text`) - v1.1.0
1+
# Unicode Text Cleaner (`cleanup-text`) - v1.1.2
22

3-
*Last updated: 2025-09-18*
3+
*Last updated: 2025-11-15*
44

55
A robust command-line tool to normalize and clean problematic Unicode characters, invisible characters, and formatting quirks from text files. Designed to make code and text more human, linter-friendly, and free of "AI tells" or watermarks.
66

@@ -85,17 +85,18 @@ tests/test_all.sh --help
8585
- Use the `-i` flag if you need to preserve invisible Unicode characters for special use cases.
8686
- Use the `-n` flag if you need to suppress the final newline (rare).
8787

88-
## TODO (Alignment & Bracket Normalization)
89-
90-
- Add optional folding of fullwidth square brackets to ASCII in `unicodefix.transforms.clean_text`:
91-
- Map ```[` and ```]` under a new flag (e.g., `preserve_fullwidth_brackets: bool = False`).
92-
- Preserve dagger glyph `` and inline spans (e.g., `†L147-L156`).
93-
- Rationale: terminal table alignment (fixed-width) and monospace column layout can drift with fullwidth characters.
94-
- Consider expanding flags to preserve typographic punctuation while still removing invisible/control chars:
95-
- Existing: `preserve_quotes`, `preserve_dashes`
96-
- Proposed: `preserve_fullwidth_brackets`, `preserve_fullwidth_variants`
97-
- Provide helper for ASCII-only display normalization for terminals while retaining original text for auditing/search.
98-
- Document patterns: (1) global pre-clean before render, (2) render-time folding behind a toggle.
88+
## Alignment & Fullwidth Brackets
89+
90+
- Fullwidth square brackets are now folded to ASCII by default to preserve monospace alignment in terminals and fixed-width tables:
91+
- ```[`, ```]`
92+
- Use `--keep-fullwidth-brackets` to preserve `【】`.
93+
- The dagger glyph `` (e.g., `†L147-L156`) is preserved.
94+
- A small helper exists for display-only folding:
95+
- `unicodefix.transforms.fold_for_terminal_display(text)` applies the same folding without other cleaning.
96+
- Useful when you want ASCII rendering for terminals while keeping the original text for auditing/search.
97+
- Patterns:
98+
1) Global pre-clean before render (default behavior).
99+
2) Render-time folding behind a toggle (use the helper).
99100

100101
## Changelog
101102

docs/test-suite.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
1-
# Test Suite for cleanup-text - v1.1.0
1+
# Test Suite for cleanup-text - v1.1.2
22

3-
*Last updated: 2025-09-18*
3+
*Last updated: 2025-11-15*
44

55
## Overview
66

src/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,2 @@
11
__all__ = ["clean_text", "handle_newlines"]
2-
__version__ = "1.1.0"
2+
__version__ = "1.1.2"

src/unicodefix/cli.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -103,6 +103,7 @@ def run_filter_mode(args) -> None:
103103
preserve_invisible=args.invisible,
104104
preserve_quotes=args.keep_smart_quotes,
105105
preserve_dashes=args.keep_dashes,
106+
preserve_fullwidth_brackets=args.keep_fullwidth_brackets,
106107
)
107108
cleaned = handle_newlines(cleaned, args.no_newline)
108109
# VSCode quirk: append only to stdout
@@ -142,6 +143,7 @@ def process_file(infile: str, args) -> None:
142143
preserve_invisible=args.invisible,
143144
preserve_quotes=args.keep_smart_quotes,
144145
preserve_dashes=args.keep_dashes,
146+
preserve_fullwidth_brackets=args.keep_fullwidth_brackets,
145147
)
146148
cleaned = handle_newlines(cleaned, args.no_newline)
147149
_write_text(infile, cleaned, eol)
@@ -167,6 +169,7 @@ def process_file(infile: str, args) -> None:
167169
preserve_invisible=args.invisible,
168170
preserve_quotes=args.keep_smart_quotes,
169171
preserve_dashes=args.keep_dashes,
172+
preserve_fullwidth_brackets=args.keep_fullwidth_brackets,
170173
)
171174
cleaned = handle_newlines(cleaned, args.no_newline)
172175

@@ -202,6 +205,8 @@ def main():
202205
help="Preserve Unicode smart quotes")
203206
parser.add_argument("-D", "--keep-dashes", action="store_true",
204207
help="Preserve Unicode EN/EM dashes")
208+
parser.add_argument("--keep-fullwidth-brackets", action="store_true",
209+
help="Preserve fullwidth square brackets (【】)")
205210
parser.add_argument("-n", "--no-newline", action="store_true",
206211
help="Do not add a final newline")
207212
parser.add_argument("-o", "--output",

src/unicodefix/transforms.py

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@ def clean_text(
2323
preserve_invisible: bool = False,
2424
preserve_quotes: bool = False,
2525
preserve_dashes: bool = False,
26+
preserve_fullwidth_brackets: bool = False,
2627
) -> str:
2728
"""
2829
Normalize problematic/invisible Unicode to safe ASCII where appropriate.
@@ -62,6 +63,15 @@ def clean_text(
6263
text = re.sub(r"\s*\u2014\s*", " - ", text) # EM → space-dash-space
6364
text = text.replace("\u2013", "-") # EN → dash
6465

66+
# Fold select fullwidth punctuation that affects monospace alignment
67+
if not preserve_fullwidth_brackets:
68+
FULLWIDTH_FOLD = {
69+
"\u3010": "[", # 【
70+
"\u3011": "]", # 】
71+
}
72+
if any(ch in text for ch in FULLWIDTH_FOLD):
73+
text = text.translate(str.maketrans(FULLWIDTH_FOLD))
74+
6575
# Zs separators → ASCII space
6676
text = re.sub(r"[\u00A0\u1680\u2000-\u200A\u202F\u205F\u3000]", " ", text)
6777

@@ -89,3 +99,16 @@ def handle_newlines(text: str, no_newline: bool = False) -> str:
8999
if no_newline:
90100
return text
91101
return text if text.endswith(("\n", "\r", "\r\n")) else text + "\n"
102+
103+
104+
def fold_for_terminal_display(text: str) -> str:
105+
"""
106+
Fold a minimal set of width-breaking Unicode punctuation for better terminal alignment.
107+
- Fullwidth square brackets 【】 → ASCII [].
108+
- Intentionally does not touch † (dagger) and similar glyphs.
109+
"""
110+
mapping = {
111+
"\u3010": "[", # 【
112+
"\u3011": "]", # 】
113+
}
114+
return text.translate(str.maketrans(mapping))

0 commit comments

Comments
 (0)