You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Clean Unicode quirks from text. If no input files are given, reads from STDIN and writes to STDOUT (filter mode). If input files are given, creates cleaned files with .clean before the extension (e.g., foo.txt -> foo.clean.txt). Use -o - to force output to STDOUT for all input files, or -o <file> to specify a single output file (only with one
108
108
input file).
@@ -116,6 +116,8 @@ options:
116
116
-Q, --keep-smart-quotes
117
117
Preserve Unicode smart quotes (do not convert to ASCII)
118
118
-D, --keep-dashes Preserve Unicode EN/EM dashes (do not convert to ASCII)
119
+
--keep-fullwidth-brackets
120
+
Preserve fullwidth square brackets (【】) (do not fold to ASCII)
119
121
-n, --no-newline Do not add a newline at the end of the output file (suppress final newline).
120
122
-o OUTPUT, --output OUTPUT
121
123
Output file name, or '-'for STDOUT. Only valid with one input file, or use '-'for STDOUT with multiple files.
@@ -127,6 +129,7 @@ options:
127
129
128
130
-`-Q`, `--keep-smart-quotes`: Preserve Unicode smart quotes (curly single/double quotes). Useful when preparing prose/blog posts where typographic quotes are intentional. Default behavior converts them to ASCII for shell/CI safety.
129
131
-`-D`, `--keep-dashes`: Preserve EN/EM dashes. Useful when stylistic punctuation is desired in prose. Default behavior converts EM dash to ` - ` and EN dash to `-`.
132
+
-`--keep-fullwidth-brackets`: Preserve fullwidth square brackets (`【】`). By default, they are folded to ASCII `[]` to keep monospace alignment in terminals and fixed-width tables.
130
133
-`-R`, `--report`: Audit text for anomalies, human-readable.
131
134
-`-J`, `--json`: Audit text for anomalies, JSON format.
132
135
-`-T`, `--threshold`: Fail CI if anomalies exceed threshold.
Copy file name to clipboardExpand all lines: docs/cleanup-text.md
+14-13Lines changed: 14 additions & 13 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
-
# Unicode Text Cleaner (`cleanup-text`) - v1.1.0
1
+
# Unicode Text Cleaner (`cleanup-text`) - v1.1.2
2
2
3
-
*Last updated: 2025-09-18*
3
+
*Last updated: 2025-11-15*
4
4
5
5
A robust command-line tool to normalize and clean problematic Unicode characters, invisible characters, and formatting quirks from text files. Designed to make code and text more human, linter-friendly, and free of "AI tells" or watermarks.
6
6
@@ -85,17 +85,18 @@ tests/test_all.sh --help
85
85
- Use the `-i` flag if you need to preserve invisible Unicode characters for special use cases.
86
86
- Use the `-n` flag if you need to suppress the final newline (rare).
87
87
88
-
## TODO (Alignment & Bracket Normalization)
89
-
90
-
- Add optional folding of fullwidth square brackets to ASCII in `unicodefix.transforms.clean_text`:
91
-
- Map `【` → `[` and `】` → `]` under a new flag (e.g., `preserve_fullwidth_brackets: bool = False`).
92
-
- Preserve dagger glyph `†` and inline spans (e.g., `†L147-L156`).
93
-
- Rationale: terminal table alignment (fixed-width) and monospace column layout can drift with fullwidth characters.
94
-
- Consider expanding flags to preserve typographic punctuation while still removing invisible/control chars:
0 commit comments