unicode

Implementations of various Unicode® Standard Annexes in Go.

This repository provides Go packages for Unicode text processing algorithms, organized by UAX (Unicode Standard Annex) specification.

Packages

uax9 - Bidirectional Algorithm

Implementation of UAX #9 (Unicode Bidirectional Algorithm) for handling bidirectional text with mixing LTR and RTL scripts.

Status: Complete with 100% conformance (513,494/513,494 tests passing)

Supports:

Full bidirectional text reordering - Proper display of mixed LTR/RTL content
Isolating run sequences (BD13) - Advanced context isolation for complex layouts
Explicit formatting characters - LRE, RLE, LRO, RLO, PDF, LRI, RLI, FSI, PDI
Deep embedding nesting - Up to 125 levels of explicit embedding
Bracket pair handling (N0) - Proper neutral character resolution
Automatic direction detection - Smart paragraph base direction

import "github.com/SCKelemen/unicode/uax9"

// Reorder mixed LTR/RTL text
text := "Hello שלום world"
result := uax9.Reorder(text, uax9.DirectionLTR)

// Auto-detect paragraph direction
dir := uax9.GetParagraphDirection("שלום עולם")  // Returns DirectionRTL

// Get bidi class of a character
class := uax9.GetBidiClass('א')  // Returns R (Right-to-Left)

uax11 - East Asian Width

Implementation of UAX #11 (East Asian Width) for determining character display width in East Asian typography contexts.

Status: Complete with comprehensive test coverage

Supports:

East Asian Width property lookup (Ambiguous, Fullwidth, Halfwidth, Narrow, Neutral, Wide)
Context-based width resolution for ambiguous characters
Character and string display width calculation
Terminal emulator and monospace font support
Complete Unicode 17.0.0 data

import "github.com/SCKelemen/unicode/uax11"

// Determine character width
width := uax11.LookupWidth('中')  // Returns Wide
if uax11.IsWide('A') {
    // Character occupies 2 units
}

// Calculate string display width
width := uax11.StringWidth("Hello世界", uax11.ContextNarrow)  // Returns 9

uax14 - Line Breaking Algorithm

Implementation of UAX #14 (Unicode Line Breaking Algorithm) for finding valid line break opportunities in text.

Status: Complete with 100% conformance (19,338/19,338 tests passing)

Note: This code was originally implemented in github.com/SCKelemen/layout and has been extracted to a standalone package for reusability.

Supports:

Word boundaries and spaces
Mandatory breaks (newlines)
Configurable hyphenation (none, manual, auto)
CJK ideographic text
Punctuation and numeric sequences

import "github.com/SCKelemen/unicode/uax14"

text := "Hello world! This is a test."
breaks := uax14.FindLineBreakOpportunities(text, uax14.HyphensManual)

uax24 - Script Property

Implementation of UAX #24 (Unicode Script Property) for identifying the writing system (script) to which a character belongs.

Status: Complete with 100% conformance (159,866/159,866 tests passing)

Supports:

Script property lookup for all Unicode 17.0.0 characters
174 scripts including Latin, Greek, Cyrillic, Han, Arabic, Hebrew, and many others
Mixed-script detection for security validation
Common and Inherited script handling
Single-script string validation

import "github.com/SCKelemen/unicode/uax24"

// Get the script of a character
script := uax24.LookupScript('A')      // Returns ScriptLatin
script = uax24.LookupScript('中')      // Returns ScriptHan
script = uax24.LookupScript('5')       // Returns ScriptCommon

// Check if character belongs to a specific script
if uax24.IsLatin('A') {
    // Character is Latin
}

// Analyze a string for script composition
info := uax24.AnalyzeScripts("Hello мир")
fmt.Printf("Scripts: %v\n", info.Scripts)        // [Latin Cyrillic]
fmt.Printf("Mixed: %v\n", info.IsMixedScript)    // true

// Security: Detect homograph attacks
if !uax24.IsSingleScript("myVariаble") {  // 'а' is Cyrillic
    // Warning: Mixed scripts detected
}

uax29 - Text Segmentation

Implementation of UAX #29 (Unicode Text Segmentation) for breaking text into grapheme clusters, words, and sentences.

Status: Complete with 100% conformance on all official Unicode tests

Supports:

Grapheme cluster boundaries (100.0% - 766/766 tests)
- User-perceived characters, emoji sequences, combining marks
- Hangul syllable composition
- Regional indicator pairs (flag emojis)
- Indic conjunct sequences for 10+ scripts
Word boundaries (100.0% - 1944/1944 tests)
- Alphabetic and numeric sequences
- Contractions, punctuation, hyphenated words
- Hebrew letter handling, Katakana sequences
- Emoji modifiers and ZWJ sequences
Sentence boundaries (100.0% - 512/512 tests)
- Period, question mark, exclamation handling
- Abbreviation detection, quote and parenthesis handling
- Multi-script sentence terminators

import "github.com/SCKelemen/unicode/uax29"

// Grapheme clusters
graphemes := uax29.Graphemes("👨‍👩‍👧‍👦")  // Returns ["👨‍👩‍👧‍👦"]

// Words
words := uax29.Words("Hello, world!")  // Returns ["Hello", ",", " ", "world", "!"]

// Sentences
sentences := uax29.Sentences("Hello. World!")  // Returns ["Hello. ", "World!"]

// Single-pass API - get all three break types at once
breaks := uax29.FindAllBreaks("Hello, world!")
for _, pos := range breaks.Graphemes {
    // Process grapheme boundaries
}
for _, pos := range breaks.Words {
    // Process word boundaries
}
for _, pos := range breaks.Sentences {
    // Process sentence boundaries
}

uax31 - Identifier and Pattern Syntax

Implementation of UAX #31 (Unicode Identifier and Pattern Syntax) for determining valid identifier characters in programming languages and pattern-based systems.

Status: Complete with 100% conformance (297,981/297,981 tests passing)

Supports:

XID_Start property - Characters valid at the start of an identifier
- Letters, ideographs, letter numbers across all scripts
- Binary search for O(log n) lookups
XID_Continue property - Characters valid after the first character
- XID_Start plus marks, digits, connector punctuation
- Includes zero-width joiner and combining marks
Pattern_Syntax property - Reserved characters for pattern languages
- ASCII punctuation and mathematical symbols
- Used to identify syntactic elements
Pattern_White_Space property - Whitespace in patterns
- Spaces, tabs, line breaks for pattern tokenization
Default Identifier Syntax - Complete identifier validation
- Pattern: <XID_Start> <XID_Continue>*
- Stable across Unicode versions

import "github.com/SCKelemen/unicode/uax31"

// Check if character can start an identifier
if uax31.IsXIDStart('A') {
    // Valid identifier start (letters, ideographs)
}

// Check if character can continue an identifier
if uax31.IsXIDContinue('5') {
    // Valid after first character (includes digits, marks)
}

// Validate complete identifier
if uax31.IsValidIdentifier("myVar123") {
    // Valid: starts with letter, continues with letters/digits
}

// Pattern syntax detection
if uax31.IsPatternSyntax('*') {
    // Reserved for pattern languages (regex, etc.)
}

// Programming language tokenization example
func isIdentifierChar(r rune, isFirst bool) bool {
    if isFirst {
        return uax31.IsXIDStart(r)
    }
    return uax31.IsXIDContinue(r)
}

// Security: Validate identifiers for safety
identifier := "user_name"
if uax31.IsValidIdentifier(identifier) {
    // Identifier follows Unicode standard
}

uax50 - Vertical Text Layout

Implementation of UAX #50 (Unicode Vertical Text Layout) for determining character orientation in vertical text.

Status: Complete with comprehensive test coverage

Supports:

Vertical orientation property lookup (Rotated, Upright, TransformedUpright, TransformedRotated)
Character rotation determination for vertical text
Glyph transformation detection for vertical-specific forms
Complete Unicode 17.0.0 data
East Asian typography and mixed-script vertical layouts

import "github.com/SCKelemen/unicode/uax50"

// Determine how to display characters in vertical text
orientation := uax50.LookupOrientation('中')  // Returns Upright
if uax50.IsUpright('A') {
    // Display upright
} else {
    // Rotate 90 degrees clockwise
}

uts51 - Unicode Emoji

Implementation of UTS #51 (Unicode Emoji) for emoji property detection, sequence validation, and terminal rendering support.

Status: Complete with 100% conformance (5,223/5,223 tests passing)

Supports:

Emoji properties - All 6 core emoji properties
- Emoji, Emoji_Presentation, Emoji_Modifier
- Emoji_Modifier_Base, Emoji_Component, Extended_Pictographic
Sequence validation - All emoji sequence types
- ZWJ sequences (family emoji, etc.)
- Modifier sequences (skin tones)
- Flag sequences (regional indicators)
- Keycap sequences (#️⃣, *️⃣, 0️⃣-9️⃣)
- Tag sequences (subdivision flags)
Terminal rendering - Width calculation for emoji display
Integration with UAX #11, #14, #29, #50

import "github.com/SCKelemen/unicode/uts51"

// Check if character is emoji
if uts51.IsEmoji('😀') {
    // Handle emoji
}

// Calculate width for terminal rendering
width := uts51.EmojiWidth('😀')  // Returns 2 (like CJK characters)

// Validate emoji sequences
sequence := []rune{0x1F468, 0x200D, 0x1F469, 0x200D, 0x1F467}  // Family
if uts51.IsValidEmojiSequence(sequence) {
    // Valid ZWJ sequence
}

uts15 - Unicode Normalization Forms

Implementation of UTS #15 (Unicode Normalization Forms) for text normalization, comparison, and canonicalization.

Status: Complete with 100% conformance (20,034/20,034 tests passing)

Supports:

NFC (Canonical Composition) - Recommended form for most uses
NFD (Canonical Decomposition) - Fully decomposed form
NFKC (Compatibility Composition) - Aggressive normalization for identifiers
NFKD (Compatibility Decomposition) - Fully compatibility decomposed
Hangul composition/decomposition - Algorithmic Hangul syllable handling
Canonical ordering - Proper combining mark ordering
Normalization stability - Idempotent operations
Complete Unicode 17.0.0 normalization data

import "github.com/SCKelemen/unicode/uts15"

// Normalize to NFC (recommended for most uses)
text := "café"  // May be composed or decomposed
normalized := uts15.NFC(text)

// Compare strings reliably
s1 := "café"  // Composed form
s2 := "cafe\u0301"  // Decomposed form (e + combining accent)
if uts15.NFC(s1) == uts15.NFC(s2) {
    // Strings are equivalent
}

// Normalize for searching (NFKC removes formatting distinctions)
query := "\uFB01le"  // Contains ﬁ ligature
normalized := uts15.NFKC(query)  // "file"

// Check if already normalized
if uts15.IsNFC("café") {
    // No normalization needed
}

uts39 - Unicode Security Mechanisms

Implementation of UTS #39 (Unicode Security Mechanisms) for detecting and preventing security issues from confusable characters and mixed scripts.

Status: Complete with 100% conformance (6,565/6,565 confusable mappings verified)

Supports:

Confusable detection - Skeleton algorithm for visual similarity
- Identifies lookalike characters (e.g., Cyrillic 'а' vs Latin 'a')
- Case-insensitive confusable matching
- 6,565 confusable mappings from Unicode 17.0.0
Mixed-script detection - Identifies suspicious script mixing
- Single-script, mixed-script, and cross-script analysis
- Script-specific security policies
Restriction levels - Security profiles for identifiers
- ASCII-Only: Strictest, ASCII characters only
- Single-Script: One script (excluding Common/Inherited)
- Highly-Restrictive: Single script + Common + Inherited
- Moderately-Restrictive: Multiple allowed script combinations
- Minimally-Restrictive: Latin + one other script
- Unrestricted: Any character combination
Safe identifier validation - Checks for security issues
- Invalid invisible characters
- Proper identifier structure (UAX #31)
- Minimum restriction level enforcement

import "github.com/SCKelemen/unicode/uts39"

// Detect confusable strings (homograph attacks)
if uts39.AreConfusable("paypal", "pаypal") {  // Second uses Cyrillic 'а'
    // Warning: visually similar but different strings
}

// Get skeleton for comparison
skel := uts39.Skeleton("Hello")

// Check restriction level
level := uts39.GetRestrictionLevel("user_name")
if level >= uts39.HighlyRestrictive {
    // Identifier meets security requirements
}

// Detect mixed scripts
if uts39.IsMixedScript("hello мир") {  // Latin + Cyrillic
    // Warning: mixed script identifier
}

// Validate identifier safety
if uts39.IsSafeIdentifier("user_name") {
    // Safe: valid identifier, highly restrictive, no invisible chars
}

// Security validation example
func validateUsername(username string) error {
    if !uts39.IsValidIdentifier(username) {
        return errors.New("invalid identifier format")
    }

    level := uts39.GetRestrictionLevel(username)
    if level < uts39.HighlyRestrictive {
        return errors.New("username uses suspicious character mixing")
    }

    return nil
}

Installation

go get github.com/SCKelemen/unicode/uax9
go get github.com/SCKelemen/unicode/uax11
go get github.com/SCKelemen/unicode/uax14
go get github.com/SCKelemen/unicode/uax24
go get github.com/SCKelemen/unicode/uax29
go get github.com/SCKelemen/unicode/uax31
go get github.com/SCKelemen/unicode/uax50
go get github.com/SCKelemen/unicode/uts15
go get github.com/SCKelemen/unicode/uts39
go get github.com/SCKelemen/unicode/uts51

Design Philosophy

These implementations focus on practical text layout and rendering needs:

Simple, focused APIs
Minimal dependencies (standard library only)
Performance-conscious
Well-tested
Layout-engine agnostic
Full conformance with Unicode standards

Version 2.0.0 Performance Improvements

Version 2.0.0 focuses on performance optimization while maintaining 100% conformance with Unicode standards.

Table-Driven Binary Search

All packages now use table-driven O(log n) binary search for character classification, replacing sequential O(n) checks:

UAX #9: Bidi class lookup optimized with 3,060 precomputed ranges from DerivedBidiClass.txt
UAX #29: Unified packed data structure with 4,673 ranges encoding all three break types (grapheme, word, sentence) in 16-bit format

Performance: Character classification now runs at ~60-100 ns/op with 0 allocations on Apple M4 Pro.

Generated Unicode Data

All Unicode property data is now generated directly from official Unicode 17.0.0 data files:

Download from unicode.org during build
Parse property files (DerivedBidiClass.txt, GraphemeBreakProperty.txt, etc.)
Generate optimized Go code with binary search tables
Ensures correctness and synchronization with Unicode standard

Single-Pass API

UAX #29 provides a new FindAllBreaks() API that computes grapheme, word, and sentence boundaries in a single traversal:

// Before: Three separate passes
graphemes := uax29.FindGraphemeBreaks(text)
words := uax29.FindWordBreaks(text)
sentences := uax29.FindSentenceBreaks(text)

// After: Single pass with shared classification
breaks := uax29.FindAllBreaks(text)
// Use breaks.Graphemes, breaks.Words, breaks.Sentences

This provides a convenient API for applications that need multiple break types, with framework in place for future hierarchical optimization.

Version 3.0.0 Performance Improvements

Version 3.0.0 focuses on hierarchical optimization of the single-pass API introduced in v2.0.0.

Hierarchical Break Detection

The FindAllBreaks() API now implements true hierarchical checking, leveraging the natural subset relationships between break types:

Words ⊆ Graphemes: Word breaks only checked at grapheme cluster boundaries
Sentences ⊆ Words: Sentence breaks only checked at word boundaries

This eliminates redundant checks and significantly improves performance for applications needing multiple break types.

Performance Improvements

Benchmark results on Apple M4 Pro comparing v3.0.0 single-pass vs three separate function calls:

Text Length	v2.0.0 Three Passes	v3.0.0 Single Pass	Speedup
Short (33 chars)	3,457 ns/op	2,197 ns/op	1.57x
Medium (86 chars)	16,191 ns/op	9,636 ns/op	1.68x
Long (467 chars)	423,491 ns/op	188,982 ns/op	2.24x

Key benefits:

Speedup increases with text length (hierarchical pruning more effective on longer text)
Single UTF-8 decode and classification pass
Pre-classified data reused across all three break types
No additional allocations compared to v2.0.0

Maintained Conformance

100% conformance maintained on all official Unicode test suites:

Grapheme: 766/766 tests passing
Word: 1,944/1,944 tests passing
Sentence: 512/512 tests passing

Version 4.0.0 Performance Improvements

Version 4.0.0 focuses on code quality and maintainability through rule-based state machine architecture.

Rule-Based State Machine Architecture

All break detection algorithms now use clean, rule-based implementations that directly map to the Unicode Standard specifications:

BreakContext abstractions: GraphemeBreakContext, WordBreakContext, SentenceBreakContext provide clean navigation APIs
Named rule functions: Each Unicode rule (GB3, WB5, SB8, etc.) becomes a named function with clear semantics
Declarative rule chains: Rules checked in order with first-match-wins strategy
Maintained hierarchical optimization: Words checked only at grapheme boundaries, sentences only at word boundaries

This architecture dramatically improves:

Readability: Rules directly match Unicode Standard specification
Maintainability: Easy to understand, modify, and extend
Debuggability: Each rule can be tested and traced independently

Code Organization

New files implementing the rule-based architecture:

context.go - Break context abstractions with navigation methods
grapheme_rules.go - Grapheme breaking rules (ruleGB3 through ruleGB12_13)
word_rules.go - Word breaking rules (ruleWB3 through ruleWB15_16)
sentence_rules.go - Sentence breaking rules (ruleSB3 through ruleSB11)

Performance Analysis

Benchmark results on Apple M4 Pro comparing v4.0.0 rule-based vs v3.0.0 inline:

Single-Pass API:

Text Length	v3.0.0 Inline	v4.0.0 Rule-Based	Change
Short (33 chars)	2,197 ns/op	2,717 ns/op	1.24x slower
Medium (86 chars)	9,636 ns/op	6,647 ns/op	1.45x faster
Long (467 chars)	188,982 ns/op	32,200 ns/op	5.87x faster

Rule-based grapheme breaking alone (standalone function):

Text Length	v3.0.0 Inline	v4.0.0 Rule-Based	Speedup
Short (33 chars)	1,882 ns/op	1,183 ns/op	1.59x
Medium (86 chars)	8,759 ns/op	3,041 ns/op	2.88x
Long (467 chars)	168,060 ns/op	15,170 ns/op	11.08x

Single-Pass vs Three Separate Passes (v4.0.0):

Text Length	Single Pass	Three Separate	Speedup
Short (33 chars)	2,717 ns/op	3,380 ns/op	1.24x
Medium (86 chars)	6,647 ns/op	14,312 ns/op	2.15x
Long (467 chars)	32,200 ns/op	239,624 ns/op	7.44x

Key findings:

Rule-based grapheme breaking provides 1.6-11x speedup over inline implementation
Performance improvements increase dramatically with text length
Single-pass API maintains significant advantage over three separate calls
Medium and long texts benefit most from rule-based architecture

Maintained Conformance

100% conformance maintained on all official Unicode test suites:

Grapheme: 766/766 tests passing
Word: 1,944/1,944 tests passing
Sentence: 512/512 tests passing

Version 5.0.0 Improvements

Version 5.0.0 extends the rule-based state machine architecture from UAX #29 to UAX #14 (Line Breaking Algorithm), achieving 100% conformance and dramatically improved maintainability.

Rule-Based Line Breaking Architecture

UAX #14 now uses a clean, rule-based implementation that directly maps to the Unicode Standard specification:

LineBreakContext abstraction: Clean navigation API with helper methods (SkipBackward, FindForward, etc.)
Named rule functions: Each Unicode rule (LB4, LB5, LB8, LB21, etc.) becomes a named function
Declarative rule chains: Rules checked in order with first-match-wins strategy
Pair table fallback: Common cases handled by efficient 2,064-entry lookup table

Code Organization

New architecture improves code organization:

Original: 1,112-line monolithic function with complex inline conditionals
Rule-based: Isolated, independently testable rule functions with clear documentation
Complex rules decomposed: LB21 (hyphen handling) and LB19 (quotation marks) broken into 7+ focused sub-rules

Key files:

context.go - LineBreakContext abstraction with navigation methods
linebreak_rules.go - Rule-based implementation (59 rule functions, 1,786 lines)
Original monolithic implementation retained for comparison and fallback

100% Conformance Achievement

The rule-based implementation passes all official Unicode conformance tests:

UAX #14 (Line Breaking): 19,338/19,338 tests passing (100.0%)

Key fixes for 100% conformance:

French guillemet separators: »word« pattern (U+00AB/U+00BB) requiring special break handling
German quotes: „..." and ‚...' patterns where ClassQU_Pi acts as closing quote
Hebrew MAQAF: HL × HH ÷ HL pattern for U+05BE hyphen
Regional indicators with combining marks: RI × CM × RI sequences
Extended pictographic × emoji modifier: Reserved emoji ranges (U+1F000-U+1FFFD)
Rule ordering: Guillemet and German patterns must process before default quotation rules

Performance Analysis

Benchmark results on Apple M4 Pro comparing rule-based vs original:

Text Length	Original	Rule-Based	Change
Short (10 chars)	494 ns/op	1,360 ns/op	2.75x slower
Medium (64 chars)	3,934 ns/op	9,374 ns/op	2.38x slower
Long (45 chars)	2,138 ns/op	5,209 ns/op	2.44x slower

Trade-off analysis:

Rule-based implementation is 2-3x slower due to abstraction overhead
Maintainability benefits are significant:
- Isolated, testable rules directly mapping to spec
- Clear documentation with spec links for each rule
- Easy to add new rules without understanding entire state machine
- Complex rules (LB21, LB19) broken into manageable sub-functions
Performance acceptable for text layout applications (thousands of characters per millisecond)

Benefits for Unicode Maintainability

The rule-based architecture provides critical benefits:

Direct spec mapping: Rule functions named after Unicode spec rules (ruleLB4, ruleLB21, etc.)
Independent testing: Each rule can be tested and traced independently
Clear debugging: Rule execution can be logged to understand break decisions
Easy updates: New Unicode versions can add rules without refactoring
Reduced complexity: No massive conditional chains or inline state tracking

This matches the successful pattern from UAX #29 v4.0.0, providing consistency across the codebase.

Maintained Conformance

100% conformance maintained on all official Unicode test suites:

Line Breaking: 19,338/19,338 tests passing

Version 6.0.0 Performance Improvements

Version 6.0.0 focuses on memory optimization and ASCII fast paths to dramatically improve performance for common cases while maintaining 100% Unicode conformance.

Type Size Reductions

All Unicode property types now use minimal storage:

UTS #15 (Normalization): combiningClassMap changed from map[rune]int to map[rune]uint8
- Unicode combining classes range 0-240, fit perfectly in uint8 (0-255)
- Memory savings: ~7.75 KB (50% reduction from 15.5 KB to 7.75 KB)
UAX #24 (Script Property): Script type changed from int to uint8
- 176 Unicode scripts fit comfortably in uint8 (0-255)
- Memory savings: 87.5% per value (8 bytes → 1 byte)
UAX #14 (Line Breaking): BreakClass type changed from int to uint8
- 66 break classes fit in uint8 (0-255)
- Memory savings: 87.5% per value (8 bytes → 1 byte)

Impact: All runtime structures using these types are 50-87.5% smaller, providing better CPU cache utilization.

ASCII Fast Paths

Common case optimization: ASCII-only text gets dedicated fast paths with early returns:

UTS #15 (Normalization):

Added isASCII() check to NFC, NFD, NFKC, NFKD functions
ASCII text is already normalized in all forms
Avoids expensive decomposition/composition operations

UTS #39 (Security):

ASCII fast paths in IsMixedScript() - ASCII is single-script (Latin)
ASCII fast paths in IsSafeIdentifier() - ASCII identifiers only need validation
Skips expensive script analysis for common identifiers

Performance Benchmarks

Benchmark results on Apple M4 Pro comparing v5.0.0 vs v6.0.0:

UTS #15 (Normalization) - ASCII Fast Path Impact

Operation	ASCII (v6.0.0)	Non-ASCII (v6.0.0)	Speedup	Improvement
NFC	7.68 ns/op	995 ns/op	129x faster	12,850%
NFKC	7.72 ns/op	1,115 ns/op	144x faster	14,340%

🎯 ASCII text normalization is essentially FREE (single isASCII() check)!

UTS #39 (Security) - ASCII Fast Path Impact

Operation	ASCII (v6.0.0)	Non-ASCII (v6.0.0)	Speedup	Improvement
IsMixedScript	4.18 ns/op	142 ns/op	34x faster	3,300%
IsSafeIdentifier	74.7 ns/op	277 ns/op	3.7x faster	271%

🎯 ASCII security checks are 34x faster!

UTS #39 (Security) - Mixed Unicode Text

Operation	Before (v5.0.0)	After (v6.0.0)	Change	Improvement
Skeleton	430 ns/op	174 ns/op	-256 ns/op	2.5x faster
AreConfusable	874 ns/op	502 ns/op	-372 ns/op	1.7x faster
GetRestrictionLevel	5.06 ns/op	4.62 ns/op	-0.44 ns/op	9% faster

UTS #15 (Normalization) - Mixed Unicode Text

Operation	Before (v5.0.0)	After (v6.0.0)	Change	Improvement
NFKC	5,877 ns/op	5,390 ns/op	-487 ns/op	8% faster
NFKD	3,337 ns/op	3,135 ns/op	-202 ns/op	6% faster
IsNFC	9,918 ns/op	9,622 ns/op	-296 ns/op	3% faster

Real-World Impact

Typical web application (mostly ASCII identifiers):

Variable name validation: 34x faster
URL normalization: 129x faster
Username security checks: 3.7x faster

International text (mixed Unicode):

Confusable detection: 2.5x faster
Text normalization: 3-8% faster
Security validation: 1.7x faster

Memory Improvements

Component	Before	After	Savings
combiningClassMap (UTS #15)	~15.5 KB	~7.75 KB	50% (7.75 KB)
Script type (UAX #24)	8 bytes/value	1 byte/value	87.5% (7 bytes)
BreakClass type (UAX #14)	8 bytes/value	1 byte/value	87.5% (7 bytes)

🎯 All runtime structures using these types are 50-87.5% smaller with better CPU cache density.

Key Benefits

✅ ASCII normalization: 129-144x faster (essentially free) ✅ ASCII security checks: 34x faster ✅ Skeleton algorithm: 2.5x faster for all text ✅ Confusable detection: 1.7x faster for all text ✅ Memory footprint: ~15-20 KB saved, 50-87.5% reduction in type sizes ✅ Conformance: 100% maintained (all 207,333 tests passing)

Design Philosophy

The optimizations excel at what matters most:

Common case (ASCII) is blazingly fast (100x+ speedups)
Full Unicode support still provides solid improvements (1.7-2.5x)
100% correctness maintained everywhere

Maintained Conformance

100% conformance maintained on all official Unicode test suites:

UTS #15: 20,034/20,034 normalization tests passing
UAX #24: 159,866/159,866 script property tests passing
UTS #39: 6,565/6,565 confusable mappings verified

Unicode Version

This repository implements Unicode 17.0.0 (September 2024).

Why Not Use Go's Standard Library?

Go's unicode package (as of Go 1.23) provides Unicode 15.0.0 data. While it includes some properties we need (e.g., Regional_Indicator, Ideographic, Sentence_Terminal), it is missing:

Emoji properties: Extended_Pictographic, Emoji, Emoji_Presentation, Emoji_Modifier, Emoji_Modifier_Base, Emoji_Component
Text segmentation properties: Grapheme_Cluster_Break, Word_Break, Sentence_Break
Layout properties: East_Asian_Width, Line_Break, Vertical_Orientation

Design Decision: We implement all related properties within each specification package (e.g., all emoji properties in uts51) rather than mixing standard library and custom implementations. This ensures:

Consistency: All properties from a specification come from one authoritative source
Completeness: Unicode 17.0.0 support with the latest emoji and text handling
Maintainability: Single source of truth for each Unicode specification
Testability: 100% conformance against official Unicode 17.0.0 test files

When Go's unicode package updates to Unicode 17.0.0, we will continue maintaining our implementations to provide the specialized properties not available in the standard library.

Conformance

All implementations follow the Unicode Standard and are tested against official Unicode conformance test suites where available:

Test Coverage

UAX #9 (Bidirectional Algorithm): 100% conformance (513,494/513,494 tests)
- All explicit embeddings and isolates
- Multi-isolate sequences and deep nesting (up to 125 levels)
- Empty isolate handling and overflow isolation
- Bracket pair matching and neutral resolution
UAX #11 (East Asian Width): Comprehensive test coverage
- Character width property lookup for all Unicode code points
- Context-based ambiguous character resolution
- Display width calculation for strings
- Terminal emulator compatibility
UAX #14 (Line Breaking): 100% conformance (19,338/19,338 tests)
- All line break classes and combining rules
- Tailorable break opportunities
- Complex script handling (CJK, Thai, etc.)
- Hyphenation support (soft hyphens U+00AD)
UAX #24 (Script Property): 100% conformance (159,866/159,866 tests)
- Script property lookup for all Unicode 17.0.0 characters
- 174 scripts with ISO 15924 codes
- Mixed-script detection and validation
- Common and Inherited script handling
UAX #29 (Text Segmentation): 100% conformance (3,222/3,222 tests)
- Grapheme cluster breaking: 766/766 tests
- Word breaking: 1,944/1,944 tests
- Sentence breaking: 512/512 tests
UAX #31 (Identifier and Pattern Syntax): 100% conformance (297,981/297,981 tests)
- XID_Start and XID_Continue properties
- Pattern_Syntax and Pattern_White_Space properties
- Default Identifier Syntax validation
- Stable across Unicode versions
UAX #50 (Vertical Text Layout): Comprehensive test coverage
- Vertical orientation property for all Unicode code points
- Glyph transformation detection
- Base orientation determination
- Mixed-script vertical layout support
UTS #15 (Normalization Forms): 100% conformance (20,034/20,034 tests)
- NFC, NFD, NFKC, NFKD normalization forms
- Hangul composition and decomposition
- Canonical ordering of combining marks
- Normalization stability verification
UTS #39 (Unicode Security Mechanisms): 100% conformance (6,565/6,565 confusable mappings)
- Confusable character detection via skeleton algorithm
- Mixed-script detection and validation
- Restriction levels for identifier security
- Safe identifier validation
UTS #51 (Unicode Emoji): 100% conformance (5,223/5,223 tests)
- All 6 emoji properties correctly implemented
- Complete sequence validation (ZWJ, modifier, flag, keycap, tag sequences)

Conformance Testing

Implementations are validated using the official Unicode Character Database (UCD) test files:

UAX #9 Test Files - BidiTest.txt (513,494 tests), BidiCharacterTest.txt
UAX #11 Data Files - EastAsianWidth.txt property data
UAX #14 Test Files - LineBreakTest.txt (19,338 tests)
UAX #24 Data Files - Scripts.txt (159,866 tests)
UAX #29 Test Files - GraphemeBreakTest.txt, WordBreakTest.txt, SentenceBreakTest.txt
UAX #31 Data Files - DerivedCoreProperties.txt (297,981 tests)
UAX #50 Data Files - VerticalOrientation.txt property data
UTS #15 Test Files - NormalizationTest.txt (20,034 tests)
UTS #39 Data Files - confusables.txt (6,565 confusable mappings)
UTS #51 Test Files - emoji-test.txt with 5,223 test cases
Unicode Character Database - Character property data files

The implementations follow the conformance model described in UTR #33: Unicode Conformance Model, which defines what it means to conform to Unicode Standard specifications.

Related Projects

github.com/SCKelemen/layout - Text layout engine using these UAX implementations

References

Metastandards

UTR #33: Unicode Conformance Model - Defines conformance requirements for Unicode Standard implementations
UAX #41: Common References for Unicode Standard Annexes - Common definitions and references used across Unicode Standard Annexes

Implemented Standards

License

🐻 BearWare 1.0 🐻

MIT License with bear emojis. See LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
.github/workflows		.github/workflows
uax11		uax11
uax14		uax14
uax24		uax24
uax29		uax29
uax31		uax31
uax50		uax50
uax9		uax9
uts15		uts15
uts39		uts39
uts51		uts51
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
BITFLAG_ANALYSIS.md		BITFLAG_ANALYSIS.md
DUPLICATION_ANALYSIS.md		DUPLICATION_ANALYSIS.md
LICENSE		LICENSE
README.md		README.md
SINGLE_PASS_DESIGN.md		SINGLE_PASS_DESIGN.md
TESTING.md		TESTING.md
go.mod		go.mod

License

SCKelemen/unicode

Folders and files

Latest commit

History

Repository files navigation

unicode

Packages

uax9 - Bidirectional Algorithm

uax11 - East Asian Width

uax14 - Line Breaking Algorithm

uax24 - Script Property

uax29 - Text Segmentation

uax31 - Identifier and Pattern Syntax

uax50 - Vertical Text Layout

uts51 - Unicode Emoji

uts15 - Unicode Normalization Forms

uts39 - Unicode Security Mechanisms

Installation

Design Philosophy

Version 2.0.0 Performance Improvements

Table-Driven Binary Search

Generated Unicode Data

Single-Pass API

Version 3.0.0 Performance Improvements

Hierarchical Break Detection

Performance Improvements

Maintained Conformance

Version 4.0.0 Performance Improvements

Rule-Based State Machine Architecture

Code Organization

Performance Analysis

Maintained Conformance

Version 5.0.0 Improvements

Rule-Based Line Breaking Architecture

Code Organization

100% Conformance Achievement

Performance Analysis

Benefits for Unicode Maintainability

Maintained Conformance

Version 6.0.0 Performance Improvements

Type Size Reductions

ASCII Fast Paths

Performance Benchmarks

UTS #15 (Normalization) - ASCII Fast Path Impact

UTS #39 (Security) - ASCII Fast Path Impact

UTS #39 (Security) - Mixed Unicode Text

UTS #15 (Normalization) - Mixed Unicode Text

Real-World Impact

Memory Improvements

Key Benefits

Design Philosophy

Maintained Conformance

Unicode Version

Why Not Use Go's Standard Library?

Conformance

Test Coverage

Conformance Testing

Related Projects

References

Metastandards

Implemented Standards

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 6

Packages 0

Languages

Packages