diff --git a/docs/tts-research.md b/docs/tts-research.md new file mode 100644 index 00000000..edb81022 --- /dev/null +++ b/docs/tts-research.md @@ -0,0 +1,248 @@ +# TTS Research + +## Gemini + +Definitive Comparative Analysis of Local TTS Architectures for Cross-Platform Tauri Integration: Supertonic 2 vs. Chatterbox Turbo1. Executive Strategic Overview: The Local Inference Convergence in 2026The trajectory of computational linguistics and speech synthesis has undergone a profound transformation over the last half-decade, culminating in a distinct bifurcation of technology stacks in early 2026. For systems architects and developers leveraging the Tauri framework to build cross-platform applications—spanning the unrestricted desktop ecosystems of macOS and Linux, as well as the rigorously sandboxed mobile environments of iOS and Android—the selection of a Text-to-Speech (TTS) engine is no longer a mere feature choice. It has elevated to a fundamental architectural decision that dictates the entire build pipeline, runtime efficiency, and distribution strategy of the final application.The user’s query posits a choice between two leading contenders in the open-weight arena: Supertonic 2, released by Supertone Inc. in January 2026, and Chatterbox Turbo, developed by Resemble AI. This report provides an exhaustive technical due diligence of these two models. The core tension explored herein is between Architectural Agility—epitomized by Supertonic’s lightweight, ONNX-native design—and Expressive Density—represented by Chatterbox’s larger, Llama-based backbone.While cloud-based inference dominated the early 2020s, the current paradigm emphasizes "Edge AI" and "Local First" principles. This shift is driven by privacy mandates, the need for zero-latency interaction in conversational interfaces, and the desire to eliminate recurring API costs. However, achieving parity with cloud-grade TTS on consumer hardware requires navigating a labyrinth of constraints: binary size limitations, memory bandwidth bottlenecks on mobile SoCs (System on Chips), and the draconian process management restrictions of mobile operating systems.For a Tauri developer, who enjoys the luxury of Rust’s performance and the web’s ubiquity, the challenge is uniquely complex. Tauri’s promise of a "write once, deploy everywhere" codebase is severely tested when integrating deep learning models that rely on disparate runtimes. Supertonic 2 offers a path of least resistance through native compilation, while Chatterbox Turbo demands a hybrid architecture that may fracture the unified codebase ideal. This report rigorously dissects these trade-offs to provide a definitive integration roadmap.2. Architectural Deconstruction: The Lightweight vs. The Large Language BackboneTo understand the feasibility of these models within a constrained Tauri environment, one must first dismantle their internal architectures. The "black box" of AI often obscures dependency chains that can shatter a cross-platform build pipeline. The difference between 44 million parameters and 350 million parameters is not merely quantitative; it represents two divergent philosophies of engineering.2.1 Supertonic 2: The Principles of Architectural DistillationSupertonic 2, as of its January 2026 release 1, is an anomaly in the contemporary landscape of generative AI. While the broader industry trend has been to scale parameters upwards—moving from millions to billions to achieve nuanced reasoning—Supertone Inc. has focused on distillation and efficiency. The model is engineered explicitly for embedded and on-device usage, prioritizing the reduction of computational overhead to near-negligible levels.The 44 Million Parameter AdvantageThe model operates with approximately 44 million parameters.2 In the context of modern neural networks, where even "Small Language Models" (SLMs) typically range from 0.5B to 3B parameters, 44M is microscopic. This scale confers specific hardware advantages that are critical for mobile performance:Cache Residency: A model of this size (approx. 268 MB in FP32, significantly less if quantized) can often reside entirely within the System Level Cache (SLC) or high-speed RAM partitions of modern mobile processors like the Apple A-series or Qualcomm Snapdragon. This drastically minimizes memory bandwidth saturation, which is the primary source of heat and battery drain during inference.Initialization Speed: The "cold start" time—the duration from loading the model to the first audio sample—is imperceptible, measured in milliseconds. This allows the TTS engine to be instantiated on-demand rather than requiring a persistent background service, optimizing system resource usage.The ONNX-Native RuntimeCrucially for Tauri developers, Supertonic is built natively for the ONNX Runtime.1 This choice is not incidental; it is a strategic enablement of cross-platform portability. ONNX (Open Neural Network Exchange) provides a standardized inference engine that is completely decoupled from the training environment. It does not require a Python interpreter, the heavy PyTorch library, or complex CUDA drivers to execute. Instead, it runs via optimized C++ libraries.Because Tauri's backend is written in Rust, developers can utilize the ort crate to bind directly to these C++ libraries. This means the TTS engine is not an external dependency or a separate process; it becomes an intrinsic function within the application's binary. This "library-level" integration is the gold standard for mobile development, ensuring compliance with App Store policies regarding executable code and utilizing native platform capabilities.The January 2026 Evolution (v2)The user's query specifically highlights "Supertonic 2 (Jan 2026)." This version introduces pivotal upgrades that address previous limitations:Multilingual Unification: Prior versions were often language-specific. Supertonic 2 introduces a unified architecture supporting English, Korean, Spanish, Portuguese, and French.1 This implies that a single ONNX model file can handle dynamic language switching at runtime without the latency penalty of unloading and reloading different model weights.Voice Personas: The update adds distinct voice styles (e.g., Alex, Sarah, James).5 While not offering the infinite flexibility of voice cloning, these preset personas cover the vast majority of use cases for standard reading applications, navigation, and accessibility tools.2.2 Chatterbox Turbo: The Llama-Based HeavyweightChatterbox, developed by Resemble AI, represents the "Quality First" school of thought. It leverages the massive advancements in Large Language Models (LLMs) and generative flow matching to achieve state-of-the-art naturalness.The Llama BackboneChatterbox Turbo is built upon a Llama backbone 6, likely adapting the transformer architecture to process audio tokens alongside text. Even in its "Turbo" configuration, which is optimized for latency, the model retains a 350 million parameter structure. While efficient for a server-grade GPU, this is nearly an order of magnitude larger than Supertonic.Memory Pressure: The model weights alone exceed 4 GB.7 Loading a 4GB model into memory is a non-trivial operation on mobile devices. Most mid-range Android phones ship with 6GB or 8GB of total RAM, shared between the OS, the GPU, and all active apps. Allocating 4GB to a single background TTS process will almost certainly trigger the operating system's Low Memory Killer (LMK), terminating the application or other background services to preserve system stability.Storage Friction: Distributing a mobile application with a 4GB asset payload is highly problematic. It exceeds the initial download size limits of both the Apple App Store (which requires Over-the-Air downloads to be under a certain threshold, often 200MB-4GB depending on OS version) and the Google Play Store (150MB base limit). Developers would be forced to implement complex "On-Demand Resource" downloading or expansive expansion files (OBB), adding significant friction to the user's first-run experience.The Python-PyTorch Dependency ChainChatterbox is a PyTorch-native model.6 Its architecture utilizes complex operations—specifically paralinguistic tag handling and flow matching decoders—that are deeply entwined with the PyTorch runtime and the Python ecosystem (requiring libraries like numpy, scipy, and torchaudio).Lack of ONNX Export: Unlike simpler models, Chatterbox does not offer a first-party, fully functional ONNX export that retains all its features. The dynamic nature of its flow matching steps and custom tokenizers makes "freezing" the model into a static computation graph exceptionally difficult. Consequently, running Chatterbox requires a live Python environment, a requirement that introduces the "Sidecar Problem" on mobile platforms—a critical hurdle for Tauri integration that will be explored in depth in subsequent sections.Feature SuperiorityDespite these architectural weights, Chatterbox offers capabilities Supertonic cannot match:Paralinguistic Control: Developers can inject tags like [laugh], [sigh], or [cough] directly into the text stream.6 The model understands these non-verbal cues and generates appropriate audio artifacts, creating a level of "human" performance that is SOTA.Zero-Shot Cloning: The model can clone a target voice from a mere 5-second reference clip.9 This feature relies on the dense vector representations of the Llama backbone to capture and replicate timbre and prosody instantly.3. The Tauri Framework Context: Integration RealitiesThe user's choice of Tauri as the application framework is the defining constraint of this analysis. Tauri operates on a unique architecture distinct from Electron or Native development. A Tauri app consists of two distinct layers:The Core (Backend): Written in Rust. This layer handles system interactions, file I/O, and heavy computation. It compiles down to a native binary.The Webview (Frontend): Written in web technologies (HTML/JS/CSS). This layer handles the UI and communicates with the Core via an asynchronous IPC bridge.For a TTS engine to be "local," it must reside within or be managed by the Rust Core. The feasibility of this integration varies wildly between Desktop (macOS/Linux) and Mobile (iOS/Android).3.1 The "Sidecar Pattern" and Desktop SuccessOn desktop operating systems, Tauri supports a feature known as the Sidecar Pattern. This allows the Rust Core to bundle and spawn external binaries as subprocesses.Mechanism: The developer compiles a Python script (and its interpreter) into a standalone executable using tools like PyInstaller or Nuitka. The Rust Core then uses the Command::sidecar API to launch this executable. Communication occurs via stdin (sending text) and stdout (receiving audio data).Implication for Chatterbox: This pattern makes running Chatterbox on macOS and Linux entirely feasible. The massive Python dependency chain is encapsulated in the sidecar binary. While the installer size bloats to 4GB+, the application runs successfully.Implication for Supertonic: While Supertonic can be run this way (e.g., using a Python wrapper around ONNX Runtime), it is unnecessary. Supertonic's C++ roots allow it to be linked directly into the Rust Core, avoiding the IPC overhead of a sidecar.3.2 The "Mobile Wall": Why Sidecars Fail on iOS & AndroidThe user's requirement for iOS and Android support reveals the fundamental weakness of the Chatterbox architecture in a Tauri context. The "Sidecar Pattern" described above is functionally non-existent on mobile platforms due to strict OS security models.iOS Sandbox ConstraintsApple's iOS enforces a draconian sandbox. An application bundle cannot contain arbitrary executables that are spawned as independent processes. The fork() and exec() system calls—essential for spawning a sidecar—are restricted or forbidden for App Store applications.Furthermore, iOS prohibits Just-In-Time (JIT) compilation for most applications (exceptions exist for browser engines and debuggers, but not general apps). PyTorch and complex Python runtimes heavily rely on JIT for performance. Running them in "interpreter-only" mode results in a catastrophic performance degradation, rendering a 350M parameter model unusable.Android Sandbox ConstraintsAndroid's security model, while slightly more flexible regarding JIT, imposes similar restrictions on subprocesses. While it is theoretically possible to package a Python binary and execute it via the NDK, managing the lifecycle of that process, ensuring it isn't killed by the stringent Android memory manager, and handling the communication bridge is a task of immense complexity. It fights against the grain of the Android application lifecycle.The Dependency Hell of Embedded PythonThe alternative to a sidecar is embedding the Python interpreter directly into the Rust binary (using crates like pyo3). This allows Python code to run within the main application process, bypassing the subprocess restriction.However, this leads to "Dependency Hell." To run Chatterbox, one must embed not just Python, but numpy, scipy, and torch. These are not pure Python libraries; they are wrappers around massive C/C++ and Fortran codebases. Compiling scipy or torch from source for aarch64-linux-android or aarch64-apple-ios and linking them statically into a Rust binary is one of the most notoriously difficult tasks in cross-platform development. It involves resolving thousands of symbol conflicts, matching libc versions, and dealing with build system incompatibilities. For 99% of development teams, this is a non-starter.4. Platform-Specific Integration Analysis: Mobile Deep DiveGiven that Mobile is the "Great Filter" in this selection process, we must analyze the integration pathway for the surviving candidate—Supertonic—and the theoretical (but painful) path for Chatterbox.4.1 Supertonic 2 on Mobile: The Native RouteSupertonic's reliance on ONNX Runtime (ORT) is its superpower here. ORT is designed with mobile in mind.iOS Integration StrategyStatic Linking: The ORT library is distributed as an .xcframework. In Rust, the ort crate can be configured to link against this framework during the build process (cargo build --target aarch64-apple-ios).CoreML Acceleration: iOS devices feature the Apple Neural Engine (ANE). ONNX Runtime supports the CoreML Execution Provider. By enabling this provider in the Rust ort session options, Supertonic inference is offloaded from the CPU to the NPU. This results in faster generation and, critically, drastically lower battery consumption.Asset Management: The 268MB .onnx file is treated as a standard bundle resource. It is accessible to the Rust Core via the NSBundle API (wrapped by Tauri's resource path helpers).Android Integration StrategyJNI and Shared Libraries: Android requires native libraries to be .so files. The ort crate manages the inclusion of libonnxruntime.so into the jniLibs folder of the Android project structure generated by Tauri.NNAPI Acceleration: Similar to CoreML, Android offers the Neural Networks API (NNAPI). Supertonic can leverage this to run on the DSP or NPU of Qualcomm or MediaTek chips, ensuring performance across the fragmented Android hardware ecosystem.App Bundle Size: While 268MB exceeds the 150MB base APK limit, Tauri developers can utilize "Play Asset Delivery" (install-time delivery) to package the model. Since the model is a static file, this is a solved infrastructure problem.4.2 Chatterbox on Mobile: The Remote FallbackSince running Chatterbox locally on mobile is effectively blocked by the OS constraints discussed in Section 3.2, the only viable architecture for a Tauri app wanting to use Chatterbox is a Hybrid Approach.Desktop Users: Enjoy local inference via the Python Sidecar.Mobile Users: The app detects the platform and routes TTS requests to a remote API (hosted by the developer) running the Chatterbox engine.The Cost: This violates the user's "local" requirement. It introduces latency, server costs (GPU hosting for inference), and privacy concerns (data leaving the device). However, it is the only way to access Chatterbox's features on a phone.5. Performance and Resource Profiling: The Cost of QualityPerformance is the secondary selector after compatibility. The user's query mentions "architecture differences," and nowhere is this more visible than in the computational cost of running the models.5.1 Real-Time Factor (RTF) BenchmarksThe "Real-Time Factor" measures the speed of generation. RTF = Processing Time / Audio Duration. An RTF of 0.1 means generating 10 seconds of audio takes 1 second. Lower is better.Supertonic 2 PerformanceDesktop (M4 Pro): Benchmarks indicate an RTF of 0.006.10 This is ~166x faster than real-time. For the user, this means the audio starts playing instantly, with zero perceived latency.Mobile (A17 Pro / Snapdragon 8 Gen 3): Even on mobile silicon, the 44M parameter model flies. Estimations based on similar SLMs suggest an RTF of 0.01 - 0.05 when using NPU acceleration. This enables "streaming" capabilities where long paragraphs are synthesized faster than the user can read them.Chatterbox Turbo PerformanceDesktop (RTX 4090): The model is fast, achieving sub-0.1 RTF.Mobile CPU (Theoretical): If one could run it on a mobile CPU (bypassing the build issues), the 350M parameters would crush the processor. Without heavy quantization (e.g., 4-bit) and optimization, RTF would likely hover between 0.5 and 1.0. This means a 10-second sentence could take 5-10 seconds to generate, creating awkward pauses in conversation or UI interaction.5.2 Memory Footprint & System StabilitySupertonic: Requires ~300-500 MB of RAM. This is safe for almost all modern mobile devices, even low-end Android phones with 4GB RAM. It leaves plenty of room for the OS and the webview.Chatterbox: Requires ~4-5 GB of RAM/VRAM. On a PC, this is fine. On a mobile device, this is catastrophic. iOS aggressively kills background processes that consume excessive memory. An app attempting to allocate 4GB for TTS would likely be terminated immediately upon initialization on all but the most expensive "Pro" model iPhones and Android flagships.6. Technical Integration Guide: Supertonic 2 (Recommended)Based on the evidence, Supertonic 2 is the only viable candidate for a truly local, cross-platform Tauri application. This section details the integration roadmap.6.1 Rust Core ConfigurationThe integration avoids the sidecar pattern entirely. We utilize the ort crate to bind to ONNX Runtime directly within the Rust process.Step 1: Dependency ManagementIn src-tauri/Cargo.toml:Ini, TOML[dependencies] +tauri = { version = "2.0", features = } +# ORT: The interface to ONNX Runtime. +# 'fetch-models' allows auto-downloading libs (mostly for dev). +# 'load-dynamic-lib' is crucial for mobile linking. +ort = { version = "2.0", features = ["fetch-models", "load-dynamic-lib", "ndarray"] } +# Rodio: For cross-platform audio playback +rodio = "0.19" +Step 2: Model Asset BundlingThe 268MB model file must be accessible to the binary at runtime.Place supertonic-v2.onnx and config.json in src-tauri/assets/.Update tauri.conf.json to include these assets:JSON"bundle": { + "resources": ["assets/*"] +} +Step 3: The Inference Engine (Rust)In src-tauri/src/lib.rs, implement a command that the frontend can invoke. This command should:Tokenize: Convert the input string into the specific integer tokens expected by Supertonic. (Note: Check if Supertonic v2 includes a fused tokenizer in the ONNX graph; if not, a small Rust-based tokenizer matching the training data is required).Inference: Pass the tokens to the ort session.Rust// Conceptual Rust Code +let inputs = ort::inputs!["input_ids" => token_tensor]?; +let outputs = session.run(inputs)?; +let audio_data = outputs["audio"].extract_tensor::()?; +Playback: Feed the audio_data into a rodio Sink for immediate playback.6.2 Mobile-Specific Build FlagsAndroid: You must ensure the correct jniLibs are present. You can often rely on the ort crate's build script, but for production, manually downloading the onnxruntime-android AAR and extracting the .so files to your project's android/app/src/main/jniLibs is the most robust method.iOS: You must link the onnxruntime.xcframework. In your build.rs, you may need to emit linker flags:Rustprintln!("cargo:rustc-link-lib=framework=onnxruntime"); +7. Technical Integration Guide: Chatterbox (The Desktop-Only Hybrid)For completeness, if the project demands Chatterbox's features, here is the implementation strategy. Note that this abandons local mobile inference.7.1 Desktop: The Python SidecarEnvironment Isolation: Create a standalone Python environment using uv or conda. Install chatterbox-tts and its heavy dependencies (torch).Freezing the Binary: Use PyInstaller to compile a server.py script into a single binary. This script should launch a local web server (e.g., FastAPI) to listen for TTS requests.Warning: The resulting binary will be 4GB+.Tauri Orchestration:Add the binary to externalBin in tauri.conf.json.On app launch, spawn it via Command::sidecar.Wait for the "ready" signal (monitor stdout).Send HTTP requests to localhost for generation.7.2 Mobile: The Remote API FallbackSince the sidecar cannot run on iOS/Android:Host a Server: Deploy the Chatterbox model to a cloud GPU provider (e.g., RunPod, Lambda Labs, or AWS).Conditional Logic: In your frontend JavaScript:JavaScriptimport { type } from '@tauri-apps/plugin-os'; + +async function generateSpeech(text) { + if (type() === 'android' | + +| type() === 'ios') {// Call Remote APIreturn await fetch('https://api.myapp.com/tts', { body: { text } });} else {// Call Local Sidecarreturn await fetch('http://localhost:8000/tts', { body: { text } });}}```8. Quality of Experience (QoE) AnalysisBeyond the binary "can it run" question lies the "how does it sound" question.8.1 Prosody and StabilitySupertonic 2: The model produces highly stable, intelligible speech. The prosody is consistent, making it ideal for reading long-form content (articles, ebooks). It rarely "hallucinates" or creates bizarre artifacts, a common trait of distilled models. However, it can sound "flatter" or less dynamic than larger models.Chatterbox Turbo: The "human" element is significantly higher. The model captures micro-tremors in pitch, breath intake, and varied pacing that signals high production value. It is better suited for narrative content (fiction, gaming) where emotional engagement is key.8.2 The "Uncanny Valley" of LatencySupertonic: The near-instant response (0.006 RTF) creates a seamless user experience. It feels like a native OS feature.Chatterbox: Even on desktop, the 200ms+ latency can create a "turn-taking" delay in conversational apps. On a slow connection (mobile remote fallback), this latency can spike to seconds, breaking the illusion of interactivity.9. Commercial and Operational Considerations9.1 Licensing and WatermarkingSupertonic 2: Released under the OpenRAIL-M license.5 This license permits commercial use but includes usage restrictions to prevent abuse (e.g., generating deepfakes for fraud). It does not mandate watermarking, though developers should be mindful of transparency.Chatterbox: Released under the MIT license 6, the most permissive option. However, Resemble AI includes PerTh Watermarking technology baked into the model.12 Every generated audio file contains an imperceptible watermark. This is a robust safety feature for a commercial app, allowing you to prove the provenance of the audio if challenged, but it incurs a small computational cost during inference.9.2 Update VelocitySupertone Inc.: The release of v2 in Jan 2026 suggests a committed roadmap. The shift to a unified multilingual architecture indicates a maturity in their R&D pipeline.Resemble AI: Chatterbox is an open-source offshoot of their core commercial product. Updates are frequent, but often prioritize their paid API services or newer, larger models that may drift further away from consumer hardware capability.10. Conclusion and Strategic RecommendationThe comparative analysis yields a definitive conclusion based on the user's specific constraint of running locally across iOS, Android, Mac, and Linux.The Recommendation: Supertonic 2 is the superior architectural choice.Mobile Feasibility: Supertonic 2 is the only candidate that offers a viable path to local inference on iOS and Android within a Tauri application. Its ONNX-native architecture allows for static linking and NPU acceleration, bypassing the OS restrictions that block Chatterbox's Python-based stack.Performance Profile: With an inference speed 166x faster than real-time and a memory footprint of under 500MB, Supertonic ensures the application remains responsive and stable on resource-constrained mobile devices. Chatterbox's 4GB requirement is a non-starter for mobile memory budgets.Integration Simplicity: While Rust requires a learning curve, the ort integration is cleaner and more robust than maintaining a fragile Python sidecar build chain.When to Consider Chatterbox:Chatterbox should only be selected if the application is Desktop-exclusive (Mac/Linux/Windows) or if the requirement for Zero-Shot Voice Cloning and Paralinguistic Tags (laughter, emotion) outweighs the requirement for "Local" execution on mobile. In that scenario, a hybrid architecture (Local Desktop + Remote Mobile API) is the only path forward.For the stated goal of a unified, local, cross-platform Tauri build, Supertonic 2 is not just the better option; it is practically the only option. + + +## Claude + +# Local TTS for Tauri: Supertonic vs Chatterbox compared + +**Supertonic emerges as the clear choice for cross-platform Tauri deployment**, offering native Rust integration, ~264 MB model size, and proven iOS/Android support out of the box. Chatterbox provides superior voice cloning and emotion control but at **10-12× the model size** and significantly higher deployment complexity. For a privacy-focused chat application prioritizing simplicity and bundle size, Supertonic's ONNX-based architecture delivers the most practical path to production. + +## Model architecture and runtime requirements + +**Supertonic** runs entirely on ONNX Runtime, making it deployment-friendly across all platforms. The architecture splits into four ONNX components: text encoder (28 MB), vector estimator (133 MB), vocoder (101 MB), and duration predictor (1.6 MB). With only **66 million parameters**, it's deliberately optimized for edge devices—proven to run on Raspberry Pi and e-readers at 0.3× real-time factor. + +**Chatterbox** was built on PyTorch with a **0.5B Llama backbone**, requiring substantially more resources. Three model variants exist: the original 500M parameter model, Chatterbox-Multilingual (500M, 23 languages), and Chatterbox-Turbo (350M, optimized for speed). While native inference requires PyTorch with CUDA/MPS/ROCm backends, official ONNX exports now exist through `ResembleAI/chatterbox-turbo-ONNX`. + +| Specification | Supertonic | Chatterbox | +|--------------|------------|------------| +| Parameters | 66M | 350M-500M | +| Native framework | ONNX Runtime | PyTorch | +| ONNX available | ✅ Primary | ✅ Exported | +| MLX support | ❌ | ✅ via mlx-audio | + +## Model sizes shape deployment decisions + +Supertonic's total ONNX bundle weighs approximately **264 MB** across all components, with OnnxSlim optimizations shaving a few megabytes. This size remains consistent since the architecture doesn't support quantization variants in the official release. + +Chatterbox offers more flexibility through quantization but starts much larger. The full-precision Turbo ONNX export totals **~3.3 GB** across its four sessions (speech encoder, language model, conditional decoder, embed tokens). Quantized variants dramatically reduce this: + +- **Q4F16** (4-bit with FP16): ~560 MB total +- **INT8 (Q8)**: ~1.1 GB total +- **FP16**: ~1.7 GB total + +For mobile deployment, the Q4F16 Chatterbox variant at 560 MB remains **roughly twice Supertonic's size**. Memory requirements diverge even more sharply: Supertonic runs comfortably in **250-500 MB RAM**, while Chatterbox ONNX peaks at **~3.2 GB RAM** on iOS based on real-world testing. + +## Cross-platform deployment capabilities + +Supertonic provides exceptional platform coverage with **official examples for every major platform** in its repository: + +- **Desktop**: Windows, macOS, Linux via C++, Rust, Go, Python, Node.js, Java, C# +- **Mobile**: Native iOS (Swift/Xcode), Android (Java/Kotlin), Flutter +- **Web**: WebGPU/WASM (Chrome 121+, Edge 121+, Safari macOS 15+) +- **Embedded**: Proven on Raspberry Pi, Onyx Boox e-readers + +Chatterbox's platform support depends heavily on your chosen runtime: + +- **PyTorch native**: Linux (primary), macOS (MPS), Windows (CUDA/CPU only) +- **ONNX Runtime**: All platforms theoretically supported; iOS demonstrated working +- **MLX**: macOS 14.0+ and iOS 16.0+ only (Apple Silicon exclusive) +- **Android**: ONNX Runtime supports it, but not officially tested + +## Rust integration and Tauri compatibility + +**Supertonic offers native Rust support** directly in the repository's `rust/` directory. The implementation uses ONNX Runtime Rust bindings, making Tauri integration straightforward—you can call TTS directly from your Rust backend without spawning external processes. + +```rust +// Supertonic approach: Native Rust in Tauri backend +// Uses ort crate (ONNX Runtime) directly +``` + +**Chatterbox lacks official Rust bindings**, creating three integration paths for Tauri: + +1. **ONNX via `ort` crate**: Load quantized ONNX models directly from Rust—no Python required, works cross-platform +2. **Python sidecar**: Bundle PyInstaller-compiled Python with Tauri's `externalBin` feature +3. **Local HTTP server**: Run chatterbox-tts-api as subprocess with OpenAI-compatible endpoints + +The Python sidecar approach has been documented for Chatterbox with mlx-audio. Configure `tauri.conf.json` with `"externalBin": ["binaries/tts-sidecar"]`, compile Python using PyInstaller with target-specific naming (`tts-sidecar-x86_64-apple-darwin`), and spawn via `app.shell().sidecar()`. Known issues include sidecars not terminating cleanly on app close and **50-200 MB additional bundle size** for the Python runtime. + +## Voice quality and feature comparison + +Both systems produce high-quality, natural speech—neither sounds robotic in typical usage. + +**Supertonic** offers configurable inference steps trading speed for quality: +- 2-step inference: "Close to ElevenLabs Flash" quality, fastest +- 5-step inference: "Reaches much of ElevenLabs Prime tier" +- 10+ steps: Highest quality, slower + +It includes **11 preset voices** (5 male, 5 female) and excels at text normalization—handling currencies ($5.2M), dates, phone numbers, and abbreviations without preprocessing. Supertonic 2, released January 6, 2026, added support for English, Korean, Spanish, Portuguese, and French. + +**Chatterbox** won **63.75% preference over ElevenLabs** in blind evaluations and offers richer features: +- **Zero-shot voice cloning** from 5-10 seconds of reference audio +- **Emotion exaggeration control** (0 = monotone, 1 = normal, 2+ = dramatic) +- **Paralinguistic tags**: `[laugh]`, `[cough]`, `[sigh]`, `[groan]` +- **23 languages** in the multilingual model +- Built-in neural watermarking (PerTh) for provenance tracking + +## Performance benchmarks reveal the gap + +Supertonic's lightweight architecture delivers exceptional speed-to-quality ratios: + +| Hardware | Supertonic RTF | Throughput | +|----------|---------------|------------| +| M4 Pro (CPU) | 0.015 | 1,263 chars/sec | +| M4 Pro (WebGPU) | 0.006 | 2,509 chars/sec | +| RTX 4090 | 0.001 | 12,164 chars/sec | +| Raspberry Pi | 0.3 | Real-time capable | + +Chatterbox requires more compute but achieves competitive latency: +- **Streaming RTF**: 0.499 on RTX 4090 +- **Latency**: Sub-200ms optimized, sub-300ms typical +- **Apple Silicon via MLX**: 2-3× faster than CPU +- **Mobile (iOS ONNX)**: Functional but ~3.2 GB peak RAM + +## Licensing permits commercial use + +Both projects use permissive licenses suitable for commercial applications: + +| Aspect | Supertonic | Chatterbox | +|--------|------------|------------| +| Code license | MIT | MIT | +| Model license | OpenRAIL-M | MIT | +| Commercial use | ✅ Allowed | ✅ Allowed | +| Voice cloning | Not supported | Built-in | +| Watermarking | None | Optional neural watermark | + +## Practical deployment recommendation + +For your privacy-focused Tauri chat application, **Supertonic provides the most straightforward path**: + +1. **Native Rust integration**: Add ONNX Runtime via `ort` crate, call TTS directly from Tauri backend +2. **Minimal bundle impact**: ~264 MB models + ~7-15 MB ONNX Runtime per platform +3. **True cross-platform**: iOS, Android, macOS, Linux work today with official examples +4. **No Python dependency**: Ships as pure native code + +Choose Chatterbox only if you specifically need **voice cloning** or **emotion control**. In that case, use the Q4F16 ONNX variant (~560 MB) with the `ort` crate for maximum portability. Avoid the Python sidecar approach on mobile—it's impractical due to bundle size and process management complexity. + +## Conclusion + +The two systems serve different use cases despite both being local TTS solutions. Supertonic optimizes for **deployment simplicity and efficiency**—its 66M parameter model runs on constrained devices while maintaining quality competitive with cloud services. Chatterbox optimizes for **feature richness**—voice cloning and emotion control justify its 5-8× larger footprint for applications requiring those capabilities. + +For a Tauri app shipping across iOS, Android, macOS, and Linux with privacy as a priority, Supertonic's architecture eliminates the primary pain points: no Python runtime, no platform-specific codepaths for inference, and proven mobile deployment. The January 2026 Supertonic 2 release with multilingual support further strengthens this recommendation for international applications. + +--- + +## Grok + +### Key Considerations for Choosing Between Supertonic and Chatterbox +- **Model Size and Resource Efficiency**: Research indicates Supertonic (66 million parameters) is significantly smaller and more lightweight than Chatterbox models (350-500 million parameters), making it better suited for resource-constrained devices across platforms. +- **Cross-Platform Compatibility**: Evidence leans toward Supertonic being more reliable and easier to deploy on iOS, Android, Mac, and Linux due to its ONNX Runtime integration, with native examples for mobile and desktop; Chatterbox, while feasible on Mac and potentially iOS via optimizations like MLX, lacks straightforward Android support and relies on Python, complicating Tauri integration. +- **Ease of Integration in Tauri Apps**: Supertonic appears more straightforward for Rust-based Tauri builds with direct Rust examples, while Chatterbox may require a Python sidecar (e.g., with MLX-audio on Apple devices), adding complexity but remaining viable for certain setups. +- **Reliability and Recent Developments**: Both are actively maintained, but Supertonic 2's January 2026 release introduces multilingual support and optimizations, suggesting it may offer fresher reliability for local runs; user feedback highlights its stability for long-form audio, though Chatterbox excels in voice cloning and expressiveness. +- **Recommendation**: For local TTS in Tauri across all specified platforms, Supertonic seems likely the better choice due to its smaller size, broader native compatibility, and simpler integration, though Chatterbox could suit scenarios prioritizing advanced features like zero-shot cloning if platform constraints are flexible. + +#### Tech Stack Overview +Supertonic leverages ONNX Runtime for inference, enabling efficient, CPU-optimized execution without heavy dependencies—ideal for embedding in cross-platform apps. It supports batch processing and configurable inference steps (e.g., 2-5), with models hosted on Hugging Face (https://huggingface.co/Supertone/supertonic-2). Chatterbox, built on PyTorch, requires Python environments and supports CUDA, CPU, or MPS (Metal Performance Shaders) for Apple devices, but integration often involves additional tools like MLX-audio for optimization on Mac/iOS. + +#### Performance and Features +Supertonic prioritizes speed, achieving real-time factors (RTF) as low as 0.006 on WebGPU and handling complex text (e.g., dates, currencies) natively. Chatterbox focuses on quality, with features like paralinguistic tags ([laugh], [chuckle]) for expressive speech and zero-shot voice cloning, though it may require more VRAM (around 5-6.5GB on GPUs). For local runs, Supertonic's lightweight design suggests lower latency on diverse hardware, while Chatterbox performs well on supported setups but could be slower on non-optimized platforms. + +#### Platform-Specific Insights +- **iOS/Android**: Supertonic offers native iOS examples and Flutter support for cross-mobile deployment; ONNX Runtime ensures compatibility (https://onnxruntime.ai/docs/build/ios.html). Chatterbox has been adapted for iOS via apps like Chinny, but Android integration is less documented and may need custom work. +- **Mac/Linux**: Both run well—Supertonic via Swift/Rust on Mac and various examples on Linux; Chatterbox uses MPS on Mac for acceleration and is straightforward on Linux with CUDA/CPU. + +If prioritizing minimal footprint and ease, start with Supertonic's Rust example for Tauri: clone the repo, build with Cargo, and integrate ONNX models directly. + +--- +### In-Depth Analysis of Supertonic and Chatterbox for Cross-Platform Local TTS in Tauri Applications + +Text-to-speech (TTS) models like Supertonic and Chatterbox represent significant advancements in open-source AI, enabling privacy-focused, offline audio generation for applications such as voice assistants, accessibility tools, and content creation. As of early 2026, these models cater to developers building cross-platform apps with frameworks like Tauri, which allows Rust-based backends for web-like UIs on iOS, Android, Mac, and Linux. This analysis draws from repository details, user integrations, performance benchmarks, and community feedback to evaluate their suitability for local deployment. We prioritize factors like model size, tech stack, platform compatibility, reliability, and Tauri-specific integration, acknowledging that while both are capable, differences in architecture influence their ease of use across devices. + +#### Model Architectures and Core Technologies +Supertonic, developed by Supertone Inc., is an ONNX-based TTS system optimized for on-device inference with minimal overhead. Its core relies on ONNX Runtime, a cross-platform engine that supports CPU-optimized execution (GPU untested in the repo but feasible). Models are slimmed using OnnxSlim, resulting in efficient, lightweight files. The system generates 16-bit WAV audio, supports batch processing for throughput, and handles natural text variations (e.g., phone numbers, units) without preprocessing. Supertonic 2, released on January 6, 2026 (v2.0.0), expands to multilingual support for English, Korean, Spanish, Portuguese, and French, with six new voice styles (M3-M5, F3-F5). It's licensed under MIT for code and OpenRAIL-M for models, allowing commercial use. + +Chatterbox, from Resemble AI, is a PyTorch-based family of models: the original (500M parameters, English-only), Multilingual (500M, 23+ languages), and Turbo (350M, English with paralinguistic tags like [chuckle] or [cough]). It emphasizes high-fidelity, zero-shot voice cloning, and expressive speech via configurable parameters (e.g., CFG for guidance, exaggeration for emotion). All include Perth watermarking for ethical traceability. The Turbo variant distills the decoder to a single generation step, reducing latency and VRAM needs. It's MIT-licensed and installable via pip, with dependencies managed in pyproject.toml for Python 3.11 on Debian-like systems. + +Key tech differences: Supertonic's ONNX focus enables broader runtime flexibility without Python, while Chatterbox's PyTorch ties it to Python environments, potentially requiring sidecars in non-Python apps like Tauri. + +#### Model Sizes and Resource Requirements +Model size directly impacts local feasibility, especially on mobile devices with limited RAM/VRAM. + +| Model | Variant | Parameters | Approximate Size | VRAM Usage (GPU) | Key Optimizations | +|-------|---------|------------|------------------|------------------|-------------------| +| Supertonic | Supertonic 2 | 66M | Ultra-lightweight (optimized ONNX files) | Minimal (CPU-focused; ~low GB if GPU) | OnnxSlim for compression; batch support | +| Chatterbox | Turbo | 350M | Medium | ~5GB (e.g., RTX 3060) | Distilled decoder; low-latency mode | +| Chatterbox | Multilingual/Original | 500M | Larger | ~6.5GB | Zero-shot cloning; expressive tuning | + +Supertonic's 66M parameters make it the smallest, enabling runs on edge devices like Raspberry Pi or e-readers with RTF as low as 0.012 on CPU. Chatterbox models, at 350-500M, demand more resources but offer efficiencies like 1-step generation in Turbo, using ~5GB VRAM for faster output (e.g., 1.8x speed over original). For Tauri apps, Supertonic's footprint reduces bundling overhead, while Chatterbox may need quantized versions (e.g., 6-bit via MLX) for mobile. + +#### Performance Benchmarks and Features +Performance varies by use case: speed vs. quality. + +- **Speed and Latency**: Supertonic excels, processing up to 12,164 characters/second on RTX 4090 and 167x real-time on M4 Pro Mac, with RTF 0.006 on WebGPU. It's faster than Chatterbox on non-NVIDIA hardware. Chatterbox Turbo achieves sub-200ms latency, suitable for real-time agents, and handles long texts stably via chunking. +- **Audio Quality and Expressiveness**: Chatterbox leads in naturalness, with low word error rates, emotional carry-over, and tags for non-verbal cues; it outperforms paid services like ElevenLabs in cloning (7-11s reference audio). Supertonic provides stable, natural long-form narration but lacks cloning or advanced emotion tuning, focusing on clear, reliable output. +- **Multilingual Support**: Supertonic 2 adds five languages; Chatterbox Multilingual covers 23+. + +In comparisons, Supertonic is praised for efficiency in resource-limited scenarios, while Chatterbox shines in expressive, cloned audio. + +#### Cross-Platform Compatibility and Deployment +ONNX Runtime makes Supertonic highly portable: it supports iOS (native Xcode), Android (via Flutter), Mac (Swift/MPS), Linux (multiple languages), and even browsers (WebGPU/WASM). Installation involves cloning the repo, Git LFS for models, and language-specific builds (e.g., `cargo build` for Rust). + +Chatterbox supports Mac (MPS), Linux (CUDA/CPU), and Windows (GPU), with iOS adaptations via apps like Chinny for offline runs. Android integration is not native; it may require embedding Python or API wrappers. MLX-audio optimizes for Apple Silicon, enabling faster inference on Mac/iOS. + +For Tauri: Supertonic integrates directly via Rust examples, embedding ONNX in the backend. Chatterbox uses a Python sidecar (e.g., via tauri-plugin-shell), running scripts as external processes—feasible but adds overhead, especially with MLX-audio for Apple platforms. + +#### Reliability, Maintenance, and Community Feedback +Both are reliable for local use, with active repos (Supertonic updated January 2026; Chatterbox last in mid-2025). Supertonic's on-device privacy and stability for long narration earn praise in integrations (e.g., browser extensions, e-book apps). Early Supertonic 2 reviews highlight its speed and natural voices on Windows/Mac. Chatterbox is lauded for beating paid TTS in quality but may face setup issues on non-GPU devices. + +Pros/Cons Table: + +| Aspect | Supertonic | Chatterbox | +|--------|------------|------------| +| **Pros** | Ultra-fast, small size, broad platforms, easy Rust/Tauri integration | Superior quality/cloning, expressive tags, multilingual options | +| **Cons** | Limited voices/cloning, recent multilingual addition | Larger size, Python-dependent, limited mobile native support | +| **Best For** | Speed-focused, cross-platform local apps | Quality-focused, voice agents with cloning | + +#### Recommendations and Implementation Tips +For Tauri builds targeting all platforms, Supertonic is preferable: download models from Hugging Face, use Rust bindings for backend inference, and expose via Tauri commands. Test on devices like Raspberry Pi for edge cases. If voice cloning is essential, opt for Chatterbox with a Python sidecar—use MLX-audio for Mac/iOS acceleration and explore quantization for Android. Always verify with hardware benchmarks, as real-world performance varies. + +### Key Citations +- [GitHub - supertone-inc/supertonic](https://github.com/supertone-inc/supertonic) +- [GitHub - resemble-ai/chatterbox](https://github.com/resemble-ai/chatterbox) +- [Lightning-Fast Open-Source TTS Model Is Here!](https://medium.com/@bytefer/12-164-characters-per-second-lightning-fast-open-source-tts-model-is-here-96452a3dd66f) +- [ONNX Runtime | Home](https://onnxruntime.ai/) +- [ONNX Runtime: Enabling Cross-Platform AI Model Inference](https://medium.com/aimonks/onnx-runtime-enabling-cross-platform-ai-model-inference-80f136ecbb2d) +- [This Local AI Voice Model Beats Paid TTS (Chatterbox Turbo Tested)](https://www.youtube.com/watch?v=3UN70ea4h9c) +- [A few insights from running Chatterbox fully offline on iPhone and Mac](https://huggingface.co/ResembleAI/chatterbox/discussions/42) +- [Blaizzy/mlx-audio](https://github.com/Blaizzy/mlx-audio/actions) +- [Supertonic TTS is incredible — sharing our integration results](https://huggingface.co/Supertone/supertonic/discussions/13) +- [Supertonic TTS: Ultra-Fast, Natural Voices on Windows & Mac](https://sonusahani.com/blogs/supertonic-tts) +- [Supertonic Review: A New, Ultra-Fast and Lightweight AI Model](https://www.youtube.com/watch?v=g2CuKBqCeK0) +- [The Top Open-Source Text to Speech (TTS) Models](https://modal.com/blog/open-source-tts) +- [The Best Open-Source Text-to-Speech Models in 2026](https://www.bentoml.com/blog/exploring-the-world-of-open-source-text-to-speech-models) +- [Embedding External Binaries](https://v2.tauri.app/develop/sidecar/) +- [How Chatterbox Is Redefining Open-Source AI Voice](https://medium.com/%40vinodpolinati/chatterbox-the-leading-family-of-open-source-ai-voice-models-thats-changing-the-game-150991d5cf8d) diff --git a/frontend/src-tauri/Cargo.lock b/frontend/src-tauri/Cargo.lock index f9b52bdd..9ed35c43 100644 --- a/frontend/src-tauri/Cargo.lock +++ b/frontend/src-tauri/Cargo.lock @@ -462,6 +462,12 @@ version = "0.22.1" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "72b3254f16251a8381aa12e40e3c4d2f0199f8c6508fbecb9d91f575e0fbb8c6" +[[package]] +name = "base64ct" +version = "1.8.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "55248b47b0caf0546f7988906588779981c43bb1bc9d0c44087278f80cdb44ba" + [[package]] name = "bitflags" version = "1.3.2" @@ -733,7 +739,7 @@ version = "0.15.8" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "d067ad48b8650848b989a59a86c6c36a995d02d2bf778d45c3c5d57bc2718f02" dependencies = [ - "smallvec", + "smallvec 1.15.1", "target-lexicon", ] @@ -976,6 +982,25 @@ dependencies = [ "crossbeam-utils", ] +[[package]] +name = "crossbeam-deque" +version = "0.8.6" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "9dd111b7b7f7d55b72c0a6ae361660ee5853c9af73f70c3c2ef6858b950e2e51" +dependencies = [ + "crossbeam-epoch", + "crossbeam-utils", +] + +[[package]] +name = "crossbeam-epoch" +version = "0.9.18" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "5b82ac4a3c2ca9c3460964f020e1402edd5753411d7737aa39c3714ad1b5420e" +dependencies = [ + "crossbeam-utils", +] + [[package]] name = "crossbeam-utils" version = "0.8.21" @@ -1012,7 +1037,7 @@ dependencies = [ "phf 0.10.1", "proc-macro2", "quote", - "smallvec", + "smallvec 1.15.1", "syn 1.0.109", ] @@ -1103,6 +1128,16 @@ version = "2.9.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "2a2330da5de22e8a3cb63252ce2abb30116bf5265e89c0e01bc17015ce30a476" +[[package]] +name = "der" +version = "0.7.10" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "e7c1832837b905bbfb5101e07cc24c8deddf52f93225eee6ead5f4d63d53ddcb" +dependencies = [ + "pem-rfc7468", + "zeroize", +] + [[package]] name = "der-parser" version = "9.0.0" @@ -1161,13 +1196,34 @@ dependencies = [ "crypto-common", ] +[[package]] +name = "dirs" +version = "5.0.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "44c45a9d03d6676652bcb5e724c7e988de1acad23a711b5217ab9cbecbec2225" +dependencies = [ + "dirs-sys 0.4.1", +] + [[package]] name = "dirs" version = "6.0.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "c3e8aa94d75141228480295a7d0e7feb620b1a5ad9f12bc40be62411e38cce4e" dependencies = [ - "dirs-sys", + "dirs-sys 0.5.0", +] + +[[package]] +name = "dirs-sys" +version = "0.4.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "520f05a5cbd335fae5a99ff7a6ab8627577660ee5cfd6a94a6a929b52ff0321c" +dependencies = [ + "libc", + "option-ext", + "redox_users 0.4.6", + "windows-sys 0.48.0", ] [[package]] @@ -1178,8 +1234,8 @@ checksum = "e01a3366d27ee9890022452ee61b2b63a67e6f13f58900b651ff5665f0bb1fab" dependencies = [ "libc", "option-ext", - "redox_users", - "windows-sys 0.61.2", + "redox_users 0.5.2", + "windows-sys 0.59.0", ] [[package]] @@ -1283,6 +1339,12 @@ version = "1.0.20" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "d0881ea181b1df73ff77ffaaf9c7544ecc11e82fba9b5f27b262a3c73a332555" +[[package]] +name = "either" +version = "1.15.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "48c757948c5ede0e46177b7add2e67155f70e33c07fea8284df6576da70b3719" + [[package]] name = "embed-resource" version = "3.0.6" @@ -1373,7 +1435,7 @@ source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "39cab71617ae0d63f51a36d69f866391735b51691dbda63cf6f96d042b63efeb" dependencies = [ "libc", - "windows-sys 0.61.2", + "windows-sys 0.52.0", ] [[package]] @@ -1841,7 +1903,7 @@ dependencies = [ "libc", "once_cell", "pin-project-lite", - "smallvec", + "smallvec 1.15.1", "thiserror 1.0.69", ] @@ -1877,7 +1939,7 @@ dependencies = [ "libc", "memchr", "once_cell", - "smallvec", + "smallvec 1.15.1", "thiserror 1.0.69", ] @@ -2054,6 +2116,12 @@ version = "0.4.3" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "7f24254aa9a54b5c858eaee2f5bccdb46aaf0e486a595ed5fd8f86ba55232a70" +[[package]] +name = "hound" +version = "3.5.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "62adaabb884c94955b19907d60019f4e145d091c75345379e70d1ee696f7854f" + [[package]] name = "html5ever" version = "0.29.1" @@ -2130,7 +2198,7 @@ dependencies = [ "itoa", "pin-project-lite", "pin-utils", - "smallvec", + "smallvec 1.15.1", "tokio", "want", ] @@ -2264,7 +2332,7 @@ dependencies = [ "icu_normalizer_data", "icu_properties", "icu_provider", - "smallvec", + "smallvec 1.15.1", "zerovec", ] @@ -2322,7 +2390,7 @@ source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "3b0875f23caa03898994f6ddc501886a45c7d3d62d04d2d90788d47be1b1e4de" dependencies = [ "idna_adapter", - "smallvec", + "smallvec 1.15.1", "utf8_iter", ] @@ -2570,6 +2638,12 @@ dependencies = [ "winapi", ] +[[package]] +name = "libm" +version = "0.2.15" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "f9fbbcab51052fe104eb5e5d351cf728d30a5be1fe14d9be8a3b097481fb97de" + [[package]] name = "libredox" version = "0.1.10" @@ -2648,13 +2722,23 @@ dependencies = [ "anyhow", "axum", "base64 0.22.1", + "dirs 5.0.1", + "futures-util", + "hound", "log", "maple-proxy", + "ndarray", "once_cell", "openssl", + "ort", "pdf-extract", + "rand 0.8.5", + "rand_distr", + "regex", + "reqwest", "serde", "serde_json", + "sha2", "tauri", "tauri-build", "tauri-plugin", @@ -2667,6 +2751,7 @@ dependencies = [ "tauri-plugin-single-instance", "tauri-plugin-updater", "tokio", + "unicode-normalization", ] [[package]] @@ -2738,6 +2823,16 @@ version = "0.8.4" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "47e1ffaa40ddd1f3ed91f717a33c8c0ee23fff369e3aa8772b9605cc1d22f4c3" +[[package]] +name = "matrixmultiply" +version = "0.3.10" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "a06de3016e9fae57a36fd14dba131fccf49f74b40b7fbdb472f96e361ec71a08" +dependencies = [ + "autocfg", + "rawpointer", +] + [[package]] name = "md-5" version = "0.10.6" @@ -2840,6 +2935,22 @@ dependencies = [ "tempfile", ] +[[package]] +name = "ndarray" +version = "0.16.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "882ed72dce9365842bf196bdeedf5055305f11fc8c03dee7bb0194a6cad34841" +dependencies = [ + "matrixmultiply", + "num-complex", + "num-integer", + "num-traits", + "portable-atomic", + "portable-atomic-util", + "rawpointer", + "rayon", +] + [[package]] name = "ndk" version = "0.9.0" @@ -2911,7 +3022,7 @@ version = "0.50.3" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "7957b9740744892f114936ab4a57b3f487491bbeafaf8083688b16841a4240e5" dependencies = [ - "windows-sys 0.61.2", + "windows-sys 0.59.0", ] [[package]] @@ -2924,6 +3035,15 @@ dependencies = [ "num-traits", ] +[[package]] +name = "num-complex" +version = "0.4.6" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "73f88a1307638156682bada9d7604135552957b7818057dcef22705b4d509495" +dependencies = [ + "num-traits", +] + [[package]] name = "num-conv" version = "0.1.0" @@ -2946,6 +3066,7 @@ source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "071dfc062690e90b734c0b2273ce72ad0ffa95f0c74596bc250dcfd960262841" dependencies = [ "autocfg", + "libm", ] [[package]] @@ -3408,6 +3529,31 @@ dependencies = [ "pin-project-lite", ] +[[package]] +name = "ort" +version = "2.0.0-rc.10" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "1fa7e49bd669d32d7bc2a15ec540a527e7764aec722a45467814005725bcd721" +dependencies = [ + "ndarray", + "ort-sys", + "smallvec 2.0.0-alpha.10", + "tracing", +] + +[[package]] +name = "ort-sys" +version = "2.0.0-rc.10" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "e2aba9f5c7c479925205799216e7e5d07cc1d4fa76ea8058c60a9a30f6a4e890" +dependencies = [ + "flate2", + "pkg-config", + "sha2", + "tar", + "ureq", +] + [[package]] name = "os_info" version = "3.12.0" @@ -3484,7 +3630,7 @@ dependencies = [ "cfg-if", "libc", "redox_syscall", - "smallvec", + "smallvec 1.15.1", "windows-link 0.2.1", ] @@ -3509,6 +3655,15 @@ dependencies = [ "unicode-normalization", ] +[[package]] +name = "pem-rfc7468" +version = "0.7.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "88b39c9bfcfc231068454382784bb460aae594343fb030d46e9f50a645418412" +dependencies = [ + "base64ct", +] + [[package]] name = "percent-encoding" version = "2.3.2" @@ -3755,6 +3910,21 @@ version = "1.1.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "60f6ce597ecdcc9a098e7fddacb1065093a3d66446fa16c675e7e71d1b5c28e6" +[[package]] +name = "portable-atomic" +version = "1.11.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "f84267b20a16ea918e43c6a88433c2d54fa145c92a811b5b047ccbe153674483" + +[[package]] +name = "portable-atomic-util" +version = "0.2.4" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "d8a2f0d8d040d7848a709caf78912debcc3f33ee4b3cac47d73d1e1069e83507" +dependencies = [ + "portable-atomic", +] + [[package]] name = "postscript" version = "0.14.1" @@ -3940,7 +4110,7 @@ dependencies = [ "once_cell", "socket2", "tracing", - "windows-sys 0.60.2", + "windows-sys 0.52.0", ] [[package]] @@ -4056,6 +4226,16 @@ dependencies = [ "getrandom 0.3.4", ] +[[package]] +name = "rand_distr" +version = "0.4.3" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "32cb0b9bc82b0a0876c2dd994a7e7a2683d3e7390ca40e6886785ef0c7e3ee31" +dependencies = [ + "num-traits", + "rand 0.8.5", +] + [[package]] name = "rand_hc" version = "0.2.0" @@ -4086,6 +4266,32 @@ version = "0.6.2" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "20675572f6f24e9e76ef639bc5552774ed45f1c30e2951e1e99c59888861c539" +[[package]] +name = "rawpointer" +version = "0.2.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "60a357793950651c4ed0f3f52338f53b2f809f32d83a07f72909fa13e4c6c1e3" + +[[package]] +name = "rayon" +version = "1.11.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "368f01d005bf8fd9b1206fb6fa653e6c4a81ceb1466406b81792d87c5677a58f" +dependencies = [ + "either", + "rayon-core", +] + +[[package]] +name = "rayon-core" +version = "1.13.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "22e18b0f0062d30d4230b2e85ff77fdfe4326feb054b9783a3460d8435c8ab91" +dependencies = [ + "crossbeam-deque", + "crossbeam-utils", +] + [[package]] name = "redox_syscall" version = "0.5.18" @@ -4095,6 +4301,17 @@ dependencies = [ "bitflags 2.10.0", ] +[[package]] +name = "redox_users" +version = "0.4.6" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "ba009ff324d1fc1b900bd1fdb31564febe58a8ccc8a6fdbb93b543d33b13ca43" +dependencies = [ + "getrandom 0.2.16", + "libredox", + "thiserror 1.0.69", +] + [[package]] name = "redox_users" version = "0.5.2" @@ -4314,7 +4531,7 @@ dependencies = [ "errno", "libc", "linux-raw-sys", - "windows-sys 0.61.2", + "windows-sys 0.52.0", ] [[package]] @@ -4483,7 +4700,7 @@ dependencies = [ "phf_codegen 0.8.0", "precomputed-hash", "servo_arc", - "smallvec", + "smallvec 1.15.1", ] [[package]] @@ -4758,6 +4975,12 @@ version = "1.15.1" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "67b1b7a3b5fe4f1376887184045fcf45c69e92af734b7aaddc05fb777b6fbd03" +[[package]] +name = "smallvec" +version = "2.0.0-alpha.10" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "51d44cfb396c3caf6fbfd0ab422af02631b69ddd96d2eff0b0f0724f9024051b" + [[package]] name = "socket2" version = "0.6.1" @@ -4768,6 +4991,17 @@ dependencies = [ "windows-sys 0.60.2", ] +[[package]] +name = "socks" +version = "0.3.4" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "f0c3dbbd9ae980613c6dd8e28a9407b50509d3803b57624d5dfe8315218cd58b" +dependencies = [ + "byteorder", + "libc", + "winapi", +] + [[package]] name = "softbuffer" version = "0.4.6" @@ -5044,7 +5278,7 @@ dependencies = [ "anyhow", "bytes", "cookie", - "dirs", + "dirs 6.0.0", "dunce", "embed_plist", "getrandom 0.3.4", @@ -5094,7 +5328,7 @@ checksum = "a924b6c50fe83193f0f8b14072afa7c25b7a72752a2a73d9549b463f5fe91a38" dependencies = [ "anyhow", "cargo_toml", - "dirs", + "dirs 6.0.0", "glob", "heck 0.5.0", "json-patch", @@ -5306,7 +5540,7 @@ source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "27cbc31740f4d507712550694749572ec0e43bdd66992db7599b89fbfd6b167b" dependencies = [ "base64 0.22.1", - "dirs", + "dirs 6.0.0", "flate2", "futures-util", "http", @@ -5441,7 +5675,7 @@ dependencies = [ "getrandom 0.3.4", "once_cell", "rustix", - "windows-sys 0.61.2", + "windows-sys 0.52.0", ] [[package]] @@ -5834,7 +6068,7 @@ dependencies = [ "once_cell", "regex-automata", "sharded-slab", - "smallvec", + "smallvec 1.15.1", "thread_local", "tracing", "tracing-core", @@ -5848,7 +6082,7 @@ source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "e3d5572781bee8e3f994d7467084e1b1fd7a93ce66bd480f8156ba89dee55a2b" dependencies = [ "crossbeam-channel", - "dirs", + "dirs 6.0.0", "libappindicator", "muda", "objc2 0.6.3", @@ -5979,6 +6213,36 @@ version = "0.9.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "8ecb6da28b8a351d773b68d5825ac39017e680750f980f3a1a85cd8dd28a47c1" +[[package]] +name = "ureq" +version = "3.1.4" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "d39cb1dbab692d82a977c0392ffac19e188bd9186a9f32806f0aaa859d75585a" +dependencies = [ + "base64 0.22.1", + "der", + "log", + "native-tls", + "percent-encoding", + "rustls-pki-types", + "socks", + "ureq-proto", + "utf-8", + "webpki-root-certs", +] + +[[package]] +name = "ureq-proto" +version = "0.5.3" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "d81f9efa9df032be5934a46a068815a10a042b494b6a58cb0a1a97bb5467ed6f" +dependencies = [ + "base64 0.22.1", + "http", + "httparse", + "log", +] + [[package]] name = "url" version = "2.5.7" @@ -6264,6 +6528,15 @@ dependencies = [ "system-deps", ] +[[package]] +name = "webpki-root-certs" +version = "1.0.4" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "ee3e3b5f5e80bc89f30ce8d0343bf4e5f12341c51f3e26cbeecbc7c85443e85b" +dependencies = [ + "rustls-pki-types", +] + [[package]] name = "webpki-roots" version = "1.0.4" @@ -6337,7 +6610,7 @@ version = "0.1.11" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "c2a7b1c03c876122aa43f3020e6c3c3ee5c05081c9a00739faf7503aeba10d22" dependencies = [ - "windows-sys 0.61.2", + "windows-sys 0.48.0", ] [[package]] @@ -6520,6 +6793,15 @@ dependencies = [ "windows-targets 0.42.2", ] +[[package]] +name = "windows-sys" +version = "0.48.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "677d2418bec65e3338edb076e806bc1ec15693c5d0104683f2efe857f61056a9" +dependencies = [ + "windows-targets 0.48.5", +] + [[package]] name = "windows-sys" version = "0.52.0" @@ -6571,6 +6853,21 @@ dependencies = [ "windows_x86_64_msvc 0.42.2", ] +[[package]] +name = "windows-targets" +version = "0.48.5" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "9a2fa6e2155d7247be68c096456083145c183cbbbc2764150dda45a87197940c" +dependencies = [ + "windows_aarch64_gnullvm 0.48.5", + "windows_aarch64_msvc 0.48.5", + "windows_i686_gnu 0.48.5", + "windows_i686_msvc 0.48.5", + "windows_x86_64_gnu 0.48.5", + "windows_x86_64_gnullvm 0.48.5", + "windows_x86_64_msvc 0.48.5", +] + [[package]] name = "windows-targets" version = "0.52.6" @@ -6628,6 +6925,12 @@ version = "0.42.2" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "597a5118570b68bc08d8d59125332c54f1ba9d9adeedeef5b99b02ba2b0698f8" +[[package]] +name = "windows_aarch64_gnullvm" +version = "0.48.5" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "2b38e32f0abccf9987a4e3079dfb67dcd799fb61361e53e2882c3cbaf0d905d8" + [[package]] name = "windows_aarch64_gnullvm" version = "0.52.6" @@ -6646,6 +6949,12 @@ version = "0.42.2" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "e08e8864a60f06ef0d0ff4ba04124db8b0fb3be5776a5cd47641e942e58c4d43" +[[package]] +name = "windows_aarch64_msvc" +version = "0.48.5" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "dc35310971f3b2dbbf3f0690a219f40e2d9afcf64f9ab7cc1be722937c26b4bc" + [[package]] name = "windows_aarch64_msvc" version = "0.52.6" @@ -6664,6 +6973,12 @@ version = "0.42.2" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "c61d927d8da41da96a81f029489353e68739737d3beca43145c8afec9a31a84f" +[[package]] +name = "windows_i686_gnu" +version = "0.48.5" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "a75915e7def60c94dcef72200b9a8e58e5091744960da64ec734a6c6e9b3743e" + [[package]] name = "windows_i686_gnu" version = "0.52.6" @@ -6694,6 +7009,12 @@ version = "0.42.2" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "44d840b6ec649f480a41c8d80f9c65108b92d89345dd94027bfe06ac444d1060" +[[package]] +name = "windows_i686_msvc" +version = "0.48.5" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "8f55c233f70c4b27f66c523580f78f1004e8b5a8b659e05a4eb49d4166cca406" + [[package]] name = "windows_i686_msvc" version = "0.52.6" @@ -6712,6 +7033,12 @@ version = "0.42.2" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "8de912b8b8feb55c064867cf047dda097f92d51efad5b491dfb98f6bbb70cb36" +[[package]] +name = "windows_x86_64_gnu" +version = "0.48.5" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "53d40abd2583d23e4718fddf1ebec84dbff8381c07cae67ff7768bbf19c6718e" + [[package]] name = "windows_x86_64_gnu" version = "0.52.6" @@ -6730,6 +7057,12 @@ version = "0.42.2" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "26d41b46a36d453748aedef1486d5c7a85db22e56aff34643984ea85514e94a3" +[[package]] +name = "windows_x86_64_gnullvm" +version = "0.48.5" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "0b7b52767868a23d5bab768e390dc5f5c55825b6d30b86c844ff2dc7414044cc" + [[package]] name = "windows_x86_64_gnullvm" version = "0.52.6" @@ -6748,6 +7081,12 @@ version = "0.42.2" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "9aec5da331524158c6d1a4ac0ab1541149c0b9505fde06423b02f5ef0106b9f0" +[[package]] +name = "windows_x86_64_msvc" +version = "0.48.5" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "ed94fce61571a4006852b7389a063ab983c02eb1bb37b47f8272ce92d06d9538" + [[package]] name = "windows_x86_64_msvc" version = "0.52.6" @@ -6810,7 +7149,7 @@ dependencies = [ "block2 0.6.2", "cookie", "crossbeam-channel", - "dirs", + "dirs 6.0.0", "dpi", "dunce", "gdkx11", diff --git a/frontend/src-tauri/Cargo.toml b/frontend/src-tauri/Cargo.toml index 6379a147..2e4a8b4a 100644 --- a/frontend/src-tauri/Cargo.toml +++ b/frontend/src-tauri/Cargo.toml @@ -39,5 +39,19 @@ axum = "0.8" pdf-extract = "0.7" base64 = "0.22" +[target.'cfg(any(target_os = "macos", target_os = "linux", target_os = "windows"))'.dependencies] +# TTS dependencies (Supertonic) - desktop only +ort = "2.0.0-rc.10" +ndarray = { version = "0.16", features = ["rayon"] } +rand = "0.8" +rand_distr = "0.4" +hound = "3.5" +unicode-normalization = "0.1" +regex = "1.10" +reqwest = { version = "0.12", features = ["stream"] } +futures-util = "0.3" +dirs = "5.0" +sha2 = "0.10" + [target.'cfg(target_os = "android")'.dependencies] openssl = { version = "0.10", default-features = false, features = ["vendored"] } diff --git a/frontend/src-tauri/src/lib.rs b/frontend/src-tauri/src/lib.rs index 966d4501..1529e132 100644 --- a/frontend/src-tauri/src/lib.rs +++ b/frontend/src-tauri/src/lib.rs @@ -3,6 +3,8 @@ use tauri_plugin_deep_link::DeepLinkExt; mod pdf_extractor; mod proxy; +#[cfg(desktop)] +mod tts; #[cfg(desktop)] #[tauri::command] @@ -34,6 +36,7 @@ pub fn run() { .plugin(tauri_plugin_os::init()) .plugin(tauri_plugin_fs::init()) .manage(proxy::ProxyState::new()) + .manage(tts::TTSState::new()) .invoke_handler(tauri::generate_handler![ proxy::start_proxy, proxy::stop_proxy, @@ -43,6 +46,12 @@ pub fn run() { proxy::test_proxy_port, pdf_extractor::extract_document_content, restart_for_update, + tts::tts_get_status, + tts::tts_download_models, + tts::tts_load_models, + tts::tts_synthesize, + tts::tts_unload_models, + tts::tts_delete_models, ]) .setup(|app| { // Initialize proxy auto-start diff --git a/frontend/src-tauri/src/tts.rs b/frontend/src-tauri/src/tts.rs new file mode 100644 index 00000000..b132bb13 --- /dev/null +++ b/frontend/src-tauri/src/tts.rs @@ -0,0 +1,994 @@ +use anyhow::{Context, Result}; +use base64::{engine::general_purpose::STANDARD as BASE64, Engine}; +use futures_util::StreamExt; +use hound::{SampleFormat, WavSpec, WavWriter}; +use ndarray::{Array, Array3}; +use once_cell::sync::Lazy; +use ort::{session::Session, value::Value}; +use rand::thread_rng; +use rand_distr::{Distribution, Normal}; +use regex::Regex; +use serde::{Deserialize, Serialize}; +use sha2::{Digest, Sha256}; +use std::fs::{self, File}; +use std::io::{BufReader, Cursor, Write}; +use std::path::{Path, PathBuf}; +use std::sync::Mutex; +use tauri::{AppHandle, Emitter}; +use unicode_normalization::UnicodeNormalization; + +// Pre-compiled regexes for text preprocessing (compiled once, reused) +static RE_BOLD: Lazy = Lazy::new(|| Regex::new(r"\*\*([^*]+)\*\*").unwrap()); +static RE_BOLD2: Lazy = Lazy::new(|| Regex::new(r"__([^_]+)__").unwrap()); +static RE_ITALIC: Lazy = Lazy::new(|| Regex::new(r"\*([^*]+)\*").unwrap()); +static RE_ITALIC2: Lazy = Lazy::new(|| Regex::new(r"_([^_\s][^_]*)_").unwrap()); +static RE_STRIKE: Lazy = Lazy::new(|| Regex::new(r"~~([^~]+)~~").unwrap()); +static RE_CODE: Lazy = Lazy::new(|| Regex::new(r"`([^`]+)`").unwrap()); +static RE_CODEBLOCK: Lazy = Lazy::new(|| Regex::new(r"(?s)```[^`]*```").unwrap()); +static RE_HEADER: Lazy = Lazy::new(|| Regex::new(r"(?m)^#{1,6}\s*").unwrap()); +static RE_EMOJI: Lazy = Lazy::new(|| { + Regex::new(r"[\x{1F600}-\x{1F64F}\x{1F300}-\x{1F5FF}\x{1F680}-\x{1F6FF}\x{1F700}-\x{1F77F}\x{1F780}-\x{1F7FF}\x{1F800}-\x{1F8FF}\x{1F900}-\x{1F9FF}\x{1FA00}-\x{1FA6F}\x{1FA70}-\x{1FAFF}\x{2600}-\x{26FF}\x{2700}-\x{27BF}\x{1F1E6}-\x{1F1FF}]+").unwrap() +}); +static RE_DIACRITICS: Lazy = Lazy::new(|| { + Regex::new(r"[\u{0302}\u{0303}\u{0304}\u{0305}\u{0306}\u{0307}\u{0308}\u{030A}\u{030B}\u{030C}\u{0327}\u{0328}\u{0329}\u{032A}\u{032B}\u{032C}\u{032D}\u{032E}\u{032F}]").unwrap() +}); +static RE_SPACE_COMMA: Lazy = Lazy::new(|| Regex::new(r" ,").unwrap()); +static RE_SPACE_PERIOD: Lazy = Lazy::new(|| Regex::new(r" \.").unwrap()); +static RE_SPACE_EXCL: Lazy = Lazy::new(|| Regex::new(r" !").unwrap()); +static RE_SPACE_QUEST: Lazy = Lazy::new(|| Regex::new(r" \?").unwrap()); +static RE_SPACE_SEMI: Lazy = Lazy::new(|| Regex::new(r" ;").unwrap()); +static RE_SPACE_COLON: Lazy = Lazy::new(|| Regex::new(r" :").unwrap()); +static RE_SPACE_APOS: Lazy = Lazy::new(|| Regex::new(r" '").unwrap()); +static RE_DUP_DQUOTE: Lazy = Lazy::new(|| Regex::new(r#""{2,}"#).unwrap()); +static RE_DUP_SQUOTE: Lazy = Lazy::new(|| Regex::new(r"'{2,}").unwrap()); +static RE_DUP_BTICK: Lazy = Lazy::new(|| Regex::new(r"`{2,}").unwrap()); +static RE_MULTI_SPACE: Lazy = Lazy::new(|| Regex::new(r"\s+").unwrap()); +static RE_ENDS_PUNCT: Lazy = Lazy::new(|| { + Regex::new(r#"[.!?;:,'"\u{201C}\u{201D}\u{2018}\u{2019})\]}…。」』】〉》›»]$"#).unwrap() +}); +static RE_SENTENCE: Lazy = Lazy::new(|| Regex::new(r"([.!?])\s+").unwrap()); + +// Pin model downloads to a specific repo revision to ensure integrity and reproducibility. +const HUGGINGFACE_REVISION: &str = "b6856d033f622c63ea29441795be266a1133e227"; +const HUGGINGFACE_BASE_URL: &str = "https://huggingface.co/Supertone/supertonic/resolve"; + +// (file_name, url_path, expected_size_bytes, expected_sha256_hex) +const MODEL_FILES: &[(&str, &str, u64, &str)] = &[ + ( + "duration_predictor.onnx", + "onnx/duration_predictor.onnx", + 1_500_789, + "b861580c56a0cba2a2b82aa697ecb3c5a163c3240c60a0ddfac369d21d054092", + ), + ( + "text_encoder.onnx", + "onnx/text_encoder.onnx", + 27_348_373, + "ba0c8ea74aeb5df00d21a89b8d47c71317f47120232e3deef95024dba37dbd88", + ), + ( + "vector_estimator.onnx", + "onnx/vector_estimator.onnx", + 132_471_364, + "b3f82ecd2e9decc4e2236048b03628a1c1d5f14a792ba274a59b7325107aa6a6", + ), + ( + "vocoder.onnx", + "onnx/vocoder.onnx", + 101_405_066, + "19bd51f47a186069c752403518a40f7ea4c647455056d2511f7249691ecddf7c", + ), + ( + "tts.json", + "onnx/tts.json", + 8_645, + "4dac5f986698a3ace9a97ea2545d43f6c8ba120d25e005f8c905128281be9b6d", + ), + ( + "unicode_indexer.json", + "onnx/unicode_indexer.json", + 262_134, + "0c3800ba4fb1fc760c9070eb43a0ad5a68279ec165742591a68ea3edca452978", + ), + ( + "F1.json", + "voice_styles/F1.json", + 420_622, + "1450bcad84a2790eaf73f85e763dd5bae7c399f55d692c4835cf4f7686b5a10f", + ), + ( + "F2.json", + "voice_styles/F2.json", + 420_905, + "47c8d44445ef8ac8aae8ef5806feca21903483cbd4f1232e405184a40520a549", + ), + ( + "M1.json", + "voice_styles/M1.json", + 421_053, + "273c9ba6582d2e00383d8fbe2f5d660d86e8fba849c91ff695384d1a6e2e02f1", + ), + ( + "M2.json", + "voice_styles/M2.json", + 421_027, + "26898a9ec3de1b5bf8cc3f6cbf41930543ca0403f2201e12aad849691ff315dd", + ), +]; + +const TOTAL_MODEL_SIZE: u64 = 264_679_978; // bytes + +fn bytes_to_hex(bytes: &[u8]) -> String { + const HEX: &[u8; 16] = b"0123456789abcdef"; + let mut out = String::with_capacity(bytes.len() * 2); + for &b in bytes { + out.push(HEX[(b >> 4) as usize] as char); + out.push(HEX[(b & 0x0f) as usize] as char); + } + out +} + +#[derive(Debug, Clone, Serialize, Deserialize)] +pub struct Config { + pub ae: AEConfig, + pub ttl: TTLConfig, +} + +#[derive(Debug, Clone, Serialize, Deserialize)] +pub struct AEConfig { + pub sample_rate: i32, + pub base_chunk_size: i32, +} + +#[derive(Debug, Clone, Serialize, Deserialize)] +pub struct TTLConfig { + pub chunk_compress_factor: i32, + pub latent_dim: i32, +} + +#[derive(Debug, Clone, Serialize, Deserialize)] +pub struct VoiceStyleData { + pub style_ttl: StyleComponent, + pub style_dp: StyleComponent, +} + +#[derive(Debug, Clone, Serialize, Deserialize)] +pub struct StyleComponent { + pub data: Vec>>, + pub dims: Vec, + #[serde(rename = "type")] + pub dtype: String, +} + +#[derive(Clone)] +pub struct Style { + pub ttl: Array3, + pub dp: Array3, +} + +struct UnicodeProcessor { + indexer: Vec, +} + +impl UnicodeProcessor { + fn new(indexer: Vec) -> Self { + UnicodeProcessor { indexer } + } + + fn call(&self, text_list: &[String]) -> (Vec>, Array3) { + // Text should already be preprocessed before reaching here + let text_ids_lengths: Vec = text_list.iter().map(|t| t.chars().count()).collect(); + let max_len = *text_ids_lengths.iter().max().unwrap_or(&0); + + let mut text_ids = Vec::new(); + for text in text_list { + let mut row = vec![0i64; max_len]; + let unicode_vals: Vec = text.chars().map(|c| c as usize).collect(); + for (j, &val) in unicode_vals.iter().enumerate() { + if val < self.indexer.len() { + row[j] = self.indexer[val]; + } else { + // Use 0 (padding token) for out-of-vocabulary characters + row[j] = 0; + } + } + text_ids.push(row); + } + + let text_mask = length_to_mask(&text_ids_lengths, Some(max_len)); + (text_ids, text_mask) + } +} + +fn preprocess_text(text: &str) -> String { + let mut text: String = text.nfkd().collect(); + + // Remove markdown formatting (using pre-compiled regexes) + text = RE_BOLD.replace_all(&text, "$1").to_string(); + text = RE_BOLD2.replace_all(&text, "$1").to_string(); + text = RE_ITALIC.replace_all(&text, "$1").to_string(); + text = RE_ITALIC2.replace_all(&text, "$1").to_string(); + text = RE_STRIKE.replace_all(&text, "$1").to_string(); + text = RE_CODE.replace_all(&text, "$1").to_string(); + text = RE_CODEBLOCK.replace_all(&text, "").to_string(); + text = RE_HEADER.replace_all(&text, "").to_string(); + text = RE_EMOJI.replace_all(&text, "").to_string(); + + // Replace various dashes and symbols + let replacements = [ + ("–", "-"), + ("‑", "-"), + ("—", "-"), + ("¯", " "), + ("\u{201C}", "\""), + ("\u{201D}", "\""), + ("\u{2018}", "'"), + ("\u{2019}", "'"), + ("´", "'"), + ("`", "'"), + ("[", " "), + ("]", " "), + ("|", " "), + ("/", " "), + ("#", " "), + ("→", " "), + ("←", " "), + ]; + for (from, to) in &replacements { + text = text.replace(from, to); + } + + text = RE_DIACRITICS.replace_all(&text, "").to_string(); + + // Remove special symbols + for symbol in &["♥", "☆", "♡", "©", "\\"] { + text = text.replace(symbol, ""); + } + + // Replace known expressions + text = text.replace("@", " at "); + text = text.replace("e.g.,", "for example, "); + text = text.replace("i.e.,", "that is, "); + + // Fix spacing around punctuation + text = RE_SPACE_COMMA.replace_all(&text, ",").to_string(); + text = RE_SPACE_PERIOD.replace_all(&text, ".").to_string(); + text = RE_SPACE_EXCL.replace_all(&text, "!").to_string(); + text = RE_SPACE_QUEST.replace_all(&text, "?").to_string(); + text = RE_SPACE_SEMI.replace_all(&text, ";").to_string(); + text = RE_SPACE_COLON.replace_all(&text, ":").to_string(); + text = RE_SPACE_APOS.replace_all(&text, "'").to_string(); + + // Remove duplicate quotes (single regex pass instead of while loop) + text = RE_DUP_DQUOTE.replace_all(&text, "\"").to_string(); + text = RE_DUP_SQUOTE.replace_all(&text, "'").to_string(); + text = RE_DUP_BTICK.replace_all(&text, "`").to_string(); + + // Remove extra spaces + text = RE_MULTI_SPACE.replace_all(&text, " ").to_string(); + text = text.trim().to_string(); + + // Add period if no ending punctuation + if !text.is_empty() && !RE_ENDS_PUNCT.is_match(&text) { + text.push('.'); + } + text +} + +fn length_to_mask(lengths: &[usize], max_len: Option) -> Array3 { + let bsz = lengths.len(); + let max_len = max_len.unwrap_or_else(|| *lengths.iter().max().unwrap_or(&0)); + let mut mask = Array3::::zeros((bsz, 1, max_len)); + for (i, &len) in lengths.iter().enumerate() { + for j in 0..len.min(max_len) { + mask[[i, 0, j]] = 1.0; + } + } + mask +} + +fn sample_noisy_latent( + duration: &[f32], + sample_rate: i32, + base_chunk_size: i32, + chunk_compress: i32, + latent_dim: i32, +) -> (Array3, Array3) { + let bsz = duration.len(); + let max_dur = duration.iter().fold(0.0f32, |a, &b| a.max(b)); + let wav_len_max = (max_dur * sample_rate as f32) as usize; + let wav_lengths: Vec = duration + .iter() + .map(|&d| (d * sample_rate as f32) as usize) + .collect(); + + let chunk_size = (base_chunk_size * chunk_compress) as usize; + let latent_len = wav_len_max.div_ceil(chunk_size); + let latent_dim_val = (latent_dim * chunk_compress) as usize; + + let mut noisy_latent = Array3::::zeros((bsz, latent_dim_val, latent_len)); + let normal = Normal::new(0.0, 1.0).unwrap(); + let mut rng = thread_rng(); + + for b in 0..bsz { + for d in 0..latent_dim_val { + for t in 0..latent_len { + noisy_latent[[b, d, t]] = normal.sample(&mut rng); + } + } + } + + let latent_lengths: Vec = wav_lengths + .iter() + .map(|&len| len.div_ceil(chunk_size)) + .collect(); + let latent_mask = length_to_mask(&latent_lengths, Some(latent_len)); + + // Apply mask + for b in 0..bsz { + for d in 0..latent_dim_val { + for t in 0..latent_len { + noisy_latent[[b, d, t]] *= latent_mask[[b, 0, t]]; + } + } + } + (noisy_latent, latent_mask) +} + +/// Split text by words when it exceeds max_len +fn split_by_words(text: &str, max_len: usize) -> Vec { + let mut result = Vec::new(); + let mut current = String::new(); + + for word in text.split_whitespace() { + if current.len() + word.len() + 1 > max_len && !current.is_empty() { + result.push(current.trim().to_string()); + current.clear(); + } + if !current.is_empty() { + current.push(' '); + } + current.push_str(word); + } + + if !current.is_empty() { + result.push(current.trim().to_string()); + } + result +} + +fn chunk_text(text: &str, max_len: usize) -> Vec { + let text = text.trim(); + if text.is_empty() { + return vec![String::new()]; + } + + static RE_PARA: Lazy = Lazy::new(|| Regex::new(r"\n\s*\n").unwrap()); + let paragraphs: Vec<&str> = RE_PARA.split(text).collect(); + let mut chunks = Vec::new(); + + for para in paragraphs { + let para = para.trim(); + if para.is_empty() { + continue; + } + + if para.len() <= max_len { + chunks.push(para.to_string()); + continue; + } + + // Split by sentence boundaries, keeping punctuation + let mut current = String::new(); + let mut last_end = 0; + + for m in RE_SENTENCE.find_iter(para) { + let sentence = para[last_end..m.start() + 1].trim(); // +1 to include punctuation + last_end = m.end(); + + if sentence.is_empty() { + continue; + } + + // If single sentence exceeds max_len, split by words + if sentence.len() > max_len { + if !current.is_empty() { + chunks.push(current.trim().to_string()); + current.clear(); + } + chunks.extend(split_by_words(sentence, max_len)); + continue; + } + + if current.len() + sentence.len() + 1 > max_len && !current.is_empty() { + chunks.push(current.trim().to_string()); + current.clear(); + } + + if !current.is_empty() { + current.push(' '); + } + current.push_str(sentence); + } + + // Remaining text after last sentence boundary + let remaining = para[last_end..].trim(); + if !remaining.is_empty() { + // If remaining exceeds max_len, split by words + if remaining.len() > max_len { + if !current.is_empty() { + chunks.push(current.trim().to_string()); + } + chunks.extend(split_by_words(remaining, max_len)); + } else if current.len() + remaining.len() + 1 > max_len && !current.is_empty() { + chunks.push(current.trim().to_string()); + chunks.push(remaining.to_string()); + } else { + if !current.is_empty() { + current.push(' '); + } + current.push_str(remaining); + chunks.push(current.trim().to_string()); + } + } else if !current.is_empty() { + chunks.push(current.trim().to_string()); + } + } + + if chunks.is_empty() { + vec![String::new()] + } else { + chunks + } +} + +pub struct TTSState { + tts: Option, + style: Option