|
| 1 | +## Nvidia Triton Inference Serving Best Practice for Spark TTS |
| 2 | + |
| 3 | +### Quick Start |
| 4 | +Directly launch the service using docker compose. |
| 5 | +```sh |
| 6 | +docker compose up |
| 7 | +``` |
| 8 | + |
| 9 | +### Build Image |
| 10 | +Build the docker image from scratch. |
| 11 | +```sh |
| 12 | +docker build . -f Dockerfile.server -t soar97/triton-spark-tts:25.02 |
| 13 | +``` |
| 14 | + |
| 15 | +### Create Docker Container |
| 16 | +```sh |
| 17 | +your_mount_dir=/mnt:/mnt |
| 18 | +docker run -it --name "spark-tts-server" --gpus all --net host -v $your_mount_dir --shm-size=2g soar97/triton-spark-tts:25.02 |
| 19 | +``` |
| 20 | + |
| 21 | +### Understanding `run.sh` |
| 22 | + |
| 23 | +The `run.sh` script automates various steps using stages. You can run specific stages using: |
| 24 | +```sh |
| 25 | +bash run.sh <start_stage> <stop_stage> [service_type] |
| 26 | +``` |
| 27 | +- `<start_stage>`: The stage to begin execution from (0-5). |
| 28 | +- `<stop_stage>`: The stage to end execution at (0-5). |
| 29 | +- `[service_type]`: Optional, specifies the service type ('streaming' or 'offline', defaults may apply based on script logic). Required for stages 4 and 5. |
| 30 | + |
| 31 | +Stages: |
| 32 | +- **Stage 0**: Download Spark-TTS-0.5B model from HuggingFace. |
| 33 | +- **Stage 1**: Convert HuggingFace checkpoint to TensorRT-LLM format and build TensorRT engines. |
| 34 | +- **Stage 2**: Create the Triton model repository structure and configure model files (adjusts for streaming/offline). |
| 35 | +- **Stage 3**: Launch the Triton Inference Server. |
| 36 | +- **Stage 4**: Run the gRPC benchmark client. |
| 37 | +- **Stage 5**: Run the single utterance client (gRPC for streaming, HTTP for offline). |
| 38 | + |
| 39 | +### Export Models to TensorRT-LLM and Launch Server |
| 40 | +Inside the docker container, you can prepare the models and launch the Triton server by running stages 0 through 3. This involves downloading the original model, converting it to TensorRT-LLM format, building the optimized TensorRT engines, creating the necessary model repository structure for Triton, and finally starting the server. |
| 41 | +```sh |
| 42 | +# This runs stages 0, 1, 2, and 3 |
| 43 | +bash run.sh 0 3 |
| 44 | +``` |
| 45 | +*Note: Stage 2 prepares the model repository differently based on whether you intend to run streaming or offline inference later. You might need to re-run stage 2 if switching service types.* |
| 46 | + |
| 47 | + |
| 48 | +### Single Utterance Client |
| 49 | +Run a single inference request. Specify `streaming` or `offline` as the third argument. |
| 50 | + |
| 51 | +**Streaming Mode (gRPC):** |
| 52 | +```sh |
| 53 | +bash run.sh 5 5 streaming |
| 54 | +``` |
| 55 | +This executes the `client_grpc.py` script with predefined example text and prompt audio in streaming mode. |
| 56 | + |
| 57 | +**Offline Mode (HTTP):** |
| 58 | +```sh |
| 59 | +bash run.sh 5 5 offline |
| 60 | +``` |
| 61 | + |
| 62 | +### Benchmark using Dataset |
| 63 | +Run the benchmark client against the running Triton server. Specify `streaming` or `offline` as the third argument. |
| 64 | +```sh |
| 65 | +# Run benchmark in streaming mode |
| 66 | +bash run.sh 4 4 streaming |
| 67 | + |
| 68 | +# Run benchmark in offline mode |
| 69 | +bash run.sh 4 4 offline |
| 70 | + |
| 71 | +# You can also customize parameters like num_task directly in client_grpc.py or via args if supported |
| 72 | +# Example from run.sh (streaming): |
| 73 | +# python3 client_grpc.py \ |
| 74 | +# --server-addr localhost \ |
| 75 | +# --model-name spark_tts \ |
| 76 | +# --num-tasks 2 \ |
| 77 | +# --mode streaming \ |
| 78 | +# --log-dir ./log_concurrent_tasks_2_streaming_new |
| 79 | + |
| 80 | +# Example customizing dataset (requires modifying client_grpc.py or adding args): |
| 81 | +# python3 client_grpc.py --num-tasks 2 --huggingface-dataset yuekai/seed_tts --split-name wenetspeech4tts --mode [streaming|offline] |
| 82 | +``` |
| 83 | + |
| 84 | +### Benchmark Results |
| 85 | +Decoding on a single L20 GPU, using 26 different prompt_audio/target_text [pairs](https://huggingface.co/datasets/yuekai/seed_tts), total audio duration 169 secs. |
| 86 | + |
| 87 | +| Mode | Note | Concurrency | Avg Latency | First Chunk Latency (P50) | RTF | |
| 88 | +|-------|-----------|-----------------------|---------|----------------|-| |
| 89 | +| Offline | [Code Commit](https://github.com/SparkAudio/Spark-TTS/tree/4d769ff782a868524f29e0be851ca64f8b22ebf1/runtime/triton_trtllm) | 1 | 876.24 ms |-| 0.1362| |
| 90 | +| Offline | [Code Commit](https://github.com/SparkAudio/Spark-TTS/tree/4d769ff782a868524f29e0be851ca64f8b22ebf1/runtime/triton_trtllm) | 2 | 920.97 ms |-|0.0737| |
| 91 | +| Offline | [Code Commit](https://github.com/SparkAudio/Spark-TTS/tree/4d769ff782a868524f29e0be851ca64f8b22ebf1/runtime/triton_trtllm) | 4 | 1611.51 ms |-| 0.0704| |
| 92 | +| Streaming | [Code Commit](https://github.com/yuekaizhang/Spark-TTS/commit/0e978a327f99aa49f0735f86eb09372f16410d86) | 1 | 913.28 ms |210.42 ms| 0.1501 | |
| 93 | +| Streaming | [Code Commit](https://github.com/yuekaizhang/Spark-TTS/commit/0e978a327f99aa49f0735f86eb09372f16410d86) | 2 | 1009.23 ms |226.08 ms |0.0862 | |
| 94 | +| Streaming | [Code Commit](https://github.com/yuekaizhang/Spark-TTS/commit/0e978a327f99aa49f0735f86eb09372f16410d86) | 4 | 1793.86 ms |1017.70 ms| 0.0824 | |
0 commit comments