Skip to content

Commit 5427c27

Browse files
author
Yuekai Zhang
committed
add triton solution
1 parent b048a2d commit 5427c27

File tree

18 files changed

+3448
-0
lines changed

18 files changed

+3448
-0
lines changed
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
FROM nvcr.io/nvidia/tritonserver:25.06-trtllm-python-py3
2+
RUN apt-get update && apt-get install -y cmake
3+
RUN git clone https://github.com/pytorch/audio.git && cd audio && git checkout c670ad8 && PATH=/usr/local/cuda/bin:$PATH python3 setup.py develop
4+
COPY ./requirements.txt /workspace/requirements.txt
5+
RUN pip install -r /workspace/requirements.txt
6+
WORKDIR /workspace

runtime/triton_trtllm/README.md

Lines changed: 94 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,94 @@
1+
## Nvidia Triton Inference Serving Best Practice for Spark TTS
2+
3+
### Quick Start
4+
Directly launch the service using docker compose.
5+
```sh
6+
docker compose up
7+
```
8+
9+
### Build Image
10+
Build the docker image from scratch.
11+
```sh
12+
docker build . -f Dockerfile.server -t soar97/triton-spark-tts:25.02
13+
```
14+
15+
### Create Docker Container
16+
```sh
17+
your_mount_dir=/mnt:/mnt
18+
docker run -it --name "spark-tts-server" --gpus all --net host -v $your_mount_dir --shm-size=2g soar97/triton-spark-tts:25.02
19+
```
20+
21+
### Understanding `run.sh`
22+
23+
The `run.sh` script automates various steps using stages. You can run specific stages using:
24+
```sh
25+
bash run.sh <start_stage> <stop_stage> [service_type]
26+
```
27+
- `<start_stage>`: The stage to begin execution from (0-5).
28+
- `<stop_stage>`: The stage to end execution at (0-5).
29+
- `[service_type]`: Optional, specifies the service type ('streaming' or 'offline', defaults may apply based on script logic). Required for stages 4 and 5.
30+
31+
Stages:
32+
- **Stage 0**: Download Spark-TTS-0.5B model from HuggingFace.
33+
- **Stage 1**: Convert HuggingFace checkpoint to TensorRT-LLM format and build TensorRT engines.
34+
- **Stage 2**: Create the Triton model repository structure and configure model files (adjusts for streaming/offline).
35+
- **Stage 3**: Launch the Triton Inference Server.
36+
- **Stage 4**: Run the gRPC benchmark client.
37+
- **Stage 5**: Run the single utterance client (gRPC for streaming, HTTP for offline).
38+
39+
### Export Models to TensorRT-LLM and Launch Server
40+
Inside the docker container, you can prepare the models and launch the Triton server by running stages 0 through 3. This involves downloading the original model, converting it to TensorRT-LLM format, building the optimized TensorRT engines, creating the necessary model repository structure for Triton, and finally starting the server.
41+
```sh
42+
# This runs stages 0, 1, 2, and 3
43+
bash run.sh 0 3
44+
```
45+
*Note: Stage 2 prepares the model repository differently based on whether you intend to run streaming or offline inference later. You might need to re-run stage 2 if switching service types.*
46+
47+
48+
### Single Utterance Client
49+
Run a single inference request. Specify `streaming` or `offline` as the third argument.
50+
51+
**Streaming Mode (gRPC):**
52+
```sh
53+
bash run.sh 5 5 streaming
54+
```
55+
This executes the `client_grpc.py` script with predefined example text and prompt audio in streaming mode.
56+
57+
**Offline Mode (HTTP):**
58+
```sh
59+
bash run.sh 5 5 offline
60+
```
61+
62+
### Benchmark using Dataset
63+
Run the benchmark client against the running Triton server. Specify `streaming` or `offline` as the third argument.
64+
```sh
65+
# Run benchmark in streaming mode
66+
bash run.sh 4 4 streaming
67+
68+
# Run benchmark in offline mode
69+
bash run.sh 4 4 offline
70+
71+
# You can also customize parameters like num_task directly in client_grpc.py or via args if supported
72+
# Example from run.sh (streaming):
73+
# python3 client_grpc.py \
74+
# --server-addr localhost \
75+
# --model-name spark_tts \
76+
# --num-tasks 2 \
77+
# --mode streaming \
78+
# --log-dir ./log_concurrent_tasks_2_streaming_new
79+
80+
# Example customizing dataset (requires modifying client_grpc.py or adding args):
81+
# python3 client_grpc.py --num-tasks 2 --huggingface-dataset yuekai/seed_tts --split-name wenetspeech4tts --mode [streaming|offline]
82+
```
83+
84+
### Benchmark Results
85+
Decoding on a single L20 GPU, using 26 different prompt_audio/target_text [pairs](https://huggingface.co/datasets/yuekai/seed_tts), total audio duration 169 secs.
86+
87+
| Mode | Note | Concurrency | Avg Latency | First Chunk Latency (P50) | RTF |
88+
|-------|-----------|-----------------------|---------|----------------|-|
89+
| Offline | [Code Commit](https://github.com/SparkAudio/Spark-TTS/tree/4d769ff782a868524f29e0be851ca64f8b22ebf1/runtime/triton_trtllm) | 1 | 876.24 ms |-| 0.1362|
90+
| Offline | [Code Commit](https://github.com/SparkAudio/Spark-TTS/tree/4d769ff782a868524f29e0be851ca64f8b22ebf1/runtime/triton_trtllm) | 2 | 920.97 ms |-|0.0737|
91+
| Offline | [Code Commit](https://github.com/SparkAudio/Spark-TTS/tree/4d769ff782a868524f29e0be851ca64f8b22ebf1/runtime/triton_trtllm) | 4 | 1611.51 ms |-| 0.0704|
92+
| Streaming | [Code Commit](https://github.com/yuekaizhang/Spark-TTS/commit/0e978a327f99aa49f0735f86eb09372f16410d86) | 1 | 913.28 ms |210.42 ms| 0.1501 |
93+
| Streaming | [Code Commit](https://github.com/yuekaizhang/Spark-TTS/commit/0e978a327f99aa49f0735f86eb09372f16410d86) | 2 | 1009.23 ms |226.08 ms |0.0862 |
94+
| Streaming | [Code Commit](https://github.com/yuekaizhang/Spark-TTS/commit/0e978a327f99aa49f0735f86eb09372f16410d86) | 4 | 1793.86 ms |1017.70 ms| 0.0824 |

0 commit comments

Comments
 (0)