Skip to content

Commit 8a940bd

Browse files
committed
Add RPC documentation
1 parent d90b204 commit 8a940bd

File tree

1 file changed

+202
-0
lines changed

1 file changed

+202
-0
lines changed

docs/rpc.md

Lines changed: 202 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,202 @@
1+
# Building and Using the RPC Server with `stable-diffusion.cpp`
2+
3+
This guide covers how to build a version of the RPC server from `llama.cpp` that is compatible with your version of `stable-diffusion.cpp` to manage multi-backends setups. RPC allows you to offload specific model components to a remote server.
4+
5+
> **Note on Model Location:** The model files (e.g., `.safetensors` or `.gguf`) remain on the **Client** machine. The client parses the file and transmits the necessary tensor data and computational graphs to the server. The server does not need to store the model files locally.
6+
7+
## 1. Building `stable-diffusion.cpp` with RPC client
8+
9+
First, you should build the client application from source. It requires `GGML_RPC=ON` to include the RPC backend to your client.
10+
```bash
11+
mkdir build
12+
cd build
13+
cmake .. \
14+
-DGGML_RPC=ON \
15+
# Add other build flags here (e.g., -DSD_VULKAN=ON)
16+
cmake --build . --config Release -j $(nproc)
17+
```
18+
19+
> **Note:** Ensure you add the other flags you would normally use (e.g., `-DSD_VULKAN=ON`, `-DSD_CUDA=ON`, `-DSD_HIPBLAS=ON`, or `-DGGML_METAL=ON`), for more information about building `stable-diffusion.cpp` from source, please refer to the `build.md` documentation.
20+
21+
## 2. Ensure `llama.cpp` is at the correct commit
22+
23+
`stable-diffusion.cpp`'s RPC client is designed to work with a specific version of `llama.cpp` (compatible with the `ggml` submodule) to ensure API compatibility. The commit hash for `llama.cpp` is stored in `ggml/scripts/sync-llama.last`.
24+
25+
> **Start from Root:** Perform these steps from the root of your `stable-diffusion.cpp` directory.
26+
27+
1. Read the target commit hash from the submodule tracker:
28+
```bash
29+
# Linux / WSL / MacOS
30+
HASH=$(cat ggml/scripts/sync-llama.last)
31+
32+
# Windows (PowerShell)
33+
$HASH = Get-Content -Path "ggml\scripts\sync-llama.last"
34+
```
35+
36+
2. Clone `llama.cpp` at the target commit .
37+
```bash
38+
git clone https://github.com/ggml-org/llama.cpp.git
39+
cd llama.cpp
40+
git checkout $HASH
41+
```
42+
43+
To save on download time and storage, you can use a shallow clone to download only the target commit:
44+
```bash
45+
mkdir -p llama.cpp
46+
cd llama.cpp
47+
git init
48+
git remote add origin https://github.com/ggml-org/llama.cpp.git
49+
git fetch --depth 1 origin $HASH
50+
git checkout FETCH_HEAD
51+
```
52+
53+
## 3. Build `llama.cpp` (RPC Server)
54+
55+
The RPC server acts as the worker. You must explicitly enable the **backend** (the hardware interface, such as CUDA for Nvidia, Metal for Apple Silicon, or Vulkan) when building, otherwise the server will default to using only the CPU.
56+
57+
To find the correct flags, refer to the official documentation for the `llama.cpp` repository.
58+
59+
> **Crucial:** You must include the compiler flags required to satisfy the API compatibility with `stable-diffusion.cpp` (`-DGGML_MAX_NAME=128`). Without this flag, `GGML_MAX_NAME` will default to `64` for the server, and data transfers between the client and server will fail. Of course, `-DGGML_RPC` must also be enabled.
60+
>
61+
> I recommend disabling the `LLAMA_CURL` flag to avoid unnecessary dependencies, and disabling shared library builds to avoid potential conflicts.
62+
63+
> **Build Target:** We are specifically building the `rpc-server` target. This prevents the build system from compiling the entire `llama.cpp` suite (like `llama-cli`), making the build significantly faster.
64+
65+
### Linux / WSL (Vulkan)
66+
```bash
67+
mkdir build
68+
cd build
69+
cmake .. -DGGML_RPC=ON \
70+
-DGGML_VULKAN=ON \ # Ensure backend is enabled
71+
-DGGML_BUILD_SHARED_LIBS=OFF \
72+
-DLLAMA_CURL=OFF \
73+
-DCMAKE_C_FLAGS=-DGGML_MAX_NAME=128 \
74+
-DCMAKE_CXX_FLAGS=-DGGML_MAX_NAME=128
75+
cmake --build . --config Release --target rpc-server -j $(nproc)
76+
```
77+
78+
### macOS (Metal)
79+
```bash
80+
mkdir build
81+
cd build
82+
cmake .. -DGGML_RPC=ON \
83+
-DGGML_METAL=ON \
84+
-DGGML_BUILD_SHARED_LIBS=OFF \
85+
-DLLAMA_CURL=OFF \
86+
-DCMAKE_C_FLAGS=-DGGML_MAX_NAME=128 \
87+
-DCMAKE_CXX_FLAGS=-DGGML_MAX_NAME=128
88+
cmake --build . --config Release --target rpc-server
89+
```
90+
91+
### Windows (Visual Studio 2022, Vulkan)
92+
```powershell
93+
mkdir build
94+
cd build
95+
cmake .. -G "Visual Studio 17 2022" -A x64 `
96+
-DGGML_RPC=ON `
97+
-DGGML_VULKAN=ON `
98+
-DGGML_BUILD_SHARED_LIBS=OFF `
99+
-DLLAMA_CURL=OFF `
100+
-DCMAKE_C_FLAGS=-DGGML_MAX_NAME=128 `
101+
-DCMAKE_CXX_FLAGS=-DGGML_MAX_NAME=128
102+
cmake --build . --config Release --target rpc-server
103+
```
104+
105+
## 4. Usage
106+
107+
Once both applications are built, you can run the server and the client to manage your GPU allocation.
108+
109+
### Step A: Run the RPC Server
110+
111+
Start the server. It listens for connections on the default address (usually `localhost:50052`). If your server is on a different machine, ensure the server binds to the correct interface and your firewall allows the connection.
112+
113+
**On the Server :**
114+
If running on the same machine, you can use the default address:
115+
```bash
116+
./rpc-server
117+
```
118+
If you want to allow connections from other machines on the network:
119+
```bash
120+
./rpc-server --host 0.0.0.0
121+
```
122+
123+
> **Security Warning:** The RPC server does not currently support authentication or encryption. **Only run the server on trusted local networks**. Never expose the RPC server directly to the open internet.
124+
125+
> **Drivers & Hardware:** Ensure the Server machine has the necessary drivers installed and functional (e.g., Nvidia Drivers for CUDA, Vulkan SDK, or Metal). If no devices are found, the server will simply fallback to CPU usage.
126+
127+
### Step B: Check if the client is able to connect to the server and see the available devices
128+
129+
We're assuming the server is running on your local machine, and listening on the default port `50052`. If it's running on a different machine, you can replace `localhost` with the IP address of the server.
130+
131+
**On the Client:**
132+
```bash
133+
./sd-cli --rpc localhost:50052 --list-devices
134+
```
135+
If the server is running and the client is able to connect, you should see `RPC0 localhost:50052` in the list of devices.
136+
137+
Example output:
138+
(Client built without GPU acceleration, two GPUs available on the server)
139+
```
140+
List of available GGML devices:
141+
Name Description
142+
-------------------
143+
CPU AMD Ryzen 9 5900X 12-Core Processor
144+
RPC0 localhost:50052
145+
RPC1 localhost:50052
146+
```
147+
148+
### Step C: Run with RPC device
149+
150+
If everything is working correctly, you can now run the client while offloading some or all of the work to the RPC server.
151+
152+
Example: Setting the main backend to the RPC0 device for doing all the work on the server.
153+
154+
```bash
155+
./sd-cli -m models/sd1.5.safetensors -p "A cat" --rpc localhost:50052 --main-backend-device RPC0
156+
```
157+
158+
---
159+
160+
## 5. Scaling: Multiple RPC Servers
161+
162+
You can connect the client to multiple RPC servers simultaneously to scale out your hardware usage.
163+
164+
Example: A main machine (192.168.1.10) with 3 GPUs, with one GPU running CUDA and the other two running Vulkan, and a second machine (192.168.1.11) only one GPU.
165+
166+
**On the first machine (Running two server instances):**
167+
168+
**Terminal 1 (CUDA):**
169+
```bash
170+
# Linux / macOS / WSL
171+
export CUDA_VISIBLE_DEVICES=0
172+
./rpc-server-cuda --host 0.0.0.0
173+
174+
# Windows PowerShell
175+
$env:CUDA_VISIBLE_DEVICES="0"
176+
./rpc-server-cuda --host 0.0.0.0
177+
```
178+
179+
**Terminal 2 (Vulkan):**
180+
```bash
181+
./rpc-server-vulkan --host 0.0.0.0 --port 50053 -d Vulkan1,Vulkan2
182+
```
183+
184+
**On the second machine:**
185+
```bash
186+
./rpc-server --host 0.0.0.0
187+
```
188+
189+
**On the Client:**
190+
Pass multiple server addresses separated by commas.
191+
192+
```bash
193+
./sd-cli --rpc 192.168.1.10:50052,192.168.1.10:50053,192.168.1.11:50052 --list-devices
194+
```
195+
196+
The client will map these servers to sequential device IDs (e.g., RPC0 from the first server, RPC2, RPC3 from the second, and RPC4 from the third). With this setup, you could for example use RPC0 for the main backend, RPC1 and RPC2 for the text encoders, and RPC3 for the VAE.
197+
198+
---
199+
200+
## 6. Performance Considerations
201+
202+
RPC performance is heavily dependent on network bandwidth, as large weights and activations must be transferred back and forth over the network, especially for large models, or when using high resolutions. For best results, ensure your network connection is stable and has sufficient bandwidth (>1Gbps recommended).

0 commit comments

Comments
 (0)