Do anybody have troubles with running multiple sxm2 tesla v100 (through pci-e adapter) like me? #18219

pl752 · 2025-12-20T10:30:24Z

pl752
Dec 20, 2025

I used one 3090 previously, but now I have purschased two aftermarket sxm2 tesla v100 16gb with sxm2 to pci-e adapters, these by themselves seem to work just fine (in llama.cpp (when used for non-moe models and one at a time), comfy.ui, stress tests, torch model training, etc.). But in llama.cpp I have encountered two different problems.

When I use UMA (GGML_CUDA_UNIFIED_MEMORY=1) to avoid random OOM errors in situations where memory margins are very tight, it works just fine with one v100 or 3090+v100, but if I select two v100 at the same time (with or without 3090), token generation and processing speeds drop to a crawl and gibberish is generated instead of text, also in nvtop both v100 are reported as fully utilized, while in normal mode they are displayed as 50% used during generation.
When I try to use this build with models which use mxfp4 (like gpt-oss-*) I encounter the problem with using -ncmoe flag. When I select 3090 as a main gpu (computes offloaded layers during preprocessing), add any of the v100 or both to the device list, and load at least two layers onto it, then I get token 30 (?) in a loop when trying to work with sequence length > ~32k (sometimes > ~20k)
Command is ~/llama.cpp/build/bin/llama-server -m /models/llm_models/gpt-oss-120b-mxfp4-00001-of-00003.gguf -ngl 900 -c 65536 --port 5001 --host 0.0.0.0 -fa on -dev CUDA0,CUDA1,CUDA2 -ts 6,2,2 -ncmoe 12 -v -ub 4096 -b 16384 Other flags aside from -ncmoe don't change the behavior (-cmoe works just fine), and if I select any of the v100 as a main gpu or physically unplug one of the v100 from system, it all just starts to work fine.
Build command used is: CUDA_HOME=/usr/local/cuda CUDACXX=/usr/local/cuda/bin/nvcc cmake -S . -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="70;86" -DCMAKE_BUILD_TYPE=Release && cmake --build build --config Release -t llama-server -- -j 16
System used is ubuntu 24.04, llama.cpp from master branch (behavior didn't change when I tried building other version (b7400)), cpu 5800x3d, b550 motherboard, 3090 is plugged into cpu pci-e 4x16, two v100 are plugged into chipset pci-e 3x4 and 3x2 slots, driver is 580-server (not open), cuda is 12.9, 128gb of ddr4 @ 3600cl18 in dual channel.
I will appreciate any hints and ready to try searching for troubles with gdb or other utilities if asked what to look for and where to look.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Do anybody have troubles with running multiple sxm2 tesla v100 (through pci-e adapter) like me? #18219

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Do anybody have troubles with running multiple sxm2 tesla v100 (through pci-e adapter) like me? #18219

Uh oh!

pl752 Dec 20, 2025

Replies: 0 comments

pl752
Dec 20, 2025