You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I used one 3090 previously, but now I have purschased two aftermarket sxm2 tesla v100 16gb with sxm2 to pci-e adapters, these by themselves seem to work just fine (in llama.cpp (when used for non-moe models and one at a time), comfy.ui, stress tests, torch model training, etc.). But in llama.cpp I have encountered two different problems.
When I use UMA (GGML_CUDA_UNIFIED_MEMORY=1) to avoid random OOM errors in situations where memory margins are very tight, it works just fine with one v100 or 3090+v100, but if I select two v100 at the same time (with or without 3090), token generation and processing speeds drop to a crawl and gibberish is generated instead of text, also in nvtop both v100 are reported as fully utilized, while in normal mode they are displayed as 50% used during generation.
When I try to use this build with models which use mxfp4 (like gpt-oss-*) I encounter the problem with using -ncmoe flag. When I select 3090 as a main gpu (computes offloaded layers during preprocessing), add any of the v100 or both to the device list, and load at least two layers onto it, then I get token 30 (?) in a loop when trying to work with sequence length > ~32k (sometimes > ~20k)
Command is ~/llama.cpp/build/bin/llama-server -m /models/llm_models/gpt-oss-120b-mxfp4-00001-of-00003.gguf -ngl 900 -c 65536 --port 5001 --host 0.0.0.0 -fa on -dev CUDA0,CUDA1,CUDA2 -ts 6,2,2 -ncmoe 12 -v -ub 4096 -b 16384 Other flags aside from -ncmoe don't change the behavior (-cmoe works just fine), and if I select any of the v100 as a main gpu or physically unplug one of the v100 from system, it all just starts to work fine.
Build command used is: CUDA_HOME=/usr/local/cuda CUDACXX=/usr/local/cuda/bin/nvcc cmake -S . -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="70;86" -DCMAKE_BUILD_TYPE=Release && cmake --build build --config Release -t llama-server -- -j 16
System used is ubuntu 24.04, llama.cpp from master branch (behavior didn't change when I tried building other version (b7400)), cpu 5800x3d, b550 motherboard, 3090 is plugged into cpu pci-e 4x16, two v100 are plugged into chipset pci-e 3x4 and 3x2 slots, driver is 580-server (not open), cuda is 12.9, 128gb of ddr4 @ 3600cl18 in dual channel.
I will appreciate any hints and ready to try searching for troubles with gdb or other utilities if asked what to look for and where to look.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
I used one 3090 previously, but now I have purschased two aftermarket sxm2 tesla v100 16gb with sxm2 to pci-e adapters, these by themselves seem to work just fine (in llama.cpp (when used for non-moe models and one at a time), comfy.ui, stress tests, torch model training, etc.). But in llama.cpp I have encountered two different problems.
Command is
~/llama.cpp/build/bin/llama-server -m /models/llm_models/gpt-oss-120b-mxfp4-00001-of-00003.gguf -ngl 900 -c 65536 --port 5001 --host 0.0.0.0 -fa on -dev CUDA0,CUDA1,CUDA2 -ts 6,2,2 -ncmoe 12 -v -ub 4096 -b 16384Other flags aside from -ncmoe don't change the behavior (-cmoe works just fine), and if I select any of the v100 as a main gpu or physically unplug one of the v100 from system, it all just starts to work fine.Build command used is:
CUDA_HOME=/usr/local/cuda CUDACXX=/usr/local/cuda/bin/nvcc cmake -S . -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="70;86" -DCMAKE_BUILD_TYPE=Release && cmake --build build --config Release -t llama-server -- -j 16System used is ubuntu 24.04, llama.cpp from master branch (behavior didn't change when I tried building other version (b7400)), cpu 5800x3d, b550 motherboard, 3090 is plugged into cpu pci-e 4x16, two v100 are plugged into chipset pci-e 3x4 and 3x2 slots, driver is 580-server (not open), cuda is 12.9, 128gb of ddr4 @ 3600cl18 in dual channel.
I will appreciate any hints and ready to try searching for troubles with gdb or other utilities if asked what to look for and where to look.
Beta Was this translation helpful? Give feedback.
All reactions