Should higher ubatch-size be lowering tg/s in vram constrained environments? #18074

MarkoM14 · 2025-12-16T02:54:58Z

MarkoM14
Dec 16, 2025

Why is higher ubatch-size affecting tg/s in vram constrained environments (which is basically every machine when wanting to fully utilize GPU)?
Maybe it is because a part goes into shared memory.
Should ubatch-size be able to affect/lower token generation speed?
As far as I understand ubatch-size isn't involved in token generation but in prompt processing.
Was this some conscious decision to prioritize pp/s over tg/s or is it inherent because of architecture and difficulty and maybe slow speed of practically switching -ub to 1 before token generation starts?
Is there currently any way to mitigate this?
Is there some space to optimize these two different phases cause they are executed at different times (prompt processing and token generation)?
I basically don't know enough to be asking these questions, but to me it was weird that it impacted tg/s.

Thanks and sorry for my ignorance.

JohannesGaessler · 2025-12-16T13:53:01Z

JohannesGaessler
Dec 16, 2025
Collaborator

Assuming you are using b1f3a6e / b7407 or later: there is now automation of the memory allocation. If you set a higher physical batch size the compute buffer is larger so less memory is available for the weights.

3 replies

MarkoM14 Dec 16, 2025
Author

Nice. Otherwise, I'm guessing it is practically impossible to have prompt processing like you have -ub 2048 and token generation with -ub 1? -ub 1 uses less vram so we can have more of the model on GPU instead of offloading to cpu.
Would it be workable to have a model change its -ub and -ncmoe during runtime/dynamically from the time when it is prompt processing to the time when generating tokens?
Or it would simply take too much (latency) to change it?
Maybe there is some workaround to be found...
Has anybody tried to optimize models using the fact that prompt processing and token generation are done at separate times?

abc-nix Dec 17, 2025

It may sound stupid, but you could also run PP first, stop generation, save KV cache from the slot to a file, reload llama-server with smaller -ub and adjust layers, load the cache to slot, and continue generating. This doesn't make sense for normal use, but if you want to generate multiple replies from the same context/conversation, this could be a good solution. For every other case, try to find a balance between PP and TG.

MarkoM14 Dec 18, 2025
Author

I searched a bit, and I see Pytorch has torch.nested (https://docs.pytorch.org/docs/stable/nested.html) to accommodate dynamic batch/data size changes. It is used with torch.compile so it can be used to change batch size during runtime.
So maybe you can implement a similar thing so model should basically shrink when doing token generation cause it doesn't use the higher batch size as it does during prompt processing or have lower priority of putting bytes of ubatch >1 to vram so they end up in shared memory making sure they don't slow down token generation.
Also maybe there are other elements to optimize for vram, like encoder parts are probably only used during pp, decoder/generator only during tg (I'm guessing), so maybe you can optimize vram usage based on differences between pp and tg and that way get better performance, more context, more batch size and so on... Maybe those parts can go temporarly to ram/shared memory to be loaded when needed, hopefully that is fast enough, for me 1 second or even more would be an ok time to wait for that.
Not just load models statically having the same/similar vram usage all the time when only some parts are used (at least true for big ubatch size during tg).
Many people are vram constrained.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Should higher ubatch-size be lowering tg/s in vram constrained environments? #18074

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Should higher ubatch-size be lowering tg/s in vram constrained environments? #18074

Uh oh!

MarkoM14 Dec 16, 2025

Replies: 1 comment · 3 replies

Uh oh!

Uh oh!

JohannesGaessler Dec 16, 2025 Collaborator

Uh oh!

MarkoM14 Dec 16, 2025 Author

Uh oh!

Uh oh!

abc-nix Dec 17, 2025

Uh oh!

Uh oh!

MarkoM14 Dec 18, 2025 Author

MarkoM14
Dec 16, 2025

Replies: 1 comment 3 replies

JohannesGaessler
Dec 16, 2025
Collaborator

MarkoM14 Dec 16, 2025
Author

MarkoM14 Dec 18, 2025
Author