Replies: 1 comment 3 replies
-
|
Assuming you are using b1f3a6e / b7407 or later: there is now automation of the memory allocation. If you set a higher physical batch size the compute buffer is larger so less memory is available for the weights. |
Beta Was this translation helpful? Give feedback.
3 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Why is higher ubatch-size affecting tg/s in vram constrained environments (which is basically every machine when wanting to fully utilize GPU)?
Maybe it is because a part goes into shared memory.
Should ubatch-size be able to affect/lower token generation speed?
As far as I understand ubatch-size isn't involved in token generation but in prompt processing.
Was this some conscious decision to prioritize pp/s over tg/s or is it inherent because of architecture and difficulty and maybe slow speed of practically switching -ub to 1 before token generation starts?
Is there currently any way to mitigate this?
Is there some space to optimize these two different phases cause they are executed at different times (prompt processing and token generation)?
I basically don't know enough to be asking these questions, but to me it was weird that it impacted tg/s.
Thanks and sorry for my ignorance.
Beta Was this translation helpful? Give feedback.
All reactions