-
Notifications
You must be signed in to change notification settings - Fork 14.1k
Async DirectIO model loading on Linux #18012
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
3074b50
26cc75f
ceccfb9
d2acc3a
f6d79fe
0879d22
fff1157
d73ff6a
99fde72
921d7c9
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -413,7 +413,7 @@ struct common_params { | |
| bool kv_unified = false; // enable unified KV cache | ||
|
|
||
| bool input_prefix_bos = false; // prefix BOS to user inputs, preceding input_prefix | ||
| bool use_mmap = true; // use mmap for faster loads | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Changing this to time ./bin/llama-completion -m ../models/gpt-oss-120b/ggml-model-mxfp4.gguf -p "hello" -n 1 -no-cnv
# master
real 0m4.648s
# PR
real 0m17.957sNot sure what is the best way to handle this. If we keep it
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would it be OK to set mmap depending on the platform?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We don't have such precedent atm for any of the parameters in
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I have on M4 Pro with GPT-OSS-20B on cold load Measured using So the cold load time is still faster using
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We can do the following:
Might want to do it in a separate PR as it would require changes in
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sounds good |
||
| bool use_mmap = false; // use uncached reads for faster loads | ||
| bool use_mlock = false; // use mlock to keep model in memory | ||
| bool verbose_prompt = false; // print prompt tokens before generation | ||
| bool display_prompt = true; // print prompt before generation | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.