Adding --direct-io flag for model loading #18166

JTischbein · 2025-12-18T11:57:29Z

Follow up for PR #18012 (comment).

To enable Direct IO model reading by default on Linux and Windows, but to stay with --mmap as default on Mac, this PR adds an additional flag for enabling/disabling Direct IO. This flag is by default true and overrules the mmap parameter. In case --direct-io is true and Direct IO is available, --mmap gets disabled. And if --no-direct-io is set or Direct IO is not available (e.g. on Mac), the specified mmap value is used.

ggerganov · 2025-12-18T15:05:53Z

I think you need this:

diff --git a/src/llama-model-loader.cpp b/src/llama-model-loader.cpp
index 1355eea95..2db2115a0 100644
--- a/src/llama-model-loader.cpp
+++ b/src/llama-model-loader.cpp
@@ -918,8 +918,7 @@ void llama_model_loader::load_data_for(struct ggml_tensor * cur) const {
         GGML_ASSERT(cur->data != nullptr);
         GGML_ASSERT(w.idx < files.size());
         const auto & file = files.at(w.idx);
-        file->seek(w.offs, SEEK_SET);
-        file->read_raw(cur->data, ggml_nbytes(cur));
+        file->read_raw_at(cur->data, ggml_nbytes(cur), w.offs);
     }
 
     if (check_tensors && !ggml_validate_row_data(cur->type, cur->data, ggml_nbytes(cur))) {

Probably need to assert that llama_file::read_raw is never used with direct io?

JTischbein · 2025-12-18T19:31:41Z

Thanks for the hint, changed that.

I think an assert would not work as read_raw in the current form is needed. Would you suggest to rename read_raw_at to be read_raw (and using tell instead of the offset argument)? Like this read_raw can be safely used again and in the loop of load_all_data the current read_raw is called (as read_raw_direct?)

ggerganov · 2025-12-19T08:50:34Z

Would you suggest to rename read_raw_at to be read_raw (and using tell instead of the offset argument)? Like this read_raw can be safely used again and in the loop of load_all_data the current read_raw is called (as read_raw_direct?)

Ok. Would we even need to have read_raw_direct in this case? If we still needed for some reason, then maybe call it read_raw_unsafe to not overload the "direct" word with more meanings.

Also some suggestions that I have not tested, but should at least convey what I mean:

diff --git a/src/llama-model-loader.cpp b/src/llama-model-loader.cpp
index 2db2115a0..ae0c698be 100644
--- a/src/llama-model-loader.cpp
+++ b/src/llama-model-loader.cpp
@@ -508,8 +508,11 @@ llama_model_loader::llama_model_loader(
     files.emplace_back(new llama_file(fname.c_str(), "rb", use_direct_io));
     contexts.emplace_back(ctx);
 
-    // Disable mmap in case Direct I/O is enabled and available
-    if (use_direct_io && files.at(0)->has_direct_io()) {
+    // check if direct io is enabled and supported
+    use_direct_io = use_direct_io && files.back()->has_direct_io();
+
+    if (use_direct_io && use_mmap) {
+        LLAMA_LOG_WARN("%s: direct I/O is enabled, disabling mmap\n", __func__);
         use_mmap = false;
     }
 
@@ -581,6 +584,10 @@ llama_model_loader::llama_model_loader(
             files.emplace_back(new llama_file(fname_split, "rb", use_direct_io));
             contexts.emplace_back(ctx);
 
+            if (use_direct_io && !files.back()->has_direct_io()) {
+                throw std::runtime_error(format("unexpected: direct I/O is not supported for split file %s", fname_split));
+            }
+
             // Save tensors data offset info of the shard.
             for (ggml_tensor * cur = ggml_get_first_tensor(ctx); cur; cur = ggml_get_next_tensor(ctx, cur)) {
                 std::string tensor_name = std::string(cur->name);
@@ -722,6 +729,7 @@ llama_model_loader::llama_model_loader(
     }
 
     this->use_mmap = use_mmap;
+    this->use_direct_io = use_direct_io;
     this->check_tensors = check_tensors;
     this->no_alloc = no_alloc;
 }
diff --git a/src/llama-model-loader.h b/src/llama-model-loader.h
index de06b5283..6f15115ce 100644
--- a/src/llama-model-loader.h
+++ b/src/llama-model-loader.h
@@ -70,6 +70,7 @@ struct llama_model_loader {
     size_t   n_bytes    = 0;
 
     bool use_mmap = false;
+    bool use_direct_io = false;
     bool check_tensors;
     bool no_alloc;
 
diff --git a/src/llama-model.cpp b/src/llama-model.cpp
index cf0c39475..502859d2e 100644
--- a/src/llama-model.cpp
+++ b/src/llama-model.cpp
@@ -2337,7 +2337,8 @@ bool llama_model::load_tensors(llama_model_loader & ml) {
 
     const bool use_mmap_buffer = true;
 
-    LLAMA_LOG_INFO("%s: loading model tensors, this can take a while... (mmap = %s)\n", __func__, ml.use_mmap ? "true" : "false");
+    LLAMA_LOG_INFO("%s: loading model tensors, this can take a while... (mmap = %s, direct_io = %s)\n",
+            __func__, ml.use_mmap ? "true" : "false", ml.use_direct_io ? "true" : "false");
 
     // build a list of buffer types for the CPU and GPU devices
     pimpl->cpu_buft_list = make_cpu_buft_list(devices, params.use_extra_bufts, params.no_host);

askmyteapot · 2025-12-19T11:24:26Z

Just an FYI. #18012 Broke loading with mmap disabled on windows.

JTischbein · 2025-12-19T14:01:53Z

@askmyteapot Thank you for the hint. Issue was that I used off_t, which is a signed long on windows. The fix is in this PR.

JTischbein · 2025-12-19T14:09:58Z

@ggerganov Should I isolate the changes with the Windows fix in a new PR?

ggerganov · 2025-12-19T14:13:15Z

Yes, would like to take extra look at the changes here, so better to fix the windows issue in the meantime. Thanks

Adding --direct-io flag for model loading

8a7a4e3

JTischbein requested review from CISC, am17an and ggerganov as code owners December 18, 2025 11:57

github-actions bot added the examples label Dec 18, 2025

Fixing read_raw() calls

0e2b356

Fixing Windows read_raw_at

7533f72

Changing type off_t to size_t for windows and Renaming functions

8c15014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adding --direct-io flag for model loading #18166

Adding --direct-io flag for model loading #18166

JTischbein commented Dec 18, 2025

Uh oh!

ggerganov commented Dec 18, 2025

Uh oh!

JTischbein commented Dec 18, 2025

Uh oh!

ggerganov commented Dec 19, 2025

Uh oh!

askmyteapot commented Dec 19, 2025

Uh oh!

JTischbein commented Dec 19, 2025

Uh oh!

JTischbein commented Dec 19, 2025

Uh oh!

ggerganov commented Dec 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Adding --direct-io flag for model loading #18166

Are you sure you want to change the base?

Adding --direct-io flag for model loading #18166

Conversation

JTischbein commented Dec 18, 2025

Uh oh!

ggerganov commented Dec 18, 2025

Uh oh!

JTischbein commented Dec 18, 2025

Uh oh!

ggerganov commented Dec 19, 2025

Uh oh!

askmyteapot commented Dec 19, 2025

Uh oh!

JTischbein commented Dec 19, 2025

Uh oh!

JTischbein commented Dec 19, 2025

Uh oh!

ggerganov commented Dec 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants