Skip to content

Conversation

@JTischbein
Copy link
Contributor

Follow up for PR #18012 (comment).

To enable Direct IO model reading by default on Linux and Windows, but to stay with --mmap as default on Mac, this PR adds an additional flag for enabling/disabling Direct IO. This flag is by default true and overrules the mmap parameter. In case --direct-io is true and Direct IO is available, --mmap gets disabled. And if --no-direct-io is set or Direct IO is not available (e.g. on Mac), the specified mmap value is used.

@ggerganov
Copy link
Member

I think you need this:

diff --git a/src/llama-model-loader.cpp b/src/llama-model-loader.cpp
index 1355eea95..2db2115a0 100644
--- a/src/llama-model-loader.cpp
+++ b/src/llama-model-loader.cpp
@@ -918,8 +918,7 @@ void llama_model_loader::load_data_for(struct ggml_tensor * cur) const {
         GGML_ASSERT(cur->data != nullptr);
         GGML_ASSERT(w.idx < files.size());
         const auto & file = files.at(w.idx);
-        file->seek(w.offs, SEEK_SET);
-        file->read_raw(cur->data, ggml_nbytes(cur));
+        file->read_raw_at(cur->data, ggml_nbytes(cur), w.offs);
     }
 
     if (check_tensors && !ggml_validate_row_data(cur->type, cur->data, ggml_nbytes(cur))) {

Probably need to assert that llama_file::read_raw is never used with direct io?

@JTischbein
Copy link
Contributor Author

Thanks for the hint, changed that.

I think an assert would not work as read_raw in the current form is needed. Would you suggest to rename read_raw_at to be read_raw (and using tell instead of the offset argument)? Like this read_raw can be safely used again and in the loop of load_all_data the current read_raw is called (as read_raw_direct?)

@ggerganov
Copy link
Member

Would you suggest to rename read_raw_at to be read_raw (and using tell instead of the offset argument)? Like this read_raw can be safely used again and in the loop of load_all_data the current read_raw is called (as read_raw_direct?)

Ok. Would we even need to have read_raw_direct in this case? If we still needed for some reason, then maybe call it read_raw_unsafe to not overload the "direct" word with more meanings.


Also some suggestions that I have not tested, but should at least convey what I mean:

diff --git a/src/llama-model-loader.cpp b/src/llama-model-loader.cpp
index 2db2115a0..ae0c698be 100644
--- a/src/llama-model-loader.cpp
+++ b/src/llama-model-loader.cpp
@@ -508,8 +508,11 @@ llama_model_loader::llama_model_loader(
     files.emplace_back(new llama_file(fname.c_str(), "rb", use_direct_io));
     contexts.emplace_back(ctx);
 
-    // Disable mmap in case Direct I/O is enabled and available
-    if (use_direct_io && files.at(0)->has_direct_io()) {
+    // check if direct io is enabled and supported
+    use_direct_io = use_direct_io && files.back()->has_direct_io();
+
+    if (use_direct_io && use_mmap) {
+        LLAMA_LOG_WARN("%s: direct I/O is enabled, disabling mmap\n", __func__);
         use_mmap = false;
     }
 
@@ -581,6 +584,10 @@ llama_model_loader::llama_model_loader(
             files.emplace_back(new llama_file(fname_split, "rb", use_direct_io));
             contexts.emplace_back(ctx);
 
+            if (use_direct_io && !files.back()->has_direct_io()) {
+                throw std::runtime_error(format("unexpected: direct I/O is not supported for split file %s", fname_split));
+            }
+
             // Save tensors data offset info of the shard.
             for (ggml_tensor * cur = ggml_get_first_tensor(ctx); cur; cur = ggml_get_next_tensor(ctx, cur)) {
                 std::string tensor_name = std::string(cur->name);
@@ -722,6 +729,7 @@ llama_model_loader::llama_model_loader(
     }
 
     this->use_mmap = use_mmap;
+    this->use_direct_io = use_direct_io;
     this->check_tensors = check_tensors;
     this->no_alloc = no_alloc;
 }
diff --git a/src/llama-model-loader.h b/src/llama-model-loader.h
index de06b5283..6f15115ce 100644
--- a/src/llama-model-loader.h
+++ b/src/llama-model-loader.h
@@ -70,6 +70,7 @@ struct llama_model_loader {
     size_t   n_bytes    = 0;
 
     bool use_mmap = false;
+    bool use_direct_io = false;
     bool check_tensors;
     bool no_alloc;
 
diff --git a/src/llama-model.cpp b/src/llama-model.cpp
index cf0c39475..502859d2e 100644
--- a/src/llama-model.cpp
+++ b/src/llama-model.cpp
@@ -2337,7 +2337,8 @@ bool llama_model::load_tensors(llama_model_loader & ml) {
 
     const bool use_mmap_buffer = true;
 
-    LLAMA_LOG_INFO("%s: loading model tensors, this can take a while... (mmap = %s)\n", __func__, ml.use_mmap ? "true" : "false");
+    LLAMA_LOG_INFO("%s: loading model tensors, this can take a while... (mmap = %s, direct_io = %s)\n",
+            __func__, ml.use_mmap ? "true" : "false", ml.use_direct_io ? "true" : "false");
 
     // build a list of buffer types for the CPU and GPU devices
     pimpl->cpu_buft_list = make_cpu_buft_list(devices, params.use_extra_bufts, params.no_host);

@askmyteapot
Copy link

Just an FYI. #18012 Broke loading with mmap disabled on windows.

@JTischbein
Copy link
Contributor Author

@askmyteapot Thank you for the hint. Issue was that I used off_t, which is a signed long on windows. The fix is in this PR.

@JTischbein
Copy link
Contributor Author

@ggerganov Should I isolate the changes with the Windows fix in a new PR?

@ggerganov
Copy link
Member

Yes, would like to take extra look at the changes here, so better to fix the windows issue in the meantime. Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants