-
Notifications
You must be signed in to change notification settings - Fork 5
ggml rope v2 demo #59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Note Reviews pausedUse the following commands to manage reviews:
WalkthroughThis PR adds a composable RoPE (Rotary Position Embedding) operation to GGML, introducing a new Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~30 minutes
Poem
Pre-merge checks and finishing touches❌ Failed checks (2 warnings, 1 inconclusive)
Comment |
|
@coderabbitai not working, something is wrong here. find the mistake |
This comment was marked as off-topic.
This comment was marked as off-topic.
|
@coderabbitai these are unrelated errors. focus on the main change, the rope algorithm |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: Length of output: 9 🏁 Script executed: Length of output: 1861 🏁 Script executed: Length of output: 84 🏁 Script executed: Length of output: 4472 🏁 Script executed: Length of output: 79 🏁 Script executed: Length of output: 235 🏁 Script executed: Length of output: 0 🏁 Script executed: Length of output: 138 🏁 Script executed: Length of output: 61 🏁 Script executed: Length of output: 2023 🏁 Script executed: Length of output: 2828 🏁 Script executed: Length of output: 4187 🏁 Script executed: Length of output: 2493 🏁 Script executed: Length of output: 463 🏁 Script executed: Length of output: 433 🏁 Script executed: Length of output: 2245 🏁 Script executed: Length of output: 1555 🏁 Script executed: Length of output: 1548 🏁 Script executed: Length of output: 6061 🏁 Script executed: Length of output: 2636 🏁 Script executed: Length of output: 653 🏁 Script executed: Length of output: 287 🏁 Script executed: Length of output: 317 @ngxson I found the bug in the RoPE algorithm! 🎯 Location: The Problem: The variable names The bug is in // WRONG - variable names are swapped!
float yarn_high = floorf(ggml_rope_yarn_corr_dim(n_dims, n_ctx_orig, beta_fast, freq_base));
float yarn_low = ceilf(ggml_rope_yarn_corr_dim(n_dims, n_ctx_orig, beta_slow, freq_base));With typical values (beta_fast=32.0, beta_slow=1.0):
But the code assigns them backwards! The variable named The fix: // CORRECT - swap the variable names
float yarn_low = floorf(ggml_rope_yarn_corr_dim(n_dims, n_ctx_orig, beta_fast, freq_base));
float yarn_high = ceilf(ggml_rope_yarn_corr_dim(n_dims, n_ctx_orig, beta_slow, freq_base));This matches the original
The 🧠 Learnings used |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
ggml/src/ggml.c (2)
993-1055: Add missing GGML_OP_SYMBOL entry forGGML_OP_ROPE_COMPto avoid null symbol lookupsYou’ve added
"ROPE_COMP"toGGML_OP_NAMEand bumpedGGML_OP_COUNTto 96, butGGML_OP_SYMBOLstill only has"rope(x)"and"rope_back(x)". ForGGML_OP_ROPE_COMPthis meansggml_op_symbol(op)will return a null pointer, which is later passed to%sformatters (e.g. inggml_graph_print/ DOT dumping), risking UB or crashes.Recommend adding a symbol in the corresponding slot:
- "soft_max_back(x)", - "rope(x)", - "rope_back(x)", + "soft_max_back(x)", + "rope(x)", + "rope_comp(x)", // GGML_OP_ROPE_COMP + "rope_back(x)",and keeping the static_assert at 96 as you already did.
6913-6941:GGML_OP_ROPE_COMPbackward pass is not implemented and will abort during training
ggml_build_backward_expandtreatsGGML_OP_ROPE_COMPidentically toGGML_OP_ROPEin theignore_src[1]section (lines 6933-6934), butggml_compute_backwardhas acase GGML_OP_ROPE:(line 6663) with gradient computation whileGGML_OP_ROPE_COMPhas no corresponding case.If a training graph includes a
GGML_OP_ROPE_COMPnode whose output influences a loss, the backward pass will reach that node and hit the default case, aborting with:"unsupported ggml op for backward pass: GGML_OP_ROPE_COMP".Either:
- Add a
case GGML_OP_ROPE_COMP:implementation toggml_compute_backward(mirroringGGML_OP_ROPEif the computation is identical), or- Exclude
GGML_OP_ROPE_COMPfrom gradient computation inggml_build_backward_expandand document it as inference-only.
🧹 Nitpick comments (3)
ggml/include/ggml.h (1)
1862-1896: New RoPE composable API is coherent; consider minor comment/API polishThe new
ggml_rope_orderingenum and the trioggml_rope_comp/ggml_rope_comp_set_yarn/ggml_rope_comp_set_multiform a clear, composable RoPE configuration surface and match the implementations described inggml.c. Function naming and parameter ordering are consistent with existing RoPE helpers.Two minor polish points you may want to address (non-blocking):
- The comment
// demo new RoPE API (NOT yet to be merged)will be misleading once this PR is merged. Consider rewording to something like// experimental RoPE composable APIor removing it.ggml_rope_compusesint32_t n_dimswhile the older RoPE APIs nearby use plainintforn_dims. It’s not wrong, but if you care about API surface consistency you could standardize on one type across the RoPE functions in a follow-up.If you’d like, I can propose a small follow-up diff that harmonizes the RoPE comments and
n_dimstypes across these declarations.ggml/src/ggml-cpu/ops.cpp (1)
5859-5863: Thread partitioning andidx_offsetleave early heads untouched and reduce parallelismTwo related points around
idx_offset:
You derive
nrfromggml_nrows(dst)(i.e.ne1 * ne2 * ne3), but the work loop only iterates heads in[idx_offset, ne1). Effective rows arenr_eff = (ne1 - idx_offset) * ne2 * ne3. Usingnrfordr/ir0/ir1means that whenidx_offsetis non‑zero, some threads will get no work even though others still have rows to process. This is correct but wastes parallelism for partial‑head use.Because both the rotation and the “copy remaining channels” loop are nested under
for (int64_t i1 = idx_offset; i1 < ne1; i1++), heads with index< idx_offsetare never written indstat all. That’s fine if the current API never setsidx_offset > 0(or guarantees that those heads are filled elsewhere), but if/when you start using non‑zeroidx_offsetfromggml_rope_comp_set_multi, those heads will contain stale/undefined data.If you plan to actually use
idx_offset > 0, it would be safer to:
- Assert
0 <= idx_offset && idx_offset <= ne1, and- Either:
- base
nron(ne1 - idx_offset)for proper thread partitioning, and/or- explicitly copy
src0→dstfor heads< idx_offsetbefore the main loop.Also applies to: 5925-5943
ggml/src/ggml.c (1)
4271-4349: Tighten ggml_rope_comp param invariants and reuse existing YARN helpersThe layout and use of
op_paramsbetweenggml_rope_compandggml_rope_comp_set_yarnis internally consistent, but a couple of small hardening tweaks would reduce footguns:
n_dimsis used as a divisor (-2.0f / n_dims) and, forGGML_ROPE_ORDERING_NEOX, asn_dims / 2—consider guarding it:GGML_API struct ggml_tensor * ggml_rope_comp( struct ggml_context * ctx, struct ggml_tensor * a, struct ggml_tensor * b, int32_t n_dims, float freq_base, enum ggml_rope_ordering ordering) { - GGML_ASSERT(ggml_is_vector(b)); + GGML_ASSERT(n_dims > 0); + GGML_ASSERT(ggml_is_vector(b)); @@ - if (ordering == GGML_ROPE_ORDERING_NEOX) { + if (ordering == GGML_ROPE_ORDERING_NEOX) { + GGML_ASSERT((n_dims % 2) == 0); idx_pair = n_dims / 2; idx_scale = 2; }
ggml_rope_comp_set_yarnduplicates and slightly diverges fromggml_rope_yarn_corr_dims(no clamping to[0, n_dims - 1]). If you want behavior consistent with existing YARN RoPE, you can rely on that helper:- float yarn_high = floorf(ggml_rope_yarn_corr_dim(n_dims, n_ctx_orig, beta_fast, freq_base)); - float yarn_low = ceilf(ggml_rope_yarn_corr_dim(n_dims, n_ctx_orig, beta_slow, freq_base)); + float dims[2]; + ggml_rope_yarn_corr_dims(n_dims, n_ctx_orig, freq_base, beta_fast, beta_slow, dims); + float yarn_high = dims[0]; + float yarn_low = dims[1];This keeps all RoPE+YARN variants using the same correction-dimension logic while still defaulting YARN off (all zeros) in the initial
ggml_rope_compcall. Given your preference to keep things simple and defer more exhaustive error handling, feel free to treat the additional asserts as optional if you don’t expect adversarial inputs. Based on learnings, …
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
ggml/include/ggml.h(2 hunks)ggml/src/ggml-cpu/ggml-cpu.c(3 hunks)ggml/src/ggml-cpu/ops.cpp(1 hunks)ggml/src/ggml-cpu/ops.h(1 hunks)ggml/src/ggml.c(5 hunks)src/models/mistral3.cpp(2 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{cpp,h,hpp}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.{cpp,h,hpp}: Always format C++ code usinggit clang-formatbefore committing, following .clang-format configuration (4-space indentation, 120 column limit, braces on same line for functions, pointer/reference alignment in middle)
Minimize external dependencies; avoid adding new external dependencies unless absolutely necessary
Files:
ggml/src/ggml-cpu/ops.hsrc/models/mistral3.cppggml/include/ggml.hggml/src/ggml-cpu/ops.cpp
**/*.{cpp,h,hpp,py}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Ensure cross-platform compatibility by testing code changes on Linux, macOS, and Windows when possible
Files:
ggml/src/ggml-cpu/ops.hsrc/models/mistral3.cppggml/include/ggml.hggml/src/ggml-cpu/ops.cpp
src/**/*.cpp
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Prioritize performance optimization in core library implementations in
src/, as this is a performance-critical inference library
Files:
src/models/mistral3.cpp
🧠 Learnings (2)
📓 Common learnings
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
📚 Learning: 2025-11-29T22:55:53.865Z
Learnt from: CR
Repo: ngxson/llama.cpp PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-29T22:55:53.865Z
Learning: Applies to src/llama.cpp : Maintain core library implementation in `src/llama.cpp` with careful attention to API contracts and backward compatibility
Applied to files:
ggml/src/ggml-cpu/ops.cpp
🧬 Code graph analysis (6)
ggml/src/ggml-cpu/ops.h (1)
ggml/src/ggml-cpu/ops.cpp (2)
ggml_compute_forward_rope_comp(5948-5968)ggml_compute_forward_rope_comp(5948-5950)
ggml/src/ggml-cpu/ggml-cpu.c (1)
ggml/src/ggml-cpu/ops.cpp (2)
ggml_compute_forward_rope_comp(5948-5968)ggml_compute_forward_rope_comp(5948-5950)
src/models/mistral3.cpp (1)
ggml/src/ggml.c (2)
ggml_rope_comp(4271-4323)ggml_rope_comp_set_yarn(4325-4349)
ggml/src/ggml.c (1)
ggml/src/ggml-impl.h (1)
ggml_set_op_params(145-149)
ggml/include/ggml.h (1)
ggml/src/ggml.c (2)
ggml_rope_comp(4271-4323)ggml_rope_comp_set_yarn(4325-4349)
ggml/src/ggml-cpu/ops.cpp (2)
ggml/src/ggml.c (1)
ggml_nrows(1231-1235)ggml/src/ggml-sycl/rope.cpp (2)
rope_yarn_ramp(13-16)rope_yarn_ramp(13-13)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
- GitHub Check: windows-latest-cmake-hip
- GitHub Check: windows-2022-cmake-cuda (12.4)
- GitHub Check: windows-latest-cmake (llvm-arm64, arm64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/ar...
- GitHub Check: windows-latest-cmake-sycl
- GitHub Check: macOS-latest-cmake-tvos
- GitHub Check: macOS-latest-cmake-ios
- GitHub Check: ubuntu-24-cmake-vulkan-deb
- GitHub Check: ubuntu-22-cmake-hip
- GitHub Check: server-windows
🔇 Additional comments (6)
ggml/src/ggml-cpu/ops.h (1)
63-65: Forward prototype forROPE_COMPlooks consistentThe new
ggml_compute_forward_rope_compdeclaration matches the implementation inggml/src/ggml-cpu/ops.cppand is placed correctly alongside the other RoPE forward ops. No issues from an ABI or style perspective.ggml/include/ggml.h (1)
516-519:GGML_OP_ROPE_COMPplacement inggml_openum is reasonableAdding
GGML_OP_ROPE_COMPimmediately afterGGML_OP_ROPEkeeps RoPE-related ops grouped and is consistent with how other op families are organized. As long as the corresponding op-name/symbol lookup tables and any serialization that depends on enum order are updated (which appears to be done in ggml.c per the PR summary), this looks fine.You may want to double-check that any on-disk or inter-process formats that encode
enum ggml_opby ordinal (if any) are not relying on the previous numeric values of entries afterGGML_OP_ROPE. If they do, bumping a format version or using symbolic encoding might be safer.src/models/mistral3.cpp (1)
41-41: LGTM! Rope factors retrieval commented out as part of API migration.The migration from
ggml_rope_exttoggml_rope_compremoves the need for explicit rope_factors retrieval. The new API handles rope configuration throughggml_rope_comp_set_yarn.ggml/src/ggml-cpu/ggml-cpu.c (3)
1866-1869: LGTM! Forward routing for ROPE_COMP correctly integrated.The case statement properly routes
GGML_OP_ROPE_COMPtoggml_compute_forward_rope_comp, following the established pattern for operation dispatch.
2301-2301: LGTM! Task scheduling correctly treats ROPE_COMP as parallelizable.The operation is properly grouped with
GGML_OP_ROPEandGGML_OP_ROPE_BACK, allowing multi-threaded execution withn_tasks = n_threads.
2820-2820: LGTM! Work size estimation consistent with ROPE operations.The work buffer allocation for
GGML_OP_ROPE_COMPcorrectly mirrors the pattern used byGGML_OP_ROPEandGGML_OP_ROPE_BACK, allocatingF32temporary storage sized bynode->ne[0] * n_tasks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
ggml/src/ggml-cpu/ops.cpp (1)
5893-5905: Yarn ramp still called with swapped (high, low) arguments
rope_yarn_rampis declared asrope_yarn_ramp(const float low, const float high, const int i0)and other call sites pass(low, high, i0). Hereinit_cachedoes:float ramp_mix = rope_yarn_ramp(yarn_high, yarn_low, i0) * ramp_factor;which inverts
low/highand breaks the intended ramp region, effectively disabling correct YaRN scaling forGGML_OP_ROPE_COMP. This is the same issue previously flagged in an earlier review and is still present.Change the call to pass low first, high second:
- float ramp_mix = rope_yarn_ramp(yarn_high, yarn_low, i0) * ramp_factor; + float ramp_mix = rope_yarn_ramp(yarn_low, yarn_high, i0) * ramp_factor;
🧹 Nitpick comments (1)
ggml/src/ggml-cpu/ops.cpp (1)
5835-5852:sectionsparameters are parsed but currently unused (pending M‑RoPE support)
sections[0..3]are copied out ofdst->op_paramsbut never used, and there’s only a// TODO M-RoPEcomment. This is harmless but will likely trigger an unused-variable warning.If M‑RoPE support isn’t wired in yet, consider either:
- Removing
sectionsand itsmemcpylines for now, or- Marking them explicitly unused (e.g.,
GGML_UNUSED(sections);) until the multi‑rope logic is implemented.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
ggml/src/ggml-cpu/ops.cpp(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.{cpp,h,hpp}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.{cpp,h,hpp}: Always format C++ code usinggit clang-formatbefore committing, following .clang-format configuration (4-space indentation, 120 column limit, braces on same line for functions, pointer/reference alignment in middle)
Minimize external dependencies; avoid adding new external dependencies unless absolutely necessary
Files:
ggml/src/ggml-cpu/ops.cpp
**/*.{cpp,h,hpp,py}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Ensure cross-platform compatibility by testing code changes on Linux, macOS, and Windows when possible
Files:
ggml/src/ggml-cpu/ops.cpp
🧠 Learnings (2)
📓 Common learnings
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
📚 Learning: 2025-11-29T22:55:53.865Z
Learnt from: CR
Repo: ngxson/llama.cpp PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-29T22:55:53.865Z
Learning: Applies to src/llama.cpp : Maintain core library implementation in `src/llama.cpp` with careful attention to API contracts and backward compatibility
Applied to files:
ggml/src/ggml-cpu/ops.cpp
🧬 Code graph analysis (1)
ggml/src/ggml-cpu/ops.cpp (3)
ggml/src/ggml.c (1)
ggml_nrows(1231-1235)ggml/src/ggml-sycl/rope.cpp (2)
rope_yarn_ramp(13-16)rope_yarn_ramp(13-13)ggml/src/ggml-cpu/llamafile/sgemm.cpp (4)
p(1046-1051)p(1046-1046)p(1053-1062)p(1053-1053)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (15)
- GitHub Check: build-cmake-pkg / linux
- GitHub Check: build-linux-cross / debian-13-loongarch64-vulkan-cross
- GitHub Check: build-linux-cross / ubuntu-24-riscv64-cpu-spacemit-ime-cross
- GitHub Check: windows-latest-cmake (vulkan-x64, x64, -DCMAKE_BUILD_TYPE=Release -DGGML_NATIVE=OFF -DLLAMA_BUILD...
- GitHub Check: ggml-ci-x64-cpu-high-perf
- GitHub Check: ggml-ci-x64-cpu-low-perf
- GitHub Check: ubuntu-22-cmake-sycl
- GitHub Check: ubuntu-22-cmake-musa
- GitHub Check: windows-msys2 (UCRT64, ucrt-x86_64, Release)
- GitHub Check: ubuntu-22-cmake-hip
- GitHub Check: ubuntu-24-cmake-webgpu
- GitHub Check: ubuntu-22-cmake-sycl-fp16
- GitHub Check: ubuntu-latest-cmake-sanitizer (UNDEFINED, Debug)
- GitHub Check: ubuntu-latest-cmake-sanitizer (THREAD, Debug)
- GitHub Check: server-windows
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
ggml/src/ggml.c (2)
1051-1158: Add missing GGML_OP_SYMBOL entry for ROPE_COMP between rope(x) and rope_back(x)The GGML_OP_NAME table includes "ROPE_COMP" between "ROPE" and "ROPE_BACK", but GGML_OP_SYMBOL is missing the corresponding "rope_comp(x)" entry. This causes array index misalignment where ggml_op_symbol(GGML_OP_ROPE_COMP) returns "rope_back(x)" and shifts all subsequent symbols by one position, breaking graph dumps and logging that rely on ggml_op_symbol().
Add "rope_comp(x)", at the correct position in GGML_OP_SYMBOL to align both tables.
6919-6943: Backward pass: ROPE_COMP is wired in graph building but missing compute implementationIn
ggml_build_backward_expand, you correctly addedGGML_OP_ROPE_COMPto the list whereignore_src[1] = true, matchingGGML_OP_ROPE(positions are non-differentiable). This handles the graph-wiring phase.However,
ggml_compute_backwardhas nocase GGML_OP_ROPE_COMP:, onlycase GGML_OP_ROPE:. If a training graph includesGGML_OP_ROPE_COMPon any path from a loss to trainable parameters, the backward pass will hit the default branch and callGGML_ABORT("unsupported ggml op for backward pass").Either:
- Add a
GGML_OP_ROPE_COMPcase inggml_compute_backwardthat reuses or parallels theGGML_OP_ROPEgradient logic (since both compute rope embeddings with similar parameterization), or- If this operation is strictly inference-only, assert or document that constraint explicitly to prevent misuse in training.
♻️ Duplicate comments (1)
src/models/mistral3.cpp (1)
78-85: Use model hyperparameters instead of hardcoded YaRN beta values.Lines 82 and 85 hardcode
beta_fast=32.0fandbeta_slow=1.0fin theggml_rope_comp_set_yarncalls. As noted in the previous review, these should usehparams.yarn_beta_fastandhparams.yarn_beta_slowto respect model-specific YaRN configurations loaded from model metadata.
🧹 Nitpick comments (5)
ggml/src/ggml-metal/ggml-metal-ops.cpp (1)
373-376: ROPE_COMP Metal path wiring and arg decoding look correct
- The new
GGML_OP_ROPE_COMPcase inggml_metal_op_encode_implcorrectly forwards toggml_metal_op_rope_comp, returning a single fused node, consistent with the existing ROPE path.ggml_metal_op_rope_compmirrorsggml_metal_op_ropestructurally: dimension asserts (ne10 % ne02 == 0,ne10 >= ne02), threadgroup sizing vianth = std::min(1024, ne00), and dispatch over(ne01, ne02, ne03, nth, 1, 1).- The op‑params decoding (indices 0–3 as ints, 4–9 as floats, 10–13 as section ints) matches the
ggml_rope_comp/ggml_rope_comp_set_yarnlayout fromggml.c, and those values are passed one‑for‑one intoggml_metal_kargs_rope_comp, includingsrc2 = (op->src[2] != nullptr).If you want to tighten readability later, consider introducing a small helper/local
const int32_t *p = op->op_params;and using it for thememcpycalls, to mirror other Metal ops and reduce repeated casts, but functionally this is fine as is.Also applies to: 3223-3311
ggml/src/ggml-cpu/ops.cpp (2)
5879-5884: Guard freq_factors use when n_dims < ne0
freq_factorsis indexed asfreq_factors[i0/2]ininit_cache, and you only assertsrc2->ne[0] >= n_dims / 2. However the loop currently runs toi0 < ne0, whilerotate_pairsis only called overn_dims. If a future call usesn_dims < ne0withsrc2->ne[0] == n_dims / 2(matching the assertion), this will read past the end offreq_factors.If you adopt the change above to bound the cache loop by
n_dimsinstead ofne0, this risk disappears and the assertion remains correct. Otherwise, you should either tighten the assertion tosrc2->ne[0] >= ne0 / 2or explicitly clamp the loop tomin(ne0, 2*src2->ne[0]).
5835-5853: Unusedsections/ TODO M‑RoPE
sections[4]is read fromdst->op_paramsbut never used inggml_compute_forward_rope_comp_flt, and the body contains// TODO M-RoPEcomments. This is fine for a demo, but it will produce unused‑variable warnings and ROPE_COMP currently ignores any multi‑rope configuration passed viaggml_rope_comp_set_multi.Either mark
sectionsas[[maybe_unused]]for now or wire it into the M‑RoPE implementation in a follow‑up; also consider clarifying in comments that ROPE_COMP currently supports only the simple (non‑M‑RoPE) case so downstream users know what to expect.ggml/src/ggml.c (2)
4271-4324: ggml_rope_comp op construction and metadata layout look soundThe new
ggml_rope_compAPI:
- Validates shapes/types reasonably (
bis F32 vector, length multiple ofa->ne[2]for M‑RoPE).- Encodes NORMAL vs NEOX ordering using
idx_pair/idx_scalein a way that matches typical RoPE conventions.- Derives
theta_scale = powf(freq_base, -2.0f / n_dims)as expected.- Packs op_params in a compact, forward‑compatible layout (dims, ordering, theta, YaRN/mrope slots), and duplicates
ato keep shape/type consistent with existing RoPE ops.Given how tightly this layout couples to CPU/Metal kernels, it would be good to keep the offsets documented somewhere (even just a short comment block) so future changes don’t silently desync the layouts.
4326-4352: YaRN low/high correction range now matches helper and fixes the earlier swap
ggml_rope_comp_set_yarnnow:
- Reads
n_dimsfromop_params[0].- Computes:
start = floor(corr_dim(beta_fast))end = ceil(corr_dim(beta_slow))yarn_low = MAX(0, start)yarn_high = MIN(n_dims - 1, end)- Writes
yarn_highandyarn_lowback at offsets 5 and 6, plusfreq_scale,attn_factor, andramp_factorat 7–9.This is consistent with
ggml_rope_yarn_corr_dims’s semantics (dims[0] = low, dims[1] = high) and addresses the previous high/low swap that broke the YaRN ramp. The zero‑clamp andn_dims - 1clamp also look correct.One minor suggestion: consider adding a short comment here pointing to
ggml_rope_yarn_corr_dimsand documenting the[5..9]slots, since this contract is shared implicitly with the backends.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (12)
ggml/include/ggml.h(2 hunks)ggml/src/ggml-cpu/ops.cpp(1 hunks)ggml/src/ggml-metal/ggml-metal-common.cpp(1 hunks)ggml/src/ggml-metal/ggml-metal-device.cpp(1 hunks)ggml/src/ggml-metal/ggml-metal-device.h(1 hunks)ggml/src/ggml-metal/ggml-metal-device.m(1 hunks)ggml/src/ggml-metal/ggml-metal-impl.h(1 hunks)ggml/src/ggml-metal/ggml-metal-ops.cpp(2 hunks)ggml/src/ggml-metal/ggml-metal-ops.h(1 hunks)ggml/src/ggml-metal/ggml-metal.metal(3 hunks)ggml/src/ggml.c(5 hunks)src/models/mistral3.cpp(2 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{cpp,h,hpp}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.{cpp,h,hpp}: Always format C++ code usinggit clang-formatbefore committing, following .clang-format configuration (4-space indentation, 120 column limit, braces on same line for functions, pointer/reference alignment in middle)
Minimize external dependencies; avoid adding new external dependencies unless absolutely necessary
Files:
ggml/src/ggml-metal/ggml-metal-device.cppsrc/models/mistral3.cppggml/src/ggml-metal/ggml-metal-device.hggml/src/ggml-metal/ggml-metal-ops.hggml/src/ggml-cpu/ops.cppggml/include/ggml.hggml/src/ggml-metal/ggml-metal-ops.cppggml/src/ggml-metal/ggml-metal-common.cppggml/src/ggml-metal/ggml-metal-impl.h
**/*.{cpp,h,hpp,py}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Ensure cross-platform compatibility by testing code changes on Linux, macOS, and Windows when possible
Files:
ggml/src/ggml-metal/ggml-metal-device.cppsrc/models/mistral3.cppggml/src/ggml-metal/ggml-metal-device.hggml/src/ggml-metal/ggml-metal-ops.hggml/src/ggml-cpu/ops.cppggml/include/ggml.hggml/src/ggml-metal/ggml-metal-ops.cppggml/src/ggml-metal/ggml-metal-common.cppggml/src/ggml-metal/ggml-metal-impl.h
src/**/*.cpp
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Prioritize performance optimization in core library implementations in
src/, as this is a performance-critical inference library
Files:
src/models/mistral3.cpp
🧠 Learnings (2)
📓 Common learnings
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
📚 Learning: 2025-11-29T22:55:53.865Z
Learnt from: CR
Repo: ngxson/llama.cpp PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-29T22:55:53.865Z
Learning: Applies to src/llama.cpp : Maintain core library implementation in `src/llama.cpp` with careful attention to API contracts and backward compatibility
Applied to files:
ggml/src/ggml-cpu/ops.cpp
🧬 Code graph analysis (6)
ggml/src/ggml-metal/ggml-metal-device.cpp (1)
ggml/src/ggml.c (1)
ggml_type_name(1283-1285)
ggml/src/ggml-metal/ggml-metal-device.h (1)
ggml/src/ggml-metal/ggml-metal-device.cpp (2)
ggml_metal_library_get_pipeline_rope_comp(1478-1493)ggml_metal_library_get_pipeline_rope_comp(1478-1478)
ggml/src/ggml-metal/ggml-metal-ops.h (1)
ggml/src/ggml-metal/ggml-metal-ops.cpp (2)
ggml_metal_op_rope_comp(3223-3311)ggml_metal_op_rope_comp(3223-3223)
ggml/src/ggml-cpu/ops.cpp (2)
ggml/src/ggml.c (1)
ggml_nrows(1231-1235)ggml/src/ggml-sycl/rope.cpp (2)
rope_yarn_ramp(13-16)rope_yarn_ramp(13-13)
ggml/include/ggml.h (1)
ggml/src/ggml.c (2)
ggml_rope_comp(4271-4324)ggml_rope_comp_set_yarn(4326-4352)
ggml/src/ggml.c (1)
ggml/src/ggml-impl.h (1)
ggml_set_op_params(145-149)
🪛 GitHub Actions: EditorConfig Checker
ggml/src/ggml-metal/ggml-metal.metal
[error] 4134-4134: Trailing whitespace detected by editorconfig-checker.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (19)
- GitHub Check: ggml-ci-x64-cpu-high-perf
- GitHub Check: ggml-ci-x64-cpu-low-perf
- GitHub Check: ubuntu-22-cmake-musa
- GitHub Check: ubuntu-latest-cmake-cuda
- GitHub Check: ubuntu-24-cmake-vulkan-deb
- GitHub Check: android-build
- GitHub Check: android-ndk-build (arm64-snapdragon, --preset arm64-android-snapdragon-release)
- GitHub Check: ubuntu-22-cmake-sycl-fp16
- GitHub Check: ubuntu-22-cmake-sycl
- GitHub Check: ubuntu-22-cmake-hip
- GitHub Check: ubuntu-latest-cmake-sanitizer (THREAD, Debug)
- GitHub Check: ubuntu-cpu-cmake (x64, ubuntu-22.04)
- GitHub Check: ubuntu-latest-cmake-rpc
- GitHub Check: macOS-latest-cmake-x64
- GitHub Check: ubuntu-latest-llguidance
- GitHub Check: ubuntu-latest-cmake-sanitizer (ADDRESS, Debug)
- GitHub Check: ubuntu-latest-cmake-sanitizer (UNDEFINED, Debug)
- GitHub Check: macOS-latest-cmake-arm64
- GitHub Check: server-windows
🔇 Additional comments (10)
ggml/src/ggml-metal/ggml-metal-common.cpp (1)
264-264: LGTM!The addition of
GGML_OP_ROPE_COMPto the reorder-safe operations list is appropriate and consistent with the treatment ofGGML_OP_ROPE.ggml/src/ggml-metal/ggml-metal-device.m (1)
1032-1032: LGTM!The
GGML_OP_ROPE_COMPoperation is correctly enabled for Metal backend support, falling through to the same logic asGGML_OP_ROPE.ggml/src/ggml-metal/ggml-metal-ops.h (1)
74-74: LGTM!The function declaration for
ggml_metal_op_rope_compfollows the established pattern and is appropriately placed alongside the existingggml_metal_op_ropedeclaration.ggml/src/ggml-metal/ggml-metal-device.h (1)
139-139: LGTM!The pipeline getter declaration for ROPE_COMP is consistent with the existing pipeline getter patterns and is appropriately positioned alongside the ROPE pipeline getter.
ggml/src/ggml-metal/ggml-metal-device.cpp (1)
1478-1493: LGTM!The
ggml_metal_library_get_pipeline_rope_compimplementation correctly follows the established pattern for Metal pipeline getters. The function properly constructs the kernel name based on the tensor type and handles pipeline compilation when needed.ggml/src/ggml-metal/ggml-metal.metal (2)
4415-4431: Kernel alias and host_name exports look consistent.
kernel_rope_comp_tcorrectly aliaseskernel_rope_comp<float>, and the f32/f16host_nameinstantiations follow the existing pattern for the other RoPE kernels.
4102-4153: Fix trailing whitespace and verify YaRN implementation details.The kernel logic appears structurally sound, but several RoPE v2 / YaRN details need alignment with the CPU implementation:
i_dim exponent: The formula
i_dim = (i0 / 2) + 1as the exponent inpow(args.theta_scale, i_dim)should be verified against the CPUggml_rope_compimplementation. If the CPU uses 0-based indexing (i.e.,i0/2), this off-by-one shifts all frequency values on Metal.YaRN magnitude scaling: The YaRN formula includes
mscale = (0.1 * ln(freq_scale) + 1) * mscale_paramto adjust magnitudes in the ramp region. This kernel applies* args.attn_factorbut never multiplies by this additional mscale term. Confirm thatattn_factorpassed from the host is already pre-scaled to encode the YaRN magnitude correction (this is a known workaround pattern); otherwise, the GPU path will produce different attention outputs than CPU YaRN for extrapolated positions.src2 indexing: Verify that
freq_factor = src2[i0/2]aligns with how src2 is laid out on the host—specifically, whether src2 uses logical RoPE pair indexing (i0/2) or follows the memory layout stride (idx_scale).Non-rotated tail: The else branch uses
i0 * args.nb00/i0 * args.nb0instead ofic * args.nb00(whereic = i0 / idx_scale). This assumes the unrotated dimensions are contiguously stored in the standard layout; if they share the composable layout with rotated dims, useicconsistently.Also, remove trailing whitespace on the blank line after
sin_theta(around line 4134) to fix EditorConfig CI failure.ggml/include/ggml.h (1)
517-572: ROPE_COMP op and composable RoPE API surface are consistent with the implementation
- Adding
GGML_OP_ROPE_COMPnext toGGML_OP_ROPEkeeps the op enum organized by functionality; downstream code that switches on ops can now distinguish the composable variant cleanly.enum ggml_rope_ordering { GGML_ROPE_ORDERING_NORMAL, GGML_ROPE_ORDERING_NEOX }is a sensible, small enum specifically forggml_rope_comp, and is intentionally separate from the olderGGML_ROPE_TYPE_*bitfield macros used in the legacymodeargument.- The declarations for
ggml_rope_comp,ggml_rope_comp_set_yarn, andggml_rope_comp_set_multimatch the ggml.c snippet (parameter order and types), so the header and implementation are in sync, including the YaRN and M‑RoPE extension hooks.No issues from an API or header‑consistency standpoint.
Also applies to: 1863-1895
ggml/src/ggml-metal/ggml-metal-impl.h (1)
263-295: Host-sideggml_metal_kargs_rope_compstruct layout is correct and aligns with shader usageThe Metal kernel file includes
ggml-metal-impl.hdirectly (line 8), ensuring the shader uses the identical struct definition from the host. The struct is populated via designated initializers inggml_metal_op_rope_comp()with all fields in the correct order and types, and the kernel accesses all fields (dims, strides, rotation params, yarn factors, and frequency scaling) as expected. No ABI mismatch exists.ggml/src/ggml.c (1)
993-995: ROPE_COMP op registration looks consistent with existing opsAdding
"ROPE_COMP"between"ROPE"and"ROPE_BACK"and bumpingGGML_OP_COUNTto 96 keeps the op-name table coherent with the enum ordering. Nothing else to change here as long as other op tables (symbols, backends) are kept in sync, which you mostly did below.If you want to double‑check enum/op-name alignment across the codebase, you can run a quick grep on
GGML_OP_ROPE_COMPusages.Also applies to: 1049-1049
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (2)
tests/test-rope.cpp (2)
147-153: Non-portable format specifier forint64_t.Using
%lldforint64_tis not portable across platforms (Windows uses different size forlong long). UsePRId64from<cinttypes>for cross-platform compatibility.🔎 Proposed fix
Add include at top of file:
#include <cinttypes>Then update the format string:
std::string vars() { char buf[256]; snprintf(buf, sizeof(buf), - "type=%d ne=(%lld,%lld,%lld,%lld) n_dims=%d mode=%d fs=%f ef=%f af=%f ff=%d v=%d inplace=%d", - type, ne_a[0], ne_a[1], ne_a[2], ne_a[3], n_dims, mode, fs, ef, af, ff ? 1 : 0, v, inplace ? 1 : 0); + "type=%d ne=(%" PRId64 ",%" PRId64 ",%" PRId64 ",%" PRId64 ") n_dims=%d mode=%d fs=%f ef=%f af=%f ff=%d v=%d inplace=%d", + type, ne_a[0], ne_a[1], ne_a[2], ne_a[3], n_dims, mode, fs, ef, af, ff ? 1 : 0, v, inplace ? 1 : 0); return std::string(buf); }As per coding guidelines, ensure cross-platform compatibility.
340-400: Memory leak risk on assertion failure.Raw
new/deleteis used for test cases. If the assertion at line 450 fails, remaining elements intest_caseswon't be cleaned up. For test code this is acceptable, but using stack-allocatedstd::vector<test_rope>or smart pointers would be safer.🔎 Alternative using value semantics
- std::vector<test_rope *> test_cases; + std::vector<test_rope> test_cases; ... - test_cases.emplace_back(new test_rope(type, {128, 32, 2, 1}, ...)); + test_cases.emplace_back(type, {128, 32, 2, 1}, ...); ... - std::vector<test_rope *> comp_cases; - for (auto & tc : test_cases) { - auto tc_comp = new test_rope(*tc); - tc_comp->use_comp = true; - comp_cases.push_back(tc_comp); - } + std::vector<test_rope> comp_cases; + for (auto & tc : test_cases) { + comp_cases.push_back(tc); + comp_cases.back().use_comp = true; + }Then access via
test_cases[i]directly instead of through pointers.Also applies to: 446-450
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tests/test-rope.cpp(2 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.{cpp,h,hpp}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.{cpp,h,hpp}: Always format C++ code usinggit clang-formatbefore committing, following .clang-format configuration (4-space indentation, 120 column limit, braces on same line for functions, pointer/reference alignment in middle)
Minimize external dependencies; avoid adding new external dependencies unless absolutely necessary
Files:
tests/test-rope.cpp
**/*.{cpp,h,hpp,py}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Ensure cross-platform compatibility by testing code changes on Linux, macOS, and Windows when possible
Files:
tests/test-rope.cpp
🧠 Learnings (1)
📓 Common learnings
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
🧬 Code graph analysis (1)
tests/test-rope.cpp (1)
ggml/src/ggml.c (2)
ggml_rope_multi(4122-4141)ggml_rope_ext(4175-4193)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
- GitHub Check: build-linux-cross / ubuntu-24-riscv64-cpu-spacemit-ime-cross
- GitHub Check: build-linux-cross / debian-13-loongarch64-vulkan-cross
- GitHub Check: build-linux-cross / debian-13-loongarch64-cpu-cross
- GitHub Check: build-cmake-pkg / linux
- GitHub Check: windows-latest-cmake (llvm-arm64-opencl-adreno, arm64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN...
- GitHub Check: windows-latest-cmake (llvm-arm64, arm64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/ar...
- GitHub Check: windows-latest-cmake (vulkan-x64, x64, -DCMAKE_BUILD_TYPE=Release -DGGML_NATIVE=OFF -DLLAMA_BUILD...
- GitHub Check: windows-latest-cmake (cpu-x64 (static), x64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmak...
- GitHub Check: windows-latest-cmake (openblas-x64, x64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x6...
- GitHub Check: ubuntu-24-wasm-webgpu
- GitHub Check: windows-msys2 (UCRT64, ucrt-x86_64, Release)
- GitHub Check: windows-msys2 (CLANG64, clang-x86_64, Release)
- GitHub Check: ubuntu-24-cmake-webgpu
- GitHub Check: macOS-latest-cmake-arm64
- GitHub Check: ubuntu-24-cmake-vulkan
- GitHub Check: ubuntu-latest-cmake-sanitizer (ADDRESS, Debug)
- GitHub Check: ubuntu-24-cmake-vulkan-deb
- GitHub Check: ubuntu-latest-cmake-rpc
- GitHub Check: macOS-latest-cmake-arm64-webgpu
- GitHub Check: server-windows
🔇 Additional comments (3)
tests/test-rope.cpp (3)
176-184: MROPE/Vision/IMROPE tests are always skipped forrope_comp.
_ggml_rope_multireturnsnullptrwhenuse_comp = true, causing all multi-rope variants (MROPE, Vision, IMROPE) to be skipped in the comparison tests. This is likely intentional for this demo, but worth documenting or removing those test cases from the matrix to avoid unnecessary iterations.If this is expected behavior for the demo, consider adding a comment explaining that
rope_compdoesn't yet support multi-rope modes, or filter out those test cases earlier to avoid wasted iterations.
410-425: Both graphs share the same context—tensor initialization overwrites shared data.Both
tc_ropeandtc_compbuild graphs in the samectx0, and both callinitialize_tensors(ctx0). Since both graphs reference the same underlying tensors (e.g., inputa,pos), the second initialization overwrites values set by the first. If this is intentional to ensure both graphs use identical inputs, it works, but if different inputs were expected, this is a bug.Verify that sharing the same tensor data between both graphs is the intended behavior. If both should use identical inputs (which seems correct for NMSE comparison), consider initializing tensors only once after both graphs are built.
🔎 Suggested clarification
ggml_tensor * out0 = tc_rope->build_graph(ctx0); ggml_tensor * out1 = tc_comp->build_graph(ctx0); if (out0 == nullptr || out1 == nullptr) { GGML_PRINT("test_rope_comp \x1b[33mSKIPPED\x1b[0m: %s\n", tc_rope->vars().c_str()); ggml_free(ctx0); delete tc_comp; delete tc_rope; continue; } - tc_rope->initialize_tensors(ctx0); - tc_comp->initialize_tensors(ctx0); + // Initialize tensors once - both graphs share the same input tensors + tc_rope->initialize_tensors(ctx0);
588-589: LGTM!Integration of
test_rope_comp()into the test suite is properly placed after existing tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
ggml/src/ggml.c (1)
941-1050:GGML_OP_SYMBOLarray is out of sync withGGML_OP_NAMEafter addingROPE_COMP.You inserted
"ROPE_COMP"intoGGML_OP_NAMEbetween"ROPE"and"ROPE_BACK"(Line 993–995), butGGML_OP_SYMBOLstill only has"rope(x)"and"rope_back(x)"with no"rope_comp(x)"entry (Lines 1101–1105). BecauseGGML_OP_SYMBOLis indexed byenum ggml_op, every symbol fromROPE_COMPonward is now shifted/mismatched (e.g.,GGML_OP_ROPE_COMPreports"rope_back(x)",GGML_OP_ROPE_BACKreports"clamp(x)", etc.).This doesn't break computation but makes graph dumps and debugging misleading.
Insert a
"rope_comp(x)"symbol at the corresponding index (after"soft_max_back(x)"/ before"rope(x)"or wherever matches your final enum ordering) soGGML_OP_SYMBOL[GGML_OP_ROPE_COMP]is correct and all following entries realign.
♻️ Duplicate comments (1)
tests/test-rope.cpp (1)
379-425: Use integer loop variable forvto matchtest_ropefield type.
test_rope::vis anintused as a bit flag (if (v & 1)inbuild_input), but the loop currently declaresfloat v : { 0, 1 }. This relies on implicit float→int conversion at the constructor call and is unnecessarily confusing for a bit-flag.Consider changing to
for (int v : { 0, 1 }) { ... }to keep the type consistent and avoid accidental misuse later.
🧹 Nitpick comments (4)
ggml/include/ggml.h (1)
1869-1875: Consider expanding parameter documentation.The declaration is clean and the comment on line 1872 about F32 requirement is helpful. Consider adding brief documentation for other parameters (especially
n_dimsandfreq_base) to match the documentation style of other complex operations in this file.ggml/src/ggml-cpu/ops.cpp (1)
5862-5876: Thread partitioning ignoresidx_offset, leading to poor load balancing (but correct results)
nris computed asggml_nrows(dst)(i.e.,ne1 * ne2 * ne3), but the inner loop over heads starts ati1 = idx_offset. The global row counterironly increments for headsi1 >= idx_offset, so whenidx_offset > 0the effective number of processed rows is smaller thannr. This makesir0/ir1ranges (and thusdr) too large and can leave higher‑indexed threads with zero work while thread 0 does everything.Numerically this is still correct, but it hurts parallelism when you use
idx_offset(e.g., skipping some heads).Consider basing the work partitioning on an effective row count:
const int64_t nr_eff = (ne1 - idx_offset) * ne2 * ne3;- Use
nr_effto computedr,ir0,ir1.- Either:
- Map
(i3, i2, i1 >= idx_offset)to a compact[0, nr_eff)index when incrementingir, or- Pre‑advance
irbyidx_offset * ne2 * ne3before the loops.This will keep the behavior the same while distributing work more evenly across threads when
idx_offsetis non‑zero.tests/test-rope.cpp (1)
440-472: Test harness only exercises forward ROPE vs ROPE_COMP, not gradients.
test_rope_comp()builds two forward graphs (classic rope vs rope_comp) and compares NMSE, which is good for catching numerical issues in the RoPE implementation. However, it does not exercise backward/gradient paths forGGML_OP_ROPE_COMP.If you plan to use
ROPE_COMPin training or fine-tuning, consider adding at least one small gradient test (e.g., build a trivial loss on top of rope_comp output and runggml_build_backward_expand) to ensure backward integration is correct and stays correct over time.ggml/src/ggml.c (1)
4271-4340:ggml_rope_compop-param layout: overall structure OK but symbol/comment consistency matters.The
ggml_rope_comp(...)builder:
- Validates
bis a F32 vector and length is a multiple ofa->ne[2](M-RoPE compatible).- Encodes
n_dims,idx_pair,idx_scale,idx_offset, andtheta_scaleplus a block of floats/ints intoop_params, then setsop = GGML_OP_ROPE_COMPandsrc[0]=a,src[1]=b,src[2]=NULL.This matches how other ops use
op_paramsand is structurally fine. Just make sure the comments on the param slots match how the CPU/Metal kernels interpret them (especially slots 5–9 used by YARN), otherwise future refactors will be error-prone.If kernels treat slot 5 as
yarn_lowand slot 6 asyarn_high(see comment onggml_rope_comp_set_yarn), the comments here should be updated accordingly for clarity.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
ggml/include/ggml.h(2 hunks)ggml/src/ggml-cpu/ops.cpp(1 hunks)ggml/src/ggml-metal/ggml-metal.metal(3 hunks)ggml/src/ggml.c(5 hunks)tests/test-rope.cpp(2 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.{cpp,h,hpp}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.{cpp,h,hpp}: Always format C++ code usinggit clang-formatbefore committing, following .clang-format configuration (4-space indentation, 120 column limit, braces on same line for functions, pointer/reference alignment in middle)
Minimize external dependencies; avoid adding new external dependencies unless absolutely necessary
Files:
tests/test-rope.cppggml/src/ggml-cpu/ops.cppggml/include/ggml.h
**/*.{cpp,h,hpp,py}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Ensure cross-platform compatibility by testing code changes on Linux, macOS, and Windows when possible
Files:
tests/test-rope.cppggml/src/ggml-cpu/ops.cppggml/include/ggml.h
🧠 Learnings (2)
📓 Common learnings
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
📚 Learning: 2025-11-29T22:55:53.865Z
Learnt from: CR
Repo: ngxson/llama.cpp PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-29T22:55:53.865Z
Learning: Applies to src/llama.cpp : Maintain core library implementation in `src/llama.cpp` with careful attention to API contracts and backward compatibility
Applied to files:
ggml/src/ggml-cpu/ops.cpp
🧬 Code graph analysis (3)
tests/test-rope.cpp (3)
tests/test-backend-ops.cpp (46)
buf(231-231)type(2306-2325)mode(5455-5459)v(2661-2665)v(5316-5320)ctx(1090-1090)ctx(1130-1134)ctx(1130-1130)ctx(1165-1172)ctx(1165-1165)ctx(1176-1180)ctx(1176-1176)ctx(1182-1186)ctx(1182-1182)ctx(1188-1192)ctx(1188-1188)ctx(1194-1198)ctx(1194-1194)ctx(1200-1204)ctx(1200-1200)ctx(1415-1415)a(1104-1106)a(1104-1104)a(5025-5084)a(5025-5025)ne(3039-3041)ne(3892-3894)ne(5410-5412)t(1082-1084)t(1082-1082)t(1136-1145)t(1136-1136)t(1147-1150)t(1147-1147)t(2339-2342)t(2339-2339)t(2397-2400)t(2397-2397)t(2577-2579)t(2577-2577)t(2612-2614)t(2612-2612)t(2701-2703)t(2701-2701)nmse(274-287)nmse(274-274)ggml/src/ggml.c (6)
ggml_rope_multi(4122-4141)ggml_cast(3434-3446)ggml_rope_comp(4271-4324)ggml_rope_comp_set_yarn(4342-4368)ggml_rope_comp_set_freq_factors(4326-4340)ggml_rope_ext(4175-4193)tools/mtmd/clip.cpp (2)
x(2247-2249)x(2247-2247)
ggml/include/ggml.h (1)
ggml/src/ggml.c (3)
ggml_rope_comp(4271-4324)ggml_rope_comp_set_freq_factors(4326-4340)ggml_rope_comp_set_yarn(4342-4368)
ggml/src/ggml.c (2)
ggml/src/ggml-cpu/repack.cpp (8)
params(1813-1826)params(1813-1813)params(1828-1875)params(1828-1833)params(1877-2008)params(1877-1877)params(2010-2142)params(2010-2010)ggml/src/ggml-impl.h (1)
ggml_set_op_params(145-149)
🪛 GitHub Actions: EditorConfig Checker
ggml/src/ggml-metal/ggml-metal.metal
[error] 4133-4133: Trailing whitespace detected by editorconfig-checker.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
- GitHub Check: android-build
- GitHub Check: windows-latest-cmake-hip
- GitHub Check: windows-latest-cmake (llvm-arm64, arm64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/ar...
- GitHub Check: windows-latest-cmake-sycl
- GitHub Check: macOS-latest-cmake-ios
- GitHub Check: ios-xcode-build
- GitHub Check: ubuntu-cpu-cmake (arm64, ubuntu-22.04-arm)
- GitHub Check: ubuntu-cpu-cmake (x64, ubuntu-22.04)
- GitHub Check: macOS-latest-cmake-arm64-webgpu
- GitHub Check: ubuntu-latest-cmake-sanitizer (UNDEFINED, Debug)
- GitHub Check: ubuntu-latest-cmake-sanitizer (THREAD, Debug)
- GitHub Check: ubuntu-latest-cmake-rpc
- GitHub Check: ubuntu-latest-cmake-sanitizer (ADDRESS, Debug)
- GitHub Check: ubuntu-latest-llguidance
- GitHub Check: ubuntu-24-cmake-vulkan-deb
- GitHub Check: macOS-latest-cmake-x64
- GitHub Check: ubuntu-24-cmake-vulkan
- GitHub Check: ubuntu-24-wasm-webgpu
- GitHub Check: macOS-latest-cmake-arm64
- GitHub Check: server-windows
🔇 Additional comments (6)
ggml/src/ggml-metal/ggml-metal.metal (1)
4414-4429: Newkernel_rope_comptypedef and host bindings look consistentThe
kernel_rope_comp_ttypedef and thekernel_rope_comp_f32/kernel_rope_comp_f16host_name instantiations mirror the pattern used by the existing ROPE kernels and correctly expose both float and half variants. No further changes needed here.ggml/include/ggml.h (4)
1863-1866: LGTM! Clean API design.The new
enum ggml_rope_orderingprovides a cleaner, type-safe alternative to the existingGGML_ROPE_TYPE_*macros. The composable API appropriately separates ordering semantics from other RoPE configuration.
1877-1880: LGTM!The function signature follows GGML conventions for modifying existing computation graph nodes. The composable pattern allows optional configuration after initial RoPE node creation.
1894-1898: LGTM!The function signature is consistent with the existing M-RoPE API (
ggml_rope_multi). Using theGGML_MROPE_SECTIONSconstant ensures consistency across the codebase.
517-517: The backward compatibility concern raised in this review is not applicable to theggml_openum. Unlikeggml_type(which is serialized in GGUF model files and has a documented policy to "always add types at the end"),ggml_opvalues are not persisted in model files—they exist only for runtime tensor operation tracking. InsertingGGML_OP_ROPE_COMPbetweenGGML_OP_ROPEandGGML_OP_ROPE_BACKdoes not break binary compatibility with existing model files. The enum entry has been properly added to both theGGML_OP_NAMEandGGML_OP_SYMBOLlookup arrays.Likely an incorrect or invalid review comment.
ggml/src/ggml.c (1)
4342-4368: The parameter order is correct. Slot 5 holdsyarn_highand slot 6 holdsyarn_low, properly aligned with compute backend struct layouts and therope_yarn_ramp(low, high, i0)function signature. For typical YARN settings whereyarn_low ≤ yarn_high, the ramp calculation works as intended.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (2)
tests/test-rope.cpp (1)
384-384: Potential regression:float vshould beint v.A previous review identified that the loop variable
vshould beint(notfloat) since it represents a binary flag (0 or 1) passed to constructors expectingint. That comment was marked as addressed in commits 1affe19 to 509dfeb, but the current code still declaresfloat v. Verify whether this is a regression or an annotation timing issue.🔎 Proposed fix (from previous review)
- for (float v : { 0, 1 }) { + for (int v : { 0, 1 }) {ggml/src/ggml.c (1)
6933-6958: ROPE_COMP is marked in backward graph expansion but still has no backward implementation
GGML_OP_ROPE_COMPis now added to the list of ops whosesrc[1]gradients are ignored (same treatment asGGML_OP_ROPE), butggml_compute_backwardstill lacks acase GGML_OP_ROPE_COMP. Any training graph containing this op will still fall into the default branch and abort with “unsupported ggml op for backward pass.”If you intend ROPE_COMP to be trainable, you’ll need to add a dedicated backward case mirroring the
GGML_OP_ROPEpath (adapted to the compact representation). If it is inference-only, consider adding explicit documentation or an assertion to make that clear.
🧹 Nitpick comments (2)
ggml/src/ggml.c (2)
941-1159:GGML_OP_SYMBOLis missing an entry forROPE_COMP, misaligning symbols for subsequent opsYou added
ROPE_COMPtoGGML_OP_NAMEand bumpedGGML_OP_COUNTto 96, butGGML_OP_SYMBOLstill has only"rope(x)"followed directly by"rope_back(x)". This meansggml_op_symbol(GGML_OP_ROPE_COMP)will return the wrong symbol, and all later ops’ symbols are shifted.Minimal fix: insert a
"rope_comp(x)"symbol after"rope(x)"so indices stay aligned.Proposed diff for `GGML_OP_SYMBOL` around RoPE entries
- "soft_max(x)", - "soft_max_back(x)", - "rope(x)", - "rope_back(x)", - "clamp(x)", + "soft_max(x)", + "soft_max_back(x)", + "rope(x)", + "rope_comp(x)", + "rope_back(x)", + "clamp(x)",
4269-4368: ROPE_COMP helpers look consistent; consider guarding NEOXn_dimsevennessThe
ggml_rope_comp,_set_freq_factors, and_set_yarnhelpers are internally consistent with the existing RoPE/YARN utilities: parameter layout is documented, the theta-scale formula matches standard RoPE, and the low/high YaRN dims now follow the samebeta_fast/beta_slowordering asggml_rope_yarn_corr_dims.One small robustness improvement would be to assert that
n_dimsis even when usingGGML_ROPE_ORDERING_NEOX, sinceidx_pair = n_dims / 2andidx_scale = 2implicitly assume pairs of dimensions:Optional assert for NEox ordering
int32_t idx_pair = 1; int32_t idx_scale = 1; if (ordering == GGML_ROPE_ORDERING_NEOX) { + GGML_ASSERT((n_dims % 2) == 0); idx_pair = n_dims / 2; idx_scale = 2; }
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
ggml/src/ggml-cpu/ops.cpp(1 hunks)ggml/src/ggml.c(5 hunks)tests/test-rope.cpp(2 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.{cpp,h,hpp}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.{cpp,h,hpp}: Always format C++ code usinggit clang-formatbefore committing, following .clang-format configuration (4-space indentation, 120 column limit, braces on same line for functions, pointer/reference alignment in middle)
Minimize external dependencies; avoid adding new external dependencies unless absolutely necessary
Files:
tests/test-rope.cppggml/src/ggml-cpu/ops.cpp
**/*.{cpp,h,hpp,py}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Ensure cross-platform compatibility by testing code changes on Linux, macOS, and Windows when possible
Files:
tests/test-rope.cppggml/src/ggml-cpu/ops.cpp
🧠 Learnings (2)
📓 Common learnings
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
📚 Learning: 2025-11-29T22:55:53.865Z
Learnt from: CR
Repo: ngxson/llama.cpp PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-29T22:55:53.865Z
Learning: Applies to src/llama.cpp : Maintain core library implementation in `src/llama.cpp` with careful attention to API contracts and backward compatibility
Applied to files:
ggml/src/ggml-cpu/ops.cpp
🧬 Code graph analysis (2)
tests/test-rope.cpp (1)
ggml/src/ggml.c (5)
ggml_rope_multi(4122-4141)ggml_rope_comp(4271-4324)ggml_rope_comp_set_yarn(4342-4368)ggml_rope_comp_set_freq_factors(4326-4340)ggml_rope_ext(4175-4193)
ggml/src/ggml.c (2)
ggml/src/ggml-cpu/llamafile/sgemm.cpp (14)
a(1151-1253)a(1151-1151)a(2188-2274)a(2188-2188)b(600-602)b(600-600)b(604-606)b(604-604)b(608-612)b(608-608)b(614-617)b(614-614)b(964-966)b(964-964)ggml/src/ggml-impl.h (1)
ggml_set_op_params(145-149)
🔇 Additional comments (6)
tests/test-rope.cpp (6)
131-160: LGTM! Well-structured test configuration.The
test_ropestruct effectively encapsulates all test parameters and provides clear diagnostic output throughvars(). Theuse_compflag design allows elegant comparison between standard and composable RoPE paths.
161-222: LGTM! Wrapper logic correctly routes between rope implementations.The wrappers elegantly switch between standard and composable RoPE:
_ggml_rope_multireturnsnullptrforrope_comp(not yet supported), handled gracefully by test skip logic._ggml_rope_extimplements the fullrope_comppath with proper type casting, ordering selection, YaRN configuration, and frequency factor attachment.
228-315: LGTM! Input and graph building logic is sound.The methods correctly handle tensor creation and graph construction across all RoPE variants (NORMAL, NEOX, MROPE, VISION, IMROPE). The dimensional splits (n_dims/2 for vision, n_dims for regular MROPE) and section configurations are intentional and correct. Commented-out paths (inplace, backward) are acceptable for incremental development.
317-365: LGTM! Tensor initialization handles F32 and F16 correctly.The F16 support addition properly converts float values using
ggml_fp32_to_fp16. Theinitialize_tensorsmethod appropriately distinguishes between position tensors (randomized within context), frequency factors (tight range around 1.0), and general parameters (uniform distribution).
368-479: LGTM! Comprehensive test suite with sound comparison logic.The test driver generates extensive coverage across RoPE variants, data types, and parameter combinations. The NMSE-based comparison between standard and composable RoPE paths is appropriate, with graceful handling of unimplemented paths (nullptr skip) and clear pass/fail logging. The final assertion ensures all tests must pass.
616-616: LGTM! Clean integration of the new test suite.The call to
test_rope_comp()at the end ofmainensures the comprehensive rope/rope_comp comparison tests run after the existing rope tests.
| template<typename T> //float or ggml_fp16_t | ||
| static void ggml_compute_forward_rope_comp_flt( | ||
| const ggml_compute_params * params, | ||
| ggml_tensor * dst, | ||
| const bool forward) { | ||
|
|
||
| const ggml_tensor * src0 = dst->src[0]; | ||
| const ggml_tensor * src1 = dst->src[1]; | ||
| const ggml_tensor * src2 = dst->src[2]; | ||
|
|
||
| GGML_ASSERT(src0->type == GGML_TYPE_F32 || src0->type == GGML_TYPE_F16); | ||
| GGML_ASSERT(src1->type == GGML_TYPE_F32); | ||
|
|
||
| int32_t n_dims, idx_pair, idx_scale, idx_offset; | ||
| float theta_scale, yarn_high, yarn_low, freq_scale, ramp_factor, attn_factor; | ||
| int32_t sections[4]; | ||
|
|
||
| memcpy(&n_dims, (int32_t *)dst->op_params + 0, sizeof(int32_t)); | ||
| memcpy(&idx_pair, (int32_t *)dst->op_params + 1, sizeof(int32_t)); | ||
| memcpy(&idx_scale, (int32_t *)dst->op_params + 2, sizeof(int32_t)); | ||
| memcpy(&idx_offset, (int32_t *)dst->op_params + 3, sizeof(int32_t)); | ||
| memcpy(&theta_scale, (int32_t *)dst->op_params + 4, sizeof(float)); | ||
| memcpy(&yarn_high, (int32_t *)dst->op_params + 5, sizeof(float)); | ||
| memcpy(&yarn_low, (int32_t *)dst->op_params + 6, sizeof(float)); | ||
| memcpy(&freq_scale, (int32_t *)dst->op_params + 7, sizeof(float)); | ||
| memcpy(&attn_factor, (int32_t *)dst->op_params + 8, sizeof(float)); | ||
| memcpy(&ramp_factor, (int32_t *)dst->op_params + 9, sizeof(float)); | ||
| memcpy(§ions[0], (int32_t *)dst->op_params + 10, sizeof(int32_t)); | ||
| memcpy(§ions[1], (int32_t *)dst->op_params + 11, sizeof(int32_t)); | ||
| memcpy(§ions[2], (int32_t *)dst->op_params + 12, sizeof(int32_t)); | ||
| memcpy(§ions[3], (int32_t *)dst->op_params + 13, sizeof(int32_t)); | ||
|
|
||
| GGML_TENSOR_UNARY_OP_LOCALS | ||
|
|
||
| GGML_ASSERT(nb0 == nb00); | ||
| GGML_ASSERT(nb0 == sizeof(T)); | ||
|
|
||
| const int ith = params->ith; | ||
| const int nth = params->nth; | ||
|
|
||
| const int nr = ggml_nrows(dst); | ||
|
|
||
| GGML_ASSERT(n_dims <= ne0); | ||
| GGML_ASSERT(n_dims % 2 == 0); | ||
|
|
||
| // rows per thread | ||
| const int dr = (nr + nth - 1)/nth; | ||
|
|
||
| // row range for this thread | ||
| const int ir0 = dr*ith; | ||
| const int ir1 = MIN(ir0 + dr, nr); | ||
|
|
||
| // row index used to determine which thread to use | ||
| int ir = 0; | ||
|
|
||
| // TODO M-RoPE | ||
|
|
||
| const float * freq_factors = NULL; | ||
| if (src2 != NULL) { | ||
| GGML_ASSERT(src2->type == GGML_TYPE_F32); | ||
| GGML_ASSERT(src2->ne[0] >= n_dims / 2); | ||
| freq_factors = (const float *) src2->data; | ||
| } | ||
|
|
||
| // backward process uses inverse rotation by cos and sin. | ||
| // cos and sin build a rotation matrix, where the inverse is the transpose. | ||
| // this essentially just switches the sign of sin. | ||
| const float sin_sign = forward ? 1.0f : -1.0f; | ||
|
|
||
| const float * pos = (const float *) src1->data; | ||
|
|
||
| auto init_cache = [&](float * cache, float p) -> void { | ||
| for (int64_t i0 = 0; i0 < ne0; i0 += 2) { | ||
| const float freq_factor = freq_factors ? freq_factors[i0/2] : 1.0f; | ||
| float theta = p * powf(theta_scale, i0/2) / freq_factor; | ||
| const float theta_extrap = theta; | ||
| const float theta_interp = freq_scale * theta; | ||
| if (ramp_factor != 0.0f) { | ||
| const float ramp_mix = rope_yarn_ramp(yarn_low, yarn_high, i0) * ramp_factor; | ||
| theta = theta_interp * (1 - ramp_mix) + theta_extrap * ramp_mix; | ||
| } else { | ||
| theta = theta_interp; | ||
| } | ||
| cache[i0 + 0] = cosf(theta) * attn_factor; | ||
| cache[i0 + 1] = sinf(theta) * attn_factor * sin_sign; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Limit init_cache loop to n_dims to avoid potential freq_factors OOB when ne0 > n_dims.
init_cache currently iterates i0 up to ne0, but freq_factors is only asserted to have src2->ne[0] >= n_dims / 2. If ne0 > n_dims and src2->ne[0] == n_dims / 2, accesses freq_factors[i0/2] for i0 >= n_dims can read past the end of freq_factors. Those extra cache entries are not used (since rotate_pairs is called with n_dims), so it is both safer and slightly cheaper to bound the loop by n_dims instead of ne0.
Consider tightening the loop as follows:
Proposed fix: use n_dims in init_cache
- auto init_cache = [&](float * cache, float p) -> void {
- for (int64_t i0 = 0; i0 < ne0; i0 += 2) {
+ auto init_cache = [&](float * cache, float p) -> void {
+ for (int64_t i0 = 0; i0 < n_dims; i0 += 2) {
const float freq_factor = freq_factors ? freq_factors[i0/2] : 1.0f;
float theta = p * powf(theta_scale, i0/2) / freq_factor;
const float theta_extrap = theta;
const float theta_interp = freq_scale * theta;
if (ramp_factor != 0.0f) {
const float ramp_mix = rope_yarn_ramp(yarn_low, yarn_high, i0) * ramp_factor;
theta = theta_interp * (1 - ramp_mix) + theta_extrap * ramp_mix;
} else {
theta = theta_interp;
}
cache[i0 + 0] = cosf(theta) * attn_factor;
cache[i0 + 1] = sinf(theta) * attn_factor * sin_sign;
}
};📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| template<typename T> //float or ggml_fp16_t | |
| static void ggml_compute_forward_rope_comp_flt( | |
| const ggml_compute_params * params, | |
| ggml_tensor * dst, | |
| const bool forward) { | |
| const ggml_tensor * src0 = dst->src[0]; | |
| const ggml_tensor * src1 = dst->src[1]; | |
| const ggml_tensor * src2 = dst->src[2]; | |
| GGML_ASSERT(src0->type == GGML_TYPE_F32 || src0->type == GGML_TYPE_F16); | |
| GGML_ASSERT(src1->type == GGML_TYPE_F32); | |
| int32_t n_dims, idx_pair, idx_scale, idx_offset; | |
| float theta_scale, yarn_high, yarn_low, freq_scale, ramp_factor, attn_factor; | |
| int32_t sections[4]; | |
| memcpy(&n_dims, (int32_t *)dst->op_params + 0, sizeof(int32_t)); | |
| memcpy(&idx_pair, (int32_t *)dst->op_params + 1, sizeof(int32_t)); | |
| memcpy(&idx_scale, (int32_t *)dst->op_params + 2, sizeof(int32_t)); | |
| memcpy(&idx_offset, (int32_t *)dst->op_params + 3, sizeof(int32_t)); | |
| memcpy(&theta_scale, (int32_t *)dst->op_params + 4, sizeof(float)); | |
| memcpy(&yarn_high, (int32_t *)dst->op_params + 5, sizeof(float)); | |
| memcpy(&yarn_low, (int32_t *)dst->op_params + 6, sizeof(float)); | |
| memcpy(&freq_scale, (int32_t *)dst->op_params + 7, sizeof(float)); | |
| memcpy(&attn_factor, (int32_t *)dst->op_params + 8, sizeof(float)); | |
| memcpy(&ramp_factor, (int32_t *)dst->op_params + 9, sizeof(float)); | |
| memcpy(§ions[0], (int32_t *)dst->op_params + 10, sizeof(int32_t)); | |
| memcpy(§ions[1], (int32_t *)dst->op_params + 11, sizeof(int32_t)); | |
| memcpy(§ions[2], (int32_t *)dst->op_params + 12, sizeof(int32_t)); | |
| memcpy(§ions[3], (int32_t *)dst->op_params + 13, sizeof(int32_t)); | |
| GGML_TENSOR_UNARY_OP_LOCALS | |
| GGML_ASSERT(nb0 == nb00); | |
| GGML_ASSERT(nb0 == sizeof(T)); | |
| const int ith = params->ith; | |
| const int nth = params->nth; | |
| const int nr = ggml_nrows(dst); | |
| GGML_ASSERT(n_dims <= ne0); | |
| GGML_ASSERT(n_dims % 2 == 0); | |
| // rows per thread | |
| const int dr = (nr + nth - 1)/nth; | |
| // row range for this thread | |
| const int ir0 = dr*ith; | |
| const int ir1 = MIN(ir0 + dr, nr); | |
| // row index used to determine which thread to use | |
| int ir = 0; | |
| // TODO M-RoPE | |
| const float * freq_factors = NULL; | |
| if (src2 != NULL) { | |
| GGML_ASSERT(src2->type == GGML_TYPE_F32); | |
| GGML_ASSERT(src2->ne[0] >= n_dims / 2); | |
| freq_factors = (const float *) src2->data; | |
| } | |
| // backward process uses inverse rotation by cos and sin. | |
| // cos and sin build a rotation matrix, where the inverse is the transpose. | |
| // this essentially just switches the sign of sin. | |
| const float sin_sign = forward ? 1.0f : -1.0f; | |
| const float * pos = (const float *) src1->data; | |
| auto init_cache = [&](float * cache, float p) -> void { | |
| for (int64_t i0 = 0; i0 < ne0; i0 += 2) { | |
| const float freq_factor = freq_factors ? freq_factors[i0/2] : 1.0f; | |
| float theta = p * powf(theta_scale, i0/2) / freq_factor; | |
| const float theta_extrap = theta; | |
| const float theta_interp = freq_scale * theta; | |
| if (ramp_factor != 0.0f) { | |
| const float ramp_mix = rope_yarn_ramp(yarn_low, yarn_high, i0) * ramp_factor; | |
| theta = theta_interp * (1 - ramp_mix) + theta_extrap * ramp_mix; | |
| } else { | |
| theta = theta_interp; | |
| } | |
| cache[i0 + 0] = cosf(theta) * attn_factor; | |
| cache[i0 + 1] = sinf(theta) * attn_factor * sin_sign; | |
| } | |
| template<typename T> //float or ggml_fp16_t | |
| static void ggml_compute_forward_rope_comp_flt( | |
| const ggml_compute_params * params, | |
| ggml_tensor * dst, | |
| const bool forward) { | |
| const ggml_tensor * src0 = dst->src[0]; | |
| const ggml_tensor * src1 = dst->src[1]; | |
| const ggml_tensor * src2 = dst->src[2]; | |
| GGML_ASSERT(src0->type == GGML_TYPE_F32 || src0->type == GGML_TYPE_F16); | |
| GGML_ASSERT(src1->type == GGML_TYPE_F32); | |
| int32_t n_dims, idx_pair, idx_scale, idx_offset; | |
| float theta_scale, yarn_high, yarn_low, freq_scale, ramp_factor, attn_factor; | |
| int32_t sections[4]; | |
| memcpy(&n_dims, (int32_t *)dst->op_params + 0, sizeof(int32_t)); | |
| memcpy(&idx_pair, (int32_t *)dst->op_params + 1, sizeof(int32_t)); | |
| memcpy(&idx_scale, (int32_t *)dst->op_params + 2, sizeof(int32_t)); | |
| memcpy(&idx_offset, (int32_t *)dst->op_params + 3, sizeof(int32_t)); | |
| memcpy(&theta_scale, (int32_t *)dst->op_params + 4, sizeof(float)); | |
| memcpy(&yarn_high, (int32_t *)dst->op_params + 5, sizeof(float)); | |
| memcpy(&yarn_low, (int32_t *)dst->op_params + 6, sizeof(float)); | |
| memcpy(&freq_scale, (int32_t *)dst->op_params + 7, sizeof(float)); | |
| memcpy(&attn_factor, (int32_t *)dst->op_params + 8, sizeof(float)); | |
| memcpy(&ramp_factor, (int32_t *)dst->op_params + 9, sizeof(float)); | |
| memcpy(§ions[0], (int32_t *)dst->op_params + 10, sizeof(int32_t)); | |
| memcpy(§ions[1], (int32_t *)dst->op_params + 11, sizeof(int32_t)); | |
| memcpy(§ions[2], (int32_t *)dst->op_params + 12, sizeof(int32_t)); | |
| memcpy(§ions[3], (int32_t *)dst->op_params + 13, sizeof(int32_t)); | |
| GGML_TENSOR_UNARY_OP_LOCALS | |
| GGML_ASSERT(nb0 == nb00); | |
| GGML_ASSERT(nb0 == sizeof(T)); | |
| const int ith = params->ith; | |
| const int nth = params->nth; | |
| const int nr = ggml_nrows(dst); | |
| GGML_ASSERT(n_dims <= ne0); | |
| GGML_ASSERT(n_dims % 2 == 0); | |
| // rows per thread | |
| const int dr = (nr + nth - 1)/nth; | |
| // row range for this thread | |
| const int ir0 = dr*ith; | |
| const int ir1 = MIN(ir0 + dr, nr); | |
| // row index used to determine which thread to use | |
| int ir = 0; | |
| // TODO M-RoPE | |
| const float * freq_factors = NULL; | |
| if (src2 != NULL) { | |
| GGML_ASSERT(src2->type == GGML_TYPE_F32); | |
| GGML_ASSERT(src2->ne[0] >= n_dims / 2); | |
| freq_factors = (const float *) src2->data; | |
| } | |
| // backward process uses inverse rotation by cos and sin. | |
| // cos and sin build a rotation matrix, where the inverse is the transpose. | |
| // this essentially just switches the sign of sin. | |
| const float sin_sign = forward ? 1.0f : -1.0f; | |
| const float * pos = (const float *) src1->data; | |
| auto init_cache = [&](float * cache, float p) -> void { | |
| for (int64_t i0 = 0; i0 < n_dims; i0 += 2) { | |
| const float freq_factor = freq_factors ? freq_factors[i0/2] : 1.0f; | |
| float theta = p * powf(theta_scale, i0/2) / freq_factor; | |
| const float theta_extrap = theta; | |
| const float theta_interp = freq_scale * theta; | |
| if (ramp_factor != 0.0f) { | |
| const float ramp_mix = rope_yarn_ramp(yarn_low, yarn_high, i0) * ramp_factor; | |
| theta = theta_interp * (1 - ramp_mix) + theta_extrap * ramp_mix; | |
| } else { | |
| theta = theta_interp; | |
| } | |
| cache[i0 + 0] = cosf(theta) * attn_factor; | |
| cache[i0 + 1] = sinf(theta) * attn_factor * sin_sign; | |
| } |
🤖 Prompt for AI Agents
In ggml/src/ggml-cpu/ops.cpp around lines 5822 to 5907, the init_cache lambda
iterates i0 up to ne0 which can index freq_factors out of bounds when ne0 >
n_dims and src2 only supplies n_dims/2 entries; change the loop bound to use
n_dims (i0 < n_dims) so the cache is only initialized for the used dimension
range and avoid reading past freq_factors, while leaving the rest of the logic
unchanged.
|
@coderabbitai pause |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (2)
ggml/src/ggml-metal/ggml-metal.metal (1)
4119-4147: Fixfreq_factorindexing to respectidx_scale(align with other RoPE kernels and CPU path)In
kernel_rope_comp,freq_factoris currently indexed withi0/2:const float freq_factor = args.src2 ? ((device const float *) src2)[i0/2] : 1.0f; ... const int ic = i0 / args.idx_scale;But
src/dstaccess usesic = i0 / args.idx_scale. Whenargs.idx_scale != 2,i0/2andicdiverge, so the frequency factors are taken from the wrong positions relative to the channels being rotated. All other RoPE kernels indexsrc2with the same index used for data access (ic), so this inconsistency will produce incorrect rotations in the Metal ROPE_COMP path.Compute
icfirst and use it forfreq_factorto keep RoPE/YaRN math coherent with the CPU implementation and existing Metal kernels:Proposed fix
- for (int i0 = 2*tiitg; i0 < args.ne0; i0 += 2*tptg.x) { - if (i0 < args.n_dims) { - const float freq_factor = args.src2 ? ((device const float *) src2)[i0/2] : 1.0f; + for (int i0 = 2*tiitg; i0 < args.ne0; i0 += 2*tptg.x) { + if (i0 < args.n_dims) { + const int ic = i0 / args.idx_scale; + const float freq_factor = args.src2 ? ((device const float *) src2)[ic] : 1.0f; @@ - const float cos_theta = cos(theta) * args.attn_factor; - const float sin_theta = sin(theta) * args.attn_factor; - - const int ic = i0 / args.idx_scale; + const float cos_theta = cos(theta) * args.attn_factor; + const float sin_theta = sin(theta) * args.attn_factor;This change makes
freq_factorselection match the layout implied byidx_scaleand the waysrc/dstare addressed.ggml/src/ggml-cpu/ops.cpp (1)
5880-5904: Potential out-of-bounds read onfreq_factorswhenne0 > n_dims
init_cacheiteratesi0up tone0, butfreq_factorsis only asserted to havesrc2->ne[0] >= n_dims / 2. Ifne0 > n_dimsandsrc2->ne[0] == n_dims / 2, the accessfreq_factors[i0/2]fori0 >= n_dimscan read past the end offreq_factors. Those extra cache entries are not used (sincerotate_pairsis bounded byn_dims), so this is pure risk without benefit.Tighten the loop to
n_dimsto avoid undefined behavior:Proposed fix
- auto init_cache = [&](float * cache, float p) -> void { - for (int64_t i0 = 0; i0 < ne0; i0 += 2) { + auto init_cache = [&](float * cache, float p) -> void { + for (int64_t i0 = 0; i0 < n_dims; i0 += 2) { const float freq_factor = freq_factors ? freq_factors[i0/2] : 1.0f; @@ cache[i0 + 0] = cosf(theta) * attn_factor; cache[i0 + 1] = sinf(theta) * attn_factor * sin_sign; } };
🧹 Nitpick comments (2)
ggml/src/ggml-cpu/ops.cpp (2)
5893-5903: Optional: avoidpowfper dimension ininit_cache
thetais recomputed asp * powf(theta_scale, i0/2)for every dimension. The existingggml_rope_cache_initpath maintainsthetamultiplicatively (theta *= theta_scale), which is cheaper and avoids repeatedpowfcalls.Not a correctness issue, but you could mirror that pattern here for consistency and perf:
Possible refactor sketch
- auto init_cache = [&](float * cache, float p) -> void { - for (int64_t i0 = 0; i0 < n_dims; i0 += 2) { - const float freq_factor = freq_factors ? freq_factors[i0/2] : 1.0f; - - float theta = p * powf(theta_scale, i0/2) / freq_factor; + auto init_cache = [&](float * cache, float p) -> void { + float theta = p; + for (int64_t i0 = 0; i0 < n_dims; i0 += 2) { + const float freq_factor = freq_factors ? freq_factors[i0/2] : 1.0f; + + float theta_extrap = theta / freq_factor; - const float theta_extrap = theta; - const float theta_interp = freq_scale * theta; + const float theta_interp = freq_scale * theta_extrap; @@ - cache[i0 + 0] = cosf(theta) * attn_factor; - cache[i0 + 1] = sinf(theta) * attn_factor * sin_sign; + cache[i0 + 0] = cosf(theta) * attn_factor; + cache[i0 + 1] = sinf(theta) * attn_factor * sin_sign; + + theta *= theta_scale; } };(Adjust naming to taste; core idea is to reuse
thetaas inggml_rope_cache_init.)
5877-5879: M‑RoPE / sections parameters are currently unused
sections[]is read fromop_paramsand there are// TODO M-RoPEcomments, but the current implementation ofggml_compute_forward_rope_comp_fltnever usessectionsor applies multi-section RoPE logic. If higher-level code passes nontrivial section configs or expects M‑RoPE behavior, those settings will be silently ignored.If this is intentional for the demo, consider documenting that
ROPE_COMPcurrently handles only the simple (non‑M‑RoPE) layouts, or wiresectionsthrough to a multi-sectioninit_cacheto match the standardggml_mrope_cache_initbehavior.Also applies to: 5921-5937
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
ggml/src/ggml-cpu/ops.cpp(1 hunks)ggml/src/ggml-metal/ggml-metal.metal(3 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.{cpp,h,hpp}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.{cpp,h,hpp}: Always format C++ code usinggit clang-formatbefore committing, following .clang-format configuration (4-space indentation, 120 column limit, braces on same line for functions, pointer/reference alignment in middle)
Minimize external dependencies; avoid adding new external dependencies unless absolutely necessary
Files:
ggml/src/ggml-cpu/ops.cpp
**/*.{cpp,h,hpp,py}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Ensure cross-platform compatibility by testing code changes on Linux, macOS, and Windows when possible
Files:
ggml/src/ggml-cpu/ops.cpp
🧠 Learnings (2)
📓 Common learnings
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
📚 Learning: 2025-11-29T22:55:53.865Z
Learnt from: CR
Repo: ngxson/llama.cpp PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-29T22:55:53.865Z
Learning: Applies to src/llama.cpp : Maintain core library implementation in `src/llama.cpp` with careful attention to API contracts and backward compatibility
Applied to files:
ggml/src/ggml-cpu/ops.cpp
🧬 Code graph analysis (1)
ggml/src/ggml-cpu/ops.cpp (2)
ggml/src/ggml-cuda/rope.cu (1)
rope_yarn_ramp(15-15)ggml/src/ggml-sycl/rope.cpp (2)
rope_yarn_ramp(13-16)rope_yarn_ramp(13-13)
🔇 Additional comments (1)
ggml/src/ggml-metal/ggml-metal.metal (1)
4417-4433: RoPE_COMP kernel typedef and host bindings look consistentThe
kernel_rope_comp_ttypedef and thekernel_rope_comp_f32/kernel_rope_comp_f16host_name instantiations follow the same pattern as existing RoPE kernels (norm,neox,multi,vision) and correctly expose both float and half variants. No issues here.
✅ Actions performedReviews paused. |
Summary by CodeRabbit
Release Notes
New Features
Tests
✏️ Tip: You can customize this high-level summary in your review settings.