Skip to content

Conversation

@ngxson
Copy link
Owner

@ngxson ngxson commented Dec 13, 2025

Summary by CodeRabbit

  • New Features

    • Exposed model audio hyperparameters via a public accessor.
    • Added configurable audio preprocessing parameters (chunk length, sample rate, FFT/window/hop).
    • Runtime-generated mel filterbank with caching and chunked mel output.
  • Refactor

    • Replaced legacy preprocessing with a modular, runtime-built audio pipeline (shared FFT cache, configurable padding/normalization).
    • Introduced a per-context, class-based audio preprocessor and init flow; audio bitrate now read from model parameters.

✏️ Tip: You can customize this high-level summary in your review settings.

ngxson and others added 4 commits December 12, 2025 23:55
Co-authored-by: Tarek <tdakhran@users.noreply.github.com>
@coderabbitai
Copy link

coderabbitai bot commented Dec 13, 2025

Walkthrough

Added audio-to-mel hyperparameters to CLIP hparams and exposed an accessor; replaced the legacy whisper preprocessor with a new runtime-configurable mtmd_audio pipeline (global cache, runtime mel filterbank, parameterized FFT/DFT) and integrated a per-context polymorphic audio preprocessor.

Changes

Cohort / File(s) Summary
CLIP model header expansion
\tools/mtmd/clip-model.h``
Added five int32_t audio fields to clip_hparams (audio_chunk_len, audio_sample_rate, audio_n_fft, audio_window_len, audio_hop_len) and declared const clip_hparams * clip_get_hparams(const struct clip_ctx * ctx);.
CLIP model initialization
\tools/mtmd/clip.cpp``
Implemented clip_get_hparams() and initialized/logged new audio preprocessing fields in the Ultravox/ViP projector initialization path; removed strict mel-bin validation for Ultravox projector branch.
Audio preprocessing API (header)
\tools/mtmd/mtmd-audio.h``
Replaced whisper-based public types with mtmd_audio_* types; introduced base class mtmd_audio_preprocessor (holds const clip_hparams&) and derived mtmd_audio_preprocessor_whisper; added initialize() and virtual preprocess(...) methods; renamed whisper_melmtmd_audio_mel; added #include "clip-model.h".
Audio preprocessing implementation
\tools/mtmd/mtmd-audio.cpp``
Replaced legacy whisper flow with cache-driven pipeline: added global cache (sine/cos tables, Hann window, runtime mel filterbank), filter_params, runtime mel filterbank builder, FFT/DFT using cache, new initialize() and preprocess(const float*, size_t, std::vector<mtmd_audio_mel>&), parameterized padding/pre-emphasis/normalization, chunking, and updated public types.
Context & integration updates
\tools/mtmd/mtmd.cpp``
Replaced per-context precomputed filters with std::unique_ptr<mtmd_audio_preprocessor> audio_preproc;, added selection/initialization logic for the preprocessor, switched preprocessing call sites to audio_preproc->preprocess(...), and made audio bitrate read from clip_hparams.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant App as mtmd caller
    participant Ctx as mtmd_context
    participant CLIP as clip_ctx / clip_hparams
    participant Pre as mtmd_audio_preprocessor
    participant Cache as mtmd_audio_global_cache

    App->>Ctx: request audio init / tokenize
    Ctx->>CLIP: read projector type & hparams (clip_get_hparams)
    Ctx->>Pre: construct appropriate mtmd_audio_preprocessor(ctx)
    Pre->>Cache: initialize() — populate sin/cos, Hann window, mel filterbank
    App->>Pre: preprocess(samples, n_samples)
    Pre->>Cache: use cached tables for FFT/DFT and mel filtering
    Pre-->>Ctx: return vector<mtmd_audio_mel> (chunks)
    Ctx-->>App: continue tokenization with mel chunks
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

  • Areas needing extra attention:
    • tools/mtmd/mtmd-audio.cpp: FFT/DFT math, mel filterbank generation, numeric stability, cache initialization and lifecycle.
    • tools/mtmd/mtmd-audio.h: public API/ABI changes, virtual interface correctness, header inclusion of clip-model.h.
    • tools/mtmd/mtmd.cpp and tools/mtmd/clip.cpp: initialization ordering, projector-type selection, propagation and use of new hparams (sample rate, chunking).
    • Thread-safety of the global cache and the non-thread-safe initialize() contract.

Poem

🐇
I hopped through tables of sine and cos,
Built mel banks where the old code froze.
Classes now hum, caches keep time,
Samples turn mel in rhythm and rhyme.
— a rabbit cheering the new pipeline 🎶

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Description check ⚠️ Warning No pull request description was provided; the template requires reviewing contributing guidelines and providing context, but the author submitted with an empty description. Add a description explaining the refactor's purpose, the architectural changes (e.g., introduction of polymorphic audio preprocessor), and acknowledgment of the reported security issues that require resolution.
Docstring Coverage ⚠️ Warning Docstring coverage is 27.78% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Xsn/mtmd refactor audio preproc' clearly indicates a refactor of audio preprocessing in the MTMD component, which aligns with the main changes across four files.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch xsn/mtmd_refactor_audio_preproc

Comment @coderabbitai help to get the list of available commands and usage tips.

@ngxson
Copy link
Owner Author

ngxson commented Dec 13, 2025

@coderabbitai anything obviously wrong here?

@coderabbitai
Copy link

coderabbitai bot commented Dec 13, 2025

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Fetch PR details
gh pr view 57 --json title,body,files,additions,deletions

echo "=== DIFF ==="
# Get the full diff for the PR
gh pr diff 57

Length of output: 217


🏁 Script executed:

#!/bin/bash
# Fetch PR details with explicit repo specification
gh pr view 57 --repo ngxson/llama.cpp --json title,body,files,additions,deletions

echo "=== DIFF ==="
# Get the full diff for the PR
gh pr diff 57 --repo ngxson/llama.cpp

Length of output: 44306


@ngxson, I found a critical bug in the mel spectrogram calculation:

Line 288 in mtmd-audio.cpp has a logic error:

sum = params.use_natural_log ? log(sum + 5.960464477539063e-08) : log10(1e-10);

When use_natural_log is false, this returns a constant log10(1e-10) instead of calculating log10(max(sum, 1e-10)) with the actual sum value. This means the mel spectrogram values will all be the same constant when use_natural_log is false.

It should probably be:

sum = params.use_natural_log ? log(sum + 5.960464477539063e-08) : log10(std::max(sum, 1e-10));

Line 293 also has a potential inconsistency - it uses log(1e-10) but should probably use log10(1e-10) to match the original code, or respect the use_natural_log parameter.

Additional observations:

  • The check for n_mel_bins != 128 was removed in clip.cpp (line 1144) - is this intentional?
  • params.need_chunking is set but never used

🧠 Learnings used
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.

@ngxson
Copy link
Owner Author

ngxson commented Dec 13, 2025

@coderabbitai any potential security risks? (buffer overflow or something) - do a review

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4d5ae24 and 7b578b5.

📒 Files selected for processing (5)
  • tools/mtmd/clip-model.h (2 hunks)
  • tools/mtmd/clip.cpp (3 hunks)
  • tools/mtmd/mtmd-audio.cpp (7 hunks)
  • tools/mtmd/mtmd-audio.h (1 hunks)
  • tools/mtmd/mtmd.cpp (2 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.{cpp,h,hpp}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.{cpp,h,hpp}: Always format C++ code using git clang-format before committing, following .clang-format configuration (4-space indentation, 120 column limit, braces on same line for functions, pointer/reference alignment in middle)
Minimize external dependencies; avoid adding new external dependencies unless absolutely necessary

Files:

  • tools/mtmd/clip.cpp
  • tools/mtmd/clip-model.h
  • tools/mtmd/mtmd.cpp
  • tools/mtmd/mtmd-audio.h
  • tools/mtmd/mtmd-audio.cpp
**/*.{cpp,h,hpp,py}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Ensure cross-platform compatibility by testing code changes on Linux, macOS, and Windows when possible

Files:

  • tools/mtmd/clip.cpp
  • tools/mtmd/clip-model.h
  • tools/mtmd/mtmd.cpp
  • tools/mtmd/mtmd-audio.h
  • tools/mtmd/mtmd-audio.cpp
🧠 Learnings (2)
📓 Common learnings
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
📚 Learning: 2025-05-26T09:45:20.653Z
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 25
File: tools/mtmd/mtmd.cpp:275-293
Timestamp: 2025-05-26T09:45:20.653Z
Learning: In tools/mtmd/clip.cpp, PROJECTOR_TYPE_QWEN25O is a placeholder that gets replaced by either PROJECTOR_TYPE_QWEN25VL (for vision) or PROJECTOR_TYPE_QWEN2A (for audio) before the respective init_vision() or init_audio() functions are called, ensuring proper token handling.

Applied to files:

  • tools/mtmd/clip.cpp
  • tools/mtmd/clip-model.h
  • tools/mtmd/mtmd.cpp
🧬 Code graph analysis (2)
tools/mtmd/mtmd-audio.h (2)
tools/mtmd/clip.cpp (4)
  • ctx (2245-2347)
  • ctx (2245-2245)
  • clip_get_hparams (3429-3431)
  • clip_get_hparams (3429-3429)
tools/mtmd/mtmd-audio.cpp (2)
  • preprocess (438-518)
  • preprocess (438-441)
tools/mtmd/mtmd-audio.cpp (1)
tools/mtmd/mtmd.cpp (2)
  • GGML_ASSERT (216-311)
  • GGML_ASSERT (313-330)
🪛 GitHub Actions: EditorConfig Checker
tools/mtmd/mtmd-audio.h

[error] 24-24: Trailing whitespace detected by editorconfig-checker.

tools/mtmd/mtmd-audio.cpp

[error] 32-32: Trailing whitespace detected by editorconfig-checker.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
  • GitHub Check: build-cmake-pkg / linux
  • GitHub Check: build-linux-cross / ubuntu-24-riscv64-cpu-spacemit-ime-cross
  • GitHub Check: build-linux-cross / debian-13-loongarch64-cpu-cross
  • GitHub Check: build-linux-cross / debian-13-loongarch64-vulkan-cross
  • GitHub Check: ubuntu-24-cmake-webgpu
  • GitHub Check: windows-latest-cmake-sycl
  • GitHub Check: ubuntu-24-cmake-vulkan-deb
  • GitHub Check: ubuntu-24-wasm-webgpu
  • GitHub Check: windows-latest-cmake-hip
  • GitHub Check: ios-xcode-build
  • GitHub Check: ubuntu-24-cmake-vulkan
  • GitHub Check: ubuntu-cpu-cmake (ppc64le, ubuntu-24.04-ppc64le)
  • GitHub Check: macOS-latest-cmake-arm64-webgpu
  • GitHub Check: ubuntu-latest-cmake-sanitizer (ADDRESS, Debug)
  • GitHub Check: ubuntu-latest-llguidance
  • GitHub Check: ubuntu-latest-cmake-sanitizer (UNDEFINED, Debug)
  • GitHub Check: ubuntu-latest-cmake-rpc
  • GitHub Check: macOS-latest-cmake-arm64
  • GitHub Check: macOS-latest-cmake-x64
  • GitHub Check: server-windows
🔇 Additional comments (11)
tools/mtmd/mtmd.cpp (2)

648-656: LGTM! Clean integration of the new preprocessor API.

The local instantiation of mtmd_audio_whisper_preprocessor per preprocessing call is appropriate, and the new API is used correctly.


854-859: LGTM! Dynamic sample rate retrieval.

Using clip_get_hparams(ctx->ctx_a)->audio_sample_rate instead of a hardcoded value is cleaner and more maintainable.

tools/mtmd/clip-model.h (2)

68-73: LGTM! Well-structured audio preprocessing parameters.

The new fields are appropriately grouped with a clear comment, and default values of -1 properly indicate "unset" state.


287-288: LGTM! Safe accessor declaration.

Returning const clip_hparams * ensures read-only access to internal state.

tools/mtmd/clip.cpp (3)

1146-1152: Hardcoded audio preprocessing parameters for whisper-based projectors.

These values (30s chunks, 16kHz sample rate, 400-sample FFT/window, 160-sample hop) match standard whisper preprocessing. Consider adding a comment noting these are whisper-compatible defaults.

Please verify these parameters are correct for all three projector types (ULTRAVOX, QWEN2A, VOXTRAL). If any projector requires different values in the future, this code path may need adjustment.


1190-1194: LGTM! Helpful debug logging for audio parameters.

The additional logging aids debugging and aligns with the existing logging pattern for other hparams.


3428-3431: LGTM! Simple accessor implementation.

The implementation is straightforward and correctly returns a const pointer to internal hparams.

tools/mtmd/mtmd-audio.cpp (4)

470-479: Cache initialization assumes uniform hparams across all calls.

The global cache is initialized once with the first caller's hparams values. If preprocess() is later called with different hparams (different audio_n_fft, audio_window_len, etc.), the cached values will be incorrect for that call.

Consider either:

  1. Validating that subsequent calls use matching hparams, or
  2. Making the cache per-instance rather than global if varying hparams are expected

495-501: Assertion may fail for very short audio.

The assertion GGML_ASSERT((size_t)out_full.n_len > frames_per_chunk) will abort if the mel spectrogram output is shorter than 3000 frames. While there's padding for short inputs (lines 451-456), this may still fail for edge cases.

Consider whether a graceful error return would be more appropriate than an assertion, especially since this processes user-provided audio that could have unexpected lengths.


36-44: LGTM! Precomputed sin/cos table for FFT optimization.

Good optimization to avoid repeated trigonometric calculations during FFT.


57-138: LGTM! Runtime mel filterbank generation.

The Slaney-scale mel filterbank implementation matches librosa defaults. The triangular filter construction is correct.

@coderabbitai
Copy link

coderabbitai bot commented Dec 13, 2025

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Get the full diff for security review
gh pr diff 57 --repo ngxson/llama.cpp > /tmp/pr_diff.txt

# Show the diff
cat /tmp/pr_diff.txt

echo -e "\n\n=== FILES CHANGED ==="
gh pr view 57 --repo ngxson/llama.cpp --json files --jq '.files[].path'

Length of output: 44018


@ngxson, I found several security vulnerabilities in addition to the logic bug I mentioned earlier:

🚨 Critical Issues

1. Buffer Over-read in Reflective Padding (Line 334)

std::reverse_copy(samples + 1, samples + 1 + stage_2_pad, samples_padded.begin());

Risk: If n_samples < stage_2_pad + 1 (where stage_2_pad = frame_size / 2 = 200), this reads beyond the input buffer bounds.

Attack scenario: Attacker provides audio shorter than 201 samples → buffer over-read → potential information disclosure or crash.

Fix: Add validation before this operation:

if (n_samples < stage_2_pad + 1) {
    // Handle short audio differently or return error
    return false;
}

2. Integer Overflow in Buffer Allocation (Line 360)

out.data.resize(out.n_mel * out.n_len);

Risk: n_mel * n_len can overflow if parameters are large, causing undersized buffer allocation → heap buffer overflow in subsequent writes.

Fix: Add overflow check:

if (out.n_mel > 0 && out.n_len > SIZE_MAX / out.n_mel) {
    return false; // overflow
}
out.data.resize(size_t(out.n_mel) * size_t(out.n_len));

3. Signed Integer Underflow (Line 359)

out.n_len = (n_samples - frame_size) / frame_step + 1;

Risk: If n_samples < frame_size, this creates a negative value that gets implicitly converted to a very large unsigned number in the resize operation.

Fix: Check bounds first:

if (n_samples < frame_size) {
    return false;
}
out.n_len = (n_samples - frame_size) / frame_step + 1;

⚠️ High-Priority Issues

4. Division by Zero (Line 241 & 359)

for (; i < std::min(n_samples / frame_step + 1, out.n_len); i += n_threads) {
out.n_len = (n_samples - frame_size) / frame_step + 1;

Risk: If frame_step (hop_length) is 0, division by zero → crash.

Fix: Validate in the beginning:

if (params.hop_length <= 0 || frame_size <= 0) {
    return false;
}

5. Potential Integer Overflow in Mel Filter Indexing (Line 273-279)

fft_out[k + 0] * filters.data[j * n_fft_bins + k + 0]

Risk: j * n_fft_bins + k can overflow with large n_fft_bins values.

Fix: Use size_t for index calculations:

size_t idx = size_t(j) * size_t(n_fft_bins) + size_t(k);

6. Multiple Integer Overflow Risks in Initialization

Lines with similar issues:

  • Line 448: hparams.audio_n_fft * 2
  • Line 118: n_mel * n_fft_bins
  • Line 330: n_samples + stage_1_pad + stage_2_pad * 2

All need overflow validation before allocation.

7. Out-of-Bounds Access in Hann Window Padding (Line 353)

std::copy(hann, hann + params.hann_window_size, &hann_window_padded[padding]);

Risk: If padding + params.hann_window_size > frame_size, writes beyond buffer.

Fix:

if (padding + params.hann_window_size > frame_size) {
    return false;
}

8. Missing Input Validation

No validation on parameters like:

  • n_mel, n_fft, sample_rate (could be negative or zero)
  • n_threads (could be 0)
  • hann_window_size vs frame_size relationship

Recommendations

  1. Add comprehensive input validation at the entry point of preprocess() and fill_mel_filterbank_matrix()
  2. Use size_t for all size/index calculations to prevent signed/unsigned issues
  3. Add overflow checks before all multiplication operations used for sizing
  4. Validate audio length meets minimum requirements before processing
  5. Consider using safer APIs like std::vector::at() during development for bounds checking

Would you like me to generate a comprehensive patch with all the security fixes?


🧠 Learnings used
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
tools/mtmd/mtmd-audio.cpp (1)

23-137: Global cache is not thread-safe for concurrent initialization.

The g_cache static object has its fill_* methods called from initialize(), but there's no synchronization. If multiple mtmd_audio_whisper_preprocessor instances call initialize() concurrently (or if preprocess() is called before initialize() completes), data races can occur.

Consider using std::call_once for thread-safe initialization:

+#include <mutex>
+
+static std::once_flag g_cache_init_flag;
+
 static struct mtmd_audio_global_cache {
     // ... existing members ...
 } g_cache;

Then in initialize():

 void mtmd_audio_whisper_preprocessor::initialize() {
+    std::call_once(g_cache_init_flag, [this]() {
         g_cache.fill_sin_cos_table(hparams.audio_n_fft);
         g_cache.fill_hann_window(hparams.audio_window_len, true);
         g_cache.fill_mel_filterbank_matrix(
             hparams.n_mel_bins,
             hparams.audio_n_fft,
             hparams.audio_sample_rate);
+    });
 }

Note: This assumes all instances use identical hparams values. Based on learnings, this may be deferred if thread-safety isn't an immediate concern.

🧹 Nitpick comments (4)
tools/mtmd/mtmd-audio.h (1)

20-28: Reference lifetime concern: hparams reference may outlive the source clip_ctx.

The hparams member is a reference to clip_hparams obtained from clip_ctx. If the clip_ctx is freed before the preprocessor, this reference becomes dangling. The current usage in mtmd.cpp appears safe since ctx_a and audio_preproc share the same lifetime in mtmd_context, but this coupling should be documented.

Consider adding a comment documenting this lifetime requirement:

 struct mtmd_audio_preprocessor {
-    const clip_hparams & hparams;
+    // NOTE: caller must ensure clip_ctx outlives this preprocessor
+    const clip_hparams & hparams;
tools/mtmd/mtmd-audio.cpp (3)

210-224: need_chunking field is set but never used.

The filter_params::need_chunking field is declared and set in preprocess() (line 492), but it's never read anywhere in the processing logic.

Either remove the unused field or implement the chunking behavior it's meant to control:

 struct filter_params {
     // ... other fields ...
-    bool    need_chunking = true;
 
     const mtmd_audio_global_cache & cache;
     filter_params() : cache(g_cache) {}
 };

356-361: Potential out-of-bounds write when padding Hann window.

If padding + hann_window_size > frame_size, the std::copy at line 359 will write past the end of hann_window_padded. While unlikely with typical parameters, consider adding a bounds check.

     if (params.hann_window_size < frame_size) {
         hann_window_padded.resize(frame_size);
         const int padding = (frame_size - params.hann_window_size) / 2;
+        GGML_ASSERT(padding + params.hann_window_size <= frame_size);
         std::copy(hann, hann + params.hann_window_size, &hann_window_padded[padding]);
         hann = hann_window_padded.data();
     }

494-497: Assertions verify cache initialization but don't provide graceful fallback.

The GGML_ASSERT calls will abort if initialize() wasn't called. This is acceptable for catching programming errors but could be more user-friendly.

Consider returning false with an error message instead of aborting:

     // make sure the global cache is initialized
-    GGML_ASSERT(!g_cache.sin_vals.empty());
-    GGML_ASSERT(!g_cache.cos_vals.empty());
-    GGML_ASSERT(!g_cache.filters.data.empty());
+    if (g_cache.sin_vals.empty() || g_cache.cos_vals.empty() || g_cache.filters.data.empty()) {
+        LOG_ERR("%s: global cache not initialized, call initialize() first\n", __func__);
+        return false;
+    }
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7b578b5 and 4b63e61.

📒 Files selected for processing (3)
  • tools/mtmd/mtmd-audio.cpp (7 hunks)
  • tools/mtmd/mtmd-audio.h (1 hunks)
  • tools/mtmd/mtmd.cpp (4 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.{cpp,h,hpp}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.{cpp,h,hpp}: Always format C++ code using git clang-format before committing, following .clang-format configuration (4-space indentation, 120 column limit, braces on same line for functions, pointer/reference alignment in middle)
Minimize external dependencies; avoid adding new external dependencies unless absolutely necessary

Files:

  • tools/mtmd/mtmd.cpp
  • tools/mtmd/mtmd-audio.h
  • tools/mtmd/mtmd-audio.cpp
**/*.{cpp,h,hpp,py}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Ensure cross-platform compatibility by testing code changes on Linux, macOS, and Windows when possible

Files:

  • tools/mtmd/mtmd.cpp
  • tools/mtmd/mtmd-audio.h
  • tools/mtmd/mtmd-audio.cpp
🧠 Learnings (2)
📓 Common learnings
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
📚 Learning: 2025-05-26T09:45:20.653Z
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 25
File: tools/mtmd/mtmd.cpp:275-293
Timestamp: 2025-05-26T09:45:20.653Z
Learning: In tools/mtmd/clip.cpp, PROJECTOR_TYPE_QWEN25O is a placeholder that gets replaced by either PROJECTOR_TYPE_QWEN25VL (for vision) or PROJECTOR_TYPE_QWEN2A (for audio) before the respective init_vision() or init_audio() functions are called, ensuring proper token handling.

Applied to files:

  • tools/mtmd/mtmd.cpp
🧬 Code graph analysis (1)
tools/mtmd/mtmd-audio.h (2)
tools/mtmd/clip.cpp (4)
  • ctx (2245-2347)
  • ctx (2245-2245)
  • clip_get_hparams (3429-3431)
  • clip_get_hparams (3429-3429)
tools/mtmd/mtmd-audio.cpp (4)
  • initialize (453-460)
  • initialize (453-453)
  • preprocess (462-536)
  • preprocess (462-465)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (19)
  • GitHub Check: ggml-ci-x64-cpu-high-perf
  • GitHub Check: build-cmake-pkg / linux
  • GitHub Check: ubuntu-22-cmake-musa
  • GitHub Check: windows-msys2 (UCRT64, ucrt-x86_64, Release)
  • GitHub Check: windows-2022-cmake-cuda (12.4)
  • GitHub Check: ggml-ci-x64-cpu-low-perf
  • GitHub Check: ubuntu-22-cmake-sycl
  • GitHub Check: macOS-latest-cmake-visionos
  • GitHub Check: windows-latest-cmake-hip
  • GitHub Check: ubuntu-22-cmake-hip
  • GitHub Check: ubuntu-24-cmake-vulkan-deb
  • GitHub Check: windows-latest-cmake-sycl
  • GitHub Check: ubuntu-latest-cmake-sanitizer (THREAD, Debug)
  • GitHub Check: ubuntu-latest-cmake-sanitizer (UNDEFINED, Debug)
  • GitHub Check: macOS-latest-cmake-arm64-webgpu
  • GitHub Check: macOS-latest-cmake-x64
  • GitHub Check: macOS-latest-cmake-arm64
  • GitHub Check: ubuntu-latest-llguidance
  • GitHub Check: server-windows
🔇 Additional comments (11)
tools/mtmd/mtmd.cpp (4)

154-154: LGTM!

The std::unique_ptr for the audio preprocessor is appropriate for managing the polymorphic preprocessor lifetime.


322-334: LGTM!

The switch-based initialization correctly maps projector types to the appropriate preprocessor implementation. The GGML_ABORT for unsupported types provides a clear failure mode during development.


665-668: LGTM!

The updated call to audio_preproc->preprocess correctly passes samples and receives the mel spectrogram chunks via the output vector.


870-875: No null-safety issue here. clip_get_hparams returns &ctx->model.hparams and cannot return null—it dereferences the context pointer directly to obtain the address of the hparams member. Since ctx->ctx_a is already validated on line 871, the code is safe.

Likely an incorrect or invalid review comment.

tools/mtmd/mtmd-audio.h (1)

30-34: LGTM!

The derived class properly inherits from the base and forwards the context to the parent constructor.

tools/mtmd/mtmd-audio.cpp (6)

14-15: LGTM!

Debug flag is appropriately set to false for production code.


334-338: Input validation for short audio prevents buffer over-read.

The check at line 334 correctly guards against buffer over-read when n_samples < stage_2_pad + 1. Returning false is appropriate.


364-375: Good defensive checks for overflow and underflow.

The overflow check at lines 367-370 and the n_samples < frame_size check at lines 371-374 properly guard against integer overflow in allocation and signed underflow in length calculation.


471-480: Short audio padding is a reasonable safeguard.

Padding input shorter than 2 * audio_n_fft with zeros prevents edge cases in the mel spectrogram computation. The TODO comment acknowledges this could be improved.


514-514: This assertion will not fail under normal operation.

The code includes multiple safeguards ensuring out_full.n_len > 3000:

  1. Pre-padding (line 475-481) ensures minimum 800 samples
  2. stage_1_pad = 480000 samples (30 seconds at 16kHz) is always applied when center_padding = false (explicitly set at line 488)
  3. With hop_length = 160 and frame_size = 400, even the minimum input of 800 samples yields out.n_len = (800 + 480400 - 400) / 160 + 1 ≈ 3006 frames

The mathematical formula out.n_len = (n_samples - 400) / 160 + 1 with the guaranteed 30-second padding ensures the assertion n_len > 3000 will always pass.

Likely an incorrect or invalid review comment.


453-460: Initialization fills global cache with instance-specific parameters.

The initialize() method populates the shared g_cache using this instance's hparams. If different preprocessor instances have different hparams values, the last one to initialize wins, potentially causing incorrect processing for earlier instances.

Verify that all mtmd_audio_whisper_preprocessor instances will use identical audio parameters:

#!/bin/bash
# Check if different projector types could have different audio hparams
rg -n "audio_n_fft|audio_window_len|audio_hop_len|audio_sample_rate|n_mel_bins" --type cpp -C 2

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (4)
tools/mtmd/mtmd-audio.cpp (4)

220-220: Unused field need_chunking.

This field is declared and assigned in preprocess() at line 492, but never read anywhere in the codebase. Consider removing it to avoid confusion.

-    bool    need_chunking = true;

279-280: Inconsistent index type safety.

Line 271 correctly uses size_t casting for the index calculation, but line 280 uses plain int multiplication which could overflow for large values.

             // handle n_fft remainder
             for (; k < n_fft_bins; k++) {
-                sum += fft_out[k] * filters.data[j * n_fft_bins + k];
+                sum += fft_out[k] * filters.data[size_t(j) * size_t(n_fft_bins) + size_t(k)];
             }

341-351: Preemphasis coefficient is hardcoded despite being a parameter.

params.preemph is used as a boolean toggle, but the actual coefficient is hardcoded as 0.97f. If preemph is meant to be configurable, consider using its value directly:

     // preemphasis
-    if (params.preemph) {
+    if (params.preemph != 0.0f) {
         const int pad_amount = frame_size / 2;
-        const float preemph = 0.97f;
         float prev = samples_padded[pad_amount];
         for (int i = pad_amount + 1; i + pad_amount < n_samples; ++i) {
             float cur = samples_padded[i];
-            samples_padded[i] = cur - preemph * prev;
+            samples_padded[i] = cur - params.preemph * prev;
             prev = cur;
         }
     }

Alternatively, if 0.97 is always intended, change preemph to a bool type for clarity.


503-503: Consider making thread count configurable.

The thread count is hardcoded to 4. This may not be optimal for all systems. Consider passing it as a parameter or deriving it from hardware concurrency.

+    const int n_threads = std::min(4, (int)std::thread::hardware_concurrency());
     bool ok = log_mel_spectrogram(
                 samples,
                 n_samples,
-                4, // n_threads
+                n_threads,
                 params,
                 out_full);
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4b63e61 and 93290e5.

📒 Files selected for processing (1)
  • tools/mtmd/mtmd-audio.cpp (7 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.{cpp,h,hpp}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.{cpp,h,hpp}: Always format C++ code using git clang-format before committing, following .clang-format configuration (4-space indentation, 120 column limit, braces on same line for functions, pointer/reference alignment in middle)
Minimize external dependencies; avoid adding new external dependencies unless absolutely necessary

Files:

  • tools/mtmd/mtmd-audio.cpp
**/*.{cpp,h,hpp,py}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Ensure cross-platform compatibility by testing code changes on Linux, macOS, and Windows when possible

Files:

  • tools/mtmd/mtmd-audio.cpp
🧠 Learnings (1)
📓 Common learnings
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
🧬 Code graph analysis (1)
tools/mtmd/mtmd-audio.cpp (1)
tools/mtmd/mtmd.cpp (2)
  • GGML_ASSERT (218-313)
  • GGML_ASSERT (315-347)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (19)
  • GitHub Check: windows-latest-cmake (llvm-arm64, arm64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/ar...
  • GitHub Check: ggml-ci-arm64-cpu-high-perf-sve
  • GitHub Check: ggml-ci-arm64-cpu-kleidiai
  • GitHub Check: ggml-ci-arm64-cpu-low-perf
  • GitHub Check: ubuntu-latest-cmake-cuda
  • GitHub Check: windows-2022-cmake-cuda (12.4)
  • GitHub Check: ubuntu-24-cmake-vulkan
  • GitHub Check: ubuntu-cpu-cmake (arm64, ubuntu-22.04-arm)
  • GitHub Check: ubuntu-24-wasm-webgpu
  • GitHub Check: ubuntu-latest-cmake-sanitizer (UNDEFINED, Debug)
  • GitHub Check: macOS-latest-cmake-x64
  • GitHub Check: ubuntu-latest-cmake-sanitizer (ADDRESS, Debug)
  • GitHub Check: ubuntu-24-cmake-vulkan-deb
  • GitHub Check: macOS-latest-cmake-arm64
  • GitHub Check: ubuntu-latest-cmake-sanitizer (THREAD, Debug)
  • GitHub Check: ubuntu-latest-llguidance
  • GitHub Check: ubuntu-latest-cmake-rpc
  • GitHub Check: ubuntu-24-cmake-webgpu
  • GitHub Check: server-windows

Comment on lines +393 to +408
const int effective_n_len = n_samples_in / frame_step;
if (params.norm_per_feature) {
for (int i = 0; i < out.n_mel; i++) {
double mean = 0;
for (int j = 0; j < effective_n_len; ++j) {
mean += out.data[i * out.n_len + j];
}
mean /= effective_n_len;

mmax -= 8.0;
double var = 0.0;
for (int j = 0; j < effective_n_len; ++j) {
const double value = out.data[i * out.n_len + j] - mean;
var += value * value;
}
var /= effective_n_len - 1; // unbiased
const double mstd = std::sqrt(var + 1e-5);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Division by zero when effective_n_len is 1.

If n_samples_in < frame_step * 2, then effective_n_len will be 1, causing a division by zero at line 407 (var /= effective_n_len - 1).

     const int effective_n_len = n_samples_in / frame_step;
+    if (effective_n_len < 2) {
+        LOG_ERR("%s: not enough frames for normalization (effective_n_len=%d)\n", __func__, effective_n_len);
+        return false;
+    }
     if (params.norm_per_feature) {
🤖 Prompt for AI Agents
In tools/mtmd/mtmd-audio.cpp around lines 393-408, guard against effective_n_len
being <= 1 to avoid division-by-zero: check effective_n_len before computing
mean/variance; if effective_n_len == 0 skip normalization for that feature (or
continue) and if effective_n_len == 1 set variance to 0 (or unbiased var to 0)
so mstd = sqrt(0 + 1e-5) and proceed without dividing by (effective_n_len - 1).
Adjust the mean computation to avoid dividing by zero and ensure you still
produce a stable mstd for the subsequent normalization step.

Comment on lines +521 to 524
mtmd_audio_mel out_chunk;
out_chunk.n_len = n_len;
out_chunk.n_mel = out_full.n_mel;
out_chunk.n_len_org = out_full.n_mel; // unused
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Suspicious assignment: n_len_org is set to n_mel instead of a length value.

Line 524 assigns out_full.n_mel to out_chunk.n_len_org, which seems incorrect given the field name suggests it should be a length. The comment says "unused", but this could cause confusion or bugs if the field is used in the future.

         out_chunk.n_len     = n_len;
         out_chunk.n_mel     = out_full.n_mel;
-        out_chunk.n_len_org = out_full.n_mel; // unused
+        out_chunk.n_len_org = out_full.n_len_org; // original length before padding
         out_chunk.data.reserve(out_chunk.n_mel * out_chunk.n_len);
🤖 Prompt for AI Agents
In tools/mtmd/mtmd-audio.cpp around lines 521 to 524, the field
out_chunk.n_len_org is incorrectly set to out_full.n_mel; change it to a proper
length value (e.g., out_chunk.n_len_org = n_len) or explicitly zero it if truly
unused, and update the comment to explain why it’s set that way to avoid
confusion.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
tools/mtmd/mtmd-audio.cpp (2)

393-408: Division by zero when effective_n_len is 1.

If n_samples_in < frame_step * 2, then effective_n_len will be 1, causing division by zero at line 407.

     const int effective_n_len = n_samples_in / frame_step;
+    if (effective_n_len < 2) {
+        LOG_ERR("%s: not enough frames for normalization (effective_n_len=%d)\n", __func__, effective_n_len);
+        return false;
+    }
     if (params.norm_per_feature) {

524-527: Suspicious assignment: n_len_org set to n_mel.

Line 527 assigns out_full.n_mel to out_chunk.n_len_org, which seems semantically incorrect given the field name. Even if currently unused, this could cause bugs if the field is used later.

         out_chunk.n_len     = n_len;
         out_chunk.n_mel     = out_full.n_mel;
-        out_chunk.n_len_org = out_full.n_mel; // unused
+        out_chunk.n_len_org = out_full.n_len_org; // original length before padding
🧹 Nitpick comments (3)
tools/mtmd/mtmd-audio.cpp (3)

14-14: Consider using constexpr for the debug flag.

Using constexpr instead of const for a compile-time constant boolean allows the compiler to eliminate debug code paths entirely when disabled.

-constexpr bool DEBUG = false;
+static constexpr bool DEBUG = false;

220-220: Unused field need_chunking.

The need_chunking field is assigned at line 493 but never read. Either use it to control chunking behavior or remove it to avoid confusion.


501-506: Hardcoded thread count.

The n_threads parameter is hardcoded to 4 at line 504, ignoring the configurable n_threads from mtmd_context_params. Consider passing the thread count through the preprocessor interface.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 93290e5 and 1aaec3b.

📒 Files selected for processing (2)
  • tools/mtmd/mtmd-audio.cpp (7 hunks)
  • tools/mtmd/mtmd.cpp (4 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.{cpp,h,hpp}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.{cpp,h,hpp}: Always format C++ code using git clang-format before committing, following .clang-format configuration (4-space indentation, 120 column limit, braces on same line for functions, pointer/reference alignment in middle)
Minimize external dependencies; avoid adding new external dependencies unless absolutely necessary

Files:

  • tools/mtmd/mtmd-audio.cpp
  • tools/mtmd/mtmd.cpp
**/*.{cpp,h,hpp,py}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Ensure cross-platform compatibility by testing code changes on Linux, macOS, and Windows when possible

Files:

  • tools/mtmd/mtmd-audio.cpp
  • tools/mtmd/mtmd.cpp
🧠 Learnings (2)
📓 Common learnings
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
📚 Learning: 2025-05-26T09:45:20.653Z
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 25
File: tools/mtmd/mtmd.cpp:275-293
Timestamp: 2025-05-26T09:45:20.653Z
Learning: In tools/mtmd/clip.cpp, PROJECTOR_TYPE_QWEN25O is a placeholder that gets replaced by either PROJECTOR_TYPE_QWEN25VL (for vision) or PROJECTOR_TYPE_QWEN2A (for audio) before the respective init_vision() or init_audio() functions are called, ensuring proper token handling.

Applied to files:

  • tools/mtmd/mtmd.cpp
🧬 Code graph analysis (1)
tools/mtmd/mtmd-audio.cpp (1)
tools/mtmd/mtmd.cpp (2)
  • GGML_ASSERT (218-313)
  • GGML_ASSERT (315-348)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (19)
  • GitHub Check: android-ndk-build (arm64-cpu, -D ANDROID_ABI=arm64-v8a -D ANDROID_PLATFORM=android-31 -D CMAKE_TO...
  • GitHub Check: android-build
  • GitHub Check: ubuntu-22-cmake-sycl
  • GitHub Check: windows-latest-cmake-sycl
  • GitHub Check: ggml-ci-arm64-cpu-kleidiai
  • GitHub Check: ubuntu-latest-cmake-cuda
  • GitHub Check: windows-latest-cmake-hip
  • GitHub Check: ios-xcode-build
  • GitHub Check: ubuntu-cpu-cmake (x64, ubuntu-22.04)
  • GitHub Check: ubuntu-latest-cmake-sanitizer (ADDRESS, Debug)
  • GitHub Check: ubuntu-cpu-cmake (arm64, ubuntu-22.04-arm)
  • GitHub Check: ubuntu-latest-cmake-sanitizer (UNDEFINED, Debug)
  • GitHub Check: ubuntu-latest-cmake-sanitizer (THREAD, Debug)
  • GitHub Check: ubuntu-latest-cmake-rpc
  • GitHub Check: ubuntu-latest-llguidance
  • GitHub Check: macOS-latest-cmake-arm64
  • GitHub Check: macOS-latest-cmake-x64
  • GitHub Check: macOS-latest-cmake-arm64-webgpu
  • GitHub Check: server-windows
🔇 Additional comments (10)
tools/mtmd/mtmd.cpp (4)

154-154: LGTM!

The addition of std::unique_ptr<mtmd_audio_preprocessor> is appropriate for managing the audio preprocessor lifetime within the context.


322-336: LGTM!

The switch statement correctly initializes the mtmd_audio_whisper_preprocessor for the supported audio projector types and aborts for unsupported types. The initialize() call ensures the global cache is populated before use. Based on learnings, PROJECTOR_TYPE_QWEN25O is a placeholder that gets replaced appropriately.


666-669: LGTM!

The type change to mtmd_audio_mel and routing through audio_preproc->preprocess() aligns with the new preprocessing API.


875-875: LGTM!

Retrieving the sample rate from clip_get_hparams instead of hardcoding 16000 is the correct approach for supporting different audio models.

tools/mtmd/mtmd-audio.cpp (6)

23-137: Global cache assumes single configuration.

The global g_cache is populated by initialize() based on the first audio context's hparams. If multiple mtmd_context instances are created with different audio configurations (different n_fft, sample_rate, etc.), the cache will be incorrect for subsequent contexts.

This design is acceptable if all audio models share identical preprocessing parameters, but document this assumption or consider making the cache per-preprocessor instance.


282-285: LGTM!

The logarithm computation correctly handles the use_natural_log flag, using different epsilon handling for each case. The std::max(sum, 1e-10) prevents log of zero when not using natural log.


334-339: Good: Short audio check prevents buffer over-read.

The check at line 334 prevents the reverse_copy from reading out of bounds when audio is too short. The TODO comment suggests this could be handled more gracefully in the future.


367-374: Good: Overflow and bounds checks added.

The SIZE_MAX check at line 367 prevents integer overflow in allocation, and the frame_size check at line 371 prevents underflow in the n_len calculation.


453-460: LGTM!

The initialize() method correctly populates the global cache with the required FFT tables, Hann window, and mel filterbank matrix.


496-498: Good: Assertions verify cache initialization.

These assertions provide a safety check ensuring initialize() was called before preprocess().

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
tools/mtmd/mtmd-audio.cpp (2)

389-415: Division by zero when effective_n_len ≤ 1.

If n_samples_in < frame_step * 2, then effective_n_len will be 0 or 1, causing division by zero at line 403 (var /= effective_n_len - 1) and at line 396 (mean /= effective_n_len).

     const int effective_n_len = n_samples_in / frame_step;
+    if (effective_n_len < 2) {
+        LOG_ERR("%s: not enough frames for normalization (effective_n_len=%d)\n", __func__, effective_n_len);
+        return false;
+    }
     if (params.norm_per_feature) {

524-527: Suspicious assignment: n_len_org is set to n_mel.

Line 527 assigns out_full.n_mel to out_chunk.n_len_org, which semantically doesn't match the field name (length vs mel bins). Even if unused now, this could cause confusion or bugs later.

         out_chunk.n_len     = n_len;
         out_chunk.n_mel     = out_full.n_mel;
-        out_chunk.n_len_org = out_full.n_mel; // unused
+        out_chunk.n_len_org = 0; // not applicable for chunks
         out_chunk.data.reserve(out_chunk.n_mel * out_chunk.n_len);
🧹 Nitpick comments (3)
tools/mtmd/mtmd-audio.cpp (3)

23-26: Global cache sharing may cause issues with concurrent context initialization.

The comment acknowledges this limitation. If multiple mtmd_context instances are created concurrently with different hparams (e.g., different audio_n_fft), the cache will be overwritten by the last initializer, causing incorrect results for other contexts. Consider documenting this constraint more prominently or using std::call_once if same-hparams assumption holds.


213-224: Unused field: need_chunking.

The field need_chunking is assigned (line 493) but never read. Consider removing it or implementing the intended functionality.

 struct filter_params {
     int32_t n_mel;
     int32_t n_fft_bins;
     int32_t hann_window_size;
     int32_t hop_length;
     int32_t sample_rate;
     bool    center_padding = false;
     float   preemph = 0.f;
     bool    use_natural_log = false;
     bool    norm_per_feature = false;
-    bool    need_chunking = true;
 };

337-347: Preemphasis logic inconsistency.

The condition if (params.preemph) uses the float as a boolean, but line 340 ignores params.preemph and uses a hardcoded 0.97f. Since params.preemph is set to 0.0f in preprocess(), this block is dead code. Either remove it or fix the logic:

-    if (params.preemph) {
+    if (params.preemph != 0.0f) {
         const int pad_amount = frame_size / 2;
-        const float preemph = 0.97f;
+        const float preemph = params.preemph;
         float prev = samples_padded[pad_amount];
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1aaec3b and 7b417fd.

📒 Files selected for processing (3)
  • tools/mtmd/mtmd-audio.cpp (6 hunks)
  • tools/mtmd/mtmd-audio.h (1 hunks)
  • tools/mtmd/mtmd.cpp (4 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.{cpp,h,hpp}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.{cpp,h,hpp}: Always format C++ code using git clang-format before committing, following .clang-format configuration (4-space indentation, 120 column limit, braces on same line for functions, pointer/reference alignment in middle)
Minimize external dependencies; avoid adding new external dependencies unless absolutely necessary

Files:

  • tools/mtmd/mtmd.cpp
  • tools/mtmd/mtmd-audio.h
  • tools/mtmd/mtmd-audio.cpp
**/*.{cpp,h,hpp,py}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Ensure cross-platform compatibility by testing code changes on Linux, macOS, and Windows when possible

Files:

  • tools/mtmd/mtmd.cpp
  • tools/mtmd/mtmd-audio.h
  • tools/mtmd/mtmd-audio.cpp
🧠 Learnings (2)
📓 Common learnings
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
📚 Learning: 2025-05-26T09:45:20.653Z
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 25
File: tools/mtmd/mtmd.cpp:275-293
Timestamp: 2025-05-26T09:45:20.653Z
Learning: In tools/mtmd/clip.cpp, PROJECTOR_TYPE_QWEN25O is a placeholder that gets replaced by either PROJECTOR_TYPE_QWEN25VL (for vision) or PROJECTOR_TYPE_QWEN2A (for audio) before the respective init_vision() or init_audio() functions are called, ensuring proper token handling.

Applied to files:

  • tools/mtmd/mtmd.cpp
🧬 Code graph analysis (1)
tools/mtmd/mtmd-audio.h (2)
tools/mtmd/clip-impl.h (1)
  • vector (361-372)
tools/mtmd/mtmd-audio.cpp (4)
  • initialize (453-460)
  • initialize (453-453)
  • preprocess (462-539)
  • preprocess (462-465)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (17)
  • GitHub Check: WebUI Check
  • GitHub Check: ggml-ci-arm64-cpu-kleidiai
  • GitHub Check: windows-latest-cmake-sycl
  • GitHub Check: ggml-ci-arm64-cpu-high-perf
  • GitHub Check: ubuntu-cpu-cmake (arm64, ubuntu-22.04-arm)
  • GitHub Check: ubuntu-latest-cmake-sanitizer (THREAD, Debug)
  • GitHub Check: macOS-latest-cmake-arm64-webgpu
  • GitHub Check: ubuntu-cpu-cmake (x64, ubuntu-22.04)
  • GitHub Check: ubuntu-latest-cmake-rpc
  • GitHub Check: macOS-latest-cmake-x64
  • GitHub Check: ubuntu-24-cmake-vulkan
  • GitHub Check: ubuntu-latest-cmake-sanitizer (UNDEFINED, Debug)
  • GitHub Check: macOS-latest-cmake-arm64
  • GitHub Check: ubuntu-latest-cmake-sanitizer (ADDRESS, Debug)
  • GitHub Check: ubuntu-24-cmake-vulkan-deb
  • GitHub Check: ubuntu-latest-llguidance
  • GitHub Check: server-windows
🔇 Additional comments (9)
tools/mtmd/mtmd.cpp (4)

154-154: LGTM!

Good use of std::unique_ptr for automatic lifetime management of the audio preprocessor.


322-336: LGTM!

The switch-based preprocessor construction correctly handles the supported projector types and aborts on unsupported types. Per retrieved learnings, PROJECTOR_TYPE_QWEN25O is a placeholder that gets replaced before init_audio() is called, so including it here is correct.


666-669: LGTM!

The type change from whisper_mel to mtmd_audio_mel and routing through audio_preproc->preprocess() correctly integrates with the new preprocessing API.


875-875: LGTM!

Good improvement—using clip_get_hparams(ctx->ctx_a)->audio_sample_rate instead of hardcoded 16000 allows the bitrate to adapt to model-specific sample rates.

tools/mtmd/mtmd-audio.cpp (3)

330-333: Good: Short audio validation added.

The check at lines 330-333 prevents buffer over-read in reflective padding when audio is too short. This addresses the security concern raised in the PR objectives.


363-370: Good: Overflow and underflow checks added.

These validations prevent size overflow in allocation and handle the case where n_samples < frame_size.


453-460: LGTM!

The initialize() implementation correctly populates the global cache with sin/cos tables, Hann window, and mel filterbank.

tools/mtmd/mtmd-audio.h (2)

20-28: LGTM!

The base class now has a virtual destructor (addressing the past review comment), and the polymorphic interface is well-designed. The hparams reference requires the clip_ctx to outlive the preprocessor, which is ensured by the ownership structure in mtmd_context.


30-34: LGTM!

The derived class correctly inherits from the base and overrides the required virtual methods.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7b417fd and 8ace9e6.

📒 Files selected for processing (1)
  • tools/mtmd/mtmd-audio.cpp (6 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.{cpp,h,hpp}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.{cpp,h,hpp}: Always format C++ code using git clang-format before committing, following .clang-format configuration (4-space indentation, 120 column limit, braces on same line for functions, pointer/reference alignment in middle)
Minimize external dependencies; avoid adding new external dependencies unless absolutely necessary

Files:

  • tools/mtmd/mtmd-audio.cpp
**/*.{cpp,h,hpp,py}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Ensure cross-platform compatibility by testing code changes on Linux, macOS, and Windows when possible

Files:

  • tools/mtmd/mtmd-audio.cpp
🧠 Learnings (1)
📓 Common learnings
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
🧬 Code graph analysis (1)
tools/mtmd/mtmd-audio.cpp (1)
tools/mtmd/mtmd.cpp (2)
  • GGML_ASSERT (218-313)
  • GGML_ASSERT (315-348)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
  • GitHub Check: build-linux-cross / debian-13-loongarch64-cpu-cross
  • GitHub Check: build-linux-cross / debian-13-loongarch64-vulkan-cross
  • GitHub Check: ubuntu-22-cmake-sycl
  • GitHub Check: ggml-ci-arm64-cpu-kleidiai
  • GitHub Check: ubuntu-24-wasm-webgpu
  • GitHub Check: ggml-ci-arm64-cpu-high-perf-sve
  • GitHub Check: ubuntu-24-cmake-webgpu
  • GitHub Check: ggml-ci-arm64-cpu-high-perf
  • GitHub Check: ggml-ci-arm64-cpu-low-perf
  • GitHub Check: ubuntu-22-cmake-sycl-fp16
  • GitHub Check: ubuntu-24-cmake-vulkan
  • GitHub Check: ubuntu-22-cmake-musa
  • GitHub Check: ggml-ci-x64-cpu-high-perf
  • GitHub Check: ubuntu-24-cmake-vulkan-deb
  • GitHub Check: ubuntu-cpu-cmake (arm64, ubuntu-22.04-arm)
  • GitHub Check: ubuntu-cpu-cmake (x64, ubuntu-22.04)
  • GitHub Check: macOS-latest-cmake-arm64
  • GitHub Check: macOS-latest-cmake-arm64-webgpu
  • GitHub Check: macOS-latest-cmake-x64
  • GitHub Check: server-windows
🔇 Additional comments (5)
tools/mtmd/mtmd-audio.cpp (5)

287-287: LGTM: Log inconsistency fixed.

The zero-fill case now correctly respects the use_natural_log parameter, fixing the inconsistency flagged in previous reviews.


452-459: LGTM: Clean initialization of global cache.

The initialize() method properly populates the global cache with FFT tables, Hann window, and mel filterbank matrix. This explicit initialization step is good practice before preprocessing.


470-480: Good defensive handling for short audio.

The code pads short audio inputs with zeros to meet minimum length requirements, preventing potential issues with the stage1/2 padding in log_mel_spectrogram. This addresses some of the buffer safety concerns.


348-356: Validate Hann window padding to prevent buffer overflow.

The code copies hann_window_size elements starting at offset padding into hann_window_padded of size frame_size. If padding + hann_window_size > frame_size, this will write out of bounds.

Add a bounds check:

     // pad hann window if it's smaller than frame_size
     // TODO: probably unnecessary here? (or better doing it in g_cache?)
     std::vector<float> hann_window_padded;
     if (params.hann_window_size < frame_size) {
         hann_window_padded.resize(frame_size);
         const int padding = (frame_size - params.hann_window_size) / 2;
+        if (padding + params.hann_window_size > frame_size) {
+            LOG_ERR("%s: hann window padding overflow\n", __func__);
+            return false;
+        }
         std::copy(hann, hann + params.hann_window_size, &hann_window_padded[padding]);
         hann = hann_window_padded.data();
     }

Based on PR objectives security analysis.

Likely an incorrect or invalid review comment.


329-333: Fix buffer over-read condition in reflective padding.

The check at line 329 uses n_samples < stage_2_pad + 1, but when n_samples == stage_2_pad, the expression samples + 1 + stage_2_pad at line 333 will read beyond the array bounds.

Tighten the bounds check:

         // reflective pad 200 samples at the beginning of audio
-        if (n_samples < stage_2_pad + 1) {
+        if (n_samples <= stage_2_pad) {
             // TODO: Handle short audio differently or return error
             return false;
         }
         std::reverse_copy(samples + 1, samples + 1 + stage_2_pad, samples_padded.begin());

Based on PR objectives security analysis.

Likely an incorrect or invalid review comment.

Comment on lines +60 to +139
void fill_mel_filterbank_matrix(
int n_mel,
int n_fft,
int sample_rate, // e.g. 16000
float fmin = 0.0f, // e.g. 0.0
float fmax = -1.0f, // e.g. sr/2; pass -1 for auto
bool slaney_area_norm = true,
float scale = 1.0f // optional extra scaling; use 1.0f/1000.0f to mimic your code
) {
GGML_ASSERT(n_mel > 0 && n_fft > 1);
if (fmax <= 0.0f) {
fmax = 0.5f * sample_rate;
}

// Slaney scale (matches librosa default)
const double min_log_hz = 1000.0;
const double lin_slope = 3 / 200.;
const double min_log_mel = min_log_hz * lin_slope;
const double log_step = log(6.4) / 27.0;
auto hz_to_mel = [min_log_hz, lin_slope, log_step, min_log_mel](const double f_hz) -> double {
return (f_hz < min_log_hz) ? f_hz * lin_slope : min_log_mel + log(f_hz / min_log_hz) / log_step;
};
auto mel_to_hz = [min_log_hz, lin_slope, log_step, min_log_mel](const double m) -> double {
return (m < min_log_mel) ? m / lin_slope : min_log_hz * exp((m - min_log_mel) * log_step);
};

// infer N_fft from n_fft_bins
const double bin_hz_step = double(sample_rate) / double(n_fft);

// mel grid: n_mel + 2 edges
const double m_lo = hz_to_mel(fmin);
const double m_hi = hz_to_mel(fmax);
std::vector<double> mel_pts(n_mel + 2);
for (int i = 0; i < n_mel + 2; ++i) {
mel_pts[i] = m_lo + (m_hi - m_lo) * (double(i) / (n_mel + 1));
}

// convert to Hz
std::vector<double> hz_pts(n_mel + 2);
for (int i = 0; i < n_mel + 2; ++i) {
hz_pts[i] = mel_to_hz(mel_pts[i]);
}

const int n_fft_bins = n_fft / 2 + 1;

// filterbank
std::vector<float> out(n_mel * n_fft_bins, 0);
for (int m = 0; m < n_mel; ++m) {
const double f_left = hz_pts[m];
const double f_center = hz_pts[m + 1];
const double f_right = hz_pts[m + 2];

const double denom_l = std::max(1e-30, f_center - f_left);
const double denom_r = std::max(1e-30, f_right - f_center);
const double enorm = slaney_area_norm ? (2.0 / std::max(1e-30, f_right - f_left)) : 1.0;

for (int k = 0; k < n_fft_bins; ++k) {
const double f = k * bin_hz_step;
double w = 0.0;
if (f >= f_left && f <= f_center) {
w = (f - f_left) / denom_l;
} else if (f > f_center && f <= f_right) {
w = (f_right - f) / denom_r;
}
out[size_t(m) * size_t(n_fft_bins) + size_t(k)] = float(w * enorm * scale);
}
}

filters.n_mel = n_mel;
filters.n_fft = n_fft;
filters.data = std::move(out);

if (DEBUG) { // debug
for (size_t i = 0; i < filters.data.size(); ++i) {
if (filters.data[i] != 0.0f) {
printf("filters[%zu] = %f\n", i, filters.data[i] * 1000.0f);
}
}
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add input validation to prevent overflow and invalid parameters.

The fill_mel_filterbank_matrix function lacks validation for critical parameters. According to the PR objectives, this creates several risks:

  1. Integer overflow at line 106: n_mel * n_fft_bins multiplication should be checked before resize
  2. No validation that n_mel, n_fft, and sample_rate are positive and within reasonable bounds
  3. No check that n_fft is even (required for n_fft_bins = n_fft / 2 + 1 to be correct)

Add validation at the function start:

     void fill_mel_filterbank_matrix(
         int n_mel,
         int n_fft,
         int sample_rate,            // e.g. 16000
         float fmin = 0.0f,          // e.g. 0.0
         float fmax = -1.0f,         // e.g. sr/2; pass -1 for auto
         bool slaney_area_norm = true,
         float scale = 1.0f          // optional extra scaling; use 1.0f/1000.0f to mimic your code
     ) {
-        GGML_ASSERT(n_mel > 0 && n_fft > 1);
+        GGML_ASSERT(n_mel > 0 && n_mel <= 512);  // reasonable upper bound
+        GGML_ASSERT(n_fft > 1 && n_fft <= 65536 && n_fft % 2 == 0);  // must be even, reasonable upper bound
+        GGML_ASSERT(sample_rate > 0 && sample_rate <= 1000000);
         if (fmax <= 0.0f) {
             fmax = 0.5f * sample_rate;
         }
 
         // Slaney scale (matches librosa default)
         const double min_log_hz = 1000.0;
         const double lin_slope = 3 / 200.;
         const double min_log_mel = min_log_hz * lin_slope;
         const double log_step = log(6.4) / 27.0;
         auto hz_to_mel = [min_log_hz, lin_slope, log_step, min_log_mel](const double f_hz) -> double {
             return (f_hz < min_log_hz) ? f_hz * lin_slope : min_log_mel + log(f_hz / min_log_hz) / log_step;
         };
         auto mel_to_hz = [min_log_hz, lin_slope, log_step, min_log_mel](const double m) -> double {
             return (m < min_log_mel) ? m / lin_slope : min_log_hz * exp((m - min_log_mel) * log_step);
         };
 
         // infer N_fft from n_fft_bins
         const double bin_hz_step = double(sample_rate) / double(n_fft);
 
         // mel grid: n_mel + 2 edges
         const double m_lo = hz_to_mel(fmin);
         const double m_hi = hz_to_mel(fmax);
         std::vector<double> mel_pts(n_mel + 2);
         for (int i = 0; i < n_mel + 2; ++i) {
             mel_pts[i] = m_lo + (m_hi - m_lo) * (double(i) / (n_mel + 1));
         }
 
         // convert to Hz
         std::vector<double> hz_pts(n_mel + 2);
         for (int i = 0; i < n_mel + 2; ++i) {
             hz_pts[i] = mel_to_hz(mel_pts[i]);
         }
 
         const int n_fft_bins = n_fft / 2 + 1;
+        
+        // Check for overflow before allocating
+        if (n_mel > SIZE_MAX / n_fft_bins) {
+            GGML_ABORT("mel filterbank size overflow");
+        }
 
         // filterbank
         std::vector<float> out(n_mel * n_fft_bins, 0);

Based on PR objectives security analysis.

🤖 Prompt for AI Agents
In tools/mtmd/mtmd-audio.cpp around lines 60 to 139, add input validation at the
top of fill_mel_filterbank_matrix: verify n_mel > 0, n_fft > 1, and sample_rate
> 0 and that n_fft is even; ensure fmax > fmin (after auto-filling fmax) and
that fmin/fmax are finite and non-negative; compute n_fft_bins = n_fft/2 + 1 and
check the multiplication n_mel * n_fft_bins will not overflow (compare against
SIZE_MAX or a project-specific max like 1e8) before allocating/constructing the
vector; on validation failure use GGML_ASSERT or return early with a clear
error/log to avoid proceeding with invalid params.

Comment on lines +309 to +310
const int frame_size = (params.n_fft_bins - 1) * 2;
const int frame_step = params.hop_length;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Validate frame_size and frame_step to prevent division by zero.

Lines 360, 388, and 402 perform divisions using frame_step and derived values. If frame_step (hop_length) is zero or frame_size is zero, this will cause division by zero.

Add validation after computing frame parameters:

     // Hann window
     const float * hann = g_cache.hann_window.data();
     const int frame_size = (params.n_fft_bins - 1) * 2;
     const int frame_step = params.hop_length;
+    
+    if (frame_step <= 0 || frame_size <= 0) {
+        LOG_ERR("%s: invalid frame parameters (frame_size=%d, frame_step=%d)\n", 
+                __func__, frame_size, frame_step);
+        return false;
+    }

Based on PR objectives security analysis.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const int frame_size = (params.n_fft_bins - 1) * 2;
const int frame_step = params.hop_length;
const int frame_size = (params.n_fft_bins - 1) * 2;
const int frame_step = params.hop_length;
if (frame_step <= 0 || frame_size <= 0) {
LOG_ERR("%s: invalid frame parameters (frame_size=%d, frame_step=%d)\n",
__func__, frame_size, frame_step);
return false;
}
🤖 Prompt for AI Agents
In tools/mtmd/mtmd-audio.cpp around lines 309-310, after computing frame_size
and frame_step validate that both are greater than zero to avoid
division-by-zero later (lines ~360, 388, 402); if either is zero or negative,
log an error (or set an error code), and return/fail early (or throw an
exception) instead of continuing, ensuring all callers handle the error path
consistently; this check should be added immediately after those two lines with
a clear message indicating invalid hop_length/frame_size.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants