-
Notifications
You must be signed in to change notification settings - Fork 5
Xsn/mtmd refactor audio preproc #57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
WalkthroughAdded audio-to-mel hyperparameters to CLIP hparams and exposed an accessor; replaced the legacy whisper preprocessor with a new runtime-configurable mtmd_audio pipeline (global cache, runtime mel filterbank, parameterized FFT/DFT) and integrated a per-context polymorphic audio preprocessor. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant App as mtmd caller
participant Ctx as mtmd_context
participant CLIP as clip_ctx / clip_hparams
participant Pre as mtmd_audio_preprocessor
participant Cache as mtmd_audio_global_cache
App->>Ctx: request audio init / tokenize
Ctx->>CLIP: read projector type & hparams (clip_get_hparams)
Ctx->>Pre: construct appropriate mtmd_audio_preprocessor(ctx)
Pre->>Cache: initialize() — populate sin/cos, Hann window, mel filterbank
App->>Pre: preprocess(samples, n_samples)
Pre->>Cache: use cached tables for FFT/DFT and mel filtering
Pre-->>Ctx: return vector<mtmd_audio_mel> (chunks)
Ctx-->>App: continue tokenization with mel chunks
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes
Poem
Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
|
@coderabbitai anything obviously wrong here? |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: Length of output: 217 🏁 Script executed: Length of output: 44306 @ngxson, I found a critical bug in the mel spectrogram calculation: Line 288 in mtmd-audio.cpp has a logic error: sum = params.use_natural_log ? log(sum + 5.960464477539063e-08) : log10(1e-10);When It should probably be: sum = params.use_natural_log ? log(sum + 5.960464477539063e-08) : log10(std::max(sum, 1e-10));Line 293 also has a potential inconsistency - it uses Additional observations:
🧠 Learnings used |
|
@coderabbitai any potential security risks? (buffer overflow or something) - do a review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
tools/mtmd/clip-model.h(2 hunks)tools/mtmd/clip.cpp(3 hunks)tools/mtmd/mtmd-audio.cpp(7 hunks)tools/mtmd/mtmd-audio.h(1 hunks)tools/mtmd/mtmd.cpp(2 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.{cpp,h,hpp}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.{cpp,h,hpp}: Always format C++ code usinggit clang-formatbefore committing, following .clang-format configuration (4-space indentation, 120 column limit, braces on same line for functions, pointer/reference alignment in middle)
Minimize external dependencies; avoid adding new external dependencies unless absolutely necessary
Files:
tools/mtmd/clip.cpptools/mtmd/clip-model.htools/mtmd/mtmd.cpptools/mtmd/mtmd-audio.htools/mtmd/mtmd-audio.cpp
**/*.{cpp,h,hpp,py}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Ensure cross-platform compatibility by testing code changes on Linux, macOS, and Windows when possible
Files:
tools/mtmd/clip.cpptools/mtmd/clip-model.htools/mtmd/mtmd.cpptools/mtmd/mtmd-audio.htools/mtmd/mtmd-audio.cpp
🧠 Learnings (2)
📓 Common learnings
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
📚 Learning: 2025-05-26T09:45:20.653Z
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 25
File: tools/mtmd/mtmd.cpp:275-293
Timestamp: 2025-05-26T09:45:20.653Z
Learning: In tools/mtmd/clip.cpp, PROJECTOR_TYPE_QWEN25O is a placeholder that gets replaced by either PROJECTOR_TYPE_QWEN25VL (for vision) or PROJECTOR_TYPE_QWEN2A (for audio) before the respective init_vision() or init_audio() functions are called, ensuring proper token handling.
Applied to files:
tools/mtmd/clip.cpptools/mtmd/clip-model.htools/mtmd/mtmd.cpp
🧬 Code graph analysis (2)
tools/mtmd/mtmd-audio.h (2)
tools/mtmd/clip.cpp (4)
ctx(2245-2347)ctx(2245-2245)clip_get_hparams(3429-3431)clip_get_hparams(3429-3429)tools/mtmd/mtmd-audio.cpp (2)
preprocess(438-518)preprocess(438-441)
tools/mtmd/mtmd-audio.cpp (1)
tools/mtmd/mtmd.cpp (2)
GGML_ASSERT(216-311)GGML_ASSERT(313-330)
🪛 GitHub Actions: EditorConfig Checker
tools/mtmd/mtmd-audio.h
[error] 24-24: Trailing whitespace detected by editorconfig-checker.
tools/mtmd/mtmd-audio.cpp
[error] 32-32: Trailing whitespace detected by editorconfig-checker.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
- GitHub Check: build-cmake-pkg / linux
- GitHub Check: build-linux-cross / ubuntu-24-riscv64-cpu-spacemit-ime-cross
- GitHub Check: build-linux-cross / debian-13-loongarch64-cpu-cross
- GitHub Check: build-linux-cross / debian-13-loongarch64-vulkan-cross
- GitHub Check: ubuntu-24-cmake-webgpu
- GitHub Check: windows-latest-cmake-sycl
- GitHub Check: ubuntu-24-cmake-vulkan-deb
- GitHub Check: ubuntu-24-wasm-webgpu
- GitHub Check: windows-latest-cmake-hip
- GitHub Check: ios-xcode-build
- GitHub Check: ubuntu-24-cmake-vulkan
- GitHub Check: ubuntu-cpu-cmake (ppc64le, ubuntu-24.04-ppc64le)
- GitHub Check: macOS-latest-cmake-arm64-webgpu
- GitHub Check: ubuntu-latest-cmake-sanitizer (ADDRESS, Debug)
- GitHub Check: ubuntu-latest-llguidance
- GitHub Check: ubuntu-latest-cmake-sanitizer (UNDEFINED, Debug)
- GitHub Check: ubuntu-latest-cmake-rpc
- GitHub Check: macOS-latest-cmake-arm64
- GitHub Check: macOS-latest-cmake-x64
- GitHub Check: server-windows
🔇 Additional comments (11)
tools/mtmd/mtmd.cpp (2)
648-656: LGTM! Clean integration of the new preprocessor API.The local instantiation of
mtmd_audio_whisper_preprocessorper preprocessing call is appropriate, and the new API is used correctly.
854-859: LGTM! Dynamic sample rate retrieval.Using
clip_get_hparams(ctx->ctx_a)->audio_sample_rateinstead of a hardcoded value is cleaner and more maintainable.tools/mtmd/clip-model.h (2)
68-73: LGTM! Well-structured audio preprocessing parameters.The new fields are appropriately grouped with a clear comment, and default values of
-1properly indicate "unset" state.
287-288: LGTM! Safe accessor declaration.Returning
const clip_hparams *ensures read-only access to internal state.tools/mtmd/clip.cpp (3)
1146-1152: Hardcoded audio preprocessing parameters for whisper-based projectors.These values (30s chunks, 16kHz sample rate, 400-sample FFT/window, 160-sample hop) match standard whisper preprocessing. Consider adding a comment noting these are whisper-compatible defaults.
Please verify these parameters are correct for all three projector types (ULTRAVOX, QWEN2A, VOXTRAL). If any projector requires different values in the future, this code path may need adjustment.
1190-1194: LGTM! Helpful debug logging for audio parameters.The additional logging aids debugging and aligns with the existing logging pattern for other hparams.
3428-3431: LGTM! Simple accessor implementation.The implementation is straightforward and correctly returns a const pointer to internal hparams.
tools/mtmd/mtmd-audio.cpp (4)
470-479: Cache initialization assumes uniform hparams across all calls.The global cache is initialized once with the first caller's
hparamsvalues. Ifpreprocess()is later called with differenthparams(differentaudio_n_fft,audio_window_len, etc.), the cached values will be incorrect for that call.Consider either:
- Validating that subsequent calls use matching hparams, or
- Making the cache per-instance rather than global if varying hparams are expected
495-501: Assertion may fail for very short audio.The assertion
GGML_ASSERT((size_t)out_full.n_len > frames_per_chunk)will abort if the mel spectrogram output is shorter than 3000 frames. While there's padding for short inputs (lines 451-456), this may still fail for edge cases.Consider whether a graceful error return would be more appropriate than an assertion, especially since this processes user-provided audio that could have unexpected lengths.
36-44: LGTM! Precomputed sin/cos table for FFT optimization.Good optimization to avoid repeated trigonometric calculations during FFT.
57-138: LGTM! Runtime mel filterbank generation.The Slaney-scale mel filterbank implementation matches librosa defaults. The triangular filter construction is correct.
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: Length of output: 44018 @ngxson, I found several security vulnerabilities in addition to the logic bug I mentioned earlier: 🚨 Critical Issues1. Buffer Over-read in Reflective Padding (Line 334)std::reverse_copy(samples + 1, samples + 1 + stage_2_pad, samples_padded.begin());Risk: If Attack scenario: Attacker provides audio shorter than 201 samples → buffer over-read → potential information disclosure or crash. Fix: Add validation before this operation: if (n_samples < stage_2_pad + 1) {
// Handle short audio differently or return error
return false;
}2. Integer Overflow in Buffer Allocation (Line 360)out.data.resize(out.n_mel * out.n_len);Risk: Fix: Add overflow check: if (out.n_mel > 0 && out.n_len > SIZE_MAX / out.n_mel) {
return false; // overflow
}
out.data.resize(size_t(out.n_mel) * size_t(out.n_len));3. Signed Integer Underflow (Line 359)out.n_len = (n_samples - frame_size) / frame_step + 1;Risk: If Fix: Check bounds first: if (n_samples < frame_size) {
return false;
}
out.n_len = (n_samples - frame_size) / frame_step + 1;
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
tools/mtmd/mtmd-audio.cpp (1)
23-137: Global cache is not thread-safe for concurrent initialization.The
g_cachestatic object has itsfill_*methods called frominitialize(), but there's no synchronization. If multiplemtmd_audio_whisper_preprocessorinstances callinitialize()concurrently (or ifpreprocess()is called beforeinitialize()completes), data races can occur.Consider using
std::call_oncefor thread-safe initialization:+#include <mutex> + +static std::once_flag g_cache_init_flag; + static struct mtmd_audio_global_cache { // ... existing members ... } g_cache;Then in
initialize():void mtmd_audio_whisper_preprocessor::initialize() { + std::call_once(g_cache_init_flag, [this]() { g_cache.fill_sin_cos_table(hparams.audio_n_fft); g_cache.fill_hann_window(hparams.audio_window_len, true); g_cache.fill_mel_filterbank_matrix( hparams.n_mel_bins, hparams.audio_n_fft, hparams.audio_sample_rate); + }); }Note: This assumes all instances use identical
hparamsvalues. Based on learnings, this may be deferred if thread-safety isn't an immediate concern.
🧹 Nitpick comments (4)
tools/mtmd/mtmd-audio.h (1)
20-28: Reference lifetime concern:hparamsreference may outlive the sourceclip_ctx.The
hparamsmember is a reference toclip_hparamsobtained fromclip_ctx. If theclip_ctxis freed before the preprocessor, this reference becomes dangling. The current usage inmtmd.cppappears safe sincectx_aandaudio_preprocshare the same lifetime inmtmd_context, but this coupling should be documented.Consider adding a comment documenting this lifetime requirement:
struct mtmd_audio_preprocessor { - const clip_hparams & hparams; + // NOTE: caller must ensure clip_ctx outlives this preprocessor + const clip_hparams & hparams;tools/mtmd/mtmd-audio.cpp (3)
210-224:need_chunkingfield is set but never used.The
filter_params::need_chunkingfield is declared and set inpreprocess()(line 492), but it's never read anywhere in the processing logic.Either remove the unused field or implement the chunking behavior it's meant to control:
struct filter_params { // ... other fields ... - bool need_chunking = true; const mtmd_audio_global_cache & cache; filter_params() : cache(g_cache) {} };
356-361: Potential out-of-bounds write when padding Hann window.If
padding + hann_window_size > frame_size, thestd::copyat line 359 will write past the end ofhann_window_padded. While unlikely with typical parameters, consider adding a bounds check.if (params.hann_window_size < frame_size) { hann_window_padded.resize(frame_size); const int padding = (frame_size - params.hann_window_size) / 2; + GGML_ASSERT(padding + params.hann_window_size <= frame_size); std::copy(hann, hann + params.hann_window_size, &hann_window_padded[padding]); hann = hann_window_padded.data(); }
494-497: Assertions verify cache initialization but don't provide graceful fallback.The
GGML_ASSERTcalls will abort ifinitialize()wasn't called. This is acceptable for catching programming errors but could be more user-friendly.Consider returning
falsewith an error message instead of aborting:// make sure the global cache is initialized - GGML_ASSERT(!g_cache.sin_vals.empty()); - GGML_ASSERT(!g_cache.cos_vals.empty()); - GGML_ASSERT(!g_cache.filters.data.empty()); + if (g_cache.sin_vals.empty() || g_cache.cos_vals.empty() || g_cache.filters.data.empty()) { + LOG_ERR("%s: global cache not initialized, call initialize() first\n", __func__); + return false; + }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
tools/mtmd/mtmd-audio.cpp(7 hunks)tools/mtmd/mtmd-audio.h(1 hunks)tools/mtmd/mtmd.cpp(4 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.{cpp,h,hpp}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.{cpp,h,hpp}: Always format C++ code usinggit clang-formatbefore committing, following .clang-format configuration (4-space indentation, 120 column limit, braces on same line for functions, pointer/reference alignment in middle)
Minimize external dependencies; avoid adding new external dependencies unless absolutely necessary
Files:
tools/mtmd/mtmd.cpptools/mtmd/mtmd-audio.htools/mtmd/mtmd-audio.cpp
**/*.{cpp,h,hpp,py}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Ensure cross-platform compatibility by testing code changes on Linux, macOS, and Windows when possible
Files:
tools/mtmd/mtmd.cpptools/mtmd/mtmd-audio.htools/mtmd/mtmd-audio.cpp
🧠 Learnings (2)
📓 Common learnings
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
📚 Learning: 2025-05-26T09:45:20.653Z
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 25
File: tools/mtmd/mtmd.cpp:275-293
Timestamp: 2025-05-26T09:45:20.653Z
Learning: In tools/mtmd/clip.cpp, PROJECTOR_TYPE_QWEN25O is a placeholder that gets replaced by either PROJECTOR_TYPE_QWEN25VL (for vision) or PROJECTOR_TYPE_QWEN2A (for audio) before the respective init_vision() or init_audio() functions are called, ensuring proper token handling.
Applied to files:
tools/mtmd/mtmd.cpp
🧬 Code graph analysis (1)
tools/mtmd/mtmd-audio.h (2)
tools/mtmd/clip.cpp (4)
ctx(2245-2347)ctx(2245-2245)clip_get_hparams(3429-3431)clip_get_hparams(3429-3429)tools/mtmd/mtmd-audio.cpp (4)
initialize(453-460)initialize(453-453)preprocess(462-536)preprocess(462-465)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (19)
- GitHub Check: ggml-ci-x64-cpu-high-perf
- GitHub Check: build-cmake-pkg / linux
- GitHub Check: ubuntu-22-cmake-musa
- GitHub Check: windows-msys2 (UCRT64, ucrt-x86_64, Release)
- GitHub Check: windows-2022-cmake-cuda (12.4)
- GitHub Check: ggml-ci-x64-cpu-low-perf
- GitHub Check: ubuntu-22-cmake-sycl
- GitHub Check: macOS-latest-cmake-visionos
- GitHub Check: windows-latest-cmake-hip
- GitHub Check: ubuntu-22-cmake-hip
- GitHub Check: ubuntu-24-cmake-vulkan-deb
- GitHub Check: windows-latest-cmake-sycl
- GitHub Check: ubuntu-latest-cmake-sanitizer (THREAD, Debug)
- GitHub Check: ubuntu-latest-cmake-sanitizer (UNDEFINED, Debug)
- GitHub Check: macOS-latest-cmake-arm64-webgpu
- GitHub Check: macOS-latest-cmake-x64
- GitHub Check: macOS-latest-cmake-arm64
- GitHub Check: ubuntu-latest-llguidance
- GitHub Check: server-windows
🔇 Additional comments (11)
tools/mtmd/mtmd.cpp (4)
154-154: LGTM!The
std::unique_ptrfor the audio preprocessor is appropriate for managing the polymorphic preprocessor lifetime.
322-334: LGTM!The switch-based initialization correctly maps projector types to the appropriate preprocessor implementation. The
GGML_ABORTfor unsupported types provides a clear failure mode during development.
665-668: LGTM!The updated call to
audio_preproc->preprocesscorrectly passes samples and receives the mel spectrogram chunks via the output vector.
870-875: No null-safety issue here.clip_get_hparamsreturns&ctx->model.hparamsand cannot return null—it dereferences the context pointer directly to obtain the address of the hparams member. Sincectx->ctx_ais already validated on line 871, the code is safe.Likely an incorrect or invalid review comment.
tools/mtmd/mtmd-audio.h (1)
30-34: LGTM!The derived class properly inherits from the base and forwards the context to the parent constructor.
tools/mtmd/mtmd-audio.cpp (6)
14-15: LGTM!Debug flag is appropriately set to
falsefor production code.
334-338: Input validation for short audio prevents buffer over-read.The check at line 334 correctly guards against buffer over-read when
n_samples < stage_2_pad + 1. Returningfalseis appropriate.
364-375: Good defensive checks for overflow and underflow.The overflow check at lines 367-370 and the
n_samples < frame_sizecheck at lines 371-374 properly guard against integer overflow in allocation and signed underflow in length calculation.
471-480: Short audio padding is a reasonable safeguard.Padding input shorter than
2 * audio_n_fftwith zeros prevents edge cases in the mel spectrogram computation. The TODO comment acknowledges this could be improved.
514-514: This assertion will not fail under normal operation.The code includes multiple safeguards ensuring
out_full.n_len > 3000:
- Pre-padding (line 475-481) ensures minimum 800 samples
stage_1_pad = 480000samples (30 seconds at 16kHz) is always applied whencenter_padding = false(explicitly set at line 488)- With hop_length = 160 and frame_size = 400, even the minimum input of 800 samples yields
out.n_len = (800 + 480400 - 400) / 160 + 1 ≈ 3006framesThe mathematical formula
out.n_len = (n_samples - 400) / 160 + 1with the guaranteed 30-second padding ensures the assertionn_len > 3000will always pass.Likely an incorrect or invalid review comment.
453-460: Initialization fills global cache with instance-specific parameters.The
initialize()method populates the sharedg_cacheusing this instance'shparams. If different preprocessor instances have differenthparamsvalues, the last one to initialize wins, potentially causing incorrect processing for earlier instances.Verify that all
mtmd_audio_whisper_preprocessorinstances will use identical audio parameters:#!/bin/bash # Check if different projector types could have different audio hparams rg -n "audio_n_fft|audio_window_len|audio_hop_len|audio_sample_rate|n_mel_bins" --type cpp -C 2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (4)
tools/mtmd/mtmd-audio.cpp (4)
220-220: Unused fieldneed_chunking.This field is declared and assigned in
preprocess()at line 492, but never read anywhere in the codebase. Consider removing it to avoid confusion.- bool need_chunking = true;
279-280: Inconsistent index type safety.Line 271 correctly uses
size_tcasting for the index calculation, but line 280 uses plainintmultiplication which could overflow for large values.// handle n_fft remainder for (; k < n_fft_bins; k++) { - sum += fft_out[k] * filters.data[j * n_fft_bins + k]; + sum += fft_out[k] * filters.data[size_t(j) * size_t(n_fft_bins) + size_t(k)]; }
341-351: Preemphasis coefficient is hardcoded despite being a parameter.
params.preemphis used as a boolean toggle, but the actual coefficient is hardcoded as0.97f. Ifpreemphis meant to be configurable, consider using its value directly:// preemphasis - if (params.preemph) { + if (params.preemph != 0.0f) { const int pad_amount = frame_size / 2; - const float preemph = 0.97f; float prev = samples_padded[pad_amount]; for (int i = pad_amount + 1; i + pad_amount < n_samples; ++i) { float cur = samples_padded[i]; - samples_padded[i] = cur - preemph * prev; + samples_padded[i] = cur - params.preemph * prev; prev = cur; } }Alternatively, if 0.97 is always intended, change
preemphto abooltype for clarity.
503-503: Consider making thread count configurable.The thread count is hardcoded to 4. This may not be optimal for all systems. Consider passing it as a parameter or deriving it from hardware concurrency.
+ const int n_threads = std::min(4, (int)std::thread::hardware_concurrency()); bool ok = log_mel_spectrogram( samples, n_samples, - 4, // n_threads + n_threads, params, out_full);
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tools/mtmd/mtmd-audio.cpp(7 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.{cpp,h,hpp}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.{cpp,h,hpp}: Always format C++ code usinggit clang-formatbefore committing, following .clang-format configuration (4-space indentation, 120 column limit, braces on same line for functions, pointer/reference alignment in middle)
Minimize external dependencies; avoid adding new external dependencies unless absolutely necessary
Files:
tools/mtmd/mtmd-audio.cpp
**/*.{cpp,h,hpp,py}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Ensure cross-platform compatibility by testing code changes on Linux, macOS, and Windows when possible
Files:
tools/mtmd/mtmd-audio.cpp
🧠 Learnings (1)
📓 Common learnings
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
🧬 Code graph analysis (1)
tools/mtmd/mtmd-audio.cpp (1)
tools/mtmd/mtmd.cpp (2)
GGML_ASSERT(218-313)GGML_ASSERT(315-347)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (19)
- GitHub Check: windows-latest-cmake (llvm-arm64, arm64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/ar...
- GitHub Check: ggml-ci-arm64-cpu-high-perf-sve
- GitHub Check: ggml-ci-arm64-cpu-kleidiai
- GitHub Check: ggml-ci-arm64-cpu-low-perf
- GitHub Check: ubuntu-latest-cmake-cuda
- GitHub Check: windows-2022-cmake-cuda (12.4)
- GitHub Check: ubuntu-24-cmake-vulkan
- GitHub Check: ubuntu-cpu-cmake (arm64, ubuntu-22.04-arm)
- GitHub Check: ubuntu-24-wasm-webgpu
- GitHub Check: ubuntu-latest-cmake-sanitizer (UNDEFINED, Debug)
- GitHub Check: macOS-latest-cmake-x64
- GitHub Check: ubuntu-latest-cmake-sanitizer (ADDRESS, Debug)
- GitHub Check: ubuntu-24-cmake-vulkan-deb
- GitHub Check: macOS-latest-cmake-arm64
- GitHub Check: ubuntu-latest-cmake-sanitizer (THREAD, Debug)
- GitHub Check: ubuntu-latest-llguidance
- GitHub Check: ubuntu-latest-cmake-rpc
- GitHub Check: ubuntu-24-cmake-webgpu
- GitHub Check: server-windows
| const int effective_n_len = n_samples_in / frame_step; | ||
| if (params.norm_per_feature) { | ||
| for (int i = 0; i < out.n_mel; i++) { | ||
| double mean = 0; | ||
| for (int j = 0; j < effective_n_len; ++j) { | ||
| mean += out.data[i * out.n_len + j]; | ||
| } | ||
| mean /= effective_n_len; | ||
|
|
||
| mmax -= 8.0; | ||
| double var = 0.0; | ||
| for (int j = 0; j < effective_n_len; ++j) { | ||
| const double value = out.data[i * out.n_len + j] - mean; | ||
| var += value * value; | ||
| } | ||
| var /= effective_n_len - 1; // unbiased | ||
| const double mstd = std::sqrt(var + 1e-5); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Division by zero when effective_n_len is 1.
If n_samples_in < frame_step * 2, then effective_n_len will be 1, causing a division by zero at line 407 (var /= effective_n_len - 1).
const int effective_n_len = n_samples_in / frame_step;
+ if (effective_n_len < 2) {
+ LOG_ERR("%s: not enough frames for normalization (effective_n_len=%d)\n", __func__, effective_n_len);
+ return false;
+ }
if (params.norm_per_feature) {🤖 Prompt for AI Agents
In tools/mtmd/mtmd-audio.cpp around lines 393-408, guard against effective_n_len
being <= 1 to avoid division-by-zero: check effective_n_len before computing
mean/variance; if effective_n_len == 0 skip normalization for that feature (or
continue) and if effective_n_len == 1 set variance to 0 (or unbiased var to 0)
so mstd = sqrt(0 + 1e-5) and proceed without dividing by (effective_n_len - 1).
Adjust the mean computation to avoid dividing by zero and ensure you still
produce a stable mstd for the subsequent normalization step.
| mtmd_audio_mel out_chunk; | ||
| out_chunk.n_len = n_len; | ||
| out_chunk.n_mel = out_full.n_mel; | ||
| out_chunk.n_len_org = out_full.n_mel; // unused |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suspicious assignment: n_len_org is set to n_mel instead of a length value.
Line 524 assigns out_full.n_mel to out_chunk.n_len_org, which seems incorrect given the field name suggests it should be a length. The comment says "unused", but this could cause confusion or bugs if the field is used in the future.
out_chunk.n_len = n_len;
out_chunk.n_mel = out_full.n_mel;
- out_chunk.n_len_org = out_full.n_mel; // unused
+ out_chunk.n_len_org = out_full.n_len_org; // original length before padding
out_chunk.data.reserve(out_chunk.n_mel * out_chunk.n_len);🤖 Prompt for AI Agents
In tools/mtmd/mtmd-audio.cpp around lines 521 to 524, the field
out_chunk.n_len_org is incorrectly set to out_full.n_mel; change it to a proper
length value (e.g., out_chunk.n_len_org = n_len) or explicitly zero it if truly
unused, and update the comment to explain why it’s set that way to avoid
confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (2)
tools/mtmd/mtmd-audio.cpp (2)
393-408: Division by zero wheneffective_n_lenis 1.If
n_samples_in < frame_step * 2, theneffective_n_lenwill be 1, causing division by zero at line 407.const int effective_n_len = n_samples_in / frame_step; + if (effective_n_len < 2) { + LOG_ERR("%s: not enough frames for normalization (effective_n_len=%d)\n", __func__, effective_n_len); + return false; + } if (params.norm_per_feature) {
524-527: Suspicious assignment:n_len_orgset ton_mel.Line 527 assigns
out_full.n_meltoout_chunk.n_len_org, which seems semantically incorrect given the field name. Even if currently unused, this could cause bugs if the field is used later.out_chunk.n_len = n_len; out_chunk.n_mel = out_full.n_mel; - out_chunk.n_len_org = out_full.n_mel; // unused + out_chunk.n_len_org = out_full.n_len_org; // original length before padding
🧹 Nitpick comments (3)
tools/mtmd/mtmd-audio.cpp (3)
14-14: Consider usingconstexprfor the debug flag.Using
constexprinstead ofconstfor a compile-time constant boolean allows the compiler to eliminate debug code paths entirely when disabled.-constexpr bool DEBUG = false; +static constexpr bool DEBUG = false;
220-220: Unused fieldneed_chunking.The
need_chunkingfield is assigned at line 493 but never read. Either use it to control chunking behavior or remove it to avoid confusion.
501-506: Hardcoded thread count.The
n_threadsparameter is hardcoded to 4 at line 504, ignoring the configurablen_threadsfrommtmd_context_params. Consider passing the thread count through the preprocessor interface.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
tools/mtmd/mtmd-audio.cpp(7 hunks)tools/mtmd/mtmd.cpp(4 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.{cpp,h,hpp}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.{cpp,h,hpp}: Always format C++ code usinggit clang-formatbefore committing, following .clang-format configuration (4-space indentation, 120 column limit, braces on same line for functions, pointer/reference alignment in middle)
Minimize external dependencies; avoid adding new external dependencies unless absolutely necessary
Files:
tools/mtmd/mtmd-audio.cpptools/mtmd/mtmd.cpp
**/*.{cpp,h,hpp,py}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Ensure cross-platform compatibility by testing code changes on Linux, macOS, and Windows when possible
Files:
tools/mtmd/mtmd-audio.cpptools/mtmd/mtmd.cpp
🧠 Learnings (2)
📓 Common learnings
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
📚 Learning: 2025-05-26T09:45:20.653Z
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 25
File: tools/mtmd/mtmd.cpp:275-293
Timestamp: 2025-05-26T09:45:20.653Z
Learning: In tools/mtmd/clip.cpp, PROJECTOR_TYPE_QWEN25O is a placeholder that gets replaced by either PROJECTOR_TYPE_QWEN25VL (for vision) or PROJECTOR_TYPE_QWEN2A (for audio) before the respective init_vision() or init_audio() functions are called, ensuring proper token handling.
Applied to files:
tools/mtmd/mtmd.cpp
🧬 Code graph analysis (1)
tools/mtmd/mtmd-audio.cpp (1)
tools/mtmd/mtmd.cpp (2)
GGML_ASSERT(218-313)GGML_ASSERT(315-348)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (19)
- GitHub Check: android-ndk-build (arm64-cpu, -D ANDROID_ABI=arm64-v8a -D ANDROID_PLATFORM=android-31 -D CMAKE_TO...
- GitHub Check: android-build
- GitHub Check: ubuntu-22-cmake-sycl
- GitHub Check: windows-latest-cmake-sycl
- GitHub Check: ggml-ci-arm64-cpu-kleidiai
- GitHub Check: ubuntu-latest-cmake-cuda
- GitHub Check: windows-latest-cmake-hip
- GitHub Check: ios-xcode-build
- GitHub Check: ubuntu-cpu-cmake (x64, ubuntu-22.04)
- GitHub Check: ubuntu-latest-cmake-sanitizer (ADDRESS, Debug)
- GitHub Check: ubuntu-cpu-cmake (arm64, ubuntu-22.04-arm)
- GitHub Check: ubuntu-latest-cmake-sanitizer (UNDEFINED, Debug)
- GitHub Check: ubuntu-latest-cmake-sanitizer (THREAD, Debug)
- GitHub Check: ubuntu-latest-cmake-rpc
- GitHub Check: ubuntu-latest-llguidance
- GitHub Check: macOS-latest-cmake-arm64
- GitHub Check: macOS-latest-cmake-x64
- GitHub Check: macOS-latest-cmake-arm64-webgpu
- GitHub Check: server-windows
🔇 Additional comments (10)
tools/mtmd/mtmd.cpp (4)
154-154: LGTM!The addition of
std::unique_ptr<mtmd_audio_preprocessor>is appropriate for managing the audio preprocessor lifetime within the context.
322-336: LGTM!The switch statement correctly initializes the
mtmd_audio_whisper_preprocessorfor the supported audio projector types and aborts for unsupported types. Theinitialize()call ensures the global cache is populated before use. Based on learnings, PROJECTOR_TYPE_QWEN25O is a placeholder that gets replaced appropriately.
666-669: LGTM!The type change to
mtmd_audio_meland routing throughaudio_preproc->preprocess()aligns with the new preprocessing API.
875-875: LGTM!Retrieving the sample rate from
clip_get_hparamsinstead of hardcoding 16000 is the correct approach for supporting different audio models.tools/mtmd/mtmd-audio.cpp (6)
23-137: Global cache assumes single configuration.The global
g_cacheis populated byinitialize()based on the first audio context's hparams. If multiplemtmd_contextinstances are created with different audio configurations (differentn_fft,sample_rate, etc.), the cache will be incorrect for subsequent contexts.This design is acceptable if all audio models share identical preprocessing parameters, but document this assumption or consider making the cache per-preprocessor instance.
282-285: LGTM!The logarithm computation correctly handles the
use_natural_logflag, using different epsilon handling for each case. Thestd::max(sum, 1e-10)prevents log of zero when not using natural log.
334-339: Good: Short audio check prevents buffer over-read.The check at line 334 prevents the
reverse_copyfrom reading out of bounds when audio is too short. The TODO comment suggests this could be handled more gracefully in the future.
367-374: Good: Overflow and bounds checks added.The SIZE_MAX check at line 367 prevents integer overflow in allocation, and the frame_size check at line 371 prevents underflow in the n_len calculation.
453-460: LGTM!The
initialize()method correctly populates the global cache with the required FFT tables, Hann window, and mel filterbank matrix.
496-498: Good: Assertions verify cache initialization.These assertions provide a safety check ensuring
initialize()was called beforepreprocess().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (2)
tools/mtmd/mtmd-audio.cpp (2)
389-415: Division by zero wheneffective_n_len≤ 1.If
n_samples_in < frame_step * 2, theneffective_n_lenwill be 0 or 1, causing division by zero at line 403 (var /= effective_n_len - 1) and at line 396 (mean /= effective_n_len).const int effective_n_len = n_samples_in / frame_step; + if (effective_n_len < 2) { + LOG_ERR("%s: not enough frames for normalization (effective_n_len=%d)\n", __func__, effective_n_len); + return false; + } if (params.norm_per_feature) {
524-527: Suspicious assignment:n_len_orgis set ton_mel.Line 527 assigns
out_full.n_meltoout_chunk.n_len_org, which semantically doesn't match the field name (length vs mel bins). Even if unused now, this could cause confusion or bugs later.out_chunk.n_len = n_len; out_chunk.n_mel = out_full.n_mel; - out_chunk.n_len_org = out_full.n_mel; // unused + out_chunk.n_len_org = 0; // not applicable for chunks out_chunk.data.reserve(out_chunk.n_mel * out_chunk.n_len);
🧹 Nitpick comments (3)
tools/mtmd/mtmd-audio.cpp (3)
23-26: Global cache sharing may cause issues with concurrent context initialization.The comment acknowledges this limitation. If multiple
mtmd_contextinstances are created concurrently with different hparams (e.g., differentaudio_n_fft), the cache will be overwritten by the last initializer, causing incorrect results for other contexts. Consider documenting this constraint more prominently or usingstd::call_onceif same-hparams assumption holds.
213-224: Unused field:need_chunking.The field
need_chunkingis assigned (line 493) but never read. Consider removing it or implementing the intended functionality.struct filter_params { int32_t n_mel; int32_t n_fft_bins; int32_t hann_window_size; int32_t hop_length; int32_t sample_rate; bool center_padding = false; float preemph = 0.f; bool use_natural_log = false; bool norm_per_feature = false; - bool need_chunking = true; };
337-347: Preemphasis logic inconsistency.The condition
if (params.preemph)uses the float as a boolean, but line 340 ignoresparams.preemphand uses a hardcoded0.97f. Sinceparams.preemphis set to0.0finpreprocess(), this block is dead code. Either remove it or fix the logic:- if (params.preemph) { + if (params.preemph != 0.0f) { const int pad_amount = frame_size / 2; - const float preemph = 0.97f; + const float preemph = params.preemph; float prev = samples_padded[pad_amount];
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
tools/mtmd/mtmd-audio.cpp(6 hunks)tools/mtmd/mtmd-audio.h(1 hunks)tools/mtmd/mtmd.cpp(4 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.{cpp,h,hpp}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.{cpp,h,hpp}: Always format C++ code usinggit clang-formatbefore committing, following .clang-format configuration (4-space indentation, 120 column limit, braces on same line for functions, pointer/reference alignment in middle)
Minimize external dependencies; avoid adding new external dependencies unless absolutely necessary
Files:
tools/mtmd/mtmd.cpptools/mtmd/mtmd-audio.htools/mtmd/mtmd-audio.cpp
**/*.{cpp,h,hpp,py}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Ensure cross-platform compatibility by testing code changes on Linux, macOS, and Windows when possible
Files:
tools/mtmd/mtmd.cpptools/mtmd/mtmd-audio.htools/mtmd/mtmd-audio.cpp
🧠 Learnings (2)
📓 Common learnings
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
📚 Learning: 2025-05-26T09:45:20.653Z
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 25
File: tools/mtmd/mtmd.cpp:275-293
Timestamp: 2025-05-26T09:45:20.653Z
Learning: In tools/mtmd/clip.cpp, PROJECTOR_TYPE_QWEN25O is a placeholder that gets replaced by either PROJECTOR_TYPE_QWEN25VL (for vision) or PROJECTOR_TYPE_QWEN2A (for audio) before the respective init_vision() or init_audio() functions are called, ensuring proper token handling.
Applied to files:
tools/mtmd/mtmd.cpp
🧬 Code graph analysis (1)
tools/mtmd/mtmd-audio.h (2)
tools/mtmd/clip-impl.h (1)
vector(361-372)tools/mtmd/mtmd-audio.cpp (4)
initialize(453-460)initialize(453-453)preprocess(462-539)preprocess(462-465)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (17)
- GitHub Check: WebUI Check
- GitHub Check: ggml-ci-arm64-cpu-kleidiai
- GitHub Check: windows-latest-cmake-sycl
- GitHub Check: ggml-ci-arm64-cpu-high-perf
- GitHub Check: ubuntu-cpu-cmake (arm64, ubuntu-22.04-arm)
- GitHub Check: ubuntu-latest-cmake-sanitizer (THREAD, Debug)
- GitHub Check: macOS-latest-cmake-arm64-webgpu
- GitHub Check: ubuntu-cpu-cmake (x64, ubuntu-22.04)
- GitHub Check: ubuntu-latest-cmake-rpc
- GitHub Check: macOS-latest-cmake-x64
- GitHub Check: ubuntu-24-cmake-vulkan
- GitHub Check: ubuntu-latest-cmake-sanitizer (UNDEFINED, Debug)
- GitHub Check: macOS-latest-cmake-arm64
- GitHub Check: ubuntu-latest-cmake-sanitizer (ADDRESS, Debug)
- GitHub Check: ubuntu-24-cmake-vulkan-deb
- GitHub Check: ubuntu-latest-llguidance
- GitHub Check: server-windows
🔇 Additional comments (9)
tools/mtmd/mtmd.cpp (4)
154-154: LGTM!Good use of
std::unique_ptrfor automatic lifetime management of the audio preprocessor.
322-336: LGTM!The switch-based preprocessor construction correctly handles the supported projector types and aborts on unsupported types. Per retrieved learnings,
PROJECTOR_TYPE_QWEN25Ois a placeholder that gets replaced beforeinit_audio()is called, so including it here is correct.
666-669: LGTM!The type change from
whisper_meltomtmd_audio_meland routing throughaudio_preproc->preprocess()correctly integrates with the new preprocessing API.
875-875: LGTM!Good improvement—using
clip_get_hparams(ctx->ctx_a)->audio_sample_rateinstead of hardcoded16000allows the bitrate to adapt to model-specific sample rates.tools/mtmd/mtmd-audio.cpp (3)
330-333: Good: Short audio validation added.The check at lines 330-333 prevents buffer over-read in reflective padding when audio is too short. This addresses the security concern raised in the PR objectives.
363-370: Good: Overflow and underflow checks added.These validations prevent size overflow in allocation and handle the case where
n_samples < frame_size.
453-460: LGTM!The
initialize()implementation correctly populates the global cache with sin/cos tables, Hann window, and mel filterbank.tools/mtmd/mtmd-audio.h (2)
20-28: LGTM!The base class now has a virtual destructor (addressing the past review comment), and the polymorphic interface is well-designed. The
hparamsreference requires theclip_ctxto outlive the preprocessor, which is ensured by the ownership structure inmtmd_context.
30-34: LGTM!The derived class correctly inherits from the base and overrides the required virtual methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tools/mtmd/mtmd-audio.cpp(6 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.{cpp,h,hpp}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.{cpp,h,hpp}: Always format C++ code usinggit clang-formatbefore committing, following .clang-format configuration (4-space indentation, 120 column limit, braces on same line for functions, pointer/reference alignment in middle)
Minimize external dependencies; avoid adding new external dependencies unless absolutely necessary
Files:
tools/mtmd/mtmd-audio.cpp
**/*.{cpp,h,hpp,py}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Ensure cross-platform compatibility by testing code changes on Linux, macOS, and Windows when possible
Files:
tools/mtmd/mtmd-audio.cpp
🧠 Learnings (1)
📓 Common learnings
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
🧬 Code graph analysis (1)
tools/mtmd/mtmd-audio.cpp (1)
tools/mtmd/mtmd.cpp (2)
GGML_ASSERT(218-313)GGML_ASSERT(315-348)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
- GitHub Check: build-linux-cross / debian-13-loongarch64-cpu-cross
- GitHub Check: build-linux-cross / debian-13-loongarch64-vulkan-cross
- GitHub Check: ubuntu-22-cmake-sycl
- GitHub Check: ggml-ci-arm64-cpu-kleidiai
- GitHub Check: ubuntu-24-wasm-webgpu
- GitHub Check: ggml-ci-arm64-cpu-high-perf-sve
- GitHub Check: ubuntu-24-cmake-webgpu
- GitHub Check: ggml-ci-arm64-cpu-high-perf
- GitHub Check: ggml-ci-arm64-cpu-low-perf
- GitHub Check: ubuntu-22-cmake-sycl-fp16
- GitHub Check: ubuntu-24-cmake-vulkan
- GitHub Check: ubuntu-22-cmake-musa
- GitHub Check: ggml-ci-x64-cpu-high-perf
- GitHub Check: ubuntu-24-cmake-vulkan-deb
- GitHub Check: ubuntu-cpu-cmake (arm64, ubuntu-22.04-arm)
- GitHub Check: ubuntu-cpu-cmake (x64, ubuntu-22.04)
- GitHub Check: macOS-latest-cmake-arm64
- GitHub Check: macOS-latest-cmake-arm64-webgpu
- GitHub Check: macOS-latest-cmake-x64
- GitHub Check: server-windows
🔇 Additional comments (5)
tools/mtmd/mtmd-audio.cpp (5)
287-287: LGTM: Log inconsistency fixed.The zero-fill case now correctly respects the
use_natural_logparameter, fixing the inconsistency flagged in previous reviews.
452-459: LGTM: Clean initialization of global cache.The
initialize()method properly populates the global cache with FFT tables, Hann window, and mel filterbank matrix. This explicit initialization step is good practice before preprocessing.
470-480: Good defensive handling for short audio.The code pads short audio inputs with zeros to meet minimum length requirements, preventing potential issues with the stage1/2 padding in
log_mel_spectrogram. This addresses some of the buffer safety concerns.
348-356: Validate Hann window padding to prevent buffer overflow.The code copies
hann_window_sizeelements starting at offsetpaddingintohann_window_paddedof sizeframe_size. Ifpadding + hann_window_size > frame_size, this will write out of bounds.Add a bounds check:
// pad hann window if it's smaller than frame_size // TODO: probably unnecessary here? (or better doing it in g_cache?) std::vector<float> hann_window_padded; if (params.hann_window_size < frame_size) { hann_window_padded.resize(frame_size); const int padding = (frame_size - params.hann_window_size) / 2; + if (padding + params.hann_window_size > frame_size) { + LOG_ERR("%s: hann window padding overflow\n", __func__); + return false; + } std::copy(hann, hann + params.hann_window_size, &hann_window_padded[padding]); hann = hann_window_padded.data(); }Based on PR objectives security analysis.
Likely an incorrect or invalid review comment.
329-333: Fix buffer over-read condition in reflective padding.The check at line 329 uses
n_samples < stage_2_pad + 1, but whenn_samples == stage_2_pad, the expressionsamples + 1 + stage_2_padat line 333 will read beyond the array bounds.Tighten the bounds check:
// reflective pad 200 samples at the beginning of audio - if (n_samples < stage_2_pad + 1) { + if (n_samples <= stage_2_pad) { // TODO: Handle short audio differently or return error return false; } std::reverse_copy(samples + 1, samples + 1 + stage_2_pad, samples_padded.begin());Based on PR objectives security analysis.
Likely an incorrect or invalid review comment.
| void fill_mel_filterbank_matrix( | ||
| int n_mel, | ||
| int n_fft, | ||
| int sample_rate, // e.g. 16000 | ||
| float fmin = 0.0f, // e.g. 0.0 | ||
| float fmax = -1.0f, // e.g. sr/2; pass -1 for auto | ||
| bool slaney_area_norm = true, | ||
| float scale = 1.0f // optional extra scaling; use 1.0f/1000.0f to mimic your code | ||
| ) { | ||
| GGML_ASSERT(n_mel > 0 && n_fft > 1); | ||
| if (fmax <= 0.0f) { | ||
| fmax = 0.5f * sample_rate; | ||
| } | ||
|
|
||
| // Slaney scale (matches librosa default) | ||
| const double min_log_hz = 1000.0; | ||
| const double lin_slope = 3 / 200.; | ||
| const double min_log_mel = min_log_hz * lin_slope; | ||
| const double log_step = log(6.4) / 27.0; | ||
| auto hz_to_mel = [min_log_hz, lin_slope, log_step, min_log_mel](const double f_hz) -> double { | ||
| return (f_hz < min_log_hz) ? f_hz * lin_slope : min_log_mel + log(f_hz / min_log_hz) / log_step; | ||
| }; | ||
| auto mel_to_hz = [min_log_hz, lin_slope, log_step, min_log_mel](const double m) -> double { | ||
| return (m < min_log_mel) ? m / lin_slope : min_log_hz * exp((m - min_log_mel) * log_step); | ||
| }; | ||
|
|
||
| // infer N_fft from n_fft_bins | ||
| const double bin_hz_step = double(sample_rate) / double(n_fft); | ||
|
|
||
| // mel grid: n_mel + 2 edges | ||
| const double m_lo = hz_to_mel(fmin); | ||
| const double m_hi = hz_to_mel(fmax); | ||
| std::vector<double> mel_pts(n_mel + 2); | ||
| for (int i = 0; i < n_mel + 2; ++i) { | ||
| mel_pts[i] = m_lo + (m_hi - m_lo) * (double(i) / (n_mel + 1)); | ||
| } | ||
|
|
||
| // convert to Hz | ||
| std::vector<double> hz_pts(n_mel + 2); | ||
| for (int i = 0; i < n_mel + 2; ++i) { | ||
| hz_pts[i] = mel_to_hz(mel_pts[i]); | ||
| } | ||
|
|
||
| const int n_fft_bins = n_fft / 2 + 1; | ||
|
|
||
| // filterbank | ||
| std::vector<float> out(n_mel * n_fft_bins, 0); | ||
| for (int m = 0; m < n_mel; ++m) { | ||
| const double f_left = hz_pts[m]; | ||
| const double f_center = hz_pts[m + 1]; | ||
| const double f_right = hz_pts[m + 2]; | ||
|
|
||
| const double denom_l = std::max(1e-30, f_center - f_left); | ||
| const double denom_r = std::max(1e-30, f_right - f_center); | ||
| const double enorm = slaney_area_norm ? (2.0 / std::max(1e-30, f_right - f_left)) : 1.0; | ||
|
|
||
| for (int k = 0; k < n_fft_bins; ++k) { | ||
| const double f = k * bin_hz_step; | ||
| double w = 0.0; | ||
| if (f >= f_left && f <= f_center) { | ||
| w = (f - f_left) / denom_l; | ||
| } else if (f > f_center && f <= f_right) { | ||
| w = (f_right - f) / denom_r; | ||
| } | ||
| out[size_t(m) * size_t(n_fft_bins) + size_t(k)] = float(w * enorm * scale); | ||
| } | ||
| } | ||
|
|
||
| filters.n_mel = n_mel; | ||
| filters.n_fft = n_fft; | ||
| filters.data = std::move(out); | ||
|
|
||
| if (DEBUG) { // debug | ||
| for (size_t i = 0; i < filters.data.size(); ++i) { | ||
| if (filters.data[i] != 0.0f) { | ||
| printf("filters[%zu] = %f\n", i, filters.data[i] * 1000.0f); | ||
| } | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add input validation to prevent overflow and invalid parameters.
The fill_mel_filterbank_matrix function lacks validation for critical parameters. According to the PR objectives, this creates several risks:
- Integer overflow at line 106:
n_mel * n_fft_binsmultiplication should be checked before resize - No validation that
n_mel,n_fft, andsample_rateare positive and within reasonable bounds - No check that
n_fftis even (required forn_fft_bins = n_fft / 2 + 1to be correct)
Add validation at the function start:
void fill_mel_filterbank_matrix(
int n_mel,
int n_fft,
int sample_rate, // e.g. 16000
float fmin = 0.0f, // e.g. 0.0
float fmax = -1.0f, // e.g. sr/2; pass -1 for auto
bool slaney_area_norm = true,
float scale = 1.0f // optional extra scaling; use 1.0f/1000.0f to mimic your code
) {
- GGML_ASSERT(n_mel > 0 && n_fft > 1);
+ GGML_ASSERT(n_mel > 0 && n_mel <= 512); // reasonable upper bound
+ GGML_ASSERT(n_fft > 1 && n_fft <= 65536 && n_fft % 2 == 0); // must be even, reasonable upper bound
+ GGML_ASSERT(sample_rate > 0 && sample_rate <= 1000000);
if (fmax <= 0.0f) {
fmax = 0.5f * sample_rate;
}
// Slaney scale (matches librosa default)
const double min_log_hz = 1000.0;
const double lin_slope = 3 / 200.;
const double min_log_mel = min_log_hz * lin_slope;
const double log_step = log(6.4) / 27.0;
auto hz_to_mel = [min_log_hz, lin_slope, log_step, min_log_mel](const double f_hz) -> double {
return (f_hz < min_log_hz) ? f_hz * lin_slope : min_log_mel + log(f_hz / min_log_hz) / log_step;
};
auto mel_to_hz = [min_log_hz, lin_slope, log_step, min_log_mel](const double m) -> double {
return (m < min_log_mel) ? m / lin_slope : min_log_hz * exp((m - min_log_mel) * log_step);
};
// infer N_fft from n_fft_bins
const double bin_hz_step = double(sample_rate) / double(n_fft);
// mel grid: n_mel + 2 edges
const double m_lo = hz_to_mel(fmin);
const double m_hi = hz_to_mel(fmax);
std::vector<double> mel_pts(n_mel + 2);
for (int i = 0; i < n_mel + 2; ++i) {
mel_pts[i] = m_lo + (m_hi - m_lo) * (double(i) / (n_mel + 1));
}
// convert to Hz
std::vector<double> hz_pts(n_mel + 2);
for (int i = 0; i < n_mel + 2; ++i) {
hz_pts[i] = mel_to_hz(mel_pts[i]);
}
const int n_fft_bins = n_fft / 2 + 1;
+
+ // Check for overflow before allocating
+ if (n_mel > SIZE_MAX / n_fft_bins) {
+ GGML_ABORT("mel filterbank size overflow");
+ }
// filterbank
std::vector<float> out(n_mel * n_fft_bins, 0);Based on PR objectives security analysis.
🤖 Prompt for AI Agents
In tools/mtmd/mtmd-audio.cpp around lines 60 to 139, add input validation at the
top of fill_mel_filterbank_matrix: verify n_mel > 0, n_fft > 1, and sample_rate
> 0 and that n_fft is even; ensure fmax > fmin (after auto-filling fmax) and
that fmin/fmax are finite and non-negative; compute n_fft_bins = n_fft/2 + 1 and
check the multiplication n_mel * n_fft_bins will not overflow (compare against
SIZE_MAX or a project-specific max like 1e8) before allocating/constructing the
vector; on validation failure use GGML_ASSERT or return early with a clear
error/log to avoid proceeding with invalid params.
| const int frame_size = (params.n_fft_bins - 1) * 2; | ||
| const int frame_step = params.hop_length; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Validate frame_size and frame_step to prevent division by zero.
Lines 360, 388, and 402 perform divisions using frame_step and derived values. If frame_step (hop_length) is zero or frame_size is zero, this will cause division by zero.
Add validation after computing frame parameters:
// Hann window
const float * hann = g_cache.hann_window.data();
const int frame_size = (params.n_fft_bins - 1) * 2;
const int frame_step = params.hop_length;
+
+ if (frame_step <= 0 || frame_size <= 0) {
+ LOG_ERR("%s: invalid frame parameters (frame_size=%d, frame_step=%d)\n",
+ __func__, frame_size, frame_step);
+ return false;
+ }Based on PR objectives security analysis.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| const int frame_size = (params.n_fft_bins - 1) * 2; | |
| const int frame_step = params.hop_length; | |
| const int frame_size = (params.n_fft_bins - 1) * 2; | |
| const int frame_step = params.hop_length; | |
| if (frame_step <= 0 || frame_size <= 0) { | |
| LOG_ERR("%s: invalid frame parameters (frame_size=%d, frame_step=%d)\n", | |
| __func__, frame_size, frame_step); | |
| return false; | |
| } |
🤖 Prompt for AI Agents
In tools/mtmd/mtmd-audio.cpp around lines 309-310, after computing frame_size
and frame_step validate that both are greater than zero to avoid
division-by-zero later (lines ~360, 388, 402); if either is zero or negative,
log an error (or set an error code), and return/fail early (or throw an
exception) instead of continuing, ensuring all callers handle the error path
consistently; this check should be added immediately after those two lines with
a clear message indicating invalid hop_length/frame_size.
Summary by CodeRabbit
New Features
Refactor
✏️ Tip: You can customize this high-level summary in your review settings.