Skip to content

Conversation

@loci-dev
Copy link

Mirrored from leejet/stable-diffusion.cpp#1222

for cases like leejet/stable-diffusion.cpp#1220

The cli should no longer save an empty image to disk if the generation failed. If no imge is sucessfuly saved, it should exit with error code -1 instead of 0. (Maybe I should change it to return -1 if any image failed to be saved?)

@loci-dev loci-dev temporarily deployed to stable-diffusion-cpp-prod January 21, 2026 17:49 — with GitHub Actions Inactive
@loci-review
Copy link

loci-review bot commented Jan 21, 2026

Performance Review Report: Stable Diffusion C++ - VAE Error Handling Improvements

Impact Classification: Moderate

Commit: 6aa3b9d "improve handling of VAE decode failures"
Binaries Analyzed: build.bin.sd-cli, build.bin.sd-server
Functions Analyzed: 14 (3 performance-critical, 11 utility/STL)

Executive Summary

The target version introduces error handling improvements for VAE decode operations, resulting in 7-8% response time increases (+4,600 nanoseconds absolute) in three GPU-critical functions. The changes prioritize production reliability over raw performance, with negligible user-facing impact (<0.001% of total generation time).

Performance-Critical Functions Impacted

1. StableDiffusionGGML::preview_image() [both binaries]

  • Base: 56,279-56,385 ns → Target: 60,881-60,977 ns
  • Change: +4,591-4,601 ns (+8.14-8.18%)
  • Called 20-50 times per generation for real-time previews

2. StableDiffusionGGML::decode_first_stage() [sd-cli]

  • Base: 63,713 ns → Target: 68,312 ns
  • Change: +4,599 ns (+7.22%)
  • Critical VAE decoder for final latent-to-image conversion

3. UpscalerGGML::upscale() [sd-cli]

  • Base: 61,032 ns → Target: 65,640 ns
  • Change: +4,608 ns (+7.55%)
  • ESRGAN 4x super-resolution with 345 convolution operations

Code Changes and Justification

All three functions received identical error handling enhancements:

  • Added return statements in tiling lambda callbacks to propagate compute() status
  • Implemented explicit error checking after VAE/TAE compute operations
  • Added free_compute_buffer() calls on error paths to prevent GPU memory leaks
  • Added memset initialization for result buffers

These changes prevent silent GPU failures, enable graceful degradation under memory constraints, and ensure proper resource cleanup—critical for production ML inference systems.

Cumulative Impact Analysis

Typical Generation (20 steps):

  • Preview overhead: 20 × 4.6 μs = 92 μs
  • Final decode overhead: 4.6 μs
  • Total: ~97 μs on 10-30 second generation (0.0003-0.001%)

Power Consumption: Negligible increase (<0.01W), imperceptible in 50-150W GPU inference workloads.

Additional Findings

Eleven STL functions show compiler-driven optimizations (no source changes) with 23-75% improvements in various metrics, providing distributed performance gains that partially offset error handling overhead. Notable: std::vector operations improved 40-75%, std::unordered_map::find() gained 44% throughput.

Conclusion

The 4.6 microsecond overhead per GPU operation is justified by substantial reliability improvements. The changes prevent crashes, memory leaks, and silent failures while adding sub-millisecond latency to multi-second operations. This represents mature engineering prioritizing correctness in production ML systems.
See the complete breakdown in Version Insights
Have questions? Tag @loci-dev to ask about this PR.

@noamshiovits-Aurora
Copy link

@loci-dev whats the most substantial change in this PR

@loci-dev loci-dev force-pushed the master branch 4 times, most recently from b9cb3c1 to e31dd7d Compare January 25, 2026 17:07
@noamshiovits-Aurora
Copy link

@loci-dev whats the most substantial change in this PR

@loci-review
Copy link

loci-review bot commented Jan 27, 2026

Based on the performance review report, the most substantial change in this PR is the error handling improvements for VAE (Variational Autoencoder) decode operations.

Here's what makes this the most substantial change:

Core Impact

The PR adds comprehensive error handling to three GPU-critical functions that were previously vulnerable to silent failures:

  1. StableDiffusionGGML::decode_first_stage() - The critical VAE decoder that converts latent representations to final images
  2. StableDiffusionGGML::preview_image() - Generates real-time previews during generation (called 20-50 times per run)
  3. UpscalerGGML::upscale() - ESRGAN 4x super-resolution with 345 convolution operations

What Changed

Each function received:

  • Return statement propagation in tiling lambda callbacks to surface compute() failures
  • Explicit error checking after VAE/TAE compute operations
  • GPU memory leak prevention via free_compute_buffer() calls on error paths
  • Buffer initialization with memset for result buffers

Why It's Substantial

Before this PR: GPU compute failures could occur silently, leading to:

  • Crashes without clear error messages
  • GPU memory leaks under resource constraints
  • Corrupted or invalid output images
  • Difficult-to-debug production issues

After this PR: The system can:

  • Detect and report GPU failures gracefully
  • Clean up resources properly on errors
  • Prevent cascading failures in long-running inference workloads
  • Provide actionable error information to users

This is a reliability and production-readiness improvement that trades ~4.6 microseconds per operation for robust error handling in GPU-intensive ML inference—a worthwhile trade-off for any production system.

Would you like me to dive deeper into any specific function or examine the actual code changes?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants