docs: update readme, header, add images

Acly · Acly · commit 04d60afa43ac · 2025-07-28T16:31:02.000+02:00
diff --git a/README.md b/README.md
@@ -8,7 +8,7 @@ Computer Vision ML inference in C++
 * Growing number of supported models behind a simple API
 * Modular design for full control and implementing your own models
 
-Based on [GGML](https://github.com/ggml-org/ggml) which also powers the [llama.cpp](https://github.com/ggml-org/llama.cpp) project.
+Based on [ggml](https://github.com/ggml-org/ggml) similar to the [llama.cpp](https://github.com/ggml-org/llama.cpp) project.
 
 ### Features
 
@@ -28,12 +28,11 @@ Get the library and executables:
 
 ### Example: Select an object in an image
 
-Let's use MobileSAM to generate a segmentation mask.
+Let's use MobileSAM to generate a segmentation mask of the plushy on the right by passing in a box describing its approximate location.
 
-<img alt="Example image showing box prompt and mask output" src="docs/media/example-sam.jpg" width="400">
+<img alt="Example image showing box prompt at pixel location (420, 120) -> (650, 430), and the output mask" src="docs/media/example-sam-coords.jpg" width="400">
 
-We target the  plushy on the right by passing a box at pixel position (420, 120) → (650, 430).
-Download the model [MobileSAM-F16.gguf](https://huggingface.co/Acly/MobileSAM-GGUF/resolve/main/MobileSAM-F16.gguf) and the [input image](docs/media/input.jpg).
+You can download the model and input image here: [MobileSAM-F16.gguf](https://huggingface.co/Acly/MobileSAM-GGUF/resolve/main/MobileSAM-F16.gguf) | [input.jpg](docs/media/input.jpg)
 
 
 #### CLI
@@ -43,6 +42,7 @@ Find the `vision-cli` executable in the `bin` folder and run it to generate the
 ```sh
 vision-cli -m MobileSAM-F16.gguf -i input.png -p 420 120 650 430 -o mask.png
 ```
+Pass `--composite output.png` to composite input and mask. Use `--help` for more options.
 
 #### API
 
@@ -67,13 +67,14 @@ data to backend devices, post-processing output, etc.
 These can be used as building blocks for flexible functions which integrate
 with your existing data sources and infrastructure.
 
-#### UI
 
 
 ## Models
 
 ### MobileSAM
 
+<img src="docs/media/example-sam.jpg" width="400">
+
 [Model download](https://huggingface.co/Acly/MobileSAM-GGUF/tree/main) | [Paper (arXiv)](https://arxiv.org/pdf/2306.14289.pdf) | [Repository (GitHub)](https://github.com/ChaoningZhang/MobileSAM) | [Segment-Anything-Model](https://segment-anything.com/) | License: Apache-2
 
 ```sh
@@ -82,6 +83,8 @@ vision-cli sam -m MobileSAM-F16.gguf -i input.png -p 300 200 -o mask.png --compo
 
 ### BiRefNet
 
+<img src="docs/media/example-birefnet.png" width="400">
+
 [Model download](https://huggingface.co/Acly/BiRefNet-GGUF/tree/main) | [Paper (arXiv)](https://arxiv.org/pdf/2401.03407) | [Repository (GitHub)](https://github.com/ZhengPeng7/BiRefNet) | License: MIT
 
 ```sh
@@ -90,6 +93,8 @@ vision-cli birefnet -m BiRefNet-lite-F16.gguf -i input.png -o mask.png --composi
 
 ### MI-GAN
 
+<img src="docs/media/example-migan.jpg" width="400">
+
 [Model download](https://huggingface.co/Acly/MIGAN-GGUF/tree/main) | [Paper (thecvf.com)](https://openaccess.thecvf.com/content/ICCV2023/papers/Sargsyan_MI-GAN_A_Simple_Baseline_for_Image_Inpainting_on_Mobile_Devices_ICCV_2023_paper.pdf) | [Repository (GitHub)](https://github.com/Picsart-AI-Research/MI-GAN) | License: MIT
 
 ```sh
@@ -98,6 +103,8 @@ vision-cli migan -m MIGAN-512-places2-F16.gguf -i image.png mask.png -o output.p
 
 ### Real-ESRGAN
 
+<img src="docs/media/example-esrgan.jpg" width="400">
+
 [Model download](https://huggingface.co/Acly/Real-ESRGAN-GGUF) | [Paper (arXiv)](https://arxiv.org/abs/2107.10833) | [Repository (GitHub)](https://github.com/xinntao/Real-ESRGAN) | License: BSD-3-Clause
 
 ```sh
@@ -157,4 +164,10 @@ uv sync
 
 # Run python tests
 uv run pytest
-```
+```
+
+## Acknowledgements
+
+* [ggml](https://github.com/ggml-org/ggml) - ML inference library | MIT
+* [stb-image](https://github.com/nothings/stb) - Image load/save/resize | Public Domain
+* [fmt](https://github.com/fmtlib/fmt) - String formatting _(only if compiler doesn't support &lt;format&gt;)_ | MIT
diff --git a/include/visp/vision.hpp b/include/visp/vision.hpp
@@ -26,8 +26,8 @@
 //
 //   Provides a high-level API to run inference on various vision models for
 //   common tasks. These operations are built for simplicity and don't provide
-//   a lot of options. Rather, you will find below each operation it is split into
-//   several steps, which can be used to build more flexible pipelines.
+//   a lot of options. If you need more control, you will find each operation
+//   split into several steps below, which can be combined in a modular fashion.
 //
 // Basic Use
 // ---------
@@ -49,17 +49,16 @@
 //
 // Internally running the model is split into several steps:
 // 1. Load the model weights from a GGUF file.
-// 2. Allocate storage on the backend device and transfer the weights.
-// 3. Detect model hyperparameters and precompute required buffers.
+// 2. Detect model hyperparameters and precompute required buffers.
+// 3. Allocate storage on the backend device and transfer the weights.
 // 4. Build a compute graph for the model architecture.
 // 5. Allocate storage for input, output and intermediate tensors on the backend device.
-// 6. Pre-process the image and transfer it to the backend device.
+// 6. Pre-process the input and transfer it to the backend device.
 // 7. Run the compute graph.
 // 8. Transfer the output to the host and post-process it.
 //
-// You can run all steps individually in order to customize the pipeline. Check the
-// implementation of the high-level API functions to get started.
-//
+// Custom pipelines are simply functions which call the individual steps and extend them
+// where needed. The implementation of the high-level API functions is a good starting point.
 // This allows to:
 // * load model weights from a different source
 // * control exactly when allocation happens