From 40d5dbdc332814cc08092f8aba6f657c130aa23a Mon Sep 17 00:00:00 2001 From: Deepak Date: Fri, 30 Jan 2026 13:05:09 +0530 Subject: [PATCH] docs: document OCR support and required pymupdf.layout import for PyMuPDF4LLM --- docs/images/layout-ocr-flow.png | Bin 59396 -> 59398 bytes docs/pymupdf-layout/index.rst | 49 ++++++++++++++++++++++++-------- 2 files changed, 37 insertions(+), 12 deletions(-) diff --git a/docs/images/layout-ocr-flow.png b/docs/images/layout-ocr-flow.png index 3b9a2e1ecc8e31958221e369999ebd056b7ba9bb..8f84016f5018bdc89b8c56acf42bfe4cba9798ac 100644 GIT binary patch delta 12 TcmZp~J}CSEQ8B9jDW delta 9 QcmZp>z}#|yc>~J}02PP?O#lD@ diff --git a/docs/pymupdf-layout/index.rst b/docs/pymupdf-layout/index.rst index bd89e2501..6ecda0df1 100644 --- a/docs/pymupdf-layout/index.rst +++ b/docs/pymupdf-layout/index.rst @@ -138,28 +138,53 @@ Now we can happily load Office files and convert them as follows:: OCR support ~~~~~~~~~~~~~~~~~ -The new layout-sensitive |PyMuPDF4LLM| version also evaluates whether a page would benefit from applying OCR to it. If its heuristics come to this conclusion, the built-in Tesseract-OCR module is automatically invoked. Its results are then handled like normal page content. - -If a page contains (roughly) no text at all, but is covered with images or many character-sized vectors, a check is made using `OpenCV `_ whether text is *probably* detectable on the page at all. This is done to tell apart image-based text from ordinary pictures (like photographs). +**Critical: Import pymupdf.layout FIRST** +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -If the page does contain text but too many characters are unreadable (like "�����"), OCR is also executed, but **for the affected text areas only** -- not the full page. This way, we avoid losing already existing text and other content like images and vectors. +.. code-block:: python + :emphasize-lines: 1 -For these heuristics to work we need both, an existing :ref:`Tesseract installation ` and the availability of `OpenCV `_ in the Python environment. If either is missing, no OCR is attempted at all. + import pymupdf.layout # REQUIRED FIRST - enables OCR decision tree + import pymupdf4llm # Now OCR heuristics are active -The decision tree for whether OCR is actually used or not depends on the following: + md_text = pymupdf4llm.to_markdown("scanned.pdf") + # Auto: detects image pages → OCR → markdown -1. :ref:`PyMuPDF Layout is imported ` +.. warning:: + **Without `import pymupdf.layout`, OCR is NEVER attempted** - + even if Tesseract and OpenCV are installed. -2. In the :ref:`PyMuPDF4LLM API ` you have `use_ocr` enabled (this is set to `True` by default) +**Complete Requirements** (all must be satisfied) +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -3. :ref:`Tesseract is correctly installed ` +.. list-table:: OCR Decision Prerequisites + :widths: 15 85 + :header-rows: 1 -4. `OpenCV `_ is available in your Python environment + * - Check + - Requirement + * - 1. Layout + - :ref:`PyMuPDF Layout is imported ` + * - 2. OCR API + - :ref:`PyMuPDF4LLM API ` you have ``use_ocr`` enabled (this is set to ``True`` by default) + * - 3. Tesseract + - :ref:`Tesseract OCR is correctly installed ` + * - 4. OpenCV + - Available in the Python environment (``pip install opencv-python``) +**Smart OCR Heuristics** (Detailed) +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -.. image:: ../images/layout-ocr-flow.png +The new layout-sensitive |PyMuPDF4LLM| version also evaluates whether a page would benefit from applying OCR to it. If its heuristics come to this conclusion, the built-in Tesseract-OCR module is automatically invoked. Its results are then handled like normal page content. ----- +If a page contains (roughly) **no text at all**, but is covered with **images or many character-sized vectors**, a check is made using `OpenCV `_ whether text is *probably* detectable on the page at all. This is done to tell apart **image-based text** from ordinary pictures (like photographs). + +If the page **does contain text** but **too many characters are unreadable** (like "�����"), OCR is also executed, but **for the affected text areas only** – not the full page. This way, we avoid losing already existing text and other content like images and vectors. + +**OCR Decision Tree** +^^^^^^^^^^^^^^^^^^^^ + +.. image:: ../images/layout-ocr-flow.png .. _pymupdf_layout_and_pymupdf4llm_api: