langchain-pymupdf-layout

An integration package connecting PyMuPDF Layout to LangChain.

Load PDF content to Markdown using AI-based, CPU only, layout analysis.

Features

📚 Structured data extraction from your documents
🧐 Advanced document page layout understanding, including semantic markup for titles, headings, headers, footers, tables, images and text styling
🔍 Detect and isolate header and footer patterns on each page

For more detailed information visit the official PyMuPDF Layout documentation webpage.

Requirements

Python 3.11 or higher
LangChain Core v1.0.0 or higher
PyMuPDF v1.26.6 or higher
PyMuPDF4LLM v0.2.0 or higher
PyMuPDF Layout v1.26.6 or higher

Installation

Install the package using pip to start using the Document Loader:

pip install -U langchain-pymupdf-layout

Usage

You can easily integrate and use the PyMuPDF Layout Loader in your Python application for loading and parsing PDFs.

Below is an example of how to set up and utilize this loader:

from langchain_pymupdf_layout import version

print(version())  # Output: version number

from langchain.text_splitter import RecursiveCharacterTextSplitter

from langchain_pymupdf_layout import PyMuPDFLayoutLoader

loader = PyMuPDFLayoutLoader(
    file_path="https://www.adobe.com/support/products/enterprise/knowledgecenter/media/c4611_sample_explain.pdf",
    show_progress=False,  
    # See other loader options on https://pymupdf.readthedocs.io/en/latest/pymupdf-layout/index.html#pymupdf-layout-and-parameter-caveats
)

documents = loader.load()

# Chunk
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)

chunks = text_splitter.split_documents(documents)

print(f"Loaded {len(documents)} document(s)")
print(f"Created {len(chunks)} chunk(s)")

content = chunks[0].page_content
print(f"\ncontent:\n{content}")

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
langchain_pymupdf_layout		langchain_pymupdf_layout
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
simple_test.py		simple_test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

langchain-pymupdf-layout

Features

Requirements

Installation

Usage

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

pymupdf/langchain-pymupdf-layout

Folders and files

Latest commit

History

Repository files navigation

langchain-pymupdf-layout

Features

Requirements

Installation

Usage

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages