-
Notifications
You must be signed in to change notification settings - Fork 2.5k
feat: MarkdownHeaderSplitter #9660
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
OGuggenbuehl
wants to merge
86
commits into
deepset-ai:main
Choose a base branch
from
OGuggenbuehl:feature/md-header-splitter
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+771
−0
Open
Changes from all commits
Commits
Show all changes
86 commits
Select commit
Hold shift + click to select a range
45e7c12
implement md-header-splitter and add tests
OGuggenbuehl edfd644
rework md-header splitter to rewrite md-header levels
OGuggenbuehl cd55f13
remove deprecated test
OGuggenbuehl dafe1bd
Update haystack/components/preprocessors/markdown_header_splitter.py
OGuggenbuehl 6da2513
use native types
OGuggenbuehl 96e616c
move to haystack logging
OGuggenbuehl c3e397f
docstrings improvements
OGuggenbuehl 1ca9803
Update haystack/components/preprocessors/markdown_header_splitter.py
OGuggenbuehl 6c49600
fix CustomDocumentSplitter arguments
OGuggenbuehl 9c23202
remove header prefix from content
OGuggenbuehl b24d92d
rework split_id assignment to avoid collisions
OGuggenbuehl 7b8150e
remove unneeded dese methods
OGuggenbuehl f085221
cleanup
OGuggenbuehl 3490d89
cleanup
OGuggenbuehl 0bf3187
add tests
OGuggenbuehl d87ef97
move initialization of secondary-splitter out of run method
OGuggenbuehl 84e34ed
move _custom_document_splitter to class method
OGuggenbuehl 32b0958
removed the _CustomDocumentSplitter class. splitting logic is now enc…
OGuggenbuehl 69b7953
return to standard feed-forward character and add tests for page brea…
OGuggenbuehl f5b91f0
quit exposing splitting_function param since it shouldn't be changed …
OGuggenbuehl 83e5579
remove test section in module
OGuggenbuehl f3625f5
add license header
OGuggenbuehl 526ac4f
add release note
OGuggenbuehl a46ac62
minor refactor for type safety
OGuggenbuehl 821d907
Update haystack/components/preprocessors/markdown_header_splitter.py
OGuggenbuehl c630e14
remove unneeded release notes entries
OGuggenbuehl fa53e1b
improved documentation for methods
OGuggenbuehl 1e6cbe3
improve method naming
OGuggenbuehl e756d99
improved page-number assignment & added return in docstring
OGuggenbuehl c48bdcf
unified page-counting
OGuggenbuehl decaadf
simplify conditional secondary-split initialization and usage
OGuggenbuehl 3ef71c4
fix linting error
OGuggenbuehl 0fbea3a
clearly specify the use of ATX-style headers (#) only
OGuggenbuehl 38119a6
reference doc_id when logging no headers found
OGuggenbuehl e12e7f7
initialize md-header pattern as private variable once
OGuggenbuehl f31528e
add example to for inferring header levels to docstring
OGuggenbuehl cee156c
improve empty document handling
OGuggenbuehl c63035f
more explicit testing for inferred headers
OGuggenbuehl cf1b820
fix linting issue
OGuggenbuehl 22369b6
improved empty content handling test cases
OGuggenbuehl 316ebec
remove all functionality related to inferring md-header levels
OGuggenbuehl d5e462c
compile regex-pattern in init for performance gains
OGuggenbuehl 4089ddc
Update haystack/components/preprocessors/markdown_header_splitter.py
OGuggenbuehl 20d172e
change all "none" to proper None values
OGuggenbuehl a7c6725
fix minor
OGuggenbuehl c9c44ee
explicitly test doc content
OGuggenbuehl 0e36419
rename parentheaders to parent_headers
OGuggenbuehl edc60b5
test split_id, doc length
OGuggenbuehl 995c121
check meta content
OGuggenbuehl 223a676
remove unneeded test
OGuggenbuehl babc7d9
make split_id testing more robust
OGuggenbuehl e488edc
more realistic overlap test sample
OGuggenbuehl c0efda3
assign split_id globally to all output docs
OGuggenbuehl 893e3de
taste page numbers explicitly
OGuggenbuehl 9abf10b
cleanup pagebreak test
OGuggenbuehl 11da0a8
minor
OGuggenbuehl 32d8c68
return doc unchunked if no headers have content
OGuggenbuehl bcf56ca
add doc-id to logging statement for unsplit documents
OGuggenbuehl c5415ec
remove unneeded logs
OGuggenbuehl dff06bc
minor cleanup
OGuggenbuehl a54d25a
simplify page-number tracking method to not return count, just the up…
OGuggenbuehl a34c7a6
add dev comment to mypy check for doc.content is None
OGuggenbuehl 7bc798e
Update haystack/components/preprocessors/markdown_header_splitter.py
OGuggenbuehl a7eef6b
remove split meta flattening
OGuggenbuehl 5b5fc93
keep empty meta return consistent
OGuggenbuehl 8ef5af0
remove unneeded content is none check
OGuggenbuehl f1e3739
update tests to reflect empty meta dict for unsplit docs
OGuggenbuehl df7e775
clean up total_page counts
OGuggenbuehl 3c1c376
remove unneeded meta check
OGuggenbuehl 86feef6
Update test/components/preprocessors/test_markdown_header_splitter.py
OGuggenbuehl c22b57d
implement keep_headers parameter
OGuggenbuehl 7c03a04
remove meta-dict flattening
OGuggenbuehl 9a8ca76
add minor sanity checks
OGuggenbuehl 2f1e203
Update test/components/preprocessors/test_markdown_header_splitter.py
OGuggenbuehl b22feb5
add warmup
OGuggenbuehl 8501831
Update haystack/components/preprocessors/markdown_header_splitter.py
OGuggenbuehl 23da68e
fix splitting when keeping headers
OGuggenbuehl ccc1057
test cleanup to cover keep_headers=True
OGuggenbuehl c4a5c17
add tests for keep_headers=False splitting
OGuggenbuehl f3d7799
remove strip()
OGuggenbuehl f842fdb
simplify doc handling
OGuggenbuehl c7fc2e4
fix split id assignment
OGuggenbuehl 64ff6fb
test cleanup
OGuggenbuehl eb3e568
test splits more explicitly
OGuggenbuehl ad155cc
cleanup tests
OGuggenbuehl 1c3897c
Merge branch 'main' into feature/md-header-splitter
OGuggenbuehl File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
337 changes: 337 additions & 0 deletions
337
haystack/components/preprocessors/markdown_header_splitter.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,337 @@ | ||
| # SPDX-FileCopyrightText: 2022-present deepset GmbH <info@deepset.ai> | ||
| # | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
|
|
||
| import re | ||
| from typing import Literal, Optional | ||
|
|
||
| from haystack import Document, component, logging | ||
| from haystack.components.preprocessors import DocumentSplitter | ||
|
|
||
| logger = logging.getLogger(__name__) | ||
|
|
||
|
|
||
| @component | ||
| class MarkdownHeaderSplitter: | ||
| """ | ||
| Split documents at ATX-style Markdown headers (#), with optional secondary splitting. | ||
|
|
||
| This component processes text documents by: | ||
| - Splitting them into chunks at Markdown headers (e.g., '#', '##', etc.), preserving header hierarchy as metadata. | ||
| - Optionally applying a secondary split (by word, passage, period, or line) to each chunk | ||
| (using haystack's DocumentSplitter). | ||
| - Preserving and propagating metadata such as parent headers, page numbers, and split IDs. | ||
| """ | ||
|
|
||
| def __init__( | ||
| self, | ||
| *, | ||
| page_break_character: str = "\f", | ||
| keep_headers: bool = True, | ||
| secondary_split: Optional[Literal["word", "passage", "period", "line"]] = None, | ||
| split_length: int = 200, | ||
| split_overlap: int = 0, | ||
| split_threshold: int = 0, | ||
| skip_empty_documents: bool = True, | ||
| ): | ||
| """ | ||
| Initialize the MarkdownHeaderSplitter. | ||
|
|
||
| :param page_break_character: Character used to identify page breaks. Defaults to form feed ("\f"). | ||
| :param keep_headers: If True, headers are kept in the content. If False, headers are moved to metadata. | ||
| Defaults to True. | ||
| :param secondary_split: Optional secondary split condition after header splitting. | ||
| Options are None, "word", "passage", "period", "line". Defaults to None. | ||
| :param split_length: The maximum number of units in each split when using secondary splitting. Defaults to 200. | ||
| :param split_overlap: The number of overlapping units for each split when using secondary splitting. | ||
| Defaults to 0. | ||
| :param split_threshold: The minimum number of units per split when using secondary splitting. Defaults to 0. | ||
| :param skip_empty_documents: Choose whether to skip documents with empty content. Default is True. | ||
| Set to False when downstream components in the Pipeline (like LLMDocumentContentExtractor) can extract text | ||
| from non-textual documents. | ||
| """ | ||
| self.page_break_character = page_break_character | ||
| self.secondary_split = secondary_split | ||
| self.split_length = split_length | ||
| self.split_overlap = split_overlap | ||
| self.split_threshold = split_threshold | ||
| self.skip_empty_documents = skip_empty_documents | ||
| self.keep_headers = keep_headers | ||
| self._header_pattern = re.compile(r"(?m)^(#{1,6}) (.+)$") # ATX-style .md-headers | ||
| self._is_warmed_up = False | ||
|
|
||
| # initialize secondary_splitter only if needed | ||
| if self.secondary_split: | ||
| self.secondary_splitter = DocumentSplitter( | ||
| split_by=self.secondary_split, | ||
| split_length=self.split_length, | ||
| split_overlap=self.split_overlap, | ||
| split_threshold=self.split_threshold, | ||
| ) | ||
OGuggenbuehl marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| def warm_up(self): | ||
| """ | ||
| Warm up the MarkdownHeaderSplitter. | ||
| """ | ||
| if self.secondary_split and not self._is_warmed_up: | ||
| self.secondary_splitter.warm_up() | ||
| self._is_warmed_up = True | ||
|
|
||
| def _split_text_by_markdown_headers(self, text: str, doc_id: str) -> list[dict]: | ||
| """Split text by ATX-style headers (#) and create chunks with appropriate metadata.""" | ||
| logger.debug("Splitting text by markdown headers") | ||
|
|
||
| # find headers | ||
| matches = list(re.finditer(self._header_pattern, text)) | ||
|
|
||
| # return unsplit if no headers found | ||
| if not matches: | ||
| logger.info( | ||
| "No headers found in document {doc_id}; returning full document as single chunk.", doc_id=doc_id | ||
| ) | ||
| return [{"content": text, "meta": {}}] | ||
|
|
||
| # process headers and build chunks | ||
| chunks: list[dict] = [] | ||
| header_stack: list[Optional[str]] = [None] * 6 | ||
| active_parents: list[str] = [] # track active parent headers | ||
| pending_headers: list[str] = [] # store empty headers to prepend to next content | ||
| has_content = False # flag to track if any header has content | ||
|
|
||
| for i, match in enumerate(matches): | ||
| # extract header info | ||
| header_prefix = match.group(1) | ||
| header_text = match.group(2) | ||
| level = len(header_prefix) | ||
|
|
||
| # get content | ||
| start = match.end() | ||
| end = matches[i + 1].start() if i + 1 < len(matches) else len(text) | ||
| content = text[start:end] | ||
| if not self.keep_headers and content.startswith("\n"): | ||
| content = content[1:] # remove leading newline if headers not kept | ||
|
|
||
| # update header stack to track nesting | ||
| header_stack[level - 1] = header_text | ||
| for j in range(level, 6): | ||
| header_stack[j] = None | ||
|
|
||
| # skip splits w/o content | ||
| if not content.strip(): # this strip is needed to avoid counting whitespace as content | ||
| # add as parent for subsequent headers | ||
| active_parents = [h for h in header_stack[: level - 1] if h is not None] | ||
| active_parents.append(header_text) | ||
| if self.keep_headers: | ||
| header_line = f"{header_prefix} {header_text}" | ||
| pending_headers.append(header_line) | ||
| continue | ||
|
|
||
| has_content = True # at least one header has content | ||
| parent_headers = list(active_parents) | ||
|
|
||
| logger.debug( | ||
| "Creating chunk for header '{header_text}' at level {level}", header_text=header_text, level=level | ||
| ) | ||
|
|
||
| if self.keep_headers: | ||
| header_line = f"{header_prefix} {header_text}" | ||
| # add pending & current header to content | ||
| chunk_content = "" | ||
| if pending_headers: | ||
| chunk_content += "\n".join(pending_headers) + "\n" | ||
| chunk_content += f"{header_line}{content}" | ||
| chunks.append( | ||
| { | ||
| "content": chunk_content, | ||
| "meta": {} if self.keep_headers else {"header": header_text, "parent_headers": parent_headers}, | ||
| } | ||
| ) | ||
| pending_headers = [] # reset pending headers | ||
| else: | ||
| chunks.append({"content": content, "meta": {"header": header_text, "parent_headers": parent_headers}}) | ||
|
|
||
| # reset active parents | ||
| active_parents = [h for h in header_stack[: level - 1] if h is not None] | ||
|
|
||
| # return doc unchunked if no headers have content | ||
| if not has_content: | ||
| logger.info( | ||
| "Document {doc_id} contains only headers with no content; returning original document.", doc_id=doc_id | ||
| ) | ||
| return [{"content": text, "meta": {}}] | ||
|
|
||
| return chunks | ||
|
|
||
| def _apply_secondary_splitting(self, documents: list[Document]) -> list[Document]: | ||
| """ | ||
| Apply secondary splitting while preserving header metadata and structure. | ||
|
|
||
| Ensures page counting is maintained across splits. | ||
| """ | ||
| result_docs = [] | ||
| current_split_id = 0 # track split_id across all secondary splits from the same parent | ||
|
|
||
| for doc in documents: | ||
| if doc.content is None: | ||
| result_docs.append(doc) | ||
| continue | ||
|
|
||
| content_for_splitting: str = doc.content | ||
|
|
||
| if not self.keep_headers: # skip header extraction if keep_headers | ||
| # extract header information | ||
| header_match = re.search(self._header_pattern, doc.content) | ||
| if header_match: | ||
| content_for_splitting = doc.content[header_match.end() :] | ||
|
|
||
| # track page from meta | ||
| current_page = doc.meta.get("page_number", 1) | ||
|
|
||
| # create a clean meta dict without split_id for secondary splitting | ||
| clean_meta = {k: v for k, v in doc.meta.items() if k != "split_id"} | ||
|
|
||
| secondary_splits = self.secondary_splitter.run( | ||
| documents=[Document(content=content_for_splitting, meta=clean_meta)] | ||
| )["documents"] | ||
|
|
||
| # split processing | ||
| for i, split in enumerate(secondary_splits): | ||
| # calculate page number for this split | ||
| if i > 0 and secondary_splits[i - 1].content: | ||
| current_page = self._update_page_number_with_breaks(secondary_splits[i - 1].content, current_page) | ||
|
|
||
| # set page number and split_id to meta | ||
| split.meta["page_number"] = current_page | ||
| split.meta["split_id"] = current_split_id | ||
| # ensure source_id is preserved from the original document | ||
| if "source_id" in doc.meta: | ||
| split.meta["source_id"] = doc.meta["source_id"] | ||
| current_split_id += 1 | ||
|
|
||
| # preserve header metadata if we're not keeping headers in content | ||
| if not self.keep_headers: | ||
| for key in ["header", "parent_headers"]: | ||
| if key in doc.meta: | ||
| split.meta[key] = doc.meta[key] | ||
|
|
||
| result_docs.append(split) | ||
|
|
||
| logger.debug( | ||
| "Secondary splitting complete. Final count: {final_count} documents.", final_count=len(result_docs) | ||
| ) | ||
| return result_docs | ||
|
|
||
| def _update_page_number_with_breaks(self, content: str, current_page: int) -> int: | ||
| """ | ||
| Update page number based on page breaks in content. | ||
|
|
||
| :param content: Content to check for page breaks | ||
| :param current_page: Current page number | ||
| :return: New current page number | ||
| """ | ||
| if not isinstance(content, str): | ||
| return current_page | ||
|
|
||
| page_breaks = content.count(self.page_break_character) | ||
| new_page_number = current_page + page_breaks | ||
|
|
||
| if page_breaks > 0: | ||
| logger.debug( | ||
| "Found {page_breaks} page breaks, page number updated: {old} → {new}", | ||
| page_breaks=page_breaks, | ||
| old=current_page, | ||
| new=new_page_number, | ||
| ) | ||
|
|
||
| return new_page_number | ||
|
|
||
| def _split_documents_by_markdown_headers(self, documents: list[Document]) -> list[Document]: | ||
| """Split a list of documents by markdown headers, preserving metadata.""" | ||
|
|
||
| result_docs = [] | ||
| for doc in documents: | ||
| logger.debug("Splitting document with id={doc_id}", doc_id=doc.id) | ||
| # mypy: doc.content is Optional[str], so we must check for None before passing to splitting method | ||
| if doc.content is None: | ||
| continue | ||
OGuggenbuehl marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| splits = self._split_text_by_markdown_headers(doc.content, doc.id) | ||
| docs = [] | ||
|
|
||
| current_page = doc.meta.get("page_number", 1) if doc.meta else 1 | ||
| total_pages = doc.content.count(self.page_break_character) + 1 | ||
| logger.debug( | ||
| "Processing page number: {current_page} out of {total_pages}", | ||
| current_page=current_page, | ||
| total_pages=total_pages, | ||
| ) | ||
| for split_idx, split in enumerate(splits): | ||
| meta = {} | ||
| if doc.meta: | ||
| meta = doc.meta.copy() | ||
| meta.update({"source_id": doc.id, "page_number": current_page, "split_id": split_idx}) | ||
| if split.get("meta"): | ||
| meta.update(split["meta"]) | ||
| current_page = self._update_page_number_with_breaks(split["content"], current_page) | ||
| docs.append(Document(content=split["content"], meta=meta)) | ||
| logger.debug( | ||
| "Split into {num_docs} documents for id={doc_id}, final page: {current_page}", | ||
| num_docs=len(docs), | ||
| doc_id=doc.id, | ||
| current_page=current_page, | ||
| ) | ||
| result_docs.extend(docs) | ||
| return result_docs | ||
|
|
||
| @component.output_types(documents=list[Document]) | ||
| def run(self, documents: list[Document]) -> dict[str, list[Document]]: | ||
| """ | ||
| Run the markdown header splitter with optional secondary splitting. | ||
|
|
||
| :param documents: List of documents to split | ||
|
|
||
| :returns: A dictionary with the following key: | ||
| - `documents`: List of documents with the split texts. Each document includes: | ||
| - A metadata field `source_id` to track the original document. | ||
| - A metadata field `page_number` to track the original page number. | ||
OGuggenbuehl marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| - A metadata field `split_id` to identify the split chunk index within its parent document. | ||
| - All other metadata copied from the original document. | ||
| """ | ||
| # validate input documents | ||
| for doc in documents: | ||
| if doc.content is None: | ||
| raise ValueError( | ||
| ( | ||
| "MarkdownHeaderSplitter only works with text documents but content for document ID" | ||
| f" {doc.id} is None." | ||
| ) | ||
| ) | ||
| if not isinstance(doc.content, str): | ||
| raise ValueError("MarkdownHeaderSplitter only works with text documents (str content).") | ||
|
|
||
| final_docs = [] | ||
| for doc in documents: | ||
| # handle empty documents | ||
| if not doc.content or not doc.content.strip(): # avoid counting whitespace as content | ||
| if self.skip_empty_documents: | ||
| logger.warning("Document ID {doc_id} has an empty content. Skipping this document.", doc_id=doc.id) | ||
| continue | ||
| # keep empty documents | ||
| final_docs.append(doc) | ||
| logger.warning( | ||
| "Document ID {doc_id} has an empty content. Keeping this document as per configuration.", | ||
| doc_id=doc.id, | ||
| ) | ||
| continue | ||
|
|
||
| # split this document by headers | ||
| header_split_docs = self._split_documents_by_markdown_headers([doc]) | ||
|
|
||
| # apply secondary splitting if configured | ||
| if self.secondary_split: | ||
| doc_splits = self._apply_secondary_splitting(header_split_docs) | ||
| else: | ||
| doc_splits = header_split_docs | ||
|
|
||
| final_docs.extend(doc_splits) | ||
|
|
||
| return {"documents": final_docs} | ||
9 changes: 9 additions & 0 deletions
9
releasenotes/notes/add-md-header-splitter-df5c024a6ddd2718.yaml
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,9 @@ | ||
| --- | ||
| features: | ||
| - | | ||
| Introduced the `MarkdownHeaderSplitter` component: | ||
| - Splits documents into chunks at Markdown headers (`#`, `##`, etc.), preserving header hierarchy as metadata. | ||
| - Optionally infers and rewrites header levels for documents where header structure is ambiguous (e.g. documents parsed using Docling). | ||
| - Supports secondary splitting (by word, passage, period, or line) for further chunking after header-based splitting using Haystack's `DocumentSplitter`. | ||
| - Preserves and propagates metadata such as parent headers and page numbers. | ||
| - Handles edge cases such as documents with no headers, empty content, and non-text documents. |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that if
keep_headersisTruewe don't store them in the metadata?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! I thought it didn't make sense to keep them in meta if they're still in the content