feat!: deduplicate documents based on ID in MultiQuery components by sjrl · Pull Request #10496 · deepset-ai/haystack

sjrl · 2026-02-03T09:29:44Z

Related Issues

fixes #issue-number

Proposed Changes:

De-duplicate based on Document.id instead of Document.content for consistency in the code base and performance reasons
Simplify the from_dict methods to utilize the new from_dict improvements. The to_dict methods were not updateable since default_to_dict doesn't support serializing init variables that are components.
Refactor some mocks used in tests

How did you test it?

Existing tests

Notes for the reviewer

Checklist

I have read the contributors guidelines and the code of conduct.
I have updated the related issue with new insights and changes.
I have added unit tests and updated the docstrings.
I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test: and added ! in case the PR includes breaking changes.
I have documented my code.
I have added a release note file, following the contributors guidelines.
I have run pre-commit hooks and fixed any issue.

vercel · 2026-02-03T09:29:49Z

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment

Project	Deployment	Actions	Updated (UTC)
haystack-docs	Ignored	Preview	Feb 6, 2026 10:32am

bogdankostic · 2026-02-05T14:43:28Z

In principle, this PR looks good to me. However, I’m wondering if we should standardize how we deduplicate documents across components. I think we could utilize the _deduplicate_documents function introduced in #10508 here as well.

sjrl · 2026-02-06T08:06:58Z

In principle, this PR looks good to me. However, I’m wondering if we should standardize how we deduplicate documents across components. I think we could utilize the _deduplicate_documents function introduced in #10508 here as well.

Now that your PR is merged, I'm happy to switch over to the new utility function!

…ti-query

sjrl · 2026-02-06T08:27:56Z

@bogdankostic I updated both components to use the new utility method!

sjrl · 2026-02-06T08:29:14Z

haystack/utils/misc.py

+    highest_scoring_docs: dict[str, "Document"] = {}
    for doc in documents:
-        doc_id = doc.id
        score = doc.score if doc.score is not None else -inf
+        best = highest_scoring_docs.get(doc.id)

-        if doc_id not in highest_scoring_docs:
-            highest_scoring_docs[doc_id] = doc
-        else:
-            existing_doc = highest_scoring_docs[doc_id]
-            existing_score = existing_doc.score if existing_doc.score is not None else -inf
-            if score > existing_score:
-                highest_scoring_docs[doc_id] = doc
+        if best is None or score > (best.score if best.score is not None else -inf):
+            highest_scoring_docs[doc.id] = doc


@bogdankostic I also slightly compacted the utility method which I think helps with readability and still makes sure not to create intermediate lists

bogdankostic

Looking good! Nice refactoring of _deduplicate_documents making it much more concise :)

refactoring

17525f3

sjrl requested a review from a team as a code owner February 3, 2026 09:29

sjrl requested review from bogdankostic and removed request for a team February 3, 2026 09:29

sjrl added the ignore-for-release-notes PRs with this flag won't be included in the release notes. label Feb 3, 2026

sjrl self-assigned this Feb 3, 2026

github-actions bot added the topic:tests label Feb 3, 2026

sjrl changed the title ~~refactor: slight refactor MultiQuery components and tests~~ refactor: slight refactor to MultiQuery components and tests Feb 3, 2026

github-actions bot added the type:documentation Improvements on the docs label Feb 3, 2026

sjrl added 2 commits February 6, 2026 09:09

Merge branch 'main' of github.com:deepset-ai/haystack into update-mul…

e50f828

…ti-query

Updates

ea8f121

Remove unnecessary line

44c7d68

sjrl commented Feb 6, 2026

View reviewed changes

add reno

5496e5f

sjrl removed the ignore-for-release-notes PRs with this flag won't be included in the release notes. label Feb 6, 2026

sjrl changed the title ~~refactor: slight refactor to MultiQuery components and tests~~ feat!: deduplicate documents based on ID in MultiQuery components Feb 6, 2026

bogdankostic approved these changes Feb 6, 2026

View reviewed changes

sjrl merged commit d68f444 into main Feb 6, 2026
25 checks passed

sjrl deleted the update-multi-query branch February 6, 2026 11:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

feat!: deduplicate documents based on ID in MultiQuery components#10496

feat!: deduplicate documents based on ID in MultiQuery components#10496
sjrl merged 5 commits intomainfrom
update-multi-query

sjrl commented Feb 3, 2026

Uh oh!

vercel bot commented Feb 3, 2026 •

edited

Loading

Uh oh!

bogdankostic commented Feb 5, 2026

Uh oh!

sjrl commented Feb 6, 2026

Uh oh!

sjrl commented Feb 6, 2026

Uh oh!

sjrl Feb 6, 2026

Uh oh!

bogdankostic left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

sjrl commented Feb 3, 2026

Related Issues

Proposed Changes:

How did you test it?

Notes for the reviewer

Checklist

Uh oh!

vercel bot commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bogdankostic commented Feb 5, 2026

Uh oh!

sjrl commented Feb 6, 2026

Uh oh!

sjrl commented Feb 6, 2026

Uh oh!

sjrl Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

bogdankostic left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vercel bot commented Feb 3, 2026 •

edited

Loading