Skip to content

Comments

feat!: deduplicate documents based on ID in MultiQuery components#10496

Merged
sjrl merged 5 commits intomainfrom
update-multi-query
Feb 6, 2026
Merged

feat!: deduplicate documents based on ID in MultiQuery components#10496
sjrl merged 5 commits intomainfrom
update-multi-query

Conversation

@sjrl
Copy link
Contributor

@sjrl sjrl commented Feb 3, 2026

Related Issues

  • fixes #issue-number

Proposed Changes:

  • De-duplicate based on Document.id instead of Document.content for consistency in the code base and performance reasons
  • Simplify the from_dict methods to utilize the new from_dict improvements. The to_dict methods were not updateable since default_to_dict doesn't support serializing init variables that are components.
  • Refactor some mocks used in tests

How did you test it?

Existing tests

Notes for the reviewer

Checklist

  • I have read the contributors guidelines and the code of conduct.
  • I have updated the related issue with new insights and changes.
  • I have added unit tests and updated the docstrings.
  • I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test: and added ! in case the PR includes breaking changes.
  • I have documented my code.
  • I have added a release note file, following the contributors guidelines.
  • I have run pre-commit hooks and fixed any issue.

@sjrl sjrl requested a review from a team as a code owner February 3, 2026 09:29
@sjrl sjrl requested review from bogdankostic and removed request for a team February 3, 2026 09:29
@vercel
Copy link

vercel bot commented Feb 3, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
haystack-docs Ignored Ignored Preview Feb 6, 2026 10:32am

Request Review

@sjrl sjrl added the ignore-for-release-notes PRs with this flag won't be included in the release notes. label Feb 3, 2026
@sjrl sjrl self-assigned this Feb 3, 2026
@sjrl sjrl changed the title refactor: slight refactor MultiQuery components and tests refactor: slight refactor to MultiQuery components and tests Feb 3, 2026
@github-actions github-actions bot added the type:documentation Improvements on the docs label Feb 3, 2026
@bogdankostic
Copy link
Contributor

In principle, this PR looks good to me. However, I’m wondering if we should standardize how we deduplicate documents across components. I think we could utilize the _deduplicate_documents function introduced in #10508 here as well.

@sjrl
Copy link
Contributor Author

sjrl commented Feb 6, 2026

In principle, this PR looks good to me. However, I’m wondering if we should standardize how we deduplicate documents across components. I think we could utilize the _deduplicate_documents function introduced in #10508 here as well.

Now that your PR is merged, I'm happy to switch over to the new utility function!

@sjrl
Copy link
Contributor Author

sjrl commented Feb 6, 2026

@bogdankostic I updated both components to use the new utility method!

Comment on lines +132 to +138
highest_scoring_docs: dict[str, "Document"] = {}
for doc in documents:
doc_id = doc.id
score = doc.score if doc.score is not None else -inf
best = highest_scoring_docs.get(doc.id)

if doc_id not in highest_scoring_docs:
highest_scoring_docs[doc_id] = doc
else:
existing_doc = highest_scoring_docs[doc_id]
existing_score = existing_doc.score if existing_doc.score is not None else -inf
if score > existing_score:
highest_scoring_docs[doc_id] = doc
if best is None or score > (best.score if best.score is not None else -inf):
highest_scoring_docs[doc.id] = doc
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bogdankostic I also slightly compacted the utility method which I think helps with readability and still makes sure not to create intermediate lists

@sjrl sjrl removed the ignore-for-release-notes PRs with this flag won't be included in the release notes. label Feb 6, 2026
@sjrl sjrl changed the title refactor: slight refactor to MultiQuery components and tests feat!: deduplicate documents based on ID in MultiQuery components Feb 6, 2026
Copy link
Contributor

@bogdankostic bogdankostic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! Nice refactoring of _deduplicate_documents making it much more concise :)

@sjrl sjrl merged commit d68f444 into main Feb 6, 2026
25 checks passed
@sjrl sjrl deleted the update-multi-query branch February 6, 2026 11:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

topic:tests type:documentation Improvements on the docs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants