Skip to content

Conversation

@ikreymer
Copy link
Member

@ikreymer ikreymer commented Dec 7, 2025

Fixes #937

  • Don't remove URLs from seen list
  • Add new excluded key, add URLs to be excluded (out-of-scope on redirect) to excluded set. The size of this set can be used to get the URLs that have been excluded in this way, to compute number of discovered URLs.
  • Don't write urn:pageinfo records for excluded pages, along with not writing to pages/extraPages.jsonl

…ng them from the seen list

avoids requeuing URLs that are excluded on redirect
…s are not written for excluded-on-redirect page
@ikreymer ikreymer requested a review from tw4l December 7, 2025 11:11
…ere should also be no urn:pageinfo record added for excluded pages
Copy link
Member

@tw4l tw4l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice change, and thanks for the good tests! Also tried it out locally and confirmed it's behaving as expected.

@tw4l
Copy link
Member

tw4l commented Dec 8, 2025

Might want to create an issue to link this to or add this PR to our sprint board just for tracking purposes.

@ikreymer ikreymer self-assigned this Dec 8, 2025
@ikreymer ikreymer merged commit 850a6a6 into main Dec 9, 2025
6 checks passed
@ikreymer ikreymer deleted the add-exclude-key branch December 9, 2025 06:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Pages excluded-on-redirect can result in same page being queued multiple times.

3 participants