Skip to content

Conversation

@ring00
Copy link
Contributor

@ring00 ring00 commented Nov 27, 2025

The PR updates the PyTorch scraper for documentation versions 2.8 and 2.9, addressing changes in the theme and HTML structure.

The PyTorch 2.8 and 2.9 uses a new sphinx theme, which is somewhat unfriendly to scrapers at the moment. The commit mainly addresses truncations in the breadcrumb navigation section (e.g. https://docs.pytorch.org/docs/2.9/name_inference.html, https://docs.pytorch.org/docs/2.9/config_mod.html) by extracting the text inside the heading instead.

The extracted doc structure is slightly different from those of older PyTorch docs because sometimes truncations happen in the middle of the navigation paths (e.g. https://docs.pytorch.org/docs/2.9/torch.compiler_aot_inductor_debugging_guide.html).

Key changes:

  • Identifies the main content area correctly in newer version docs.
  • Supports the new breadcrumb navigation structure.
  • Restore truncated entry names in newer docs using the full page header, maintaining consistent naming conventions.

If you're updating existing documentation to its latest version, please ensure that you have:

  • Updated the versions and releases in the scraper file
  • Ensured the license is up-to-date
  • Ensured the icons and the SOURCE file in public/icons/your_scraper_name/ are up-to-date if the documentation has a custom icon
  • Ensured self.links contains up-to-date urls if self.links is defined
  • Tested the changes locally to ensure:
    • The scraper still works without errors
    • The scraped documentation still looks consistent with the rest of DevDocs
    • The categorization of entries is still good

This commit updates the PyTorch scraper for documentation versions 2.8
and 2.9, addressing changes in the theme and HTML structure.

Key changes:
- Identifies the main content area correctly in newer version docs.
- Supports the new breadcrumb navigation structure.
- Restore truncated entry names in newer docs using the full page title,
maintaining consistent naming conventions.
@ring00 ring00 marked this pull request as ready for review November 27, 2025 08:05
@ring00 ring00 requested a review from a team as a code owner November 27, 2025 08:05
Copy link
Contributor

@simon04 simon04 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@simon04 simon04 merged commit 407b0dc into freeCodeCamp:main Dec 7, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants