Skip to content

Conversation

@majiayu000
Copy link

Summary

Fixes #1680

The html2text module was ignoring the HTML <base> tag when resolving relative links. According to HTML standards, when a document contains a <base> tag, relative URLs should be resolved against the base URL specified in the tag's href attribute.

List of files changed and why

  • crawl4ai/html2text/__init__.py - Added handling for the <base> tag in handle_tag() to update self.baseurl when encountered, ensuring subsequent relative links are resolved correctly
  • tests/unit/test_html2text_base_tag.py - Added unit test to verify the fix

How Has This Been Tested?

  • Unit test verifies handle_tag processes <base> tag and updates baseurl
  • Test uses AST parsing to verify code structure

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added/updated unit tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Fixes unclecode#1680

The html2text module was ignoring the HTML <base> tag when resolving
relative links. According to HTML standards, when a document contains
a <base> tag, relative URLs should be resolved against the base URL
specified in the tag's href attribute.

Added handling for the <base> tag in handle_tag() to update self.baseurl
when encountered, ensuring subsequent relative links are resolved
correctly.

Signed-off-by: majiayu000 <1835304752@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant