Skip to content

Conversation

@dmsnell
Copy link
Member

@dmsnell dmsnell commented May 27, 2024

Status

This is a work in progress.

  • Special atomic elements need deeper access to split apart because the token and text boundaries aren't exposed. Can we reconstruct this by changing the start tag name and re-parsing?

Description

Replace the regular-expression approach to splitting HTML with the HTML API for a reliable parse.

dmsnell and others added 5 commits May 27, 2024 10:20
@github-actions
Copy link

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

Core Committers: Use this line as a base for the props when committing in SVN:

Props dmsnell, jonsurrell.

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

@github-actions
Copy link

Trac Ticket Missing

This pull request is missing a link to a Trac ticket. For a contribution to be considered, there must be a corresponding ticket in Trac.

To attach a pull request to a Trac ticket, please include the ticket's full URL in your pull request description. More information about contributing to WordPress on GitHub can be found in the Core Handbook.

@github-actions
Copy link

Test using WordPress Playground

The changes in this pull request can previewed and tested using a WordPress Playground instance.

WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser.

Some things to be aware of

  • The Plugin and Theme Directories cannot be accessed within Playground.
  • All changes will be lost when closing a tab with a Playground instance.
  • All changes will be lost when refreshing the page.
  • A fresh instance is created each time the link below is clicked.
  • Every time this pull request is updated, a new ZIP file containing all changes is created. If changes are not reflected in the Playground instance,
    it's possible that the most recent build failed, or has not completed. Check the list of workflow runs to be sure.

For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation.

Test this pull request with WordPress Playground.

@sirreal
Copy link
Member

sirreal commented May 28, 2024

It seems like all the remaining test failures are around CDATA sections. The assertions appear to be wrong in HTML5. @westonruter shared some stats on XHTML usage and there seems to be very little: GoogleChromeLabs/wpp-research#74

Should theses tests be removed or updated since they're wrong in the majority of cases?

It would be great to see https://core.trac.wordpress.org/ticket/59883 (drop support for pre-HTML5) move forward.

@sirreal
Copy link
Member

sirreal commented May 28, 2024

There is one test failure I see that's not CDATA related, but it also seems like it may be a fix in behavior:

1) WP_Test_REST_Comments_Controller::test_comment_roundtrip_as_superadmin
Failed asserting that two strings are identical.
--- Expected
+++ Actual
@@ @@
-'<p>\\&#038;\\ &amp; &invalid; < &lt; &amp;lt;
-</p>'
+'<p>\\&#038;\\ &amp; &invalid; < &lt; &amp;lt;</p>'

Copy link
Member

@westonruter westonruter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should wp_get_internal_tag_processor() be marked as private? Or else, should the logic be put in a closure?

*/
function wp_html_split( $input ) {
return preg_split( get_html_split_regex(), $input, -1, PREG_SPLIT_DELIM_CAPTURE );
function wp_html_split( $input_html ) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
function wp_html_split( $input_html ) {
$get_internal_tag_processor = static function ( $html ) {
return new class( $html ) extends WP_HTML_Tag_Processor {
/**
* Returns the raw token from the input string at the
* current location, if paused at a location.
*
* @return false|string
*/
public function get_raw_token() {
if (
WP_HTML_Tag_Processor::STATE_READY === $this->parser_state ||
WP_HTML_Tag_Processor::STATE_INCOMPLETE_INPUT === $this->parser_state ||
WP_HTML_Tag_Processor::STATE_COMPLETE === $this->parser_state
) {
return false;
}
$this->set_bookmark( 'here' );
$here = $this->bookmarks['here'];
return substr( $this->html, $here->start, $here->length );
}
};
};

return preg_split( get_html_split_regex(), $input, -1, PREG_SPLIT_DELIM_CAPTURE );
function wp_html_split( $input_html ) {
$chunks = array();
$processor = wp_get_internal_tag_processor( $input_html );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
$processor = wp_get_internal_tag_processor( $input_html );
$processor = $get_internal_tag_processor( $input_html );

$raw_html = $processor->get_raw_token();
$first_char = $raw_html[1];
$raw_html[1] = 'X';
$special = wp_get_internal_tag_processor( $raw_html );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
$special = wp_get_internal_tag_processor( $raw_html );
$special = $get_internal_tag_processor( $raw_html );

Comment on lines +607 to +639
/**
* Returns a Tag Processor exposing the raw matched tokens.
*
* @since 6.6.0
*
* @param string $html Passed into the Tag Processor.
* @return WP_HTML_Tag_Processor|__anonymous@23567
*/
function wp_get_internal_tag_processor( $html ) {
return new class( $html ) extends WP_HTML_Tag_Processor {
/**
* Returns the raw token from the input string at the
* current location, if paused at a location.
*
* @return false|string
*/
public function get_raw_token() {
if (
WP_HTML_Tag_Processor::STATE_READY === $this->parser_state ||
WP_HTML_Tag_Processor::STATE_INCOMPLETE_INPUT === $this->parser_state ||
WP_HTML_Tag_Processor::STATE_COMPLETE === $this->parser_state
) {
return false;
}

$this->set_bookmark( 'here' );
$here = $this->bookmarks['here'];

return substr( $this->html, $here->start, $here->length );
}
};
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/**
* Returns a Tag Processor exposing the raw matched tokens.
*
* @since 6.6.0
*
* @param string $html Passed into the Tag Processor.
* @return WP_HTML_Tag_Processor|__anonymous@23567
*/
function wp_get_internal_tag_processor( $html ) {
return new class( $html ) extends WP_HTML_Tag_Processor {
/**
* Returns the raw token from the input string at the
* current location, if paused at a location.
*
* @return false|string
*/
public function get_raw_token() {
if (
WP_HTML_Tag_Processor::STATE_READY === $this->parser_state ||
WP_HTML_Tag_Processor::STATE_INCOMPLETE_INPUT === $this->parser_state ||
WP_HTML_Tag_Processor::STATE_COMPLETE === $this->parser_state
) {
return false;
}
$this->set_bookmark( 'here' );
$here = $this->bookmarks['here'];
return substr( $this->html, $here->start, $here->length );
}
};
}

@sirreal
Copy link
Member

sirreal commented Jul 29, 2025

Is this superseded by #9270?

@dmsnell
Copy link
Member Author

dmsnell commented Jul 29, 2025

Yes @sirreal — thanks. I should have found this and linked it when I created #9270. I’ve updated its PR description accordingly.

Not ready to close this though, as I still will probably want to merge ideas from both of them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants