-
Notifications
You must be signed in to change notification settings - Fork 3.2k
HTML API: Refactor wp_kses_hair()
#9248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HTML API: Refactor wp_kses_hair()
#9248
Conversation
|
The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the Core Committers: Use this line as a base for the props when committing in SVN: To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook. |
Trac ticket: Core-63694 `wp_kses_hair()` is built around an impressive state machine for parsing the `$attr` of an HTML tag, that is, the span of text after the tag name and before the closing `>`. Unfortunately, that parsing code doesn’t fully-implement the HTML specification and may be prone to mis-parsing. This patch replaces the existing state machine with a straight-forward use of the HTML API to parse the attributes for us, constructing a shell take for the `$attr` string and reading the attributes structurally. This shell is necessary because a previous stage of the pipeline has already separated what it thinks is the so-called “attribute list” from a tag. Props: dmsnell
68c7746 to
b476339
Compare
Trac ticket: Core-63694 `wp_kses_hair()` is built around an impressive state machine for parsing the `$attr` of an HTML tag, that is, the span of text after the tag name and before the closing `>`. Unfortunately, that parsing code doesn’t fully-implement the HTML specification and may be prone to mis-parsing. This patch replaces the existing state machine with a straight-forward use of the HTML API to parse the attributes for us, constructing a shell take for the `$attr` string and reading the attributes structurally. This shell is necessary because a previous stage of the pipeline has already separated what it thinks is the so-called “attribute list” from a tag. Props: dmsnell
b476339 to
6146ecd
Compare
Trac ticket: Core-63694 `wp_kses_hair()` is built around an impressive state machine for parsing the `$attr` of an HTML tag, that is, the span of text after the tag name and before the closing `>`. Unfortunately, that parsing code doesn’t fully-implement the HTML specification and may be prone to mis-parsing. This patch replaces the existing state machine with a straight-forward use of the HTML API to parse the attributes for us, constructing a shell take for the `$attr` string and reading the attributes structurally. This shell is necessary because a previous stage of the pipeline has already separated what it thinks is the so-called “attribute list” from a tag. Props: dmsnell
6146ecd to
d64f56e
Compare
Test using WordPress PlaygroundThe changes in this pull request can previewed and tested using a WordPress Playground instance. WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser. Some things to be aware of
For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation. |
Prep work for WordPress#9248 See also WordPress#9251
Prep work for WordPress#9248 See also WordPress#9251
Prep work for WordPress#9252. Part of WordPress#9248.
) Prep work for WordPress#9252. Part of WordPress#9248.
550921c to
5565848
Compare
Trac ticket: Core-63694 `wp_kses_hair()` is built around an impressive state machine for parsing the `$attr` of an HTML tag, that is, the span of text after the tag name and before the closing `>`. Unfortunately, that parsing code doesn’t fully-implement the HTML specification and may be prone to mis-parsing. This patch replaces the existing state machine with a straight-forward use of the HTML API to parse the attributes for us, constructing a shell take for the `$attr` string and reading the attributes structurally. This shell is necessary because a previous stage of the pipeline has already separated what it thinks is the so-called “attribute list” from a tag. Props: dmsnell
5565848 to
63aecea
Compare
Trac ticket: Core-63694 `wp_kses_hair()` is built around an impressive state machine for parsing the `$attr` of an HTML tag, that is, the span of text after the tag name and before the closing `>`. Unfortunately, that parsing code doesn’t fully-implement the HTML specification and may be prone to mis-parsing. This patch replaces the existing state machine with a straight-forward use of the HTML API to parse the attributes for us, constructing a shell take for the `$attr` string and reading the attributes structurally. This shell is necessary because a previous stage of the pipeline has already separated what it thinks is the so-called “attribute list” from a tag. Props: dmsnell
63aecea to
726bcfd
Compare
Adds 60 test methods covering all aspects of wp_kses_hair() function before the HTML API refactor in PR WordPress#9248. Test coverage includes: - Basic attribute parsing (single, double, unquoted, boolean) - Character reference handling (named, numeric decimal/hex) - Quote handling and normalization - URL protocol filtering - WordPress-specific attributes (data-*, aria-*) - Edge cases: malformed input, whitespace variations - Security patterns: injection attempts, contradictory values - Real-world malformed patterns from broken editors/templates Total: 60 test methods, 1,441 lines File: tests/phpunit/tests/kses/wpKsesHair.php This baseline test suite documents current behavior before refactoring to ensure no regressions during the HTML API migration. Related: WordPress#9248 Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Adds @ticket 63724 annotation to class-level docblock and all 60 test methods, following WordPress PHPUnit testing conventions. This links the test suite to Trac ticket 63724 (Core-63724) which tracks the wp_kses_hair() refactoring in PR WordPress#9248. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Improves test suite by adding explicit assertions documenting exact expected behavior for all edge cases and malformed input. Changes: - Replace weak assertIsArray() with explicit expected arrays - Document exact parsing behavior for malformed patterns - Show protocol filtering results explicitly - Document error recovery patterns for invalid input - Add whitespace handling assertions Results: 60 tests, 64 assertions, all passing (0.218s) This provides comprehensive baseline documentation of current wp_kses_hair() behavior before HTML API refactor in PR WordPress#9248. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Related: WordPress#9248 Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
|
This function had quirks that change with this PR and I want to understand them. I created a test suite for My review of the most common usages on suggest that this change is safe to make and would not negatively impact plugin authors.
This is a tricky question. It doesn't seem like folks rely on specifics of the input representation being present in the output, however it's certainly possible. In one of the examples from plugins, After some reflection, I believe the behavior you've implemented here is a good decision. Consider that the input is HTML and the output ( behavior diffdiff --git a/tests/phpunit/tests/kses/wpKsesHair.php b/tests/phpunit/tests/kses/wpKsesHair.php
index 2ed83679f2e3d..05d573bc070bc 100644
--- a/tests/phpunit/tests/kses/wpKsesHair.php
+++ b/tests/phpunit/tests/kses/wpKsesHair.php
@@ -57,7 +57,7 @@ public function data_attribute_parsing() {
'title' => array(
'name' => 'title',
'value' => 'My Title',
- 'whole' => "title='My Title'",
+ 'whole' => 'title="My Title"',
'vless' => 'n',
),
),
@@ -188,8 +188,8 @@ public function data_attribute_parsing() {
array(
'title' => array(
'name' => 'title',
- 'value' => '<test>',
- 'whole' => 'title="<test>"',
+ 'value' => '<test>',
+ 'whole' => 'title="<test>"',
'vless' => 'n',
),
),
@@ -200,8 +200,8 @@ public function data_attribute_parsing() {
array(
'title' => array(
'name' => 'title',
- 'value' => '<hex>',
- 'whole' => 'title="<hex>"',
+ 'value' => '<hex>',
+ 'whole' => 'title="<hex>"',
'vless' => 'n',
),
),
@@ -212,8 +212,8 @@ public function data_attribute_parsing() {
array(
'title' => array(
'name' => 'title',
- 'value' => '<HEX>',
- 'whole' => 'title="<HEX>"',
+ 'value' => '<HEX>',
+ 'whole' => 'title="<HEX>"',
'vless' => 'n',
),
),
@@ -224,8 +224,8 @@ public function data_attribute_parsing() {
array(
'title' => array(
'name' => 'title',
- 'value' => '&invalid; &#; &#x;',
- 'whole' => 'title="&invalid; &#; &#x;"',
+ 'value' => '&invalid; &#; &#x;',
+ 'whole' => 'title="&invalid; &#; &#x;"',
'vless' => 'n',
),
),
@@ -249,7 +249,7 @@ public function data_attribute_parsing() {
'data-text' => array(
'name' => 'data-text',
'value' => 'Single quoted value',
- 'whole' => "data-text='Single quoted value'",
+ 'whole' => 'data-text="Single quoted value"',
'vless' => 'n',
),
),
@@ -267,7 +267,7 @@ public function data_attribute_parsing() {
'alt' => array(
'name' => 'alt',
'value' => 'single',
- 'whole' => "alt='single'",
+ 'whole' => 'alt="single"',
'vless' => 'n',
),
'id' => array(
@@ -284,8 +284,8 @@ public function data_attribute_parsing() {
array(
'title' => array(
'name' => 'title',
- 'value' => "It's working",
- 'whole' => 'title="It\'s working"',
+ 'value' => 'It's working',
+ 'whole' => 'title="It's working"',
'vless' => 'n',
),
),
@@ -296,8 +296,8 @@ public function data_attribute_parsing() {
array(
'title' => array(
'name' => 'title',
- 'value' => 'He said "hello"',
- 'whole' => 'title=\'He said "hello"\'',
+ 'value' => 'He said "hello"',
+ 'whole' => 'title="He said "hello""',
'vless' => 'n',
),
),
@@ -327,12 +327,32 @@ public function data_attribute_parsing() {
yield 'invalid attribute name starting with number' => array(
'1invalid="value"',
- array(),
+ array(
+ '1invalid' => array(
+ 'name' => '1invalid',
+ 'value' => 'value',
+ 'whole' => '1invalid="value"',
+ 'vless' => 'n',
+ ),
+ ),
);
yield 'invalid attribute name special chars' => array(
'@invalid="value" $bad="value"',
- array(),
+ array(
+ '@invalid' => array(
+ 'name' => '@invalid',
+ 'value' => 'value',
+ 'whole' => '@invalid="value"',
+ 'vless' => 'n',
+ ),
+ '$bad' => array(
+ 'name' => '$bad',
+ 'value' => 'value',
+ 'whole' => '$bad="value"',
+ 'vless' => 'n',
+ ),
+ ),
);
yield 'duplicate attributes first wins' => array(
@@ -355,7 +375,20 @@ public function data_attribute_parsing() {
yield 'malformed unclosed double quote' => array(
'title="unclosed class="test"',
- array(),
+ array(
+ 'title' => array(
+ 'name' => 'title',
+ 'value' => 'unclosed class=',
+ 'whole' => 'title="unclosed class="',
+ 'vless' => 'n',
+ ),
+ 'test"' => array(
+ 'name' => 'test"',
+ 'value' => '',
+ 'whole' => 'test"',
+ 'vless' => 'y',
+ ),
+ ),
);
yield 'very long attribute value' => array(
@@ -610,7 +643,7 @@ public function data_attribute_parsing() {
'alt' => array(
'name' => 'alt',
'value' => '',
- 'whole' => "alt=''",
+ 'whole' => 'alt=""',
'vless' => 'n',
),
'class' => array(
@@ -625,7 +658,7 @@ public function data_attribute_parsing() {
yield 'forward slashes between attributes' => array(
'att / att2=2 /// att3="3"',
array(
- 'att' => array(
+ 'att' => array(
'name' => 'att',
'value' => '',
'whole' => 'att',
@@ -652,13 +685,13 @@ public function data_attribute_parsing() {
'att' => array(
'name' => 'att',
'value' => 'val',
- 'whole' => "att='val'",
+ 'whole' => 'att="val"',
'vless' => 'n',
),
'att2' => array(
'name' => 'att2',
'value' => 'val2',
- 'whole' => "att2='val2'",
+ 'whole' => 'att2="val2"',
'vless' => 'n',
),
),
@@ -670,13 +703,13 @@ public function data_attribute_parsing() {
'att' => array(
'name' => 'att',
'value' => 'val',
- 'whole' => "att='val'",
+ 'whole' => 'att="val"',
'vless' => 'n',
),
'att2' => array(
'name' => 'att2',
'value' => 'val2',
- 'whole' => "att2='val2'",
+ 'whole' => 'att2="val2"',
'vless' => 'n',
),
),
@@ -688,13 +721,13 @@ public function data_attribute_parsing() {
'att' => array(
'name' => 'att',
'value' => 'val',
- 'whole' => "att='val'",
+ 'whole' => 'att="val"',
'vless' => 'n',
),
'att2' => array(
'name' => 'att2',
'value' => 'val2',
- 'whole' => "att2='val2'",
+ 'whole' => 'att2="val2"',
'vless' => 'n',
),
),
@@ -706,13 +739,13 @@ public function data_attribute_parsing() {
'att' => array(
'name' => 'att',
'value' => 'val',
- 'whole' => "att='val'",
+ 'whole' => 'att="val"',
'vless' => 'n',
),
'att2' => array(
'name' => 'att2',
'value' => 'val2',
- 'whole' => "att2='val2'",
+ 'whole' => 'att2="val2"',
'vless' => 'n',
),
),
@@ -739,34 +772,67 @@ public function data_attribute_parsing() {
// Malformed Equals Patterns.
yield 'multiple equals signs' => array(
'att=="val"',
- array(),
+ array(
+ 'att' => array(
+ 'name' => 'att',
+ 'value' => '="val"',
+ 'whole' => 'att="="val""',
+ 'vless' => 'n',
+ ),
+ ),
);
yield 'equals with strange spacing' => array(
'att= ="val"',
- array(),
+ array(
+ 'att' => array(
+ 'name' => 'att',
+ 'value' => '="val"',
+ 'whole' => 'att="="val""',
+ 'vless' => 'n',
+ ),
+ ),
);
yield 'triple equals signs' => array(
'att==="val"',
- array(),
+ array(
+ 'att' => array(
+ 'name' => 'att',
+ 'value' => '=="val"',
+ 'whole' => 'att="=="val""',
+ 'vless' => 'n',
+ ),
+ ),
);
yield 'equals echo pattern' => array(
"att==echo 'something'",
array(
- 'att' => array(
+ 'att' => array(
'name' => 'att',
'value' => '=echo',
'whole' => 'att="=echo"',
'vless' => 'n',
),
+ "'something'" => array(
+ 'name' => "'something'",
+ 'value' => '',
+ 'whole' => "'something'",
+ 'vless' => 'y',
+ ),
),
);
yield 'attribute starting with equals' => array(
'= bool k=v',
array(
+ '=' => array(
+ 'name' => '=',
+ 'value' => '',
+ 'whole' => '=',
+ 'vless' => 'y',
+ ),
'bool' => array(
'name' => 'bool',
'value' => '',
@@ -785,18 +851,43 @@ public function data_attribute_parsing() {
yield 'mixed quotes and equals chaos' => array(
'k=v ="' . "' j=w",
array(
- 'k' => array(
+ 'k' => array(
'name' => 'k',
'value' => 'v',
'whole' => 'k="v"',
'vless' => 'n',
),
+ '="' . "'" => array(
+ 'name' => '="' . "'",
+ 'value' => '',
+ 'whole' => '="' . "'",
+ 'vless' => 'y',
+ ),
+ 'j' => array(
+ 'name' => 'j',
+ 'value' => 'w',
+ 'whole' => 'j="w"',
+ 'vless' => 'n',
+ ),
),
);
yield 'triple equals quoted whitespace' => array(
'===" "',
- array(),
+ array(
+ '=' => array(
+ 'name' => '=',
+ 'value' => '="',
+ 'whole' => '=="=""',
+ 'vless' => 'n',
+ ),
+ '"' => array(
+ 'name' => '"',
+ 'value' => '',
+ 'whole' => '"',
+ 'vless' => 'y',
+ ),
+ ),
);
yield 'boolean with contradictory value' => array(
@@ -820,7 +911,13 @@ public function data_attribute_parsing() {
yield 'empty attribute name with value' => array(
'="value" class="test"',
array(
- 'class' => array(
+ '="value"' => array(
+ 'name' => '="value"',
+ 'value' => '',
+ 'whole' => '="value"',
+ 'vless' => 'y',
+ ),
+ 'class' => array(
'name' => 'class',
'value' => 'test',
'whole' => 'class="test"',
@@ -890,7 +987,7 @@ public function data_protocol_filtering() {
'href' => array(
'name' => 'href',
'value' => 'alert(1)',
- 'whole' => "href='alert(1)'",
+ 'whole' => 'href="alert(1)"',
'vless' => 'n',
),
),
@@ -925,8 +1022,8 @@ public function data_protocol_filtering() {
array(
'src' => array(
'name' => 'src',
- 'value' => 'text/html,<script>alert(1)</script>',
- 'whole' => 'src="text/html,<script>alert(1)</script>"',
+ 'value' => 'text/html,<script>alert(1)</script>',
+ 'whole' => 'src="text/html,<script>alert(1)</script>"',
'vless' => 'n',
),
),Here are two examples from the most most popular plugins in the WP Directory search: From YITH (this appears to be part of the yith library used in many of their plugins): function yith_plugin_fw_html_attributes_to_string( $attributes = array(), $echo = false ) {
$html_attributes = '';
if ( ! ! $attributes ) {
if ( is_string( $attributes ) ) {
$parsed_attrs = wp_kses_hair( $attributes, wp_allowed_protocols() );
$attributes = array();
foreach ( $parsed_attrs as $attr ) {
$attributes[ $attr['name'] ] = 'n' === $attr['vless'] ? $attr['value'] : null;
}
}
if ( is_array( $attributes ) ) {
$html_attributes = array();
foreach ( $attributes as $key => $value ) {
if ( ! is_null( $value ) ) {
$html_attributes[] = esc_attr( $key ) . '="' . esc_attr( $value ) . '"';
} else {
$html_attributes[] = esc_attr( $key );
}
}
$html_attributes = implode( ' ', $html_attributes );
}
}
if ( $echo ) {
// Already escaped above.
echo $html_attributes; // phpcs:ignore WordPress.Security.EscapeOutput.OutputNotEscaped
}
return $html_attributes;
} $params = wp_kses_hair( $params, array( 'http' ) );
$width = isset( $params['width'] ) ? (int) $params['width']['value'] : 0;
$height = isset( $params['height'] ) ? (int) $params['height']['value'] : 0;
$wh = '';
if ( $width && $height ) {
$wh = "&w=$width&h=$height";
}
$url = esc_url_raw( "https://www.youtube.com/watch?v={$match[3]}{$wh}" ); |
sirreal
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a good improvement.
src/wp-includes/kses.php
Outdated
| * ); | ||
| * | ||
| * @since 1.0.0 | ||
| * @since 6.9.0 Rebuilt on HTML API |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| * @since 6.9.0 Rebuilt on HTML API | |
| * @since 7.0.0 Rebuilt on HTML API |
Trac ticket: Core-63694 `wp_kses_hair()` is built around an impressive state machine for parsing the `$attr` of an HTML tag, that is, the span of text after the tag name and before the closing `>`. Unfortunately, that parsing code doesn’t fully-implement the HTML specification and may be prone to mis-parsing. This patch replaces the existing state machine with a straight-forward use of the HTML API to parse the attributes for us, constructing a shell take for the `$attr` string and reading the attributes structurally. This shell is necessary because a previous stage of the pipeline has already separated what it thinks is the so-called “attribute list” from a tag. Props: dmsnell
726bcfd to
34ba49c
Compare
Trac ticket: Core-63694 `wp_kses_hair()` is built around an impressive state machine for parsing the `$attr` of an HTML tag, that is, the span of text after the tag name and before the closing `>`. Unfortunately, that parsing code doesn’t fully-implement the HTML specification and may be prone to mis-parsing. This patch replaces the existing state machine with a straight-forward use of the HTML API to parse the attributes for us, constructing a shell take for the `$attr` string and reading the attributes structurally. This shell is necessary because a previous stage of the pipeline has already separated what it thinks is the so-called “attribute list” from a tag. Props: adamziel, dmsnell, jonsurrell, jorbin. Co-authored-by: Adam Zieliński <zieladam@git.wordpress.org> Co-authored-by: Aaron Jorbin <jorbin@git.wordpress.org> Co-authored-by: Jon Surrell <jonsurrell@git.wordpress.org> Github-PR: 9248 Github-PR-URL: WordPress#9248 Trac-Ticket: 63694 Trac-Ticket-URL: https://core.trac.wordpress.org/ticket/63724 Git-Branch: html-api/refactor-wp-kses-hair-take-3
2fe88f3 to
60c3c67
Compare
Trac ticket: Core-63694 `wp_kses_hair()` is built around an impressive state machine for parsing the `$attr` of an HTML tag, that is, the span of text after the tag name and before the closing `>`. Unfortunately, that parsing code doesn’t fully-implement the HTML specification and may be prone to mis-parsing. This patch replaces the existing state machine with a straight-forward use of the HTML API to parse the attributes for us, constructing a shell take for the `$attr` string and reading the attributes structurally. This shell is necessary because a previous stage of the pipeline has already separated what it thinks is the so-called “attribute list” from a tag. Props: adamziel, dmsnell, jonsurrell, jorbin. Co-authored-by: Adam Zieliński <zieladam@git.wordpress.org> Co-authored-by: Aaron Jorbin <jorbin@git.wordpress.org> Co-authored-by: Jon Surrell <jonsurrell@git.wordpress.org> Github-PR: 9248 Github-PR-URL: WordPress#9248 Trac-Ticket: 63694 Trac-Ticket-URL: https://core.trac.wordpress.org/ticket/63724 Git-Branch: html-api/refactor-wp-kses-hair-take-3
60c3c67 to
f3d1a51
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Props to Gemini)
Trac ticket: Core-63694 `wp_kses_hair()` is built around an impressive state machine for parsing the `$attr` of an HTML tag, that is, the span of text after the tag name and before the closing `>`. Unfortunately, that parsing code doesn’t fully-implement the HTML specification and may be prone to mis-parsing. This patch replaces the existing state machine with a straight-forward use of the HTML API to parse the attributes for us, constructing a shell take for the `$attr` string and reading the attributes structurally. This shell is necessary because a previous stage of the pipeline has already separated what it thinks is the so-called “attribute list” from a tag. Props: adamziel, dmsnell, jonsurrell, jorbin, westonruter. Co-authored-by: Adam Zieliński <zieladam@git.wordpress.org> Co-authored-by: Aaron Jorbin <jorbin@git.wordpress.org> Co-authored-by: Jon Surrell <jonsurrell@git.wordpress.org> CO-authored-by: Weston Ruter <westonruter@git.wordpress.org> Github-PR: 9248 Github-PR-URL: WordPress#9248 Trac-Ticket: 63694 Trac-Ticket-URL: https://core.trac.wordpress.org/ticket/63724 Git-Branch: html-api/refactor-wp-kses-hair-take-3
f3d1a51 to
290c46a
Compare
`wp_kses_hair()` is built around an impressive state machine for parsing the span of text following an HTML tag name and the tag’s closing `>` into a structured representation of the attributes. Unfortunately that parsing code doesn’t comply with the HTML Living Standard and is prone to mis-parsing attributes, particularly in the presence of malformed inputs. This patch replaces the existing state machine with the spec-compliant parsing from the HTML API. With a comprehensive test suite covering attribute parsing, the same reliability the Tag Processor affords will be applied to `wp_kses_hair()`, giving new guarantees not previously available in Core: - All attribute values are reported fully-normalized, where character references are decoded and then re-encoded in a predictable manner. Only the “big five” syntax characters (“&<>'"”) will remain, and in their named forms. - All `whole` values are fully normalized and presented either as boolean attributes without a value, or with double-quoted attribute values. - All attributes and their values will be properly parsed according to how a browser would parse them, bringing agreement between the server and user agents. Developed in #9248 Discussed in https://core.trac.wordpress.org/ticket/63724 Props adamziel, dmsnell, jonsurrell, jorbin, westonruter. Fixes #63724. git-svn-id: https://develop.svn.wordpress.org/trunk@61467 602fd350-edb4-49c9-b593-d223f7449a82
`wp_kses_hair()` is built around an impressive state machine for parsing the span of text following an HTML tag name and the tag’s closing `>` into a structured representation of the attributes. Unfortunately that parsing code doesn’t comply with the HTML Living Standard and is prone to mis-parsing attributes, particularly in the presence of malformed inputs. This patch replaces the existing state machine with the spec-compliant parsing from the HTML API. With a comprehensive test suite covering attribute parsing, the same reliability the Tag Processor affords will be applied to `wp_kses_hair()`, giving new guarantees not previously available in Core: - All attribute values are reported fully-normalized, where character references are decoded and then re-encoded in a predictable manner. Only the “big five” syntax characters (“&<>'"”) will remain, and in their named forms. - All `whole` values are fully normalized and presented either as boolean attributes without a value, or with double-quoted attribute values. - All attributes and their values will be properly parsed according to how a browser would parse them, bringing agreement between the server and user agents. Developed in WordPress/wordpress-develop#9248 Discussed in https://core.trac.wordpress.org/ticket/63724 Props adamziel, dmsnell, jonsurrell, jorbin, westonruter. Fixes #63724. Built from https://develop.svn.wordpress.org/trunk@61467 git-svn-id: http://core.svn.wordpress.org/trunk@60779 1a063a9b-81f0-0310-95a4-ce76da25c4cd
`wp_kses_hair()` is built around an impressive state machine for parsing the span of text following an HTML tag name and the tag’s closing `>` into a structured representation of the attributes. Unfortunately that parsing code doesn’t comply with the HTML Living Standard and is prone to mis-parsing attributes, particularly in the presence of malformed inputs. This patch replaces the existing state machine with the spec-compliant parsing from the HTML API. With a comprehensive test suite covering attribute parsing, the same reliability the Tag Processor affords will be applied to `wp_kses_hair()`, giving new guarantees not previously available in Core: - All attribute values are reported fully-normalized, where character references are decoded and then re-encoded in a predictable manner. Only the “big five” syntax characters (“&<>'"”) will remain, and in their named forms. - All `whole` values are fully normalized and presented either as boolean attributes without a value, or with double-quoted attribute values. - All attributes and their values will be properly parsed according to how a browser would parse them, bringing agreement between the server and user agents. Developed in WordPress/wordpress-develop#9248 Discussed in https://core.trac.wordpress.org/ticket/63724 Props adamziel, dmsnell, jonsurrell, jorbin, westonruter. Fixes #63724. Built from https://develop.svn.wordpress.org/trunk@61467 git-svn-id: https://core.svn.wordpress.org/trunk@60779 1a063a9b-81f0-0310-95a4-ce76da25c4cd
Trac ticket: Core-63724
Replaces #7407, dmsnell#5
Coordination in #9256
Review feedback
valueandwholeproperties of the returned array indicate the raw parsed bytes from the HTML (with some exceptions). This means that HTML character references are not decoded. This represents an abstraction leak between the HTML and structural return value.Implementation
wp_kses_hair()is built around an impressive state machine for parsing the$attrof an HTML tag, that is, the span of text after the tag name and before the closing>. Unfortunately, that parsing code doesn’t fully-implement the HTML specification and may be prone to mis-parsing.This patch replaces the existing state machine with a straight-forward use of the HTML API to parse the attributes for us, constructing a shell tag for the
$attrstring and reading the attributes structurally. This shell is necessary because a previous stage of the pipeline has already separated what it thinks is the so-called “attribute list” from a tag.Dependencies
relkeywords toAelements. #9252assertEqualHTML()inwp_rel_nofollow()tests. #9251assertEqualHTMLinwp_rel_ugc()tests. #9255assertEqualHTML()inwp_kses()tests. #9257assertEqualHTML()in post filtering tests. #9258oEmbedfiltering tests. #9259assertEqualHTML()in media tests. #9264