HTML often appears in ways that are unexpected. It may be missing implicit tags, may have unquoted, single-quoted, or double-quoted attributes, may contain duplicate attributes, may contain unescaped text content, or any number of other possible invalid constructions. The HTML API understands all fo these inputs, but downline parsers may not, and HTML snippets which are safe on their own may introduce problems when joined with other HTML snippets.
This patch introduces the `serialize()` method on the HTML Processor, which prints a fully-normative HTML output, eliminating invalid markup along the way. It produces a string which contains every missing tag, double-quoted attributes, and no duplicates. A `normalize()` static method on the HTML Processor provides a convenient wrapper for constructing a fragment parser and immediately serializing.
Subclasses relying on the `serialize_token()` method may perform structural HTML modifications with as much security as the upcoming `\Dom\HTMLDocument()` parser will, though these are not
able to provide the full safety that will eventually appear with `set_inner_html()`.
Further work may explore serializing to XML (which involves a number of other important transformations) and adding constraints to serialization (such as only allowing inline/flow/formatting elements and text).
Developed in https://github.com/wordpress/wordpress-develop/pull/7331
Discussed in https://core.trac.wordpress.org/ticket/62036
Props dmsnell, jonsurrell, westonruter.
Fixes#62036.
Built from https://develop.svn.wordpress.org/trunk@59076
git-svn-id: http://core.svn.wordpress.org/trunk@58472 1a063a9b-81f0-0310-95a4-ce76da25c4cd