XML_HTMLSax3 pear.php.net A SAX parser for HTML and other badly formed XML documents XML_HTMLSax3 is a SAX based XML parser for badly formed XML documents, such as HTML. The original code base was developed by Alexander Zhukov and published at http://sourceforge.net/projects/phpshelve/. Alexander kindly gave permission to modify the code and license for inclusion in PEAR. NOTE! This package is now dual licensed under PHP license v3.01 and LGPL 3.0 See the CVS repo link for the actual licenses PEAR::XML_HTMLSax3 provides an API very similar to the native PHP XML extension (http://www.php.net/xml), allowing handlers using one to be easily adapted to the other. The key difference is HTMLSax will not break on badly formed XML, allowing it to be used for parsing HTML documents. Otherwise HTMLSax supports all the handlers available from Expat except namespace and external entity handlers. Provides methods for handling XML escapes as well as JSP/ASP opening and close tags. Version 1.x introduced an API similar to the native SAX extension but used a slow character by character approach to parsing. Version 2.x has had it's internals completely overhauled to use a Lexer, delivering performance *approaching* that of the native XML extension, as well as a radically improved, modular design that makes adding further functionality easy. Version 3.x is about fine tuning the API, behaviour and providing a mechanism to distinguish HTML "quirks" from badly formed HTML (later functionality not yet implemented) A big thanks to Jeff Moore (lead developer of WACT: http://wact.sourceforge.net) who's largely responsible for new design, as well input from other members at Sitepoint's Advanced PHP forums: http://www.sitepointforums.com/showthread.php?threadid=121246. Thanks also to Marcus Baker (lead developer of SimpleTest: http://www.lastcraft.com/simple_test.php) for sorting out the unit tests. Harry Fuecks hfuecks hfuecks@phppatterns.com no 2007-12-01 3.0.0 3.0.0 stable stable PHP Fixed bug #1850 HTMLtoXHTML.php does not produce XHTML [dufuz] Fixed bug #11607 Requesting License change, emails to listed authors bounce [cdake} Fixed bug #12159 not clarified license [hfuecks] This package is now dual licensed under PHP license v3.01 and LGPL 3.0 4.0.5 1.4.0b1 pcre 2004-06-02 3.0.0RC1 3.0.0RC1 beta beta PHP * Re PEAR version naming rules, you now include XML/HTMLSax3.php and the main class is called XML_HTMLSax3 * Now able to parse Word generated HTML - fixed bug with parsing of XML escape sequences * API break (minor): no longer extends PEAR * API break (minor): attributes with no value (like option selected) are now populated with NULL instead of TRUE * API break (minor): replaced XML_OPTION_FULL_ESCAPES with XML_OPTION_STRIP_ESCAPES - by default you now get back the complete escape sequence * Added some more examples 2.1.2 2.1.2 stable stable 2003-12-05 PHP * Bug fixed (thanks Jeff) where badly formed attributes resulted in infinite loop * Added additional boolean argument to open and close handler calls to spot empty tags like br/ - should not break exising APIs * Added XML_OPTION_FULL_ESCAPES which (when = 1) passes through the complete content in an XML escape, allowing comment / cdata reconstruction 2.1.1 2.1.1 stable stable 2003-10-08 PHP * Reporting of byte index with get_current_position() more accurate on opening tags (thanks to Alexander Orlov at x-code.com) * All parser options now available to PHP versions lt 4.3.x, using implementation of html_entity_decode in PHP 2.1.0 2.1.0 stable stable 2003-09-10 PHP * Well (unit) tested with SimpleTest 2.0.2 2.0.2 alpha alpha 2003-08-11 PHP * API is backwards compatible apart from the renaming of parser options * Performance dramatically increased. Not much slower than Expat * Better handling of XML comments and CDATA * Option to trigger additional data handler calls for linefeeds and tabs * Option to trigger additional data handler calls for XML entities and parse them if required. * Added public get_current_position() and get_length() methods 1.1 1.1 stable stable 2003-06-26 PHP * Bug fixes to Attribute_Parser to cope with newline, tag, forward slash and whitespace issues. 1.0 1.0 stable stable 2003-06-08 PHP * Modifications to file structure to place Attributes_Parser.php and State_Machine.php in subdirectory HTMLSax * XML_HTMLSax.php includes Attributes_Parser.php and State_Machine.php using require_once() 0.9.0rc2 0.9.0rc2 beta beta 2003-05-18 PHP *First release under PEAR *Changed package name to XML_HTMLSax *Added patch from John Luxford to parse single quoted attributes *Modified State_Machine to be a simple variable store 0.9.0rc1 0.9.0rc1 beta beta 2003-05-09 PHP A summary of the main differences between this version of HTML_Sax and HTMLSax2002082201 are as follows; *Instead of extending HTMLSax with your own "handlers" class, you now use the set_object() method to pass an instance of the class to HTMLSax. *Class method callbacks are specified using the following methods; *set_element_handler('startHandler','endHandler') <tag> and </tag> *set_data_handler('dataHandler') for contents of an element *set_pi_handler('piHandler') for <?php ?>, <?xml ?> etc. *set_escape_handler(') for anything beginning with <! *set_jasp_handler() - set listener for <% %> tags *Attributes which no value are created and set to true *Comments are handled and may contain entities; < > *The callback handlers will all be passed an instance of HTMLSax in the same way as the native PHP XML Expat extension *Setting of parser options is handled specifically by the set_option() method. Available options are; *skipWhiteSpace; instruct the parser to ignore whitespace characters *trimDataNodes; trim whitespace inside character data *breakOnNewLine; newline characters found in character data are treated as new events triggering another data callback *caseFolding; converts element names to uppercase