Make html5 dom trees less insane
use HTML::HTML5::Parser; use HTML::HTML5::Sanity; my $parser = HTML::HTML5::Parser->new; my $html5_dom = $parser->parse_file('http://example.com/'); my $sane_dom = fix_document($html5_dom);
The Document Object Model (\s-1DOM\s0) generated by HTML::HTML5::Parser meets the requirements of the \s-1HTML5\s0 spec, but will probably catch a lot of people by surprise.
The main oddity is that elements and attributes which appear to be namespaced are not really. For example, the following element:
<div xml:lang="fr">...</div>
Looks like it should be parsed so that it has an attribute \*(L"lang\*(R" in the \s-1XML\s0 namespace. Not so. It will really be parsed as having the attribute \*(L"xml:lang\*(R" in the null namespace. $sane_dom = fix_document($html5_dom); Returns a modified copy of the \s-1DOM\s0 and leaving the original \s-1DOM\s0 unmodified. Don't use this. Not exported. Don't use this. Not exported. $HTML::HTML5::Sanity::FIX_LANG_ATTRIBUTES = 2; $sane_dom = fix_document($html5_dom); If set to 1 (the default), the package will detect invalid values in @lang and @xml:lang, and remove the attribute if it is invalid. If set to 2, it will also attempt to canonicalise the value (e.g. '\s-1EN_GB\s0' will be converted to to 'en-GB'). If set to 0, then the value of language attributes is not checked.
Please report any bugs to <http://rt.cpan.org/>.
HTML::HTML5::Parser, XML::LibXML, Task::HTML5.
Toby Inkster <[email protected]>.
Copyright (C) 2009-2013 by Toby Inkster
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
\s-1THIS\s0 \s-1PACKAGE\s0 \s-1IS\s0 \s-1PROVIDED\s0 \*(L"\s-1AS\s0 \s-1IS\s0\*(R" \s-1AND\s0 \s-1WITHOUT\s0 \s-1ANY\s0 \s-1EXPRESS\s0 \s-1OR\s0 \s-1IMPLIED\s0 \s-1WARRANTIES\s0, \s-1INCLUDING\s0, \s-1WITHOUT\s0 \s-1LIMITATION\s0, \s-1THE\s0 \s-1IMPLIED\s0 \s-1WARRANTIES\s0 \s-1OF\s0 \s-1MERCHANTIBILITY\s0 \s-1AND\s0 \s-1FITNESS\s0 \s-1FOR\s0 A \s-1PARTICULAR\s0 \s-1PURPOSE\s0.