HTML::HTML5::Sanity: Make html5 dom trees less insane

SYNOPSIS

  use HTML::HTML5::Parser;
  use HTML::HTML5::Sanity;

  my $parser    = HTML::HTML5::Parser->new;
  my $html5_dom = $parser->parse_file('http://example.com/');
  my $sane_dom  = fix_document($html5_dom);

DESCRIPTION

The Document Object Model (\s-1DOM\s0) generated by HTML::HTML5::Parser meets the requirements of the \s-1HTML5\s0 spec, but will probably catch a lot of people by surprise.

The main oddity is that elements and attributes which appear to be namespaced are not really. For example, the following element:

Looks like it should be parsed so that it has an attribute \*(L"lang\*(R" in the \s-1XML\s0 namespace. Not so. It will really be parsed as having the attribute \*(L"xml:lang\*(R" in the null namespace. $sane_dom = fix_document($html5_dom); Returns a modified copy of the \s-1DOM\s0 and leaving the original \s-1DOM\s0 unmodified. Don't use this. Not exported. Don't use this. Not exported. $HTML::HTML5::Sanity::FIX_LANG_ATTRIBUTES = 2; $sane_dom = fix_document($html5_dom); If set to 1 (the default), the package will detect invalid values in @lang and @xml:lang, and remove the attribute if it is invalid. If set to 2, it will also attempt to canonicalise the value (e.g. '\s-1EN_GB\s0' will be converted to to 'en-GB'). If set to 0, then the value of language attributes is not checked.

BUGS

Please report any bugs to <http://rt.cpan.org/>.

RELATED TO HTML::HTML5::Sanity…

HTML::HTML5::Parser, XML::LibXML, Task::HTML5.

AUTHOR

Toby Inkster <[email protected]>.

COPYRIGHT AND LICENSE

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

DISCLAIMER OF WARRANTIES

\s-1THIS\s0 \s-1PACKAGE\s0 \s-1IS\s0 \s-1PROVIDED\s0 \*(L"\s-1AS\s0 \s-1IS\s0\*(R" \s-1AND\s0 \s-1WITHOUT\s0 \s-1ANY\s0 \s-1EXPRESS\s0 \s-1OR\s0 \s-1IMPLIED\s0 \s-1WARRANTIES\s0, \s-1INCLUDING\s0, \s-1WITHOUT\s0 \s-1LIMITATION\s0, \s-1THE\s0 \s-1IMPLIED\s0 \s-1WARRANTIES\s0 \s-1OF\s0 \s-1MERCHANTIBILITY\s0 \s-1AND\s0 \s-1FITNESS\s0 \s-1FOR\s0 A \s-1PARTICULAR\s0 \s-1PURPOSE\s0.

HTML::HTML5::Sanity (3pm)