Tokenize xml the rex way
my $tokens = MKDoc::XML::Tokenizer->process_data ($some_xml); foreach my $token (@{$tokens}) { print "'" . $token->as_string() . "' is text\n" if (defined $token->text()); print "'" . $token->as_string() . "' is a self closing tag\n" if (defined $token->tag_self_close()); print "'" . $token->as_string() . "' is an opening tag\n" if (defined $token->tag_open()); print "'" . $token->as_string() . "' is a closing tag\n" if (defined $token->tag_close()); print "'" . $token->as_string() . "' is a processing instruction\n" if (defined $token->pi()); print "'" . $token->as_string() . "' is a declaration\n" if (defined $token->declaration()); print "'" . $token->as_string() . "' is a comment\n" if (defined $token->comment()); print "'" . $token->as_string() . "' is a tag\n" if (defined $token->tag()); print "'" . $token->as_string() . "' is a pseudo-tag (NOT text and NOT tag)\n" if (defined $token->pseudotag()); print "'" . $token->as_string() . "' is a leaf token (NOT opening tag)\n" if (defined $token->leaf()); }
MKDoc::XML::Tokenizer is a module which uses Robert D. Cameron \s-1REX\s0 technique to parse \s-1XML\s0 (ignore the carriage returns):
[^<]+|<(?:!(?:--(?:[^-]*-(?:[^-][^-]*-)*->?)?|\[CDATA\[(?:[^\]]*](?:[^\]]+]) *]+(?:[^\]>][^\]]*](?:[^\]]+])*]+)*>)?|DOCTYPE(?:[ \n\t\r]+(?:[A-Za-z_:]|[^\ x00-\x7F])(?:[A-Za-z0-9_:.-]|[^\x00-\x7F])*(?:[ \n\t\r]+(?:(?:[A-Za-z_:]|[^\ x00-\x7F])(?:[A-Za-z0-9_:.-]|[^\x00-\x7F])*|"[^"]*"|'[^']*'))*(?:[ \n\t\r]+) ?(?:\[(?:<(?:!(?:--[^-]*-(?:[^-][^-]*-)*->|[^-](?:[^\]"'><]+|"[^"]*"|'[^']*' )*>)|\?(?:[A-Za-z_:]|[^\x00-\x7F])(?:[A-Za-z0-9_:.-]|[^\x00-\x7F])*(?:\?>|[\ n\r\t ][^?]*\?+(?:[^>?][^?]*\?+)*>))|%(?:[A-Za-z_:]|[^\x00-\x7F])(?:[A-Za-z0 -9_:.-]|[^\x00-\x7F])*;|[ \n\t\r]+)*](?:[ \n\t\r]+)?)?>?)?)?|\?(?:(?:[A-Za-z _:]|[^\x00-\x7F])(?:[A-Za-z0-9_:.-]|[^\x00-\x7F])*(?:\?>|[\n\r\t ][^?]*\?+(? :[^>?][^?]*\?+)*>)?)?|/(?:(?:[A-Za-z_:]|[^\x00-\x7F])(?:[A-Za-z0-9_:.-]|[^\x 00-\x7F])*(?:[ \n\t\r]+)?>?)?|(?:(?:[A-Za-z_:]|[^\x00-\x7F])(?:[A-Za-z0-9_:. -]|[^\x00-\x7F])*(?:[ \n\t\r]+(?:[A-Za-z_:]|[^\x00-\x7F])(?:[A-Za-z0-9_:.-]| [^\x00-\x7F])*(?:[ \n\t\r]+)?=(?:[ \n\t\r]+)?(?:"[^<"]*"|'[^<']*'))*(?:[ \n\ t\r]+)?/?>?)?)
That's right. One big regex, and it works rather well.
This module does low level \s-1XML\s0 manipulation. It will somehow parse even broken \s-1XML\s0 and try to do something with it. Do not use it unless you know what you're doing.
Splits $some_xml into a list of MKDoc::XML::Token objects and returns an array reference to the list of tokens. Same as MKDoc::XML::Tokenizer->process_data ($some_xml), except that it reads $some_xml from '/some/file.xml'.
MKDoc::XML::Tokenizer works with MKDoc::XML::Token, which can be used when building a full tree is not necessary. If you need to build a tree, look at MKDoc::XML::TreeBuilder.
Copyright 2003 - MKDoc Holdings Ltd.
Author: Jean-Michel Hiver
This module is free software and is distributed under the same license as Perl itself. Use it at your own risk.
MKDoc::XML::Token MKDoc::XML::TreeBuilder