HTML::LinkExtractor: Extract links from an html document

DESCRIPTION

HTML::LinkExtractor is used for extracting links from \s-1HTML\s0. It is very similar to HTML::LinkExtor, except that besides getting the \s-1URL\s0, you also get the link-text.

Example ( please run the examples ):

    use HTML::LinkExtractor;
    use Data::Dumper;

    my $input = q{If <a href="http://perl.com/"> I am a LINK!!! </a>};
    my $LX = new HTML::LinkExtractor();

    $LX->parse(\$input);

    print Dumper($LX->links);
    _\|_END_\|_
    # the above example will yield
    $VAR1 = [
              {
                '_TEXT' => '<a href="http://perl.com/"> I am a LINK!!! </a>',
                'href' => bless(do{\(my $o = 'http://perl.com/')}, 'URI::http'),
                'tag' => 'a'
              }
            ];

\*(C`HTML::LinkExtractor\*(C' will also correctly extract nested link-type tags.

SYNOPSIS

## the demo perl LinkExtractor.pm perl LinkExtractor.pm file.html othefile.html

## or if the module is installed, but you don't know where

perl -MHTML::LinkExtractor -e" system $^X, $INC{q{HTML/LinkExtractor.pm}} " perl -MHTML::LinkExtractor -e' system $^X, $INC{q{HTML/LinkExtractor.pm}} '

## or

use HTML::LinkExtractor; use LWP qw( get ); # use LWP::Simple qw( get );

my $base = 'http://search.cpan.org'; my $html = get($base.'/recent'); my $LX = new HTML::LinkExtractor();

$LX->parse(\$html);

print qq{<base href="$base">\n};

for my $Link( @{ $LX->links } ) { ## new modules are linked by /author/NAME/Dist if( $$Link{href}=~ m{^\/author\/\w+} ) { print $$Link{_TEXT}."\n"; } }

undef $LX; _\|_END_\|_

## or

use HTML::LinkExtractor; use Data::Dumper;

my $input = q{If <a href="http://perl.com/"> I am a LINK!!! </a>}; my $LX = new HTML::LinkExtractor( sub { print Data::Dumper::Dumper(@_); }, 'http://perlFox.org/', );

$LX->parse(\$input); $LX->strip(1); $LX->parse(\$input); _\|_END_\|_

#### Calculate to total size of a web-page #### adds up the sizes of all the images and stylesheets and stuff

use strict; use LWP; # use LWP::Simple; use HTML::LinkExtractor; # my $url = shift || 'http://www.google.com'; my $html = get($url); my $Total = length $html; # print "initial size $Total\n"; # my $LX = new HTML::LinkExtractor( sub { my( $X, $tag ) = @_; # unless( grep {$_ eq $tag->{tag} } @HTML::LinkExtractor::TAGS_IN_NEED ) { # print "$$tag{tag}\n"; # for my $urlAttr ( @{$HTML::LinkExtractor::TAGS{$$tag{tag}}} ) { if( exists $$tag{$urlAttr} ) { my $size = (head( $$tag{$urlAttr} ))[1]; $Total += $size if $size; print "adding $size\n" if $size; } } } }, $url, 0 ); # $LX->parse(\$html); # print "The total size of \n$url\n is $Total bytes\n"; _\|_END_\|_

METHODS

Accepts 3 arguments, all of which are optional. If for example you want to pass a $baseUrl, but don't want to have a callback invoked, just put \*(C`undef\*(C' in place of a subref.

This is the only class method.

1.: a callback ( a sub reference, as in \*(C`sub{}\*(C', or \*(C`\&sub\*(C') which is to be called each time a new \s-1LINK\s0 is encountered ( for @HTML::LinkExtractor::TAGS_IN_NEED this means after the closing tag is encountered ) The callback receives an object reference($LX) and a link hashref.
2.: and a base \s-1URL\s0 ( \s-1URI-\s0>new, so its up to you to make sure it's valid which is used to convert all relative \s-1URI\s0's to absolute ones. $ALinkP{href} = URI->new_abs( $ALink{href}, $base );
3.: A \*(L"boolean\*(R" (just stick with 1). See the example in \*(L"\s-1DESCRIPTION\s0\*(R". Normally, you'd get back _TEXT that looks like '_TEXT' => '<a href="http://perl.com/"> I am a LINK!!! </a>', If you turn this option on, you'll get the following instead '_TEXT' => ' I am a LINK!!! ', The private utility function \*(C`_stripHTML\*(C' does this by using HTML::TokeParsers method get_trimmed_text. You can turn this feature on an off by using \*(C`$LX->strip(undef || 0 || 1)\*(C' Each time you call \*(C`parse\*(C', you should pass it a $filename a *FILEHANDLE or a \*(C`\$FileContent\*(C'

Each time you call \*(C`parse\*(C' a new \*(C`HTML::TokeParser\*(C' object is created and stored in \*(C`$this->{_tp}\*(C'.

You shouldn't need to mess with the TokeParser object. Only after you call \*(C`parse\*(C' will this method return anything. This method returns a reference to an ArrayOfHashes, which basically looks like (Data::Dumper output)

$VAR1 = [ { tag => 'img', src => 'image.png' }, ];

Please note that if yo provide a callback this array will be empty. If you pass in \*(C`undef\*(C' (or nothing), returns the state of the option. Passing in a true or false value sets the option.

If you wanna know what the option does see \*(C`$LX->new([\&callback, [$baseUrl, [1]]])\*(C'

WHAT'S A LINK-type tag

Take a look at %HTML::LinkExtractor::TAGS to see what I consider to be link-type-tag.

Take a look at @HTML::LinkExtractor::VALID_URL_ATTRIBUTES to see all the possible tag attributes which can contain \s-1URI\s0's (the links!!)

Take a look at @HTML::LinkExtractor::TAGS_IN_NEED to see the tags for which the '_TEXT' attribute is provided, like \*(C`<a href="#"> TEST </a>\*(C'

How can that be?!?!

I took at look at %HTML::Tagset::linkElements and the following \s-1URL\s0's

http://www.blooberry.com/indexdot/html/tagindex/all.htm

http://www.blooberry.com/indexdot/html/tagpages/a/a-hyperlink.htm http://www.blooberry.com/indexdot/html/tagpages/a/applet.htm http://www.blooberry.com/indexdot/html/tagpages/a/area.htm

http://www.blooberry.com/indexdot/html/tagpages/b/base.htm http://www.blooberry.com/indexdot/html/tagpages/b/bgsound.htm

http://www.blooberry.com/indexdot/html/tagpages/d/del.htm http://www.blooberry.com/indexdot/html/tagpages/d/div.htm

http://www.blooberry.com/indexdot/html/tagpages/e/embed.htm http://www.blooberry.com/indexdot/html/tagpages/f/frame.htm

http://www.blooberry.com/indexdot/html/tagpages/i/ins.htm http://www.blooberry.com/indexdot/html/tagpages/i/image.htm http://www.blooberry.com/indexdot/html/tagpages/i/iframe.htm http://www.blooberry.com/indexdot/html/tagpages/i/ilayer.htm http://www.blooberry.com/indexdot/html/tagpages/i/inputimage.htm

http://www.blooberry.com/indexdot/html/tagpages/l/layer.htm http://www.blooberry.com/indexdot/html/tagpages/l/link.htm

http://www.blooberry.com/indexdot/html/tagpages/o/object.htm

http://www.blooberry.com/indexdot/html/tagpages/q/q.htm

http://www.blooberry.com/indexdot/html/tagpages/s/script.htm http://www.blooberry.com/indexdot/html/tagpages/s/sound.htm

And the special cases

<!DOCTYPE HTML SYSTEM "http://www.w3.org/DTD/HTML4-strict.dtd"> http://www.blooberry.com/indexdot/html/tagpages/d/doctype.htm '!doctype' is really a process instruction, but is still listed in %TAGS with 'url' as the attribute

and

RELATED TO HTML::LinkExtractor…

HTML::LinkExtor, HTML::TokeParser, HTML::Tagset.

AUTHOR

D.H (PodMaster)

Please use http://rt.cpan.org/ to report bugs.

Just go to http://rt.cpan.org/NoAuth/Bugs.html?Dist=HTML-Scrubber to see a bug list and/or repot new ones.

LICENSE

This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself. The \s-1LICENSE\s0 file contains the full text of the license.

HTML::LinkExtractor (3pm)