Agent for harvesting from open archives version 1.0, 1.1, 2.0 and static ('2.0s') compatible repositories
\*(C`HTTP::OAI::Harvester\*(C' is the harvesting front-end in the OAI-PERL library.
To harvest from an OAI-PMH compliant repository create an \*(C`HTTP::OAI::Harvester\*(C' object using the baseURL option and then call OAI-PMH methods to request data from the repository. To handle version 1.0/1.1 repositories automatically you must request \*(C`Identify()\*(C' first.
It is recommended that you request an Identify from the Repository and use the \*(C`repository()\*(C' method to update the Identify object used by the harvester.
When making \s-1OAI\s0 requests the underlying HTTP::OAI::UserAgent module will take care of automatic redirection (http code 302) and retry-after (http code 503). OAI-PMH flow control (i.e. resumption tokens) is handled transparently by \*(C`HTTP::OAI::Response\*(C'.
Static repositories are automatically and transparently supported within the existing \s-1API\s0. To harvest a static repository specify the repository \s-1XML\s0 file using the baseURL argument to HTTP::OAI::Harvester. An initial request is made that determines whether the base \s-1URL\s0 specifies a static repository or a normal \s-1OAI\s0 1.x/2.0 \s-1CGI\s0 repository. To prevent this initial request state the \s-1OAI\s0 version using an HTTP::OAI::Identify object e.g.
$h = HTTP::OAI::Harvester->new( repository=>HTTP::OAI::Identify->new( baseURL => 'http://arXiv.org/oai2', version => '2.0', ));
If a static repository is found the response is cached, and further requests are served by that cache. Static repositories do not support sets, and will result in a noSetHierarchy error if you try to use sets. You can determine whether the repository is static by checking the version ($ha->repository->version), which will be \*(L"2.0s\*(R" for static repositories.
You should refer to the Open Archives Protocol version 2.0 and other \s-1OAI\s0 documentation, available from http://www.openarchives.org/.
Note OAI-PMH 1.0 and 1.1 are deprecated.
In the examples I use arXiv.org's and cogprints \s-1OAI\s0 interfaces. To avoid causing annoyance to their server administrators please contact them before performing testing or large downloads (or use other, less loaded, servers for testing).
use HTTP::OAI;
my $h = new HTTP::OAI::Harvester(baseURL=>'http://arXiv.org/oai2'); my $response = $h->repository($h->Identify) if( $response->is_error ) { print "Error requesting Identify:\n", $response->code . " " . $response->message, "\n"; exit; }
# Note: repositoryVersion will always be 2.0, $r->version returns # the actual version the repository is running print "Repository supports protocol version ", $response->version, "\n";
# Version 1.x repositories don't support metadataPrefix, # but OAI-PERL will drop the prefix automatically # if an Identify was requested first (as above) $response = $h->ListIdentifiers( metadataPrefix=>'oai_dc', from=>'2001-02-03', until=>'2001-04-10' );
if( $response->is_error ) { die("Error harvesting: " . $response->message . "\n"); }
print "responseDate => ", $response->responseDate, "\n", "requestURL => ", $response->requestURL, "\n";
while( my $id = $response->next ) { print "identifier => ", $id->identifier; # Only available from OAI 2.0 repositories print " (", $id->datestamp, ")" if $id->datestamp; print " (", $id->status, ")" if $id->status; print "\n"; # Only available from OAI 2.0 repositories for( $id->setSpec ) { print "\t", $_, "\n"; } }
# Using a handler $response = $h->ListRecords( metadataPrefix=>'oai_dc', handlers=>{metadata=>'HTTP::OAI::Metadata::OAI_DC'}, ); while( my $rec = $response->next ) { print $rec->identifier, "\t", $rec->datestamp, "\n", $rec->metadata, "\n"; print join(',', @{$rec->metadata->dc->{'title'}}), "\n"; } if( $rec->is_error ) { die $response->message; }
# Offline parsing $I = HTTP::OAI::Identify->new(); $I->parse_string($content); $I->parse_file($fh);
This constructor method returns a new instance of \*(C`HTTP::OAI::Harvester\*(C'. Requires either an HTTP::OAI::Identify object, which in turn must contain a baseURL, or a baseURL from which to construct an Identify object. Any other parameters are passed to the HTTP::OAI::UserAgent module, and from there to the LWP::UserAgent module. $h = HTTP::OAI::Harvester->new( baseURL => 'http://arXiv.org/oai2', resume=>0, # Suppress automatic resumption ) $id = $h->repository(); $h->repository($h->Identify);
$h = HTTP::OAI::Harvester->new( HTTP::OAI::Identify->new( baseURL => 'http://arXiv.org/oai2', )); Returns and optionally sets the HTTP::OAI::Identify object used by the Harvester agent. If set to true (default) resumption tokens will automatically be handled by requesting the next partial list during \*(C`next()\*(C' calls.
The 6 OAI-PMH Verbs are the requests supported by an OAI-PMH interface.
Use \*(C`is_success()\*(C' or \*(C`is_error()\*(C' on the returned object to determine whether an error occurred (see HTTP::OAI::Response).
\*(C`code()\*(C' and \*(C`message()\*(C' return the error code (200 is success) and a human-readable message respectively. Errors returned by the repository can be retrieved using the \*(C`errors()\*(C' method:
foreach my $error ($r->errors) { print $error->code, "\t", $error->message, "\n"; }
Note: \*(C`is_success()\*(C' is true for the \s-1OAI\s0 Error Code \*(C`noRecordsMatch\*(C' (i.e. empty set), although \*(C`errors()\*(C' will still contain the \s-1OAI\s0 error.
If the response contained a resumption token this can be retrieved using the $r->resumptionToken method.
These methods return an object subclassed from HTTP::Response (where the class corresponds to the verb requested, e.g. \*(C`GetRecord\*(C' requests return an \*(C`HTTP::OAI::GetRecord\*(C' object). Get a single record from the repository identified by identifier, in format metadataPrefix. $gr = $h->GetRecord( identifier => 'oai:arXiv:hep-th/0001001', # Required metadataPrefix => 'oai_dc' # Required ); $rec = $gr->next; die $rec->message if $rec->is_error; printf("%s (%s)\n", $rec->identifier, $rec->datestamp); $dom = $rec->metadata->dom; Get information about the repository. $id = $h->Identify(); print join ',', $id->adminEmail; Retrieve the identifiers, datestamps, sets and deleted status for all records within the specified date range (from/until) and set spec (set). 1.x repositories will only return the identifier. Or, resume an existing harvest by specifying resumptionToken. $lr = $h->ListIdentifiers( metadataPrefix => 'oai_dc', # Required from => '2001-10-01', until => '2001-10-31', set=>'physics:hep-th', ); while($rec = $lr->next) { { ... do something with $rec ... } } die $lr->message if $lr->is_error; List available metadata formats. Given an identifier the repository should only return those metadata formats for which that item can be disseminated. $lmdf = $h->ListMetadataFormats( identifier => 'oai:arXiv.org:hep-th/0001001' ); for($lmdf->metadataFormat) { print $_->metadataPrefix, "\n"; } die $lmdf->message if $lmdf->is_error; Return full records within the specified date range (from/until), set and metadata format. Or, specify a resumption token to resume a previous partial harvest. $lr = $h->ListRecords( metadataPrefix=>'oai_dc', # Required from => '2001-10-01', until => '2001-10-01', set => 'physics:hep-th', ); while($rec = $lr->next) { { ... do something with $rec ... } } die $lr->message if $lr->is_error; Return a list of sets provided by the repository. The scope of sets is undefined by OAI-PMH, so therefore may represent any subset of a collection. Optionally provide a resumption token to resume a previous partial request. $ls = $h->ListSets(); while($set = $ls->next) { print $set->setSpec, "\n"; } die $ls->message if $ls->is_error;
These modules have been written by Tim Brody <[email protected]>.