stag-ir: Information retrieval using a simple relational index

DESCRIPTION

Indexes stag nodes (\s-1XML\s0 Elements) in a simple relational db structure - keyed by \s-1ID\s0 with an \s-1XML\s0 Blob as a value

Imagine you have a very large file of data, in a stag compatible format such as \s-1XML\s0. You want to index all the elements of type person; each person can be uniquely identified by social_security_no, which is a direct subnode of person

The first thing to do is to build the index file, which will be stored in the database mydb

stag-ir.pl -r person -k social_security_no -d Pg:mydb myrecords.xml

You can then use the index \*(L"person-idx\*(R" to retrieve person nodes by their social security number

stag-ir.pl -d Pg:mydb -q 999-9999-9999 > some-person.xml

You can export using different stag formats

stag-ir.pl -d Pg:mydb -q 999-9999-9999 -w sxpr > some-person.xml

You can retrieve multiple nodes (although these need to be rooted to make a valid file)

stag-ir.pl -d Pg:mydb -q 999-9999-9999 -q 888-8888-8888 -top personset

Or you can use a list of IDs from a file (newline delimited)

stag-ir.pl -d Pg:mydb -qf my_ss_nmbrs.txt -top personset

\s-1ARGUMENTS\s0

-d \s-1DB_NAME\s0

This database will be used for storing the stag nodes

The name can be a logical name or \s-1DBI\s0 locator or DBStag shorthand - see DBIx::DBStag

The database must already exist

-clear

Deletes all data from the relation type (specified with -r) before loading

-insertonly

Does not check if the \s-1ID\s0 in the file exists in the db - will always attempt an \s-1INSERT\s0 (and will fail if \s-1ID\s0 already exists)

This is the fastest way to load data (only one \s-1SQL\s0 operation per node rather than two) but is only safe if there is no existing data

(Default is clobber mode - existing data with same \s-1ID\s0 will be replaced)

-newonly

If there is already data in the specified relation in the db, and the \s-1XML\s0 being loaded specifies an \s-1ID\s0 that is already in the db, then this node will be ignored

(Default is clobber mode - existing data with same \s-1ID\s0 will be replaced)

-transaction_size

A commit will be performed every n UPDATEs/COMMITs (and at the end)

Default is autocommit

note that if you are using -insertonly, and you are using transactions, and the input file contains an \s-1ID\s0 already in the database, then the transaction will fail because this script will try and insert a duplicate \s-1ID\s0

-r RELATION-NAME

This is the name of the stag node (\s-1XML\s0 element) that will be stored in the index; for example, with the \s-1XML\s0 below you may want to use the node name person and the unique key id

<person_set> <person> <id>...</id> </person> <person> <id>...</id> </person> ... </person_set>

This flag should only be used when you want to store data

-k UNIQUE-KEY

This node will be used as the unique/primary key for the data

This node should be nested directly below the node that is being stored in the index - if it is more that one below, specify a path

This flag should only be used when you want to store data

-u UNIQUE-KEY

Synonym for -k

-create

If specified, this will create a table for the relation name specified below; you should use this the first time you index a relation

-idtype \s-1TYPE\s0

(optional)

This is the \s-1SQL\s0 datatype for the unique key; it defaults to \s-1VARCHAR\s0(255)

If you know that your id is an integer, you can specify \s-1INTEGER\s0 here

If your id is always a 8-character field you can do this

-idtype 'CHAR(8)'

This option only makes sense when combined with the -c option

-p \s-1PARSER\s0

This can be the name of a stag supported format (xml, sxpr, itext) - \s-1XML\s0 is assumed by default

It can also be a module name - this module is used to parse the input file into a stag stream; see Data::Stag::BaseGenerator for details on writing your own parsers/event generators

This flag should only be used when you want to store data

-q QUERY-ID

Fetches the relation/node with unique key value equal to query-id

Multiple arguments can be passed by specifying -q multple times

This flag should only be used when you want to query data

-top NODE-NAME

If this is specified in conjunction with -q or -qf then all the query result nodes will be nested inside a node with this name (ie this provides a root for the resulting document tree)

-qf QUERY-FILE

This is a file of newline-seperated IDs; this is useful for querying the index in batch

-keys

This will write a list of all primary keys in the index

stag-ir (1p)

SYNOPSIS

DESCRIPTION

\s-1ARGUMENTS\s0

RELATED TO stag-ir…