Hocr to djvused script converter
hocr2djvused [option...] [hocr-file...]
hocr2djvused reads one or more \m[blue]hOCR\m[]\s-2\u[1]\d\s+2 files (as produced by \m[blue]OCRopus\m[]\s-2\u[2]\d\s+2 or \m[blue]Cuneiform\m[]\s-2\u[3]\d\s+2 or \m[blue]Tesseract\m[]\s-2\u[4]\d\s+2) and converts them to a djvused script.
Unless a filename is explicitly provided on the command line, hOCR is read from the standard input.
-t lines, --details lines
Record location of every line. Don't record locations of particular words or characters.
-t words, --details=words
Record location of every line and every word. Don't record locations of particular characters.
This is the default.
-t chars, --details=chars
Record location of every line, every word and every character.
--word-segmentation=simple
Consider each non-empty sequence of non-whitespace characters a single word.
This is the default, despite being linguistically incorrect.
--word-segmentation=uax29
Use the \m[blue]Unicode Text Segmentation\m[]\s-2\u[5]\d\s+2 algorithm to break lines into words.
This options break assumptions of some DjVu tools that words are separated by spaces, and therefore is it not recommended.
--rotation=n
Assume that DjVu pages are rotated by n degrees.
--page-size=widthxheight
Specifies that page size is width pixels \(mu height pixels.
This option is required for hOCR generated by Cuneiform (< 0.8) and superfluous otherwise.
--html5
Use a \m[blue]HTML5 parser\m[]\s-2\u[6]\d\s+2, which is more robust but slower than the default parser.
--fix-utf8
Attempt to fix UTF-8 encoding issues and eliminate unwanted control characters.
This option might be needed for hOCR generated by Cuneiform\s-2\u[7]\d\s+2 or Tesseract\s-2\u[8]\d\s+2.
--version
Output version information and exit.
-h, --help
Display help and exit.
Please report bugs at: \m[blue]https://bitbucket.org/jwilk/ocrodjvu/issues\m[]
hOCR
\m[blue]https://docs.google.com/View?docid=dfxcv4vc_67g844kf\m[]
OCRopus
\m[blue]https://code.google.com/p/ocropus/\m[]
Cuneiform
\m[blue]https://launchpad.net/cuneiform-linux\m[]
Tesseract
\m[blue]https://code.google.com/p/tesseract-ocr/\m[]
Unicode Text Segmentation
\m[blue]http://unicode.org/reports/tr29/\m[]
HTML5 parser
\m[blue]http://www.whatwg.org/specs/web-apps/current-work/#html-parser\m[]
\m[blue]https://bugs.launchpad.net/cuneiform-linux/+bug/585418\m[]
\m[blue]https://code.google.com/p/tesseract-ocr/issues/detail?id=690\m[]