On Thu, Mar 19, 2009 at 03:36:47PM +1030, Frank J Bruzzaniti
wrote:> I've been experimenting using tesseract to OCR tiff's with omega
just
> using the tesseract binary package from Ubuntu.
Is tesseract better than gocr? In the previous discussion of this I
noted that gocr generates random junk from logos and graphics, and XML
tags for barcodes, and the pipeline used didn't handle multi-page
documents:
http://thread.gmane.org/gmane.comp.search.xapian.general/6336/focus
> The one issue I find is that tesseract is sooo slow.
>
> One work around so ocr'ing doesn't hold up omindex would be to
maintain
> a separate instance of omindex and a separate database of ocr'd data
> then allow them both to be searched via the "stub database"
method. I'd
> definatly wanna use last_mod patch here so I don't have to re-ocr.
>
> Dose this sound reasonable, if anyone has any better solutions I;d love
> to hear of them. Once I've got it sorted I'll submit a patch.
Maybe we
> could have a flag for omindex to it knows if it's designated just to
ocr
> tiff's.
You don't actually need new flags for this - you can just specify -M
flags to disable subsets of mimetypes for each indexing run.
That's quiet fiddly for the "everything but tiffs", but perhaps
the
best way to deal with that is to add a "don't add the default mime
mapping" option, rather than something very specific to OCRing tiffs.
> I guess we could also ocr image pdf's if they comeback with no data
from
> the regular pdf filter.
We've been here before too - see the post linked to above.
Cheers,
Olly