thr3ads.net - Xapian discuss - [Xapian-discuss] tiff / image pdf filter [Mar 2009]

If this information is useful, please help other people find it:
Share via:

Frank J Bruzzaniti

2009-Mar-19 05:06 UTC

[Xapian-discuss] tiff / image pdf filter

I've been experimenting using tesseract to OCR tiff's with omega just 
using the tesseract binary package from Ubuntu.

The one issue I find is that tesseract is sooo slow.

One work around so ocr'ing doesn't hold up omindex would be to maintain 
a separate instance of omindex and a separate database of ocr'd data 
then allow them both to be searched via the "stub database" method. 
I'd
definatly wanna use last_mod patch here so I don't have to re-ocr.

Dose this sound reasonable,  if anyone has any better solutions I;d love 
to hear of them.  Once I've got it sorted I'll submit a patch. Maybe we 
could have a flag for omindex to it knows if it's designated just to ocr 
tiff's. 

I guess we could also ocr image pdf's if they comeback with no data from 
the regular pdf filter. E.g. If you run omindex --tiff --ipdf then it 
will only ocr tiff's and image pdf's by emploing the regular pdf filter 
if it returns data then skip it if it dosen't then ocr it.

Frank

Olly Betts

2009-Mar-19 22:42 UTC

head link

[Xapian-discuss] tiff / image pdf filter

On Thu, Mar 19, 2009 at 03:36:47PM +1030, Frank J Bruzzaniti
wrote:> I've been experimenting using tesseract to OCR tiff's with omega
just
> using the tesseract binary package from Ubuntu.
Is tesseract better than gocr?  In the previous discussion of this I
noted that gocr generates random junk from logos and graphics, and XML
tags for barcodes, and the pipeline used didn't handle multi-page
documents:

http://thread.gmane.org/gmane.comp.search.xapian.general/6336/focus
> The one issue I find is that tesseract is sooo slow.
> 
> One work around so ocr'ing doesn't hold up omindex would be to
maintain
> a separate instance of omindex and a separate database of ocr'd data 
> then allow them both to be searched via the "stub database"
method.  I'd
> definatly wanna use last_mod patch here so I don't have to re-ocr.
> 
> Dose this sound reasonable,  if anyone has any better solutions I;d love 
> to hear of them.  Once I've got it sorted I'll submit a patch.
Maybe we
> could have a flag for omindex to it knows if it's designated just to
ocr
> tiff's. 
You don't actually need new flags for this - you can just specify -M
flags to disable subsets of mimetypes for each indexing run.

That's quiet fiddly for the "everything but tiffs", but perhaps
the
best way to deal with that is to add a "don't add the default mime
mapping" option, rather than something very specific to OCRing tiffs.
> I guess we could also ocr image pdf's if they comeback with no data
from
> the regular pdf filter.
We've been here before too - see the post linked to above.

Cheers,
    Olly

Xapian discuss - Mar 2009 - tiff / image pdf filter

[Xapian-discuss] tiff / image pdf filter

[Xapian-discuss] tiff / image pdf filter