Guys, I was just playing around and added a bit of code to omindex.cc so I could ocr tiff and tif with gocr which seems to work. Here's what it looks like: // Tiff: } else if (startswith(mimetype, "image/tif")) { // Inspired by http://mjr.towers.org.uk/comp/sxw2text string safefile = shell_protect(file); string cmd = "tifftopnm " + safefile + " | gocr -f UTF8 -"; try { dump = stdout_to_string(cmd); } catch (ReadError) { cout << "\"" << cmd << "\" failed - skipping\n"; return; } // Tiff:End I don't really understand all the code in omindex.cc but was wondering if I could OCR when no text was returned while trying to process PDF's as a way of dealing with image only PDF's. Here's the bit in omindex.cc that deals with pdf's: } else if (mimetype == "application/pdf") { string safefile = shell_protect(file); string cmd = "pdftotext -enc UTF-8 " + safefile + " -"; try { dump = stdout_to_string(cmd); } catch (ReadError) { cout << "\"" << cmd << "\" failed - skipping\n"; return; } I wanted to change it so if nothing (or no strings) was returned from "pdftotext -enc UTF-8 " + safefile + " -"; then run "pdftoppm " + safefile + " | gocr -f UTF8 -"; P.S. I was able to write similar snippets of code to process docx and xlsx, so far so good, if they test ok should I post them somewhere or email them to someone? Thanks, Frank
Reini Urban
2008-Jul-31 07:53 UTC
[Xapian-devel] [Xapian-discuss] Dealing with image PDF's
2008/7/30 Frank Bruzzaniti <frank.bruzzaniti at gmail.com>:> // Inspired by http://mjr.towers.org.uk/comp/sxw2text > string safefile = shell_protect(file); > string cmd = "tifftopnm " + safefile + " | gocr -f UTF8 -"; > try { > dump = stdout_to_string(cmd); > } catch (ReadError) { > cout << "\"" << cmd << "\" failed - skipping\n"; > return; > }Can we finally please use configure checks for such weird helper apps, to avoid runtime exceptions were the system clearly has no such app. I once provided a huge patch to to do that. http://thread.gmane.org/gmane.comp.search.xapian.devel/783/ Applied to 1.0.5 it is attached. But there's much more in this patch so some parts may be stripped. See ChangeLog. TEXTCAT support for language and charset detection, cached virtual directories (zip,msg,pst,...) to name a few. Works fine for me for two years and I haven't touched it since 0.9.6. -- Reini Urban http://phpwiki.org/ http://murbreak.at/ -------------- next part -------------- A non-text attachment was scrubbed... Name: xapian-omega-1.0.5a.patch.gz Type: application/x-gzip Size: 42949 bytes Desc: not available URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20080731/e1df52e7/attachment-0002.bin>
Richard Boulton
2008-Jul-31 08:55 UTC
[Xapian-devel] [Xapian-discuss] Dealing with image PDF's
Reini Urban wrote:> 2008/7/30 Frank Bruzzaniti <frank.bruzzaniti at gmail.com>: >> // Inspired by http://mjr.towers.org.uk/comp/sxw2text >> string safefile = shell_protect(file); >> string cmd = "tifftopnm " + safefile + " | gocr -f UTF8 -"; >> try { >> dump = stdout_to_string(cmd); >> } catch (ReadError) { >> cout << "\"" << cmd << "\" failed - skipping\n"; >> return; >> } > > Can we finally please use configure checks for such weird helper apps, > to avoid runtime exceptions were the system clearly has no such app. > > I once provided a huge patch to to do that. > http://thread.gmane.org/gmane.comp.search.xapian.devel/783/Perhaps the patch should go in a ticket; that way, we're less likely to forget about it.> Applied to 1.0.5 it is attached. But there's much more in this patch > so some parts may be stripped. See ChangeLog. > TEXTCAT support for language and charset detection, cached virtual > directories (zip,msg,pst,...) to name a few. Works fine for me for two > years and I haven't touched > it since 0.9.6.Sounds useful. However, I'm not sure that configure time is the right place to check for the existence of helper apps. In particular, quite often omindex is installed from a pre-compiled package (for example, in Debian), and the helper apps present at configure time need therefore bear no relation to those present at runtime. Perhaps omindex could be improved to handle missing helper applications - I've not actually looked at how it handles this recently, so I don't know if there's actually a problem, but if there is, the correct fix seems to me to be to handle missing helper applications gracefully, rather than disable them at configure time. Perhaps omindex would keep a cache, during each run, of the helper applications which have been found to be missing, so it would only attempt to run each one once. -- Richard
On Thu, Jul 31, 2008 at 04:09:39AM +0930, Frank Bruzzaniti wrote:> I was just playing around and added a bit of code to omindex.cc so I > could ocr tiff and tif with gocr which seems to work. Here's what it > looks like: > > // Tiff: > } else if (startswith(mimetype, "image/tif"))Just test (mimetype == "image/tiff") instead -- image/tif is just incorrect.> { > // Inspired by http://mjr.towers.org.uk/comp/sxw2textThis comment is not relevant here.> string safefile = shell_protect(file); > string cmd = "tifftopnm " + safefile + " | gocr -f UTF8 -"; > try { > dump = stdout_to_string(cmd); > } catch (ReadError) { > cout << "\"" << cmd << "\" failed - skipping\n"; > return; > } > // Tiff:EndInteresting idea! I tried it on the TIFF files I have here. The problems I noticed: * On the TIFF icons I have from various packages, I get random junk from the OCR software, which we don't really want to be indexing. I couldn't see an obvious option to tell gocr to "give up if there's nothing which looks like text". Logos and graphics on pages of text also lead to random junk so perhaps a filtering step to drop it would be better anyway. * On the multi-page scanned document I have, I only get the text from the first page. I guess that's tifftopnm, but it doesn't seem to have an option to do "all pages". Perhaps something else to do this conversion would be better? * It OCRed a barcode in my document, which is cute, but we don't really want to index the XML-like tag as plain text: <barcode type="39" chars="12" code="*N04456664M*" crc="E" error="0.049" />> I don't really understand all the code in omindex.cc but was wondering > if I could OCR when no text was returned while trying to process PDF's > as a way of dealing with image only PDF's. > > Here's the bit in omindex.cc that deals with pdf's: > > } else if (mimetype == "application/pdf") { > string safefile = shell_protect(file); > string cmd = "pdftotext -enc UTF-8 " + safefile + " -"; > try { > dump = stdout_to_string(cmd); > } catch (ReadError) { > cout << "\"" << cmd << "\" failed - skipping\n"; > return; > }And then: if (dump.empty()) { // Do the OCR thing... } Or if you get can get an "empty" dump which actually just has whitespace in then: if (dump.find_first_not_of(" \n\t") == string::npos) { // Do the OCR thing... }> I wanted to change it so if nothing (or no strings) was returned from > "pdftotext -enc UTF-8 " + safefile + " -"; then run "pdftoppm " + > safefile + " | gocr -f UTF8 -";pdftoppm seems to produce one ppm file per page, rather than output on stdout, so you'll need to extract to a temporary directory and then read files from it. See the PostScript handling code for how to work with a temporary directory.> P.S. I was able to write similar snippets of code to process docx and > xlsx, so far so good, if they test ok should I post them somewhere or > email them to someone?Creating a new trac ticket and attaching the patch is probably best. If you've not already done so, take a look at: http://trac.xapian.org/wiki/FAQ/OmegaNewFileFormat As that says, it's very helpful if you can update the documentation to cover the new format(s) and supply some sample files which we can redistribute for testing (my hope is to create an automated test suite for omindex). Cheers, Olly