thr3ads.net - Xapian devel - [Xapian-discuss] Dealing with image PDF's [Jul 2008]

If this information is useful, please help other people find it:
Share via:

Frank Bruzzaniti

2008-Jul-30 18:39 UTC

[Xapian-discuss] Dealing with image PDF's

Guys,

I was just playing around and added a bit of code to omindex.cc so I 
could ocr tiff and tif with gocr which seems to work. Here's what it 
looks like:

 // Tiff:
    } else if (startswith(mimetype, "image/tif"))
    {
    // Inspired by http://mjr.towers.org.uk/comp/sxw2text
    string safefile = shell_protect(file);
    string cmd = "tifftopnm " + safefile + " | gocr -f UTF8
-";
    try {
        dump = stdout_to_string(cmd);
    } catch (ReadError) {
        cout << "\"" << cmd << "\"
failed - skipping\n";
        return;
    }
    // Tiff:End

I don't really understand all the code in omindex.cc but was wondering 
if I could OCR when no text was returned while trying to process PDF's 
as a way of dealing with image only PDF's.

Here's the bit in omindex.cc that deals with pdf's:

} else if (mimetype == "application/pdf") {
    string safefile = shell_protect(file);
    string cmd = "pdftotext -enc UTF-8 " + safefile + " -";
    try {
        dump = stdout_to_string(cmd);
    } catch (ReadError) {
        cout << "\"" << cmd << "\"
failed - skipping\n";
        return;
    }

I wanted to change it so if nothing (or no strings) was returned from 
"pdftotext -enc UTF-8 " + safefile + " -";   then run
"pdftoppm " +
safefile + " | gocr -f UTF8 -";

P.S. I was able to write similar snippets of code to process docx and 
xlsx, so far so good, if they test ok should I post them somewhere or 
email them to someone?

Thanks,

Frank

Reini Urban

2008-Jul-31 07:53 UTC

head link

[Xapian-devel] [Xapian-discuss] Dealing with image PDF's

2008/7/30 Frank Bruzzaniti <frank.bruzzaniti at
gmail.com>:>    // Inspired by http://mjr.towers.org.uk/comp/sxw2text
>    string safefile = shell_protect(file);
>    string cmd = "tifftopnm " + safefile + " | gocr -f UTF8
-";
>    try {
>        dump = stdout_to_string(cmd);
>    } catch (ReadError) {
>        cout << "\"" << cmd <<
"\" failed - skipping\n";
>        return;
>    }
Can we finally please use configure checks for such weird helper apps,
to avoid runtime exceptions were the system clearly has no such app.

I once provided a huge patch to to do that.
http://thread.gmane.org/gmane.comp.search.xapian.devel/783/

Applied to 1.0.5 it is attached. But there's much more in this patch
so some parts may be stripped. See ChangeLog.
TEXTCAT support for language and charset detection, cached virtual
directories (zip,msg,pst,...) to name a few. Works fine for me for two
years and I haven't touched
it since 0.9.6.
-- 
Reini Urban
http://phpwiki.org/ http://murbreak.at/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: xapian-omega-1.0.5a.patch.gz
Type: application/x-gzip
Size: 42949 bytes
Desc: not available
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20080731/e1df52e7/attachment-0002.bin>

Richard Boulton

2008-Jul-31 08:55 UTC

head link

[Xapian-devel] [Xapian-discuss] Dealing with image PDF's

Reini Urban wrote:> 2008/7/30 Frank Bruzzaniti <frank.bruzzaniti at gmail.com>:
>>    // Inspired by http://mjr.towers.org.uk/comp/sxw2text
>>    string safefile = shell_protect(file);
>>    string cmd = "tifftopnm " + safefile + " | gocr -f
UTF8 -";
>>    try {
>>        dump = stdout_to_string(cmd);
>>    } catch (ReadError) {
>>        cout << "\"" << cmd <<
"\" failed - skipping\n";
>>        return;
>>    }
> 
> Can we finally please use configure checks for such weird helper apps,
> to avoid runtime exceptions were the system clearly has no such app.
> 
> I once provided a huge patch to to do that.
> http://thread.gmane.org/gmane.comp.search.xapian.devel/783/
Perhaps the patch should go in a ticket; that way, we're less likely to 
forget about it.
> Applied to 1.0.5 it is attached. But there's much more in this patch
> so some parts may be stripped. See ChangeLog.
> TEXTCAT support for language and charset detection, cached virtual
> directories (zip,msg,pst,...) to name a few. Works fine for me for two
> years and I haven't touched
> it since 0.9.6.
Sounds useful.  However, I'm not sure that configure time is the right 
place to check for the existence of helper apps.  In particular, quite 
often omindex is installed from a pre-compiled package (for example, in 
Debian), and the helper apps present at configure time need therefore 
bear no relation to those present at runtime.

Perhaps omindex could be improved to handle missing helper applications 
- I've not actually looked at how it handles this recently, so I don't 
know if there's actually a problem, but if there is, the correct fix 
seems to me to be to handle missing helper applications gracefully, 
rather than disable them at configure time.  Perhaps omindex would keep 
a cache, during each run, of the helper applications which have been 
found to be missing, so it would only attempt to run each one once.

-- 
Richard

Olly Betts

2008-Jul-31 12:26 UTC

head link

[Xapian-discuss] Dealing with image PDF's

On Thu, Jul 31, 2008 at 04:09:39AM +0930, Frank Bruzzaniti
wrote:> I was just playing around and added a bit of code to omindex.cc so I 
> could ocr tiff and tif with gocr which seems to work. Here's what it 
> looks like:
> 
>  // Tiff:
>     } else if (startswith(mimetype, "image/tif"))
Just test (mimetype == "image/tiff") instead -- image/tif is just
incorrect.
>     {
>     // Inspired by http://mjr.towers.org.uk/comp/sxw2text
This comment is not relevant here.
>     string safefile = shell_protect(file);
>     string cmd = "tifftopnm " + safefile + " | gocr -f UTF8
-";
>     try {
>         dump = stdout_to_string(cmd);
>     } catch (ReadError) {
>         cout << "\"" << cmd <<
"\" failed - skipping\n";
>         return;
>     }
>     // Tiff:End
Interesting idea!  I tried it on the TIFF files I have here.  The
problems I noticed:

* On the TIFF icons I have from various packages, I get random junk from
  the OCR software, which we don't really want to be indexing.  I
couldn't
  see an obvious option to tell gocr to "give up if there's nothing
  which looks like text".  Logos and graphics on pages of text also lead
  to random junk so perhaps a filtering step to drop it would be better
  anyway.

* On the multi-page scanned document I have, I only get the text from
  the first page.  I guess that's tifftopnm, but it doesn't seem to have
  an option to do "all pages".  Perhaps something else to do this
  conversion would be better?

* It OCRed a barcode in my document, which is cute, but we don't really
  want to index the XML-like tag as plain text:

  <barcode type="39" chars="12"
code="*N04456664M*" crc="E" error="0.049" />
 > I don't really understand all the code in omindex.cc but was wondering 
> if I could OCR when no text was returned while trying to process PDF's 
> as a way of dealing with image only PDF's.
>
> Here's the bit in omindex.cc that deals with pdf's:
> 
> } else if (mimetype == "application/pdf") {
>     string safefile = shell_protect(file);
>     string cmd = "pdftotext -enc UTF-8 " + safefile + "
-";
>     try {
>         dump = stdout_to_string(cmd);
>     } catch (ReadError) {
>         cout << "\"" << cmd <<
"\" failed - skipping\n";
>         return;
>     }
And then:

    if (dump.empty()) {
	// Do the OCR thing...
    }

Or if you get can get an "empty" dump which actually just has
whitespace
in then:

    if (dump.find_first_not_of(" \n\t") == string::npos) {
	// Do the OCR thing...
    }
> I wanted to change it so if nothing (or no strings) was returned from 
> "pdftotext -enc UTF-8 " + safefile + " -";   then run
"pdftoppm " +
> safefile + " | gocr -f UTF8 -";
pdftoppm seems to produce one ppm file per page, rather than output on
stdout, so you'll need to extract to a temporary directory and then
read files from it.  See the PostScript handling code for how to work
with a temporary directory.
> P.S. I was able to write similar snippets of code to process docx and 
> xlsx, so far so good, if they test ok should I post them somewhere or 
> email them to someone?
Creating a new trac ticket and attaching the patch is probably best.

If you've not already done so, take a look at:

http://trac.xapian.org/wiki/FAQ/OmegaNewFileFormat

As that says, it's very helpful if you can update the documentation to
cover the new format(s) and supply some sample files which we can
redistribute for testing (my hope is to create an automated test suite
for omindex).

Cheers,
    Olly

Reasonably Related Threads

Search for more maybe matching threads

Xapian devel - Jul 2008 - Dealing with image PDF's

[Xapian-discuss] Dealing with image PDF's

[Xapian-devel] [Xapian-discuss] Dealing with image PDF's

[Xapian-devel] [Xapian-discuss] Dealing with image PDF's

[Xapian-discuss] Dealing with image PDF's

Reasonably Related Threads