thr3ads.net - Xapian discuss - indexing pdf errors [Mar 2021]

If this information is useful, please help other people find it:
Share via:

Henk L.

2021-Mar-29 08:10 UTC

indexing pdf errors

I have indexed pdf's with omindex. In this batch not all went well,
though the files got indexed. I get errors:

Syntax Warning: Couldn't link the profiles
Syntax Warning: Can't create transform

These pdf's are the output of a scan and ocr process. So they contain
text.  

Is there a way I can find out what happened?

Olly Betts

2021-Mar-31 21:14 UTC

head link

indexing pdf errors

On Mon, Mar 29, 2021 at 10:10:39AM +0200, Henk L. wrote:> I have indexed pdf's with omindex. In this batch not all went well,
> though the files got indexed. I get errors:
> 
> Syntax Warning: Couldn't link the profiles
> Syntax Warning: Can't create transform
> 
> These pdf's are the output of a scan and ocr process. So they contain
> text.  
> 
> Is there a way I can find out what happened?
Assuming Xapian 1.4.x where x >= 10, we extract text from PDFs by piping
them to:

    pdftotext -enc UTF-8 - -

For some PDF files pdftotext emits warning messages.

These are presumably due to either invalid structure within the PDF file
(either due to bugs in the tool that made it, or corruption of the file
since) or possibly bugs in libpoppler (which pdftotext uses to do most
of the actual work.)

Most of them sound like things that aren't important for extracting just
the text (e.g. "Can't create transform" sounds like a graphics
coordinate mapping problem.)

You can test by hand with the command above and see what text is
actually extracted.

Possibly we should run pdftotext with -q to disable such messages,
though that also seems like it makes errors silent too, which is less
helpful.  There doesn't seem to be any finer level of control.

(Or if you're running Omega from git master you may be using the new
worker module for libpoppler, but mostly that just means we don't fork()
and exec() for each PDF file indexed - these messages are still coming
from libpoppler, and we could set the option that pdftotext -q sets to
disable them.)

Cheers,
    Olly

Xapian discuss - Mar 2021 - indexing pdf errors

indexing pdf errors

indexing pdf errors