Olly Betts writes:
> On Thu, Dec 12, 2013 at 03:11:29PM +0100, jf at dockes.org wrote:
> > I've had a heads up from a user that catppt did not work at all
on
> > semi-recent PowerPoint files (ppt, not pptx). I checked, and indeed
it
> > misses most of the content on many files.
> >
> > After looking around, I found Python code from the libreoffice
project
> > which makes a nice ppt text extractor after adding a very thin
command line
> > wrapper:
> >
> > http://cgit.freedesktop.org/libreoffice/contrib/mso-dumper/
> >
> > It's pure python, no other dependancies, orders of magnitude
faster than
> > unoconv, and contrarily to catppt, does extract the text...
> >
> > Just in case this can be useful to Omega... I can provide more
details of
> > course.
>
> Thanks, that is interesting.
>
> Another option coming soon is liblibreoffice, which debuts in Libreoffice
> 4.2 - currently in beta, due for release late January or early Februrary
> 2014:
>
> http://cgit.freedesktop.org/libreoffice/core/tree/desktop/inc/
>
> It looks like the current API requires saving to a temporary file.
>
> I haven't tried this yet, so I'm not sure about speed, but it
should
> avoid a lot of the overhead of unoconv.
After doing a number of informal tests with unoconv, I have more or less
come to the conclusion that the abysmal performance when used on ppt files
is due to the time needed to process graphics, not the client-server
overhead (for example performance does not change a lot if the server is
already started). Plus the incessant crashes. Or maybe I just did not find
the right options.
It will be interesting to see if liblibreoffice does better, but what I
like with the Python code is that I can ship it today (as a zip package +
script), without having to add dependancies and wait for packaging or
backporting.
For the sake of completeness there is also this:
http://silvercoders.com/en/products/doctotext/
It's commercial GPL, based on the wvWare libs, and works extremely well on
everything I tried it on. It's an order of magnitude again faster than the
Python version (and also a bit better at eliminating spurious text), but
the build system is abysmal and it's not packaged anywhere. So I'm going
with Python for now ...
Cheers,
jf