Frank John Bruzzaniti
2009-Feb-02 15:05 UTC
[Xapian-discuss] Using Open Office to convert documents.
I wrote a little python script (oOC.py) that I could insert as one of
the "helper" apps that uses unoconv and openoffice to convert
documents
to text. E.g. I was having trouble converting *.doc that were saved with
wordperfect as antiword didn't decode them so I substitute the line in
omindex that contains atiword with oOC.py. Theoretically oOC can
convert almost any format supported by OpenOffice and Unconv.
I've done some initial testing and it seems to work ok. I wouldn't
recommend it in a production environment without lots of testing, I
decided to email it for the sake of curiosity.
Basically it runs a headless copy of openoffice which should stay
running and accept requests from unconv and print the results from
stdout.
#!/usr/bin/python
# Python script to convert dpcuments via OpenOffice for Xapian-Omega
# By Frank J Bruzzaniti
# frank.bruzzaniti at gmail.com
import os, sys, time
from subprocess import *
# Get pid of any running soffice processes
getpid = Popen(["ps -ef | grep -v grep | grep
'/usr/lib/openoffice/program/soffice.bin -headless
-accept=socket,host=127.0.0.1,port=2002;urp; -nofirststartwizard' | cut
-f3 -d' '"], shell=True, stdout=PIPE).stdout
# Save pid might be usefull
pid = getpid.read()
#print "PID=" + pid
# If soffice not running start and wait 5 secs
if pid == "":
Popen(['soffice -headless
-accept="socket,host=127.0.0.1,port=2002;urp;"
-nofirststartwizard'],
shell=True)
#print "I didn't find soffice running so I'm starting one now
and
waiting 5 secs"
time.sleep(5)
# Run unoconv
os.system('unoconv --stdout -f text ' + sys.argv[1])
James Aylett
2009-Feb-02 19:25 UTC
[Xapian-discuss] Using Open Office to convert documents.
On Tue, Feb 03, 2009 at 01:35:14AM +1030, Frank John Bruzzaniti wrote:> I wrote a little python script (oOC.py) that I could insert as one > of the "helper" apps that uses unoconv and openoffice to convert > documents to text.Cool--any chance you could slap a license on it, or put it up on the wiki or something? I have a feeling it'd be useful for lots of folk. J -- James Aylett talktorex.co.uk - xapian.org - uncertaintydivision.org