Frank John Bruzzaniti
2009-Feb-02 15:05 UTC
[Xapian-discuss] Using Open Office to convert documents.
I wrote a little python script (oOC.py) that I could insert as one of the "helper" apps that uses unoconv and openoffice to convert documents to text. E.g. I was having trouble converting *.doc that were saved with wordperfect as antiword didn't decode them so I substitute the line in omindex that contains atiword with oOC.py. Theoretically oOC can convert almost any format supported by OpenOffice and Unconv. I've done some initial testing and it seems to work ok. I wouldn't recommend it in a production environment without lots of testing, I decided to email it for the sake of curiosity. Basically it runs a headless copy of openoffice which should stay running and accept requests from unconv and print the results from stdout. #!/usr/bin/python # Python script to convert dpcuments via OpenOffice for Xapian-Omega # By Frank J Bruzzaniti # frank.bruzzaniti at gmail.com import os, sys, time from subprocess import * # Get pid of any running soffice processes getpid = Popen(["ps -ef | grep -v grep | grep '/usr/lib/openoffice/program/soffice.bin -headless -accept=socket,host=127.0.0.1,port=2002;urp; -nofirststartwizard' | cut -f3 -d' '"], shell=True, stdout=PIPE).stdout # Save pid might be usefull pid = getpid.read() #print "PID=" + pid # If soffice not running start and wait 5 secs if pid == "": Popen(['soffice -headless -accept="socket,host=127.0.0.1,port=2002;urp;" -nofirststartwizard'], shell=True) #print "I didn't find soffice running so I'm starting one now and waiting 5 secs" time.sleep(5) # Run unoconv os.system('unoconv --stdout -f text ' + sys.argv[1])
James Aylett
2009-Feb-02 19:25 UTC
[Xapian-discuss] Using Open Office to convert documents.
On Tue, Feb 03, 2009 at 01:35:14AM +1030, Frank John Bruzzaniti wrote:> I wrote a little python script (oOC.py) that I could insert as one > of the "helper" apps that uses unoconv and openoffice to convert > documents to text.Cool--any chance you could slap a license on it, or put it up on the wiki or something? I have a feeling it'd be useful for lots of folk. J -- James Aylett talktorex.co.uk - xapian.org - uncertaintydivision.org