On Tue, Sep 27, 2016 at 04:32:33PM -0400, John Bankert
wrote:> I've run into a couple of things using omega/omindex under cygwin. I
don't
> think I'd attribute them to xapian, omega or omindex, but wanted to get
> them out to the list so that if anyone else should run into these things
> down the road, hopefully someone will remember and be able to help.
>
> 1) after compiling and building omega, and doing make install, I get a set
> violation when trying to run omindex from it's installed location under
> cygwin. I worked around this by copying various required windows dll files
> into the same directory as omindex.exe and presto, success.
I've no idea what a "set violation" is - is that a typo for
"seg violation"
(short for "segmentation violation")?
Not sure I can offer much insight into this though - I haven't had to
wrangle
DLLs for several decades.
> 2) There appears to be some sort of weird path issue in using omindex in
> the cywin bash shell. using the path /www/example/product, should in cygwin
> bash, act as a fully defined path the directory to be indexed by omindex.
> This is not the case. I had to product a relative path from where
> omindex.exe was running in order to successfully index the files in
> /www/example/product.
I tried to set up a build on appveyor to reproduce this, but it works for me:
https://ci.appveyor.com/project/ojwb/xapian/build/1.0.30
In particular:
bash -c 'xapian-applications/omega/omindex -v --db omtest.db --url
msproducts /www/example/products/'
[Entering directory ""]
Indexing "example.docx" as
application/vnd.openxmlformats-officedocument.wordprocessingml.document ...
Skipping - "unzip -p '/www/example/products/example.docx'
word/document.xml 'word/header*.xml' 'word/footer*.xml'
2>/dev/null" failed
Indexing "html.htm" as text/html ... added
Indexing "sample.doc" as application/msword ... Skipping -
"antiword -mUTF-8.txt '/www/example/products/sample.doc'"
failed
Indexing "text.txt" as text/plain ... added
I didn't install "unzip" or "antiword", so that's
what I'd expect to happen.
> This next bit is me wondering about the output I've gotten.
>
> John at win-7-test ~/xapian-omega-1.4.0
> $ ls -al ../../../www/example/msproducts/
> total 357
> drwx------+ 1 John None 0 Sep 27 16:25 .
> drwx------+ 1 John None 0 Sep 27 15:41 ..
> -rwx------+ 1 John None 32476 Sep 14 15:18 100-objects-v1.csv
> -rwx------+ 1 John None 32477 Sep 14 15:19 100-objects-v2.csv
> -rwx------+ 1 John None 14228 Aug 31 11:41 burger.docx
> -rwx------+ 1 John None 19034 Jun 30 12:15 hotdog.docx
> -rwx------+ 1 John None 10538 Sep 14 15:30 index.html
> -rwx------+ 1 John None 137728 Jun 30 12:15 sausage.doc
> -rwx------+ 1 John None 71536 Sep 14 15:21 states.csv
> -rwx------+ 1 John None 541 Sep 14 15:21 us_states_on_wikipedia.html
> -rwx------+ 1 John None 29824 Aug 31 15:08 zlib_how.html
>
> John at win-7-test ~/xapian-omega-1.4.0
> $ ./omindex -v --db omtest.db --url msproducts
> ../../../www/example/msproducts/
Hmm, I notice here you have "www/example/msproducts", but above you
said
"/www/example/product" - "msproducts" vs
"product". Could that be the
problem, or was the earlier one just a typo or hypothetical example?
> John at win-7-test ~/xapian-omega-1.4.0
> $ [Entering directory ""]
What was the exact command line you used to run the indexer here? It seems to
have got lost from the paste, and would be useful to know.
> Indexing "100-objects-v1.csv" as text/csv ... added
> Indexing "100-objects-v2.csv" as text/csv ... added
> Indexing "burger.docx" as
> application/vnd.openxmlformats-officedocument.wordproc
> essingml.document ... The system cannot find the path specified.
> Skipping - "unzip -p
"..\..\..\www\example\msproducts\burger.docx"
> word/document
> .xml 'word/header*.xml' 'word/footer*.xml'
2>/dev/null" failed
> Indexing "hotdog.docx" as
> application/vnd.openxmlformats-officedocument.wordproc
> essingml.document ... The system cannot find the path specified.
> Skipping - "unzip -p
"..\..\..\www\example\msproducts\hotdog.docx"
> word/document
> .xml 'word/header*.xml' 'word/footer*.xml'
2>/dev/null" failed
> Indexing "index.html" as text/html ... added
>
> omindex stops when it hits sausage.doc, and echo $? returns 0, so I've
no
> idea why it doesn't want to process an ms word .doc file, although I
> suspect it may be related to the inability to process the .docx files. I
> should note that I performing this work on a windows VM that does not have
> MS office or open office installed, if that makes a difference.
That shouldn't matter. By default omindex will try to use unzip (and
internal
XML parsing) for .docx and antiword for .doc.
I can't see why it shouldn't try to handle the other files in the
directory
- in my test it continues after both the .docx and .doc failures.
Cheers,
Olly