Hello, I have a number of Excel .xlsx files that aren't indexed properly. To illustrate, I have a file called "this is a test.xlsx". It consists of four cells: | this | | is | | a | | test | It gets indexed but I am unable to search for it. I was able to determine the index number and use delve to see the term list: #delve users -r 16496 Term List for record #16496: D20130405 Hvesuvius M201304 P/ Tapplication/vnd.openxmlformats-officedocument.spreadsheetml.sheet Ufile://vesuvius/cpurves/this is a test.xlsx Y2013 Zthisisatest thisisatest You can see that the words are all concatenated together as if they are a single word. If I search for "thisisatest" it comes up, but not otherwise. I'm using version 1.2.3 on Debian. -- Chris Purves Visit my blog: http://chris.northfolk.ca "The idea is to zap them with lasers and see how they respond." - Dr. Scott Menary
On Fri, Apr 05, 2013 at 03:47:11PM -0300, Chris Purves wrote:> I have a number of Excel .xlsx files that aren't indexed properly. To illustrate, I have a file called "this is a test.xlsx". It consists of four cells: > > | this | > | is | > | a | > | test | >[...]> > You can see that the words are all concatenated together as if they > are a single word. If I search for "thisisatest" it comes up, but not > otherwise. > > I'm using version 1.2.3 on Debian.The xlsx extraction code changed significantly in 1.2.11, so I think this is quite likely to already be fixed. Could you try a newer version, or point us at a sample file which exhibits this problem? Cheers, Olly