Crowell, Brian
2011-Feb-09 21:11 UTC
[Xapian-discuss] Not separating words when parsing HTML in Omega
We noticed, when indexing a Word 2007 document, that two words in adjacent paragraphs got melded together in the Xapian database. For example: To find the document containing these two paragraphs... ...you would search for "containingthese". I fixed it locally by adding a "dump.append(" ");" just before the return in process_text() in myhtmlparse.cc. Thought I'd mention it to see if anyone could put in a better/more permanent fix. I could send a sample document that produces the error, if that helps. --Brian Crowell Developer, Barbnet Investments
Olly Betts
2011-Feb-10 02:50 UTC
[Xapian-discuss] Not separating words when parsing HTML in Omega
On Wed, Feb 09, 2011 at 03:11:18PM -0600, Crowell, Brian wrote:> We noticed, when indexing a Word 2007 document, that two words in > adjacent paragraphs got melded together in the Xapian database. For > example:What version of Omega is this with? I have a feeling I fixed something to do with running words together fairly recently, but I'm not seeing it in the ChangeLog.> I could send a sample document that produces the error, if that helps.That would be useful if you have something you don't mind making public. Bonus points if you're happy to license it for use in a testsuite! Cheers, Olly