Richard Jolly
2007-Jan-21 11:36 UTC
[Xapian-discuss] indexing for phrase searching and constructing queries
Hi, I'm new to xapian, and to search engines in general. I'm using the perl bindings and 0.9.9. In general it works excellently, but I've got some questions. 1. phrase searching I'm having no luck getting phrase searching to work. I expect it's because I've not indexed the content correctly. The content is xml. I'm basically taking the text content of that, splitting it into words, lower casing, stemming and stripping of punctuation. The term position passed to add_posting is just incremented, but I'm keeping the same position for both the stemmed and the unstemmed words. # made up add_posting( 'office', 3 ) add_posting( 'offic', 3 ) My hand-wavy understanding of phrase searching is that it's looking for consecutive matching terms, which is why I've done the stemmed and unstemmed words at the same position. But when I do a query, I get no results. The debug on the query look sane to me: Xapian::Query((impose:(pos=1) PHRASE 3 time:(pos=2) PHRASE 3 limits:(pos=3))) How can I tell why this isn't matching? Can I find those three posts in the index and compare the positions? Secondly, a user entered search with an apostrophe ends up as a phrase search - not right at all: Xapian::Query(((mike:(pos=1) PHRASE 2 s:(pos=2)) OR tail:(pos=3))) 2. user interfaces My next question is about the practicalities of user facing search interfaces. I've got a web form with a big text input, and also a couple additional controls that correspond to indexed terms. I've then got code that combines the term controls with the text input into something like: ( name:foo AND name:bar ) AND text from text box And I hand this off to QueryParser. But punctuation seems to mess it up. Should I be stripping out punctuation and stop words? Is it a bad approach all together? Thanks, Richard
Olly Betts
2007-Jan-25 04:07 UTC
[Xapian-discuss] indexing for phrase searching and constructing queries
On 21/01/07, Richard Jolly <richardjolly@mac.com> wrote:> 1. phrase searching > > I'm having no luck getting phrase searching to work. I expect it's > because I've not indexed the content correctly. The content is xml. I'm > basically taking the text content of that, splitting it into words, > lower casing, stemming and stripping of punctuation. The term position > passed to add_posting is just incremented, but I'm keeping the same > position for both the stemmed and the unstemmed words. > > # made up > add_posting( 'office', 3 ) > add_posting( 'offic', 3 ) > > My hand-wavy understanding of phrase searching is that it's looking for > consecutive matching termsYes, that's correct.> which is why I've done the stemmed and > unstemmed words at the same position. But when I do a query, I get no > results. The debug on the query look sane to me: > > Xapian::Query((impose:(pos=1) PHRASE 3 time:(pos=2) PHRASE 3 > limits:(pos=3))) > > How can I tell why this isn't matching? Can I find those three posts in > the index and compare the positions?Use "delve" - it's in the examples subdirectory of xapian-core.> Secondly, a user entered search with an apostrophe ends up as a phrase > search - not right at all: > > Xapian::Query(((mike:(pos=1) PHRASE 2 s:(pos=2)) OR tail:(pos=3)))This is how Omega currently works, and what the QueryParser does. It's a misfeature really (particularly since it produces a more expensive phrase search for a case where we don't need to do one). I'm intending to change this for Xapian 1.0.> 2. user interfaces > My next question is about the practicalities of user facing search > interfaces. I've got a web form with a big text input, and also a > couple additional controls that correspond to indexed terms. I've then > got code that combines the term controls with the text input into > something like: > > ( name:foo AND name:bar ) AND text from text box > > And I hand this off to QueryParser. But punctuation seems to mess it > up. Should I be stripping out punctuation and stop words? Is it a bad > approach all together?It's generally a mistake to try to manipulate user entered text before passing it to the QueryParser class. It's better to let the QueryParser parse the user entered query and then apply additional filters etc to the Query object produced. I take it you have a "name" box and a "text" box? If so, you'd ideally want to parse each separately using a QueryParser object with one set to default to "name:", but currently I don't think you can (I'll take a look in a week or so when I'm back from holiday as this should be easy to do). For now, I'd suggest wrapping the "name" field in `name:(' and `)' and parsing that, then combining it with the result of parsing the "text" field with OP_AND. Cheers, Olly