Is there a way to make underscores and colons in terms behave like letters? It would be nice for query terms like doc_id and Search::Xapian to be treated as one term, not two. The results would be a lot more relevant for some queries. Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.tartarus.org/pipermail/xapian-discuss/attachments/20051027/afe38dbe/attachment.htm
Just another note. I'm currently looking at migrating fulltext search from MySQL to Xapian. Xapian is much faster but MySQL can search for terms with underscores and colons without breaking them up. It would be great if there was a way to do it with Xapian. On 10/27/05, John Wang <johncwang@gmail.com> wrote:> > Is there a way to make underscores and colons in terms behave like > letters? It would be nice for query terms like doc_id and Search::Xapian to > be treated as one term, not two. The results would be a lot more relevant > for some queries. > > Thanks, > John >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.tartarus.org/pipermail/xapian-discuss/attachments/20051027/2f4e7b67/attachment.htm
On Thu, Oct 27, 2005 at 04:50:12PM -0700, John Wang wrote:> Is there a way to make underscores and colons in terms behave like > letters? It would be nice for query terms like doc_id and > Search::Xapian to be treated as one term, not two. The results would > be a lot more relevant for some queries.Hi, John. Xapian itself doesn't care what terms look like - however the commonly-used QueryParser that ships with Xapian, and the omega indexers (scriptindex and omindex) generate their terms in a certain way. If you want terms including underscores and colons, you'll need to write your own word generator (that goes through text and figures out what the words are before optionally passing them to a stemmer to make terms), to use both while indexing and while compiling searches. However it's worth pointing out that the query parser will often turn queries into PHRASE queries in these cases, which is actually more helpful - it means you can search for the fragments that make up the larger unit, as well as for the entire unit. I can't remember in detail how this works, however (and I don't read lemony, so I can't figure it out from source), so someone else will have to fill you in here. J -- /--------------------------------------------------------------------------\ James Aylett xapian.org james@tartarus.org uncertaintydivision.org
On Thu, Oct 27, 2005 at 04:50:12PM -0700, John Wang wrote:> Is there a way to make underscores and colons in terms behave like letters? > It would be nice for query terms like doc_id and Search::Xapian to be > treated as one term, not two. The results would be a lot more relevant for > some queries.Terms can contain any characters (even zero bytes). You don't say how you're generating them, but I guess you must be using Omega... Omega's current strategy is to split terms on characters like underscore and colon, and to let _ and : in a query generate a phrase search. So the query Search::Xapian is the same as the query "Search Xapian". One benefit of this is that a query for Xapian matches Search::Xapian in a document, which is usually desirable. That's probably less of a benefit for underscore, but it is how Omega currently handles it. The QueryParser class also assumes you're doing this, because it was originally part of Omega. That needs fixing - the tokenisation should be configurable there. This use of phrase searches does cause slow searches on large databases sometimes: http://www.xapian.org/cgi-bin/bugzilla/show_bug.cgi?id=22 It's also annoying if you don't want to support actual phrase searches but do want underscored terms, etc to work. I'm working on addressing this issue. Currently by working on flint which will make access to positional information faster, but I'm also intending to revisit the tokenisation rules. Cheers, Olly