Hi all! I've been trawling through the archives and I found reference to an n-gram query parser plugin which some guy made. I don't think it's been included into the main Xapian distro yet but I would be really interested in such a tokenizer if there were plans! His tokenizer apparently plugs into Xapian, but I'm not sure how you plug extra query parsing engines in - could someone possibly shed some light on this for me please? Additionally, would any plugin be able to take advantage of the term prefixes? Or is that something that would need to be reimplemented with each query parsing / tokenizing engine ? The guy put all the code here: http://code.google.com/p/cjk-tokenizer/>j(btw - xapian is looking really fantastic at the moment - thanks to all involved, Olly, Richard, James, etc.) Send instant messages to your online friends http://uk.messenger.yahoo.com
On Tue, Aug 19, 2008 at 07:29:46PM +0000, Joss Shaw wrote:> I've been trawling through the archives and I found reference to an > n-gram query parser plugin which some guy made. I don't think it's > been included into the main Xapian distro yet but I would be really > interested in such a tokenizer if there were plans!It's certainly something I'd like to include, but I don't have firm plans for working on it myself currently.> His tokenizer apparently plugs into Xapian, but I'm not sure how you > plug extra query parsing engines in - could someone possibly shed some > light on this for me please?I've not studied this code myself, so if you want to know the true answer to this and other questions concerning it, then you'll have to read the source code or ask the author. But since Xapian::QueryParser pretty much just uses public API methods, I'd guess he's just implemented something similar but with different handling for CJK characters. Cheers, Olly
On Wed, Aug 20, 2008 at 7:00 PM, Joss Shaw <jossblowing at yahoo.co.uk> wrote:> I've been trawling through the archives and I found reference to an n-gram query parser plugin > which some guy made. I don't think it's been included into the main Xapian distro yet but I would > be really interested in such a tokenizer if there were plans! > > His tokenizer apparently plugs into Xapian, but I'm not sure how you plug extra query parsing > engines in - could someone possibly shed some light on this for me please? Additionally, > would any plugin be able to take advantage of the term prefixes? Or is that something that would > need to be reimplemented with each query parsing / tokenizing engine ? > > The guy put all the code here: http://code.google.com/p/cjk-tokenizer/ >The guy in question is Yung-Chung Lin. I am using a slightly modified version of his CJKV tokenizer in Pinot to pre-process queries before feeding them to the QueryParser. I chose this route because I didn't want to implement my own query parser and wanted something that works with "mixed" queries. Look for the QueryModifier class here : http://svn.berlios.de/wsvn/pinot/trunk/IndexSearch/Xapian/XapianEngine.cpp The CJKVTokenizer class is here : http://svn.berlios.de/wsvn/dijon/trunk/cjkv/CJKVTokenizer.cc For instance, the query "????? title:??" will become this : (? ?? ? ?? ? ?? ? ?? ?) title:? title:?? title:? Altogether it seems to work quite well. Of course, any bug is mine not Yung-Chung's :-) I hope this helps. Fabrice