Hi, all I am studying Xapain project. I am wandered if the version 1.0.5 has support the chinese/japanese indexing. If so, could you please tell me the code in the project to implement it? or how can I implement to support indexing chinese? Thanks a lot! _________________________________________________________________ Express yourself instantly with MSN Messenger! Download today it's FREE! messenger.msn.click-url.com/go/onm00200471ave/direct/01
chun yu wrote:> Hi, all > > I am studying Xapain project. > > I am wandered if the version 1.0.5 has support the chinese/japanese indexing. > > If so, could you please tell me the code in the project to implement it? > > or how can I implement to support indexing chinese? > > > Thanks a lot! >Hello, For indexing of Chinese/Japanese/Korean data, I have to suggest a product called Senna (qwik.jp/senna). It is also free as in free (if that's what floats your boat), but is not Xapian specifically. I haven't yet successfully used Xapian for indexing any character from the CJK set in a production environment, but from my experience so far it's not so convenient to use it for such a thing (no stemming support that I can see, and significance of spaces in many cases!). Technically, however, the indexing for 90% of the cases I've tested with which include Asian character support have _functioned_, just not catered to that use is all. Perhaps one of the core developers can shed some light on this situation, but I do believe I am correct with my personal tests. There are other solutions as well which cater to Asian character sets. Regards, Rick
On Tue, 26 Feb 2008 09:48:29 +0000, Olly Betts <olly at survex.com> wrote:> On Tue, Feb 26, 2008 at 01:27:36AM -0800, Rick Olson wrote: > > chun yu wrote: > > > I am wandered if the version 1.0.5 has support the chinese/japanese > > > indexing. > > There's nothing specific to Chinese or Japanese currently, although we > do support all of Unicode in the character classification code, so > Chinese and Japanese characters should be correctly identified as part > of words. > > > > or how can I implement to support indexing chinese? > > The usual approaches are based on n-gram matching. Someone posted a > link to some code they'd written (and I think were using with Xapian) > on the list, but I've not had a chance to study it yet. >Yung-chung Lin wrote a CJKV n-gram tokenizer. The source is here : svn.berlios.de/wsvn/dijon/trunk/cjkv/?rev=0&sc=1 It's not tied to Xapian in particular. It needs libunicode 0.4 or glib. I make use of it in Pinot, to generate terms when indexing CJKV documents, and at search time to pre-process CJKV queries before feeding them to the QueryParser. Fabrice