thr3ads.net - Xapian discuss - [Xapian-discuss] chinese/japanese index support [Feb 2008]

If this information is useful, please help other people find it:
Share via:

chun yu

2008-Feb-26 08:02 UTC

[Xapian-discuss] chinese/japanese index support

Hi, all
 
I am studying Xapain project.
 
I am wandered if the version 1.0.5 has support the chinese/japanese indexing.
 
If so, could you please tell me the code in the project to implement it?
 
or how can I implement to support indexing chinese?
 
 
Thanks a lot!
_________________________________________________________________
Express yourself instantly with MSN Messenger! Download today it's FREE!
messenger.msn.click-url.com/go/onm00200471ave/direct/01

Rick Olson

2008-Feb-26 09:27 UTC

head link

[Xapian-discuss] chinese/japanese index support

chun yu wrote:> Hi, all
>  
> I am studying Xapain project.
>  
> I am wandered if the version 1.0.5 has support the chinese/japanese
indexing.
>  
> If so, could you please tell me the code in the project to implement it?
>  
> or how can I implement to support indexing chinese?
>  
>  
> Thanks a lot!
>   Hello,

For indexing of Chinese/Japanese/Korean data, I have to suggest a 
product called Senna (qwik.jp/senna).  It is also free as in 
free (if that's what floats your boat), but is not Xapian specifically.  
I haven't yet successfully used Xapian for indexing any character from 
the CJK set in a production environment, but from my experience so far 
it's not so convenient to use it for such a thing (no stemming support 
that I can see, and significance of spaces in many cases!).

Technically, however, the indexing for 90% of the  cases I've tested 
with which include Asian character support have _functioned_, just not 
catered to that use is all.

Perhaps one of the core developers can shed some light on this 
situation, but I do believe I am correct with my personal tests.  There 
are other solutions as well which cater to Asian character sets.

Regards,
Rick

Fabrice Colin

2008-Feb-26 12:31 UTC

head link

[Xapian-discuss] chinese/japanese index support

On Tue, 26 Feb 2008 09:48:29 +0000,  Olly Betts <olly at survex.com>
wrote:>  On Tue, Feb 26, 2008 at 01:27:36AM -0800, Rick Olson wrote:
>  > chun yu wrote:
>  > > I am wandered if the version 1.0.5 has support the
chinese/japanese
>  > > indexing.
>
>  There's nothing specific to Chinese or Japanese currently, although we
>  do support all of Unicode in the character classification code, so
>  Chinese and Japanese characters should be correctly identified as part
>  of words.
>
>  > > or how can I implement to support indexing chinese?
>
>  The usual approaches are based on n-gram matching.  Someone posted a
>  link to some code they'd written (and I think were using with Xapian)
>  on the list, but I've not had a chance to study it yet.
>Yung-chung Lin wrote a CJKV n-gram tokenizer. The source is here :
svn.berlios.de/wsvn/dijon/trunk/cjkv/?rev=0&sc=1
It's not tied to Xapian in particular. It needs libunicode 0.4 or glib.

I make use of it in Pinot, to generate terms when indexing CJKV documents,
and at search time to pre-process CJKV queries before feeding them to the
QueryParser.

Fabrice

Xapian discuss - Feb 2008 - chinese/japanese index support

[Xapian-discuss] chinese/japanese index support

[Xapian-discuss] chinese/japanese index support

[Xapian-discuss] chinese/japanese index support