thr3ads.net - Xapian devel - [Xapian-devel] GSOC 2011- CJK Support [Apr 2011]

If this information is useful, please help other people find it:
Share via:

yong zhang

2011-Apr-07 15:11 UTC

[Xapian-devel] GSOC 2011- CJK Support

Hello, erver one, I am Yongzhi Zhang, a chinese student.

I'm interested in CJK Support(also known as Chinese, Japanese, and Korean
Support),

I have 6 years experience in software development (c/C++ and java) .

I want to work on this project "CJK Support", I come from Beijing of
china.

Chinese is my native language. This is my advantage for ?CJK Support? .

I have fixed a bug for the indexing problem in Chinese version of help
system for OpenOffice. The OpenOffice use Lucene to implement the indexing .


I'll be happy to participate in this project during Google Summer ofCode
2011 program and implement CJK Support.

As Chinese letters are not delimited by whitespace, we cannotdistinguish
them easily. After my investigation, I find three methods to resolve this
issue, and I prefer the last one.

   1.

   Set each letter as a key to index, This is used by Lucene as default.

   The class is *StandardAnalyzer*
   2.

   Every two letter as a key to index. This is used by Lucene for ?CJK
   support?

   The java class name is
CJKAnalyzer<http://svn.services.openoffice.org/opengrok/s?defs=CJKAnalyzer&project=/DEV300_m103>
   3.

   Follow the dictionary rule to distinguish group of characters.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20110407/46337fe9/attachment.html>

Olly Betts

2011-Apr-08 04:23 UTC

head link

[Xapian-devel] GSOC 2011- CJK Support

On Thu, Apr 07, 2011 at 11:11:13PM +0800, yong zhang
wrote:> I want to work on this project "CJK Support", I come from Beijing
of china.
Thanks for your interest.

In case it isn't clear, you need to submit an application here before
19:00 UTC on April 8th (which is less than 15 hours away now):

http://www.google-melange.com

Just sending a mail to this list isn't enough - you need to be in
Google's system.
> As Chinese letters are not delimited by whitespace, we cannotdistinguish
> them easily. After my investigation, I find three methods to resolve this
> issue, and I prefer the last one.
> 
>    1.
> 
>    Set each letter as a key to index, This is used by Lucene as default.
> 
>    The class is *StandardAnalyzer*
>    2.
> 
>    Every two letter as a key to index. This is used by Lucene for ?CJK
>    support?
> 
>    The java class name is
>
CJKAnalyzer<http://svn.services.openoffice.org/opengrok/s?defs=CJKAnalyzer&project=/DEV300_m103>
>    3.
> 
>    Follow the dictionary rule to distinguish group of characters.
The third method should give the best results, but is harder to
implement, and I think will need separate versions for Chinese, Japanese,
and Korean.  Handling words not in the dictionary list is an issue too.

Certainly we're happy to consider proposals for different approaches.
Just explain why you're taking the approach you are.

Cheers,
    Olly

Apparently Analagous Threads

Search for more reasonably related threads

Xapian devel - Apr 2011 - GSOC 2011- CJK Support

[Xapian-devel] GSOC 2011- CJK Support

[Xapian-devel] GSOC 2011- CJK Support

Apparently Analagous Threads